# Aq-emod

`aq-emod` is a collection of builtin functions that can perform some complex data processing and manipulation, when used with other aq_tools.
Functionality varies from simple mathematical function, URL encoding functions to complex string extraction functions.
We will take a look at syntax of these functions, as well as very basic sample usage of each functions that were not covered in the documentation.

## Prerequsites

Readers of this notebook is expected to know

* Bash commands
* Regular Expressions
* use of aq_input / input-spec and column spec with `aq_pp` command

We won't go over input, column and output spec on this notebook. They can be found on 
- [aq-input](aq_input.ipynb).
- [aq-output](aq_output.ipynb)
 

Also have the [aq-emod documentation](http://auriq.com/documentation/source/reference/manpages/aq-emod.html) ready on your side, so you can refer to the details of each options as needed.


## Overview

Roughly speaking, `aq-emod` functions / builtin functions can be categorized into the followings.

- [String property functions](#string_property)
- [Math functions](#math)
- [Comparison functions](#comparison)
- [Data extraction and encode/decode functions](#extract_code)
- [General data conversion functions](#conversion)
- [Date/Time conversion functions](#date_time)
- [Character set encoding conversion functions](#character_encoding)
- [Key hashing functions](#key_hashing)
- [Speciality functions](#speciality)
- [RTmetrics functions](#rtmetrics)
- [Udb specific functions](#udb)

Note that we'll be taking a look at each function using `aq_pp` command's `-eval` option. Below is the basic syntax of the command.

### Syntax


```aq-pp ... -eval ColSpec|ColName Expr```

where
- `ColSpec`: new column's column spec to assign the result
- `ColName`: existing column name to assign the result
- `Expr`: expression to be evaluated. This is where `aq-emod` functions reside.


## Data

[Ramen Ratings Dataset](https://www.kaggle.com/residentmario/ramen-ratings) from kaggle will be used in this sample, which contains ratings of 2500 ramen products. 

Review|Brand|Variety|Style|Country|Stars
---|---|---|---|---|---|
2580|New Touch|T's Restaurant Tantanmen|Cup|Japan|3.75
2579|Just Way|Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles|Pack|Taiwan|1
2578|Nissin|Cup Noodles Chicken Vegetable|Cup|USA|2.25
2577|Wei Lih|GGE Ramen Snack Tomato Flavor|Pack|Taiwan|2.75
2576|Ching's Secret|Singapore Curry|Pack|India|3.75
2575|Samyang Foods|Kimchi song Song Ramen|Pack|South Korea|4.75
2574|Acecook|Spice Deli Tantan Men With Cilantro|Cup|Japan|4
2573|Ikeda Shoku|Nabeyaki Kitsune Udon|Tray|Japan|3.75
2572|Ripe'n'Dry|Hokkaido Soy Sauce Ramen|Pack|Japan|0.25
2571|KOKA|The Original Spicy Stir-Fried Noodles|Pack|Singapore|2.5

Columns and corresponing data types for the dataset are follows.
- `int: Review #`: review id number, the more recent the review is, the bigger the number is
- `str: Brand`: brand / manufacture of the product
- `str: Variety`: title of the product
- `str: Style`: categorical styles of the products, cup, pack or tray
- `str: Country`: country of origin
- `float: stars`: star rating of each product

Some smaller dataset will be used for some examples; they will be introduced along the way.<br>

## Input and Column Specification

Here is the corresponding column specs for the data<br>
`i:reviewID s:brand s:variety s:style s:country f:stars`

**Note**<br>
When reading in the files with `aq-pp`, we'll be using bash's [variable substitution](http://www.compciv.org/topics/bash/variables-and-substitution/) to keep the command short and clean. For instance, 
```bash
# assign file name & path to variable 'file'
file='data/aq_pp/fileName.csv'
```


In [2]:
file="data/aq_pp/ramen-ratings-part.csv"
cols="i:reviewID s:brand s:variety s:style s:country f:stars"
aq_pp -f,+1 $file -d $cols

"reviewID","brand","variety","style","country","stars"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25


<a id='string_property'></a>
### String Property Functions


**`SHash(Val)`**:Returns the numeric hash value of a string.<br>
Val can be a string column’s name, a string constant, or an expression that evaluates to a string.

In this example, we'll hash `style` column, which value consists of Cup, Pack or Tray, and store the result in `style_hash` column.

In [3]:
aq_pp -f,+1 $file -d $cols -eval 'i:style_hash' 'SHash(style)' -c style style_hash

"style","style_hash"
"Cup",193488781
"Pack",2090607556
"Cup",193488781
"Pack",2090607556
"Pack",2090607556
"Pack",2090607556
"Cup",193488781
"Tray",2090769765
"Pack",2090607556


You can see that same original string value results in equal hash.

**`SLeng(Val)`**:Returns the length of a string.<br>
Val can be a string column’s name, a string constant, or an expression that evaluates to a string.
    
Again in this example, we'll provide `style` column, and result will be stored in `style_len` column.

In [4]:
aq_pp -f,+1 $file -d $cols -eval 'i:style_len' 'SLeng(style)' -c style style_len

"style","style_len"
"Cup",3
"Pack",4
"Cup",3
"Pack",4
"Pack",4
"Pack",4
"Cup",3
"Tray",4
"Pack",4


<a id='math'></a>
### Math Functions
<br>

**Basics for math functions**

Besides few exceptions, math function will take a single argument `Val` which can be numeric column, constant or expression that will result in numeric value. 
We will go over just a few of them here, plus functions with irregular syntax. For the list of all available math functions, refer to the [aq-emod documentation](http://auriq.com/documentation/source/reference/manpages/aq-emod.html#math-functions).

**`Ceil(Val)`**: Rounds Val up to the nearest integral value and returns the result.

Val can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.
`stars` column that contains average star rating will be provided and result will be stored in `ceiling` column.

In [5]:
aq_pp -f,+1 $file -d $cols -eval 'i:ceiling' 'Ceil(stars)' -c stars ceiling

"stars","ceiling"
3.75,4
1,1
2.25,3
2.75,3
3.75,4
4.75,5
4,4
3.75,4
0.25,1


**`Floor(Val)`**: Rounds Val down to the nearest integral value and returns the result.

Val can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.
Similary to `Ceil()`, we'll use `stars` column again. Notice the difference in the result compare to `Ceil()` function.


In [6]:
aq_pp -f,+1 $file -d $cols -eval 'i:floor' 'Floor(stars)' -c stars floor

"stars","floor"
3.75,3
1,1
2.25,2
2.75,2
3.75,3
4.75,4
4,4
3.75,3
0.25,0



**`Round(Val)`**: Rounds Val up/down to the nearest integral value and returns the result. Half way cases are rounded away from zero.
Val can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.

Given `star` column, the result will be rounded to the nearest integer.

In [7]:
aq_pp -f,+1 $file -d $cols -eval 'i:round' 'round(stars)' -c stars round

"stars","round"
3.75,4
1,1
2.25,2
2.75,3
3.75,4
4.75,5
4,4
3.75,4
0.25,0


**`Sqrt(Val)`**: Computes the square root of Val.

Val can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.
In this example, we will provide an constant `9` as an argument for clearity.

In [8]:
aq_pp -f,+1 $file -d $cols -eval 'i:squared' 'Sqrt(9)' 

"reviewID","brand","variety","style","country","stars","squared"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,3
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,3
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,3
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,3
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,3
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,3
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,3
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,3
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,3


**Math functions with irregular syntax**

These functions require multiple values as their arguments to return the result.

**`Min(Val1, Val2 [, Val3 ...])`**: Returns the smallest value among Val1, Val2 and so on.

Each Val can be a numeric column’s name, a number, or an expression that evaluates to a number.
    If all values are integers, the result will also be an integer.
    If any value is a floating point number, the result will be a floating point number.
We'll provide a constant, as well as `stars` column to be compared with, and store the result in a column called `smaller`.

In [9]:
aq_pp -f,+1 $file -d $cols -eval 'f:min' 'Min(3, stars)'  -c stars min

"stars","min"
3.75,3
1,1
2.25,2.25
2.75,2.75
3.75,3
4.75,3
4,3
3.75,3
0.25,0.25


**`Pow(Val, Power)`**: Computes Val raised to the power of Power.

Val and Power can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.

In this example, we'll calculate a 8th power of 2, meaning `Val = 2` and `Power = 8`, and result will be in a integer column called `byte`. 

In [10]:
aq_pp -f,+1 $file -d $cols -eval 'i:byte' 'Pow(2, 8)' 

"reviewID","brand","variety","style","country","stars","byte"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,256
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,256
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,256
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,256
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,256
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,256
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,256
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,256
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,256


**`IsInf(Val)`**: Tests if Val is infinite.

Returns 1, -1 or 0 if the value is positive infinity, negative infinity or finite respectively.
`Val` can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.

In order to provide "negative infinity", we'll provide an expression `-1.0/0`, which will get evaluated to a negative infinity, and `IsInf()` should returns -1.

**Note:** 
* In order to get positive / negative infinity, the expression needs to be evaluated as float(e.g. `1.0/0` instead of `1/0`).
* The column to assign the result needs to be a **datatype of signed integer**, either `is` or `ls`, in order to be able to display negative values correctly.

In [11]:
aq_pp -f,+1 $file -d $cols -eval 'is:IsInf' 'IsInf(-1.0/0)' -c IsInf

"IsInf"
-1
-1
-1
-1
-1
-1
-1
-1
-1


<a id='comparison'></a>
### Comparison Functions

Most of the comparision function compare 1 or more string constant, pattern or regex against whole / part of given string or string column. They return 1 if there are match, and 0 for no match. 

Let's start with function that compares beginning and end of the string with given pattern.



**`BegCmp(Val, BegStr [, BegStr ...])`**: examine if string `Val` start exactly with `BegStr`. 

* Returns 1 if there is a match, 0 otherwise.
* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* Each `BegStr` is a string constant that specifies the starting string to match.

Let's use `style` column again to demonstrate this function. I will give "P" as a pattern to match, and this should return 1 whenever `style` column's content start with "P".

In [12]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'BegCmp(style, "P")' -c style beginWith

"style","beginWith"
"Cup",0
"Pack",1
"Cup",0
"Pack",1
"Pack",1
"Pack",1
"Cup",0
"Tray",0
"Pack",1


We can also provide multiple `BegStr` to match with string that starts with any of the given `BegStr`. You can observe that this time it returns 1 for string that start with either "P" or "Tr". 

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'BegCmp(style, "P", "Tr")' -c style beginWith

**`EndCmp(Val, EndStr [, EndStr ...])`**: Compares one or more ending string EndStr with the tail of Val. All the comparisons are case sensitive.

* Returns 1 if there is a match, 0 otherwise.
* Val can be a string column’s name, a string constant, or an expression that evaluates to a string.
* Each `EndStr` is a string constant that specifies the ending string to match.

This compares the ending of `Val` with given `EndStr`. 
Let's see it in action with `style` column, I will provide 2 `EndStr` this time as well.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'EndCmp(style, "ck", "up")' -c style beginWith

1 is returned for `Cup` and `Pack` that ends with given pattern.

**`SubCmp(Val, SubStr [, SubStr ...])`**: Compares one or more substring `SubStr` with with any part of `Val`. All the comparisons are case sensitive.

* Returns 1 if there is a match, 0 otherwise.
* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* Each `SubStr` is a string constant that specifies the substring to match.

I will provide "Noodle" as `SubStr` for `variety` column, to detect the ramen name which contains "Noodle"(case sensitive).

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'SubCmp(variety, "Noodle")' -c variety beginWith

Next, we will provide 2 strings, "Noodle" and "Spic"(to match both "Spicy" and "Spice") to extract names which contains **EITHER** of the words.



In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'SubCmp(variety, "Noodle", "Spic")' -c variety beginWith

**`SubCmpAll(Val, SubStr [, SubStr ...])`**:Same as `SubCmp()`, except when multiple `SubStr` are provided, it will return 1 only if `Val` contains every single one of `SubStr`.

We'll provide "Noodle" and "Spic" to be compared with `variety` column. This time though it'll return 1 only if `variety` contains **BOTH** "Noodle" and "Spic".

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'SubCmpAll(variety, "Noodle", "Spic")' -c variety beginWith

**`MixedCmp(Val, SubStr, Typ [, SubStr, Typ ...])`**: You can see this function as more versatile version of `BegCmp`, `EndCmp`, and `SubCmp`. Given `Val`, you'll provide `SubStr` and `Typ` which is one of the followings:
* `BEG` - Match with the head of `Val`.
* `END` - Match with the tail of `Val`.
* `SUB` - Match with any part of `Val`.

Note when provided more than 2 `SubStr`, this function will return 1 for matching with **EITHER** of the provided `SubStr`'s pattern. This will be demonstrated at **`SUB`** section later.

* `BEG`<br>
Let's start with `BEG`. We will use `style` column, and give `C` as `SubStr` to get `style` that begin with "C".

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'MixedCmp(style, "C", BEG)' -c style beginWith

* `END`<br>

Next I will provide `ck` as `SubStr`, and `END` as `Typ` to extract record with style which end with "ck" (Pack type).

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'MixedCmp(style, "ck", END)' -c style beginWith

* `SUB`<br>
Lastly, I will provide "Noodle" and "Spic" to match with `variety` column as `Val` to extract records that contains EITHER of these strings.
Since we're looking for substring match in any position of `variety`, `SUB` will be the `Typ`.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'MixedCmp(variety, "Noodle", SUB, "Spic", SUB)' -c variety beginWith

**`MixedCmpAll(Val, SubStr, Typ [, SubStr, Typ ...])`**: Same as `MixedCmp()` function above, except when provided more than 2 `SubStr`s, this will return 1 only when all of the `SubStr` are found in `Val` string.
You'll provide `SubStr` and `Typ` which is:
* `BEG` - Match with the head of Val.
* `END` - Match with the tail of Val.
* `SUB` - Match with any part of Val.

We will demonstrate it using **`SUB`**, with `variety` column.
This should only return 1 for the record that contains both "Noodle" and "Spic" in `variety` column.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'MixedCmpAll(variety, "Noodle", SUB, "Spic", SUB)' -c variety beginWith

**`Contain(Val, SubStrs)`**: Compares the substrings in `SubStrs` with any part of Val. All the comparisons are case sensitive.

* Returns 1 if there is a match, 0 otherwise.
* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* `SubStrs` is a string constant that specifies what substrings to match. It is a comma-newline separated list of literal substrings of the form “`SubStr1,[\r]\nSubStr2...`”.

Let's test this function by providing "Noodle" and "Spic" as `SubStrs`, and `variety` as `Val`.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'contain(variety, "Noodle,\nSpic")' -c variety beginWith

**`ContainAll(Val, SubStrs)`**:Compares the substrings in SubStrs with any part of Val. All the comparisons are case sensitive.

* Returns 1 if all the substrings match, 0 otherwise.
* Val can be a string column’s name, a string constant, or an expression that evaluates to a string.
* SubStrs is a string constant that specifies what substrings to match. It is a comma-newline separated list of literal substrings of the form “`SubStr1,[\r]\nSubStr2...`”.

Same as `Contain()`, except that when provided with multiple `SubStrs`, it'll return 1 only if all the patterns are present in `Val`. 
Using `variety` column with values of "Noodle" and "Spic" again, we'll see that only record with **BOTH** words present in thier `variety` column will have 1 as return value.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:beginWith' 'ContainAll(variety, "Noodle,\nSpic")' -c variety beginWith

For the 2 examples below, we'll be using data from an airline online ticket search, which looks like below.

ticket|
-----|
From=ISGTo=OKADate=20150904Class=Y
From=OKATo=ITMDate=20150426Class=Y
From=ISGTo=OKADate=20150406Class=Y
From=ITMTo=FUKDate=20151016Class=Y
From=HNDTo=KOJDate=20171112Class=Y
From=NGik0To=OKADate=20150425Class=1
From=HNDTo=ITMDate=20151113Class=S
From=NGOTo=SPKDate=20160528Class=S
From=SPKTo=NGODate=20160207Class=S
From=OKATo=nG0ODate=20150425Class=3




**`PatCmp(Val, Pattern [, AtrLst])`**:Compares a generic wildcard pattern with Val.

* Returns 1 if it matches, 0 otherwise. `Pattern` must match the _entire_ `Val` to be successful.
* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* `Pattern` is a string constant that specifies the pattern to match. It is a simple wildcard pattern containing just '*' (matches any number of bytes) and ‘?’ (matches any 1 byte) only; literal ‘*’, ‘?’ and ‘\’ in the pattern must be ‘\’ escaped.
* Optional `AtrLst` is a list of `|` separated attributes containing:
    * `ncas` - Perform a case insensitive match (default is case sensitive). For ASCII data only.

On the example below, we'll use wildcard to look for record whose Class is equal to "Y".

In [None]:
airline="data/aq_pp/airline_sample.csv"
aq_pp -f,+1 $airline -d S:ticket -eval 'is:contains' 'PatCmp(ticket, "From=*To=*Date=*Class=Y")' -c ticket contains

**`RxCmp(Val, Pattern [, AtrLst])`**:Compares a string with a regular expression.

* Returns 1 if they match, 0 otherwise. `Pattern` only needs to match a subpart of `Val` to be successful.
* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* `Pattern` is a string constant that specifies the regular expression to match.
* Optional `AtrLst` is a list of `|` separated regular expression attributes.

Similar to `RxPat()`, except this uses regex pattern as pattern. Again getting records with Class=Y, 

In [None]:
aq_pp -f,+1 $airline -d S:ticket -eval 'is:contains' 'PatCmp(ticket, "Class=Y", pcre)' -c ticket contains

**`NumCmp(Val1, Val2, Delta)`**:Tests if Val1 and Val2 are within Delta of each other - i.e., whether `Abs(Val1 - Val2) <= Delta`.

* Returns 1 if true, 0 otherwise.
* `Val1`, `Val2` and `Delta` can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.
* `Delta` should be greater than or equal to zero.

We'll use following dataset with year, month and day column.

year|month|day
----|----|---|
2015|04|03
2015|08|08
2015|08|23
2015|12|28
2016|03|21
2016|04|11
2016|05|02
2016|05|15
2016|11|02
2016|11|04
2016|12|04
2017|04|26
2017|05|15
2017|10|23
2017|12|18
2018|02|21
2018|08|07
2018|08|07
2018|10|05
2018|12|03

We will calculate the difference between `month` and `date`, then compare that to our `delta` which is equal to 8.

In [None]:
forth_dim="data/aq_pp/year_month_date.csv"
aq_pp -f,+1 $forth_dim -d I:year I:month I:day -eval 'is:delta' 'abs(month - day)' -eval 'is:isWithin' 'NumCmp(month, day, 8)' -c ~year

You can see that the rows whose delta value is greater than 8, isWithin's value is 0, and vise versa.

<a id='extract_code'></a>
### Data extraction and encode / decode Functions



**`SubStr(Val, Start [, Length])`**: Returns a substring of a string.

* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* `Start` is the starting position (zero-based) of the substring in `Val`. It can be a numeric column’s name, a number, or an expression that evaluates to a number.
     * If `Start` is negative, the length of `Val` will be added to it. If it is still negative, 0 will be used.(Think of it as a pythonic way of indexing the string from backwards)
* Optional `Length` specifies the length of the substring in `Val`. It can be a numeric column’s name, a number, or an expression that evaluates to a number.
    * Max length is length of `Val` minus `Start`.
    * If `Length` is not specified, max length is assumed.
    * If `Length` is negative, max length will be added to it. If it is still negative, 0 will be used.

For this example, to keep things simple we'll use a file containing 2 row, one with numeric string and the other with good old "Hello World", which looks like below.


simple_str|
---|
0123456789
Hello World

starting from zero as `Val`, and will extract substring at index 3 ~ last index. We can do this like following.

In [None]:
# first define the filename
subStr="data/aq_pp/substr.csv"
aq_pp -f,+1 $subStr -d s:val_str -eval 's:subStr' 'SubStr(val_str, 3)' -c  val_str SubStr

As you can see, string from 3rd index (counting from zero) are extracted as substring. <br>

* `Length`<br>
Providing this argument will allow users to specify the length of **extracted substring**. Note that this is NOT the ending index of the substring. 
For example, in order to extract substring at index 3 ~ 7 in the original `Val` string, we'd need to provide `3` as `Start` and `5` as `Length`, since the substring extracted will be the length of 5.

In [None]:
aq_pp -f,+1 $subStr -d s:val_str -eval 's:subStr' 'SubStr(val_str, 3, 5)' -c  val_str SubStr

* **Negative Index**<br>
Users can specify index from right side of the `Val` string, by using negative indexing (Similar to python's string).
For example, say we'd like to extract the word "World" using the negative index. Letter "W" is the 5th character from the right side of the string, so we'll provide `-5` as `Start`.

In [None]:
aq_pp -f,+1 $file -d s:val_str -eval 's:subStr' 'SubStr(val_str, -5)' -c  val_str SubStr

`Length` argument can also be negative number. Let's provide `-2` as `Length` parameter, and `0` for `Start`.

In [None]:
aq_pp -f,+1 $subStr -d s:val_str -eval 's:subStr' 'SubStr(val_str, 0, -2)' -c  val_str SubStr

This works exactly as the reverse ending indexing, such that we've extracted substring that ends before the 2nd character from right side of the original string.

**`StrIndex(Val, Str [, AtrLst])`**: Returns the position (zero-based) of the first occurrence of `Str` in `Val` or -1 if it is not found.

* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* `Str` is the value to find within `Val`. It can be a string column’s name, a string constant, or an expression that evaluates to a string.
* Optional `AtrLst` is a list of `|` separated attributes containing:
    * `ncas` - Perform a case insensitive match (default is case sensitive). For ASCII data only.
    * `back` - Search backwards from the end of `Val`.

Let's check this out with the ramen dataset. We will provide `variety` column as `Val`, and "Noodle" as string to get the index of the string in `Val`.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:isAt' 'StrIndex(variety, "Noodle")' -c variety isAt

You can see that it is outputting the index number of the given string, counting from left side of the `Val` string.

**Attributes**<br>

* **`back`**<br>
Note that 2rd record (or 3th row) contains 2 occurence of "Noodle", and result returns index of 0. This is because `StrIndex()` only returns the first occurence of given string. <br>
We can reverse the search from backwards by giving `back` attribute, like this.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:isAt' 'StrIndex(variety, "Noodle", back)' -c variety isAt

The 2nd record's result is now 52, which is the index of the second occurence of "Noodle" string.

* **`ncas`**<br>
This attribute perform case insensitive search (default is sensitive). 
We'll provide "NOODLE" as `Str` which wouldn't match any of the record this time, as well as `ncas` for attribute.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'is:isAt' 'StrIndex(variety, "NOODLE", ncas)' -c variety isAt

**`RxMap(Val, MapFrom [, Col, MapTo ...] [, AtrLst])`**: Extracts substrings from a string based on a `MapFrom` expression and place the results in columns based on `MapTo` expressions.

* Returns 1 if successful or 0 otherwise. MapFrom only needs to match a subpart of `Val` to be successful.

* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.

* `MapFrom` is a string constant that specifies the regular expression to match. The expression should contain subexpressions for substring extractions.

* The `Col` and `MapTo` pairs define how to save the results. Col is the column to put the result in. It must be of string type. `MapTo` is a string constant that defines how to render the result. It has the form:

Optional AtrLst is a list of | separated regular expression attributes.

Let's start with simple example. We'll have a address column, filled with fake addresses. And we'd like to extract ZIP code. Data look like below.

<a id='fake_address_data'></a>
**Fake Address Dataset**


fake_address|
-----|
06060 Cruz Loop Suite 043, Randyberg, WA 82176
11919 Wells Field Suite 087, East Dianaport, AL 96554
92586 Ferguson Inlet, Port Natalieview, HI 90811
836 Myers Road, South Cynthia, TN 70598
Unit 2992 Box 3756, DPO AE 65985
75721 Jo Bypass, Lake Kaitlin, FL 74395
9964 Justin Cliffs Apt. 446, Elizabethstad, MN 58843
499 Anderson Ridge, Pattersonton, TN 09233
USNS Harris, FPO AE 17643
741 Denise Motorway Suite 930, Desireeland, DC 76580

We will extract states and zip code from the address using regex. 
An expression `[A-Z]{2} [0-9]{5}` is provided as `MapFrom` to extract 2 capital alphabet character, followed by whitespace and 5 digit number. 

In [None]:
# define the file path and column spec
fake_addrs="data/aq_pp/fake_addrs.csv"
aq_pp -f,+1 $fake_addrs -d S:address -eval S:State_zip '"96828 HI"' -eval - 'RxMap(address, "[A-Z]{2}\s[0-9]{5}", State_zip, "%%0%%", pcre)' -c State_zip

**`PatMap(Val, MapFrom [, Col, MapTo ...] [, AtrLst])`**:extract substring from `Val` based on `MapFrom` pattern, and map to `Col` based on `MapTo` pattern.

This function uses [RT MapFrom](http://auriq.com/documentation/source/reference/manpages/aq_pp.html?highlight=pcre#rt-mapfrom-syntax) expression instead of regex for `MapFrom` value. 
Let's keep things simple, and extract the zip code only this time.

In [None]:
aq_pp -f,+1 $fake_addrs -d S:address -eval S:Zip '"96828 HI"' -eval - 'PatMap(address, "%*%%ZIP:@n:5-5%%%*", Zip, "%%ZIP%%")' -c Zip

MapFrom expressiosn `%*%%ZIP:@n:5-5%%%*` was used here. 

- `%*`: represents any numbers of any characters
- `%%VarName%%`: variable name of your choice, surrounded by double percent signs
- `:`: works as a separator for different attributes to specify the pattern to store in the variable
- `@n`: where character after @ represents class type / character type, and here `n` is for numbers.
- `5-5`: represents the min and max numbers of characters to match, and here we'd like to extract exactly 5 characters.

**`KeyEnc(Col, [, Col ...])`**: Encodes columns of various types into a single string.

* Returns a string key. The key is binary, do not try to interpret or modify it.
* `Col` are the columns to encode into the key.

This feature comes in handy when you'd like to create one composite key column out of multiple columns. Let's create a composite key from `review`, `brand` and `variety`.

Here are the selected columns of the data to be encoded.

Review|Brand|Country|
---|---|---|
2580|New Touch|Japan|
2579|Just Way|Taiwan|
2578|Nissin|USA|
2577|Wei Lih|Taiwan|
2576|Ching's Secret|India|
2575|Samyang Foods|South Korea|
2574|Acecook|Japan|
2573|Ikeda Shoku|Japan|
2572|Ripe'n'Dry|Japan|

In [None]:
aq_pp -f,+1 $file -d $cols -eval 's:key' 'KeyEnc(reviewID, brand, country)' -c key

**`KeyDec(Key, Col|"ColType" [, Col|"ColType" ...])`**: Decodes a key previously encoded by KeyEnc() and place the resulting components in the given columns.

* Returns 1 if successful. A failure is considered a processing error. There is no failure return value.
* `Key` is the previously encoded value. It can be a string column’s name, a string constant or an expression that evaluates to a string.
* Each `Col` or `ColType` specifies a components in the key.
    * If a column is given, a component matching the column’s type is expected; the extracted value will be placed in the given column.
    * If a column type string is given, a component matching this type is expected, but the extracted value will not be saved.
* The components must be given in the same order as in the encoding call.

For demonstration purpose, we will encode the record first on the first line, and pipe its output into the second `aq_pp` command on the second line.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 's:key' 'KeyEnc(reviewID, brand, country)' -c key | \
aq_pp -f,+1 - -d s:key -eval I:dec_ID '100' -eval S:dec_brand '"prada"' -eval S:dec_country '"South America"' \
-eval - 'KeyDec(key, dec_ID, dec_brand, dec_country)' -c dec_ID dec_brand dec_country

* 1st line encodes the 3 columns into one string column named `key`, then outputs only that column.
* 2nd line get the key from the first `aq_pp` command, then set up destination columns to map the decoded keys.
* 3rd line decodes and map the key on the destination columns, and output the 3 columns.

You can verify that the content of output column does match the original column's contents.

**`ClipStr(Val, ClipSpec)`**:Returns a substring of a string, based on `clipSpec`.

* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.

* `ClipSpec` is a string constant that specifies how to clip the substring from the source. It is a sequence of individual clip elements separated by “;”:<br>
Each clip elements specifies either the starting or trailing portion of the source string `Val`. Below are some of the commonly used ones.
    - `Num`: number of bytes / Separators (`Sep`) to clip
    - `Dir`: direction to clip the string, `>`:left to right, `<`: right to left.
    - `Sep`: Single byte separator character. Substring that are up to the `Num` of `Sep` character will be clipped.

Let's take a look at simple examples.

We'll use a list of web URLs as data for next few example. It is a single column list with web URL strings.<br>
Here are what the URL data looks like.

URL|
-----|
https://duckduckgo.com/?q=is+duckduckgo+safe&t=h_&ia=web|
https://www.google.com/search?client=ubuntu&channel=fs&q=hello+world&ie=utf-8&oe=utf-8|
http://auriq.com/documentation/search.html?q=emod&check_keywords=yes&area=default|

First, let's say you'd like to extract the first 5 characters of the URL. We can specify `ClipSpec` as `5>` where 
- `Num`: `5` - specifying the numbers of characters to extract.
- `Dir`: `>` - specifying the direction of extraction, in this case from left to right.

val_str column will be provided as `Val`, and the result will be on `subStr` column. 

In [None]:
# assigning new data file name
urls="data/aq_pp/clipstr.csv"
aq_pp -f,+1 $urls -d s:val_str -eval 's:subStr' 'ClipStr(val_str, "5>")' -c  val_str SubStr

* **`Sep`**<br>
Now what if we would like to extract string up until domain name from the URLs? We can do this using `Sep` attributes. <br>
We will extract everything up until the third `/` character in the URLs, by specifying `/` as `Sep`, and give `Num` three.
Note that the `Sep` is inclusive, therefore the extracted string will include the `Sep` as their last character.

In [None]:
aq_pp -f,+1 $urls -d s:val_str -eval 's:subStr' 'ClipStr(val_str, "3>/")' -c  val_str SubStr

* **Different Direction**<br>
We can also clip the string from right side. In this example we'll clip the very last portion of the URL, by providing `/` as `Sep` and `<` as `Dir`.

In [None]:
aq_pp -f,+1 $urls -d s:val_str -eval 's:subStr' 'ClipStr(val_str, "2</")' -c  val_str SubStr

**`QryDec(Val, [, AtrLst], Col, KeyName [, AtrLst] [, Col, KeyName [, AtrLst] ...])`:**<br>
Given query string from URL as `Val`, extracts the values of selected [query parameters](https://en.wikipedia.org/wiki/Query_string) and place the results in columns, as well as return number of parameters extracted.

`AtrLst` are used to specify extraction behaviour, such as 
- `beg`: specifying letter to begin extraction at
- `dec`: number of times to perform decoding after extraction

There are [more attributes available](http://auriq.com/documentation/source/reference/manpages/aq-emod.html?highlight=emod#qrydec), but we'll use `beg` and set it to `?` in order to focus on string after `?` character in the URLs. (Usually query strings are included after `?` character in URLs).
We will extract the queries, and store the result (1 or 0) in e_num column.

In [None]:
aq_pp -f,+1 $urls -d S:URL S:result -eval 'I:e_num' 0 -eval e_num 'QryDec(URL, "beg=?", result, "q")'

**`UrlEnc(Val)`**: URL-encode a string.

* Returns the encoded result.
* `Val` is the string to encoded. It can be a string column’s name, a string constant or an expression that evaluates to a string.

We'll encode the URLs, and store the result on `encoded` column.

In [None]:
aq_pp -f,+1 $urls -d s:val_str -eval 's:encoded' 'UrlEnc(val_str)' -c encoded

**`UrlDec(Val)`**: Decodes an URL-encoded string.

* Returns the decoded result.
* `Val` is an URL-encoded string. It can be a string column’s name, a string constant or an expression that evaluates to a string.

We'll first encode the URLs like done so in previous example using `UrlEnc()`, then pass the result to another `aq_pp` command using pipe, then decode the result using `UrlDec()`. The final result will be stored and outputted in decoded_url column.

In [None]:
aq_pp -f,+1 $urls -d s:val_str -eval 's:encoded' 'UrlEnc(val_str)' -c encoded | \
aq_pp -f,+1 - -d s:encoded -eval 's:decoded_url' 'UrlDec(encoded)' -c decoded_url

We can see that the decoded URLs are identical to the original URLs before encoding was applied.

**`Base64Enc(Val)`**: Base64-encode a string.

* Returns the encoded result.
* ``Val`` is the string to encode.
    * It can be a string column's name, a string constant
    * or an expression that evaluates to a string.

Using the ramen data, we'll encode the `country` column.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'S:encoded64' 'Base64Enc(country)' -c country encoded64

**`Base64Dec(Val)`**: Decodes a base64-encoded string.

* Returns the decoded result.There is no integrity check. Portions of `Val` that is not base64-encoded are simply skipped. As a result, the function may return a blank string.

* `Val` is a base64-encoded string. It can be a string column's name, a string constant or an expression that evaluates to a string.

We'll demonstrate it by first encoding country column and place the result on encoded64 column. On second line we will decode that column to compare with the original country column.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 'S:encoded64' 'Base64Enc(country)' -c country encoded64 | \
aq_pp -f,+1 - -d s:country s:encoded64 -eval 's:decoded64' 'Base64Dec(encoded64)' -c country decoded64

You can observe that decoded country matches the original column contents.

<a id='conversion'></a>
### General Data Conversion Functions

There are several data conversion function that convert data into other types, such as following:
- `ToIP(Val)`: IP type
- `ToS(Val)`: String type
- `ToI(Val)`: Integer type
- `ToF(Val)`: Float type

Each of these takes `Val` as argument, and output the data as a corresponding data types.
For concrete examples of the functions, refer to Data Conversion section in `aq_pp -eval` notebook.

Other 4 functions are related to manipulating string values. Let's take a look at each of them.


**`ToUpper(Val), ToLower(Val)`**: Returns the upper or lower case string representation of `Val`.

- For ASCII strings only. May corrupt multibyte character strings.
- `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.

We'll convert the contents of `style` column in the Ramen dataset into upper case letters. 
`ToLower(Val)` can be used in a same manner. 

In [None]:
aq_pp -f,+1 $file -d $cols -eval 's:upper' 'ToUpper(style)' -c style upper


**`MaskStr(Val)`**: Irreversibly masks (or obfuscates) a string value. The result should be nearly as unique as the original (the probability of two different values having the same masked value is extremely small).

* `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.
* The length of the result may be the same or longer than the original.

Let's apply this function to the `style` column as well. You can observe that same original value are masked into same string value.

In [None]:
aq_pp -f,+1 $file -d $cols -eval 's:masked' 'MaskStr(style)' -c style masked

**`RxReplace(Val, RepFrom, Col, RepTo [, AtrLst])`**: Replaces the first or all occurrences of a substring in `Val` matching expression `RepFrom` with expression `RepTo` and place the result in `Col`.

- Returns the number of replacements performed or 0 if there is no match.

- `Val` can be a string column’s name, a string constant, or an expression that evaluates to a string.

- `RepFrom` is a string constant that specifies the regular expression to match. Substring(s) matching this expression will be replaced. The expression can contain subexpressions that can be referenced in RepTo.

- `Col` is the column to put the result in. It must be of string type.

Again we will use fake address dataset. 

**Fake Address Dataset**


fake_address|
-----|
06060 Cruz Loop Suite 043, Randyberg, WA 82176
11919 Wells Field Suite 087, East Dianaport, AL 96554
92586 Ferguson Inlet, Port Natalieview, HI 90811
836 Myers Road, South Cynthia, TN 70598
Unit 2992 Box 3756, DPO AE 65985
75721 Jo Bypass, Lake Kaitlin, FL 74395
9964 Justin Cliffs Apt. 446, Elizabethstad, MN 58843
499 Anderson Ridge, Pattersonton, TN 09233
USNS Harris, FPO AE 17643
741 Denise Motorway Suite 930, Desireeland, DC 76580

In the example below, we will do 2 things.
1. extract State and Zip code from `address` column, then assign it to new column called `State_Zip` in a format of `State: WA, Zip: 82176`. 
2. store the numbers of replacements performed in a row in an integer column `num_rep`.

In [None]:
aq_pp -f,+1 $fake_addrs -d S:address \
    -eval 's:replaced' '"SZ"' -eval 'i:num_rep' '0' \
    -eval num_rep 'RxReplace(address, "([A-Z]{2})\s([0-9]{5})", replaced, "STATE: %%1%%, ZIP: %%2%%", pcre)' -c replaced num_rep

* 1st line deals with input spec
* 2nd line creates columns to store results.
* 3rd line apply `RxReplace()` to `address` column. 

**`RxRep(Val, RepFrom, RepTo [, AtrLst])`**: The same as RxReplace() except that it returns the result string directly (for this reason, it does not have RxReplace()‘s Col argument).

Let's apply this to same column as above, but this time without numbers of replacement performed.

In [None]:
aq_pp -f,+1 $fake_addrs -d S:address -eval 's:replaced' '"SZ"' \
    -eval replaced 'RxRep(address, "([A-Z]{2})\s([0-9]{5})", "STATE: %%1%%, ZIP: %%2%%", pcre)' -c replaced

<a id='date_time'></a>
### Date/Time conversion Functions

**`DateToTime(DateVal, DateFmt)`**, **`GmDateToTime(DateVal, DateFmt)`**: each of them takes string `DateVal`, and return [UNIX time](https://en.wikipedia.org/wiki/Unix_time) in integral, unless otherwise specified. 

- `DateVal` can be a string column’s name, a string constant, or an expression that evaluates to a string.
- `DateFmt` is a string constant that specifies the format of `DateVal`.

Example below will convert date time column's value into UNIX time, and store it in new column (`Unix_time`).


In [None]:
date_data="data/aq_pp/dates.csv"
aq_pp -f,+1 $date_data -d S:date -eval 'I:Unix_time' 'DateToTime(date, "%Y.%m.%d")'

`DateFmt` used in the example above include followings:

- (a dot) `.` - represent a single unwanted character (e.g., a separator).
- `%Y` - 1-4 digit year.
- `%m` - Month in 1-12.
- `%d` - Day of month in 1-31.

- `%H` or `%I` - hour in 0-23 or 1-12.
- `%M` - Minute in 0-59.
- `%S` - Second in 0-59.

We only covered a simple example, but more attributes are available.
For more details, please refer to the [Date/Time conversion - aq-emod](http://auriq.com/documentation/source/reference/manpages/aq-emod.html#date-time-conversion-functions)

**`TimeToDate(TimeVal, DateFmt)`**, **`TimeToGmDate(TimeVal, DateFmt)`**: Both functions return the date string corresponding to TimeVal. The result string’s maximum length is 127.

- `TimeVal` can be a numeric column’s name, a numeric constant, or an expression that evaluates to a number.
- `DateFmt` is a string constant that specifies the format of the output. 
- Conversion is timezone dependent. It is done using the program’s default timezone. Set the program’s timezone, e.g, via the TZ environment, before execution if necessary.

Example below uses a column of Unix time, to be converted to Date format like "2019-10-21 16:17:56".

In [54]:
unix_data="data/aq_pp/Unix_time.csv"
aq_pp -f,+1 $unix_data -d I:Unix_time -eval 'S:Date' 'TimeToDate(Unix_time, "%Y-%m-%d %H:%M:%S")'

"Unix_time","Date"
119731017,"1973-10-17 18:36:57"
1000000000,"2001-09-09 01:46:40"
1111111111,"2005-03-18 01:58:31"
2000000000,"2033-05-18 03:33:20"
2147483647,"2038-01-19 03:14:07"


You can also set timezone by setting the system's timezone. For example in order to set it to Japan, 

In [55]:
TZ="Japan" aq_pp -f,+1 $unix_data -d I:Unix_time -eval 'S:Date' 'TimeToDate(Unix_time, "%Y-%m-%d %H:%M:%S")'

"Unix_time","Date"
119731017,"1973-10-18 03:36:57"
1000000000,"2001-09-09 10:46:40"
1111111111,"2005-03-18 10:58:31"
2000000000,"2033-05-18 12:33:20"
2147483647,"2038-01-19 12:14:07"


## To Dos

<a id='character_encoding'></a>
* ### Character set encoding conversion Functions
