# aq_pp command examples

In this notebook, we'll go over common usage examples of the data preprocessing command, `aq_pp`. We'll start from very basic, and work our way up to advanced examples.

Before going over this notebook, make sure you're faimilar with 

* Bash commands
* Regular Expression
* aq_input / input-spec 

We won't go over input, column and output spec on this notebook. They can be found on 
- [this notebook](aq_input.ipynb).
- [aq_output notebook](aq_output.ipynb)
 

Also have the [aq_pp documentation](http://auriq.com/documentation/source/reference/manpages/aq_pp.html) ready on your side, so you can refer to the details of each options as you go over this sample.

## ToS

## Basic Options

### Evaluation (Better wording?)

- `-eval`
- `-var`

### Output Options
- `-c`
- `-o`
- `-ovar`

### Combining Datasets
- `-cat`
- `-cmb`
- `-sub`

### Filtering
- `-grep`
- `-filt`


### String Manipulation
- `-mapf ... -mapc`
- `-map`


### Conditionals
- `-if`
- `-elif`
- `-else`
- `-endif`


### Variable usage
### Advanced Options
- `-kenc`
- `-kdec`
- `-pmod`
- `-imp`
- `-exp`(this belongs to input spec, so should be in [aq_input notebook](aq_input.ipynb)?
### Buildin Variables
- `Random`
- `RowNum`
- `CurSec`
- `CurUSec`


## Basic Options
we'll start with basic options for data wrangling. Let's take a look at them by objectives.

### Evalutation

This section explores 2 options, 
- `-eval`: evaluates give expression, and store the result in exsiting or new column.
- `-var`: create a new variable and initialize it's value. 

These 2 options are essential for manipulating and performing calculations on it. 

#### [-eval](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#eval)

Basic syntax for this option is
```bash
aq-pp ... -eval ColSpec|ColName Expr
```
where
- `ColSpec`: new column's column spec to assign the result
- `ColName`: existing column name to assign the result
- `Expr`: expression to be evaluated.

#### Data

[Ramen Ratings Dataset](https://www.kaggle.com/residentmario/ramen-ratings) from kaggle will be used in this sample, which contains ratings of 2500 ramen products. 

Review|Brand|Variety|Style|Country|Stars
---|---|---|---|---|---|
2580|New Touch|T's Restaurant Tantanmen|Cup|Japan|3.75
2579|Just Way|Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles|Pack|Taiwan|1
2578|Nissin|Cup Noodles Chicken Vegetable|Cup|USA|2.25
2577|Wei Lih|GGE Ramen Snack Tomato Flavor|Pack|Taiwan|2.75
2576|Ching's Secret|Singapore Curry|Pack|India|3.75
2575|Samyang Foods|Kimchi song Song Ramen|Pack|South Korea|4.75
2574|Acecook|Spice Deli Tantan Men With Cilantro|Cup|Japan|4
2573|Ikeda Shoku|Nabeyaki Kitsune Udon|Tray|Japan|3.75
2572|Ripe'n'Dry|Hokkaido Soy Sauce Ramen|Pack|Japan|0.25
2571|KOKA|The Original Spicy Stir-Fried Noodles|Pack|Singapore|2.5

Here are its' columns and data types.
- `int: Review #`: review id number, the more recent the review is, the bigger the number is
- `str: Brand`: brand / manufacture of the product
- `str: Variety`: title of the product
- `str: Style`: categorical styles of the products, cup, pack or tray
- `str: Country`: country of origin
- `float: stars`: star rating of each product

And here is the corresponding column specs<br>
`i:reviewID s:brand s:variety s:style s:country f:stars`

**Note**<br>
When reading in the files with `aq-pp`, we'll be using bash's [variable substitution](http://www.compciv.org/topics/bash/variables-and-substitution/).

#### Todo

Operations that can be performed on data by leveraging `-eval` option can be broken down and simplied into 4 categories below.
Things to cover in this section, are
- ONIT: numerical evaluation
    - constant and colname, on new column
    - constant and colName, on existing column
    - colName and ColName, on new column
    - 
- string manipulation
- data type conversion operation
    - data type
- any operation possible by builtin functions (aq-emod)


Let's start with Numerical operations.

**Numerical Operation**

Operators supported for numerical operation are<br>

_Arithmetic_
- `*`: multiplication
- `/`: division
- `%`: modulus
- `+`: addition
- `-`: subtraction

_Bitwise_
- `&`: AND
- `|`: OR
- `^`: XOR

First, we will double the value of star rating column, and assign it to a new column named `double_rating`. 

In [1]:
# First store filename and column spec in variable to simplify commands
file="data/aq_pp/ramen-ratings-part.csv"
cols="i:reviewID s:brand s:variety s:style s:country f:stars"
# now create a column called double_rating, and assign the value of 2 * stars
aq_pp -f,+1 $file -d $cols -eval f:double_rating '2*stars'

"reviewID","brand","variety","style","country","stars","double_rating"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,7.5
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,2
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,4.5
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,5.5
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,7.5
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,9.5
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,7.5
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,0.5


Now the new column `double_rating` contains the value twice as large as the `stars` value.

**Couple things to note**<br>
- **Column Datatype:** the destination column's datatype has to be same as the datatype of result of the `Expr`. In the example above, the result is float datatype, therefore we've declared `double_rating` as float.
- **Quotations:** you cannot quote `colName|colSpec`, while `Expr` to be evaluated needs to be quoted. Single quotation is recommended, because string value in the expression needs to be quoted as well.

Now we will perform the same operation, but store the result on existing column, `stars`.

In [22]:
aq_pp -f,+1 $file -d $cols -eval stars '2*stars'

"reviewID","brand","variety","style","country","stars"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",7.5
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",2
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",4.5
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",5.5
2576,"Ching's Secret","Singapore Curry","Pack","India",7.5
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",9.5
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",7.5
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.5


You can apply any of the other arithmetic operators just like above example. 

In the above example, `Expr` only contained existing column and a constant. We can also provide multiple columns as `Expr` and perform calculation.

We'll divide the `reviewID` (int) by `stars`(float), and store the result in new column `div`(float).

In [23]:
aq_pp -f,+1 $file -d $cols -eval f:div 'reviewID/stars'

"reviewID","brand","variety","style","country","stars","div"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,688
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,2579
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,1145.7777777777778
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,937.09090909090912
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,686.93333333333328
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,542.10526315789468
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,643.5
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,686.13333333333333
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,10288


**+ operator with string**<br>
`+` operator can also be used to concatinate string values, besides numeric operation. As a example, we'll create a string column `s:info`, and store combined strings of `brand` and `country`, separated by ` - `. 


In [4]:
aq_pp -f,+1 $file -d $cols -eval s:info 'brand+" - "+country'

"reviewID","brand","variety","style","country","stars","info"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,"New Touch - Japan"
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,"Just Way - Taiwan"
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,"Nissin - USA"
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,"Wei Lih - Taiwan"
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,"Ching's Secret - India"
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,"Samyang Foods - South Korea"
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,"Acecook - Japan"
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,"Ikeda Shoku - Japan"
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,"Ripe'n'Dry - Japan"


Note that `brand` and `country` are column names while ` - ` is a string constant, which is why it is double quoted. 

**[Builtin Variables](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#eval)**
You can also use builtin variables. There are couple of them, and here we'll take a look at `$RowNum` and `$Random`.

Using `$RowNum`, we'll create a new integer column `row` and store the row number.

In [24]:
aq_pp -f,+1 $file -d $cols -eval i:row '$RowNum'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,1
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,2
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,3
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,4
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,5
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,6
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,7
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,8
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,9


Since we are skipping the header row with `-f,+1` option, we'll correct the row numbers by addding 1 to each row number.

In [25]:
aq_pp -f,+1 $file -d $cols -eval i:row '$RowNum +1'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,2
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,3
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,4
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,5
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,6
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,7
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,9
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,10


In [26]:
aq_pp -f,+1 $file -d $cols -eval i:row '$random'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,476707713
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,1186278907
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,505671508
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,2137716191
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,936145377
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,1215825599
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,589265238
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,924859463
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,1182112391


This outputs very large positive integer. Sometimes we need random numbers within a certain range. Let's say between 0 and 10. Using modulus operator, 

In [27]:
aq_pp -f,+1 $file -d $cols -eval i:row '$random%10'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,3
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,7
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,8
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,1
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,7
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,9
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,3
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,1


Let's take a look at bitwise operator, which performs bitwise logical operation on decimal numbers.

For this, we'll use different file containing binary numbers, which looks like this

number|mask
---|---
1|981
290|90
31|12
79|56
10|874

Let's perform `|`(bitwise OR) operator on `numbers` column, with a constant 32. The result will be stored in the new column `i:result`.

In [28]:
aq_pp -f,+1 data/aq_pp/bitwise.csv -d i:number i:mask -eval i:result 'number | 32'

"number","mask","result"
1,981,33
290,90,290
31,12,63
79,56,111
10,874,42


**Note**: `aq-pp` interpret numbers as decimal by default, therefore the bitwise operators will be calculated, and result will be shown as decimal number as well. 

**Data Conversion Operation**<br>

Aq tools comes with various [builtin functions / aq-emod](http://auriq.com/documentation/source/reference/manpages/aq-emod.html) to manipulate, clean, convert different data types. 
We'll go over couple [data conversion functions](http://auriq.com/documentation/source/reference/manpages/aq-emod.html#general-data-conversion-functions) on this notebook.

For the purpose of this sample, we will modify the column spec for ramen dataset, in order to input all the columns as string. 

After that, we'll create a new columns for `i:int_reviewID` to store converted integer value of `reviewID`. 

In [9]:
cols="s:reviewID s:brand s:variety s:style s:country s:stars"
# input every columns as string
aq_pp -f,+1 $file -d $cols 

"reviewID","brand","variety","style","country","stars"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75"
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1"
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25"
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75"
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75"
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75"
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4"
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75"
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25"


Notice that `reviewID` and `stars` columns are quoted, showing that `aq_pp` is interpretting them as string at this point. Let's convert them into appropriate data types with builtin functions, 
- `ToF(Val)`: convert `Val` to float
- `ToI(Val)`: convert `Val` to integer

`Val` can be constant value or column names of string / numeric data type.

In [30]:
aq_pp -f,+1 $file -d $cols -eval i:int_reviewID 'ToI(reviewID)'

"reviewID","brand","variety","style","country","stars","int_reviewID"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",2580
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",2579
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",2578
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",2577
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",2576
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",2575
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",2574
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",2573
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",2572


We've provided column name as `Val` in the example above, but can also provide a string constant. Note that you should always quote the string values in `-eval` options' `Expr`. 

In [31]:
aq_pp -f,+1 $file -d $cols -eval i:int_reviewID 'ToI("13")'

"reviewID","brand","variety","style","country","stars","int_reviewID"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",13
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",13
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",13
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",13
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",13
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",13
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",13
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",13
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",13


Builtin function can also be combined with arithmetic expression. Let's convert+ `"13"`(str constant) and `reviewID` into int, then add them together, then store the result on `i:result` column this time.

In [32]:
aq_pp -f,+1 $file -d $cols -eval i:result 'ToI(reviewID) + ToI("13")'

"reviewID","brand","variety","style","country","stars","result"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",2593
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",2592
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",2591
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",2590
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",2589
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",2588
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",2587
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",2586
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",2585


13 is added to the original `reviewID` value, on `result` column. 

We can also combine numeric strings by using `+`, convert the result to numeric data type, then store in a numeric column. Let's take a look.



In [14]:
aq_pp -f,+1 $file -d $cols -eval i:result 'ToI(reviewID + "13")'

"reviewID","brand","variety","style","country","stars","result"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",258013
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",257913
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",257813
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",257713
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",257613
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",257513
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",257413
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",257313
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",257213


**String Manipulation**<br>

`-map` options will be covered under different notebook.