# aq_pp  -eval

In this notebook, we'll go over common usage examples of the data preprocessing command, `aq_pp`, primarily focusing on `-eval` option.

## Overview

`-eval` option in `aq-pp` command is responsible for data manipulation and column creation. Given expression and destination column name, it _evaluate_ the expression, and store the result in the destination column.

Before going over this notebook, make sure you're faimilar with the following concepts.

* Bash commands
* Regular Expression
* aq_input / input-spec 

We won't go over input, column and output spec on this notebook. Some resources are available at

* Notebooks
    - [aq_input notebook](aq_input.ipynb).
    - [aq_output notebook](aq_output.ipynb)
* Documentation
    - [aq_input](http://auriq.com/documentation/source/reference/manpages/aq-input.html)
    - [aq_output](http://auriq.com/documentation/source/reference/manpages/aq-output.html)
    
    
Also have the [-eval, aq_pp documentation](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#eval) ready on your side, so you can refer to the details of each options as needed.

**Skip to [Data](#data) section if you're already familiar with the syntax**<br>
### Syntax

```bash
aq-pp ... -eval ColSpec|ColName Expr
```
where
- `ColSpec`: new column's column spec to assign the result
- `ColName`: existing column name to assign the result
- `Expr`: expression to be evaluated, can be column name, constant or/and operators.


**Couple things to note**<br>
- **Column Datatype:** the destination column's datatype has to be same as the datatype of result of the `Expr`. In the example above, the result is float datatype, therefore we've declared `double_rating` as float.
- **Quotations:** you cannot quote `colName|colSpec`, while `Expr` needs to be quoted. Single quotation is recommended, in case string value is included which require further quotation.
Now we will perform the same operation, but store the result on existing column, `stars`.

### Operators

Operators supported for numerical operation are<br>

_Arithmetic_
- `*`: multiplication
- `/`: division
- `%`: modulus
- `+`: addition
- `-`: subtraction

_Bitwise_
- `&`: AND
- `|`: OR
- `^`: XOR

<a id='data'></a>
## Data

[Ramen Ratings Dataset](https://www.kaggle.com/residentmario/ramen-ratings) from kaggle will be used in this sample, which contains ratings of 2500 ramen products. 

Review|Brand|Variety|Style|Country|Stars
---|---|---|---|---|---|
2580|New Touch|T's Restaurant Tantanmen|Cup|Japan|3.75
2579|Just Way|Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles|Pack|Taiwan|1
2578|Nissin|Cup Noodles Chicken Vegetable|Cup|USA|2.25
2577|Wei Lih|GGE Ramen Snack Tomato Flavor|Pack|Taiwan|2.75
2576|Ching's Secret|Singapore Curry|Pack|India|3.75
2575|Samyang Foods|Kimchi song Song Ramen|Pack|South Korea|4.75
2574|Acecook|Spice Deli Tantan Men With Cilantro|Cup|Japan|4
2573|Ikeda Shoku|Nabeyaki Kitsune Udon|Tray|Japan|3.75
2572|Ripe'n'Dry|Hokkaido Soy Sauce Ramen|Pack|Japan|0.25
2571|KOKA|The Original Spicy Stir-Fried Noodles|Pack|Singapore|2.5

Columns and corresponing data types for the dataset are follows.
- `int: Review #`: review id number, the more recent the review is, the bigger the number is
- `str: Brand`: brand / manufacture of the product
- `str: Variety`: title of the product
- `str: Style`: categorical styles of the products, cup, pack or tray
- `str: Country`: country of origin
- `float: stars`: star rating of each product

Several other dataset will be used; they will be introduced along the way.<br>
Now we are all set and ready, let's get started with numerical operation.

### Table of Samples
#### Arithmetic
- [Single Column in `Expr`](#single_col_exp)
- [Multiple Columns in `Expr`](#mul_col_exp)
- [String Operation](#string_op)
- [Bitwise Operation](#bit_op)

#### Builtin Variables
- [`RowNum`](#row_num)
- [`Random`](#random)

#### [Data Conversion](#data_conversion)

<br>

## Arithmetic

<a id='single_col_exp'></a>
### Single Column in `Expr`
First, we will double the value of star rating column, and assign it to a new column named `double_rating`. 

In [1]:
# First store filename and column spec in variable to simplify commands
file="data/aq_pp/ramen-ratings-part.csv"
cols="i:reviewID s:brand s:variety s:style s:country f:stars"
# now create a column called double_rating, and assign the value of 2 * stars
aq_pp -f,+1 $file -d $cols -eval f:double_rating '2*stars' -c stars double_rating

"stars","double_rating"
3.75,7.5
1,2
2.25,4.5
2.75,5.5
3.75,7.5
4.75,9.5
4,8
3.75,7.5
0.25,0.5


Now the new column `double_rating` contains the value twice as large as the `stars` value.


In the below example, we'll assign the result to existing column `stars`, instead of creating new column.

In [85]:
aq_pp -f,+1 $file -d $cols -eval stars '2*stars'

"reviewID","brand","variety","style","country","stars"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",7.5
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",2
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",4.5
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",5.5
2576,"Ching's Secret","Singapore Curry","Pack","India",7.5
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",9.5
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",7.5
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.5


You can apply any of the other arithmetic operators just like above example. 

`Expr` only contained existing column and a constant. We can also provide multiple column names as `Expr` and perform calculation.

<a id='mul_col_exp'></a>
### Multiple Columns in `Expr`

We'll divide the `reviewID` (int) by `stars`(float), and store the result in new column `div`(float), but feel free to try more complex operations using other operators as well.

In [3]:
aq_pp -f,+1 $file -d $cols -eval f:div 'reviewID/stars'

"reviewID","brand","variety","style","country","stars","div"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,688
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,2579
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,1145.7777777777778
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,937.09090909090912
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,686.93333333333328
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,542.10526315789468
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,643.5
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,686.13333333333333
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,10288


<a id='string_op'></a>
### String Operation 
**+ operator with string**<br>
`+` operator can also be used to concatinate string values, besides numeric operation. As a example, we will concatinate strings in `brand` and `country` columns separated by ` - ` character, then store them in newly created string column `s:info`.

Note that only `+` operator supports string manipulation.<br>
In the `Expr` `brand + " - " + country`, ` - `(dash surrounded by whitespaces) is double quoted because it is a string constant as oppose to being column name.

In [4]:
aq_pp -f,+1 $file -d $cols -eval s:info 'brand + " - " + country' 

"reviewID","brand","variety","style","country","stars","info"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,"New Touch - Japan"
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,"Just Way - Taiwan"
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,"Nissin - USA"
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,"Wei Lih - Taiwan"
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,"Ching's Secret - India"
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,"Samyang Foods - South Korea"
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,"Acecook - Japan"
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,"Ikeda Shoku - Japan"
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,"Ripe'n'Dry - Japan"


More complex string manipulations are possible with `aq_pp` by using `-map` options and/or [`builtin functions / aq-emod`](http://auriq.com/documentation/source/reference/manpages/aq-emod.html), which will be covered in [other notebook.](aq-emod.ipynb)

<a id='bit_op'></a>
### Bitwise Operation

Let's take a look at bitwise operator, which performs [bitwise logical operation](https://en.wikipedia.org/wiki/Bitwise_operation) on decimal numbers.

Following 3 opperators are supported.

- `&`: AND
- `|`: OR
- `^`: XOR

We'll use different data containing decimal numbers to demonstrate the result of bitwise operation clearly, which looks like below, and is stored at `data/aq_pp/bitwise.csv`

number|mask
---|---
1|981
290|90
31|12
79|56
10|874

Let's perform `|`(bitwise OR) operator on `numbers` column, with a constant 32. The result will be stored in the new column `i:result`.

In [5]:
aq_pp -f,+1 data/aq_pp/bitwise.csv -d i:number i:mask -eval i:result 'number | 32'

"number","mask","result"
1,981,33
290,90,290
31,12,63
79,56,111
10,874,42


Feel free to try the other 2 operators to see the result!<br>

**Note**: <br>
`aq_pp` interpret numbers as decimal by default, therefore input to the operators will be interpretted as decimal, and output will be in decimal number. 

## Builtin Variables
`aq_pp` is equipped with [builtin variables](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#eval) that can be used to substitue values and we can use them in `-eval` option. There are couple of them, and here we'll take a look at `$RowNum` and `$Random`.

<a id='row_num'></a>
### `RowNum`
represents the row number of the record, starting at 1.

On the example below, we'll create a new integer column `row` and store the row number. You'll be able to observe the row number on the all the way right.

In [7]:
aq_pp -f,+1 $file -d $cols -eval i:row '$RowNum'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,1
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,2
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,3
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,4
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,5
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,6
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,7
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,8
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,9


**Arithmetic with Variable**<br>
Since we are skipping the header row with `-f,+1` option, we'll correct the row numbers by addding 1 to each row number (counting the header as row 1).

In [90]:
aq_pp -f,+1 $file -d $cols -eval i:row '$RowNum +1'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,2
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,3
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,4
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,5
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,6
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,7
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,9
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,10


<a id='random'></a>
### `Random`

Represents a positive random number, and the value changes every time the variable is referenced.(meaning on every record)

In this example, we will use `Random` to generate random integer for every row, and store it in integer column named `random`.

In [91]:
aq_pp -f,+1 $file -d $cols -eval i:random '$random'

"reviewID","brand","variety","style","country","stars","random"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,476707713
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,1186278907
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,505671508
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,2137716191
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,936145377
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,1215825599
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,589265238
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,924859463
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,1182112391


This outputs very large positive integer. Sometimes we need random numbers within a certain range. Let's say between 0 and 10. Using modulus operator, 

In [92]:
aq_pp -f,+1 $file -d $cols -eval i:row '$random%10'

"reviewID","brand","variety","style","country","stars","row"
2580,"New Touch","T's Restaurant Tantanmen ","Cup","Japan",3.75,3
2579,"Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan",1,7
2578,"Nissin","Cup Noodles Chicken Vegetable","Cup","USA",2.25,8
2577,"Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan",2.75,1
2576,"Ching's Secret","Singapore Curry","Pack","India",3.75,7
2575,"Samyang Foods","Kimchi song Song Ramen","Pack","South Korea",4.75,9
2574,"Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan",4,8
2573,"Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan",3.75,3
2572,"Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan",0.25,1


Other than modulus, you can form and apply more complex numerical operations with builtin variables. 

<a id='data_conversion'></a>
## Data Conversion

Users can take advantage of powerful [builtin function / aq-emod](http://auriq.com/documentation/source/reference/manpages/aq-emod.html) that can be used for more complex data processing than one can do with combinations of `-eval` options. 

While there are variety of functions available, we'll take a look at ones for data type conversion in this section, specifically `ToI()` and `ToF()`. 

We'll set all columns' data types as string in column spec in the first step.

In [93]:
cols="s:reviewID s:brand s:variety s:style s:country s:stars"
# input every columns as string
aq_pp -f,+1 $file -d $cols 

"reviewID","brand","variety","style","country","stars"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75"
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1"
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25"
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75"
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75"
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75"
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4"
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75"
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25"


Notice that `reviewID` and `stars` columns are quoted, showing that `aq_pp` is interpretting them as string. Let's convert them into appropriate data types with builtin functions, 
- `ToF(Val)`: convert `Val` to float
- `ToI(Val)`: convert `Val` to integer

where `Val` can be constant value or column names of string / numeric data type.

In [94]:
aq_pp -f,+1 $file -d $cols -eval i:int_reviewID 'ToI(reviewID)'

"reviewID","brand","variety","style","country","stars","int_reviewID"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",2580
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",2579
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",2578
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",2577
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",2576
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",2575
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",2574
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",2573
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",2572


We've provided column name as `Val` in the example above, but can also provide a string constant. Note that you should always quote the string values in `-eval` options' `Expr`. 

In [95]:
aq_pp -f,+1 $file -d $cols -eval i:int_reviewID 'ToI("13")'

"reviewID","brand","variety","style","country","stars","int_reviewID"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",13
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",13
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",13
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",13
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",13
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",13
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",13
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",13
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",13


Builtin function can also be combined with arithmetic expression. Let's convert+ `"13"`(str constant) and `reviewID` into int, then add them together, then store the result on `i:result` column this time.

In [96]:
aq_pp -f,+1 $file -d $cols -eval i:result 'ToI(reviewID) + ToI("13")'

"reviewID","brand","variety","style","country","stars","result"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",2593
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",2592
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",2591
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",2590
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",2589
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",2588
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",2587
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",2586
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",2585


13 is added to the original `reviewID` value, on `result` column. 

We can also combine numeric strings by using `+`, convert the result to numeric data type, then store in a numeric column. Let's take a look.



In [97]:
aq_pp -f,+1 $file -d $cols -eval i:result 'ToI(reviewID + "13")'

"reviewID","brand","variety","style","country","stars","result"
"2580","New Touch","T's Restaurant Tantanmen ","Cup","Japan","3.75",258013
"2579","Just Way","Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles","Pack","Taiwan","1",257913
"2578","Nissin","Cup Noodles Chicken Vegetable","Cup","USA","2.25",257813
"2577","Wei Lih","GGE Ramen Snack Tomato Flavor","Pack","Taiwan","2.75",257713
"2576","Ching's Secret","Singapore Curry","Pack","India","3.75",257613
"2575","Samyang Foods","Kimchi song Song Ramen","Pack","South Korea","4.75",257513
"2574","Acecook","Spice Deli Tantan Men With Cilantro","Cup","Japan","4",257413
"2573","Ikeda Shoku","Nabeyaki Kitsune Udon","Tray","Japan","3.75",257313
"2572","Ripe'n'Dry","Hokkaido Soy Sauce Ramen","Pack","Japan","0.25",257213
