# aq_pp -map, -mapf, -mapc

In this notebook, we'll go over `aq_pp`'s string extraction and mapping option `-map(f/c)`. 


## Pre-requsites
Readers are assumed to be equipped with decent amount of knowledge of the followings.
- `aq_pp`
- `aq_input`
- `aq_output`
- Regex

Also have [aq_pp official documentation](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#map) open on your side for referece.
**If you're already familiar with the syntax of the command, feel free to skip to [data](#data) section.**

## Overview

There are 3 types of map options, namely
- `-map`: extract desired pattern(s) from given column, and map it back to the same column based on MapTo string.
- `-mapf` & `-mapc`: these are used in pairs. `-mapf` specifies the source column to extract patterns from, and `-mapc` is used to specify destination column and mapping pattern.

### Syntax

**Map**<br>
```aq_pp ... -map[,AtrLst] colName "MapFrom" "MapTo"```

**Mapf/Mapc**<br>
```aq_pp ... -mapf,[AtrLst] ColName MapFrom -mapc ColSpec|ColName MapTo```

Where
- `[,AtrLst]`: list of attributes to provide. 
    * `ncas` for case insensitive match (Default is case sensitive)
    * [regular expression attributes](http://auriq.com/documentation/source/reference/manpages/aq-emod.html#regex-attributes) to specify the type of regex being used.
- `colName`: column name
- `ColSpec`: column spec, in case of storing the result in a new column.





## MapFrom
This rule dictates the pattern to match with given string for extraction. Some of the major components of the MapFrom syntax are followings.
- A literal: Matches literal characters / strings
- wildcard: `%*` matches any numbers of bytes, and `%?` matches any 1 byte.
- Variable: `%%varName%%` variable name surrounded by double `%` characters. The name can be referred later in `-mapc`'s MapTo string.
> Variable can have some additional attributes, and the syntax for variables looks like below.<br>
```%%VarName[:@class][:[chars]][:min[-max]][,brks]%%``` <a id='var_attr'></a>
    * `:@class`: specifies the types of characters to match, such as `n` for 0~9, or `a` for a-z, and `b` for A-Z. 
    * `[chars]`: specifies which characters to be included in the variable. 
    * `min-max`: integers to restrict min and max numbers of characters 
    * `,brks`: specifies the list of characters at which the extraction should stop. 
    
    E.g. `%%CAPITAL:@b:3-5%%`: should match a substring of capital alphabetical letter, length between 3 ~ 5.<br>
    More attributes for MapFrom are available at [RT MapFrom Syntax](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#rt-mapfrom-syntax)<br>
    **Note:** Each attributes are separated by `;`
    
    
## MapTo

This string specifies the format to map the extracted value into destination column. It has somewhat similar rule to MapFrom syntax, such as
- Literal: map constant characters / string
- Variable: place the value stored in named variable. This variable can also take some attributes, looks like following.
> ```%%VarName[:cnv][[:start]:length][,brks]%%```<br>
    * `:cnv`: either `b64` or `url[Num]` convert the variable string to base 64 or url accordingly.
    * `:start:length`: starting bytes position and subsequent length of the extracted variable. without the starting position, numbers of bytes from the beginning of the extracted variable.
    * `,brks`: list of characters to stop substitution of variable.
    
    **Note:** each attributes are separated by `;`<br>
    Details of MapTo is also available at [MapTo Syntax](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#mapto-syntax)
    
## -Map(c/f)

### MapFrom

Let's start with simple string extraction using the options and variables.

**Extracting Origin of Flight**<br>

Using `-mapf` to extract the origin, and `-mapf` to map it to newly created column, here are the arguments to think about.

- `-mapf`
    - `ColName`: column to extract data from. This should be `search` column.
    - `MapFrom`: pattern to match the 3 capital letter airport code following "From=" string, followed by "To=...". We'll be using a variable named `ORG` to extract and store the match.
- `-mapc`
    - `ColSpec`: column spec for new string column. We'll name it as Origin
    - `MapTo`: We can place variable name `ORG` to the destination column, since we're not formatting the result in any way.
    
    
    

<a id='data'></a>
## Data

We'll be using airline flight search history data (modified from the original for privacy purpose.) This dataset contains long string columns that are perfect for this tutorial.
Below is what the data looks like.

search|
-----|
From=HNDTo=NGODate=20150506Class=Y|
From=NGik0To=OKADate=20150425Class=1|
From=OKATo=NGODate=20l5o4A5Class=Y|
From=OKATo=NGODate=20150425Class=S|
From=OKATo=nG0ODate=20150425Class=3
From=3GOTo=OKADate=S0150F25Class=Y|
From=NGOTo=OKADate=20150419Class=Y|
From=OKaTo=NGODate=20150517Class=Y|


Each rows are record for a one search query, and from left to right, composted of
- `From=*`: 3 capital letter alphabetical airport code
- `To=*`: 3 capital letter alphabetical airport code
- `Date=*`: Date of the flight
- `Class=*`: class of flight, categorical value

Some records in the data has invalid values intentionally for practice.

## Sample

### Loading the Data


In [6]:
# set up the column spec and file name
file="data/aq_pp/airline_sample.csv"
col="S:search"
aq_pp -f,+1 $file -d $col 

"search"
"From=ISGTo=OKADate=20150904Class=Y"
"From=OKATo=ITMDate=20150426Class=Y"
"From=ISGTo=OKADate=2015o406Class=Y"
"From=ITMTo=FUKDate=20151016Class=Y"
"From=HNDTo=KOJDate=20171112Class=Y"
"From=NGik0To=OKADate=20150425Class=1"
"From=HNDTo=ITMDate=20l518920113Class=S"
"From=NGOTo=SPKDate=20160528Class=S"
"From=SPKTo=NGODate=20160207Class=S"
"From=OKATo=nG0ODate=20150425Class=3"


## Table of Sample

- [MapFrom](#mapfrom)
	- [Variable Attribute - String](#var_attr)
	- [Variable Attribute - Number](#var_num)
- [MapTo](#mapto)
	- [Attributes](#attr)
	- [Map One Variable across Columns](#map_one_few)
- [Multiple Variables](#mul_var)
	- [Mapping to Multiple Columns](#to_mul_col)
	- [Mapping from Multiple Columns](#from_mul_col)
	- [Advanced Examples](#adv_ex)
- [Multiple Columns](#from_to_mul)
- [Map](#map)

<a id='mapfrom'></a>
### MapFrom

Using `-mapf` for extraction and `-mapto` for mapping the string with one variable named `ORG`. <br>
Takig a closer look at MapFrom string, we have 
```From=%%ORG%%To=%*```<br>
- `From=`: literal to match the substring before the origin airport code
- `%%ORG%%`: variable named ORG
- `To=`: String literal
- `%*`: wildcard to match everything after "To=" in the column

Precisely speaking, this expression. matches **ANY** substring between `From=` and `To=` which is followed by any numbers of bytes, and stores it in variable named `ORG`. Therefore you'll notice that invalid airport code are also extracted as well.

In [7]:
aq_pp -f,+1 $file -d $col -mapf search 'From=%%ORG%%To=%*' -mapc S:Origin '%%ORG%%'

"search","Origin"
"From=ISGTo=OKADate=20150904Class=Y","ISG"
"From=OKATo=ITMDate=20150426Class=Y","OKA"
"From=ISGTo=OKADate=2015o406Class=Y","ISG"
"From=ITMTo=FUKDate=20151016Class=Y","ITM"
"From=HNDTo=KOJDate=20171112Class=Y","HND"
"From=NGik0To=OKADate=20150425Class=1","NGik0"
"From=HNDTo=ITMDate=20l518920113Class=S","HND"
"From=NGOTo=SPKDate=20160528Class=S","NGO"
"From=SPKTo=NGODate=20160207Class=S","SPK"
"From=OKATo=nG0ODate=20150425Class=3","OKA"


<a id='var_attr'></a>
**Variable Attributes - string**<br>

What if there are anomalies in the dataset, which include invalid airport code in the position between `From=` and `To=`? 
With the MapFrom expression above, it'll still pick up the invalid origin code and map it to the new column. In fact, take a look at the result above. You can see that some invalid airport codes are extracted as well, such as `NGik0`.

How can we make sure that we're only extracting valid airport code? 
We can do it by specifying format of variable with attributes.

In this case, we want to make sure that the extracted airport code look like...
- capital alphabetical characters
- exactly 3 letters long

This can be achieved by providing [some attributes after the variable name](#var_attr) like below.<br>
```%%ORG:@b:3-3%%```<br>
where
- `@b`: class to specify the characters are in range of A~Z
- `3-3`: min and max numbers of characters in variable

In [8]:
aq_pp -f,+1 $file -d $col -mapf search 'From=%%ORG:@b:3-3%%To=%*' -mapc S:Origin '%%ORG%%'

"search","Origin"
"From=ISGTo=OKADate=20150904Class=Y","ISG"
"From=OKATo=ITMDate=20150426Class=Y","OKA"
"From=ISGTo=OKADate=2015o406Class=Y","ISG"
"From=ITMTo=FUKDate=20151016Class=Y","ITM"
"From=HNDTo=KOJDate=20171112Class=Y","HND"
"From=NGik0To=OKADate=20150425Class=1",
"From=HNDTo=ITMDate=20l518920113Class=S","HND"
"From=NGOTo=SPKDate=20160528Class=S","NGO"
"From=SPKTo=NGODate=20160207Class=S","SPK"
"From=OKATo=nG0ODate=20150425Class=3","OKA"


As a result, only valid airport codes are extracted and mapped to Origin column while invalid ones are left blank (meaning no match was found).
This can be used as data validation method.

<a id='var_attr_num'></a>
**Variable Attribute - number**<br>

Let us try to extract numeric values, namely year, month and date data. We will be using similar tactics used above to filter out the invalid numeric record as well. Here are example of valid and invalid numeric record.

* Valid: `...Date=20150425Class...`
* Invalid: `...Date=S0150F25Class...`

Valid date data is followed by `..Date=` substring, and have a format of 8 digits nubmers, 4 digit for year, 2 for month and date.
Let's extract all 8 digits, store it in `TIME` variable, and map it in `time` column.

In [9]:
aq_pp -f,+1 $file -d $col -mapf search '%*Date=%%TIME:@n:8-8%%Class%*' -mapc S:time '%%TIME%%'

"search","time"
"From=ISGTo=OKADate=20150904Class=Y","20150904"
"From=OKATo=ITMDate=20150426Class=Y","20150426"
"From=ISGTo=OKADate=2015o406Class=Y",
"From=ITMTo=FUKDate=20151016Class=Y","20151016"
"From=HNDTo=KOJDate=20171112Class=Y","20171112"
"From=NGik0To=OKADate=20150425Class=1","20150425"
"From=HNDTo=ITMDate=20l518920113Class=S",
"From=NGOTo=SPKDate=20160528Class=S","20160528"
"From=SPKTo=NGODate=20160207Class=S","20160207"
"From=OKATo=nG0ODate=20150425Class=3","20150425"


Little about MapFrom string, ```%*Date=%%TIME:@n:8-8%%Class%*```<br>
- `%*`: 2 of them are used in the beginning and at the end to skip the unneccesary string.
- `Date=`, `Class`: literals that preceed and follows the variable
- `%%TIME:@n:8-8%%`:variable, with attributes of ...
    - `@n`: numbers of 0~9
    - `8-8`: min 8 and max 8 length
    
You can see that only valid date data are extracted, and rest are left blank.

<a id='mapto'></a>
### MapTo

We've focused on usage of MapFrom with `-mapf/c` to extract patterns, and kept use of MapTo simple in previous sections.<br>
In this part of the notebook, we'll take a look at how to use MapTo string to specify formatting of destination column.

In this example, we'll keep using same `-mapf` string that we've used to extract 8 digit datetime value. <br>

**Literal and Variable**<br>
To make the content of the column clear, we'd like to map the date digits in format of <br>
```date=20180904```.

How can we map this using MapTo? We knows that `%%varName%%` will substitute the extracted value. All we need is the literal string of `date=` before that. Let's take a look.

In [10]:
aq_pp -f,+1 $file -d $col -mapf search '%*Date=%%TIME:@n:8-8%%Class%*' \
-mapc S:time 'date=%%TIME%%'

"search","time"
"From=ISGTo=OKADate=20150904Class=Y","date=20150904"
"From=OKATo=ITMDate=20150426Class=Y","date=20150426"
"From=ISGTo=OKADate=2015o406Class=Y","date="
"From=ITMTo=FUKDate=20151016Class=Y","date=20151016"
"From=HNDTo=KOJDate=20171112Class=Y","date=20171112"
"From=NGik0To=OKADate=20150425Class=1","date=20150425"
"From=HNDTo=ITMDate=20l518920113Class=S","date="
"From=NGOTo=SPKDate=20160528Class=S","date=20160528"
"From=SPKTo=NGODate=20160207Class=S","date=20160207"
"From=OKATo=nG0ODate=20150425Class=3","date=20150425"


<a id='attr'></a>
**Attributes**<br>

There are 3 attributes available and we'll cover 2 of them, 
- `:length`: numbers of bytes from the beginning of the value stored in the variable
- `:start:length`: starting byte position and **subsequent** length of the extracted value. Starting at 0.

We've extracted 8 digits, but would only like to use year portion (first 4 digits) of the data. We can accomplish this using `:length` attribute.

In [11]:
aq_pp -f,+1 $file -d $col -mapf search '%*Date=%%TIME:@n:8-8%%Class%*' \
-mapc S:time 'Year=%%TIME:4%%'

"search","time"
"From=ISGTo=OKADate=20150904Class=Y","Year=2015"
"From=OKATo=ITMDate=20150426Class=Y","Year=2015"
"From=ISGTo=OKADate=2015o406Class=Y","Year="
"From=ITMTo=FUKDate=20151016Class=Y","Year=2015"
"From=HNDTo=KOJDate=20171112Class=Y","Year=2017"
"From=NGik0To=OKADate=20150425Class=1","Year=2015"
"From=HNDTo=ITMDate=20l518920113Class=S","Year="
"From=NGOTo=SPKDate=20160528Class=S","Year=2016"
"From=SPKTo=NGODate=20160207Class=S","Year=2016"
"From=OKATo=nG0ODate=20150425Class=3","Year=2015"


What if we'd like to map only month (2 digits in the middle)? We can use `:start:length` attributes. Note that `start` is starting byte to substitute, an index starting from zero, and `length` is subsequent length.<br>
Let's take a look.

In [12]:
aq_pp -f,+1 $file -d $col -mapf search '%*Date=%%TIME:@n:8-8%%Class%*' \
-mapc S:time 'Month=%%TIME:4:2%%'

"search","time"
"From=ISGTo=OKADate=20150904Class=Y","Month=09"
"From=OKATo=ITMDate=20150426Class=Y","Month=04"
"From=ISGTo=OKADate=2015o406Class=Y","Month="
"From=ITMTo=FUKDate=20151016Class=Y","Month=10"
"From=HNDTo=KOJDate=20171112Class=Y","Month=11"
"From=NGik0To=OKADate=20150425Class=1","Month=04"
"From=HNDTo=ITMDate=20l518920113Class=S","Month="
"From=NGOTo=SPKDate=20160528Class=S","Month=05"
"From=SPKTo=NGODate=20160207Class=S","Month=02"
"From=OKATo=nG0ODate=20150425Class=3","Month=04"


That looks correct!

<a id='map_one_few'></a>
**Mapping one variable across columns**<br>

In the 2 examples above, we have mapped part of `TIME` variable only onece.
However, in case if you'd like to map different part of one variable, for instance mapping year and month separately into individual columns it is possible.<br>

In [13]:
aq_pp -f,+1 $file -d $col -mapf search '%*Date=%%TIME:@n:8-8%%Class%*' \
-mapc S:time 'Year=%%TIME:4%%, Month=%%TIME:4:2%%'

"search","time"
"From=ISGTo=OKADate=20150904Class=Y","Year=2015, Month=09"
"From=OKATo=ITMDate=20150426Class=Y","Year=2015, Month=04"
"From=ISGTo=OKADate=2015o406Class=Y","Year=, Month="
"From=ITMTo=FUKDate=20151016Class=Y","Year=2015, Month=10"
"From=HNDTo=KOJDate=20171112Class=Y","Year=2017, Month=11"
"From=NGik0To=OKADate=20150425Class=1","Year=2015, Month=04"
"From=HNDTo=ITMDate=20l518920113Class=S","Year=, Month="
"From=NGOTo=SPKDate=20160528Class=S","Year=2016, Month=05"
"From=SPKTo=NGODate=20160207Class=S","Year=2016, Month=02"
"From=OKATo=nG0ODate=20150425Class=3","Year=2015, Month=04"


<a id='mul_var'></a>
### Multiple Variables

Let's say we'd like to extract multiple substrings from one columns, such as 
- origin
- destination
- year
- month
- date


We can do this by using multiple variables, and assigning each string patterns to them. For simplicity, we'll start with 2 variables, in order to extract origin and destination.<br>
Basic rule does not change when using multiple variables. You can fit them into one MapFrom and MapTo string. For example, to extract origin and destination, 

```From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%%*```

And we'll be mapping both values in a single column for now in a format of `Origin=airport_code, Destination=airport_code`

Let's see this in action.

In [14]:
aq_pp -f,+1 $file -d $col -mapf search 'From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%%*' \
-mapc S:airports 'Origin=%%ORG%%, Destination=%%DEST%%'

"search","airports"
"From=ISGTo=OKADate=20150904Class=Y","Origin=ISG, Destination=OKA"
"From=OKATo=ITMDate=20150426Class=Y","Origin=OKA, Destination=ITM"
"From=ISGTo=OKADate=2015o406Class=Y","Origin=ISG, Destination=OKA"
"From=ITMTo=FUKDate=20151016Class=Y","Origin=ITM, Destination=FUK"
"From=HNDTo=KOJDate=20171112Class=Y","Origin=HND, Destination=KOJ"
"From=NGik0To=OKADate=20150425Class=1","Origin=, Destination="
"From=HNDTo=ITMDate=20l518920113Class=S","Origin=HND, Destination=ITM"
"From=NGOTo=SPKDate=20160528Class=S","Origin=NGO, Destination=SPK"
"From=SPKTo=NGODate=20160207Class=S","Origin=SPK, Destination=NGO"
"From=OKATo=nG0ODate=20150425Class=3","Origin=, Destination="


Let's scale up to extracting 3 substrings, origin, destination and date (year, month and date).

For extraction, we can simply combine 3 variables together in MapFrom, like following.

`From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%Date=%%TIME:@n:8-8%%%*`

And Map it in format of 

`Origin=airport_code, Destination=airport_code, date=date_digits` <br>

In [15]:
aq_pp -f,+1 $file -d $col -mapf search 'From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%Date=%%TIME:@n:8-8%%%*' \
-mapc S:airports 'Origin=%%ORG%%, Destination=%%DEST%%, Date=%%TIME%%'

"search","airports"
"From=ISGTo=OKADate=20150904Class=Y","Origin=ISG, Destination=OKA, Date=20150904"
"From=OKATo=ITMDate=20150426Class=Y","Origin=OKA, Destination=ITM, Date=20150426"
"From=ISGTo=OKADate=2015o406Class=Y","Origin=, Destination=, Date="
"From=ITMTo=FUKDate=20151016Class=Y","Origin=ITM, Destination=FUK, Date=20151016"
"From=HNDTo=KOJDate=20171112Class=Y","Origin=HND, Destination=KOJ, Date=20171112"
"From=NGik0To=OKADate=20150425Class=1","Origin=, Destination=, Date="
"From=HNDTo=ITMDate=20l518920113Class=S","Origin=, Destination=, Date="
"From=NGOTo=SPKDate=20160528Class=S","Origin=NGO, Destination=SPK, Date=20160528"
"From=SPKTo=NGODate=20160207Class=S","Origin=SPK, Destination=NGO, Date=20160207"
"From=OKATo=nG0ODate=20150425Class=3","Origin=, Destination=, Date="


<a id='from_to_mul'></a>
### Mapping Multiple Columns

We've been extracting all the strings from one column, and mapping it onto another single column so far. But this is not sufficient for occasion such as when you'd like to extract multiple data, and map it across several columns, or combine data from multiple columns into one column.

`-mapf` and `-mapc` supports such operation. Users can use these options multiple times to map to and from multiple columns. Each `-mapf` option and `-mapc` option is capable of extracting and mapping into single column, so you'd need to use these optiolns as many times as the numbers of columns you're dealing with. 

For example, if you'd like to extract data from 2 columns, and map them into 3 columns, then you'd need to use 2 `-mapf` options and 3 `-mapc` options. We'll take a look at them individually.

<a id='to_mul_col'></a>
**Mapping To Multiple Columns**<br>

In this first example, we'll take 3 extracted strings (origin, destination and date) and put them into 3 different columns accordingly. This does not change `-mapf` option, so pay attention to the 3 `-mapc` options.

In [16]:
aq_pp -f,+1 $file -d $col -mapf search 'From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%Date=%%TIME:@n:8-8%%%*' \
-mapc S:Origin '%%ORG%%' -mapc S:Destination '%%DEST%%', -mapc S:Date '%%TIME%%'

"search","Origin","Destination","Date"
"From=ISGTo=OKADate=20150904Class=Y","ISG","OKA,","20150904"
"From=OKATo=ITMDate=20150426Class=Y","OKA","ITM,","20150426"
"From=ISGTo=OKADate=2015o406Class=Y",,",",
"From=ITMTo=FUKDate=20151016Class=Y","ITM","FUK,","20151016"
"From=HNDTo=KOJDate=20171112Class=Y","HND","KOJ,","20171112"
"From=NGik0To=OKADate=20150425Class=1",,",",
"From=HNDTo=ITMDate=20l518920113Class=S",,",",
"From=NGOTo=SPKDate=20160528Class=S","NGO","SPK,","20160528"
"From=SPKTo=NGODate=20160207Class=S","SPK","NGO,","20160207"
"From=OKATo=nG0ODate=20150425Class=3",,",",


Next example, we'll further split `TIME` data into three columns, namely Year, Month and Date. 
In order to split the `TIME` variable into 3 columns, we can again use `:start:length` attribute for MapTo string.<br>
Let's take a look.

In [17]:
aq_pp -f,+1 $file -d $col -mapf search 'From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%Date=%%TIME:@n:8-8%%%*' \
-mapc S:Origin '%%ORG%%' -mapc S:Destination '%%DEST%%' \
-mapc S:Year '%%TIME:4%%' -mapc S:Month '%%TIME:4:2%%' -mapc S:Date '%%TIME:6:8%%' 

"search","Origin","Destination","Year","Month","Date"
"From=ISGTo=OKADate=20150904Class=Y","ISG","OKA","2015","09","04"
"From=OKATo=ITMDate=20150426Class=Y","OKA","ITM","2015","04","26"
"From=ISGTo=OKADate=2015o406Class=Y",,,,,
"From=ITMTo=FUKDate=20151016Class=Y","ITM","FUK","2015","10","16"
"From=HNDTo=KOJDate=20171112Class=Y","HND","KOJ","2017","11","12"
"From=NGik0To=OKADate=20150425Class=1",,,,,
"From=HNDTo=ITMDate=20l518920113Class=S",,,,,
"From=NGOTo=SPKDate=20160528Class=S","NGO","SPK","2016","05","28"
"From=SPKTo=NGODate=20160207Class=S","SPK","NGO","2016","02","07"
"From=OKATo=nG0ODate=20150425Class=3",,,,,


<a id='from_mul_col'></a>
**Mapping From Multiple Columns**<br>

Because the flight search data we've been using only have one source column, we'll use differnt dataset for this one particular example. 
[Titanic survivor dataset](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html) is a dataset of passengers of titanic and whether they made it out alive or not.
The actual data look like below.

Survived|Pclass|Name|Sex|Age|Siblings_Spouses_Aboard|Parents_Children_Aboard|Fare
-----|-----|-----|-----|-----|-----|-----|-----|
1|2|Mrs. Amin S (Marie Marthe Thuillard) Jerwan|female|23|0|0|13.791700000000001
0|3|Mr. William Thomas Beavan|male|19|0|0|8.0500000000000007
1|1|Miss. Kornelia Theodosia Andrews|female|63|1|0|77.958299999999994
0|3|Master. Eino Viljami Panula|male|1|4|1|39.6875
1|1|Mr. Algernon Henry Wilson Barkworth|male|80|0|0|30
0|3|Mr. Farred Chehab Emir|male|26|0|0|7.2249999999999996



Let's say we'd like to create a summary column of information composed of
- Survived or not
- Age

Note that we'll be interpreting all of the columns as string, so that we can apply the options, and will keep it to 2 columns for now. We can do this by using 2 `-mapf` and a single `-mapc` option

In [18]:
# set up the params
sunken_ship="data/aq_pp/titanic.csv"
ship_col="S:Survived S:Pclass S:Name S:Sex S:Age S:Siblings_Spouses_Aboard S:Parents_Children_Aboard S:Fare"

aq_pp -f,+1 $sunken_ship -d $ship_col -mapf Survived '%%STATE%%' -mapf Age '%%AGE%%' \
-mapc S:Summary 'Age: %%AGE%% Alive: %%STATE%%' -c Survived Age Summary

"Survived","Age","Summary"
"1","23.0","Age: 23.0 Alive: 1"
"0","19.0","Age: 19.0 Alive: 0"
"1","63.0","Age: 63.0 Alive: 1"
"0","1.0","Age: 1.0 Alive: 0"
"1","80.0","Age: 80.0 Alive: 1"
"0","26.0","Age: 26.0 Alive: 0"
"0","22.0","Age: 22.0 Alive: 0"
"1","42.0","Age: 42.0 Alive: 1"
"1","33.0","Age: 33.0 Alive: 1"
"1","35.0","Age: 35.0 Alive: 1"
"1","4.0","Age: 4.0 Alive: 1"
"0","26.0","Age: 26.0 Alive: 0"
"1","48.0","Age: 48.0 Alive: 1"
"1","38.0","Age: 38.0 Alive: 1"
"0","24.0","Age: 24.0 Alive: 0"
"1","34.0","Age: 34.0 Alive: 1"
"1","48.0","Age: 48.0 Alive: 1"
"1","19.0","Age: 19.0 Alive: 1"
"0","45.0","Age: 45.0 Alive: 0"
"0","55.0","Age: 55.0 Alive: 0"


<a id='adv_ex'></a>
**Advanced Example**<br>
Let's try something little more complicated. 
We'd like to combine following information
- Sex
- Age
- Passenger Class
- A person's title (included in the Name column, "Mr", "Mrs", etc)

into a summary column. Also we'd like to extract a person's name only (without the title) and map it into name column.
So we're extracting from 4 columns (`Sex`, `Age`, `PClass`, and `Name`) and mapping into 2 columns (`Summary` and `Brith Name`).

Let's take a look.

In [19]:
aq_pp -f,+1 $sunken_ship -d $ship_col \
-mapf Sex '%%SEX%%' -mapf Age '%%AGE%%' -mapf Pclass '%%CLASS%%' -mapf Name '%%TITLE:[ CDJLMRSadehijklmnoprstuvy]:2-12%%.%%NAME%%' \
-mapc S:Summary 'Title: %%TITLE%%, Sex: %%SEX%%, Age: %%AGE%%, Class: %%CLASS%%' -mapc S:Birth_name '%%Name%%' -c Summary Birth_name

"Summary","Birth_name"
"Title: Mrs, Sex: female, Age: 23.0, Class: 2"," Amin S (Marie Marthe Thuillard) Jerwan"
"Title: Mr, Sex: male, Age: 19.0, Class: 3"," William Thomas Beavan"
"Title: Miss, Sex: female, Age: 63.0, Class: 1"," Kornelia Theodosia Andrews"
"Title: Master, Sex: male, Age: 1.0, Class: 3"," Eino Viljami Panula"
"Title: Mr, Sex: male, Age: 80.0, Class: 1"," Algernon Henry Wilson Barkworth"
"Title: Mr, Sex: male, Age: 26.0, Class: 3"," Farred Chehab Emir"
"Title: Mr, Sex: male, Age: 22.0, Class: 1"," Sante Ringhini"
"Title: Mrs, Sex: female, Age: 42.0, Class: 2"," Charles Alexander (Alice Adelaide Slow) Louch"
"Title: Mrs, Sex: female, Age: 33.0, Class: 3"," Karl Alfred (Maria Mathilda Gustafsson) Backstrom"
"Title: Mr, Sex: male, Age: 35.0, Class: 1"," Gustave J Lesurer"
"Title: Master, Sex: male, Age: 4.0, Class: 3"," Harold Theodor Johnson"
"Title: Mr, Sex: male, Age: 26.0, Class: 3"," Henrik Juul Hansen"
"Title: Lady, Sex: female, Age: 48.0, Class: 1"," (Lucille Chris

On the 
* 1st line, we're setting up the input and column spec for the dataset
* 2nd line, 4 `-mapf` was used to extract total of 5 variables
    * On MapFrom string `%%TITLE:[ CDJLMRSadehijklmnoprstuvy]:2-12%%.%%NAME%%`, we're extracting 2 variables. 
        * `TITLE` variable by
            * 2-12 character long: `2-12`
            * It is composed of the following characters: `[ CDJLMRSadehijklmnoprstuvy]`, including the white space
        * `NAME` variable
        
* 3rd line, with 2 `-mapc` we map the results on column Summary and Birth_name, then output only these two on stduot.

<a id='map'></a>
## `-Map`

`-map` option extract string from a column, and map it back to the source column.
Quickly reviewing it's syntax, it looks like follwoing...<br>
```aq_pp ... -map[,AtrLst] colName "MapFrom" "MapTo"```

Note that both MapFrom and MapTo string are located right after the `colName`, which is the source and destination column in this case.

Almost everything we've covered in `-mapf/c` section can be also done with `-map` option. Followings are some of the examples.

Starting with basic, let's extract origin airport code from the `search` column, and map it back to it.

In [20]:
aq_pp -f,+1 $file -d $col -map search 'From=%%ORG:@b:3-3%%To=%*' 'Origin=%%ORG%%'

"search"
"Origin=ISG"
"Origin=OKA"
"Origin=ISG"
"Origin=ITM"
"Origin=HND"
"From=NGik0To=OKADate=20150425Class=1"
"Origin=HND"
"Origin=NGO"
"Origin=SPK"
"Origin=OKA"


**NOTE:** When there are no match in the string, it will simply display the original content of the column. Therefore in practice it's good to filter out unmatched record.

Let's try extracting origin, destination and date, and mapping the three values back to search column.

In [21]:
aq_pp -f,+1 $file -d $col -map search 'From=%%ORG:@b:3-3%%To=%%DEST:@b:3-3%%Date=%%TIME:@n:8-8%%%*' \
'ORN=%%ORG%%, DEST=%%DEST%%, Time=%%TIME%%'

"search"
"ORN=ISG, DEST=OKA, Time=20150904"
"ORN=OKA, DEST=ITM, Time=20150426"
"From=ISGTo=OKADate=2015o406Class=Y"
"ORN=ITM, DEST=FUK, Time=20151016"
"ORN=HND, DEST=KOJ, Time=20171112"
"From=NGik0To=OKADate=20150425Class=1"
"From=HNDTo=ITMDate=20l518920113Class=S"
"ORN=NGO, DEST=SPK, Time=20160528"
"ORN=SPK, DEST=NGO, Time=20160207"
"From=OKATo=nG0ODate=20150425Class=3"


The MapTo string include 3 variable name as well as literal strings to format them.
**Split the time column into year, month and date with literal string mapping and Each Columns**