# aq_pp -cmb

## Overview
`-cmb` option of `aq_pp` command combines 2 or more dataset into 1 **horizontally**. Unlike SQL though, this option assumes that there are no duplicates in join key(column). Therefore if there are matching duplicates key values, it will only extract the first record, and discard the rest by default. (There are option to do otherwise)

We'll cover the basic usage of this option as well as usage in comparison to SQL joins.

### Prerequisite
Readers are expected to have knowledge of the followings:
- bash and its commands
- aq_tool's [input, column](aq_input.ipynb) and [output](aq_output.ipynb) specs
- basic knowledge of SQL joins

### Syntax 

Syntax of the option look like below.

```aq_pp ... -cmb[,AtrLst] File [File ...] ColSpec [ColSpec ...] ```
where
- `AtrLst` - list of attributes that dectates join behaviors as well as input specification for `File`
- `File` - name of the file(s) to join to the original file
- `ColSpec` - Column specs of the `File` that you're combining.


**`AtrLst`**<br>
Because `-cmb` option _import_ the other file to join, it requires the set of input specification just like we specify with `-f` option. 
Besides that, there are some attributes that we can specify for join behavior, such as 
- `ncas` - Perform case insensitive match (default is case sensitive). For ASCII data only.
- `req` - Discard unmatched records.
- `all` - Use all matches. Normally, only the first match is used. With this attribute, one row is produced for each match.
- `mrg` - Use merge mode. Records in the current data set and in the combine files must already be sorted according to the combine keys in the same order (default is ascending unless dec is given). Use this approach if the combine data is too large to fit into memory.
- `dec` - Same as mrg except that all the data are sorted in descending order.


**`ColSpec`**<br>
This defines the column of the `File` to be combined on. There are 2 extensions besides the attributes on input column spec.
- `key` - Marks a column as being a join key. It must be a common column. This is the default for a common column.
- `cmb` - Marks a column to be combined into the current data set. This is the default for a non-common column. It is typically used to mark a common column as not a join key.


**Default Behaviors**<br>
There are 3 default behaviors of the option that you should be aware of. 
* Performs **case sensitive** match 
* When there are matches of same keys across several rows, only the first match is used (extracted). 
* This option keeps all the records from original input spec file regardless of match, while only keeping matched record from data provided to `-cmb` option as `File`.<br>
We will be looking at each of these behavior in later examples.

## Data
We will use 2 tables for this example, one with a person's name and user IDs, the other with the user IDs and thier likes (or hobbies). They looks like below.


<h4><center>Users.csv</center></h4>

id|name|                   
--|----|                   
1|Patrik|                
2|Albert|
3|Maria|
4|Darwin|
5|Elizabeth|

<br>

<h4><center>Likes.csv</center></h4>

id|likes|
--|-----|
1|Climbing|
1|Code|
3|Stars|
6|Rugby|
4|Apples|


We will be performing the join on `id` column.

## Samples

### Default Behavior

Let's start with the default behavior of the option. We'll be joining the 2 tables and Users.csv as left (as input spec to `aq_pp`) and Likes.csv as right (as input to `-cmb` option).


In [5]:
## Set up the file's path 
users="data/aq_pp/cmb/users.csv"
likes="data/aq_pp/cmb/likes.csv"

aq_pp -f,+1 $users -d i:id s:name -cmb,+1 $likes i:id s:likes

"id","name","likes"
1,"Patrik","Climbing"
2,"Albert",
3,"Maria","Stars"
4,"Darwin","Apples"
5,"Elizabeth",


As you can see, 
1. there is only 1 "Patrik" with user ID of "1" present in the combined dataset, while the original Likes.csv dataset has 2 entry with user id of 2. This indicates that `-cmb` option is extracting the first match only.
2. "Albert" with user id of "2" is present regardless of having no match, because it belongs to left data (Users.csv). Note that aq_tool fills out empty or NULL string records as an empty string.

Note the `+1` attribute is used after `-cmb` option to skip the header in Likes.csv file.

### Attributes
We'll go over 2 of the attributes here, `all` and `req`.

**`all`**<br>
Note that there are duplicated key in Likes.csv table, namely "1" values which corresponds to "Patrik" on the other table. By default `-cmb` will drop the second record, `all` attribute tells it to keep all matching records of "Patrik".

In [3]:
aq_pp -f,+1 $users -d i:id s:name -cmb,+1,all $likes i:id s:likes

"id","name","likes"
1,"Patrik","Climbing"
1,"Patrik","Code"
2,"Albert",
3,"Maria","Stars"
4,"Darwin","Apples"
5,"Elizabeth",


Now there are 2 records for "Patrik", with different value in `likes` column. 

**`req`**<br>

On the example above, the unmatched records from the left table ("Albert" and "Elizabeth") were still present on the result. 
We can use `req` option to keep only matching records from both tables.

In [4]:
aq_pp -f,+1 $users -d i:id s:name -cmb,+1,req $likes i:id s:likes

"id","name","likes"
1,"Patrik","Climbing"
3,"Maria","Stars"
4,"Darwin","Apples"


That removed the 2 unmatching records.

### SQL equivalent of each

Let's perform each of 4 SQL joins using this option. 

**Left Join**<br>
Left join will retain all of the records from left table (Users.csv), and retain the records from right table (Likes.csv) only when there's a match.<br>
Note that this demonstrates left join **with duplicates in key**.
