# aq_cnt tips and samples

This notebook goes over aq_cnt's options and it's sample usages. 
Based on AQ Tools version: 2.0.1-1.

### Prerequisites
Users are assumed to be equipped with decent knowledge of
- bash commands
- aq_pp command
- input, column and output spec for aq_tools


We'll be going over each options in the `aq_cnt` command and it's use cases. Have the [aq_cnt documentation](http://auriq.com/documentation/source/reference/manpages/aq_cnt.html?highlight=aq_cnt) ready on the side, so you can refer to it whenever needed.
We'll start with basic usage of each options, then dive into advanced usage.

### Dataset
Will be using [amazon customer review dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) dataset.
This dataset was collected over few decades since 1995 and 2015, and contains over 130+ million customer reviews.

We'll be using files from multilinugal dataset, from several marketplace internationally to have variety in data.


### Terminology
- Key: in `aq_cnt`, key means each unique value present in arbitrary column. It can be a composite key, where it is a unique combination of values from several columns. 

### To do
#### Basic 
Concrete version of the examples from documentation manual page. 
- k
- kx
- kX
- g
#### Advanced
- k
- kx
- kX
- g

Now we're ready, let's get started!

**Note**

Throughout the tutorial, we'll be using bash variable to represent fileName and column spec to avoid repetition and lengthy commands.
They are assigned on cell below.

In [22]:
# setting filename and column spec, and brief look at the dataset
file="data/reviews.tsv"
colSpec=$(loginf -f,auto $file -o_pp_col -)
head $file -n 4

marketplace	customer_id	review_id	product_id	product_parent	product_title	product_category	star_rating	helpful_votes	total_votes	vine	verified_purchase	review_headline	review_body	review_date
US	23871632	R31B5MWO3O7O6	B007IXWKUK	600633062	Fifty Shades Darker (Fifty Shades, Book 2)	Digital_Ebook_Purchase	2.0	0.0	0.0	N	Y	Fifty shades Darker	Would not recommend this book to anybody! Please do not waste your money. There is no real plot, the chapters are very repetitive and boring. The story is pathetic and predictable.	2013-05-31
US	10261718	RZ891DUCUNPMD	B00BVMXBVG	940561470	Orphan Black: Season 1 (Blu-ray)	Video DVD	5.0	1.0	1.0	N	Y	awesome show	A fantastic new series.  Tatiana Maslany is one of the best actresses of this generation and was was robbed of an emmy nomination. Add in the Sci Fi mystery and orphan black is really original and not a clone or knockoff.	2013-08-10
UK	19372562	R2D1VN26VB52J0	B005DOL0R0	359342335	Amazon Zip Sleeve for 7-Inch Tablets	PC	4.0	0.0	1.0	N	Y	As I said i

## Options

- `k`
- `kx`
- `kX`
- `-g`

### -k
`-k KeyName ColName [ColName ...]`
This option counts numbers of unique values present in given column(s). 
`KeyName` specifies the name of the key (or combination of keys, if multiple columns are given).
You can pass multiple columns to count composite key.

**Single Column**<br>
We can use this to count how many unique products are present in the review.
`product_id` column is a unique identifier for each product. 
Setting keyName as `num_products`, and giving `product_id` to colName option like below...

In [24]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -k num_product product_id

"row","num_product"
data/reviews.tsv: Bad field count: byte=2842981+656 rec=5081 field=#15
9999,6575




- `row` is the numbers of total rows processed by the command (in this case the entire dataset)
- `num_products`: number of unique values in `product_id` column.

**Multiple columns**
Let's observe what happens when we provide the `-k` option with multiple columns. This time we'll use a data (`data/multiple_k.txt`) that looks like a table below, which contains marketplace abbreviation and fake product id number.

|marketplace|product_id|
|---|---|
|US|1|
|US|2|
|US|3|
|JP|1|
|JP|2|
|JP|3|
|FR|1|
|FR|2|
|FR|3|

Now unique numbers of `product_id` in the above case would be 3.
However numbers of uniue combinations of `product_id` and `marketplace` would be 9. Let's check it.

In [29]:
aq_cnt -f,+1,sep="|" data/multiple_k.txt -d s:marketplace i:product_id -k num_product marketplace product_id

"row","num_product"
9,9


You can also provide more than 2 columns. This will come in handy when counting numbers of records based on composite key(such as combinations of Last name, first name, phone numbers etc).

## -kx

`-kx[,AtrLst] File KeyName ColName [ColName ...]` <br>
While `-k` option counts and displays the numbers of unique values in given column(s), this option displays the actual unique data in the given columns(s). 

**Be wary of syntactic difference**<br>
This options requires the output file name as the first argument.
In this sample we'll be using `-` which outputs on stdout(command line window). 
    
**Single Column**<br>
As a example, we'll display the all marketplace names contained in `marketplace` column on amazon review dataset.


In [31]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -kx - country marketplace

"marketplace"
data/reviews.tsv: Bad field count: byte=2842981+656 rec=5081 field=#15
"FR"
"JP"
"DE"
"UK"
"US"


Let's take a look at star_rating as well. As we know amazon's star rating ranges from 1~5, so this dataset should also contain all of the numbers.


In [43]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -kx - country star_rating | aq_ord -f,+1 - -d i:star_rating -sort star_rating

data/reviews.tsv: Bad field count: byte=2842981+656 rec=5081 field=#15
"star_rating"
1
2
3
4
5


**Multiple Columns**<br>
Providing `marketplace` and `star_rating` columns, we can take a look at star_rating's values in each market place. (More technically, it is listing the unique combination of values from `marketplace` and `star_rating`. 

**Note**: in below example, `aq_ord` is used after piping to order the result, but is out of scope of this sample. Don't worry about it for now!

In [44]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -kx - country marketplace star_rating \
| aq_ord -f,+1 - -d s:marketplace s:star -sort marketplace star

data/reviews.tsv: Bad field count: byte=2842981+656 rec=5081 field=#15
"marketplace","star"
"DE","1"
"DE","2"
"DE","3"
"DE","4"
"DE","5"
"FR","1"
"FR","2"
"FR","3"
"FR","4"
"FR","5"
"JP","1"
"JP","2"
"JP","3"
"JP","4"
"JP","5"
"UK","1"
"UK","2"
"UK","3"
"UK","4"
"UK","5"
"US","1"
"US","2"
"US","3"
"US","4"
"US","5"


# From here, further update is needed.


## Using Groupby with -g option

**What is groupby and how does it work?**
`-g` allows users to create group in which to count and analize record. For example, users can specify `marketplace` column as a group, and count numbers of reviews within each market place. 

### Using Groupby with -k


#### Passengers per each passenger class

We'll use `-g` option to specify Pclass column as group, and within that group `-k` will count number of unique names. 

In [46]:
aq_cnt -f,+1 $file -d $colSpec -g Pclass -k head_counts Name | \
aq_ord -f,+1 - -d i:Pclass i:row i:head_counts -sort Pclass 

"Pclass","row","head_counts"
1,216,216
2,184,184
3,487,487


output is in format of <br>
```GroupbyCol(Pclass), row, count```

#### Passengers per each passenger class and Sex

This time using Sex and PClass as the group, counting names belongs to each category.

In [49]:
aq_cnt -f,+1 $file -d $colSpec -g Pclass Sex -k head_counts Name | \
aq_ord -f,+1 - -d i:Pclass s:Sex i:row i:head_counts -sort Pclass Sex

"Pclass","Sex","row","head_counts"
1,"female",94,94
1,"male",122,122
2,"female",76,76
2,"male",108,108
3,"female",144,144
3,"male",343,343


### Using Groupby with -kx

**Little Refresher for `-kx` option**

display /output actual unique values of the colName to stdout or file. 
Close to 

```python
df[column].unique()
```
in python&pandas stack.

#### TItle per each passenger class
Let's take a look at person's title (Mr., Miss., Master., etc), and display it within the group of passenger class. 
To that we'll extract title from name column and map it into new column, named title, using `aq_pp`. 
Feel free to skip to the counting part.

In [83]:
# extracting the title from name column
aq_pp -f,+1 $file -d $colSpec -mapf,pcre name "(M(rs?|is{2}|a(s|j).{1,2}r))" -mapc s:title "%%1%%" -c Pclass title | \

### display the titles in each groups.####
aq_cnt -f,+1 - -d i:Pclass s:title -g Pclass -kx - title_by_class title | \

aq_ord -f,+1 - -d i:Pclass s:title -sort Pclass

"Pclass","title"
1,"Major"
1,"Master"
1,
1,"Miss"
1,"Mr"
1,"Mrs"
2,
2,"Master"
2,"Miss"
2,"Mr"
2,"Mrs"
3,"Mrs"
3,"Master"
3,"Miss"
3,"Mr"


We can see that both class 2 and 3 have same passenger titles, but class 1 also has Major. 

### Using Groupby with -kX
For instance, let's count the numbers of people survived, within each passanger class(`Pclass`) using `-g` to group by Pclass, then apply `-kX` to display frequencies of each unique values in Survived column (0s and 1s). 

In [30]:
aq_cnt -f,+1 $file -d $colSpec -g Pclass -kX - survivor_by_class Survived

"Pclass","Survived","count"
2,0,97
2,1,87
1,0,80
1,1,136
3,1,119
3,0,368


You can see the format is in

`GroupByCol(Pclass), KeyCol(Survived), Count`

### Multiple Groupby
We can also specify multiple columns as groups to analyze data. 

Let's take a look at survivor counts in group of Pclass, and sex as well.
Groupby Columns will be Pclass and Sex.

In [36]:
aq_cnt -f,+1 $file -d $colSpec -g Pclass Sex -kX - survivor_by_class_sex survived | \
aq_ord -f,+1 - -d i:Pclass s:Sex i:survived i:count -sort Pclass Sex Survived #-sort,dec survived # ordering the results for visual

"Pclass","Sex","survived","count"
1,"female",0,3
1,"female",1,91
1,"male",0,77
1,"male",1,45
2,"female",0,6
2,"female",1,70
2,"male",0,91
2,"male",1,17
3,"female",0,72
3,"female",1,72
3,"male",0,296
3,"male",1,47


You can see the grouping structure of Pclass > Sex > Survived. <br>
What happends if we'd like to categorize by Sex first, then into Pclasses? 

In [33]:
aq_cnt -f,+1 $file -d $colSpec -g Sex Pclass -kX - survivor_by_class_sex survived | \
aq_ord -f,+1 - -d s:Sex i:Pclass i:survived i:count -sort Sex Pclass Survived #-sort,dec survived # ordering the results for visual

"Sex","Pclass","survived","count"
"female",1,0,3
"female",1,1,91
"female",2,0,6
"female",2,1,70
"female",3,0,72
"female",3,1,72
"male",1,0,77
"male",1,1,45
"male",2,0,91
"male",2,1,17
"male",3,0,296
"male",3,1,47


That worked by reversing the column names passed into `-g` option. However when you take a closer look at the comparisons<br>
of the 2 outputs, you can see that they're essentially the same data, in different order of row and columns.\


### Wait, but can we do this just by providing multiple colNames??

Let's see if we can achieve same result from the last example, using Sex and Pclass as group.

In [90]:
aq_cnt -f,+1 $file -d $colSpec -kX - survivor_by_class_sex Sex Pclass survived | \
aq_ord -f,+1 - -d s:Sex i:Pclass i:survived i:count -sort Sex Pclass Survived #-sort,dec survived # ordering the results

"Sex","Pclass","survived","count"
"female",1,0,3
"female",1,1,91
"female",2,0,6
"female",2,1,70
"female",3,0,72
"female",3,1,72
"male",1,0,77
"male",1,1,45
"male",2,0,91
"male",2,1,17
"male",3,0,296
"male",3,1,47


### Todo
As far as I know, you can, but there might exist some things that can only be done through `-g` option.

Will be updated in the futuer on this. 