# aq_cnt tips and samples

This notebook goes over aq_cnt's options and it's sample usages. 
Based on AQ Tools version: 2.0.1-1.

### Prerequisites
Users are assumed to be equipped with decent knowledge of
- bash commands
- aq_pp command
- input, column and output spec for aq_tools


We'll be going over each options in the `aq_cnt` command and it's use cases. Have the [aq_cnt documentation](http://auriq.com/documentation/source/reference/manpages/aq_cnt.html?highlight=aq_cnt) ready on the side, so you can refer to it whenever needed.
We'll start with basic usage of each options, then dive into advanced usage.

### Dataset
Will be using [amazon customer review dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) dataset.
This dataset was collected over few decades since 1995 and 2015, and contains over 130+ million customer reviews.

We'll be using files from multilinugal dataset, from several marketplace internationally to have variety in data.


### Terminology
- Key: in `aq_cnt`, key means each unique value present in arbitrary column. It can be a composite key, where it is a unique combination of values from several columns. 

Now we're ready, let's get started!

**Note**

*Bash*
Throughout the tutorial, we'll be using bash variable to represent fileName and column spec to avoid repetition and lengthy commands.
They are assigned on cell below.

In [2]:
# setting filename and column spec, and brief look at the dataset
file="data/sample_reviews.tsv"
allColSpec=$(loginf -f,auto $file -o_pp_col -) 
colSpec="S:marketplace X S:review_id S:product_id X X S:product_category F:star_rating X X X X X X X"
#aq_pp -f,+1,tsv $file -d $colSpec -filt '$rowNum < 11'

## Data

Because of the large scale of this dataset, modified version of it is used in this sample for clarification purpose.
Below is the first 10 rows of the data we'll be using.

marketplace|review_id|product_id|product_category|star_rating
---|---|---|---|---|
US|R31B5MWO3O7O6|B007IXWKUK|Digital_Ebook_Purchase|2
US|RZ891DUCUNPMD|B00BVMXBVG|Video DVD|5
UK|R2D1VN26VB52J0|B005DOL0R0|PC|4
US|R3RKWQN433BXL4|B007SSEZNA|Digital_Video_Download|4
DE|RXW35JFHT3MU2|B005KPLN5Q|Video DVD|5
US|RBPHJIGASHA68|0525946284|Books|5
US|R1GD0IA9TYWTFF|145162607X|Books|5
UK|R142V52WZDX8GR|B00005421R|Video DVD|5
US|R119R14XEIOATS|0736092269|Books|5
UK|R2V6F0Z5LE75S1|B00538VY5Y|Video DVD|4

### Columns

- **marketplace (string):** abbrebiation for country that amazon marketplace is located at.
- **review_id (string):** unique review id
- **product_id (string):** unique product id
- **product_category (string):** category for the products
- **star_rating (int):** rating for the each product

## Options

- `k`:done
- `kx`:done
- `kX`Done
- `-g` **working on it...**

### -k
`-k KeyName ColName [ColName ...]`
This option counts numbers of unique values present in given column(s). 
`KeyName` specifies the name of the key (or combination of keys, if multiple columns are given).
You can pass multiple columns to count composite key.

**Single Column**<br>
We can use this to count how many unique products are present in the review.
`product_id` column is a unique identifier for each product. 
Setting keyName as `num_products`, and giving `product_id` to colName option like below...

In [3]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -k num_product product_id

"row","num_product"
2181,879




- `row` is the numbers of total rows processed by the command (in this case the entire dataset)
- `num_products`: number of unique values in `product_id` column.

**Multiple columns**
Let's observe what happens when we provide the `-k` option with multiple columns. This time we'll use a data (`data/multiple_k.txt`) that looks like a table below, which contains marketplace abbreviation and fake product id number.

|marketplace|product_id|
|---|---|
|US|1|
|US|2|
|US|3|
|JP|1|
|JP|2|
|JP|3|
|FR|1|
|FR|2|
|FR|3|

Now unique numbers of `product_id` in the above case would be 3.
However numbers of uniue combinations of `product_id` and `marketplace` would be 9. Let's check it.

In [4]:
aq_cnt -f,+1,sep="|" data/multiple_k.txt -d s:marketplace i:product_id -k num_product marketplace product_id

"row","num_product"
9,9


You can also provide more than 2 columns. This will come in handy when counting numbers of records based on composite key(such as combinations of Last name, first name, phone numbers etc).

## -kx

`-kx[,AtrLst] File KeyName ColName [ColName ...]` <br>
While `-k` option counts and displays the numbers of unique values in given column(s), this option displays the actual unique data in the given columns(s). 

**Be wary of syntactic difference**<br>
This options requires the output file name as the first argument.
In this sample we'll be using `-` which outputs on stdout(command line window). 
    
**Single Column**<br>
As a example, we'll display the all marketplace names contained in `marketplace` column on amazon review dataset.


In [5]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -kx - country marketplace

"marketplace"
"FR"
"JP"
"DE"
"US"
"UK"


Let's take a look at star_rating as well. As we know amazon's star rating ranges from 1~5, so this dataset should also contain all of the numbers.


In [6]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -kx - country star_rating | aq_ord -f,+1 - -d i:star_rating -sort star_rating

"star_rating"
1
2
3
4
5


**Multiple Columns**<br>
Providing `marketplace` and `star_rating` columns, we can take a look at star_rating's values in each market place. (More technically, it is listing the unique combination of values from `marketplace` and `star_rating`. 

**Note**: in below example, `aq_ord` is used after piping to order the result, but is out of scope of this sample. Don't worry about it for now!

In [7]:
aq_cnt -f,+1,tsv,eok $file -d $colSpec -kx - country marketplace star_rating \
| aq_ord -f,+1 - -d s:marketplace s:star -sort marketplace star

"marketplace","star"
"DE","1"
"DE","2"
"DE","3"
"DE","4"
"DE","5"
"FR","1"
"FR","2"
"FR","3"
"FR","4"
"FR","5"
"JP","1"
"JP","2"
"JP","3"
"JP","4"
"JP","5"
"UK","1"
"UK","2"
"UK","3"
"UK","4"
"UK","5"
"US","1"
"US","2"
"US","3"
"US","4"
"US","5"


## -kX

`-kX[,AtrLst] File KeyName ColName [ColName ...] [STATS:ColName [STATS:ColName ...]]`<br>
This option has 2 main functionality.
1. Given file name, keyname and column names, it outputs every unique values (or combination of values if multiple column was given), and occurence counts of the each value. 
2. `STATS:colName` option returns comprehensive statistics of given neumeric column, for each key (column value) combinations. Concretely, it provides sum, average, standard deviation, minimum and maximum. 

Let's take a look at it in action. We'll start with the first functionality. 
We can explore how many each marketplace appears in the dataset.
We do this by setting `ColName` to `marketplace`. Just like before, we'll set `File` to `-` to display the result on the notebook, instead of outputting it to an external file.

In [9]:
aq_cnt -f,+1,tsv $file -d $colSpec -kX - country marketplace

"marketplace","count"
"FR",21
"JP",30
"DE",102
"US",1723
"UK",305


As you can see, it displays all the unique values in `marketplace` column, and each value's occurance counts. 

Next, let's see what we can do with `STATS` option

In [10]:
aq_cnt -f,+1,tsv $file -d $colSpec -kX - country marketplace STATS:star_rating

"marketplace","count","star_rating.sum","star_rating.avg","star_rating.stddev","star_rating.min","star_rating.max"
"FR",21,84,3.9999999999999991,1.3416407864998738,1,5
"JP",30,117,3.8999999999999995,1.3222238320605804,1,5
"DE",102,420,4.1176470588235308,1.3591939781498403,1,5
"US",1723,7337,4.2582704585025963,1.2226258479092469,1,5
"UK",305,1274,4.177049180327872,1.2306921248330025,1,5


We gave `star_rating` column to the `STATS` option here.

First two columns on the result displays the same information as before, and the rest of columns displays statistical information on `star_rating` column, for each marketplace (columnName or key).

This feature is very useful for performing stats analysis by certain groups in a column, such as monthly analysis on numbers of reviews, or star_rating analysis by each product_category.

**Multiple Columns**<br>
We will provide `product_category` column besides `marketplace`.

**1. distribution of each product category within each marketplace.**


In [11]:
aq_cnt -f,+1,tsv $file -d $colSpec -kX - country marketplace product_category \
| aq_ord -f,+1 - -d s:marketplace s:product_category i:count -sort,dec marketplace count

"marketplace","product_category","count"
"US","Mobile_Apps",1478
"US","Digital_Music_Purchase",99
"US","Toys",73
"US","PC",48
"US","Musical Instruments",15
"US","Shoes",3
"US","Kitchen",3
"US","Health & Personal Care",2
"US","Office Products",2
"UK","Mobile_Apps",228
"UK","Digital_Music_Purchase",29
"UK","Toys",27
"UK","PC",15
"UK","Office Products",3
"UK","Musical Instruments",2
"UK","Shoes",1
"JP","Mobile_Apps",12
"JP","Toys",11
"JP","PC",5
"JP","Digital_Music_Purchase",2
"FR","Mobile_Apps",8
"FR","Toys",5
"FR","PC",4
"FR","Digital_Music_Purchase",3
"FR","Shoes",1
"DE","Mobile_Apps",47
"DE","Toys",24
"DE","Digital_Music_Purchase",15
"DE","PC",12
"DE","Shoes",1
"DE","Health & Personal Care",1
"DE","Musical Instruments",1
"DE","Office Products",1


**2. statistics for `star_rating` for each keys (in above example)**<br>
We can display statistics for each keys (composed of marketplace and product category) on star_rating values.

In [12]:
aq_cnt -f,+1,tsv $file -d $colSpec -kX - country marketplace product_category STATS:star_rating \
| aq_ord -f,+1 - \
-d s:marketplace s:product_category i:count f:rating_sum f:rating_avg f:rating_stddev f:rating_min f:rating_max \
-sort,dec marketplace count


"marketplace","product_category","count","rating_sum","rating_avg","rating_stddev","rating_min","rating_max"
"US","Mobile_Apps",1478,6247,4.2266576454668376,1.2492569731437477,1,5
"US","Digital_Music_Purchase",99,455,4.5959595959595987,0.76823667589330191,2,5
"US","Toys",73,329,4.5068493150684921,1.0817075964422702,1,5
"US","PC",48,197,4.1041666666666679,1.3720551960445719,1,5
"US","Musical Instruments",15,67,4.4666666666666668,0.8338093878327919,3,5
"US","Shoes",3,12,4,1,3,5
"US","Kitchen",3,14,4.666666666666667,0.57735026918962573,4,5
"US","Health & Personal Care",2,6,3,1.4142135623730951,2,4
"US","Office Products",2,10,5,0,5,5
"UK","Mobile_Apps",228,916,4.0175438596491233,1.3173358252480474,1,5
"UK","Digital_Music_Purchase",29,134,4.6206896551724128,0.82000841038580119,1,5
"UK","Toys",27,128,4.7407407407407405,0.81299979149361468,1,5
"UK","PC",15,68,4.5333333333333332,0.63994047342218441,3,5
"UK","Office Products",3,14,4.666666666666667,0.57735026918962584,4,5
"UK","Musical Instrume

This displays each keys' (combination of marketplace and product values) stats for star rating values. 

## Using Groupby with -g 

**What is groupby and how does it work?**<br>
`-g` allows users to create group in which to count and analyze records. Major application for this is to count numbers of distinct elements within defined groups. This will require cascading executiion of `aq_cnt` without `-g`, but with it this can be done in one command.

### Counting within group with `-k` option

First, we will count numbers of rows in each group (each marketplace)

In [13]:
aq_cnt -f,+1,tsv $file -d $colSpec -g marketplace

"marketplace","row"
"FR",21
"JP",30
"DE",102
"US",1723
"UK",305


Now you can see that using `-g` option alone with one column name will give you record count of each group within the given column.

### Analyzing Products 

Let's see how many products are present in the review dataset. To do this will require one simple `-k` option to count numbers of distinct `product_id` within the whole dataset.

In [14]:
aq_cnt -f,+1,tsv $file -d $colSpec -k num_product product_id

"row","num_product"
2181,879


There are 879 distinct products present in the dataset with 2181 data points.

**Single Column**<br>
What if we want to know the numbers of distinct product **within each marketplace?** That's where `-g` option comes in handy.<br>
We will specify `marketplace` as groupby column here, and count the numbers of distinct `product_id` with `-k` option.

In [15]:
aq_cnt -f,+1,tsv $file -d $colSpec -g marketplace -k num_product product_id

"marketplace","row","num_product"
"FR",21,21
"JP",30,30
"DE",102,96
"US",1723,668
"UK",305,222


Achieving the same result without using `-g` option would take 2 cascading commands like this.

In [16]:
aq_cnt -f,+1,tsv $file -d $colSpec -kx - keyName marketplace product_id | aq_cnt -f,+1 - -d s:marketplace s:product_id -kX - - marketplace

"marketplace","count"
"US",668
"FR",21
"DE",96
"UK",222
"JP",30


**Multiple Columns**

Let's create 2 groupby column, `marketplace` and `product_category`. 
By doing this, we wii count the numbers of distinct products within each product category within marketplace. 

In [17]:
aq_cnt -f,+1,tsv $file -d $colSpec -g marketplace product_category -k num_product product_id \
| aq_ord -f,+1 - -d s:marketplace s:product_category X i:num_product -sort marketplace product_category

"marketplace","product_category","num_product"
"DE","Digital_Music_Purchase",15
"DE","Health & Personal Care",1
"DE","Mobile_Apps",43
"DE","Musical Instruments",1
"DE","Office Products",1
"DE","PC",11
"DE","Shoes",1
"DE","Toys",23
"FR","Digital_Music_Purchase",3
"FR","Mobile_Apps",8
"FR","PC",4
"FR","Shoes",1
"FR","Toys",5
"JP","Digital_Music_Purchase",2
"JP","Mobile_Apps",12
"JP","PC",5
"JP","Toys",11
"UK","Digital_Music_Purchase",29
"UK","Mobile_Apps",147
"UK","Musical Instruments",2
"UK","Office Products",3
"UK","PC",14
"UK","Shoes",1
"UK","Toys",26
"US","Digital_Music_Purchase",93
"US","Health & Personal Care",1
"US","Kitchen",2
"US","Mobile_Apps",457
"US","Musical Instruments",10
"US","Office Products",2
"US","PC",38
"US","Shoes",3
"US","Toys",62


Just like that, you'll be able to nest gruops within gruops. 

### Chronological analysis

You can also apply `-g` option to group records by time. 

By specifying time frame (year/month) as groupby column, we can perform analysis by time. 
By using this option with `-k`, we are able to count numbers of unique elements within a defined group(s). 

Just for comparison, we'll be showing you how to do this with and without `-g` option.

For this section, we'll be adding more columns to the data to use, concretely year and month columns. (This was done by `aq_pp` tools's string manipulation. If you're interested, check out our tutorial on [aq_pp -map](aq_pp%20-map)

marketplace|review_id|product_id|product_category|star_rating|year|month|review_date
---|---|---|---|---|---|---|---|
UK|R2D1VN26VB52J0|B005DOL0R0|PC|4|2015|6|2015-06-17
US|R227AKNUDMALRT|B006N0YWGY|Mobile_Apps|5|2012|2|2012-02-05
US|R3QKWFUMPXK7WF|B009UX2YAC|Mobile_Apps|5|2014|2|2014-02-11
US|R36EDJI0TQ59V5|B00B2V66VS|Mobile_Apps|1|2015|3|2015-03-24
US|R2MCQNEALKQ1Y6|B0054JZC6E|Mobile_Apps|5|2012|4|2012-04-16
UK|RC98JYWAQUZVS|B0094BB4TW|Mobile_Apps|3|2014|10|2014-10-19
US|R1CYPW3LETPTCN|B0091REZMW|Mobile_Apps|5|2013|11|2013-11-03
US|R2IQ07J1AEPRKF|B004SJ3BCI|Mobile_Apps|5|2011|11|2011-11-29
US|RCW5PBZOAGACG|B00BJA2VFW|Mobile_Apps|1|2014|12|2014-12-20
US|R19ZSKFTU408PA|B00IKZX1ZI|Mobile_Apps|5|2014|4|2014-04-02

**Dev Note**:<br>
for here, I'm using chronological_review.csv file, which contains extracted columns, plus year, month and review_date column, since colSpec of file with full 18 columns can't be detected correclty at the moment with longinf

In [18]:
# set up variables
colSpec="S:marketplace S:review_id S:product_id S:product_category
            I:star_rating I:year I:month S:review_date"
file="data/chronological_reviews.tsv"


For example, let's see the change in the total numbers of reviews over the years. 
We do this by setting `year` column as groupby, and passing `review_id` to `-k` option (counts the number of distinct values).

In [19]:
aq_cnt -f,+1,tsv $file -d $colSpec -g year -k rate_by_year review_id \
| aq_ord -f,+1 - -d i:year X i:rate_by_year -sort year # just sorting the results by year

"year","rate_by_year"
2006,1
2007,2
2008,4
2009,3
2010,6
2011,44
2012,242
2013,569
2014,770
2015,540


We can see that generally speaking, numbers of reviews are increasing overtime yearly.

<br>**Without using `-g` option**, we need to take following 2 steps.
1. First, display all unique combinations of year and review_id using `-kx` option.
2. secondly, we can count the numbers of the year-review_id pairs using `-kX`

In [21]:
# 1. get the composite keys of unique year and review_id value. 
# top 10 result will be displayed by head command for clearity
aq_cnt -f,+1,tsv $file -d $colSpec -kx - key year review_id | head -n 10

"year","review_id"
2015,"R2KQL9CZFKVV0N"
2014,"R1I7BV7VW46S1Z"
2014,"R2SRQCD31X6K2F"
2010,"R2U69XPUKM8RQF"
2014,"R23SU296YFJ0MM"
2015,"R2LG5P4FT5QF91"
2013,"R36BQ1B0AEIN4O"
2015,"R25WNX1SABYYXR"
2014,"R1LYAVDUGBOXRL"


In [22]:
# first step and pipe output to second step
aq_cnt -f,+1,tsv $file -d $colSpec -kx - key year review_id \
| aq_cnt -f,+1 - -d s:year s:review_id -kX - keyName year # on the second path, count the nubmers of each years' occurances

"year","count"
"2006",1
"2008",4
"2009",3
"2007",2
"2011",44
"2012",242
"2013",569
"2010",6
"2014",770
"2015",540


We can see that `-g` option makes things a lot easier for counting distinct elements!

### Numbers of monthly reviews within each year

Let's take a look at numbers of reviews in each month, within each year. 

In [23]:
aq_cnt -f,+1,tsv $file -d $colSpec -g year month -k key review_id \
| aq_ord -f,+1 - -d i:year i:month i:row i:key -sort year month # just sorting

"year","month","row","key"
2006,11,1,1
2007,7,1,1
2007,12,1,1
2008,4,1,1
2008,5,1,1
2008,6,1,1
2008,10,1,1
2009,8,1,1
2009,10,1,1
2009,11,1,1
2010,5,2,2
2010,8,1,1
2010,10,2,2
2010,12,1,1
2011,1,3,3
2011,2,2,2
2011,3,3,3
2011,4,3,3
2011,5,2,2
2011,6,2,2
2011,7,3,3
2011,8,6,6
2011,9,2,2
2011,10,7,7
2011,11,4,4
2011,12,7,7
2012,1,23,23
2012,2,19,19
2012,3,15,15
2012,4,22,22
2012,5,17,17
2012,6,10,10
2012,7,12,12
2012,8,22,22
2012,9,22,22
2012,10,14,14
2012,11,21,21
2012,12,45,45
2013,1,53,53
2013,2,51,51
2013,3,48,48
2013,4,48,48
2013,5,30,30
2013,6,49,49
2013,7,53,53
2013,8,58,58
2013,9,37,37
2013,10,42,42
2013,11,44,44
2013,12,56,56
2014,1,74,74
2014,2,60,60
2014,3,62,62
2014,4,53,53
2014,5,40,40
2014,6,57,57
2014,7,68,68
2014,8,47,47
2014,9,71,71
2014,10,71,71
2014,11,76,76
2014,12,91,91
2015,1,84,84
2015,2,70,70
2015,3,84,84
2015,4,57,57
2015,5,64,64
2015,6,61,61
2015,7,65,65
2015,8,55,55


### Groupby Statistics with `-g` and `-kX`

### Annual rating statistics by month
Now we have a basic ideas of annual numbers of rating, let's take a look at statistics of each year. 
We will get sum, average, standard deviation, minimum and maximum numbers of reviews within each year, in terms of month.

1. On the first pass, we'll get review counts within each month in each year. This looks exactly like the result from above.
2. On the second pass, we'll use `-kX` and `stats` option with review_count, to get review count statistics. We'll be providing `year` column, since we want annually stats. 
    
**Note** <br>
the data from the first pass was organized montly, which means average, std, min and max for each year is calculated based on the monthly review count, not daily(original dataset was daily)



In [24]:
aq_cnt -f,+1,tsv $file -d $colSpec -g year month -k review_count review_id \
| aq_cnt -f,+1 - -d i:year i:month i:row i:review_count -kX - key year stats:review_count

"year","count","review_count.sum","review_count.avg","review_count.stddev","review_count.min","review_count.max"
2013,12,569,47.416666666666679,8.0730565725902395,30,58
2015,8,540,67.5,11.199489784296933,55,84
2014,12,770,64.166666666666671,13.953385599809451,40,91
2012,12,242,20.166666666666664,8.9527378578575139,10,45
2006,1,1,1,0,1,1
2008,4,4,1,0,1,1
2010,4,6,1.5,0.57735026918962584,1,2
2011,12,44,3.6666666666666665,1.9227505550564008,2,7
2009,3,3,1,0,1,1
2007,2,2,1,0,1,1


### Annual Rating Statistics by Days

Here, we'll take a look at annual review count's statistics as well, but by days. 

This process is very similar to the one we did above, except we're provoding year and review_date columns to `-g` option on the first pass of `aq_cnt`, in order to get daily review counts.

1. The first pass will get us daily review count, associated with year column
2. Then we'll use that to get yearly statistics of review_count. 

In [25]:
aq_cnt -f,+1,tsv $file -d $colSpec -g year review_date -k review_count review_id \
| aq_cnt -f,+1 - -d i:year s:review_date X i:review_count -kX - key year stats:review_count

"year","count","review_count.sum","review_count.avg","review_count.stddev","review_count.min","review_count.max"
2006,1,1,1,0,1,1
2008,4,4,1,0,1,1
2009,3,3,1,0,1,1
2007,2,2,1,0,1,1
2011,40,44,1.0999999999999999,0.37893237337253671,1,3
2012,173,242,1.3988439306358373,0.72127055525151751,1,4
2013,284,569,2.003521126760563,1.1663668317148435,1,6
2014,306,770,2.5163398692810444,1.398381149117975,1,7
2015,214,540,2.5233644859813085,1.5281139646837443,1,8
2010,6,6,1,0,1,1


That concludes the example lists for `aq_cnt` command, we've covered basic usage including single and multi columns keys for `-k`, `-kx` and -`kX` options, as well as advanced grouping options with `-g`.
