# UDB Walk Through with Amazon Product Dataset

## Intro

In this notebook we will be looking at a simple use case example of udb, table, vector and variables to organize the famous [Amazon Product Review Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html), which is publically available on aws s3 bucket. 

It is a collection of amazon customer reviews, left between 1995 and 2015, contains millions of data points. 

## Pre-requisites

In order to go through this notebook, we'll assume that you're somewhat familar with
* bash and linux commands/scripting
* Essentia and aq_tool command
* little bit of UDB, aq_udb command

Resources below are available for reference if needed, 
* [Essentia Playground Home](../README.ipynb)
* [AuriQ Knowledge Base](http://auriq.com/knowledge-base/) where documentations and tutorials are available. 


## Dataset

Before getting our hands dirty with the data, we'll take a look at the columns and information contained in the data. 
It has 15 columns total, some of which are important for this tutorial, such as followings.

* **customer_id**: int - unique identifier for a customer who left review(s)
* **product_parent**: int - unique identifier for a product. This can be used to merge same product across different marketplace.
* **review_id**: string - unique identifier for a review. 
* **star_rating**: int - star rating, ranging from 1 to 5. 
* **verified_purchase**: string - whether or not the review is based on verified purchase. String value of "Y" for verified, or "N" for otherwise.
* **review_date**: string - string value describing the year, month and date of the review, in format of `2014-08-01`.

## What's UDB?
If you're already familiar with these concepts, feel free to skip to next section (GOAL). 

What were udb (User Data Base)? It is a in-memory database that stores data using key-value type data structure. 
There are 3 main components, and couple attributes to keep in mind.

### Components


* **Database** - database itself. Contains the following 3 components, and must have a specified primary key.
* **Table** - Analogous to SQL table with primary key column, except that foregin key does not exist.
* **Vector** - A container to store single record / row of data. Each vector(row) stores a information that is associated with one unique primary key in the database (can be a single customer, or product in amazon dataset for instance). *Also acts as "strainer" for data. More on this later*
* **Variable** - as the name suggest, this stores values temporarly. Not really a datatype, but comes in handy when performing complex calculations. 

### Attributes
UDB attributes are very powerful. You can assgin one of the following attributes to each column of table / vector. When the data flows into table/vector's columns with these attributes, each performs / stores the followings.

* `pkey`: primary hash key, must be string type
* `tkey`: integer sorting key, used to sort records within each primary keys.
* `+key`: [long discription available](http://www.auriq.com/documentation/source/reference/manpages/udb.spec.html)
* `+first`: keep the first vaule when importing data
* `+last`: keep the last value when importing data
* `+add`: add incoming value to existing value, can be used to calculate cumulative sum across records
* `+bor`: Bitwise-OR numeric values
* `+min`: Take the smallest value when importing
* `+max`: Take the largest value when importing
* `+nozero`: Ignore values of 0 or an empty string


## Goal

The final goal of this tutorial is to import the data into udb, in a way that is easy to manage and analyze later. Concretely, we will create the following databases, tables and vectors.

### Database: Amazon
This database contains a table and a vector, which are 
* Table: reviews - keeps all of the original review dataset. 
* vector: customer - contains summary of each customer's information, such as numbers of reviews each customer left, average star rating of each customer, numbers of helpful votes. 

**Put shemas here**

### Database: Products
This database only contains one vector
* vector: product - summarizes the information for each product, such as numbers of reviews left, average star ratings, etc. 

**Schema**

## Steps
The whole project can be divided up into the following steps. 
1. define, and crate data schemas on udb, and start udb server
2. Stream the data from datastore, process some of the columns, and fill up `reviews` table. 

## 1. definition 
We'll select datastore, create category, and define data schemas for udb. Finally start the udb server.

In [2]:
# choosing the public s3 bucket that stores amazon review dataset as datastore
ess select s3://amazon-reviews-pds
# taking a look at what's inside of the bucket. 
ess ls /tsv/amazon_reviews_multi*.tsv.gz

# create a category, only including Danish reviews to keep data size small.
ess category add danish_reviews "/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz" 

 230M Nov 24 2017     /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
  67M Nov 24 2017     /tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
  90M Nov 24 2017     /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
 333M Nov 24 2017     /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
 1.4G Nov 24 2017     /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz


In [3]:
ess summary danish_reviews

Name:        danish_reviews
Pattern:     /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   Tab
# of files:  1
Total size:  230.7MB
File range:  1970-01-01 - 1970-01-01
# columns:   15
Column Spec: S:marketplace I:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S,trm:review_headline S,esc:review_body S:review_date
Pkey: 
Schema: S:marketplace I:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S,trm:review_headline S,esc:review_body S:review_date
Preprocess:  
usecache:    False
Comment:    

First few lines:
marketplace	customer_id	review_id	product_id	product_parent	product_title	product_category	star_rating	helpful_votes	total_votes	vine	verified_purchase	review_headline	review_body	review_d

Now we know what the data and its schema looks like, we can create and start udb.

In [5]:
# This section will create a database and its schemas

# first make sure there's no existing udb / schemas. 
ess server reset

# create database "amazon", on port 0. 
# Everything that'll be created after this will be inside of "amazon" database. (except ohter database)
ess create database amazon --port 0

# create tables and vectors, with column specs in amazon db
ess create table reviews S:marketplace I,pkey:customer_id I:u_time S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date I:year I:month I:day
ess create vector customer I,pkey:customer_id I,+add:num_review I,+add:helpful_votes I,+add:total_votes I,+max:max_star I,+min:min_star I,+add:sum_star I,+add:num_verified_purchase

# create products db, and product vector inside
ess create database products
ess create vector product I,pkey:product_parent I,+add:num_review I,+add:num_verified_purchase I,+min:min_star I,+add:sum_star

# this will start the udbd server. 
ess udbd start 

ip-10-10-1-36: Starting udbd-10010.
ip-10-10-1-36: udbd-10010 (13405) started.


### Closer Look at the UDB Schemas

Let's take a look at each schema to understand what they do. 
We will only focus on significant ones that will be used or target of a data processing later. 


#### `Amazon:review` table
* `I,pkey:customer_id`: primary key to hash on.
* `I:u_time` 
* `S:review_id` 
* `I:product_parent`
* `I:star_rating` 
* `I:helpful_votes` & `I:total_votes` 
* `S:verified_purchase` 
* `S:review_date` 
* `I:year`, `I:month`, and `I:day`

#### `Amazon:customer` vector

This vector is used to summarize each customer's data.
Single vector stores info about one customer (`customer_id`), therefore there are same numbers of vectors as number of unique `customer_id` present in the data. 

* `I,pkey:customer_id`: primary key, corresponds to unique customers in the dataset.  
* `I,+add:num_review`: `+add` is used to keep adding values that go through this column. Used to get cumulative sum of values on `num_review` column across records, for each `customer_id`. 
* `I,+add:helpful_votes` & `I,+add:total_votes`: get the total numbers of each votes per each `customer_id`. 
* `I,+max:max_star`, and `I,+min:min_star`: `+max` and `+min` only keeps max and min value across the records for each primary key (`customer_id`). Used to get min and max star_rating that each user left.
* `I,+add:sum_star`: used to get cumulative sum of star_rating across records for each `customer_id`. 
* `I,+add:num_verified_purchase`:used to count numbers of verified purchases made by each `customer_id`. Used in accordance with `-if -else` options in `aq_pp` command, which will be covered later. 

#### `Products:product` vector
Similary, each of this vector contains records for each unique product present in the dataset. 
Same / similar schema is used to collect data for each product, by using different primary key.

* `I,pkey:product_parent`: primary key to identify a product
* `I,+add:num_review`: used to count numbers of records (reviews) per each product
* `I,+add:num_verified_purchase`: to count number of verified purchase. Used with conditional statement in `aq_pp`.
* `I,+min:min_star`: keep minimum star rating value for each product.
* `I,+add:sum_star`: calculate the cumulative sum of star rating left on each product.


### Streaming Data 

Now we have the servers running with necessary schemas, we'll stream our data from datastore (s3) to the udb tables and vectors, while processing the data on the fly. 

While processing, the data will go through the following processing using `aq_pp` command,

* convert `review_date` column's string value into unix time, and store it in `u_time` integer column.
* extract year, month and date from `review_date` column, and remap them onto individual new columns. --> Done by `-mapf` and `-mapc` options.
* create new `sum_star`, `min_star` and `max_star` columns from `star_rating` column, and stream them into `product` and `customer` vectors. 
* create and assign `num_review` column value of 1, which will be used to count the numbers of records (reviews) per product / customers, while they're streamed into the 2 vectors. 
* create `num_verified_purchase` column, and conditionally fill it's value using `-if -else` option. 

Finally `-imp` option was used to specify the table and vectors to import/stream the data into. 

**Note:**: In order for a column of data to be imported, 
the column spec of the data (from `aq_pp`) needs to **match with the column spec of the table/vector.**

In [6]:
# set column spec
COL="S:marketplace I:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S,trm:review_headline S,esc:review_body S:review_date"
AMT="1000"

# stream, process and fill up udb table and product vector. 
ess stream danish_reviews "*" "*" \
    "aq_pp -f,eok,tsv - -d $COL \
    -eval I:u_time 'DateToTime(review_date, \"%Y.%m.%d\")' \
    -mapf review_date '%%YEAR%%-%%MONTH%%-%%DATE%%' \
    -mapc s:s_year '%%YEAR%%' -mapc s:s_month '%%MONTH%%' -mapc s:s_day '%%DATE%%' \
    -eval i:year 'ToI(s_year)' -eval i:month 'ToI(s_month)' -eval i:day 'ToI(s_day)' \
    -eval I:sum_star 'star_rating' -eval I:min_star 'star_rating' -eval I:max_star 'star_rating' -eval I:num_review 1 \
    -if -filt 'verified_purchase == \"Y\"' -eval I:num_verified_purchase '1' -else -eval num_verified_purchase '0' -endif \
    -imp,ddef,seg=1/$AMT amazon:reviews -imp,seg=1/$AMT amazon:customer -imp,seg=1/$AMT products:product"



The table and vector are filled up, so let's take a look using `aq_udb` command. 

In [8]:
# export data from the tables and vectors, top few results only
# table 
aq_udb -exp amazon:reviews -top 2
# vectors
aq_udb -exp amazon:customer -top 5
aq_udb -exp products:product -top 5

"marketplace","customer_id","u_time","review_id","product_id","product_parent","product_title","product_category","star_rating","helpful_votes","total_votes","vine","verified_purchase","review_headline","review_body","review_date","year","month","day"
"DE",3859831,1420156800,"RU66BVNON80FI","B000024EXY",202640400,"Seconds Out","Music",5,1,3,"N","Y","einfach klasse","Ich h&ouml;re Genesis schon seit Anfang der 80'er. Nach dem Abgang des genialen, aber auch, f&uuml;r meinen Geschmack etwas zu abgedrehten Peter Gabriel, hatten Genesis m.E. ihre beste Zeit. Ich habe mir endlich auch die &#34;And than they where three&#34; gekauft. Unbedingt empfehlen kann ich auch &#34;A Trick of the Tail&#34;, &#34;Wind and Wuthering&#34; und nat&uuml;rlich &#34;The Lamb lies down on Broadway&#34;!!","2015-01-02",2015,1,2
"DE",3965250,1433462400,"R3OI6RJRW2ZPVQ","B00ERLTDCO",921065352,"Desperate Housewives - Staffel 8","Digital_Video_Download",1,5,11,"N","N","Stellungnahme !!!!","Hi Amazon....!!!!  Wie wä

## How did this happen??

This section will explain how the 2 vectors and their attributes were used to calculate the values alongside with `aq_pp` command. 

Let's start with taking a look at the part of `aq_pp` command. 


### Star Ratings
This was done in some of the `-eval` options. Below is the chunk of code. 

```bash
aq_pp -f, .... 
-eval I:sum_star 'star_rating' -eval I:min_star 'star_rating' -eval I:max_star 'star_rating' ...
```
We're simply creating 3 new columns, all of them contains same values from `star_rating` column. All three columns' data are streamed into the columns with same names on the 2 vectors, and keeps/calculate value from the data.

For each primary keys (`customer_id` or `product_parent`),
* `I,+min:min_star`: keeps only the minimum star_rating value.
* `I,+max:max_star`: same, but only max value.
* `I,+add:sum_star`: adds up all of the star_rating values across the records.

### Number of Reviews

How are the numbers of reviews for each customer/product calculated? 

Remember that each unique value of `review_id` represent a unique review entry in the dataset. Because there are no duplicates of `review_id` across this entire dataset, we can assume that **each row of the dataset represent unique review entry** also. <br>
Hence, we can get numbers of reviews per customer/product by counting numbers of rows per customer/product. 

This is done in one of the `-eval` option also, like
```bash
aq_pp -f ... 
-eval I:num_review 1 ... 
```
where every value for the new `num_review` column is equal to 1. 
This data is streamed into the 2 vectors' column with same name, which has the attribute of `+add` (`I,+add:num_review`). This attribute gives cumulative sum of the values. 

### Number of Varified Purchases

The original column was represented by string, "Y" for verified, and "N" for not verified. 
We'd like to count the number of "Y" for each customer/product, and store it in the vectors. 

First, each string value was mapped to a new column called `num_verified_purchase`, 
* "Y" --> 1
* "N" --> 0

This was done by using conditional statement block in `aq_pp`, which is

```bash
aq_pp -f ... 
-if -filt 'verified_purchase == \"Y\"' -eval I:num_verified_purchase '1' -else -eval num_verified_purchase '0' -endif ...
```

Finally, the value from this column was summed up for each customer/product as the data go through the vectors in the column `I,+add:num_verified_purchase`.


In [16]:
# turn things off after using
ess server reset