# UDB Walk Through with Amazon Product Dataset

## Intro

In this notebook we will be looking at a simple use case of udb, table, vector and variables to organize and perform little analysis on the [Amazon Product Review Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html), which is publically available on aws s3 bucket. 

If you're not familiar with the data, go ahead and take a look at the link above. 

## Pre-requisites

You're assumed to have some knowledge of 
* bash and linux
* Essentia and aq_commands
* UDB, aq_udb command

If you're not confident enough, you can take a look at the other notebooks, or look at [AuriQ Knowledge Base](http://auriq.com/knowledge-base/) where documentations and tutorials are available. 

Now that is out of the way, let's get started. 

## Dataset

Before getting our hands dirty with the data, we'll take a look at the schema and what's inside of the data. 
This dataset contains customer reviews from 1995 - 2015, left on amazon.com, and contains 15 columns, some of which are important for this tutorial.
* **customer_id** - unique identifier for a customer who left review(s)
* **product_parent** - unique identifier for a product. This can be used to merge same product across different marketplace.
* **review_id** - unique identifier for a review. 
* **star_rating** - star rating, ranging from 1 to 5. 
* **verified_purchase** - whether or not the review is based on verified purchase. String value of Y or N.
* **review_date** - string value describing the year, month and date of the review. 

## What's UDB?
If you're already familiar with these concepts, feel free to skip to next section (GOAL). 

What were the udb? It is a in-memory database that stores data using key-value type data structure. 
There are 3 data structures, and couple attributes to keep in mind.

### Structures
Because the database is key-value based, each database needs to have a primary key column specified, which is common across all the data structures within the database. 
* **Table** - Analogous to SQL table with primary key column, except that foregin key does not exist.
* **Vector** - You can think of this as a row / column vector-ish data structure, if you're familiar with matrices. Each vector corresponds to a summary information of a primary key. Can be used to summarize customer information across entire dataset, for e.g. 
* **Variable** - as the name suggest, this stores values temporarly. Not really a datatype, but comes in handy when performing complex calculations. 

### Attributes
UDB attributes are very powerful. You can assgin one of the following attributes to each column, and when the data flows into table/vector whose columns have these attributes, 
* `pkey`: primary hash key, must be string type
* `tkey`: integer sorting key
* `+key`: string key to merge on - only applicable for table. **ADD MORE EXPLANATION**
* `+first`: Use the first imported value when merging
* `+last`: Use the last imported values when merging
* `+add`: Sum values across rows for each unique value
* `+bor`: Bitwise-OR numeric values
* `+min`: Take the smallest value
* `+max`: Take the largest value
* `+nozero`: Ignore values of 0 or an empty string

## Goal

The final goal of this tutorial is to import the data into udb, in a way that is easy to manage and analyze later. Concretely, we will create the following udb databases, tables, vectors and variables.

### Database: Amazon
This database contains a table and a vector, which are 
* Table: reviews - keeps all of the original review dataset. 
* vector: customer - contains summary of each customer's information, such as numbers of reviews each customer left, average star rating of each customer, numbers of helpful votes. 

**Put shemas here**

### Database: Products
This database only contains one vector
* vector: product - summarizes the information for each product, such as numbers of reviews left, average star ratings, etc. 

**Schema**

## Steps
The whole project can be divided up into the following steps. 
1. define, and crate data schemas on udb, and start udb server
2. Stream the data from datastore, process some of the columns, and fill up `reviews` table. 
3. **Update this section** Do some calculation and fill up the `customer` and `product` vector.

## 1. definition 
We'll select datastore, create category, and define data schemas. Finally start the udb.

In [7]:
# choosing the public s3 bucket that stores amazon review dataset as datastore
ess select s3://amazon-reviews-pds
ess ls /tsv/amazon_reviews_multi*.tsv.gz

# create a category only including Danish reviews.
ess category add danish_reviews "/tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz" --noprobe

 230M Nov 24 2017     /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
  67M Nov 24 2017     /tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
  90M Nov 24 2017     /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
 333M Nov 24 2017     /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
 1.4G Nov 24 2017     /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz
2020-03-04 19:43:23 ip-10-10-1-118 ess[3339]: Fetching file list from datastore.
2020-03-04 19:43:23 ip-10-10-1-118 ess[3339]: Examining largest matched file to determine compression type: /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz


In [8]:
ess summary danish_reviews

Name:        danish_reviews
Pattern:     /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   
# of files:  1
Total size:  230.7MB
File range:  1970-01-01 - 1970-01-01
# columns:   0
Column Spec: 
Pkey: 
Schema: None
Preprocess:  
usecache:    False
Comment:    

First few lines:
       ��r�X�6v
=�2���� bϖ�-6�{p
�@D�K��wЅ�eֺ��@]Օ�Bz��� 3��g���_&e-�������y���7���[OE�'���5�[���/�j�&��)���R?��o� ��'/��I�je���5�� ^Z+?�,���I�gV����9�}��O�E����)ҧ�����+ߛ��B��%�W����?�'��p�M���|��?k8{Î5l4Z׃��c�jw�ݖ��4�a�=���ө�6����<͋x�-����ܞˀ^ʎ�0v�O^��u@�#Zڳa�m�A��Yִ��c?��_�<Iֶg�AN��gAnGI���ʏ�����O�?���s�5��}�gEno������/���^&��~J�y�Il{�����'�nD4_���~��pZ;��N�aZ�;�k]��Nǝ�������Z�F�qݖ���ۍv�ճ��w+�3�<��ݰȾ�{�sR�A��gE�aaW�����O���� �G� ���}&�&댨��?��~�{9- ���W��|ۏL^�4Y���D+^T��״�M�C���"Imz{��^w�)ȟ,ʫ3��'��K������3E�kӓV��/?o�a���@��F�����D��;�V��������0j\��F����&��tڭ^��[�^�/��̶��2 �� �=J���]*�dXG

Because we did not scan the source files when creating the category, Column spec, shema and first few lines of the files are not available with `ess summary <categoryName>`.

But we do know what it looks like, from [datasource's website](https://s3.amazonaws.com/amazon-reviews-pds/readme.html).

With this in mind, we'll create databases/tables/vectors with schemas.

In [9]:
# This section will create a database and its schemas

# first make sure there's no existing udb / schemas. 
ess server reset

# create database "amazon", on port 0. 
# Everything that'll be created after this will be inside of "amazon" database. (except ohter database)
ess create database amazon --port 0

# create tables and vectors, with column specs in amazon db
ess create table reviews S:marketplace I,pkey:customer_id I:u_time S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date I:year I:month I:day
ess create vector customer I,pkey:customer_id I:num_review I,+add:helpful_votes I,+add:total_votes I,+max:max_star I,+min:min_star F:avg_star I:verified_purchases

# create products db, and product vector inside
ess create database products
ess create vector product I,pkey:product_parent I,+add:num_review I,+add:num_verified_purchase I,+min:min_star I,+add:sum_star

# this will start the udbd server. 
ess udbd start 

ip-10-10-1-118: Starting udbd-10010.
ip-10-10-1-118: udbd-10010 (3544) started.


Now we have the servers running with necessary schemas, we'll stream our data,
* datastore (s3) --> udb table (`amazon`). 

Within the stream, we will also process some columns and thier values using `aq_pp` command, concretely...
* extract year, month and date from `review_date` column, and remap them onto individual new columns.
* For `star_rating` column, 
    * create new `sum_star`, `min_star` columns which will be streamed into `product` vector. 
* create and assign `num_review` column value of 1
* create `num_verified_purchase` column, and conditionally fill it's value using `-if -else` option. 