# UDB Walk Through with Amazon Product Dataset

## Intro

In this notebook we will be looking at a simple use case of udb, table, vector and variables to organize and perform little analysis on the [Amazon Product Review Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html), which is publically available on aws s3 bucket. 

If you're not familiar with the data, go ahead and take a look at the link above. 

## Pre-requisites

You're assumed to have some knowledge of 
* bash and linux
* Essentia and aq_commands
* UDB, aq_udb command

If you're not confident enough, you can take a look at the other notebooks, or look at [AuriQ Knowledge Base](http://auriq.com/knowledge-base/) where documentations and tutorials are available. 

Now that is out of the way, let's get started. 

## Dataset

Before getting our hands dirty with the data, we'll take a look at the schema and what's inside of the data. 
This dataset contains customer reviews from 1995 - 2015, left on amazon.com, and contains 15 columns, some of which are important for this tutorial.
* **customer_id** - unique identifier for a customer who left review(s)
* **product_parent** - unique identifier for a product. This can be used to merge same product across different marketplace.
* **review_id** - unique identifier for a review. 
* **star_rating** - star rating, ranging from 1 to 5. 
* **verified_purchase** - whether or not the review is based on verified purchase. String value of Y or N.
* **review_date** - string value describing the year, month and date of the review. 

## Refresher
Before we get into details of this project, let's briefly go over the concepts of UDB, tables, vectors and variables. If you're already familiar with these, feel free to skip to next section (GOAL). 

What were the udb? It is a in-memory database that stores data using key-value type data structure. 
There are 3 data structures, and couple attributes to keep in mind.

### Structures
Because the database is key-value based, each database needs to have a primary key column specified, which is common across all the data structures within the database. 
* **Table** - Analogous to SQL table with primary key column, except that foregin key does not exist.
* **Vector** - You can think of this as a row / column vector-ish data structure, if you're familiar with matrices. Each vector corresponds to a summary information of a primary key. Can be used to summarize customer information across entire dataset, for e.g. 
* **Variable** - as the name suggest, this stores values temporarly. Not really a datatype, but comes in handy when performing complex calculations. 

### Attributes
UDB attributes are very powerful. You can assgin one of the following attributes to each column, and when the data flows into table/vector whose columns have these attributes, 
* `pkey`: primary hash key, must be string type
* `tkey`: integer sorting key
* `+key`: string key to merge on - only applicable for table. **ADD MORE EXPLANATION**
* `+first`: Use the first imported value when merging
* `+last`: Use the last imported values when merging
* `+add`: Sum values across rows for each unique value
* `+bor`: Bitwise-OR numeric values
* `+min`: Take the smallest value
* `+max`: Take the largest value
* `+nozero`: Ignore values of 0 or an empty string

## Goal

The final goal of this tutorial is to import the data into udb, in a way that is easy to manage and analyze later. Concretely, we will create the following udb databases, tables, vectors and variables.

### Database: Amazon
This database contains a table and a vector, which are 
* Table: reviews - keeps all of the original review dataset. 
* vector: customer - contains summary of each customer's information, such as numbers of reviews each customer left, average star rating of each customer, numbers of helpful votes. 

**Put shemas here**

### Database: Products
This database only contains one vector
* vector: product - summarizes the information for each product, such as numbers of reviews left, average star ratings, etc. 

**Schema**