# aq_udb

## Overview
This command let users interact with UDB (User Database). It is distributed, hash-based, in memory database. 
It is designed to process vast amount of data in a computing cluster, where each node is responsible for unique set of primary keys in the database. 

This allows users to perform much more complex analysis and queries than `aq_pp` command could alone, on larger datasets.

`aq_udb` is used to perform data cleaning and transforming with UDB once the database is created. 

### Components of UDB
* Database: User database, that contains one or more of the following components
* Table: similar to TABLE in MySQL, with schema that is a variant of column spec
* Vector: ??
* Variable: ??

### Commands to be used
* `ess create`
* `aq_udb`
* `ess server summary`

## Contents
In this notebook, first we'll go though the general steps of managing database, then go over the actual examples of data processing using udb.

### [Preparing Database](#prep_db)
* creating schema for database
* starting the database server
* populating it with data

### [Clean Up Database](#clean_db)
* shutting it down
* clearing the data insdie
* deleting schema

* reusing previously defined database

[**aq_udb**](#aq_udb_option)<br>
* [`-exp`](#exp) - export data 
* [`-sort`](#sort) - sort output data
* [`-ord`](#ord) - sort keys in DB, or records in table internally
* [`-shf`](#shf) - shuffle keys or records in DB internally
* [`-cnt`](#cnt) - count unique primary keys in DB
* [`-eval`](#eval) - same as `aq_pp`'s option
* [`-filt`](#filt) - same as `aq_pp`'s option
* [`-var`](#var) - assign value to predefined variable
* [`-del_row`](#del_row) - delete a row in DB
* [`-lim_key Num`](#lim_key) - output `Num` of keys only
* [`-lim_rec Num`](#lim_rec) - output `Num` of records
* [`-key_rec Num`](#key_rec) - output `Num` of records per unique key
* [`-top [Start:]Num`](#top) - limit the output result to `Num` of records from top of the DB
* [`-last [Start:]Num`](#last) - same as above, but from buttom of the dB 




Below is the general steps of utilizing udb.
1. create database, table, vector, and/or variable with arbitrary schema.
2. fill up tables with data. 


<a id='prep_db'></a>
### Preparing Database
In this section we'll cover steps to prepare udb for use, for the first time.
This includes steps below
* selecting datastore and creating data category
* taking a look at data, and getting column spec
* creating database schema
* starting database server

Let's start with preparing the data we'll use.

**Prepare Data**

In [12]:
# select datastore, which is s3 bucket
ess select essentia-playground

# display the directory structures and files stored
ess ls /tsv/ | head -n 10

# create data category
ess category add amazon \
 '/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz'
 
# get info about the category
ess summary amazon | head -n 20

 230M Nov 12 23:31    /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
  67M Nov 12 23:31    /tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
  90M Nov 12 23:31    /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
 333M Nov 12 23:31    /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
 1.4G Nov 12 23:31    /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz
 618M Nov 12 23:31    /tsv/amazon_reviews_us_Apparel_v1_00.tsv.gz
 555M Nov 12 23:31    /tsv/amazon_reviews_us_Automotive_v1_00.tsv.gz
 340M Nov 12 23:31    /tsv/amazon_reviews_us_Baby_v1_00.tsv.gz
 871M Nov 12 23:31    /tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz
 2.6G Nov 12 23:31    /tsv/amazon_reviews_us_Books_v1_00.tsv.gz
Name:        amazon
Pattern:     tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   Tab
# of files:  2
Total size:  423.5MB
File range:  1970-01-01 - 1970-01-01
# columns:   15
Column Spec: S:mar

Now we have some information that we need about the data in order to define database schema. We have column spec. 
Let's go ahead and create database and table now.

**Database Creation**<br>
We'll create 
* database named `amazon` 
* table `reviews` with 
    * schema - column spec of the data from the data category.
    > When creating schema, we need to specify primary key column, much like SQL database. Here is the schema.
    `S:marketplace I,pkey:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date`
    * Note the `I,pkey:customer_id`, `pkey` specify that this column is the primary key of the table.
    
We can use  `ess create {}` where you can specify entity to create and pass in name of the entity.

After that, we'll start database server.

In [15]:
# creating database named amazon
ess create database amazon
# creating table named reviews, with the schema
ess create table reviews S:marketplace I,pkey:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date
# start the db server
ess udbd start

2019-11-14 02:22:30 ip-10-10-1-118 ess[4193]: ***Error*** Database amazon already exists
2019-11-14 02:22:31 ip-10-10-1-118 ess[4202]: ***Error*** Table reviews already exists.
ip-10-10-1-118: Starting udbd-10010.
ip-10-10-1-118: udbd-10010 (4273) started.


**Populate it with Data**<br>

Now we'll fill up the database with the review dataset, using essentia's stream and `aq_pp` command. Note that **`-imp` option is used to direct the output into the database**.

In [17]:
ess stream amazon "*" "*" 'aq_pp -f,+1,tsv,eok - -d %cols -imp amazon:reviews'



Using `aq_udb` with [`-exp`](#exp) command, we can see that the data is inside of the database. 

In [20]:
aq_udb -exp amazon:reviews -lim_rec 30

"marketplace","customer_id","review_id","product_id","product_parent","product_title","product_category","star_rating","helpful_votes","total_votes","vine","verified_purchase","review_headline","review_body","review_date"
"UK",10349,"R2YVNBBMXD8KVJ","B00MWK7BWG",307651059,"My Favourite Faded Fantasy","Music",5,0,0,"N","Y","Five Stars","The best album ever!","2014-12-29"
"UK",10349,"R2YVNBBMXD8KVJ","B00MWK7BWG",307651059,"My Favourite Faded Fantasy","Music",5,0,0,"N","Y","Five Stars","The best album ever!","2014-12-29"
"UK",10629,"R2K4BOL8MN1TTY","B006CHML4I",835010224,"Seiko 5 Men's Automatic Watch with Black Dial Analogue Display and Blue Fabric Strap SNK807K2","Watches",4,0,0,"N","Y","Great watch from casio.","What a great watch. Both watches and strap is in a great quality, and the prize is low. Especially compared to the price here in Denmark.","2013-10-24"
"UK",10629,"R2K4BOL8MN1TTY","B006CHML4I",835010224,"Seiko 5 Men's Automatic Watch with Black Dial Analogue Display and Blue Fa

"UK",16563,"R3G5WIW7NNA1CS","B004OY47JS",74795975,"Billy Elliot (1 Disc Collectors Steelbook Edition) [2000] (Region 2) (Import)","Video DVD",5,0,0,"N","Y","Nice steelbook case very good soundtrack and most important a very good movie.","Nice steelbook case very good soundtrack and most important a very good movie a very nice item for a steelbook collector.","2013-10-28"
"UK",17139,"R75U5MUIZ9T0D","B009O36EO0",269758980,"Heal","Music",5,3,4,"N","Y","MAGIC!!!","Euphoria is one of the reason why I bought this album, since the victory @ Eurovision Song Contest 2012. I'm Indonesian, but I watched the show, and love her performance.<br />My Fave Tracks :<br />- In My Head<br />- My Heart Is Refusing Me<br />- Euphoria<br />- Sober<br />- Crying Out Your Name<br />- Breaking The Robot<br />Loreen's music is influenced by enigma, there are so many \""mysteries\"" in every single track. Electro Pop, Euro,  Harmony Vocals, in \""Dark\"" Nuance Reminds me the style of MDNA's in Frozen.<br />Trus

<a id='clean_db'></a>
### Clean Up Dabase

<a id='aq_udb_option'></a>
## aq_udb options

<a id='exp'></a>
### -exp

<a id='sort'></a>
### -sort

<a id='ord'></a>
### -ord

<a id='shf'></a>
### -shf

<a id='cnt'></a>
### -cnt










<a id='eval'></a>
### -eval

<a id='filt'></a>
### -filt

<a id='var'></a>
### -var

<a id='del_row'></a>
### -del_row

<a id='lim_key'></a>
### -lim_key

<a id='lim_rec'></a>
### -lim_rec

<a id='key_rec'></a>
### -key_rec

<a id='top'></a>
### -top

<a id='last'></a>
### -laset