# aq_udb

## Overview
This command let users interact with UDB (User Database). It is distributed, hash-based, in memory database. 
It is designed to process vast amount of data in a computing cluster, where each node is responsible for unique set of primary keys in the database. 

This allows users to perform much more complex analysis and queries than `aq_pp` command could alone, on larger datasets.

`aq_udb` is used to perform data cleaning and transforming with UDB once the database is created. 

### Components of UDB
* Database: User database, that contains one or more of the following components
* Table: similar to TABLE in MySQL, with schema that is a variant of column spec
* Vector: ??
* Variable: ??

## Data and Database setup

We'll be using [amazon customer review dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)'s international market's data, namely Japan and UK's reviews.

**Database**<br>
Followings are some important info about the database we'll create.
* Database: `amazon`
* Table: `reviews`
* Primary key (column): `customer_ID`

## Contents
In this notebook, first we'll go though the general steps of managing database, then go over the actual examples of data processing using udb.

### Manage UDB

#### [Preparing Database](#prep_db)


#### [Checking Database State](#check_db)

#### [Clean Up Database](#clean_db)


### [aq_udb](#aq_udb_option)<br>
* [`-exp`](#exp) - export data 
* [`-sort`](#sort) - sort output data
* [`-ord`](#ord) - sort keys in DB, or records in table internally
* [`-shf`](#shf) - shuffle keys or records in DB internally
* [`-cnt`](#cnt) - count unique primary keys in DB
* [`-eval`](#eval) - same as `aq_pp`'s option
* [`-filt`](#filt) - same as `aq_pp`'s option
* [`-var`](#var) - assign value to predefined variable
* [`-del_row`](#del_row) - delete a row in DB
* [`-lim_key Num`](#lim_key) - output `Num` of keys only
* [`-lim_rec Num`](#lim_rec) - output `Num` of records
* [`-key_rec Num`](#key_rec) - output `Num` of records per unique key
* [`-top [Start:]Num`](#top) - limit the output result to `Num` of records from top of the DB
* [`-last [Start:]Num`](#last) - same as above, but from buttom of the dB 




Below is the general steps of utilizing udb.
1. create database, table, vector, and/or variable with arbitrary schema.
2. fill up tables with data. 


<a id='prep_db'></a>
### Preparing Database
In this section we'll cover steps to prepare udb for use, for the first time.
This includes steps below
* selecting datastore and creating data category
* taking a look at data, and getting column spec
* creating database schema
* starting database server

Let's start with preparing the data we'll use.

**Prepare Data**

In [1]:
# select datastore, which is s3 bucket
ess select essentia-playground

# display the directory structures and files stored
ess ls /tsv/ | head -n 10

# create data category
ess category add amazon \
 '/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz'
 
# get info about the category
ess summary amazon | head -n 10

 230M Nov 12 23:31    /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
  67M Nov 12 23:31    /tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
  90M Nov 12 23:31    /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
 333M Nov 12 23:31    /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
 1.4G Nov 12 23:31    /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz
 618M Nov 12 23:31    /tsv/amazon_reviews_us_Apparel_v1_00.tsv.gz
 555M Nov 12 23:31    /tsv/amazon_reviews_us_Automotive_v1_00.tsv.gz
 340M Nov 12 23:31    /tsv/amazon_reviews_us_Baby_v1_00.tsv.gz
 871M Nov 12 23:31    /tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz
 2.6G Nov 12 23:31    /tsv/amazon_reviews_us_Books_v1_00.tsv.gz
Name:        amazon
Pattern:     tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   Tab
# of files:  2
Total size:  423.5MB
File range:  1970-01-01 - 1970-01-01


Now we have some information that we need about the data in order to define database schema. We have column spec. 
Let's go ahead and create database and table now.
<a id='db_creation'></a>
**Database Creation**<br>
We'll create 
* database named `amazon` 
* table `reviews` with 
    * schema - column spec of the data from the data category.
    > When creating schema, we need to specify primary key column, much like SQL database. Here is the schema.
    `S:marketplace I,pkey:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date`
    * Note the `I,pkey:customer_id`, `pkey` specify that this column is the primary key of the table.
    
We can use  `ess create {}` where you can specify entity to create and pass in name of the entity.

After that, we'll start database server.

In [48]:
# delete database, schema and data if already exist
ess server reset 
# creating database named amazon
ess create database amazon
# creating table named reviews, with the schema
ess create table reviews S:marketplace I,pkey:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date
# start the db server
ess udbd start

ip-10-10-1-118: Starting udbd-10010.
ip-10-10-1-118: udbd-10010 (5644) started.


<a id='db_population'></a>
**Populate it with Data**<br>

Now we'll fill up the database with the review dataset, using essentia's stream and `aq_pp` command. Note that **`-imp` option is used to direct the output into the database**.

In [49]:
ess stream amazon "*" "*" 'aq_pp -f,+1,tsv,eok - -d %cols -imp amazon:reviews'



Using `aq_udb` with [`-exp`](#exp) command, we can see that the data is inside of the database. 

In [42]:
aq_udb -exp amazon:reviews -lim_rec 30

"marketplace","customer_id","review_id","product_id","product_parent","product_title","product_category","star_rating","helpful_votes","total_votes","vine","verified_purchase","review_headline","review_body","review_date"
"UK",10349,"R2YVNBBMXD8KVJ","B00MWK7BWG",307651059,"My Favourite Faded Fantasy","Music",5,0,0,"N","Y","Five Stars","The best album ever!","2014-12-29"
"UK",10629,"R2K4BOL8MN1TTY","B006CHML4I",835010224,"Seiko 5 Men's Automatic Watch with Black Dial Analogue Display and Blue Fabric Strap SNK807K2","Watches",4,0,0,"N","Y","Great watch from casio.","What a great watch. Both watches and strap is in a great quality, and the prize is low. Especially compared to the price here in Denmark.","2013-10-24"
"UK",12136,"R3P40IEALROVCH","B00IIFCJX0",271687675,"Dexter Season 8","Digital_Video_Download",5,0,0,"N","Y","fantastic","love watching all the episodes of Dexter, when i first heard about this series i wasnt too sure about watching it. it took me a very long time to start and 

"UK",20583,"R28IKPKZMZZV52","B00A6HL704",438645704,"The Twilight Saga: The Complete Collection [DVD]","Video DVD",4,0,3,"N","Y","all good apart from postage","didn't turn up on time and paid extra for day of release. DVD fine just the timing was awful which ruined the experience.","2013-03-15"
"UK",20725,"R21DHG6AOGXIZ6","B00IABBXIO",777928797,"RATHER BE - CLEAN BANDIT","Music",5,0,1,"N","Y","Top tune","Bought this single as it was number 1 the day my daughter was born. Didn't like it at 1st but it's grown on me and now I love it. Great catchy tune.","2014-04-12"
"UK",20849,"R2Z32STUPPU8O4","B00FDPK2JQ",650376702,"Salute","Music",5,0,0,"N","Y","music","bought for son for christmas he loves  liitle mix just started listening to music after turning 12 yrs of age great for teenages","2014-01-07"
"UK",20849,"R2K985KNJXCYX0","B004B8NBQW",702526844,"Bright Lights","Music",5,0,0,"N","Y","amazing","bought for hubby for christmas great fan of ellie goulding love all of her music fast delivery i

<a id='check_db'></a>
### Checking the database state

Here, we'll go over 3 useful commands to check the state of database server. For details and syntax of each command, refer to man page. 

`ess server summary` provide overall information and status of the database server, such as database, table, vectors and its schema.

In [13]:
ess server summary

DATABASE : amazon (active)
   TABLE :reviews	S:marketplace I,pkey:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date
  VECTOR : (none)
     VAR : (none)

ip-10-10-1-118: (+) udbd-10010 (4114) running.


`udbd status` return if the server is running or not.

In [22]:
ess exec "udbd status"

ip-10-10-1-118: (+) udbd-10010 (4114) running.


`aq_udb`'s option `-inf` providee information about a specific database running on a server.

In [21]:
aq_udb -inf amazon

"memx","strx","pkey","var","reviews"
1335836544,4914141,1059156,0,1969910


<a id='clean_db'></a>
### Clean Up Dabase
Now we know how to create and get infomation about our ubd and server, let's learn how to clean it up after using it, based on user cases.


**1. Stop database server and delete schema**<br>
* `ess server reset`: use this when you're not planning on using the database and its schema again.
* New data schema needs to be created from scratch

**2. Stop the database server, but preserve schema**<br>
* `(ess) udbd stop`: perfect for when you'd like to shut off your instance, but would like to come back and use the database again. 
* To use it again, start the server with `ess udbd start` and fill it up with data.

**3. Clear up the data inside of database**<br>
* `aq_udb -clr`: empty the data from database, tables, etc. Use this to repopulate tables/whole database. 
    * with Table name, empty the data but preserves the DB and table schema
    * with database name, empty the whole database (not delete??), you can refill the data again without creating schema
            
            
On the following 3 cells, we'll demonstrate each commands, and output the results using `ess server summary`

In [39]:
# case 1, delete everything
ess server reset
ess server summary

ip-10-10-1-118: No running server detected.


In [43]:
# stop the database server, preserve schema
ess udbd stop
ess server summary

ip-10-10-1-118: Stopping udbd-10010 (5182).
ip-10-10-1-118: udbd-10010 stopped.
DATABASE : amazon (active)
   TABLE :reviews	S:marketplace I,pkey:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S:review_body S:review_date
  VECTOR : (none)
     VAR : (none)

2019-11-15 01:40:54 ip-10-10-1-118 ess[5291]: ***Error*** ip-10-10-1-118: (-) udbd-10010 not running.



: 1

In [46]:
# clean up the data only
aq_udb -clr amazon:reviews
# check if the table is filled with data
aq_udb -exp amazon:reviews 

"marketplace","customer_id","review_id","product_id","product_parent","product_title","product_category","star_rating","helpful_votes","total_votes","vine","verified_purchase","review_headline","review_body","review_date"


<a id='aq_udb_option'></a>
## aq_udb options

Now we'll take a look at each options of `aq_udb` command. Before going through this section, go over [preparing databas](#prep_db) section and make sure database is running and filled with data.

<a id='exp'></a>
### -exp

This option export the data from given `DatabaseName:TableName`, or if only `DatabaseName` is given, then it'll export the primary keys from database. <br>

This option is used with many other options in order to process the data.
Let's take a look.

In [63]:
# exporting the data from table (-top 20 limits the output to 20 records)
aq_udb -exp amazon:reviews -top 20

"marketplace","customer_id","review_id","product_id","product_parent","product_title","product_category","star_rating","helpful_votes","total_votes","vine","verified_purchase","review_headline","review_body","review_date"
"UK",10349,"R2YVNBBMXD8KVJ","B00MWK7BWG",307651059,"My Favourite Faded Fantasy","Music",5,0,0,"N","Y","Five Stars","The best album ever!","2014-12-29"
"UK",10629,"R2K4BOL8MN1TTY","B006CHML4I",835010224,"Seiko 5 Men's Automatic Watch with Black Dial Analogue Display and Blue Fabric Strap SNK807K2","Watches",4,0,0,"N","Y","Great watch from casio.","What a great watch. Both watches and strap is in a great quality, and the prize is low. Especially compared to the price here in Denmark.","2013-10-24"
"UK",12136,"R3P40IEALROVCH","B00IIFCJX0",271687675,"Dexter Season 8","Digital_Video_Download",5,0,0,"N","Y","fantastic","love watching all the episodes of Dexter, when i first heard about this series i wasnt too sure about watching it. it took me a very long time to start and 

In [60]:
# now outputting primary keys only
aq_udb -exp amazon -top 10

"customer_id"
10349
10629
12136
12268
12677
13070
15356
16019
16563
17139


<a id='sort'></a>
### -sort

You can sort the data based on given column name, as it is being exported. Note that the data is not sorted within the database. 


In [65]:
# sorting based on customer_id column. feel free to try other column as well.
# -c option is used to output given column only
aq_udb -exp amazon:reviews -sort customer_id -top 10 -c customer_id

"customer_id"
10349
10629
12136
12268
12677
13070
15356
16019
16563
17139


In [66]:
# note that this command does not change the order of data within the table
aq_udb -exp amazon:reviews -top 10 -c customer_id

"customer_id"
10349
10629
12136
12268
12677
13070
15356
16019
16563
17139


<a id='ord'></a>
### -ord

In [69]:
aq_udb -ord amazon:reviews customer_id

Server(127.0.0.1:10010) error: sort column: Bad spec
aq_udb: Udb request invalid


: 34

<a id='shf'></a>
### -shf

<a id='cnt'></a>
### -cnt










<a id='eval'></a>
### -eval

<a id='filt'></a>
### -filt

<a id='var'></a>
### -var

<a id='del_row'></a>
### -del_row

<a id='lim_key'></a>
### -lim_key

<a id='lim_rec'></a>
### -lim_rec

<a id='key_rec'></a>
### -key_rec

<a id='top'></a>
### -top

<a id='last'></a>
### -laset