### Column data store vs Row data storage and why it is better
- postgres column extension: cstore_fdw

In [2]:
%load_ext sql
DB_ENDPOINT = '127.0.0.1'
DB = 'reviews'
DB_USER = 'postgres'
DB_PASSWORD = '1234'
DB_PORT = '5500'

conn_string = 'postgresql://{}:{}@{}:{}/{}'\
                .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)
print(conn_string)
%sql $conn_string

postgresql://postgres:1234@127.0.0.1:5500/reviews


In [17]:
#!sudo -u postgres psql -c 'CREATE DATABASE reviews;'

!wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
!wget http://examples.citusdata.com/customer_reviews_1999.csv.gz

!gzip -d customer_reviews_1998.csv.gz 
!gzip -d customer_reviews_1999.csv.gz 

!mv customer_reviews_1998.csv ../data/customer_reviews_1998.csv
!mv customer_reviews_1999.csv ../data/customer_reviews_1999.csv

URL transformed to HTTPS due to an HSTS policy
--2021-04-14 08:13:23--  https://examples.citusdata.com/customer_reviews_1998.csv.gz
Resolving examples.citusdata.com (examples.citusdata.com)... 2606:4700:20::681a:f38, 2606:4700:20::ac43:4902, 2606:4700:20::681a:e38, ...
Connecting to examples.citusdata.com (examples.citusdata.com)|2606:4700:20::681a:f38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24774482 (24M) [application/x-gzip]
Saving to: ‘customer_reviews_1998.csv.gz’


2021-04-14 08:13:25 (20.1 MB/s) - ‘customer_reviews_1998.csv.gz’ saved [24774482/24774482]

URL transformed to HTTPS due to an HSTS policy
--2021-04-14 08:13:25--  https://examples.citusdata.com/customer_reviews_1999.csv.gz
Resolving examples.citusdata.com (examples.citusdata.com)... 2606:4700:20::ac43:4902, 2606:4700:20::681a:e38, 2606:4700:20::681a:f38, ...
Connecting to examples.citusdata.com (examples.citusdata.com)|2606:4700:20::ac43:4902|:443... connected.
HTTP request sent, await

### STEP 1 : Create a table with a normal (Row) storage & load data

Create a table called customer_reviews_row with the column names contained in the customer_reviews_1998.csv and customer_reviews_1999.csv files.

In [3]:
%%sql
DROP TABLE IF EXISTS customer_reviews_row;
CREATE TABLE customer_reviews_row
(
    customer_id TEXT,
    review_date DATE,
    review_rating INT,
    review_votes INT,
    review_helpful_votes INT,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)

 * postgresql://postgres:***@127.0.0.1:5500/reviews
Done.
Done.


[]

### Objective: 
- Use the COPY statement to populate the tables with the data in the customer_reviews_1998.csv and customer_reviews_1999.csv files. 
- You can access the files in the /tmp/ folder.

In [13]:
import pandas as pd

In [14]:
df=pd.read_csv('customer_reviews_1999.csv')
df.head()

Unnamed: 0,ATVPDKIKX0DER,1999-01-01,1,0,0.1,0385322135,The Voice on the Radio,562396,Book,Teens,Literature & Fiction,"{0440220653,0590988492,0590474782,0440219817,0590457403}"
0,ATVPDKIKX0DER,1999-01-01,1,0,0,0385484518,Tuesdays with Morrie,757,Book,Biographies & Memoirs,General,"{0786868716,0385318790,0385504209,0671027360,B..."
1,ATVPDKIKX0DER,1999-01-01,1,0,0,0395732611,Hatchet,1432267,Book,Children's Books,Literature,{}
2,ATVPDKIKX0DER,1999-01-01,1,0,0,0425147363,Tom Clancy's Op-Center (Tom Clancy's Op Center...,19188,Book,Mystery & Thrillers,Thrillers,"{0425165566,0425168220,0425174808,0425180050,0..."
3,ATVPDKIKX0DER,1999-01-01,1,0,0,0425151875,Tom Clancy's Op-Center,76807,Book,Mystery & Thrillers,Thrillers,"{0425165566,0425168220,0425174808,0425180050,0..."
4,ATVPDKIKX0DER,1999-01-01,1,0,0,042515601X,Acts of War (Tom Clancy's Op Center (Paperback)),165249,Book,Mystery & Thrillers,Thrillers,"{0425165566,0425168220,0425174808,0425180050,0..."


In [16]:
%%sql 
COPY customer_reviews_row FROM 'customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_row FROM 'customer_reviews_1999.csv' WITH CSV;

 * postgresql://postgres:***@127.0.0.1:5500/reviews
(psycopg2.errors.UndefinedFile) could not open file "customer_reviews_1998.csv" for reading: No such file or directory
HINT:  COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.

[SQL: COPY customer_reviews_row FROM 'customer_reviews_1998.csv' WITH CSV;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


### STEP 2 : Create a table with columnar storage & load data

First, load the extension to use columnar storage in Postgres.

- load extension first time after install
- create server object

In [10]:
%%sql
CREATE EXTENSION citus

 * postgresql://postgres:***@127.0.0.1:5500/reviews
Done.


[]

In [18]:
%%sql
CREATE EXTENSION cstore_fdw
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw

# other cell

SERVER cstore_server
OPTIONS(compression 'pglz');

 * postgresql://postgres:***@127.0.0.1:5432/reviews
(psycopg2.errors.SyntaxError) syntax error at or near "CREATE"
LINE 1: CREATE EXTENSION cstore_fdw CREATE SERVER cstore_server FORE...
                                    ^

[SQL: CREATE EXTENSION cstore_fdw CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw]
(Background on this error at: http://sqlalche.me/e/13/f405)


### Objective: 
Create a FOREIGN TABLE called customer_reviews_col with the column names contained in the customer_reviews_1998.csv and customer_reviews_1999.csv files.

- create foreign table
- leave code below as is

In [11]:
%%sql
DROP TABLE IF EXISTS customer_reviews_col;

CREATE TABLE customer_reviews_col
(
    customer_id TEXT,
    review_date DATE,
    review_rating INT,
    review_votes INT,
    review_helpful_votes INT,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
) USING columnar


 * postgresql://postgres:***@127.0.0.1:5500/reviews
Done.
Done.


[]

### Objective: 
- Use the COPY statement to populate the tables with the data in the customer_reviews_1998.csv and customer_reviews_1999.csv files. 
- You can access the files in the /tmp/ folder.

In [16]:
%%sql 
COPY customer_reviews_col FROM 'customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_col FROM 'customer_reviews_1999.csv' WITH CSV;

 * postgresql://postgres:***@127.0.0.1:5500/reviews
(psycopg2.errors.UndefinedFile) could not open file "customer_reviews_1998.csv" for reading: No such file or directory
HINT:  COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.

[SQL: COPY customer_reviews_col FROM 'customer_reviews_1998.csv' WITH CSV;]
(Background on this error at: http://sqlalche.me/e/13/e3q8)


In [23]:
!pwd
!ls

/Users/brad/Programming/Learning/DataEngineering/02-DataWarehouse
1.0_3NF-SchemaExploration.ipynb
1.1_3NF-CreatingFactDimensionTables.ipynb
1.2-ETL3NFToFactDimensionTables.ipynb
1.3-ComputFromFactDimensionTables.ipynb
1.4_OLAP-SlicingDicingRollupDrillDownGroupBy.ipynb
1.5_OLAP-GroupingSets-GroupByCube.ipynb
1.6_OLAP-ColumnFormatInROLAP.ipynb
README.md
add-citus-repo.sh
customer_reviews_1998.csv
customer_reviews_1998.csv.gz.1
customer_reviews_1999.csv
customer_reviews_1999.csv.gz.1
[34mdata[m[m
wget-log


### Step 3: Compare perfromamce
Now run the same query on the two tables and compare the run time. Which form of storage is more performant?

Write a query that calculates the average review_rating by product_title for all reviews in 1995. Sort the data by review_rating in descending order. Limit the results to 20.

First run the query on customer_reviews_row:

In [8]:
%%time
%%sql

SELECT product_title, review_date, avg(review_rating) as rating
FROM customer_reviews_row
WHERE review_date BETWEEN ('1995-01-01') AND ('1995-12-31')
GROUP BY product_title, review_date
ORDER BY rating DESC
LIMIT 20;

 * postgresql://postgres:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 6.76 ms, sys: 2.31 ms, total: 9.06 ms
Wall time: 405 ms


product_title,review_date,rating
Acts of Kindness,1995-11-14,5.0
Albinus on Anatomy,1995-09-25,5.0
Accidental Empires,1995-10-04,5.0
Act Like Nothing's Wrong,1995-07-20,5.0
Ain't Nobody's Business If You Do,1995-08-14,5.0
99 Critical Shots in Pool,1995-08-03,5.0
A Year in Provence (Vintage Departures),1995-12-09,5.0
A Year in Provence (abridged),1995-12-09,5.0
A Civil Action (Vintage),1995-11-05,5.0
A People's History of the United States,1995-10-30,5.0


In [9]:
%%time
%%sql

SELECT product_title, review_date, avg(review_rating) as rating
FROM customer_reviews_col
WHERE review_date BETWEEN ('1995-01-01') AND ('1995-12-31')
GROUP BY product_title, review_date
ORDER BY rating DESC
LIMIT 20;

 * postgresql://postgres:***@127.0.0.1:5432/reviews
(psycopg2.errors.UndefinedTable) relation "customer_reviews_col" does not exist
LINE 2: FROM customer_reviews_col
             ^

[SQL: SELECT product_title, review_date, avg(review_rating) as rating
FROM customer_reviews_col
WHERE review_date BETWEEN ('1995-01-01') AND ('1995-12-31')
GROUP BY product_title, review_date
ORDER BY rating DESC
LIMIT 20;]
(Background on this error at: http://sqlalche.me/e/13/f405)
CPU times: user 5.11 ms, sys: 3.29 ms, total: 8.39 ms
Wall time: 71.8 ms
