# Exercise 03 - Columnar Vs Row Storage

- The columnar storage extension used here: 
    - cstore_fdw by citus_data [https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)
- The data tables are the ones used by citus_data to show the storage extension


In [7]:
%load_ext sql

## STEP 0 : Connect to the local database where Pagila is loaded

### Create the database

In [None]:
!sudo -u postgres psql -c 'CREATE DATABASE reviews;'

!wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
!wget http://examples.citusdata.com/customer_reviews_1999.csv.gz

!gzip -d customer_reviews_1998.csv.gz 
!gzip -d customer_reviews_1999.csv.gz 

!mv customer_reviews_1998.csv /tmp/customer_reviews_1998.csv
!mv customer_reviews_1999.csv /tmp/customer_reviews_1999.csv

### Connect to the database

In [8]:
DB_ENDPOINT = "127.0.0.1"
DB = 'reviews'
DB_USER = 'student'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)

postgresql://student:student@127.0.0.1:5432/reviews


In [9]:
%sql $conn_string

'Connected: student@reviews'

In [2]:
import pandas as pd
pd.read_csv('/tmp/customer_reviews_1999.csv').head()

Unnamed: 0,ATVPDKIKX0DER,1999-01-01,1,0,0.1,0385322135,The Voice on the Radio,562396,Book,Teens,Literature & Fiction,"{0440220653,0590988492,0590474782,0440219817,0590457403}"
0,ATVPDKIKX0DER,1999-01-01,1,0,0,0385484518,Tuesdays with Morrie,757,Book,Biographies & Memoirs,General,"{0786868716,0385318790,0385504209,0671027360,B..."
1,ATVPDKIKX0DER,1999-01-01,1,0,0,0395732611,Hatchet,1432267,Book,Children's Books,Literature,{}
2,ATVPDKIKX0DER,1999-01-01,1,0,0,0425147363,Tom Clancy's Op-Center (Tom Clancy's Op Center...,19188,Book,Mystery & Thrillers,Thrillers,"{0425165566,0425168220,0425174808,0425180050,0..."
3,ATVPDKIKX0DER,1999-01-01,1,0,0,0425151875,Tom Clancy's Op-Center,76807,Book,Mystery & Thrillers,Thrillers,"{0425165566,0425168220,0425174808,0425180050,0..."
4,ATVPDKIKX0DER,1999-01-01,1,0,0,042515601X,Acts of War (Tom Clancy's Op Center (Paperback)),165249,Book,Mystery & Thrillers,Thrillers,"{0425165566,0425168220,0425174808,0425180050,0..."


## STEP 1 :  Create a table with a normal  (Row) storage & load data

**TODO:** Create a table called customer_reviews_row with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [10]:
%%sql
DROP TABLE IF EXISTS customer_reviews_row;
CREATE TABLE customer_reviews_row 
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [12]:
%%sql 
COPY customer_reviews_row FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_row FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

 * postgresql://student:***@127.0.0.1:5432/reviews
589859 rows affected.
1172645 rows affected.


[]

## STEP 2 :  Create a table with columnar storage & load data

First, load the extension to use columnar storage in Postgres.

In [13]:
%%sql

-- load extension first time after install
CREATE EXTENSION cstore_fdw;

-- create server object
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

 * postgresql://student:***@127.0.0.1:5432/reviews
(psycopg2.ProgrammingError) extension "cstore_fdw" already exists
 [SQL: '-- load extension first time after install\nCREATE EXTENSION cstore_fdw;']


**TODO:** Create a `FOREIGN TABLE` called `customer_reviews_col` with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [14]:
%%sql
-- create foreign table
DROP FOREIGN TABLE IF EXISTS customer_reviews_col;

CREATE FOREIGN TABLE customer_reviews_col
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)
SERVER cstore_server
OPTIONS(compression 'pglz');

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [15]:
%%sql 
COPY customer_reviews_col FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_col FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

 * postgresql://student:***@127.0.0.1:5432/reviews
589859 rows affected.
1172645 rows affected.


[]

## Step 3: Compare perfromamce

Now run the same query on the two tables and compare the run time. Which form of storage is more performant?

**TODO**: Write a query that calculates the average `review_rating` by `product_title` for all reviews in 1995. Sort the data by `review_rating` in descending order. Limit the results to 20.

First run the query on `customer_reviews_row`:

In [27]:
%%time
%%sql

SELECT product_title, AVG(review_rating)
FROM customer_reviews_row
WHERE review_date >= '1995-01-01'  
      AND review_date <= '1995-12-31'
GROUP BY product_title
ORDER BY product_title
LIMIT 20

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 4.73 ms, sys: 86 µs, total: 4.81 ms
Wall time: 361 ms


product_title,avg
22 Immutable Laws of Marketing,4.0
8051 Microcontroller,4.0
99 Critical Shots in Pool,5.0
A Beginner's Guide to Constructing the Universe,5.0
A Civil Action (Vintage),5.0
A Darkness at Sethanon,5.0
A Fire Upon The Deep (Zones of Thought),3.5
A First Course in General Relativity,5.0
A Man on the Moon,4.0
A People's History of the United States,5.0


 Then on `customer_reviews_col`:

In [28]:
%%time
%%sql

SELECT product_title, AVG(review_rating)
FROM customer_reviews_col
WHERE review_date >= '1995-01-01'  
      AND review_date <= '1995-12-31'
GROUP BY product_title
ORDER BY product_title
LIMIT 20

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 3.34 ms, sys: 0 ns, total: 3.34 ms
Wall time: 12.3 ms


product_title,avg
22 Immutable Laws of Marketing,4.0
8051 Microcontroller,4.0
99 Critical Shots in Pool,5.0
A Beginner's Guide to Constructing the Universe,5.0
A Civil Action (Vintage),5.0
A Darkness at Sethanon,5.0
A Fire Upon The Deep (Zones of Thought),3.5
A First Course in General Relativity,5.0
A Man on the Moon,4.0
A People's History of the United States,5.0


## Conclusion: We can see that the columnar storage is faster!