# Exercise 03 - Columnar Vs Row Storage

- The columnar storage extension used here: 
    - cstore_fdw by citus_data [https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)
- The data tables are the ones used by citus_data to show the storage extension


In [1]:
%load_ext sql

## STEP 0 : Connect to the local database where Pagila is loaded

### Create the database

In [2]:
!sudo -u postgres psql -c 'CREATE DATABASE reviews;'

!wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
!wget http://examples.citusdata.com/customer_reviews_1999.csv.gz

!gzip -d customer_reviews_1998.csv.gz 
!gzip -d customer_reviews_1999.csv.gz 

!mv customer_reviews_1998.csv /tmp/customer_reviews_1998.csv
!mv customer_reviews_1999.csv /tmp/customer_reviews_1999.csv

CREATE DATABASE
--2021-02-15 21:27:20--  http://examples.citusdata.com/customer_reviews_1998.csv.gz
Resolving examples.citusdata.com (examples.citusdata.com)... 172.67.73.2, 104.26.14.56, 104.26.15.56, ...
Connecting to examples.citusdata.com (examples.citusdata.com)|172.67.73.2|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://examples.citusdata.com/customer_reviews_1998.csv.gz [following]
--2021-02-15 21:27:20--  https://examples.citusdata.com/customer_reviews_1998.csv.gz
Connecting to examples.citusdata.com (examples.citusdata.com)|172.67.73.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24774482 (24M) [application/x-gzip]
Saving to: ‘customer_reviews_1998.csv.gz’


2021-02-15 21:27:21 (53.7 MB/s) - ‘customer_reviews_1998.csv.gz’ saved [24774482/24774482]

URL transformed to HTTPS due to an HSTS policy
--2021-02-15 21:27:22--  https://examples.citusdata.com/customer_reviews_1999.csv.gz
Resolving examples.ci

### Connect to the database

In [3]:
DB_ENDPOINT = "127.0.0.1"
DB = 'reviews'
DB_USER = 'student'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)

postgresql://student:student@127.0.0.1:5432/reviews


In [4]:
%sql $conn_string

'Connected: student@reviews'

## STEP 1 :  Create a table with a normal  (Row) storage & load data

**TODO:** Create a table called customer_reviews_row with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [5]:
%%sql
DROP TABLE IF EXISTS customer_reviews_row;
CREATE TABLE customer_reviews_row
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [6]:
%%sql 
COPY customer_reviews_row FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_row FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

 * postgresql://student:***@127.0.0.1:5432/reviews
589859 rows affected.
1172645 rows affected.


[]

## STEP 2 :  Create a table with columnar storage & load data

First, load the extension to use columnar storage in Postgres.

In [7]:
%%sql

-- load extension first time after install
CREATE EXTENSION cstore_fdw;

-- create server object
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Create a `FOREIGN TABLE` called `customer_reviews_col` with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [8]:
%%sql
-- create foreign table
DROP FOREIGN TABLE IF EXISTS customer_reviews_col;

-------------
CREATE FOREIGN TABLE customer_reviews_col
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)


-------------
-- leave code below as is
SERVER cstore_server
OPTIONS(compression 'pglz');

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [9]:
%%sql 
COPY customer_reviews_col FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_col FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

 * postgresql://student:***@127.0.0.1:5432/reviews
589859 rows affected.
1172645 rows affected.


[]

## Step 3: Compare perfromamce

Now run the same query on the two tables and compare the run time. Which form of storage is more performant?

**TODO**: Write a query that calculates the average `review_rating` by `product_title` for all reviews in 1995. Sort the data by `review_rating` in descending order. Limit the results to 20.

First run the query on `customer_reviews_row`:

In [18]:
%%time
%%sql

SELECT product_title, avg(review_rating) as average_rating
FROM customer_reviews_row
WHERE EXTRACT(YEAR FROM review_date) = '1995'
GROUP BY review_rating,product_title
ORDER BY review_rating DESC
limit 20;

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 4.31 ms, sys: 75 µs, total: 4.39 ms
Wall time: 588 ms


product_title,average_rating
Data Link Protocols,5.0
Literature Guide,5.0
The Griffin & Sabine Trilogy Boxed Set,5.0
A Skiff for All Seasons,5.0
Titanic,5.0
"How to Write a Damn Good Novel, II",5.0
The Art of Hugging,5.0
Serenity,5.0
Footfall,5.0
Native Texas Plants,5.0


 Then on `customer_reviews_col`:

In [20]:
%%time
%%sql

SELECT product_title, avg(review_rating) as average_rating
FROM customer_reviews_col
WHERE EXTRACT(YEAR FROM review_date) = '1995'
GROUP BY review_rating,product_title
ORDER BY review_rating DESC
limit 20;

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 3.93 ms, sys: 491 µs, total: 4.42 ms
Wall time: 578 ms


product_title,average_rating
Data Link Protocols,5.0
Literature Guide,5.0
The Griffin & Sabine Trilogy Boxed Set,5.0
A Skiff for All Seasons,5.0
Titanic,5.0
"How to Write a Damn Good Novel, II",5.0
The Art of Hugging,5.0
Serenity,5.0
Footfall,5.0
Native Texas Plants,5.0


## Conclusion: We can see that the columnar storage is faster!

In [21]:
%%time
%%sql
SELECT
    customer_id, review_date, review_rating, product_id, product_title
FROM
    customer_reviews_row
WHERE
    customer_id ='A27T7HVDXA3K2A' AND
    product_title LIKE '%Dune%' AND
    review_date >= '1998-01-01' AND
    review_date <= '1998-12-31';

 * postgresql://student:***@127.0.0.1:5432/reviews
5 rows affected.
CPU times: user 2.1 ms, sys: 3.2 ms, total: 5.3 ms
Wall time: 455 ms


customer_id,review_date,review_rating,product_id,product_title
A27T7HVDXA3K2A,1998-04-10,5,0399128964,Dune (Dune Chronicles (Econo-Clad Hardcover))
A27T7HVDXA3K2A,1998-04-10,5,044100590X,Dune
A27T7HVDXA3K2A,1998-04-10,5,0441172717,"Dune (Dune Chronicles, Book 1)"
A27T7HVDXA3K2A,1998-04-10,5,0881036366,Dune (Dune Chronicles (Econo-Clad Hardcover))
A27T7HVDXA3K2A,1998-04-10,5,1559949570,Dune Audio Collection


In [22]:
%%time
%%sql
SELECT
    customer_id, review_date, review_rating, product_id, product_title
FROM
    customer_reviews_col
WHERE
    customer_id ='A27T7HVDXA3K2A' AND
    product_title LIKE '%Dune%' AND
    review_date >= '1998-01-01' AND
    review_date <= '1998-12-31';

 * postgresql://student:***@127.0.0.1:5432/reviews
5 rows affected.
CPU times: user 3.51 ms, sys: 626 µs, total: 4.14 ms
Wall time: 173 ms


customer_id,review_date,review_rating,product_id,product_title
A27T7HVDXA3K2A,1998-04-10,5,0399128964,Dune (Dune Chronicles (Econo-Clad Hardcover))
A27T7HVDXA3K2A,1998-04-10,5,044100590X,Dune
A27T7HVDXA3K2A,1998-04-10,5,0441172717,"Dune (Dune Chronicles, Book 1)"
A27T7HVDXA3K2A,1998-04-10,5,0881036366,Dune (Dune Chronicles (Econo-Clad Hardcover))
A27T7HVDXA3K2A,1998-04-10,5,1559949570,Dune Audio Collection


In [23]:
%%time
%%sql
SELECT product_title, avg(review_rating)
FROM customer_reviews_col
WHERE review_date >= '1995-01-01' 
    AND review_date <= '1998-12-31'
GROUP BY product_title
ORDER by product_title
LIMIT 20;

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 4.39 ms, sys: 94 µs, total: 4.48 ms
Wall time: 513 ms


product_title,avg
!Yo!,4.75
# 1's,4.268292682926829
#1 Record/Radio City,5.0
"#1 Soul Hits Of The 60's, Vol. 3",5.0
#1's,4.240963855421687
'58 Miles Featuring Stella by Starlight,5.0
'Bout It,3.0
'Round Midnight,5.0
'Salem's Lot,4.633333333333333
'The Moon by Whale Light,4.25


In [24]:
%%time
%%sql
SELECT product_title, avg(review_rating)
FROM customer_reviews_row
WHERE review_date >= '1995-01-01' 
    AND review_date <= '1998-12-31'
GROUP BY product_title
ORDER by product_title
LIMIT 20;

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 33 µs, sys: 3.88 ms, total: 3.91 ms
Wall time: 794 ms


product_title,avg
!Yo!,4.75
# 1's,4.268292682926829
#1 Record/Radio City,5.0
"#1 Soul Hits Of The 60's, Vol. 3",5.0
#1's,4.240963855421687
'58 Miles Featuring Stella by Starlight,5.0
'Bout It,3.0
'Round Midnight,5.0
'Salem's Lot,4.633333333333333
'The Moon by Whale Light,4.25
