# Exercise 03 - Columnar Vs Row Storage

- The columnar storage extension used here: 
    - cstore_fdw by citus_data [https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)
- The data tables are the ones used by citus_data to show the storage extension


In [22]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [None]:
!sudo -u postgres psql -c 'CREATE DATABASE reviews;'

## STEP 0 : Connect to the local database where Pagila is loaded

### Create the database

In [1]:
!wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
!wget http://examples.citusdata.com/customer_reviews_1999.csv.gz

!gzip -d customer_reviews_1998.csv.gz 
!gzip -d customer_reviews_1999.csv.gz 

!mv customer_reviews_1998.csv /tmp/customer_reviews_1998.csv
!mv customer_reviews_1999.csv /tmp/customer_reviews_1999.csv

--2020-02-01 20:16:24--  http://examples.citusdata.com/customer_reviews_1998.csv.gz
Resolving examples.citusdata.com (examples.citusdata.com)... 104.25.47.11, 104.25.46.11, 2606:4700:20::6819:2e0b, ...
Connecting to examples.citusdata.com (examples.citusdata.com)|104.25.47.11|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://examples.citusdata.com/customer_reviews_1998.csv.gz [following]
--2020-02-01 20:16:29--  https://examples.citusdata.com/customer_reviews_1998.csv.gz
Connecting to examples.citusdata.com (examples.citusdata.com)|104.25.47.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24774482 (24M) [application/x-gzip]
Saving to: ‘customer_reviews_1998.csv.gz’


2020-02-01 20:16:32 (8.02 MB/s) - ‘customer_reviews_1998.csv.gz’ saved [24774482/24774482]

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/claudiordgz/.wget-hsts'. H

In [8]:
!pwd

/home/claudiordgz/udacity/dend/notes/data_warehouses


### Connect to the database

In [23]:
DB_ENDPOINT = "127.0.0.1"
DB = 'reviews'
DB_USER = 'student'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)

postgresql://student:student@127.0.0.1:5432/reviews


In [24]:
%sql $conn_string

'Connected: student@reviews'

## STEP 1 :  Create a table with a normal  (Row) storage & load data

**TODO:** Create a table called customer_reviews_row with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [19]:
%%sql
DROP TABLE IF EXISTS customer_reviews_row;
CREATE TABLE customer_reviews_row
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [20]:
%%sql 
COPY customer_reviews_row FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_row FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

 * postgresql://student:***@127.0.0.1:5432/reviews
589859 rows affected.
1172645 rows affected.


[]

## STEP 2 :  Create a table with columnar storage & load data

First, load the extension to use columnar storage in Postgres.

In [25]:
%%sql

-- load extension first time after install
CREATE EXTENSION cstore_fdw;

-- create server object
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Create a `FOREIGN TABLE` called `customer_reviews_col` with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [26]:
%%sql
-- create foreign table
DROP FOREIGN TABLE IF EXISTS customer_reviews_col;

-------------
CREATE FOREIGN TABLE customer_reviews_col
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)
--------------- leave code below as is
SERVER cstore_server
OPTIONS(compression 'pglz');

 * postgresql://student:***@127.0.0.1:5432/reviews
Done.
Done.


[]

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [27]:
%%sql 
COPY customer_reviews_col FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_col FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

 * postgresql://student:***@127.0.0.1:5432/reviews
589859 rows affected.
1172645 rows affected.


[]

## Step 3: Compare perfromamce

Now run the same query on the two tables and compare the run time. Which form of storage is more performant?

**TODO**: Write a query that calculates the average `review_rating` by `product_title` for all reviews in 1995. Sort the data by `review_rating` in descending order. Limit the results to 20.

First run the query on `customer_reviews_row`:

In [34]:
%%time
%%sql

SELECT 
    avg(review_rating) as review_rating, product_title
FROM customer_reviews_row
WHERE 
    review_date >= '1995-01-01' AND
    review_date <= '1996-01-01'
GROUP BY product_title
ORDER BY review_rating desc
LIMIT 20;

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 260 ms


review_rating,product_title
5.0,Act Like Nothing's Wrong
5.0,Albinus on Anatomy
5.0,Accidental Empires
5.0,A Civil Action (Vintage)
5.0,Acts of Kindness
5.0,Ain't Nobody's Business If You Do
5.0,A First Course in General Relativity
5.0,A Year in Provence (abridged)
5.0,A People's History of the United States
5.0,A People's History of the United States


 Then on `customer_reviews_col`:

In [35]:
%%time
%%sql

SELECT 
    avg(review_rating) as review_rating, product_title
FROM customer_reviews_col
WHERE 
    review_date >= '1995-01-01' AND
    review_date <= '1996-01-01'
GROUP BY product_title
ORDER BY review_rating desc
LIMIT 20;

 * postgresql://student:***@127.0.0.1:5432/reviews
20 rows affected.
CPU times: user 15.6 ms, sys: 0 ns, total: 15.6 ms
Wall time: 15.6 ms


review_rating,product_title
5.0,Christmas in America
5.0,Wizard's First Rule (Bookcassette(r) Edition)
5.0,Postmortem
5.0,Fingerprints of the Gods (Alternative History)
5.0,Constructing the Sexual Crucible
5.0,SOCIETY OF MIND
5.0,The Face on the Milk Carton
5.0,Post Captain (Aubrey-Maturin (Audio))
5.0,Wild Swans
5.0,The Doors (Special Edition)


## Conclusion: We can see that the columnar storage is faster!