# Greenplum Demo (Part 2)

### This is Part 2 of Greenplum Demo. If you missed Part 1 or wish to repeat, then click [here](GP-demo-1.ipynb).

In [1]:
import os, re
from IPython.display import display_html

CONNECTION_STRING = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

In [2]:
%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

## Step 4. Familiarize yourself with the newly loaded data table

### 1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [61]:
!cat script/4-1-psql-describe-amzn-reviews.sql

\d demo.amzn_reviews


In [62]:
psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING
display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [63]:
!cat script/4-2-gp-describe-amzn-reviews.sql

SELECT * 
FROM information_schema.COLUMNS 
WHERE TABLE_NAME = 'amzn_reviews';


In [64]:
query = !cat script/4-2-gp-describe-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [65]:
!cat script/4-3-select-sample-amzn-reviews.sql

SELECT * 
FROM demo.amzn_reviews 
ORDER BY RANDOM() 
LIMIT 10;


In [None]:
query = !cat script/4-3-select-sample-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

### 4. Show *demo.amzn_reviews* table data distribution across segments:

In [26]:
!cat script/4-4-data-distrib-amzn-reviews.sql

SELECT gp_segment_id, 
  count(*) 
FROM
  demo.amzn_reviews
GROUP BY 1 
ORDER BY 1;


In [28]:
query = !cat script/4-4-data-distrib-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


gp_segment_id,count
0,1726796
1,1726368


### 5. Check *demo.amzn_reviews* table object size and disk space

### 5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [76]:
!cat script/4-5-1-object-size-and-disk-space.sql

SELECT 'demo' AS schemaname, 'amzn_reviews' AS tablename, pg_size_pretty(pg_relation_size('demo.amzn_reviews')) AS size, 'Disk space used by the table or index.' AS level 
UNION ALL 
SELECT 'demo' AS schemaname, 'amzn_reviews' AS tablename, pg_size_pretty(pg_total_relation_size('demo.amzn_reviews')) AS size, 'Total disk space used by the table, including indexes and toasted data.' AS level;


In [77]:
query = !cat script/4-5-1-object-size-and-disk-space.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,1872 MB,Disk space used by the table or index.
demo,amzn_reviews,1875 MB,"Total disk space used by the table, including indexes and toasted data."


In [78]:
!cat script/4-5-2-object-size-and-disk-space.sql

SELECT gp_size_of_table_disk.schemaname,
	gp_size_of_table_disk.tablename, 
	tabledisksize, 
	uncompressedsize, 
	tablesize, 
	indexsize, 
	toastsize, 
	othersize 
FROM (
	SELECT sotd.sotdoid as oid, 
		sotd.sotdschemaname as schemaname, 
		sotd.sotdtablename as tablename, 
		pg_size_pretty(sotd.sotdsize::BIGINT) as tablesize, 
		pg_size_pretty(sotd.sotdtoastsize::BIGINT) as toastsize, 
		pg_size_pretty(sotd.sotdadditionalsize::BIGINT) as othersize
	FROM
		gp_toolkit.gp_size_of_table_disk as sotd
	WHERE
		sotd.sotdschemaname || '.' || sotd.sotdtablename = 'demo.amzn_reviews') gp_size_of_table_disk, (
	SELECT sotaid.sotaidoid as oid, 
		sotaid.sotaidschemaname as schemaname, 
		sotaid.sotaidtablename as tablename, 
		pg_size_pretty(sotaid.sotaidtablesize::BIGINT) as tabledisksize, 
		pg_size_pretty(sotaid.sotaididxsize::BIGINT) as indexsize
	FROM
		gp_toolkit.gp_size_of_table_and_indexes_disk as sotaid 
	WHERE
		sotaid.sotaidschemaname || '.' || sotaid.sotaidt

In [79]:
query = !cat script/4-5-2-object-size-and-disk-space.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,tabledisksize,uncompressedsize,tablesize,indexsize,toastsize,othersize
demo,amzn_reviews,1875 MB,1875 MB,1872 MB,0 bytes,3776 kB,0 bytes


### 6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [83]:
!cat script/4-6-1-data-skew.sql

SELECT
	'Example Table' AS "Table Name", 
	max(c) AS "Max Seg Rows", min(c) AS "Min Seg Rows", 
	(max(c)-min(c))*100.0/max(c) AS "Percentage Difference Between Max & Min" 
FROM (
	SELECT
		count(*) c, 
		gp_segment_id 
	FROM
		demo.amzn_reviews
	GROUP BY 2
) AS a;


In [84]:
query = !cat script/4-6-1-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
Example Table,1726796,1726368,0.0247857882459769


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [89]:
!cat script/4-6-2-data-skew.sql

SELECT
	skcc.skcnamespace as schemaname,
	skcc.skcrelname as tablename, 
	skcc.skccoeff as coefficient 
FROM
	gp_toolkit.gp_skew_coefficients skcc 
WHERE
	skcc.skcnamespace || '.' || skcc.skcrelname = 'demo.amzn_reviews';


In [90]:
query = !cat script/4-6-2-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0175283712182706


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than10% skew should have their distribution policies evaluated.

In [92]:
!cat script/4-6-3-data-skew.sql

SELECT
	sif.sifnamespace as schemaname,
	sif.sifrelname as tablename, 
	sif.siffraction as fraction 
FROM
	gp_toolkit.gp_skew_idle_fractions sif 
WHERE
	sif.sifnamespace || '.' || sif.sifrelname = 'demo.amzn_reviews'


In [93]:
query = !cat script/4-6-3-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0001239289412298


## Step 5. Partitioning

### 1. Create a new copy of the original table, define a *PARTITION* pattern (by month) and load it.

In [None]:
%%sql 

CREATE TABLE demo.amzn_reviews_v2(
  marketplace TEXT, 
  customer_id BIGINT, 
  review_id TEXT, 
  product_id TEXT, 
  product_parent BIGINT, 
  product_title TEXT, 
  product_category TEXT, 
  star_rating INTEGER, 
  helpful_votes INTEGER, 
  total_votes INTEGER, 
  vine TEXT, 
  verified_purchase TEXT, 
  review_headline TEXT, 
  review_body TEXT, 
  review_date DATE)
DISTRIBUTED BY (review_id)
PARTITION BY RANGE(review_date) 
(
    START ('1998-07-01'::date) END ('2015-09-01'::date)
    EVERY ('1 month'::interval)
);

INSERT INTO demo.amzn_reviews_v2
SELECT * FROM demo.amzn_reviews;

### 2. Show row count per partition for the new table.

In [None]:
%%sql
SELECT tableoid::regclass, count(*) FROM demo.amzn_reviews_v2 GROUP BY 1 ORDER BY 1;

### 3. Demonstrate *Partition Elimination* functionality

In [None]:
!psql -d gpadmin -U gpadmin -h 10.0.2.15 -f './scripts/explain_example_1_1.sql'

In [None]:
psql_out = !psql -H -d gpadmin -U gpadmin -h 10.0.2.15 -f './scripts/example_1_2.sql'

display_html(''.join(psql_out), raw=True)

In [None]:
%%sql
EXPLAIN
SELECT COUNT(*)
    , date_part('year', review_date::DATE) AS YEAR_NUM
    , date_part('month', review_date::DATE) AS MONTH_NUM
FROM demo.amzn_reviews_v2
GROUP BY 2, 3
ORDER BY 2, 3;

## Step 4. Compression

## Populate the three tables:
- Load heap table with gpload (gpload_h.yaml):

```yaml
VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/crimes_all.txt
    - FORMAT: text
    - DELIMITER: '|'
    - LOG_ERRORS: true
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.fact_crimes_heap
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true
```

In [None]:
%%sql

DROP TABLE IF EXISTS demo.fact_crimes_heap;

CREATE TABLE demo.fact_crimes_heap
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
distributed by (id);

DROP TABLE IF EXISTS demo.fact_crimes_row_comp;

CREATE TABLE demo.fact_crimes_row_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=row, compresstype=zlib, compresslevel=3)
distributed by (id);


DROP TABLE IF EXISTS demo.fact_crimes_col_comp;

CREATE TABLE demo.fact_crimes_col_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=3)
distributed by (id);

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'gpload -d gpadmin -f /home/gpadmin/gpload_h.yaml > /home/gpadmin/gpload_h.log 2>&1'


**Note:** Heap table loaded data from the same source file in <33 seconds (heap vs compressed table loading has different performance)

- Load **demo.fact_row_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_row_comp;

`INSERT INTO demo.fact_crimes_row_comp SELECT * FROM demo.fact_crimes;`

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_row_comp.sql '

- Load **demo.fact_col_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_col_comp;

`INSERT INTO demo.fact_crimes_col_comp SELECT * FROM demo.fact_crimes;`

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_col_comp.sql '

## Check the size of each of the three tables:

In [None]:
%%sql
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_heap'))::TEXT, 'demo.fact_crimes_heap' AS TABLENAME
UNION
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_row_comp'))::TEXT AS TABLESIZE, 'demo.fact_crimes_row_comp' AS TABLENAME
UNION ALL
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_col_comp')) AS TABLESIZE, 'demo.fact_crimes_col_comp' AS TABLENAME;

**Notes:** 
- Heap table has no compression. It is best for staging tables or when frequent updates/ deletes are needed.
- Row oriented has the best compression. It is best for frequent inserts and `SELECT`'s on all/ most of the columns.
- Column oriented also has better compression than the heap table but not from the row-oriented table. It is best for static partitions/ tables and `SELECT`'s on fewer columns.

In [None]:
# Step 4. EXPLAIN plans, & Statistics

In [None]:
%%sql
EXPLAIN SELECT location_desc
	, count(case_number)
FROM
	demo.fact_crimes
WHERE
	crime_date >= '2014-01-01'
	AND crime_date <= '2014-12-31'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'



**Notes:**
- Copy `EXPLAIN` plan created above and paste it in : http://planchecker.cfapps.io/
- Planchecker app will provide recommendation(s) about collecting statistics. Highlight this as a recommendation that database provides for optimizations.
- Use the `ANALYZE` utility to collect statistics for optimizer, missing or stale statistics; all the above can generate bad plans.
- Use the `ANALYZEDB` utility and scheduled it to run frequently i.e. everyday, to collect statistics on changed tables/ partitions only since last run. 
- The same utility can also be easily stopped and resumed. 
- There is no need for DBA to explicitly look for different stats collection policies for different types of tables/ partitions.

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -c "analyze demo.fact_crimes"'



In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'

