# Greenplum Demo (Part 2)

### This is Part 2 of Greenplum Demo. If you missed Part 1 or wish to repeat, then click [here](GP-demo-1.ipynb).

In [5]:
import os, re
from IPython.display import display_html

from pygments import highlight
from pygments.lexers import PostgresLexer
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

In [6]:
%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

## Step 4. Familiarize yourself with the newly loaded data table

### 1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [7]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [8]:
psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING
display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [10]:
query = !cat script/4-2-gp-describe-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [11]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [None]:
query = !cat script/4-3-select-sample-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

### 4. Show *demo.amzn_reviews* table data distribution across segments:

In [12]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [13]:
query = !cat script/4-4-data-distrib-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


gp_segment_id,count
0,1726796
1,1726368


### 5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [14]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [15]:
query = !cat script/4-5-1-object-size-and-disk-space.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,1872 MB,Disk space used by the table or index.
demo,amzn_reviews,1875 MB,"Total disk space used by the table, including indexes and toasted data."


#### 5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [16]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [17]:
query = !cat script/4-5-2-object-size-and-disk-space.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,tabledisksize,uncompressedsize,tablesize,indexsize,toastsize,othersize
demo,amzn_reviews,1875 MB,1875 MB,1872 MB,0 bytes,3776 kB,0 bytes


### 6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [18]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [19]:
query = !cat script/4-6-1-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
Example Table,1726796,1726368,0.0247857882459769


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [20]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [21]:
query = !cat script/4-6-2-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0175283712182706


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than10% skew should have their distribution policies evaluated.

In [22]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [23]:
query = !cat script/4-6-3-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0001239289412298


## Step 5. Partitioning

Table partitioning enables supporting very large tables, such as fact tables, by logically dividing them into smaller, more manageable pieces. Partitioned tables can improve query performance by allowing the Greenplum Database query optimizer to scan only the data needed to satisfy a given query instead of scanning all the contents of a large table.

### 1. Create a new copy of the original table, define a *PARTITION* pattern (by month) and load it.

After you create the partitioned table structure, top-level parent tables are empty. Data is routed to the bottom-level child table partitions. In a multi-level partition design, only the subpartitions at the bottom of the hierarchy can contain data.

Rows that cannot be mapped to a child table partition are rejected and the load fails. To avoid unmapped rows being rejected at load time, define your partition hierarchy with a DEFAULT partition. Any rows that do not match a partition's CHECK constraints load into the DEFAULT partition.

At runtime, the query optimizer scans the entire table inheritance hierarchy and uses the CHECK table constraints to determine which of the child table partitions to scan to satisfy the query's conditions. The DEFAULT partition (if your hierarchy has one) is always scanned. DEFAULT partitions that contain data slow down the overall scan time.

When you use COPY or INSERT to load data into a parent table, the data is automatically rerouted to the correct partition, just like a regular table.

In [42]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-1-create-and-load-partition-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [43]:
query = !cat script/5-1-create-and-load-partition-table.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
3453164 rows affected.
2 rows affected.


row_count,tablename
3453164,demo.amzn_reviews
3453164,demo.amzn_reviews_2


### 2. Familiarize yourself with the Partitioned Table Design and Present Basic Demographics

#### 2.1. Retrieve Partitioned Table Design

In [66]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-2-1-partition-table-design.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [68]:
query = !cat script/5-2-1-partition-table-design.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

18 rows affected.


partitionboundary,partitiontablename,partitionname,partitionlevel,partitionrank
PARTITION year1998 START ('1998-01-01'::date) END ('1999-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year1998,year1998,0,1
PARTITION year1999 START ('1999-01-01'::date) END ('2000-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year1999,year1999,0,2
PARTITION year2000 START ('2000-01-01'::date) END ('2001-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2000,year2000,0,3
PARTITION year2001 START ('2001-01-01'::date) END ('2002-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2001,year2001,0,4
PARTITION year2002 START ('2002-01-01'::date) END ('2003-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2002,year2002,0,5
PARTITION year2003 START ('2003-01-01'::date) END ('2004-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2003,year2003,0,6
PARTITION year2004 START ('2004-01-01'::date) END ('2005-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2004,year2004,0,7
PARTITION year2005 START ('2005-01-01'::date) END ('2006-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2005,year2005,0,8
PARTITION year2006 START ('2006-01-01'::date) END ('2007-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2006,year2006,0,9
PARTITION year2007 START ('2007-01-01'::date) END ('2008-01-01'::date) EVERY ('1 year'::interval),amzn_reviews_2_1_prt_year2007,year2007,0,10


#### 2.2. Row Count per Partition

In [69]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-2-2-row_count_per_partition.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [70]:
query = !cat script/5-2-2-row_count_per_partition.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

18 rows affected.


partition_name,row_count
demo.amzn_reviews_2_1_prt_year1998,7
demo.amzn_reviews_2_1_prt_year1999,1199
demo.amzn_reviews_2_1_prt_year2000,6732
demo.amzn_reviews_2_1_prt_year2001,9611
demo.amzn_reviews_2_1_prt_year2002,12196
demo.amzn_reviews_2_1_prt_year2003,14407
demo.amzn_reviews_2_1_prt_year2004,15162
demo.amzn_reviews_2_1_prt_year2005,17792
demo.amzn_reviews_2_1_prt_year2006,22957
demo.amzn_reviews_2_1_prt_year2007,49904


#### 2.3. Row Count per Partition & Segment

In [71]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-2-3-row-count-per-partition-segment.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [72]:
query = !cat script/5-2-3-row-count-per-partition-segment.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

36 rows affected.


partition_name,segment_id,segment_count,partition_count
demo.amzn_reviews_2_1_prt_year1998,0,2,7
demo.amzn_reviews_2_1_prt_year1998,1,5,7
demo.amzn_reviews_2_1_prt_year1999,0,602,1199
demo.amzn_reviews_2_1_prt_year1999,1,597,1199
demo.amzn_reviews_2_1_prt_year2000,0,3348,6732
demo.amzn_reviews_2_1_prt_year2000,1,3384,6732
demo.amzn_reviews_2_1_prt_year2001,0,4779,9611
demo.amzn_reviews_2_1_prt_year2001,1,4832,9611
demo.amzn_reviews_2_1_prt_year2002,0,6024,12196
demo.amzn_reviews_2_1_prt_year2002,1,6172,12196


### 3. Partitioned Table Size and Disk Space Usage

After you create the partitioned table structure, top-level parent tables are empty. Data is routed to the bottom-level child table partitions. In a multi-level partition design, only the subpartitions at the bottom of the hierarchy can contain data.

Compare the output below with the [Non-Partitioned Table Size and Disk Usage](http://127.0.0.1:9900/notebooks/gp-demo/GP-demo-2.ipynb#5.2.-Using-the-gp_toolkit-Administrative-Schema-(Greenplum-5.x)).

When you use `COPY` or `INSERT` to load data into a parent table, the data is automatically rerouted to the correct partition, just like a regular table.

In [143]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-3-partitioned-table-size-disk.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [74]:
query = !cat script/5-3-partitioned-table-size-disk.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

18 rows affected.


tablename,tabledisksize,uncompressedsize,tablesize,indexsize,toastsize,othersize,partitionname,partitiontablesize,partitionindexsize
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year1998,160 kB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year1999,1056 kB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2000,6304 kB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2001,9376 kB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2002,11 MB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2003,14 MB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2004,16 MB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2005,19 MB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2006,23 MB,0 bytes
demo.amzn_reviews_2,96 kB,96 kB,0 bytes,0 bytes,96 kB,0 bytes,demo.amzn_reviews_2_1_prt_year2007,41 MB,0 bytes


### 4. Verify your Partition Strategy and Demonstrate *Partition Elimination* functionality

When a table is partitioned based on the query predicate, you can use `EXPLAIN` to verify that the query optimizer scans only the relevant data to examine the query plan. For example, the `demo.amzn_reviews_2` table is date-range partitioned by year. 

#### Example 1: `SELECT`' data for a single day (`2011-07-12`):

In [144]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-4-explain-example-1.sql
display_html('\n'.join(sqlfilecode), raw=True)

The query plan for this query should show a table scan of only the following tables:

- the default partition returning 0-1 rows (if your partition design has one)
- the 2011 partition (`	demo.amzn_reviews_2_1_prt_year2011`) returning ***some number*** of rows

To confirm, execute the `EXPLAIN` query and check the query plan:

In [145]:
explain_output = !cat script/5-4-explain-example-1.sql \
    | psql $CONNECTION_STRING | pygmentize -f html -O full,style=colorful -l psql 
display_html('\n'.join(explain_output), raw=True)

#### Example 2 : Number of Reviews for the period, 1 January - 25 October 2012 

- **Non-partitioned Table**

In [147]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-4-explain-example-2-1.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [163]:
explain_output = !cat script/5-4-explain-example-2-1.sql \
    | psql $CONNECTION_STRING | pygmentize -f html -O full,style=colorful -l psql 
display_html('\n'.join(explain_output), raw=True)

- **Partitioned Table**

In [164]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/5-4-explain-example-2-2.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [165]:
explain_output= !cat script/5-4-explain-example-2-2.sql \
    | psql $CONNECTION_STRING | pygmentize -f html -O full,style=colorful -l psql 
display_html('\n'.join(explain_output), raw=True)

In [176]:
explain_output_2_2 = !cat script/5-4-explain-example-2-2.sql | psql $CONNECTION_STRING
#'\n'.join(explain_output_2_2)


out = !cat script/5-4-explain-example-2-2.sql     | psql $CONNECTION_STRING\
    | grep -e '"'"'Total runtime:'"'"'

out
#    | awk '"'"'BEGIN{FS="|";OFS=" "} {print $3}'"'"'\
#    | awk '"'"'{print $1, "COUNT(*)", $3, $4, $5, $6, $7, $8}'"'"''

[]