# Greenplum Demo - Part 2

This is Part 2 of Greenplum Demo, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](AWS-GP-demo-1.ipynb).

In [11]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [12]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING

display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [13]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [14]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

10 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
US,25136478,R2IUGUKNWHYWJ5,B00G0KQ2G0,80180978,Pink Platinum Little Girls' Strawberry One Piece Swimsuit,Apparel,1,0,0,N,N,Bad fit,It does not fit true to size unfortunately so my daughter was unable to wear it. I think if I had ordered up avsize it may have fit but then again I am not sure.,2014-06-20
US,11011177,R21BWQACZB0LQT,1430226293,628000125,Beginning Android 2,Books,4,0,0,N,Y,good book if your still old school,google is better,2015-04-24
US,13064977,R3ZX7T9DT2NV0,1439264597,351350363,Jasmine's Justice: Historical Treasure Lost on the Underground Railroad,Books,4,0,0,N,N,My review,A good book which teaches you to appreciate life. The book as educational as it teaches US history in the 1850s.,2011-08-07
US,17192100,RTMZHI9EUNGDV,B007Q0OA8A,467183789,LEGO Friends 3189 Heartlake Stables,Toys,5,0,0,N,Y,Awesome,The friends series is wonderful. I am able to continue to add to them for birthdays and other occasions. Some many to choose from.,2014-01-02
US,12109988,R20BZ82M8OBMU3,B004CFA9RS,941986263,"Divergent (Divergent Trilogy, Book 1)",Digital_Ebook_Purchase,4,0,0,N,N,Quick read,"Fun, fast_paced book. Good character development in the beginning, but jumped awkwardly into the climax. Predictable, but still worth your time.",2014-09-07
US,30622587,RA34JIDAFJ9GW,B00GPDYLXS,15250050,Dash of Peril (Love Undercover (Foster) series Book 4),Digital_Ebook_Purchase,4,0,0,N,N,Another good one by Lori Foster,"Fast and fulfilling! If you are looking for a quick read, familiar characters and a bit of a rush, this is the one for you. Loved a Dash!",2014-04-14
US,43732074,R3D9OU7UVASKL2,B007SLSDKS,485091737,"The Indifferent Stars Above: The Harrowing Saga of a Donner Party Bride[ THE INDIFFERENT STARS ABOVE: THE HARROWING SAGA OF A DONNER PARTY BRIDE ] by Brown, Daniel James (Author) Jun-01-10[ Paperback ]",Books,5,0,0,N,Y,Good read,This is an Interesting story. There are lots and lots of details and factual information. I would recommend ti to everyone.,2014-03-16
US,12122639,R1GV4MVHWLPF7R,B003WKDMKK,35790382,Dominique Womens Satin and Lace Torsolette Bridal Bustier 8949,Apparel,5,1,2,N,N,Good product but size is tricky,"This is a great product - really worked well under my wedding dress. It provides great support and the lace and satin is sooo pretty! However, I agree with other reviewers that the sizing on this product is weird. I'm usually a 36C, and I ordered this in a 34C, which fit perfectly.",2012-06-02
US,32445450,RRWGU8HS6H4IT,B000WD3IR2,450587274,Energizer Rechargeable Batteries and Charger - 8 rechargeble AA batteries and 4 rechargeble AAA batteries,Health & Personal Care,1,1,1,N,Y,A Waste,"I bought 3 packs of these batteries and a charger to charge them overseas. Upon getting to where I was traveling to, I tried to charge the batteries for my use only to discover that the charger would not work. I tried severally but end up getting another charger which was so expensive. I have tried to send the charger back to the manufacturer for repairs but they would not accept it but instead offering monetary offer. I consider buying this a waste of money and waste of time altogether.",2010-01-25
US,41491734,R28OLAPZEO0L7K,B00E0WNH7M,268784785,Brixton Men's Barrel Classic Driver Cap,Apparel,5,0,0,N,Y,Five Stars,"Good fabric, high quality cap.",2014-12-15


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [15]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

24 rows affected.


gp_segment_id,count
0,4298868
1,4296740
2,4291976
3,4297200
4,4300819
5,4297083
6,4301096
7,4298972
8,4297364
9,4296533


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [16]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,59 GB,Disk space used by the table or index.
demo,amzn_reviews,60 GB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [17]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize
demo,amzn_reviews,59 GB,196 MB,0 bytes,60 GB,0 bytes


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [18]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,4301096,4291976,0.2120389779721261


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [19]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0529644564132474


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

In [20]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0007849840288769


## Continue to Part 3 of Greenplum Demo; **[MPP Fundamentals and Partitioning](AWS-GP-demo-3.ipynb)**.