# Greenplum Demo - Part 2

This is Part 2 of Greenplum Demo, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](AWS-GP-demo-1.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

'Connected: gpadmin@gpadmin'

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [2]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING

display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [3]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [4]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

10 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
US,26663989,R37SO9ODGBHRVP,B005WX6EL4,815186521,"SGT KNOTS Paracord Bracelet Whistle Buckle 3/4"" - 100 Pack",Sports,1,0,1,N,Y,"Greedy, Seedy Seller - Beware","To be clear - I have no qualms about the actual product. I give it five stars.<br /><br />However, our dealings with the very rude seller have been miserable. The seller's description is simply misleading. We thought the buckles were 3/4\"" in total width, but they are actually larger. The 3/4\"" width refers to the little opening. Now, maybe everyone else on the earth already knows this tidy little piece of insider information, as the seller claims, but we did not. How hard is it to make that clarification in the product description?<br /><br />We asked for a refund of our shipping as well as a prepaid mailing label for the return, which has been standard procedure from every seller we've encountered on Amazon. Until now. The seller must have missed the \""customer is always right\"" chapter in Salesmanship 101. Which is too bad for him because he has lost our business.<br /><br />The whole ordeal could have been avoided if he'd made a tiny clarification in the product description to notify buyers of just what part of the buckle the 3/4\"" width applies to. After 4 emails to rectify the situation, I actually had to file a claim in order to get all my money back. So...five stars for the product minus 4 for the awful seller leaves me with a 1 star rating.",2012-05-22
US,85871,R3IP8ALYBGH7FX,B00JKULIPW,843462495,Cellto Samsung Galaxy S5 Premium Wallet Case with HD Screen Protector [Dual Magnetic Flap] Diary Cover /w ID Pocket Top Quality for Galaxy S V Galaxy SV i9500 [Made in Korea] + Life Time Warranty,Wireless,5,0,0,N,Y,I really like it! Love the colors,"I really like it! Love the colors, mini wallet and how it protects the the screen when I close it.",2014-08-22
US,49186302,R287P2NZEMW41R,B00KXD2HL4,451597140,Bridged (Callahan & McLane Book 2),Digital_Ebook_Purchase,5,0,0,N,N,Five Stars,Great book!,2015-07-18
US,22919302,R3EM5MFGK7CJPI,B00280LYI2,152274097,"Titan's Curse, The (Percy Jackson and the Olympians, Book 3)",Digital_Ebook_Purchase,5,0,0,N,Y,Love it,Omg such a good book love it a lot percy annabeth Thalia everything was awesome I love love love this book don't wanna tell how it goes the twists are great u gotta read it,2013-01-26
US,40316220,RBJFD9C2LHCMT,B00D9F4MDE,69272273,Reaching Out For You (Never Letting Go Book 1),Digital_Ebook_Purchase,5,4,4,N,N,Please read this book! You will not be disappointed!,"Omg! This book made me cry! I love that Sophie and Adam get a chance and maybe just maybe fall in love completely. The attraction is still there, along with that magnetic pull, but will it be enough to make Adam stay to be there when Sophie needs him the most? I am not going to tell you. You will have to read it for yourself. What I can say is...this book was a carosel of emotions. One minute I was crying the next I wanted to kick Kyle's behind (to put it nicely) Trust me when I say you will not be disappointed. If you were hoping for spoilers, you are not going to get them from me!",2013-08-06
US,19857589,R2W70IMA9M19RM,B008CO80SA,684833372,Step2 Whisper Ride II,Toys,3,0,0,N,Y,Three Stars,It's not a good quality of plastic and all the parts didn't fit..,2015-04-30
US,21642808,R308M0EC6M1HR5,B001F51TNQ,584087904,Olay Moisturizing Lotion Sensitive Skin 6.0 Fl Oz,Beauty,5,2,2,N,Y,I use it every Night,"I use this every night its not greasy and who doesn't love olay is what my mother used and her mother used and now they only keep making this product better I ordered this online, because I really like how much I get for the price",2014-12-10
US,13835146,R3NWHMO6VQAT12,B001JI51BK,69273548,Strict Leather Ball Stretcher with 2 Pulls,Health & Personal Care,3,0,1,N,Y,too big,this thing is designed for someone who has a really large droopy set. regular guys may have problems with fit.,2013-12-27
US,46654430,R21BCWSFAUXD2P,B001VLFE6W,362553810,Paul Blart: Mall Cop [Blu-ray],Video DVD,5,0,0,N,Y,Five Stars,Great movie.,2015-05-28
US,32792279,R24TSQCBND3PRR,B008VQ2YUY,326494619,Lexar Professional USB 3.0 Dual-Slot Reader,PC,4,0,2,N,Y,Four Stars,Very good product !! Very satisfied,2014-12-23


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [5]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


gp_segment_id,count
0,51577078
1,51568195


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [6]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,59 GB,Disk space used by the table or index.
demo,amzn_reviews,60 GB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [7]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize
demo,amzn_reviews,59 GB,195 MB,0 bytes,60 GB,0 bytes


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [8]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,51577078,51568195,0.0172227670594289


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0121793841920028


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

In [10]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,8.61138352971e-05


## Continue to Part 3 of Greenplum Demo; **[MPP Fundamentals and Partitioning](AWS-GP-demo-3.ipynb)**.