# Greenplum Demo - Part 2

This is Part 2 of Greenplum Demo, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](AWS-GP-demo-1.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

'Connected: gpadmin@gpadmin'

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [2]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING

display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [3]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [4]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

10 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
US,50122160,RH9653V2SZS22,1416949720,962927882,The Mysterious Edge of the Heroic World,Books,5,0,1,N,N,A new kid who has a secret dream: to discover some hidden treasure to make himself famous,"E.L. Konigsburg's THE MYSTERIOUS EDGE OF THE HEROIC WORLD tells of Amedeo Kaplan, a new kid who has a secret dream: to discover some hidden treasure to make himself famous - and make a new friend to share his discoveries with. William is an aloof boy who is a loner and seems an odd friendship choice - but the two find themselves working on an odd mansion project and find themselves becoming involved in a long-kept secret revolving around Nazi Germany.",2007-12-03
US,1239482,R2FBAEIQPELS3T,B005NAUKAC,847714360,Outlining Your Novel: Map Your Way to Success (Helping Writers Become Authors Book 1),Digital_Ebook_Purchase,3,0,0,N,Y,Three Stars,"Fair guide, too many examples of what not to do",2014-11-09
US,50631608,RN7896UK707SL,0553496786,831458401,Thomas & Friends Story Time Collection (Thomas & Friends),Books,5,0,0,N,Y,Five Stars,Great stories!,2015-03-30
US,41103187,R1FH0ZLPLP7JKC,B000XSKDH4,876690792,Anne of Green Gables: Collector's Edition,Video DVD,5,0,0,N,Y,Fantastic,"This is a truly great series. The acting is superbly done and very in-line to the novel. Megan Follows' portrayal of Anne is nothing but fantastic. The story pulls you in to the place, the time, and the characterization. It becomes a story you want to see over and over.",2014-03-12
US,38807921,R20OYNWVELE403,B003C383H4,611621847,"Shoulder Holster Glock 29,30,36 And Beretta Cheetah 84,85,.380 81,32 ACP,87,.22LR Taurus Millennium Pro. PT-745, PT-140, PT-145, PT-111, PT138, PT-957, PT-940, PT-908",Sports,5,0,0,N,Y,Five Stars,Worked perfect,2015-03-08
US,15931374,R1IELQM1XFXE4G,B001NJQ5PG,78035719,iPearl mCover Hard Shell Case with FREE keyboard cover for Model A1278 13-inch Regular display Aluminum Unibody MacBook Pro - AQUA,PC,5,0,0,N,Y,This case is awesome!,I love this color! I wish there was a a silicone keyboard cover that matches this exactly. I love the legs that allow me to flip it up when my computer needs a little breathing room from the table. Easy to put on and take off. Protects my computer from getting as dirty has it probably should be. Wish I would have bought one for my old white Macbook because then it wouldn't be so dingy and dirty. Great item and great seller.,2013-04-30
US,16863311,R2DEIYVFDU4A2C,B000M9N7CM,144344680,Wilton Petite Silicone Heart Pan,Kitchen,5,1,1,N,Y,Cheesecakes and Lotion Bars,I absolutely love this pan for cheesecakes as well as hot chocolate cubes and lotion bars. I would highly recommend.,2013-01-23
US,2329972,R1LP1BZU6CNUVI,B004M8YRVY,853405223,Legacy 7-Inch Motorized Touch Screen TFT/LCD Monitor with DVD/CD/MP3/AM/FM Player,Wireless,1,0,0,N,Y,One Star,no sirven ni siquiera me lo quieren regresar,2013-11-23
US,47401021,R39M8JVJRF5ALB,B005890G8O,904098182,"Kindle Touch 3G, Free 3G + Wi-Fi, 6"" E Ink Display",PC,5,0,1,N,Y,Amazon Kindle- Incredible Deal & Product!,"The Amazon Kindle works for my daughter who travels in Missouri, she has not been in a major town yet that she did not have amazon coverage.Thanks for a great product(Amazon Network) that works with a great product (Kindle)",2012-01-10
US,36185594,R1V5274KT56CJC,B00006ISA3,860460320,Now or Never,Music,5,0,0,N,N,This album is awesome!,"This album is simply amazing! Nick Carter at his best. I think that the best song on the album is \""Blow Your Mind\"", it sounds soo awesome!",2002-10-30


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [5]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

24 rows affected.


gp_segment_id,count
0,4298868
1,4296740
2,4291976
3,4297200
4,4300819
5,4297083
6,4301096
7,4298972
8,4297364
9,4296533


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [6]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,59 GB,Disk space used by the table or index.
demo,amzn_reviews,60 GB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [7]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize
demo,amzn_reviews,59 GB,195 MB,0 bytes,60 GB,0 bytes


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [8]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,4301096,4291976,0.2120389779721261


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0529644564132474


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

In [10]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0007849840288769


## Continue to Part 3 of Greenplum Demo; **[MPP Fundamentals and Partitioning](AWS-GP-demo-3.ipynb)**.