# Greenplum Demo - Part 2

This is Part 2 of Greenplum Demo, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](GP-demo-1.ipynb).

In [2]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [20]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING

display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [21]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [22]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

10 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
US,35871264,R2RRA8VU3Z5AYK,B001GHOPTI,629306937,Texas Instrument 84 Plus Silver Edition graphing Calculator (Full Pink in color) (Packaging may vary),Office Products,4,0,0,N,Y,My daughter loves this device. She stated she wishes ...,"My daughter loves this device. She stated she wishes it was in color, she also said that she had some difficulty reading the information in the direction, maybe you can put the instruction on line so the print can be enlarged.",2014-10-24
US,52906199,RVBIY0AYLSTAL,B008UXOBGI,255302013,ELP 6ct Zaner Bloser Pencil Grips,Office Products,4,7,8,N,Y,Found another purpose for this simple item.,Nicely made...but I use mine for crochet to keep the yarn from twisting. It also helps weight the strand of working yarn enough for arthritic fingers.,2015-02-04
US,44614362,R9E8G1VD8BC8D,B000VG2FCO,382681376,Remanufactured Ink Cartridge Replacement for HP 56 and HP 57 (2 Black 1 Color 3 Pack),Office Products,2,0,0,N,Y,Didn't work but I blame HP not the cartridge,"this product may or may not be good, I could never find out. My printer insisted they were dry which was a lie. I wasted my money on these cartridges. This printer had never printed more than 50 pages in its life. It demanded all new cartridges and wouldn't work after it got them. I refilled the old ones and that didn't work. So......I took the printer outside and beat it up with a large sledge. I HATE HP. It used to be a great company but now all they want to do is screw their customers while management makes really bad decisions and collects huge bonuses. New cartridges, old cartridges, full, empty, they want you to buy another just to prove you are a sucker. I will never buy an HP product again. I bought a Samsung laser and am happy and it cost about twice what a full set of new HP ink jet cartridges cost.",2013-01-12
US,11071769,R260HMZNL42VFT,B008LDJ8HO,329537080,"24 Pack Compatible Canon CLI-226 , PGI-225 8 Big Black, 4 Small Black, 4 Cyan, 4 Magenta, 4 Yellow for use with Canon PIXMA iP4820, PIXMA iP4920, PIXMA iX6520, PIXMA MG5120, PIXMA MG5220, PIXMA MG5320, PIXMA MG6120, PIXMA MG6220, PIXMA MG8120, PIXMA MG8120B, PIXMA MG8220, PIXMA MX712, PIXMA MX882, PIXMA MX892. Ink Cartridges for inkjet printers. CLI-526BK , CLI-526C , CLI-526M , CLI-526Y , PGI-525",Office Products,5,0,0,N,Y,Exactly what I needed and at a great price!,"The product was exactly what I needed and arrived when promised. I was a little worried at first since the price seemed like such a great deal. But after installing it and using it with no trouble, I am glad I purchased the item from Blake Printing Supply (through Amazon). The company even took the time to see how I liked the product. I am glad I could support a family-owned business that wants to provide fantastic customer service.",2013-10-14
US,52012464,R385HM4FD63TWG,B000VKY0NW,818655383,"LD Remanufactured HP 02 Cartridges : BLK C8721WN, C C8771WN, M C8772WN, Y C8773WN, LT C C8774WN, LT M C8775WN",Office Products,5,3,5,N,Y,As good as the original,This non-HP cartridge seems to work just like the HP version. No problems so far.,2008-08-07
US,51768784,R15KP6MITXBO9L,B00GH9N2EE,629745590,"Pyle Digital Voice Recorder with USB and PC Interface, Micro SD Slot, 4GB Built-In Memory and Headphone Jack",Office Products,4,0,0,N,Y,Really nice. I've owned several digital recorders,"Really nice. I've owned several digital recorders, by different makers, and this one is as good as any, at a lower price than any.",2015-05-09
US,33563974,RRNAURP9NPH0G,B0058P4JXQ,112324700,Van Gogh Almond Blossoms Mouse Pad Mousepad -Fine Art - Ideal Gift for all occassions!,Office Products,4,2,2,N,Y,functional and nice to look at,Good size and a pleasing design. The surface is fabric so you don't get the cheap scratchy plastic noise when you move your mouse. I am satisfied with this purchase.,2011-11-13
US,12592954,R2JDXFAE80B122,B006Q6HTDI,278858267,RioRand Remote Control Wireless Presentation Presenter Mouse For Windows system,Office Products,3,1,1,N,Y,Its ok.,"For the price you get what you get.<br />Its cheap material.<br />It came apart the 2nd time of use and was able to put it back together.<br />Overall, it does the job but I have feeling I will be looking for a new remote soon.",2014-02-15
US,29433152,R23UG5TRJCNN5S,B003LL1RUC,784033968,Mickey Spiral Notebook & Pen Set,Office Products,1,0,0,N,N,not very good quality,not very well quality...picture on front looks old..very small in size. I wish I did not make this purchase..unhappy customer,2013-09-17
US,12928411,R2GQAGG9YRBCOY,B008MHAHA6,421981030,"LG Optimus 4X HD P880 Black Factory Unlocked International Version by New Generation Products LLC.,",Home Entertainment,4,0,0,N,Y,It's good price,"It's a good cell phone, much lighter than Iphone 4, but the lifetime of battery is not long as expected.But it's best value anyway and I like it.",2012-12-29


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [23]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


gp_segment_id,count
0,1726796
1,1726368


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [24]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,1878 MB,Disk space used by the table or index.
demo,amzn_reviews,1881 MB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [5]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

6 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize,uncompressedsize,compressionpercentage
demo,amzn_reviews_ao_co_zlib3,731 MB,96 kB,192 kB,731 MB,0 bytes,1718 MB,57.44
demo,amzn_reviews_ao_ro_zlib3,885 MB,3744 kB,192 kB,889 MB,0 bytes,1810 MB,50.87
demo,amzn_reviews_heap,1878 MB,3776 kB,0 bytes,1881 MB,0 bytes,1881 MB,0.0
demo,calendar,64 kB,0 bytes,0 bytes,64 kB,0 bytes,64 kB,0.0
demo,sales,64 kB,96 kB,0 bytes,160 kB,0 bytes,160 kB,0.0
demo,sales2,64 kB,96 kB,0 bytes,160 kB,0 bytes,160 kB,0.0


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [28]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,1726796,1726368,0.0247857882459769


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [29]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0175283712182706


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than10% skew should have their distribution policies evaluated.

In [30]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0001239289412298


## Continue to Part 3 of Greenplum Demo; **[MPP Fundamentals and Partitioning](GP-demo-3.ipynb)**.