# Greenplum Database  Concepts Explained - Part 2

This is Part 2 of Greenplum Database  Concepts Explained, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](AWS-GP-demo-1.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

'Connected: gpadmin@gpadmin'

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [2]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING

display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [3]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the _demo.amzn_reviews_ table data (10 rows).

In [4]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

10 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
US,44480318,RWBD4MW6QBWJA,B00IK810HQ,614124732,Sisters of Treason (The Tudor Trilogy),Digital_Ebook_Purchase,4,3,3,N,N,Sisters of Treason,"Jane Grey is executed after reigning only for nine days and her family is left behind tainted as traitors. The remaining Grey sisters grew up in the court under the suspicious eyes of the queen(s) but their mother’s confidante, Levina, looks after them.<br /><br />The story is told by 3 people: Katherine Grey, Mary Grey and court painter Levina Teerlinc and it worked well for this book.<br /><br />My favourite was definitely Mary and I really wished it could have ended happily for her. Being crook backed definitely didn’t make things easy for her and people can be so cruel. And yet she remained kind and gentle despite everything. Her sister’s death had deep impact on her and taught how dangerous it can be to have royal blood in your veins.<br /><br />Katherine was the type that thinks with her heart and not with her head, and it can be dangerous when you’re so close to the throne. She was little shallow and empty headed and I wished she would have listened Mary’s warnings. Her chapters were my least favourite and I think the weakest link in the book.<br /><br />I really liked how the sisters’ mother Frances Grey was portrayed. She was shown as caring and loving mother who deeply mourned her daughter and it was nice to see her friendship with Levina who was “just” a court painter and not noble born.<br /><br />This was truly enjoyable book and I look forward reading The Queen’s Gambit which I already own.",2014-08-16
US,23693521,R1C9221P1TSIO,B0060AY8K2,471047680,The Wolf Gift: The Wolf Gift Chronicles (1),Digital_Ebook_Purchase,3,0,0,N,Y,the wolf gift,"Parts of the \""Wolf Gift were really good\"" It would make a good movie. It goes into a lot of detail , sometimes gory , but seems realistic. But the ending went on forever, I thought it would never end.",2012-10-24
US,19590922,R2TG7AY3LY10E1,B001800I44,356631611,WEN Sweet Almond Mint Cleansing Conditioner 32oz,Beauty,5,0,0,N,Y,Wen is worth every penny! A healing shampoo that conditions and does not strip hair or fade color,I process by hair by foiling it blonde and Wen after one use repairs any existing damage and coats each strand providing instant moisture and repair. It also protects your color from fade and provides some thermal heat protection. Wen is worth every penny you spend on it!,2013-08-29
US,26434168,R3595FZRQ6KT69,B008AL9VXI,493629973,Apple MD564ZM/A USB 2.0 SuperDrive [Old Model],PC,5,0,0,N,Y,Easy,"As usual, it was really easy to set up my superdrives for both the Apple iMac and MacBook Air. They work perfectly every time.",2013-06-02
US,44286736,RKL4WWZXVJJDL,B00CAA19IK,732911039,70&#039;s Feathered Hair Wig Costume Accessory,Apparel,5,0,0,N,Y,Super!,Excellent quality wig! It looks sensational! Better than the picture!<br /><br />I would recommend this wig to anyone dressing up for a 70's get together or for Halloween.,2014-05-16
US,51650493,R363YTEO45KTHL,B00005NIMR,989169486,Belkin FireWire Cable,PC,4,0,0,N,Y,Four Stars,Love it no issues will buy from this person again.,2014-08-24
US,15447066,RXA5SYPIFX7TW,B000R8KBOK,144376864,Ulta-Lit Tree Co-Import 01203-CD Christmas Light Repair Tool,Home,5,0,0,N,Y,Works great - this is my second one,Works great - this is my second one. I gave one to the church because they do a big light display each year. It helped save a few items From the trash. I use it on our pre-lit tree. I See they also make one for the LED strings.,2014-11-29
US,50465366,R2IPLMHTAH30U3,B004HD6E4M,675518066,Mark of Royalty,Digital_Ebook_Purchase,5,0,0,N,Y,Well written and entertaining.,"I have enjoyed this book very much. It's a good story with action and romance. The characters are portrayed well. I especially think the step father is done quite well. Kind of deteriorating in character after the death of his wife whom he loved. Not a villain, but a very misguided man. I like the other characters more, of course. Probably reads more for young adults, but I think many adults would like it, as well.",2015-03-08
US,39082384,R1S8AGZOHD1SI1,B004MME6SY,104332691,Panasonic SC-EN17 CD Micro System (Discontinued by Manufacturer),Electronics,5,1,1,N,Y,Gaint Compact,Panasonic doesn't drop the ball with this! Great compact system I keep on kitchen counter to play cd's & radio. It stylish and good with decor.,2011-09-09
US,20702973,R1ZJUEQ409Z8YG,0544003411,669379389,The Lord of the Rings,Books,5,0,0,N,Y,Five Stars,god,2014-12-31


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [5]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

24 rows affected.


gp_segment_id,count
0,4298868
1,4296740
2,4291976
3,4297200
4,4300819
5,4297083
6,4301096
7,4298972
8,4297364
9,4296533


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [6]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,59 GB,Disk space used by the table or index.
demo,amzn_reviews,60 GB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [7]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize
demo,amzn_reviews,59 GB,196 MB,0 bytes,60 GB,0 bytes


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [8]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,4301096,4291976,0.2120389779721261


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0529644564132474


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

In [10]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0007849840288769


## Continue to Part 3 of Greenplum Database  Concepts Explained; **[MPP Fundamentals and Partitioning](AWS-GP-demo-3.ipynb)**.