# Greenplum Database  Concepts Explained - Part 2

This is Part 2 of Greenplum Database  Concepts Explained, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](AWS-GP-demo-1.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

'Connected: gpadmin@gpadmin'

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [2]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING
display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [3]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
gpadmin,demo,amzn_reviews,product_parent,5,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,5,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,customer_id,2,,YES,bigint,,,64.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int8,,,,,2,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,total_votes,10,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,10,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,helpful_votes,9,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,9,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,star_rating,8,,YES,integer,,,32.0,2.0,0.0,,,,,,,,,,,,,gpadmin,pg_catalog,int4,,,,,8,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
gpadmin,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,gpadmin,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the _demo.amzn_reviews_ table data (10 rows).

In [4]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

10 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
US,12254726,R9GDAII80EZXR,B002JM27PS,911457122,CalExoticsLife-Like Curve Stroker - Tight Ass,Health & Personal Care,2,7,12,N,Y,A bit lacking.,"After trying this toy, I found myself left wanting. The penetration is a little difficult to do without paying attention, given the lack of structure. The design of the interior doesn't lend itself well towards retaining lube, though the sensations are fine when generously lubricated. The textures are nothing special, though. The hole in the back is small enough to allow the passage of air without sacrificing grip. At this price point, the [[ASIN:B000V1OPYA Tenga Deep Throat Cup]] is a better value, I feel.",2013-08-23
US,52638353,R326SRLP18YR2K,B001B8EUEW,83272512,Microsoft Wireless Keyboard 6000,PC,1,6,9,N,N,I'd give it negative stars,"If you type no more then 15 to 20 words a minute you can use this keyboard. Any more then that and its totally frustrating. Even with the USB only 2' away from the keyboard, the lag time makes this keyboard virtually unusable. Total waste of money and it wasn't cheap either.",2009-07-31
US,49854407,RH09Z81Z3KH11,B003YHF4AC,120433442,Rubbermaid Reveal Spray Mop Kit,Home,5,6,7,N,N,Absolutely worth trying!!!!,"I just picked one of these Rubbermaid mops up at Target earlier this week for twenty bucks during my mini-shopping spree for new cleaning products. I hate mopping, and usually have to wait until nap time or when the kiddos are strapped into highchairs and off the floor to actually clean. But with this mop, I can sweep first (love love love that the microfiber and dusting mops can be switched out!) and then quickly spray and mop when I have the time(the spraying range isn't a problem if you angle the handle just right and point to where you need cleaning solution).<br /><br />I use both a traditional mop/bucket and Swiffer, so I'm definitely impressed with the Rubbermaid. I'm already pulling out this mop multiple times daily for a quick sweep (I actually remove the cleaning bottle until it's needed) or to mop up a small spill. The great thing about the microfiber mop is its \""sharp\"" edges - absolutely picks up all the stray pet hair and dustbunnies trapped in the edges and corners (it even reached way further under the fridge and oven than either Swiffer or old fashioned brooms ever did). Contrary to what others have said about this mop being flimsy, I can definitely tell the difference between Rubbermaid (it's honestly very sturdy, though one has to be a little delicate with the rotating head and handle) and predecessors like Swiffer (how sturdy can something boxed in pieces be???) and O'Cedar (got too expensive having to constantly replace the entire handle and mop just because the 'side holders' broke).<br /><br />I'm definitely happy with my purchase and hope it can last the test of my busy household!",2011-03-18
US,9893228,RDOV9BK6X84WC,B006FO87N2,87083978,Diamond in the Rough: A Memoir,Digital_Ebook_Purchase,3,15,19,N,Y,Kind of a downer,"I had just finished reading Carole King's memoir, \""Natural Woman\"". Much of her story is about her insecurities and the bad choices of men she's made in her life, and the bad situations they led her into. But there's enough good stuff, and enough about her music career, to have made it, in the end, an enjoyable read. Shawn Colvin's \""Diamond in the Rough\"" is similar, but it spends more time on Shawn's troubles with men and drugs than it does on her music. Like Shawn Colvin, I'm a native South Dakotan, so it was fun to read about the girl who grew up just down the road. And I've always loved her music. But she has had a hard life, mostly because of her insecurities and the unhealthy relationships she's been in. Like Carole. I recommended this book to a friend, who had also lived in South Dakota, and after a few pages she asked, \""And why did you think I would like this book?\"" And yet, though it's a downer, I enjoyed the glimmers of wit and the insight into Shawn Colvin's creative life. She mentions a performance at the Minnesota State Fair, where she went onstage in a deep funk, expecting little response from the corn-dog-munching audience. But she was surprised and delighted when the audience wrapped her in their love. Shawn, I was in that audience. What we gave you was real, sincere, and heartfelt. And we're still here for you.",2012-07-19
US,33698170,R2UTTHWZC9VZA1,B00FQKI3W6,940671580,Viva Naturals,Health & Personal Care,5,0,0,N,Y,Five Stars,Vitamins came on time and price is excellent,2015-08-24
US,2435491,R3QABRFX82O6Q0,B00020S7XK,852949495,"Sony ICF-S10MK2 Pocket AM/FM Radio, Silver",Electronics,5,0,0,N,Y,This is the only place I've been able to find ...,This is the only place I've been able to find them.. They are 1 of the 3 things my mentally challenged nephew enjoys. He carries them wherever he goes. They are also the first thing he throws when having a spell. We go through several a year. Hopefully you will continue manufacturing them.,2015-05-25
US,13883720,R3NQ09XOF4BUNQ,B0085DOBX6,160319154,"Every Day Is an Atheist Holiday!: More Magical Tales from the Bestselling Author of God, No!",Digital_Ebook_Purchase,3,0,2,N,Y,Vulgarity Not Needed,"Mr. Jillete detracts from his otherwise entertaining style, by the &#34;shock jock&#34; inserts. Big man with big feelings about life.",2013-04-24
US,17298435,RDPBOA9OOOEWX,B000CN2OIQ,679140852,Chanel No. 5 by Chanel 1.7 oz Eau de Parfum Spray Classic Bottle (Unboxed),Beauty,3,0,0,N,Y,"Good perfume, but not great smell","The perfume is good, but not as great as the popularity of this brand and name. The lasting power is also not so good (atleast for me). I would not like to repeat this perfume in future. However it was delivered on time. So again full marks to Amazon for prompt delivery.",2012-04-23
US,1368862,RBU8XIR7L00I3,B00JG8GOWU,816234934,"Kindle Paperwhite, 6"" High-Resolution Display (212 ppi) with Built-in Light, Wi-Fi",PC,5,0,0,N,Y,Five Stars,Love it,2015-01-11
US,14163454,R1RWU8WXSYIUXH,B0072LAAA0,773879572,Motorola MT350R FRS Weatherproof Two-Way,Wireless,4,0,0,N,Y,Four Stars,Good value. Very stylish and easy to use.,2015-05-16


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [5]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

24 rows affected.


gp_segment_id,count
0,4298868
1,4296740
2,4291976
3,4297200
4,4300819
5,4297083
6,4301096
7,4298972
8,4297364
9,4296533


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [6]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,59 GB,Disk space used by the table or index.
demo,amzn_reviews,60 GB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [7]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize
demo,amzn_reviews,59 GB,196 MB,0 bytes,60 GB,0 bytes


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [8]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,4301096,4291976,0.2120389779721261


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,coefficient
demo,amzn_reviews,0.0529644564132474


- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

In [10]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0.0007849840288769


## Continue to Part 3 of Greenplum Database  Concepts Explained; **[MPP Fundamentals and Partitioning](AWS-GP-demo-3.ipynb)**.