# Greenplum Database  Concepts Explained - Part 2

This is Part 2 of Greenplum Database  Concepts Explained, ***Basic Table Functions***. 
- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](AWS-GP-demo-1.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

## 4. Basic Table Functions

### 4.1. DESCRIBE *demo.amzn_reviews* table using psql utility (`\d <table name>`)

In [2]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l psql script/4-1-psql-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

psql_out = !cat script/4-1-psql-describe-amzn-reviews.sql | psql -H $CONNECTION_STRING

display_html(''.join(psql_out), raw=True)

Column,Type,Collation,Nullable,Default
marketplace,text,,,
customer_id,bigint,,,
review_id,text,,,
product_id,text,,,
product_parent,bigint,,,
product_title,text,,,
product_category,text,,,
star_rating,integer,,,
helpful_votes,integer,,,
total_votes,integer,,,


### 4.2. DESCRIBE *demo.amzn_reviews* table using `information_schema` catalog table.

In [4]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-2-gp-describe-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-2-gp-describe-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

15 rows affected.


table_catalog,table_schema,table_name,column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length,character_octet_length,numeric_precision,numeric_precision_radix,numeric_scale,datetime_precision,interval_type,interval_precision,character_set_catalog,character_set_schema,character_set_name,collation_catalog,collation_schema,collation_name,domain_catalog,domain_schema,domain_name,udt_catalog,udt_schema,udt_name,scope_catalog,scope_schema,scope_name,maximum_cardinality,dtd_identifier,is_self_referencing,is_identity,identity_generation,identity_start,identity_increment,identity_maximum,identity_minimum,identity_cycle,is_generated,generation_expression,is_updatable
dev,demo,amzn_reviews,review_date,15,,YES,date,,,,,,0.0,,,,,,,,,,,,dev,pg_catalog,date,,,,,15,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,marketplace,1,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,1,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,review_id,3,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,3,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,product_id,4,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,4,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,product_title,6,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,6,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,product_category,7,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,7,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,vine,11,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,11,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,verified_purchase,12,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,12,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,review_headline,13,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,13,NO,NO,,,,,,,NEVER,,YES
dev,demo,amzn_reviews,review_body,14,,YES,text,,1073741824.0,,,,,,,,,,,,,,,,dev,pg_catalog,text,,,,,14,NO,NO,,,,,,,NEVER,,YES


### 4.3. Retrieve a sample of the _demo.amzn_reviews_ table data (10 rows).

In [5]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-3-select-sample-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-3-select-sample-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

0 rows affected.


marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date


### 4.4. Show *demo.amzn_reviews* table data distribution across segments:

In [6]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-4-data-distrib-amzn-reviews.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-4-data-distrib-amzn-reviews.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

0 rows affected.


gp_segment_id,count


### 4.5. *demo.amzn_reviews* Table Size and Disk Space Usage

#### 4.5.1. Using PostgreSQL System Administration Functions (PG 8.4)

In [7]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-1-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-1-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

2 rows affected.


schemaname,tablename,size,level
demo,amzn_reviews,0 bytes,Disk space used by the table or index.
demo,amzn_reviews,2336 kB,"Total disk space used by the table, including indexes and toasted data."


#### 4.5.2. Using the `gp_toolkit` Administrative Schema (Greenplum 5.x)

In [8]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-5-2-object-size-and-disk-space.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-5-2-object-size-and-disk-space.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

21 rows affected.


schema,relation,tablesize,toastsize,othersize,tabledisksize,indexsize
demo,amzn_reviews,0 bytes,2336 kB,0 bytes,2336 kB,0 bytes
demo,amzn_reviews_1,86 GB,259 MB,0 bytes,86 GB,0 bytes
demo,amzn_reviews_1_1,86 GB,259 MB,0 bytes,86 GB,0 bytes
demo,amzn_reviews_1_2,83 GB,260 MB,4640 kB,83 GB,0 bytes
demo,amzn_reviews_2,86 GB,259 MB,0 bytes,86 GB,0 bytes
demo,amzn_reviews_3,86 GB,259 MB,0 bytes,86 GB,0 bytes
demo,amzn_reviews_3_1,86 GB,259 MB,0 bytes,86 GB,0 bytes
demo,amzn_reviews_summary,32 kB,2336 kB,0 bytes,2368 kB,0 bytes
public,test100,1760 kB,0 bytes,0 bytes,1760 kB,0 bytes
public,test1000,2304 kB,0 bytes,0 bytes,2304 kB,0 bytes


### 4.6. Check table for Data Skew
Data skew may be caused by uneven data distribution due to the wrong choice of distribution keys or single tuple table insert or copy operations. Present at the table level, data skew, is often the root cause of poor query performance and out of memory conditions. Skewed data affects scan (read) performance, but it also affects all other query execution operations, for instance, joins and group by operations.

It is very important to *validate* distributions to ensure that data is evenly distributed after the initial load. It is equally important to *continue* to validate distributions after incremental loads.

The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-1-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-1-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


Table Name,Max Seg Rows,Min Seg Rows,Percentage Difference Between Max & Min
demo.amzn_reviews,,,


The `gp_toolkit` schema has two views that you can use to check for skew.
- The `gp_toolkit.gp_skew_coefficients` view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The `skccoeff` column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

In [10]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-2-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-2-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

DataError: (psycopg2.errors.NullValueNotAllowed) query string argument of EXECUTE is null
CONTEXT:  PL/pgSQL function gp_toolkit.gp_skew_details(oid) line 51 at OPEN
SQL function "gp_skew_coefficient" statement 1
SQL statement "SELECT *                     FROM
            gp_toolkit.gp_skew_coefficient(skcoid)"
PL/pgSQL function gp_toolkit.__gp_skew_coefficients() line 9 at SQL statement

[SQL: SELECT skcc.skcnamespace as schemaname, skcc.skcrelname as tablename, skcc.skccoeff as coefficient FROM gp_toolkit.gp_skew_coefficients skcc WHERE skcc.skcnamespace || '.' || skcc.skcrelname = 'demo.amzn_reviews' ;]
(Background on this error at: http://sqlalche.me/e/9h9h)

- The `gp_toolkit.gp_skew_idle_fractions` view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The `siffraction` column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.

In [11]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/4-6-3-data-skew.sql

display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/4-6-3-data-skew.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


schemaname,tablename,fraction
demo,amzn_reviews,0


## Continue to Part 3 of Greenplum Database  Concepts Explained; **[MPP Fundamentals and Partitioning](AWS-GP-demo-3.ipynb)**.