# Greenplum Demo (Part 4)

This is Part 4 of Greenplum Demo, ***Table Storage Models***. 

- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](GP-demo-1.ipynb).
- If you missed Part 2 (*Basic Table Functions*) or wish to repeat, then click [here](GP-demo-2.ipynb).
- If you missed Part 3 (*MPP Fundamentals and Partitioning*) or wish to repeat, then click [here](GP-demo-3.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

In [2]:
%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

In [3]:
query = !cat script/7-db-maintenance.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

Done.
Done.
Done.
1 rows affected.
Done.


[]

## 7. Table Storage Models

### 7.1. Comparing Greenplum Table Storage Models: Loading

Re-create the Amazon Reviews table, using 3 different table storage models, Heap table, Append-Optimized (AO)/Row-Oriented table with ZLib (Level 3) compression, and Append-Optimized (AO)/Column-Oriented table with ZLib (Level 3) compression, as shown below:

In [4]:
sqlfilecode1 = !pygmentize -f html -O full,style=colorful -l postgres script/7-1-amzn-reviews-heap.sql
sqlfilecode2 = !pygmentize -f html -O full,style=colorful -l postgres script/7-1-amzn-reviews-ao-ro-zlib3.sql
sqlfilecode3 = !pygmentize -f html -O full,style=colorful -l postgres script/7-1-amzn-reviews-ao-co-zlib3.sql

display_html('\n'.join(sqlfilecode1), raw=True)
display_html('\n'.join(sqlfilecode2), raw=True)
display_html('\n'.join(sqlfilecode3), raw=True)

query1 = !cat script/7-1-amzn-reviews-heap.sql
query2 = !cat script/7-1-amzn-reviews-ao-ro-zlib3.sql
query3 = !cat script/7-1-amzn-reviews-ao-co-zlib3.sql

%sql $DB_USER@$DB_SERVER {''.join(query1)}
%sql $DB_USER@$DB_SERVER {''.join(query2)}
%sql $DB_USER@$DB_SERVER {''.join(query3)}

Done.
Done.
Done.
Done.
Done.
Done.


[]

Load the input dataset to each using the `gpload` utility, and compare loading times.

In [5]:
!scp script/7-1-gpload-amzn-reviews-heap.yaml $DB_USER@$DB_SERVER:gpload-amzn-reviews-heap.yaml
!ssh $DB_USER@$DB_SERVER 'gpload -d gpadmin -f /home/gpadmin/gpload-amzn-reviews-heap.yaml 2>&1 \
    | tee /home/gpadmin/gpload-amzn-reviews-heap.yaml.log'

7-1-gpload-amzn-reviews-heap.yaml             100%  374   211.2KB/s   00:00    
2019-08-05 12:28:06|INFO|gpload session started 2019-08-05 12:28:06
2019-08-05 12:28:06|INFO|no host supplied, defaulting to localhost
2019-08-05 12:28:06|INFO|started gpfdist -p 8000 -P 9000 -f "/home/gpadmin/data/amzn_reviews*.tsv.gz" -t 30 -m 1000000
2019-08-05 12:28:06|INFO|did not find an external table to reuse. creating ext_gpload_reusable_7a152d56_b77c_11e9_aa41_080027acd876
2019-08-05 12:29:06|WARN|134 bad rows
2019-08-05 12:29:06|WARN|Please use following query to access the detailed error
2019-08-05 12:29:06|WARN|select * from gp_read_error_log('ext_gpload_reusable_7a152d56_b77c_11e9_aa41_080027acd876') where cmdtime > to_timestamp('1565008086.35')
2019-08-05 12:29:06|INFO|running time: 60.38 seconds
2019-08-05 12:29:06|INFO|rows Inserted          = 3453164
2019-08-05 12:29:06|INFO|rows Updated           = 0
2019-08-05 12:29:06|INFO|data formatting errors = 134


In [6]:
!scp script/7-1-gpload-amzn-reviews-ao-ro-zlib3.yaml $DB_USER@$DB_SERVER:gpload-amzn-reviews-ao-ro-zlib3.yaml
!ssh $DB_USER@$DB_SERVER 'gpload -d gpadmin -f /home/gpadmin/gpload-amzn-reviews-ao-ro-zlib3.yaml 2>&1 \
    | tee /home/gpadmin/gpload-amzn-reviews-ao-ro-zlib3.log'

7-1-gpload-amzn-reviews-ao-ro-zlib3.yaml      100%  380    96.6KB/s   00:00    
2019-08-05 12:29:12|INFO|gpload session started 2019-08-05 12:29:12
2019-08-05 12:29:12|INFO|no host supplied, defaulting to localhost
2019-08-05 12:29:12|INFO|started gpfdist -p 8000 -P 9000 -f "/home/gpadmin/data/amzn_reviews*.tsv.gz" -t 30 -m 1000000
2019-08-05 12:29:12|INFO|did not find an external table to reuse. creating ext_gpload_reusable_a17e29ec_b77c_11e9_9118_080027acd876
2019-08-05 12:30:07|WARN|134 bad rows
2019-08-05 12:30:07|WARN|Please use following query to access the detailed error
2019-08-05 12:30:07|WARN|select * from gp_read_error_log('ext_gpload_reusable_a17e29ec_b77c_11e9_9118_080027acd876') where cmdtime > to_timestamp('1565008152.47')
2019-08-05 12:30:07|INFO|running time: 54.68 seconds
2019-08-05 12:30:07|INFO|rows Inserted          = 3453164
2019-08-05 12:30:07|INFO|rows Updated           = 0
2019-08-05 12:30:07|INFO|data formatting errors = 134


In [7]:
!scp script/7-1-gpload-amzn-reviews-ao-co-zlib3.yaml $DB_USER@$DB_SERVER:gpload-amzn-reviews-ao-co-zlib3.yaml
!ssh $DB_USER@$DB_SERVER 'gpload -d gpadmin -f /home/gpadmin/gpload-amzn-reviews-ao-co-zlib3.yaml 2>&1 \
    | tee /home/gpadmin/gpload-amzn-reviews-ao-co-zlib3.log'

7-1-gpload-amzn-reviews-ao-co-zlib3.yaml      100%  381    89.8KB/s   00:00    
2019-08-05 12:30:32|INFO|gpload session started 2019-08-05 12:30:32
2019-08-05 12:30:32|INFO|no host supplied, defaulting to localhost
2019-08-05 12:30:32|INFO|started gpfdist -p 8000 -P 9000 -f "/home/gpadmin/data/amzn_reviews*.tsv.gz" -t 30 -m 1000000
2019-08-05 12:30:32|INFO|did not find an external table to reuse. creating ext_gpload_reusable_d0e92894_b77c_11e9_ad93_080027acd876
2019-08-05 12:31:22|WARN|134 bad rows
2019-08-05 12:31:22|WARN|Please use following query to access the detailed error
2019-08-05 12:31:22|WARN|select * from gp_read_error_log('ext_gpload_reusable_d0e92894_b77c_11e9_ad93_080027acd876') where cmdtime > to_timestamp('1565008232.03')
2019-08-05 12:31:22|INFO|running time: 50.86 seconds
2019-08-05 12:31:22|INFO|rows Inserted          = 3453164
2019-08-05 12:31:22|INFO|rows Updated           = 0
2019-08-05 12:31:22|INFO|data formatting errors = 134


In [8]:
cmd = 'grep -e '"'"'running'"'"' /home/gpadmin/gpload-amzn-reviews*\
    | awk '"'"'BEGIN{FS=":"} {print $1, "finished in", $5}'"'"'' 
grep_output = !ssh $DB_USER@$DB_SERVER $cmd | pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(grep_output), raw=True)

### 7.2. Comparing Greenplum Table Storage Models: Table Size and Disk Space Usage

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/7-2-table-size-comparison.sql
display_html('\n'.join(sqlfilecode), raw=True)
query = !cat script/7-2-table-size-comparison.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

3 rows affected.


tablename,tablesize,total_tablesize
demo.amzn_reviews_heap,1878 MB,1881 MB
demo.amzn_reviews_ao_ro_zlib3,885 MB,889 MB
demo.amzn_reviews_ao_co_zlib3,731 MB,731 MB


### 7.3. Comparing Greenplum Table Storage Models: Query Performance

#### 7.3.1. Narrow (*Few columns of the table*) `SELECT`

In [10]:
sqlfilecode1 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-narrow-select-heap.sql
display_html('\n'.join(sqlfilecode1), raw=True)
cmd1 = !echo $(cat script/7-3-narrow-select-heap.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd1), raw=True)

sqlfilecode2 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-narrow-select-ao-ro.sql
display_html('\n'.join(sqlfilecode2), raw=True)
cmd2 = !echo $(cat script/7-3-narrow-select-ao-ro.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd2), raw=True)

sqlfilecode3 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-narrow-select-ao-co.sql
display_html('\n'.join(sqlfilecode3), raw=True)
cmd3 = !echo $(cat script/7-3-narrow-select-ao-co.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd3), raw=True)

#### 7.3.2. Super Narrow (*1 column of the table*) `SELECT`

In [11]:
sqlfilecode1 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-super-narrow-select-heap.sql
display_html('\n'.join(sqlfilecode1), raw=True)
cmd1 = !echo $(cat script/7-3-super-narrow-select-heap.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd1), raw=True)

sqlfilecode2 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-super-narrow-select-ao-ro.sql
display_html('\n'.join(sqlfilecode2), raw=True)
cmd2 = !echo $(cat script/7-3-super-narrow-select-ao-ro.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd2), raw=True)

sqlfilecode3 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-super-narrow-select-ao-co.sql
display_html('\n'.join(sqlfilecode3), raw=True)
cmd3 = !echo $(cat script/7-3-super-narrow-select-ao-co.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd3), raw=True)

#### 7.3.3. Wide (*Most/Many columns of the table*) `SELECT`

In [12]:
sqlfilecode1 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-wide-select-heap.sql
display_html('\n'.join(sqlfilecode1), raw=True)
cmd1 = !echo $(cat script/7-3-wide-select-heap.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd1), raw=True)

sqlfilecode2 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-wide-select-ao-ro.sql
display_html('\n'.join(sqlfilecode2), raw=True)
cmd2 = !echo $(cat script/7-3-wide-select-ao-ro.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd2), raw=True)

sqlfilecode3 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-wide-select-ao-co.sql
display_html('\n'.join(sqlfilecode3), raw=True)
cmd3 = !echo $(cat script/7-3-wide-select-ao-co.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd3), raw=True)

#### 7.3.4. Aggregate/Window Functions over a limited number of columns

In [13]:
sqlfilecode1 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-aggr-select-heap.sql
display_html('\n'.join(sqlfilecode1), raw=True)
cmd1 = !echo $(cat script/7-3-aggr-select-heap.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd1), raw=True)

sqlfilecode2 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-aggr-select-ao-ro.sql
display_html('\n'.join(sqlfilecode2), raw=True)
cmd2 = !echo $(cat script/7-3-aggr-select-ao-ro.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd2), raw=True)

sqlfilecode3 = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-aggr-select-ao-co.sql
display_html('\n'.join(sqlfilecode3), raw=True)
cmd3 = !echo $(cat script/7-3-aggr-select-ao-co.sql | \
               psql $CONNECTION_STRING | \
               grep -e 'Total runtime') | \
    pygmentize -f html -O full,style=colorful -l postgres
display_html('\n'.join(cmd3), raw=True)