# NBK Demo, 07/2019

## Step 0 - System and Connection Check
- Start with gpstate. Use jupyter, dbeaver or pgadmin for queries.
- Check *gp_autostats_mode* is set to **NONE**. This will avoid analyze time in loading and is required for one of the steps when running explain.

In [3]:
import os, re
connection_string = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', connection_string)

db_usr      = cs.group(1)
db_pwd      = cs.group(2)
db_host_ip   = cs.group(3)
db_host_port = cs.group(4)
db_host_db   = cs.group(5)

In [35]:
%reload_ext sql
%sql $connection_string

u'Connected: gpadmin@gpadmin'

In [36]:
%%sql
SHOW gp_autostats_mode;

 * postgresql://gpadmin:***@10.0.2.15:5432/gpadmin
1 rows affected.


gp_autostats_mode
ON_NO_STATS


In [37]:
%%sql
SET gp_autostats_mode = 'NONE';

 * postgresql://gpadmin:***@10.0.2.15:5432/gpadmin
Done.


[]

In [8]:
%%sql
SELECT version();

 * postgresql://gpadmin:***@10.0.2.15:5432/gpadmin
1 rows affected.


version
"PostgreSQL 8.3.23 (Greenplum Database 5.21.0 build commit:27db6bab4c909daa8d6699d94cabc48f87b07fab) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jul 12 2019 23:39:01"


## Step 1. The Amazon Customer Reviews Dataset

Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters). Samples of the data are available in English and French; more details on the information in each column can be found [here](https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt).

If you use the AWS Command Line Interface, you can list data in the bucket with the `ls` command: 

```aws s3 ls s3://amazon-reviews-pds/tsv/```

To download data using the AWS Command Line Interface, you can use the `cp` command. For instance, the following command will copy the file named `amazon_reviews_us_Camera_v1_00.tsv.gz` to your local directory:

```aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Camera_v1_00.tsv.gz```

For our demo, we choose to download three files under the `/home/gpadmin/data/` folder, using the `aws s3 cp <S3 File> <Local File>` command described above:
- [`s3://amazon-reviews-pds/tsv/amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz`](s3://amazon-reviews-pds/tsv/amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz) (~185MB)
- [`s3://amazon-reviews-pds/tsv/amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz`](s3://amazon-reviews-pds/tsv/amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz) (~22MB)
- [`s3://amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz`](s3://amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz) (~489MB)

## Step 2. Create Database Table to hold the Dataset

### Create the Schema (optional) and the Database Table to hold the dataset, as shown below:

In [50]:
!cat gp-demo/script/2-1-create-db-schema-table.sql

DROP SCHEMA IF EXISTS demo CASCADE;

CREATE SCHEMA demo;

DROP TABLE IF EXISTS demo.amzn_reviews;


CREATE TABLE demo.amzn_reviews(
  marketplace TEXT, 
  customer_id TEXT, 
  review_id TEXT, 
  product_id TEXT, 
  product_parent TEXT, 
  product_title TEXT, 
  product_category TEXT, 
  star_rating TEXT, 
  helpful_votes TEXT, 
  total_votes TEXT, 
  vine TEXT, 
  verified_purchase TEXT, 
  review_headline TEXT, 
  review_body TEXT, 
  review_date TEXT)
DISTRIBUTED BY (review_id);


In [52]:
query = !cat gp-demo/script/2-1-create-db-schema-table.sql

%sql {''.join(query)}

 * postgresql://gpadmin:***@10.0.2.15:5432/gpadmin
Done.
Done.
Done.
Done.


[]

In [53]:
!cat gp-demo/script/2-2-count-table.sql

SELECT COUNT(*) FROM demo.amzn_reviews;


In [54]:
query = !cat gp-demo/script/2-2-count-table.sql
%sql {''.join(query)}

 * postgresql://gpadmin:***@10.0.2.15:5432/gpadmin
1 rows affected.


count
0


## Step 3. Load dataset into the database using `gpload`.

**gpload** is a data loading utility that acts as an interface to the Greenplum Database external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum Database parallel file server (*gpfdist*), creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database. 

You can declare more than one file as input/source as long as the data is of the same format in all files specified. Additionally, if the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that `gunzip` or `bunzip2` is in your path). You can also declare options such as the schema of the source data files, perform basic transformations,  define custom delimiter and/or escape character(s), and many more. For the full list of available options, check the GPLoad Utility Reference available on [Pivotal Greenplum Database Documentation](https://gpdb.docs.pivotal.io/latest) (*Pivotal Greenplum Documentation* > *Utility Guide* > *Management Utility Reference* > *gpload*).

The operation, including any SQL commands specified in the SQL collection of the YAML control file, are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.

For our demo, we the **gpload_amzn_reviews.yaml** file, as following:

In [55]:
!cat gp-demo/script/3-1-gpload-amzn-reviews.yaml

VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/amzn_reviews*.tsv.gz
    - FORMAT: text
    - HEADER: true
    - LOG_ERRORS: true
    - MAX_LINE_LENGTH: 1000000
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.amzn_reviews
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true


In [59]:
!scp gp-demo/script/3-1-gpload-amzn-reviews.yaml $db_usr@$db_host_ip:gpload_amzn_reviews.yaml
!ssh $db_usr@$db_host_ip 'gpload -d gpadmin -f /home/gpadmin/gpload_amzn_reviews.yaml 2>&1 \
    | tee /home/gpadmin/gpload_amzn_reviews.log'

3-1-gpload-amzn-reviews.yaml                  100%  353   446.0KB/s   00:00    
2019-07-26 14:52:03|INFO|gpload session started 2019-07-26 14:52:03
2019-07-26 14:52:03|INFO|no host supplied, defaulting to localhost
2019-07-26 14:52:03|INFO|started gpfdist -p 8000 -P 9000 -f "/home/gpadmin/data/amzn_reviews*.tsv.gz" -t 30 -m 1000000
2019-07-26 14:52:03|INFO|did not find an external table to reuse. creating ext_gpload_reusable_ee043088_afb4_11e9_8538_080027acd876
2019-07-26 14:52:49|WARN|134 bad rows
2019-07-26 14:52:49|WARN|Please use following query to access the detailed error
2019-07-26 14:52:49|WARN|select * from gp_read_error_log('ext_gpload_reusable_ee043088_afb4_11e9_8538_080027acd876') where cmdtime > to_timestamp('1564152723.37')
2019-07-26 14:52:49|INFO|running time: 45.85 seconds
2019-07-26 14:52:49|INFO|rows Inserted          = 3453164
2019-07-26 14:52:49|INFO|rows Updated           = 0
2019-07-26 14:52:49|INFO|data formatting errors = 134


### Check `gpload` execution

Check `gpload` execution output (shown above and also available on `/home/gpadmin/script/gpload_amzn_reviews.log`), confirm successful loading of the data and/or identify any message which require ones attention and/or actions:

### 1. Check the data has been properly loaded, by confirming row count shown above:

In [None]:
%%sql SELECT COUNT(*) 
FROM demo.amzn_reviews;

### 2. Check data formatting errors and row counts, if identified by the `gpload` execution log:

In [None]:
%sql SELECT COUNT(*) \
    FROM gp_read_error_log('ext_gpload_reusable_3168f2da_aee0_11e9_a57d_080027acd876') \
    WHERE cmdtime > to_timestamp('1564061353.64')

In [None]:
%sql SELECT * \
    FROM gp_read_error_log('ext_gpload_reusable_3168f2da_aee0_11e9_a57d_080027acd876') \
    WHERE cmdtime > to_timestamp('1564061353.64')

## Step 4. Familiarize yourself with the newly loaded data table

### 1. DESCRIBE *demo.amzn_reviews* table using psql utility.

In [None]:
from IPython.display import display_html

psql_cmd = !psql -H -h $host_ip -U $usr -c '\d demo.amzn_reviews'

display_html(''.join(psql_cmd), raw=True)

### 2. DESCRIBE *demo.amzn_reviews* table using *information_schema* database catalog table.

In [None]:
%%sql
SELECT *
FROM information_schema.COLUMNS
WHERE TABLE_NAME = 'amzn_reviews';

### 3. Retrieve a sample of the demo.amzn_reviews table data (10 rows).

In [None]:
%%sql
SELECT * FROM demo.amzn_reviews LIMIT 10;

### 4. Show *demo.amzn_reviews* table data distribution across segments:

In [None]:
%%sql
SELECT gp_segment_id, count(*) FROM demo.amzn_reviews GROUP BY 1 ORDER BY 1;

## Step 5. Partitioning

### 1. Create a new copy of the original table, define a *PARTITION* pattern (by month) and load it.

In [None]:
%%sql 

CREATE TABLE demo.amzn_reviews_v2(
  marketplace TEXT, 
  customer_id BIGINT, 
  review_id TEXT, 
  product_id TEXT, 
  product_parent BIGINT, 
  product_title TEXT, 
  product_category TEXT, 
  star_rating INTEGER, 
  helpful_votes INTEGER, 
  total_votes INTEGER, 
  vine TEXT, 
  verified_purchase TEXT, 
  review_headline TEXT, 
  review_body TEXT, 
  review_date DATE)
DISTRIBUTED BY (review_id)
PARTITION BY RANGE(review_date) 
(
    START ('1998-07-01'::date) END ('2015-09-01'::date)
    EVERY ('1 month'::interval)
);

INSERT INTO demo.amzn_reviews_v2
SELECT * FROM demo.amzn_reviews;

### 2. Show row count per partition for the new table.

In [None]:
%%sql
SELECT tableoid::regclass, count(*) FROM demo.amzn_reviews_v2 GROUP BY 1 ORDER BY 1;

### 3. Demonstrate *Partition Elimination* functionality

In [None]:
!psql -d gpadmin -U gpadmin -h 10.0.2.15 -f './scripts/explain_example_1_1.sql'

In [None]:
psql_out = !psql -H -d gpadmin -U gpadmin -h 10.0.2.15 -f './scripts/example_1_2.sql'

display_html(''.join(psql_out), raw=True)

In [None]:
%%sql
EXPLAIN
SELECT COUNT(*)
    , date_part('year', review_date::DATE) AS YEAR_NUM
    , date_part('month', review_date::DATE) AS MONTH_NUM
FROM demo.amzn_reviews_v2
GROUP BY 2, 3
ORDER BY 2, 3;

## Step 4. Compression

## Populate the three tables:
- Load heap table with gpload (gpload_h.yaml):

```yaml
VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/crimes_all.txt
    - FORMAT: text
    - DELIMITER: '|'
    - LOG_ERRORS: true
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.fact_crimes_heap
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true
```

In [None]:
%%sql

DROP TABLE IF EXISTS demo.fact_crimes_heap;

CREATE TABLE demo.fact_crimes_heap
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
distributed by (id);

DROP TABLE IF EXISTS demo.fact_crimes_row_comp;

CREATE TABLE demo.fact_crimes_row_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=row, compresstype=zlib, compresslevel=3)
distributed by (id);


DROP TABLE IF EXISTS demo.fact_crimes_col_comp;

CREATE TABLE demo.fact_crimes_col_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=3)
distributed by (id);

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'gpload -d gpadmin -f /home/gpadmin/gpload_h.yaml > /home/gpadmin/gpload_h.log 2>&1'


**Note:** Heap table loaded data from the same source file in <33 seconds (heap vs compressed table loading has different performance)

- Load **demo.fact_row_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_row_comp;

`INSERT INTO demo.fact_crimes_row_comp SELECT * FROM demo.fact_crimes;`

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_row_comp.sql '

- Load **demo.fact_col_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_col_comp;

`INSERT INTO demo.fact_crimes_col_comp SELECT * FROM demo.fact_crimes;`

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_col_comp.sql '

## Check the size of each of the three tables:

In [None]:
%%sql
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_heap'))::TEXT, 'demo.fact_crimes_heap' AS TABLENAME
UNION
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_row_comp'))::TEXT AS TABLESIZE, 'demo.fact_crimes_row_comp' AS TABLENAME
UNION ALL
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_col_comp')) AS TABLESIZE, 'demo.fact_crimes_col_comp' AS TABLENAME;

**Notes:** 
- Heap table has no compression. It is best for staging tables or when frequent updates/ deletes are needed.
- Row oriented has the best compression. It is best for frequent inserts and `SELECT`'s on all/ most of the columns.
- Column oriented also has better compression than the heap table but not from the row-oriented table. It is best for static partitions/ tables and `SELECT`'s on fewer columns.

In [None]:
# Step 4. EXPLAIN plans, & Statistics

In [None]:
%%sql
EXPLAIN SELECT location_desc
	, count(case_number)
FROM
	demo.fact_crimes
WHERE
	crime_date >= '2014-01-01'
	AND crime_date <= '2014-12-31'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'



**Notes:**
- Copy `EXPLAIN` plan created above and paste it in : http://planchecker.cfapps.io/
- Planchecker app will provide recommendation(s) about collecting statistics. Highlight this as a recommendation that database provides for optimizations.
- Use the `ANALYZE` utility to collect statistics for optimizer, missing or stale statistics; all the above can generate bad plans.
- Use the `ANALYZEDB` utility and scheduled it to run frequently i.e. everyday, to collect statistics on changed tables/ partitions only since last run. 
- The same utility can also be easily stopped and resumed. 
- There is no need for DBA to explicitly look for different stats collection policies for different types of tables/ partitions.

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -c "analyze demo.fact_crimes"'



In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'

