# Greenplum Demo, 07/2019

## Step 0 - System and Connection Check
- Start with gpstate. Use jupyter, dbeaver or pgadmin for queries.
- Check *gp_autostats_mode* is set to **NONE**. This will avoid analyze time in loading and is required for one of the steps when running explain.

In [1]:
import os, re
from IPython.display import display_html

CONNECTION_STRING = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

In [2]:
%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

In [3]:
%%sql $DB_USER@$DB_SERVER
SHOW gp_autostats_mode;

1 rows affected.


gp_autostats_mode
ON_NO_STATS


In [4]:
%%sql $DB_USER@$DB_SERVER
SET gp_autostats_mode = 'NONE';

Done.


[]

In [5]:
%%sql $DB_USER@$DB_SERVER
SELECT version();

1 rows affected.


version
"PostgreSQL 8.3.23 (Greenplum Database 5.21.0 build commit:27db6bab4c909daa8d6699d94cabc48f87b07fab) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jul 12 2019 23:39:01"


## Step 1. The Amazon Customer Reviews Dataset

Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters). Samples of the data are available in English and French; more details on the information in each column can be found [here](https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt).

If you use the AWS Command Line Interface, you can list data in the bucket with the `ls` command: 

```aws s3 ls s3://amazon-reviews-pds/tsv/```

To download data using the AWS Command Line Interface, you can use the `cp` command. For instance, the following command will copy the file named `amazon_reviews_us_Camera_v1_00.tsv.gz` to your local directory:

```aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Camera_v1_00.tsv.gz```

For our demo, we choose to download three files under the `/home/gpadmin/data/` folder, using the `aws s3 cp <S3 File> <Local File>` command described above:
- [`s3://amazon-reviews-pds/tsv/amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz`](s3://amazon-reviews-pds/tsv/amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz) (~185MB)
- [`s3://amazon-reviews-pds/tsv/amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz`](s3://amazon-reviews-pds/tsv/amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz) (~22MB)
- [`s3://amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz`](s3://amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz) (~489MB)

## Step 2. Create Database Table to hold the Dataset

### Create the Schema (optional) and the Database Table to hold the dataset, as shown below:

In [48]:
!cat script/2-1-create-db-schema-table.sql

DROP SCHEMA IF EXISTS demo CASCADE;

CREATE SCHEMA demo;

DROP TABLE IF EXISTS demo.amzn_reviews;


CREATE TABLE demo.amzn_reviews(
  marketplace TEXT, 
  customer_id BIGINT, 
  review_id TEXT,
  product_id TEXT, 
  product_parent BIGINT, 
  product_title TEXT, 
  product_category TEXT, 
  star_rating INTEGER, 
  helpful_votes INTEGER, 
  total_votes INTEGER, 
  vine TEXT, 
  verified_purchase TEXT, 
  review_headline TEXT, 
  review_body TEXT, 
  review_date DATE) 
DISTRIBUTED BY (review_id);


In [49]:
query = !cat script/2-1-create-db-schema-table.sql

%sql $DB_USER@$DB_SERVER {''.join(query)}

Done.
Done.
Done.
Done.


[]

In [50]:
!cat script/2-2-count-table.sql

SELECT COUNT(*) FROM demo.amzn_reviews;


In [51]:
query = !cat script/2-2-count-table.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
0


## Step 3. Load dataset into the database using `gpload`.

**gpload** is a data loading utility that acts as an interface to the Greenplum Database external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum Database parallel file server (*gpfdist*), creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database. 

You can declare more than one file as input/source as long as the data is of the same format in all files specified. Additionally, if the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that `gunzip` or `bunzip2` is in your path). You can also declare options such as the schema of the source data files, perform basic transformations,  define custom delimiter and/or escape character(s), and many more. For the full list of available options, check the GPLoad Utility Reference available on [Pivotal Greenplum Database Documentation](https://gpdb.docs.pivotal.io/latest) (*Pivotal Greenplum Documentation* > *Utility Guide* > *Management Utility Reference* > *gpload*).

The operation, including any SQL commands specified in the SQL collection of the YAML control file, are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.

For our demo, we the **gpload_amzn_reviews.yaml** file, as following:

In [53]:
!cat script/3-2-gpload-amzn-reviews.yaml

VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/amzn_reviews*.tsv.gz
    - FORMAT: text
    - HEADER: true
    - LOG_ERRORS: true
    - MAX_LINE_LENGTH: 1000000
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.amzn_reviews
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true


### 1. Delete error log information for existing tables in the current database.

In [54]:
!cat script/3-1-delete-error-log-info.sql

SELECT gp_truncate_error_log('demo.amzn_reviews');


In [55]:
query = !cat script/3-1-delete-error-log-info.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


gp_truncate_error_log
True


### 2. Copy GPLoad YAML file across to Database Server and Execute

In [56]:
!scp script/3-2-gpload-amzn-reviews.yaml $DB_USER@$DB_SERVER:gpload_amzn_reviews.yaml
!ssh $DB_USER@$DB_SERVER 'gpload -d gpadmin -f /home/gpadmin/gpload_amzn_reviews.yaml 2>&1 \
    | tee /home/gpadmin/gpload_amzn_reviews.log'

3-2-gpload-amzn-reviews.yaml                  100%  353   480.6KB/s   00:00    
2019-07-29 12:27:27|INFO|gpload session started 2019-07-29 12:27:27
2019-07-29 12:27:27|INFO|no host supplied, defaulting to localhost
2019-07-29 12:27:27|INFO|started gpfdist -p 8000 -P 9000 -f "/home/gpadmin/data/amzn_reviews*.tsv.gz" -t 30 -m 1000000
2019-07-29 12:27:27|INFO|did not find an external table to reuse. creating ext_gpload_reusable_39fe859e_b1fc_11e9_a1c4_080027acd876
2019-07-29 12:28:25|WARN|134 bad rows
2019-07-29 12:28:25|WARN|Please use following query to access the detailed error
2019-07-29 12:28:25|WARN|select * from gp_read_error_log('ext_gpload_reusable_39fe859e_b1fc_11e9_a1c4_080027acd876') where cmdtime > to_timestamp('1564403247.44')
2019-07-29 12:28:25|INFO|running time: 58.49 seconds
2019-07-29 12:28:25|INFO|rows Inserted          = 3453164
2019-07-29 12:28:25|INFO|rows Updated           = 0
2019-07-29 12:28:25|INFO|data formatting errors = 134


### Check `gpload` execution

Check `gpload` execution output (shown above and also available on `/home/gpadmin/script/gpload_amzn_reviews.log`), confirm successful loading of the data and/or identify any message which require ones attention and/or actions:

### 1. Check the data has been properly loaded, by confirming row count shown above:

In [57]:
!cat script/3-3-count-amzn-reviews.sql

SELECT COUNT(*) FROM demo.amzn_reviews;


In [58]:
query = !cat script/3-3-count-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
3453164


### 2. Check data formatting row count and errors, if such were identified by the `gpload` execution log:

In [59]:
cmd = 'cat /home/gpadmin/gpload_amzn_reviews.log\
    | grep -e '"'"'WARN|select'"'"'\
    | awk '"'"'BEGIN{FS="|";OFS=" "} {print $3}'"'"'\
    | awk '"'"'{print $1, "COUNT(*)", $3, $4, $5, $6, $7, $8}'"'"''
query = !ssh $DB_USER@$DB_SERVER $cmd
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
134


In [None]:
cmd = 'cat /home/gpadmin/gpload_amzn_reviews.log\
    | grep -e '"'"'WARN|select'"'"'\
    | awk '"'"'BEGIN{FS="|"} {print $3}'"'"' ' 
query = !ssh $DB_USER@$DB_SERVER $cmd
%sql {''.join(query)}

### Continue to Part 2 of Greenplum Demo, "[Step 4. Familiarize yourself with the newly loaded data table](GP-demo-2.ipynb)"