# Greenplum Demo - Part 1

## 1. System Setup
### 1.1 Initialize database connection and setup global variable values

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

'Connected: gpadmin@gpadmin'

In [2]:
%%sql $DB_USER@$DB_SERVER
SHOW gp_autostats_mode;
ALTER DATABASE gpadmin SET gp_autostats_mode TO 'NONE';
SHOW gp_autostats_mode;
SELECT version();

1 rows affected.
Done.
1 rows affected.
1 rows affected.


version
"PostgreSQL 8.3.23 (Greenplum Database 5.20.1 build commit:03ff833f877a23469ca41aab0b2dfc58c48520ad) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jun 28 2019 08:56:11"


## 2. The Amazon Customer Reviews Dataset

Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters). Samples of the data are available in English and French; more details on the information in each column can be found [here](https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt).

If you use the AWS Command Line Interface, you can list data in the bucket with the `ls` command: 

```aws s3 ls s3://amazon-reviews-pds/tsv/```

To download data using the AWS Command Line Interface, you can use the `cp` command. For instance, the following command will copy the file named `amazon_reviews_us_Camera_v1_00.tsv.gz` to your local directory:

```aws s3 cp s3://amazon-reviews-pds/tsv/<S3 File> <Local File>```

### 2.1 Prepare AWS System and setup `awscli` library via `pip`

In [3]:
shfilecode = !pygmentize -f html -O full,style=friendly -l shell script/1-1-system-prepare.sh
display_html('\n'.join(shfilecode), raw=True)

In [5]:
!ssh-keygen -R $DB_SERVER
!ssh-keyscan $DB_SERVER >> ~/.ssh/known_hosts
!scp -i ~/.ssh/aws-gp.pem script/1-1-system-prepare.sh $DB_USER@$DB_SERVER:system-prepare.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER chmod +x ./system-prepare.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER ./system-prepare.sh

--------------------------------------------------
/usr/local/greenplum-db/./bin:/usr/local/greenplum-db/./ext/python/bin:/usr/local/bin:/usr/bin:/usr/local/bin:/usr/local/greenplum-db/pxf/bin:/usr/local/greenplum-cloud
/usr/bin/pip
--------------------------------------------------
# Host ec2-18-130-57-92.eu-west-2.compute.amazonaws.com found: line 26
# Host ec2-18-130-57-92.eu-west-2.compute.amazonaws.com found: line 27
# Host ec2-18-130-57-92.eu-west-2.compute.amazonaws.com found: line 28
/root/.ssh/known_hosts updated.
Original contents retained as /root/.ssh/known_hosts.old
# ec2-18-130-57-92.eu-west-2.compute.amazonaws.com:22 SSH-2.0-OpenSSH_7.4
# ec2-18-130-57-92.eu-west-2.compute.amazonaws.com:22 SSH-2.0-OpenSSH_7.4
# ec2-18-130-57-92.eu-west-2.compute.amazonaws.com:22 SSH-2.0-OpenSSH_7.4
1-1-system-prepare.sh                         100%  722    80.5KB/s   00:00    
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                

  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Collecting jmespath<1.0.0,>=0.7.1 (from botocore==1.12.231->awscli)
  Using cached https://files.pythonhosted.org/packages/83/94/7179c3832a6d45b266ddb2aac329e101367fbdb11f425f13771d27f225bb/jmespath-0.9.4-py2.py3-none-any.whl
Collecting pyasn1>=0.1.3 (from rsa<=3.5.0,>=3.1.2->awscli)
  Using cached https://files.pythonhosted.org/packages/a1/71/8f0d444e3a74e5640a3d5d967c1c6b015da9c655f35b2d308a55d907a517/pyasn1-0.4.7-py2.py3-none-any.whl
Collecting futures<4.0.0,>=2.2.0; python_version == "2.6" or python_version == "2.7" (from s3transfer<0.3.0,>=0.2.0->awscli)
  Using cached https://files.pythonhosted.org/packages/d8/a6/f46ae3f1da0cd4361c344888f59ec2f5785e69c872e175a748ef6071cdb5/futures-3.3.0-py2-none-any.whl
Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore==1.12.231->awscli)
  Using 

### 2.2 Provide AWS Access Key ID & Secret Access Key

In [5]:
shfilecode = !pygmentize -f html -O full,style=friendly -l bash script/1-2-aws-configure.sh
display_html('\n'.join(shfilecode), raw=True)

In [6]:
import getpass

!scp -i ~/.ssh/aws-gp.pem script/1-2-aws-configure.sh $DB_USER@$DB_SERVER:aws-configure.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER chmod +x ./aws-configure.sh

cmd = './aws-configure.sh ' 
cmd = cmd + getpass.getpass("AWS Access Key ID [None]:") 
cmd = cmd + ' ' + getpass.getpass("AWS Secret Access Key [None]:")

!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER $cmd

1-2-aws-configure.sh                          100%  455    54.5KB/s   00:00    
AWS Access Key ID [None]:········
AWS Secret Access Key [None]:········
AWS S3 Configuration setup correctly


### 2.3 Copy source files from AWS S3

For our demo, we choose to download the available files into the `/home/gpadmin/data/` folder, using the `aws s3 cp <S3 File> <Local File>` command described before, as follows:

In [7]:
shfilecode = !pygmentize -f html -O full,style=friendly -l bash script/1-3-aws-s3-copy.sh
display_html('\n'.join(shfilecode), raw=True)

In [8]:
!scp -i ~/.ssh/aws-gp.pem script/1-3-aws-s3-copy.sh $DB_USER@$DB_SERVER:aws-s3-copy.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER chmod +x ./aws-s3-copy.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER ./aws-s3-copy.sh

1-3-aws-s3-copy.sh                            100% 2240   362.7KB/s   00:00    
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz to ./amazon_reviews_us_Books_v1_00.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_01.tsv.gz to ./amazon_reviews_us_Books_v1_01.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz to ./amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz to ./amazon_reviews_us_Wireless_v1_00.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Music_v1_00.tsv.gz to ./amazon_reviews_us_Music_v1_00.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_PC_v1_00.tsv.gz to ./amazon_reviews_us_PC_v1_00.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Video_DVD_v1_00.tsv.gz to ./amazon_reviews_us_Video_DVD_v1_00.tsv.gz
download: s3://amazon-reviews-pds/tsv/amazon_reviews_multil

## 3. Data Loading

### 3.1. Create the Schema (optional) and the Database Table to hold the dataset, as shown below:

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/2-1-create-db-schema-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [10]:
query = !cat script/2-1-create-db-schema-table.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

Done.
Done.
Done.
Done.


[]

In [11]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/2-2-count-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/2-2-count-table.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
0


### 3.2. Load the Input Dataset using the `gpload` Utility

**gpload** is a data loading utility that acts as an interface to the Greenplum Database external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum Database parallel file server (*gpfdist*), creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database. 

You can declare more than one file as input/source as long as the data is of the same format in all files specified. Additionally, if the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that `gunzip` or `bunzip2` is in your path). You can also declare options such as the schema of the source data files, perform basic transformations,  define custom delimiter and/or escape character(s), and many more. For the full list of available options, check the GPLoad Utility Reference available on [Pivotal Greenplum Database Documentation](https://gpdb.docs.pivotal.io/latest) (*Pivotal Greenplum Documentation* > *Utility Guide* > *Management Utility Reference* > *gpload*).

The operation, including any SQL commands specified in the SQL collection of the YAML control file, are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.

For our demo, we the **gpload_amzn_reviews.yaml** file, as following:

In [12]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l yaml script/3-2-gpload-amzn-reviews.yaml
display_html('\n'.join(sqlfilecode), raw=True)

#### 3.2.1. Delete error log information for existing tables in the current database.

In [13]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/3-1-delete-error-log-info.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/3-1-delete-error-log-info.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


gp_truncate_error_log
True


#### 3.2.2. Copy GPLoad YAML file across to Database Server and Execute

In [14]:
!scp -i ~/.ssh/aws-gp.pem script/3-2-gpload-amzn-reviews.yaml $DB_USER@$DB_SERVER:gpload_amzn_reviews.yaml
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER 'gpload -d gpadmin -f /home/gpadmin/gpload_amzn_reviews.yaml 2>&1 \
    | tee /home/gpadmin/gpload_amzn_reviews.log'

3-2-gpload-amzn-reviews.yaml                  100%  358    13.1KB/s   00:00    
2019-09-17 15:03:41|INFO|gpload session started 2019-09-17 15:03:41
2019-09-17 15:03:41|INFO|no host supplied, defaulting to localhost
2019-09-17 15:03:41|INFO|started gpfdist -p 8000 -P 9000 -f "/home/gpadmin/data/amazon_reviews_us*.tsv.gz" -t 30 -m 1000000
2019-09-17 15:03:41|INFO|did not find an external table to reuse. creating ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0
2019-09-17 15:43:39|WARN|3714 bad rows
2019-09-17 15:43:39|WARN|Please use following query to access the detailed error
2019-09-17 15:43:39|WARN|select * from gp_read_error_log('ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0') where cmdtime > to_timestamp('1568729021.23')
2019-09-17 15:43:39|INFO|running time: 2398.52 seconds
2019-09-17 15:43:39|INFO|rows Inserted          = 103145273
2019-09-17 15:43:39|INFO|rows Updated           = 0
2019-09-17 15:43:39|INFO|data formatting errors = 3714


### 3.3. Check `gpload` execution

Check `gpload` execution output (shown above and also available on `/home/gpadmin/script/gpload_amzn_reviews.log`), confirm successful loading of the data and/or identify any message which require ones attention and/or actions:

#### 3.3.1. Check the data has been properly loaded, by confirming row count shown above:

In [15]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/3-3-count-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/3-3-count-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
103145273


#### 3.3.2. Check data formatting row count if such were identified by the `gpload` execution log:

In [16]:
cmd = 'cat /home/gpadmin/gpload_amzn_reviews.log\
    | grep -e '"'"'WARN|select'"'"'\
    | awk '"'"'BEGIN{FS="|";OFS=" "} {print $3}'"'"'\
    | awk '"'"'{print $1, "COUNT(*)", $3, $4, $5, $6, $7, $8}'"'"''
query = !ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER $cmd
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
3714


#### 3.3.3. Check a sample set of 10 rows from the data formatting errors, if such were identified by the `gpload` execution log:

In [17]:
cmd = 'cat /home/gpadmin/gpload_amzn_reviews.log\
    | grep -e '"'"'WARN|select'"'"'\
    | awk '"'"'BEGIN{FS="|"} {print $3, "LIMIT 10"}'"'"' ' 
query = !ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER $cmd
%sql {''.join(query)}

 * postgresql://gpadmin:***@ec2-18-130-134-20.eu-west-2.compute.amazonaws.com:5432/gpadmin
10 rows affected.


cmdtime,relname,filename,linenum,bytenum,errmsg,rawdata,rawbytes
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	26056830	R1DP0EMU6DDE4I	B00WGBT1D8	728815172	Malibu Sugar Big Girls' Woke Up Like This Muscle Top	Apparel	1	0	5	N	N	but i was browsing and saw this adorable shirt and when i looked at the price	i haven't got it, but i was browsing and saw this adorable shirt and when i looked at the price, my heart dropped. its so cute but just a muscle top, for $40?!! u need to change the price. :\	2015-07-23",
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	33597262	R2J6YUD2SZTWNP	B00VH3U2F2	613590419	MABUA Cotton No Show See liner boat loafer Mens Womens Socks 5 Pairs. Get Colors Now!	Apparel	2	0	0	N	Y	The concept and fit of this sock is great BUT not so sure about the quality	The concept and fit of this sock is great BUT not so sure about the quality. I brought 5 striped pairs on May 8th and as of today, they all have holes. I don't mean just the toes, like underneath the sock as well. I don't generally get holes in my socks, if anything I'd say they usually loosen up or get too wide to wear so I'm not sure what's wrong with these socks. Don't get me wrong, they are soft and don't really fall off but each sock's life span was probably like 2-3 weeks max total. At $5 a pair, I'd think I'll look elsewhere. :\	2015-08-27",
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""",US	43907302	R260EQ65BRIPW1	B00V9GOJ2O	746100730	Zkess Women's Sleeveless Summer Strappy Cut outs One-piece Swimsuit Small Size Green	Apparel	4	1	1	N	Y	It's very sexy but runs very small... ...	It's very sexy but runs very small... But I'm making it work. Also there was one strap missing from one of the sides :\	2015-06-11,
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""",US	919830	RIJRZ601276ZM	B00RKIYC9K	796752879	HeForShe Men Logo Pin Black and Magenta	Apparel	5	0	0	N	Y	Five Stars	Awesome /|\	2015-05-03,
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""",US	45159328	R25458DTM4P6GQ	B00RBSLCHO	75975899	BCBGeneration Women's Cold Shoulder Button Down	Apparel	2	1	2	N	Y	Cute	Its a really nice shirt/style but i was sent the wrong size =\	2015-05-20,
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""",US	11467498	RNHZW3DZ9GZ5Y	B00R1LJC6E	779653185	Allegra K Women's Plaids Single Breasted Belted Mini Christmas A Line Shirt Dress	Apparel	2	2	4	N	Y	material is nice but the buttons are very weak	Overall the buyer should be aware that this is very small in the shoulders and if you have anything larger then a B it will fit tight in the chest area. the belt is made of this stretchy faux leather material and sits well on the waist. material is nice but the buttons are very weak. the plastic button on my dress broke very easily. the hassle of returning this item is too much so I will keep it.. maybe I'll be able to wear it if I lose some weight :( on my shoulders....somehow :\	2015-02-18,
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	26387875	R2YA6KDRQ51DMN	B00PTBIWEG	594181576	Gioberti Boys Sherpa Lined Zip Up Fleece Hoodie Jacket	Apparel	3	0	2	N	Y	Great jacket, except the zipper and the seams :\	This Jacket was one of my son's favorite gifts for Christmas, very soft on the inside and snuggly warm, however, the zipper is plastic and broke In the first week and the pockets started to come apart at the seam. I am so pleased with the jacket, but so sad that the zipper and the sewing around the pockets are not quality.	2015-01-09",
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	40362022	R132X6FMGC53EX	B00PKL33HQ	655140322	NoBull Woman Apparel Women's Check Meowt Racerback Tank Top	Apparel	5	0	0	N	Y	Super cute!	Love it! Very comfortable. Great to workout in. The only downside is me, I'm a super sweaty person so the colour shows up my intense sweat patches :-\	2015-04-21",
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	12035266	R1YDS0AGKDEJI3	B00P6TO5E2	815388106	J.TOMSON Womens Deep Draping Halter Top	Apparel	3	2	2	N	Y	Because, I live in Phoenix.	Yes, this is trash. But it's already 80 degrees here in February, and I'm post-menopausal. I have to make this work. This, and other equally trashy things, so that I don't die of heat-stroke in July. Period. One doesn't simply use a light cover-up, in Phoenix, in July. What you do, is wish that you were 20-something, so that you could wear as little as possible while avoiding jail time. This will definitely require a lot of double-sided, invisible lingerie tape. I hope not to sweat it off. I believe I will constantly be shifting the gathers around in an attempt to disguise my nipples. (Btw, why is 2 ft of cleavage acceptable, but no nipples?...just curious..) None of the photos of these things show necklaces being worn with it, but there is a gaping hole down the middle, so I will definitely be filling the gap with some sort of necklace. The champagne color is a nice neutral taupe, kind of a stone color; not too light or dark, and not on the yellow side. This would be so easy to make, that I'm embarrassed to say I'd rather buy it, than spend the time, but that is the case. :\	2015-02-17",
2019-09-17 15:03:41.365463+01:00,ext_gpload_reusable_f4192898_d953_11e9_a9e8_06fe645668b0,gpfdist://mdw:8000//home/gpadmin/data/amazon_reviews_us*.tsv.gz [/home/gpadmin/data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	10112805	R6AW51YPP64YI	B00OD3PGJK	860452344	Pulse Extended Plus Size Women's 3in1 Geo Ski Jacket Coat	Apparel	1	1	2	N	Y	Returned!	This coat runs small, and did not look good on me at all. Sending back! :\	2014-11-19",


## Continue to Part 2 of Greenplum Demo; **[Basic Table Functions](AWS-GP-demo-2.ipynb)**.