# Greenplum Demo - Part 1

## 1. System Setup
### 1.1 Initialize database connection and setup global variable values

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('AWSGPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

%reload_ext sql
%sql $CONNECTION_STRING

'Connected: gpadmin@gpadmin'

In [14]:
%%sql $DB_USER@$DB_SERVER
SHOW gp_autostats_mode;
ALTER DATABASE gpadmin SET gp_autostats_mode TO 'NONE';
SHOW gp_autostats_mode;

1 rows affected.
Done.
1 rows affected.


gp_autostats_mode
NONE


In [15]:
%%sql $DB_USER@$DB_SERVER
SELECT VERSION();

1 rows affected.


version
"PostgreSQL 8.3.23 (Greenplum Database 5.20.1 build commit:03ff833f877a23469ca41aab0b2dfc58c48520ad) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jun 28 2019 08:56:11"


## 2. The Amazon Customer Reviews Dataset

Over 130+ million customer reviews are available to researchers as part of this release. The data is available in TSV files in the amazon-reviews-pds S3 bucket in AWS US East Region. Each line in the data files corresponds to an individual review (tab delimited, with no quote and escape characters). Samples of the data are available in English and French; more details on the information in each column can be found [here](https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt).

If you use the AWS Command Line Interface, you can list data in the bucket with the `ls` command: 

```aws s3 ls s3://amazon-reviews-pds/tsv/```

To download data using the AWS Command Line Interface, you can use the `cp` command. For instance, the following command will copy the file named `amazon_reviews_us_Camera_v1_00.tsv.gz` to your local directory:

```aws s3 cp s3://amazon-reviews-pds/tsv/<S3 File> <Local File>```

### 2.1 Prepare AWS System and setup `awscli` library via `pip`

In [3]:
shfilecode = !pygmentize -f html -O full,style=friendly -l shell script/1-1-system-prepare.sh
display_html('\n'.join(shfilecode), raw=True)

In [4]:
!ssh-keygen -R $DB_SERVER
!ssh-keyscan $DB_SERVER >> ~/.ssh/known_hosts
!scp -i ~/.ssh/aws-gp.pem script/1-1-system-prepare.sh $DB_USER@$DB_SERVER:system-prepare.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER chmod +x ./system-prepare.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER sudo ./system-prepare.sh

# Host ec2-35-178-54-183.eu-west-2.compute.amazonaws.com found: line 36
# Host ec2-35-178-54-183.eu-west-2.compute.amazonaws.com found: line 37
# Host ec2-35-178-54-183.eu-west-2.compute.amazonaws.com found: line 38
/root/.ssh/known_hosts updated.
Original contents retained as /root/.ssh/known_hosts.old
# ec2-35-178-54-183.eu-west-2.compute.amazonaws.com:22 SSH-2.0-OpenSSH_7.4
# ec2-35-178-54-183.eu-west-2.compute.amazonaws.com:22 SSH-2.0-OpenSSH_7.4
# ec2-35-178-54-183.eu-west-2.compute.amazonaws.com:22 SSH-2.0-OpenSSH_7.4
1-1-system-prepare.sh                         100%  712   111.6KB/s   00:00    
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1733k  100 1733k    0     0  13.8M      0 --:--:-- --:--:-- --:--:-- 13.8M
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date

  Using cached https://files.pythonhosted.org/packages/81/b7/cef47224900ca67078ed6e2db51342796007433ad38329558f56a15255f5/urllib3-1.25.5-py2.py3-none-any.whl
Collecting pyasn1>=0.1.3 (from rsa<=3.5.0,>=3.1.2->awscli)
  Using cached https://files.pythonhosted.org/packages/a1/71/8f0d444e3a74e5640a3d5d967c1c6b015da9c655f35b2d308a55d907a517/pyasn1-0.4.7-py2.py3-none-any.whl
Collecting futures<4.0.0,>=2.2.0; python_version == "2.6" or python_version == "2.7" (from s3transfer<0.3.0,>=0.2.0->awscli)
  Using cached https://files.pythonhosted.org/packages/d8/a6/f46ae3f1da0cd4361c344888f59ec2f5785e69c872e175a748ef6071cdb5/futures-3.3.0-py2-none-any.whl
Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore==1.12.233->awscli)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: jmespath, docutils, six, python-dateutil, urllib3, botoc

### 2.2 Provide AWS Access Key ID & Secret Access Key

In [None]:
shfilecode = !pygmentize -f html -O full,style=friendly -l bash script/1-2-aws-configure.sh
display_html('\n'.join(shfilecode), raw=True)

In [None]:
import getpass

!scp -i ~/.ssh/aws-gp.pem script/1-2-aws-configure.sh $DB_USER@$DB_SERVER:aws-configure.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER chmod +x ./aws-configure.sh

cmd = './aws-configure.sh ' 
cmd = cmd + getpass.getpass("AWS Access Key ID [None]:") 
cmd = cmd + ' ' + getpass.getpass("AWS Secret Access Key [None]:")

!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER $cmd

### 2.3 Copy source files from AWS S3

For our demo, we choose to download the available files into the `/home/gpadmin/data/` folder, using the `aws s3 cp <S3 File> <Local File>` command described before, as follows:

In [None]:
shfilecode = !pygmentize -f html -O full,style=friendly -l bash script/1-3-aws-s3-copy.sh
display_html('\n'.join(shfilecode), raw=True)

In [None]:
!scp -i ~/.ssh/aws-gp.pem script/1-3-aws-s3-copy.sh $DB_USER@$DB_SERVER:aws-s3-copy.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER chmod +x ./aws-s3-copy.sh
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER sudo ./aws-s3-copy.sh

## 3. Data Loading

### 3.1. Create the Schema (optional) and the Database Table to hold the dataset, as shown below:

In [16]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/2-1-create-db-schema-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

In [17]:
query = !cat script/2-1-create-db-schema-table.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

Done.
Done.
Done.
Done.


[]

In [18]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/2-2-count-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/2-2-count-table.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
0


### 3.2. Load the Input Dataset using the `gpload` Utility

**gpload** is a data loading utility that acts as an interface to the Greenplum Database external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload executes a load by invoking the Greenplum Database parallel file server (*gpfdist*), creating an external table definition based on the source data defined, and executing an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database. 

You can declare more than one file as input/source as long as the data is of the same format in all files specified. Additionally, if the files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), the files will be uncompressed automatically (provided that `gunzip` or `bunzip2` is in your path). You can also declare options such as the schema of the source data files, perform basic transformations,  define custom delimiter and/or escape character(s), and many more. For the full list of available options, check the GPLoad Utility Reference available on [Pivotal Greenplum Database Documentation](https://gpdb.docs.pivotal.io/latest) (*Pivotal Greenplum Documentation* > *Utility Guide* > *Management Utility Reference* > *gpload*).

The operation, including any SQL commands specified in the SQL collection of the YAML control file, are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.

For our demo, we the **gpload_amzn_reviews.yaml** file, as following:

In [19]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l yaml script/3-2-gpload-amzn-reviews.yaml
display_html('\n'.join(sqlfilecode), raw=True)

#### 3.2.1. Delete error log information for existing tables in the current database.

In [20]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/3-1-delete-error-log-info.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/3-1-delete-error-log-info.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


gp_truncate_error_log
True


#### 3.2.2. Copy GPLoad YAML file across to Database Server and Execute

In [21]:
!scp -i ~/.ssh/aws-gp.pem script/3-2-gpload-amzn-reviews.yaml $DB_USER@$DB_SERVER:gpload_amzn_reviews.yaml
!ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER gpload -d $DB_USER -f ./gpload_amzn_reviews.yaml 2>&1 \
    | tee ./gpload_amzn_reviews.log

3-2-gpload-amzn-reviews.yaml                  100%  356    40.2KB/s   00:00    


### 3.3. Check `gpload` execution

Check `gpload` execution output (shown above and also available on `/home/gpadmin/script/gpload_amzn_reviews.log`), confirm successful loading of the data and/or identify any message which require ones attention and/or actions:

#### 3.3.1. Check the data has been properly loaded, by confirming row count shown above:

In [23]:
sqlfilecode = !pygmentize -f html -O full,style=friendly -l postgres script/3-3-count-amzn-reviews.sql
display_html('\n'.join(sqlfilecode), raw=True)

query = !cat script/3-3-count-amzn-reviews.sql
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
103145273


#### 3.3.2. Check data formatting row count if such were identified by the `gpload` execution log:

In [24]:
cmd = 'cat /home/gpadmin/gpload_amzn_reviews.log\
    | grep -e '"'"'WARN|select'"'"'\
    | awk '"'"'BEGIN{FS="|";OFS=" "} {print $3}'"'"'\
    | awk '"'"'{print $1, "COUNT(*)", $3, $4, $5, $6, $7, $8}'"'"''
query = !ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER $cmd
%sql $DB_USER@$DB_SERVER {''.join(query)}

1 rows affected.


count
3714


#### 3.3.3. Check a sample set of 10 rows from the data formatting errors, if such were identified by the `gpload` execution log:

In [25]:
cmd = 'cat /home/gpadmin/gpload_amzn_reviews.log\
    | grep -e '"'"'WARN|select'"'"'\
    | awk '"'"'BEGIN{FS="|"} {print $3, "LIMIT 10"}'"'"' ' 
query = !ssh -i ~/.ssh/aws-gp.pem $DB_USER@$DB_SERVER $cmd
%sql {''.join(query)}

 * postgresql://gpadmin:***@ec2-35-178-54-183.eu-west-2.compute.amazonaws.com:5432/gpadmin
10 rows affected.


cmdtime,relname,filename,linenum,bytenum,errmsg,rawdata,rawbytes
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	25891418	R2HCJ9YE9GI2G4	B00NC5K2DA	838295580	ARJOSA® Women's 1/2 Flouncing Sleeve Crewneck Casual T-shirt Blouse Top	Apparel	1	0	0	N	Y	and i am very disappointed..	I received this today, and i am very disappointed... What i got was different from he picture the picture shows a short dress like, what i got was a long shirt, the sleeves were shorter I usually wear 8 US And this was a one-size so i didnt expect any problems. Its a straight long shirt... That i dont think i will be wearing =\	2015-08-07",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	37016098	R12ELBYX02OK62	B00FJ4S1SU	429981199	Evil Smiley Face, T-shirt	Apparel	4	0	0	N	Y	Evil Smiley Face = 1 Happy Teenm\	Great quality T and my daughter loved it. The only reason I am giving 4 instead of 5 stars is because the evil smiley face on the T was a much larger than shown in the picture.	2015-02-11",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Apparel_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	12669055	R1SLARMB63P7QM	B001RX9TII	755639242	3 - Prs. Merino Wool Blend Socks- Size 10-13-unisex	Apparel	4	0	0	N	Y	Decent	When I got them in the mail, I was pretty excited to have some nice socks to wear with my work boots. And they are really comfy and keep your feet warm. But I also noticed something else. The packaging was all messed up. I ended up with two pairs of socks, and one single sock. What I ordered said it came with 3 pairs, not 2 pairs and one single. :-\	2012-10-27",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	16313817	R29GC5JDFPLLQN	B00KWP3SA2	808268078	OZNaturals Facial Cleanser - This Natural Face Wash Is A Superior Cleanser That Deep Cleans & Unclogs Pores With Ocean Minerals, Vitamin E and Rose Hip Oil For That Healthy, Youthful Glow!	Beauty	5	0	0	N	Y	I'm a dude and I love this stuff\	Love it!! This is my new go to face product. I even use it with my Mia face ionizer. My pours are really opened up by this stuff and I feel like my face can breathe again!! I will definitely be buying more.	2015-07-06",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	1007025	R1AMWMV3O55YYZ	B0042SLRGM	237408223	FusionBeauty LipFusion Micro-Injected Collagen Lip Plump Color Shine, Objects Of Desire, Rhinestone	Beauty	2	0	0	N	Y	I thought it would make your lips more plumped, ...	I thought it would make your lips more plumped, but it just makes your lips feel numb. And then it goes away =\	2015-02-06",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""",US	15953541	R24NX7FU6U3HWM	B005LN670S	823112185	WONDERSTRUCK For Women By TAYLOR SWIFT Eau De Parfum Spray	Beauty	5	0	0	N	N	we're I am everybody love it very much	It's very nice everybody loves it<br /> l0\	2015-01-11,
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	4408452	REXSTOEERDIJB	B004DLXAO0	302789113	Just Cavalli I Love Him By Roberto Cavalli for Men Eau-de-toillete Spray, 2 Ounce	Beauty	2	0	0	N	Y	Two Stars	No me gusto..huele mal.. =\	2014-08-14",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""",US	2575724	R2G82U8E7FM9YH	B000N5WT2O	557918006	Tiger Balm Ultra Sports Rub	Beauty	4	0	0	N	Y	Eeeeeh!	Like any other minty lotion. This is more like a pomade though. It's alright! Not as good as some people make it seem. =\	2014-07-31,
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	6164409	R3CDS7NM6FLVYE	B002RL8IUE	518090681	Fran Wilson MOODmatcher Lipstick, 10pc Collection	Beauty	3	1	1	N	Y	Meh	These are okay, but not quite what I was expecting. They all turned out to be a different shade of pink, but the colors do turn out nicely. A little warning to a future buyer, they do stain. \	2014-01-08",
2019-09-23 14:09:43.343364+01:00,ext_gpload_reusable_688b2826_de03_11e9_a8a9_067c34561bce,gpfdist://mdw:8000//var/tmp_s3_data/amazon_reviews_us*.tsv.gz [/var/tmp_s3_data/amazon_reviews_us_Beauty_v1_00.tsv.gz],,,"missing data for column ""review_date""","US	40160017	RY1IWGXCVW4QS	B004V8C1CW	890593083	Oxy Hydrating Body Wash	Beauty	4	1	3	N	N	Very hydrating!!	I really do like this body wash. My nearest grocery store was selling them for $2.03 (I know, right?!). I bought 3 bottles and just finished my second bottle. I love the way it makes my skins feels moisturized--the skin drinks it up!<br />On the other hand, it's one of those things where you think to yourself, \\""Wow, it is helping my acne!\\"" But then a week later you're like, \\""Oh wait.\\"" :\	2012-08-23",


## Continue to Part 2 of Greenplum Demo; **[Basic Table Functions](AWS-GP-demo-2.ipynb)**.