# Step 0 - System and Connection Check
- Start with gpstate. Use jupyter, dbeaver or pgadmin for queries.
- Check *gp_autostats_mode* is set to **NONE**. This will avoid analyze time in loading and is required for one of the steps when running explain.

In [1]:
%reload_ext sql
import os
#connection_string = os.getenv('GPDBCONN')
#connection_string = 'postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin'
%sql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin

u'Connected: gpadmin@gpadmin'

In [2]:
%%sql
SHOW gp_autostats_mode;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
1 rows affected.


gp_autostats_mode
ON_NO_STATS


In [3]:
%%sql
SET gp_autostats_mode = 'NONE';

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
Done.


[]

In [4]:
%%sql
SELECT version();

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
1 rows affected.


version
"PostgreSQL 8.3.23 (Greenplum Database 5.17.0 build commit:fc9a9d4cad8dd4037b9bc07bf837c0b958726103) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Feb 13 2019 15:26:34"


# Step 1. Create Table with Distribution, Partitioning and Compression
- Data file is available at: https://drive.google.com/file/d/1vMm8kHUaxT80ktU8X0y2OG5IpoMp-saz/view?usp=sharing
- Download the file from google drive, upload to your GPDB instance under */home/gpadmin/data/* and unzip the file.

In [7]:
%%sql
--DROP SCHEMA IF EXISTS demo;
--CREATE SCHEMA demo;

DROP TABLE IF EXISTS demo.fact_crimes;

CREATE TABLE demo.fact_crimes
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
distributed by (id)
PARTITION BY RANGE(crime_date) 
(
    PARTITION yr START ('2000-01-01'::date) END ('2019-03-30'::date)
    EVERY ('1 year'::interval)
    WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=3) 
);

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
Done.
Done.


[]

In [8]:
%%sql
SELECT COUNT(*) FROM demo.fact_crimes;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
1 rows affected.


count
0


# Step 2. Load Data via gpload.

**Yaml file (gpload_f.yaml)**

``` yaml
VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/crimes_all.txt
    - FORMAT: text
    - DELIMITER: '|'
    - LOG_ERRORS: true
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.fact_crimes
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true
```

- Run gpload. Note: **gpload** is used for fast parallel data loading and is a wrapper to create (or reuse) external tables, start gpfdist process, and run insert/select:

In [9]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'gpload -d gpadmin -f /home/gpadmin/gpload_f.yaml > /home/gpadmin/gpload_f.log 2>&1'



- Check data is properly loaded:

In [10]:
%%sql
SELECT COUNT(*) FROM demo.fact_crimes;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
1 rows affected.


count
5017691


In [11]:
%%sql
SELECT * FROM demo.fact_crimes LIMIT 10;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
10 rows affected.


id,case_number,crime_date,block,iucr,primary_type,description,location_desc,arrest,domestic,beat,district,ward,community_area,fbi_code,x_coord,y_coord,crime_year,record_update_date,latitude,longitude,location,historical,zipcode,community,census,wards,boundaries,policedistrict,policebeats
9947454,HY135904,2018-01-31 23:38:00,054XX S INDIANA AVE,620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,231,2,3,40,05,1178528.0,1868845.0,2015,2018-02-10 15:50:01,41.795415492,-87.620863574,"(41.795415492, -87.620863574)",12,21192,7,401,9,10,24,125
9947480,HY135912,2018-01-31 23:36:00,010XX N WALLER AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,False,False,1511,15,29,25,14,1138165.0,1906324.0,2015,2018-02-10 15:50:01,41.899086038,-87.767972922,"(41.899086038, -87.767972922)",52,4299,26,671,7,5,25,70
9950622,HY138330,2018-01-31 23:30:00,016XX W 18TH ST,820,THEFT,$500 AND UNDER,SIDEWALK,False,False,1234,12,25,31,06,1165614.0,1891465.0,2015,2018-02-10 15:50:01,41.857771113,-87.667576714,"(41.857771113, -87.667576714)",8,14920,33,343,26,43,15,121
9948231,HY136898,2018-01-31 23:30:00,012XX N LA SALLE DR,820,THEFT,$500 AND UNDER,RESIDENCE,False,False,1821,18,43,8,06,1174912.0,1908513.0,2015,2018-02-10 15:50:01,41.90434894,-87.632937036,"(41.90434894, -87.632937036)",51,14926,37,17,11,54,14,198
9947434,HY135906,2018-01-31 23:20:00,058XX S LAFAYETTE AVE,890,THEFT,FROM BUILDING,RESIDENCE,False,False,232,2,20,40,06,1176978.0,1866245.0,2015,2018-02-10 15:50:01,41.788315946,-87.626625798,"(41.788315946, -87.626625798)",53,21559,7,164,4,11,24,268
9947548,HY135894,2018-01-31 23:15:00,084XX S ESCANABA AVE,460,BATTERY,SIMPLE,STREET,False,False,423,4,10,46,08B,1196953.0,1849492.0,2015,2018-02-10 15:50:01,41.741870702,-87.553942987,"(41.741870702, -87.553942987)",47,21202,42,226,47,25,19,238
9947444,HY135893,2018-01-31 23:08:00,070XX S STATE ST,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,SIDEWALK,True,False,322,3,6,69,18,1177509.0,1858457.0,2015,2018-02-10 15:50:01,41.766932865,-87.624914075,"(41.766932865, -87.624914075)",31,22260,67,513,32,11,18,211
9947494,HY135885,2018-01-31 23:03:00,010XX N KARLOV AVE,560,ASSAULT,SIMPLE,RESIDENTIAL YARD (FRONT/BACK),False,False,1111,11,37,23,08A,1148861.0,1906680.0,2015,2015-08-17 15:03:40,41.899862799,-87.728677214,"(41.899862799, -87.728677214)",4,4299,24,99,45,5,16,68
9947761,HY136383,2018-01-31 23:00:00,043XX N LAWNDALE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,1723,17,39,16,14,1150947.0,1928505.0,2015,2018-02-10 15:50:01,41.95971185,-87.720442222,"(41.95971185, -87.720442222)",28,21538,16,364,12,39,1,9
9947464,HY135913,2018-01-31 22:56:00,067XX S GREEN ST,5111,OTHER OFFENSE,GUN OFFENDER: ANNUAL REGISTRATION,STREET,True,False,723,7,6,68,26,1171800.0,1860108.0,2015,2018-02-10 15:50:01,41.771590534,-87.645791404,"(41.771590534, -87.645791404)",17,21559,66,410,32,11,17,202


- Show data distribution across segments:

In [12]:
%%sql
SELECT gp_segment_id, count(*) FROM demo.fact_crimes GROUP BY 1 ORDER BY 1;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
2 rows affected.


gp_segment_id,count
0,2505720
1,2511971


- Show row count per partition:

In [13]:
%%sql
SELECT tableoid::regclass, count(*) FROM demo.fact_crimes GROUP BY 1 ORDER BY 1;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
19 rows affected.


tableoid,count
demo.fact_crimes_1_prt_yr_1,264089
demo.fact_crimes_1_prt_yr_2,264089
demo.fact_crimes_1_prt_yr_3,264089
demo.fact_crimes_1_prt_yr_4,264089
demo.fact_crimes_1_prt_yr_5,264089
demo.fact_crimes_1_prt_yr_6,264089
demo.fact_crimes_1_prt_yr_7,264089
demo.fact_crimes_1_prt_yr_8,264089
demo.fact_crimes_1_prt_yr_9,264089
demo.fact_crimes_1_prt_yr_10,264089


# Step 3. Compression
Show compression ratio difference by creating three copies of this table:
1. Heap
2. Row with compression
3. Column with compression

In [14]:
%%sql
CREATE TABLE demo.fact_crimes_heap
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
distributed by (id);


CREATE TABLE demo.fact_crimes_row_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=row, compresstype=zlib, compresslevel=3)
distributed by (id);


CREATE TABLE demo.fact_crimes_col_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=3)
distributed by (id);

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
Done.
Done.
Done.


[]

## Populate the three tables:
- Load heap table with gpload (gpload_h.yaml):

```yaml
VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/crimes_all.txt
    - FORMAT: text
    - DELIMITER: '|'
    - LOG_ERRORS: true
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.fact_crimes_heap
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true
```

In [15]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'gpload -d gpadmin -f /home/gpadmin/gpload_h.yaml > /home/gpadmin/gpload_h.log 2>&1'


**Note:** Heap table loaded data from the same source file in <33 seconds (heap vs compressed table loading has different performance)

- Load **demo.fact_row_comp** table with data from the **heap** table above, and check timing

In [37]:
%%sql
DELETE FROM demo.fact_crimes_row_comp;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
0 rows affected.


[]

`INSERT INTO demo.fact_crimes_row_comp SELECT * FROM demo.fact_crimes;`

In [38]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_row_comp.sql '

Timing is on.
INSERT 0 5017691
Time: 24905.288 ms


- Load **demo.fact_col_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_col_comp;

`INSERT INTO demo.fact_crimes_col_comp SELECT * FROM demo.fact_crimes;`

In [26]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_col_comp.sql '

Timing is on.
INSERT 0 5017691
Time: 17623.734 ms


## Check the size of each of the three tables:

In [39]:
%%sql
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_heap'))::TEXT, 'demo.fact_crimes_heap' AS TABLENAME
UNION
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_row_comp'))::TEXT AS TABLESIZE, 'demo.fact_crimes_row_comp' AS TABLENAME
UNION ALL
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_col_comp')) AS TABLESIZE, 'demo.fact_crimes_col_comp' AS TABLENAME;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
3 rows affected.


pg_size_pretty,tablename
1249 MB,demo.fact_crimes_heap
564 MB,demo.fact_crimes_row_comp
986 MB,demo.fact_crimes_col_comp


**Notes:** 
- Heap table has no compression. It is best for staging tables or when frequent updates/ deletes are needed.
- Row oriented has the best compression. It is best for frequent inserts and `SELECT`'s on all/ most of the columns.
- Column oriented also has better compression than the heap table but not from the row-oriented table. It is best for static partitions/ tables and `SELECT`'s on fewer columns.

In [40]:
# Step 4. EXPLAIN plans, & Statistics

In [41]:
%%sql
EXPLAIN SELECT location_desc
	, count(case_number)
FROM
	demo.fact_crimes
WHERE
	crime_date >= '2014-01-01'
	AND crime_date <= '2014-12-31'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

 * postgresql://gpadmin:***@13.64.71.99:5432/gpadmin
23 rows affected.


QUERY PLAN
Limit (cost=0.00..431.00 rows=1 width=16)
-> Gather Motion 2:1 (slice2; segments: 2) (cost=0.00..431.00 rows=1 width=16)
Merge Key: (count((count(case_number))))
-> Sort (cost=0.00..431.00 rows=1 width=16)
Sort Key: (count((count(case_number))))
-> GroupAggregate (cost=0.00..431.00 rows=1 width=16)
Group By: location_desc
-> Sort (cost=0.00..431.00 rows=1 width=16)
Sort Key: location_desc
-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..431.00 rows=1 width=16)


In [54]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'



                                                                                                 QUERY PLAN                                                                                                  
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..431.00 rows=1 width=16)
   ->  Gather Motion 2:1  (slice2; segments: 2)  (cost=0.00..431.00 rows=1 width=16)
         Merge Key: (count((count(case_number))))
         ->  Sort  (cost=0.00..431.00 rows=1 width=16)
               Sort Key: (count((count(case_number))))
               ->  GroupAggregate  (cost=0.00..431.00 rows=1 width=16)
                     Group By: location_desc
                     ->  Sort  (cost=0.00..431.00 rows=1 width=16)
                           Sort Key: location_desc
                           ->  Redistribute Motion 2:2

**Notes:**
- Copy `EXPLAIN` plan created above and paste it in : http://planchecker.cfapps.io/
- Planchecker app will provide recommendation(s) about collecting statistics. Highlight this as a recommendation that database provides for optimizations.
- Use the `ANALYZE` utility to collect statistics for optimizer, missing or stale statistics; all the above can generate bad plans.
- Use the `ANALYZEDB` utility and scheduled it to run frequently i.e. everyday, to collect statistics on changed tables/ partitions only since last run. 
- The same utility can also be easily stopped and resumed. 
- There is no need for DBA to explicitly look for different stats collection policies for different types of tables/ partitions.

In [56]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -c "analyze demo.fact_crimes"'



ANALYZE


In [57]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'



                                                                                              QUERY PLAN                                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..474.93 rows=5 width=20)
   ->  Gather Motion 2:1  (slice2; segments: 2)  (cost=0.00..474.93 rows=10 width=20)
         Merge Key: (count((count(case_number))))
         ->  Limit  (cost=0.00..474.93 rows=5 width=20)
               ->  Sort  (cost=0.00..474.93 rows=60 width=20)
                     Sort Key: (count((count(case_number))))
                     ->  HashAggregate  (cost=0.00..474.89 rows=60 width=20)
                           Group By: location_desc
                           ->  Redistribute Motion 2:2  (slice1; segments: 2)  (cost=0.00..474.89 rows=