# Greenplum Demo (Part 3)

This is Part 4 of Greenplum Demo, ***Table Storage Models***. 

- If you missed Part 1 (*Setup, Describe Input Dataset & Data Loading*) or wish to repeat, then click [here](GP-demo-1.ipynb).
- If you missed Part 2 (*Basic Table Functions*) or wish to repeat, then click [here](GP-demo-2.ipynb).
- If you missed Part 3 (*MPP Fundamentals and Partitioning*) or wish to repeat, then click [here](GP-demo-3.ipynb).

In [1]:
import os, re
from IPython.display import display_html

import pygments.lexers
from pygments import highlight
from pygments.formatters import HtmlFormatter

CONNECTION_STRING = os.getenv('GPDBCONN')

cs = re.match('^postgresql:\/\/(\S+):(\S+)@(\S+):(\S+)\/(\S+)$', CONNECTION_STRING)

DB_USER   = cs.group(1)
DB_PWD    = cs.group(2)
DB_SERVER = cs.group(3)
DB_PORT   = cs.group(4)
DB_NAME   = cs.group(5)

In [2]:
%reload_ext sql
%sql $CONNECTION_STRING

u'Connected: gpadmin@gpadmin'

## 7. Greenplum Database Tables Storage Models

Greenplum Database supports several storage models and a mix of storage models. When you create a table, you choose how to store its data. This topic explains the options for table storage and how to choose the best storage model for your workload.

- [Heap Storage](#7.1-Heap-Storage)
- [Append-Optimized (AO) Storage](#7.2-Append-Optimized-Storage)
- [Choosing Row- or Column-Oriented Storage](#7.3-Choosing-Row-or-Column-Oriented-Storage)
- Using Compression (Append-Optimized Tables Only)
- Checking the Compression and Distribution of an Append-Optimized Table
- Altering the Table Storage Model

### 7.1 Heap Storage

- By default, Greenplum Database uses the same heap storage model as PostgreSQL. 
- Heap table storage works best with OLTP-type workloads where the data is often modified after it is initially loaded. 
- `UPDATE` and `DELETE` operations require storing row-level versioning information to ensure reliable database transaction processing. 
- Heap tables are best suited for smaller tables, such as dimension tables, that are often updated after they are initially loaded.

#### How to create a Heap Table:

In [9]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/7-1-create-heap-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

### 7.2 Append-Optimized Storage

- Append-optimized table storage works best with denormalized fact tables in a data warehouse environment. Denormalized fact tables are typically the largest tables in the system. 
- Fact tables are usually loaded in batches and accessed by read-only queries. 
- Moving large fact tables to an append-optimized storage model eliminates the storage overhead of the per-row update visibility information, saving about 20 bytes per row. This allows for a leaner and easier-to-optimize page structure. 
- The storage model of append-optimized tables is optimized for bulk data loading. Single row `INSERT` statements are not recommended.

#### How to create an Append-Optimized (AO) Table:

In [13]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/7-2-create-ao-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

### 7.3 Choosing Row or Column-Oriented Storage

Greenplum provides a choice of storage orientation models: row, column, or a combination of both:

- **Row-oriented storage**: good for OLTP types of workloads with many iterative transactions and many columns of a single row needed all at once, so retrieving is efficient.
- **Column-oriented storage**: good for data warehouse workloads with aggregations of data computed over a small number of columns, or for single columns that require regular updates without modifying other column data.

For most general purpose or mixed workloads, row-oriented storage offers the best combination of flexibility and performance. However, there are use cases where a column-oriented storage model provides more efficient I/O and storage. Consider the following requirements when deciding on the storage orientation model for a table:

- **Updates of table data**. If you load and update the table data frequently, choose a row-oriented heap table. Column-oriented table storage is only available on append-optimized tables.
- **Frequent `INSERT`s**. If rows are frequently inserted into the table, consider a row-oriented model. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on disk.
- **Number of columns requested in queries**. If you typically request all or the majority of columns in the `SELECT` list or `WHERE` clause of your queries, consider a row-oriented model. Column-oriented tables are best suited to queries that aggregate many values of a single column where the `WHERE` or `HAVING` predicate is also on the aggregate column.
- **Number of columns in the table**. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small. Column-oriented tables can offer better query performance on tables with many columns where you access a small subset of columns in your queries.
- **Compression**. Column data has the same data type, so storage size optimizations are available in column-oriented data that are not available in row-oriented data. For example, many compression schemes use the similarity of adjacent data to compress. However, the greater adjacent compression achieved, the more difficult random access can become, as data must be uncompressed to be read.

#### How to create a Column-Oriented Table:

In [14]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/7-3-create-column-oriented-table.sql
display_html('\n'.join(sqlfilecode), raw=True)

### 7.4 Using Compression (Append-Optimized Tables Only)

There are two types of in-database compression available in the Greenplum Database for append-optimized tables:
- Table-level compression is applied to an entire table.
- Column-level compression is applied to a specific column. You can apply different column-level compression algorithms to different columns.

#### How to create a Compressed Table (Table-level Compression):

In [18]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/7-4-create-compressed-table-1.sql
display_html('\n'.join(sqlfilecode), raw=True)

#### How to create a Compressed Table (Column-level Compression):

In [19]:
sqlfilecode = !pygmentize -f html -O full,style=colorful -l postgres script/7-4-create-compressed-table-2.sql
display_html('\n'.join(sqlfilecode), raw=True)

## Populate the three tables:
- Load heap table with gpload (gpload_h.yaml):

```yaml
VERSION: 1.0.0.1
GPLOAD:
   INPUT:
    - SOURCE:
         FILE:
           - /home/gpadmin/data/crimes_all.txt
    - FORMAT: text
    - DELIMITER: '|'
    - LOG_ERRORS: true
    - ERROR_LIMIT: 50000
   OUTPUT:
    - TABLE: demo.fact_crimes_heap
    - MODE: insert
   PRELOAD:
    - TRUNCATE: true
    - REUSE_TABLES: true
```

In [None]:
%%sql

DROP TABLE IF EXISTS demo.fact_crimes_heap;

CREATE TABLE demo.fact_crimes_heap
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
distributed by (id);

DROP TABLE IF EXISTS demo.fact_crimes_row_comp;

CREATE TABLE demo.fact_crimes_row_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=row, compresstype=zlib, compresslevel=3)
distributed by (id);


DROP TABLE IF EXISTS demo.fact_crimes_col_comp;

CREATE TABLE demo.fact_crimes_col_comp
(
  id INT
  , case_number VARCHAR (20)
  , crime_date TIMESTAMP
  , block VARCHAR(50)
  , IUCR VARCHAR(10)
  , primary_type VARCHAR(50)
  , description VARCHAR(75)
  , location_desc VARCHAR (75)
  , arrest VARCHAR(5)
  , domestic VARCHAR(5)
  , beat VARCHAR(7)
  , district VARCHAR(7)
  , ward SMALLINT
  , community_area VARCHAR(10)
  , fbi_code VARCHAR(5)
  , x_coord FLOAT
  , y_coord FLOAT
  , crime_year SMALLINT
  , record_update_date TIMESTAMP
  , latitude FLOAT
  , longitude FLOAT
  , location VARCHAR (60),
  historical int null,
  zipcode int null,
  community int null, 
  census int null,
  wards int null,
  boundaries int null, 
  policedistrict int null, 
  policebeats int null	)
WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=3)
distributed by (id);

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'gpload -d gpadmin -f /home/gpadmin/gpload_h.yaml > /home/gpadmin/gpload_h.log 2>&1'


**Note:** Heap table loaded data from the same source file in <33 seconds (heap vs compressed table loading has different performance)

- Load **demo.fact_row_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_row_comp;

`INSERT INTO demo.fact_crimes_row_comp SELECT * FROM demo.fact_crimes;`

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_row_comp.sql '

- Load **demo.fact_col_comp** table with data from the **heap** table above, and check timing

In [None]:
%%sql
DELETE FROM demo.fact_crimes_col_comp;

`INSERT INTO demo.fact_crimes_col_comp SELECT * FROM demo.fact_crimes;`

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f insert_into_col_comp.sql '

## Check the size of each of the three tables:

In [None]:
%%sql
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_heap'))::TEXT, 'demo.fact_crimes_heap' AS TABLENAME
UNION
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_row_comp'))::TEXT AS TABLESIZE, 'demo.fact_crimes_row_comp' AS TABLENAME
UNION ALL
SELECT pg_size_pretty(pg_relation_size('demo.fact_crimes_col_comp')) AS TABLESIZE, 'demo.fact_crimes_col_comp' AS TABLENAME;

**Notes:** 
- Heap table has no compression. It is best for staging tables or when frequent updates/ deletes are needed.
- Row oriented has the best compression. It is best for frequent inserts and `SELECT`'s on all/ most of the columns.
- Column oriented also has better compression than the heap table but not from the row-oriented table. It is best for static partitions/ tables and `SELECT`'s on fewer columns.

In [None]:
# Step 4. EXPLAIN plans, & Statistics

In [None]:
%%sql
EXPLAIN SELECT location_desc
	, count(case_number)
FROM
	demo.fact_crimes
WHERE
	crime_date >= '2014-01-01'
	AND crime_date <= '2014-12-31'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'



**Notes:**
- Copy `EXPLAIN` plan created above and paste it in : http://planchecker.cfapps.io/
- Planchecker app will provide recommendation(s) about collecting statistics. Highlight this as a recommendation that database provides for optimizations.
- Use the `ANALYZE` utility to collect statistics for optimizer, missing or stale statistics; all the above can generate bad plans.
- Use the `ANALYZEDB` utility and scheduled it to run frequently i.e. everyday, to collect statistics on changed tables/ partitions only since last run. 
- The same utility can also be easily stopped and resumed. 
- There is no need for DBA to explicitly look for different stats collection policies for different types of tables/ partitions.

In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -c "analyze demo.fact_crimes"'



In [None]:
!ssh -i /root/gpdb-gcp.key gpadmin@13.64.71.99 'psql postgresql://gpadmin:z3huyg3gyfll2@13.64.71.99:5432/gpadmin -f explain_select.sql'

