## TPCH Data

In [0]:
DROP TABLE IF EXISTS partsupp;
DROP TABLE IF EXISTS lineitem;
DROP TABLE IF EXISTS supplier;
DROP TABLE IF EXISTS part;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customer;
DROP TABLE IF EXISTS nation;
DROP TABLE IF EXISTS region;

CREATE TABLE region (
  R_REGIONKEY bigint NOT NULL,
  R_NAME varchar(25),
  R_COMMENT varchar(152));

CREATE TABLE nation (
  N_NATIONKEY bigint NOT NULL,
  N_NAME varchar(25),
  N_REGIONKEY bigint,
  N_COMMENT varchar(152));

create table customer (
  C_CUSTKEY bigint NOT NULL,
  C_NAME varchar(25),
  C_ADDRESS varchar(40),
  C_NATIONKEY bigint,
  C_PHONE varchar(15),
  C_ACCTBAL decimal(18,4),
  C_MKTSEGMENT varchar(10),
  C_COMMENT varchar(117));

create table orders (
  O_ORDERKEY bigint NOT NULL,
  O_CUSTKEY bigint,
  O_ORDERSTATUS varchar(1),
  O_TOTALPRICE decimal(18,4),
  O_ORDERDATE Date,
  O_ORDERPRIORITY varchar(15),
  O_CLERK varchar(15),
  O_SHIPPRIORITY Integer,
  O_COMMENT varchar(79));

create table part (
  P_PARTKEY bigint NOT NULL,
  P_NAME varchar(55),
  P_MFGR  varchar(25),
  P_BRAND varchar(10),
  P_TYPE varchar(25),
  P_SIZE integer,
  P_CONTAINER varchar(10),
  P_RETAILPRICE decimal(18,4),
  P_COMMENT varchar(23));

create table supplier (
  S_SUPPKEY bigint NOT NULL,
  S_NAME varchar(25),
  S_ADDRESS varchar(40),
  S_NATIONKEY bigint,
  S_PHONE varchar(15),
  S_ACCTBAL decimal(18,4),
  S_COMMENT varchar(101));

create table lineitem (
  L_ORDERKEY bigint NOT NULL,
  L_PARTKEY bigint,
  L_SUPPKEY bigint,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY decimal(18,4),
  L_EXTENDEDPRICE decimal(18,4),
  L_DISCOUNT decimal(18,4),
  L_TAX decimal(18,4),
  L_RETURNFLAG varchar(1),
  L_LINESTATUS varchar(1),
  L_SHIPDATE date,
  L_COMMITDATE date,
  L_RECEIPTDATE date,
  L_SHIPINSTRUCT varchar(25),
  L_SHIPMODE varchar(10),
  L_COMMENT varchar(44));

create table partsupp (
  PS_PARTKEY bigint NOT NULL,
  PS_SUPPKEY bigint NOT NULL,
  PS_AVAILQTY integer,
  PS_SUPPLYCOST decimal(18,4),
  PS_COMMENT varchar(199));

## COPY Data From S3

In [0]:
COPY region FROM 's3://redshift-immersionday-labs/data/region/region.tbl.lzo'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

COPY nation FROM 's3://redshift-immersionday-labs/data/nation/nation.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy supplier from 's3://redshift-immersionday-labs/data/supplier/supplier.json' manifest
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy customer from 's3://redshift-immersionday-labs/data/customer/customer.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy part from 's3://redshift-immersionday-labs/data/part/part.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy orders from 's3://redshift-immersionday-labs/data/orders/orders.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy partsupp from 's3://redshift-immersionday-labs/data/partsupp/partsupp.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

copy lineitem from 's3://redshift-immersionday-labs/data/lineitem-part/'
iam_role default
region 'us-west-2' gzip delimiter '|' COMPUPDATE PRESET;

## Compression Analyse

> When COMPUPDATE is PRESET, the COPY command chooses the compression encoding for each column if the target table is empty; even if the columns already have encodings other than RAW.

Quoted from [Compression encodings](https://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html)

> When you use CREATE TABLE, ENCODE AUTO is disabled when you specify compression encoding for any column in the table. If ENCODE AUTO is disabled, Amazon Redshift automatically assigns compression encoding to columns for which you don't specify an ENCODE type as follows:

- Columns that are defined as sort keys are assigned RAW compression.
- Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression.
- Columns that are defined as SMALLINT, INTEGER, BIGINT, DECIMAL, DATE, TIMESTAMP, or TIMESTAMPTZ data types are assigned AZ64 compression.
- Columns that are defined as CHAR or VARCHAR data types are assigned LZO compression.

In [0]:
select "column", type, encoding from pg_table_def
where tablename = 'customer'

In [0]:
ANALYZE COMPRESSION customer

In [0]:
drop table if exists customertest;
create table customertest (
  C_CUSTKEY bigint NOT NULL encode raw,
  C_NAME varchar(25),
  C_ADDRESS varchar(40),
  C_NATIONKEY bigint,
  C_PHONE varchar(15),
  C_ACCTBAL decimal(18,4),
  C_MKTSEGMENT varchar(10),
  C_COMMENT varchar(117))
diststyle AUTO;

In [0]:
select "column", type, encoding from pg_table_def
where tablename = 'customertest'

In [0]:
-- this take about 55 seconds
copy customertest from 's3://packt-redshift-cookbook/customer/'
iam_role default region 'eu-west-1'
csv gzip COMPUPDATE PRESET;

In [0]:
select "column", type, encoding from pg_table_def
where tablename = 'customertest'

## Data Distribution 

> Amazon Redshift automatically manages the distribution style for the table, and for small tables, it creates a distribution style of ALL . With the ALL distribution style, the data for this table is stored on every compute node slice as 0 . The distribution style of ALL is well-suited for small dimension tables, which enables join performance optimization for large tables with smaller dimension tables.

First, let check how data distributed across node and slices. Both listing and sales tables are AUTO(EVEN) distributed accross nodes.

In [0]:
select * from svv_table_info
where "table"='customer'

Second, analyze a join query, and this will take 18 seconds on the cluster with 2 nodes.

In [0]:
explain
select
    c_name, o_totalprice
from
    customer, orders
where
    customer.c_custkey = orders.o_custkey 
limit 10;

DS_BCAST_INNER means that a copy of the entire inner table (listing table) is broadcast to all of the compute nodes. This occurs because the data for both tables must be brought together on the same slice to join each given row during the query.

Let improve performance by distributing data by key which co-locate data (rows) by c_custkey. This query will take 2m47sec.

In [0]:
CREATE TABLE customer_distkey
DISTKEY (c_custkey)
AS
SELECT * FROM customer;
-- this take 70 seconds
CREATE TABLE orders_distkey
DISTKEY (o_custkey)
AS
SELECT * FROM orders;

In [0]:
explain
select
    c_name, o_totalprice
from
    customer_distkey, orders_distkey
where
    customer_distkey.c_custkey = orders_distkey.o_custkey
limit 10;

DS_DIST_NONE means that no redistribution data is required. This is because corresponding slices are co-located on the comptue nodes because they had the same DISTKEY.

## Sorted Key

before sorting by time

In [0]:
SET enable_result_cache_for_session TO OFF;
select count(*) as num_order, sum(o_totalprice) as revenue
from orders
where o_orderdate between '1998-08-02'::timestamp and '1998-09-03'::timestamp

In [0]:
-- this take 160 seconds 
create table sorted_orders
SORTKEY (o_orderdate)
as
select * from orders

after sorting by time. The performance improvement is because the sorted_big_sale table has been sorted by the saletime column. We can define sort key when creating a new table or alter an existing table 

```sql
alter table sales alter sortkey (saletime);
```

In [0]:
SET enable_result_cache_for_session TO OFF;
select count(*) as num_order, sum(o_totalprice) as revenue
from sorted_orders
where o_orderdate between '1998-08-02'::timestamp and '1998-09-03'::timestamp