## Setup steps
Here are the steps to setup this lab:
- Install missing dependencies and restart the notebook
- Create the notebook variables
- Create the `db_ybu` database

### Install missing dependencies and restart the notebook
Run the following cell to ensure that the notebook dependencies are available. 
Once run, the proceeding cell doesn't need to be executed again.

In [1]:
!pip install ipython-sql
!pip install psycopg2-binary
!pip install sqlalchemy 



### Create the notebook variables 

> IMPORTANT!
> 
> Do NOT skip running this cell. 
> 

The following Python cell creates and stores variables that all the notebooks in this lab will use. You can view these variables in the Jupyter tab.

- To run the script, select Execute Cell (Play Arrow) in the left gutter of the cell.
- Verify the accuracy of the output values

In [2]:
# Env variables for Notebook
import os

# read env_vars.env
env_vars = !cat env_vars.env
for var in env_vars:
    key, value = var.split('=')
    os.environ[key] = value
 

# Comment out Local
MY_YB_PATH=os.environ.get('MY_YB_PATH_LOCAL')
MY_GITPOD_WORKSPACE_URL=os.environ.get('MY_GITPOD_WORKSPACE_URL_LOCAL')
MY_SUDO=os.environ.get('MY_SUDO')

# Gitpod specific
# MY_YB_PATH=os.environ.get('MY_YB_PATH')
# MY_GITPOD_WORKSPACE_URL=os.environ.get('GITPOD_WORKSPACE_URL')

# env_vars defines the following
MY_DB_NAME=os.environ.get('MY_DB_NAME')
MY_HOST_IPv4_01=os.environ.get('MY_HOST_IPv4_01')
MY_HOST_IPv4_02=os.environ.get('MY_HOST_IPv4_02')
MY_HOST_IPv4_03=os.environ.get('MY_HOST_IPv4_03')
MY_TSERVER_WEBSERVER_PORT=os.environ.get('MY_TSERVER_WEBSERVER_PORT')
MY_DATA3_DDL_FILE=os.environ.get('MY_DATA3_DDL_FILE')
MY_DATA3_DML_FILE=os.environ.get('MY_DATA3_DML_FILE')
MY_UTIL_FUNCTIONS_FILE=os.environ.get("MY_UTIL_FUNCTIONS_FILE")
MY_UTIL_YBTSERVER_METRICS_FILE=os.environ.get("MY_UTIL_YBTSERVER_METRICS_FILE")
MY_SUDO=os.environ.get('MY_SUDO')
# Current directory of project and related child folders
MY_NOTEBOOK_DIR=os.getcwd()
MY_NOTEBOOK_DATA_FOLDER=MY_NOTEBOOK_DIR +'/data'
MY_NOTEBOOK_UTILS_FOLDER=MY_NOTEBOOK_DIR + '/utils'

print(MY_NOTEBOOK_DATA_FOLDER, MY_NOTEBOOK_UTILS_FOLDER)
# Store the note book values for other notebooks to use

%store MY_DB_NAME
%store MY_YB_PATH
%store MY_GITPOD_WORKSPACE_URL
%store MY_HOST_IPv4_01
%store MY_HOST_IPv4_02
%store MY_HOST_IPv4_03
%store MY_NOTEBOOK_DIR
%store MY_TSERVER_WEBSERVER_PORT
%store MY_NOTEBOOK_DATA_FOLDER
%store MY_NOTEBOOK_UTILS_FOLDER
%store MY_DATA3_DDL_FILE
%store MY_DATA3_DML_FILE
%store MY_UTIL_FUNCTIONS_FILE
%store MY_UTIL_YBTSERVER_METRICS_FILE
%store MY_SUDO



/Users/markkim/Documents/YBU_repos/jupyter/YSQL/data /Users/markkim/Documents/YBU_repos/jupyter/YSQL/utils
Stored 'MY_DB_NAME' (str)
Stored 'MY_YB_PATH' (str)
Stored 'MY_GITPOD_WORKSPACE_URL' (str)
Stored 'MY_HOST_IPv4_01' (str)
Stored 'MY_HOST_IPv4_02' (str)
Stored 'MY_HOST_IPv4_03' (str)
Stored 'MY_NOTEBOOK_DIR' (str)
Stored 'MY_TSERVER_WEBSERVER_PORT' (str)
Stored 'MY_NOTEBOOK_DATA_FOLDER' (str)
Stored 'MY_NOTEBOOK_UTILS_FOLDER' (str)
Stored 'MY_DATA3_DDL_FILE' (str)
Stored 'MY_DATA3_DML_FILE' (str)
Stored 'MY_UTIL_FUNCTIONS_FILE' (str)
Stored 'MY_UTIL_YBTSERVER_METRICS_FILE' (str)
Stored 'MY_SUDO' (str)


In [3]:
%%bash -s "$MY_SUDO"  # ifconfig aliases
MY_SUDO=${1}

if ifconfig lo0 | grep 127.0.0.[2-7] > /dev/null
then
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.2
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.3
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.4
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.5
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.6
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.7
fi

echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.2
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.3
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.4
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.5
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.6
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.7

echo ${MY_SUDO} | sudo ifconfig lo0

lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
	options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP>
	inet 127.0.0.1 netmask 0xff000000 
	inet6 ::1 prefixlen 128 
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
	inet 127.0.0.2 netmask 0xff000000 
	inet 127.0.0.3 netmask 0xff000000 
	inet 127.0.0.4 netmask 0xff000000 
	inet 127.0.0.5 netmask 0xff000000 
	inet 127.0.0.6 netmask 0xff000000 
	inet 127.0.0.7 netmask 0xff000000 
	nd6 options=201<PERFORMNUD,DAD>


Password:

In [4]:
%%bash -s "$MY_YB_PATH" "$MY_TSERVER_WEBSERVER_PORT"  # yb-ctl create
YB_PATH=${1}
TSERVER_WEBSERVER_PORT=${2}

cd $YB_PATH

### Grep port 9000 for conflict
# lsof -nP -iTCP -sTCP:LISTEN | grep 9000

# Stop running cluster
if  pgrep -x "yb-tserver" > /dev/null 
then
    ./bin/yb-ctl stop
    sleep 1
fi

# Destroy cluster
if echo `./bin/yb-ctl status` | grep "Node Count"  > /dev/null 
then
    ./bin/yb-ctl destroy
    sleep 1
fi

# Create cluster
./bin/yb-ctl --rf 3 create --tserver_flags "yb_num_shards_per_tserver=1,ysql_num_shards_per_tserver=1,ysql_beta_features=true,webserver_port="${TSERVER_WEBSERVER_PORT}  \
--master_flags "yb_num_shards_per_tserver=1,ysql_num_shards_per_tserver=1" \
--num_shards_per_tserver=1  \
--placement_info "cloud1.region1.zone1,cloud2.region2.zone2,cloud3.region3.zone3" 

# Output status
./bin/yb-ctl status

Stopping cluster.
Destroying cluster.
Creating cluster.
Waiting for cluster to be ready.
....
----------------------------------------------------------------------------------------------------
| Node Count: 3 | Replication Factor: 3                                                            |
----------------------------------------------------------------------------------------------------
| JDBC                : jdbc:postgresql://127.0.0.1:5433/yugabyte                                  |
| YSQL Shell          : bin/ysqlsh                                                                 |
| YCQL Shell          : bin/ycqlsh                                                                 |
| YEDIS Shell         : bin/redis-cli                                                              |
| Web UI              : http://127.0.0.1:7000/                                                     |
| Cluster Data        : /Users/markkim/yugabyte-data                                              

### Create the `db_ybu` database with `ysqlsh`
Run the following cell to connect to the local host using `ysqlsh`, create the `db_ybu` database, and then list the databases.

In [5]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  # create database
YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# drop and create
./bin/ysqlsh -d yugabyte -c "drop database if exists "${DB_NAME}";"  
./bin/ysqlsh -d yugabyte -c "create database "${DB_NAME}";" 

# list dbs
./bin/ysqlsh -d yugabyte -c "\l"

NOTICE:  database "db_ybu" does not exist, skipping


DROP DATABASE
CREATE DATABASE
                                   List of databases
      Name       |  Owner   | Encoding | Collate |    Ctype    |   Access privileges   
-----------------+----------+----------+---------+-------------+-----------------------
 db_ybu          | yugabyte | UTF8     | C       | en_US.UTF-8 | 
 postgres        | postgres | UTF8     | C       | en_US.UTF-8 | 
 system_platform | postgres | UTF8     | C       | en_US.UTF-8 | 
 template0       | postgres | UTF8     | C       | en_US.UTF-8 | =c/postgres          +
                 |          |          |         |             | postgres=CTc/postgres
 template1       | postgres | UTF8     | C       | en_US.UTF-8 | =c/postgres          +
                 |          |          |         |             | postgres=CTc/postgres
 yugabyte        | postgres | UTF8     | C       | en_US.UTF-8 | 
(6 rows)



### Create a partitioned table, add associated partitions, and load data using DDL and DML scripts
In this section of the notebook, you will:
- Create tables with a DDL script
- Load data with a DML script
- Verify the creation of tables and data
- View the DDL for order_changes
- Create Partition Table

##### Create tables, load data, and review relations
Run the following cell to execute the DDL and DML scripts using `ysqlsh`.

In [6]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA3_DDL_FILE" "$MY_DATA3_DML_FILE"   # Order_changes Partition Table
YB_PATH=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
DATA_DDL_FILE=${4}
DATA_DML_FILE=${5}

ORDER_DDL_PATH=${DATA_FOLDER}/${DATA_DDL_FILE}
ORDER_DML_PATH=${DATA_FOLDER}/${DATA_DML_FILE}

cd $YB_PATH

# DDL file
./bin/ysqlsh -d ${DB_NAME} -f ${ORDER_DDL_PATH} >&/dev/null
sleep 1;

# DML file
./bin/ysqlsh -d ${DB_NAME} -f ${ORDER_DML_PATH} >&/dev/null
sleep 1;

# Describe relations
./bin/ysqlsh -d ${DB_NAME} -c "\d"

                 List of relations
 Schema |         Name          | Type  |  Owner   
--------+-----------------------+-------+----------
 public | order_changes         | table | yugabyte
 public | order_changes_2022_02 | table | yugabyte
 public | order_changes_2022_03 | table | yugabyte
 public | order_changes_default | table | yugabyte
(4 rows)



##### View Table Definitions
Run the following cell using `ysqlsh` to view a table definition.

> Note
> 
> SQL magic does not support PostgreSQL `psql` commands. In order to execute `psql` commands, the notebook uses `bash` and `ysqlsh`.



In [9]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  

YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# ./bin/ysqlsh -d ${DB_NAME} -c "\dt"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes"
./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_2022_02"
./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_2022_03"
./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_default"

          Table "public.order_changes_2022_02"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 user_id     | integer |           | not null | 
 account_id  | integer |           | not null | 
 change_date | date    |           | not null | 
 description | text    |           |          | 
Partition of: order_changes FOR VALUES FROM ('2022-02-01') TO ('2022-03-01')
Indexes:
    "order_changes_2022_02_pkey" PRIMARY KEY, lsm (user_id HASH, account_id ASC, change_date ASC)

          Table "public.order_changes_2022_03"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 user_id     | integer |           | not null | 
 account_id  | integer |           | not null | 
 change_date | date    |           | not null | 
 description | text    |           |          | 
Partition of: order_changes FOR VALUES FROM ('2022-03-01') TO ('2022-04-01')
Indexes:
    "order_chan

### Connect to YugabyteDB using the PostgreSQL Driver for Python
The following cells requires:
- Python 3.8+ and psycopg2
- SQLAlchemy

Create a connection string to connect to YugabyteDB through the Python application.

In [10]:
%config SqlMagic.autocommit=True

In [11]:
# Connect to db_ybu
import psycopg2
import sqlalchemy as alc
from sqlalchemy import create_engine

# env_var.env
db_host=MY_HOST_IPv4_01
db_name=MY_DB_NAME

connection_str='postgresql+psycopg2://yugabyte@'+db_host+':5433/'+db_name

#### Load SQL magic extension
>IMPORTANT!
>
> To use SQL magic, you must run the following cell that loads the notebook extension.
> 
The connection string is used to create a connection for SQL Magic

In [12]:
%reload_ext sql
# creates connection for sql magic
%sql {connection_str}


#### Verify Table DDL DML operations
Run the cell below to view the row counts for the tables.

#### Review Partitions

Note that in the following result sets, you can see that the attribute, `change_date`, determines the partition location. The virtual table doesn't store data, but the partitions do.

In [17]:
%%sql

-- SELECT * FROM order_changes
-- SELECT * FROM order_changes_2022_03
-- SELECT * FROM order_changes_2022_02
SELECT * FROM order_changes_default
-- SELECT * FROM order_changes_2022_04

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
1 rows affected.


user_id,account_id,change_date,description
3,2002,2022-01-25,add sprinklers


#### Indexed Relations

By creating an index on the partitioned or parent table, a matching index is also created on any partitions that exist now or in the future. An index or unique constraint declared on a partitioned table is “virtual” in the same way that the partitioned table is: the actual data is in child indexes on the individual partition tables.

In [18]:
%%sql

CREATE INDEX ON order_changes (change_date);

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

#### Partition Maintenance

It is common to have a dynamic set of partitions that define a table. Partitions are frequently dropped and created to dispose of old information and add new info. 

Partitions simplify the removal of old data with the following command:

In [22]:
%%sql

DROP TABLE order_changes_2022_02;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

Alternatively, a partition can be removed from the partitioned table, but still retain access to the data. This is incase a report or aggregation of the data is necessary.

This is done by detaching the partition from the partitioned table with the following statement:

In [19]:
%%sql
DROP TABLE IF EXISTS order_changes_2022_04;

CREATE TABLE order_changes_2022_04 PARTITION OF order_changes FOR VALUES FROM ('2022-04-01') TO ('2022-05-01');

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.
Done.


[]

In [20]:
%%sql

INSERT INTO order_changes VALUES(4, 4001, '2022-04-11', 'add drink');

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
1 rows affected.


[]

In [25]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  

YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# ./bin/ysqlsh -d ${DB_NAME} -c "\d+ order_changes"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d+"
./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_2022_04"


          Table "public.order_changes_2022_04"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 user_id     | integer |           | not null | 
 account_id  | integer |           | not null | 
 change_date | date    |           |          | 
 description | text    |           |          | 
Partition of: order_changes FOR VALUES FROM ('2022-04-01') TO ('2022-05-01')
Indexes:
    "order_changes_2022_04_change_date_idx" lsm (change_date HASH)



In [27]:
%%sql

-- SELECT * FROM order_changes
SELECT * FROM order_changes_2022_04


 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
1 rows affected.


user_id,account_id,change_date,description
4,4001,2022-04-11,add drink


## Tablespaces and Geo Row Partitioning

In a distributed cloud-native database such as YugabyteDB, the location of tables and indexes plays a very important role in achieving optimal performance for any workload.

Given the impact of distance on node-to-node communication, it is highly useful to be able to specify at a table level, how its data should be spread across the cluster. This way, you can move tables closer to their clients and decide which tables actually need to be geo-distributed. 


In the section, we will first create a table that will be split on the geo_partition attribute, which holds that value of a region.

In [28]:
%%sql

DROP TABLE IF EXISTS transactions;

CREATE TABLE transactions (
  user_id       INT NOT NULL,
  account_id	  INT NOT NULL,
  geo_partition TEXT,
  account_type  TEXT NOT NULL,
  amount        NUMERIC NOT NULL,
  created_at    TIMESTAMP DEFAULT NOW()
) PARTITION BY LIST (geo_partition);

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.
Done.


[]

## Tablespaces

YSQL Tablespaces are entities that can specify the number of replicas for a set of tables or indexes, and how each of these replicas should be distributed across a set of cloud, regions, zones.

In the following demo, we will create tablespaces that are designated to a specific location, determined by the placement block.

In [29]:
%%sql

CREATE TABLESPACE tblspace_us WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [{"cloud": "cloud1", "region": "region1", "zone": "zone1", "min_num_replicas": 1}]}'
)

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

In [30]:
%%sql

CREATE TABLESPACE tblspace_eu WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [{"cloud": "cloud2", "region": "region2", "zone": "zone2", "min_num_replicas": 1}]}'
)

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

In [31]:
%%sql

CREATE TABLESPACE tblspace_ap WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [{"cloud": "cloud3", "region": "region3", "zone": "zone3", "min_num_replicas": 1}]}'
)

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

In [32]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  

YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

./bin/ysqlsh -d ${DB_NAME} -c "\db+"

                                                                                                                  List of tablespaces
    Name     |  Owner   | Location | Access privileges |                                                                               Options                                                                               |   Size    | Description 
-------------+----------+----------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------
 pg_default  | postgres |          |                   |                                                                                                                                                                     | 608 bytes | 
 pg_global   | postgres |          |                   |                                                                              

Bash is used to process a psql command to view the tablespaces.
psql — PostgreSQL interactive terminal

#### Create Partitions

When the partitions are created, the tablespaces are connected to assign the geo location. The partitions include the range of values that are accepted for the partition.

In [33]:
%%sql /* Partition transactions_us - US */

CREATE TABLE transactions_us PARTITION OF transactions
    (user_id, account_id, geo_partition, account_type, amount, created_at,
    PRIMARY KEY (user_id HASH, account_id, geo_partition))
  FOR VALUES IN ('US') TABLESPACE tblspace_us

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

In [34]:
%%sql /* Partition transactions_eu - EU */

CREATE TABLE transactions_eu PARTITION OF transactions
    (user_id, account_id, geo_partition, account_type, amount, created_at,
    PRIMARY KEY (user_id HASH, account_id, geo_partition))
  FOR VALUES IN ('EU') TABLESPACE tblspace_eu

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

In [35]:
%%sql /* Partition transactions_ap - India */

CREATE TABLE transactions_ap PARTITION OF transactions
    (user_id, account_id, geo_partition, account_type, amount, created_at,
    PRIMARY KEY (user_id HASH, account_id, geo_partition))
  FOR VALUES IN ('India') TABLESPACE tblspace_ap

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

#### Add records to the transactions table

Note that a new data row is filtered by the `geo_partition` attribute to a specific table partition. Since the table partition's tablespace is assigned to a geographic location, offering data residency that complies with regulatory requirements imposed on information based on the data laws that govern a country or region in which the data resides. For example think about the user data stored at Tik Tok and the regulatory standards that prevent that data being stored outside the US border.

Data localization also has a role in performance as well. Keeping the data source closer to the client will reduce network latency, improving the response time of the database. Having an understanding of what data is needed where coupled with the ability to place data in particular location is an important tool in distributed sql systems.

In [36]:
%%sql

INSERT INTO transactions  VALUES (1, 100, 'US', 'customer', 100, now());
INSERT INTO transactions  VALUES (2, 200, 'EU', 'customer', 200, now());
INSERT INTO transactions  VALUES (3, 300, 'India', 'customer', 300, now());

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
1 rows affected.
1 rows affected.
1 rows affected.


[]

## Validate SQL operations

Review the rows to compare how the `geo_location` attribute determines the table partition.

In [41]:
%%sql

-- SELECT * FROM transactions
-- SELECT * FROM transactions_us
-- SELECT * FROM transactions_eu
-- SELECT * FROM transactions_ap
-- SELECT tableoid::regclass, user_id, account_id, geo_partition FROM transactions
SELECT tableoid::regclass, user_id, account_id, geo_partition  FROM transactions_us
-- SELECT tableoid::regclass, user_id, account_id, geo_partition FROM transactions_eu
-- SELECT tableoid::regclass, user_id, account_id, geo_partition  FROM transactions_ap

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
1 rows affected.


tableoid,user_id,account_id,geo_partition
transactions_us,1,100,US


#### Object Identifiers

Object identifiers (OIDs) are used internally by PostgreSQL as primary keys for various system tables. OIDs are not added to user-created tables, unless WITH OIDS is specified when the table is created, or the default_with_oids configuration variable is enabled.

---
# All done!
In this lab, you completed the following:

- Setup
  - Created the `db_ybu` database with `ysqlsh`
  - Created tables and loaded data using DDL and DML scripts
  - Connected to the database using a PostgreSQL driver for Python
  - Created a partitioned table
  - Created tablespaces
  - Created a geo located row