## Setup steps
Here are the steps to setup this lab:
- Install missing dependencies and restart the notebook
- Create the notebook variables
- Create Loop back IP addresses
- Spin up cluster locally
- Create the `db_ybu` database

### Install missing dependencies and restart the notebook
Run the following cell to ensure that the notebook dependencies are available to the notebook. 

In [None]:
!pip install ipython-sql
!pip3 install psycopg2-binary==2.8.6
!pip install sqlalchemy

### Create the notebook variables 

> IMPORTANT!
> 
> Do NOT skip running this cell. 
> 

The following Python cell creates and stores variables that all the notebooks in this lab will use. You can view these variables in the Jupyter tab.

- To run the script, select Execute Cell (Play Arrow) in the left gutter of the cell.
- Verify the accuracy of the output values

In [3]:
# Env variables for Notebook
import os

# read env_vars.env
env_vars = !cat env_vars.env
for var in env_vars:
    key, value = var.split('=')
    os.environ[key] = value
 

# Comment out Local
# MY_YB_PATH=os.environ.get('MY_YB_PATH_LOCAL')
# MY_GITPOD_WORKSPACE_URL=os.environ.get('MY_GITPOD_WORKSPACE_URL_LOCAL')
# MY_SUDO=os.environ.get('MY_SUDO')

# Gitpod specific
MY_YB_PATH=os.environ.get('MY_YB_PATH')
MY_GITPOD_WORKSPACE_URL=os.environ.get('GITPOD_WORKSPACE_URL')

# env_vars defines the following
MY_DB_NAME=os.environ.get('MY_DB_NAME')
MY_HOST_IPv4_01=os.environ.get('MY_HOST_IPv4_01')
MY_HOST_IPv4_02=os.environ.get('MY_HOST_IPv4_02')
MY_HOST_IPv4_03=os.environ.get('MY_HOST_IPv4_03')
MY_TSERVER_WEBSERVER_PORT=os.environ.get('MY_TSERVER_WEBSERVER_PORT')
MY_DATA3_DDL_FILE=os.environ.get('MY_DATA3_DDL_FILE')
MY_DATA3_DML_FILE=os.environ.get('MY_DATA3_DML_FILE')
MY_UTIL_FUNCTIONS_FILE=os.environ.get("MY_UTIL_FUNCTIONS_FILE")
MY_UTIL_YBTSERVER_METRICS_FILE=os.environ.get("MY_UTIL_YBTSERVER_METRICS_FILE")

# Current directory of project and related child folders
MY_NOTEBOOK_DIR=os.getcwd()
MY_NOTEBOOK_DATA_FOLDER=MY_NOTEBOOK_DIR +'/data'
MY_NOTEBOOK_UTILS_FOLDER=MY_NOTEBOOK_DIR + '/utils'

# Store the note book values for other notebooks to use
%store MY_DB_NAME
%store MY_YB_PATH
%store MY_GITPOD_WORKSPACE_URL
%store MY_HOST_IPv4_01
%store MY_HOST_IPv4_02
%store MY_HOST_IPv4_03
%store MY_NOTEBOOK_DIR
%store MY_TSERVER_WEBSERVER_PORT
%store MY_NOTEBOOK_DATA_FOLDER
%store MY_NOTEBOOK_UTILS_FOLDER
%store MY_DATA3_DDL_FILE
%store MY_DATA3_DML_FILE
%store MY_UTIL_FUNCTIONS_FILE
%store MY_UTIL_YBTSERVER_METRICS_FILE

/Users/markkim/Documents/YBU_repos/jupyter/YSQL/data /Users/markkim/Documents/YBU_repos/jupyter/YSQL/utils
Stored 'MY_DB_NAME' (str)
Stored 'MY_YB_PATH' (str)
Stored 'MY_GITPOD_WORKSPACE_URL' (NoneType)
Stored 'MY_HOST_IPv4_01' (str)
Stored 'MY_HOST_IPv4_02' (str)
Stored 'MY_HOST_IPv4_03' (str)
Stored 'MY_NOTEBOOK_DIR' (str)
Stored 'MY_TSERVER_WEBSERVER_PORT' (str)
Stored 'MY_NOTEBOOK_DATA_FOLDER' (str)
Stored 'MY_NOTEBOOK_UTILS_FOLDER' (str)
Stored 'MY_DATA3_DDL_FILE' (str)
Stored 'MY_DATA3_DML_FILE' (str)
Stored 'MY_UTIL_FUNCTIONS_FILE' (str)
Stored 'MY_UTIL_YBTSERVER_METRICS_FILE' (str)


### Create the `db_ybu` database with `ysqlsh`
Run the following cell to connect to the local host using `ysqlsh`, create the `db_ybu` database, and then list the databases.

In [4]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  # create database
YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# drop and create
./bin/ysqlsh -d yugabyte -c "drop database if exists "${DB_NAME}";"  
./bin/ysqlsh -d yugabyte -c "create database "${DB_NAME}";" 

# list dbs
./bin/ysqlsh -d yugabyte -c "\l"

bash: line 4: cd: /home/gitpod/yugabyte/: No such file or directory
bash: line 7: ./bin/ysqlsh: No such file or directory
bash: line 8: ./bin/ysqlsh: No such file or directory
bash: line 11: ./bin/ysqlsh: No such file or directory


CalledProcessError: Command 'b'YB_PATH=${1}\nDB_NAME=${2}\n\ncd $YB_PATH\n\n# drop and create\n./bin/ysqlsh -d yugabyte -c "drop database if exists "${DB_NAME}";"  \n./bin/ysqlsh -d yugabyte -c "create database "${DB_NAME}";" \n\n# list dbs\n./bin/ysqlsh -d yugabyte -c "\\l"\n'' returned non-zero exit status 127.

## Connect to YugabyteDB using the PostgreSQL Driver for Python
The following cells requires:
- Python 3.8+ and psycopg2

### Create tables and loaded data using DDL and DML scripts
In this section of the notebook, you will:
- Create tables with a DDL script
- Load data with a DML script
- Verify the creation of tables and data
- View the DDL for `order_changes`

##### Create tables, load data, and review relations
Run the following cell to execute the DDL and DML scripts using `ysqlsh`.

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA3_DDL_FILE" "$MY_DATA3_DML_FILE"   # order_changes
YB_PATH=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
DATA_DDL_FILE=${4}
DATA_DML_FILE=${5}

ORDER_DDL_PATH=${DATA_FOLDER}/${DATA_DDL_FILE}
ORDER_DML_PATH=${DATA_FOLDER}/${DATA_DML_FILE}
echo $ORDER_DDL_PATH
echo $ORDER_DML_PATH

cd $YB_PATH

# DDL file
./bin/ysqlsh -d ${DB_NAME} -f ${ORDER_DDL_PATH} >&/dev/null
sleep 1;

# DML file
./bin/ysqlsh -d ${DB_NAME} -f ${ORDER_DML_PATH} >&/dev/null
sleep 1;

# Describe relations
./bin/ysqlsh -d ${DB_NAME} -c "\d"

##### View DDL for Table partitions
Run the following cell using `ysqlsh` to view a table definition.

> Note
> 
> SQL magic does not support PostgreSQL `psql` commands. In order to execute `psql` commands, the notebook uses bash and `ysqlsh`.



In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  

YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

./bin/ysqlsh -d ${DB_NAME} -c "\dt"
./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_2022_02"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_2022_03"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d order_changes_default"

##### Set Autocommit

Need to assign autocommit to true in order for DML transaction to occur without a transaction block error for the tablespace creation.

In [None]:
%config SqlMagic.autocommit=True

In [None]:
# Connect to db_ybu
# Inspiration from https://medium.com/analytics-vidhya/postgresql-integration-with-jupyter-notebook-deb97579a38d
import psycopg2
import sqlalchemy as alc
from sqlalchemy import create_engine

# env_var.env
db_host=MY_HOST_IPv4_01
db_name=MY_DB_NAME

connection_str='postgresql+psycopg2://yugabyte@'+db_host+':5433/'+db_name

# engine = create_engine(connection_str)

#### Load SQL magic extension
>IMPORTANT!
>
> To use SQL magic, you must run the following cell that loads the notebook extension.

In [None]:
%reload_ext sql
# creates connection for sql magic
%sql {connection_str}

#### Show table row counts
Run the cell below to view the row counts for the tables.

A SQL update can compute the new value and return it without the need to query again. The following adds 100 to the salaries of all employees who are not managers and show the new value

In [None]:
%%sql

SELECT * FROM order_changes
-- SELECT * FROM order_changes_2022_03 
-- SELECT * FROM order_changes_2022_02 
-- SELECT * FROM order_changes_default

#### Indexed Relations

By creating an index on the partitioned or parent table, a matching index is also created on any partitions that exist now or in the future. An index or unique constraint declared on a partitioned table is “virtual” in the same way that the partitioned table is: the actual data is in child indexes on the individual partition tables.

In [None]:
%%sql

CREATE INDEX ON order_changes (change_date)

#### Partition Maintenance

It is common to have a dynamic set of partitions that define a table. Partitions are frequently dropped and created to dispose of old information and add new info. 

Partitions simplify the removal of old data with the following command:

In [None]:
%%sql

DROP TABLE order_changes_2022_02;

Alternatively, a partition can be removed from the partitioned table, but still retain access to the data. This is incase a report or aggregation of the data is necessary.

This is done by detaching the partition from the partitioned table with the following statement:

## Tablespaces and Geo Row Partitioning

In [None]:
%%sql

CREATE TABLE transactions (
  user_id       INT NOT NULL,
  account_id	  INT NOT NULL,
  geo_partition TEXT,
  account_type  TEXT NOT NULL,
  amount        NUMERIC NOT NULL,
  created_at    TIMESTAMP DEFAULT NOW()
) PARTITION BY LIST (geo_partition)


## Tablespaces

Tablespaces are assigned repositories with assigned locations. In our example, each tablespace is assigned to a certain node in a particular cloud, region, and zone.

In [None]:
%%sql

CREATE TABLESPACE tblspace_us WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [{"cloud": "cloud1", "region": "region1", "zone": "zone1", "min_num_replicas": 1}]}'
)

In [None]:
%%sql

CREATE TABLESPACE  tblspace_eu WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [{"cloud": "cloud2", "region": "region2", "zone": "zone2", "min_num_replicas": 1}]}'
)

In [None]:
%%sql

CREATE TABLESPACE tblspace_ap WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [{"cloud": "cloud3", "region": "region3", "zone": "zone3", "min_num_replicas": 1}]}'
)

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  

YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

./bin/ysqlsh -d ${DB_NAME} -c "\db+"

#### Create Table Partitions

The partitions will determine the which rows are included with the value from `geo_location`. Since the partitioned table has the Partition property by LIST, not RANGE, only rows that contain the LIST value will be assigned to a partition.

In [None]:
%%sql /* Table Reads */

CREATE TABLE transactions_us PARTITION OF transactions
    (user_id, account_id, geo_partition, account_type, amount, created_at,
    PRIMARY KEY (user_id HASH, account_id, geo_partition))
  FOR VALUES IN ('US') TABLESPACE tblspace_us

In [None]:
%%sql /* Table Reads */

CREATE TABLE transactions_eu PARTITION OF transactions
    (user_id, account_id, geo_partition, account_type, amount, created_at,
    PRIMARY KEY (user_id HASH, account_id, geo_partition))
  FOR VALUES IN ('EU') TABLESPACE tblspace_eu

In [None]:
%%sql /* Table Reads */

CREATE TABLE transactions_ap PARTITION OF transactions
    (user_id, account_id, geo_partition, account_type, amount, created_at,
    PRIMARY KEY (user_id HASH, account_id, geo_partition))
  FOR VALUES IN ('India') TABLESPACE tblspace_ap

#### Add records to the transactions table

Note that a new record is filtered by the `geo_partition` attribute to a specific table partition. Since the table partition's tablespace is assigned to a geographic location, offering data residency that complies with regulatory requirements imposed on data based on the data laws that govern a country or region in which the data resides. For example think about the user data stored at Tik Tok and the regulatory standards that prevents that data being stored outside this country's boundaries.

Data localization also has a role in performance as well. Keeping the data source closer to the client will reduce network latency, improving the response time of the database. Having an understanding of what data is needed where coupled with the ability to place data in particular location is an important tool in distributed sql systems.

In [None]:
%%sql

INSERT INTO transactions  VALUES (1, 100, 'US', 'customer', 100, now())
-- INSERT INTO transactions  VALUES (2, 200, 'EU', 'customer', 200, now())
-- INSERT INTO transactions  VALUES (3, 300, 'India', 'customer', 300, now())

## Validate SQL operations

Review the rows to compare how the `geo_location` attribute determines the table partition. In this case, we set the location of the tablespace, then assigned a partition with a specific value for the `geo_location` attribute to determine which row will be assigned to the tablespace. This is geo row location.

In [None]:
%%sql

SELECT * FROM transactions
-- SELECT * FROM transactions_us
-- SELECT * FROM transactions_eu
-- SELECT * FROM transactions_ap
-- SELECT tableoid::regclass, user_id, account_id, geo_partition FROM transactions
-- SELECT tableoid::regclass, user_id, account_id, geo_partition  FROM transactions_us
-- SELECT tableoid::regclass, user_id, account_id, geo_partition FROM transactions_eu
-- SELECT tableoid::regclass, user_id, account_id, geo_partition  FROM transactions_ap



#### Object Identifiers

In this example, we are using Object Identifiers to locate the partition table that is associated with the row of data.

---
# All done!
In this lab, you completed the following:

- Setup
  - Created the `db_ybu` database with `ysqlsh`
  - Created tables and loaded data using DDL and DML scripts
  - Connected to the database using a PostgreSQL driver for Python

