# Lab Requirements and Setup

This lab consists of several Jupyter notebooks and runs in Gitpod using VS Code.  Follow the instructions for requirements and setup.

## About Jupyter notebooks
A notebook consists of one or more cells. In VS Code, notebooks cells are editable. 

There are two types of cells: markdown and code. This is a markdown cell.

You run a code cell by simply selecting the play icon in the cell's left gutter. For code cells, you can modify the code for execution. Certain labs contain challenges or experiments that require you to do just that - modify a code cell and re-run it!

### Requirements
Here are the requirements for this lab:
- Launch using a gitpod workspace
- Run a three node, YugabyteDB cluster using `yb-ctl`

> Note
>  
> Although a three node cluster is up and running, Gitpod does not support visiting loopback addresses over a web ui, even if exposed on a different port.
> 127.0.0.1 is the only web user interfaces. To see all available ports in Gitpod, in the terminal, run `gp ports list`.

#### Notebook keyboard shortcuts
The Jupyter extension for Gitpod supports the following keyboard shortcuts:
| Keystroke | Description |
|--|--|
| ESC | Change the cell mode |
| A | Add a cell above |
| B | Add a cell below |
| J or down arrow key |  Change a cell to below | 
| K or up arrow key | Change a cell to above | 
| Ctrl+Enter | Run the currently selected cell |
| Shift+Enter | Run the currently selected cell and insert a new cell immediately below (focus moves to new cell) |
| Alt+Enter | Run the currently selected cell and insert a new cell immediately below (focus remains on current cell) |
| dd | Delete a selected cell |
| z | Undo the last change | 
| M | switch the cell type to Markdown | 
| Y | switch the cell type to code |
| L | Enable/Disable line numbers |
```


## Setup steps
Here are the steps to setup this lab:
- Install missing dependencies and restart the notebook
- Create the notebook variables
- Create the `db_ybu` database

### Install missing dependencies and restart the notebook
Run the following cell to ensure that the notebook dependencies are available to the notebook. 

In [None]:
!pip install ipython-sql
!pip install psycopg2-binary
!pip install sqlalchemy 

> Important!
> 
> Restart the Notebook.
> 
> Do NOT skip this step.
> 
> After restarting the notebook, you can continue running notebook cells below, at **Create the notebook variables**.


### Create the notebook variables 

> IMPORTANT!
> 
> Do NOT skip running this cell. 
> 

The following Python cell creates and stores variables that all the notebooks in this lab will use. You can view these variables in the Jupyter tab.

- To run the script, select Execute Cell (Play Arrow) in the left gutter of the cell.
- Verify the accuracy of the output values

In [1]:
# Env variables for Notebook
import os

# read env_vars.env
env_vars = !cat env_vars.env
for var in env_vars:
    key, value = var.split('=')
    os.environ[key] = value
 

# Comment out Local
MY_YB_PATH=os.environ.get('MY_YB_PATH_LOCAL')
MY_GITPOD_WORKSPACE_URL=os.environ.get('MY_GITPOD_WORKSPACE_URL_LOCAL')
MY_SUDO=os.environ.get('MY_SUDO')

# Gitpod specific
# MY_YB_PATH=os.environ.get('MY_YB_PATH')
# MY_GITPOD_WORKSPACE_URL=os.environ.get('GITPOD_WORKSPACE_URL')

# env_vars defines the following
MY_DB_NAME=os.environ.get('MY_DB_NAME')
MY_HOST_IPv4_01=os.environ.get('MY_HOST_IPv4_01')
MY_HOST_IPv4_02=os.environ.get('MY_HOST_IPv4_02')
MY_HOST_IPv4_03=os.environ.get('MY_HOST_IPv4_03')
MY_TSERVER_WEBSERVER_PORT=os.environ.get('MY_TSERVER_WEBSERVER_PORT')
MY_DATA_DDL_FILE=os.environ.get("MY_DATA_DDL_FILE")
MY_DATA_DML_FILE=os.environ.get("MY_DATA_DML_FILE")
print(MY_DATA_DDL_FILE, MY_DATA_DML_FILE)
MY_UTIL_FUNCTIONS_FILE=os.environ.get("MY_UTIL_FUNCTIONS_FILE")
MY_UTIL_YBTSERVER_METRICS_FILE=os.environ.get("MY_UTIL_YBTSERVER_METRICS_FILE")

# Current directory of project and related child folders
MY_NOTEBOOK_DIR=os.getcwd()
MY_NOTEBOOK_DATA_FOLDER=MY_NOTEBOOK_DIR +'/data'
MY_NOTEBOOK_UTILS_FOLDER=MY_NOTEBOOK_DIR + '/utils'

print(MY_NOTEBOOK_DATA_FOLDER, MY_NOTEBOOK_UTILS_FOLDER)
# Store the note book values for other notebooks to use

%store MY_DB_NAME
%store MY_YB_PATH
%store MY_GITPOD_WORKSPACE_URL
%store MY_HOST_IPv4_01
%store MY_HOST_IPv4_02
%store MY_HOST_IPv4_03
%store MY_NOTEBOOK_DIR
%store MY_TSERVER_WEBSERVER_PORT
%store MY_NOTEBOOK_DATA_FOLDER
%store MY_NOTEBOOK_UTILS_FOLDER
%store MY_DATA_DDL_FILE
%store MY_DATA_DML_FILE
%store MY_UTIL_FUNCTIONS_FILE
%store MY_UTIL_YBTSERVER_METRICS_FILE
%store MY_SUDO

company_ddl.sql company_dml.sql
/Users/markkim/Documents/YBU_repos/jupyter/YSQL/data /Users/markkim/Documents/YBU_repos/jupyter/YSQL/utils
Stored 'MY_DB_NAME' (str)
Stored 'MY_YB_PATH' (str)
Stored 'MY_GITPOD_WORKSPACE_URL' (str)
Stored 'MY_HOST_IPv4_01' (str)
Stored 'MY_HOST_IPv4_02' (str)
Stored 'MY_HOST_IPv4_03' (str)
Stored 'MY_NOTEBOOK_DIR' (str)
Stored 'MY_TSERVER_WEBSERVER_PORT' (str)
Stored 'MY_NOTEBOOK_DATA_FOLDER' (str)
Stored 'MY_NOTEBOOK_UTILS_FOLDER' (str)
Stored 'MY_DATA_DDL_FILE' (str)
Stored 'MY_DATA_DML_FILE' (str)
Stored 'MY_UTIL_FUNCTIONS_FILE' (str)
Stored 'MY_UTIL_YBTSERVER_METRICS_FILE' (str)
Stored 'MY_SUDO' (str)


In [2]:
%%bash -s "$MY_SUDO"  # ifconfig aliases
MY_SUDO=${1}

if ifconfig lo0 | grep 127.0.0.[2-7] > /dev/null
then
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.2
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.3
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.4
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.5
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.6
    echo ${MY_SUDO} | sudo -S ifconfig lo0 delete 127.0.0.7
fi

echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.2
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.3
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.4
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.5
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.6
echo ${MY_SUDO} | sudo -S ifconfig lo0 alias 127.0.0.7

echo ${MY_SUDO} | sudo ifconfig lo0

lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
	options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP>
	inet 127.0.0.1 netmask 0xff000000 
	inet6 ::1 prefixlen 128 
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
	inet 127.0.0.2 netmask 0xff000000 
	inet 127.0.0.3 netmask 0xff000000 
	inet 127.0.0.4 netmask 0xff000000 
	inet 127.0.0.5 netmask 0xff000000 
	inet 127.0.0.6 netmask 0xff000000 
	inet 127.0.0.7 netmask 0xff000000 
	nd6 options=201<PERFORMNUD,DAD>


Password:

In [3]:
%%bash -s "$MY_YB_PATH" "$MY_TSERVER_WEBSERVER_PORT"  # yb-ctl create
YB_PATH=${1}
TSERVER_WEBSERVER_PORT=${2}

cd $YB_PATH

### Grep port 9000 for conflict
# lsof -nP -iTCP -sTCP:LISTEN | grep 9000

# Stop running cluster
if  pgrep -x "yb-tserver" > /dev/null 
then
    ./bin/yb-ctl stop
    sleep 1
fi

# Destroy cluster
if echo `./bin/yb-ctl status` | grep "Node Count"  > /dev/null 
then
    ./bin/yb-ctl destroy
    sleep 1
fi

# Create cluster
./bin/yb-ctl --rf 3 create  \
--tserver_flags "yb_num_shards_per_tserver=1,ysql_num_shards_per_tserver=1,ysql_beta_features=true,webserver_port="${TSERVER_WEBSERVER_PORT}  \
--master_flags "yb_num_shards_per_tserver=1,ysql_num_shards_per_tserver=1" \
--num_shards_per_tserver=1  \
--placement_info "azure.region1.zone1,azure.region1.zone2,azure.region1.zone3" 

# Output status
./bin/yb-ctl status

Stopping cluster.
Destroying cluster.
Creating cluster.
Waiting for cluster to be ready.
----------------------------------------------------------------------------------------------------
| Node Count: 3 | Replication Factor: 3                                                            |
----------------------------------------------------------------------------------------------------
| JDBC                : jdbc:postgresql://127.0.0.1:5433/yugabyte                                  |
| YSQL Shell          : bin/ysqlsh                                                                 |
| YCQL Shell          : bin/ycqlsh                                                                 |
| YEDIS Shell         : bin/redis-cli                                                              |
| Web UI              : http://127.0.0.1:7000/                                                     |
| Cluster Data        : /Users/markkim/yugabyte-data                                               |
--

### Create the `db_ybu` database with `ysqlsh`
Run the following cell to connect to the local host using `ysqlsh`, create the `db_ybu` database, and then list the databases.

In [4]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  # create database
YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# drop and create
./bin/ysqlsh -d yugabyte -c "drop database if exists "${DB_NAME}";"  
./bin/ysqlsh -d yugabyte -c "create database "${DB_NAME}";" 

# list dbs
./bin/ysqlsh -d yugabyte -c "\l"

NOTICE:  database "db_ybu" does not exist, skipping


DROP DATABASE
CREATE DATABASE
                                   List of databases
      Name       |  Owner   | Encoding | Collate |    Ctype    |   Access privileges   
-----------------+----------+----------+---------+-------------+-----------------------
 db_ybu          | yugabyte | UTF8     | C       | en_US.UTF-8 | 
 postgres        | postgres | UTF8     | C       | en_US.UTF-8 | 
 system_platform | postgres | UTF8     | C       | en_US.UTF-8 | 
 template0       | postgres | UTF8     | C       | en_US.UTF-8 | =c/postgres          +
                 |          |          |         |             | postgres=CTc/postgres
 template1       | postgres | UTF8     | C       | en_US.UTF-8 | =c/postgres          +
                 |          |          |         |             | postgres=CTc/postgres
 yugabyte        | postgres | UTF8     | C       | en_US.UTF-8 | 
(6 rows)



### Create tables and loaded data using DDL and DML scripts
In this section of the notebook, you will:
- Create tables with a DDL script
- Load data with a DML script
- Verify the creation of tables and data
- View the DDL for tbl_countries

##### Create tables, load data, and review relations
Run the following cell to execute the DDL and DML scripts using `ysqlsh`.

In [5]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA_DDL_FILE" "$MY_DATA_DML_FILE"   # World Cities
YB_PATH=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
DATA_DDL_FILE=${4}
DATA_DML_FILE=${5}

#ls $DATA_FOLDER

COMPANY_DDL_PATH=${DATA_FOLDER}/${DATA_DDL_FILE}
COMPANY_DML_PATH=${DATA_FOLDER}/${DATA_DML_FILE}

cd $YB_PATH

# DDL file
./bin/ysqlsh -d ${DB_NAME} -f ${COMPANY_DDL_PATH} >&/dev/null
sleep 1;

# DML file
./bin/ysqlsh -d ${DB_NAME} -f ${COMPANY_DML_PATH} >&/dev/null
sleep 1;

# Describe relations
# ./bin/ysqlsh -d ${DB_NAME} -c "\d"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d emp"
# ./bin/ysqlsh -d ${DB_NAME} -c "\d dept"

##### View DDL for tbl_countries
Run the following cell using `ysqlsh` to view a table definition.

> Note
> 
> SQL magic does not support PostgreSQL `psql` commands. In order to execute `psql` commands, the notebook uses bash and `ysqlsh`.



In [7]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"  

YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# ./bin/ysqlsh -d ${DB_NAME} -c "\dt"
./bin/ysqlsh -d ${DB_NAME} -c "\d dept"
./bin/ysqlsh -d ${DB_NAME} -c "\d emp"
./bin/ysqlsh -d ${DB_NAME} -c "\d emp_empno_seq"

                  Table "public.dept"
   Column    |  Type   | Collation | Nullable | Default 
-------------+---------+-----------+----------+---------
 deptno      | integer |           | not null | 
 dname       | text    |           |          | 
 loc         | text    |           |          | 
 description | text    |           |          | 
Indexes:
    "pk_dept" PRIMARY KEY, lsm (deptno ASC)

                               Table "public.emp"
   Column   |  Type   | Collation | Nullable |             Default              
------------+---------+-----------+----------+----------------------------------
 empno      | integer |           | not null | generated by default as identity
 ename      | text    |           | not null | 
 job        | text    |           |          | 
 mgr        | integer |           |          | 
 hiredate   | date    |           |          | 
 sal        | integer |           |          | 
 comm       | integer |           |          | 
 deptno     | inte

## Connect to YugabyteDB using the PostgreSQL Driver for Python
The following cells requires:
- Python 3.8+ and psycopg2

In [8]:
# Connect to db_ybu
# Inspiration from https://medium.com/analytics-vidhya/postgresql-integration-with-jupyter-notebook-deb97579a38d
import psycopg2
import sqlalchemy as alc
from sqlalchemy import create_engine

# env_var.env
db_host=MY_HOST_IPv4_01
db_name=MY_DB_NAME

connection_str='postgresql+psycopg2://yugabyte@'+db_host+':5433/'+db_name

# engine = create_engine(connection_str)

#### Load SQL magic extension
>IMPORTANT!
>
> To use SQL magic, you must run the following cell that loads the notebook extension.

In [9]:
%reload_ext sql
# creates connection for sql magic
%sql {connection_str}

#### Show table row counts
Run the cell below to view the row counts for the tables.

A SQL update can compute the new value and return it without the need to query again. The following adds 100 to the salaries of all employees who are not managers and show the new value

In [10]:
%%sql /* row counts */

update emp set sal=sal+100
where job != 'MANAGER'
returning ename,sal as new_salary;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
11 rows affected.


ename,new_salary
SMITH,900
ADAMS,1200
WARD,1350
KING,5100
FORD,3100
MARTIN,1350
JAMES,1050
ALLEN,1700
MILLER,1400
SCOTT,3100


Scenario 2. Join
List all employees earning more than their managers using a self-join query.

Description
A self join is a regular join, but the table is joined with itself. The following SQL statement matches employees with their manager and filters those that are earning more than their manager.

In [11]:
%%sql

SELECT 
  employee.ename,
  employee.sal,
  manager.ename as "manager ename",
  manager.sal as "manager sal"
FROM
  emp employee
JOIN emp manager ON
  employee.mgr = manager.empno
WHERE
  manager.sal<employee.sal
ORDER BY employee.sal;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
2 rows affected.


ename,sal,manager ename,manager sal
FORD,3100,JONES,2975
SCOTT,3100,JONES,2975


Scenario 4. Indexes
Create and analyze index on the fly

Description
Create a new table and a specific index to avoid table scan and sorts

SQL Statement
Step 1: Create a new demo table with randomly generated rows

GENERATE_SERIES function can generate rows. The following uses it to create a table with 42 rows and a random value from 1 to 10

In [12]:
%%sql 

create table demo as select generate_series(1,42) num, round(10*random()) val;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
42 rows affected.


[]

Step 2: Create the index “demo_val” on demo table

With the goal to query, for a value, with numbers ordered, in the most efficient way, the following creates an index on “val” (hashed for distribution) and “num” in ascending order

In [13]:
%%sql

create index demo_val on demo(val hash,num asc);

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

Step 3: Gather optimizer statistics on table demo

The query planner chooses the best access path when provided with statistics about the data stored in the table. The following gathers those statistics.

In [14]:
%%sql 

analyze demo;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

Step 4: Query the Top-3 numbers for a specific value

The following displays the Top-3 numbers for the value 5

In [15]:
%%sql

select * from demo where val=5 order by num asc fetch first 3 rows only;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
3 rows affected.


num,val
5,5.0
11,5.0
17,5.0


Step 5: Verify that index is leading to faster query execution using explain analyze

When defining an index for a specific access pattern, it is good that the developer verifies that the index is used. The following shows that an “Index Only Scan” was used, without the need for an additional “Sort” operation

In [16]:
%%sql

explain analyze select * from demo where val=5 order by num fetch first 3 rows only;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
7 rows affected.


QUERY PLAN
Limit (cost=0.00..3.19 rows=3 width=12) (actual time=2.837..2.843 rows=3 loops=1)
-> Index Only Scan using demo_val on demo (cost=0.00..4.47 rows=4 width=12) (actual time=2.836..2.839 rows=3 loops=1)
Index Cond: (val = '5'::double precision)
Heap Fetches: 0
Planning Time: 0.105 ms
Execution Time: 2.896 ms
Peak Memory Usage: 8 kB


Step 6: Clean up the table for this exercise.

To leave the database in the same state as before this exercise, the following removes the demo table created before

In [17]:
%%sql

drop table if exists demo;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.


[]

2. Built-in Functions
Learn powerful functions for performing complex database operations with ease

Scenario 1. Window Functions
Compare employees hiring time interval by department using LAG function

LAG is a window function that provides access to the row before the current one. The following SQL statement uses WINDOW to define groups of employees by department, in order of their hiring date. LAG is used to access the previous row in this group, to compare the hiring date interval between two employees. FORMAT builds a text from column values, and COALESCE handles the first hire for which there is no previous row in the group. Without those window functions, this query would have required reading the same table two times.


In [18]:
%%sql

select
dname,ename,job,
coalesce (
  'hired '||to_char(hiredate -
    lag(hiredate) over (per_dept_hiredate),'999')||' days after '||
    lag(ename) over (per_dept_hiredate),
    format('(1st hire in %L)',dname)
) as "last hire in dept"
from emp join dept using(deptno)
window per_dept_hiredate
  as (partition by dname order by hiredate)
order by dname,hiredate;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
14 rows affected.


dname,ename,job,last hire in dept
ACCOUNTING,CLARK,MANAGER,(1st hire in 'ACCOUNTING')
ACCOUNTING,KING,PRESIDENT,hired 161 days after CLARK
ACCOUNTING,MILLER,CLERK,hired 67 days after KING
RESEARCH,SMITH,CLERK,(1st hire in 'RESEARCH')
RESEARCH,JONES,MANAGER,hired 106 days after SMITH
RESEARCH,FORD,ANALYST,hired 245 days after JONES
RESEARCH,SCOTT,ANALYST,hired 371 days after FORD
RESEARCH,ADAMS,CLERK,hired 34 days after SCOTT
SALES,ALLEN,SALESMAN,(1st hire in 'SALES')
SALES,WARD,SALESMAN,hired 2 days after ALLEN


Scenario 2. Regexp Matching
List all employees with @gmail or .org in their email addresses
Description
REGEXP performs a pattern match of a string expression. The following lists employees with an e-mail ending in ‘.org’ or a domain starting with ‘gmail.’

In [19]:
%%sql

select * from emp
where email ~ any ( ARRAY[ '@.*\.org$' , '@gmail\.' ] );

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
3 rows affected.


empno,ename,job,mgr,hiredate,sal,comm,deptno,email,other_info
7876,ADAMS,CLERK,7788,1983-01-12,1200,,20,ADAMS@acme.org,
7566,JONES,MANAGER,7839,1981-04-02,2975,,20,JONES@gmail.com,
7900,JAMES,CLERK,7698,1981-12-03,1050,,30,JAMES@acme.org,


Scenario 3. Arithmetic Date Intervals

The interval data type allows you to store and manipulate a period of time in years, months, days... The following example compares overlapping evaluation periods. A WITH clause defines the evaluation period length depending on the job.

Find employees with overlapping evaluation periods

In [20]:
%%sql

with emp_evaluation_period as (
 select ename,deptno,hiredate,
 hiredate + case when job in ('MANAGER','PRESIDENT')
 then interval '3 month' else interval '4 weeks'
 end evaluation_end from emp)
select * from emp_evaluation_period e1
 join emp_evaluation_period e2
 on (e1.ename>e2.ename) and (e1.deptno=e2.deptno)
where (e1.hiredate,e1.evaluation_end)
 overlaps (e2.hiredate,e2.evaluation_end);

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
3 rows affected.


ename,deptno,hiredate,evaluation_end,ename_1,deptno_1,hiredate_1,evaluation_end_1
MILLER,10,1982-01-23,1982-02-20 00:00:00,KING,10,1981-11-17,1982-02-17 00:00:00
TURNER,30,1981-09-08,1981-10-06 00:00:00,MARTIN,30,1981-09-28,1981-10-26 00:00:00
WARD,30,1981-02-22,1981-03-22 00:00:00,ALLEN,30,1981-02-20,1981-03-20 00:00:00


Scenario 4. CROSSTABVIEW
Description
CROSSTABVIEW is a client command to display rows as columns. The following sums the salaries across jobs and departments and displays them as a cross-table

Display total salary per job and department as a cross-table

In [21]:
%%sql

select job, dname, sum(sal)
from emp join dept using(deptno)
group by dname, job

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
9 rows affected.


job,dname,sum
PRESIDENT,ACCOUNTING,5100
CLERK,ACCOUNTING,1400
SALESMAN,SALES,6000
MANAGER,ACCOUNTING,2450
MANAGER,RESEARCH,2975
MANAGER,SALES,2850
CLERK,SALES,1050
ANALYST,RESEARCH,6200
CLERK,RESEARCH,2100


Scenario 5. NTILE Function

Split e-mails in 3 groups and format them

In order to send e-mails to all employees in different batches, you will split them into 3 groups using the NTILE function, and format them with the FORMAT function and aggregate them in a comma-separated list with the STRING_AGG function

In [22]:
%%sql

with groups as (
 select ntile(3) over (order by empno) group_num
 ,* 
 from emp
)
select string_agg(format('<%s> %s',ename,email),', ') 
from groups group by group_num;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
3 rows affected.


string_agg
"<ADAMS> ADAMS@acme.org, <JAMES> JAMES@acme.org, <FORD> FORD@acme.com, <MILLER> MILLER@acme.com"
"<BLAKE> BLAKE@hotmail.com, <CLARK> CLARK@acme.com, <SCOTT> SCOTT@acme.com, <KING> KING@aol.com, <TURNER> TURNER@acme.com"
"<SMITH> SMITH@acme.com, <ALLEN> ALLEN@acme.com, <WARD> WARD@compuserve.com, <JONES> JONES@gmail.com, <MARTIN> MARTIN@acme.com"


Advanced Features
Expand your YSQL skills by completing the following 5 scenarios

Scenario 1. GIN Index on Document
List employees that know SQL
Description
The skills are stored in the semi-structured JSON document. We can query them with @>, ?, ?& and ?| operators. And, for best performance, index them.

SQL Statement
Step 1: Create GIN index on JSON document

GIN indexes can provide fast access to elements inside a JSON document. The following creates an index on the ‘skills’ attributes within ‘other_info’ JSON column

In [None]:
%%sql 

create index emp_skills on emp using gin((other_info->'skills'));

Step 2: Query the JSON attribute list

SQL queries can navigate into the JSON document with -> and check if an array contains a value with @>. The following searches the employees with the “SQL” skill

In [None]:
%%sql 

select * from emp where other_info->'skills' @> '"SQL"' ;

Step 3: Use Explain pla to verify that index is used

Thanks to the GIN index, this search doesn’t need to read all documents. The following shows the execution plan being an indexed access path.

In [None]:
%%sql

explain select * from emp where other_info->'skills' @> '"SQL"' ;

Scenario 2. Text Search
Build a search index on department descriptions

Description
SQL queries can search in the text for some words using the to_tsvector() function to extract a list of words that can be compared. We will find all department descriptions with the words 'responsible' and 'services' in it

SQL Statement
Step 1: Create text search index on the description column

GIN indexes can provide fast access to words inside a text. The following creates an index for the simple-grammar vector of words extracted from the department description

In [None]:
%%sql

create index dept_description_text_search on dept using gin (( to_tsvector('simple',description) ));

Step 2: Query on description for matching words

The following compares the simple-grammar vector of words extracted from the department description with a word search pattern to find the departments which contain ‘responsible’ and ‘services’ in their description.

In [None]:
%%sql

select * from dept where to_tsvector('simple',description) @@ to_tsquery('simple','responsible & services');

Step 3: Explain plan to verify that index is used

Thanks to the GIN index, this search doesn’t need to read all rows and text. The following shows the execution plan being an indexed access path

In [None]:
%%sql

explain select * from dept where to_tsvector('simple',description) @@ to_tsquery('simple','responsible & services');

Scenario 3. Stored Procedures
Transfer commission from one employee to another


Description
A stored procedure can encapsulate a procedural logic into an atomic operation. We will create one in PL/pgSQL, named "commission_transfer", that transfers commission “amount” from “empno1” to “empno2”

SQL Statement
Step 1: Create the procedure for the commission transfer between employees

The procedure has two SQL operations: decrease from “empno1” and add to “empno2”. Plus error checking to raise a custom exception if “empno1” doesn’t have the amount to be transferred

In [None]:
%%sql

create or replace procedure commission_transfer(empno1 int, empno2 int, amount int) as $$
begin
update emp set comm=comm-commission_transfer.amount
  where empno=commission_transfer.empno1 and comm>commission_transfer.amount;
if not found then raise exception 'Cannot transfer % from %',amount,empno1; end if;
update emp set comm=comm+commission_transfer.amount
  where emp.empno=commission_transfer.empno2;
if not found then raise exception 'Cannot transfer from %',empno2; end if;
end;
$$ language plpgsql;

Step 2: Call the procedure with employee IDs and the amount to be transferred

Once defined, the stored procedure is called with values for all parameters. This transfers 100 from employee 7521 to 7654

In [None]:
%%sql

call commission_transfer(7521,7654,100);

Step 3: List all employees who have received commission

The following displays all employees having a commission, to verify that 100 have been transferred

In [None]:
%%sql

SELECT * from emp where comm is not null;

Step 4: Call the procedure with employee IDs and amount that is not allowed to invoke error handling

The following attempts to transfer 1000000, more than what 7521 has. It raises the “Cannot transfer” error defined in the procedure and automatically reverts all intermediate changes to return to a consistent state

In [None]:
%%sql

call commission_transfer(7521,7654,1000000);

Scenario 4. Triggers
Record the last update time of each row automatically

Description
We will add a column to hold the last update time, and declare a trigger to update it automatically, on the departments table

SQL Statement
Step 1: Add a column to store the last update time

The structure of a SQL table can evolve. With the goal of recording the last update, the following adds a “last_update” column to the department table.

In [None]:
%%sql

alter table dept add last_update timestamptz;

Step 2: Add a function “dept_last_update” to set the last update time. The following SQL query uses the built-in function transaction_timestamp(), which returns the current date and time at the start of the current transaction

A stored function declares some procedural logic that returns a value. The following returns the “new” state for a trigger after setting the “last_update” to the current time.

In [None]:
%%sql

create or replace function dept_last_update() returns trigger as $$
begin
  new.last_update:=transaction_timestamp();
  return new;
end;
$$ language plpgsql;

Step 3: Create a trigger “dept_last_update” to call the function “dept_last_update()” on each table update

The previous function can be called automatically. The following trigger executes it on each row update for the departments table.

In [None]:
%%sql

create trigger dept_last_update
before update on dept
for each row
execute procedure dept_last_update();

Step 4: Display the current state of the table

In order to verify the automatic logging of the last update time, the following displays the current state of departments before any update

In [None]:
%%sql

select deptno,dname,loc,last_update from dept;

Scenario 5. Materialized Views
View
Pre-compute analytics for reporting, with a materialized view

Description
In order to get fast on-demand reports, we create a materialized view to store pre-joined and pre-aggregated data. This view will store the total salary per department, the number of employees, and the list of jobs in the department.

SQL Statement
Step 1: Create the materialized view

In [None]:
%%sql

create materialized view report_sal_per_dept as
select 
deptno,dname,
sum(sal) sal_per_dept,
count(*) num_of_employees,
string_agg(distinct job,', ') distinct_jobs
from dept join emp using(deptno)
group by deptno,dname
order by deptno;

Step 2: Indexes can be created on it. This one allows fast queries on a range of total salary

In [None]:
%%sql

create index report_sal_per_dept_sal on report_sal_per_dept(sal_per_dept desc);

Step 3: A refresh can be scheduled on a daily basis to re-compute it in the background with a simple command.

In [None]:
%%sql

select *
from report_sal_per_dept
where sal_per_dept<=10000
order by sal_per_dept;

---
# All done!
In this lab, you completed the following:

- Setup
  - Created the `db_ybu` database with `ysqlsh`
  - Created utils
  - Created tables and loaded data using DDL and DML scripts
  - Connected to the database using a PostgreSQL driver for Python

Next, run the following cell to open `02_Demystifying_table_sharding_tablets_and_data_distribution.ipynb`.