<div style="width:100%; background-color: #000041"><a target="_blank\" href="http://university.yugabyte.com\"><img src="assets/YBU_Logo.webp" /></a></div><br>

> **YugabyteDB YCQL Development**
>
> Enroll for free at [Yugabyte University](https://university.yugabyte.com/courses/yugabytedb-ycql-development).
>

# Query-driven data model: Query plans
In this notebook, you will learn about YCQL query plans.

### Import the notebook variables 

> Requirements:
>
> You must first create the variables in the `01_Setup.ipynb` notebook.
>

The following Python cell reads the stored variables created in the `01_Setup.ipynb` notebook. 

- To run the script, select Execute Cell (Play Arrow) in the left gutter of the cell.

In [None]:
%store -r MY_DB_NAME
%store -r MY_YB_PATH
%store -r MY_YB_PATH_DATA
%store -r MY_GITPOD_WORKSPACE_URL
%store -r MY_HOST_IPv4_01
%store -r MY_HOST_IPv4_02
%store -r MY_HOST_IPv4_03
%store -r MY_NOTEBOOK_DIR
%store -r MY_TSERVER_WEBSERVER_PORT
%store -r MY_NOTEBOOK_DATA_FOLDER
%store -r MY_YB_MASTER_HOST_GITPOD_URL
%store -r MY_YB_TSERVER_HOST_GITPOD_URL
%store -r MY_DATA_DDL_FILE
%store -r MY_DATA_DML_FILE

### Create the `ks_ybu` keyspace
Run the following cells to connect to the YugabyteDB cluster using `ycqlsh`. Then, complete the following tasks:
- Create the `ks_ybu` keyspace if it does not exists


Create the keyspace, `ks_ybu`.

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" # Create the keyspace, ks_ybu. DB_NAME=ks_ybu.
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -e "
  create keyspace if not exists $DB_NAME;
  "

Confirm the keyspace creation.

In [None]:
%%bash -s "$MY_YB_PATH"  "$MY_DB_NAME" # Create the keyspace, ks_ybu. DB_NAME=ks_ybu.
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -e "
  describe keyspace $DB_NAME;
  "

Drop the existing tables in the keyspace.


In [None]:
%%bash -s "$MY_YB_PATH"  "$MY_DB_NAME" # Drop the table
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

#  DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  drop table if exists tbl_employees;
  "

### Run the DDL and DML scripts
Run the following cells to run both DDL and DML files

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA_DDL_FILE"  "$MY_DATA_DML_FILE"  
# Wishlist
YB_PATH=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
DATA_DDL_FILE=${4}
DATA_DML_FILE=${5}

WISHLIST_DDL_PATH=${DATA_FOLDER}/${DATA_DDL_FILE}
WISHLIST_DML_PATH=${DATA_FOLDER}/${DATA_DML_FILE}

cd $YB_PATH/bin

# DDL file
./ycqlsh -k ${DB_NAME} -f ${WISHLIST_DDL_PATH} 
sleep 1;

# DML file
./ycqlsh -k ${DB_NAME} -f ${WISHLIST_DML_PATH} 
sleep 1;

Describe the tables.

In [None]:
%%bash -s "$MY_YB_PATH"  "$MY_DB_NAME" # describe the tables from the DDL script
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  describe tables;
  "

Validate the data was loaded properly from the SQL scripts.

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA_DML_FILE"   
# Wishlist
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  select * from tbl_products_by_brand;
  "

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA_DML_FILE"   
# Wishlist
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  select * from tbl_products_by_category;
  "

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA_DML_FILE"   
# Wishlist
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  select * from tbl_products_by_wishlist;
  "

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" "$MY_NOTEBOOK_DATA_FOLDER" "$MY_DATA_DML_FILE"   
# Wishlist
YB_PATH=${1}
DB_NAME=${2}
cd $YB_PATH/bin

# shell variable sustituton, DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  select * from tbl_wishlists_by_user;
  "

---
## Execution plan

In YCQL, an execution plan, known as a query plan, is similar in style to a YSQL query plan. A query plan consists of execution nodes. A plan node represents an action and may include one or more sub-actions. An action refers to a specific, internal operation. Nodes can be nested. Nested nodes are executed from the inside out. This means that the innermost node is executed before an outer node. This can be best thought of as a nested function call where the inner node returns its results to the outer node, often in a loop.

To see an example of a query plan, run the following cell:

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"   # # Query plan: Sequential scan  with filter
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH/bin

# DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  explain select sum(quantity) as total_items
  from tbl_products_by_wishlist
  where quantity > 1;
"    

In the above output, the query plan shows that this is an aggregate query. An aggregate query exhibits a nested node. The innermost node is the sequence scan operation. The sequence scan applies a filter sub-action for a conditional expression. The sequence scan operation returns the subtotal value to the outermost operation, the aggregation operation.


### Query plan: Sequence scan
A query that results in a sequential scan of a very large table can be very costly in YugabyteDB. For a sequential scan operation, each tablet must perform a seek operation. A seek operation requires CPU, memory, and disk operations. A seek operation consists of reading data from a SST file. The SST file contains the DocKey and document value. The SST file format consists of data and meta blocks. A sequential scan reads all the data blocks within a SST file. 

Coordinating the tablet operations and gathering tablet results require additional network, CPU, and memory consumption. The topology of a cluster can increase the network latency related to coordinating tablet operations and gathering tablet results. For these reasons, a sequential scan consumes numerous computing resources and often results in very poor query performance, especially for very large tables with billions of rows.

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"   # Query plan: Sequential scan  
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH/bin

# DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  explain select * from tbl_products_by_category;
"  

### Query plan: Range scan
A range scan operation often returns multiple rows and is not guaranteed to return zero rows or one row. The equality operator indicates that operation is for a single partition query. All products that belong to a specific category reside on a single tablet. A seek operation consists of reading data from the SST files of a tablet. The number of seeks for a range scan is often far less than that of a sequential scan. In this case, the number of seeks may vary by the number of products related to the category. Each seek operation requires CPU, memory, and disk usage. In most cases, a range scan query is efficient and scalable. However, for a popular product category with millions of related products, this query may be problematic. One tablet may be accessed significantly more than the other tablets. This is known as a hot tablet or hot shard query. Depending on query frequency and the volume of data, a hot tablet query can quickly turn into a hot node problem. A hot node is a node that consistently utilizes a high percentage of computing resources.


In [None]:
%%bash -s "$MY_YB_PATH"  "$MY_DB_NAME"  # Query plan: range function
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH/bin

# DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  explain select category, product_name, product_id, brand, price, discount, description, gtin
  from tbl_products_by_category 
  where category = 'h20';
"  

### Query plan: Primary key lookup
In this following example, the where expression contains three conditional expressions. Each conditional expression contains a primary key column and an equality operator. Because all three parts of the primary key are in the where expression and include equality operators, the query is a primary key lookup query. A primary key lookup is guaranteed to return zero or one row.


In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME"   # # Query plan: Sequential scan  with filter
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH/bin

# DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  explain select category, product_name, product_id, brand, price, discount, description, gtin
  from tbl_products_by_category 
  where category = 'H20'
    and product_name = 'Talc 5'
    and product_id = 62362;
"    

### Primary key lookup: Tablet
A primary key lookup operation uses the DocKey to determine the location of the row. The partition key hash encoded value of the DocKey informs the operation. The query reads from the tablet that contains the partition key hash. The clustering keys order the data on disk for the given encoded hash. The primary key lookup performs a single seek in the SST file of the related tablet in RocksDB. This is known as a single-key query. The Key-Conditions and Filter sub-actions are semantic artifacts of the query planner. There are no related operations for these sub-actions. 

#### Select a YB-TServer host
Set the host variable for one of the nodes. All three nodes in the cluster are running a Tablet Server (YB-TServer). You can comment/uncomment lines 7-9 as needed.


In [None]:
%%bash -s "$MY_HOST_IPv4_01" "$MY_HOST_IPv4_02" "$MY_HOST_IPv4_03" --out MY_HOST_IPv4
HOST_IPv4_01=$( echo "${1}" | tr -d " ")
HOST_IPv4_02=$( echo "${2}" | tr -d " ")
HOST_IPv4_03=$( echo "${3}" | tr -d " ")

# change the hosts for different tablet leaders
#MY_HOST_IPv4=$HOST_IPv4_01
#MY_HOST_IPv4=$HOST_IPv4_02
MY_HOST_IPv4=$HOST_IPv4_03

echo ${MY_HOST_IPv4}

Store the select host variable.

In [None]:
%store MY_HOST_IPv4
print(MY_HOST_IPv4)

Save the table name as a variable.

In [None]:
MY_OBJECT_NAME="tbl_products_by_category"
%store MY_OBJECT_NAME
print(MY_OBJECT_NAME)

Grep the Table_ID for the the table using `curl` and `jq`.

> Note:
> If your are running locally, this cell requires `jq`. 
> To install for your local OS, try the following:
> - Ubuntu: 
>   - `sudo apt-get install jq`
> - OS X:
>   - `brew install jq`
> 

In [None]:
%%bash -s "$MY_OBJECT_NAME" "$MY_HOST_IPv4"  "$MY_DB_NAME"  "$MY_TSERVER_WEBSERVER_PORT"  --out MY_TABLE_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
DB_NAME=$( echo "${3}" | tr -d " ")
TSERVER_WEBSERVER_PORT=$( echo "${4}" | tr -d " ")


MY_URL="http://${HOST_IPv4}:${TSERVER_WEBSERVER_PORT}/metrics"

MY_TABLE_ID=`curl -s --compressed ${MY_URL} | jq -r 'limit(1;  .[] | select(.attributes.namespace_name=="'${DB_NAME}'" and .type=="tablet" and .attributes.table_name=="'${OBJECT_NAME}'") |  .attributes.table_id) '`

echo ${MY_TABLE_ID}

Store the table_id for the table.

In [None]:
%store MY_TABLE_ID
print(MY_TABLE_ID)

Get the tablet_id for the tablet leader for the select node host.

In [None]:
%%bash -s "$MY_OBJECT_NAME" "$MY_HOST_IPv4" --out MY_TABLET_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")

MY_URL="http://${HOST_IPv4}:8200/metrics"

TABLET_ID=`curl -s --compressed ${MY_URL} | jq --raw-output ' .[] | select(.attributes.namespace_name=="ks_ybu" and .type=="tablet" and .attributes.table_name=="'$OBJECT_NAME'") | {tablet_id: .id, metrics: .metrics[] | select(.name == ("is_raft_leader") ) | select(.value == 1) } | select(.tablet_id) | {tablet_id} | .tablet_id '`

echo ${TABLET_ID}

Store the tablet_id for the tablet leader.

In [None]:
%store MY_TABLET_ID
print(MY_TABLET_ID)

Flush the WAL file to a SST file for the given table_id.

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_HOST_IPv4" "$MY_TABLE_ID"  # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
TABLE_ID=$( echo "${3}" | tr -d " ")
cd $YB_PATH/bin

./yb-admin -init_master_addrs ${HOST_IPv4}:7100 flush_table_by_id ${TABLE_ID} 600

Dump and decode the SST file in human-readable form.

> Note:
>
> If the following does **NOT** dump the SST file, it is most likely that there are not any rows written to this tablet. To resolve this issue, you need to select a different Tablet Server host. Return back to **Select a YB-TServer host** and select a different node host.

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_YB_PATH_DATA" "$MY_TABLE_ID" "$MY_TABLET_ID" # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
YB_PATH_DATA=$( echo "${2}" | tr -d " ")
TABLE_ID=$( echo "${3}" | tr -d " ")
TABLET_ID=$( echo "${4}" | tr -d " ")

cd $YB_PATH/bin/

TABLE_ID_PATH=${YB_PATH_DATA}/node-1/disk-1/yb-data/tserver/data/rocksdb/table-${TABLE_ID}/tablet-${TABLET_ID}

# ls -l  ${TABLE_ID_PATH}

./sst_dump --command=scan --file=${TABLE_ID_PATH} --output_format=decoded_regulardb 

> Question:
>
> The query predicate is:
> 
> ```where category = 'H20' and product_name = 'Talc 5' and product_id = 62362```
> 
> How can you find the partition key hash in the SST dump for the query predicate?
> 
> Hint:
>
> Modify the "Select a YB-TServer host" cell to use a a different YB-TServer. Then, rerun the cells.

---
TODO: Next notebook