<div style="width:100%; background-color: #121017;"><a target="_blank" href="http://university.yugabyte.com?utm_source=gitpod&utm_medium=notebook"><img src="assets/YBU_Logo.png" /></a></div><br>

> **YugabyteDB YCQL Development**
>
> Enroll for free at [Yugabyte University](https://university.yugabyte.com/courses/yugabytedb-ycql-development?utm_source=gitpod&utm_medium=notebook).
>
<br>
This notebook file is:

`04_QDDM_secondary_indexes.ipynb`



# Secondary indexes
 A secondary index is a data structure that contains some of the columns of the index table and an index key that supports one or more data access patterns. In this notebook, you will learn how to create secondary indexes to not only improve query performance, but also remove unnecessary tables from the data model.

## 🛠️ Requirements
Here are the requirements for this notebook:
- ✅ Create the notebook variables in `01_Introduction.ipynb`, which you did previously
- ✅ Create the `ks_ybu` keyspace in `02_Language_fundamentals.ipynb`, which you did previously
- ✅ Complete `03_QDDM_query_plans.ipynb`, which you did previously
- ☑️ Select the **Python 3.11.8** for the notebook, *which you need to select right now!!!*
- ☑️ Import the notebook variables, *which you must do next*
- ☑️ Confirm that the existence of the `ks_ybu` keyspace and the child tables, *which you must do next*

### Select your notebook kernel
- In the Notebook toolbar, click **Select Kernel**.
<br>
<img width=50% src="assets/01_01_Select_Kernel_Toolbar.png" />

- Next, in the dropdown, select **Python 3.11.8**.
<br>
<img width=50% src="assets/01_02_Select_Kernel_Dropdown.png" />

> 👉 **IMPORTANT!** 👈
> 
> You must select Python **Python 3.11.8**. 
> 
> Do **NOT** select _Python 3.12_ or higher!!! 
>


That's it!

## ⛑️ Getting help
The best way to get help from the Yugabyte University team is to post your question on YugabyteDB Community Slack in the #training channels. To sign up, visit [YugabyteDB Community Slack](https://join.slack.com/t/yugabyte-db/shared_invite/zt-xbd652e9-3tN0N7UG0eLpsace4t1d2A?utm_source=gitpod&utm_medium=notebook).

## 👣 Setup steps
Here are the steps to setup this lab:
- Create the notebook styles
- Import the notebook variables

### 👇 Create the notebook styles

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

### 👇 Create the notebook variables 

> 👉 **IMPORTANT!** 👈
> 
> Do **NOT** skip running this cell. 
> 

The following Python cell creates and stores variables that all the notebooks in this lab will use. You can view these variables in the Jupyter tab.

- To run the script, select Execute Cell (Play Arrow) in the left gutter of the cell.
- Verify the accuracy of the output values

👇 👇 👇 

In [None]:
# Use %store -r to read 01_Lab_Setup variables
%store -r

### Confirm the existence of the  `ks_ybu` keyspace and the child tables
You created...
- the keyspace in the `02_Language_fundamentals.ipynb` notebook
- tables in the  `03_QDDM_query_plans.ipynb` notebook

Run the following cell to describe the keyspace.

In [None]:
%%bash -s "$NB_YB_PATH_BIN"  "$NB_DB_NAME" # describe the keyspace, ks_ybu
YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

# DB_NAME=ks_ybu
./ycqlsh -r -e "
  describe keyspace $DB_NAME;
"

> 🤔 Question:
>  
> Does the `ks_ybu` keyspace exist?
> 
> If not, go back to the  `02_Language_fundamentals.ipynb` notebook and create the `ks_ybu` keyspace!
>

> 🤔 Question:
>  
> Does the `ks_ybu` keyspace have tables for products and wishlists?
> 
> If not, go back to the `03_QDDM_query_plans.ipynb` notebook and create the tables!

---
## Secondary index: `Index Scan` query plan
A secondary index is a data structure that often contains some of the columns of the index table and an index key that supports one or more data access patterns. You can often create a secondary index so that a given query plan uses the secondary index.

An `index scan` query plan uses a secondary index. However, after accessing the index, the query accesses the index table. This type of query plan is often better than a `sequence scan` query plan.

To begin, examine the query plan without the implementation of a secondary index.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  select  category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where brand='Bear';
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select  category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where brand=?;
"  

The query plan reveals that the query uses a sequence scan.

### Create the secondary index

> 👉 **IMPORTANT!** 👈
>
> You can only create a secondary index for a table with the `transactions` property enabled.
> 

To begin, drop the secondary index if it exists.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # drop if index exists
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH


./ycqlsh -r -k $DB_NAME -e "
  drop index if exists idx_products_by_category_brand;
"

sleep 1;

Run the following cell to create an index by describing the `tbl_products_by_category` table.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # create the secondary index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH


./ycqlsh -r -k $DB_NAME -e "
  create index if not exists idx_products_by_category_brand 
  on tbl_products_by_category (brand);
"

Use the `describe` keyword  to verify if the index was created for the given table. 


In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # describe table to view index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  describe tbl_products_by_category;
"

The index key consists of a partition key and zero or more clustering keys. In the previous example, you can see that the index key is `brand`. The index inherits the primary key from the index table. The clustering keys are `category`,`product_name`, and `product_id`. 

### Index backfill
By default, YugabyteDB will automatically backfill an index. You can check the status of the index backfill operation using the YB-Master web ui at `http://yb-master-url:7000/tasks`. You can also grep the html output of the web ui. 

In [None]:
%%bash -s "$NB_HOST_IPv4_01" 
HOST_IPv4=$( echo "${1}" | tr -d " ")
MY_URL="http://${HOST_IPv4}:7000/tasks"

curl -s  ${MY_URL} | html2text

There are multiple tasks associated with the creation of the index. For an index on a very large table, you will want to check for these task names associated with the given index:
- `Backfill Table` or  `Backfill Index Table` 
- `Mark backfill done.`


> 👉 **IMPORTANT!** 👈
>
> In certain cases, the backfill may fail. 
>
> The state will show `kFailed` instead of `kComplete`.
> 

### View the query plan for the secondary index
Run the following cell to verify if adding a secondary index for the `brand` can make the product_name query on the `tbl_products_by_category` more efficient.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  explain select  category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where brand = 'Bear';
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select  category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where brand = ?;
"  

#### `Index Scan`
With the introduction of a secondary index for the table, there is a new query plan. This query plan contains one node. The action is an index scan. Key Conditions is a sub-action that specifies the use of the partition key for the index. The index contains the partition key for the index and the primary key columns of the index table. After accessing the index, the query accesses the table. The table contains the description column. The action to access the index table is similar to a primary key lookup. Since the index contains the primary key columns, the query is able to lookup the rows in the table using the primary key values.

### Index tablet and `index scan` query
YugabyteDB stores a secondary index in the same way as it does for a table. A secondary index exists as tablets. The data structure is also similar. It is DocDB. 

#### Select a YB-TServer host
<a id="select-a-yb-tserver-host-1"> </a>
Set the host variable for one of the nodes. All three nodes in the cluster are running a Tablet Server (YB-TServer). You can comment/uncomment lines 7-9 as needed.

In [None]:
%%bash -s "$NB_HOST_IPv4_01" "$NB_HOST_IPv4_02" "$NB_HOST_IPv4_03" --out NB_HOST_IPv4
HOST_IPv4_01=$( echo "${1}" | tr -d " ")
HOST_IPv4_02=$( echo "${2}" | tr -d " ")
HOST_IPv4_03=$( echo "${3}" | tr -d " ")

# change the hosts for different tablet leaders
MY_HOST_IPv4=$HOST_IPv4_01
#MY_HOST_IPv4=$HOST_IPv4_02
#MY_HOST_IPv4=$HOST_IPv4_03

echo ${MY_HOST_IPv4}

Store the select host variable.

In [None]:
%store NB_HOST_IPv4
print(NB_HOST_IPv4)

Save the `OBJECT_NAME` as a variable.

In [None]:
NB_OBJECT_NAME="idx_products_by_category_brand"
%store NB_OBJECT_NAME
print(NB_OBJECT_NAME)

Grep the `INDEX_ID` for the index using `curl` and `jq`.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4"  "$NB_DB_NAME"  "$NB_TSERVER_WEBSERVER_PORT"  --out NB_INDEX_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
DB_NAME=$( echo "${3}" | tr -d " ")
TSERVER_WEBSERVER_PORT=$( echo "${4}" | tr -d " ")

MY_URL="http://${HOST_IPv4}:${TSERVER_WEBSERVER_PORT}/metrics"

MY_INDEX_ID=`curl -s --compressed ${MY_URL} | jq -r 'limit(1;  .[] | select(.attributes.namespace_name=="'${DB_NAME}'" and .type=="tablet" and .attributes.table_name=="'${OBJECT_NAME}'") |  .attributes.table_id) '`

echo ${MY_INDEX_ID}

Store the `INDEX_ID` for the index.

In [None]:
%store NB_INDEX_ID
print(NB_INDEX_ID)

Get the `TABLET_ID` for the tablet leader for the select node host.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4" --out NB_INDEX_TABLET_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")

MY_URL="http://${HOST_IPv4}:8200/metrics"

MY_INDEX_TABLET_ID=`curl -s --compressed ${MY_URL} | jq --raw-output ' .[] | select(.attributes.namespace_name=="ks_ybu" and .type=="tablet" and .attributes.table_name=="'$OBJECT_NAME'") | {tablet_id: .id, metrics: .metrics[] | select(.name == ("is_raft_leader") ) | select(.value == 1) } | select(.tablet_id) | {tablet_id} | .tablet_id '`

echo ${MY_INDEX_TABLET_ID}

Store the `TABLET_ID` for the tablet leader.

In [None]:
%store NB_INDEX_TABLET_ID
print(NB_INDEX_TABLET_ID)

Flush the WAL file to a SST file for the given index_id.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_HOST_IPv4" "$NB_INDEX_ID"  # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
cd $YB_PATH

./yb-admin -init_master_addrs ${HOST_IPv4}:7100 flush_table_by_id ${INDEX_ID} 600

Dump and decode the SST file in human-readable form.

> 📝 Note
>
> If the following does **NOT** dump the SST file, it is most likely that there are not any rows written to this tablet. To resolve this issue, you need to select a different Tablet Server host. 
> 
> Return back to [Select a YB-TServer host](#select-a-yb-tserver-host-1) and select a different node host by commenting out (add a `#` sign) to line 7 and uncomment out (remove the`#` sign) line 8 or line 9.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_YB_PATH_DATA" "$NB_INDEX_ID" "$NB_INDEX_TABLET_ID" # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
YB_PATH_DATA=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
INDEX_TABLET_ID=$( echo "${4}" | tr -d " ")

cd $YB_PATH/

INDEX_ID_PATH=${YB_PATH_DATA}/node1/data/yb-data/tserver/data/rocksdb/table-${INDEX_ID}/tablet-${INDEX_TABLET_ID}


./sst_dump --command=scan --file=${INDEX_ID_PATH} --output_format=decoded_regulardb 

The DocKey consists of the partition key hash, the partition key, and the clustering keys. 

The index scan query begins by accessing a single tablet for the index. The seek operation reads data from the SST file for the index tablet using the partition key hash. Because the brand column contains non-unique values, there may be multiple seeks in the related SST file. The seek operation gathers the DocKeys. The DocKeys contains the primary key values for the index table. Using this list, a second operation accesses the index table tablets. 

When there is more than a single index table tablet, this is a batch operation. Often, the index and table tablets reside on different nodes. This means that the batch operation requires at least one or more remote procedure calls to one or more nodes in the cluster. The goal with the batch operation is to optimize the number of seek operations for a given table tablet by using a list of primary keys that fall within the tablet hash value range. Although this query is not as costly as a sequential scan query, it does require accessing one index tablet and at least one tablet for the index table. 

## Secondary index: `Index Only Scan` query plan

The previous index required the query to access both the tablets for the index and for the table. A covering index only utilizes the index itself for the query. The term, covering index, describes a secondary index that a query uses to only access the index and not the index table. 
 
You can define one or more `include` columns to cover a query with the index alone.  There are some restrictions for defining an `include` column in a secondary index. The `include` column needs to be a column with a basic data type.

### Create the secondary index

To begin, drop the secondary index if it exists.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # drop secondary index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  drop index if exists idx_products_by_category_brand_inc;
"

sleep 1;

Run the following cell to create the index the uses the `include` clause.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # create secondary index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  create index if not exists idx_products_by_category_brand_inc 
    on tbl_products_by_category (brand) 
    include (description);
"

Use the `describe` keyword  to verify if the index was created for the given table. 


In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # describe table to view index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  describe tbl_products_by_category;
"

There are now two indexes for the table. 

The index key for the `idx_products_by_category_brand_inc` index consists of a partition key and zero or more clustering keys.  The the index key is `brand`. The clustering keys are `category`, `product_name`, and `product_id`.  In addition, the `description` column is added to the index. This helps cover a query using this index alone.


### View the query plan that uses the covering index
Run the following to view the plan.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  select  category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where brand = 'Bear';
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select  category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where brand = ?;
"  

#### `Index Only Scan`
The introduction of a new covering index results in a new query plan. This query plan contains one node. The action is an `index only scan`. Key Conditions is a sub-action that specifies the use of the partition key for the index. The equality operator indicates that the internal operation is to locate a specific partition key on a single tablet. The index contains the partition key for the index, the primary key columns of the index table, and any `include` columns from the index table. In this example, the `include` column is the description column from the index table.

### Index tablet and `index only scan` query
To learn more about why this is an  `index only scan` query, take a look at the SST files for an index tablet.

#### Select a YB-TServer host
<a id="select-a-yb-tserver-host-2"> </a>
Set the host variable for one of the nodes. All three nodes in the cluster are running a Tablet Server (YB-TServer). You can comment/uncomment lines 7-9 as needed.

In [None]:
%%bash -s "$NB_HOST_IPv4_01" "$NB_HOST_IPv4_02" "$NB_HOST_IPv4_03" --out NB_HOST_IPv4
HOST_IPv4_01=$( echo "${1}" | tr -d " ")
HOST_IPv4_02=$( echo "${2}" | tr -d " ")
HOST_IPv4_03=$( echo "${3}" | tr -d " ")

# change the hosts for different tablet leaders
MY_HOST_IPv4=$HOST_IPv4_01
#MY_HOST_IPv4=$HOST_IPv4_02
#MY_HOST_IPv4=$HOST_IPv4_03

echo ${MY_HOST_IPv4}

Store the select host variable.

In [None]:
%store NB_HOST_IPv4
print(NB_HOST_IPv4)

Save the `OBJECT_NAME` as a variable.

In [None]:
NB_OBJECT_NAME="idx_products_by_category_brand_inc"
%store NB_OBJECT_NAME
print(NB_OBJECT_NAME)

Grep the `INDEX_ID` for the index using `curl` and `jq`.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4"  "$NB_DB_NAME"  "$NB_TSERVER_WEBSERVER_PORT"  --out NB_INDEX_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
DB_NAME=$( echo "${3}" | tr -d " ")
TSERVER_WEBSERVER_PORT=$( echo "${4}" | tr -d " ")


MY_URL="http://${HOST_IPv4}:${TSERVER_WEBSERVER_PORT}/metrics"

MY_INDEX_ID=`curl -s --compressed ${MY_URL} | jq -r 'limit(1;  .[] | select(.attributes.namespace_name=="'${DB_NAME}'" and .type=="tablet" and .attributes.table_name=="'${OBJECT_NAME}'") |  .attributes.table_id) '`

echo ${MY_INDEX_ID}

Store the `INDEX_ID` for the index.

In [None]:
%store NB_INDEX_ID
print(NB_INDEX_ID)

Get the `TABLET_ID` for the tablet leader for the select node host.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4" --out NB_INDEX_TABLET_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")

MY_URL="http://${HOST_IPv4}:8200/metrics"

MY_INDEX_TABLET_ID=`curl -s --compressed ${MY_URL} | jq --raw-output ' .[] | select(.attributes.namespace_name=="ks_ybu" and .type=="tablet" and .attributes.table_name=="'$OBJECT_NAME'") | {tablet_id: .id, metrics: .metrics[] | select(.name == ("is_raft_leader") ) | select(.value == 1) } | select(.tablet_id) | {tablet_id} | .tablet_id '`

echo ${MY_INDEX_TABLET_ID}

Store the `TABLET_ID` for the tablet leader.

In [None]:
%store NB_INDEX_TABLET_ID
print(NB_INDEX_TABLET_ID)

Flush the WAL file to a SST file for the given index_id.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_HOST_IPv4" "$NB_INDEX_ID"  # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
cd $YB_PATH

./yb-admin -init_master_addrs ${HOST_IPv4}:7100 flush_table_by_id ${INDEX_ID} 600

Dump and decode the SST file in human-readable form.

> 📝 Note 
>
> If the following does **NOT** dump the SST file, it is most likely that there are not any rows written to this tablet. To resolve this issue, you need to select a different Tablet Server host. 
> 
> Return back to [Select a YB-TServer host](#select-a-yb-tserver-host-2) and select a different node host by commenting out (add a `#` sign) to line 7 and uncomment out (remove the`#` sign) line 8 or line 9.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_YB_PATH_DATA" "$NB_INDEX_ID" "$NB_INDEX_TABLET_ID" # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
YB_PATH_DATA=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
INDEX_TABLET_ID=$( echo "${4}" | tr -d " ")

cd $YB_PATH/

INDEX_ID_PATH=${YB_PATH_DATA}/node1/data/yb-data/tserver/data/rocksdb/table-${INDEX_ID}/tablet-${INDEX_TABLET_ID}


./sst_dump --command=scan --file=${INDEX_ID_PATH} --output_format=decoded_regulardb 

A query plan with an `index only scan` operation accesses a single tablet for the index. The seek operation reads data from the SST file for the index tablet. Depending on the number of products for a given brand, there may be multiple seeks. However, because brand is the partition key for this index, the query locates the tablet with the partition key hash almost immediately. The partition key hash for the index is the DocKey hash. The DocKey consists of the index partition key hash, the index partition key, and clustering keys. The clustering keys are the primary key columns of the index table. The DocKey maps to subdocuments. The subdocuments in this example are any include columns, and in this case, just `description`. Each subdocument contains a column value. 

The covering index efficiently processes the query without needing to access the data for the index table.

> 🤔 Question:
> 
> With a covering index for the query, which table can be dropped?
>
> 🙋 Answer:
>
> You can drop `tbl_products_by_brand`.

## Unique index
A unique index creates a unique constraint for the index column in the index table. This is especially useful when maintaining data integrity for rows of data that must have unique values such as identifiers, phone numbers, or emails. In this example, the unique index constraint prevents the insertion of a row with a duplicate global trade identification number.

To begin, first drop the index if it exists.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # if exists, drop index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH


./ycqlsh -r -k $DB_NAME -e "
  drop index if exists idx_products_by_category_unq;
"

Next, create the unique index.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # if not exists, create index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  create unique index if not exists idx_products_by_category_unq
  on tbl_products_by_category (gtin);
"

Review the index creation.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # describe table to view index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  desc tbl_products_by_category;
"  

This following `insert` statement will generate an error since there already is a row with `gtin=006236226326`:

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # insert statement that throws error
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH


./ycqlsh -r -k $DB_NAME -e "
  insert into tbl_products_by_category (category, product_name, product_id, brand, price, discount, description, gtin) 
  values ('H20','Talc 9',62569,'Yeah',29.99,8,'3 liter','006236226326');
"

The thrown exception (error) is:
```
Duplicate value disallowed by unique index idx_products_by_category_unq
```

### Index tablet for unique index
YugabyteDB stores a secondary unique index in the same way as it does for a table. The data structure is also similar. It is DocDB. However, the implementation is a bit different for unique indexes. To see how it differs, you will need to first flush the WAL file for the index tablet and then dump the SST file for the tablet.

#### Select a YB-TServer host
<a id="select-a-yb-tserver-host-3"> </a>
Set the host variable for one of the nodes. All three nodes in the cluster are running a Tablet Server (YB-TServer). You can comment/uncomment lines 7-9 as needed.

In [None]:
%%bash -s "$NB_HOST_IPv4_01" "$NB_HOST_IPv4_02" "$NB_HOST_IPv4_03" --out NB_HOST_IPv4
HOST_IPv4_01=$( echo "${1}" | tr -d " ")
HOST_IPv4_02=$( echo "${2}" | tr -d " ")
HOST_IPv4_03=$( echo "${3}" | tr -d " ")

# change the hosts for different tablet leaders
MY_HOST_IPv4=$HOST_IPv4_01
#MY_HOST_IPv4=$HOST_IPv4_02
#MY_HOST_IPv4=$HOST_IPv4_03

echo ${MY_HOST_IPv4}

Store the select host variable.

In [None]:
%store NB_HOST_IPv4
print(NB_HOST_IPv4)

Save the `OBJECT_NAME` as a variable.

In [None]:
NB_OBJECT_NAME="idx_products_by_category_unq"
%store NB_OBJECT_NAME
print(NB_OBJECT_NAME)

Grep the `INDEX_ID` for the index using `curl` and `jq`.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4"  "$NB_DB_NAME"  "$NB_TSERVER_WEBSERVER_PORT"  --out NB_INDEX_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
DB_NAME=$( echo "${3}" | tr -d " ")
TSERVER_WEBSERVER_PORT=$( echo "${4}" | tr -d " ")


MY_URL="http://${HOST_IPv4}:${TSERVER_WEBSERVER_PORT}/metrics"

MY_INDEX_ID=`curl -s --compressed ${MY_URL} | jq -r 'limit(1;  .[] | select(.attributes.namespace_name=="'${DB_NAME}'" and .type=="tablet" and .attributes.table_name=="'${OBJECT_NAME}'") |  .attributes.table_id) '`

echo ${MY_INDEX_ID}

Store the `INDEX_ID` for the index.

In [None]:
%store NB_INDEX_ID
print(NB_INDEX_ID)

Get the `TABLET_ID` for the tablet leader for the select node host.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4" --out NB_INDEX_TABLET_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")

MY_URL="http://${HOST_IPv4}:8200/metrics"

MY_INDEX_TABLET_ID=`curl -s --compressed ${MY_URL} | jq --raw-output ' .[] | select(.attributes.namespace_name=="ks_ybu" and .type=="tablet" and .attributes.table_name=="'$OBJECT_NAME'") | {tablet_id: .id, metrics: .metrics[] | select(.name == ("is_raft_leader") ) | select(.value == 1) } | select(.tablet_id) | {tablet_id} | .tablet_id '`

echo ${MY_INDEX_TABLET_ID}

Store the `TABLET_ID` for the tablet leader.

In [None]:
%store NB_INDEX_TABLET_ID
print(NB_INDEX_TABLET_ID)

Flush the WAL file to a SST file for the given index_id.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_HOST_IPv4" "$NB_INDEX_ID"  # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
cd $YB_PATH

./yb-admin -init_master_addrs ${HOST_IPv4}:7100 flush_table_by_id ${INDEX_ID} 600

Dump and decode the SST file in human-readable form.

> 📝 Note 
>
> If the following does **NOT** dump the SST file, it is most likely that there are no rows written to this tablet. To resolve this issue, you need to select a different Tablet Server host. 
> 
> Return back to [Select a YB-TServer host](#select-a-yb-tserver-host-3) and select a different node host by commenting out (add a `#` sign) to line 7 and uncomment out (remove the`#` sign) line 8 or line 9.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_YB_PATH_DATA" "$NB_INDEX_ID" "$NB_INDEX_TABLET_ID" # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
YB_PATH_DATA=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
INDEX_TABLET_ID=$( echo "${4}" | tr -d " ")

cd $YB_PATH/

INDEX_ID_PATH=${YB_PATH_DATA}/node1/data/yb-data/tserver/data/rocksdb/table-${INDEX_ID}/tablet-${INDEX_TABLET_ID}

# ls -l  ${TABLE_ID_PATH}

./sst_dump --command=scan --file=${INDEX_ID_PATH} --output_format=decoded_regulardb 

The DocKey for the index consists solely of the hash encoded partition key and the index value itself. The subdocuments of the DocKey contain the other document values.

### View the query plan for the unique index
Run the following to view the plan.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  select category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where gtin = '006236226326';
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where gtin = ?;
"  

The DocKey for the index consists solely of the hash encoded partition key and the index value itself. The query reads from the exact tablet that contains the partition key hash. However, the primary key columns of the index table are not included in the DocKey. This means that each column in the primary key of the index table maps to an individual subdocument. Any included columns for the unique index have the same SubDocKey column mapping. Each subdocument for a column has a document value. 

In order to get the primary key for the index table, the query must read the related subdocuments. A second operation then accesses the  tablet of the index table. Often, the index and table tablets reside on different nodes. This means that the operation requires at least one or more remote procedure calls to one or more nodes in the cluster. In other words, the index key lookup query requires that the query accesses both the tablets for the index and the table.

That said, it is possible to create a unique index that includes columns from the index table. A covering index eliminates the need for a query to access both the index and table tablets. 

### 🏆 Challenge: Create a unique covering index
In the code cells below, create a unique covering index. The challenge is to answer the following question:

> 🤔 Question:
>  
> What is the resulting query plan for unique covering index? Here is the explain query:
>
> 
```
  explain select category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where gtin=?;
```

First, drop the existing unique index.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # if exists, drop index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

# DB_NAME=ks_ybu

./ycqlsh -r -k $DB_NAME -e "
  drop index if exists idx_products_by_category_unq;
"

To create the unique, covering index, complete the create index statement below and then run the cell.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # you must complete this code
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH


./ycqlsh -r -k $DB_NAME -e "
  create unique index if not exists idx_products_by_category_unq_inc
  on ...
"

> ☝️ Have an error above? 
> 
> You need to edit and complete the code above and run it again!

Verify the new query plan that uses the unique, covering index.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  select category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where gtin = '006236226326';
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select category, brand, product_name, description, product_id 
  from tbl_products_by_category 
  where gtin = ?;
"  

> 🙋 Answer:
>
> The query plan indicates an `Index Only Key Lookup` action.
>
> 
```
create unique index if not exists idx_products_by_category_unq
  on tbl_products_by_category (gtin)
  include (category, brand, product_name, description, product_id);
```

## Partial Index

A partial index contains only the rows that satisfy the where expression in the index predicate. The predicate column in the sub-expression can be an integer type, a boolean, or text. The supported operators are equal, not equal, greater than, less than, greater than or equal to, or less than or equal to. 

The logical implication holds if all sub-expressions of the index predicate are present in the where expression of the select query.

In this exercise, you will create a partial index.

Drop the index if it exists.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # drop index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH


./ycqlsh -r -k $DB_NAME -e "
  drop index if exists idx_products_by_category_high_discount;
"

Create the partial index.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # create index
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

# DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  create index idx_products_by_category_high_discount
  on tbl_products_by_category
  (discount)
  include (brand, description, price) 
  where discount > 9;
"

To view the DDL for the index, describe the table.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # describe table
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

# DB_NAME=ks_ybu
./ycqlsh -r -k $DB_NAME -e "
  desc tbl_products_by_category;
"

### Index tablet for partial index
In the following cells, you can flush the WAL file and dump the SST file for a partial index tablet.

#### Select a YB-TServer host
<a id="select-a-yb-tserver-host-4"> </a>
Set the host variable for one of the nodes. All three nodes in the cluster are running a Tablet Server (YB-TServer). You can comment/uncomment lines 7-9 as needed.

> Important!
>
> You most likely will need to return here and change the host as there are very few rows that apply to the partial index.

In [None]:
%%bash -s "$NB_HOST_IPv4_01" "$NB_HOST_IPv4_02" "$NB_HOST_IPv4_03" --out NB_HOST_IPv4
HOST_IPv4_01=$( echo "${1}" | tr -d " ")
HOST_IPv4_02=$( echo "${2}" | tr -d " ")
HOST_IPv4_03=$( echo "${3}" | tr -d " ")

# change the hosts for different tablet leaders
MY_HOST_IPv4=$HOST_IPv4_01
#MY_HOST_IPv4=$HOST_IPv4_02
#MY_HOST_IPv4=$HOST_IPv4_03

echo ${MY_HOST_IPv4}

Store the select host variable.

In [None]:
%store NB_HOST_IPv4
print(NB_HOST_IPv4)

Save the `OBJECT_NAME` name as a variable.

In [None]:
NB_OBJECT_NAME="idx_products_by_category_high_discount"
%store NB_OBJECT_NAME
print(NB_OBJECT_NAME)

Grep the `INDEX_ID` for the index using `curl` and `jq`.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4"  "$NB_DB_NAME"  "$NB_TSERVER_WEBSERVER_PORT"  --out NB_INDEX_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
DB_NAME=$( echo "${3}" | tr -d " ")
TSERVER_WEBSERVER_PORT=$( echo "${4}" | tr -d " ")


MY_URL="http://${HOST_IPv4}:${TSERVER_WEBSERVER_PORT}/metrics"

MY_INDEX_ID=`curl -s --compressed ${MY_URL} | jq -r 'limit(1;  .[] | select(.attributes.namespace_name=="'${DB_NAME}'" and .type=="tablet" and .attributes.table_name=="'${OBJECT_NAME}'") |  .attributes.table_id) '`

echo ${MY_INDEX_ID}

Store the `INDEX_ID` for the index.

In [None]:
%store NB_INDEX_ID
print(NB_INDEX_ID)

Get the `TABLET_ID` for the tablet leader for the select node host.

In [None]:
%%bash -s "$NB_OBJECT_NAME" "$NB_HOST_IPv4" --out NB_INDEX_TABLET_ID
OBJECT_NAME=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")

MY_URL="http://${HOST_IPv4}:8200/metrics"

MY_INDEX_TABLET_ID=`curl -s --compressed ${MY_URL} | jq --raw-output ' .[] | select(.attributes.namespace_name=="ks_ybu" and .type=="tablet" and .attributes.table_name=="'$OBJECT_NAME'") | {tablet_id: .id, metrics: .metrics[] | select(.name == ("is_raft_leader") ) | select(.value == 1) } | select(.tablet_id) | {tablet_id} | .tablet_id '`

echo ${MY_INDEX_TABLET_ID}

Store the `TABLET_ID` for the tablet leader.

In [None]:
%store NB_INDEX_TABLET_ID
print(NB_INDEX_TABLET_ID)

Flush the WAL file to a SST file for the given index_id.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_HOST_IPv4" "$NB_INDEX_ID"  # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
HOST_IPv4=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
cd $YB_PATH

./yb-admin -init_master_addrs ${HOST_IPv4}:7100 flush_table_by_id ${INDEX_ID} 600

Dump and decode the SST file in human-readable form.

> 📝 Note 
>
> If the following does **NOT** dump the SST file, it is most likely that there are not any rows written to this tablet. To resolve this issue, you need to select a different Tablet Server host. 
> 
> Return back to [Select a YB-TServer host](#select-a-yb-tserver-host-4) and select a different node host by commenting out (add a `#` sign) to line 7 and uncomment out (remove the`#` sign) line 8 or line 9.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_YB_PATH_DATA" "$NB_INDEX_ID" "$NB_INDEX_TABLET_ID" # Import file path of Yugabyte and DB name
YB_PATH=$( echo "${1}" | tr -d " ")
YB_PATH_DATA=$( echo "${2}" | tr -d " ")
INDEX_ID=$( echo "${3}" | tr -d " ")
INDEX_TABLET_ID=$( echo "${4}" | tr -d " ")

cd $YB_PATH/

INDEX_ID_PATH=${YB_PATH_DATA}/node1/data/yb-data/tserver/data/rocksdb/table-${INDEX_ID}/tablet-${INDEX_TABLET_ID}

# ls -l  ${INDEX_ID_PATH}

./sst_dump --command=scan --file=${INDEX_ID_PATH} --output_format=decoded_regulardb 

The DocKey for the partial index consists of the index partition key hash, the index partition key, and clustering keys. The clustering keys are the primary key columns of the index table. The include columns exist as subdocument values. The partial index only contains the index table rows that meet the where expression.

### View the query plan for the partial index
Run the following to view the plan.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  select category, brand, product_name, description, product_id, discount
  from tbl_products_by_category 
  where discount > 9 ;
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select category, brand, product_name, description, product_id, discount
  from tbl_products_by_category 
  where discount > 9 ;
"  

> 🤔 Question:
>  
> What happens when the query plan contains a similar where expression such as `where discount > 8`?
>

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"   # new query plan
YB_PATH=${1}
DB_NAME=${2}  
cd $YB_PATH

./ycqlsh -r -k $DB_NAME -e "
  select category, brand, product_name, description, product_id, discount
  from tbl_products_by_category 
  where discount > 8 ;
"  

./ycqlsh -r -k $DB_NAME -e "
  explain select category, brand, product_name, description, product_id, discount
  from tbl_products_by_category 
  where discount > 8 ;
"  

> 🙋 Answer:
>  
> Even though there is a mathematical implication to use the index as in the sense the index needs to be grater that some value, the query plan results in a sequence scan and does not use the partial index. The DDL for the partial index specifies how YugabyteDB fills the index base on the literal value in the expression.
>

---
# 🌟🌟🌟🌟 Great work! 
In this notebook, you completed the following:

- Secondary indexes
  - Requirements
  - Secondary index: Index Scan query plan
  - Secondary index: Index Only Scan query plan
  - Unique index
  - Partial index


## 😊 Next up!
Continue your learning by opening the next notebook, `05_JSONB.ipynb`. 

In [None]:
%%bash
gp open '05_JSONB.ipynb'