<div style="width:100%; background-color: #000041"><a target="_blank" href="http://university.yugabyte.com"><img src="assets/YBU_Logo.png" /></a></div><br>

> **YugabyteDB YSQL Development**
>
> Enroll for free at [Yugabyte University](https://university.yugabyte.com/courses/yugabytedb-ysql-development).
>
<br>
This notebook file is:

`05_Using_GIN_Indexes.ipynb`


# About Generalized Inverted Indexes (YBGIN)

In YugabyteDB, tables and secondary indexes exist as tablets. A tablet is customized RocksDB database that stores a key-value document. For a table, the tablet key-value store maps primary keys and columns. The key-value store of a secondary index tablet maps the index keys to the primary keys of the table (and any additional columns, such as for a covering index). In DocDB, the distributed document storage of YugabyteDB, the key-value stores for tablets exist as LSM-Trees in SST files.

A regular secondary index can help improve the efficiency of a query that utilizes the index. However, for container data types such as `array`, `jsonb`, and `tsvector`, a regular index often is of limited use. Here is a basic example that explains why:

`where this_int_array = '{1,2,3}'` uses an equality operator and the query optimizer will select the regular index for the `this_int_array` container column.

`where  this_int_array @> '{3}'` use the contains operator and the query optimizer will not use the regular index for the `this_int_array` container column.

For queries that search container data types, a **Generalized Inverted Index (GIN)** is an ideal index.  A GIN index has only one single index entry per key which stores the mapping of all key rows to the same value in the index entry. For this reason, a GIN index often best supports container data types, especially for applications that utilize these columns in search queries. 

YSQL supports the creation of a GIN index using the `GIN` or `YBGIN` keywords for these column data types:
- `array`
- `jsonb`
- `tsvector`

Here ar some examples for how a GIN index indexes the elements inside a container:

|Data type|Use case| Example | GIN elements|
|-|-|-|-|
| `array`| element search |  `'{1,5,1}'` | 1, 5 |
|  `jsonb` | key/value/path search |  `'{"k","v"}'` | k, v |
|  `tsvector` | full text search |  `to_tsvector('The quick brown fox')` | brown, fox, quick |


Here's how you create a YBGIN index for a supported data type:
```
create index idx_gin_example on tbl_example using YBGIN (this_int_array);
```

A YBGIN index utilizes **range sharding**. The LSM tree sorts the specified key values for range sharding.

> Note
> 
> PostgreSQL stores the key in the index entry with the mapping details in posting tree using a B-Tree index. YugabyteDB uses a LSM tree index. 
> 
> A LSM tree is not an tree structure, but rather, a complex algorithm that converts discrete random write requests into batch sequential write requests. To improve write performance for the LSM tree, RocksDB utilizes a Write-Ahead Log (WAL) and a memtable (a skiplist that lives in memory). The sequential writes in the memtable persists to disk as a Sorted Sequence Table (SSTable or SST) files.



## GIN Operator classes
Operator classes define semantics for index columns of a particular data type and a particular index access method. An operator class specifies that a particular operator family is applicable to a particular indexable column data type. As shown in the below table, YugabyteDB supports all the GIN operator classes included in the core PostgreSQL distribution. 

| Name  | Indexed Data Type | Indexable Operators | 
|-|-|-|
| array_ops | anyarray | <li>`&&`  <li>`<@`  <li>`=`  <li>`@>` |
| jsonb_ops | jsonb |existence operators <li>`?` <li>`?&` <li>`?\|` <br> containment operators<li>`@>` <li>`@?` <li>`@@`|
| jsonb_path_ops | jsonb |containment operators <li>`@>` <li>`@?` <li>`@@` |
| tsvector_ops | tsvector | <li>`@@` <li>`@@@` | 

In this notebook, using Explain Plans, built-in functions, and custom utilities for YB-TServer metrics, you will learn about how and when to utilize GIN Indexes in YugabyteDB.

## 🛠️ Requirements
Here are the requirements for this notebook:
- ✅ Create the notebook variables in `01_Lab_Setup.ipynb`, which you previously did
- ✅ Create the `ds_ybu` database, which you previously did
- ☑️ Import the notebook variables, *which you must do next*
- ☑️ Connect to the `ds_ybu` database, *which you must do next*
- ☑️ Run through a series of DDL and DML scenarios
  -  Basic of DDL and DML
  -  Built-in Functions
  -  Advanced Features


### Select your notebook kernel
- In the Notebook toolbar, click **Select Kernel**.
<br>
<img width=50% src="assets/01_01_Select_Kernel_Toolbar.png" />

- Next, in the dropdown, select **Python 3.12** or higher.
<br>
<img width=50% src="assets/01_02_Select_Kernel_Dropdown.png" />

That's it!

## ⛑️ Getting help
The best way to get help from the Yugabyte University team is to post your question on YugabyteDB Community Slack in the #training or #yb-university channels. To sign up, visit [https://communityinviter.com/apps/yugabyte-db/register](https://communityinviter.com/apps/yugabyte-db/register).

# 👣 Setup steps
Here are the steps to setup this lab:
- Import the notebook variables
- Connect to `db_ybu` database
- Load the SQL Magic extension for the connection
- Create the prepared statements
- Load the example data

### 👇 Import the notebook variables and style the notebook

> 👉 **IMPORTANT!** 👈
> 
> Do **NOT** skip running the following cells. 
>
> The following Python cell reads the stored variables created in the `01_Lab_Setup.ipynb` notebook. To run the script, select Execute Cell (Play Arrow) in the left gutter of the cell.  The cell after that styles the notebook.

👇 👇 👇 

In [None]:
# Use %store -r to read 01_Lab_Setup variables
%store -r 
%config SqlMagic.named_parameters=True

**Update the styling of the notebook**.

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

## Connect to the `db_ybu` database
Run all the cells in this section:
- Connect using Python and PostgreSQL driver
- Load the SQL magic extension
- Create the prepared statements


In [None]:
# connect use Python 3.12.1
import psycopg2
import sqlalchemy as alc
from sqlalchemy import create_engine

db_host = NB_HOST_IPv4_01
db_name = NB_DB_NAME

connection_str = 'postgresql+psycopg2://yugabyte@'+db_host+':5433/'+db_name

engine = create_engine(connection_str)

### Load the SQL magic extension

In [None]:
%reload_ext sql

# SQL magic for python connection string
%sql engine

### Create the prepared statements

> 👉 **IMPORTANT!** 👈
>   
> In order to create the prepared statements for the SQL magic connection, you must run the following cell!!!
> 
> **Do not skip this step**.
> 

In [None]:
#%% python, but prepared statements as sql magic
if (NB_GITPOD_WORKSPACE_URL is None):
    a = %sql select fn_yb_create_stmts()
else:
    WORKSPACE_URL = NB_GITPOD_WORKSPACE_URL.replace('https://','https://7000-')
    a = %sql select fn_yb_create_stmts(:WORKSPACE_URL)

print (a)

Confirm that the following query returns a count of 3 (for three prepared statements).

In [None]:
%%sql 
select count(*) from pg_prepared_statements where 1=1 and name in ('stmt_util_metrics_snap_tablet','stmt_util_metrics_snap_table','stmt_util_metrics_snap_reset')

### Load the example tables for GIN
Run the following cell to execute the DDL and DML scripts using `ysqlsh`. This will create the following tables

- `tbl_vectors`
- `tbl_arrays`
- `tbl_jsonbs`

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME" "$NB_NOTEBOOK_DATA_FOLDER" "$NB_GIN_EXAMPLES"   # GIN
YB_PATH_BIN=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
GIN_EXAMPLES=${4}


#ls $DATA_FOLDER

GIN_EXAMPLE_PATH=${DATA_FOLDER}/${GIN_EXAMPLES}

cd $YB_PATH_BIN
echo $GIN_EXAMPLE_PATH
# GIN Examples file
./ysqlsh -d ${DB_NAME} -f ${GIN_EXAMPLE_PATH} >&/dev/null

sleep 1;


# Describe relations
./ysqlsh -d ${DB_NAME} -c "\d+ tbl_arrays"

---
# GIN Indexes 
This begins the lab for the GIN indexes.
## q1 | array



In [None]:
%%sql
drop table if exists tbl_arrays;

create table if not exists tbl_arrays (a int[], k serial primary key);

drop index if exists idx_arrays;

create index nonconcurrently idx_arrays on tbl_arrays using ybgin (a);

Insert multiple values.

In [None]:
%%sql
insert into tbl_arrays values
    ('{1,1,6}'),
    ('{1,6,1}'),
    ('{2,3,6}'),
    ('{2,5,8}'),
    ('{null}'),
    ('{}'),
    (null);
    
insert into tbl_arrays select '{0}' from generate_series(1, 1000);

> Note:
>
> sql magic requires the escaping with a double character , but recent fix resolved
> - for {6} --> {{ 6 }}
> - and doc @> '{"year": 1950}'

In [None]:
%%sql

select * from tbl_arrays WHERE a @> '{{ 6 }}';

> Note:
> 
> There's an issue with `explain analyze` and sqlMagic that breaks connection in this Preview release, so removing `analyze`. If desired, can run in a bash cell using `ysqlsh`. If you run into this error 
> 
> `(psycopg2.OperationalError) server closed the connection unexpectedly`
> 
> Yu will need to return backup to Setup steps, and rerun the following cells:
> - Load the SQL magic extension
> - Create the prepared statements
>
> Then, return here.
>  

In [None]:
%%sql
explain 
select * 
from tbl_arrays
where 1=1
and  a @> '{{ 6 }}';

Issue with sql magic, can't run `explain analyze` after prepared for this query, so separating.

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(5);
select * from tbl_arrays WHERE a @> '{{ 6 }}';

Review the metrics snapshot.

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

---
## q2 | JSONB

In [None]:
%%sql
drop table if exists tbl_jsonbs;

create table tbl_jsonbs (j jsonb, k serial PRIMARY KEY);

drop index if exists idx_jsonbs;

create index NONCONCURRENTLY idx_jsonbs ON tbl_jsonbs USING ybgin (j);

In [None]:
%%sql
insert into tbl_jsonbs values ('{"some":"body"}');
insert into tbl_jsonbs values ('{"some":["where","how"]}');
insert into tbl_jsonbs values ('{"some":{"nested":"jsonb"}, "and":["another","element","not","a","number"]}');

> Note
> 
> sql magic has issue with `{ "a":5 }`
> spacing resolves ... 
> also, once a cell has an "issue", need to make new cell with resolution as there is some type of semantic caching in effect

In [None]:
%%sql
insert into tbl_jsonbs values ('{ "a" : 5 }')

Validate the `insert` with a `select` query.

In [None]:
%%sql
select * from tbl_jsonbs where j ? 'some';

In [None]:
%%sql

explain analyze select * from tbl_jsonbs where j ? 'some';

Issue with sqlMagic, so separating from `explain analyze`.

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(2);
select * from tbl_jsonbs where j ? 'some';

Review the snapshot.

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

--- 
## q3 | Text Search (tsvector)



In [None]:
%%sql

drop table if exists tbl_vectors;

create table tbl_vectors (v tsvector, k serial primary key);

drop index if exists idx_vectors;

create index nonconcurrently idx_vectors on tbl_vectors using ybgin (v);

insert into tbl_vectors values
    (to_tsvector('simple', 'the quick brown fox')),
    (to_tsvector('simple', 'jumps over the')),
    (to_tsvector('simple', 'lazy dog'));

-- add some filler rows to make sequential scan more costly.
insert into tbl_vectors select to_tsvector('simple', 'filler') from generate_series(1, 1000);

Validate the with a `select` query.

In [None]:
%%sql
select * from tbl_vectors where v @@ to_tsquery('simple', 'the');

In [None]:
%%sql
explain select * from tbl_vectors where v @@ to_tsquery('simple', 'the');

Run the metrics separately from the explain.

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(2);
select * from tbl_vectors where v @@ to_tsquery('simple', 'the');

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

--- 
### Limitations
See https://docs.yugabyte.com/preview/explore/indexes-constraints/gin/#limitations



---

## q4 | Optimize JSONB queries with GIN indexes

In this exercise, you will create a table that contains information about the contributors to YugabyteDB. The data is from an API for github and is in JSON format. The goal of the exercise is to demonstrate how using a GIN index benefits query performance for searching the JSONB data type.

### Load the github contributors list
The following bash script loads the contributor list for YugabyteDB open source project, available at:

https://github.com/yugabyte/yugabyte-db

The github API that returns a JSON payload is:

https://docs.github.com/en/rest/repos/repos#list-repository-contributors


Run the following cell to create the table, `tbl_commits`, and load the data into the table. Look closely, and you will see the creation of a temporay table, too.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME" "$NB_NOTEBOOK_DATA_FOLDER" "$NB_GITHUB_DATA_FILE"
YB_PATH_BIN=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
GITHUB_DATA_FILE=${4}

GITHUB_DATA_FILE_PATH=${DATA_FOLDER}/${GITHUB_DATA_FILE}

cd $YB_PATH_BIN
echo ${GITHUB_DATA_FILE_PATH} 

./ysqlsh -d ${DB_NAME} -c "drop table if exists tmp_commits;";

./ysqlsh -d ${DB_NAME} -c "drop table if exists tbl_commits;";

./ysqlsh -d ${DB_NAME} -c "create table tmp_commits (j jsonb);";

./ysqlsh -d ${DB_NAME} -c "create table tbl_commits (j jsonb);";

./ysqlsh -d ${DB_NAME} -c "\copy tmp_commits (j) FROM "${GITHUB_DATA_FILE_PATH}";"

./ysqlsh -d ${DB_NAME} -c "insert into tbl_commits select jsonb_array_elements(j) from tmp_commits;"

#./ysqlsh -d ${DB_NAME} -c "select jsonb_array_elements(j) from tmp_commits "

Verify the load.

In [None]:
%%sql
select '' _ 
 , j
 , j -> 'author'
 , j -> 'author' ->> 'login'
-- , (j -> 'author' ->> 'login') as login, j
from tbl_commits
where 1=1 
-- and j ? 'author'
and (j -> 'author' ->> 'login') = 'mbautin'
;

### q4a | Query JSONB without a GIN index
To view the Explain Plan for the query, run the following cell:

In [None]:
%%sql

-- explain (costs off, analyze, verbose) 
explain 
select '' _ 
 , j -> 'author'
 , j -> 'author' ->> 'login'
from tbl_commits
where 1=1 
and (j -> 'author' ->> 'login') = 'mbautin'
;

#### q4a | Explain Plan (above ^^)

The Explain Plan shows that this query uses a sequence scan or a full table scan.
- `Seq Scan on public.tbl_commits (actual time=1.871..4.389 rows=2 loops=1`

The Filter expression functions to remove all the rows expect the one that meets the condition.

- `Filter: (((tbl_commits.j -> 'author'::text) ->> 'login'::text) = 'mbautin'::text)`

The query removes all but one result.

- `Rows Removed by Filter: 28`

To view the tablet metrics, run the following cells:

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(3);
select '' _ 
 , j -> 'author'
 , j -> 'author' ->> 'login'
from tbl_commits
where 1=1 
and (j -> 'author' ->> 'login') = 'mbautin'
;

Review the snapshot metrics.

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

#### q4a | Metrics (above ^^)

The metrics reveal how costly this query is without an GIN index:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_commits link_table_id tablet_id_unq_1 leader	 | 1 | 	17	|
| db_ybu tbl_commits link_table_id tablet_id_unq_2 leader	 | 1 |  29	|
| db_ybu tbl_commits link_table_id tablet_id_unq_2 leader	 | 1 |  11	|

There is just one seek for an offset for each tablet, resulting in about 20 full reads for each seek.


### q4b | Search JSONB using a YBGIN index
Of the two operator classes for type `jsonb`, `jsonb_ops` is the default in PostgreSQL. However, first implement the `jsonb_path_ops` index due to the reasons highlighted in the table below.

| Operator class | How data is stored |
| - | - |
| `jsonb_ops` | adds every json key and value from the indexed column into the index as a key. For example, '{"abc": [123, true]}' maps to three GIN keys: `\001abc`, `\004123`, `\003t`. The flag bytes here indicate the types key, numeric, and boolean, respectively. Since it a produces a large number of index rows, it is inefficient in terms of both time and space. |
| `jsonb_ops_path` |  Encodes the full JSON path for every leaf value into the index key. For example, with '{"abc": [123, true]}, there are two paths: "abc" -> 123 and "abc" -> true. Then, there are two GIN keys based on those  paths. Since it produces significantly fewer keys, performing lookups on this is very efficient for full path queries because it's just one lookup. |



#### q4b | Create the GIN index with json_ops
-- USING GIN ((j -> 'author') jsonb_ops_path);

In [None]:
%%sql
drop index if exists idx_commits_ybgin_j;

select pg_sleep(1);

create index if not exists idx_commits_ybgin_j on tbl_commits using gin (j jsonb_ops);

> Important!
> 
> In Data Definition Language statements, YugabyteDB will interpret the `GIN` keyword as `YBGIN`.

Review the DDL for index in the following cell and observe `USING ybgin`:

In [None]:
%%sql
select pg_get_indexdef('idx_commits_ybgin_j':: regclass);

Alternatively, use the `\d+` command to view the definition in the table:

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"  # \d+
YB_PATH_BIN=${1}
DB_NAME=${2}

cd $YB_PATH_BIN

./ysqlsh -d ${DB_NAME} -c "\d+ tbl_commits"

##### View the Index details in the YB-Master web ui
You can view the details of the `idx_commits_j_path_ops` index in the YB-Master web ui. Run the cell below and open the URL in your web browser.

In [None]:
#%% python, but prepared statements as sql magic
THIS_INDEX_NAME = 'idx_commits_ybgin_j'
THIS_SCHEMA_NAME = 'public'
DB_NAME = NB_DB_NAME

## Comment out if local
view_gitpod_url = %sql select fn_get_table_id_url(:NB_YB_MASTER_HOST_GITPOD_URL,7000,:DB_NAME,:THIS_SCHEMA_NAME,:THIS_INDEX_NAME ) as view_gitpod_url
print (view_gitpod_url)

## Uncomment if local
# view_local_url = %sql select fn_get_table_id_url(:NB_HOST_IPv4_01,7000,:DB_NAME,:THIS_SCHEMA_NAME,:THIS_INDEX_NAME ) as view_local_url
# print (view_local_url)

##### Review the Column section
The column section shows details about each column in the index. Here is the section for  `idx_commits_j_path_ops`:


| Column | ID	| Type |
|--------|------|------|
| j    | 0	| string NOT NULL NOT A PARTITION KEY | 
| ybidxbasectid	   | 1	| binary NOT NULL NOT A PARTITION KEY | 


<br/>

The index is of type `string`.

> Important!
>  
>  YugabyteDB creates an internal, hidden column, `ybidxbasectid`, for the indexed row. `ybidxbasectid` is similar to the internal, hidden column, `ybctid`, for a row of a table. Both `ybctid` and  `ybidxbasectid` are virtual columns that represent the
>  DocDB-encoded key for the tuple. 
> 
> Using  `\d` or `\d+` will not show the `ybidxbasectid` column. It is also not possible to query the `ybidxbasectid` value.

##### Review the Tablet section
The Tablet section shows the details for the existing tablets. Here is the section for  `idx_commits`: 

| Tablet ID |	Partition	| SplitDepth	| State	| Hidden	| Message	| RaftConfig|
|--|--|--|--|--|--|--|
| some_uuid_1<br>`1e2c3ef228534d3cbbf59c9fa6968d88	` |	`range: [<start>, <end>)` |	0	| Running|	false| Tablet reported with an active leader	|<li>FOLLOWER: 127.0.0.1 <li>FOLLOWER: 127.0.0.3<li>LEADER: 127.0.0.1 |

YugabyteDB will automatically split this tablet based on the size of the table on disk. There are various global flags that determine this behavior.

### q4b | Query JSONB with a GIN index
To view the Explain Plan for the query, run the following cell:

In [None]:
%%sql
-- SET enable_indexscan = on;
-- SET enable_seqscan = OFF;


-- explain (costs off, analyze, verbose) 
explain 
select '' _ 
 , j -> 'author'
 , j -> 'author' ->> 'login'
from tbl_commits
where 1=1 
-- uses index scan with gin and json_ops
-- and j ? 'author'
and j @> '{"author": {"login": "mbautin"}}'
;


In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(3);
select '' _ 
 , j -> 'author'
 , j -> 'author' ->> 'login'
from tbl_commits
where 1=1 
-- uses index scan with gin and json_ops
-- and j ? 'author'
and j @> '{"author": {"login": "mbautin"}}'
;

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

##### q4b | Metrics (above ^^)

In the initial query, the `Index Scan` accesses the index tablet:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu  idx_commits_ybgin_j  link_table_id tablet_id_unq_1 leader	 | 1| 	30 |

There is one results for the predicate expression. This results in one seek of the index tablet offset. The are 30 reads from the offset because that is the total number of rows.

Because the query returns all columns, the query also reads from one of the tablet leader for `tbl_commits`, a table with hash sharding.


| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_commits link_table_id tablet_id_unq_1	leader | 8 | 15 |
| db_ybu tbl_commits link_table_id tablet_id_unq_2	leader | 11 | 21 |
| db_ybu tbl_commits link_table_id tablet_id_unq_1	leader | 11 | 21 |

What the Metrics report reveals is that even though the query uses the GIN index, the query must seeks offsets from all of the tablets for `tbl_commits`, e.g. 30 rows.



### q4c | YBGIN index for JSONB using a `jsonb_path_ops` operator

 In the previous exercise, the the GIN index stored values as separate string entries: field, field1, value1, value2, etc. 

 Depending on the predicate, the GIN index will combine multiple index entries to satisfy the specific query conditions, or, not use the index at all.

The `jsonb_path_ops` operator class optimizes the GIN index structure and stores index data as individual entries using an internal hash function: fn_hashf(field, value1), fn_hash(field, value2) and so on.


The `jsonb_path_ops` class supports containment queries. To begin this excise, create the GIN index using the operator.

In [None]:
%%sql
drop index if exists idx_commits_j_path_ops;

select pg_sleep(1);

create index if not exists idx_commits_j_path_ops on tbl_commits using gin (j jsonb_path_ops);

> Important!
> 
> In Data Definition Language statements, YugabyteDB will interpret the `GIN` keyword as `YBGIN`.

Review the DDL for index in the following cell and observe `USING ybgin`:

In [None]:
%%sql
select pg_get_indexdef('idx_commits_j_path_ops':: regclass);


Alternatively, use the `\d+` command to view the definition in the table:

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"  # \d+
YB_PATH_BIN=${1}
DB_NAME=${2}

cd $YB_PATH_BIN

./ysqlsh -d ${DB_NAME} -c "\d+ tbl_commits"

#### View the Index details in the YB-Master web ui
You can view the details of the `idx_commits_j_path_ops` index in the YB-Master web ui. Run the cell below and open the URL in your web browser.

In [None]:
#%% python, but prepared statements as sql magic
THIS_INDEX_NAME = 'idx_commits_j_path_ops'
THIS_SCHEMA_NAME = 'public'
DB_NAME = NB_DB_NAME

## Comment out if local
view_gitpod_url = %sql select fn_get_table_id_url(:NB_YB_MASTER_HOST_GITPOD_URL,7000,:DB_NAME,:THIS_SCHEMA_NAME,:THIS_INDEX_NAME ) as view_gitpod_url
print (view_gitpod_url)

## Uncomment if local
# view_local_url = %sql select fn_get_table_id_url(:NB_HOST_IPv4_01,7000,:DB_NAME,:THIS_SCHEMA_NAME,:THIS_INDEX_NAME ) as view_local_url
# print (view_local_url)

#### Review the Column section
The column section shows details about each column in the index. Here is the section for  `idx_commits_j_path_ops`:


| Column | ID	| Type |
|--------|------|------|
| j    | 0	| int32 NOT NULL NOT A PARTITION KEY | 
| ybidxbasectid	   | 1	| binary NOT NULL NOT A PARTITION KEY | 


<br/>

The index is of type `int32`. The internal hash function for json_ops_path returns an integer value.

> Important!
>  
>  YugabyteDB creates an internal, hidden column, `ybidxbasectid`, for the indexed row. `ybidxbasectid` is similar to the internal, hidden column, `ybctid`, for a row of a table. Both `ybctid` and  `ybidxbasectid` are virtual columns that represent the
>  DocDB-encoded key for the tuple. 
> 
> Using  `\d` or `\d+` will not show the `ybidxbasectid` column. It is also not possible to query the `ybidxbasectid` value.

#### Review the Tablet section
The Tablet section shows the details for the existing tablets. Here is the section for  `idx_commits_j_path_ops`: 

| Tablet ID |	Partition	| SplitDepth	| State	| Hidden	| Message	| RaftConfig|
|--|--|--|--|--|--|--|
| some_uuid_1<br>`1e2c3ef228534d3cbbf59c9fa6968d88	` |	`range: [<start>, <end>)` |	0	| Running|	false| Tablet reported with an active leader	|<li>FOLLOWER: 127.0.0.1 <li>FOLLOWER: 127.0.0.3<li>LEADER: 127.0.0.1 |

YugabyteDB will automatically split this tablet based on the size of the table on disk. There are various global flags that determine this behavior.

### q4c | Query JSONB with a GIN index using `jsonb_path_ops`
To view the Explain Plan for the query, run the following cell:

In [None]:
%%sql

explain 
select '' _ 
 , j -> 'author'
 , j -> 'author' ->> 'login'
from tbl_commits
where 1=1 
-- uses index scan with gin and json_path_ops
and j @> '{"author": {"login": "mbautin"}}'
;

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(3);
select '' _ 
 , j -> 'author'
 , j -> 'author' ->> 'login'
from tbl_commits
where 1=1 
-- uses index scan with gin and json_path_ops
and j @> '{"author": {"login": "mbautin"}}'
;

Review the metrics.

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

##### q4b | Metrics (above ^^)

In the initial query, the `Index Scan` accesses the index tablet:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu  idx_commits_j_path_ops   link_table_id tablet_id_unq_1 leader	 | 1| 	2 |

There is one results for the predicate expression. This results in one seek of the index tablet offset. The are 2 reads from the offset because that is the total number of rows.

The query also reads from two rows, one from each tablet leader for `tbl_commits`, a table with hash sharding.


| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_commits link_table_id tablet_id_unq_1	leader | 2 | 2 |
| db_ybu tbl_commits link_table_id tablet_id_unq_2	leader | 2 | 2|

This for the two columns in the query expression.
```
 , j -> 'author'
 , j -> 'author' ->> 'login'
 ```


--- 

### Optional | Books

In [None]:
%%sql
drop table books;
create table tbl_books(k serial primary key, doc jsonb not null);

insert into tbl_books(doc) values
  ('
    { "ISBN"    : 4582546494267,
      "title"   : "Macbeth", 
      "author"  : {"given_name": "William", "family_name": "Shakespeare"},
      "year"    : 1623
    }
  '), ('
    { "ISBN"    : 8760835734528,
      "title"   : "Hamlet",
      "author"  : {"given_name": "William", "family_name": "Shakespeare"},
      "year"    : 1603,
      "editors" : ["Lysa", "Elizabeth"]
    }
  '), ('
    { "ISBN"    : 7658956876542,
      "title"   : "Oliver Twist",
      "author"  : {"given_name": "Charles", "family_name": "Dickens"},
      "year"    : 1838,
      "genre"   : "novel",
      "editors" : ["Mark", "Tony", "Britney"]
    }
  '), ('
    { "ISBN"    : 9874563896457,
      "title"   : "Great Expectations",
      "author"  : {"family_name": "Dickens"},
      "year"    : 1950,
      "genre"   : "novel",
      "editors" : ["Robert", "John", "Melisa", "Elizabeth"]
    }
  '), ('
    { "ISBN"    : 8647295405123,
      "title"   : "A Brief History of Time",
      "author"  : {"given_name": "Stephen", "family_name": "Hawking"},
      "year"    : 1988,
      "genre"   : "science",
      "editors" : ["Melisa", "Mark", "John", "Fred", "Jane"]
    }
  '), ('
    { "ISBN"     : 6563973589123,
      "year"     : 1989,
      "genre"    : "novel",
      "title"    : "Joy Luck Club",
      "author"   : {"given_name": "Amy", "family_name": "Tan"},
      "editors"  : ["Ruilin", "Aiping"]
    }
');

In [None]:
%%sql
create index idx_books_gin_json_ops on tbl_books using gin(doc);


In [None]:
%%sql
create index idx_books_gin_json_path_ops on tbl_books using gin(doc jsonb_path_ops);

See 
https://docs.yugabyte.com/preview/api/ysql/datatypes/type_json/functions-operators/key-or-value-existence-operators/


 `?`   operator

If the left-hand JSON value is an object , test if it has a key-value pair with a key whose name is given by the right-hand scalar text value.

If the left-hand JSON value is an array test if it has a string value given by the right-hand scalar text value.


`?|`   operator
Purpose: If the left-hand JSON value is an object, test if it has at least one key-value pair where the key name is present in the right-hand list of scalar text values. If the left-hand JSON value is an array, test if it has at least one string value that is present in the right-hand list of scalar text values.



`?&`   operator 

Purpose: If the left-hand JSON value is an object, test if every value in the right-hand list of scalar text values is present as the name of the key of a key-value pair. If the left-hand JSON value is an array, test if every value in the right-hand list of scalar text values is present as a string value in the array.

`@>` and `<@` operators
Purpose: the @> operator tests if the left-hand JSON value contains the right-hand JSON value. The <@ operator tests if the right-hand JSON value contains the left-hand JSON value.

In [None]:
%%sql

set enable_indexscan = on;
-- set enable_seqscan = OFF;

execute  stmt_util_metrics_snap_reset;

explain (costs off, analyze,verbose) 
select
  ((doc->>'ISBN')::bigint)   as isbn,
  ((doc->>'editors')::text)  as editors
from tbl_books
where 1=1 
-- and doc ? 'Melisa'
 and doc @> '{"year": 1950}'
 order by 1;

---

<div style="width:100%; background-color: #000041"><a target="_blank" href="http://university.yugabyte.com"><img src="assets/jeop_logo_large.webp" /></a></div>


## q5 | Text Search for Jeopardy

In this exercise, you will create a table and load data into a table for 38 seasons of Jeopardy questions and answers. The table uses a container data type column, `tsvector`. The `tsvector` data type supports text search.

Jeopardy is a game where contestants see the answer to a question, and need to provide an answer that is in a form of a question. For this reason, the `tsvector` data type is for the `answer_vec` column.

After creating the table and loading the data, you will then run a baseline query to view the query explain plan and the associated metrics.

Next, you will create a GIN index for the table. After creating the GIN index, you will run the same baseline query, review the explain plan, and the associated metrics.


In [None]:
%%sql 
drop function if exists fn_trig_answer_vec CASCADE;
drop trigger if exists trig_answer_vec ON tbl_jeopardy CASCADE;
drop index if exists idx_ygbin_jeopardy;
drop table if exists tbl_jeopardy;

### Create the table, `tbl_jeopardy`
Run the following cell to create the table for the Jeopardy. The is a `rowid` as the primary key.

In [None]:
%%sql

create table tbl_jeopardy (
  rowid serial primary key,
  round int,
  value int,
  daily_double bool,
  category text,
  comments text,
  answer text,
  answer_vec tsvector,
  question text,
  air_date date,
  notes text) ;


To help load the data into the table, run the following cell to create a trigger function and a trigger for the table.

In [None]:
%%sql

create or replace function fn_trig_answer_vec()
returns trigger 
language plpgsql 
AS $$
begin
    new.answer_vec = to_tsvector('english', new.answer);
    RETURN new;
end$$;

create trigger trig_answer_vec
  before insert or update on tbl_jeopardy
  for each row execute function fn_trig_answer_vec();

### Load the Jeopardy data set
The following bash script loads the Jeopardy data file. The file contains the answers (you must form the question) from 38 seasons of Jeopardy.

> Note:
>
> The following data load will take about 5 minutes or so. It is 38 seasons of Jeopardy and over 200K items.
> 
> So, what's the question for this answer:
>
> Debuting in 1996, this database gets its name from being a developmental successor to the Ingres database from the University of California, Berkeley.

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME" "$NB_NOTEBOOK_DATA_FOLDER" "$NB_JEOPARDY_DATA_FILE"
YB_PATH_BIN=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
JEOPARDY_DATA_FILE=${4}

JEOPARDY_DATA_FILE_PATH=${DATA_FOLDER}/${JEOPARDY_DATA_FILE}

cd $YB_PATH_BIN
echo ${JEOPARDY_DATA_FILE_PATH} 

./ysqlsh -d ${DB_NAME} -c "SET yb_index_state_flags_update_delay TO '0s'";

./ysqlsh -d ${DB_NAME} -c "\copy tbl_jeopardy (round, value, daily_double, category, comments, answer, question, air_date, notes) FROM "${JEOPARDY_DATA_FILE_PATH}" WITH DELIMITER E'\t' CSV HEADER;"


You can review the table definition by running the following cell:

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"  # \d+
YB_PATH_BIN=${1}
DB_NAME=${2}

cd $YB_PATH_BIN

./ysqlsh -d ${DB_NAME} -c "\d+ tbl_jeopardy"

### Review the a few rows of the table

In [None]:
%%sql
select * from tbl_jeopardy limit 10;

#### Create a helper function to query the table
YugabyteDB supports several built-in functions to use with the `tsvector` data type. In the following cell, you will these related functions:

- `to_tsquery`
- `ts_headline`
- `ts_rank`

In [None]:
%%sql
drop function if exists fn_sel;

create or replace function fn_sel(filter text) returns SETOF record language plpgsql AS $$
declare rec record;
declare query tsquery = to_tsquery(filter);
begin
    for rec in (select question as q, ts_headline(answer, query, 'StartSel=<, StopSel=>') as a
        from tbl_jeopardy
        where 1=1 
        and answer_vec @@ query
        order by ts_rank(answer_vec, query) desc) loop
        return next rec;
    end loop;
    return;
end$$
;

#### Get rows using the helper function
Run the following example to use the helper function.

In [None]:
%%sql

select * 
from fn_sel('databases') AS (q text, a text);    

### q5a | Query using text search without a GIN index
to_tsquery() is a system function for converting text to a tsquery.
See https://www.postgresql.org/docs/11/functions-textsearch.html

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(1);

In [None]:
%%sql
explain (costs off, analyze, verbose) 
-- select * from tbl_jeopardy where 1=1 and answer_vec @@ to_tsquery ('databases')
select * from tbl_jeopardy where 1=1 and answer_vec @@ to_tsquery ('!data & !database & data:*')  
; 

#### q5a | Explain Plan (above ^^)

The Explain Plan shows that this query uses a sequence scan or a full table scan.
- `Seq Scan on public.tbl_jeopardy (actual time=601.915..7022.337 rows=1 loops=1`

The Filter expression functions to remove all the rows expect the one that meets the condition.

- `Filter: (tbl_jeopardy.answer_vec @@ to_tsquery('!data & !database & data:*'::text))`

The query removes all but one result.

- ` Rows Removed by Filter: 402415`

To view the tablet metrics, run the following cell:

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

#### q5a | Metrics (above ^^)

The metrics reveal how costly this query is without an GIN index:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_1 leader	 | 131 | 	1476914	|
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_2 leader	 | 132 | 	1480357	|
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_3 leader	 | 131 | 	1473603	|

There are hundreds of seeks for offset for each tablet, resulting in full reads for each seek.


### q5b | Text search using a YBGIN index
First, you will need to create the index itself. You can use either the `USING GIN` or `USING YBGIN` keywords. The following cell uses the `GIN` keyword so as to illustrate PostgreSQL compatibility. This wil take about 1 minute or so.

In [None]:
%%sql
drop index if exists idx_jeopardy_ybgin_answer_vec;

select pg_sleep(1);

create index if not exists idx_jeopardy_ybgin_answer_vec on tbl_jeopardy using gin (answer_vec);

Review the DDL for index in the following cell and observe `USING ybgin`:

In [None]:
%%sql
select pg_get_indexdef('idx_jeopardy_ybgin_answer_vec'::regclass);

Alternatively, use the `\d+` command to view the definition in the table:

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"  # \d+
YB_PATH_BIN=${1}
DB_NAME=${2}

cd $YB_PATH_BIN

./ysqlsh -d ${DB_NAME} -c "\d+ tbl_jeopardy"

> Important!
> 
> In Data Definition Language statements, YugabyteDB will interpret the `GIN` keyword as `YBGIN`.

#### View the Index details in the YB-Master web ui
You can view the details of the `idx_jeopardy_ybgin_answer_vec` index in the YB-Master web ui. Run the cell below and open the URL in your web browser.

In [None]:
#%% python, but prepared statements as sql magic
THIS_INDEX_NAME = 'idx_jeopardy_ybgin_answer_vec'
THIS_SCHEMA_NAME = 'public'
DB_NAME = NB_DB_NAME

## Comment out if local
view_gitpod_url = %sql select fn_get_table_id_url(:NB_YB_MASTER_HOST_GITPOD_URL,7000,:DB_NAME,:THIS_SCHEMA_NAME,:THIS_INDEX_NAME ) as view_gitpod_url
print (view_gitpod_url)

## Uncomment if local
# view_local_url = %sql select fn_get_table_id_url(:NB_HOST_IPv4_01,7000,:DB_NAME,:THIS_SCHEMA_NAME,:THIS_INDEX_NAME ) as view_local_url
# print (view_local_url)

#### Review the Column section
The column section shows details about each column in the index. Here is the section for  `idx_jeopardy_ybgin_answer_vec`:


| Column | ID	| Type |
|--------|------|------|
| answer_vec    | 0	| string NOT NULL NOT A PARTITION KEY | 
| ybidxbasectid	   | 1	| binary NOT NULL NOT A PARTITION KEY | 


<br/>

> Important!
>  
>  YugabyteDB creates an internal, hidden column, `ybidxbasectid`, for the indexed row. `ybidxbasectid` is similar to the internal, hidden colum, `ybctid`, for a row of a table. Both `ybctid` and  `ybidxbasectid` are virtual columns that represent the
>  DocDB-encoded key for the tuple. 
> 
> Using  `\d` or `\d+` will not show the `ybidxbasectid` column. It is also not possible to query the `ybidxbasectid` value.

#### Review the Tablet section
The Tablet section shows the details for the existing tablets. Here is the section for  `idx_jeopardy_ybgin_answer_vec`: 

| Tablet ID |	Partition	| SplitDepth	| State	| Hidden	| Message	| RaftConfig|
|--|--|--|--|--|--|--|
| some_uuid_1<br>`1e2c3ef228534d3cbbf59c9fa6968d88	` |	`range: [<start>, <end>)` |	0	| Running|	false| Tablet reported with an active leader	|<li>FOLLOWER: 127.0.0.1 <li>FOLLOWER: 127.0.0.3<li>LEADER: 127.0.0.1 |

YugabyteDB will automatically split this tablet based on the size of the table on disk. The following global flags determine this behavior:

```

--tablet_force_split_threshold_bytes=107374182400 --> 10240 MB
--tablet_split_high_phase_shard_count_per_node=24
--tablet_split_high_phase_size_threshold_bytes=10737418240 --> 10240 MB
--tablet_split_low_phase_shard_count_per_node=8
--tablet_split_low_phase_size_threshold_bytes=536870912 --> 512 MB
--tablet_split_size_threshold_bytes=0
```

The low phase indicates the threshold for the initial splits of the tablet. With more data volume, the threshold increases from 512 MB to over 10 GBs.

#### q5b | Query using the GIN index


Just going to run explain first and then the query.

In [None]:
%%sql

explain select * from tbl_jeopardy 
where 1=1 
and answer_vec @@ to_tsquery ('!data & !database & data:*')  
; 

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(1);

##### q5b | Explain Plan (above ^^)

The Explain Plan shows that this query uses the index and reads 1 rows
- `Index Scan using idx_jeopardy_ybgin_answer_vec on public.tbl_jeopardy (actual time=9.587..9.589 rows=1 loops=1)`

The Index Condition reflects the query predicate.

- ` Index Cond: (tbl_jeopardy.answer_vec @@ to_tsquery('!data & !database & data:*'::text))`

Notice the Index Recheck:
- `Rows Removed by Index Recheck: 169`

Unlike PostgreSQL, YugabyteDB does not use a Bitmap Index Scan and Heap Tables for GIN related queries. In PostgreSQL, `Index Recheck` typically indicates that the working memory of heap table has hit a limit because of the size of bitmap index. In YugabyteDB, `Index Recheck` simply indicates that the YB-TServer processing the query rechecks the rows retrieved from DocDB and removes any rows that do not meet the query condition, e.g., `Index Condition`.

To view the tablet metrics, first run the query and then  prepared statement.

In [None]:
%%sql
select * from tbl_jeopardy 
where 1=1 
and answer_vec @@ to_tsquery ('!data & !database & data:*') ;

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

##### q5b | Metrics (above ^^)

In the initial query, the `Index Scan` accesses the index tablet:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu  idx_jeopardy_ybgin_answer_vec link_table_id tablet_id_unq_1 leader	 | 1| 	170 |

There is one results for the predicate expression. This results in one seek of the index tablet offset. The are 170 reads from the offset, notably because of the condition `'data:*`.

Because the query returns all columns, the query also reads from one of the tablet leader for `tbl_jeopardy`, a table with hash sharding.


| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_1	leader | 58 | 638 |
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_2	leader | 51 | 561 |
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_1	leader | 61 | 671 |

What the Metrics report reveals is that even though the query uses the GIN index, the query must seeks offsets from all of the tablets for `tbl_jeopardy`. 

The YB-TServer for the client connection (this notebook connection) performs the removal of the rows that do not meet the index condition. In other words, since all tablet server return results, e.g, 170 rows.



---

## q6 | PG_TRGM extension

  A trigram is a group of three consecutive characters taken from a string. For example, the set of trigrams in the string "cat" is:
 - c
 - ca
 - cat
 - at


The set of trigrams in the string “foo|bar” is:
- f
- fo
- foo
- oo
- b
- ba
- bar
- ar
  
With trigrams, you can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.

> Note:
> 
> `pg_trgm` ignores non-alphanumerical characters when extracting trigrams from a string. Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.e

The `pg_trgm` extension for PostgreSQL has several functions. With the `pg_trgm` extension, YugabyteDB supports the following functions:

| Function | Returns | Description |
|-|-|-|
| `similarity ( text, text )` |  `real` |  Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical). |
| `show_trgm ( text ) ` | `text[]` | Returns an array of all the trigrams in the given string. (In practice this is seldom useful except for debugging.) |
| `word_similarity ( text, text )` | ` real ` |  Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below. |
| `strict_word_similarity ( text, text ) `| ` real `  | Same as word_similarity, but forces extent boundaries to match word boundaries. Since we don't have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string. |
| `show_limit () `| ` real `  |  Returns the current similarity threshold used by the % operator. This sets the minimum similarity between two words for them to be considered similar enough to be misspellings of each other, for example. (Deprecated; instead use SHOW pg_trgm.similarity_threshold.) |


To install the extension, run the following:

In [None]:
%%sql 
create extension if not exists pg_trgm;

Now that you've installed the `pg_trgm` extension,  consider the following example:

In [None]:
%%sql
select show_trgm('help'), show_trgm('helpers'),  similarity('help', 'helpers'), word_similarity('help', 'helpers');

The most similar extent of an ordered set of trigrams in the second string is [' h', ' he', 'elp',' hel']. As a result, the similarity is 0.8.

The `word_similarity()` function returns a value that can be approximately understood as the greatest similarity between the first string and any substring of the second string. However, this function does not add padding to the boundaries of the extent. Thus, the number of additional characters present in the second string is not considered, except for the mismatched word boundaries.

The `strict_word_similarity(text, text`) function selects an extent of words in the second string.

In [None]:
%%sql
select '' _
, similarity('she helps', 'she is helping') as similarity
, word_similarity('she helps', 'she is helping') as word_similarity
, strict_word_similarity('she helps', 'she is helping') as strict_word_similarity
;

The `strict_word_similarity()` function is useful for finding the similarity to whole words, while `word_similarity()` is more suitable for finding the similarity for parts of words.

The `pg_trgm` extension for PostgreSQL has several GIN operators which YugabyteDB supports in the `SELECT` command, but not in the `WHERE` clause.

| Operator	| Returns	| Description |
|-|-|-|
|`text % text`|	boolean|	Returns true if its arguments have a similarity that is greater than the current similarity threshold set by `pg_trgm.similarity_threshold`.|
|`text <% text`| boolean |	Returns true if the similarity between the trigram set in the first argument and a continuous extent of an ordered trigram set in the second argument is greater than the current word similarity threshold set by p`g_trgm.word_similarity_threshold` parameter.
|`text %> text`| boolean |	Commutator of the `<%` operator.
|`text <<% text`|	boolean |	Returns true if its second argument has a continuous extent of an ordered trigram set that matches word boundaries, and its similarity to the trigram set of the first argument is greater than the current strict word similarity threshold set by the `pg_trgm.strict_word_similarity_threshold` parameter.
|`text %>> text`|	boolean	|Commutator of the `<<%` operator.
|`text <-> text`|	real |	Returns the “distance” between the arguments, that is one minus the `similarity()` value. |
|`text <<-> text`|	real |	Returns the “distance” between the arguments, that is one minus the `word_similarity()`value. |
|`text <->> text`|	real |	Commutator of the `<<->` operator. |
|`text <<<-> text`|	real |	Returns the “distance” between the arguments, that is one minus the `strict_word_similarity()` value. |
|` text <->>> text`| real | Commutator of the `<<<->` operator. |


In [None]:
%%sql
select '' _
, 'she helps' <-> 'she is helping' as similarity
, 'she helps' <->> 'she is helping' as word_similarity
, 'she helps' <->>> 'she is helping' as strcit_word_similarity
;

### q6a | gin_trgm_ops
In PostgreSQL, the `pg_trgm` module provides GiST and GIN index operator classes. YugabyteDB currently only supports the **GIN** index operator class.

> Note
>
> GiST is currently not a supported operator class in YugabyteDB.
> 

With the `pg_trgm` module in YugabyteDB, you can use the GIN index operator class to create an index over a text column. The GIN index operator class supports the related similarity functions as well as the trigram-based index queries that use:
-  `LIKE` (like as case sensitive)
-  `ILIKE` (like as case insensitive)
-  `~` (regular expression matches)
-   `~*` (regular expression matches)


In [None]:
%%sql
select answer, show_trgm(answer) from tbl_jeopardy limit 10;

In [None]:
%%sql
drop index if exists idx_jeopardy_ybgin_trgm_ops;
select pg_sleep(1);
create index idx_jeopardy_ybgin_trgm_ops on tbl_jeopardy USING gin (question gin_trgm_ops);

In [None]:
%%sql
select pg_get_indexdef('idx_jeopardy_ybgin_trgm_ops'::regclass);

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME"  # \d+
YB_PATH_BIN=${1}
DB_NAME=${2}

cd $YB_PATH_BIN

./ysqlsh -d ${DB_NAME} -c "\d+ tbl_jeopardy"

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;
select pg_sleep(1);

In [None]:
%%sql
-- explain (costs off, analyze, verbose) 
explain
select '' _
,  question as val
  , category
  , answer
  , question 
from tbl_jeopardy
where 1=1
-- and  question LIKE '%Data'
-- and  question ILIKE '%Data'
-- and question ~ 'data'
and question ~* 'data'
; 

##### q6a | Explain Plan (above ^^)

The Explain Plan shows that this query uses the index and reads 1 rows
- `idx_jeopardy_ybgin_trgm_ops`

The Index Condition reflects the query predicate.

- `(question ~* 'data'::text)`

Notice the Index Recheck (not visible without analyze):
- `Rows Removed by Index Recheck: 169`

Unlike PostgreSQL, YugabyteDB does not use a Bitmap Index Scan and Heap Tables for GIN related queries. In PostgreSQL, `Index Recheck` typically indicates that working memory has hit a limit because of the size of bitmap index. In YugabyteDB, `Index Recheck` simply indicates the removal of the 169 of the rows that do no meet the Index Condition.

To view the tablet metrics, run the following cell:

In [None]:
%%sql
select '' _
,  question as val
  , category
  , answer
  , question 
from tbl_jeopardy
where 1=1
-- and  question LIKE '%Data'
-- and  question ILIKE '%Data'
-- and question ~ 'data'
and question ~* 'data'

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

##### q3b | Metrics (above ^^)

In the initial query, the `Index Scan` accesses the index tablet:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu  idx_jeopardy_ybgin_trgm_ops link_table_id tablet_id_unq_1 leader	 | 1| 	292 |

There is one results for the predicate expression. This results in one seek of the index tablet offset. The are 170 reads from the offset, notably because of the condition `'data:*`.

Because the query returns all columns, the query also reads from one of the tablet leader for `tbl_jeopardy`, a table with hash sharding.


| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_1	leader | 91 | 1168 |
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_2	leader | 102 | 1311 |
| db_ybu tbl_jeopardy link_table_id tablet_id_unq_1	leader | 102 | 1311 |



No support for similarity operator `%`?

ybgin index method cannot use more than one required scan entry: got 3

In [None]:
%%sql
execute  stmt_util_metrics_snap_reset;

In [None]:
%%sql

-- explain (costs off, analyze, verbose) 
select '' _
,  question <-> 'data' as val
  , category
  , answer
  , question 
from tbl_jeopardy
where 1=1
and question % 'dat'
; 

In [None]:
%%sql
execute  stmt_util_metrics_snap_table;

---
#### unaccent
`unaccent` is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search.

In [None]:
%%sql
create extension unaccent;


In [None]:
%%sql
create text search configuration fr (copy=french);

ALTER TEXT not supported
https://www.postgresql.org/docs/11/sql-altertsconfig.html

In [None]:
%%sql 
alter text search configuration fr alter mapping for hword, hword_part, word with unaccent, french_stem;

In [None]:
%%bash -s "$NB_YB_PATH_BIN" "$NB_DB_NAME" "$NB_NOTEBOOK_DATA_FOLDER" "$NB_GIN_EXAMPLES"   # GIN
YB_PATH_BIN=${1}
DB_NAME=${2}
DATA_FOLDER=${3}
GIN_EXAMPLES=${4}


#ls $DATA_FOLDER

DDL_1="alter text search configuration fr alter mapping for hword, hword_part, word with unaccent,french_stem;"

cd $YB_PATH_BIN
echo $DDL_1
# GIN Examples file
./ysqlsh -d ${DB_NAME} -c ${DDL_1} 

sleep 1;

GIN index removes the `stop words`:
- `de`
- `la`
- `en`

This is `stop word` pruning. To demonstrate, create a table of with a column of the type, `tsvector`. Then, create a GIN index. Insert a few rows, and review the results.

In [None]:
%%sql3
create table tbl_fr (v tsvector);

create index on tbl_fr using gin (v);
insert into tbl_fr values (to_tsvector (' fr','Café de la Presse')) ;
insert into tbl_fr values (to_tsvector (' fr',  'Hôtel en bord de mer!')) ;

View the results.

In [None]:
%%sql
select * from tbl_fr;

In [None]:
%%sql
--execute  stmt_util_metrics_snap_reset;

-- explain (costs off, analyze, verbose) 
select * from tbl_fr 
where 1=1 
and v @@ to_tsquery('fr','cafe')
;

In [None]:
%%sql

explain select * from tbl_fr 
where 1=1 
and v @@ to_tsquery('fr','hotel')
;

### q7 | Cleanup
Drop index

In [None]:
%%sql
drop index if exists idx_cities_city_name_hash;
drop index if exists idx_cities_city_name_range;

---
# 🌟🌟🌟🌟🌟  All  done! 
In this notebook, you completed the following:

- Created GIN indexes
- Viewed Explain Plans and metrics reports for various queries to see how a GIN index benefits container queries


## 😊 Next up!
Continue your learning by opening the next notebook, `06_YSQL_Development_Advanced.ipynb'`. 

Or, to open the notebook from GitPod, run the following:

In [None]:
%%bash
gp open '06_YSQL_Development_Advanced.ipynb'