# Demystifying table sharding, tablets, and data distribution
As a distributed SQL database, YugabyteDB stores data differently than a stand-alone, monolithic  database. Because data is stored differently, YugabyteDB reads data differently as well.

YugabyteDB stores data in tables.

A table consists of tablets. A tablet represents a table shard which contains a set of rows for the logical table. Under the hood, each tablet is a customized RocksDB instance. A tablet leader has a peer group known as tablet followers, and this group of tablet peers exists as a Raft consensus group. 

In this lab, using Explain Plans, built-in functions, and custom utilities, you will learn about how YugabyteDB stores data for a table, as table shards, known as tablets. You will also learn how YugabyteDB reads tablet data.


## Requirements
Before running the cells in this notebook, you must first edit and execute all the cells in the first notebook:
- `01_Lab_Setup.ipynb`

## Connect to `db_ybu` using the PostgreSQL Driver for Python
Run all the cells in this section:
- Connect using Python and PostgreSQL driver
- Load the SQL magic extension
- Create prepared statements for using utility metrics


In [1]:
# connect use Python 3.7.9+
import psycopg2
import sqlalchemy as alc
from sqlalchemy import create_engine

# Inspiration from https://medium.com/analytics-vidhya/postgresql-integration-with-jupyter-notebook-deb97579a38d
# Use %store -r to read 01_Lab_Requirements_Setup variables

%store -r MY_DB_NAME
%store -r MY_YB_PATH
%store -r MY_HOST_IPv4_01
%store -r MY_HOST_IPv4_02
%store -r MY_HOST_IPv4_03

db_host=MY_HOST_IPv4_01
db_name=MY_DB_NAME


connection_str='postgresql+psycopg2://yugabyte@'+db_host+':5433/'+db_name

engine = create_engine(connection_str)

In [2]:
%reload_ext sql

# Example format
%sql {connection_str}

### Create prepared utility statements

In [20]:
%%sql 
select fn_yb_create_stmts() ;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
1 rows affected.


fn_yb_create_stmts
2022-07-14 21:02:48.055882-07:00


## q1 | Create a table with no Primary Key (PK)
For this first example, you will create a table without explicitly defining a primary key. You will then review an Explain plan for a query of the table as well as query metrics.

In [22]:
%%sql /* create table no PK and insert rows */
drop table if exists tbl_no_pk;

create table if not exists tbl_no_pk (k int, v text);

insert into tbl_no_pk (k,v)
select g.id, format('%s%s',chr(97+CAST(random() * 25 AS INTEGER)),chr(97+CAST(random() * 25 AS INTEGER)))
from generate_series(1, 1000) AS g (id);

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
Done.
Done.
1000 rows affected.


[]

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" #\d+
YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

./bin/ysqlsh -d ${DB_NAME} -c "\d+ tbl_no_pk"

### q1a | Query the table for a value of k

The following is a query that will show the Explain Plan for the query. The first statement resets the metrics that will be captured after running the query. The query itself has a `where` clause predicate for `k` . 

In [25]:
%%sql /* explain plan */
execute stmt_util_metrics_snap_reset;
explain (analyze, costs off, verbose, timing on) 
select '' _
    , k
    , v
from tbl_no_pk 
where 1=1 
and  k=1000
-- limit 30
;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
0 rows affected.
7 rows affected.


QUERY PLAN
Seq Scan on public.tbl_no_pk (actual time=6.063..8.916 rows=1 loops=1)
"Output: ''::text, k, v"
Filter: (tbl_no_pk.k = 1000)
Rows Removed by Filter: 999
Planning Time: 0.035 ms
Execution Time: 8.946 ms
Peak Memory Usage: 0 kB


#### q1a | Explain plan (above ^^)
Because YugabyteDB reuses the PostgreSQL query layer, it is possible to view an Explain Plan for a query. A query optimizer generates the Explain Plan for query execution.
The Explain plan for this query shows:
- `Seq Scan on public.tbl_no_pk (actual time=7.912..11.720 rows=1 loops=1)`

A `Seq Scan` is a full table scan, meaning that potentially all the rows in the table will be read.

- `Rows Removed by Filter: 999`

999 rows were removed, leaving the remain 1 row.

The Explain Plan infers how the query accesses tablet data, but does **NOT** show the metrics for tablet access.Using a custom utility, you can view the metrics for the query execution in the notebook cell below.

In [26]:
%%sql /* metrics */
execute stmt_util_metrics_snap_table;

 * postgresql+psycopg2://yugabyte@127.0.0.1:5433/db_ybu
3 rows affected.


[dbname | relname | tableid | tabletid | isLeader],rocksdb_number_db_seek,rocksdb_number_db_next,rows_inserted
db_ybu | tbl_no_pk | 000040000000300080000000000045cd | 5ede8fcd0afa40488b8de7c30aea9603 | L,1,938,
db_ybu | tbl_no_pk | 000040000000300080000000000045cd | ddec4f9f54d940e2937f2d20fb63a052 | L,1,1064,
db_ybu | tbl_no_pk | 000040000000300080000000000045cd | e84c0745259a444496e1309dddf5c8c6 | L,1,995,


#### q1a | Metrics (above ^^)

The Explain Plan revealed that the query read 1000 rows, and filter out 999 rows, to return a 1 row result.

The Metrics reports shows that the query access 3 tablets for the table (`tbl_no_pk`). Each tablet has 1 seek of an offset, and then per offset, 900+ reads.

Here is how to read the output of the Metrics report:
- row_name `db_ybu | tbl_no_pk | 00004000000030008000000000004058 | d13264e8acad46dd828912677f9483e2 | L`
  - database --> `db_ybu`
  - table or index or materialized view  --> `tbl_no_pk`
  - table_id --> `00004...`
  - tablet_id --> `4aed1608c05d40edb7105b11abc629cd`
  - isLeader --> `L`
- rocksdb_number_db_seek
  - The number of Seek() RocksDB API calls, which means the number of seeks for offsets 
- rocksdb_number_db_next
  - The number of Next() RocksDB API calls, which means the number of reads from the offset
- rows_inserted
  - number of rows inserted 



### Tables and Tablets TODO Update for GitPod
To better understand the Metrics report, you can view the tablets that processed the query in the output above in the YB-Master web ui.

Each leader has a unique identifier, a tablet_id, which is the last value in the `row_name`, something like `7ef9496faf1045a0bf5002d50eac446`.

From the row_name column, you can copy the URL (`http://127.0.0.x7000/table?id=00000x`) into a web browser, and view the table details, including the tablet leader details.

To view the table and the tablets, follow these steps:
- From the metrics output above,  in the `row_name`, copy the `http://127.0.0.x:7000/table?id=00000x` value
- Past the URL into a browser window such a Chrome


### Column section
> Important!
>  
When you create a table without a Primary Key (PK), YugabyteDB creates a hidden primary key for you. You will see this hidden PK in the YB-Master web ui for the table.
- `ybrowid` is a random Universal Unique Identifier (UUID)
- `ybrowid` is the a hidden PK, and is not accessible
- `ybrowid` will not show using `\d` or `\d+`

| Column | ID	| Type |
|--------|------|------|
| ybrowid    | 0	| binary NOT NULL PARTITION KEY | 
| k	         | 1	| int32 NULLABLE NOT A PARTITION KEY | 
| v	         | 2	| string NULLABLE NOT A PARTITION KEY | 

### Tablet section

| Tablet ID |	Partition	| SplitDepth	| State	| Hidden	| Message	| RaftConfig|
|--|--|--|--|--|--|--|
| some_uuid_1 |	`hash_split: [0x5555, 0xAAAA)` |	0	| Running|	false| Tablet reported with an active leader	|<li>FOLLOWER: 127.0.0.1 <li>FOLLOWER: 127.0.0.3<li>LEADER: 127.0.0.2  |
| some_uuid_2	| `hash_split: [0xAAAA, 0xFFFF)`	| 0 |  Running |false |	Tablet reported with an active leader |<li>FOLLOWER: 127.0.0.1 <li>LEADER: 127.0.0.3 <li>FOLLOWER: 127.0.0.2 |
| some_uuid_3 <br>(tablet leader where the row lives) |	`hash_split: [0x0000, 0x5555)` |	0 |	Running | 	false	| Tablet reported with an active leader |	<li>LEADER: 127.0.0.1<li>FOLLOWER: 127.0.0.3<li>FOLLOWER: 127.0.0.2 |

The reason that there are only 3 tablets listed is because of a combination of settings you used to create a local cluster using `yb-ctl` : 
- `rf 3` --> a replication factor of 3 which resulted in a cluster with 3 nodes
- `yb_num_shards_per_tserver=1,ysql_num_shards_per_tserver=1` --> each tserver will create only 1 tablet leader for a table

In other words, `3 nodes` x `1 number_shard_per_tserver` = `3 tablet leaders` and `6 tablet followers`  (with respect to the replication factor of 3, where each tablet leader requires 2 tablet followers).


Each tablet has a Partition value, a `hash_split` that describes the hash values:
```
hash_split: [0x0000, 0x5555)
hash_split: [0x5555, 0xAAAA)
hash_split: [0xAAAA, 0xFFFF)
```

Converting the hex values to integer values, the range in integer form is:
```
0-21845
21845-43690
43690-65535
```



### yb_hash_code()
When YugabyteDB processes an insert query for a row of data into a table with a hash sharding strategy, YugabyteDB uses the shard key (shown as `PARTITION KEY`) to distribute the data among the tablet leaders. An internal hashing algorithm modulos the shard key value by the number of tablets to determine the tablet leader destination for the insert row.

YSQL has an built-in function that shows the integer form of a hashed value, `yb_hash_code(val)`.

However, because YugabyteDB does not expose `ybrowid`, it is not possible to use yb_hash_code() to infer the location of a row, in terms of `k` or `v`, within a given tablet.

## q 2 | Alter the table and add a Primary Key
To illuminate how YugabyteDB distributes data, you can create primary key the `k` column of the table. Later, you will use the `yb_code_hash()` built-in function and the primary key to show where various rows live on a given tablet.

Because there is not an existing primary key (PK) for `tbl_no_pk`, it is possible to add a one.

> IMPORTANT
> Once you create a primary key on a table, you can not alter it.

 - `PRIMARY KEY (k)` for hash  [DEFAULT]
 - `PRIMARY KEY (k hash)` for hash
 - `PRIMARY KEY (k asc)` for range

> NOTE
> 
> In this scenario, when you add a primary key to a table that does not have one, YugabyteDB will need to recreate the entire table behind the scenes and distribute the data according to PK sharding algorithm, range or hash. The previous table with the table id will show the table name as table_name_temp_old. The newly created table will have a new table id.

In [None]:
%%sql /* alter table rename AND add PK */

drop table if exists tbl_pk_hash;
-- rename table
alter table if exists tbl_no_pk rename to tbl_pk_hash;
-- add PK as hash
alter table tbl_pk_hash add primary key (k);

In [None]:
%%bash -s "$MY_YB_PATH" "$MY_DB_NAME" #\d+
YB_PATH=${1}
DB_NAME=${2}

cd $YB_PATH

./bin/ysqlsh -d ${DB_NAME} -c "\d+ tbl_pk_hash"

### q2a | yb_hash_code()
YSQL has an built-in function that shows the integer form of a hashed value, `yb_hash_code(val)`. Using some helper functions, you can view where the hashed value lives for the PK value. 

In the web ui, you can then confirm that the PK has a hash code that lives in the hash_split range of the tablet leader in question.  

The following query shows:
- k, the primary key value
- the yb_hash_code() integer for k
- and hash_split range where the yb_hash_code() integer value lives.

In [None]:
%%sql /* query with hash buckets */
select '' as _
  , k
 -- , v
  , yb_hash_code(k::int) as k_hash_code
  , fn_find_hash_code_in_partition_hex_range(yb_hash_code( k::int),'hash_split: [0x0000, 0x5555)'::text) as col_0x0000_0x5555
  , fn_find_hash_code_in_partition_hex_range(yb_hash_code( k::int),'hash_split: [0x5555, 0xAAAA)'::text) as col_0x5555_0xAAAA
  , fn_find_hash_code_in_partition_hex_range(yb_hash_code( k::int),'hash_split: [0xAAAA, 0xFFFF)'::text) as col_0xAAAA_0xFFFF
from tbl_pk_hash
where 1=1 
-- and k=123
order by k asc
-- OFFSET 330
limit 30
;


### q2 | Obsevations
The results show that the 30 returned rows are distributed among all of the hash_split ranges, meaning that each tablet leader is returning results.

#### q2b | Experiment
Question: 
- What happens when you change the SQL code above and comment out the `ORDER BY` clause and re-run the query?

Answer:
- The results are in the order of `k_hash_code`, not `k`
- YugabyteDB only returns `k_hash_code` ordered results and these 30 row are from the first hash_split range, `0-21845`
- Th `k_hash_code` ordering is how the internal LSM-Tree sorted the table inserts. 
- For the given tablet, the LSM-tree persists to disk as Sorted Sequence Tables (SST files).
- The reads come from the SST file for the tablet. For this reason, you can rerun the query over and over again and still see the same "order" (`k_hash_code`) of "unordered" (`k`) results.


Question: 
- With the  `ORDER BY` clauses commented out, what happens when you uncomment one of the `OFFSET` lines and run the query again? 

Answer:
- YugabyteDB only returns `k_hash_code` ordered results from the beginning of the offset, which should span two hash_split ranges, indicating that the hash sharding is more or less evenly split among the tablets.


### q2c | View the explain plan and metrics
This is the same exact query above. This time, however, you will generate an explain plan and the view the metrics.

Just like the section before, you wil rerun the queries to answer the experiment questions.

In [None]:
%%sql /* explain plan */
execute stmt_util_metrics_snap_reset;
explain (costs off, analyze, verbose) 
select '' as _
  , k
 -- , v
  , yb_hash_code(k::int) as k_hash_code
  , fn_find_hash_code_in_partition_hex_range(yb_hash_code( k::int),'hash_split: [0x0000, 0x5555)'::text) as col_0x0000_0x5555
  , fn_find_hash_code_in_partition_hex_range(yb_hash_code( k::int),'hash_split: [0x5555, 0xAAAA)'::text) as col_0x5555_0xAAAA
  , fn_find_hash_code_in_partition_hex_range(yb_hash_code( k::int),'hash_split: [0xAAAA, 0xFFFF)'::text) as col_0xAAAA_0xFFFF
from tbl_pk_hash
where 1=1 
-- and k=123
order by k asc
-- order by k desc
-- OFFSET 330
limit 30
;

##### q2c | Explain plan (above ^^)
For the initial query with the `order by asc` clause, the Explain plan can shows that this is a `Seq Scan` query, and no index is being used:

`-> Seq Scan on public.tbl_pk_hash (actual time=2.626..7.200 rows=1000 loops=1)`

A `Seq Scan` results in potentially accessing all the rows in the table.  These means all tablet leaders will most likely process the query.

In [None]:
%%sql
execute stmt_util_metrics_snap_table;

#### q2c | Metrics (above ^^)
In the initial query, the metrics shows that the `Seq Scan` query accesses all tablets leaders:

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_pk_hash link_table_id tablet_id_unq_1	 | 1 | 	689 |
| db_ybu tbl_pk_hash link_table_id tablet_id_unq_2	| 1	| 641 |
| db_ybu tbl_pk_hash link_table_id tablet_id_unq_3	| 1	| 667 |

#### q2c | Experiment
Question: 
- When you change the SQL code above and comment out the `ORDER BY` clause and re-run the query, what observations can you make about the Explain plan and the Metrics.

Answer:
- Even though the Explain plan shows that this is a `Seq Scan` query:
  - `-> Seq Scan on public.tbl_students (actual time=1.072..5.176 rows=30 loops=1`

The table scan is only for the tablet that contains the hash_split range where 30 rows live.
| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_pk_hash link_table_id tablet_id_unq_1	 | 1 | 	62 |

- YugabyteDB returns the unordered results and from the first hash_split range, `0-21845`.
- The "unordered" results are actually returned as they were ordered in the LSM-Tree as Sorted Sequence Tables (SST files). For this reason, you can rerun the query over and over again and still see the same "order" of "unordered" results. 


Question: 
- With the `ORDER BY` clauses commented out, what happens when you uncomment one of the `OFFSET` lines and run the query again? How does the Explain plan compare to the Metrics?


Answer:
- YugabyteDB returns results from the beginning of the offset, which, if the hash sharding is more or less evenly split, should span two hash_split ranges. Even though the Explain plan shows that this is a `Seq Scan` query:
  - `-> Seq Scan on public.tbl_students (actual time=4.039..55.580 rows=360 loops=1)`

The table scan is for the tablets that contains the hash_split ranges. This means full table scans for two tablet leaders.

| row_name| 	rocksdb_number_db_seek | 	rocksdb_number_db_next | 
|--|--|--|
| db_ybu tbl_pk_hash link_table_id tablet_id_unq_1	 | 1 | 	689 |
| db_ybu tbl_pk_hash link_table_id tablet_id_unq_2	| 1	| 641 |


### To Do Range Sharding


---
## That's it for table sharding, tablets, and data distribution
- Number of shard per tablet server
- Web ui of tablet server for a table's tablets
- NO PK, Range Sharding PK, and Hash Sharding PK
- yb_hash_code()
