<a href="https://colab.research.google.com/github/ainfanzon/Cockroach_IAM_Workshop/blob/main/GCP_Colab_notebooks/Exercise_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src="https://drive.google.com/uc?id=1XYr9Tyrz31a5kZdo601xD1QWz_YM8-H3">

### CockroachDB is a distributed SQL database that is __*highly scalable*__, __*resilient*__, and __*easy to use*__.

# Identity and Access Management Workshop.
---
## CockroachDB Overview.

This section explains the concepts of ranges, range sets, replicas and leaders (a.k.a, leaseholders) which are the building blocks for
[Replication and Rebalancing](https://www.cockroachlabs.com/docs/stable/demo-replication-and-rebalancing).

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
  border-collapse: collapse;
}
</style>
</head>
<body>

<table style="width:100%">
  <tr>
      <td align="right">
          <img src="https://drive.google.com/uc?id=1roJY0K02x6gDV96uXT24ivR_f2gnf2Pb" width="350" height="345">
      </td>
      <td style="width:5%" align="center">
          &emsp;
      </td>
      <td align="left">
          <img src="https://drive.google.com/uc?id=19C2KDL00TdFwcZQX2epcqzMUi4evEMio" width="525" height="275">
      </td>
  </tr>
</table>

</body>
</html>

You will:

1. Start a three node cluster.
1. Verify the cluster deployment.
1. Load data and verify replication.

Key definitions:

| Term | Definition |
| --- | --- |
| Range | Data is sorted in a map of key-value pairs. This keyspace is divided into contiguous chunks called ranges, such that every key is found in one range.|
| Replica | A copy of a range stored on a node. By default, there are three replicas (replication factor) of each range on different nodes.|
| Range Set | Is a collection of ranges. Each set has a leader and a number of replicas based on the replication factor|
|Leader | A range in the range set is elected as the leader. The leader is responsible for managing the replication to other ranges in the set (followers).|



---

## 1. Start a three node cluster.

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
  border-collapse: collapse;
}
</style>
</head>
<body>

<table style="width:100%">
  <tr>
      <td align="right">
          <img src="https://drive.google.com/uc?id=1k_qixt3JcQTD13Zhi_jycORL046zrYR7" width="850" height="250">
      </td>
  </tr>
</table>

</body>
</html>

To start a cluster with three nodes execute the steps below:

- On your laptop open a terminal window and connect to the GCP compute engine using ssh.

<br>
<html>
<body>
<table style="width:100%"; border="1">
  <tr>
      <th>AWS</th>
      <th>GCP</th>
  </tr>
  <tr>
      <td align="right">
          ssh -i "&lt;Key_Pair.pem&gt;" &lt;UserId&gt;@&lt;Public IP&gt;
      </td>
      <td align="right">
          ssh &lt;UserId&gt;@&lt;Public IP&gt;
      </td>
  </tr>
</table>
</body>
</html>
<br>

- To start a three nodes cluster execute the __**cockroach start**__ (see example below). For the lab there is a script (`strt_crdb.sh`) you can execute. The script is located in the `/home/cockroach/scripts/` directory.

> <code>
  ./strt_crdb.sh
</code>

> <p>cockroach start<br>
&emsp;&emsp;--insecure<br>
&emsp;&emsp;--listen-addr=&lt;ip address&gt;:&lt;sql listening port&gt;<br>
&emsp;&emsp;--join=&lt;ip address&gt;:&lt;sql listening port&gt;, ... ,&lt;ip address&gt;:&lt;sql listening port&gt;<br>
&emsp;&emsp;--http-addr=&lt;ip address&gt;:&lt;http listening port&gt;<br>
&emsp;&emsp;--locality=region=us-west,zone=us-west-1a<br> &emsp;&emsp;--store=/home/cockroach/data/cr_data_1<br>
&emsp;&emsp;--background<br>
<br>
cockroach init --insecure --host &lt;ip address&gt;
</p>

You should see a `Cluster successfully initialized` message.

---

## 2. Verify the cluster deployment

Verify there are three instances of the `cockroach` process running on different ports.

- List all active `cockroach` processes. The command below displays the `process id` and the full command used at launch time.

> `pgrep -a cockroach`

&emsp;NOTE: Each process will be running on the same IP address but different ports. The command below displays the listneing address.

> <code>
pgrep -a cockroach | awk '{ print $5}'<br><br>
--listen-addr=10.0.1.2:26257<br>
--listen-addr=10.0.1.2:26258<br>
--listen-addr=10.0.1.2:26259
</code>

- Similarly there is a different port for the DB Console of each node.

> <code>
pgrep -a cockroach | awk '{ print $7}'<br><br>
--http-addr=10.0.1.2:8080<br>
--http-addr=10.0.1.2:8081<br>
--http-addr=10.0.1.2:8082
</code>

- Open another browser tab to display the cockroach **DB Console**:

&emsp;&emsp;__**NOTE:**__ Replace &lt;IP Address&gt; with your ec2 PUBLIC IP
<br>

> <code>
http://&lt;IP Address&gt;:8080/#/overview/list
</code>

- Verify there are NO under-replicated ranges.

Display additional information by connecting to a node using __`psycopg2`__ and the *__VM External IP__* address.

In the cell below, replace the __**host**__ value with your PUBLIC IP address.

In [None]:
import psycopg2
import pandas as pd

from IPython.display import IFrame, display, HTML, Markdown

pd.set_option('display.max_colwidth', None)

conn = psycopg2.connect(
        database = 'defaultdb'
      , user = 'root'
      , host = 'your IP address'      # Use the GCP Compute Engine external IP address
      , port = '26257'
      , sslmode = 'disable'
)
cursor = conn.cursor()

- Execute the SQL below to display the cluter's:
    - Node id
    - Advertised Address
    - Version
    - Up Time
    - Number of Ranges
    - Number of Leaders
    - Server Status and,
    - Membership Status.

In [None]:
cursor.execute("""
SELECT gn.node_id AS "Node ID"
     , gn.advertise_sql_address AS "Advertised Address"
     , gn.build_tag AS "Version"
     , current_timestamp() AT TIME ZONE 'UTC' - gn.started_at AS "Up Time"
     , "ranges" AS "Ranges"
     , leases AS "Leaders"
     , CASE WHEN is_live THEN 'LIVE' ELSE 'DEAD' END AS "Server Status"
     , gl.membership AS "Membership Status"
FROM crdb_internal.gossip_nodes AS gn join crdb_internal.gossip_liveness AS gl USING(node_id)
""")
result_set = cursor.fetchall()
df_result_set = pd.DataFrame(result_set, columns=[desc[0] for desc in cursor.description])
df_result_set.set_index('Node ID', inplace=True)
df_result_set

Unnamed: 0_level_0,Advertised Address,Version,Up Time,Ranges,Leaders,Server Status,Membership Status
Node ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,10.14.0.235:26257,v24.2.3,0 days 00:40:09.460621,54,17,LIVE,active
2,10.14.0.235:26259,v24.2.3,0 days 00:40:08.896847,54,20,LIVE,active
3,10.14.0.235:26258,v24.2.3,0 days 00:40:08.726597,54,17,LIVE,active


### Few points to note.

- How many ranges does each replica has?
- Are there any under-replicated ranges?
- How many ranges are unavailable?
- Are all the nodes active?
<br><br>
---

## 3. Create and populate the IAM database.

Next step is to create and populate the Identity Access Management database.

<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
  border-collapse: collapse;
}
</style>
</head>
<body>

<table style="width:100%">
  <tr>
      <td align="right">
          <img src="https://drive.google.com/uc?id=1hhcjsCJ7TO7nhUmR2JRoBhi7BZ-L5SIh" width="550" height="400">
      </td>
  </tr>
</table>

</body>
</html>
<br>

Follow the steps below to create a database and load the data:

- On your laptop open a second terminal window using ssh (see above).

- On the first terminal change to the **/home/cockroach/dump** directory and execute the **Python** http server.

> ```
cd /home/cockroach/dump
python -m http.server 3000
```
&emsp;You should see the HTTP server is running on port 3000

> <code>
Serving HTTP on 0.0.0.0 port 3000 (http://0.0.0.0:3000/) ...
</code>

- On the second terminal execute the **iam.sql** script to create the schema and populate the database.

    - First update the script with your PRIVATE IP address<br>
<code>
sed -E -i s/HOST_IP/$(hostname -I | awk '{print $1}')/ /home/cockroach/sql/iam.sql
</code><br>

    - Then execute the SQL script<br>
```cockroach sql --host $(hostname -I) -u root -d default -f /home/cockroach/sql/iam.sql --insecure```

- Execute the same query as before to compare the number of ranges and their distribution across the nodes in the cluster.

In [None]:
cursor.execute("""
SELECT gn.node_id AS "Node ID"
     , gn.advertise_sql_address AS "Advertised Address"
     , gn.build_tag AS "Version"
     , current_timestamp() AT TIME ZONE 'UTC' - gn.started_at AS "Up Time"
     , "ranges" AS "Ranges"
     , leases AS "Leaders"
     , CASE WHEN is_live THEN 'LIVE' ELSE 'DEAD' END AS "status"
     , gl.membership
FROM crdb_internal.gossip_nodes AS gn join crdb_internal.gossip_liveness AS gl USING(node_id)
""")
conn.commit()
result_set = cursor.fetchall()
df_result_set = pd.DataFrame(result_set, columns=[desc[0] for desc in cursor.description])
df_result_set.set_index('Node ID', inplace=True)
df_result_set

Unnamed: 0_level_0,Advertised Address,Version,Up Time,Ranges,Leaders,status,membership
Node ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,10.14.0.235:26257,v24.2.3,0 days 01:17:55.164081,83,29,LIVE,active
2,10.14.0.235:26259,v24.2.3,0 days 01:17:54.600307,83,29,LIVE,active
3,10.14.0.235:26258,v24.2.3,0 days 01:17:54.430057,83,25,LIVE,active


### Few points to note.

- What is the difference in the number of ranges?
- What is the difference in the number of Leaders?
- Compare with the number of ranges in the DB Console.

---
## CockroachDB is a distributed SQL database that is __*highly scalable*__, __*resilient*__, and __*easy to use*__.
<img src="https://drive.google.com/uc?id=1XYr9Tyrz31a5kZdo601xD1QWz_YM8-H3">

---

# Appendix

Workshop CRDB user id and passowrd

> <p>uid = roachie<br>
pwd = roachfan
</p>

List CRDB process id and process name.

> <code>pgrep -l cockroach</code>

List the listening address of each `cockroach` process.

> <code>pgrep -a cockroach | awk '{ print $5}'</code>

Kill ALL CRDB processes

> <code>kill -9  $(pgrep cockroach)</code>

Remove all CRDB files

> <code>rm -fR /home/cockroach/data/*</code>

Replace ip in file

> ```sed -E -i s/HOST_IP/$(hostname -I | awk '{print $1}')/ iam.sql```