## Cassandra Query Language (CQL)

### Review commands from previous lecture

In [1]:
!nodetool status

Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.24.0.4  149.94 KiB  16      100.0%            b5d7c159-2e4b-47e2-b12c-eceaaf2dd3a4  rack1
UN  172.24.0.2  165.59 KiB  16      100.0%            a3a9a70e-d4a3-467d-9b34-1bfcdfdef42b  rack1
UN  172.24.0.3  177.09 KiB  16      100.0%            39d76fc9-d6be-4a5b-8420-0da3ce3a0343  rack1



In [2]:
from cassandra.cluster import Cluster
import pandas as pd

In [3]:
cluster = Cluster(["demo-db-1", "demo-db-2", "demo-db-3"])
cass = cluster.connect()

For `Cluster` configuration, we don't really need to type out each node name. Reason: what if we had 100s of nodes! Just typing a few node names will be sufficient here.

#### If you din't manually create the `banking` keyspace, then execute below cell

In [5]:
cass.execute("""
create keyspace banking with 
replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
""")

<cassandra.cluster.ResultSet at 0x76a35825f190>

Let's use `banking` keyspace.

In [6]:
cass.execute("use banking")

<cassandra.cluster.ResultSet at 0x76a35829ee00>

### Cassandra table creation

Let's create loans table.

In [7]:
cass.execute("""
create table loans(
    bank_id INT,
    bank_name TEXT,
    loan_id UUID,
    amount INT,
    state TEXT,
)
""")

InvalidRequest: Error from server: code=2200 [Invalid query] message="No PRIMARY KEY specifed for table 'loans' (exactly one required)"

#### What is UUID? 

- It stands for "Universally Unique Identifier".
- Globally unique across all computing machines.

#### Primary key specification

Syntax: `PRIMARY KEY(parition_key, cluster_key)`

In [8]:
cass.execute("""
create table loans(
    bank_id INT,
    bank_name TEXT,
    loan_id UUID,
    amount INT,
    state TEXT,
    PRIMARY KEY((bank_id), amount, loan_id)
)
""")

<cassandra.cluster.ResultSet at 0x76a3283c9180>

Let's take a peek at the table.

In [9]:
cass.execute("describe table loans").one()
# one() enables us to extract the one row from the result

Row(keyspace_name='banking', type='table', name='loans', create_statement="CREATE TABLE banking.loans (\n    bank_id int,\n    amount int,\n    loan_id uuid,\n    bank_name text,\n    state text,\n    PRIMARY KEY (bank_id, amount, loan_id)\n) WITH CLUSTERING ORDER BY (amount ASC, loan_id ASC)\n    AND additional_write_policy = '99p'\n    AND bloom_filter_fp_chance = 0.01\n    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}\n    AND cdc = false\n    AND comment = ''\n    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}\n    AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}\n    AND memtable = 'default'\n    AND crc_check_chance = 1.0\n    AND default_time_to_live = 0\n    AND extensions = {}\n    AND gc_grace_seconds = 864000\n    AND max_index_interval = 2048\n    AND memtable_flush_period_in_ms = 0\n    AND min_index_interval = 1

In [10]:
print(cass.execute("describe table loans").one().create_statement)

CREATE TABLE banking.loans (
    bank_id int,
    amount int,
    loan_id uuid,
    bank_name text,
    state text,
    PRIMARY KEY (bank_id, amount, loan_id)
) WITH CLUSTERING ORDER BY (amount ASC, loan_id ASC)
    AND additional_write_policy = '99p'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND memtable = 'default'
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';


### Drop table and recreate

In [11]:
cass.execute("drop table if exists loans")

<cassandra.cluster.ResultSet at 0x76a3582d60b0>

**Note:** Final `create` statement for `loans` table.

In [12]:
cass.execute("""
create table loans(
    bank_id INT,
    bank_name TEXT static,
    loan_id UUID,
    amount INT,
    state TEXT,
    PRIMARY KEY ((bank_id), amount, loan_id)
) WITH CLUSTERING ORDER BY (amount DESC)
""")

<cassandra.cluster.ResultSet at 0x76a32839dd50>

In [13]:
print(cass.execute("describe table loans").one().create_statement)

CREATE TABLE banking.loans (
    bank_id int,
    amount int,
    loan_id uuid,
    bank_name text static,
    state text,
    PRIMARY KEY (bank_id, amount, loan_id)
) WITH CLUSTERING ORDER BY (amount DESC, loan_id ASC)
    AND additional_write_policy = '99p'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND memtable = 'default'
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';


### `INSERT` data

- `INSERT` is actually update or insert

In [14]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name)
VALUES (544, 'test1')
""")

<cassandra.cluster.ResultSet at 0x76a32845a650>

In [15]:
pd.DataFrame(cass.execute("select * from loans"))

Unnamed: 0,bank_id,amount,loan_id,bank_name,state
0,544,,,test1,


In [16]:
# INSERT is actually update or insert
cass.execute("""
INSERT INTO loans (bank_id, bank_name)
VALUES (544, 'test2')
""")

<cassandra.cluster.ResultSet at 0x76a32840bd30>

In [17]:
pd.DataFrame(cass.execute("select * from loans"))

Unnamed: 0,bank_id,amount,loan_id,bank_name,state
0,544,,,test2,


##### **Observation**: 

We can insert data just with partition key information.
Cluster key is not necessary as long as you don't have data for any of the repeating columns.

Let's try to add just `loan_id`. This shouldn't work because both `amount` and `loan_id` form the cluster key.

In [18]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, loan_id)
VALUES (544, 'test2', UUID())
""")

InvalidRequest: Error from server: code=2200 [Invalid query] message="Some clustering keys are missing: amount"

### `UUID()` function

enables us to generate UUID

In [19]:
cass.execute("""
INSERT INTO loans (bank_id, amount, loan_id)
VALUES (544, 300, UUID())
""")

<cassandra.cluster.ResultSet at 0x76a32860a0b0>

In [20]:
pd.DataFrame(cass.execute("select * from loans"))

Unnamed: 0,bank_id,amount,loan_id,bank_name,state
0,544,300,aa0573f3-0cb4-4288-b505-02ce7b3d4c32,test2,


### `NOW()` versus `UUID()`

- both return UUIDs
- `NOW()` is "more" unique (looks at MAC address, timestamp, sequence number)

In [21]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, state)
VALUES (544, 'Chase', 400, NOW(), 'WI')
""")

<cassandra.cluster.ResultSet at 0x76a328253550>

In [22]:
pd.DataFrame(cass.execute("select * from loans"))

Unnamed: 0,bank_id,amount,loan_id,bank_name,state
0,544,400,a905f780-f35e-11ee-ace3-eb4804f1355e,mybank,WI
1,544,300,aa0573f3-0cb4-4288-b505-02ce7b3d4c32,mybank,


##### **Observation:** Why did it modify "bank_name" column for the first loan that we previously inserted?

- Recall that `bank_name` is a static column. It can only have one value per partition.
- Also, even though partition key `bank_id` and static column `bank_name` can only have a unique value per partition, when you run a `SELECT *` query, it will just display that unique value for every row - making the output more readable.

Inserting a new loan into a new partition.

In [23]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, state)
VALUES (999, 'UWCU', 500, NOW(), 'IL')
""")

<cassandra.cluster.ResultSet at 0x76a328611210>

In [24]:
pd.DataFrame(cass.execute("select * from loans"))

Unnamed: 0,bank_id,amount,loan_id,bank_name,state
0,544,400,a905f780-f35e-11ee-ace3-eb4804f1355e,mybank,WI
1,544,300,aa0573f3-0cb4-4288-b505-02ce7b3d4c32,mybank,
2,999,500,da482d30-f35f-11ee-90ed-71336e409d36,UWCU,IL


**Observation:** Cluster keys only sort data within a single partition.

### Custom types

Syntax: `TYPE <NAME> (field1, field2, ...)`

In [25]:
cass.execute("""
CREATE TYPE FullName (first text, last text)
""")

<cassandra.cluster.ResultSet at 0x76a32849ca90>

### `alter` existing table

Let's add `FullName` as a column

In [26]:
cass.execute("""
alter table loans add (username FullName)
""")

<cassandra.cluster.ResultSet at 0x76a32860a050>

In [27]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, username)
VALUES (999, 'UWCU', 500, NOW(), {first:'Meenakshi', last:'Syamkumar'})
""")

<cassandra.cluster.ResultSet at 0x76a35829ee60>

In [28]:
pd.DataFrame(cass.execute("""
SELECT username, username.first, username.last 
FROM loans
"""))

Unnamed: 0,username,username_first,username_last
0,,,
1,,,
2,,,
3,"(Meenakshi, Syamkumar)",Meenakshi,Syamkumar


### Prepared statements

works both for SQL and CQL

In [29]:
uwcu_insert = cass.prepare("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, username)
VALUES (999, 'UWCU', ?, NOW(), {first:?, last:?})
""")

In [30]:
cass.execute(uwcu_insert, (300, "Viyan", "Meero"))

<cassandra.cluster.ResultSet at 0x76a3286939d0>

In [32]:
pd.DataFrame(cass.execute("select * from loans"))

Unnamed: 0,bank_id,amount,loan_id,bank_name,state,username
0,544,400,a905f780-f35e-11ee-ace3-eb4804f1355e,mybank,WI,
1,544,300,aa0573f3-0cb4-4288-b505-02ce7b3d4c32,mybank,,
2,999,500,da482d30-f35f-11ee-90ed-71336e409d36,UWCU,IL,
3,999,500,4a7510a0-f360-11ee-ace3-eb4804f1355e,UWCU,,"(Meenakshi, Syamkumar)"
4,999,300,3f33e580-f36b-11ee-ace3-eb4804f1355e,UWCU,,"(Viyan, Meero)"


#### Configuration options for prepared statements

In [34]:
# uwcu_insert.<VARIOUS_CONFIG>

### GROUP BYs

#### What is the average loan amount per bank?

In [35]:
pd.DataFrame(cass.execute("""
SELECT bank_id, bank_name, AVG(amount)
FROM loans
GROUP BY bank_id
"""))

Unnamed: 0,bank_id,bank_name,system_avg_amount
0,544,mybank,350
1,999,UWCU,433


#### What is the average loan amount per state?

In [36]:
pd.DataFrame(cass.execute("""
SELECT state, AVG(amount)
FROM loans
GROUP BY state
"""))

InvalidRequest: Error from server: code=2200 [Invalid query] message="Group by is currently only supported on the columns of the PRIMARY KEY, got state"

**Observation**: can only group by partition key (or partition key with some more columns of the primary key).<br>
**Observation**: it is for transaction processing and not analytics.

### Spark solution

In [38]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .appName("cs544")
         .config('spark.jars.packages', \
                 'com.datastax.spark:spark-cassandra-connector_2.12:3.4.0')
         .config("spark.sql.extensions", \
                 "com.datastax.spark.connector.CassandraSparkExtensions")
         .getOrCreate())

### Reading data into Spark

### Approach 1: individual DataFrame

```python
spark.read.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host", "????")
.option("keyspace", ????)
.option("table", ????)
.load()

```

### Approach 2: catalogs

- set of tables that `Spark` can see, which can either be managed by `Spark` or some other system

In [39]:
spark.conf.set("spark.sql.catalog.mycat", \
               "com.datastax.spark.connector.datasource.CassandraCatalog")
spark.conf.set("spark.sql.catalog.mycat.spark.cassandra.connection.host", \
               "demo-db-1,demo-db-2,demo-db-3")

### Spark SQL

Syntax: `FROM <catalog>.<keyspace>.<table>`

In [40]:
spark.sql("""
SELECT *
FROM mycat.banking.loans
""")

DataFrame[bank_id: int, amount: int, loan_id: string, state: string, username: struct<first:string,last:string>, bank_name: string]

In [41]:
spark.sql("""
SELECT *
FROM mycat.banking.loans
""").toPandas()

                                                                                

Unnamed: 0,bank_id,amount,loan_id,state,username,bank_name
0,999,500,da482d30-f35f-11ee-90ed-71336e409d36,IL,,UWCU
1,999,500,4a7510a0-f360-11ee-ace3-eb4804f1355e,,"(Meenakshi, Syamkumar)",UWCU
2,999,300,3f33e580-f36b-11ee-ace3-eb4804f1355e,,"(Viyan, Meero)",UWCU
3,544,400,a905f780-f35e-11ee-ace3-eb4804f1355e,WI,,mybank
4,544,300,aa0573f3-0cb4-4288-b505-02ce7b3d4c32,,,mybank


#### What is the average loan amount per state?

In [42]:
spark.sql("""
SELECT state, AVG(amount)
FROM mycat.banking.loans
GROUP BY state
""").toPandas()

                                                                                

Unnamed: 0,state,avg(amount)
0,,366.666667
1,WI,400.0
2,IL,500.0


We could dump this data somewhere like into HDFS or Hive or wherever you want to.

In [None]:
# spark.sql("""
# SELECT *
# FROM mycat.banking.loans
# """).write.....

### Spark - Hash Partitioning Demo

It's Not elastic!

In [43]:
import string

In [45]:
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [44]:
df = pd.DataFrame({"letter": list(string.ascii_uppercase)})
df.head()

Unnamed: 0,letter
0,A
1,B
2,C
3,D
4,E


In [46]:
df["partition1"] = df["letter"].apply(lambda letter: hash(letter) % 4)
df.head()

Unnamed: 0,letter,partition1
0,A,3
1,B,3
2,C,2
3,D,3
4,E,2


In [48]:
df["partition2"] = df["letter"].apply(lambda letter: hash(letter) % 5)
df.head()

Unnamed: 0,letter,partition1,partition2
0,A,3,1
1,B,3,0
2,C,2,4
3,D,3,3
4,E,2,3


Let's compare partition1 and partition2 results.

In [50]:
df["partition1"] == df["partition2"]

0     False
1     False
2     False
3      True
4     False
5      True
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13     True
14    False
15     True
16    False
17    False
18     True
19    False
20    False
21    False
22    False
23    False
24     True
25    False
dtype: bool

In [49]:
(df["partition1"] == df["partition2"]).mean()

0.2692307692307692

**Observation**: Only few of the letters stayed with the same partition number.