## Cassandra Query Language (CQL)

### Review commands from previous lecture

In [None]:
!nodetool status

In [None]:
from cassandra.cluster import Cluster
import pandas as pd

In [None]:
cluster = Cluster(["demo-db-1", "demo-db-2", "demo-db-3"])
cass = cluster.connect()

For `Cluster` configuration, we don't really need to type out each node name. Reason: what if we had 100s of nodes! Just typing a few node names will be sufficient here.

#### If you din't manually create the `banking` keyspace, then execute below cell

In [None]:
cass.execute("""
create keyspace banking with 
replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
""")

Let's use `banking` keyspace.

### Cassandra table creation

Let's create loans table.

In [None]:
cass.execute("""
create table loans(
    bank_id INT,
    bank_name TEXT,
    loan_id UUID,
    amount INT,
    state TEXT,
)
""")

#### What is UUID? 

- It stands for "Universally Unique Identifier".
- Globally unique across all computing machines.

#### Primary key specification

Syntax: `PRIMARY KEY(parition_key, cluster_key)`

In [None]:
cass.execute("""
create table loans(
    bank_id INT,
    bank_name TEXT,
    loan_id UUID,
    amount INT,
    state TEXT,
)
""")

Let's take a peek at the table.

In [None]:
# one() enables us to extract the one row from the result

In [None]:
print(cass.execute("describe table loans").one())

### Drop table and recreate

In [None]:
cass.execute("")

**Note:** Final `create` statement for `loans` table.

In [None]:
cass.execute("""
create table loans(
    bank_id INT,
    bank_name TEXT static,
    loan_id UUID,
    amount INT,
    state TEXT,
    PRIMARY KEY ((bank_id), amount, loan_id)
)
""")

In [None]:
print(cass.execute("describe table loans").one().create_statement)

### `INSERT` data

- `INSERT` is actually update or insert

In [None]:
cass.execute("""

""")

In [None]:
pd.DataFrame(cass.execute("select * from loans"))

In [None]:
# INSERT is actually update or insert
cass.execute("""
INSERT INTO loans (bank_id, bank_name)
VALUES (544, 'test1')
""")

In [None]:
pd.DataFrame(cass.execute("select * from loans"))

##### **Observation**: 

We can insert data just with partition key information.
Cluster key is not necessary as long as you don't have data for any of the repeating columns.

Let's try to add just `loan_id`. This shouldn't work because both `amount` and `loan_id` form the cluster key.

In [None]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, loan_id)
VALUES (544, 'test2', UUID())
""")

### `UUID()` function

enables us to generate UUID

In [None]:
cass.execute("""
INSERT INTO loans (bank_id, amount, loan_id)
VALUES (544, 300, ???)
""")

In [None]:
pd.DataFrame(cass.execute("select * from loans"))

### `NOW()` versus `UUID()`

- both return UUIDs
- `NOW()` is "more" unique (looks at MAC address, timestamp, sequence number)

In [None]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, state)
VALUES (544, 'Chase', 400, ???, 'WI')
""")

In [None]:
pd.DataFrame(cass.execute("select * from loans"))

##### **Observation:** Why did it modify "bank_name" column for the first loan that we previously inserted?

- Recall that `bank_name` is a static column. It can only have one value per partition.
- Also, even though partition key `bank_id` and static column `bank_name` can only have a unique value per partition, when you run a `SELECT *` query, it will just display that unique value for every row - making the output more readable.

Inserting a new loan into a new partition.

In [None]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, state)
VALUES (999, 'UWCU', 500, NOW(), 'IL')
""")

In [None]:
pd.DataFrame(cass.execute("select * from loans"))

**Observation:** Cluster keys only sort data within a single partition.

### Custom types

Syntax: `TYPE <NAME> (field1, field2, ...)`

In [None]:
cass.execute("""

""")

### `alter` existing table

Let's add `FullName` as a column

In [None]:
cass.execute("""

""")

In [None]:
cass.execute("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, username)
VALUES (999, 'UWCU', 500, NOW(), )
""")

In [None]:
pd.DataFrame(cass.execute("""
SELECT 
FROM loans
"""))

### Prepared statements

works both for SQL and CQL

In [None]:
uwcu_insert = cass.prepare("""
INSERT INTO loans (bank_id, bank_name, amount, loan_id, username)
VALUES (999, 'UWCU', 500, NOW(), {first:'Meenakshi', last:'Syamkumar'})
""")

In [None]:
cass.execute(uwcu_insert, (300, "Viyan", "Meero"))

In [None]:
pd.DataFrame(cass.execute("select * from loans"))

#### Configuration options for prepared statements

In [None]:
# uwcu_insert.<VARIOUS_CONFIG>

### GROUP BYs

#### What is the average loan amount per bank?

In [None]:
pd.DataFrame(cass.execute("""

"""))

#### What is the average loan amount per state?

In [None]:
pd.DataFrame(cass.execute("""

"""))

**Observation**: can only group by partition key (or partition key with some more columns of the primary key).<br>
**Observation**: it is for transaction processing and not analytics.

### Spark solution

In [None]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .appName("cs544")
         .config('spark.jars.packages', \
                 'com.datastax.spark:spark-cassandra-connector_2.12:3.4.0')
         .config("spark.sql.extensions", \
                 "com.datastax.spark.connector.CassandraSparkExtensions")
         .getOrCreate())

### Reading data into Spark

### Approach 1: individual DataFrame

```python
spark.read.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host", "????")
.option("keyspace", ????)
.option("table", ????)
.load()

```

### Approach 2: catalogs

- set of tables that `Spark` can see, which can either be managed by `Spark` or some other system

In [None]:
spark.conf.set("spark.sql.catalog.mycat", \
               "com.datastax.spark.connector.datasource.CassandraCatalog")
spark.conf.set("spark.sql.catalog.mycat.spark.cassandra.connection.host", \
               "demo-db-1,demo-db-2,demo-db-3")

### Spark SQL

Syntax: `FROM <catalog>.<keyspace>.<table>`

In [None]:
spark.sql("""

""")

In [None]:
spark.sql("""
SELECT *
FROM mycat.banking.loans
""").toPandas()

#### What is the average loan amount per state?

In [None]:
spark.sql("""

""").toPandas()

We could dump this data somewhere like into HDFS or Hive or wherever you want to.

In [None]:
# spark.sql("""
# SELECT *
# FROM mycat.banking.loans
# """).write.....

### Spark - Hash Partitioning Demo

It's Not elastic!

In [None]:
import string

In [None]:
string.ascii_uppercase

In [None]:
df = pd.DataFrame({"letter": list(string.ascii_uppercase)})
df.head()

In [None]:
df["partition1"] = df["letter"].apply()
df.head()

In [None]:
df["partition2"] = df["letter"].apply(lambda letter: hash(letter) % 5)
df.head()

Let's compare partition1 and partition2 results.

In [None]:
df["partition1"] == df["partition2"]

In [None]:
(df["partition1"] == df["partition2"]).mean()

**Observation**: Only few of the letters stayed with the same partition number.