# Apache Cassandra

![png](http://www.siliconweek.es/wp-content/uploads/2011/01/cassandra-database.jpg)

## Design

Designing a relational database schema begins with the focus on the object types, their attributes,
and how that maps to tables. Queries are usually an afterthought.

Designing a Cassandra database scheme focuses early on the queries and making their speed a first
priority. We structure Cassandra tables to support queries rather than represent domain object types.

Doing it right now means we scale and have consistent spped, our model could be fast now, but may
not scale well and needs to be considered.

##  Data Suited for Cassandra

* Most domains
* Transactional or operational data
* Demands of web, mobile, and internet of things (lOT)
* High availability!
* Scale!

### BigData

* Volume (petabytes of data, trillions of entities)
* Velocity (real-time, streams, millions of transactions per second)
* Variety (un-, semi-, structured)
* ...what relational databases cannot handle



## Problems Apps Faces

* Scalability—Apps constantly adds users and videos
* Reliability—Apps must always be available
* Ease of use—Apps must be easy to manage and maintain



## Solutions Attempted
### Relational Database Problems

* Single points of failure - Master/slave architecture will cause single point of failure if master goes offline
* Scaling complexity - Scaling a relational database is doable, but difficult.
* Reliability issues
* Difficult to serve users worldwide - Harder to distribute the data world wide to be closer to the users who need the data



## Apps and Cassandra
### Why Cassandra

* Peers instead of master/slave
* Linear scale performance
* Always on reliability
* Data can be stored geographically close to clients



# Load Data

### Keyspaces

* Top-level namespace/container
* Similar to a relational database schema

In [1]:
%load_ext cql

In [2]:
%%cql
CREATE KEYSPACE demo 
WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};

'No results.'

* Replication parameters required

### USE

* USE switches between keyspaces

USE killrvideo;


In [3]:
%cql USE demo;

'No results.'

### Tables

* Keyspaces contain tables
* Tables contain data

In [4]:
%%cql
CREATE TABLE table1 (
    column1 text,
    column2 text,
    column3 int,
    PRIMARY KEY (column1)
);

'No results.'

In [5]:
%%cql
CREATE TABLE users (
    user UUID,
    email text,
    name text,
    PRIMARY KEY (user)
);

'No results.'

### Primary Keys

![png](./images/cassandra_fig1.png)

* Uniquely identify rows

### Basic Data Types

* text
    * UTF8 encoded string
    * varchar is same as text


* int
    * Signed
    * 32 bits

### UUID & TIMEUUID

* Universally Unique Identifier
    * Ex: 52b11d6d-16e2-4ee2-b2a9-5ef1e9589328
    * Generate via uuidO


* TIMEUUID embeds a TIMESTAMP value
    * Ex: 1be43390-9fe4-11e3-8dO5-425861b86ab6
    * Sortable - You can order on a TIMEUUID to produce time-ordered data.
    * Generate via nowO
    * CQL’s dateOf() function extracts the time portion of a TIMEUUID.



### TIMESTAMP

* Stores date and time
* 64-bit integer
* Milliseconds since January 11970 at 00:00:00 GMT
* Displayed in cqlsh as yyyy-mm-dd HH:mm:ssZ
* As literal in cqlsh is ‘1979-07-24 08:30:15’

### COPY

* Imports/exports CSV (comma-separated values)

    COPY table1 FROM 'tabledata.csv'
    
                                              
* Header parameter skips the first line in the file

    COPY table1 (column1, column2, column3) FROM ‘tabledata_headers.csv’ 
    WITH HEADER=true;

In [6]:
!cqlsh --keyspace demo -e "COPY table1(column1, column2, column3) FROM 'data/tabledata.csv';"


3 rows imported in 0.715 seconds.


In [7]:
!cqlsh --keyspace demo -e "COPY table1 FROM 'data/tabledata_headers.csv' WITH HEADER=true;"


2 rows imported in 0.534 seconds.


### SELECT

In [8]:
%%cql
SELECT * FROM table1;

column1,column2,column3
sdfa,dfghdfg,12
dfcx,vnxzcvxzc,20
fadfa,asdfsad,34
vnbrwt,vcnxc<zfsdfd,22
sfgf,sdfgfsdgfsdfgsdfgdsf,56


In [9]:
%%cql
SELECT column1, column2, column3 from table1;

column1,column2,column3
sdfa,dfghdfg,12
dfcx,vnxzcvxzc,20
fadfa,asdfsad,34
vnbrwt,vcnxc<zfsdfd,22
sfgf,sdfgfsdgfsdfgsdfgdsf,56


In [10]:
%%cql
SELECT count(*) 
from table1;



count
5


In [11]:
%%cql
SELECT *
from table1
LIMIT 2;

column1,column2,column3
sdfa,dfghdfg,12
dfcx,vnxzcvxzc,20


# killrvideo


In [12]:
%%cql
CREATE KEYSPACE killrvideo
WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};

'No results.'

In [13]:
%cql USE killrvideo;

'No results.'

In [14]:
%%cql
CREATE TABLE videos (
  video_id TIMEUUID,
  added_date TIMESTAMP,
  description TEXT,
  title TEXT,
  user_id UUID,
  PRIMARY KEY (video_id)
);

'No results.'

In [15]:
!cqlsh --keyspace killrvideo -e "COPY videos FROM 'data/2-videos.csv' WITH HEADER=true;"


430 rows imported in 7.229 seconds.


### Query videos by title and year

* Partitions
* Partition keys
* Composite partition keys

### Query

Let’s try the following queries on the videos table:

In [16]:
%%cql
SELECT *
FROM videos where title = 'The Original Grumpy cat';

InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the restricted columns support the provided operators: "

In [17]:
%%cql
SELECT *
FROM videos where date < '2015-05-01';

InvalidRequest: code=2200 [Invalid query] message="Undefined name date in where clause ('date < '2015-05-01'')"

### Videos

![png](./images/cassandra_fig2.png)

### Cassandra’s Physical Storage Strategy

![png](./images/cassandra_fig3.png)

* **Partition** - Maps a partition key to a sequence of cells
* **Cell** - a key-value pair



### CQL Tables
![png](./images/cassandra_fig4.png)

* CQL displays partition data in a tabular format called a table.
* CQL tables appear similar to relational tables, but the comparison diverges there.

### Determining Partition Keys

• CQL’s PRIMARY KEY clause determines partitioning criteria


![png](./images/cassandra_fig5.png)

### Partition Storage

* Cassandra distributes partitions across nodes
* WHERE on any field other than partition key would require a scan of all partitions on all nodes
* Inefficient access pattern


![png](./images/cassandra_fig6.png)

### WHERE and Partition Keys

* We can WHERE on a partition key value
* Cassandra uses a hashing algorithm to quickly determine which node(s) contain the desired partition


![png](./images/cassandra_fig7.png)

### Composite Partition Keys


![png](./images/cassandra_fig8.png)

* Multiple columns may make up the partition key
* Extra set of parenthesis required
* Further columns can follow
* Determine clustering columns
* Discussed later


### Upserts

In this example, **email** is the primary key.   
Inserting a row with a matching key of an existing row simply updates the existing value(s). Cassandra does not read before writing for INSERTs and will overwrite/update non-key values.



In [18]:
%%cql
CREATE TABLE users (
  email text,
  password text,
  userid int,
  PRIMARY KEY (email)
);

'No results.'

In [19]:
%%cql
INSERT INTO users(email, password, userid)
VALUES ('casandra.rock.star@datastax.com', 'abc', 42)

'No results.'

In [20]:
%cql SELECT * FROM users WHERE email='casandra.rock.star@datastax.com'

email,password,userid
casandra.rock.star@datastax.com,abc,42


In [21]:
%cql SELECT * FROM users WHERE email='casandra.rock.star@datastax.com'

email,password,userid
casandra.rock.star@datastax.com,abc,42


In [22]:
%%cql
UPDATE users
SET password = 'lol',
    userid = 50
WHERE email = 'casandra.admin@datastax.com'

'No results.'

In [23]:
%cql SELECT * FROM users WHERE email= 'casandra.admin@datastax.com'

email,password,userid
casandra.admin@datastax.com,lol,50


Again, by default and for speed, Cassandra does not read before writing, so all UPDATEs simply insert. Notice inserted values here from both the SET and the WHERE clauses.   
We will later show you how to have Cassandra prevent upserts, but it is best you see how to properly use Cassandra to its full potential by structuring your applications to work with it gracefully.



### Clustering Columns

* Come after partition key within PRIMARY KEY clause
* Data displays the same as before


![png](./images/cassandra_fig9.png)


* Clustering columns divide CQL rows between partitions

![png](./images/cassandra_fig10.png)

### Side by Side Comparison

* The structure on the left makes single-row partitions (one CQL row per partition). The right structure will store several CQL rows per partition, grouped by the video year.
* The structure on the right embeds the video name with each column name in each cell’s key.
* We built the right structure specifically to service querying on videos in a given year.
* Single row partitions are sometimes called skinny whereas multi-row partitions are sometimes called wide.

![png](./images/cassandra_fig11.png)

### Clustering Column Ordering

* Clustering column values stored sorted
* Default is ascending but you can specify descending

![png](./images/cassandra_fig12.png)

In [24]:
%%cql
CREATE TABLE videos_ordered (
    id int,
    name text,
    runtime int,
    year int,
PRIMARY KEY((year), name)
) WITH CLUSTERING ORDER BY (name DESC)

'No results.'

In [25]:
%cql INSERT INTO videos_ordered(id, name, runtime, year) VALUES(1, 'Interstellar', 98, 2014)
%cql INSERT INTO videos_ordered(id, name, runtime, year) VALUES(2, 'Mockingjay', 113, 2014)
%cql INSERT INTO videos_ordered(id, name, runtime, year) VALUES(3, 'Insurgent', 119, 2015)

'No results.'

In [26]:
%%cql
SELECT *
FROM videos_ordered
WHERE year = 2014 and name = 'Mockingjay'

year,name,id,runtime
2014,Mockingjay,2,113


![png](./images/cassandra_fig13.png)