## Non-Relational Databases

##### Terminology:

NoSQL and Non-Relational are interchangeable terms.

NoSQL = <font color=yellow>Not Only SQL</font> 


NB - "nodes" are basically just servers or systems. So adding a node, as is commonly referred to in linear scaling, is basically just adding another server.

##### some notes on when Not to Use SQL:

- Need high availability in the data: <font color=green>indicates system is always up and there is no downtime</font>
- Have large amounts of data 
- Need Linear Scalability: <font color=green>The need to add more nodes to the system so performance will increase linearly</font> 
- Low Latency:  <font color=green>shorter delay before the data is trasnferred once the instruction for the transfer has been received</font> 
- Need fast reads and write 

### Apache Cassandra

Understanding the architecture: https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archTOC.html 

Cassandra Architecture: https://www.tutorialspoint.com/cassandra/cassandra_architecture.htm 

In depth Apache Cassandra Data Model docs: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlIntro.html 

###  <font color=yellow>CAP</font> Theorem

- <font color=yellow>Consistency:</font> Every read from the database gets the latest (and correct) piece of data or an error

- <font color=yellow>Availability:</font> Every request is received and a response is given -- without a guarantee that the data is the latest update

- <font color=yellow>Partition Tolerance:</font> The system continues to work regardless of losing network connectivity between nodes



##### <font color=green>Is Eventual Consistency the opposite of what is promised by SQL database per the ACID principle?</font>
Much has been written about how Consistency is interpreted in the ACID principle and the CAP theorem. Consistency in the ACID principle refers to the requirement that only transactions that abide by constraints and database rules are written into the database, otherwise the database keeps previous state. In other words, the data should be correct across all rows and tables. However, consistency in the CAP theorem refers to every read from the database getting the latest piece of data or an error.

##### <font color=green>Which of these combinations is desirable for a production system - Consistency and Availability, Consistency and Partition Tolerance, or Availability and Partition Tolerance?</font>
As the CAP Theorem Wikipedia entry says, "The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability." So there is no such thing as Consistency and Availability in a distributed database since it must always tolerate network issues. You can only have Consistency and Partition Tolerance (CP) or Availability and Partition Tolerance (AP). Remember, relational and non-relational databases do different things, and that's why most companies have both types of database systems.

##### <font color=green>Does Cassandra meet just Availability and Partition Tolerance in the CAP theorem?</font>
According to the CAP theorem, a database can actually only guarantee two out of the three in CAP. So supporting Availability and Partition Tolerance makes sense, since Availability and Partition Tolerance are the biggest requirements.

##### <font color=green>If Apache Cassandra is not built for consistency, won't the analytics pipeline break?</font>
If I am trying to do analysis, such as determining a trend over time, e.g., how many friends does John have on Twitter, and if you have one less person counted because of "eventual consistency" (the data may not be up-to-date in all locations), that's OK. In theory, that can be an issue but only if you are not constantly updating. If the pipeline pulls data from one node and it has not been updated, then you won't get it. Remember, in Apache Cassandra it is about Eventual Consistency.

#### Data Modelling in Apache Cassandra:
- Denormalization is not just okay > it's a must
- Denormalization must be done for fast reads
- Apache Cassandra has been optimized for fast writes
- ALWAYS think Queries first
- One table per query is a great strategy
- Apache Cassandra does **NOT** allow for JOINs between tables


##### <font color=green>I see certain downsides of this approach, since in a production application, requirements change quickly and I may need to improve my queries later. Isn't that a downside of Apache Cassandra?</font>
In Apache Cassandra, you want to model your data to your queries, and if your business need calls for quickly changing requirements, you need to create a new table to process the data. That is a requirement of Apache Cassandra. If your business needs calls for ad-hoc queries, these are not a strength of Apache Cassandra. However keep in mind that it is easy to create a new table that will fit your new query.

## Please Note - This code won't work on my personal seup without a fully installed Cassandra setup (which currently cannot be supported by Python 3.8 which I have)

As such, the following is demo notes for code that would work on the right backend setup

go to NoSQL Demo 1.ipynb

### Primary Key

- Must be Unique
- The PRIMARY KEY is made up of either just the PARTITION KEY or may also include additional CLUSTERING COLUMNS
- A simple PRIMARY KEY is just one column that is also the PARTITION KEY. A composite PRIMARY KEY is made up of more than one column and will assist in creating a unique value and in your retrieval queries
- The PARTITION KEY will determine the distribution of the data across the system

useful link for some docs on the subject: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSimplePrimaryKeyConcept.html#useSimplePrimaryKeyConcept 

### Clustering Columns

- The clustering column will sort the data in sorted ascending order, e.g., alphabetical order. Note: this is a mistake in the video, which says descending order.
- More than one clustering column can be added (or none!)
- From there the clustering columns will sort in order of how they were added to the primary key 


#### Additional Resources 
useful link for some docs on the subject: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompoundPrimaryKeyConcept.html

Partition Key vs Clustering Keys : https://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra 

### WHERE clause

- Data Modeling in Apache Cassandra is query focused, and that focus needs to be on the WHERE clause
- Failure to include a WHERE clause will result in an error


#### Additional Resources 
AVOID using "ALLOW FILTERING": Here is a reference in DataStax that explains ALLOW FILTERING and why you should not use it.
https://www.datastax.com/dev/blog/allow-filtering-explained-2 


#### Why do we need to use a WHERE statement since we are not concerned about analytics? Is it only for debugging purposes?
The WHERE statement is allowing us to do the fast reads. With Apache Cassandra, we are talking about big data -- think terabytes of data -- so we are making it fast for read purposes. Data is spread across all the nodes. By using the WHERE statement, we know which node to go to, from which node to get that data and serve it back. For example, imagine we have 10 years of data on 10 nodes or servers. So 1 year's data is on a separate node. By using the WHERE year = 1 statement we know which node to visit fast to pull the data from.