## Data Engineering: A brief history

Knowing a bit of the history of data engineering will help you understand how certain tools were developed, and give you more context when choosing between tools and frameworks. There are two readings we suggest.

1) [On the Evolution of Data Engineering](https://medium.com/analytics-and-data/on-the-evolution-of-data-engineering-c5e56d273e37): This short read (~5 minutes) focuses on the recent change from managing SQL databases to working with massive datasets in real time. It was written by Julien Kervizic, an experienced analytics expert from the Netherlands.

2) [Data Engineering Introduction and Epochs](https://learn.panoply.io/hubfs/Data%20Engineering%20-%20Introduction%20and%20Epochs.pdf): This slightly longer read (~20 minutes) goes further back in time to the birth of computers. It walks through four "epochs" of data engineering, and the major advances over the past 70 years. It was written by Panopoly, a data engineering platform provider.


Data Engineering Tools
Feel free to look at these resources and infographic describing the different tools used by Data Engineers.

- https://www.burtchworks.com/2018/09/10/the-rise-of-data-engineering-common-skills-and-tools/
- https://www.analyticsindiamag.com/data-engineering-101-top-tools-and-framework-resources/
- https://joviam.com/this-infographic-of-big-data-tools-will-blow-your-mind-infographic/
- https://datafloq.com/big-data-open-source-tools/os-home/

# Data Model

Databases: A database is a structured repository or collection of data that is stored and retrieved electronically for use in applications. Data can be stored, updated, or deleted from a database.

Database Management System (DBMS): The software used to access the database by the user and application is the database management system. Check out these few links describing a DBMS in more detail.

- [Introduction to DBMS](https://www.geeksforgeeks.org/introduction-of-dbms-database-management-system-set-1/)
- [DBMS as per Wikipedia](https://en.wikipedia.org/wiki/Database#Database_management_system)

## What is Data Model?
Data modeling at a high level is all about an abstraction that organizes elements of data and how they relate to each other.

### Process
To organize data into a database system to ensure your data is persisted and easily usable.

![Screen%20Shot%202020-05-24%20at%209.23.41%20pm.png](attachment:Screen%20Shot%202020-05-24%20at%209.23.41%20pm.png)
Entity mapping diagram

Physical data modeling is to transform logical data model to database use DDL(data definition language)

### Common Questions
#### Why can't everything be stored in a giant Excel spreadsheet?

- There are limitations to the amount of data that can be stored in an Excel sheet. So, a database helps organize the elements into tables - rows and columns, etc. Also reading and writing operations on a large scale is not possible with an Excel sheet, so it's better to use a database to handle most business functions.
#### Does data modeling happen before you create a database, or is it an iterative process?

- It's definitely an iterative process. Data engineers continually reorganize, restructure, and optimize data models to fit the needs of the organization.
#### How is data modeling different from machine learning modeling?

- Machine learning includes a lot of data wrangling to create the inputs for machine learning models, but data modeling is more about how to structure data to be used by different people within an organization. You can think of data modeling as the process of designing data and making it available to machine learning engineers, data scientists, business analytics, etc., so they can make use of it easily.



## Why data modeling is important?

1. Key points about Data Modeling

    - Data Organization: The organization of the data for your applications is extremely important and makes everyone's life easier.
    - Use cases: Having a well thought out and organized data model is critical to how that data can later be used. Queries that could have been straightforward and simple might become complicated queries if data modeling isn't well thought out.
    - Starting early: Thinking and planning ahead will help you be successful. This is not something you want to leave until the last minute.
    - Iterative Process: Data modeling is not a fixed process. It is iterative as new requirements and data are introduced. Having flexibility will help as new information becomes available.

2. Example of Why Data Modeling is Important:

Let's take an example from Udacity. Here, a Udacity data engineer would help structure the data so it can be used by different people within Udacity for further analysis and also shared with the learner on the website. For instance, when we want to track the students' progress within a Nanodegree program, we want to aggregate data across students and projects within a Nanodegree. In a relational database, this requires the data to be structured in ways that each student's data is tracked across all Nanodegree programs that s/he has ever enrolled in. The data also needs to track the student's progress within each of those Nanodegree programs.

The data model is critical for accurately representing each data object. For instance, a data table would track a student's progress on project submissions, i.e., whether they passed or failed a specific rubric requirement. Furthermore, the data model should ensure that a student's progress is updated and aggregated to provide an indicator of whether the student passed all the rubric requirements and successfully finished the project. Data modeling is critical to track all of these pieces of data so the tables are speaking to each other, updating the tables correctly (e.g., updating a student's progress on a project submission), and meeting defined rules (e.g., project completed when all rubric requirements are passed).

## Relational and NoSQL Databases

They do data modeling different

### Relational Model:
This model prganizes data into one or more tables(or relations) of columns and rows, with a unique key dentififying each row. Generally, each table represent one entity type(Such as customer or product)

A software system used to maintain relational databases is a relational database management system (RDBMS)

SQL(Structured Query Language) is the language used across almost all relational database system for querying and maintaining the databas

### The basics
- Database/Schema is collection of tables

- Tables/Relations are a group of rows sharing the same labeled elements, such as a custom table

- Columns/Attributes are labelde element such as name, email, city

- Rows/Tuples are sigle items, such as a individual's records


### Advantages of Using a Relational Database
- **Flexibility for writing in SQL queries:** With SQL being the most common database query language.
- **Modeling the data not modeling queries**
- **Ability to do JOINS**
- **Ability to do aggregations and analytics**
- **Secondary Indexes available** : You have the advantage of being able to add another index to help with quick searching.
- **Smaller data volumes:** If you have a smaller data volume (and not big data) you can use a relational database for its simplicity.
- **ACID Transactions:** Allows you to meet a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, and thus maintain data integrity.
- **Easier to change to business requirements


### ACID Transactions
Properties of database transactions intended to guarantee validity even in the event of errors or power failures.

- Atomicity: The whole transaction is processed or nothing is processed. A commonly cited example of an atomic transaction is money transactions between two bank accounts. The transaction of transferring money from one account to the other is made up of two operations. First, you have to withdraw money in one account, and second you have to save the withdrawn money to the second account. An atomic transaction, i.e., when either all operations occur or nothing occurs, keeps the database in a consistent state. This ensures that if either of those two operations (withdrawing money from the 1st account or saving the money to the 2nd account) fail, the money is neither lost nor created. Source [Wikipedia](https://en.wikipedia.org/wiki/Atomicity_%28database_systems%29) for a detailed description of this example. 

- Consistency: Only transactions that abide by constraints and rules are written into the database, otherwise the database keeps the previous state. The data should be correct across all rows and tables. Check out additional information about consistency on [Wikipedia](https://en.wikipedia.org/wiki/Consistency_%28database_systems%29).

- Isolation: Transactions are processed independently and securely, order does not matter. A low level of isolation enables many users to access the data simultaneously, however this also increases the possibilities of concurrency effects (e.g., dirty reads or lost updates). On the other hand, a high level of isolation reduces these chances of concurrency effects, but also uses more system resources and transactions blocking each other. Source: [Wikipedia](https://en.wikipedia.org/wiki/Isolation_%28database_systems%29)
When a user make a transaction, other users will be bolcked.

- Durability: Completed transactions are saved to database even in cases of system failure. A commonly cited example includes tracking flight seat bookings. So once the flight booking records a confirmed seat booking, the seat remains booked even if a system failure occurs. Source: [Wikipedia](https://en.wikipedia.org/wiki/ACID).

## When Not to Use a Relational Database
- **Have large amounts of data:** Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. You are limited by how much you can scale and how much data you can store on one machine. You cannot add more machines like you can in NoSQL databases.
- **Need to be able to store different data type formats:** Relational databases are not designed to handle unstructured data.
- **Need high throughput -- fast reads:** While ACID transactions bring benefits, they also slow down the process of reading and writing data. If you need very fast reads and writes, using a relational database may not suit your needs.
- **Need a flexible schema:** Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
- **Need high availability:** The fact that relational databases are not distributed (and even when they are, they have a coordinator/worker architecture), they have a single point of failure. When that database goes down, a fail-over to a backup system occurs and takes time.
- **Need horizontal scalability:** Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data.

## PostgreSQL

- open source object-relational database system
- Uses and builds on SQL language



The following information on setting up PostgreSQL on your local machine is completely optional for this course and for users who feel comfortable setting up the environment on their local machine to complete the exercises.

Here is additional Information on how to install and set up Postgres locally in case you want to follow along the demo on your local machine. This [link](https://www.codementor.io/@engineerapart/getting-started-with-postgresql-on-mac-osx-are8jcopb) provides directions for MacOS. It goes through configuring Postgres, creating users, and creating databases using the psql utility. It will help further explain the Python driver and also help you in running the demos locally.

In addition, here is a short tutorial on psycopg2. This [link](https://pynative.com/python-postgresql-tutorial/) gives a good starter tutorial in case you are curious about it.
Here are the two demo files. Feel free to follow along. Just download these and open up the Jupyter Notebook files.

### Walk through the basics of PostgreSQL. You will need to complete the following tasks:<li> Create a table in PostgreSQL, <li> Insert rows of data <li> Run a simple SQL query to validate the information. <br>
    
#### Import the library 
*Note:* An error might popup after this command has executed. If it does, read it carefully before ignoring. 

In [4]:
import psycopg2

### Create a connection to the database

In [30]:
try: 
    conn = psycopg2.connect("host=localhost dbname=studentdb user=edifierxuhao password=****")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)

### Use the connection to get a cursor that can be used to execute queries.

In [31]:
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get curser to the Database")
    print(e)

### TO-DO: Set automatic commit to be true so that each action is committed without having to call conn.commit() after each command. 

In [32]:
conn.set_session(autocommit = True)

### TO-DO: Create a database to do the work in. 

In [33]:
## TO-DO: Add the database name within the CREATE DATABASE statement. You can choose your own db name.
try: 
    cur.execute("create database mydb")
except psycopg2.Error as e:
    print(e)

#### TO-DO: Add the database name in the connect statement. Let's close our connection to the default database, reconnect to the Udacity database, and get a new cursor.

In [34]:
## TO-DO: Add the database name within the connect statement
try: 
    conn.close()
except psycopg2.Error as e:
    print(e)
    
try: 
    conn = psycopg2.connect("host=localhost dbname=mydb user=edifierxuhao password=****")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)
    
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get curser to the Database")
    print(e)

conn.set_session(autocommit=True)

### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. 

`song_title
artist_name
year
album_name
single`



In [35]:
## TO-DO: Finish writing the CREATE TABLE statement with the correct arguments
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS music_library (song_title varchar, artist_name varchar, year int, album_name varchar, single boolean);")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

### TO-DO: Insert the following two rows in the table
`First Row:  "Across The Universe", "The Beatles", "1970", "False", "Let It Be"`

`Second Row: "The Beatles", "Think For Yourself", "False", "1965", "Rubber Soul"`

In [36]:
## TO-DO: Finish the INSERT INTO statement with the correct arguments

try: 
    cur.execute("INSERT INTO music_library (song_title,artist_name, year, album_name, single) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 ("Across The Universe", "The Beatles", 1970,"Let It Be",False))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_library (song_title,artist_name, year, album_name, single) \
                  VALUES (%s, %s, %s, %s, %s)",
                  ("Think For Yourself", "The Beatles",1965,"Rubber Soul",False))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

### TO-DO: Validate your data was inserted into the table. 

In [38]:
## TO-DO: Finish the SELECT * Statement 
try: 
    cur.execute("SELECT * FROM music_library;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

('Across The Universe', 'The Beatles', 1970, 'Let It Be', False)
('Think For Yourself', 'The Beatles', 1965, 'Rubber Soul', False)


### And finally close your cursor and connection. 

In [39]:
cur.close()
conn.close()

## NoSQL Databases

has a simpler design, simpler hprizontal scaling and finer control of availability. Data structures used are different than those in Relational Database are make some operations faster.

NoSQL = Not Only SQL; NoSQL and NonRelational are interchangeable terms.

Common Types of NoSQL Databases
- Apache Cassandra (Partition Row store)
- MongoDB (Document Store)
- DynamoDB (key-value store)
- Apache HBase (Wide Column Store)
- Neo4j (Graph Database)

We will use Apache Cassandra to explain the concepts of data modeling for NoSQL Databases

### The basic of Apache Cassandra
- Keyspace is conllection of tables
- Table is a group of Partitions
- Row is a single item

![Screen%20Shot%202020-05-24%20at%2011.34.07%20pm.png](attachment:Screen%20Shot%202020-05-24%20at%2011.34.07%20pm.png)

### Common Questions:
#### What type of companies use Apache Cassandra?
All kinds of companies. For example, Uber uses Apache Cassandra for their entire backend. Netflix uses Apache Cassandra to serve all their videos to customers. Good use cases for NoSQL (and more specifically Apache Cassandra) are :

1. Transaction logging (retail, health care)
2. Internet of Things (IoT)
3. Time series data
4. Any workload that is heavy on writes to the database (since Apache Cassandra is optimized for writes).

#### Would Apache Cassandra be a hindrance for my analytics work? If yes, why?
Yes, if you are trying to do analysis, such as using GROUP BY statements. Since Apache Cassandra requires data modeling based on the query you want, you can't do ad-hoc queries. However you can add clustering columns into your data model and create new tables.

## When to use a NoSQL Database
- Need to be able to store different data type formats: NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL.
- Large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes).
- Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data
- Need high throughput: While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs.
- Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
- Need high availability: Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time.

## When NOT to use a NoSQL Database?
- **When you have a small dataset:** NoSQL databases were made for big datasets not small datasets and while it works it wasn’t created for that.
- **When you need ACID Transactions:** If you need a consistent database with ACID transactions, then most NoSQL databases will not be able to serve this need. NoSQL database are eventually consistent and do not provide ACID transactions. However, there are exceptions to it. Some non-relational databases like MongoDB can support ACID transactions.
- **When you need the ability to do JOINS across tables:** NoSQL does not allow the ability to do JOINS. This is not allowed as this will result in full table scans.
- **If you want to be able to do aggregations and analytics**
- **If you have changing business requirements :** Ad-hoc queries are possible but difficult as the data model was done to fix particular queries
- **If your queries are not available and you need the flexibility :** You need your queries in advance. If those are not available or you will need to be able to have flexibility on how you query your data you might need to stick with a relational database


### Caveats to NoSQL and ACID Transactions
There are some NoSQL databases that offer some form of ACID transaction. As of v4.0, MongoDB added multi-document ACID transactions within a single replica set. With their later version, v4.2, they have added multi-document ACID transactions in a sharded/partitioned deployment.

- Check out this documentation from MongoDB on [multi-document ACID transactions](https://www.mongodb.com/collateral/mongodb-multi-document-acid-transactions)
- Here is another [link documenting MongoDB's ability to handle ACID transactions](https://www.mongodb.com/blog/post/mongodb-multi-document-acid-transactions-general-availability)

Another example of a NoSQL database supporting ACID transactions is MarkLogic. 

- Check out this [link](https://www.marklogic.com/blog/how-marklogic-supports-acid-transactions/) from their blog that offers ACID transactions.

The following information on setting up Cassandra on your local machine is completely optional for this course and for users who feel comfortable setting up the environment on their local machine to complete the exercises. Please note, Apache Cassandra is easier to install on MacOS than a Windows machine.

Installing Apache Cassandra to run locally on your machine:
[Cassandra Documentation](https://cassandra.apache.org/doc/latest/getting_started/installing.html)

Again, if you want to follow along, here is the demo notebook showcased in the video above.


```shell
pip install cassandra-driver
```

In [40]:
import cassandra

Then, let's create a connection to the database

In [42]:
from cassandra.cluster import Cluster

try:
    cluster = Cluster(['localhost'])
    session = cluster.connect()
except Exception as e:
    print(e)

### Let's Test our Connection

In [43]:
try:
    session. execute('''select * from music_libary''')
except Exception as e:
    print(e)

Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename"


### Create a key space

In [46]:
try:
    session.execute('''
    CREATE KEYSPACE IF NOT EXISTS udacity
    WITH REPLICATION = 
    {'class':'SimpleStrategy','replication_factor':1}'''
                   )
except Exception as e:
    print(e)

### connect to our keyspace

compare to PostgreSQL, we do not need to connect again

In [47]:
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

We are working with Apache Cassandra a NoSQL database, we can't model our data and create our table without more information.

### We are working with Apache Cassandra a NoSQL database. We can't model our data and create our table without more information.

### Think about what queries will you be performing on this data?

#### We want to be able to get every album that was released in a particular year. 
`select * from music_library WHERE YEAR=1970`

*To do that:* 
We need to be able to do a WHERE on YEAR. 

YEAR will become my partition key,artist name will be my clustering column to make each Primary Key unique. **Remember there are no duplicates in Apache Cassandra.**,

- **Table Name:** music_library
- **column 1:** Album Name,
- **column 2:** Artist Name,
- **column 3:** Year,
- PRIMARY KEY(year, artist name)

### Now to translate this information into a Create Table Statement. \n",
More information on Data Types can be found [here](https://datastax.github.io/python-driver)


In [48]:
query = 'CREATE TABLE IF NOT EXISTS music_library'
query = query + '(year int, artist_name text, album_name text, PRIMARY KEY(year, artist_name))'
try:
    session.execute(query)
except Exception as e:
    print(e)


In [51]:
query = 'select count(*) from music_library'
try:
    count = session.execute(query)
except Exception as e:
    print(e)
    
print(count.one())

Row(count=0)


Let's insert two rows

In [53]:
query = 'INSERT INTO music_library (year, artist_name, album_name)'
query = query + 'VALUES (%s, %s, %s)'

try:
    session.execute(query,(1970, 'The Beatles', 'Let it Be'))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, 'The Beatles', 'Rubber Soul'))
except Exception as e:
    print(e)

In [54]:
query = 'select * from music_library'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

for row in rows:
    print(row.year, row.album_name, row.artist_name)

1965 Rubber Soul The Beatles
1970 Let it Be The Beatles


Apache Cassandra never allow duplicates, I can run INSERT VALUES many times, I can only get one record.

In [55]:
query = 'select * from music_library WHERE year = 1970'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

for row in rows:
    print(row.year, row.album_name, row.artist_name)

1970 Let it Be The Beatles


In [57]:
# For the sake of the demo, I will drop the table
query = 'DROP TABLE music_library'

try:
    rows = session.execute(query)
except Exception as e:
    print(e)

In [58]:
session.shutdown()
cluster.shutdown()

### Walk through the basics of Apache Cassandra. Complete the following tasks:<li> Create a table in Apache Cassandra, <li> Insert rows of data,<li> Run a simple SQL query to validate the information. <br>

#### Import Apache Cassandra python package

In [59]:
import cassandra

### Create a connection to the database

In [60]:
from cassandra.cluster import Cluster
try: 
    cluster = Cluster(['localhost']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
except Exception as e:
    print(e)
 

### TO-DO: Create a keyspace to do the work in 

In [63]:
## TO-DO: Create the keyspace
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e)

### TO-DO: Connect to the Keyspace

In [64]:
## To-Do: Add in the keyspace you created
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. 

`song_title
artist_name
year
album_name
single`

### TO-DO: You need to create a table to be able to run the following query: 
`select * from songs WHERE year=1970 AND artist_name="The Beatles"`

In [65]:
## TO-DO: Complete the query below
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + '(song_title text, artist_name text, year int, album_name text, single Boolean, PRIMARY KEY(year, artist_name))'

try:
    session.execute(query)
except Exception as e:
    print(e)



### TO-DO: Insert the following two rows in your table
`First Row:  "Across The Universe", "The Beatles", "1970", "False", "Let It Be"`

`Second Row: "The Beatles", "Think For Yourself", "False", "1965", "Rubber Soul"`

In [67]:
## Add in query and then run the insert statement
query = "INSERT INTO music_library (song_title, artist_name, year, album_name, single)" 
query = query + " VALUES (%s, %s, %s, %s, %s)"

try:
    session.execute(query, ("Across The Universe","The Beatles", 1970,"Let It Be",False))
except Exception as e:
    print(e)
    
try:
    session.execute(query, ("Think For Yourself","The Beatles", 1965,"Rubber Soul",False))
except Exception as e:
    print(e)

### TO-DO: Validate your data was inserted into the table.

In [68]:
## TO-DO: Complete and then run the select statement to validate the data was inserted into the table
query = 'SELECT * FROM music_library'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.album_name, row.artist_name)

1965 Rubber Soul The Beatles
1970 Let It Be The Beatles


### TO-DO: Validate the Data Model with the original query.

`select * from songs WHERE YEAR=1970 AND artist_name="The Beatles"`

In [74]:
##TO-DO: Complete the select statement to run the query 
query = "SELECT * FROM music_library WHERE year=1970 AND artist_name = 'The Beatles'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.album_name, row.artist_name)

1970 Let It Be The Beatles


### And Finally close the session and cluster connection

In [75]:
session.shutdown()
cluster.shutdown()