## Data Engineering: A brief history

Knowing a bit of the history of data engineering will help you understand how certain tools were developed, and give you more context when choosing between tools and frameworks. There are two readings we suggest.

1) [On the Evolution of Data Engineering](https://medium.com/analytics-and-data/on-the-evolution-of-data-engineering-c5e56d273e37): This short read (~5 minutes) focuses on the recent change from managing SQL databases to working with massive datasets in real time. It was written by Julien Kervizic, an experienced analytics expert from the Netherlands.

2) [Data Engineering Introduction and Epochs](https://learn.panoply.io/hubfs/Data%20Engineering%20-%20Introduction%20and%20Epochs.pdf): This slightly longer read (~20 minutes) goes further back in time to the birth of computers. It walks through four "epochs" of data engineering, and the major advances over the past 70 years. It was written by Panopoly, a data engineering platform provider.


Data Engineering Tools
Feel free to look at these resources and infographic describing the different tools used by Data Engineers.

- https://www.burtchworks.com/2018/09/10/the-rise-of-data-engineering-common-skills-and-tools/
- https://www.analyticsindiamag.com/data-engineering-101-top-tools-and-framework-resources/
- https://joviam.com/this-infographic-of-big-data-tools-will-blow-your-mind-infographic/
- https://datafloq.com/big-data-open-source-tools/os-home/

# Data Model

Databases: A database is a structured repository or collection of data that is stored and retrieved electronically for use in applications. Data can be stored, updated, or deleted from a database.

Database Management System (DBMS): The software used to access the database by the user and application is the database management system. Check out these few links describing a DBMS in more detail.

- [Introduction to DBMS](https://www.geeksforgeeks.org/introduction-of-dbms-database-management-system-set-1/)
- [DBMS as per Wikipedia](https://en.wikipedia.org/wiki/Database#Database_management_system)

## What is Data Model?
Data modeling at a high level is all about an abstraction that organizes elements of data and how they relate to each other.

### Process
To organize data into a database system to ensure your data is persisted and easily usable.

![Screen%20Shot%202020-05-24%20at%209.23.41%20pm.png](attachment:Screen%20Shot%202020-05-24%20at%209.23.41%20pm.png)
Entity mapping diagram

Physical data modeling is to transform logical data model to database use DDL(data definition language)

### Common Questions
#### Why can't everything be stored in a giant Excel spreadsheet?

- There are limitations to the amount of data that can be stored in an Excel sheet. So, a database helps organize the elements into tables - rows and columns, etc. Also reading and writing operations on a large scale is not possible with an Excel sheet, so it's better to use a database to handle most business functions.
#### Does data modeling happen before you create a database, or is it an iterative process?

- It's definitely an iterative process. Data engineers continually reorganize, restructure, and optimize data models to fit the needs of the organization.
#### How is data modeling different from machine learning modeling?

- Machine learning includes a lot of data wrangling to create the inputs for machine learning models, but data modeling is more about how to structure data to be used by different people within an organization. You can think of data modeling as the process of designing data and making it available to machine learning engineers, data scientists, business analytics, etc., so they can make use of it easily.



## Why data modeling is important?

1. Key points about Data Modeling

    - Data Organization: The organization of the data for your applications is extremely important and makes everyone's life easier.
    - Use cases: Having a well thought out and organized data model is critical to how that data can later be used. Queries that could have been straightforward and simple might become complicated queries if data modeling isn't well thought out.
    - Starting early: Thinking and planning ahead will help you be successful. This is not something you want to leave until the last minute.
    - Iterative Process: Data modeling is not a fixed process. It is iterative as new requirements and data are introduced. Having flexibility will help as new information becomes available.

2. Example of Why Data Modeling is Important:

Let's take an example from Udacity. Here, a Udacity data engineer would help structure the data so it can be used by different people within Udacity for further analysis and also shared with the learner on the website. For instance, when we want to track the students' progress within a Nanodegree program, we want to aggregate data across students and projects within a Nanodegree. In a relational database, this requires the data to be structured in ways that each student's data is tracked across all Nanodegree programs that s/he has ever enrolled in. The data also needs to track the student's progress within each of those Nanodegree programs.

The data model is critical for accurately representing each data object. For instance, a data table would track a student's progress on project submissions, i.e., whether they passed or failed a specific rubric requirement. Furthermore, the data model should ensure that a student's progress is updated and aggregated to provide an indicator of whether the student passed all the rubric requirements and successfully finished the project. Data modeling is critical to track all of these pieces of data so the tables are speaking to each other, updating the tables correctly (e.g., updating a student's progress on a project submission), and meeting defined rules (e.g., project completed when all rubric requirements are passed).

## Relational and NoSQL Databases

They do data modeling different

### Relational Model:
This model prganizes data into one or more tables(or relations) of columns and rows, with a unique key dentififying each row. Generally, each table represent one entity type(Such as customer or product)

A software system used to maintain relational databases is a relational database management system (RDBMS)

SQL(Structured Query Language) is the language used across almost all relational database system for querying and maintaining the databas

### The basics
- Database/Schema is collection of tables

- Tables/Relations are a group of rows sharing the same labeled elements, such as a custom table

- Columns/Attributes are labelde element such as name, email, city

- Rows/Tuples are sigle items, such as a individual's records


### Advantages of Using a Relational Database
- **Flexibility for writing in SQL queries:** With SQL being the most common database query language.
- **Modeling the data not modeling queries**
- **Ability to do JOINS**
- **Ability to do aggregations and analytics**
- **Secondary Indexes available** : You have the advantage of being able to add another index to help with quick searching.
- **Smaller data volumes:** If you have a smaller data volume (and not big data) you can use a relational database for its simplicity.
- **ACID Transactions:** Allows you to meet a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, and thus maintain data integrity.
- **Easier to change to business requirements


### ACID Transactions
Properties of database transactions intended to guarantee validity even in the event of errors or power failures.

- Atomicity: The whole transaction is processed or nothing is processed. A commonly cited example of an atomic transaction is money transactions between two bank accounts. The transaction of transferring money from one account to the other is made up of two operations. First, you have to withdraw money in one account, and second you have to save the withdrawn money to the second account. An atomic transaction, i.e., when either all operations occur or nothing occurs, keeps the database in a consistent state. This ensures that if either of those two operations (withdrawing money from the 1st account or saving the money to the 2nd account) fail, the money is neither lost nor created. Source [Wikipedia](https://en.wikipedia.org/wiki/Atomicity_%28database_systems%29) for a detailed description of this example. 

- Consistency: Only transactions that abide by constraints and rules are written into the database, otherwise the database keeps the previous state. The data should be correct across all rows and tables. Check out additional information about consistency on [Wikipedia](https://en.wikipedia.org/wiki/Consistency_%28database_systems%29).

- Isolation: Transactions are processed independently and securely, order does not matter. A low level of isolation enables many users to access the data simultaneously, however this also increases the possibilities of concurrency effects (e.g., dirty reads or lost updates). On the other hand, a high level of isolation reduces these chances of concurrency effects, but also uses more system resources and transactions blocking each other. Source: [Wikipedia](https://en.wikipedia.org/wiki/Isolation_%28database_systems%29)
When a user make a transaction, other users will be bolcked.

- Durability: Completed transactions are saved to database even in cases of system failure. A commonly cited example includes tracking flight seat bookings. So once the flight booking records a confirmed seat booking, the seat remains booked even if a system failure occurs. Source: [Wikipedia](https://en.wikipedia.org/wiki/ACID).

## When Not to Use a Relational Database
- **Have large amounts of data:** Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. You are limited by how much you can scale and how much data you can store on one machine. You cannot add more machines like you can in NoSQL databases.
- **Need to be able to store different data type formats:** Relational databases are not designed to handle unstructured data.
- **Need high throughput -- fast reads:** While ACID transactions bring benefits, they also slow down the process of reading and writing data. If you need very fast reads and writes, using a relational database may not suit your needs.
- **Need a flexible schema:** Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
- **Need high availability:** The fact that relational databases are not distributed (and even when they are, they have a coordinator/worker architecture), they have a single point of failure. When that database goes down, a fail-over to a backup system occurs and takes time.
- **Need horizontal scalability:** Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data.

## PostgreSQL

- open source object-relational database system
- Uses and builds on SQL language



The following information on setting up PostgreSQL on your local machine is completely optional for this course and for users who feel comfortable setting up the environment on their local machine to complete the exercises.

Here is additional Information on how to install and set up Postgres locally in case you want to follow along the demo on your local machine. This [link](https://www.codementor.io/@engineerapart/getting-started-with-postgresql-on-mac-osx-are8jcopb) provides directions for MacOS. It goes through configuring Postgres, creating users, and creating databases using the psql utility. It will help further explain the Python driver and also help you in running the demos locally.

In addition, here is a short tutorial on psycopg2. This [link](https://pynative.com/python-postgresql-tutorial/) gives a good starter tutorial in case you are curious about it.
Here are the two demo files. Feel free to follow along. Just download these and open up the Jupyter Notebook files.

### Walk through the basics of PostgreSQL. You will need to complete the following tasks:<li> Create a table in PostgreSQL, <li> Insert rows of data <li> Run a simple SQL query to validate the information. <br>
    
#### Import the library 
*Note:* An error might popup after this command has executed. If it does, read it carefully before ignoring. 

In [4]:
import psycopg2

### Create a connection to the database

In [30]:
try: 
    conn = psycopg2.connect("host=localhost dbname=studentdb user=edifierxuhao password=****")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)

### Use the connection to get a cursor that can be used to execute queries.

In [31]:
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get curser to the Database")
    print(e)

### TO-DO: Set automatic commit to be true so that each action is committed without having to call conn.commit() after each command. 

In [32]:
conn.set_session(autocommit = True)

### TO-DO: Create a database to do the work in. 

In [33]:
## TO-DO: Add the database name within the CREATE DATABASE statement. You can choose your own db name.
try: 
    cur.execute("create database mydb")
except psycopg2.Error as e:
    print(e)

#### TO-DO: Add the database name in the connect statement. Let's close our connection to the default database, reconnect to the Udacity database, and get a new cursor.

In [34]:
## TO-DO: Add the database name within the connect statement
try: 
    conn.close()
except psycopg2.Error as e:
    print(e)
    
try: 
    conn = psycopg2.connect("host=localhost dbname=mydb user=edifierxuhao password=****")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)
    
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get curser to the Database")
    print(e)

conn.set_session(autocommit=True)

### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. 

`song_title
artist_name
year
album_name
single`



In [35]:
## TO-DO: Finish writing the CREATE TABLE statement with the correct arguments
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS music_library (song_title varchar, artist_name varchar, year int, album_name varchar, single boolean);")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

### TO-DO: Insert the following two rows in the table
`First Row:  "Across The Universe", "The Beatles", "1970", "False", "Let It Be"`

`Second Row: "The Beatles", "Think For Yourself", "False", "1965", "Rubber Soul"`

In [36]:
## TO-DO: Finish the INSERT INTO statement with the correct arguments

try: 
    cur.execute("INSERT INTO music_library (song_title,artist_name, year, album_name, single) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 ("Across The Universe", "The Beatles", 1970,"Let It Be",False))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_library (song_title,artist_name, year, album_name, single) \
                  VALUES (%s, %s, %s, %s, %s)",
                  ("Think For Yourself", "The Beatles",1965,"Rubber Soul",False))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

### TO-DO: Validate your data was inserted into the table. 

In [38]:
## TO-DO: Finish the SELECT * Statement 
try: 
    cur.execute("SELECT * FROM music_library;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

('Across The Universe', 'The Beatles', 1970, 'Let It Be', False)
('Think For Yourself', 'The Beatles', 1965, 'Rubber Soul', False)


### And finally close your cursor and connection. 

In [39]:
cur.close()
conn.close()

## NoSQL Databases

has a simpler design, simpler hprizontal scaling and finer control of availability. Data structures used are different than those in Relational Database are make some operations faster.

NoSQL = Not Only SQL; NoSQL and NonRelational are interchangeable terms.

Common Types of NoSQL Databases
- Apache Cassandra (Partition Row store)
- MongoDB (Document Store)
- DynamoDB (key-value store)
- Apache HBase (Wide Column Store)
- Neo4j (Graph Database)

We will use Apache Cassandra to explain the concepts of data modeling for NoSQL Databases

### The basic of Apache Cassandra
- Keyspace is conllection of tables
- Table is a group of Partitions
- Row is a single item

![Screen%20Shot%202020-05-24%20at%2011.34.07%20pm.png](attachment:Screen%20Shot%202020-05-24%20at%2011.34.07%20pm.png)

### Common Questions:
#### What type of companies use Apache Cassandra?
All kinds of companies. For example, Uber uses Apache Cassandra for their entire backend. Netflix uses Apache Cassandra to serve all their videos to customers. Good use cases for NoSQL (and more specifically Apache Cassandra) are :

1. Transaction logging (retail, health care)
2. Internet of Things (IoT)
3. Time series data
4. Any workload that is heavy on writes to the database (since Apache Cassandra is optimized for writes).

#### Would Apache Cassandra be a hindrance for my analytics work? If yes, why?
Yes, if you are trying to do analysis, such as using GROUP BY statements. Since Apache Cassandra requires data modeling based on the query you want, you can't do ad-hoc queries. However you can add clustering columns into your data model and create new tables.

## When to use a NoSQL Database
- Need to be able to store different data type formats: NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL.
- Large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes).
- Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data
- Need high throughput: While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs.
- Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
- Need high availability: Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time.

## When NOT to use a NoSQL Database?
- **When you have a small dataset:** NoSQL databases were made for big datasets not small datasets and while it works it wasn’t created for that.
- **When you need ACID Transactions:** If you need a consistent database with ACID transactions, then most NoSQL databases will not be able to serve this need. NoSQL database are eventually consistent and do not provide ACID transactions. However, there are exceptions to it. Some non-relational databases like MongoDB can support ACID transactions.
- **When you need the ability to do JOINS across tables:** NoSQL does not allow the ability to do JOINS. This is not allowed as this will result in full table scans.
- **If you want to be able to do aggregations and analytics**
- **If you have changing business requirements :** Ad-hoc queries are possible but difficult as the data model was done to fix particular queries
- **If your queries are not available and you need the flexibility :** You need your queries in advance. If those are not available or you will need to be able to have flexibility on how you query your data you might need to stick with a relational database


### Caveats to NoSQL and ACID Transactions
There are some NoSQL databases that offer some form of ACID transaction. As of v4.0, MongoDB added multi-document ACID transactions within a single replica set. With their later version, v4.2, they have added multi-document ACID transactions in a sharded/partitioned deployment.

- Check out this documentation from MongoDB on [multi-document ACID transactions](https://www.mongodb.com/collateral/mongodb-multi-document-acid-transactions)
- Here is another [link documenting MongoDB's ability to handle ACID transactions](https://www.mongodb.com/blog/post/mongodb-multi-document-acid-transactions-general-availability)

Another example of a NoSQL database supporting ACID transactions is MarkLogic. 

- Check out this [link](https://www.marklogic.com/blog/how-marklogic-supports-acid-transactions/) from their blog that offers ACID transactions.

The following information on setting up Cassandra on your local machine is completely optional for this course and for users who feel comfortable setting up the environment on their local machine to complete the exercises. Please note, Apache Cassandra is easier to install on MacOS than a Windows machine.

Installing Apache Cassandra to run locally on your machine:
[Cassandra Documentation](https://cassandra.apache.org/doc/latest/getting_started/installing.html)

Again, if you want to follow along, here is the demo notebook showcased in the video above.


```shell
pip install cassandra-driver
```

In [40]:
import cassandra

Then, let's create a connection to the database

In [42]:
from cassandra.cluster import Cluster

try:
    cluster = Cluster(['localhost'])
    session = cluster.connect()
except Exception as e:
    print(e)

### Let's Test our Connection

In [43]:
try:
    session. execute('''select * from music_libary''')
except Exception as e:
    print(e)

Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename"


### Create a key space

In [46]:
try:
    session.execute('''
    CREATE KEYSPACE IF NOT EXISTS udacity
    WITH REPLICATION = 
    {'class':'SimpleStrategy','replication_factor':1}'''
                   )
except Exception as e:
    print(e)

### connect to our keyspace

compare to PostgreSQL, we do not need to connect again

In [47]:
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

We are working with Apache Cassandra a NoSQL database, we can't model our data and create our table without more information.

### We are working with Apache Cassandra a NoSQL database. We can't model our data and create our table without more information.

### Think about what queries will you be performing on this data?

#### We want to be able to get every album that was released in a particular year. 
`select * from music_library WHERE YEAR=1970`

*To do that:* 
We need to be able to do a WHERE on YEAR. 

YEAR will become my partition key,artist name will be my clustering column to make each Primary Key unique. **Remember there are no duplicates in Apache Cassandra.**,

- **Table Name:** music_library
- **column 1:** Album Name,
- **column 2:** Artist Name,
- **column 3:** Year,
- PRIMARY KEY(year, artist name)

### Now to translate this information into a Create Table Statement. \n",
More information on Data Types can be found [here](https://datastax.github.io/python-driver)


In [48]:
query = 'CREATE TABLE IF NOT EXISTS music_library'
query = query + '(year int, artist_name text, album_name text, PRIMARY KEY(year, artist_name))'
try:
    session.execute(query)
except Exception as e:
    print(e)


In [51]:
query = 'select count(*) from music_library'
try:
    count = session.execute(query)
except Exception as e:
    print(e)
    
print(count.one())

Row(count=0)


Let's insert two rows

In [53]:
query = 'INSERT INTO music_library (year, artist_name, album_name)'
query = query + 'VALUES (%s, %s, %s)'

try:
    session.execute(query,(1970, 'The Beatles', 'Let it Be'))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, 'The Beatles', 'Rubber Soul'))
except Exception as e:
    print(e)

In [54]:
query = 'select * from music_library'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

for row in rows:
    print(row.year, row.album_name, row.artist_name)

1965 Rubber Soul The Beatles
1970 Let it Be The Beatles


Apache Cassandra never allow duplicates, I can run INSERT VALUES many times, I can only get one record.

In [55]:
query = 'select * from music_library WHERE year = 1970'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

for row in rows:
    print(row.year, row.album_name, row.artist_name)

1970 Let it Be The Beatles


In [57]:
# For the sake of the demo, I will drop the table
query = 'DROP TABLE music_library'

try:
    rows = session.execute(query)
except Exception as e:
    print(e)

In [58]:
session.shutdown()
cluster.shutdown()

### Walk through the basics of Apache Cassandra. Complete the following tasks:<li> Create a table in Apache Cassandra, <li> Insert rows of data,<li> Run a simple SQL query to validate the information. <br>

#### Import Apache Cassandra python package

In [59]:
import cassandra

### Create a connection to the database

In [60]:
from cassandra.cluster import Cluster
try: 
    cluster = Cluster(['localhost']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
except Exception as e:
    print(e)
 

### TO-DO: Create a keyspace to do the work in 

In [63]:
## TO-DO: Create the keyspace
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e)

### TO-DO: Connect to the Keyspace

In [64]:
## To-Do: Add in the keyspace you created
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

### Create a Song Library that contains a list of songs, including the song name, artist name, year, album it was from, and if it was a single. 

`song_title
artist_name
year
album_name
single`

### TO-DO: You need to create a table to be able to run the following query: 
`select * from songs WHERE year=1970 AND artist_name="The Beatles"`

In [65]:
## TO-DO: Complete the query below
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + '(song_title text, artist_name text, year int, album_name text, single Boolean, PRIMARY KEY(year, artist_name))'

try:
    session.execute(query)
except Exception as e:
    print(e)



### TO-DO: Insert the following two rows in your table
`First Row:  "Across The Universe", "The Beatles", "1970", "False", "Let It Be"`

`Second Row: "The Beatles", "Think For Yourself", "False", "1965", "Rubber Soul"`

In [67]:
## Add in query and then run the insert statement
query = "INSERT INTO music_library (song_title, artist_name, year, album_name, single)" 
query = query + " VALUES (%s, %s, %s, %s, %s)"

try:
    session.execute(query, ("Across The Universe","The Beatles", 1970,"Let It Be",False))
except Exception as e:
    print(e)
    
try:
    session.execute(query, ("Think For Yourself","The Beatles", 1965,"Rubber Soul",False))
except Exception as e:
    print(e)

### TO-DO: Validate your data was inserted into the table.

In [68]:
## TO-DO: Complete and then run the select statement to validate the data was inserted into the table
query = 'SELECT * FROM music_library'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.album_name, row.artist_name)

1965 Rubber Soul The Beatles
1970 Let It Be The Beatles


### TO-DO: Validate the Data Model with the original query.

`select * from songs WHERE YEAR=1970 AND artist_name="The Beatles"`

In [74]:
##TO-DO: Complete the select statement to run the query 
query = "SELECT * FROM music_library WHERE year=1970 AND artist_name = 'The Beatles'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.album_name, row.artist_name)

1970 Let It Be The Beatles


### And Finally close the session and cluster connection

In [75]:
session.shutdown()
cluster.shutdown()

# Relational Data Models

Stuendts will learn the fundamentals of how to do relational data modeling by focusing on normalization, denormalization, fact/dimension tables, and different schema models.

## Databases
### Rule 1: The information rule:
All information in a relational database is represented explicitly at the logical level and in exactly one way – by values in tables.

More information on Codd's 12 Rules can be found here:
[Wikipedia link](https://en.wikipedia.org/wiki/Codd%27s_12_rules)

- Rule 0: The foundation rule:

    For any system that is advertised as, or claimed to be, a relational data base management system, that system must be able to manage data bases entirely through its relational capabilities.
- Rule 1: The information rule:

    All information in a relational data base is represented explicitly at the logical level and in exactly one way – by values in tables.
- Rule 2: The guaranteed access rule:

    Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
- Rule 3: Systematic treatment of null values:

    Null values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.
- Rule 4: Dynamic online catalog based on the relational model:

    The data base description is represented at the logical level in the same way as ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data.
- Rule 5: The comprehensive data sublanguage rule:

    A relational system may support several languages and various modes of terminal use (for example, the fill-in-the-blanks mode). However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and that is comprehensive in supporting all of the following items:
    - Data definition.
    - View definition.
    - Data manipulation (interactive and by program).
    - Integrity constraints.
    - Authorization.
    - Transaction boundaries (begin, commit and rollback).
- Rule 6: The view updating rule:

    All views that are theoretically updatable are also updatable by the system.
- Rule 7: Possible for high-level insert, update, and delete:

    The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update and deletion of data.
- Rule 8: Physical data independence:

    Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods.
- Rule 9: Logical data independence:

    Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
- Rule 10: Integrity independence:

    Integrity constraints specific to a particular relational data base must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
- Rule 11: Distribution independence:

    The end-user must not be able to see that the data is distributed over various locations. Users should always get the impression that the data is located at one site only.
- Rule 12: The nonsubversion rule:

    If a relational system has a low-level (single-record-at-a-time) language, that low level cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher level relational language (multiple-records-at-a-time).

### Importance of Relational Databases:
- **Standardization of data model:** Once your data is transformed into the rows and columns format, your data is standardized and you can query it with SQL
- **Flexibility in adding and altering tables:** Relational databases gives you flexibility to add tables, alter tables, add and remove data.
- **Data Integrity:** Data Integrity is the backbone of using a relational database.
- **Structured Query Language (SQL):** A standard language can be used to access the data with a predefined language.
- **Simplicity :** Data is systematically stored and modeled in tabular format.
- **Intuitive Organization:** The spreadsheet format is intuitive but intuitive to data modeling in relational databases.

### OLAP vs OLTP
#### Online Analytical Processing (OLAP):
Databases optimized for these workloads allow for complex analytical and ad hoc queries, including aggregations. These type of databases are optimized for reads.

#### Online Transactional Processing (OLTP):
Databases optimized for these workloads allow for less complex queries in large volume. The types of queries for these databases are read, insert, update, and delete.

The key to remember the difference between OLAP and OLTP is analytics (A) vs transactions (T). If you want to get the price of a shoe then you are using OLTP (this has very little or no aggregations). If you want to know the total stock of shoes a particular store sold, then this requires using OLAP (since this will require aggregations).

Additional Resource on the difference between OLTP and OLAP:
This [Stackoverflow post](https://stackoverflow.com/questions/21900185/what-are-oltp-and-olap-what-is-the-difference-between-them) describes it well.

## Normalization: To deduce data redundancy and increase data integrity

reduce repeat of data(reduce redundancy), and make sure the answer I get back from the database is the correct answer(increase integrity)

## Denormalization: Must be done in read heavy workloads to increase performance

## Normal Form:

### Objectives of Normal Form:
1. To free the database from unwanted insertions, updates, & deletion dependencies(if update the data, I only need to update once)
2. To reduce the need for refactoring the database as new types of data are introduced(if want to add a new feature, do not need to redesign the database, but only add a new column or a new table)
3. To make the relational model more informative to users
4. To make the database neutral to the query statistics


### processes
![Screen%20Shot%202020-05-26%20at%2010.42.33%20pm.png](attachment:Screen%20Shot%202020-05-26%20at%2010.42.33%20pm.png)

1. How to reach First Normal Form (1NF):

    - Atomic values: each cell contains unique and single values
    - Be able to add data without altering tables
    - Separate different relations into different tables
    - Keep relationships between tables together with foreign keys
2. Second Normal Form (2NF):

    - Have reached 1NF
    - All columns in the table must rely on the Primary Key
3. Third Normal Form (3NF):

    - Must be in 2nd Normal Form
    - No transitive dependencies
    - Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B.
    
    **When to use 3NF:**

    - When you want to update data, we want to be able to do in just 1 place. We want to avoid updating the table in the Customers Detail table (in the example in the lecture slide).
    
As a matter of factor, there are 6th normal form, but 4-6 are used in acdamic study.

### Lesson 2 Exercise 1: Creating Normalized Tables

In [1]:
import psycopg2

__Create a connection to the database, get a cursor, and set autocommit to true)__

In [2]:
try: 
    conn = psycopg2.connect("host=localhost dbname=mydb user=edifierxuhao password=******")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get cursor to the Database")
    print(e)
conn.set_session(autocommit=True)

#### Let's imagine we have a table called Music Store. 

`Table Name: music_store
column 0: Transaction Id
column 1: Customer Name
column 2: Cashier Name
column 3: Year 
column 4: Albums Purchased`


## Now to translate this information into a CREATE Table Statement and insert the data

![Screen%20Shot%202020-05-26%20at%2011.17.15%20pm.png](attachment:Screen%20Shot%202020-05-26%20at%2011.17.15%20pm.png)

In [3]:
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS music_store (Transaction_Id int,\
                                                         Customer_Name VARCHAR,\
                                                         Cashier_Name VARCHAR,\
                                                         year int,\
                                                         Albums_Purchased text[])")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (1,'Amanda','Sam',2000,['Rubber Soul','Let it Be']))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (2,'Toby','Sam',2000,['My Generation']))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (3,'Max','Bob',2018,['Meet the Beatles','Help!']))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
    
try: 
    cur.execute("SELECT * FROM music_store;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

(1, 'Amanda', 'Sam', 2000, ['Rubber Soul', 'Let it Be'])
(2, 'Toby', 'Sam', 2000, ['My Generation'])
(3, 'Max', 'Bob', 2018, ['Meet the Beatles', 'Help!'])


#### Moving to 1st Normal Form (1NF)

### TO-DO: This data has not been normalized. To get this data into 1st normal form, you need to remove any collections or list of data and break up the list of songs into individual rows. 



In [4]:
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS music_store2 (Transaction_Id int,\
                                                         Customer_Name VARCHAR,\
                                                         Cashier_Name VARCHAR,\
                                                         year int,\
                                                         Albums_Purchased VARCHAR)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store2 (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (1,'Amanda','Sam',2000,'Rubber Soul'))

except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO music_store2 (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (1,'Amanda','Sam',2000,'Let it Be'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store2 (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (2,'Toby','Sam',2000,'My Generation'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store2 (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (3,'Max','Bob',2018,'Meet the Beatles'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO music_store2 (Transaction_Id,Customer_Name,Cashier_Name,year,Albums_Purchased) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (3,'Max','Bob',2018,'Help!'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("SELECT * FROM music_store2;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(1, 'Amanda', 'Sam', 2000, 'Rubber Soul')
(1, 'Amanda', 'Sam', 2000, 'Let it Be')
(2, 'Toby', 'Sam', 2000, 'My Generation')
(3, 'Max', 'Bob', 2018, 'Meet the Beatles')
(3, 'Max', 'Bob', 2018, 'Help!')


#### Moving to 2nd Normal Form (2NF)
You have now moved the data into 1NF, which is the first step in moving to 2nd Normal Form. The table is not yet in 2nd Normal Form. While each of the records in the table is unique, our Primary key (transaction id) is not unique. 

### TO-DO: Break up the table into two tables, transactions and albums sold. 



In [5]:
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS transactions (transaction_Id int,\
                                                          Customer_Name VARCHAR,\
                                                          Cashier_Name VARCHAR,\
                                                          year int)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

try: 
    cur.execute("CREATE TABLE IF NOT EXISTS albums_sold (Id int,\
                                                         transaction_Id int,\
                                                         Albums_Purchased VARCHAR);")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
try: 
    cur.execute("INSERT INTO transactions (transaction_Id,Customer_Name,Cashier_Name,year) \
                 VALUES (%s, %s, %s, %s)", \
                 (1, 'Amanda', 'Sam', 2000))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO transactions (transaction_Id,Customer_Name,Cashier_Name,year) \
                 VALUES (%s, %s, %s, %s)", \
                 (2, 'Toby', 'Sam', 2000))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO transactions (transaction_Id,Customer_Name,Cashier_Name,year) \
                 VALUES (%s, %s, %s, %s)", \
                 (3, 'Max', 'Bob', 2018))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO albums_sold (Id,transaction_Id,Albums_Purchased) \
                 VALUES (%s, %s, %s)", \
                 (1,1,'Rubber Soul'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO albums_sold (Id,transaction_Id,Albums_Purchased) \
                 VALUES (%s, %s, %s)", \
                 (2,1,'Let it Be'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO albums_sold (Id,transaction_Id,Albums_Purchased) \
                 VALUES (%s, %s, %s)", \
                 (3,2,'My Generation'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO albums_sold (Id,transaction_Id,Albums_Purchased) \
                 VALUES (%s, %s, %s)", \
                 (4,3,'Meet the Beatles'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO albums_sold (Id,transaction_Id,Albums_Purchased) \
                 VALUES (%s, %s, %s)", \
                 (5,3,'Help!'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

print("Table: transactions\n")
try: 
    cur.execute("SELECT * FROM transactions;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

print("\nTable: albums_sold\n")
try: 
    cur.execute("SELECT * FROM albums_sold;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)
row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

Table: transactions

(1, 'Amanda', 'Sam', 2000)
(2, 'Toby', 'Sam', 2000)
(3, 'Max', 'Bob', 2018)

Table: albums_sold

(1, 1, 'Rubber Soul')
(2, 1, 'Let it Be')
(3, 2, 'My Generation')
(4, 3, 'Meet the Beatles')
(5, 3, 'Help!')


### TO-DO: Do a `JOIN` on these tables to get all the information in the original first Table. 

In [10]:
try: 
    cur.execute("SELECT * FROM transactions t JOIN albums_sold a ON t.transaction_Id = a.transaction_Id ;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()



(1, 'Amanda', 'Sam', 2000, 1, 1, 'Rubber Soul')
(1, 'Amanda', 'Sam', 2000, 2, 1, 'Let it Be')
(2, 'Toby', 'Sam', 2000, 3, 2, 'My Generation')
(3, 'Max', 'Bob', 2018, 4, 3, 'Meet the Beatles')
(3, 'Max', 'Bob', 2018, 5, 3, 'Help!')


#### Moving to 3rd Normal Form (3NF)
Check our table for any transitive dependencies. 
_HINT:_ Check the table for any transitive dependencies. _Transactions_ can remove _Cashier Name_ to its own table, called _Employees_, which will leave us with 3 tables. 


### TO-DO: Create the third table named *employees* to move to 3rd NF. 



In [11]:
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS transactions2 (transaction_Id int,\
                                                           Customer_Name VARCHAR,\
                                                           Cashier_ID int,\
                                                           year int)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

try: 
    cur.execute("CREATE TABLE IF NOT EXISTS employees (employee_id int, employee_name VARCHAR);")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

try: 
    cur.execute("INSERT INTO transactions2 (transaction_Id,Customer_Name,Cashier_ID,year) \
                 VALUES (%s, %s, %s, %s)", \
                 (1, 'Amanda', 1, 2000))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO transactions2 (transaction_Id,Customer_Name,Cashier_ID,year) \
                 VALUES (%s, %s, %s, %s)", \
                 (2, 'Toby', 1, 2000))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO transactions2 (transaction_Id,Customer_Name,Cashier_ID,year) \
                 VALUES (%s, %s, %s, %s)", \
                 (3, 'Max', 2, 2018))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO employees (employee_id,employee_name) \
                 VALUES (%s, %s)", \
                 (1,'Sam'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO employees (employee_id,employee_name) \
                 VALUES (%s, %s)", \
                 (2,'Bob'))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)    

print("Table: transactions2\n")
try: 
    cur.execute("SELECT * FROM transactions2;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

print("\nTable: albums_sold\n")
try: 
    cur.execute("SELECT * FROM albums_sold;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

print("\nTable: employees\n")
try: 
    cur.execute("SELECT * FROM employees;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

Table: transactions2

(1, 'Amanda', 1, 2000)
(2, 'Toby', 1, 2000)
(3, 'Max', 2, 2018)

Table: albums_sold

(1, 1, 'Rubber Soul')
(2, 1, 'Let it Be')
(3, 2, 'My Generation')
(4, 3, 'Meet the Beatles')
(5, 3, 'Help!')

Table: employees

(1, 'Sam')
(2, 'Bob')


### TO-DO: Complete the last two `JOIN` on these 3 tables so we can get all the information we had in our first Table. 

In [12]:
try: 
    cur.execute("SELECT * FROM transactions2 t JOIN employees e ON \
                               t.Cashier_ID = e.employee_id JOIN \
                               albums_sold a ON t.transaction_Id =a.transaction_Id;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(1, 'Amanda', 1, 2000, 1, 'Sam', 1, 1, 'Rubber Soul')
(1, 'Amanda', 1, 2000, 1, 'Sam', 2, 1, 'Let it Be')
(2, 'Toby', 1, 2000, 1, 'Sam', 3, 2, 'My Generation')
(3, 'Max', 2, 2018, 2, 'Bob', 4, 3, 'Meet the Beatles')
(3, 'Max', 2, 2018, 2, 'Bob', 5, 3, 'Help!')


### Your output for the above cell should be:

(1, 'Amanda', 1, 2000, 1, 1, 'Rubber Soul', 1, 'Sam')<br>
(1, 'Amanda', 1, 2000, 2, 1, 'Let it Be', 1, 'Sam')<br>
(2, 'Toby', 1, 2000, 3, 2, 'My Generation', 1, 'Sam')<br>
(3, 'Max', 2, 2018, 4, 3, 'Meet the Beatles', 2, 'Bob')<br>
(3, 'Max', 2, 2018, 5, 3, 'Help!', 2, 'Bob')<br>



### And finally close your cursor and connection. 

In [13]:
try: 
    cur.execute("DROP table music_store")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table music_store2")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table albums_sold")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table employees")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table transactions")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table transactions2")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)

### And finally close your cursor and connection. 

In [14]:
cur.close()
conn.close()

# Denormalization: The process of trying to improve the read performance of a database at the expense of losing some write performance by adding redundant copies of data.

JOINS on the database allow for outstanding flexibility but are extremely slow. If you are dealing with heavy reads on your database, you may want to think about denormalizing your tables. You get your data into normalized form, and then you proceed with denormalization. So, denormalization comes after normalization.

Denormalization may need more space in the system.

1. The designer is in charge of keeping the data consistent.
2. Reads will be faster(select)
3. Writes will be slower(insert, update, delete)

Let's take a moment to make sure you understand what was in the demo regarding denormalized vs. normalized data. These are important concepts, so make sure to spend some time reflecting on these.

**Normalization** is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible.

**Denormalization** is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS).

### Example of Denormalized Data:
As you saw in the earlier demo, this denormalized table contains a column with the Artist name that includes duplicated rows, and another column with a list of songs.

![Screen%20Shot%202020-05-30%20at%201.23.47%20pm.png](attachment:Screen%20Shot%202020-05-30%20at%201.23.47%20pm.png)

### Example of Normalized Data:
Now for normalized data, Amanda used 3NF. You see a few changes:

1) No row contains a list of items. For e.g., the list of song has been replaced with each song having its own row in the Song table.

2) Transitive dependencies have been removed. For e.g., album ID is the PRIMARY KEY for the album year in Album Table. Similarly, each of the other tables have a unique primary key that can identify the other values in the table (e.g., song id and song name within Song table).

![Screen%20Shot%202020-05-30%20at%201.33.58%20pm.png](attachment:Screen%20Shot%202020-05-30%20at%201.33.58%20pm.png)

## Walk through the basics of modeling data from normalized from to denormalized form. We will create tables in PostgreSQL, insert rows of data, and do simple JOIN SQL queries to show how these multiple tables can work together. 

#### Where you see ##### you will need to fill in code. This exercise will be more challenging than the last. Use the information provided to create the tables and write the insert statements.

#### Remember the examples shown are simple, but imagine these situations at scale with large datasets, many users, and the need for quick response time. 

Note: __Do not__ click the blue Preview button in the lower task bar


### Import the library 
Note: An error might popup after this command has exectuted. If it does read it careful before ignoring. 

In [1]:
import psycopg2

### Create a connection to the database, get a cursor, and set autocommit to true

In [3]:
try: 
    conn = psycopg2.connect("host=localhost dbname=mydb user=edifierxuhao password=******")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get cursor to the Database")
    print(e)
conn.set_session(autocommit=True)

#### Let's start with our normalized (3NF) database set of tables we had in the last exercise, but we have added a new table `sales`. 

`Table Name: transactions2 
column 0: transaction_id
column 1: customer_name
column 2: cashier_id
column 3: year `

`Table Name: albums_sold
column 0: album_id
column 1: transaction_id
column 3: album_name` 

`Table Name: employees
column 0: employee_id
column 1: employee_name `

`Table Name: sales
column 0: transaction_id
column 1: amount_spent
`


### TO-DO: Add all Create statements for all Tables and Insert data into the tables

In [17]:



# TO-DO: Add all Create statements for all tables
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS transactions2 (transaction_id int,\
                                                          customer_name VARCHAR,\
                                                          cashier_id int,\
                                                          year int)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

try: 
    cur.execute("CREATE TABLE IF NOT EXISTS albums_sold (album_id int,\
                                                          transaction_id int,\
                                                          album_name VARCHAR)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

try: 
    cur.execute("CREATE TABLE IF NOT EXISTS employees (employee_id int,\
                                                          employee_name VARCHAR)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

try: 
    cur.execute("CREATE TABLE IF NOT EXISTS sales (transaction_id int,\
                                                          amount_spent int)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)

      
# TO-DO: Insert data into the tables    
    
    
    
try: 
    cur.execute("INSERT INTO transactions2 (transaction_id, customer_name, cashier_id, year) \
                 VALUES (%s, %s, %s, %s)", \
                 (1, "Amanda", 1, 2000))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO transactions2 (transaction_id, customer_name, cashier_id, year) \
                 VALUES (%s, %s, %s, %s)", \
                 (2, "Toby", 1, 2000))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO transactions2 (transaction_id, customer_name, cashier_id, year) \
                 VALUES (%s, %s, %s, %s)", \
                 (3, "Max", 2, 2018))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO albums_sold (album_id, transaction_id, album_name) \
                 VALUES (%s, %s, %s)", \
                 (1, 1, "Rubber Soul"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO albums_sold (album_id, transaction_id, album_name) \
                 VALUES (%s, %s, %s)", \
                 (2, 1, "Let It Be"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO albums_sold (album_id, transaction_id, album_name) \
                 VALUES (%s, %s, %s)", \
                 (3, 2, "My Generation"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO albums_sold (album_id, transaction_id, album_name) \
                 VALUES (%s, %s, %s)", \
                 (4, 3, "Meet the Beatles"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO albums_sold (album_id, transaction_id, album_name) \
                 VALUES (%s, %s, %s)", \
                 (5, 3, "Help!"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO employees (employee_id, employee_name) \
                 VALUES (%s, %s)", \
                 (1, "Sam"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO employees (employee_id, employee_name) \
                 VALUES (%s, %s)", \
                 (2, "Bob"))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)    
    
try: 
    cur.execute("INSERT INTO sales (transaction_id, amount_spent) \
                 VALUES (%s, %s)", \
                 (1, 40))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)    
    
try: 
    cur.execute("INSERT INTO sales (transaction_id, amount_spent) \
                 VALUES (%s, %s)", \
                 (2, 19))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e) 

try: 
    cur.execute("INSERT INTO sales (transaction_id, amount_spent) \
                 VALUES (%s, %s)", \
                 (3, 45))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e) 

#### TO-DO: Confirm using the Select statement the data were added correctly

In [18]:
print("Table: transactions2\n")
try: 
    cur.execute("SELECT * FROM transactions2;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

print("\nTable: albums_sold\n")
try: 
    cur.execute("SELECT * FROM albums_sold;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

print("\nTable: employees\n")
try: 
    cur.execute("SELECT * FROM employees;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()
    
print("\nTable: sales\n")
try: 
    cur.execute("SELECT * FROM sales;")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

Table: transactions2

(1, 'Amanda', 1, 2000)
(2, 'Toby', 1, 2000)
(3, 'Max', 2, 2018)

Table: albums_sold

(1, 1, 'Rubber Soul')
(2, 1, 'Let It Be')
(3, 2, 'My Generation')
(4, 3, 'Meet the Beatles')
(5, 3, 'Help!')

Table: employees

(1, 'Sam')
(2, 'Bob')

Table: sales

(1, 40)
(2, 19)
(3, 45)


### Let's say you need to do a query that gives:

`transaction_id
 customer_name
 cashier_name
 year 
 albums sold
 amount sold` 

### TO-DO: Complete the statement below to perform a 3 way `JOIN` on the 4 tables you have created. 

In [21]:
try: 
    cur.execute("SELECT t.transaction_id, t.customer_name, e.employee_name,t.year, a.album_name,s.amount_spent\
                 FROM transactions2 t JOIN albums_sold a\
                 ON t.transaction_id = a.transaction_id\
                 JOIN employees e\
                 ON e.employee_id = t.cashier_id\
                 JOIN sales s\
                 ON t.transaction_id = s.transaction_id")
    
    
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(1, 'Amanda', 'Sam', 2000, 'Rubber Soul', 40)
(1, 'Amanda', 'Sam', 2000, 'Let It Be', 40)
(2, 'Toby', 'Sam', 2000, 'My Generation', 19)
(3, 'Max', 'Bob', 2018, 'Meet the Beatles', 45)
(3, 'Max', 'Bob', 2018, 'Help!', 45)


#### Great we were able to get the data we wanted.

### But, we had to perform a 3 way `JOIN` to get there. While it's great we had that flexibility, we need to remember that `JOINS` are slow and if we have a read heavy workload that required low latency queries we want to reduce the number of `JOINS`.  Let's think about denormalizing our normalized tables.

### With denormalization you want to think about the queries you are running and how to reduce the number of JOINS even if that means duplicating data. The following are the queries you need to run.

### Query 1 : `select transaction_id, customer_name, amount_spent FROM <min number of tables>` 
It should generate the amount spent on each transaction 
### Query 2: `select cashier_name, SUM(amount_spent) FROM <min number of tables> GROUP BY cashier_name` 
It should generate the total sales by cashier 

###  Query 1: `select transaction_id, customer_name, amount_spent FROM <min number of tables>`

One way to do this would be to do a JOIN on the `sales` and `transactions2` table but we want to minimize the use of `JOINS`.  

To reduce the number of tables, first add `amount_spent` to the `transactions` table so that you will not need to do a JOIN at all. 

`Table Name: transactions 
column 0: transaction Id
column 1: Customer Name
column 2: Cashier Id
column 3: Year
column 4: amount_spent`

![table19.png](attachment:table19.png)

### TO-DO: Add the tables as part of the denormalization process

In [22]:
# TO-DO: Create all tables
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS transactions (transaction_id int,\
                                                          customer_name VARCHAR,\
                                                          cashier_id int,\
                                                          year int,\
                                                          amount_spent int)")

except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)



#Insert data into all tables 
    
try: 
    cur.execute("INSERT INTO transactions (transaction_id,customer_name,cashier_id,year,amount_spent) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (1,'Amanda',1,2000,40))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO transactions (transaction_id,customer_name,cashier_id,year,amount_spent) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (2,'Toby',1,2000,19))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO transactions (transaction_id,customer_name,cashier_id,year,amount_spent) \
                 VALUES (%s, %s, %s, %s, %s)", \
                 (3,'Max',2,2018,45))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

### Now you should be able to do a simplifed query to get the information you need. No  `JOIN` is needed.

In [23]:
try: 
    cur.execute("SELECT transaction_id,customer_name,amount_spent\
                 FROM transactions")
        
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(1, 'Amanda', 40)
(2, 'Toby', 19)
(3, 'Max', 45)


#### Your output for the above cell should be the following:
(1, 'Amanda', 40)<br>
(2, 'Toby', 19)<br>
(3, 'Max', 45)

### Query 2: `select cashier_name, SUM(amount_spent) FROM <min number of tables> GROUP BY cashier_name` 

To avoid using any `JOINS`, first create a new table with just the information we need. 

`Table Name: cashier_sales
col: Transaction Id
Col: Cashier Name
Col: Cashier Id
col: Amount_Spent
`

![table20.png](attachment:table20.png)

### TO-DO: Create a new table with just the information you need.

In [24]:
# Create the tables

try: 
    cur.execute("CREATE TABLE IF NOT EXISTS cashier_sales (transaction_id int,\
                                                          customer_name VARCHAR,\
                                                          cashier_id int,\
                                                          amount_spent int)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)


#Insert into all tables 
    
try: 
    cur.execute("INSERT INTO cashier_sales (transaction_id,customer_name,cashier_id,amount_spent) \
                 VALUES (%s, %s, %s, %s)", \
                 (1,'Sam',1,40 ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO cashier_sales (transaction_id,customer_name,cashier_id,amount_spent) \
                 VALUES (%s, %s, %s, %s)", \
                 (2,'Sam',1,19 ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO cashier_sales (transaction_id,customer_name,cashier_id,amount_spent) \
                 VALUES (%s, %s, %s, %s)", \
                 (3,'Bob',2,45 ))

except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

### Run the query

In [25]:
try: 
    cur.execute("SELECT customer_name, SUM(amount_spent)\
                 FROM cashier_sales\
                 GROUP BY customer_name")
        
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
   print(row)
   row = cur.fetchone()

('Sam', 59)
('Bob', 45)


#### Your output for the above cell should be the following:
('Sam', 59)<br>
('Max', 45)



#### We have successfully taken normalized table and denormalized them inorder to speed up our performance and allow for simplier queries to be executed. 

### Drop the tables

In [26]:
try: 
    cur.execute("DROP table transactions2")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table albums_sold")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table employees")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table sales")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table transactions")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
try: 
    cur.execute("DROP table cashier_sales")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)

### And finally close your cursor and connection. 

In [27]:
cur.close()
conn.close()

## Dimensional Modelling
Dimensional modeling is simpler, more expressive, and easier to understand. There are 3 basic concepts in dimensional modeling i.e. facts, dimensions and measures.
![0*4Qddq2ncN_1KBRCR.png](attachment:0*4Qddq2ncN_1KBRCR.png)

Dimensional modeling is primarily used to support OLAP and decision making while ER modeling is best fit for OLTP where results consist of detailed information of entities rather an aggregated view.

It provides four types of operations: **Drill down, Roll up, Slice and Dice**.

Drill down and roll up are the operations for moving the view down and up along the dimensional hierarchy levels to get more refined or bird-eye views. With drill-down capability, users can navigate to higher levels of detail. With roll-up capability, users can zoom out to see a summarized level of data.

Slice and dice are the operations for browsing the data through the visualized cube. Slicing cuts through the cube so that users can focus on some specific perspectives. Dicing rotates the cube to another perspective so that users can be more specific with the data analysis.

## Fact and Dimension Tables
- Work together to create an organized data model
- While fact and dimension are not created differently in the DDL, they are conceptual and extremely important for organization.

### Fact tables
Fact tables consists of the measurements, metrics or facts of a business process.

### Dimension
A structure that categorizes facts and measures in order to enable users to answer business questions. Dimensions are people, products, place and time.

### Implementing Different Schemas
Two of the most popular(beacuse of their simplicity) data maet schema for data warehouses are:
1. Star Schema
2. Snowflake Schema

#### Citations for slides:

- https://en.wikipedia.org/wiki/Dimension_(data_warehouse)
- https://en.wikipedia.org/wiki/Fact_table

The following image shows the relationship between the fact and dimension tables for the example shown in the video. As you can see in the image, the unique primary key for each Dimension table is included in the Fact table.

In this example, it helps to think about the Dimension tables providing the following information:

- Where the product was bought? (Dim_Store table)
- When the product was bought? (Dim_Date table)
- What product was bought? (Dim_Product table)

The Fact table provides the metric of the business process (here Sales).

- How many units of products were bought? (Fact_Sales table)

![dimension-fact-tables.png](attachment:dimension-fact-tables.png)

If you are familiar with **Entity Relationship Diagrams** (ERD), you will find the depiction of STAR and SNOWFLAKE schemas in the demo familiar. The ERDs show the data model in a concise way that is also easy to interpret. ERDs can be used for any data model, and are not confined to STAR or SNOWFLAKE schemas. Commonly available tools can be used to generate ERDs. However, more important than creating an ERD is to learn more about the data through conversations with the data team so as a data engineer you have a strong understanding of the data you are working with.

More information about ER diagrams can be found at this [Wikipedia](https://en.wikipedia.org/wiki/Entity–relationship_model) page.

## Star Schema
Star Schema is the simplest style of data mart schema. The star shema consists of one of more fact fables referencing any number of dimension tables.

This is one of the most used schema in the industry

### Why use star schema?
- Get its name from the physical model resembling a star shape
- A fact table is at its center
- Dimension tables surrounds the fact table representing the stra's points.

![600px-%D0%9F%D1%80%D0%B8%D0%BA%D0%BB%D0%B0%D0%B4_%D1%81%D1%85%D0%B5%D0%BC%D0%B8_%D0%B7%D1%96%D1%80%D0%BA%D0%B8.png](attachment:600px-%D0%9F%D1%80%D0%B8%D0%BA%D0%BB%D0%B0%D0%B4_%D1%81%D1%85%D0%B5%D0%BC%D0%B8_%D0%B7%D1%96%D1%80%D0%BA%D0%B8.png)

Reference for image in slides: https://en.wikipedia.org/wiki/Star_schema

Additional Resources
Check out this [Wikipedia page on Star schemas](https://en.wikipedia.org/wiki/Star_schema).

### Benifits
- Denormalized
- Simplifies queries
- Fast Aggregations

### Drawbacks
- Issues that cone with denormalization
- Data Integrity
- Decreas query flexibility
- many to man relationship


## Snowflake Schema
Logical arrangement of tables, in a multidimensional database represented by centralized fact tables which are connected to multiple dimensions.

A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborated, having multiple levels of relationships, child tables having multiple parents.
![0*-jCX6AtuIi2v_eTE.png](attachment:0*-jCX6AtuIi2v_eTE.png)

### Snowflake vs Star
- Star Schema is a special, simplified case of the snowflake schema.
- Star schema does allow for one to many relationships while the snowflake schema does.
- Snowflake shcema is more normalized than star schema, but only in 1NF or 2NF

### Additional Resources
Check out this Wikipedia page on [Snowflake schemas](https://en.wikipedia.org/wiki/Snowflake_schema).

This [Medium post](https://medium.com/@BluePi_In/deep-diving-in-the-world-of-data-warehousing-78c0d52f49a) provides a nice comparison, and examples, of Star and Snowflake Schemas. Make sure to scroll down halfway through the page.

### This exercise will be more challenging than the last. Use the information provided to create the tables and write the insert statements. 

### Import the library 
Note: An error might popup after this command has exectuted. If it does read it careful before ignoring. 

In [28]:
import psycopg2


### Create a connection to the database

In [29]:
try: 
    conn = psycopg2.connect("host=localhost dbname=mydb user=edifierxuhao password=******")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)

### Next use that connect to get a cursor that we will use to execute queries.

In [30]:
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get cursor to the Database")
    print(e)

#### For this demo we will use automactic commit so that each action is commited without having to call conn.commit() after each command. The ability to rollback and commit transactions is a feature of Relational Databases. 

In [31]:
conn.set_session(autocommit=True)

### Imagine you work at an online Music Store. There will be many tables in our database, but let's just focus on 4 tables around customer purchases. 

![starSchema.png](attachment:starSchema.png)

### From this representation you can start to see the makings of a "STAR". You will have one fact table (the center of the star) and 3  dimension tables that are coming from it.

### TO-DO: Create the Fact table and insert the data into the table

In [33]:
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS customer_transactions(customer_id int,\
                                                                  store_id int,\
                                                                  spent numeric)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
#Insert into all tables 
try: 
    cur.execute("INSERT INTO customer_transactions (customer_id,store_id,spent) \
                 VALUES (%s, %s, %s)", \
                 (1,1,20.50 ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO customer_transactions (customer_id,store_id,spent) \
                 VALUES (%s, %s, %s)", \
                 (2,1,35.21 ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

### TO-DO: Create the Dimension tables and insert data into those tables.

In [34]:
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS customer(customer_id int,\
                                                     name VARCHAR,\
                                                     rewards BOOLEAN)")
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS items_purchased(customer_id int,\
                                                            item_number int,\
                                                            item_name VARCHAR)")
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("CREATE TABLE IF NOT EXISTS store(store_id int,\
                                                  state VARCHAR)")
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO customer (customer_id,name,rewards) \
                 VALUES (%s, %s, %s)", \
                 (1,'Amanda',True ))
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
try: 
    cur.execute("INSERT INTO customer (customer_id,name,rewards) \
                 VALUES (%s, %s, %s)", \
                 (2,'Toby',False ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO items_purchased (customer_id,item_number,item_name) \
                 VALUES (%s, %s, %s)", \
                 (1,1,'Rubber Soul' ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)
    
try: 
    cur.execute("INSERT INTO items_purchased (customer_id,item_number,item_name) \
                 VALUES (%s, %s, %s)", \
                 (2,3,'Let It Be' ))
except psycopg2.Error as e: 
    print("Error: Issue creating table")
    print (e)
    
try: 
    cur.execute("INSERT INTO store (store_id,state) \
                 VALUES (%s, %s)", \
                 (1,'CA' ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

try: 
    cur.execute("INSERT INTO store (store_id,state) \
                 VALUES (%s, %s)", \
                 (2,'WA' ))
except psycopg2.Error as e: 
    print("Error: Inserting Rows")
    print (e)

### Now run the following queries on this data easily because of utilizing the Fact/ Dimension and Star Schema
 
#### Query 1: Find all the customers that spent more than 30 dollars, who are they, which store they bought it from, location of the store, what they bought and if they are a rewards member.

#### Query 2: How much did Customer 2 spend?

### Query 1:

In [37]:
try: 
    cur.execute("SELECT c.name, s.store_id,s.state,i.item_name,c.rewards\
                 FROM customer_transactions ct JOIN customer c\
                 ON ct.customer_id = c.customer_id\
                 JOIN items_purchased i\
                 ON ct.customer_id = i.customer_id\
                 JOIN store s\
                 ON ct.store_id = s.store_id\
                 WHERE ct.spent > 30")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

('Toby', 1, 'CA', 'Let It Be', 'N')


### Query 2: 

In [39]:
try: 
    cur.execute("SELECT customer_id,SUM(spent)\
                 FROM customer_transactions\
                 WHERE customer_id = 2\
                 GROUP BY customer_id")
except psycopg2.Error as e: 
    print("Error: select *")
    print (e)

row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(2, 35.21)


### Summary: You can see here from this elegant schema that we were: 1) able to get "facts/metrics" from our fact table (how much each store sold), and 2) information about our customers that will allow us to do more indepth analytics to get answers to business questions by utilizing our fact and dimension tables. 

### TO-DO: Drop the tables

In [40]:
try: 
    cur.execute("DROP table customer_transactions")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)

try: 
    cur.execute("DROP table customer")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
    
try: 
    cur.execute("DROP table items_purchased")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)
    
try: 
    cur.execute("DROP table store")
except psycopg2.Error as e: 
    print("Error: Dropping table")
    print (e)

### And finally close your cursor and connection. 

In [41]:
cur.close()
conn.close()

## Data Definition and Constraints
The CREATE statement in SQL has a few important constraints that are highlighted below.

## NOT NULL
The **NOT NULL** constraint indicates that the column cannot contain a null value.

Here is the syntax for adding a **NOT NULL** constraint to the CREATE statement:

```sql
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL, 
    store_id int, 
    spent numeric
);
```

You can add **NOT NULL** constraints to more than one column. Usually this occurs when you have a **COMPOSITE KEY**, which will be discussed further below.

Here is the syntax for it:

```sql
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL, 
    store_id int NOT NULL, 
    spent numeric
);
```

## UNIQUE
The **UNIQUE** constraint is used to specify that the data across all the rows in one column are unique within the table. The **UNIQUE** constraint can also be used for multiple columns, so that the combination of the values across those columns will be unique within the table. In this latter case, the values within 1 column do not need to be unique. 

Let's look at an example.

```sql
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL UNIQUE, 
    store_id int NOT NULL UNIQUE, 
    spent numeric 
);
```

Another way to write a **UNIQUE** constraint is to add a table constraint using commas to separate the columns.

```sql
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL, 
    store_id int NOT NULL, 
    spent numeric,
    UNIQUE (customer_id, store_id, spent)
);
```

## PRIMARY KEY
The **PRIMARY KEY** constraint is defined on a single column, and every table should contain a primary key. This requires that the values be both unique and not null. The values in this column uniquely identify the rows in the table. If a group of columns are defined as a primary key, they are called a **composite key**. That means the combination of values in these columns will uniquely identify the rows in the table. By default, the **PRIMARY KEY** constraint has the unique and not null constraint built into it. 

Let's look at the following example:

```sql
CREATE TABLE IF NOT EXISTS store (
    store_id int PRIMARY KEY, 
    store_location_city text,
    store_location_state text
);
```

Here is an example for a group of columns serving as **composite key**.

```sql
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int, 
    store_id int, 
    spent numeric,
    PRIMARY KEY (customer_id, store_id)
);
```

To read more about these constraints, check out the [PostgreSQL documentation](https://www.postgresql.org/docs/9.4/ddl-constraints.html).

## Foreign Keys

```sql
CREATE TABLE products (
    product_no integer PRIMARY KEY,
    name text,
    price numeric
);

CREATE TABLE orders (
    order_id integer PRIMARY KEY,
    shipping_address text,
    ...
);

CREATE TABLE order_items (
    product_no integer REFERENCES products,
    order_id integer REFERENCES orders,
    quantity integer,
    PRIMARY KEY (product_no, order_id)
);
```
Notice that the primary key overlaps with the foreign keys in the last table.

We know that the foreign keys disallow creation of orders that do not relate to any products. But what if a product is removed after an order is created that references it? SQL allows you to handle that as well. Intuitively, we have a few options:

- Disallow deleting a referenced product

- Delete the orders as well

- Something else?

To illustrate this, let's implement the following policy on the many-to-many relationship example above: when someone wants to remove a product that is still referenced by an order (via order_items), we disallow it. If someone removes an order, the order items are removed as well:

```sql
CREATE TABLE products (
    product_no integer PRIMARY KEY,
    name text,
    price numeric
);

CREATE TABLE orders (
    order_id integer PRIMARY KEY,
    shipping_address text,
    ...
);

CREATE TABLE order_items (
    product_no integer REFERENCES products ON DELETE RESTRICT,
    order_id integer REFERENCES orders ON DELETE CASCADE,
    quantity integer,
    PRIMARY KEY (product_no, order_id)
);
```


## Check Constraints

A check constraint is the most generic constraint type. It allows you to specify that the value in a certain column must satisfy a Boolean (truth-value) expression. For instance, to require positive product prices, you could use:

```sql
CREATE TABLE products (
    product_no integer,
    name text,
    price numeric CHECK (price > 0)
);
```

As you see, the constraint definition comes after the data type, just like default value definitions. Default values and constraints can be listed in any order. A check constraint consists of the key word CHECK followed by an expression in parentheses. The check constraint expression should involve the column thus constrained, otherwise the constraint would not make too much sense.

You can also give the constraint a separate name. This clarifies error messages and allows you to refer to the constraint when you need to change it. The syntax is:

```sql
CREATE TABLE products (
    product_no integer,
    name text,
    price numeric CONSTRAINT positive_price CHECK (price > 0)
);
```

So, to specify a named constraint, use the key word CONSTRAINT followed by an identifier followed by the constraint definition. (If you don't specify a constraint name in this way, the system chooses a name for you.)

A check constraint can also refer to several columns. Say you store a regular price and a discounted price, and you want to ensure that the discounted price is lower than the regular price:

```sql
CREATE TABLE products (
    product_no integer,
    name text,
    price numeric CHECK (price > 0),
    discounted_price numeric CHECK (discounted_price > 0),
    CHECK (price > discounted_price)
);
```

The first two constraints should look familiar. The third one uses a new syntax. It is not attached to a particular column, instead it appears as a separate item in the comma-separated column list. Column definitions and these constraint definitions can be listed in mixed order.

We say that the first two constraints are column constraints, whereas the third one is a table constraint because it is written separately from any one column definition. Column constraints can also be written as table constraints, while the reverse is not necessarily possible, since a column constraint is supposed to refer to only the column it is attached to. (PostgreSQL doesn't enforce that rule, but you should follow it if you want your table definitions to work with other database systems.) The above example could also be written as:

```sql
CREATE TABLE products (
    product_no integer,
    name text,
    price numeric,
    CHECK (price > 0),
    discounted_price numeric,
    CHECK (discounted_price > 0),
    CHECK (price > discounted_price)
);
```

or even:

```sql
CREATE TABLE products (
    product_no integer,
    name text,
    price numeric CHECK (price > 0),
    discounted_price numeric,
    CHECK (discounted_price > 0 AND price > discounted_price)
);

```

It's a matter of taste.

Names can be assigned to table constraints in the same way as column constraints:
```sql
CREATE TABLE products (
    product_no integer,
    name text,
    price numeric,
    CHECK (price > 0),
    discounted_price numeric,
    CHECK (discounted_price > 0),
    CONSTRAINT valid_discount CHECK (price > discounted_price)
);
```

## Upsert
In RDBMS language, the term upsert refers to the idea of inserting a new row in an existing table, or updating the row if it already exists in the table. The action of updating or inserting has been described as "upsert".

The way this is handled in PostgreSQL is by using the `INSERT` statement in combination with the `ON CONFLICT` clause.

## INSERT
The **INSERT** statement adds in new rows within the table. The values associated with specific target columns can be added in any order.

Let's look at a simple example. We will use a customer address table as an example, which is defined with the following **CREATE** statement:

```sql
CREATE TABLE IF NOT EXISTS customer_address (
    customer_id int PRIMARY KEY, 
    customer_street varchar NOT NULL,
    customer_city text NOT NULL,
    customer_state text NOT NULL
);
```

Let's try to insert data into it by adding a new row:

```sql
INSERT into customer_address (
VALUES
    (432, '758 Main Street', 'Chicago', 'IL'
);
```

Now let's assume that the customer moved and we need to update the customer's address. However we do not want to add a new customer id. In other words, if there is any conflict on the `customer_id`, we do not want that to change.

This would be a good candidate for using the **ON CONFLICT DO NOTHING** clause.

```SQL
INSERT INTO customer_address (customer_id, customer_street, customer_city, customer_state)
VALUES
 (
 432, '923 Knox Street', 'Albany', 'NY'
 ) 
ON CONFLICT (customer_id) 
DO NOTHING;
```
Now, let's imagine we want to add more details in the existing address for an existing customer. This would be a good candidate for using the **ON CONFLICT DO UPDATE** clause.

```sql
INSERT INTO customer_address (customer_id, customer_street)
VALUES
    (
    432, '923 Knox Street, Suite 1' 
) 
ON CONFLICT (customer_id) 
DO UPDATE
    SET customer_street  = EXCLUDED.customer_street;
```

We recommend checking out these two links to learn other ways to insert data into the tables.

- [PostgreSQL tutorial](https://www.postgresqltutorial.com/postgresql-upsert/)
- [PostgreSQL documentation](https://www.postgresql.org/docs/9.5/sql-insert.html)

In [50]:
import psycopg2

try: 
    conn = psycopg2.connect("host=localhost dbname=mydb user=edifierxuhao password=Go219029od")
except psycopg2.Error as e: 
    print("Error: Could not make connection to the Postgres database")
    print(e)
    
try: 
    cur = conn.cursor()
except psycopg2.Error as e: 
    print("Error: Could not get cursor to the Database")
    print(e)
    
conn.set_session(autocommit=True)



In [78]:

cur.execute("CREATE TABLE IF NOT EXISTS customer_address (\
    customer_id int PRIMARY KEY, \
    customer_street varchar NOT NULL,\
    customer_city text NOT NULL,\
    customer_state text NOT NULL\
);")


In [79]:
cur.execute("INSERT into customer_address(customer_id,customer_street,customer_city,customer_state)\
VALUES(432, '758 Main Street', 'Chicago', 'IL')")

In [80]:
cur.execute("INSERT INTO customer_address (customer_id, customer_street, customer_city, customer_state)\
            VALUES (432, '923 Knox Street', 'Albany', 'NY') \
            ON CONFLICT (customer_id)\
            DO NOTHING;")


In [81]:
cur.execute("SELECT * FROM customer_address")


row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(432, '758 Main Street', 'Chicago', 'IL')


In [83]:
cur.execute("INSERT INTO customer_address (customer_id, customer_street, customer_city, customer_state)\
             VALUES (432, '923 Knox Street, Suite 1', 'Albany', 'NY')\
             ON CONFLICT (customer_id) \
             DO UPDATE \
             SET customer_street  = EXCLUDED.customer_street;")


In [85]:
cur.execute("SELECT * FROM customer_address")


row = cur.fetchone()
while row:
    print(row)
    row = cur.fetchone()

(432, '923 Knox Street, Suite 1', 'Chicago', 'IL')


In [86]:
cur.execute('DROP table customer_address')

In [87]:
cur.close()
conn.close()

## What we learned:
- What makes a database a relational database and Codd’s 12 rules of relational database design
- The difference between different types of workloads for databases OLAP and OLTP
- The process of database normalization and the normal forms.
- Denormalization and when it should be used.
- Fact vs dimension tables as a concept and how to apply that to our data modeling
- How the star and snowflake schemas use the concepts of fact and dimension tables to make getting value out of the data easier.