# Normal forms and relational algebra

``jeep@cphbusiness.dk``

# Agenda

* Recap
* Algebra
* Relational algebra
* Table constraints
* Normal form BCNF, 4 and 5

# Learning objectives
## Knowledge
The student must have knowledge of:

 * Various database types and the underlying models
 * A specific database system’s storage organisation  and query execution
 * A specific database system’s optimisation possibilities – including advantages and disadvantages
 * Database-specific security problems and their solutions
 * Concepts and issues when handling big data
 * The particular issues raised by having many simultaneous transactions, including in connection with distributed databases
 * Relational algebra (including its relationship to execution plans)

## Skills
The student can:

 * Transform logical data models into physical models in various database types
 * Implement database optimisation
 * Use parts of the administration tool to assist in the optimisation and tuning of existing databases, including the incorporation of a specific DBMS’ execution plans
 * Use a specific database system’s tools for handling simultaneous transactions
 * Use the programming and other facilities provided by a modern DBMS


## Competencies
The student can:
 
 * Analyse the application domain in order to select a database type
 * Divide responsibility for tasks between the application and DBMS during system development, to ensure the best possible implementation.


# Algebra

"The study of mathematical symbols and their rules" - [Wikipedia](https://en.wikipedia.org/wiki/Algebra)

* Allows us to work with
  * Symbols
  * Variables, placeholders
  * Abstractions

## Logical *and*, *or* and *not*

* Logical **and** $\land$ 
  * $True \land False = False$
* Logical **or** $\lor$
  * $True \lor False = True$
* Logical **not** $\neg$
  * $\neg True = False$

## Sets in algebra

"Collection of distinct objects" - <a href="https://en.wikipedia.org/wiki/Set_(mathematics)">Wikipedia</a>

* Empty set: $Ø = \{\}$
* Non-empty set: $\{1\}$
* Non-empty set: $\{Red, Green\}$
* Distinct
  * Set of the list $[1, 2, 2, 3]$ is $\{1, 2, 3\}$
  * Set of $[Red, Green, Yellow, Green]$ is $\{Red, Green, Yellow\}$
* Order doesn't matter
  * $\{Red, Green\} == \{Green, Red\}$

## Different sets

* The empty set, $Ø = \{\}$
* All the students in this class
* All the natural numbers, $\mathbb{Z} = \{0, 1, 2, 3, 4 \dots\}$
* All the integers, $\mathbb{N} = \{-\infty, \dots, -1, 0, 1, \dots, \infty\}$
* All the rational numbers $\mathbb{Q} = \{0, -2, 10, \frac{-2832}{123}, \frac{72}{1923892}, \frac{67821}{1298732}\}$
* All the real numbers, $\mathbb{R} = \{0, -2, 10, \frac{-2832}{123}, \frac{72}{1923892}, \frac{67821}{1298732}, \sqrt 2, \pi, \dots\}$

## Set membership

* $4 \in \{1, 2, 3, 4, 5\}$
  * 4 **is a member of** the set $\{1, 2, 3, 4, 5\}$
* $7 \notin \{1, 2, 3, 4, 5\}$
  * 7 **is not a member of** the set $\{1, 2, 3, 4, 5\}$

## Exercise on sets


![Number system](images/numbersystem.png)

* $4 \in \mathbb{R}$
* $\pi \in \mathbb{Q}$
* $\frac{2}{3} \in \mathbb{R}$
* $\sqrt 2 \notin \mathbb{Q}$
* $-5 \in \mathbb{N}$
* $log(7) \notin \mathbb{Q}$

# More symbols!

* Such that: $\mid$
  * $n \in \mathbb{Z} \mid n = 2$: N is a member of all the integers **such that** $n = 2$
* For all: $\forall$
  * $\forall n \in \mathbb{R}$: For all $n$ in all the real numbers
  * $\forall n \in \mathbb{R} \mid n + n = n$
* There exists: $\exists$
  * $\exists n \in \mathbb{R}$: There exists a number $n$ in all the real numbers
  * $\exists n \in \mathbb{R} \mid n * n = n$: There exists a number $n$ in all the real numbers such that $n * n = n$

## Constructing sets

* $\{1, 2, 4, 6, 7\}$
* $\{1, 2, 3, \dots, 99, 100\}$
* $\{e_0, e_1, \dots, e_{n-1}, e_{n}\}$

### Set-builder notation

* $\{a \in \mathbb{Z}\} $
  * $\{-\infty, \dots, -1, 0, 1, \dots, \infty\}$
  * The set of all a, where a is an integer
* $\{b \in \mathbb{Z} \mid b = 10\}$
  * $\{10\}$
  * The set of all b, where b is a member of all the natural numbers **such that** $b = 10$
* $\{x \mid x \in \mathbb{N}; b = -10\}$
  * $Ø$
  * The set of all x, where x is a member of all the natural numbers **such that** $b = -10$ 

## Joining sets

* Union
  * $\{1, 2\} \cup \{3, 4\} = \{1, 2, 3, 4\}$
  * $\{1, 2\} \cup \{3, 4\} = \{3, 4\} \cup \{1, 2\}$
  
![Union](https://upload.wikimedia.org/wikipedia/commons/thumb/3/30/Venn0111.svg/800px-Venn0111.svg.png)

## Intersecting sets

* Intersection
  * $\{1, 2\} \cap \{3, 4\} = \{\}$
  * $\{1\} \cap \{1, 2\} = \{1\}$
  
![Intersection](https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Venn0001.svg/800px-Venn0001.svg.png)

## Subtracting sets 1/3

* Complement
  * $A \setminus B$
  
![Complement](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Venn0100.svg/800px-Venn0100.svg.png)

## Subtracting sets 2/3

* Complement
  * $B \setminus A$
  
![Complement](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Venn0010.svg/800px-Venn0010.svg.png)

## Subtracting sets 3/3

* We now have two sets
  * $A \setminus B$
  * $B \setminus A$

* What happens when we join those two together?

![Symmetric difference](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Venn0110.svg/800px-Venn0110.svg.png)

## Symmetric difference

* The difference between two sets, joined together
  * $A = \{1, 2, 3\}$ and $B = \{3, 4, 5\}$
  * $A \setminus B = \{1, 2\}$
  * $B \setminus A = \{4, 5\}$
  * $(A \setminus B) \cup (B \setminus A) = A \Delta B = \{1, 2, 4, 5\}$
  
![Symmetric difference](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Venn0110.svg/800px-Venn0110.svg.png)

## Set operations

![Set operations](images/setops.png)

## Subsets

* A set can be a part of another set
  * $\{1\} \subseteq \{1, 2, 3\}$
  * $\{2, 3\} \subseteq \{1, 2, 3\}$
  * $\{1, 2, 3\} \subseteq \{1, 2, 3\}$
  
* Or it cannot
  * $\{0\} \nsubseteq \{1, 2, 3\}$
  
* The empty set is a subset of all sets
  * $Ø \subseteq \{1, 2, 3\}$

## Subset example

![Number system](images/numbersystem.png)

* $\{\pi\} \subseteq \mathbb{R}$
* $\{\pi\} \nsubseteq \mathbb{N}$

## Cartesian product

* An operator that can combine two sets into tuples of all possible combinations
  * Written $×$

* $A = \{1, 2, 3\}$
* $B = \{Apple, Orange\}$

* $A × B = \{(1, Apple), (1, Orange), (2, Apple), (2, Orange), (3, Apple), (3, Orange)\}$

* $B × A = \{(Apple, 1), (Orange, 1), (Apple, 2), (Orange, 2), (Apple, 3), (Orange, 3)\}$

* $A × B = \{(a, b) \mid a \in A, b \in B\}$

* $R\times S =\{(r_1,r_2,\dots,r_n,s_1,s_2,\dots,s_m)|(r_1,r_2,\dots,r_n)\in R, (s_1,s_2,\dots,s_m)\in S\}$

## Relational algebra

* Our tables (relations) are actually sets of sets

| CPR | Name | Address |
|---------|---------|-------------|
|140298-1234|Thomas|Copenhagen|
|041297-5367|Nikoline|Aarhus|
|151197-2352|Claus|Dragør|
|050596-1142|Martin|Copenhagen|

* $students = \{CPR, Name, Address\}$
  * CPR contains all the tuples in the CPR column and so on

* Written $R(CPR, Name, Address)$

## Relational algebra, take 2

The idea that we can use algebra (symbols and symbolic meaning) to talk about relations

* Invented by Edgar Codd
* The theoretical basis for SQL implementations
* Incredibly useful when reasoning about and understanding your queries

## Algebraic operations on relations

* $\{CPR\} \cup \{Name, Address\}$
  * $\{CPR, Name, Address\}$



* $\{CPR, Name\} \cap \{CPR, Address\}$
  * $\{CPR\}$

* $\{CPR, Name\} \setminus \{Name, Address\}$
  * $\{CPR\}$

## Relational algebra lingo

* Selections: $\sigma_{\varphi}(R)$
  * Selects sets based on a condition
  * ``SELECT ... FROM ... WHERE``
* Projections: $\Pi_{a_1, a_2, \dots, a_n}(R)$
  * Column selection in a relation
  * ``SELECT name, address FROM ... DISTINCT``
* Joins: $A \bowtie B$
  * We'll get back to that

# Getting some data to play with

``docker run -p 5432:5432 --name data jegp/soft2018-data``

* Try to log in: ``docker exec -it data bash -c "psql -U appdev"``

## Running PostgreSQL in Jupyter

``docker run -p 8888:8888 --name jupyter --link data -v `pwd`:/home/jovyan -it jegp/soft2018-jupyter``

* Don't forget to link the containers!
* Click on the link in your terminal

## Exporting Jupyter files to your host machine

* When you start Jupyter, the notebooks will be saved in side the container
* To export them you need to map a folder on your host machine with a folder on your running container
  * Note: The Unix command for finding the present working directory is ``pwd``
  
  
* To expose a volume you will have to map a folder from a host to the container, just like with ports
* The syntax is ``-v host-folder:container-folder``


* In our case you'll have to map your current directory to the ``/home/jovyan`` folder inside the container:
   * ```pwd`:/home/jovyan``
   * The full command is available in the above cell
 

## Sql magic

* In Jupyter ``%%`` means that you are evaluating something else than jupyter
  * ``%%`` is multiline
  * ``%`` is singleline
* In our case we can use it to run SQL:
  
      %sql SELECT 1;

## Jupyter can now connect to your database!

In [3]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [12]:
c = %sql SELECT * FROM public.tweet LIMIT 100;

100 rows affected.


In [14]:
type(c)

sql.run.ResultSet

In [7]:
%sql SELECT * FROM information_schema.tables;

240 rows affected.


table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action
appdev,chinook,invoice,BASE TABLE,,,,,,YES,NO,
appdev,chinook,customer,BASE TABLE,,,,,,YES,NO,
appdev,chinook,mediatype,BASE TABLE,,,,,,YES,NO,
appdev,chinook,playlist,BASE TABLE,,,,,,YES,NO,
appdev,eav,support,BASE TABLE,,,,,,YES,NO,
appdev,chinook,genre,BASE TABLE,,,,,,YES,NO,
appdev,chinook,invoiceline,BASE TABLE,,,,,,YES,NO,
appdev,pg_catalog,pg_type,BASE TABLE,,,,,,YES,NO,
appdev,chinook,track,BASE TABLE,,,,,,YES,NO,
appdev,chinook,playlisttrack,BASE TABLE,,,,,,YES,NO,


## Guide to connect to your database

* ``%load_ext sql``
* ``%sql postgresql://appdev@data:5432/appdev``
* ``%sql SELECT * from tweet LIMIT 10;``

## Check out all the tables!

* ``SELECT table_name FROM information_schema.tables;``

* Wait! ``information.``?!
  * Postgresql contains ``schema``s which contains tables
  * Hence the ``public.tweet`` $=$ ``tweet``

# Database normalisation

Definition:
  * "The process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity" - [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization)

1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
3. To make the relational model more informative to users;
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

     — E.F. Codd, "Further Normalization of the Data Base Relational Model"[3]


## Database anomalies

* Update anomaly
  * If you have redundant data, and one piece of data is not properly updated (inconsistency)

* Insertion anomaly
  * If you cannot assign a value to a field (``null``)

* Deletion anomaly
  * If a deletion will make you delete too much

## Database normalisation hierarchy

![Normalisation hierarchy](images/normalisation.png)

## Normal form 1

**Definition**: All fields are atomic:

1. There are no duplicated rows in the table.
2. Each cell is single-valued (no repeating groups or arrays).
3. Entries in a column (field) are of the same kind.

**Motivation**: Duplication makes it hard to change (update and delete anomalies)

## Normal form 2

**Definition**: Non-key attributes are dependent on all of the key.

  1. Relation is in 1st NF
  2. Eliminate redundant data: if an attribute depends on only part of a multi-valued key, remove it to a separate table

**Motivation**: Imagine you wanted to give a trick a different ID: update anomaly.

## Normal form 3

**Definition**: No transitive dependencies

1. Relation is in 2NF
2. If attributes do not contribute to a description of the key, remove them to a separate table.

**Motivation**: Avoids redundant data

## Functional dependencies

* $X \rightarrow Y$
  * When each $X$ value is associated with precisely one $Y$ value
  
* A functional dependency can be *trivial*
  * When $X \subseteq Y$

## Keys

* One does not simply key

* A **superkey** is a key that *uniquely* identifies a tuple
  * If your table contains no redundancy, the set of all attributes is a pretty trivial superkey


* A **minimal superkey** is the *minimal* set of attributes, required to identify a tuple
  * This is also called a **candidate key**

## Boyce-Codd normal form, or normal form 3.5

**Definition**: No redundant functional dependencies

1. $X \rightarrow Y$ is a trivial functional dependency
2. $X$ is a *superkey*

**Motivation**: Less redundancy

## Boyce-Codd normal form example

| Court | Start Time | End Time | Rate Type |
| :--------------- |----------|-----------|---------------|---------------|
|1|09:30|10:30|SAVER|
|1|11:00|12:00|SAVER|
|1|14:00|15:30|STANDARD|
|2|10:00|11:30|PREMIUM-B|
|2|11:30|13:30|PREMIUM-B|
|2|15:00|16:30|PREMIUM-A|

* The table doesn't have any primary keys, so 2NF and 3NF is out of the question.
  * But what's the problem?

* List of superkeys:
  * {Court, Start Time}, {Court, End Time}, {Rate Type, Start Time}, {Rate Type, End Time}, {Court, Start Time, End Time}, {Rate Type, Start Time, End Time}, {Court, Rate Type, Start Time}, {Court, Rate Type, End Time}, {Court, Rate Type, Start Time, End Time} (trivial superkey)


* Only the first four are *candidate keys*
  * Because e. g. $\{Court, Start Time\} \subseteq \{Court, Start Time, End Time\}$

## Boyce-Codd normal form example fix

| Rate Type | Court | Member Flag |
|--------------|-----|-----
|SAVER|1|Yes|
|STANDARD|1|No|
|PREMIUM-A|2|Yes|
|PREMIUM-B|2|No|
 
|Member Flag|Court|Start Time|End Time|
|------|----|--------|--------|
|Yes|1|09:30|10:30|
|Yes|1|11:00|12:00|
|No|1|14:00|15:30|
|No|2|10:00|11:30|
|No|2|11:30|13:30|
|Yes|2|15:00|16:30|


## Normal form 4

* Introduced in 1977 by Ronald Fagin
* Normally implies normal form 5

* **Definition**: No multivalued dependencies
  * For non-trivial multivalued dependencies X ↠ Y, X is a superkey

**Motivation**: De-duplication

## Normal form 4 pizza example

| Restaurant | Pizza Variety | Delivery Area
| -------------|-------------------|--------------|
|A1 Pizza|Thick Crust|Springfield|
|A1 Pizza|Thick Crust|Shelbyville
|A1 Pizza|Thick Crust|Capital City
|A1 Pizza|Stuffed Crust|Springfield
|A1 Pizza|Stuffed Crust|Shelbyville
|A1 Pizza|Stuffed Crust|Capital City
|Elite Pizza|Thin Crust|Capital City
|Elite Pizza|Stuffed Crust|Capital City
|Vincenzo's Pizza|Thick Crust|Springfield
|Vincenzo's Pizza|Thick Crust|Shelbyville
|Vincenzo's Pizza|Thin Crust|Springfield
|Vincenzo's Pizza|Thin Crust|Shelbyville

* What's the problem?

* Solution: Two tables with $\{Restaurant, Pizza Variety\}$ and $\{Restaurant, Delivery Area\}$

## Normal form 5

* Also known as project-join normal form; Ronald Fagin 1979
* **Definition**: Every non-trivial join dependency in a table is implied by the candidate keys


## Join dependency

* A join dependency $\{A, B, \dots, Z\}$  is implied by the candidate key(s) if and only if each of $A$, $B$, $\dots$, $Z$ is a superkey

## Normal form 5

* Also known as project-join normal form
* **Definition**: Every non-trivial join dependency in a table is implied by the candidate keys

**Motivation**: Avoid redundancy

# The Chinook dataset

https://github.com/lerocha/chinook-database

* Mostly based on iTunes data

      SELECT * from information_schema.tables WHERE table_schema = 'chinook';
      SELECT * from information_schema.columns WHERE table_schema = 'chinook';

## Set operations on the chinook dataset

* ``SELECT * FROM chinook.invoice WHERE billingcountry = 'Norway'``

* ``(SELECT * FROM chinook.invoice WHERE billingcountry = 'Norway') union (SELECT * FROM chinook.invoice WHERE total > 10)``
  * Is there a smarter way to write this?

* ``(SELECT * FROM chinook.invoice WHERE billingcountry = 'Norway') intersect (SELECT * FROM chinook.invoice WHERE total > 10)``
  * What is the smarter way to write this?

* ``(SELECT * FROM chinook.invoice WHERE billingcountry = 'Norway') except (SELECT * FROM chinook.invoice WHERE total > 10)``
  * What is the smarter way to write this?

# Assignment for next week

* **Deadline**: 13th of March 12:00
* **Review deadline**: 14th of March 23:59
* Please send a link to a GitHub repository containing your notebook
  * Jupyter notebooks are shown directly on the GitHub webpage

1. Setup the database as described in these slides
2. Use the set notations (``union``, ``intersect`` or ``except``) on the Chinook dataset to find the following
  1. The union of all the tracks with genreid 18 and 20
    1. Write a line of text in the notebook: What did you find?
  2. The intersection of all the invoices that are cheaper than 10 dollars and the invoices that are more expensive than 5 dollars
    1. Write a line of text in the notebook: What did you find?
  3. The set of all customers from USA, subtracted by the set of all customers with an email ending in 'yahoo.com'
    1. Write a line of text in the notebook: What did you find?
  4. The union of the set of all albums playing something by Mozart and the set of all albums playing something with Bach
    1. Write a line of text in the notebook: What did you find?