# DBMS and normal forms

``jeep@cphbusiness.dk``

## NF1 motivation

With no duplicate datae we cannot 

## Agenda

* Recap and hand-in evaluation
* Recap on teaching format
* A brief history of how not to do databases
  * ACID properties
* PostgreSQL
  * Connecting to PostgreSQL
  * Creating tables in PostgreSQL
* Formal SQL introduction 
  * Manifesting ERD's in SQL
  * Keys and surrogate keys
* Normalisation
  * Constraints
  * 1, 2 and 3
* Assignment for next week

## Learning outcomes

## Knowledge 
* Various database types and the underlying models
* (The particular issues raised by having many simultaneous transactions, including in connection with distributed databases)

### Skills
* Transform logical data models into physical models in various database types
* Use the programming and other facilities provided by a modern DBMS


# Teaching format

![Donning-Kruger effect](https://i.imgur.com/jbo2gy5.jpg)

## A brief history of how not to do databases

Databases typically contain important information.

Imagine building a database in 1980: you have one single file, shared between many users that can execute an arbitrary amount of queries on your database. What could possibly go wrong?

## Problem 1: Concurrent executions

What if user A mutates the database, but, before he is done, user B starts using the partial work of A?

  * Dirty reads or lost updates: data corruction
  
Example: money transfer
  * Two queries: take and put money

**Solution**: atomicity
  * We need to promise that each transaction happen, or don't

![Do or do not](https://media.giphy.com/media/26FmQ6EOvLxp6cWyY/giphy.gif)

## Problem 2: Inconsistency

So, tables cannot be corrupted. But what about their relations? They can still break!

**Solution**: Consistency
  * We have to promise to always keep the database in a consistent state

## Problem 3: Concurrent execution

At what point should a query be exposed?

Example:
  * Imagine that I create a table for orders and start filling in data. But before the table is finished another user wants to list all orders.

**Solution**: Isolation
  * Strict isolation: only visible when entire all the query are done
  * Relaxed isolation: parts of the queries can be accessed before the rest is done

## Problem 4: Data integrity

Disks are not reliable. Power can go down, bits can switch.

**Solution: Durability**
  * We need to promise our users that data is stored in fail-safe and non-volatile memory.

## DBMS

A database is no longer just a database. We need guarantees to make it work properly.

At the very least we need:

* Atomicity
* Consistency
* Isolation and
* Durability

Also known as ACID

## PostgreSQL

``postgresql.org``

PostgreSQL is a powerful, open source object-relational database system.

And naturally ACID compliant.

## Common databases

* MySQL/MariaDB
* Oracle
* PostgreSQL

## Why PostgreSQL?

* Open-source
* Extremely robust
* Front-line of research
* Well documented
* Great interoperability

## Mastering PostgreSQL
![Mastering PostgreSQL](images/book_fontaine.png)

## Running PostgreSQL

``docker run -p 5432:5432 -d postgres:alpine``

``docker run -p 5432:5432 -d -v postgres-data:/var/lib/postgresql/data --name psql postgres:alpine``

## Connecting to PostgreSQL

* Normal psql connection command 
  * ``psql``
* With user ``postgres``
  * ``psql -U postgres``

* Connecting to a shell in a Docker container
  * ``docker exec -it psql bash``

* Connecting to PostgreSQL through the shell in the Docker container
  * ``docker exec -it psql bash -c "psql -U postgres"``

# Formal SQL introduction

* First appeared in 1974
  * Based on Edgar F. Codd's from 1970
* Standard since 80's
  * Now [SQL:2016](https://en.wikipedia.org/wiki/SQL:2016)
  * Not all follow these standard
    * Example: Case-sensitivity

* SQL syntax recap
* Data types
* Table creation
* Table keys
* Modeling relations

# Structured query language (SQL)
<div style="float:right; width: 45%"><br/><br/><img alt="SQL" style="width:100%;" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/b0bfef3c941c1a88d3990bd1472653e60cf02d0a" /></div>
* Statements
  * May also be a query
* Clauses
  * ``select``-clause, ``where``-clause etc.
* Expressions
  * ``population + 1``, ``"Boris Jeltsin"``
* Predicates
  * ``name = ’USA‘``

* Statements must end with ``;``

## SQL data types

* Character Types

  * Character (CHAR)
  * Character Varying (VARCHAR)
  * Character Large Object (CLOB)

* Binary Types
  * Binary (BINARY)
  * Binary Varying (VARBINARY)
  * Binary Large Object (BLOB)

* Numeric Types
  * Exact Numeric Types (NUMERIC, DECIMAL, SMALLINT, INTEGER, BIGINT)
  * Approximate Numeric Types (FLOAT, REAL, DOUBLE PRECISION)

* Datetime Types (DATE, TIME, TIMESTAMP)
* Interval Type (INTERVAL)
* Boolean
* XML
* null<sup>TM</sup>

## SQL relations

* We needed a language that can model
  * Entities
  * Attributes
  * Relationships

* In SQL
  * Entities $=$ Tables $=$ Relations
  * Attributes $=$ Columns
  * Relationships $\approx$ foreing keys

## Creating entities

* ``CREATE TABLE [name];``
* ``DROP TABLE [name];``

## .. with attributes

    CREATE TABLE [name] (
      [attribute1] [type],
      [attribute2] [type]
    );

* Example:

    CREATE TABLE actors (
      name VARCHAR,
      age int
    );

## Inspecting tables in PostgreSQL

  * ``\dt``: lists tables
  * ``\d [name]``: describes the structure of a table

## Identifying tuples/rows

How do we find the needle in the haystack? 
  * By some kind of identifier, a **key**

* Example:
    CREATE TABLE actors (
      name VARCHAR PRIMARY KEY,
      age int
    );

What if you want to identify a tuple by more than one column?

* Example: 
    CREATE TABLE actors (
      name VARCHAR,
      age int,
      PRIMARY KEY (name, age)
    );

## Inspecting tables in PostgreSQL

  * ``\dt``: lists tables
  * ``\d [name]``: describes the structure of a table

## A note on primary keys

Primary keys are unique, you cannot have two entries with the same primary key combination. They also **can not** be null.

* Unique: ``UNIQUE (name, age)``
* Not-null: ``name VARCHAR NOT NULL``

* Example: 

    CREATE TABLE actors (
      name VARCHAR NOT NULL,
      age int NOT NULL,
      UNIQUE (name, age)
    );

## Describing relations in SQL

Relations require that you can point to them. Now we can!

Say we have a table:

    CREATE TABLE actors (
      name VARCHAR PRIMARY KEY,
      age
    );

Then we can **reference** the ``name`` column in another table:
    
    CREATE TABLE role (
      actor VARCHAR REFERENCES actors(name),
      movie VARCHAR REFERENCES movies(title)
    );

## Constraints

* ``REFERENCES``, ``UNIQUE``, ``NOT NULL`` and ``PRIMARY KEY`` are **constraints**
* They **cannot** be broken
  * Similar to ``assert``
* Example:
  * ``INSERT INTO role (actor, movie) VALUES ('The Rock', 'G.I.Joe'); -- Crash!``

## Normal forms

We talked about how databases would keep the data safe and consistent.

But what about performance and data constraints?

* Functional dependencies
* Database optimisation problems
* Normal forms

## Functional dependency

* A constraint between two sets of attributes in a table.
* Column B is functionally dependent on column A if and only if each value in A is associated with precisely one value in B
* Or column A is said to **functionally determine** column B
* Written A → B

* Example: 

| CPR | Name | Address |
|---------|---------|-------------|
|140298-1234|Thomas|Copenhagen|
|041297-5367|Nikoline|Aarhus|
|151197-2352|Claus|Dragør|
|050596-1142|Martin|Copenhagen|

## Database optimisation

We need our databases to run fast. Functional dependencies are needed because they can find and eliminate redundancies.

The question is how we can affect the quality of our design:

  * Relations
  * Attributes
  * Constraints

# Database normalisation

Definition:
  * "The process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity" - [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization)

1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
3. To make the relational model more informative to users;
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

     — E.F. Codd, "Further Normalization of the Data Base Relational Model"[3]


## Database normalisation hierarchy

![Normalisation hierarchy](images/normalisation.png)

## Normal form 1

**Definition**: All fields are atomic:

1. There are no duplicated rows in the table.
2. Each cell is single-valued (no repeating groups or arrays).
3. Entries in a column (field) are of the same kind.


Example:

| CPR | Name | Address |
|---------|---------|-------------|
|050596-1142|Martin|Copenhagen|
|140298-1234|Thomas|Copenhagen <br>Aarhus|
|041297-5367|null|Aarhus|
|151197-2352|Claus|Dragør|
|050596-1142|Martin|Copenhagen|

**Motivation**: Duplication makes it hard to change (update and delete anomalies)

## Normal form 2

**Definition**: Non-key attributes are dependent on all of the key.

  1. Relation is in 1st NF
  2. Eliminate redundant data: if an attribute depends on only part of a multi-valued key, remove it to a separate table

Example (what is the key?):

| Puppy Number | Trick ID| Trick Name| Skill Level|
|-----|------|------------|------|
|52  |27|“roll over” |9|
|53 |16 |“Nose Stand”|9|
|54|27  |“roll over”|5|

**Motivation**: Imagine you wanted to give a trick a different ID: update anomaly.

## Normal form 3

**Definition**: No transitive dependencies

## Transitive dependency

* A functional dependency X → Z  is transitive if X → Y → Z, unless Y → X

* Example: 

| Book |Author| Author_age|
|------------------|------------------|-------|
|Game of Thrones|George R. R. Martin|66|
|Harry Potter|J. K. Rowling|49|
|Dying of the Light|George R. R. Martin|66|

## Normal form 3

**Definition**: No transitive dependencies

1. Relation is in 2NF
2. If attributes do not contribute to a description of the key, remove them to a separate table.

## Assignment 

Given a SQL table over some Twitter data, construct a normalised relation to the 3rd NF

* **Deadline**: 6th of March 12:00
* **Review**: 7th of March 23:59
* If you are interested, data is available here: http://followthehashtag.com/datasets/free-twitter-dataset-usa-200000-free-usa-tweets/

    create table tweet
     (
       id         bigint primary key,
       date       date,
       hour       time,
       uname      text,
       nickname   text,
       bio        text,                -- User biography
       message    text,
       favs       bigint,              -- Number of user that favourited this tweet
       rts        bigint,              -- Number of times this tweet has been retweeted
       latitude   double precision,
       longitude  double precision,
       country    text,                -- The country where this tweet was tweeted from
       place      text,                -- The name of the location thi tweet was tweeted from (if any)
       picture    text,                -- A picture in this tweet (if any)
       followers  bigint,              -- Number of users followers 
       following  bigint,              -- Number of users following
       listed     bigint,              -- ID of the list, this tweet belongs to (if any)
                                       -- In Twitter a tweet can be an a list (conversation), started by a user
       lang       text,                -- Tweet language (not user)
       url        text                 -- Tweet URL
     );
