<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_05_Design.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Conceptual Design at the Covert Academy

Welcome to the Covert Academy, the world's premier educational institution for aspiring secret agents! Our state-of-the-art facility is hidden beneath the streets of London and equipped with the latest in espionage technology. We offer a wide range of classes to train our students in the arts of surveillance, infiltration, disguise, and more.

To keep track of our students and classes, we need a well-designed database. Let's walk through the process of conceptual design for this database.

### What is Conceptual Design?

Conceptual design is the first step in creating a database. It involves **formulating business rules**, which define how the database should work and what constraints it should have. It also involves creating a **preliminary list of entities and attributes**, which represent the key concepts and data elements in the system. Finally, it involves creating a **preliminary Entity-Relationship Diagram (ERD)** to visually represent the entities and their relationships.

### Formulating Business Rules

**Business rules** are concise, unambiguous statements that define or constrain some aspect of the database. They are derived through a combination of **descriptive processes**, such as interviewing stakeholders to understand their needs and requirements, and **normative processes**, which involve making decisions about how the system should work.

For the Covert Academy, we have the following business rules:

1.  Each **student** must be enrolled in at least one **class**, but may be enrolled in multiple classes.
2.  Each **class** must have at least one **student** enrolled, but may have many students.

To write effective business rules:

-   Keep them concise and unambiguous
-   Ensure they are testable (you can determine if the system is complying with the rule)
-   Involve all relevant stakeholders in formulating and reviewing them
-   Prioritize them (some may be must-haves, others may be nice-to-haves)

### Preliminary List of Entities and Attributes

Based on our business rules, we can identify our preliminary **entities**, which are the key concepts or objects in our system, and their **attributes**, which are the data elements that describe each entity.

For the Covert Academy, our entities and attributes are:

**Student**

-   Name
-   Codename
-   Nationality
-   Specialization

**Class**

-   Name
-   Description
-   Instructor
-   Location

Note that at this stage, we don't include data types, identifier columns, or join tables. We're just focusing on the core entities and their key attributes.

### Preliminary ERD

Finally, we can create a preliminary ERD to visualize our entities and their relationship. Here's what it looks like for the Covert Academy:

In [1]:
import base64
from IPython.display import Image, display, HTML

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))

mm("""
    erDiagram
    STUDENT |{--|{ CLASS : enrolls
    """)

This diagram shows that there is a many-to-many relationship between **STUDENT** and **CLASS** (a student can enroll in many classes, a class can have many students).

We'll refine this ERD in the next step, logical modeling, where we'll resolve the many-to-many relationship with a join table. But this preliminary ERD gives us a good starting point to visualize our system.


## What is Logical Modeling?
In the previous section, we laid the conceptual foundation for the Covert Academy's database. Now it's time to take that conceptual model and transform it into a **logical model** using the **relational model**.

Logical modeling is the process of taking a conceptual model and adapting it to fit a specific logical data model. In our case, we'll be using the **relational model**, which organizes data into **tables** (also known as relations) with **rows** (also known as tuples) and **columns** (also known as attributes).

In the relational model:

-   Each table should represent a single entity or concept
-   Each row in a table represents a specific instance of that entity
-   Each column in a table represents an attribute of that entity
-   Each table should have a **primary key**, a unique identifier for each row

### Resolving Many-to-Many Relationships

In our conceptual model, we had a many-to-many relationship between **STUDENT** and **CLASS**. However, in the relational model, we can't directly represent many-to-many relationships. Instead, we need to introduce a **join table**.

A join table is a table that breaks down a many-to-many relationship into two one-to-many relationships. It does this by having foreign keys to both of the original tables.

For the Covert Academy, our join table will be called **ENROLLMENT**. It will have the following structure:

**ENROLLMENT**

-   StudentID (Foreign Key to STUDENT)
-   ClassID (Foreign Key to CLASS)
-   EnrollmentDate

Now, instead of a direct many-to-many relationship, we have:

-   A one-to-many relationship between **STUDENT** and **ENROLLMENT**
-   A one-to-many relationship between **CLASS** and **ENROLLMENT**

### Choosing Primary Keys

Each table in our database needs a primary key. A primary key is a unique identifier for each row in a table. There are two main options for primary keys:

1.  **Natural Key**: A natural key is a key that uses one of the existing attributes of the entity. For example, we could use a student's email as the primary key for the STUDENT table. Natural keys can be convenient because they don't require an additional column. However, they can also be problematic if the natural key ever needs to change.
2.  **Surrogate Key**: A surrogate key is an artificial key that is created specifically to be the primary key. It's usually a simple integer or a universally unique identifier (UUID). Surrogate keys are often preferable because they are guaranteed to be unique and they never need to change.

For the Covert Academy, we'll use surrogate keys for all of our tables. We'll call these `ID` columns.

### Updated ERD

With these changes in mind, here's our updated ERD:

In [8]:
mm("""
erDiagram
    STUDENT ||--o{ ENROLLMENT : has
    ENROLLMENT }o--|| CLASS : is_for
""")

The tables are as follows:

**STUDENT** (ID (Primary Key), Name, Codename, Nationality, Specialization)

**CLASS** (ID (Primary Key), Name, Description, Instructor, Location)

**ENROLLMENT** (StudentID (Foreign Key to STUDENT), ClassID (Foreign Key to CLASS), EnrollmentDate)

## Normalization: Keeping Your Data in Line

Before we move on to physical modeling, let's take a moment to discuss a crucial concept in database design: **normalization**. (As it turns out, our table are already in 3NF. However, this won't always be the case!).

Normalization is the process of organizing data in a database to avoid data redundancy and improve data integrity. It involves dividing large tables into smaller tables and defining relationships between them based on the rules of **normal forms**.

Normalization is important because it:

1.  Minimizes data redundancy
2.  Avoids data anomalies (such as update and deletion anomalies)
3.  Simplifies data management
4.  Reduces data inconsistencies

There are several normal forms, each with its own set of rules. The most common are 1NF, 2NF, and 3NF. Let's dive into each one!

### First Normal Form (1NF)

A database is in 1NF if:

1.  Each column contains atomic values (indivisible values)
2.  There are no repeating groups of columns

For example, consider this non-normalized STUDENT table:

| ID | Name | Codename | Nationality | Specialization |
| --- | --- | --- | --- | --- |
| 1 | James | 007 | British | Espionage, Combat |
| 2 | Natasha | Black Widow | Russian | Combat, Infiltration |

This table is not in 1NF because the Specialization column contains multiple values. To bring it to 1NF, we would create a separate SPECIALIZATION table and establish a one-to-many relationship:

**STUDENT**

| ID | Name | Codename | Nationality |
| --- | --- | --- | --- |
| 1 | James | 007 | British |
| 2 | Natasha | Black Widow | Russian |

**SPECIALIZATION**

| ID | StudentID | Specialization |
| --- | --- | --- |
| 1 | 1 | Espionage |
| 2 | 1 | Combat |
| 3 | 2 | Combat |
| 4 | 2 | Infiltration |

### Second Normal Form (2NF)

A database is in 2NF if:

1.  It is in 1NF
2.  All non-key columns are fully dependent on the primary key

Consider this 1NF ENROLLMENT table:

| StudentID | ClassID | EnrollmentDate | ClassName |
| --- | --- | --- | --- |
| 1 | 1 | 2023-01-01 | Spy Gadgets |
| 1 | 2 | 2023-01-15 | Disguise 101 |
| 2 | 1 | 2023-01-01 | Spy Gadgets |

This table is not in 2NF because ClassName is dependent on ClassID, which is only part of the primary key (StudentID, ClassID). To bring it to 2NF, we move ClassName to the CLASS table:

**ENROLLMENT**

| StudentID | ClassID | EnrollmentDate |
| --- | --- | --- |
| 1 | 1 | 2023-01-01 |
| 1 | 2 | 2023-01-15 |
| 2 | 1 | 2023-01-01 |

**CLASS**

| ID | ClassName |
| --- | --- |
| 1 | Spy Gadgets |
| 2 | Disguise 101 |

### Third Normal Form (3NF)

A database is in 3NF if:

1.  It is in 2NF
2.  There are no transitive dependencies

A transitive dependency is when a non-key column depends on another non-key column.

Consider this 2NF CLASS table:

| ID | ClassName | InstructorID | InstructorName |
| --- | --- | --- | --- |
| 1 | Spy Gadgets | 1 | Q |
| 2 | Disguise 101 | 2 | M |

This table is not in 3NF because InstructorName is transitively dependent on InstructorID (a non-key column). To bring it to 3NF, we move InstructorName to a new INSTRUCTOR table:

**CLASS**

| ID | ClassName | InstructorID |
| --- | --- | --- |
| 1 | Spy Gadgets | 1 |
| 2 | Disguise 101 | 2 |

**INSTRUCTOR**

| ID | Name |
| --- | --- |
| 1 | Q |
| 2 | M |

And there you have it! By applying these normal forms, we ensure our database is well-structured, minimizes redundancy, and avoids anomalies.

Physical Modeling - Bringing the Database to Life
=================================================

Welcome back, future data masters! We've conceptually designed our database and transformed it into a logical model using the relational model and normalization techniques. Now it's time for the exciting part - actually creating our database!

What is Physical Modeling?
--------------------------

Physical modeling is the process of taking the logical model and implementing it in a specific database management system (DBMS). This involves defining the actual tables, columns, data types, and constraints in the database using the Data Definition Language (DDL) of the chosen DBMS.

For the Covert Academy's database, we'll be using **SQLite**.

What is SQLite?
---------------

SQLite is a lightweight, file-based DBMS. It's a popular choice for many applications because:

1.  It's serverless (the database is just a file)
2.  It's self-contained (no external dependencies)
3.  It's cross-platform (works on all major operating systems)
4.  It's open-source and free to use

SQLite, like most modern DBMSs, uses **SQL (Structured Query Language)** for defining and manipulating databases.

### What is ANSI Standard SQL?

ANSI (American National Standards Institute) is an organization that defines standards for various industries, including database systems. ANSI SQL is a standard that defines the SQL language.

Most modern DBMSs, including SQLite, follow the ANSI SQL standard to a large extent. This means that the core SQL syntax is the same across different DBMSs. However, each DBMS also has its own extensions and quirks.

### ANSI Standard SQL Data Types

ANSI SQL defines a set of standard data types. The most common ones are:

1.  **INTEGER**: A whole number.
2.  **REAL**: A floating-point number.
3.  **VARCHAR(n)**: A string of characters with a maximum length of n.
4.  **BLOB**: Binary Large Object, used for storing large amounts of binary data.
5.  **DATE**: A date value (YYYY-MM-DD).
6.  **TIME**: A time value (HH:MM:SS).
7.  **TIMESTAMP**: A combination of date and time (YYYY-MM-DD HH:MM:SS).

SQLite supports most of these data types, although it has some of its own quirks:

-   SQLite has a small number of underlying "storage classes"  -- TEXT, NUMERIC, REAL, BLOB, NULL -- that it uses to store all values (regardless of SQL data type).
- SQLite is "dynamically typed", meaning that it determines the storage class for each value as is it inserted and doesn't "enforce" column data types, unlike most RDBMSs).

## The CREATE TABLE Statement

Now that we know about data types, we can start creating our tables! In SQL, we use the `CREATE TABLE` statement for this.

The general syntax is:

```sql
CREATE TABLE table_name (
    column1 datatype constraint,
    column2 datatype constraint,
    ...
    PRIMARY KEY (one or more columns)
);
```

For example, let's create the STUDENT table:

In [9]:
# connect to sqlite using sql magic
%load_ext sql
%sql sqlite:///covert_academy.db

In [10]:
%%sql
CREATE TABLE STUDENT (
    ID INTEGER PRIMARY KEY,  -- The primary key, an auto-incrementing integer
    Name VARCHAR(100) NOT NULL,  -- The student's name, cannot be NULL
    Codename VARCHAR(50),  -- The student's codename, can be NULL
    Nationality VARCHAR(50),  -- The student's nationality
    Specialization VARCHAR(100)  -- The student's specialization
);

 * sqlite:///covert_academy.db
Done.


[]

This creates a table named STUDENT with five columns. Note the use of ANSI standard data types and the comments explaining each column.

We can create the CLASS and ENROLLMENT tables similarly:

In [11]:
%%sql
CREATE TABLE CLASS (
    ID INTEGER PRIMARY KEY,  -- The primary key
    Name VARCHAR(100) NOT NULL,  -- The class name, cannot be NULL
    Description VARCHAR(200),  -- The class description
    Instructor VARCHAR(100),  -- The class instructor
    Location VARCHAR(100)  -- The class location
);

CREATE TABLE ENROLLMENT (
    StudentID INTEGER,  -- Foreign key to STUDENT table
    ClassID INTEGER,  -- Foreign key to CLASS table
    EnrollmentDate DATE,  -- The date of enrollment
    PRIMARY KEY (StudentID, ClassID),  -- Composite primary key
    FOREIGN KEY (StudentID) REFERENCES STUDENT(ID),  -- Foreign key constraint
    FOREIGN KEY (ClassID) REFERENCES CLASS(ID)  -- Foreign key constraint
);

 * sqlite:///covert_academy.db
Done.
Done.


[]

Note how in the ENROLLMENT table, we define a composite primary key (StudentID, ClassID) and also specify the foreign key constraints.

## Column Constraints

Before we start inserting data into our database, let's discuss some important concepts that help maintain the integrity and consistency of our data: **constraints**.

Column constraints are rules applied to individual columns in a table. They restrict the type of data that can be stored in a column. The most common constraints are:

1.  **CHECK**: Ensures that a column's value satisfies a boolean expression.
2.  **DEFAULT**: Specifies a default value for a column when no value is provided.
3.  **NOT NULL**: Ensures that a column cannot have a NULL value.
4.  **UNIQUE**: Ensures that each value in a column is unique across the whole table.

Let's see how we can apply these constraints to our tables.

### The DROP TABLE Statement

But first, let's discuss how to drop a table. The `DROP TABLE` statement is used to remove a table definition and all its data. The syntax is simple:

```sql
DROP TABLE table_name;
```

This statement is irreversible, so use it with caution!

Now, let's recreate our tables with some constraints.

### Recreating the STUDENT Table

In [12]:
%%sql
DROP TABLE IF EXISTS STUDENT;  -- Drop the table if it already exists

CREATE TABLE STUDENT (
    ID INTEGER PRIMARY KEY,
    Name VARCHAR(100) NOT NULL,
    Codename VARCHAR(50) UNIQUE,  -- Codenames must be unique
    Nationality VARCHAR(50) DEFAULT 'Unknown',  -- Default nationality is 'Unknown'
    Specialization VARCHAR(100),
    Age INTEGER CHECK (Age >= 18)  -- Students must be at least 18 years old
);

 * sqlite:///covert_academy.db
Done.
Done.


[]

Here, we've added a UNIQUE constraint to the Codename column, a DEFAULT constraint to the Nationality column, and a CHECK constraint to ensure that all students are at least 18 years old.

### Recreating the CLASS Table
For the CLASS table, we've added a NOT NULL constraint to the Instructor column and a CHECK constraint to ensure that the StartDate is always before the EndDate.

In [13]:
%%sql
DROP TABLE IF EXISTS CLASS;

CREATE TABLE CLASS (
    ID INTEGER PRIMARY KEY,
    Name VARCHAR(100) NOT NULL,
    Description VARCHAR(200),
    Instructor VARCHAR(100) NOT NULL,  -- Every class must have an instructor
    Location VARCHAR(100),
    StartDate DATE,
    EndDate DATE,
    CHECK (StartDate < EndDate)  -- The start date must be before the end date
);

 * sqlite:///covert_academy.db
Done.
Done.


[]

### Recreating the ENROLLMENT Table
For the ENROLLMENT table, we've added a DEFAULT constraint to set the EnrollmentDate to the current date if no value is provided.

In [14]:
%%sql
DROP TABLE IF EXISTS ENROLLMENT;

CREATE TABLE ENROLLMENT (
    StudentID INTEGER,
    ClassID INTEGER,
    EnrollmentDate DATE DEFAULT CURRENT_DATE,  -- Default is the current date
    PRIMARY KEY (StudentID, ClassID),
    FOREIGN KEY (StudentID) REFERENCES STUDENT(ID),
    FOREIGN KEY (ClassID) REFERENCES CLASS(ID)
);

 * sqlite:///covert_academy.db
Done.
Done.


[]

And there we have it! Our Covert Academy database is now set up with constraints to ensure data integrity.

### Syntax for Creating Tables: A Quick Reference

As we've seen, SQL provides a variety of tools for defining the structure and constraints of our database tables. Let's summarize the syntax for some of the key concepts we've encountered.

### Defining a Primary Key

To define a primary key for a table, you can use the `PRIMARY KEY` constraint. This can be done in two ways:

1.  As part of the column definition:

   ```sql
CREATE TABLE table_name (
        column_name datatype PRIMARY KEY,
        ...
    );
```

2.  As a separate table constraint:

    ```sql
CREATE TABLE table_name (
        column_name datatype,
        ...,
        PRIMARY KEY (column_name)
    );
```

If the primary key is composed of multiple columns (a composite key), you must use the second form and list all the columns in the primary key:

```sql
CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    ...,
    PRIMARY KEY (column1, column2)
);
```

#### Defining a Foreign Key

To define a foreign key, you use the `FOREIGN KEY` constraint. Again, this can be done as part of the column definition or as a separate table constraint:

```sql
CREATE TABLE table_name (
    ...,
    foreign_key_column datatype,
    ...,
    FOREIGN KEY (foreign_key_column) REFERENCES referenced_table(referenced_column)
);
```

#### The NOT NULL Constraint
To specify that a column cannot hold NULL values, use the `NOT NULL` constraint:

```sql
CREATE TABLE table_name (
    column_name datatype NOT NULL,
    ...
);
```

#### The UNIQUE Constraint

To ensure that all values in a column are different, use the `UNIQUE` constraint:

```sql
CREATE TABLE table_name (
    column_name datatype UNIQUE,
    ...
);
```

#### The CHECK Constraint
To ensure that a column's value satisfies a boolean expression, use the `CHECK` constraint:

```sql
CREATE TABLE table_name (
    ...,
    column_name datatype CHECK (boolean_expression),
    ...
);
```

#### The DEFAULT Constraint

To specify a default value for a column when no value is provided, use the `DEFAULT` constraint:

```sql
CREATE TABLE table_name (
    column_name datatype DEFAULT default_value,
    ...
);
```

These are the basic building blocks for defining the structure and integrity of your database tables. With these tools, you can ensure that your data is consistent, valid, and maintains the relationships you've defined.