# Level 6: Database Design & Normalization

Good database design is crucial for data integrity, performance, and maintainability. This notebook covers the foundational principles of designing a relational database, including Entity-Relationship (ER) modeling and the process of **normalization**, which helps to reduce data redundancy and improve data integrity.

## 6.1 Entity-Relationship (ER) Modeling

ER modeling is a conceptual way of designing a database. We think about the main **entities** (things we want to store data about) and the **relationships** between them.

- **Entities** usually become **tables** (e.g., `Students`, `Courses`, `Instructors`).
- **Attributes** of entities become **columns** in those tables (e.g., a `Student` has a `name` and `email`).
- **Relationships** define how tables are linked together:
    - **One-to-One (1:1):** Each record in Table A relates to exactly one record in Table B. (e.g., `User` and `UserProfile`).
    - **One-to-Many (1:M):** One record in Table A can relate to many records in Table B. (e.g., one `Author` can have many `Books`). This is the most common relationship.
    - **Many-to-Many (M:M):** Many records in Table A can relate to many records in Table B. (e.g., `Students` and `Courses`). This requires a third table, called an **association** or **junction table**, to implement.

## 6.2 Normalization

Normalization is the process of organizing the columns and tables of a relational database to minimize data redundancy. We'll walk through an example, taking a single, unnormalized table to the Third Normal Form (3NF).

### Unnormalized Form (UNF)
Let's imagine a flat file or spreadsheet for tracking project assignments. It might look like this:

| project_id | project_name | employees (id, name, email) |
|---|---|---|
| 101 | Project Alpha | (1, 'Alice', 'a@a.com'), (2, 'Bob', 'b@b.com') |
| 102 | Project Beta | (1, 'Alice', 'a@a.com'), (3, 'Charlie', 'c@c.com') |

**Problems:**
- The `employees` column contains multiple values (it's not atomic).
- Employee information (like Alice's email) is repeated.
- If we delete Project Beta, we lose the information that Charlie exists.

### First Normal Form (1NF)
**Rule:** Ensure all values in a column are atomic (indivisible) and each row is unique. We can achieve this by removing the repeating group of employees.

| project_id | project_name | employee_id | employee_name | employee_email |
|---|---|---|---|---|
| 101 | Project Alpha | 1 | Alice | a@a.com |
| 101 | Project Alpha | 2 | Bob | b@b.com |
| 102 | Project Beta | 1 | Alice | a@a.com |
| 102 | Project Beta | 3 | Charlie | c@c.com |

**Problems Solved:** Values are now atomic.
**New Problems:** Massive redundancy. `project_name` and employee details are repeated. This can lead to **update anomalies** (e.g., if Alice changes her email, we have to update it in multiple places).

### Second Normal Form (2NF)
**Rules:**
1. Be in 1NF.
2. Remove partial dependencies. This means that every non-key attribute must depend on the *whole* primary key, not just part of it.

Our primary key here is a composite key: `(project_id, employee_id)`.
- `project_name` depends only on `project_id` (partial dependency).
- `employee_name` and `employee_email` depend only on `employee_id` (partial dependency).

We need to split the table into three:

**`projects` table:**
```sql
CREATE TABLE projects (
    project_id INTEGER PRIMARY KEY,
    project_name TEXT NOT NULL
);
```

**`employees` table:**
```sql
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    employee_name TEXT NOT NULL,
    employee_email TEXT NOT NULL UNIQUE
);
```

**`project_assignments` table (Junction Table):**
```sql
CREATE TABLE project_assignments (
    project_id INTEGER,
    employee_id INTEGER,
    PRIMARY KEY (project_id, employee_id),
    FOREIGN KEY (project_id) REFERENCES projects(project_id),
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id)
);
```

**Problems Solved:** Data redundancy is greatly reduced. Employee and project information is stored only once.
**New Problems:** What if we add an employee's department? Let's say `employee_name` determines `department` (`Alice` -> `Engineering`). This is a **transitive dependency**.

### Third Normal Form (3NF)
**Rules:**
1. Be in 2NF.
2. Remove transitive dependencies. This means no non-key attribute should depend on another non-key attribute.

In our `employees` table, if we added `department_name` and `department_head`, these would depend on the `department_id`, not directly on the `employee_id`. To fix this, we create a `departments` table.

**Final Schema (3NF):**
```sql
CREATE TABLE departments (
    department_id INTEGER PRIMARY KEY,
    department_name TEXT NOT NULL
);

CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    employee_name TEXT NOT NULL,
    department_id INTEGER,
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

-- projects and project_assignments tables remain the same
```

## 6.3 Indexes

An **index** is a special lookup table that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a table.

**How it works:** When you create an index on a column, the database stores a sorted copy of that column's data with pointers back to the original table rows. When you query with a `WHERE` clause on that column, the database can do a fast binary search on the index instead of a slow full-table scan.

**Trade-off:** Indexes speed up `SELECT` queries and `WHERE` clauses, but they **slow down data modification** (`INSERT`, `UPDATE`, `DELETE`) because the index also needs to be updated.

Primary keys are automatically indexed.

In [2]:
# Let's create an index on the employees' name column
import sqlite3
conn = sqlite3.connect(':memory:') # Use an in-memory database
cursor = conn.cursor()

cursor.execute("CREATE TABLE employees (id INTEGER, name TEXT);")

# Create the index
cursor.execute("CREATE INDEX idx_employee_name ON employees(name);")

print("Index 'idx_employee_name' created successfully.")

conn.close()

Index 'idx_employee_name' created successfully.
