# Database Normalization

**Database normalization** is a set of principles for designing databases with clarity and logical rigor. Normalized designs communicate the mapping between real-world entities and their representations in database design.

## Core Principle

The fundamental principle of normalization is that **each table should represent one distinct entity class**.

```{note}
In a normalized design, each row of a given table describes a distinct entity, and no two rows in that table represent different types of entities.
```

## Why Normalization Matters

Different entity types have different:
- **Identification systems**: How they are uniquely identified
- **Attributes**: What properties they have  
- **Relationships**: How they connect to other entities
- **Business rules**: What constraints apply to them

## Key Requirements

1. **Clear Entity Representation**: The table name must clearly indicate what entity type is represented by the table's rows (using singular form)
2. **Primary Key**: Each table must have a primary key that uniquely identifies each entity
3. **Relevant Attributes**: Secondary attributes must directly describe the entities of the table's class
4. **No Mixed Entities**: Avoid mixing different entity types in the same table

## Example: E-commerce Shopping Cart
Let's examine a common unnormalized design using DataJoint notation. Consider a table for representing items in a shopping cart for an e-commerce site:

```
:: ShoppingCart
order_number : int
item : int
---
purchase_date : date
buyer_full_name : varchar(16)
buyer_address : varchar(1000)
buyer_email : varchar(120)
item_description : varchar(1000)
item_price : numeric(8, 2)
item_quantity : int
total_amount : numeric(8, 2)
```

Here, the first line starting with :: specifies the table name. 
Subsequent lines describe the table columns (entity attributes) with the colon : separating the attribute name from its data type. 
The dashes --- separate the attributes in the primary key from the secondary attributes below.
Such designs are typical for DataJoint newbies.
What is wrong with this design? 
The typical novice mistake is to put too much information in the same table, mixing information about different entities in the same table. This table contains information describing multiple entities: orders, items, and buyers, all in one. 
How would you fix this design?

## Normalized Solution

Database normalization requires splitting this into multiple tables, each representing a distinct entity type: 

### 1. Item Table
(DataJoint)
```python
@schema
class Item(dj.Manual):
    definition = """
    item : int
    ---
    item_description : varchar(1000)
    """
```
(Equivalent SQL)
```sql
CREATE TABLE item (
    item INT,
    item_description VARCHAR(1000) NOT NULL
    PRIMARY KEY (item)
);
```

### 2. Order Table

(DataJoint)
```python
@schema
class Order(dj.Manual):
    definition = """
    order_number : int
    ---
    purchase_date : date
    buyer_full_name : varchar(16)
    buyer_address : varchar(1000)
    buyer_email : varchar(120)
    total_amount : numeric(8, 2)
    """
```

(Equivalent SQL)
```sql
CREATE TABLE order (
    order_number INT PRIMARY KEY,
    purchase_date DATE NOT NULL,
    buyer_full_name VARCHAR(16) NOT NULL,

### 3. OrderItem Table (Junction Table), we specify the items in the order in a separate table, OrderItem . This table associates each item, its price and quantity, to the order.

```
::OrderItem
-> Order
-> Item
---
item_quantity : int
item_price : numeric(8, 2)
item_quantity : int
```

Note the use of dependencies -> Order and -> Item . Dependencies include the primary attributes of the referenced tables in the new table. Without them, we would need to replicate 

```
::OrderItem
order_number : int     # use dependency instead
item : int             # use dependency instead
---
item_quantity : int
item_price : numeric(8, 2)
item_quantity : int
```

## Benefits of Normalized Design

The normalized design provides several advantages:

1. **Eliminates Redundancy**: Item descriptions and buyer information stored once
2. **Ensures Consistency**: Changes to item descriptions automatically apply everywhere
3. **Prevents Anomalies**: No risk of inconsistent data across related records
4. **Improves Performance**: Smaller tables with focused indexes
5. **Enhances Maintainability**: Clear separation of concerns

## Classical Normal Forms

The normalization process follows specific rules called **normal forms**:

- **First Normal Form (1NF)**: Eliminate repeating groups and ensure atomic values
- **Second Normal Form (2NF)**: Remove partial dependencies (all non-key attributes depend on the entire primary key)
- **Third Normal Form (3NF)**: Remove transitive dependencies (non-key attributes don't depend on other non-key attributes)

The DataJoint approach enforces these principles by design, making it difficult to create unnormalized tables.