# Metadata

**L1 Taxonomy** - SQL in Python

**L2 Taxonomy** - Database Normalization

**Subtopic** - Integrating Python Libraries for SQL Database Normalization

**Use Case** - Implement a Python script using Pandas to normalize a denormalized CSV dataset into third normal form and export the results to SQL tables.

**Programming Language** - Python

**Target Model** - GPT-4o

# Setup

```requirements.txt
pandas
```


# Prompt

## **Problem Overview**

You are given a denormalized CSV file representing customer orders. However, instead of each product being on a separate row, each row may contain **multiple products encoded in a single field** using a custom format.

Your task is to implement a Python program that **normalizes** this dataset into **Third Normal Form (3NF)** and exports it into a SQLite database. You must correctly:

* Parse and normalize multiple products per order row,
* Assign stable, deduplicated primary keys for customers and products,
* Maintain foreign key integrity in the resulting normalized schema.

This problem simulates real-world input from a semi-structured system where embedded lists must be flattened and normalized.


## **Input Format**

* A CSV file named `denormalized_data.csv` with the following headers:

```
order_id,customer_name,customer_email,products,order_date
```

* The `products` field contains one or more product entries **delimited by semicolons (`;`)**. Each product is encoded as:

```
product_name:quantity@price
```

## **Example `products` field**:

```
"Laptop:1@1000;Mouse:2@20"
```


## **Output Format**

Your function must generate a SQLite database named `normalized_data.db`, containing the following **four tables** in 3NF:

1. **customers**
   Columns: `customer_id`, `customer_name`, `customer_email`

2. **products**
   Columns: `product_id`, `product_name`

3. **orders**
   Columns: `order_id`, `customer_id`, `order_date`

4. **order_items**
   Columns: `order_id`, `product_id`, `quantity`, `price`


## **Examples**

### Input CSV Row:

```
1001,John Doe,john@example.com,"Laptop:1@1000;Mouse:2@20",2023-01-15
```

### Normalized Tables:

**customers**

```
1,John Doe,john@example.com
```

**products**

```
1,Laptop
2,Mouse
```

**orders**

```
1001,1,2023-01-15
```

**order_items**

```
1001,1,1,1000
1001,2,2,20
```

# Requirements


## **Explicit Requirements**

* Parse the `products` field correctly and normalize it.
* Use `pandas` for CSV processing and `sqlite3` or `SQLAlchemy` for SQL export.
* Assign consistent, stable IDs to `customer_id` and `product_id`.
* Populate and link all four normalized tables using foreign keys.
* Treat customers as unique by (`customer_name`, `customer_email`).
* Treat products as unique by `product_name`.


## **Implicit Requirements**

* Maintain **referential integrity** using foreign key constraints.
* Ensure the solution is **deterministic** (same input gives same IDs).
* Sanitize and trim any whitespace in parsed fields.
* Raise descriptive errors for malformed product strings.
* Allow for empty product lists (skip row or handle gracefully).
* Use snake_case for all table and column names.
* Price and quantity must be cast to numeric types and defaulted to zero if missing.


## **Edge Cases**

* `products` field is empty or null: skip or handle gracefully without crashing.
* Product entry is malformed, e.g., `"Mouse:2"` or `"Laptop@1000"`: raise validation error or log and skip.
* Multiple orders with the same customer.
* Multiple products with the same name across different orders.
* CSV file with only headers and no data.


## **Function Signature**

```python
def normalize_csv_to_3nf(csv_path: str, db_path: str) -> None:
    """
    Reads a denormalized CSV file, parses composite product entries,
    normalizes the data into 3NF, and exports it to a SQLite database.

    Parameters:
    csv_path (str): Path to the input CSV file.
    db_path (str): Path to the output SQLite database file.

    Returns:
    None
    """
```


## **Constraints**

* Input CSV file size: up to 500 MB.
* Maximum number of unique customers: 100,000.
* Maximum number of unique products: 500,000.
* Maximum orders: 1 million.
* Python version: 3.8+
* Allowed libraries: `pandas`, `sqlite3`, or `sqlalchemy`.

In [None]:
# code

import pandas as pd
import sqlite3
from typing import Tuple, List, Set
import os


def generate_customer_id_map(customers_df: pd.DataFrame) -> pd.DataFrame:
    """Assigns deterministic integer IDs to customers sorted by name and email."""
    customers_df = customers_df.drop_duplicates()
    customers_df = customers_df.sort_values(by=['customer_name', 'customer_email']).reset_index(drop=True)
    customers_df['customer_id'] = customers_df.index + 1
    return customers_df


def generate_product_id_map(products_set: Set[Tuple[str]]) -> pd.DataFrame:
    """Creates a DataFrame of unique products with deterministic product IDs."""
    products_list = sorted(list(products_set), key=lambda x: x[0].lower())
    products_df = pd.DataFrame(products_list, columns=['product_name'])
    products_df['product_id'] = products_df.index + 1
    return products_df


def parse_products_column(df: pd.DataFrame) -> Tuple[List[Tuple[int, str, int, float]], Set[Tuple[str]]]:
    """
    Parses the 'products' column to extract order_items and unique products.

    Returns:
        - A list of (order_id, product_name, quantity, price) tuples
        - A set of unique (product_name,) tuples
    """
    order_items = []
    products_set = set()

    for index, row in df.iterrows():
        order_id = row['order_id']
        products_str = row.get('products', '')
        if pd.isna(products_str) or not isinstance(products_str, str):
            continue

        for item in products_str.split(';'):
            item = item.strip()
            if not item:
                continue

            try:
                name_qty, price = item.split('@')
                product_name, quantity = name_qty.split(':')
                product_name = product_name.strip()
                quantity = int(quantity.strip())
                price = float(price.strip())

                if not product_name:
                    continue

                products_set.add((product_name,))
                order_items.append((order_id, product_name, quantity, price))

            except (ValueError, AttributeError):
                # Skip malformed entries but could log these if needed
                continue

    return order_items, products_set


def normalize_csv_to_3nf(csv_path: str, db_path: str) -> None:
    """
    Reads a denormalized CSV file, parses composite product entries,
    normalizes the data into 3NF, and exports it to a SQLite database.

    Parameters:
        csv_path (str): Path to the input CSV file.
        db_path (str): Path to the output SQLite database file.

    Returns:
        None
    """
    if not os.path.exists(csv_path):
        raise FileNotFoundError(f"CSV file not found: {csv_path}")

    df = pd.read_csv(csv_path)

    required_columns = {'order_id', 'customer_name', 'customer_email', 'order_date', 'products'}
    missing = required_columns - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns in CSV: {', '.join(missing)}")

    if df.empty:
        # Create empty tables with schema only
        conn = sqlite3.connect(db_path)
        cursor = conn.cursor()
        cursor.execute("PRAGMA foreign_keys = ON;")
        cursor.executescript("""
        CREATE TABLE IF NOT EXISTS customers (
            customer_id INTEGER PRIMARY KEY,
            customer_name TEXT NOT NULL,
            customer_email TEXT NOT NULL
        );
        CREATE TABLE IF NOT EXISTS products (
            product_id INTEGER PRIMARY KEY,
            product_name TEXT NOT NULL
        );
        CREATE TABLE IF NOT EXISTS orders (
            order_id INTEGER PRIMARY KEY,
            customer_id INTEGER NOT NULL,
            order_date TEXT NOT NULL,
            FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        );
        CREATE TABLE IF NOT EXISTS order_items (
            order_id INTEGER NOT NULL,
            product_id INTEGER NOT NULL,
            quantity INTEGER NOT NULL,
            price REAL NOT NULL,
            PRIMARY KEY (order_id, product_id),
            FOREIGN KEY (order_id) REFERENCES orders(order_id),
            FOREIGN KEY (product_id) REFERENCES products(product_id)
        );
        """)
        conn.commit()
        conn.close()
        return

    # Normalize customers
    customers_df = df[['customer_name', 'customer_email']].copy()
    customers_df = generate_customer_id_map(customers_df)

    # Normalize orders
    orders_df = df[['order_id', 'customer_name', 'customer_email', 'order_date']].drop_duplicates()
    orders_df = orders_df.merge(customers_df, on=['customer_name', 'customer_email'], how='left')
    orders_df = orders_df[['order_id', 'customer_id', 'order_date']]

    # Parse products and order_items
    order_items_raw, products_set = parse_products_column(df)
    products_df = generate_product_id_map(products_set)

    # Create order_items DataFrame by joining product IDs
    order_items_df = pd.DataFrame(order_items_raw, columns=['order_id', 'product_name', 'quantity', 'price'])
    order_items_df = order_items_df.merge(products_df, on='product_name', how='left')
    order_items_df = order_items_df[['order_id', 'product_id', 'quantity', 'price']]

    # Fill missing numeric values safely
    order_items_df['quantity'] = order_items_df['quantity'].fillna(0).astype(int)
    order_items_df['price'] = order_items_df['price'].fillna(0.0).astype(float)

    # Connect to SQLite and create tables
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute("PRAGMA foreign_keys = ON;")

    cursor.executescript("""
    CREATE TABLE IF NOT EXISTS customers (
        customer_id INTEGER PRIMARY KEY,
        customer_name TEXT NOT NULL,
        customer_email TEXT NOT NULL
    );

    CREATE TABLE IF NOT EXISTS products (
        product_id INTEGER PRIMARY KEY,
        product_name TEXT NOT NULL
    );

    CREATE TABLE IF NOT EXISTS orders (
        order_id INTEGER PRIMARY KEY,
        customer_id INTEGER NOT NULL,
        order_date TEXT NOT NULL,
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
    );

    CREATE TABLE IF NOT EXISTS order_items (
        order_id INTEGER NOT NULL,
        product_id INTEGER NOT NULL,
        quantity INTEGER NOT NULL,
        price REAL NOT NULL,
        PRIMARY KEY (order_id, product_id),
        FOREIGN KEY (order_id) REFERENCES orders(order_id),
        FOREIGN KEY (product_id) REFERENCES products(product_id)
    );
    """)

    # Export data to database
    customers_df.to_sql('customers', conn, if_exists='replace', index=False)
    products_df.to_sql('products', conn, if_exists='replace', index=False)
    orders_df.to_sql('orders', conn, if_exists='replace', index=False)
    order_items_df.to_sql('order_items', conn, if_exists='replace', index=False)

    conn.commit()
    conn.close()

In [None]:
# tests

"""Unit tests for the normalize_csv_to_3nf function.

This module contains test cases to verify CSV to 3NF database normalization,
including edge cases and data validation scenarios.
"""

import os
import sqlite3
import unittest

import pandas as pd

from main import normalize_csv_to_3nf


class TestNormalizeCSVTo3NF(unittest.TestCase):
    """Test suite for normalize_csv_to_3nf function."""

    def setUp(self):
        """Initialize test files and paths before each test."""
        self.csv_path = "test_denormalized_data.csv"
        self.db_path = "test_normalized_data.db"
        self.cleanup_files()

    def tearDown(self):
        """Clean up test files after each test."""
        self.cleanup_files()

    def cleanup_files(self):
        """Remove test files if they exist."""
        for path in [self.csv_path, self.db_path]:
            try:
                if os.path.exists(path):
                    os.remove(path)
            except PermissionError:
                pass

    def read_table(self, table_name):
        """Read table from database."""
        with sqlite3.connect(self.db_path) as conn:
            query = f"SELECT * FROM {table_name}"
            return pd.read_sql_query(query, conn)

    def test_single_valid_order(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write("1001, John Doe, john@example.com, Laptop:1@1000,"
                    " 2023-01-15\n")

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("orders")
        self.assertEqual(len(df), 1)
        self.assertEqual(df.iloc[0]["order_id"], 1001)

    def test_empty_products_field_skips_order(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write("1002,Alice,a@example.com,,2023-01-16\n")

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertEqual(len(df), 0)

    def test_multiple_products_parsed_correctly(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1003,Bob,b@example.com,"Laptop:1@1000;Mouse:2@20",'
                    "2023-01-17\n")

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertEqual(len(df), 2)

    def test_product_missing_quantity_skipped_or_zero(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1004,Carl,c@example.com,"Keyboard:@100",2023-01-18\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertTrue(df.empty)

    def test_product_missing_price_skipped_or_zero(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1005,Diana,d@example.com,"Monitor:2@",2023-01-19\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertTrue(df.empty)

    def test_no_crash_on_missing_columns(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,products,order_date\n")
            f.write('1006,Eve,"Tablet:1@200",2023-01-20\n')

        with self.assertRaises(Exception):
            normalize_csv_to_3nf(self.csv_path, self.db_path)

    def test_trailing_semicolon_is_ignored(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1007,Fred,f@example.com,"Mouse:2@20;",2023-01-21\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertEqual(len(df), 1)

    def test_malformed_product_skipped(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1008,Gary,g@example.com,"BadProduct",2023-01-22\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertEqual(len(df), 0)

    def test_customers_have_unique_ids(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1009,Hank,h@example.com,"Pen:1@5",2023-01-23\n')
            f.write('1010,Hank,h@example.com,"Book:2@15",2023-01-24\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("customers")
        self.assertEqual(len(df), 1)

    def test_products_have_unique_ids(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1011,Ivy,i@example.com,"Pen:1@5;Pen:2@6",2023-01-25\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("products")
        self.assertEqual(len(df), 1)

    def test_multi_order_same_customer_and_product(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1012,John Doe,john@example.com,"Pen:1@5",2023-01-26\n')
            f.write('1013,John Doe,john@example.com,"Pen:3@15",2023-01-27\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        self.assertEqual(len(self.read_table("customers")), 1)
        self.assertEqual(len(self.read_table("products")), 1)

    def test_order_id_is_preserved(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('9999,Zack,z@example.com,"Phone:1@999",2023-01-28\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("orders")
        self.assertEqual(df.iloc[0]["order_id"], 9999)

    def test_whitespace_trimmed_in_fields(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1014,  Jill ,  jill@example.com  ," Lamp :1@50 ",'
                    "2023-01-29\n")

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("customers")
        self.assertIn("Jill", df.iloc[0]["customer_name"])

    def test_case_sensitive_customers(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1015,Kate,kate@example.com,"Cup:1@10",2023-01-30\n')
            f.write('1016,kate,kate@example.com,"Cup:1@10",2023-01-31\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("customers")
        self.assertEqual(len(df), 2)

    def test_product_name_case_insensitive_uniqueness(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('1017,Liam,liam@example.com,"Chair:1@30",2023-02-01\n')
            f.write('1018,Liam,liam@example.com,"chair:1@35",2023-02-02\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("products")
        self.assertEqual(len(df), 2)

    def test_large_order_list_limit(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            for i in range(1000, 1020):
                date = f"2023-02-0{i % 28 + 1}"
                f.write(f'{i},Bulk,b@example.com,"Item{i}:1@10",{date}\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("orders")
        self.assertEqual(len(df), 20)

    def test_valid_product_with_zero_quantity(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('2001,Mira,m@example.com,"Box:0@10",2023-02-10\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertEqual(df.iloc[0]["quantity"], 0)

    def test_valid_product_with_zero_price(self):
        with open(self.csv_path, "w") as f:
            f.write("order_id,customer_name,customer_email,products,"
                    "order_date\n")
            f.write('2002,Nico,n@example.com,"Board:2@0",2023-02-11\n')

        normalize_csv_to_3nf(self.csv_path, self.db_path)
        df = self.read_table("order_items")
        self.assertEqual(df.iloc[0]["price"], 0)


if __name__ == "__main__":
    unittest.main()


# Model Breaking Proof

#### Model Breaking Task URL: https://rlhf-v3.turing.com/prompt/0eceb2dc-a0c1-4bdf-a312-408b9453caa7?origin=https%3A%2F%2Flabeling-ms.turing.com&redirect_url=https%3A%2F%2Flabeling-ms.turing.com%2Fconversations%2F67752%2Fview

#### Model code:

```python
import pandas as pd
import sqlite3
import hashlib

def normalize_csv_to_3nf(csv_path: str, db_path: str) -> None:
    # Load the CSV file into a DataFrame
    df = pd.read_csv(csv_path)

    # Extract unique customers
    customers = df[['customer_name', 'customer_email']].drop_duplicates().reset_index(drop=True)
    # Hash customer_name and customer_email for unique and consistent IDs
    customers['customer_id'] = customers.apply(lambda row: hashlib.md5((row['customer_name'] + row['customer_email']).encode()).hexdigest(), axis=1) # <Issue>: It is using md5 hashes as primary keys which violates the explicit requirement for consistent and stable ids

    # Separate orders
    orders = df[['order_id', 'order_date']]
    # Ensure order_id field is unique (dropping duplicates if any)
    orders = orders.drop_duplicates().reset_index(drop=True)

    # Get customer_id for each order by merging
    orders = orders.merge(df[['order_id', 'customer_name', 'customer_email']], on='order_id')
    orders = orders.merge(customers, on=['customer_name', 'customer_email'], how='left')
    orders = orders[['order_id', 'customer_id', 'order_date']]
    
    # Extract and normalize products and order_items
    products_set = set()
    order_items = []
    
    for _, row in df.iterrows():
        order_id = row['order_id']
        if pd.isna(row['products']):
            continue
        products = row['products'].split(';')
        for prod in products:
            details = prod.split('@')
            if len(details) != 2:
                continue  # Skip malformed entries
            name_qty = details[0].split(':')
            if len(name_qty) != 2:
                continue  # Skip malformed entries
            product_name, quantity = name_qty[0].strip(), name_qty[1].strip()
            price = details[1].strip()
            try:
                quantity = int(quantity)
                price = float(price)
            except ValueError:
                continue  # Skip entries with invalid quantity or price
            
            product_id = hashlib.md5(product_name.encode()).hexdigest()
            products_set.add((product_id, product_name))
            order_items.append((order_id, product_id, quantity, price))
    
    # Create a DataFrame from products set
    products = pd.DataFrame(list(products_set), columns=['product_id', 'product_name'])
    
    # Create a DataFrame for order_items
    order_items_df = pd.DataFrame(order_items, columns=['order_id', 'product_id', 'quantity', 'price'])
    
    # Create the SQLite database and tables
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Creating the tables
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS customers (
            customer_id TEXT PRIMARY KEY,
            customer_name TEXT,
            customer_email TEXT
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS products (
            product_id TEXT PRIMARY KEY,
            product_name TEXT
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS orders (
            order_id INTEGER PRIMARY KEY,
            customer_id TEXT,
            order_date TEXT,
            FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS order_items (
            order_id INTEGER,
            product_id TEXT,
            quantity INTEGER,
            price REAL,
            FOREIGN KEY (order_id) REFERENCES orders(order_id),
            FOREIGN KEY (product_id) REFERENCES products(product_id)
        )
    ''') # <Issue>: does not include a primary key. At a minimum, (order_id, product_id) should be defined as a composite primary key.

    # Insert data into database
    customers.to_sql('customers', conn, if_exists='replace', index=False)
    products.to_sql('products', conn, if_exists='replace', index=False)
    orders.to_sql('orders', conn, if_exists='replace', index=False)
    order_items_df.to_sql('order_items', conn, if_exists='replace', index=False)

    # Close the database connection
    conn.commit()
    conn # <Issue>: Generated incomplete response. conn.close() was not called.
```