# Chapter 6 Lab



## 🧪 Lab: Cleaning Up Bill’s Surf Shop Orders

Bill runs a gnarly surf shop on the East Coast, but he’s not exactly a data guy. His order file is riddled with issues that make it hard to analyze or trust. Your task is to take this messy dataset and bring it up to par.

This lab builds directly on the skills you practiced in this chapter: defining data quality rules, implementing them in Python, and then re-implementing them using a structured Chat Completions API prompt. Your goal is to write both a Python solution and an AI-driven solution that produce the same cleaned dataset.

You’ll be working with the file `bills_surf_shop_orders.csv` (located in the `ch06/setup/` folder).

---

### 🧼 Fields to Clean and Standardize

Below are the fields in the dataset, along with the known issues you’ll need to detect and resolve:

---

#### `order_id`

* ✅ This should be clean already — it’s just the row index. No action needed.

---

#### `email`

* ✅ Valid values look like standard emails.
* ❌ Some entries are badly formatted (`example@.co`, `555-1234`, `None`, etc.).
* 🔧 Detect and optionally flag or remove invalid email formats.

---

#### `age`

* ✅ Should be a number between 18 and 100.
* ❌ Some rows are missing `age`.
* 🔧 Detect nulls and consider strategies for imputation or row exclusion.

---

#### `purchase_amount`

* ✅ Most rows have a positive number here.
* ❌ Some rows are missing or contain negative values.
* 🔧 Identify negative amounts or nulls — decide whether to drop, impute, or flag.

---

#### `customer_id`

* ✅ Should be consistent per customer (used for joins).
* 🔍 No cleaning needed for now — but check for nulls if you want bonus points.

---

#### `store_location`

* ✅ Should be one of: `Florida`, `North Carolina`, `South Carolina`, `New Jersey`, `Rhode Island`, or their abbreviations: `FL`, `NC`, `SC`, `NJ`, `RI`.
* ❌ Some rows contain invalid entries (e.g., `Flrda`, `Carolinas`, etc.).
* 🔧 Create a mapping and standardize locations.

---

#### `optional_note`

* ❌ This column is always `null` and not useful.
* 🔧 Drop this column entirely.

---

#### `purchase_date`

* ❌ Values come in different formats (`12/31/2023`, `2023/01/01`, `15-01-2023`, etc.).
* 🔧 Standardize to the format `YYYY-MM-DD`.

---

#### `SKU`

* ✅ Valid format is three uppercase letters followed by three digits (e.g., `SUR123`).
* ❌ Some entries are missing or malformed.
* 🔧 Extract and validate the pattern.

---

#### `description`

* ✅ Generally good quality text (e.g., `blue surfboard - 6ft soft top`).
* ❌ Some entries are missing or inconsistently formatted.
* 🔧 Consider truncating to 20 characters for reporting purposes.
* 🗺 Map the description field to a high-level product category (e.g., "board", "wetsuit", "accessory") using a simple dictionary. You can define this yourself based on values you observe in the dataset

---

#### `first_name` + `last_name`

* ✅ Usually clean.
* 🔧 Combine into a `full_name` column.

---

#### `product_name`

* ✅ Each entry corresponds to a surf-related item.
* 🔧 Map these to standard product categories (e.g., `surfboard`, `wetsuit`, `accessory`).

---


### 🔧 Your Task

#### Step 1: Clean the Dataset Using Python

Use only `pandas` and standard Python functions to clean the dataset according to the rules above. You may find yourself using:

* `.apply()` or `.map()`
* `regex` matching
* `dropna()` or `drop()`
* custom functions for standardizing formats

Make sure your cleaned DataFrame has:

* A standardized format across all fields
* No invalid rows for `purchase_amount` or `email`
* A `full_name` column
* No `optional_note` column

---

### Step 2: Clean the Dataset Using Chat Completions

Now it's time to clean the same dataset using the **structured response format** technique you learned in Listings 6.4 and 6.6. The idea is to define exactly what kind of cleaned output you want, use a `BaseModel` data class to structure the response, and have the model return parsed values you can drop directly into your DataFrame.

#### ⚙️ Your AI-Powered Cleaning Workflow:

1. **Define a new data class** using `pydantic.BaseModel` that includes all the fields you want to clean or standardize (e.g., `normalized_date`, `cleaned_sku`, `full_name`, `product_category`, etc.).

2. **Write a detailed prompt** explaining the cleaning rules. For each field, describe what valid data looks like and what the model should return when it's missing, invalid, or malformed.

   Your prompt should ask the model to:

   * Normalize `purchase_date` to `YYYY-MM-DD` format
   * Clean or nullify invalid `SKU` values
   * Truncate `description` to 20 characters
   * Combine `first_name` and `last_name` into `full_name`
   * Map `description` or `product_name` to a standardized `product_category`
   * Return `None` for invalid emails or negative/missing purchase amounts

3. **❗ Important: Process each record individually.**

   In earlier sections, we passed the entire dataset at once using `df.to_dict(orient="records")`. However, with larger or messy datasets, this can cause errors if the model skips records or returns inconsistent response lengths.

   **Instead, loop through the dataset one row at a time.** Use `tqdm` to show progress and feed each record through the API call with its own message and prompt. This gives you:

   * Stable response lengths (one output per input)
   * Cleaner debugging when something goes wrong
   * A safer way to accumulate results row by row

4. **Build your final DataFrame** from the list of cleaned records. Drop any rows that are still missing critical fields like `email` or `purchase_amount`.


---

### 💬 Discussion Questions

* Which approach was easier to build and test?
* Which was more flexible when you needed to change rules?
* How confident are you in the AI-generated cleaning?
* Could you imagine building a reusable framework from this?


## Feel free to use the provided csv in ../setup. For an extra challenge you can generate a new file with similar attributes using the code below

In [2]:
import pandas as pd
import random
from faker import Faker

fake = Faker()

# Step 1: Generate 50 consistent customer profiles
num_customers = 50
customers = []
for _ in range(num_customers):
    first = fake.first_name()
    last = fake.last_name()
    email = f"{first.lower()}.{last.lower()}@{fake.free_email_domain()}"
    age = random.randint(18, 65)
    customer_id = fake.uuid4()
    customers.append({
        "customer_id": customer_id,
        "first_name": first,
        "last_name": last,
        "email": email,
        "age": age
    })

# Step 2: Define surf-themed items with fixed SKUs
surf_items = [
    "Longboard surfboard", "Shortboard surfboard", "Soft-top surfboard",
    "Wetsuit - full", "Wetsuit - shorty", "Rash guard",
    "Board shorts", "Surf wax", "Leash", "Fins",
    "Surf helmet", "Dry bag", "Waterproof watch", "Beach towel",
    "Surf poncho", "Roof rack straps", "Wax comb", "Traction pad",
    "Sunscreen", "Ear plugs"
]
sku_valid = [f"{fake.lexify('???').upper()}{fake.numerify('###')}" for _ in surf_items]
item_sku_map = dict(zip(surf_items, sku_valid))
sku_item_map = {v: k for k, v in item_sku_map.items()}

sku_invalid = ['abc12', '123ABC', 'a12bc3', None]

# Step 3: Set up locations and date formats
store_locations = ['New York', 'NY', 'New Jersey', 'NJ', 'Florida', 'FL', 'Massachusetts', 'MA', 'Virginia', 'VA']
valid_dates = ['12/31/2023', '2023-01-01', '01-15-2023']
invalid_dates = ['31/12/2023', '2023/01/01', '15-01-2023']

invalid_emails = ['example@.co', 'example@', '555-1234', None]

# Step 4: Build 100 order records
num_orders = 100
records = []
for i in range(1, num_orders + 1):
    customer = random.choice(customers)
    email = random.choice(invalid_emails) if random.random() < 0.25 else customer['email']
    age = None if random.random() < 0.1 else customer['age']
    purchase_amount = round(random.uniform(25, 800), 2) if random.random() > 0.15 else random.choice([None, -round(random.uniform(10, 200), 2)])
    location = random.choice(store_locations)
    purchase_date = random.choice(valid_dates) if random.random() > 0.2 else random.choice(invalid_dates)

    if random.random() < 0.2:
        sku = random.choice(sku_invalid)
        description = None
    else:
        description, sku = random.choice(list(item_sku_map.items()))

    records.append([
        i, email, age, purchase_amount,
        customer['customer_id'], location, None,
        purchase_date, sku, description,
        customer['first_name'], customer['last_name']
    ])

# Step 5: Create and export DataFrame
df = pd.DataFrame(records, columns=[
    "order_id", "email", "age", "purchase_amount", "customer_id",
    "store_location", "optional_note", "purchase_date",
    "SKU", "description", "first_name", "last_name"
])

df.to_csv("../setup/bills_surf_shop_orders.csv", index=False)


## Step 1: Clean the Dataset Using Python

In [11]:
import pandas as pd
import re
from dateutil import parser  #A

# Load the dataset  #B
df = pd.read_csv("../setup/bills_surf_shop_orders.csv")  #C

# Standardize purchase_date to YYYY-MM-DD  #D
def normalize_date(val):  #E
    try:
        return parser.parse(val).strftime('%Y-%m-%d')  #F
    except Exception:
        return None  #G

df['purchase_date'] = df['purchase_date'].apply(normalize_date)  #H

# Validate SKU format (3 uppercase letters + 3 digits)  #I
df['SKU'] = df['SKU'].str.upper().str.extract(r'([A-Z]{3}\d{3})', expand=False)  #J

# Truncate product_description to 20 characters  #K
df['description'] = df['description'].str[:20]  #L

# Create full_name column from first and last names  #M
df['full_name'] = df['first_name'].fillna('') + ' ' + df['last_name'].fillna('')  #N

# Normalize store_location using mapping  #O
location_map = {
    'FL': 'Florida', 'Florida': 'Florida',
    'NC': 'North Carolina', 'North Carolina': 'North Carolina',
    'SC': 'South Carolina', 'South Carolina': 'South Carolina',
    'NY': 'New York', 'New York': 'New York',
    'NJ': 'New Jersey', 'New Jersey': 'New Jersey',
    'MA': 'Massachusetts', 'Massachusetts': 'Massachusetts'
}
df['store_location'] = df['store_location'].map(location_map)  #P


# Drop optional_note column  #Q
df = df.drop(columns=['optional_note'], errors='ignore')  #R

# Drop rows with null or invalid email format  #S
email_pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"  #T
df = df[df['email'].fillna('').str.match(email_pattern)]  #U

# Filter out negative or missing purchase_amounts  #V
df = df[df['purchase_amount'].fillna(0) > 0]  #W

# Drop rows with missing age or customer_id  #X
df = df.dropna(subset=['age', 'customer_id'])  #Y

category_map = {
    'Leash': 'accessories',
    'Surf wax': 'accessories',
    'Wax comb': 'accessories',
    'Sunscreen': 'protection',
    'Shortboard surfboard': 'boards',
    'Longboard surfboard': 'boards',
    'Soft-top surfboard': 'boards',
    'Board shorts': 'apparel',
    'Wetsuit - shorty': 'apparel',
    'Wetsuit - full': 'apparel',
    'Rash guard': 'apparel',
    'Surf poncho': 'apparel',
    'Fins': 'accessories',
    'Traction pad': 'accessories',
    'Beach towel': 'accessories',
    'Dry bag': 'accessories',
    'Roof rack straps': 'accessories',
    'Ear plugs': 'protection',
    'Surf helmet': 'protection',
    'Waterproof watch': 'electronics'
}

# Apply to your DataFrame
df['product_category'] = df['description'].map(category_map).fillna('unknown')


# Output cleaned DataFrame  #Z
display(df)  #AA


Unnamed: 0,order_id,email,age,purchase_amount,customer_id,store_location,purchase_date,SKU,description,first_name,last_name,full_name,product_category
0,1,katie.smith@hotmail.com,38.0,516.46,7e430391-da3b-4ab8-bd2e-63e22a6b97bb,Florida,2023-01-15,CPT535,Leash,Katie,Smith,Katie Smith,accessories
3,4,amanda.shaw@hotmail.com,60.0,40.93,5ccbe607-4190-49cd-a729-8fdf272fd647,New York,2023-01-15,ZBA521,Surf wax,Amanda,Shaw,Amanda Shaw,accessories
4,5,betty.hicks@hotmail.com,43.0,114.53,38f0c359-adce-46ae-b1cd-8dbaa3453a00,Massachusetts,2023-01-15,ICZ284,Wax comb,Betty,Hicks,Betty Hicks,accessories
5,6,stephanie.sanders@gmail.com,46.0,503.94,68884de5-2364-49a0-a29b-289a836d25e3,,2023-12-31,,,Stephanie,Sanders,Stephanie Sanders,unknown
6,7,ashley.smith@yahoo.com,32.0,386.81,f4f04583-8c20-4bfa-a597-c2c2ec7363c8,,2023-01-15,,,Ashley,Smith,Ashley Smith,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,95,michelle.miller@gmail.com,30.0,153.66,1fef2f15-047b-4ee6-9723-1ef67983e7a2,Massachusetts,2023-01-15,ALE909,Soft-top surfboard,Michelle,Miller,Michelle Miller,boards
95,96,edward.hunt@hotmail.com,62.0,672.24,41473f03-e37c-4fff-8d91-7a4963d48b8d,New Jersey,2023-01-15,,,Edward,Hunt,Edward Hunt,unknown
96,97,alicia.carroll@gmail.com,25.0,708.20,50f5bcb6-067e-48d1-bf2e-c967887372f7,Florida,2023-12-31,VVL444,Rash guard,Alicia,Carroll,Alicia Carroll,apparel
97,98,sarah.reeves@yahoo.com,41.0,544.23,4c938301-38ec-491c-9771-f881fa9e5f11,New Jersey,2023-01-01,,,Sarah,Reeves,Sarah Reeves,unknown


## 2 Clean the Dataset Using Chat Completions/Open AI

In [25]:
import pandas as pd
import openai
import os
from dotenv import load_dotenv
from pydantic import BaseModel
from typing import Optional
from tqdm.notebook import tqdm  #A

# Load API key from .env file  #B
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")  #C

# Define the data model for cleaned output  #D
class SurfShopCleaning(BaseModel):  #E
    normalized_date: Optional[str]  #F
    cleaned_sku: Optional[str]  #G
    truncated_description: Optional[str]  #H
    full_name: str  #I
    standardized_location: Optional[str]  #J
    filtered_email: Optional[str]  #K
    filtered_purchase_amount: Optional[float]  #L
    product_category: Optional[str]  #M

# Load the raw dataset  #N
df = pd.read_csv("../setup/bills_surf_shop_orders.csv")  #O
records = df.to_dict(orient="records")  #P

# Create an empty list to collect cleaned rows  #Q
cleaned_rows = []  #R

# Loop through each record using tqdm for a progress bar  #S
for record in tqdm(records, desc="Cleaning Records"):  #T
    row_prompt = (
        "You are a data cleaning assistant. Clean this single record and return:\n"
        "- normalized_date: Convert purchase_date to YYYY-MM-DD or null\n"
        "- cleaned_sku: Valid 3-uppercase-letter + 3-digit SKU or null\n"
        "- truncated_description: First 20 characters of description or null\n"
        "- full_name: Combine first_name and last_name\n"
        "- standardized_location: Normalize store_location to full name or null\n"
        "- filtered_email: Return only if valid format, else null\n"
        "- filtered_purchase_amount: Return only if positive and non-null, else null\n"
        "- product_category: Map description to one of: boards, apparel, protection, electronics, accessories. Use 'unknown' if unclear.\n"
        "Return the result as a JSON object matching the SurfShopCleaning structure."
    )

    try:
        # Make the API call  #U
        completion = openai.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": row_prompt},  #V
                {"role": "user", "content": str(record)}  #W
            ],
            response_format=SurfShopCleaning  #X
        )

        cleaned = completion.choices[0].message.parsed.dict()  #Y
        cleaned_rows.append(cleaned)  #Z

    except Exception as e:
        print(f"Error processing row {record['order_id']}: {e}")
        continue

# Convert to DataFrame  #AA
df_cleaned = pd.DataFrame(cleaned_rows)  #AB

# Drop rows missing critical fields  #AC
df_cleaned = df_cleaned.dropna(subset=["filtered_email", "filtered_purchase_amount"])  #AD

# Display the cleaned dataset  #AE
display(df_cleaned)  #AF


Cleaning Records:   0%|          | 0/100 [00:00<?, ?it/s]

Unnamed: 0,normalized_date,cleaned_sku,truncated_description,full_name,standardized_location,filtered_email,filtered_purchase_amount,product_category
0,2023-01-15,CPT535,Leash,Katie Smith,Florida,katie.smith@hotmail.com,516.46,accessories
1,2023-12-31,CUC263,Traction pad,Erika Mcdaniel,Virginia,erika.mcdaniel@yahoo.com,0.00,accessories
2,2023-01-15,CPT535,Leash,Edward Hunt,Massachusetts,edward.hunt@hotmail.com,0.00,accessories
3,2023-01-15,ZBA521,Surf wax,Amanda Shaw,New York,amanda.shaw@hotmail.com,40.93,accessories
4,2023-01-15,ICZ284,Wax comb,Betty Hicks,Massachusetts,betty.hicks@hotmail.com,114.53,accessories
...,...,...,...,...,...,...,...,...
95,2023-01-15,.null,,Edward Hunt,New Jersey,edward.hunt@hotmail.com,672.24,unknown
96,2023-12-31,VVL444,Rash guard,Alicia Carroll,Florida,alicia.carroll@gmail.com,708.20,apparel
97,2023-01-01,/ull,/ull,Sarah Reeves,New Jersey,sarah.reeves@yahoo.com,544.23,unknown
98,2023-01-01,AAS588,Surf poncho,Danielle Cooke,Virginia,danielle.cooke@hotmail.com,656.46,apparel
