# Day 1 Exercise: Cleaning Messy Cafe Sales Data

**Name:** _Bálint Décsi_  
**Date:** October 8, 2025

---

## Objective

Transform a messy cafe sales dataset into a tidy format, designate and validate a primary key, and create summary tables.

## Dataset

**File:** `../data/day1/dirty_cafe_sales.csv`  
**Rows:** 10,000 cafe transactions  
**Data Dictionary:** See `../data/day1/README.md`

## Deliverable

This notebook should **"Restart & Run All"** successfully when you're done!

---

## Section 1: Setup and Data Loading

### TODO 1: Import libraries

In [None]:
# TODO 1: Import pandas and numpy
# Uncomment the lines below and run this cell:

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

# print("✅ Libraries imported successfully!")

### TODO 2: Load the data

In [None]:
# TODO 2: Load the data
# Uncomment the lines below and run this cell:

df = pd.read_csv('../../data/day1/dirty_cafe_sales.csv')
# print(f"✅ Data loaded: {len(df):,} rows")

---

## Section 2: Initial Exploration

Before cleaning, let's understand what we have.

### TODO 3: Display basic information

In [None]:
# TODO 3: Display the shape of the dataframe
# Uncomment and run:

print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

In [None]:
# TODO 3 (continued): Display the first 10 rows
# Uncomment and run:

df.head(10)

In [None]:
# TODO 3 (continued): Display column names and types
# Uncomment and run:

print("Column Types:")
print(df.dtypes)

### TODO 4: Check for missing values

In [None]:
# TODO 4: Count missing values (NaN) in each column
# Uncomment and run:

print("Missing Values (NaN) per column:")
print(df.isnull().sum())

### TODO 5: Check for sentinel values

Look for "ERROR" and "UNKNOWN" in the data.

In [None]:
# TODO 5: Count "ERROR" values in each column
# Uncomment and run:

print("'ERROR' values per column:")
print((df == 'ERROR').sum())

In [None]:
# TODO 5 (continued): Count "UNKNOWN" values in each column
# Uncomment and run:

print("'UNKNOWN' values per column:")
print((df == 'UNKNOWN').sum())

### Reflection: What Issues Did You Find?

**TODO:** Write 2-3 sentences describing the data quality issues you observed.

_All of the values are stores as strings, which is an undesired state. There is a peak in NaNs for `Payment MEthod` and `Location`. Moreover, both "ERROR" and "UNKNOWN" are used in all of the columns. These sentinel values should be standardized._

---

## Section 3: Is This Data Tidy?

### TODO 6: Evaluate against tidy data principles

**The Three Rules:**
1. Each variable is a column
2. Each observation is a row
3. Each value is a cell

**Questions to answer in markdown:**

1. What is the unit of observation in this dataset? (What does each row represent?)

_Transaction._

2. Does each variable have its own column?

_Yes._

3. Is this dataset tidy? Why or why not?

_`Total Spent` is calculated from other two columns, so it might be counted as redundancy. Moreover, it'd be even more granular by making items in transactions the unit of observation._

---

## Section 4: Identify and Validate Primary Key

### TODO 7: Identify the primary key candidate

In [None]:
# TODO 7: Check if 'Transaction ID' is unique
# Uncomment and run:

is_unique = df['Transaction ID'].is_unique
print(f"Is 'Transaction ID' unique? {is_unique}")
print(f"Total rows: {len(df):,}")
print(f"Unique Transaction IDs: {df['Transaction ID'].nunique():,}")

In [None]:
# TODO 7 (continued): Check for any NULL values in 'Transaction ID'
# Uncomment and run:

null_count = df['Transaction ID'].isnull().sum()
print(f"NULL Transaction IDs: {null_count}")

In [None]:
# TODO 7 (continued): If there are duplicates, find them
# Uncomment and run:

duplicates = df[df.duplicated(subset=['Transaction ID'], keep=False)]
print(f"Duplicate rows: {len(duplicates)}")
if len(duplicates) > 0:
    print("\nShowing first few duplicates:")
    display(duplicates.head())

### TODO 8: Write validation assertions

Once you've confirmed (or fixed) the primary key, write assertions to prove it.

In [None]:
# TODO 8: Add assertions to validate primary key
# Uncomment and run (these will error if checks fail):

assert df['Transaction ID'].is_unique, "❌ Duplicate transaction IDs found"
assert df['Transaction ID'].notna().all(), "❌ NULL transaction IDs found"
print("✅ Transaction ID is a valid primary key")

### Reflection: Primary Key

**TODO:** Explain what you found and any decisions you made.

_Transaction ID a good primary key. I haven't found any issues. It is unique and non-NULL._

---

## Section 5: Handle Missing Values

### TODO 9: Standardize missing value representations

Convert "ERROR", "UNKNOWN", and empty strings to NaN.

In [None]:
# TODO 9: Replace sentinel values with NaN
# Uncomment and run:

df = df.replace(['ERROR', 'UNKNOWN', ''], np.nan)
print("✅ Replaced 'ERROR', 'UNKNOWN', and empty strings with NaN")

In [None]:
# TODO 9 (continued): Check missing values again after standardization
# Uncomment and run:

print("Missing values after standardization:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum():,}")

### TODO 10: Decide how to handle missing values

**Options:**
- Drop rows with missing values in critical columns
- Fill with default values
- Keep as NaN (document impact on analysis)

**Your strategy:**

_[Write your strategy here. Example: "I will keep NULL for `Payment Method` and `Location` because it represents around 30% of rows. ALthough, I decided to exclude NULLs from `Total Spent` as it is a crucial feature."]_

In [None]:
# TODO 10: Implement your missing value strategy
# This is a decision point - choose your approach!
# Below is ONE option: Keep NaN as-is (document in reflection above)
# Uncomment and run:

# For this exercise, we'll keep NaN values and handle them in analysis
# (You could also drop rows or fill values - document your choice above!)
print("✅ Missing value strategy: excluding NULLs from `Total Spent` as it is a crucial feature.")
df = df.dropna(subset=['Total Spent'])

In [None]:
# Check the result
print("Missing values after standardization:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum():,}")

---

## Section 6: Fix Type Issues

### TODO 11: Convert Quantity to integer

In [None]:
# TODO 11: Convert Quantity to integer
# Uncomment and run:

df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce').astype('Int64')
print("✅ Quantity converted to Int64 (allows NaN)")

### TODO 12: Convert prices to float

In [None]:
# TODO 12: Convert 'Price Per Unit' to float
# Uncomment and run:

df['Price Per Unit'] = pd.to_numeric(df['Price Per Unit'], errors='coerce')
print("✅ 'Price Per Unit' converted to float64")

In [None]:
# TODO 12 (continued): Convert 'Total Spent' to float
# Uncomment and run:

df['Total Spent'] = pd.to_numeric(df['Total Spent'], errors='coerce')
print("✅ 'Total Spent' converted to float64")

### TODO 13: Convert Transaction Date to datetime

In [None]:
# TODO 13: Parse Transaction Date as datetime
# Uncomment and run:

df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], errors='coerce')
print("✅ 'Transaction Date' converted to datetime64")

### TODO 14: Verify types

In [None]:
# TODO 14: Display dtypes to verify conversions worked
# Uncomment and run:

print("Updated Column Types:")
print(df.dtypes)

### TODO 15: Write type assertions

In [None]:
# TODO 15: Add assertions to validate types
# Uncomment and run:

assert df['Quantity'].dtype in ['int64', 'Int64'], "❌ Quantity should be integer"
assert df['Price Per Unit'].dtype == 'float64', "❌ Price should be float"
assert df['Transaction Date'].dtype == 'datetime64[ns]', "❌ Date should be datetime"
print("✅ All types are correct!")

---

## Section 7: Validate Data Integrity

### TODO 16: Check if Total Spent = Quantity × Price Per Unit

In [None]:
# TODO 16: Calculate expected total
# Uncomment and run:

df['Calculated Total'] = df['Quantity'] * df['Price Per Unit']
print("✅ Calculated expected totals")

In [None]:
# TODO 16 (continued): Compare with actual Total Spent
# This uses np.isclose() for float comparison (allows tiny rounding differences)
# Uncomment and run:

mask = df['Total Spent'].notna() & df['Calculated Total'].notna()
mismatches = ~np.isclose(
    df.loc[mask, 'Total Spent'], 
    df.loc[mask, 'Calculated Total'],
    rtol=1e-05  # Relative tolerance for floating point comparison
)
print(f"Mismatches found: {mismatches.sum()} out of {mask.sum()} rows with data")

### TODO 17: Check for impossible values

In [None]:
# TODO 17: Check for negative or zero prices
# Uncomment and run:

bad_prices = df[df['Price Per Unit'] <= 0]
print(f"Rows with price <= 0: {len(bad_prices)}")
if len(bad_prices) > 0:
    display(bad_prices[['Transaction ID', 'Item', 'Price Per Unit']].head())

In [None]:
# TODO 17 (continued): Check for zero or negative quantities
# Uncomment and run:

bad_qty = df[df['Quantity'] <= 0]
print(f"Rows with quantity <= 0: {len(bad_qty)}")
if len(bad_qty) > 0:
    display(bad_qty[['Transaction ID', 'Item', 'Quantity']].head())

### Reflection: Data Integrity

**TODO:** What did you find? How did you handle integrity issues?

_I haven't found any integrity issue._

---

## Section 8: Create Summary Tables

Now that data is clean, answer some business questions!

### TODO 18: Total sales by payment method

In [None]:
# TODO 18: Calculate total revenue and transaction count by payment method
# Uncomment and run (this one is fully worked as an example):

payment_summary = df.groupby('Payment Method').agg({
    'Total Spent': 'sum',
    'Transaction ID': 'count'
}).round(2)

payment_summary.columns = ['Total Revenue', 'Transaction Count']
payment_summary = payment_summary.sort_values('Total Revenue', ascending=False)

print("Sales by Payment Method:")
display(payment_summary)

### TODO 19: Most popular items

In [None]:
# TODO 19: Find most popular items by quantity sold
# Pattern: df.groupby('Column')['Metric'].sum().sort_values(ascending=False)
# Uncomment and adapt:

popular_items = df.groupby('Item')['Quantity'].sum().sort_values(ascending=False)
print("Most Popular Items (by quantity):")
display(popular_items.head(10))

In [None]:
# TODO 19 (continued): Find highest revenue items
# Use the same pattern but with 'Total Spent' instead of 'Quantity'
# Uncomment and adapt:

revenue_items = df.groupby('Item')['Total Spent'].sum().sort_values(ascending=False).round(2)
print("Highest Revenue Items:")
display(revenue_items.head(10))

### TODO 20: Location comparison

In [None]:
# TODO 20: Compare transaction volume and average transaction value by location
# This uses .agg() with multiple functions (like TODO 18)
# Uncomment and run:

location_summary = df.groupby('Location').agg({
    'Transaction ID': 'count',
    'Total Spent': ['sum', 'mean']
}).round(2)

location_summary.columns = ['Transaction Count', 'Total Revenue', 'Avg Transaction Value']
print("Sales by Location:")
display(location_summary)

---

## Section 9: Final Validation

### TODO 21: Run all validations

In [None]:
# TODO 21: Gather all your assertions in one cell to prove data quality
# Uncomment and run:

print("Running final validation...\n")

# Primary key
assert df['Transaction ID'].is_unique, "❌ Duplicate transaction IDs"
assert df['Transaction ID'].notna().all(), "❌ NULL transaction IDs"
print("✅ Primary key validated")

# Types
assert df['Quantity'].dtype in ['int64', 'Int64'], "❌ Quantity type wrong"
assert df['Price Per Unit'].dtype == 'float64', "❌ Price type wrong"
assert df['Transaction Date'].dtype == 'datetime64[ns]', "❌ Date type wrong"
print("✅ Types validated")

# Data ranges (only check non-null values)
assert (df['Quantity'].dropna() > 0).all(), "❌ Invalid quantities found"
assert (df['Price Per Unit'].dropna() > 0).all(), "❌ Invalid prices found"
print("✅ Data ranges validated")

print("\n✅ All validations passed!")

---

## Section 10: Documentation

### TODO 22: Document your data cleaning process

Write a brief summary (8-10 sentences) of:
1. What problems you found
2. What decisions you made
3. What the implications are for analysis
4. What a stakeholder should know about this data

---

## Data Cleaning Summary

_This is a dataset of store transactions with unique IDs for each observation._

### Issues Found
- _data types: everything is stored in strings_,
- _missing data: for `Payment MEthod` and `Location` almost 30%_,
- _sentinel values are used, e. g. "UNKNOWN".

### Actions Taken
- _convert columns to integer, float and datetime types_,
- _not dropping all the rows in the 30%_,
- _standardize with "NaN"_.

### Assumptions Made
- _since both columns mentioned above with high percentage of NULLs have a quite uniform distribution, there might be something behind the missing values_,
- _other columns seems like having valid data, which has been checked with type conversion assertions_,
- _should ask data collection team if there's difference between "ERROR" and "UNKNOWN"_. 

### Implications for Analysis
- _be aware that although I haven't dropped missing value rows, I neither imputated them with any value_,
- _I strongly suggest finding (a) good candidate value(s) for usage there_.

### Data Quality Assessment
- _overall, now 100% of data is usable with the constraint of the two columns "NaN" value consideration in further analysis_.

---

## Congratulations!

You've successfully cleaned a real messy dataset using tidy data principles!

**Final check:** Can you **"Restart & Run All"** successfully? That's the gold standard!

**Reflection:** What was the hardest part? What did you learn?