# 03 – Data Quality Assessment and Cleaning

This notebook documents the **data quality** and **cleaning** steps for the
project. It complements the script `scripts/clean_data.py` and provides
exploratory checks and summaries.

Goals:

- Profile the cleaned datasets:
  - `coffee_sales_clean.csv`
  - `coffee_shop_clean.csv`
- Assess data quality:
  - Missing values
  - Potential outliers
  - Duplicate records
- Describe and justify key cleaning decisions:
  - Dropping invalid or missing quantities/prices
  - Cleaning `hour_of_day` and `money` in the shop data

In [None]:
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"

SALES_CLEAN = PROCESSED_DIR / "coffee_sales_clean.csv"
SHOP_CLEAN = PROCESSED_DIR / "coffee_shop_clean.csv"

PROJECT_ROOT, DATA_DIR, PROCESSED_DIR

## 1. (Optional) Regenerate cleaned data

If you have updated `scripts/clean_data.py`, you can re-run it here to
regenerate the cleaned CSV files.

Otherwise, you can skip this cell and use the existing cleaned files in
`data/processed/`.

In [None]:
# OPTIONAL: Uncomment to regenerate cleaned data
# !python ../scripts/clean_data.py

## 2. Load cleaned datasets

We now load:

- `data/processed/coffee_sales_clean.csv`
- `data/processed/coffee_shop_clean.csv`

These should reflect the cleaning rules in `scripts/clean_data.py`.

In [None]:
sales = pd.read_csv(SALES_CLEAN)
shop = pd.read_csv(SHOP_CLEAN)

print("Sales (clean) shape:", sales.shape)
print("Shop  (clean) shape:", shop.shape)

In [None]:
sales.head()

In [None]:
shop.head()

## 3. Schema and missing values

We inspect:

- Column names and data types (`info()`).
- Missing values per column (`isna().sum()`).

This helps verify that critical fields (e.g., `transaction_qty`,
`unit_price`, `hour_of_day`, `money`) have been cleaned appropriately.

In [None]:
print("=== Sales info ===")
print(sales.info())

print("\n=== Missing values in sales ===")
print(sales.isna().sum())

In [None]:
print("=== Shop info ===")
print(shop.info())

print("\n=== Missing values in shop ===")
print(shop.isna().sum())

## 4. Duplicate checks

We verify that there are no unexpected duplicate rows, especially on
key identifiers:

- `transaction_id` in the sales data.
- (`hour_of_day`, `coffee_name`) or just `hour_of_day` in the shop data,
  depending on how the dataset is structured.

In [None]:
# Sales duplicates by transaction_id
if "transaction_id" in sales.columns:
    dup_sales = sales.duplicated(subset=["transaction_id"]).sum()
    print(f"Duplicate transaction_id rows in sales: {dup_sales}")
else:
    print("transaction_id not found in sales columns.")

# Shop duplicates by hour_of_day
if "hour_of_day" in shop.columns:
    dup_shop_hour = shop.duplicated(subset=["hour_of_day"]).sum()
    print(f"Duplicate hour_of_day rows in shop: {dup_shop_hour}")
else:
    print("hour_of_day not found in shop columns.")

## 5. Numeric distributions and potential outliers

We look at basic descriptive statistics and simple checks to identify:

- Unreasonable values (e.g., non-positive quantities or prices).
- Very large values that might represent outliers or data entry errors.

For this project, we at least:

- Confirm that `transaction_qty` and `unit_price` are positive.
- Confirm that `hour_of_day` is between 0 and 23.
- Confirm that `money` has reasonable ranges.

In [None]:
numeric_cols_sales = ["transaction_qty", "unit_price"]
numeric_cols_sales = [c for c in numeric_cols_sales if c in sales.columns]

print("Sales numeric summary:")
sales[numeric_cols_sales].describe()

In [None]:
numeric_cols_shop = []
for candidate in ["hour_of_day", "money"]:
    if candidate in shop.columns:
        numeric_cols_shop.append(candidate)

print("Shop numeric summary:")
shop[numeric_cols_shop].describe()

In [None]:
# Check for non-positive qty/price in sales
if {"transaction_qty", "unit_price"}.issubset(sales.columns):
    invalid_qty = (sales["transaction_qty"] <= 0).sum()
    invalid_price = (sales["unit_price"] <= 0).sum()
    print(f"Non-positive transaction_qty rows: {invalid_qty}")
    print(f"Non-positive unit_price rows: {invalid_price}")

In [None]:
# Check hour_of_day range in shop
if "hour_of_day" in shop.columns:
    bad_hours = shop[
        (shop["hour_of_day"] < 0) | (shop["hour_of_day"] > 23)
    ]
    print(f"Rows with invalid hour_of_day: {len(bad_hours)}")

In [None]:
# Basic check for negative money values
if "money" in shop.columns:
    negative_money = (shop["money"] < 0).sum()
    print(f"Rows with negative money: {negative_money}")

## 6. Categorical summaries

We also examine categorical distributions to understand the structure
of the data and potentially detect unexpected categories.

Examples:

- `store_location`
- `product_category`
- `Time_of_Day`
- `Weekday`

In [None]:
if "store_location" in sales.columns:
    print("Top store_location values:")
    print(sales["store_location"].value_counts().head(), "\n")

if "product_category" in sales.columns:
    print("Top product_category values:")
    print(sales["product_category"].value_counts().head(), "\n")

if "Time_of_Day" in shop.columns:
    print("Time_of_Day distribution:")
    print(shop["Time_of_Day"].value_counts(), "\n")

if "Weekday" in shop.columns:
    print("Weekday distribution:")
    print(shop["Weekday"].value_counts(), "\n")

## 7. Summary of data quality and cleaning decisions

Based on the script (`scripts/clean_data.py`) and the checks in this
notebook, the following key cleaning steps are applied:

### Sales data (`coffee_sales_clean.csv`)

- **Whitespace trimming**: All string columns are stripped of leading/trailing whitespace.
- **Duplicate removal**: Rows with duplicate `transaction_id` values are dropped.
- **Type conversion**:
  - `transaction_qty` and `unit_price` are converted to numeric types.
- **Invalid values**:
  - Rows with missing or non-positive `transaction_qty` or `unit_price`
    are removed, since they prevent meaningful revenue calculation.
- **Result**:
  - A cleaned sales table where each transaction has a unique ID and
    valid quantity/price values suitable for aggregation.

### Shop data (`coffee_shop_clean.csv`)

- **Whitespace trimming**: All string columns are stripped of leading/trailing whitespace.
- **Type conversion**:
  - `hour_of_day` and `money` are converted to numeric types.
- **Range checks**:
  - Rows with `hour_of_day` outside the range 0–23 are dropped.
  - Rows with missing `money` are dropped (to allow average money per hour).
- **Result**:
  - A cleaned shop table with valid hourly information that can be used
    to build the time-of-day profile in the integration step.

These decisions are documented here and referenced in the main project
`README.md` under the **Data quality** and **Data cleaning** sections.