# 03 – Data Quality Assessment and Cleaning

This notebook documents the **data quality** and **cleaning** steps for the
project. It complements the script `scripts/clean_data.py` and provides
exploratory checks and summaries.

Goals:

- Profile the cleaned datasets:
  - `coffee_sales_clean.csv`
  - `coffee_shop_clean.csv`
- Assess data quality:
  - Missing values
  - Potential outliers
  - Duplicate records
- Describe and justify key cleaning decisions:
  - Dropping invalid or missing quantities/prices
  - Cleaning `hour_of_day` and `money` in the shop data

In [1]:
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"

SALES_CLEAN = PROCESSED_DIR / "coffee_sales_clean.csv"
SHOP_CLEAN = PROCESSED_DIR / "coffee_shop_clean.csv"

PROJECT_ROOT, DATA_DIR, PROCESSED_DIR

(PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data'),
 PosixPath('/Users/ujjwal/Downloads/IS-477-Project-Ujjwal/data/processed'))

## 1. (Optional) Regenerate cleaned data

If you have updated `scripts/clean_data.py`, you can re-run it here to
regenerate the cleaned CSV files.

Otherwise, you can skip this cell and use the existing cleaned files in
`data/processed/`.

In [2]:
# OPTIONAL: Uncomment to regenerate cleaned data
# !python ../scripts/clean_data.py

## 2. Load cleaned datasets

We now load:

- `data/processed/coffee_sales_clean.csv`
- `data/processed/coffee_shop_clean.csv`

These should reflect the cleaning rules in `scripts/clean_data.py`.

In [3]:
sales = pd.read_csv(SALES_CLEAN)
shop = pd.read_csv(SHOP_CLEAN)

print("Sales (clean) shape:", sales.shape)
print("Shop  (clean) shape:", shop.shape)

Sales (clean) shape: (149116, 11)
Shop  (clean) shape: (3547, 11)


In [4]:
sales.head()

Unnamed: 0,transaction_id,transaction_date,transaction_time,transaction_qty,store_id,store_location,product_id,unit_price,product_category,product_type,product_detail
0,1,1/1/23,7:06:11,2,5,Lower Manhattan,32,3.0,Coffee,Gourmet brewed coffee,Ethiopia Rg
1,2,1/1/23,7:08:56,2,5,Lower Manhattan,57,3.1,Tea,Brewed Chai tea,Spicy Eye Opener Chai Lg
2,3,1/1/23,7:14:04,2,5,Lower Manhattan,59,4.5,Drinking Chocolate,Hot chocolate,Dark chocolate Lg
3,4,1/1/23,7:20:24,1,5,Lower Manhattan,22,2.0,Coffee,Drip coffee,Our Old Time Diner Blend Sm
4,5,1/1/23,7:22:41,2,5,Lower Manhattan,57,3.1,Tea,Brewed Chai tea,Spicy Eye Opener Chai Lg


In [5]:
shop.head()

Unnamed: 0,hour_of_day,cash_type,money,coffee_name,Time_of_Day,Weekday,Month_name,Weekdaysort,Monthsort,Date,Time
0,10,card,38.7,Latte,Morning,Fri,Mar,5,3,01/03/2024,15:50.5
1,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,01/03/2024,19:22.5
2,12,card,38.7,Hot Chocolate,Afternoon,Fri,Mar,5,3,01/03/2024,20:18.1
3,13,card,28.9,Americano,Afternoon,Fri,Mar,5,3,01/03/2024,46:33.0
4,13,card,38.7,Latte,Afternoon,Fri,Mar,5,3,01/03/2024,48:14.6


## 3. Schema and missing values

We inspect:

- Column names and data types (`info()`).
- Missing values per column (`isna().sum()`).

This helps verify that critical fields (e.g., `transaction_qty`,
`unit_price`, `hour_of_day`, `money`) have been cleaned appropriately.

In [6]:
print("=== Sales info ===")
print(sales.info())

print("\n=== Missing values in sales ===")
print(sales.isna().sum())

=== Sales info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149116 entries, 0 to 149115
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   transaction_id    149116 non-null  int64  
 1   transaction_date  149116 non-null  object 
 2   transaction_time  149116 non-null  object 
 3   transaction_qty   149116 non-null  int64  
 4   store_id          149116 non-null  int64  
 5   store_location    149116 non-null  object 
 6   product_id        149116 non-null  int64  
 7   unit_price        149116 non-null  float64
 8   product_category  149116 non-null  object 
 9   product_type      149116 non-null  object 
 10  product_detail    149116 non-null  object 
dtypes: float64(1), int64(4), object(6)
memory usage: 12.5+ MB
None

=== Missing values in sales ===
transaction_id      0
transaction_date    0
transaction_time    0
transaction_qty     0
store_id            0
store_location      0
product_id   

In [7]:
print("=== Shop info ===")
print(shop.info())

print("\n=== Missing values in shop ===")
print(shop.isna().sum())

=== Shop info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3547 entries, 0 to 3546
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   hour_of_day  3547 non-null   int64  
 1   cash_type    3547 non-null   object 
 2   money        3547 non-null   float64
 3   coffee_name  3547 non-null   object 
 4   Time_of_Day  3547 non-null   object 
 5   Weekday      3547 non-null   object 
 6   Month_name   3547 non-null   object 
 7   Weekdaysort  3547 non-null   int64  
 8   Monthsort    3547 non-null   int64  
 9   Date         3547 non-null   object 
 10  Time         3547 non-null   object 
dtypes: float64(1), int64(3), object(7)
memory usage: 304.9+ KB
None

=== Missing values in shop ===
hour_of_day    0
cash_type      0
money          0
coffee_name    0
Time_of_Day    0
Weekday        0
Month_name     0
Weekdaysort    0
Monthsort      0
Date           0
Time           0
dtype: int64


## 4. Duplicate checks

We verify that there are no unexpected duplicate rows, especially on
key identifiers:

- `transaction_id` in the sales data.
- (`hour_of_day`, `coffee_name`) or just `hour_of_day` in the shop data,
  depending on how the dataset is structured.

In [8]:
# Sales duplicates by transaction_id
if "transaction_id" in sales.columns:
    dup_sales = sales.duplicated(subset=["transaction_id"]).sum()
    print(f"Duplicate transaction_id rows in sales: {dup_sales}")
else:
    print("transaction_id not found in sales columns.")

# Shop duplicates by hour_of_day
if "hour_of_day" in shop.columns:
    dup_shop_hour = shop.duplicated(subset=["hour_of_day"]).sum()
    print(f"Duplicate hour_of_day rows in shop: {dup_shop_hour}")
else:
    print("hour_of_day not found in shop columns.")

Duplicate transaction_id rows in sales: 0
Duplicate hour_of_day rows in shop: 3530


## 5. Numeric distributions and potential outliers

We look at basic descriptive statistics and simple checks to identify:

- Unreasonable values (e.g., non-positive quantities or prices).
- Very large values that might represent outliers or data entry errors.

For this project, we at least:

- Confirm that `transaction_qty` and `unit_price` are positive.
- Confirm that `hour_of_day` is between 0 and 23.
- Confirm that `money` has reasonable ranges.

In [9]:
numeric_cols_sales = ["transaction_qty", "unit_price"]
numeric_cols_sales = [c for c in numeric_cols_sales if c in sales.columns]

print("Sales numeric summary:")
sales[numeric_cols_sales].describe()

Sales numeric summary:


Unnamed: 0,transaction_qty,unit_price
count,149116.0,149116.0
mean,1.438276,3.382219
std,0.542509,2.658723
min,1.0,0.8
25%,1.0,2.5
50%,1.0,3.0
75%,2.0,3.75
max,8.0,45.0


In [10]:
numeric_cols_shop = []
for candidate in ["hour_of_day", "money"]:
    if candidate in shop.columns:
        numeric_cols_shop.append(candidate)

print("Shop numeric summary:")
shop[numeric_cols_shop].describe()

Shop numeric summary:


Unnamed: 0,hour_of_day,money
count,3547.0,3547.0
mean,14.185791,31.645216
std,4.23401,4.877754
min,6.0,18.12
25%,10.0,27.92
50%,14.0,32.82
75%,18.0,35.76
max,22.0,38.7


In [11]:
# Check for non-positive qty/price in sales
if {"transaction_qty", "unit_price"}.issubset(sales.columns):
    invalid_qty = (sales["transaction_qty"] <= 0).sum()
    invalid_price = (sales["unit_price"] <= 0).sum()
    print(f"Non-positive transaction_qty rows: {invalid_qty}")
    print(f"Non-positive unit_price rows: {invalid_price}")

Non-positive transaction_qty rows: 0
Non-positive unit_price rows: 0


In [12]:
# Check hour_of_day range in shop
if "hour_of_day" in shop.columns:
    bad_hours = shop[
        (shop["hour_of_day"] < 0) | (shop["hour_of_day"] > 23)
    ]
    print(f"Rows with invalid hour_of_day: {len(bad_hours)}")

Rows with invalid hour_of_day: 0


In [13]:
# Basic check for negative money values
if "money" in shop.columns:
    negative_money = (shop["money"] < 0).sum()
    print(f"Rows with negative money: {negative_money}")

Rows with negative money: 0


## 6. Categorical summaries

We also examine categorical distributions to understand the structure
of the data and potentially detect unexpected categories.

Examples:

- `store_location`
- `product_category`
- `Time_of_Day`
- `Weekday`

In [14]:
if "store_location" in sales.columns:
    print("Top store_location values:")
    print(sales["store_location"].value_counts().head(), "\n")

if "product_category" in sales.columns:
    print("Top product_category values:")
    print(sales["product_category"].value_counts().head(), "\n")

if "Time_of_Day" in shop.columns:
    print("Time_of_Day distribution:")
    print(shop["Time_of_Day"].value_counts(), "\n")

if "Weekday" in shop.columns:
    print("Weekday distribution:")
    print(shop["Weekday"].value_counts(), "\n")

Top store_location values:
store_location
Hell's Kitchen     50735
Astoria            50599
Lower Manhattan    47782
Name: count, dtype: int64 

Top product_category values:
product_category
Coffee                58416
Tea                   45449
Bakery                22796
Drinking Chocolate    11468
Flavours               6790
Name: count, dtype: int64 

Time_of_Day distribution:
Time_of_Day
Afternoon    1205
Morning      1181
Night        1161
Name: count, dtype: int64 

Weekday distribution:
Weekday
Tue    572
Mon    544
Fri    532
Thu    510
Wed    500
Sat    470
Sun    419
Name: count, dtype: int64 



## 7. Summary of data quality and cleaning decisions

Based on the script (`scripts/clean_data.py`) and the checks in this
notebook, the following key cleaning steps are applied:

### Sales data (`coffee_sales_clean.csv`)

- **Whitespace trimming**: All string columns are stripped of leading/trailing whitespace.
- **Duplicate removal**: Rows with duplicate `transaction_id` values are dropped.
- **Type conversion**:
  - `transaction_qty` and `unit_price` are converted to numeric types.
- **Invalid values**:
  - Rows with missing or non-positive `transaction_qty` or `unit_price`
    are removed, since they prevent meaningful revenue calculation.
- **Result**:
  - A cleaned sales table where each transaction has a unique ID and
    valid quantity/price values suitable for aggregation.

### Shop data (`coffee_shop_clean.csv`)

- **Whitespace trimming**: All string columns are stripped of leading/trailing whitespace.
- **Type conversion**:
  - `hour_of_day` and `money` are converted to numeric types.
- **Range checks**:
  - Rows with `hour_of_day` outside the range 0–23 are dropped.
  - Rows with missing `money` are dropped (to allow average money per hour).
- **Result**:
  - A cleaned shop table with valid hourly information that can be used
    to build the time-of-day profile in the integration step.

These decisions are documented here and referenced in the main project
`README.md` under the **Data quality** and **Data cleaning** sections.