# ☕ Café Sales Data Cleaning

Welcome to my notebook!  

This project is about **cleaning a messy café sales dataset**.  
The dataset contains multiple issues such as:  

- Missing values in numeric and categorical columns  
- Wrong or placeholder values like `ERROR` or `UNKNOWN`  
- Inconsistent transaction dates  
- Plausibility errors (e.g., Quantity × Price not matching Total Spent)  

### Goals of this notebook:
1. Load and inspect the raw dataset  
2. Identify missing values and inconsistencies  
3. Apply systematic **data cleaning steps**  
4. Recalculate values when possible  
5. Export a clean dataset and a list of invalid rows  
6. Summarize fixes in a **cleaning report**  

This notebook is written in a style similar to Kaggle kernels: **Markdown explanations alternating with code cells**.  


# Café Sales Data Cleaning

In this notebook, I clean and prepare a dataset of café sales that contains multiple data quality issues.

## Setup

Let's start by importing the necessary libraries and adjusting Pandas display settings.

In [None]:
import pandas as pd

pd.set_option('display.max_rows', 100)    # show up to 100 rows
pd.set_option('display.max_columns', None)  # show all columns
pd.set_option('display.expand_frame_repr', False)  # no truncation

## Load the Dataset

We read the raw CSV file with error handling to catch common issues like missing files or parsing errors.

In [None]:
try:
    df = pd.read_csv("dirty_cafe_sales.csv")
    print("file successfully read")
except FileNotFoundError:
    print("Error: file 'dirty_cafe_sales.csv' not found.")
    df = None
except pd.errors.EmptyDataError:
    print("Error: file is empty.")
    df = None
except pd.errors.ParserError:
    print("Error: file could not be read (parse error).")
    df = None
except Exception as e:
    print(f"An unknown error occurred: {e}")
    df = None

## First Look at the Data

We check the first rows, dataset info, and descriptive statistics.

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## Random Sample

To get a better overview, let's look at 100 random rows.

In [None]:
df.sample(100)

## Missing Values

We count missing values per column and show all rows that contain NaN values.

In [None]:
missing_values_count = df.isnull().sum()
missing_values_count

In [None]:
df[df.isna().any(axis=1)]

## Data Cleaning Steps

We now convert numeric columns, handle missing values, and recalculate when possible.

In [None]:
num_cols = ["Quantity", "Price Per Unit", "Total Spent"]
for col in num_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

df[num_cols] = df[num_cols].fillna(0.0)

### Recalculating numeric fields
We recalculate missing values for Price Per Unit, Total Spent, and Quantity where possible.

In [None]:
fixes = {"Price Per Unit": 0, "Total Spent": 0, "Quantity": 0, 
         "Item": 0, "Payment Method": 0, "Location": 0, "Transaction Date": 0}

mask_price_missing = (df["Price Per Unit"] == 0.0) & (df["Total Spent"] > 0) & (df["Quantity"] > 0)
fixes["Price Per Unit"] = mask_price_missing.sum()
df.loc[mask_price_missing, "Price Per Unit"] = df["Total Spent"] / df["Quantity"]

mask_total_missing = (df["Total Spent"] == 0.0) & (df["Price Per Unit"] > 0) & (df["Quantity"] > 0)
fixes["Total Spent"] = mask_total_missing.sum()
df.loc[mask_total_missing, "Total Spent"] = df["Price Per Unit"] * df["Quantity"]

mask_quantity_missing = (df["Quantity"] == 0.0) & (df["Price Per Unit"] > 0) & (df["Total Spent"] > 0)
fixes["Quantity"] = mask_quantity_missing.sum()
df.loc[mask_quantity_missing, "Quantity"] = df["Total Spent"] / df["Price Per Unit"]

### Cleaning categorical values
We clean values for Item, Payment Method, Location, and Transaction Date.

In [None]:
before_items = df["Item"].isin(["ERROR", "UNKNOWN"]).sum() + df["Item"].isna().sum()

df["Item"] = df["Item"].replace(["ERROR", "UNKNOWN"], pd.NA).fillna("Unknown Item")

n_salad = ((df["Price Per Unit"] == 5) & (df["Item"].str.lower() == "unknown item")).sum()
n_coffee = ((df["Price Per Unit"] == 2) & (df["Item"].str.lower() == "unknown item")).sum()
n_tea = ((df["Price Per Unit"] == 1.5) & (df["Item"].str.lower() == "unknown item")).sum()

df.loc[(df["Price Per Unit"] == 5) & (df["Item"].str.lower() == "unknown item"), "Item"] = "Salad"
df.loc[(df["Price Per Unit"] == 2) & (df["Item"].str.lower() == "unknown item"), "Item"] = "Coffee"
df.loc[(df["Price Per Unit"] == 1.5) & (df["Item"].str.lower() == "unknown item"), "Item"] = "Tea"

fixes["Item"] = before_items + n_salad + n_coffee + n_tea

before_payment = df["Payment Method"].isin(["ERROR", "UNKNOWN"]).sum() + df["Payment Method"].isna().sum()
fixes["Payment Method"] = before_payment
df["Payment Method"] = df["Payment Method"].replace(["ERROR", "UNKNOWN"], pd.NA).fillna("Unknown")

before_location = df["Location"].isin(["ERROR", "UNKNOWN"]).sum() + df["Location"].isna().sum()
fixes["Location"] = before_location
df["Location"] = df["Location"].replace(["ERROR", "UNKNOWN"], pd.NA).fillna("Unknown")

before_date = df["Transaction Date"].isna().sum()
fixes["Transaction Date"] = before_date
df["Transaction Date"] = df["Transaction Date"].replace(["ERROR", "UNKNOWN"], pd.NA)
df["Transaction Date"] = pd.to_datetime(df["Transaction Date"], errors="coerce")

df = df.sort_values(by="Transaction Date")

### Plausibility Check
We verify if Quantity × Price Per Unit equals Total Spent and export invalid rows.

In [None]:
df["Check_Total_OK"] = (df["Quantity"] * df["Price Per Unit"]).round(2) == df["Total Spent"].round(2)
invalid_rows = df[~df["Check_Total_OK"]]
invalid_count = len(invalid_rows)

df.to_csv("clean_cafe_sales.csv", index=False)
invalid_rows.to_csv("transactions_invalid.csv", index=False)

## Cleaning Report
We summarize how many values were fixed and how many invalid rows remain.

In [None]:
for key, val in fixes.items():
    print(f"{key} fixed/filled: {val}")
print(f"Invalid rows (plausibility check failed): {invalid_count}")
print("Files saved: clean_cafe_sales.csv, transactions_invalid.csv")

## Inspect Cleaned Data
Finally, let's look at some samples from the cleaned dataset.

In [None]:
df2 = pd.read_csv("clean_cafe_sales.csv")
df2.sample(100)

In [None]:
df2.head(100)

In [None]:
df2.iloc[9450:9461]

# 📊 Summary & Next Steps

### What we achieved:
- Converted and cleaned numeric columns (`Quantity`, `Price Per Unit`, `Total Spent`)  
- Replaced or inferred missing values where possible  
- Fixed invalid entries in `Item`, `Payment Method`, `Location`, and `Transaction Date`  
- Ensured dataset is sorted chronologically  
- Added a plausibility check for transactions  
- Exported two files:  
  - `clean_cafe_sales.csv` → cleaned dataset  
  - `transactions_invalid.csv` → invalid rows  

### Next steps:
- Perform **exploratory data analysis (EDA)** on the cleaned dataset  
- Visualize sales trends (by date, item, location, payment method)  
- Build a **predictive model** (e.g., forecasting daily sales)  

---

✅ With this notebook, recruiters and data teams can see how I approach **real-world messy data**: carefully, systematically, and with reproducible code.  
