# 🧼 Transaction Data Preprocessing

This notebook prepares the transaction data for market basket analysis.

We'll:
- Load the raw CSV
- Clean and transform the data
- Group items by transaction to prepare for frequent itemset mining

## 📥 Load Dataset

We'll load the CSV file containing transactional data where each row represents a single item in a transaction.

In [3]:
import pandas as pd

# Load the data
df = pd.read_csv("../data/market_basket_transactions.csv")

# Show basic structure
df.head()

Unnamed: 0,TransactionID,CustomerID,Item
0,T0001,82,Bread
1,T0002,95,Eggs
2,T0002,95,Tomatoes
3,T0002,95,Butter
4,T0003,95,Beef


## 🔍 Data Cleaning

- Check for nulls or duplicates  
- Make sure data types are correct  
- Standardize column names

In [4]:
# Rename columns to remove spaces (optional)
df.columns = df.columns.str.strip()

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Drop duplicates if any
df = df.drop_duplicates()

# Convert TransactionID and CustomerID to string (if needed)
df["TransactionID"] = df["TransactionID"].astype(str)
df["CustomerID"] = df["CustomerID"].astype(str)

Missing values:
 TransactionID    0
CustomerID       0
Item             0
dtype: int64


## 🛒 Group Items by Transaction

Group all items purchased together under the same `TransactionID` to create the baskets required for association rule mining.

In [5]:
# Create a list of items per transaction
grouped_df = df.groupby("TransactionID")["Item"].apply(list).reset_index()

grouped_df.head()

Unnamed: 0,TransactionID,Item
0,T0001,[Bread]
1,T0002,"[Eggs, Tomatoes, Butter]"
2,T0003,[Beef]
3,T0004,"[Apples, Bread, Beef, Chicken, Milk]"
4,T0005,"[Tomatoes, Bread, Eggs, Bananas, Apples]"


## ✅ Save Preprocessed Data

We can save the grouped data into a new file that will be used in the next notebook.

In [6]:
grouped_df.to_csv("../data/market_basket_grouped.csv", index=False)
print("Grouped transactions saved successfully.")

Grouped transactions saved successfully.
