## Step 3: Parse Grocery Transactions into Basket Format

The recommendation model requires grocery transactions to be represented as lists of items, where each list corresponds to a single shopping basket.

In this step, the raw dataset is cleaned and transformed into a transaction-based structure suitable for market basket analysis. This representation mirrors the grocery lists created in Task 1, but across multiple customers.


In [1]:
# Step 3: Prepare transaction data

import pandas as pd

# Load dataset
file_path = r"C:\coding5final\coding5\data\raw\Groceries_dataset.csv"
df = pd.read_csv(file_path, header=None)

df.head()



Unnamed: 0,0,1,2
0,Member_number,Date,itemDescription
1,1808,21-07-2015,tropical fruit
2,2552,05-01-2015,whole milk
3,2300,19-09-2015,pip fruit
4,1187,12-12-2015,other vegetables


### Assign column names

The dataset has three columns: customer ID, date, and item description. We assign descriptive column names so that subsequent data transformations are readable and defensible.


In [2]:
df.columns = ["CustomerID", "Date", "Item"]

df.head()



Unnamed: 0,CustomerID,Date,Item
0,Member_number,Date,itemDescription
1,1808,21-07-2015,tropical fruit
2,2552,05-01-2015,whole milk
3,2300,19-09-2015,pip fruit
4,1187,12-12-2015,other vegetables


### Define unique transaction IDs

A transaction is defined as all items purchased by a single customer on a single date. 
We create a new column, `TransactionID`, by combining `CustomerID` and `Date` to uniquely identify each shopping basket.


In [3]:
df["TransactionID"] = (
    df["CustomerID"].astype(str) + "_" + df["Date"].astype(str)
)

df.head()


Unnamed: 0,CustomerID,Date,Item,TransactionID
0,Member_number,Date,itemDescription,Member_number_Date
1,1808,21-07-2015,tropical fruit,1808_21-07-2015
2,2552,05-01-2015,whole milk,2552_05-01-2015
3,2300,19-09-2015,pip fruit,2300_19-09-2015
4,1187,12-12-2015,other vegetables,1187_12-12-2015


### Group items into baskets

Next, we group all items by `TransactionID` to create shopping baskets. 
Each transaction will now correspond to a list of items purchased together, which is the required format for market basket analysis.


In [4]:
transactions = (
    df.groupby("TransactionID")["Item"]
    .apply(list)
)

transactions.head()


TransactionID
1000_15-03-2015    [sausage, whole milk, semi-finished bread, yog...
1000_24-06-2014                    [whole milk, pastry, salty snack]
1000_24-07-2015                       [canned beer, misc. beverages]
1000_25-11-2015                          [sausage, hygiene articles]
1000_27-05-2015                           [soda, pickled vegetables]
Name: Item, dtype: object

### Convert grouped transactions to list-of-lists

Most recommendation algorithms (e.g., Apriori, FP-Growth) expect a list of transactions, where each transaction is itself a list of items. 
We convert the grouped transactions into this list-of-lists format, ready for model input.


In [5]:
basket_list = transactions.tolist()

basket_list[:5]


[['sausage', 'whole milk', 'semi-finished bread', 'yogurt'],
 ['whole milk', 'pastry', 'salty snack'],
 ['canned beer', 'misc. beverages'],
 ['sausage', 'hygiene articles'],
 ['soda', 'pickled vegetables']]

### Verify transaction list

Finally, we perform basic sanity checks:

1. Count the number of transactions
2. Inspect a sample basket

This ensures the data is correctly structured before moving on to model encoding and recommendation generation.


In [6]:
print("Number of baskets:", len(basket_list))
print("Sample basket:", basket_list[0])


Number of baskets: 14964
Sample basket: ['sausage', 'whole milk', 'semi-finished bread', 'yogurt']


In [7]:
import pickle
import os

# Ensure the processed folder exists
os.makedirs(r"C:\coding5final\coding5\data\processed", exist_ok=True)

# Save basket_list for Step 4
with open(r"C:\coding5final\coding5\data\processed\basket_list.pkl", "wb") as f:
    pickle.dump(basket_list, f)

print("basket_list saved successfully in processed folder!")


basket_list saved successfully in processed folder!
