# Practical Exam: Customer Purchase Prediction

RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior.

As an AI Engineer, you will help build this prediction system. Your work will directly impact RetailTech's growth strategy and their goal of increasing revenue.


## Data Description

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

# Task 1

The marketing team has collected customer session data in `raw_customer_data.csv`, but it contains missing values and inconsistencies that need to be addressed.
Create a cleaned version of the dataframe:

- Start with the data in the file `raw_customer_data.csv`
- Your output should be a DataFrame named `clean_data`
- All column names and values should match the table below.
</br>

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

In [19]:
# Write your answer to Task 1 here 
import pandas as pd

# Load raw data
raw_data = pd.read_csv("raw_customer_data.csv")

# --- Cleaning ---
clean_data = raw_data.copy()

# customer_id → integer (unique, no missing)
clean_data["customer_id"] = clean_data["customer_id"].astype(int)

# time_spent → float, fill missing with median
clean_data["time_spent"] = clean_data["time_spent"].astype(float)
clean_data["time_spent"] = clean_data["time_spent"].fillna(clean_data["time_spent"].median())

# pages_viewed → int, fill missing with mean
clean_data["pages_viewed"] = clean_data["pages_viewed"].fillna(clean_data["pages_viewed"].mean())
clean_data["pages_viewed"] = clean_data["pages_viewed"].astype(int)

# basket_value → float, fill missing with 0
clean_data["basket_value"] = clean_data["basket_value"].fillna(0.0).astype(float)

# device_type → string, fill missing with "Unknown"
clean_data["device_type"] = clean_data["device_type"].fillna("Unknown").astype(str)

# customer_type → string, fill missing with "New"
clean_data["customer_type"] = clean_data["customer_type"].fillna("New").astype(str)

# purchase → binary int (0/1)
clean_data["purchase"] = clean_data["purchase"].astype(int)

# Final output
clean_data


Unnamed: 0,customer_id,time_spent,pages_viewed,basket_value,device_type,customer_type,purchase
0,1,23.097867,7,50.574647,Mobile,Returning,0
1,2,57.092144,3,56.891022,Mobile,Returning,1
2,3,44.187643,14,8.348296,Mobile,Returning,0
3,4,36.320851,10,43.481489,Mobile,New,1
4,5,10.205100,16,0.000000,Mobile,Returning,1
...,...,...,...,...,...,...,...
495,496,21.847781,6,39.954545,Mobile,New,1
496,497,35.435711,15,64.972694,Desktop,Returning,1
497,498,35.037329,10,0.000000,Unknown,New,1
498,499,58.489294,5,73.736271,Mobile,Returning,1


# Task 2
The pre-cleaned dataset `model_data.csv` needs to be prepared for our neural network.
Create the model features:

- Start with the data in the file `model_data.csv`
- Scale numerical features (`time_spent`, `pages_viewed`, `basket_value`) to 0-1 range
- Apply one-hot encoding to the categorical features (`device_type`, `customer_type`)
    - The column names should have the following format: variable_name_category_name (e.g., `device_type_Desktop`)
- Your output should be a DataFrame named `model_feature_set`, with all column names from `model_data.csv` except for the columns where one-hot encoding was applied.


In [20]:
# Write your answer to Task 2 here
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load data
df = pd.read_csv("model_data.csv")

# --- Scale numerical features ---
num_cols = ["time_spent", "pages_viewed", "basket_value"]
scaler = MinMaxScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# --- One-hot encode categorical features ---
cat_cols = ["device_type", "customer_type"]
df_encoded = pd.get_dummies(df, columns=cat_cols, prefix=cat_cols)

# --- Final feature set ---
model_feature_set = df_encoded.copy()

# Output
model_feature_set


Unnamed: 0,customer_id,time_spent,pages_viewed,basket_value,purchase,device_type_Desktop,device_type_Mobile,device_type_Tablet,device_type_Unknown,customer_type_New,customer_type_Returning
0,501,0.664167,0.500000,0.000000,1,1,0,0,0,1,0
1,502,0.483681,0.222222,0.524981,1,0,1,0,0,0,1
2,503,0.231359,0.111111,0.457291,0,0,1,0,0,0,1
3,504,0.792944,0.277778,0.000000,1,0,0,0,1,1,0
4,505,0.649210,0.166667,0.484283,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
495,996,0.510473,1.000000,0.459799,1,1,0,0,0,1,0
496,997,0.908229,0.000000,0.000000,0,0,0,1,0,0,1
497,998,0.039019,0.333333,0.202147,1,1,0,0,0,0,1
498,999,0.944895,0.888889,0.369052,1,0,1,0,0,0,1


# Task 3

Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

- Using PyTorch, create a network with:
   - At least one hidden layer with 8 units
   - ReLU activation for hidden layer
   - Sigmoid activation for the output layer
- Using the prepared features in `input_model_features.csv`, train the model to predict purchases. 
- Use the validation dataset `validation_features.csv` to predict new values based on the trained model. 
- Your model should be named `purchase_model` and your output should be a DataFrame named `validation_predictions` with columns `customer_id` and `purchase`. The `purchase` column must be your predicted values.


In [21]:
# Task 3A — Create network model (PyTorch)
# Builds a model with one hidden layer of EXACTLY 8 units, ReLU hidden, Sigmoid output
import pandas as pd
import torch
from torch import nn

# Determine input dimension from training features file
train_df = pd.read_csv("input_model_features.csv")
feature_cols = [c for c in train_df.columns if c not in ["customer_id", "purchase"]]
input_dim = len(feature_cols)

# Define the model with required activations and sizes
purchase_model = nn.Sequential(
    nn.Linear(input_dim, 8),  # exactly 8 hidden units
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

# Display the model (optional)
purchase_model

# Task 3B — Train purchase_model on input_model_features.csv and predict on validation_features.csv
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

# -----------------------------
# 1) Load prepared feature data
# -----------------------------
train_df = pd.read_csv("input_model_features.csv")
val_df   = pd.read_csv("validation_features.csv")

TARGET = "purchase"
IDCOL  = "customer_id"

# Training feature columns (exclude ID and target)
feature_cols = [c for c in train_df.columns if c not in [IDCOL, TARGET]]

# Align validation columns with training feature space
missing_in_val = [c for c in feature_cols if c not in val_df.columns]
extra_in_val   = [c for c in val_df.columns if c not in feature_cols + [IDCOL]]

# Add missing validation columns (unseen categories, etc.) as zeros
for c in missing_in_val:
    val_df[c] = 0

# Drop any extra columns in validation not present in training features
if extra_in_val:
    val_df = val_df.drop(columns=extra_in_val)

# Reorder validation columns to match training order
val_df = val_df[[IDCOL] + feature_cols]

# Numpy tensors
X_train = train_df[feature_cols].fillna(0).astype(np.float32).values
y_train = train_df[TARGET].astype(np.float32).values.reshape(-1, 1)
X_val   = val_df[feature_cols].fillna(0).astype(np.float32).values
val_ids = val_df[IDCOL].values

# -----------------------------
# 2) Dataset & DataLoader
# -----------------------------
torch.manual_seed(42)
np.random.seed(42)

train_ds = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)

input_dim = X_train.shape[1]

# -----------------------------
# 3) Model (reuse if exists; else rebuild identically to 3A)
# -----------------------------
if 'purchase_model' not in globals():
    purchase_model = nn.Sequential(
        nn.Linear(input_dim, 8),
        nn.ReLU(),
        nn.Linear(8, 1),
        nn.Sigmoid()
    )

# Loss & optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(purchase_model.parameters(), lr=1e-3)

# -----------------------------
# 4) Train
# -----------------------------
purchase_model.train()
EPOCHS = 200
for epoch in range(EPOCHS):
    for xb, yb in train_loader:
        optimizer.zero_grad()
        preds = purchase_model(xb)
        loss = criterion(preds, yb)
        loss.backward()
        optimizer.step()

# -----------------------------
# 5) Predict on validation set
# -----------------------------
purchase_model.eval()
with torch.no_grad():
    val_probs = purchase_model(torch.from_numpy(X_val)).numpy().ravel()
    val_pred_labels = (val_probs >= 0.5).astype(int)

# -----------------------------
# 6) Output DataFrame
# -----------------------------
validation_predictions = pd.DataFrame({
    "customer_id": val_ids,
    "purchase": val_pred_labels
})

# Final output for grading
validation_predictions


Unnamed: 0,customer_id,purchase
0,1801,1
1,1802,1
2,1803,1
3,1804,1
4,1805,1
...,...,...
195,1996,1
196,1997,1
197,1998,1
198,1999,1
