Customer Purchase Prediction

RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior. We will help build this prediction system and our work will directly impact RetailTech's growth strategy and their goal of increasing revenue.


## Data Description

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

The marketing team has collected customer session data in `raw_customer_data.csv`, but it contains missing values and inconsistencies that need to be addressed.
Create a cleaned version of the dataframe:

- Start with the data in the file `raw_customer_data.csv`
- Your output should be a DataFrame named `clean_data`
- All column names and values should match the table below.
</br>

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

In [1]:
import pandas as pd

clean_data = pd.read_csv('raw_customer_data.csv')


#time_spent
clean_data['time_spent'] = clean_data['time_spent'].fillna(round(clean_data['time_spent'].median(),2))

#pages_viewed
clean_data['pages_viewed'] = clean_data['pages_viewed'].fillna(round(clean_data['pages_viewed'].mean(),2))
clean_data['pages_viewed'] = clean_data['pages_viewed'].astype(int)

#basket_value
clean_data['basket_value'] = clean_data['basket_value'].fillna(0)

#device_type
clean_data['device_type'] = clean_data['device_type'].fillna('Unknown')

#customer_type
clean_data['customer_type'] = clean_data['customer_type'].fillna('New')

#purchase
clean_data['purchase'] = clean_data['purchase'].astype(bool)

#clean_data.isna().sum()
clean_data.dtypes

#clean_data['purchase'].value_counts()

customer_id        int64
time_spent       float64
pages_viewed       int32
basket_value     float64
device_type       object
customer_type     object
purchase            bool
dtype: object

The pre-cleaned dataset `model_data.csv` needs to be prepared for our neural network.
Create the model features:

- Start with the data in the file `model_data.csv`
- Scale numerical features (`time_spent`, `pages_viewed`, `basket_value`) to 0-1 range
- Apply one-hot encoding to the categorical features (`device_type`, `customer_type`)
    - The column names should have the following format: variable_name_category_name (e.g., `device_type_Desktop`)
- Your output should be a DataFrame named `model_feature_set`, with all column names from `model_data.csv` except for the columns where one-hot encoding was applied.


In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Standardize Features
df = pd.read_csv("model_data.csv")
columns = ['time_spent', 'pages_viewed', 'basket_value']

for column in columns:
    df[column] = scaler.fit_transform(df[[column]])

# One-hot encoding
original_cols = df.columns
df_encoded = pd.get_dummies(df, columns=['device_type'])
model_feature_set = pd.get_dummies(df_encoded, columns=['customer_type'])


model_feature_set.head()


Unnamed: 0,customer_id,time_spent,pages_viewed,basket_value,purchase,device_type_Desktop,device_type_Mobile,device_type_Tablet,device_type_Unknown,customer_type_New,customer_type_Returning
0,501,0.682273,-0.038919,-1.339318,1,True,False,False,False,True,False
1,502,0.040076,-1.011895,0.938511,1,False,True,False,False,False,True
2,503,-0.857722,-1.401086,0.644815,0,False,True,False,False,False,True
3,504,1.14048,-0.8173,-1.339318,1,False,False,False,True,True,False
4,505,0.629054,-1.206491,0.761926,1,False,False,True,False,True,False


Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

- Using PyTorch, create a network with:
   - At least one hidden layer with 8 units
   - ReLU activation for hidden layer
   - Sigmoid activation for the output layer
- Using the prepared features in `input_model_features.csv`, train the model to predict purchases. 
- Use the validation dataset `validation_features.csv` to predict new values based on the trained model. 
- Your model should be named `purchase_model` and your output should be a DataFrame named `validation_predictions` with columns `customer_id` and `purchase`. The `purchase` column must be your predicted values.


In [58]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
train_data = pd.read_csv('input_model_features.csv')
validation_data = pd.read_csv('validation_features.csv')

# Verify column names
print("Training columns:", train_data.columns.tolist())
print("Validation columns:", validation_data.columns.tolist())

# Separate features, target, and IDs
# Training data
customer_ids_train = train_data['customer_id']
X_train = train_data.drop(columns=['purchase', 'customer_id'])  # Keep only features
y_train = train_data['purchase']

# Validation data (unseen)
customer_ids_val = validation_data['customer_id']
X_val = validation_data.drop(columns=['customer_id'])  # Keep only features

# Check feature dimensions
print(f"\nTraining features shape: {X_train.shape}")
print(f"Validation features shape: {X_val.shape}")

# Split into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

# Convert to tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)
X_val_tensor = torch.tensor(X_val.values, dtype=torch.float32)

# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
val_loader = DataLoader(X_val_tensor, batch_size=32, shuffle=False)

# Model definition
class PurchaseModel(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.layer1 = nn.Linear(input_size, 8)  # Hidden layer with 8 units
        self.layer2 = nn.Linear(8, 1)           # Output layer
        self.relu = nn.ReLU()                    # ReLU activation
        self.sigmoid = nn.Sigmoid()              # Sigmoid output
        
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.sigmoid(self.layer2(x))
        return x

# Initialize model
input_size = X_train.shape[1]
print(f"\nNumber of features: {input_size}")
purchase_model = PurchaseModel(input_size)

# Training setup
criterion = nn.BCELoss()
optimizer = optim.Adam(purchase_model.parameters(), lr=0.001)

# Training loop
print("\nStarting training...")
for epoch in range(100):
    # Training phase
    purchase_model.train()
    train_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = purchase_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    # Validation phase
    purchase_model.eval()
    test_preds, test_labels = [], []
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = purchase_model(inputs)
            test_preds.extend((outputs >= 0.5).int().tolist())
            test_labels.extend(labels.int().tolist())
    
    # Calculate metrics
    train_loss /= len(train_loader)
    test_acc = accuracy_score(test_labels, test_preds)
    
    if (epoch+1) % 10 == 0:
        print(f'Epoch {epoch+1:3d} | Train Loss: {train_loss:.4f} | Test Acc: {test_acc:.4f}')

# Predict on completely unseen validation data
print("\nGenerating predictions for validation data...")
purchase_model.eval()
val_preds = []
with torch.no_grad():
    for inputs in val_loader:
        outputs = purchase_model(inputs)
        val_preds.extend((outputs >= 0.5).int().flatten().tolist())

# Create final output DataFrame
validation_predictions = pd.DataFrame({
    'customer_id': customer_ids_val,
    'purchase': val_preds
})

# Verify output
print("\nFirst 5 predictions:")
print(validation_predictions.head())
print(f"\nPrediction distribution:\n{validation_predictions['purchase'].value_counts()}")


Training columns: ['customer_id', 'time_spent', 'pages_viewed', 'basket_value', 'purchase', 'device_type_Desktop', 'device_type_Mobile', 'device_type_Tablet', 'device_type_Unknown', 'customer_type_New', 'customer_type_Returning']
Validation columns: ['customer_id', 'time_spent', 'pages_viewed', 'basket_value', 'device_type_Desktop', 'device_type_Mobile', 'device_type_Tablet', 'device_type_Unknown', 'customer_type_New', 'customer_type_Returning']

Training features shape: (800, 9)
Validation features shape: (200, 9)

Number of features: 9

Starting training...
Epoch  10 | Train Loss: 0.4920 | Test Acc: 0.7688
Epoch  20 | Train Loss: 0.4841 | Test Acc: 0.7688
Epoch  30 | Train Loss: 0.4813 | Test Acc: 0.7688
Epoch  40 | Train Loss: 0.4790 | Test Acc: 0.7688
Epoch  50 | Train Loss: 0.4767 | Test Acc: 0.7688
Epoch  60 | Train Loss: 0.4748 | Test Acc: 0.7688
Epoch  70 | Train Loss: 0.4731 | Test Acc: 0.7688
Epoch  80 | Train Loss: 0.4718 | Test Acc: 0.7688
Epoch  90 | Train Loss: 0.4708 | T