# 👩‍💻 E-commerce Customer Purchase Prediction: End-to-End ML Pipeline

## 📋 Overview
In this capstone project, you'll step into the role of a data scientist working for an e-commerce company. Your task is to build a complete machine learning pipeline that predicts whether a customer will make a purchase based on their behavior and demographic information. This represents a real business challenge that can help companies optimize their marketing strategies, personalize customer experiences, and increase conversion rates.

This project integrates key machine learning skills including data preprocessing, exploratory data analysis, feature engineering, model training, evaluation, and hyperparameter tuning
You'll work with realistic e-commerce customer data that includes demographics, browsing behavior, and purchase history
Your final model will provide actionable insights that could directly impact business decisions

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- Build a complete end-to-end machine learning pipeline from raw data to optimized model
- Apply data preprocessing techniques to handle real-world data challenges
- Train and evaluate multiple classification models using appropriate metrics
- Optimize model performance through hyperparameter tuning
- Interpret model results and translate them into business insights


## Task 1: Data Loading and Exploration
As the first step in your project, you need to understand the dataset you're working with. You'll load the e-commerce customer data and perform initial exploration to understand the structure, identify patterns, and detect any data quality issues.

**Steps:**
1. Load the dataset `customer_purchase_data.csv` into a pandas DataFrame named `df`.


2. Display:

    - The dataset shape and the first few rows.
    - Information about data types and missing values using df.info().
    - A statistical summary using df.describe().
    - Missing values for each column.      
    
    
3. Explore the target variable `PurchaseStatus`:

    - Display value counts (both absolute and normalized).
    - Plot the distribution using a countplot.    
    

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('customer_purchase_data.csv')


In [2]:
# Explore the dataset

# Display basic information
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

# Check data types and missing values
print("\nData types and non-null counts:")
print(df.info())

# Statistical summary
print("\nStatistical summary:")
print(df.describe())



Dataset shape: (1500, 9)

First 5 rows:
   Age  Gender   AnnualIncome  NumberOfPurchases  ProductCategory  \
0   40       1   66120.267939                  8                0   
1   20       1   23579.773583                  4                2   
2   27       1  127821.306432                 11                2   
3   24       1  137798.623120                 19                3   
4   31       1   99300.964220                 19                1   

   TimeSpentOnWebsite  LoyaltyProgram  DiscountsAvailed  PurchaseStatus  
0           30.568601               0                 5               1  
1           38.240097               0                 5               0  
2           31.633212               1                 0               1  
3           46.167059               0                 4               1  
4           19.823592               0                 0               1  

Data types and non-null counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to

## Grading of Lab Assignments:
The grading of this assignment is based on the test cases throughout this notebook within the `# BEGIN TESTS` and `#END TESTS` comments. 

Each task has a number of test cells. For example, the three cells below are confirming the data has been loaded into a dataframe named `df` as expected. 

Run all of these test cells throughout the project to confirm you pass the tests and are on the right track. Once you have passed all the tests in the entire notebook, or are happy with your results you can click the `Submit Assignment` button in the top right corner for your final submission and grading. 

Good luck!

In [3]:
# BEGIN TESTS
assert isinstance(df, pd.DataFrame), "there should be a pandas DataFrame named df"
assert df.shape[0] > 0 and df.shape[1] > 0, "df should not be empty"
print("✅ TEST PASSED!")
# END TESTS

# Note: There may be hidden tests that learners cannot see, as to not give away the full solution.


✅ TEST PASSED!


In [4]:
# BEGIN TEST
assert df.shape[0] == 1500, "df should contain 1500 rows"
assert df.shape[1] == 9, "df should contain 9 columns"
print("✅ TEST PASSED!")
# END TEST

✅ TEST PASSED!


In [5]:
# BEGIN TEST
assert 'PurchaseStatus' in df.columns, "df should contain the target column PurchaseStatus"
assert df['PurchaseStatus'].nunique() == 2, "The PurchaseStatus column should only have 2 unique values (0 or 1)"
print("✅ TEST PASSED!")
# END TEST

✅ TEST PASSED!


## ✅ Success Checklist
- You've loaded the dataset successfully
- You've identified any data quality issues like missing values or outliers
- You've analyzed the distribution of features and the target variable
- You've created visualizations that provide insights into the data

## 💡 Key Points
- Exploratory data analysis is a critical first step in any machine learning project
- Understanding the data distribution helps inform preprocessing decisions
- Checking for class imbalance in the target variable affects model selection and evaluation strategies

## Task 2: Data Preprocessing and Feature Engineering
Now that you understand the dataset, it's time to prepare it for machine learning. You'll handle any data quality issues, encode categorical variables, scale numerical features, and create any additional features that might improve your model.

**Steps:**
1. Handle Missing Values:
 - Check for missing values in all columns.
 - For `numerical_features` (Age, AnnualIncome, NumberOfPurchases, TimeSpentOnWebsite, DiscountsAvailed):
    - Fill missing values using the median of each column.
 - For `categorical_features` (Gender, ProductCategory, LoyaltyProgram):
    - Fill missing values using the mode (most frequent value) of each column.
    
    
2. Create a New Feature:
 - Add a feature called `AvgSpendingPerPurchase` calculated as:
    - AvgSpendingPerPurchase = AnnualIncome / NumberOfPurchases
 - Replace NumberOfPurchases == 0 with 1 in the denominator to avoid division by zero.
 
 
3. Update the Feature Lists:
 - After creating the new feature, append `AvgSpendingPerPurchase` to the list of numerical features.
 

4. Build a Preprocessing Pipeline named `preprocessor`:
 - Use `ColumnTransformer` to preprocess your data:
    - Scale numerical features using StandardScaler.
    - One-hot encode categorical features using OneHotEncoder(drop='first').
    

5. Split the Dataset:
 - Separate the target variable `PurchaseStatus` from the feature matrix. Split the data into features `X` and target `y`
 - Use `train_test_split` with `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)`
        

6. Fit and Transform:
 - Fit the preprocessor on the training data only.
 - Transform both training and test sets using the fitted preprocessor.
 - Ensure the processed outputs contain no missing values and are in numerical form.
 

In [6]:
# Import additional libraries
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Step 1: Check if missing values exist and handle them

# Define feature types
categorical_features = ['Gender', 'ProductCategory', 'LoyaltyProgram']
numerical_features = ['Age', 'AnnualIncome', 'NumberOfPurchases',
                      'TimeSpentOnWebsite', 'DiscountsAvailed']

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
Age                   0
Gender                0
AnnualIncome          0
NumberOfPurchases     0
ProductCategory       0
TimeSpentOnWebsite    0
LoyaltyProgram        0
DiscountsAvailed      0
PurchaseStatus        0
dtype: int64


In [7]:
# Step 2: Create a new feature AvgSpendingPerPurchase
if 'AnnualIncome' in df.columns and 'NumberOfPurchases' in df.columns:
    # Avoid division by zero
    df['AvgSpendingPerPurchase'] = df['AnnualIncome'] / df['NumberOfPurchases'].replace(0, 1)
    numerical_features.append('AvgSpendingPerPurchase')


In [8]:
# Step 3: Append AvgSpendingPerPurchase to the list of numerical features
# your code here


In [9]:
# Step 4: Create preprocessor with pipelines for categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])


In [10]:
# Step 5: Split the Dataset as outlined in the instructional steps above
# Split the data into features and target
X = df.drop('PurchaseStatus', axis=1)
y = df['PurchaseStatus']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [11]:
# Step 6: Fit and Transform
preprocessor.fit(X_train)


In [12]:
# BEGIN TEST
assert len(categorical_features) == 3, "there should be 3 categorical features"
assert len(numerical_features) == 6, "there should be 6 numerical features"
print("✅ TEST PASSED!")
# END TEST

✅ TEST PASSED!


In [13]:
# BEGIN TEST: Check that there are no missing values in df
assert df.isnull().sum().sum() == 0
print("✅ TEST PASSED!")
# END TEST

✅ TEST PASSED!


In [14]:
# BEGIN TEST: Check that AvgSpendingPerPurchase was created and added
assert 'AvgSpendingPerPurchase' in df.columns
assert 'AvgSpendingPerPurchase' in numerical_features
print("✅ TEST PASSED!")
# END TEST

✅ TEST PASSED!


In [15]:
# BEGIN TEST: Check train-test split shapes
assert X_train.shape[0] > 0 and X_test.shape[0] > 0
assert y_train.shape[0] > 0 and y_test.shape[0] > 0
print("✅ TEST PASSED!")
# END TEST

✅ TEST PASSED!


## ✅ Success Checklist
- You've successfully handled any missing values in the dataset
- Categorical features are properly encoded
- Numerical features are appropriately scaled
- You've created meaningful engineered features
- Data is properly split into training and testing sets

## 💡 Key Points
- Feature engineering can significantly improve model performance
- Proper data preprocessing reduces the impact of outliers and improves model stability
- Standardizing numerical features ensures all features contribute equally to the model
- Stratifying the train-test split ensures balanced representation of target classes

## Task 3: Model Training and Baseline Evaluation
In this task, you'll train multiple classification models to predict customer purchase behavior. You'll establish baseline performance metrics to later compare with your optimized models.


**Steps:**
1. Define and train the following models using the processed training data:
    - Logistic Regression
    - Decision Tree Classifier
    - Random Forest Classifier
    

2. Use each model to predict on the test set.


3. Calculate the following metrics for each model:
    - Accuracy
    - Precision
    - Recall
    - F1 Score
    - ROC-AUC Score
    

4. Store each model’s results in a dictionary named `results` with metric names as keys. So the dictionary may look something like:
```
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC-AUC': roc_auc,
        'Model': model,
        'Predictions': y_pred,
        'Probabilities': y_pred_proba
    }
    ```

In [None]:
# Import modeling libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

# Create a dictionary to store models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    print(f"\nTraining {name}...")    
      
    # Step 1: Train the model
    # your code here
    
    
    # Step 2: Make predictions
    # your code here
    
        
    # Step 3: Calculate metrics
    # your code here
    
    
    # Step 4: Store results
    # complete the code
    results[name] = {
        
    }


In [None]:
# BEGIN TEST 1: Ensure all models are evaluated
assert len(results) == 3
assert all(name in results for name in ['Logistic Regression', 'Decision Tree', 'Random Forest'])
print("✅ TEST PASSED!")
# END TEST

In [None]:
# BEGIN TEST 2: Check structure of each result
required_keys = {'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC'}
for res in results.values():
    assert required_keys.issubset(res.keys()), "The results dictionary is required to contain the keys 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC'"
print("✅ TEST PASSED!")
# END TEST

In [None]:
# BEGIN TEST 3: Ensure all required metric values are within range
required_keys = {'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC'}
for res in results.values():
    for key in required_keys:
        assert 0.6 <= res[key] <= 1, f"{key} value {res[key]} is out of range"
print("✅ TEST PASSED!")
# END TEST

## ✅ Success Checklist
- You've successfully trained multiple classification models
- You've evaluated each model using appropriate metrics
- You've visualized model performance using confusion matrices and ROC curves
- You've identified the strengths and weaknesses of each model

## 💡 Key Points
- Different metrics reveal different aspects of model performance
- The confusion matrix helps understand the types of errors a model makes
- ROC curves help visualize the trade-off between true positive rate and false positive rate
- The best model depends on the specific business objectives and costs associated with different types of errors


## Task 4: Feature Importance and Model Interpretation (Optional)
Understanding which features drive your model's predictions is crucial for business insights. In this task, you'll analyze feature importance and interpret what your model has learned about customer purchase behavior. Run the code below to see a plot of Random Forest Feature Importances. Explore further as an optional task.


In [None]:
from sklearn.inspection import permutation_importance, partial_dependence

# In this case we will focus on the RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_processed, y_train)

# Use the feature names from your preprocessing pipeline
feature_names = numerical_features + list(cat_feature_names)

importances = rf_model.feature_importances_
importances_df = pd.Series(importances, index=feature_names).sort_values(ascending=True)

# Plot
importances_df.plot(kind='barh', figsize=(10, 8))
plt.title('Random Forest Feature Importances')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

## 🔍 Reflect
1. Analyze the feature importance plot:
    - Which features have the strongest influence on purchase predictions?
    - Do different models identify similar important features?
    - Are there any surprises in the feature importance results?


2. Consider 3-5 business insights based on the feature importance analysis:
    - What customer behaviors or attributes are most predictive of purchases?
    - How might the business use these insights for targeted marketing?

## ✅ Success Checklist

- You've identified the most important features for predicting purchases
- You've translated model insights into actionable business recommendations

## 💡 Key Points

- Feature importance helps understand what drives model predictions
- Different models may identify different important features
- Permutation importance is model-agnostic and often more reliable than built-in feature importance
- Partial dependence plots reveal how features affect predictions, which is valuable for business interpretation


## 💻Exemplar Solution
After completing this activity (or if you get stuck!), take a moment to review the exemplar solution. This sample solution can offer insights into different techniques and approaches.
Reflect on what you can learn from the exemplar solution to improve your coding skills.
Remember, multiple solutions can exist for some problems; the goal is to learn and grow as a programmer by exploring various approaches.
Use the exemplar solution as a learning tool to enhance your understanding and refine your approach to coding challenges.

<details>

<summary><strong>Click HERE to see an exemplar solution</strong></summary>    
    
```python    
###################
# TASK 1
###################    
    
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('customer_purchase_data.csv')

# Display basic information
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

# Check data types and missing values
print("\nData types and non-null counts:")
print(df.info())

# Statistical summary
print("\nStatistical summary:")
print(df.describe())

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Check target variable distribution
print("\nTarget variable distribution:")
print(df['PurchaseStatus'].value_counts())
print(df['PurchaseStatus'].value_counts(normalize=True).round(3))

# Visualize target distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='PurchaseStatus', data=df)
plt.title('Purchase Status Distribution')
plt.xlabel('Purchase Status (0: No, 1: Yes)')
plt.ylabel('Count')
plt.show()
    
###################
# TASK 2
###################  

# Import additional libraries
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Define feature types
categorical_features = ['Gender', 'ProductCategory', 'LoyaltyProgram']
numerical_features = ['Age', 'AnnualIncome', 'NumberOfPurchases',
                      'TimeSpentOnWebsite', 'DiscountsAvailed']


# Check if missing values exist and handle them
if df.isnull().sum().sum() > 0:
    # For numerical features, fill with median
    for col in numerical_features:
        if df[col].isnull().sum() > 0:
            df[col] = df[col].fillna(df[col].median())

    # For categorical features, fill with mode
    for col in categorical_features:
        if df[col].isnull().sum() > 0:
            df[col] = df[col].fillna(df[col].mode()[0])

# Feature engineering
# Create a new feature: Average spending per purchase
if 'AnnualIncome' in df.columns and 'NumberOfPurchases' in df.columns:
    # Avoid division by zero
    df['AvgSpendingPerPurchase'] = df['AnnualIncome'] / df['NumberOfPurchases'].replace(0, 1)
    numerical_features.append('AvgSpendingPerPurchase')

# Create preprocessor with pipelines for categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])

# Split the data into features and target
X = df.drop('PurchaseStatus', axis=1)
y = df['PurchaseStatus']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fit the preprocessor on the training data
preprocessor.fit(X_train)

# Apply preprocessing to create processed training and testing data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Check the shape of the processed data
print("Processed training data shape:", X_train_processed.shape)
print("Processed testing data shape:", X_test_processed.shape)

# If using OneHotEncoder, get the feature names
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = numerical_features + list(cat_feature_names)
print("\nFeature names after preprocessing:")
print(feature_names)

# Visualize correlation matrix of numerical features
plt.figure(figsize=(10, 8))
corr_matrix = df[numerical_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

###################
# TASK 3
###################  
    
# Import modeling libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

# Create a dictionary to store models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    print(f"\nTraining {name}...")

    # Train the model
    model.fit(X_train_processed, y_train)

    # Make predictions
    y_pred = model.predict(X_test_processed)
    y_pred_proba = model.predict_proba(X_test_processed)[:, 1]

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)

    # Store results
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC-AUC': roc_auc,
        'Model': model,
        'Predictions': y_pred,
        'Probabilities': y_pred_proba
    }

    # Print results
    print(f"{name} Results:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC-AUC: {roc_auc:.4f}")

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

    # Create confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['No Purchase', 'Purchase'],
                yticklabels=['No Purchase', 'Purchase'])
    plt.title(f'Confusion Matrix - {name}')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.tight_layout()
    plt.show()

# Plot ROC curves for all models
plt.figure(figsize=(10, 8))
for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_test, result['Probabilities'])
    plt.plot(fpr, tpr, label=f"{name} (AUC = {result['ROC-AUC']:.3f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for All Models')
plt.legend()
plt.grid(True)
plt.show()

# Create a dataframe to compare model performances
comparison_df = pd.DataFrame({
    model_name: {
        metric: results[model_name][metric]
        for metric in ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']
    }
    for model_name in results.keys()
})

print("\nModel Comparison:")
print(comparison_df)

# Visualize model comparison
plt.figure(figsize=(12, 8))
comparison_df.plot(kind='bar')
plt.title('Model Performance Comparison')
plt.xlabel('Metric')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.legend(title='Model')
plt.grid(axis='y')
plt.tight_layout()
plt.show()

    
```