# Online Shoppers Purchasing Intention Dataset - Data Science Analysis

**Dataset Source:** https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset

## Dataset Overview
- **Total Sessions:** 12,330
- **Features:** 17 (10 numerical, 8 categorical)
- **Target:** Revenue (Purchase vs No Purchase)
- **Class Distribution:** 84.5% No Purchase (10,422), 15.5% Purchase (1,908)
- **Task:** Classification
- **Missing Values:** No

In [None]:
# Install required package
!pip install ucimlrepo xgboost -q

In [None]:
# Fetch dataset from UCI Repository
from ucimlrepo import fetch_ucirepo

online_shoppers_purchasing_intention_dataset = fetch_ucirepo(id=468)

# Data (as pandas dataframes)
X = online_shoppers_purchasing_intention_dataset.data.features
y = online_shoppers_purchasing_intention_dataset.data.targets

print("Dataset loaded successfully!")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# 1. Import Libraries

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    roc_auc_score, roc_curve,
    accuracy_score, precision_score, 
    recall_score, f1_score
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì All libraries imported successfully!")

# 2. Data Exploration & Understanding

In [None]:
# Combine features and target into a single dataframe
df = X.copy()
df['Revenue'] = y

print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nFirst 5 rows:")
df.head()

In [None]:
# Display data information
print("=" * 70)
print("DATA INFORMATION")
print("=" * 70)
df.info()

print("\n" + "=" * 70)
print("MISSING VALUES")
print("=" * 70)
missing = df.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values found!")

print("\n" + "=" * 70)
print("DUPLICATE ROWS")
print("=" * 70)
print(f"Number of duplicate rows: {df.duplicated().sum()}")

In [None]:
# Descriptive statistics for numerical features
print("=" * 100)
print("DESCRIPTIVE STATISTICS FOR NUMERICAL FEATURES")
print("=" * 100)
df.describe().T

In [None]:
# Target variable distribution
print("=" * 70)
print("TARGET VARIABLE (REVENUE) DISTRIBUTION")
print("=" * 70)
print("\nCount:")
print(df['Revenue'].value_counts())
print("\nPercentage:")
print(df['Revenue'].value_counts(normalize=True) * 100)

# Visualize target distribution
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df['Revenue'].value_counts().plot(kind='bar', ax=ax[0], color=['#FF6B6B', '#4ECDC4'])
ax[0].set_title('Revenue Distribution (Count)', fontsize=14, fontweight='bold')
ax[0].set_xlabel('Revenue', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].set_xticklabels(['No Purchase', 'Purchase'], rotation=0)
for i, v in enumerate(df['Revenue'].value_counts()):
    ax[0].text(i, v + 200, str(v), ha='center', fontweight='bold')

# Pie chart
df['Revenue'].value_counts().plot(kind='pie', ax=ax[1], autopct='%1.1f%%', 
                                   colors=['#FF6B6B', '#4ECDC4'], startangle=90)
ax[1].set_title('Revenue Distribution (Percentage)', fontsize=14, fontweight='bold')
ax[1].set_ylabel('')

plt.tight_layout()
plt.show()

# Class imbalance ratio
imbalance_ratio = df['Revenue'].value_counts()[False] / df['Revenue'].value_counts()[True]
print(f"\n‚ö†Ô∏è  Class Imbalance Ratio: {imbalance_ratio:.2f}:1 (No Purchase : Purchase)")

# 3. Exploratory Data Analysis (EDA)

In [None]:
# Distribution of numerical features
numerical_cols = ['Administrative', 'Administrative_Duration', 'Informational', 
                  'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
                  'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay']

fig, axes = plt.subplots(5, 2, figsize=(15, 20))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    axes[idx].hist(df[col], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)

plt.suptitle('Distribution of Numerical Features', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
df_corr = df.copy()
df_corr['Revenue'] = df_corr['Revenue'].astype(int)

# Select numerical columns
numerical_features = df_corr.select_dtypes(include=[np.number]).columns.tolist()
correlation_matrix = df_corr[numerical_features].corr()

# Plot correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap of Numerical Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Features most correlated with Revenue
print("\n" + "=" * 70)
print("FEATURES MOST CORRELATED WITH REVENUE")
print("=" * 70)
revenue_corr = correlation_matrix['Revenue'].sort_values(ascending=False)
print(revenue_corr)

In [None]:
# Compare key features between purchasers and non-purchasers
key_features = ['PageValues', 'ProductRelated_Duration', 'ExitRates', 'BounceRates']

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    df.boxplot(column=feature, by='Revenue', ax=axes[idx])
    axes[idx].set_title(f'{feature} by Revenue', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Revenue (Purchase)', fontsize=10)
    axes[idx].set_ylabel(feature, fontsize=10)
    
plt.suptitle('Key Features Comparison: Purchasers vs Non-Purchasers', 
             fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Categorical features vs Revenue
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Month vs Revenue
pd.crosstab(df['Month'], df['Revenue'], normalize='index').plot(
    kind='bar', ax=axes[0, 0], color=['#FF6B6B', '#4ECDC4'])
axes[0, 0].set_title('Purchase Rate by Month', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Month', fontsize=10)
axes[0, 0].set_ylabel('Proportion', fontsize=10)
axes[0, 0].legend(['No Purchase', 'Purchase'])
axes[0, 0].tick_params(axis='x', rotation=45)

# VisitorType vs Revenue
pd.crosstab(df['VisitorType'], df['Revenue'], normalize='index').plot(
    kind='bar', ax=axes[0, 1], color=['#FF6B6B', '#4ECDC4'])
axes[0, 1].set_title('Purchase Rate by Visitor Type', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Visitor Type', fontsize=10)
axes[0, 1].set_ylabel('Proportion', fontsize=10)
axes[0, 1].legend(['No Purchase', 'Purchase'])
axes[0, 1].tick_params(axis='x', rotation=45)

# Weekend vs Revenue
pd.crosstab(df['Weekend'], df['Revenue'], normalize='index').plot(
    kind='bar', ax=axes[1, 0], color=['#FF6B6B', '#4ECDC4'])
axes[1, 0].set_title('Purchase Rate by Weekend', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Weekend', fontsize=10)
axes[1, 0].set_ylabel('Proportion', fontsize=10)
axes[1, 0].legend(['No Purchase', 'Purchase'])
axes[1, 0].set_xticklabels(['Weekday', 'Weekend'], rotation=0)

# TrafficType distribution (top 10)
top_traffic = df['TrafficType'].value_counts().head(10).index
df_top_traffic = df[df['TrafficType'].isin(top_traffic)]
pd.crosstab(df_top_traffic['TrafficType'], df_top_traffic['Revenue'], normalize='index').plot(
    kind='bar', ax=axes[1, 1], color=['#FF6B6B', '#4ECDC4'])
axes[1, 1].set_title('Purchase Rate by Traffic Type (Top 10)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Traffic Type', fontsize=10)
axes[1, 1].set_ylabel('Proportion', fontsize=10)
axes[1, 1].legend(['No Purchase', 'Purchase'])

plt.tight_layout()
plt.show()

# 4. Data Preprocessing

In [None]:
# Create preprocessing copy
df_processed = df.copy()

# Separate features and target
X_prep = df_processed.drop('Revenue', axis=1)
y_prep = df_processed['Revenue']

print("Original Features Shape:", X_prep.shape)
print("Target Shape:", y_prep.shape)

In [None]:
# Encode categorical variables
print("Encoding categorical variables...\n")

# One-hot encoding for Month
month_encoded = pd.get_dummies(X_prep['Month'], prefix='Month', drop_first=True)
print(f"Month encoded: {month_encoded.shape[1]} features")

# One-hot encoding for VisitorType
visitor_encoded = pd.get_dummies(X_prep['VisitorType'], prefix='VisitorType', drop_first=True)
print(f"VisitorType encoded: {visitor_encoded.shape[1]} features")

# Convert Weekend to integer
X_prep['Weekend'] = X_prep['Weekend'].astype(int)

# Drop original categorical columns
X_prep = X_prep.drop(['Month', 'VisitorType'], axis=1)

# Concatenate encoded features
X_prep = pd.concat([X_prep, month_encoded, visitor_encoded], axis=1)

# Convert target to integer
y_prep = y_prep.astype(int)

print(f"\n‚úì Preprocessing complete!")
print(f"Final Features Shape: {X_prep.shape}")
print(f"Total Features: {X_prep.shape[1]}")
print(f"\nFeature Names:\n{X_prep.columns.tolist()}")

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_prep, y_prep, test_size=0.2, random_state=42, stratify=y_prep
)

print("=" * 70)
print("TRAIN-TEST SPLIT")
print("=" * 70)
print(f"Training set: {X_train.shape}")
print(f"Testing set: {X_test.shape}")
print(f"\nTraining target distribution:\n{y_train.value_counts()}")
print(f"\nTesting target distribution:\n{y_test.value_counts()}")

In [None]:
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("‚úì Feature scaling complete!")
print(f"Scaled training set: {X_train_scaled.shape}")
print(f"Scaled testing set: {X_test_scaled.shape}")

# 5. Model Building & Training

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB(),
    'Support Vector Machine': SVC(kernel='rbf', random_state=42, probability=True)
}

print("=" * 70)
print("MODELS TO TRAIN")
print("=" * 70)
for i, name in enumerate(models.keys(), 1):
    print(f"{i}. {name}")

In [None]:
# Train and evaluate all models
results = []
trained_models = {}

print("\n" + "=" * 80)
print("TRAINING MODELS")
print("=" * 80)

for name, model in models.items():
    print(f"\n[{list(models.keys()).index(name) + 1}/{len(models)}] Training {name}...", end=" ")
    
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
    
    # Store results
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc
    })
    
    trained_models[name] = model
    
    print(f"‚úì Done!")
    print(f"    Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f}")

print("\n" + "=" * 80)
print("‚úì ALL MODELS TRAINED SUCCESSFULLY!")
print("=" * 80)

# 6. Model Evaluation & Comparison

In [None]:
# Create results dataframe
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1-Score', ascending=False).reset_index(drop=True)

print("\n" + "=" * 100)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 100)
print(results_df.to_string(index=False))
print("=" * 100)

# Best model
best_model_name = results_df.iloc[0]['Model']
print(f"\nüèÜ BEST MODEL (based on F1-Score): {best_model_name}")
print(f"   F1-Score: {results_df.iloc[0]['F1-Score']:.4f}")
print(f"   Accuracy: {results_df.iloc[0]['Accuracy']:.4f}")
print(f"   Precision: {results_df.iloc[0]['Precision']:.4f}")
print(f"   Recall: {results_df.iloc[0]['Recall']:.4f}")
print(f"   ROC-AUC: {results_df.iloc[0]['ROC-AUC']:.4f}")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    results_sorted = results_df.sort_values(metric, ascending=True)
    
    bars = ax.barh(results_sorted['Model'], results_sorted[metric], 
                   color=colors[idx], alpha=0.7)
    ax.set_xlabel(metric, fontsize=12, fontweight='bold')
    ax.set_title(f'Model Comparison - {metric}', fontsize=14, fontweight='bold')
    ax.set_xlim(0, 1)
    ax.grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(results_sorted[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center', fontsize=9, fontweight='bold')

plt.suptitle('Model Performance Comparison', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Detailed evaluation of best model
best_model = trained_models[best_model_name]
y_pred_best = best_model.predict(X_test_scaled)
y_pred_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]

# Confusion Matrix and ROC Curve
cm = confusion_matrix(y_test, y_pred_best)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_best)
roc_auc = roc_auc_score(y_test, y_pred_proba_best)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0], 
            xticklabels=['No Purchase', 'Purchase'],
            yticklabels=['No Purchase', 'Purchase'],
            cbar_kws={'label': 'Count'})
axes[0].set_title(f'Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_xlabel('Predicted', fontsize=12)

# ROC Curve
axes[1].plot(fpr, tpr, color='darkorange', lw=2, 
             label=f'ROC curve (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', 
             label='Random Classifier (AUC = 0.500)')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate', fontsize=12)
axes[1].set_title(f'ROC Curve - {best_model_name}', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right", fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Classification Report
print("\n" + "=" * 70)
print(f"CLASSIFICATION REPORT - {best_model_name}")
print("=" * 70)
print(classification_report(y_test, y_pred_best, 
                          target_names=['No Purchase', 'Purchase']))

# 7. Feature Importance Analysis

In [None]:
# Feature importance for tree-based models
tree_based_models = ['Random Forest', 'Gradient Boosting', 'XGBoost', 'Decision Tree']

fig, axes = plt.subplots(2, 2, figsize=(18, 14))
axes = axes.ravel()

for idx, model_name in enumerate(tree_based_models):
    if model_name in trained_models:
        model = trained_models[model_name]
        
        if hasattr(model, 'feature_importances_'):
            importances = model.feature_importances_
            
            # Create dataframe
            feature_imp_df = pd.DataFrame({
                'Feature': X_train.columns,
                'Importance': importances
            }).sort_values('Importance', ascending=False).head(15)
            
            # Plot
            bars = axes[idx].barh(range(len(feature_imp_df)), 
                                 feature_imp_df['Importance'], 
                                 color='steelblue', alpha=0.7)
            axes[idx].set_yticks(range(len(feature_imp_df)))
            axes[idx].set_yticklabels(feature_imp_df['Feature'])
            axes[idx].set_xlabel('Importance', fontsize=11, fontweight='bold')
            axes[idx].set_title(f'Top 15 Features - {model_name}', 
                               fontsize=12, fontweight='bold')
            axes[idx].invert_yaxis()
            axes[idx].grid(axis='x', alpha=0.3)
            
            # Add value labels
            for i, v in enumerate(feature_imp_df['Importance']):
                axes[idx].text(v, i, f' {v:.3f}', va='center', fontsize=8)

plt.suptitle('Feature Importance Analysis - Tree-Based Models', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# 8. Key Insights & Conclusions

## Dataset Summary
- **Total Sessions**: 12,330
- **Features**: 17 (10 numerical, 8 categorical)
- **Target**: Revenue (Purchase vs No Purchase)
- **Class Distribution**: 84.5% No Purchase, 15.5% Purchase (Imbalanced)

## Key Findings from EDA
1. **PageValues** shows the strongest correlation with purchase behavior
2. **ProductRelated_Duration** - Customers who purchase spend more time on product pages
3. **ExitRates** and **BounceRates** are significantly lower for customers who make purchases
4. **Returning visitors** have notably higher purchase rates compared to new visitors
5. **November and May** show higher purchase rates (likely due to holiday shopping seasons)
6. **Weekend** visits show slightly different conversion patterns

## Model Performance Summary
- **Best performing models**: Ensemble methods (Random Forest, XGBoost, Gradient Boosting)
- **Key metrics**: Models evaluated on Accuracy, Precision, Recall, F1-Score, and ROC-AUC
- **Class imbalance** was preserved through stratified splitting

## Most Important Features for Prediction
Based on feature importance analysis across tree-based models:
1. **PageValues** - Consistently the most important predictor
2. **ProductRelated_Duration** - Time spent browsing products
3. **ExitRates** - Likelihood of leaving the site
4. **BounceRates** - Single-page visit metrics
5. **Month** - Seasonal shopping patterns

## Business Recommendations

### 1. Optimize Page Value Metrics
- Focus on improving page value scores as they are the strongest predictor
- Analyze high-value pages and replicate successful elements

### 2. Enhance Product Page Engagement
- Improve product page content and user experience
- Add engaging elements to increase time spent on product pages
- Implement better product recommendations

### 3. Reduce Exit and Bounce Rates
- Implement exit-intent popups with special offers
- Improve page load times and mobile responsiveness
- Enhance navigation and internal linking

### 4. Target Returning Visitors
- Develop loyalty programs to encourage repeat visits
- Personalize content for returning customers
- Implement email marketing for customer retention

### 5. Seasonal Optimization
- Increase marketing budget during peak months (November, May)
- Prepare inventory and special promotions for high-conversion periods
- Plan campaigns around shopping holidays

### 6. Traffic Source Analysis
- Analyze which traffic types convert best
- Allocate marketing budget to high-converting traffic sources
- Optimize campaigns for different traffic types

## Technical Considerations
- **Class Imbalance**: Consider using techniques like SMOTE, class weights, or ensemble methods for production
- **Real-time Prediction**: The model can be deployed for real-time purchase intention prediction
- **Model Updates**: Retrain periodically to adapt to changing user behavior patterns
- **Feature Engineering**: Additional features could be created from existing ones (e.g., ratios, interactions)

## Next Steps
1. Hyperparameter tuning for the best models
2. Handle class imbalance with advanced techniques
3. Feature engineering for additional insights
4. Deploy the model in a production environment
5. Monitor model performance over time
6. A/B testing of business recommendations