# Dataset

Dataset access from: https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset

Online Shoppers Purchasing Intention Dataset
Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping.

Dataset Characteristics
Multivariate

Subject Area
Business

Associated Tasks
Classification, Clustering

Feature Type
Integer, Real

Instances
12330

Features
17

Dataset Information
Additional Information

The dataset consists of feature vectors belonging to 12,330 sessions.
The dataset was formed so that each session
would belong to a different user in a 1-year period to avoid
any tendency to a specific campaign, special day, user
profile, or period.

Has Missing Values?

No

In [None]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
online_shoppers_purchasing_intention_dataset = fetch_ucirepo(id=468)

# data (as pandas dataframes)
X = online_shoppers_purchasing_intention_dataset.data.features
y = online_shoppers_purchasing_intention_dataset.data.targets

# metadata
print(online_shoppers_purchasing_intention_dataset.metadata)

# variable information
print(online_shoppers_purchasing_intention_dataset.variables)


{'uci_id': 468, 'name': 'Online Shoppers Purchasing Intention Dataset', 'repository_url': 'https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/468/data.csv', 'abstract': 'Of the 12,330 sessions in the dataset,\n84.5% (10,422) were negative class samples that did not\nend with shopping, and the rest (1908) were positive class\nsamples ending with shopping.', 'area': 'Business', 'tasks': ['Classification', 'Clustering'], 'characteristics': ['Multivariate'], 'num_instances': 12330, 'num_features': 17, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Revenue'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Thu Jan 11 2024', 'dataset_doi': '10.24432/C5F88Q', 'creators': ['C. Sakar', 'Yomi Kastro'], 'intro_paper': {'ID': 367, 'type': 'NATIVE', 'title': 'Real-time prediction of online shoppers‚Äô p

In [None]:
# Feature importance for tree-based models
tree_based_models = ['Random Forest', 'Gradient Boosting', 'XGBoost', 'Decision Tree']

fig, axes = plt.subplots(2, 2, figsize=(18, 14))
axes = axes.ravel()

for idx, model_name in enumerate(tree_based_models):
    if model_name in trained_models:
        model = trained_models[model_name]
        
        # Get feature importances
        if hasattr(model, 'feature_importances_'):
            importances = model.feature_importances_
            
            # Create dataframe for better visualization
            feature_imp_df = pd.DataFrame({
                'Feature': X_train.columns,
                'Importance': importances
            }).sort_values('Importance', ascending=False).head(15)
            
            # Plot
            axes[idx].barh(range(len(feature_imp_df)), feature_imp_df['Importance'], 
                          color='steelblue', alpha=0.7)
            axes[idx].set_yticks(range(len(feature_imp_df)))
            axes[idx].set_yticklabels(feature_imp_df['Feature'])
            axes[idx].set_xlabel('Importance', fontsize=11)
            axes[idx].set_title(f'Top 15 Feature Importances - {model_name}', 
                               fontsize=12, fontweight='bold')
            axes[idx].invert_yaxis()
            axes[idx].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

# 6. Feature Importance Analysis

In [None]:
# Detailed evaluation of the best model
best_model = trained_models[best_model_name]
y_pred_best = best_model.predict(X_test_scaled)
y_pred_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0], 
            xticklabels=['No Purchase', 'Purchase'],
            yticklabels=['No Purchase', 'Purchase'])
axes[0].set_title(f'Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_xlabel('Predicted', fontsize=12)

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_best)
roc_auc = roc_auc_score(y_test, y_pred_proba_best)

axes[1].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate', fontsize=12)
axes[1].set_title(f'ROC Curve - {best_model_name}', fontsize=14, fontweight='bold')
axes[1].legend(loc="lower right")
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Classification Report
print(f"\nClassification Report - {best_model_name}:")
print("=" * 60)
print(classification_report(y_test, y_pred_best, target_names=['No Purchase', 'Purchase']))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    results_sorted = results_df.sort_values(metric, ascending=True)
    
    ax.barh(results_sorted['Model'], results_sorted[metric], color=colors[idx], alpha=0.7)
    ax.set_xlabel(metric, fontsize=12, fontweight='bold')
    ax.set_title(f'Model Comparison - {metric}', fontsize=14, fontweight='bold')
    ax.set_xlim(0, 1)
    ax.grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(results_sorted[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Create results dataframe
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('F1-Score', ascending=False).reset_index(drop=True)

print("Model Performance Comparison:")
print("=" * 100)
print(results_df.to_string(index=False))
print("=" * 100)

# Identify best model
best_model_name = results_df.iloc[0]['Model']
print(f"\nüèÜ Best Model (based on F1-Score): {best_model_name}")
print(f"   F1-Score: {results_df.iloc[0]['F1-Score']:.4f}")
print(f"   Accuracy: {results_df.iloc[0]['Accuracy']:.4f}")
print(f"   ROC-AUC: {results_df.iloc[0]['ROC-AUC']:.4f}")

# 5. Model Evaluation & Comparison

In [None]:
# Train and evaluate all models
results = []
trained_models = {}

print("Training models...\n")
print("=" * 80)

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
    
    # Store results
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc
    })
    
    # Store trained model
    trained_models[name] = model
    
    print(f"‚úì {name} trained successfully!")
    print(f"  Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1: {f1:.4f}")

print("\n" + "=" * 80)
print("All models trained successfully!")

In [None]:
# Define models to train
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB(),
    'Support Vector Machine': SVC(kernel='rbf', random_state=42, probability=True)
}

print(f"Total models to train: {len(models)}")
print("Models:", list(models.keys()))

In [None]:
# Import classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

print("All models imported successfully!")

# 4. Model Building & Training

In [None]:
# Feature Scaling - StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("Scaled training set shape:", X_train_scaled.shape)
print("Scaled testing set shape:", X_test_scaled.shape)
print("\nFirst few rows of scaled training data:")
print(X_train_scaled.head())

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_prep, y_prep, test_size=0.2, 
                                                      random_state=42, stratify=y_prep)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("\nTraining set target distribution:")
print(y_train.value_counts())
print("\nTesting set target distribution:")
print(y_test.value_counts())

In [None]:
# Encode categorical variables
# Month: One-hot encoding
month_encoded = pd.get_dummies(X_prep['Month'], prefix='Month', drop_first=True)

# VisitorType: One-hot encoding
visitor_encoded = pd.get_dummies(X_prep['VisitorType'], prefix='VisitorType', drop_first=True)

# Weekend: Convert boolean to integer (0, 1)
X_prep['Weekend'] = X_prep['Weekend'].astype(int)

# Drop original categorical columns and add encoded ones
X_prep = X_prep.drop(['Month', 'VisitorType'], axis=1)
X_prep = pd.concat([X_prep, month_encoded, visitor_encoded], axis=1)

# Convert target variable to integer
y_prep = y_prep.astype(int)

print("Processed dataset shape:", X_prep.shape)
print("\nNew feature columns:")
print(X_prep.columns.tolist())
print(f"\nTotal features: {X_prep.shape[1]}")

In [None]:
# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# Separate features and target
X_prep = df_processed.drop('Revenue', axis=1)
y_prep = df_processed['Revenue']

print("Original dataset shape:", X_prep.shape)
print("Target variable shape:", y_prep.shape)
print("\nFeature columns:")
print(X_prep.columns.tolist())

# 3. Data Preprocessing

In [None]:
# Analyze categorical features vs Revenue
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Month vs Revenue
pd.crosstab(df['Month'], df['Revenue'], normalize='index').plot(kind='bar', ax=axes[0, 0], 
                                                                  color=['#FF6B6B', '#4ECDC4'])
axes[0, 0].set_title('Purchase Rate by Month', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Month', fontsize=10)
axes[0, 0].set_ylabel('Proportion', fontsize=10)
axes[0, 0].legend(['No Purchase', 'Purchase'])
axes[0, 0].tick_params(axis='x', rotation=45)

# VisitorType vs Revenue
pd.crosstab(df['VisitorType'], df['Revenue'], normalize='index').plot(kind='bar', ax=axes[0, 1],
                                                                        color=['#FF6B6B', '#4ECDC4'])
axes[0, 1].set_title('Purchase Rate by Visitor Type', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Visitor Type', fontsize=10)
axes[0, 1].set_ylabel('Proportion', fontsize=10)
axes[0, 1].legend(['No Purchase', 'Purchase'])
axes[0, 1].tick_params(axis='x', rotation=45)

# Weekend vs Revenue
pd.crosstab(df['Weekend'], df['Revenue'], normalize='index').plot(kind='bar', ax=axes[1, 0],
                                                                    color=['#FF6B6B', '#4ECDC4'])
axes[1, 0].set_title('Purchase Rate by Weekend', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Weekend', fontsize=10)
axes[1, 0].set_ylabel('Proportion', fontsize=10)
axes[1, 0].legend(['No Purchase', 'Purchase'])
axes[1, 0].set_xticklabels(['Weekday', 'Weekend'], rotation=0)

# TrafficType distribution
traffic_revenue = df.groupby('TrafficType')['Revenue'].value_counts(normalize=True).unstack()
traffic_revenue.plot(kind='bar', ax=axes[1, 1], color=['#FF6B6B', '#4ECDC4'])
axes[1, 1].set_title('Purchase Rate by Traffic Type', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Traffic Type', fontsize=10)
axes[1, 1].set_ylabel('Proportion', fontsize=10)
axes[1, 1].legend(['No Purchase', 'Purchase'])

plt.tight_layout()
plt.show()

In [None]:
# Compare key features between purchasers and non-purchasers
key_features = ['PageValues', 'ProductRelated_Duration', 'ExitRates', 'BounceRates']

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    df.boxplot(column=feature, by='Revenue', ax=axes[idx])
    axes[idx].set_title(f'{feature} by Revenue', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Revenue (Purchase)', fontsize=10)
    axes[idx].set_ylabel(feature, fontsize=10)
    
plt.suptitle('Key Features Comparison: Purchasers vs Non-Purchasers', 
             fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
# Convert boolean Revenue to numeric for correlation
df_corr = df.copy()
df_corr['Revenue'] = df_corr['Revenue'].astype(int)

# Select only numerical columns for correlation
numerical_features = df_corr.select_dtypes(include=[np.number]).columns.tolist()
correlation_matrix = df_corr[numerical_features].corr()

# Plot correlation heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap of Numerical Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show features most correlated with Revenue
print("Features most correlated with Revenue:")
revenue_corr = correlation_matrix['Revenue'].sort_values(ascending=False)
print(revenue_corr)

In [None]:
# Distribution of numerical features
numerical_cols = ['Administrative', 'Administrative_Duration', 'Informational', 
                  'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
                  'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay']

fig, axes = plt.subplots(5, 2, figsize=(15, 20))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    axes[idx].hist(df[col], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# 2. Exploratory Data Analysis (EDA)

In [None]:
# Target variable distribution
print("Target Variable (Revenue) Distribution:")
print(df['Revenue'].value_counts())
print("\nPercentage Distribution:")
print(df['Revenue'].value_counts(normalize=True) * 100)

# Visualize target distribution
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df['Revenue'].value_counts().plot(kind='bar', ax=ax[0], color=['#FF6B6B', '#4ECDC4'])
ax[0].set_title('Revenue Distribution (Count)', fontsize=14, fontweight='bold')
ax[0].set_xlabel('Revenue', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].set_xticklabels(['No Purchase (False)', 'Purchase (True)'], rotation=0)

# Pie chart
df['Revenue'].value_counts().plot(kind='pie', ax=ax[1], autopct='%1.1f%%', 
                                   colors=['#FF6B6B', '#4ECDC4'], startangle=90)
ax[1].set_title('Revenue Distribution (Percentage)', fontsize=14, fontweight='bold')
ax[1].set_ylabel('')

plt.tight_layout()
plt.show()

# Calculate class imbalance ratio
imbalance_ratio = df['Revenue'].value_counts()[False] / df['Revenue'].value_counts()[True]
print(f"\nClass Imbalance Ratio: {imbalance_ratio:.2f}:1 (No Purchase : Purchase)")

In [None]:
# Analyze categorical features
categorical_cols = ['Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue']

print("Categorical Features Analysis:")
for col in categorical_cols:
    print(f"\n{col}:")
    print(df[col].value_counts())
    print("-" * 40)

In [None]:
# Descriptive statistics for numerical features
print("Descriptive Statistics for Numerical Features:")
df.describe().T

In [None]:
# Display data information
print("Data Types and Non-Null Counts:")
print(df.info())
print("\n" + "="*50)
print("\nMissing Values:")
print(df.isnull().sum())
print("\n" + "="*50)
print("\nDuplicate Rows:", df.duplicated().sum())

In [None]:
# Combine features and target into a single dataframe for easier analysis
df = X.copy()
df['Revenue'] = y

print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

# 1. Data Exploration & Understanding

# 7. Key Insights & Conclusions

## Dataset Summary
- **Total Sessions**: 12,330
- **Features**: 17 (10 numerical, 8 categorical)
- **Target**: Revenue (Purchase vs No Purchase)
- **Class Distribution**: 84.5% No Purchase, 15.5% Purchase (Imbalanced)

## Key Findings from EDA
1. **PageValues** shows strong correlation with purchase behavior
2. **ProductRelated_Duration** - customers who purchase spend more time on product pages
3. **ExitRates** and **BounceRates** are lower for customers who make purchases
4. **Returning visitors** have higher purchase rates than new visitors
5. **November and May** show higher purchase rates (likely due to shopping seasons)

## Model Performance
- Multiple classification algorithms were trained and evaluated
- Best performing models typically include ensemble methods (Random Forest, XGBoost, Gradient Boosting)
- Key metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC

## Important Features for Prediction
Based on feature importance analysis:
1. **PageValues** - Most important predictor
2. **ProductRelated_Duration** - Time spent on product pages
3. **ExitRates** - Exit rate metrics
4. **Month** - Seasonal effects
5. **BounceRates** - Bounce rate metrics

## Recommendations
1. **Focus on PageValue optimization** - This is the strongest predictor
2. **Improve product page engagement** - Longer duration correlates with purchases
3. **Reduce exit and bounce rates** - Implement strategies to keep visitors engaged
4. **Target returning visitors** - They have higher conversion rates
5. **Optimize for peak seasons** - Focus marketing efforts in November and May