## TrendsMarketPlace
## Business Use Case
# Enhancing E-Commerce Conversion Rates Through Predictive Modeling

#### Description:
In the competitive e-commerce space, improving conversion rates is a critical metric for success. This business use case involves leveraging the "Online Shoppers Intention Dataset" to develop predictive models that identify visitors with a high likelihood of making a purchase or abandoning the website.

The insights derived from these models can be utilized to:

1. Personalize Customer Experience: Identify high-intent buyers in real-time and offer targeted promotions or assistance.
2. Reduce Cart Abandonment: Proactively engage visitors likely to leave the site without completing a transaction through tailored incentives.
3. Improve Marketing ROI: Optimize ad spending by focusing on segments with higher purchasing intent.
4. Enhance Operational Efficiency: Provide data-driven insights for better decision-making in marketing, customer service, and website design.

#### Impact:
By implementing these predictive models, e-commerce platforms can achieve:

1. Increased revenue by boosting purchase conversion rates.
2. Reduced operational costs by targeting the right customers at the right time.
3. Enhanced customer satisfaction and retention through personalized interventions.


##### Team Members: Jacob Battles | Garrett Kierzek | Abdihakim Bashe | Divyansh Sen | Utkarsh Joshi


#### Importing Libraries 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import optuna
import pymysql as sql

import warnings
# Suppress warnings
warnings.filterwarnings("ignore")

#### Connecting to MYSQL database
Database: TrendMarketplace

Table: online_shoppers_intention


In [2]:
db_config = {
    "host": "localhost",
    "user": "root",
    "database": "trendmarketplace",
    "password": "Mysqlsystem1!"
}

# Load data from the database
def load_data(query="SELECT * FROM online_shoppers_intention;"):
    connection = sql.connect(**db_config)
    try:
        df = pd.read_sql(query, con=connection)
    finally:
        connection.close()
    return df

#### Preprocess Data
 1. Remove Nulls
 2. Implementing Standardscaler and Onehotencoding for Categorical and Numerical Data Transformation

In [3]:
# Preprocess data
def preprocess_data(df):
    # Handle missing values
    df = df.dropna()

    # Separate features and target
    X = df.drop('Revenue', axis=1)
    y = df['Revenue']

    # Identify categorical and numerical columns
    categorical_cols = X.select_dtypes(include=['object']).columns
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

    # Preprocessing pipelines
    num_transformer = StandardScaler()
    cat_transformer = OneHotEncoder(handle_unknown='ignore')

    # Combine preprocessors
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_transformer, numerical_cols),
            ('cat', cat_transformer, categorical_cols)
        ])

    return X, y, preprocessor



### Implementing Optuna library for optimizing hyperparameters in the ML model.  
#### Models:
1. RandomForest
2. logisticRegression
3. SVM
4. Multi-Layer Perceptron

In [4]:
# Define model functions with Optuna integration
def optimize_random_forest(trial, X, y, preprocessor):
    n_estimators = trial.suggest_int("n_estimators", 50, 300, step=50)
    max_depth = trial.suggest_int("max_depth", 5, 30, step=5)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 10, step=1)

    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )

    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')
    return np.mean(scores)

def optimize_logistic_regression(trial, X, y, preprocessor):
    C = trial.suggest_loguniform("C", 1e-4, 1e2)
    solver = trial.suggest_categorical("solver", ["liblinear", "lbfgs"])
    penalty = trial.suggest_categorical("penalty", ["l2"])

    model = LogisticRegression(
        C=C,
        solver=solver,
        penalty=penalty,
        max_iter=500,
        random_state=42
    )

    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')
    return np.mean(scores)

def optimize_svm(trial, X, y, preprocessor):
    C = trial.suggest_loguniform("C", 1e-4, 1e2)
    gamma = trial.suggest_loguniform("gamma", 1e-4, 1e-1)

    model = SVC(
        C=C,
        gamma=gamma,
        kernel="rbf",
        random_state=42
    )

    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')
    return np.mean(scores)

def optimize_mlp(trial, X, y, preprocessor):
    hidden_layer_sizes = trial.suggest_int("hidden_layer_sizes", 50, 200, step=50)
    alpha = trial.suggest_loguniform("alpha", 1e-4, 1e-1)

    model = MLPClassifier(
        hidden_layer_sizes=(hidden_layer_sizes,),
        alpha=alpha,
        max_iter=500,
        random_state=42
    )

    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')
    return np.mean(scores)


### Comparison between Results obtained with and without Optuna

In [6]:
def main():
    # Load the data
    df = load_data()

    # Preprocess the data
    X, y, preprocessor = preprocess_data(df)
    print('\n--------------------------------------------------------------------------------\n')    
    # Evaluate before optimization
    print("Evaluating models before optimization...\n")
    models = {
        "Random Forest": RandomForestClassifier(random_state=42),
        "Logistic Regression": LogisticRegression(max_iter=500, random_state=42),
        "Support Vector Machine": SVC(kernel="rbf", random_state=42),
        "MLP Classifier": MLPClassifier(max_iter=500, random_state=42)
    }
    for model_name, model in models.items():
        pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
        skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')
        print(f"{model_name} Accuracy: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
        
    print('\n--------------------------------------------------------------------------------\n')    

    # Optimize models using Optuna
    
    print("\nOptimizing models using Optuna...\n")
    study_rf = optuna.create_study(direction="maximize")
    study_rf.optimize(lambda trial: optimize_random_forest(trial, X, y, preprocessor), n_trials=20)

    study_lr = optuna.create_study(direction="maximize")
    study_lr.optimize(lambda trial: optimize_logistic_regression(trial, X, y, preprocessor), n_trials=20)

    study_svm = optuna.create_study(direction="maximize")
    study_svm.optimize(lambda trial: optimize_svm(trial, X, y, preprocessor), n_trials=20)

    study_mlp = optuna.create_study(direction="maximize")
    study_mlp.optimize(lambda trial: optimize_mlp(trial, X, y, preprocessor), n_trials=20)

    # Print optimized results
    print("\nOptimized Results:\n")
    print(f"Random Forest Best Accuracy: {study_rf.best_value:.4f}")
    print(f"Logistic Regression Best Accuracy: {study_lr.best_value:.4f}")
    print(f"SVM Best Accuracy: {study_svm.best_value:.4f}")
    print(f"MLP Classifier Best Accuracy: {study_mlp.best_value:.4f}")
   
    print('\n--------------------------------------------------------------------------------\n')    

    

if __name__ == "__main__":
    main()



--------------------------------------------------------------------------------

Evaluating models before optimization...
Random Forest Accuracy: 0.9012 (+/- 0.0050)
Logistic Regression Accuracy: 0.8843 (+/- 0.0036)
Support Vector Machine Accuracy: 0.8932 (+/- 0.0053)


[I 2024-12-04 15:50:40,439] A new study created in memory with name: no-name-dadafa23-e6a3-488e-b295-1f434f12ce49


MLP Classifier Accuracy: 0.8821 (+/- 0.0028)

--------------------------------------------------------------------------------


Optimizing models using Optuna...


[I 2024-12-04 15:50:51,563] Trial 0 finished with value: 0.9012165450121655 and parameters: {'n_estimators': 200, 'max_depth': 30, 'min_samples_split': 3}. Best is trial 0 with value: 0.9012165450121655.
[I 2024-12-04 15:50:53,211] Trial 1 finished with value: 0.8927818329278183 and parameters: {'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 9}. Best is trial 0 with value: 0.9012165450121655.
[I 2024-12-04 15:51:03,409] Trial 2 finished with value: 0.9018653690186538 and parameters: {'n_estimators': 200, 'max_depth': 30, 'min_samples_split': 2}. Best is trial 2 with value: 0.9018653690186538.
[I 2024-12-04 15:51:15,941] Trial 3 finished with value: 0.902514193025142 and parameters: {'n_estimators': 250, 'max_depth': 30, 'min_samples_split': 5}. Best is trial 3 with value: 0.902514193025142.
[I 2024-12-04 15:51:20,377] Trial 4 finished with value: 0.9044606650446066 and parameters: {'n_estimators': 100, 'max_depth': 10, 'min_samples_split': 2}. Best is trial 4 with value: 0.90


Optimized Results:
Random Forest Best Accuracy: 0.9045
Logistic Regression Best Accuracy: 0.8844
SVM Best Accuracy: 0.8947
MLP Classifier Best Accuracy: 0.8994

--------------------------------------------------------------------------------



In [7]:
# user Inputs

def main():        
    # Load the data
    df = load_data()

    # Preprocess the data
    X, y, preprocessor = preprocess_data(df)
    # Run models with user-defined parameters
    print("\nRunning models with user-defined parameters...")
    print('\n--------------------------------------------------------------------------------\n')
    print('\n Please enter the parameters')
    rf_n_estimators = int(input("Enter the number of estimators for Random Forest (e.g 100): "))
    lr_max_iter = int(input("Enter the max iterations for Logistic Regression (e.g., 200): "))
    svm_kernel = input("Enter the kernel for SVM (e.g., 'linear', 'rbf'): ")
    mlp_hidden_layer_sizes = int(input("Enter the hidden layer size for MLP (e.g., 100): "))
    print('\n\n')

    user_models = {
        "Random Forest": RandomForestClassifier(n_estimators=rf_n_estimators, random_state=42),
        "Logistic Regression": LogisticRegression(max_iter=lr_max_iter, random_state=42),
        "Support Vector Machine": SVC(kernel=svm_kernel, random_state=42),
        "MLP Classifier": MLPClassifier(hidden_layer_sizes=(mlp_hidden_layer_sizes,), max_iter=500, random_state=42)
    }
    for model_name, model in user_models.items():
        pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
        skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')
        print(f"{model_name} Accuracy with User Parameters: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
        
    print('\n--------------------------------------------------------------------------------\n')
    

if __name__ == "__main__":
    main()



Running models with user-defined parameters...

--------------------------------------------------------------------------------


 Please enter the parameters
Enter the number of estimators for Random Forest (e.g 100): 400
Enter the max iterations for Logistic Regression (e.g., 200): 100
Enter the kernel for SVM (e.g., 'linear', 'rbf'): rbf
Enter the hidden layer size for MLP (e.g., 100): 200



Random Forest Accuracy with User Parameters: 0.9030 (+/- 0.0035)
Logistic Regression Accuracy with User Parameters: 0.8843 (+/- 0.0036)
Support Vector Machine Accuracy with User Parameters: 0.8932 (+/- 0.0053)
MLP Classifier Accuracy with User Parameters: 0.8807 (+/- 0.0061)

--------------------------------------------------------------------------------

