**Classify Prospective Leads Using Machine Learning**

Lead scoring is a methodology used by sales and marketing departments to determine the worthiness of leads or potential customers by assigning values based on their behavior related to their interest in products or services

In [11]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Preprocessing libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# Model selection libraries
from sklearn.model_selection import train_test_split, cross_val_score

# Model building libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import xgboost as xgb
from xgboost import XGBClassifier  # Direct import of XGBClassifier for convenience

# Model evaluation libraries
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Visualization libraries
import plotly.figure_factory as ff
import plotly.graph_objects as go# Data manipulation libraries

Load Dataset
I use leads scoring dataset on Kaggle but it can be get from a website information

You can access the dataset using this link: https://www.kaggle.com/datasets/amritachatterjee09/lead-scoring-dataset.

In [42]:
# Dataset from kaggle: https://www.kaggle.com/datasets/amritachatterjee09/lead-scoring-dataset

#df = pd.read_csv('Lead Scoring.csv')
df = pd.read_csv('Lead Scoring.csv', header=0)

**Fill Missing Values**

Based on our dataset, a few columns have missing values, so we need to fill the missing values.

In [44]:
# Identify numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64'])

# Fill missing numeric data with their mean
df[numeric_cols.columns] = numeric_cols.fillna(numeric_cols.mean())

# Identify non-numeric columns
non_numeric_cols = df.select_dtypes(exclude=['int64', 'float64'])

# Fill missing non-numeric data with 'missing'
df[non_numeric_cols.columns] = non_numeric_cols.fillna('Missing')

**Feature Encoding**

This is for machine learning compatibility, as many machine learning models cannot directly handle non-numeric data. They require all input data to be numeric, so we must transform categorical data into numeric.

In [45]:
# Define a function to encode non-numeric data types using one-hot encoding or label encoding.

# Create a copy of the DataFrame to avoid modifying the original data
encoded_df = df.copy()

# Identify non-numeric columns
non_numeric_cols = encoded_df.select_dtypes(exclude=['int64', 'float64'])

# Initialize LabelEncoder
le = LabelEncoder()

# Apply LabelEncoder to each non-numeric column
for column in non_numeric_cols.columns:
    # Fit and transform the data
    # Use `astype(str)` to ensure proper conversion for mixed types
    encoded_df[column] = le.fit_transform(non_numeric_cols[column].astype(str))



In [46]:
encoded_df.columns

Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about X Education',
       'What is your current occupation',
       'What matters most to you in choosing a course', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score',
       'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype='object')

**Model Preparation**

This part performs data preprocessing and splitting a dataset into training and testing sets for a machine learning model.

In [47]:
# Setting 'Lead Number' as the index
encoded_df.set_index('Lead Number', inplace=True)

# Dropping unnecessary column 'Prospect ID'
encoded_df.drop(['Prospect ID'], axis=1, inplace=True)

# Assuming all categorical data are already encoded and proceeding with train-test split
from sklearn.model_selection import train_test_split

# Split the data
X = encoded_df.drop('Converted', axis=1)
y = encoded_df['Converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Checking the shape of the train and test sets
X_train.shape, X_test.shape

((7392, 34), (1848, 34))

**Model Selection**

We tried several models to choose the best one to predict leads conversion, from logistic regression, random forest, gradient boosting, and SVM classifier to the XGBoost classifier model.

In [48]:
#Model Selection

def logistic_regression(X_train, y_train):
    model = LogisticRegression()
    model.fit(X_train, y_train)
    return model

def random_forest(X_train, y_train):
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    return model

def gradient_boosting(X_train, y_train):
    model = GradientBoostingClassifier()
    model.fit(X_train, y_train)
    return model

def svm_classifier(X_train, y_train):
    model = SVC()
    model.fit(X_train, y_train)
    return model

def xgboost_classifier(X_train, y_train):
    model = XGBClassifier()
    model.fit(X_train, y_train)
    return model

**Model Evaluation**

In [49]:
# Model evaluation function
def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    report = classification_report(y_test, predictions)
    return accuracy, report

# Running and evaluating each model
models = {
    "Logistic Regression": logistic_regression,
    "Random Forest": random_forest,
    "Gradient Boosting": gradient_boosting,
    "SVM": svm_classifier,
    "XGBoost": xgboost_classifier
}

# Compare the models
for name, model_func in models.items():
    print(f"Running {name}...")
    model = model_func(X_train, y_train)
    accuracy, report = evaluate_model(model, X_test, y_test)
    print(f"{name} Accuracy: {accuracy}")
    print(f"Classification Report for {name}:\n{report}\n")

Running Logistic Regression...
Logistic Regression Accuracy: 0.8273809523809523
Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1107
           1       0.81      0.74      0.77       741

    accuracy                           0.83      1848
   macro avg       0.82      0.81      0.82      1848
weighted avg       0.83      0.83      0.83      1848


Running Random Forest...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Random Forest Accuracy: 0.9318181818181818
Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.93      0.96      0.94      1107
           1       0.94      0.89      0.91       741

    accuracy                           0.93      1848
   macro avg       0.93      0.92      0.93      1848
weighted avg       0.93      0.93      0.93      1848


Running Gradient Boosting...
Gradient Boosting Accuracy: 0.9383116883116883
Classification Report for Gradient Boosting:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1107
           1       0.95      0.89      0.92       741

    accuracy                           0.94      1848
   macro avg       0.94      0.93      0.94      1848
weighted avg       0.94      0.94      0.94      1848


Running SVM...
SVM Accuracy: 0.7148268398268398
Classification Report for SVM:
              precision    recall  f1-score   support

     

**Fit Data into Model**

In [50]:
# Choose XGBoost for the best model.

# Building the XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)

# Getting probability predictions and rounding to two decimal places
xgb_prob_predictions = [round(prob, 2) for prob in xgb_model.predict_proba(X_test)[:, 1]]  # Assuming binary classification

# Function to categorize probabilities
def categorize_probability(prob):
    if 0.1 <= prob <= 0.3:
        return 'Low Probability'
    elif 0.4 <= prob <= 0.6:
        return 'Medium Probability'
    elif 0.7 <= prob <= 1.0:
        return 'High Probability'
    else:
        return 'Undefined'  # For probabilities outside the specified ranges

# Apply categorization to probability predictions
probability_categories = [categorize_probability(prob) for prob in xgb_prob_predictions]

# Convert probabilities to 0 or 1 based on threshold using list comprehension
xgb_predictions = [int(prob > 0.5) for prob in xgb_prob_predictions]

# Evaluate the model using the binary predictions
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
xgb_report = classification_report(y_test, xgb_predictions)

# Print the results
print("XGBoost Model Accuracy:", xgb_accuracy)
print("Classification Report:\n", xgb_report)

# Adding predictions to the test set for review
X_test['Predicted_Probability'] = xgb_prob_predictions
X_test['Probability_Category'] = probability_categories
X_test['Predicted_Conversion'] = xgb_predictions
X_test['Actual_Conversion'] = y_test

XGBoost Model Accuracy: 0.9475108225108225
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.97      0.96      1107
           1       0.95      0.92      0.93       741

    accuracy                           0.95      1848
   macro avg       0.95      0.94      0.95      1848
weighted avg       0.95      0.95      0.95      1848



Parameters: { "use_label_encoder" } are not used.



**Result**

In [51]:
# Recoding the 'Predicted_Class' and 'Actual_Class'
X_test['Predicted_Class'] = X_test['Predicted_Conversion'].replace({0: 'Not Converted', 1: 'Converted'})
X_test['Actual_Class'] = X_test['Actual_Conversion'].replace({0: 'Not Converted', 1: 'Converted'})

# Resetting the index of X_test
X_test_reset = X_test.reset_index()

# Joining df1 with X_test_reset
result_df = df.join(X_test_reset, rsuffix='_drop', how='inner')

# Dropping columns from X_test that have the same names (these columns have '_drop' suffix)
result_df = result_df[[col for col in result_df.columns if not col.endswith('_drop')]]

# Dropping the 'Converted' column
result_df.drop('Converted', axis=1, inplace=True)

# Show DataFrame Leads Scoring result
result_df

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,...,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity,Predicted_Probability,Probability_Category,Predicted_Conversion,Actual_Conversion,Predicted_Class,Actual_Class
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0.0,0,0.00,Page Visited on Website,...,15.0,No,No,Modified,0.98,High Probability,1,1,Converted,Converted
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,5.0,674,2.50,Email Opened,...,15.0,No,No,Email Opened,0.01,Undefined,0,0,Not Converted,Not Converted
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,2.0,1532,2.00,Email Opened,...,20.0,No,Yes,Email Opened,0.00,Undefined,0,0,Not Converted,Not Converted
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,1.0,305,1.00,Unreachable,...,17.0,No,No,Modified,0.04,Undefined,0,0,Not Converted,Not Converted
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,2.0,1428,1.00,Converted to Lead,...,18.0,No,No,Modified,0.00,Undefined,0,0,Not Converted,Not Converted
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1843,e3d7f9af-1593-41e8-9682-c6ddfb9ee1b7,641612,Landing Page Submission,Direct Traffic,No,No,2.0,203,2.00,Email Opened,...,18.0,No,Yes,Email Opened,0.99,High Probability,1,1,Converted,Converted
1844,b3c94e33-69da-4163-a97e-a4825e233f95,641603,Landing Page Submission,Google,Yes,No,2.0,102,2.00,Email Bounced,...,18.0,No,No,Email Bounced,0.99,High Probability,1,1,Converted,Converted
1845,ff31e840-8fc7-45d0-a399-59accdb1f7dc,641602,API,Google,No,No,3.0,481,3.00,Email Opened,...,18.0,No,No,Email Opened,0.00,Undefined,0,0,Not Converted,Not Converted
1846,1451a32b-16de-482e-825e-8cef8c017064,641585,API,Google,No,No,9.0,195,2.25,Unreachable,...,15.0,No,No,Unreachable,0.94,High Probability,1,1,Converted,Converted


**Conclusion**

Based on the above results, we will have the following information:

1. Predicted Probability: the probability interval from 0–1.
2. Probability Category: probability classes include low, medium, and high probability.
3. Predicted Conversion: value 0 for not converted and 1 for converted.
4. Predicted Class: value “Converted” and “Not-Converted”.

These results make more data-based decision-making and can make sales efforts more effective.