In today's competitive business landscape, effective lead management is crucial for driving sales and maximizing revenue. With an overwhelming influx of potential customers, sales teams often struggle to identify which leads are most likely to convert. To address this challenge, we propose the development of an AI-Powered Lead Scoring System that leverages machine learning algorithms to assess and prioritize leads based on their likelihood of conversion. By analyzing various lead attributes—including demographics, engagement behavior, and previous interactions—this system empowers sales teams to focus their efforts on high-potential opportunities. The result is a more efficient sales process, improved conversion rates, and ultimately, increased revenue for businesses. This project aims to harness the power of data-driven insights to transform lead management into a strategic advantage.

## Step 1: Data Generation

To simulate a dataset for lead scoring, we can create synthetic data that includes lead attributes like demographics, engagement metrics, and previous interactions.

In [1]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Define the number of leads
num_leads = 1000

# Generate synthetic data
data = {
    'lead_id': range(1, num_leads + 1),
    'age': np.random.randint(22, 65, size=num_leads),  # Lead age
    'company_size': np.random.choice(['Small', 'Medium', 'Large'], size=num_leads),
    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], size=num_leads),
    'email_open_rate': np.random.uniform(0, 1, size=num_leads),  # Percentage of emails opened
    'website_visits': np.random.randint(0, 100, size=num_leads),  # Number of website visits
    'days_since_last_contact': np.random.randint(0, 30, size=num_leads),  # Days since last contact
    'previous_interactions': np.random.randint(0, 10, size=num_leads),  # Previous interactions with sales team
    'converted': np.random.choice([0, 1], size=num_leads, p=[0.7, 0.3])  # Target variable (0: not converted, 1: converted)
}

# Create DataFrame
leads_df = pd.DataFrame(data)

# Save to CSV (optional)
leads_df.to_csv('synthetic_leads_data.csv', index=False)
print(leads_df.head())


   lead_id  age company_size    industry  email_open_rate  website_visits  \
0        1   60        Small  Healthcare         0.181319              56   
1        2   50        Large        Tech         0.387674              46   
2        3   36       Medium     Finance         0.402327              99   
3        4   64        Small  Healthcare         0.874262              38   
4        5   29       Medium      Retail         0.510110              15   

   days_since_last_contact  previous_interactions  converted  
0                        9                      2          0  
1                       20                      3          1  
2                       23                      7          0  
3                        1                      5          1  
4                        8                      1          0  


In [3]:
leads_df.info() # Call the info method on the DataFrame 'leads_df' to get information about it.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   lead_id                  1000 non-null   int64  
 1   age                      1000 non-null   int64  
 2   company_size             1000 non-null   object 
 3   industry                 1000 non-null   object 
 4   email_open_rate          1000 non-null   float64
 5   website_visits           1000 non-null   int64  
 6   days_since_last_contact  1000 non-null   int64  
 7   previous_interactions    1000 non-null   int64  
 8   converted                1000 non-null   int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 70.4+ KB


In [6]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

# Load the synthetic leads dataset
leads_df = pd.read_csv('synthetic_leads_data.csv')

# Display basic information about the dataset
print(leads_df.info())
print(leads_df.describe())

# 1. Distribution of Numeric Features
fig = px.histogram(leads_df, x='age', nbins=20, title='Age Distribution')
fig.show()

fig = px.histogram(leads_df, x='email_open_rate', nbins=20, title='Email Open Rate Distribution')
fig.show()

fig = px.histogram(leads_df, x='website_visits', nbins=20, title='Website Visits Distribution')
fig.show()

fig = px.histogram(leads_df, x='days_since_last_contact', nbins=30, title='Days Since Last Contact Distribution')
fig.show()

fig = px.histogram(leads_df, x='previous_interactions', nbins=10, title='Previous Interactions Distribution')
fig.show()

# 2. Categorical Feature Analysis
fig = px.bar(leads_df['company_size'].value_counts().reset_index(),
              x='index',
              y='company_size',
              title='Company Size Distribution',
              labels={'index': 'Company Size', 'company_size': 'Count'})
fig.show()

fig = px.bar(leads_df['industry'].value_counts().reset_index(),
              x='index',
              y='industry',
              title='Industry Distribution',
              labels={'index': 'Industry', 'industry': 'Count'})
fig.show()

# 3. Target Variable Distribution
fig = px.pie(leads_df, names='converted', title='Conversion Rate Distribution', hole=0.3)
fig.show()

# 4. Correlation Heatmap
```python
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

# Load the synthetic leads dataset
leads_df = pd.read_csv('synthetic_leads_data.csv')

# Display basic information about the dataset
print(leads_df.info())
print(leads_df.describe())

# 1. Distribution of Numeric Features
fig = px.histogram(leads_df, x='age', nbins=20, title='Age Distribution')
fig.show()

fig = px.histogram(leads_df, x='email_open_rate', nbins=20, title='Email Open Rate Distribution')
fig.show()

fig = px.histogram(leads_df, x='website_visits', nbins=20, title='Website Visits Distribution')
fig.show()

fig = px.histogram(leads_df, x='days_since_last_contact', nbins=30, title='Days Since Last Contact Distribution')
fig.show()

fig = px.histogram(leads_df, x='previous_interactions', nbins=10, title='Previous Interactions Distribution')
fig.show()

# 2. Categorical Feature Analysis
fig = px.bar(leads_df['company_size'].value_counts().reset_index(),
              x='index',
              y='company_size',
              title='Company Size Distribution',
              labels={'index': 'Company Size', 'company_size': 'Count'})
fig.show()

fig = px.bar(leads_df['industry'].value_counts().reset_index(),
              x='index',
              y='industry',
              title='Industry Distribution',
              labels={'index': 'Industry', 'industry': 'Count'})
fig.show()

# 3. Target Variable Distribution
fig = px.pie(leads_df, names='converted', title='Conversion Rate Distribution', hole=0.3)
fig.show()

# 4. Correlation Heatmap
correlation_matrix = leads_df.corr()
fig = px.imshow(correlation_matrix,
                x=correlation_matrix.columns,
                y=correlation_matrix.columns,
                color_continuous_scale='RdBu',
                title='Correlation Heatmap')
fig.show()
```

SyntaxError: invalid syntax (<ipython-input-6-2584c7c03caf>, line 49)

In [7]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go


In [8]:
# Load the synthetic leads dataset
leads_df = pd.read_csv('synthetic_leads_data.csv')

In [9]:
# Display basic information about the dataset
print(leads_df.info())
print(leads_df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   lead_id                  1000 non-null   int64  
 1   age                      1000 non-null   int64  
 2   company_size             1000 non-null   object 
 3   industry                 1000 non-null   object 
 4   email_open_rate          1000 non-null   float64
 5   website_visits           1000 non-null   int64  
 6   days_since_last_contact  1000 non-null   int64  
 7   previous_interactions    1000 non-null   int64  
 8   converted                1000 non-null   int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 70.4+ KB
None
           lead_id          age  email_open_rate  website_visits  \
count  1000.000000  1000.000000      1000.000000     1000.000000   
mean    500.500000    43.014000         0.507965       49.658000   
std     288.819436    12.3337

In [10]:
# 1. Distribution of Numeric Features
fig = px.histogram(leads_df, x='age', nbins=20, title='Age Distribution')
fig.show()

fig = px.histogram(leads_df, x='email_open_rate', nbins=20, title='Email Open Rate Distribution')
fig.show()

fig = px.histogram(leads_df, x='website_visits', nbins=20, title='Website Visits Distribution')
fig.show()

fig = px.histogram(leads_df, x='days_since_last_contact', nbins=30, title='Days Since Last Contact Distribution')
fig.show()

fig = px.histogram(leads_df, x='previous_interactions', nbins=10, title='Previous Interactions Distribution')
fig.show()

In [12]:
# 2. Categorical Feature Analysis
fig = px.bar(leads_df['company_size'].value_counts().reset_index(),
              x='company_size',  # Changed from 'index' to 'company_size'
              y='count',       # Changed from 'company_size' to 'count'
              title='Company Size Distribution',
              labels={'company_size': 'Company Size', 'count': 'Count'}) # Updated labels
fig.show()

In [15]:
fig = px.bar(leads_df['industry'].value_counts().reset_index(),
              x='industry', # Use 'industry' column for x-axis which contains unique industries
              y='count', # Use 'count' column for y-axis which contains counts
              title='Industry Distribution',
              labels={'industry': 'Industry', 'count': 'Count'}) # Set axis labels
fig.show()

In [16]:
# 3. Target Variable Distribution
fig = px.pie(leads_df, names='converted', title='Conversion Rate Distribution', hole=0.3)
fig.show()


In [18]:
# 4. Correlation Heatmap
# Select only numeric columns for correlation calculation
numerical_leads_df = leads_df.select_dtypes(include=['number'])

# Calculate correlation matrix on numerical data
correlation_matrix = numerical_leads_df.corr()

fig = px.imshow(correlation_matrix,
                title='Correlation Heatmap',
                color_continuous_scale='RdBu',
                zmin=-1, zmax=1)
fig.show()

In [19]:
# 5. Boxplots to analyze numerical features against the target variable
fig = px.box(leads_df, x='converted', y='age', title='Age vs Conversion')
fig.show()

fig = px.box(leads_df, x='converted', y='email_open_rate', title='Email Open Rate vs Conversion')
fig.show()

fig = px.box(leads_df, x='converted', y='website_visits', title='Website Visits vs Conversion')
fig.show()

fig = px.box(leads_df, x='converted', y='days_since_last_contact', title='Days Since Last Contact vs Conversion')
fig.show()

fig = px.box(leads_df, x='converted', y='previous_interactions', title='Previous Interactions vs Conversion')
fig.show()

## Step 2: Data Preprocessing

Next, we'll clean and preprocess the data to prepare it for training. This includes handling categorical variables, normalizing features, and splitting the dataset.

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler


In [22]:
# Load data (if from CSV)
# leads_df = pd.read_csv('synthetic_leads_data.csv')

# Encode categorical variables
label_encoders = {}
for column in ['company_size', 'industry']:
    le = LabelEncoder()
    leads_df[column] = le.fit_transform(leads_df[column])
    label_encoders[column] = le

In [23]:
# Separate features and target variable
X = leads_df.drop(columns=['lead_id', 'converted'])
y = leads_df['converted']


In [24]:
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [25]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [26]:
print(f"Training data shape: {X_train.shape}, Testing data shape: {X_test.shape}")

Training data shape: (800, 7), Testing data shape: (200, 7)


## Step 3: Model Selection and Training

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [28]:
# Initialize the Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)

In [29]:
# Train the model
model.fit(X_train, y_train)

In [30]:
# Make predictions
y_pred = model.predict(X_test)

In [31]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[138   3]
 [ 54   5]]





             Predicted Negative   Predicted Positive
Actual Negative         138                 3
Actual Positive          54                 5

Performance Metrics
From the confusion matrix, we can derive the following performance metrics:

Accuracy:

Accuracy
=
𝑇
𝑃
+
𝑇
𝑁
𝑇
𝑃
+
𝑇
𝑁
+
𝐹
𝑃
+
𝐹
𝑁
=
138
+
5
138
+
3
+
54
+
5
≈
72.8
%
Accuracy=
TP+TN+FP+FN
TP+TN
​
 =
138+3+54+5
138+5
​
 ≈72.8%
Precision:

Precision
=
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑃
=
5
5
+
3
≈
62.5
%
Precision=
TP+FP
TP
​
 =
5+3
5
​
 ≈62.5%
Recall (Sensitivity):

Recall
=
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑁
=
5
5
+
54
≈
8.5
%
Recall=
TP+FN
TP
​
 =
5+54
5
​
 ≈8.5%
F1-Score:

F1 Score
=
2
×
Precision
×
Recall
Precision
+
Recall
≈
15.7
%
F1 Score=2×
Precision+Recall
Precision×Recall
​
 ≈15.7%

In [32]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.98      0.83       141
           1       0.62      0.08      0.15        59

    accuracy                           0.71       200
   macro avg       0.67      0.53      0.49       200
weighted avg       0.69      0.71      0.63       200



## Step 4: Lead Scoring

Now we can use the trained model to score new leads.

In [34]:
# Function to score new leads
def score_leads(new_leads):
    new_leads_encoded = new_leads.copy()
    for column in ['company_size', 'industry']:
        new_leads_encoded[column] = label_encoders[column].transform(new_leads_encoded[column])
    new_leads_scaled = scaler.transform(new_leads_encoded)
    scores = model.predict_proba(new_leads_scaled)[:, 1]  # Probability of conversion
    return scores

In [35]:
# Example new leads
new_leads_data = pd.DataFrame({
    'lead_id': [1001, 1002],
    'age': [30, 45],
    'company_size': ['Medium', 'Large'],
    'industry': ['Tech', 'Finance'],
    'email_open_rate': [0.8, 0.3],
    'website_visits': [20, 5],
    'days_since_last_contact': [2, 15],
    'previous_interactions': [3, 1]
})


In [38]:
# Function to score new leads
def score_leads(new_leads):
    # Select only the columns used during training
    training_columns = ['age', 'company_size', 'industry', 'email_open_rate', 'website_visits', 'days_since_last_contact', 'previous_interactions']
    new_leads_subset = new_leads[training_columns].copy()

    new_leads_encoded = new_leads_subset.copy()
    for column in ['company_size', 'industry']:
        new_leads_encoded[column] = label_encoders[column].transform(new_leads_encoded[column])

    new_leads_scaled = scaler.transform(new_leads_encoded)
    scores = model.predict_proba(new_leads_scaled)[:, 1]  # Probability of conversion
    return scores

In [39]:
print(new_leads_data[['lead_id', 'conversion_probability']])

   lead_id  conversion_probability
0     1001                    0.35
1     1002                    0.18


In [40]:
# prompt: make a plotly graph to show the final result

import plotly.graph_objects as go

# Assuming 'y_pred' and 'y_test' are available from your model evaluation

# Create a bar chart to visualize predicted vs actual conversion
fig = go.Figure(data=[
    go.Bar(name='Predicted Conversion', x=np.arange(len(y_pred)), y=y_pred),
    go.Bar(name='Actual Conversion', x=np.arange(len(y_test)), y=y_test)
])

fig.update_layout(title='Predicted vs Actual Conversion', xaxis_title='Lead ID', yaxis_title='Conversion')
fig.show()

This project develops a lead scoring model using machine learning, specifically a Random Forest Classifier, to predict the likelihood of a lead converting into a customer.  The model is trained on synthetic lead data, encompassing various attributes like demographics (age, company size, industry), engagement metrics (email open rate, website visits), and interaction history (days since last contact, previous interactions).

**Project Summary:**

1. **Data Generation:** Synthetic lead data is created to mimic real-world scenarios, ensuring a diverse range of lead profiles.

2. **Exploratory Data Analysis (EDA):**  Histograms, bar charts, pie charts, and a correlation heatmap are used to visualize the distribution of lead characteristics and identify potential relationships between variables. Box plots compare the distribution of numerical features for converted and non-converted leads.  This step provides insights into lead behavior and informs feature selection.

3. **Data Preprocessing:** Categorical features are label-encoded, and numerical features are standardized (scaled) to improve model performance. The dataset is then split into training and testing sets.

4. **Model Training and Evaluation:** A Random Forest Classifier is trained on the preprocessed training data.  The model's predictive performance is assessed on the test set using metrics such as a confusion matrix, accuracy, precision, recall, F1-score, and a classification report. A visualization of predicted versus actual conversions helps to assess the model's ability to accurately classify leads.

5. **Lead Scoring:** A function is created to use the trained model to generate lead scores (conversion probabilities) for new leads.  This allows sales and marketing teams to prioritize higher-potential leads.


**Real-Life Applications:**

* **Sales and Marketing Prioritization:** The model allows sales and marketing teams to focus their efforts on leads with a higher probability of conversion, improving resource allocation and increasing sales efficiency.

* **Lead Qualification:**  It can automate the process of lead qualification, identifying promising leads early in the sales cycle and reducing wasted time on low-potential prospects.

* **Personalized Marketing Campaigns:**  Lead scores can be used to tailor marketing messages and offers to specific lead segments, improving engagement and conversion rates.

* **Sales Forecasting:**  By analyzing lead scores and conversion trends, businesses can gain a better understanding of their sales pipeline and develop more accurate sales forecasts.

* **Customer Relationship Management (CRM) Integration:** The lead scoring model can be integrated into a CRM system to automate lead scoring and provide sales representatives with real-time insights into lead potential.


**Potential Improvements:**

* **Use Real-World Data:** Replacing synthetic data with real lead data from a company's own systems will provide more accurate and relevant results.

* **Feature Engineering:** Creating new features from existing ones or incorporating external data sources can improve the model's predictive power.

* **Model Tuning:** Optimizing model hyperparameters using techniques like grid search or randomized search can lead to better performance.

* **Regular Model Retraining:** Periodically retraining the model with new data will ensure its accuracy and relevance over time.


This project provides a strong foundation for building a robust lead scoring system that can significantly improve sales and marketing performance.
