# Abstract & Imports

Abstract:

This notebook explores the use of the Random Forest Classifier model within a sales team of a B2B SaaS company. Sales teams leverage huge amounts of data regarding current clients and prospective clients/leads, this notebook intends to leverage subsets of that data to make predictions on which leads/prospects would be most worthwhile to pursue. The practical application of this model would assist in lowering workloads of sales teams and thinning the collection of leads to focus on more attainable prospects.

Note: All datasets used in this notebook have been simulated. The primary goal of this notebook is to showcase a proof of concept for the model.

In [2]:
import os
import time
from datetime import datetime

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import sklearn

import statsmodels.api as sm

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
os.getcwd()

'/content'

In [4]:
data_path = '/content/drive/MyDrive/Colab Notebooks/Customer Predictor/Data'

# A Look at the Data

client_leads_list.csv contains each lead that our sales team has acted upon.

The columns contain information as follows:
*    Company Industry: The industry that the company works in
*    Employee Count: The number of employees working for the company
*    Company HQ Location: The location of the company's HQ
*    Year Founded: The year the company was founded
*    POC Title: The title of the point of contact for the lead
*    Method of Sourcing: Wether this was an inbound lead or outbound
*    Client: Whether this lead was converted into a customer or not
*    Train/Test: Whether this row should be included in the models Training or Testing

In [5]:
leads_list = pd.read_csv(os.path.join(data_path, 'client_leads_list.csv'))

In [6]:
leads_list

Unnamed: 0,Company Industry,Employee Count,Company HQ Location,Year Founded,POC Title,Method of Sourcing,Client,Train/Test
0,Sports,11-100,"Los Angeles, CA",2019,Senior Manager,Inbound,0,Train
1,Technology,11-100,"Los Angeles, CA",2017,President,Outbound,0,Train
2,Media/Entertainment,11-100,"Seattle, WA",2020,President,Outbound,0,Train
3,Ecommerce,101-500,"San Fransisco, CA",2017,President,Outbound,1,Train
4,Consulting,1-10,"Chicago, IL",2016,Manager,Outbound,1,Train
...,...,...,...,...,...,...,...,...
294,Technology,11-100,"Los Angeles, CA",2011,Manager,Outbound,0,Test
295,Ecommerce,11-100,"Los Angeles, CA",2020,Senior Vice President,Inbound,1,Test
296,Ecommerce,11-100,"New York City, NY",2012,Founder & CEO,Outbound,0,Test
297,Ecommerce,1-10,"New York City, NY",2021,Vice President,Outbound,0,Test


In [7]:
leads_list.dtypes

Unnamed: 0,0
Company Industry,object
Employee Count,object
Company HQ Location,object
Year Founded,int64
POC Title,object
Method of Sourcing,object
Client,int64
Train/Test,object


In [8]:
#The random forest classifier can only work with ints/floats so we need to convert our variables
leads_list['industry_code'] = leads_list['Company Industry'].astype('category').cat.codes
leads_list['num_employees_code'] = leads_list['Employee Count'].astype('category').cat.codes
leads_list['location_code'] = leads_list['Company HQ Location'].astype('category').cat.codes
leads_list['title_code'] = leads_list['POC Title'].astype('category').cat.codes
leads_list['source_code'] = leads_list['Method of Sourcing'].astype('category').cat.codes
leads_list

Unnamed: 0,Company Industry,Employee Count,Company HQ Location,Year Founded,POC Title,Method of Sourcing,Client,Train/Test,industry_code,num_employees_code,location_code,title_code,source_code
0,Sports,11-100,"Los Angeles, CA",2019,Senior Manager,Inbound,0,Train,13,2,3,9,0
1,Technology,11-100,"Los Angeles, CA",2017,President,Outbound,0,Train,14,2,3,7,1
2,Media/Entertainment,11-100,"Seattle, WA",2020,President,Outbound,0,Train,10,2,7,7,1
3,Ecommerce,101-500,"San Fransisco, CA",2017,President,Outbound,1,Train,2,1,6,7,1
4,Consulting,1-10,"Chicago, IL",2016,Manager,Outbound,1,Train,1,0,2,6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,Technology,11-100,"Los Angeles, CA",2011,Manager,Outbound,0,Test,14,2,3,6,1
295,Ecommerce,11-100,"Los Angeles, CA",2020,Senior Vice President,Inbound,1,Test,2,2,3,10,0
296,Ecommerce,11-100,"New York City, NY",2012,Founder & CEO,Outbound,0,Test,2,2,5,5,1
297,Ecommerce,1-10,"New York City, NY",2021,Vice President,Outbound,0,Test,2,0,5,11,1


# Random Forest Classifier Model

In [9]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=50,\
                            min_samples_split=10,\
                            random_state=448)

train = leads_list[leads_list['Train/Test'] == 'Train']

test = leads_list[leads_list['Train/Test'] == 'Test']

predictors = ['Year Founded', 'industry_code', 'num_employees_code', 'location_code', 'title_code', 'source_code']

In [10]:
rf.fit(train[predictors], train['Client'])

In [11]:
preds = rf.predict(test[predictors])

In [12]:
#We want to ensure our accuracy and precision are decent before using it to predict
from sklearn.metrics import accuracy_score

acc = accuracy_score(test['Client'], preds)

acc

0.75

In [13]:
from sklearn.metrics import precision_score

precision_score(test['Client'], preds)

0.6470588235294118

# Prediction

leads_list_needs_prediction.csv contains 100 data points for leads that our sales team has yet to act on. We are going to put it through our model and see which leads it tells us to act upon.

The columns are the exact same as our previous dataset.

In [14]:
prospect_list = pd.read_csv(os.path.join(data_path, 'leads_list_needs_prediction.csv'))
prospect_list

Unnamed: 0,Company Industry,Employee Count,Company HQ Location,Year Founded,POC Title,Method of Sourcing
0,Food & Beverage,1-10,"Los Angeles, CA",2012,Manager,Outbound
1,Ecommerce,501-1000,"Seattle, WA",2021,Senior Vice President,Outbound
2,Consulting,11-100,"Los Angeles, CA",2023,Manager,Outbound
3,Ecommerce,1-10,"Los Angeles, CA",2009,COO,Outbound
4,Technology,1-10,"New York City, NY",2013,Director,Outbound
...,...,...,...,...,...,...
95,Consulting,11-100,"Los Angeles, CA",2013,COO,Outbound
96,Consulting,11-100,"Los Angeles, CA",2015,Vice President,Outbound
97,Ecommerce,11-100,"Los Angeles, CA",2014,COO,Outbound
98,Ecommerce,11-100,"Boston, MA",2018,President,Outbound


In [15]:
prospect_list.dtypes

Unnamed: 0,0
Company Industry,object
Employee Count,object
Company HQ Location,object
Year Founded,int64
POC Title,object
Method of Sourcing,object


In [16]:
prospect_list['industry_code'] = prospect_list['Company Industry'].astype('category').cat.codes
prospect_list['num_employees_code'] = prospect_list['Employee Count'].astype('category').cat.codes
prospect_list['location_code'] = prospect_list['Company HQ Location'].astype('category').cat.codes
prospect_list['title_code'] = prospect_list['POC Title'].astype('category').cat.codes
prospect_list['source_code'] = prospect_list['Method of Sourcing'].astype('category').cat.codes
prospect_list

Unnamed: 0,Company Industry,Employee Count,Company HQ Location,Year Founded,POC Title,Method of Sourcing,industry_code,num_employees_code,location_code,title_code,source_code
0,Food & Beverage,1-10,"Los Angeles, CA",2012,Manager,Outbound,6,0,3,6,1
1,Ecommerce,501-1000,"Seattle, WA",2021,Senior Vice President,Outbound,2,4,6,9,1
2,Consulting,11-100,"Los Angeles, CA",2023,Manager,Outbound,1,3,3,6,1
3,Ecommerce,1-10,"Los Angeles, CA",2009,COO,Outbound,2,0,3,2,1
4,Technology,1-10,"New York City, NY",2013,Director,Outbound,13,0,4,4,1
...,...,...,...,...,...,...,...,...,...,...,...
95,Consulting,11-100,"Los Angeles, CA",2013,COO,Outbound,1,3,3,2,1
96,Consulting,11-100,"Los Angeles, CA",2015,Vice President,Outbound,1,3,3,10,1
97,Ecommerce,11-100,"Los Angeles, CA",2014,COO,Outbound,2,3,3,2,1
98,Ecommerce,11-100,"Boston, MA",2018,President,Outbound,2,3,1,7,1


In [17]:
predictors = ['Year Founded', 'industry_code', 'num_employees_code', 'location_code', 'title_code', 'source_code']

prospect_list['Predicted Outcome'] = rf.predict(prospect_list[predictors])

prospect_list

Unnamed: 0,Company Industry,Employee Count,Company HQ Location,Year Founded,POC Title,Method of Sourcing,industry_code,num_employees_code,location_code,title_code,source_code,Predicted Outcome
0,Food & Beverage,1-10,"Los Angeles, CA",2012,Manager,Outbound,6,0,3,6,1,0
1,Ecommerce,501-1000,"Seattle, WA",2021,Senior Vice President,Outbound,2,4,6,9,1,0
2,Consulting,11-100,"Los Angeles, CA",2023,Manager,Outbound,1,3,3,6,1,0
3,Ecommerce,1-10,"Los Angeles, CA",2009,COO,Outbound,2,0,3,2,1,0
4,Technology,1-10,"New York City, NY",2013,Director,Outbound,13,0,4,4,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,Consulting,11-100,"Los Angeles, CA",2013,COO,Outbound,1,3,3,2,1,0
96,Consulting,11-100,"Los Angeles, CA",2015,Vice President,Outbound,1,3,3,10,1,0
97,Ecommerce,11-100,"Los Angeles, CA",2014,COO,Outbound,2,3,3,2,1,0
98,Ecommerce,11-100,"Boston, MA",2018,President,Outbound,2,3,1,7,1,0


In [18]:
#The subset of leads that our model tells us are worthy for our sales team to act upon
prospect_list[prospect_list['Predicted Outcome'] == 1]

Unnamed: 0,Company Industry,Employee Count,Company HQ Location,Year Founded,POC Title,Method of Sourcing,industry_code,num_employees_code,location_code,title_code,source_code,Predicted Outcome
13,Ecommerce,1-10,"Los Angeles, CA",2006,Director,Outbound,2,0,3,4,1,1
14,Finance,11-100,"Los Angeles, CA",2021,COO,Inbound,5,3,3,2,0,1
17,Consulting,1-10,"Chicago, IL",2017,Director,Inbound,1,0,2,4,0,1
22,Consulting,1-10,"Chicago, IL",2012,Senior Manager,Inbound,1,0,2,8,0,1
30,Ecommerce,1-10,"Los Angeles, CA",2020,COO,Outbound,2,0,3,2,1,1
31,Ecommerce,11-100,"Los Angeles, CA",2022,Manager,Inbound,2,3,3,6,0,1
36,Marketing,1-10,"Los Angeles, CA",2018,COO,Inbound,8,0,3,2,0,1
45,Consulting,1-10,"Los Angeles, CA",2021,Senior Manager,Outbound,1,0,3,8,1,1
48,Ecommerce,1-10,"Los Angeles, CA",2020,COO,Outbound,2,0,3,2,1,1
52,Ecommerce,1-10,"San Fransisco, CA",2022,Director,Outbound,2,0,5,4,1,1


# Pitfalls & Areas of Improvement/Exploration

The most obvious area for improvement would be in using actual real world data instead of simulated data to understand if this model would be beneficial for an actual sales team to employ.

Considering our model gave returned 18% of leads as worthwhile to pursue it is clear we need to worry about Type 2 errors within the model. Any company cannot afford to leave money of the table in this instance. It would be wise to correct for these errors especially.

It may be more ideal to consider a different ML algorithm such as Decision Trees/K-Means, etc. instead of Random Forest Classifier.