<img src="images/upGrad.png" alt="upGrad" align="Left" style="width: 200px;"/>
<img src="images/IIITB.jpeg" alt="IITB" align="Right" style="width: 200px;"/>

# Lead Scoring Case Study

<b>Author:</b> Anish Mahapatra, Karthik Premanand

<i>Machine Learning I > Group Case Study I </i>

# Problem Statement

The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. Target lead conversion rate to be around 80%.

### Goals of case Study:
- Build logistic regression model to assign score between 0 and 100
- There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well

### Evaluation Rubric:
- Data Quality Checks Performed
- Dummy Variables Created
- Feature Engineering, if required
- Clean Analytical Dataset
- Tuning model parameters
- Correct variable selection technique
- Model Evaluation
- Model explained properly
- Commented code to include brief explanantion of important variables and the model in simple terms

In [48]:
# Importing the required packages

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
import warnings

In [49]:
# Removing the minimum display columns to 500
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# Ignoring warnings
warnings.filterwarnings("ignore")

In [50]:
# Reading the .csv as a pandas dataframe
leadsData = pd.read_csv("Data/Leads.csv")

In [51]:
# Sense Check of the Data
leadsData.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [52]:
# Viewing the shape of the data
leadsData.shape

(9240, 37)

In [53]:
# Information about the data
leadsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
Prospect ID                                      9240 non-null object
Lead Number                                      9240 non-null int64
Lead Origin                                      9240 non-null object
Lead Source                                      9204 non-null object
Do Not Email                                     9240 non-null object
Do Not Call                                      9240 non-null object
Converted                                        9240 non-null int64
TotalVisits                                      9103 non-null float64
Total Time Spent on Website                      9240 non-null int64
Page Views Per Visit                             9103 non-null float64
Last Activity                                    9137 non-null object
Country                                          6779 non-null object
Specialization                                   7802 

In [54]:
# Sense Check of the Data
leadsData.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


We have 9240 rows and 37 columns of string(30), integer(3) and float(4) data.

Analyzing the data and the data dictionary, here are a few observations regarding the data:

**Unique Identifiers (to be analyzed)**:
1. Prospect ID
2. Lead Number

**The primary key of the data**:
1. Last Notable Activity (One-hot encoding)

**One-hot encoding can be performed on**: 
1. Lead Origin 
2. Lead Source
3. Last Activity
4. Country
5. Specialization
6. How did you hear about X Education
7. What is your current occupation
8. What matters most to you in choosing a course
9. Tags
10. Lead Profile
11. City
12. Asymmetrique Activity Index
13. Asymmetrique Profile Index

**Ordinal Categorical Variables**: 
1. Lead Quality

**Binary Conversion of Yes and No can be used**:
1. Do Not Email
2. Do Not Call
3. Search
4. Magazine
5. Newspaper Article
6. X Education Forums
7. Newspaper
8. Digital Advertisement
8. Through Recommendations
10. Recieve More Updates About Our Courses
11. Update me on Supply Chain Content
12. Get updates on DM Content
13. A free copy of Mastering The Interview

**Not useful**: 
1. I agree to pay the amount through cheque (Reason: All "No" - no variance)

**Dependent Feature**: 
1. Converted

**Numeric Variables**:
1. TotalVisits
2. Total Time Spent on Website
3. Page Views Per Visit
4. Asymmetrique Activity Score
5. Asymmetrique Profile Score


In [55]:
df = leadsData.copy(deep = True)

### Missing Value Analysis

In [56]:
# Calculating the percent of missing values in the dataframe
percentMissing = (df.isnull().sum() / len(df)) * 100

# Making a dataframe with the missing values % and columns into a dataframe (on account of large number of rows) 
missingValuesDf = pd.DataFrame({'Column Name': df.columns,
                                 'Percent of data missing': percentMissing})

In [57]:
# Viewing the missing
missingValuesDf

Unnamed: 0,Column Name,Percent of data missing
Prospect ID,Prospect ID,0.0
Lead Number,Lead Number,0.0
Lead Origin,Lead Origin,0.0
Lead Source,Lead Source,0.38961
Do Not Email,Do Not Email,0.0
Do Not Call,Do Not Call,0.0
Converted,Converted,0.0
TotalVisits,TotalVisits,1.482684
Total Time Spent on Website,Total Time Spent on Website,0.0
Page Views Per Visit,Page Views Per Visit,1.482684


The rule of thumb that is followed here is that if the column has more than 10-13 % of the data missing, we shall delete the columns.

The columns that have missing data include the following:
1. Country (26%)
2. Specialization (15%)
3. How did you hear about X Education (23.88%)
4. What is your current occupation (29%)
5. What matters most to you in choosing a course (29%)
6. Tags (36%)
7. Lead Quality (51%)
8. Lead Profile (29%)
9. City (15%)
10. Asymmetrique Activity Index (45%)
11. Asymmetrique Profile Index (45%)
12. Asymmetrique Activity Score (45%)
13. Asymmetrique Profile Score (45%)

We have now removed 13 out of 37 columns on account of missing values, so we are left with 24 features now.

In [58]:
df.shape

(9240, 37)

#### Dropping the columns that have over 13% missing values

In [59]:
# Dropping the selected columns
df = df.drop(['Country', 'How did you hear about X Education', 'What is your current occupation',\
         'What matters most to you in choosing a course', 'Tags', 'Lead Quality', 'Lead Profile', 'City',\
         'Asymmetrique Activity Index', 'Asymmetrique Profile Index', 'Asymmetrique Activity Score',\
         'Asymmetrique Profile Score', 'Specialization'], axis = 1)

In [60]:
# Viewing the shape of the data
df.shape

(9240, 24)

In [61]:
# Calculating the percent of missing values in the dataframe
percentMissing = (df.isnull().sum() / len(df)) * 100

# Making a dataframe with the missing values % and columns into a dataframe (on account of large number of rows) 
missingValuesDf = pd.DataFrame({'Column Name': df.columns,
                                 'Percent of data missing': percentMissing})

In [62]:
missingValuesDf

Unnamed: 0,Column Name,Percent of data missing
Prospect ID,Prospect ID,0.0
Lead Number,Lead Number,0.0
Lead Origin,Lead Origin,0.0
Lead Source,Lead Source,0.38961
Do Not Email,Do Not Email,0.0
Do Not Call,Do Not Call,0.0
Converted,Converted,0.0
TotalVisits,TotalVisits,1.482684
Total Time Spent on Website,Total Time Spent on Website,0.0
Page Views Per Visit,Page Views Per Visit,1.482684


We have now removed the columns that have over 15% or more missing data for the analysis.

In [63]:
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,No,No,No,No,No,No,No,No,No,No,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,No,No,No,No,No,No,No,No,No,No,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,No,No,No,No,No,No,No,No,No,No,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,No,No,No,No,No,No,No,No,No,No,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,No,No,No,No,No,No,No,No,No,No,No,No,Modified


In [64]:
cols = df.columns.tolist()
cols = ['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source', 'Do Not Email', 'Do Not Call', 'TotalVisits',
        'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity', 'Search', 'Magazine', 'Newspaper Article',
        'X Education Forums', 'Newspaper', 'Digital Advertisement', 'Through Recommendations',
        'Receive More Updates About Our Courses', 'Update me on Supply Chain Content', 'Get updates on DM Content',
        'A free copy of Mastering The Interview', 'Last Notable Activity','Converted']

In [66]:
df = df[cols]

In [67]:
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,A free copy of Mastering The Interview,Last Notable Activity,Converted
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0.0,0,0.0,Page Visited on Website,No,No,No,No,No,No,No,No,No,No,No,Modified,0
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,5.0,674,2.5,Email Opened,No,No,No,No,No,No,No,No,No,No,No,Email Opened,0
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,2.0,1532,2.0,Email Opened,No,No,No,No,No,No,No,No,No,No,Yes,Email Opened,1
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,1.0,305,1.0,Unreachable,No,No,No,No,No,No,No,No,No,No,No,Modified,0
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,2.0,1428,1.0,Converted to Lead,No,No,No,No,No,No,No,No,No,No,No,Modified,1
