# Problem Statement

##### An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

#### IMPORTING LIBRARIES

In [251]:
import warnings

warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 100)

# Reading the dataset

In [252]:
lead_ds=pd.read_csv("Leads.csv")
lead_ds.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


# Inspecting the dataset

In [253]:
temp_ds=lead_ds

In [254]:
lead_ds.columns

Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about X Education',
       'What is your current occupation',
       'What matters most to you in choosing a course', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score',
       'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype='object')

In [255]:
lead_ds.shape

(9240, 37)

In [256]:
lead_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [257]:
lead_ds.isna().sum()

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

# Replacing the Select value with null

In [258]:
def updateSelectAsNan(col):
    lead_ds[col]=lead_ds[col].replace("Select",np.nan)

In [259]:
for column in lead_ds.columns:
    updateSelectAsNan(column)

In [260]:
lead_ds.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,,,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,,,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,,,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,,,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [261]:
lead_ds["Specialization"].isna().sum()

3380

# Droping all the columns which has over 40% values as null

In [262]:
def dropColumn(col,data,colList):
    #print(data[col].isna().sum()/9240)
    if data[col].isna().sum()/9240 >= 0.4 :
        colList.append(col)

In [263]:
colList=[]
for column in lead_ds.columns:
    dropColumn(column,lead_ds,colList)

In [264]:
colList

['How did you hear about X Education',
 'Lead Quality',
 'Lead Profile',
 'Asymmetrique Activity Index',
 'Asymmetrique Profile Index',
 'Asymmetrique Activity Score',
 'Asymmetrique Profile Score']

In [265]:
lead_ds=lead_ds.drop(colList,axis=1)

In [266]:
lead_ds.shape

(9240, 30)

# Dropping few additional fields

In [267]:
dropList=["Last Activity","Last Notable Activity","Tags"]
lead_ds=lead_ds.drop(dropList,axis=1)
lead_ds.shape

(9240, 27)

In [268]:
lead_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 27 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [269]:
lead_ds.head(3)

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Country,Specialization,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,City,I agree to pay the amount through cheque,A free copy of Mastering The Interview
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,,,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,No,No,,No,No
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,India,,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,No,No,,No,No
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,India,Business Administration,Student,Better Career Prospects,No,No,No,No,No,No,No,No,No,No,Mumbai,No,Yes


In [270]:
def updateYesNo(col):
    lead_ds[col]=lead_ds[col].replace("Yes",1)
    lead_ds[col]=lead_ds[col].replace("No",0)

In [271]:
for column in lead_ds.columns:
    updateYesNo(column)

In [272]:
lead_ds.head(3)

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Country,Specialization,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,City,I agree to pay the amount through cheque,A free copy of Mastering The Interview
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,0,0,0,0.0,0,0.0,,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,0,0,0,5.0,674,2.5,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,0,0,1,2.0,1532,2.0,India,Business Administration,Student,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1


In [273]:
lead_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 27 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   int64  
 5   Do Not Call                                    9240 non-null   int64  
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [274]:
ls=list(lead_ds.values)

In [275]:
np.isnan(ls[0][2])

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [276]:
#np.isnan(ls[0][11])
rowNaCountList=[]
def populateRowNaCount(data,k):
    rowNaCountList.append(0)
    for i in range(data.size):
        try:
            if np.isnan(data[i]):
                rowNaCountList[k]+=1
        except:
            pass

In [277]:
k=0
for data in ls:
    populateRowNaCount(data,k)
    k+=1

In [278]:
max(rowNaCountList)

7

In [279]:
hmap=[0,0,0,0,0,0,0,0]
for data in rowNaCountList:
    hmap[data]+=1

In [280]:

hmap


[3812, 559, 2382, 959, 635, 880, 12, 1]

In [281]:
sum(hmap)

9240

In [282]:
rowNaCountList.index(7)

680

In [283]:
lead_ds.drop([680],axis=0)

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Country,Specialization,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,City,I agree to pay the amount through cheque,A free copy of Mastering The Interview
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,0,0,0,0.0,0,0.00,,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,0,0,0,5.0,674,2.50,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,0,0,1,2.0,1532,2.00,India,Business Administration,Student,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,0,0,0,1.0,305,1.00,India,Media and Advertising,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,0,0,1,2.0,1428,1.00,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,19d6451e-fcd6-407c-b83b-48e1af805ea9,579564,Landing Page Submission,Direct Traffic,1,0,1,8.0,1845,2.67,Saudi Arabia,IT Projects Management,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
9236,82a7005b-7196-4d56-95ce-a79f937a158d,579546,Landing Page Submission,Direct Traffic,0,0,0,2.0,238,2.00,India,Media and Advertising,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
9237,aac550fe-a586-452d-8d3c-f1b62c94e02c,579545,Landing Page Submission,Direct Traffic,1,0,0,2.0,199,2.00,India,Business Administration,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
9238,5330a7d1-2f2b-4df4-85d6-64ca2f6b95b9,579538,Landing Page Submission,Google,0,0,1,3.0,499,3.00,India,Human Resource Management,,,0,0,0,0,0,0,0,0,0,0,Other Metro Cities,0,0


In [284]:
lead_ds.shape

(9240, 27)

# Dropping rows where more than 3 colums and less than 7 columns are null

In [285]:
rowNaCountList[5]

5

In [286]:
k=[]
for i in range(len(rowNaCountList)):
    if rowNaCountList[i]>3 and rowNaCountList[i] <7:        
        k.append(i)
      

In [287]:
lead_ds.drop(k,axis=0,inplace=True)



In [288]:
lead_ds.shape

(7713, 27)

In [289]:
lead_ds.isna().sum()

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                         7
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                        43
Total Time Spent on Website                         0
Page Views Per Visit                               43
Country                                          1505
Specialization                                   1873
What is your current occupation                  1273
What matters most to you in choosing a course    1292
Search                                              0
Magazine                                            0
Newspaper Article                                   0
X Education Forums                                  0
Newspaper                   

In [290]:
lead_ds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7713 entries, 0 to 9239
Data columns (total 27 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    7713 non-null   object 
 1   Lead Number                                    7713 non-null   int64  
 2   Lead Origin                                    7713 non-null   object 
 3   Lead Source                                    7706 non-null   object 
 4   Do Not Email                                   7713 non-null   int64  
 5   Do Not Call                                    7713 non-null   int64  
 6   Converted                                      7713 non-null   int64  
 7   TotalVisits                                    7670 non-null   float64
 8   Total Time Spent on Website                    7713 non-null   int64  
 9   Page Views Per Visit                           7670 

In [291]:
lead_ds.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Country,Specialization,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,City,I agree to pay the amount through cheque,A free copy of Mastering The Interview
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,0,0,0,0.0,0,0.0,,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,0,0,0,5.0,674,2.5,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,0,0,1,2.0,1532,2.0,India,Business Administration,Student,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,0,0,0,1.0,305,1.0,India,Media and Advertising,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,0,0,1,2.0,1428,1.0,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0


In [295]:
lead_ds.drop(["Prospect ID"],axis=1,inplace=True)

In [298]:
lead_ds.fillna("UnKnown")

Unnamed: 0,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Country,Specialization,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,City,I agree to pay the amount through cheque,A free copy of Mastering The Interview
0,660737,API,Olark Chat,0,0,0,0.0,0,0.0,UnKnown,UnKnown,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,UnKnown,0,0
1,660728,API,Organic Search,0,0,0,5.0,674,2.5,India,UnKnown,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,UnKnown,0,0
2,660727,Landing Page Submission,Direct Traffic,0,0,1,2.0,1532,2.0,India,Business Administration,Student,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
3,660719,Landing Page Submission,Direct Traffic,0,0,0,1.0,305,1.0,India,Media and Advertising,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
4,660681,Landing Page Submission,Google,0,0,1,2.0,1428,1.0,India,UnKnown,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,579564,Landing Page Submission,Direct Traffic,1,0,1,8.0,1845,2.67,Saudi Arabia,IT Projects Management,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0
9236,579546,Landing Page Submission,Direct Traffic,0,0,0,2.0,238,2.0,India,Media and Advertising,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
9237,579545,Landing Page Submission,Direct Traffic,1,0,0,2.0,199,2.0,India,Business Administration,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1
9238,579538,Landing Page Submission,Google,0,0,1,3.0,499,3.0,India,Human Resource Management,UnKnown,UnKnown,0,0,0,0,0,0,0,0,0,0,Other Metro Cities,0,0


In [307]:
field="Lead Source"
kdict=dict(lead_ds[field].value_counts())
for k,v in kdict.items():
    if v<900:
        lead_ds[field]=lead_ds[field].replace(k,"Other")

In [313]:
field="Specialization"
kdict=dict(lead_ds[field].value_counts())
for k,v in kdict.items():
    if v<300:
        lead_ds[field]=lead_ds[field].replace(k,"Other")


In [315]:
field="City"
lead_ds[field].value_counts()

Mumbai                         3208
Thane & Outskirts               752
Other Cities                    684
Other Cities of Maharashtra     447
Other Metro Cities              377
Tier II Cities                   74
Name: City, dtype: int64

In [316]:
lead_ds.isna().sum()

Lead Number                                         0
Lead Origin                                         0
Lead Source                                         7
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                        43
Total Time Spent on Website                         0
Page Views Per Visit                               43
Country                                          1505
Specialization                                   1873
What is your current occupation                  1273
What matters most to you in choosing a course    1292
Search                                              0
Magazine                                            0
Newspaper Article                                   0
X Education Forums                                  0
Newspaper                                           0
Digital Advertisement       

In [318]:
dummy = pd.get_dummies(lead_ds[['Lead Origin','Specialization' ,'Lead Source',  'What is your current occupation','Country','What matters most to you in choosing a course','City']], drop_first=True)

In [319]:
df_final = pd.concat([lead_ds, dummy], axis=1)
df_final

Unnamed: 0,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Country,Specialization,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,City,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Lead Origin_Landing Page Submission,Lead Origin_Lead Add Form,Lead Origin_Lead Import,Specialization_Business Administration,Specialization_Finance Management,Specialization_Human Resource Management,Specialization_IT Projects Management,Specialization_Marketing Management,Specialization_Operations Management,Specialization_Other,Specialization_Supply Chain Management,Lead Source_Google,Lead Source_Olark Chat,Lead Source_Organic Search,Lead Source_Other,What is your current occupation_Housewife,What is your current occupation_Other,What is your current occupation_Student,What is your current occupation_Unemployed,What is your current occupation_Working Professional,Country_Not India,What matters most to you in choosing a course_Flexibility & Convenience,What matters most to you in choosing a course_Other,City_Other Cities,City_Other Cities of Maharashtra,City_Other Metro Cities,City_Thane & Outskirts,City_Tier II Cities
0,660737,API,Olark Chat,0,0,0,0.0,0,0.00,,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,660728,API,Organic Search,0,0,0,5.0,674,2.50,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,660727,Landing Page Submission,Direct Traffic,0,0,1,2.0,1532,2.00,India,Business Administration,Student,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,660719,Landing Page Submission,Direct Traffic,0,0,0,1.0,305,1.00,India,Other,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,660681,Landing Page Submission,Google,0,0,1,2.0,1428,1.00,India,,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,579564,Landing Page Submission,Direct Traffic,1,0,1,8.0,1845,2.67,Not India,IT Projects Management,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0
9236,579546,Landing Page Submission,Direct Traffic,0,0,0,2.0,238,2.00,India,Other,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
9237,579545,Landing Page Submission,Direct Traffic,1,0,0,2.0,199,2.00,India,Business Administration,Unemployed,Better Career Prospects,0,0,0,0,0,0,0,0,0,0,Mumbai,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
9238,579538,Landing Page Submission,Google,0,0,1,3.0,499,3.00,India,Human Resource Management,,,0,0,0,0,0,0,0,0,0,0,Other Metro Cities,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [334]:
df_final["TotalVisits"]=df_final["TotalVisits"].fillna(0)
df_final["Page Views Per Visit"]=df_final["Page Views Per Visit"].fillna(0)

In [335]:
df_final.isna().sum()

Lead Number                                                                0
Do Not Email                                                               0
Do Not Call                                                                0
Converted                                                                  0
TotalVisits                                                                0
Total Time Spent on Website                                                0
Page Views Per Visit                                                       0
Search                                                                     0
Magazine                                                                   0
Newspaper Article                                                          0
X Education Forums                                                         0
Newspaper                                                                  0
Digital Advertisement                                                      0

In [336]:
df_final.drop(['Lead Origin','Specialization' ,'Lead Source',  'What is your current occupation','Country','What matters most to you in choosing a course','City'],axis=1,inplace=True)

KeyError: "['Lead Origin' 'Specialization' 'Lead Source'\n 'What is your current occupation' 'Country'\n 'What matters most to you in choosing a course' 'City'] not found in axis"

In [337]:
df_final.head()

Unnamed: 0,Lead Number,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Lead Origin_Landing Page Submission,Lead Origin_Lead Add Form,Lead Origin_Lead Import,Specialization_Business Administration,Specialization_Finance Management,Specialization_Human Resource Management,Specialization_IT Projects Management,Specialization_Marketing Management,Specialization_Operations Management,Specialization_Other,Specialization_Supply Chain Management,Lead Source_Google,Lead Source_Olark Chat,Lead Source_Organic Search,Lead Source_Other,What is your current occupation_Housewife,What is your current occupation_Other,What is your current occupation_Student,What is your current occupation_Unemployed,What is your current occupation_Working Professional,Country_Not India,What matters most to you in choosing a course_Flexibility & Convenience,What matters most to you in choosing a course_Other,City_Other Cities,City_Other Cities of Maharashtra,City_Other Metro Cities,City_Thane & Outskirts,City_Tier II Cities
0,660737,0,0,0,0.0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,660728,0,0,0,5.0,674,2.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,660727,0,0,1,2.0,1532,2.0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,660719,0,0,0,1.0,305,1.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,660681,0,0,1,2.0,1428,1.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


# Model building

In [338]:
from sklearn.model_selection import train_test_split

In [339]:
X = df_final.drop(['Converted'], 1)
X.head()

Unnamed: 0,Lead Number,Do Not Email,Do Not Call,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Lead Origin_Landing Page Submission,Lead Origin_Lead Add Form,Lead Origin_Lead Import,Specialization_Business Administration,Specialization_Finance Management,Specialization_Human Resource Management,Specialization_IT Projects Management,Specialization_Marketing Management,Specialization_Operations Management,Specialization_Other,Specialization_Supply Chain Management,Lead Source_Google,Lead Source_Olark Chat,Lead Source_Organic Search,Lead Source_Other,What is your current occupation_Housewife,What is your current occupation_Other,What is your current occupation_Student,What is your current occupation_Unemployed,What is your current occupation_Working Professional,Country_Not India,What matters most to you in choosing a course_Flexibility & Convenience,What matters most to you in choosing a course_Other,City_Other Cities,City_Other Cities of Maharashtra,City_Other Metro Cities,City_Thane & Outskirts,City_Tier II Cities
0,660737,0,0,0.0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,660728,0,0,5.0,674,2.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,660727,0,0,2.0,1532,2.0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,660719,0,0,1.0,305,1.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,660681,0,0,2.0,1428,1.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [340]:
y=df_final['Converted']
y.head()

0    0
1    0
2    1
3    0
4    1
Name: Converted, dtype: int64

In [341]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=10)

In [342]:
from sklearn.preprocessing import MinMaxScaler
# Scale the three numeric features
scaler = MinMaxScaler()
X_train[['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']] = scaler.fit_transform(X_train[['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']])
X_train.head()

Unnamed: 0,Lead Number,Do Not Email,Do Not Call,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Update me on Supply Chain Content,Get updates on DM Content,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Lead Origin_Landing Page Submission,Lead Origin_Lead Add Form,Lead Origin_Lead Import,Specialization_Business Administration,Specialization_Finance Management,Specialization_Human Resource Management,Specialization_IT Projects Management,Specialization_Marketing Management,Specialization_Operations Management,Specialization_Other,Specialization_Supply Chain Management,Lead Source_Google,Lead Source_Olark Chat,Lead Source_Organic Search,Lead Source_Other,What is your current occupation_Housewife,What is your current occupation_Other,What is your current occupation_Student,What is your current occupation_Unemployed,What is your current occupation_Working Professional,Country_Not India,What matters most to you in choosing a course_Flexibility & Convenience,What matters most to you in choosing a course_Other,City_Other Cities,City_Other Cities of Maharashtra,City_Other Metro Cities,City_Thane & Outskirts,City_Tier II Cities
554,654519,0,0,0.051793,0.165557,0.118182,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0
2296,637500,0,0,0.011952,0.654683,0.054545,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0
811,651923,0,0,0.003984,0.133156,0.018182,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
8125,587802,0,0,0.043825,0.003551,0.025091,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
5254,609470,0,0,0.015936,0.13138,0.072727,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [343]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [344]:
from sklearn.feature_selection import RFE

In [345]:
# Running RFE with 15 
rfe = RFE(logreg, 15)
rfe = rfe.fit(X_train, y_train)

In [350]:
k=list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [352]:
flist=[]
for data in k:
    if data[2]==1:
        flist.append(data)

In [353]:
flist

[('Lead Number', True, 1),
 ('Do Not Email', True, 1),
 ('Total Time Spent on Website', True, 1),
 ('A free copy of Mastering The Interview', True, 1),
 ('Lead Origin_Landing Page Submission', True, 1),
 ('Lead Origin_Lead Add Form', True, 1),
 ('Specialization_IT Projects Management', True, 1),
 ('Specialization_Other', True, 1),
 ('Lead Source_Google', True, 1),
 ('Lead Source_Olark Chat', True, 1),
 ('Lead Source_Organic Search', True, 1),
 ('Lead Source_Other', True, 1),
 ('What is your current occupation_Unemployed', True, 1),
 ('What is your current occupation_Working Professional', True, 1),
 ('Country_Not India', True, 1)]

In [354]:
col = X_train.columns[rfe.support_]

In [355]:
X_train = X_train[col]

In [356]:
import statsmodels.api as sm

In [357]:
X_train_sm = sm.add_constant(X_train)
logm1 = sm.GLM(y_train, X_train_sm, family = sm.families.Binomial())
res = logm1.fit()
res.summary()

0,1,2,3
Dep. Variable:,Converted,No. Observations:,5399.0
Model:,GLM,Df Residuals:,5383.0
Model Family:,Binomial,Df Model:,15.0
Link Function:,logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2559.4
Date:,"Wed, 19 Jan 2022",Deviance:,5118.7
Time:,19:52:35,Pearson chi2:,5820.0
No. Iterations:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-4.1577,0.970,-4.288,0.000,-6.058,-2.257
Lead Number,2.289e-06,1.54e-06,1.484,0.138,-7.33e-07,5.31e-06
Do Not Email,-1.3988,0.167,-8.384,0.000,-1.726,-1.072
Total Time Spent on Website,4.3958,0.162,27.144,0.000,4.078,4.713
A free copy of Mastering The Interview,0.0844,0.101,0.833,0.405,-0.114,0.283
Lead Origin_Landing Page Submission,-0.0743,0.106,-0.698,0.485,-0.283,0.134
Lead Origin_Lead Add Form,4.1693,0.346,12.053,0.000,3.491,4.847
Specialization_IT Projects Management,0.0274,0.180,0.152,0.879,-0.325,0.380
Specialization_Other,0.0135,0.100,0.136,0.892,-0.182,0.209


In [358]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [359]:
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Unnamed: 0,Features,VIF
0,Lead Number,19.68
11,Lead Source_Other,6.58
4,Lead Origin_Landing Page Submission,6.26
5,Lead Origin_Lead Add Form,5.99
12,What is your current occupation_Unemployed,4.75
8,Lead Source_Google,3.25
3,A free copy of Mastering The Interview,3.13
9,Lead Source_Olark Chat,2.58
2,Total Time Spent on Website,2.31
10,Lead Source_Organic Search,1.59


In [360]:
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

554     0.299160
2296    0.787854
811     0.291778
8125    0.376868
5254    0.111110
5290    0.578023
1906    0.593238
5132    0.218992
2626    0.378377
819     0.994917
dtype: float64

In [363]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

array([0.29915975, 0.78785441, 0.29177832, 0.3768679 , 0.11111047,
       0.57802319, 0.59323754, 0.21899188, 0.37837749, 0.99491733])

In [364]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Conversion_Prob':y_train_pred})
y_train_pred_final.head()

Unnamed: 0,Converted,Conversion_Prob
0,1,0.29916
1,1,0.787854
2,0,0.291778
3,0,0.376868
4,0,0.11111


In [365]:
y_train_pred_final['Predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

Unnamed: 0,Converted,Conversion_Prob,Predicted
0,1,0.29916,0
1,1,0.787854,1
2,0,0.291778,0
3,0,0.376868,0
4,0,0.11111,0


In [366]:
from sklearn import metrics

In [367]:
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted )
confusion

array([[2681,  409],
       [ 767, 1542]], dtype=int64)

In [368]:
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted)

0.7821818855343582

In [369]:
# Substituting the value of true positive
TP = confusion[1,1]
# Substituting the value of true negatives
TN = confusion[0,0]
# Substituting the value of false positives
FP = confusion[0,1] 
# Substituting the value of false negatives
FN = confusion[1,0]

In [370]:
# Calculating the sensitivity
TP/(TP+FN)

0.667821567778259

In [371]:
# Calculating the specificity
TN/(TN+FP)

0.8676375404530744

In [372]:
X_test[['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']] = scaler.transform(X_test[['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']])

In [373]:
col = X_train.columns

In [374]:
# Select the columns in X_train for X_test as well
X_test = X_test[col]
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test[col])
X_test_sm

Unnamed: 0,const,Lead Number,Do Not Email,Total Time Spent on Website,A free copy of Mastering The Interview,Lead Origin_Landing Page Submission,Lead Origin_Lead Add Form,Specialization_IT Projects Management,Specialization_Other,Lead Source_Google,Lead Source_Olark Chat,Lead Source_Organic Search,Lead Source_Other,What is your current occupation_Unemployed,What is your current occupation_Working Professional,Country_Not India
52,1.0,660069,0,0.174878,1,1,0,0,1,0,0,1,0,1,0,0
8010,1.0,588437,0,0.019530,1,1,0,0,1,0,0,0,0,1,0,0
5189,1.0,610131,0,0.000000,0,0,0,0,0,0,1,0,0,1,0,0
7421,1.0,592608,1,0.094097,1,1,0,0,0,0,0,0,0,1,0,1
5434,1.0,608250,0,0.069685,0,1,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253,1.0,657754,0,0.063915,0,1,0,0,0,1,0,0,0,0,0,0
8812,1.0,582634,0,0.498890,0,1,0,0,0,1,0,0,0,1,0,0
7891,1.0,589309,0,0.468265,0,0,0,0,0,1,0,0,0,1,0,0
1080,1.0,648518,0,0.067909,0,1,0,0,0,1,0,0,0,0,1,0


In [375]:
# Storing prediction of test set in the variable 'y_test_pred'
y_test_pred = res.predict(X_test_sm)
# Coverting it to df
y_pred_df = pd.DataFrame(y_test_pred)
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)
# Remove index for both dataframes to append them side by side 
y_pred_df.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
# Append y_test_df and y_pred_df
y_pred_final = pd.concat([y_test_df, y_pred_df],axis=1)
# Renaming column 
y_pred_final= y_pred_final.rename(columns = {0 : 'Conversion_Prob'})
y_pred_final.head()

Unnamed: 0,Converted,Conversion_Prob
0,0,0.310533
1,0,0.138889
2,1,0.365458
3,1,0.042549
4,0,0.207514


In [376]:
y_pred_final['final_predicted'] = y_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.35 else 0)
y_pred_final

Unnamed: 0,Converted,Conversion_Prob,final_predicted
0,0,0.310533,0
1,0,0.138889,0
2,1,0.365458,1
3,1,0.042549,0
4,0,0.207514,0
...,...,...,...
2309,0,0.106315,0
2310,0,0.619661,1
2311,1,0.608977,1
2312,0,0.809476,1


In [377]:
metrics.accuracy_score(y_pred_final['Converted'], y_pred_final.final_predicted)

0.7675021607605877

In [378]:
confusion2 = metrics.confusion_matrix(y_pred_final['Converted'], y_pred_final.final_predicted )
confusion2

array([[931, 364],
       [174, 845]], dtype=int64)

In [379]:
# Substituting the value of true positive
TP = confusion2[1,1]
# Substituting the value of true negatives
TN = confusion2[0,0]
# Substituting the value of false positives
FP = confusion2[0,1] 
# Substituting the value of false negatives
FN = confusion2[1,0]

In [380]:
# Calculating the sensitivity
TP/(TP+FN)

0.8292443572129539

In [381]:
# Calculating the specificity
TN/(TN+FP)

0.7189189189189189