### Leads Scoring - Logistic Regression


###### Problem Statement
An X Education need help to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

###### Goals and Objectives
There are quite a few goals for this case study.

1. Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

2. There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. 

3. Please fill it based on the logistic regression model you got in the first step. 

4. Make sure you include this in your final PPT where you'll make recommendations.


###### Variables
With 37 predictor variables I need to build a model, assign a lead score to each of the leads such as the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance.

### Step 1: Importing the Dataset

In [1]:
# Suppressing Warnings

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Importing the dataset

lead = pd.read_csv('Leads.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Leads.csv'

### Step 2: Inspecting the Dataframe

In [None]:
# Checking the head of our master dataset

lead.head()

In [None]:
# Checking the dimentions of the dataframe

lead.shape

In [None]:
# Checking the statistical aspects of the dataframe

lead.describe()

In [None]:
# Checking the type of each column

lead.info()

In [None]:
# Checking if there are any duplicate values in the dataset

lead[lead.duplicated(keep=False)]

###### Insights:

    There are no dupicate values in this dataset.         
           

### Step 3: Data Cleaning

In [None]:
# Checking columns with one unique value 

lead.nunique()

In [None]:
# Dropping redundant columns like 'Prospect ID','Lead Number','Country',
#'I agree to pay the amount through cheque','A free copy of Mastering The Interview','City'

redund_cols=['Prospect ID','Lead Number','Country','I agree to pay the amount through cheque',
             'A free copy of Mastering The Interview','City']
             
lead=lead.drop(redund_cols,1)

#### Checking for the missing values (column-wise)

In [None]:
# Adding the mising values (column-wise)

lead.isnull().sum()

In [None]:
# Checking the percetage of missing values

round(100*(lead.isnull().sum()/len(lead.index)),2)

In [None]:
# Dropping columns with more than 30% missing values

cols=lead.columns

for i in cols:
    if((100*(lead[i].isnull().sum()/len(lead.index))) >= 30):
        lead.drop(i, 1, inplace = True)

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)),2)

##### Following columns have null values :

- Lead Source
- Total Visits
- Page Views Per Visit
- Last Activity 
- Specialization 
- How did you hear about X Education               
- What is your current occupation                  
- What matters most to you in choosing a course
- Lead Profile
 

In [None]:
lead['Specialization'] = lead['Specialization'].fillna('not provided') 
lead.info()

In [None]:
# Rechecking the percentage of missing values

round(100*(lead.isnull().sum()/len(lead.index)), 2)

##### Following columns have null values :

- Lead Source
- Total Visits
- Page Views Per Visit
- Last Activity
- What is your current occupation                  
- What matters most to you in choosing a course
- Lead Profile

### Univariate Analysis

#### Categorical Attributes Analysis

In [None]:
# Checking the value counts for Lead Source column

lead['Lead Source'].value_counts().head(25) 

###### Insights:
    Google has the highest number of occurences, hence need to impute the missing values with label 'Google'

In [None]:
# Applying lambda to captilize the first character of the column 'Lead Source'

lead['Lead Source']=lead['Lead Source'].str.capitalize()
lead['Lead Source'].value_counts()

In [None]:
sns.countplot(lead['Lead Source']).tick_params(axis='x', rotation = 90)

plt.title('Lead Source')
plt.show()

In [None]:
# Checking the value counts for Total Visits column

lead['TotalVisits'].value_counts()

###### Insights:
    0.0 has the highest number of occurences, hence need to impute the missing values with label '0.0'

In [None]:
sns.countplot(lead['TotalVisits']).tick_params(axis='x', rotation = 90)

plt.title('TotalVisits')
plt.show()

In [None]:
# Checking the median of column 

lead['TotalVisits'].median()

In [None]:
# Impute the null values in TotalVisits by the median value which is 3.0

lead['TotalVisits'] = lead['TotalVisits'].replace(np.nan, lead['TotalVisits'].median())

In [None]:
# Checking value counts of Page Views Per Visit column

lead['Page Views Per Visit'].value_counts()

###### Insights:
       0.0 has the highest number of occurences, hence need to impute the missing values with label '0.0'

In [None]:
# Checking the median of the column

lead['Page Views Per Visit'].median()

In [None]:
sns.countplot(lead['Page Views Per Visit']).tick_params(axis='x', rotation = 90)

plt.title('Page Views Per Visit')
plt.show()

In [None]:
# Checking value counts for the column Last Activity

lead['Last Activity'].value_counts()

###### Insights:
       Email Opened has the highest number of occurences, hence need to impute the missing values with label 'Email Opened'

In [None]:
sns.countplot(lead['Last Activity']).tick_params(axis='x', rotation = 90)

plt.title('Last Activity')
plt.show()

In [None]:
# Checking value counts of What is your current occupation
 
lead['What is your current occupation'].value_counts()

In [None]:
sns.countplot(lead['What is your current occupation']).tick_params(axis='x', rotation = 90)

plt.title('What is your current occupation')
plt.show()

In [None]:
# Checking value counts of What matters most to you in choosing a course 

lead['What matters most to you in choosing a course'].value_counts()

In [None]:
sns.countplot(lead['What matters most to you in choosing a course']).tick_params(axis='x', rotation = 90)

plt.title('What matters most to you in choosing a course')
plt.show()

In [None]:
# Checking the value counts for Total Visits column

lead['Lead Profile'].value_counts()

In [None]:
sns.countplot(lead['Lead Profile']).tick_params(axis='x', rotation = 90)

plt.title('Lead Profile')
plt.show()

In [None]:
# Checking for missing values after imputing values to the missing area

lead.isnull().sum() 

###### Insights:
    Now all columns do not have missing values, so it is good to go for next analysis

In [None]:
lead.head()

In [None]:
# Converting 'Select' values to NaN.

lead['Specialization'] = lead['Specialization'].replace('Select', np.nan)
lead['Lead Profile'] = lead['Lead Profile'].replace('Select', np.nan)

### Step 4: Data Preparation

#### Converting some binary variables (Yes/No) to 0/1

In [None]:
# List of variables to map

varlist = ['Do Not Email', 'Do Not Call','Newspaper Article','X Education Forums',
           'Newspaper','Digital Advertisement','Through Recommendations',
           'Receive More Updates About Our Courses', 'Update me on Supply Chain Content',
           'Get updates on DM Content']

# Defining the map function

def binary_map(x):
    return x.map({'Yes':1, 'No':0})

# Applying the function to the Leads list

lead[varlist] = lead[varlist].apply(binary_map)

In [None]:
lead.head()

###### Insights:
    After converting the binary categories from 'Yes' to 1 and 'No' to 0, 
    the dummy variables for mutiple levels of categories will be used.

##### For categorical variables with multiple levels, create dummy features, remove repeated columns

In [None]:
# Creating a dummy variable for some of the categorical variables and dropping the first level.

d_lead_origin=pd.get_dummies(lead['Lead Origin'], prefix='LeadOrigin')
# Drop Add Form
d_lead_origin1 = d_lead_origin.drop(['LeadOrigin_Quick Add Form'], 1)
#Add the results to the master dataframe
lead = pd.concat([lead, d_lead_origin1],axis = 1)                     


d_last_activity = pd.get_dummies(lead['Last Activity'], prefix='LastActivity')
# Drop a column
d_last_activity1 = d_last_activity.drop(['LastActivity_Resubscribed to emails'], 1)
# Add the results to the master dataframe
lead = pd.concat([lead,d_last_activity1], axis=1)


d_curr_occupation = pd.get_dummies(lead['What is your current occupation'], prefix='CurrentOccupation')
# Drop a column
d_curr_occupation1 = d_curr_occupation.drop(['CurrentOccupation_Businessman'], 1)
# Add the results to the master dataframe
lead = pd.concat([lead,d_curr_occupation1], axis=1)


d_last_notable_activity = pd.get_dummies(lead['Last Notable Activity'], prefix='LastNotableActivity')
# Drop a column
d_last_notable_activity1 = d_last_notable_activity.drop(['LastNotableActivity_Resubscribed to emails'], 1)
# Add te results to the master dataframe
lead = pd.concat([lead,d_last_notable_activity1], axis=1)

lead.head()

In [None]:
lead.info()

In [None]:
# Dropping redundant variables

redundant=['Receive More Updates About Our Courses','Specialization',
           'Update me on Supply Chain Content','Get updates on DM Content','Magazine']

lead=lead.drop(redundant,1)

In [None]:
lead.head()

In [None]:
# Checking outliers at 25%,50%,75%,90%,95% and above

lead.describe(percentiles=[.25,.5,.75,.90,.95,.99])

In [None]:
lead.info()

In [None]:
lead.corr()

#### Checking for Outliers

In [None]:
# Adding up the missing values (column-wise)

lead.isnull().sum()

In [None]:
# Checking for outliers in the continuous variables

num_lead = lead[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']]

In [None]:
# Check the outliers in all the numeric columns

plt.figure(figsize=(20, 25))
plt.subplot(4,3,1)
sns.boxplot(y = 'TotalVisits', palette='Set2', data = lead)
plt.subplot(4,3,2)
sns.boxplot(y = 'Total Time Spent on Website', palette='Set2', data = lead)
plt.subplot(4,3,3)
sns.boxplot(y = 'Page Views Per Visit', palette='Set2', data = lead)
plt.show()

In [None]:
# Checking for outliers at 25%, 50%, 75%, 90%, 95% and 99%

num_lead.describe(percentiles=[.25,.5,.75,.90,.95,.99])

##### Insights:
    The numbers are gradually increasing.

### Step 5: Train - Test Split

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Importing more modules
import seaborn as sns
%matplotlib inline

# To Scale the dataset
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [None]:
# Separating target varaible from dependent variable
# Putting target varaible 'Converted' to a new series 'y'

y=lead['Converted']    

y.head()

In [None]:
# Putting dependent variable in a new dataset called 'X'

X=lead.drop('Converted',1)

X.head()

In [None]:
# Splitting the data into train and test set

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.7,test_size=0.3, random_state=100)

### Step 6: Rescaling the features with MinMax Scaling

In [None]:
# Importing MinMaxScaler method from sklearn - preprocessing library

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_train[['TotalVisits',
                                                                        'Total Time Spent on Website','Page Views Per Visit']])

X_train.head()

In [None]:
scaler = MinMaxScaler()

num_cols=X_train.select_dtypes(include=['float64', 'int64']).columns

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])

X_train.head()

In [None]:
# Now, Scalling the 'Total Time Spent on Website' variables with standard scaler and fitting - tranforming the X - train dataset

X_train[['Total Time Spent on Website']]=scaler.fit_transform(X_train[['Total Time Spent on Website']])

X_train.head()

In [None]:
# Checking the Converted rate

convert = (sum(lead['Converted'])/len(lead['Converted'].index))*100
convert

##### Insights:
        We have almost 39% converted rate.

In [None]:
X_train.info()

### Step 7: Correlations

In [None]:
# Importing matplotlib and seaborn

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Checking the Correlation Matrix

In [None]:
# Checking the correlation of variables

plt.figure(figsize = (30,20))       
plt.title('Correlations')
sns.heatmap(lead.corr(method ='spearman'))
plt.show()

##### Insights:
    The correlation is shown in the heatmap.
    So, we can proceed with building a model based on the p-values and VIFs. 
    Check for correlation as from the heatmap, difficult to spot the highly correlated variables.

### Step 8: Model Building  

#### Running the Initial Training Model

In [None]:
import statsmodels.api as sm

In [None]:
# Logistic regression model

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

### Step 9: Feature Selection Using RFE

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE

# running RFE with 18 variables as output

rfe = RFE(logreg, 18)            
rfe = rfe.fit(X_train, y_train)

In [None]:
# Checking for ture and false assigned to the variables after rfe

rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# variables shortlisted by RFE

col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

#### Rebuilding Model - Model 2

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
col = col.drop('LastActivity_Approached upfront',1)

#### Rebuilding Model - Model 3

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm3 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm3.fit()
res.summary()

#### Rebuilding Model - Model 4

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm4 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm4.fit()
res.summary()

#### Rebuilding Model - Model 5

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm5 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm5.fit()
res.summary()

#### Rebuilding Model - Model 6

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm6 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm6.fit()
res.summary()

#### Rebuilding Model - Model 7

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm7 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm7.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set

y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
# Reshape

y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

##### Creating a dataframe with  predicted probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'TotalVisits':y_train.values, 'Last Activity':y_train_pred})
y_train_pred_final['Total Time Spent on Website'] = y_train.index
y_train_pred_final.head()

##### Creating a dataframe with the actual churn flag and the predicted probabilities

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final.Last Activity.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

##### Lets check the confusion metrics and accuracy

In [None]:
from sklearn import metrics

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

#### Checking VIFs

In [None]:
# Importing VIFs library

from statsmodels.stats.outliers_influence import variance_inflation_factor


In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif


##### Insights:
    All features have VIF values less than 5, there is no multicollinearity issue in the dataset.
    Dropping the highest in-significant features i.e 'What is your current occupation_Housewife' has 0.999 p-value.

##### Metrics - Sensitivity, Specificity, False Positive Rate, Postitive Predictive Value and Negative Predictive Value

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting non conversion when leads have converted

print(FP/ float(TN+FP))

In [None]:
# positive predictive value 

print (TP / float(TP+FP))

In [None]:
# Negative predictive value

print (TN / float(TN+ FN))

### Step 10: Plotting the ROC Curve

An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Converted_Prob, 
                                         drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

### Step 11: Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

In [None]:
# Let us make the final prediction using 0.37 as the cut off

y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.37 else 0)
y_train_pred_final.head()

In [None]:
# Now let us calculate the lead score

y_train_pred_final['lead_score'] = y_train_pred_final.Converted_Prob.map(lambda x: round(x*100))
y_train_pred_final.head(20)

In [None]:
# checking if 80% cases are correctly predicted based on the converted column.

# get the total of final predicted conversion / non conversion counts from the actual converted rates

checking_df = y_train_pred_final.loc[y_train_pred_final['Converted']==1,['Converted','final_predicted']]
checking_df['final_predicted'].value_counts()

In [None]:
# check the precentage of final_predicted conversions

1965/float(1965+497)

##### Hence we can see that the final prediction of conversions have a target of 80% (79.8%) conversion as per the X Educations CEO's requirement . Hence this is a good model.

##### Overall Metrics - Accuracy, Confusion Metrics, Sensitivity, Specificity, False Postive Rate, Positive Predictive Value, Negative Predicitive Value  on final prediction on train set

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

In [None]:
# Let us calculate specificity

TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting conversions when leads has not converted

print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 

print (TP / float(TP+FP))

In [None]:
# Negative predictive value

print (TN / float(TN+ FN))

##### Metrics - Precision and Recall

In [None]:
#Looking at the confusion matrix again

confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
confusion

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
# precision

precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

In [None]:
# recall

recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

### Precision and recall tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.predicted

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

### Step 12: Making predictions on the test set

In [None]:
X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.transform(X_test[['TotalVisits',
                                                                        'Total Time Spent on Website','Page Views Per Visit']])

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array

y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head

y_pred_1.head()

In [None]:
# Converting y_test to dataframe

y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting LeadId to index

y_test_df['LeadId'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 

y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_Prob'})

In [None]:
# Rearranging the columns

y_pred_final = y_pred_final.reindex_axis(['LeadId','Converted','Converted_Prob'], axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Based on cut off threshold using accuracy, sensitivity and specificity of 0.37%

y_pred_final['final_predicted'] = y_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.37 else 0)

In [None]:
y_pred_final.head()

In [None]:
# Now let us calculate the lead score

y_pred_final['lead_score'] = y_pred_final.Converted_Prob.map(lambda x: round(x*100))
y_pred_final.head(20)

In [None]:
# checking if 80% cases are correctly predicted based on the converted column.

# get the total of final predicted conversion or non conversion counts from the actual converted rates

checking_test_df = y_pred_final.loc[y_pred_final['Converted']==1,['Converted','final_predicted']]
checking_test_df['final_predicted'].value_counts()

In [None]:
# check the precentage of final_predicted conversions on test data

797/float(797+218)

##### Hence we can see that the final prediction of conversions have a target rate of 79% (78.5%) (Around 1 % short of the predictions made on training data set)

##### Overall Metrics - Accuracy, Confusion Metrics, Sensitivity, Specificity  on test set

In [None]:
# Let's check the accuracy.

metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

##### Precision and Recall metrics for the test set 

In [None]:
# precision
print('precision ',precision_score(y_pred_final.Converted, y_pred_final.final_predicted))

# recall
print('recall ',recall_score(y_pred_final.Converted, y_pred_final.final_predicted))

In [None]:

p, r, thresholds = precision_recall_curve(y_pred_final.Converted, y_pred_final.Converted_Prob)

plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

##### Conclusion :
    
    - While we have checked both Sensitivity-Specificity as well as Precision and Recall Metrics, we have considered the
      optimal 
      cut off based on Sensitivity and Specificity for calculating the final prediction.
    - Accuracy, Sensitivity and Specificity values of test set are around 81%, 79% and 82% which are approximately closer to 
      the respective values calculated using trained set.
    - Also the lead score calculated in the trained set of data shows the conversion rate on the final predicted model is 
      around 80%
    - Hence overall this model seems to be good.   