# Activity 6: Explore predictive accuracy

In this activity, we will look at how to build a Naive Bayes model in Python with just one variable. We shall start with the same variable we used in previous activity and compare the results. Let's start by looking at how we run Naive Bayes in Python for categorical/qualitative predictors. 

We will look at the variable 'Telephone' together and then you will analyse 'Housing' on your own.

## Telephone

In [1]:
# We need the following packages

import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import confusion_matrix as cm
from sklearn.model_selection import train_test_split

# Data prepapration. Let's use the same variables:
data = pd.read_csv("german_credit.csv", usecols=["Telephone", "Status"])
data.head(10)

Unnamed: 0,Telephone,Status
0,Known,Good
1,Unknown,Bad
2,Unknown,Good
3,Unknown,Good
4,Unknown,Bad
5,Known,Good
6,Unknown,Good
7,Known,Good
8,Unknown,Good
9,Unknown,Bad


In [2]:
# We need to convert Telephone. Note that for Multinomial NB, we don't need to drop the extra level to avoid correlation.
data = pd.concat([data,pd.get_dummies(data['Telephone'], prefix='Telephone', drop_first=False)],axis=1).drop(['Telephone'],axis=1)  

# The target variable also should be coded as 0/1.
data = pd.concat([data,pd.get_dummies(data['Status'], prefix='Status', drop_first=False)],axis=1).drop(['Status'],axis=1)  

# For the rest of our analysis, we want to focus on Status_Bad rather than good
data = data.drop('Status_Good',axis=1)

data.head(10)

Unnamed: 0,Telephone_Known,Telephone_Unknown,Status_Bad
0,1,0,0
1,0,1,1
2,0,1,0
3,0,1,0
4,0,1,1
5,1,0,0
6,0,1,0
7,1,0,0
8,0,1,0
9,0,1,1


In [3]:
X=data.drop('Status_Bad', axis=1)
y=data['Status_Bad']

model1 = MultinomialNB()
# ravel() converts y-vector into 1d array as required by sklearn
model1.fit(X, y.values.ravel()) 

y_pred = model1.predict(X) # this produces predicted class

print("Confusion matrix: \n"+str(cm(y,y_pred)))

Confusion matrix: 
[[700   0]
 [300   0]]


In [4]:
# This happens because of imbalanced classes, as discussed before, since everyone is classified into 'Good'
# From a business point, it means we should accept everyone, and our PD will be as before 300/1000=0.3
# Let's see if we can correct for this

y_prob = model1.predict_proba(X) # this produces predicted probabilities, column0 for 0 (Good), Column1 - for 1(Bad)
y_pred1 = np.where(y_prob[:,1]>=0.3, 1, 0) # define cut-off at PD =0.3 
print("Confusion matrix: \n"+str(cm(y,y_pred1)))

y_pr = pd.DataFrame({'Bad': y, 'P_Bad': y_prob[:,1], 'Bad_hat': y_pred1})
phone_cross = pd.crosstab(index=y_pr["Bad"],  # Produce a two-way table
                         columns=y_pr["P_Bad"], margins=True)
phone_cross

Confusion matrix: 
[[291 409]
 [113 187]]


P_Bad,0.28002314048044485,0.31356461080822495,All
Bad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,291,409,700
1,113,187,300
All,404,596,1000


In [5]:
# the accuracy is lower, since now we mis-classify 409+113 customers as compared to 300 in the previous step
# however, from a business point of view we will only accept those that are predicted Good (404 or c.40%),
# and our PD among accepted (predicted as Good) is 0.28
# which is slighly lower than 0.3, but we will have to reject 60% of applicants

# We can also calculate the area under the ROC curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y, y_prob[:,1])
roc_auc=auc(false_positive_rate, true_positive_rate)
print("AUC:" +str(roc_auc)) 

AUC:0.5195238095238095


Not brilliant either, but we only used one variable. We can also split the sample into training and test sets to calculate AUC and confusion matrix on the test sample.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2) 
# 70% training and 30% test 
# random_state fixes the split, so if we want to compare the models, we compare then on the same sample

model2 = BernoulliNB()
model2.fit(X_train, y_train.values.ravel()) 

# this produces predicted probablities
y_prob2 = model2.predict_proba(X_test) 

# define cut-off at PD =0.3 instead of 0.5
y_pred2 = np.where(y_prob2[:,1]>=0.3, 1, 0) 
print("Confusion matrix: \n"+str(cm(y_test,y_pred2)))

# Model AUC?
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob2[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
print("AUC:" +str(roc_auc)) 

Confusion matrix: 
[[ 82 119]
 [ 40  59]]
AUC:0.5019598974822855


As expected measures of predictive accuracy on the test sample are worse as compared to the training/whole. The deviation of AUC from 0.5 is small anyway, and estimated PD is now 0.32. Therefore, 'Telephone' is perhaps not a very strong predictor. 

## Housing

Now it is your turn. You should repeat the same analysis for 'Housing'. Is predictive accuracy better for 'Housing' than for 'Telephone'?

In [7]:
data = pd.read_csv("german_credit.csv", usecols=["Housing", "Status"])
data.head(10)

data = pd.concat([data,pd.get_dummies(data['Housing'], prefix='Housing', drop_first=False)],axis=1).drop(['Housing'],axis=1)  

# The target variable also should be coded as 0/1.
data = pd.concat([data,pd.get_dummies(data['Status'], prefix='Status', drop_first=False)],axis=1).drop(['Status'],axis=1)  

# For the rest of our analysis, we want to focus on Status_Bad rather than good
data = data.drop('Status_Good',axis=1)

data.head()

Unnamed: 0,Housing_Free,Housing_Own,Housing_Rent,Status_Bad
0,0,1,0,0
1,0,1,0,1
2,0,1,0,0
3,1,0,0,0
4,1,0,0,1


In [8]:
X=data.drop('Status_Bad', axis=1)
y=data['Status_Bad']

# First, try without adjusting the probability-theshold:

### BEGIN SOLUTION


### END SOLUTION

print("Confusion matrix: \n"+str(cm(y,y_pred)))

Confusion matrix: 
[[700   0]
 [300   0]]


In [9]:
# Now, adjust the probability cut-off:

### BEGIN SOLUTION


### END SOLUTION

housing_cross

NameError: name 'housing_cross' is not defined

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2) 

# Now, apply a model to a 70% training and 30% test with appropriate cut-off 

### BEGIN SOLUTION


### END SOLUTION

print("Confusion matrix: \n"+str(cm(y_test,y_pred2)))

# Model AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob2[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
print("AUC:" +str(roc_auc)) 

Confusion matrix: 
[[149  52]
 [ 65  34]]
AUC:0.5407055630936228
