Final Assignment

This assignment is for candidates enrolled in our Programs in Analytics and Statistical Studies (PASS) or seeking the ACE credits. Others are welcome to take it, and view model answers and compare their answers with it, but marks will be provided only to those registered in the above program.

Scenario: Universal Bank has begun a program to encourage its existing customers to borrow via a consumer loan program. The bank has promoted the loan to 5000 customers, of whom 480 accepted the offer. The data are available in fle UniversalBank.csv. The bank now wants to develop a model to predict which customers have the greatest probability of accepting the loan, to reduce promotion costs and send the offer only to a subset of its customers.

Data: The last assignment integrates learning from earlier assignments. We will use the Personal Loan Offer dataset that was also used in Assignments 4 (K-NN) and 5 (Naive Bayes), as well as in the Example in Chapter 9 (CART). We will develop K-NN with k=3, Naive Bayes and classification tree, then combine them in an ensemble. Finally we do the bagging and boosting and compare the results.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
from dmba import plotDecisionTree, regressionSummary, classificationSummary
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from dmba import classificationSummary
import math
import scipy.stats

no display found. Using non-interactive Agg backend


### Data preparation: Load the data and remove unnecessary columns (ID, ZIP Code). Split the data into training (60%) and validation (40%) sets (use random_state=1).

In [2]:
# Load the data
bank_df = pd.read_csv("dmba/UniversalBank.csv")

# Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.head(5)  # view the first 15 observations

Shape (5000, 14)


Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [3]:
# Remove ID and Zip Code columns
bank_df = bank_df.drop(columns=['ID', 'ZIP Code'])

# Re-Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.tail(5)  # view the first 15 observations

Shape (5000, 12)


Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
4995,29,3,40,1,1.9,3,0,0,0,0,1,0
4996,30,4,15,4,0.4,1,85,0,0,0,1,0
4997,63,39,24,2,0.3,3,0,0,0,0,0,0
4998,65,40,49,3,0.5,2,0,0,0,0,1,0
4999,28,4,83,3,0.8,1,0,0,0,0,1,1


In [4]:
y = bank_df["Personal Loan"]
X = bank_df.drop(columns=["Personal Loan"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
print('Training set:', X_train.shape, 'Validation set:', X_test.shape)

Training set: (3000, 11) Validation set: (2000, 11)


### PART A (50 points) We will develop K-NN with k=3, Naive Bayes (after binning the continuous predictors) and classification tree, then combine them in an ensemble.

#### Fit models to the data for (1) k-nearest neighbors with k = 3, (2) Naive Bayes and (3) classification trees. Use Personal Loan as the outcome variable. Report the validation confusion matrix for each of the three models. (15 points)

In [5]:
predictors = list(X_train.columns)
scaler = preprocessing.StandardScaler()
scaler.fit(X_train[predictors])

# Transform the predictors
train_X = scaler.transform(X_train[predictors])
train_y = y_train
valid_X = scaler.transform(X_test[predictors])
valid_y = y_test

  return self.partial_fit(X, y)
  
  


##### Fit models to the data for (1) k-nearest neighbors with k = 3

In [6]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(train_X, train_y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

###### Reporting the validation confusion matrix for k-nearest neighbors with k=3. Use Personal Loan as the outcome variable.

In [7]:
knnPred = knn.predict(valid_X)
print(classification.confusion_matrix(valid_y, knnPred), "\n")
print('Accuracy :', classification.accuracy_score(valid_y, knnPred))

[[1793   14]
 [  77  116]] 

Accuracy : 0.9545


##### Fit models to the data for (2) Naive Bayes (after binning the continuous predictors). Use Personal Loan as the outcome variable.

In [8]:
# Convert Online and CreditCard to categories
bank_df.Online = bank_df.Online.astype('category')
bank_df.CreditCard = bank_df.CreditCard.astype('category')
# Re-Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.tail(5)  # view the first 15 observations

Shape (5000, 12)


Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
4995,29,3,40,1,1.9,3,0,0,0,0,1,0
4996,30,4,15,4,0.4,1,85,0,0,0,1,0
4997,63,39,24,2,0.3,3,0,0,0,0,0,0
4998,65,40,49,3,0.5,2,0,0,0,0,1,0
4999,28,4,83,3,0.8,1,0,0,0,0,1,1


In [9]:
# Remove all columns except for Online, CreditCard and Personal Loan
bank_df = bank_df[["Online", "CreditCard", "Personal Loan"]]

# Convert Online and CreditCard to categories
bank_df.Online = bank_df.Online.astype('category')
bank_df.CreditCard = bank_df.CreditCard.astype('category')

# Re-Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.tail(5)  # view the first 15 observations

Shape (5000, 3)


Unnamed: 0,Online,CreditCard,Personal Loan
4995,1,0,0
4996,1,0,0
4997,0,0,0
4998,1,0,0
4999,1,1,0


In [10]:
#binned_personal_loan = ["Age", "Experience", "Income", "Family", "CCAvg", "Education", "Mortgage"]

In [11]:
bank_df["binned_personal_loan"] = pd.cut(bank_df["Personal Loan"], 20, labels=False)
bank_df.head(20)

Unnamed: 0,Online,CreditCard,Personal Loan,binned_personal_loan
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0
4,0,1,0,0
5,1,0,0,0
6,1,0,0,0
7,0,1,0,0
8,1,0,0,0
9,0,0,1,19


In [12]:
X = bank_df.drop(columns=["Personal Loan", "binned_personal_loan"])
y = bank_df["binned_personal_loan"]

X = pd.get_dummies(X, prefix_sep="---")

In [13]:
X.head()

Unnamed: 0,Online---0,Online---1,CreditCard---0,CreditCard---1
0,1,0,1,0
1,1,0,1,0
2,1,0,1,0
3,1,0,1,0
4,1,0,0,1


In [14]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: binned_personal_loan, dtype: int64

In [15]:
# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=1)
print("The shapes of the training and test sets are",X_train.shape, " and", X_test.shape, "respectively.")

The shapes of the training and test sets are (3000, 4)  and (2000, 4) respectively.


In [16]:
# run naive Bayes
nb = MultinomialNB(alpha=0.01)
nb.fit(X_train, y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

###### Reporting the validation confusion matrix for Naive Bayes

In [17]:
nbPred = nb.predict(X_test)
print(classification.confusion_matrix(y_test, nbPred), "\n")
print('Accuracy :', classification.accuracy_score(y_test, nbPred))

[[1807    0]
 [ 193    0]] 

Accuracy : 0.9035


##### Fit models to the data for (3) classification trees. Use Personal Loan as the outcome variable. 

##### Default Tree

In [18]:
bank_df = pd.read_csv('UniversalBank.csv')
bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)

# split into training and validation
X = bank_df.drop(columns=['Personal Loan'])
y = bank_df['Personal Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=3)

In [19]:
default_Tree = DecisionTreeClassifier(random_state=1)
default_Tree.fit(X_train, y_train)

names_of_class = [str(s) for s in default_Tree.classes_]

print("Classes: {}".format(', '.join(names_of_class)))
print("Nodes: {}".format(default_Tree.tree_.node_count))

Classes: 0, 1
Nodes: 93


In [20]:
print("Remembering that the top 10 specifications of the Default Tree were:")
specifications = pd.DataFrame({"features": X_train.columns, "importance": default_Tree.feature_importances_})
# Sorting out the top 10 specifications based on importance
specifications.sort_values(by="importance", ascending=False).head(10)

Remembering that the top 10 specifications of the Default Tree were:


Unnamed: 0,features,importance
5,Education,0.411483
2,Income,0.315087
3,Family,0.153984
4,CCAvg,0.042747
8,CD Account,0.034992
1,Experience,0.016844
9,Online,0.011104
0,Age,0.008521
6,Mortgage,0.003224
10,CreditCard,0.002015


###### Reporting the validation confusion matrix for the Default Tree.

In [21]:
classes = default_Tree.classes_
classificationSummary(y_test, default_Tree.predict(X_test), class_names=default_Tree.classes_)

Confusion Matrix (Accuracy 0.9825)

       Prediction
Actual    0    1
     0 1778   15
     1   20  187


#### Create a data frame with the actual outcome, predicted outcome, and probability of being a "1" for each of the three models. Report the first 10 rows of this data frame. (10 points)

In [22]:
y_hats_knn = knnPred
y_hats_nb = nbPred
y_hats_tree = default_Tree.predict(X_test)

In [23]:
new_df = pd.DataFrame()
new_df["actual outcome"]=y_test
new_df["knn predicted outcome"] = y_hats_knn
new_df["nb predicted outcome"] = y_hats_nb
new_df["tree predicted outcome"] = y_hats_tree

In [24]:
new_df.head(10)

Unnamed: 0,actual outcome,knn predicted outcome,nb predicted outcome,tree predicted outcome
2584,1,1,0,1
4338,0,0,0,0
4556,0,0,0,0
3438,0,0,0,0
737,1,0,0,1
2676,0,0,0,0
285,0,0,0,0
2972,0,0,0,0
241,0,0,0,0
519,0,1,0,0


#### Add two columns to this data frame for (1) a majority vote of predicted outcomes, and (2) the average of the predicted probabilities. Using the classifications generated by these two methods derive a confusion matrix for each method and report the overall accuracy. (15 points)

In [25]:
header_list = ["actual outcome", "knn predicted outcome", "nb predicted outcome", 
               "tree predicted outcome", "majority vote"]
new_df = new_df.reindex(columns = header_list)

print(new_df.groupby(["knn predicted outcome", "nb predicted outcome", 
                      "tree predicted outcome"]).agg({"majority vote": lambda x: (scipy.stats.mode(x))}))

new_df.head(10)

                                                                  majority vote
knn predicted outcome nb predicted outcome tree predicted outcome              
0                     0                    0                       ([nan], [1])
                                           1                       ([nan], [1])
1                     0                    0                       ([nan], [1])
                                           1                       ([nan], [1])


Unnamed: 0,actual outcome,knn predicted outcome,nb predicted outcome,tree predicted outcome,majority vote
2584,1,1,0,1,
4338,0,0,0,0,
4556,0,0,0,0,
3438,0,0,0,0,
737,1,0,0,1,
2676,0,0,0,0,
285,0,0,0,0,
2972,0,0,0,0,
241,0,0,0,0,
519,0,1,0,0,


#### Compare the error rates for the three individual methods and the two ensemble methods. (10 points)

In [26]:
knn_Error = np.mean(knnPred != y_test)
knn_Error

0.1545

In [27]:
nb_Error = np.mean(nbPred != y_test)
nb_Error

0.1035

In [28]:
default_Tree_Error = np.mean(default_Tree.predict(X_test) != y_test)
default_Tree_Error

0.0175

PART B (30 points) Use Bagging and Boosted Trees and compare their performance with all the methodologies we used in PART A.

##### Bagging

In [29]:
bagging = BaggingClassifier(DecisionTreeClassifier(random_state=1), n_estimators=100, random_state=1)
bagging.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=None, oob_score=False,
         random_state=1, verbose=0, warm_start=False)

##### Report the validation confusion matrix for bagging.

In [30]:
classificationSummary(y_test, bagging.predict(X_test), class_names=classes)

Confusion Matrix (Accuracy 0.9855)

       Prediction
Actual    0    1
     0 1781   12
     1   17  190


In [31]:
bagging_Error = np.mean(bagging.predict(X_test) != y_test)
default_Tree_Error

0.0175

##### Boosting

In [32]:
boost = AdaBoostClassifier(DecisionTreeClassifier(random_state=1), n_estimators=100, random_state=1)
boost.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best'),
          learning_rate=1.0, n_estimators=100, random_state=1)

##### Report the validation confusion matrix for boosting.

In [33]:
classificationSummary(y_test, boost.predict(X_test), class_names=classes)

Confusion Matrix (Accuracy 0.9840)

       Prediction
Actual    0    1
     0 1779   14
     1   18  189


In [34]:
boost_Error = np.mean(boost.predict(X_test) != y_test)
boost_Error

0.016

##### Random Forest

In [35]:
# Train a random forest classifier using the training set
rfModel = RandomForestClassifier(n_estimators=100, random_state=1)
rfModel.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

##### Report the validation confusion matrix for random forest.

In [36]:
classificationSummary(y_test, rfModel.predict(X_test), class_names=classes)

Confusion Matrix (Accuracy 0.9830)

       Prediction
Actual    0    1
     0 1784    9
     1   25  182


In [37]:
rfModel_Error = np.mean(rfModel.predict(X_test) != y_test)
rfModel_Error

0.017