

In this example, we will be using a bank loan data set to see how we can build a classifier that can help identify potential customers who have a higher probability of purchasing the loan. You may read more about the data set at https://www.kaggle.com/itsmesunil/bank-loan-modelling.

**For this example, please import necessary packages as you need.** For a good programming style, you should put all import command in the following cell. However, you may also import in the cell where you need to call the API. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression


In [2]:
df = pd.read_csv("Bank_Personal_Loan_Modelling.csv")

In [3]:
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

In [4]:
df

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0


In [5]:
X = df[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities Account',
       'CD Account', 'Online', 'CreditCard']].values

In [6]:
y = df['Personal Loan'].values

In [7]:
df['Personal Loan'].describe()

count    5000.000000
mean        0.096000
std         0.294621
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: Personal Loan, dtype: float64

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 5)

##### Article on data splitting functions- (https:https://medium.com/@julie.yin/understanding-the-data-splitting-functions-in-scikit-learn-9ae4046fbd26//)

## Logistic Regression

Using the LogisticRegression API to implement a logistic regression classifier. More specifically, we follow these steps:
1. Declare a LogisticRegression object. Note that by default, the scikit-learn package specifies a L2 normalization term for its logistic regression model, you can "disable it" by setting a very large C, such as C=1e20, or specify penalty to be 'none'.
2. Fit the logistic regression model using the training data, that is, X_train and y_train.
3. Evaluate the performance of the logistic regression model on both the training set and the test set.

Logistic Regression Documentation - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [9]:
regr = LogisticRegression( max_iter = 2000, penalty='none', solver='lbfgs')

# Fit the model to the training data
regr.fit(X_train,y_train)

# score on train and test data, where we expect to see higher scores for training data
train_score = regr.score(X_train,y_train)
print("Training Data Score:", train_score)
test_score = regr.score(X_test, y_test)
print("Testing Data Score:", test_score)

Training Data Score: 0.9362666666666667
Testing Data Score: 0.9336


## Feature Normalization
Next, let's use feature normalization and polynomial feature expension to improve the performance of the model. We have done these two steps on the training set. In this example, we apply these steps to the test set. 

**Note that the exact same normalization should be applied to the training and test set. Therefore, the StandardScaler and PolynomialFeatures object should only be "fit" once.**

In [10]:
minmax = MinMaxScaler()
poly = PolynomialFeatures(degree=2)

minmax.fit(X_train)

X_train_scaled = minmax.transform(X_train)
X_train_poly = poly.fit_transform(X_train_scaled)

In [11]:
# Rescale the test data as well **THIS STEP IS CRITICAL**
X_test_scaled = minmax.transform(X_test)
X_test_poly = poly.fit_transform(X_test_scaled)


## Logistic Regression with New Features 
With the normalized and expanded features, we can then re-run the logistic regression to see if that improves the performance. 

In [12]:
regr = LogisticRegression( max_iter = 4000, penalty='none', solver='lbfgs')

# Retrain using the min/max and polynomial normalized data
regr.fit(X_train_poly,y_train)


train_score = regr.score(X_train_poly, y_train)
test_score = regr.score(X_test_poly, y_test)

# score on train and test data, where we expect to see higher scores for training data
train_score = regr.score(X_train_poly,y_train)
print("Training Data Score:", train_score)
test_score = regr.score(X_test_poly, y_test)
print("Testing Data Score:", test_score)

Training Data Score: 0.9597333333333333
Testing Data Score: 0.9472
