## Assignment Week 9 - Biswajit Sharma

Import the dataset and ensure that it loaded properly.

In [1]:
# import modules

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

In [2]:
# Set up display options
pd.options.display.max_columns=80
pd.options.display.max_rows=50
pd.options.display.max_colwidth=80

In [3]:
# read the dataset into a dataframe
df = pd.read_csv("./datasets/Loan_Train.csv")

In [4]:
# view few rows
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Drop the column “Loan_ID.”

In [5]:
#drop the Id column
df = df.drop(columns="Loan_ID")

In [6]:
# view few rows
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Drop any rows with missing data.

In [7]:
# check dimension of dataframe
df.shape

(614, 12)

In [8]:
#identify columns that have missing values
df.isna().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [9]:
# drop rows if any field in a row has NaN
df = df.dropna(how="any", axis=0)

In [10]:
# check dimension after drop
df.shape

(480, 12)

In [11]:
# check no missing values remain
df.isna().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

We notice that 134 rows have been dropped due presence of missing values.

Convert the categorical features into dummy variables.

In [12]:
# check column datatypes
df.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [13]:
#using pandas get_dummies to encode categorical variables
categorical_feature_names = [cols for cols in df.columns if (df[cols].dtype == "object" and cols != "Loan_Status")]
df_with_dummies = pd.get_dummies(df, columns=categorical_feature_names, drop_first=True, dtype="int")

In [14]:
# view few rows after applying dummies
df_with_dummies.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban
1,4583,1508.0,128.0,360.0,1.0,N,1,1,1,0,0,0,0,0,0
2,3000,0.0,66.0,360.0,1.0,Y,1,1,0,0,0,0,1,0,1
3,2583,2358.0,120.0,360.0,1.0,Y,1,1,0,0,0,1,0,0,1
4,6000,0.0,141.0,360.0,1.0,Y,1,0,0,0,0,0,0,0,1
5,5417,4196.0,267.0,360.0,1.0,Y,1,1,0,1,0,0,1,0,1


Split the data into a training and test set, where the “Loan_Status” column is the target.

In [15]:
# define feature and target names
target = "Loan_Status"
features = [cols for cols in df_with_dummies.columns if cols != "Loan_Status"]

In [16]:
# using sklearn's train_test_split method to split the dataset
features_train, features_test, target_train, target_test = train_test_split(
    df_with_dummies[features], df_with_dummies[target], test_size=0.2, random_state=1
)

Create a pipeline with a min-max scaler and a KNN classifier.

In [17]:
# create a min-max scaler 
scaler = MinMaxScaler()

In [18]:
# create a default KNN classifier
default_knn = KNeighborsClassifier()

In [19]:
# define pipeline
pipe = Pipeline(
    [
        ("scaler",scaler),
        ("knn", default_knn)
    ]
)

Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. 

In [20]:
# fit the pipeline with training set
default_knn_model = pipe.fit(features_train, target_train)

In [21]:
#predict loan status for test set using default KNN classifier
predicted_target_test = default_knn_model.predict(features_test)

In [22]:
# calculate accuracy for test set using a default KNN classifier
print(f"Accuracy: {accuracy_score(target_test, predicted_target_test)}")

Accuracy: 0.7291666666666666


We see that model accuracy of the _default KNN Classifier_ is $0.73$.

Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10.

In [23]:
#create search space 
grid_search_space = [
    {"knn__n_neighbors": [i for i in range(1,11)]}
] 

Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [24]:
# create grid search
gcv = GridSearchCV(
    pipe,
    grid_search_space,
    cv=5,
    verbose=1
)

In [25]:
#fit the grid search
best_model = gcv.fit(features_train, target_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [26]:
# show best model parameters
best_model.best_estimator_.get_params()

{'memory': None,
 'steps': [('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())],
 'verbose': False,
 'scaler': MinMaxScaler(),
 'knn': KNeighborsClassifier(),
 'scaler__clip': False,
 'scaler__copy': True,
 'scaler__feature_range': (0, 1),
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,
 'knn__p': 2,
 'knn__weights': 'uniform'}

In [27]:
print(f"Best value for n_neighbors is {best_model.best_estimator_.get_params()['knn__n_neighbors']}")

Best value for n_neighbors is 5


We see that best model has $5$ neighbors

Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [28]:
#predict loan status for test set using the best KNN classifier
predicted_best_model_target_test = best_model.predict(features_test)

In [29]:
# calculate accuracy for test set using best KNN classifier
print(f"Accuracy: {accuracy_score(target_test, predicted_best_model_target_test)}")

Accuracy: 0.7291666666666666


The accuracy is still $0.73$. So, there is no improvement in accuracy over the default model.  

Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

In [30]:
# define model objects
logistic = LogisticRegression(max_iter=500, solver="liblinear")
random_forest = RandomForestClassifier()
knn = KNeighborsClassifier()

In [31]:
# define pipeline
pipe_multiple = Pipeline(
    steps=[
        ("scaler", scaler),
        ("classifier", knn)
    ]
)   

In [32]:
# set hyperparameters search space for all the classifiers
hyperparameters_search_space = [
    {
        "classifier": [knn],
        "classifier__n_neighbors": [i for i in range(1,11)]
    },
    {
        "classifier": [logistic],
        "classifier__penalty": ["l1", "l2"],
        "classifier__C": np.logspace(0,4,10)
    },
    {
        "classifier": [random_forest],
        "classifier__n_estimators": [10,100,1000],
        "classifier__max_features": [1,2,3]
    }
]


In [33]:
# set up grid search for above hyperparameters
gcv_multiple = GridSearchCV(
    pipe_multiple,
    hyperparameters_search_space,
    cv=5,
    verbose=1,
    n_jobs=-1
)

In [34]:
# fit the grid search
best_model_multiple = gcv_multiple.fit(features_train, target_train)

Fitting 5 folds for each of 39 candidates, totalling 195 fits


What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [35]:
#show best model parameters
print(best_model_multiple.best_params_)
best_model_multiple.best_estimator_.get_params()

{'classifier': LogisticRegression(max_iter=500, solver='liblinear'), 'classifier__C': 1.0, 'classifier__penalty': 'l1'}


{'memory': None,
 'steps': [('scaler', MinMaxScaler()),
  ('classifier',
   LogisticRegression(max_iter=500, penalty='l1', solver='liblinear'))],
 'verbose': False,
 'scaler': MinMaxScaler(),
 'classifier': LogisticRegression(max_iter=500, penalty='l1', solver='liblinear'),
 'scaler__clip': False,
 'scaler__copy': True,
 'scaler__feature_range': (0, 1),
 'classifier__C': 1.0,
 'classifier__class_weight': None,
 'classifier__dual': False,
 'classifier__fit_intercept': True,
 'classifier__intercept_scaling': 1,
 'classifier__l1_ratio': None,
 'classifier__max_iter': 500,
 'classifier__multi_class': 'auto',
 'classifier__n_jobs': None,
 'classifier__penalty': 'l1',
 'classifier__random_state': None,
 'classifier__solver': 'liblinear',
 'classifier__tol': 0.0001,
 'classifier__verbose': 0,
 'classifier__warm_start': False}

In [36]:
#predict loan status for test set using best model 
predicted_logistic_target_test = best_model_multiple.predict(features_test)

In [37]:
# calculate accuracy for test set a default KNN classifier
print(f"Accuracy: {accuracy_score(target_test, predicted_logistic_target_test)}")

Accuracy: 0.7395833333333334


Best model found is _Logistic Regression_ which yeilded an accuracy of $0.74$

### Summary

We observed that the accuracy of the _default_ KNN classifier is $0.73$. We used grid search approach to identify the best `n_neighbors` value to use with KNN classifier on this loan approval dataset. The grid search yeilded that $5$ neighbors generates best performce of the KNN classifier.  KNN classifier using $5$ neighbors achieved an accuracy of  $0.73$. Therefore, _no improvement_ in _accuracy_ was achieved when using _$5$ neighbors in KNN classifier over the default KNN classifier_.

Grid search was expanded to include Logistic Regression and Random Forest classifier along with KNN classifier to identify the best performing model along with its hyperparameters. The results showed that Logistic Regression classifier was identified as the best model, yeilding an accuracy of $0.74$ with hyper parameter `penalty=l1` and `C=1.0`. There is _no significant increase_ in _accuracy_ (just $1$%) for the selected Logistic Regression classifier over the default and $5$ neighbors KNN classifer models.