# Week 9 - Hyperparameter Tuning
Objectives:
* Import the [dataset](https://www.kaggle.com/datasets/granjithkumar/loan-approval-data-set) and ensure that it loaded properly.
* Prepare the data for modeling by performing the following steps:
    * Drop the column “Load_ID.”
    * Drop any rows with missing data.
    * Convert the categorical features into dummy variables.
* Split the data into a training and test set, where the “Loan_Status” column is the target.
* Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).
* Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.
* Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).
* Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.
* Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.
* Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.
* What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.
* Summarize your results

In [1]:
# importing the dataset
import pandas as pd
df = pd.read_csv("Loan_Train.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [2]:
# drop Loan_ID column
df.drop(columns=['Loan_ID'], inplace = True)

# drop any rows with missing data
df.dropna(axis=0, inplace = True)

# convert categorical features to dummy variables
df = pd.get_dummies(df, columns=['Gender', 'Education', 'Property_Area', 'Loan_Status', 'Married', 'Self_Employed'])

df.head()

Unnamed: 0,Married,Dependents,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Education_Graduate,Education_Not Graduate,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
1,Yes,1,No,4583,1508.0,128.0,360.0,1.0,0,1,1,0,1,0,0,1,0
2,Yes,0,Yes,3000,0.0,66.0,360.0,1.0,0,1,1,0,0,0,1,0,1
3,Yes,0,No,2583,2358.0,120.0,360.0,1.0,0,1,0,1,0,0,1,0,1
4,No,0,No,6000,0.0,141.0,360.0,1.0,0,1,1,0,0,0,1,0,1
5,Yes,2,Yes,5417,4196.0,267.0,360.0,1.0,0,1,1,0,0,0,1,0,1


In [3]:
# for the binary categorical variables, let's remove the negation
# Maintain Education_Graduate, and remove Education Not Graduate
# Maintain Loan_Status_Y (rename to Loan_Status) and remove Loan_Status_N
df.drop(columns=['Education_Not Graduate', 'Loan_Status_N'], inplace=True)
df.rename(columns={'Loan_Status_Y': "Loan_Status"}, inplace=True)

df.head()

Unnamed: 0,Married,Dependents,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Education_Graduate,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
1,Yes,1,No,4583,1508.0,128.0,360.0,1.0,0,1,1,1,0,0,0
2,Yes,0,Yes,3000,0.0,66.0,360.0,1.0,0,1,1,0,0,1,1
3,Yes,0,No,2583,2358.0,120.0,360.0,1.0,0,1,0,0,0,1,1
4,No,0,No,6000,0.0,141.0,360.0,1.0,0,1,1,0,0,1,1
5,Yes,2,Yes,5417,4196.0,267.0,360.0,1.0,0,1,1,0,0,1,1


In [25]:
# Dependents is a str type column, with 0, 1, 2, and 3+ as values
# convert to int type, and 3+ will just be 3
df['Dependents'] = df['Dependents'].apply(lambda x : int(x) if x in ('0', '1', '2') else 3)

#### Split the data into a training and test set, where the “Loan_Status” column is the target.

In [27]:
from sklearn.model_selection import train_test_split
# target variable
y = df['Loan_Status']
# feature variables
x = df.drop(columns=['Loan_Status'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

#### Create a pipeline with a min-max scaler and a KNN classifier

In [32]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# create standardizer
scaler = MinMaxScaler()

# create a knn classifier
knn = KNeighborsClassifier(n_neighbors=5)

# create a pipeline
pipe = Pipeline([("scaler", scaler), ("knn", knn)])

#### Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.

In [33]:
# fit the knn classifier to pipeline
pipe.fit(x_train, y_train)

# get predictions
y_pred = pipe.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {}".format(accuracy))

Accuracy: 0.7708333333333334


#### Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. 

In [34]:
# create a space of candidate values
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

#### Fit a grid search with the pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [35]:
# create a grid search
classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(x_train, y_train)

# find the best value for k
best_k = classifier.best_estimator_.get_params()['knn__n_neighbors']
print("K-value that produces best model: {}".format(best_k))

K-value that produces best model: 3


#### Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.

In [36]:
# update the knn classifier
knn = KNeighborsClassifier(n_neighbors=best_k)

# update pipeline
pipe = Pipeline([("scaler", scaler), ("knn", knn)])

# fit the knn classifier to updated pipeline
pipe.fit(x_train, y_train)

# get predictions
y_pred_2 = pipe.predict(x_test)

accuracy_2 = accuracy_score(y_test, y_pred)
print("Accuracy with k=3: {}".format(accuracy_2))

Accuracy with k=3: 0.7708333333333334


#### Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values

In [42]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# set random seed
np.random.seed(0)

# update pipeline
pipe = Pipeline([("scaler", scaler), ("classifier", RandomForestClassifier())])

# fit the knn classifier to updated pipeline
pipe.fit(x_train, y_train)

expanded_search_space = [{"classifier": [LogisticRegression(max_iter=500,
                                                           solver = "liblinear")],
                         "classifier__penalty" : ['l1', 'l2'],
                         "classifier__C": np.logspace(0, 4, 10)},
                        {"classifier": [RandomForestClassifier()],
                        "classifier__n_estimators": [10, 100, 1000],
                        "classifier__max_features": [1, 2, 3]}]

# create grid search with updated search space
gridsearch_expanded = GridSearchCV(pipe, expanded_search_space, cv=5, verbose=0)

# fit grid search
best_model = gridsearch_expanded.fit(x_train, y_train)
print(best_model.best_estimator_.get_params()['classifier'])

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('classifier',
                 LogisticRegression(max_iter=500, solver='liblinear'))])


The best model and hyperparameters found is Logistic Regression with a max iterations of 500 and solver being liblinear. We can test this by checking the accuracy on the test set. 

#### Find the accuracy of this model on the test set.

In [47]:
# creating the model
logistic_model = LogisticRegression(max_iter=500, solver='liblinear')

# creating the pipeline
pipeline = Pipeline([("scaler", scaler), ("logistic", logistic_model)])

# fit the pipeline
pipeline.fit(x_train, y_train)

# predict using new model
y_pred = pipeline.predict(x_test)

# accuracy
accuracy_3 = accuracy_score(y_test, y_pred)
print("Accuracy score on Logistic Regression model: ", accuracy_3)

Accuracy score on Logistic Regression model:  0.7986111111111112


## Summary

Through this assignment, I imported loan data and did a quick cleanup. I then first created a KNN model, set to k=5, which had an accuracy of about 77%. Then, after conducting a search to determine the best k for knn-modeling, which turned out to be k=3, the accuracy ended up being the same. Then, I conducted an expanded search, including Logistic Regression and Random Forest Classification, and their respective hyperparameters to determine the best model for the data. The result ended up being Logistic Regression with an accuracy of 79.9%, an improvement from the last model.