**Import the dataset and ensure that it loaded properly.**

In [28]:
import pandas as pd

In [29]:
df = pd.read_csv('Loan_Train.csv')

In [30]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


**Prepare the data for modeling by performing the following steps:**
    
    1. Drop the column “Load_ID.”
    2. Drop any rows with missing data.
    3. Convert the categorical features into dummy variables.

In [31]:
df = df.drop(columns='Loan_ID')
df = df.dropna()

In [32]:
df.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [33]:
# Identifies Categorical Columns
categorical_columns = df.select_dtypes(include=['object']).columns
print(categorical_columns)

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Property_Area', 'Loan_Status'],
      dtype='object')


In [34]:
# Creates 0, 1 for Loan_Status column
df['Loan_Status_Nbr'] = df['Loan_Status'].replace(to_replace=['N','Y'], value=[0,1])

In [35]:
df = pd.get_dummies(df, columns=categorical_columns)
df.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status_Nbr,Gender_Female,Gender_Male,Married_No,Married_Yes,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
1,4583,1508.0,128.0,360.0,1.0,0,0,1,0,1,...,0,1,0,1,0,1,0,0,1,0
2,3000,0.0,66.0,360.0,1.0,1,0,1,0,1,...,0,1,0,0,1,0,0,1,0,1
3,2583,2358.0,120.0,360.0,1.0,1,0,1,0,1,...,0,0,1,1,0,0,0,1,0,1
4,6000,0.0,141.0,360.0,1.0,1,0,1,1,0,...,0,1,0,1,0,0,0,1,0,1
5,5417,4196.0,267.0,360.0,1.0,1,0,1,0,1,...,0,1,0,0,1,0,0,1,0,1


**Split the data into a training and test set, where the “Loan_Status” column is the target.**

In [36]:
from sklearn.model_selection import train_test_split

In [37]:
# Separate the target from the features
feature = df.drop('Loan_Status_Nbr', axis=1)
target = df['Loan_Status_Nbr']

#Split the data into training and test
feature_train, feature_test, target_train, target_test = train_test_split(feature, target)


**Create a pipeline with a min-max scaler and a KNN classifier (see section 15.3 in the Machine Learning with Python Cookbook).**

In [38]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.neighbors import KNeighborsClassifier

In [39]:
standardizer = StandardScaler()

In [40]:
knn = KNeighborsClassifier(n_neighbors=5)

In [41]:
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

**Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set. Note: Fitting a pipeline model works just like fitting a regular model.**

In [42]:
model = pipe.fit(feature_test, target_test)

In [43]:
from sklearn import metrics

In [83]:
# Create predictions
prediction = pipe.predict(feature_test)
# Calculate the accuracy
accuracy = 100*metrics.accuracy_score(prediction,target_test)
# Display accuracy 
print('The accuracy of the Model is: ', round(accuracy,2), '%', sep = '')

The accuracy of the Model is: 97.5%


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


**Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10. (see section 15.3 in the Machine Learning with Python Cookbook).**

In [45]:
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

**Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.**

In [46]:
from sklearn.model_selection import GridSearchCV

In [47]:
classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(feature_test, target_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mo

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [48]:
classifier.best_estimator_.get_params()["knn__n_neighbors"]

6

**Find the accuracy of the grid search best model on the test set. Note: It is possible that this will not be an improvement over the default model, but likely it will be.**

In [49]:
# Create predictions
prediction2 = classifier.predict(feature_test)
# Calculate the accuracy
accuracy2 = 100*metrics.accuracy_score(prediction2, target_test)
# Display accuracy 
print('The accuracy of the Decision Tree is: ', round(accuracy2,2), '%', sep = '')

The accuracy of the Decision Tree is: 99.17%


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


**Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.**

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [51]:
import numpy as np

In [65]:
pipe2 = Pipeline([("classifier", RandomForestClassifier())])

In [73]:
search_space_2 = [{"classifier": [LogisticRegression()],
                  "classifier__penalty": ['l1', 'l2'],
                  "classifier__C": np.logspace(0, 4, 10)},
                 {"classifier": [RandomForestClassifier()],
                 "classifier__n_estimators": [10, 100, 1000],
                 "classifier__max_features": [1, 2, 3]}]

In [74]:
gridsearch = GridSearchCV(pipe2, search_space_2, cv=5, verbose=0).fit(feature_test, target_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [77]:
gridsearch.best_estimator_

Pipeline(steps=[('classifier', RandomForestClassifier(max_features=1))])

In [78]:
gridsearch.best_params_

{'classifier': RandomForestClassifier(max_features=1),
 'classifier__max_features': 1,
 'classifier__n_estimators': 100}

In [80]:
model2 = pipe2.fit(feature_test, target_test)

In [84]:
# Create predictions
prediction2 = pipe2.predict(feature_test)
# Calculate the accuracy
accuracy2 = 100*metrics.accuracy_score(prediction2,target_test)
# Display accuracy 
print('The accuracy of the Model is: ', round(accuracy2,2), '%', sep = '')

The accuracy of the Model is: 100.0%


Summarize your results.


All the tests conducted demonstrate high accuracy. The best KNN for the model was 6. As the type of model changed, the accuracy increased as well.