## **Problem:**
Use the `credit.csv` dataset to build **classification model** using KNN,NB. The target variable is `default` which is a binary label to indicate of the loan is default (yes, no). Use all other variables for your feature set.


- Comment your code, do not display unnecessary data and keep your notebook clean and readable.

### **Read `credit.csv` into a dataframe `credit_df`. Display the first 5 rows**

- Handle the missing value character '?' as na value.*
- Remove whitespace from column names and replace with underscore

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import files
files.upload();

Saving credit.csv to credit (1).csv


In [None]:
credit_df = pd.read_csv("credit_loan.csv")

In [None]:
credit_df.shape

(1000, 11)

In [None]:
# replace ? with NaN
credit_df = credit_df.replace(to_replace = '?', value = np.nan)
credit_df.head(6)

Unnamed: 0,months_loan_duration,credit history,amount,percent_of_income,years at residence,age,existing loans_count,job,dependents,phone,default
0,6,critical,1169,4.0,4,67,2,skilled,1,yes,no
1,48,good,5951,2.0,2,22,1,skilled,1,no,yes
2,12,critical,2096,2.0,3,49,1,unskilled,2,no,no
3,42,good,7882,2.0,4,45,1,skilled,2,no,no
4,24,poor,4870,,4,53,2,skilled,2,no,yes
5,36,good,9055,2.0,4,35,1,unskilled,2,yes,no


In [None]:
# replace whitespace with underscore
credit_df.columns = [str_clm.strip().replace(' ', '_') for str_clm in credit_df.columns]
credit_df.columns

Index(['months_loan_duration', 'credit_history', 'amount', 'percent_of_income',
       'years_at_residence', 'age', 'existing_loans_count', 'job',
       'dependents', 'phone', 'default'],
      dtype='object')

### **1. Display the following in each cell**
- The balance of your target variable (no vs yes count)
- Missing values per column,
- Total number of rows that have missing values then drop rows with missing values

In [None]:
# target variable balance
credit_df.default.value_counts()

no     700
yes    300
Name: default, dtype: int64

In [None]:
# missing values per column
credit_df.isnull().sum()

months_loan_duration    0
credit_history          3
amount                  0
percent_of_income       4
years_at_residence      6
age                     1
existing_loans_count    0
job                     0
dependents              0
phone                   0
default                 0
dtype: int64

In [None]:
# count number of rows with missing values, then drop them
credit_df.isna().any(axis=1).sum()
credit_df = credit_df.dropna()

### **2. Determine categorical and numerical features and assign each into `numerical_features` and `categorical_features`**

In [None]:
# show dtypes
credit_df.dtypes

months_loan_duration     int64
credit_history          object
amount                   int64
percent_of_income       object
years_at_residence      object
age                     object
existing_loans_count     int64
job                     object
dependents               int64
phone                   object
default                 object
dtype: object

In [None]:
# change numeric object columns to int
credit_df[["percent_of_income", "years_at_residence", "age"]] = credit_df[["percent_of_income", "years_at_residence", "age"]].astype("int64")

In [None]:
# categorical
categorical_features = credit_df.select_dtypes("object").columns
categorical_features

Index(['credit_history', 'job', 'phone', 'default'], dtype='object')

In [None]:
# numerical
numerical_features = credit_df.select_dtypes("int64").columns
numerical_features

Index(['months_loan_duration', 'amount', 'percent_of_income',
       'years_at_residence', 'age', 'existing_loans_count', 'dependents'],
      dtype='object')

### **3. Create the preprocessing pipelines for both numeric and categorical data that does imputation and OneHotEncoder to the appropriate columns.**
Use mean strategy for numerical imputation and most frequent for categorical imputation.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [None]:
# numerical pipeline
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy = "mean"))
])


In [None]:
# categorical pipeline
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy = "most_frequent")),
    ("onehot", OneHotEncoder())
])


### **4. Implement ColumnTransformer for both numerical and categorical columns and assigned to a variable called `clms_transformers`**

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
clms_transformers = ColumnTransformer(
    transformers = [
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features )
    ]
)

### **5. Fit and transform the original data frame `credit_df` and assign the final output to `transformed_data_df`**
`hints:`
- *use the clms_transformers object created in the previous to fit and transform on the credit_df.*
- *use extract_feature_names method to get the feature names in order to create the final transformed_data_df*

In [None]:
# fit and transform the column transformer to generate the transformed data
clms_transformed_credit_df = clms_transformers.fit_transform(credit_df)
clms_transformed_credit_df.shape

(988, 20)

In [None]:
# extract_feature_names method
def extract_feature_names(columnTransformerProcessor):
    '''Get feature names from the processed columnTransformer'''

    output_features = []

    for name, pipe, features in columnTransformerProcessor.transformers_:
        if name!='remainder':
            for i in pipe:
                trans_features = []
                if hasattr(i,'categories_'):
                    trans_features.extend(i.get_feature_names_out(features))
                else:
                    trans_features = features
            output_features.extend(trans_features)

    return output_features

In [None]:
extract_feature_names(clms_transformers)

['months_loan_duration',
 'amount',
 'percent_of_income',
 'years_at_residence',
 'age',
 'existing_loans_count',
 'dependents',
 'credit_history_critical',
 'credit_history_good',
 'credit_history_perfect',
 'credit_history_poor',
 'credit_history_very good',
 'job_management',
 'job_skilled',
 'job_unemployed',
 'job_unskilled',
 'phone_no',
 'phone_yes',
 'default_no',
 'default_yes']

In [None]:
transformed_data_df = pd.DataFrame(clms_transformed_credit_df, columns = extract_feature_names(clms_transformers))
transformed_data_df.head()

Unnamed: 0,months_loan_duration,amount,percent_of_income,years_at_residence,age,existing_loans_count,dependents,credit_history_critical,credit_history_good,credit_history_perfect,credit_history_poor,credit_history_very good,job_management,job_skilled,job_unemployed,job_unskilled,phone_no,phone_yes,default_no,default_yes
0,6.0,1169.0,4.0,4.0,67.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
1,48.0,5951.0,2.0,2.0,22.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,12.0,2096.0,2.0,3.0,49.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
3,42.0,7882.0,2.0,4.0,45.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,36.0,9055.0,2.0,4.0,35.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0


### **6. Partition the dataset into `X` feature matrix and `y` target variable.**

In [None]:
X = transformed_data_df.drop("default_yes", axis = 1)
y = transformed_data_df["default_yes"]

### **7. Partition the data into training and testing data set and apply MinMax Scaling**
- Use train_test_split with 30% for testing size and apply random_state = 23

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=23)

In [None]:
X_train.shape, X_test.shape

((691, 19), (297, 19))

In [None]:
# minmaxscaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_scaler = scaler.fit(X_train)

In [None]:
X_train_scaled = train_scaler.transform(X_train)
X_test_scaled = train_scaler.transform(X_test)

### **8. Build a k-nearest neighbors (KNN) classifier and train the model using the using GridSearchCV**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
paramGrid_knn = {"n_neighbors": range(1, 15 + 1)}

In [None]:
from sklearn.model_selection import GridSearchCV
grid_knn = GridSearchCV(knn_model, paramGrid_knn, cv = 10, scoring = "accuracy")
grid_knn.fit(X_train_scaled, y_train)

In [None]:
pd.DataFrame(grid_knn.cv_results_).sort_values("rank_test_score")[["param_n_neighbors", "mean_test_score", "rank_test_score"]]

Unnamed: 0,param_n_neighbors,mean_test_score,rank_test_score
2,3,0.972505,1
4,5,0.968199,2
0,1,0.968137,3
12,13,0.966729,4
14,15,0.966729,4
6,7,0.96528,6
8,9,0.96528,6
10,11,0.96528,6
13,14,0.962381,9
3,4,0.960952,10


In [None]:
grid_knn.best_estimator_

In [None]:
y_pred_knn = grid_knn.predict(X_test_scaled)

In [None]:
compare_prediction_knn = pd.DataFrame({"Actual": y_test, "Predicted": y_pred_knn, "compare": y_test == y_pred_knn})
compare_prediction_knn

Unnamed: 0,Actual,Predicted,compare
979,0.0,0.0,True
166,1.0,1.0,True
107,0.0,0.0,True
778,1.0,1.0,True
843,0.0,0.0,True
...,...,...,...
663,0.0,0.0,True
385,0.0,0.0,True
557,1.0,1.0,True
887,1.0,1.0,True


In [None]:
compare_prediction_knn.value_counts()

Actual  Predicted  compare
0.0     0.0        True       202
1.0     1.0        True        89
        0.0        False        4
0.0     1.0        False        2
dtype: int64

### **9. Build a MultinomialNB (MNB) classifier and train the model using the using GridSearchCV**

In [None]:
from sklearn.naive_bayes import MultinomialNB
cnb_model = MultinomialNB()

In [None]:
# GridSearchCV
paramGrid_mnb = {"fit_prior": [True, False]}

In [None]:
from sklearn.model_selection import GridSearchCV

search_mnb = GridSearchCV(cnb_model,
                      paramGrid_mnb,
                      cv=10,
                      scoring="accuracy"
                      )

In [None]:
# search using fit method
search_fit = search_mnb.fit(X_train_scaled, y_train)

In [None]:
pd.DataFrame(search_fit.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_fit_prior,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002365,0.001371,0.000954,0.000232,True,{'fit_prior': True},0.885714,0.84058,0.855072,0.811594,0.898551,0.884058,0.913043,0.898551,0.869565,0.927536,0.878427,0.033208,2
1,0.001875,0.000455,0.000846,0.000206,False,{'fit_prior': False},1.0,1.0,0.971014,0.971014,1.0,0.985507,0.985507,1.0,0.985507,1.0,0.989855,0.011319,1


In [None]:
pd.DataFrame(search_fit.cv_results_).sort_values("rank_test_score")[["params", "mean_test_score"]]

Unnamed: 0,params,mean_test_score
1,{'fit_prior': False},0.989855
0,{'fit_prior': True},0.878427


In [None]:
y_pred_mnb = search_fit.predict(X_test_scaled)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_mnb)

0.9966329966329966

### **10. Evaluate performance of both models and discuss the results indicating which model provides better performance**

#### KNN Classfication Report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_knn, target_names = ["Default: 0", "Default: 1"]))

              precision    recall  f1-score   support

  Default: 0       0.98      0.99      0.99       204
  Default: 1       0.98      0.96      0.97        93

    accuracy                           0.98       297
   macro avg       0.98      0.97      0.98       297
weighted avg       0.98      0.98      0.98       297



#### MNB CLassification Report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_mnb, target_names = ["Default: 0", "Default: 1"]))

              precision    recall  f1-score   support

  Default: 0       1.00      1.00      1.00       204
  Default: 1       1.00      0.99      0.99        93

    accuracy                           1.00       297
   macro avg       1.00      0.99      1.00       297
weighted avg       1.00      1.00      1.00       297



#### Summarize the Performance results:



KNN


*   Of all loans the model predicted wouldn't be defaulted, 98% actually were.
*   Of all the loans that weren't defaulted, the model correctly predicted 99% of them.
*   Of all loans the model predicted would be defaulted, 98% actually were.
*   Of all the loans that defaulted, the model correctly predicted 96% of them.



MNB

*   Of all loans the model predicted wouldn't be defaulted, 100% actually were.
*   Of all the loans that weren't defaulted, the model correctly predicted 100% of them.
*   Of all loans the model predicted would be defaulted, 100% actually were.
*   Of all the loans that defaulted, the model correctly predicted 99% of them.

Comparing the data, MNB had a higher f1-score for both defaulted and non-defaulted loans. Both did a good job at predicting the loans, but MNB was slightly better than KNN.