<a href="https://colab.research.google.com/github/gabcsx/CCMACLRL_EXERCISES_COM221ML/blob/main/Exercise9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 9: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all classification models

Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e10/overview



In [69]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Dataset File

In [70]:
dataset_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/main/datasets/loan_approval/train.csv?raw=true'
df = pd.read_csv(dataset_url)

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58645 entries, 0 to 58644
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          58645 non-null  int64  
 1   person_age                  58645 non-null  int64  
 2   person_income               58645 non-null  int64  
 3   person_home_ownership       58645 non-null  object 
 4   person_emp_length           58645 non-null  float64
 5   loan_intent                 58645 non-null  object 
 6   loan_grade                  58645 non-null  object 
 7   loan_amnt                   58645 non-null  int64  
 8   loan_int_rate               58645 non-null  float64
 9   loan_percent_income         58645 non-null  float64
 10  cb_person_default_on_file   58645 non-null  object 
 11  cb_person_cred_hist_length  58645 non-null  int64  
 12  loan_status                 58645 non-null  int64  
dtypes: float64(3), int64(6), object

## Test File

In [72]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/main/datasets/loan_approval/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [73]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39098 entries, 0 to 39097
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          39098 non-null  int64  
 1   person_age                  39098 non-null  int64  
 2   person_income               39098 non-null  int64  
 3   person_home_ownership       39098 non-null  object 
 4   person_emp_length           39098 non-null  float64
 5   loan_intent                 39098 non-null  object 
 6   loan_grade                  39098 non-null  object 
 7   loan_amnt                   39098 non-null  int64  
 8   loan_int_rate               39098 non-null  float64
 9   loan_percent_income         39098 non-null  float64
 10  cb_person_default_on_file   39098 non-null  object 
 11  cb_person_cred_hist_length  39098 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.6+ MB


## Sample Submission File

In [74]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/main/datasets/loan_approval/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [75]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39098 entries, 0 to 39097
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           39098 non-null  int64  
 1   loan_status  39098 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 611.0 KB


In [76]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Map categorical columns to numerical values
mapping_dict = {
    'person_home_ownership': {"RENT": 1, "OWN": 0},
    'loan_intent': {"PERSONAL": 1, "MORTGAGE": 2, "MEDICAL": 3, "VENTURE": 4, "EDUCATION": 5},
    'loan_grade': {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5},
    'cb_person_default_on_file': {"Y": 1, "N": 0}
}

for column, mapping in mapping_dict.items():
    df[column] = df[column].map(mapping)

# Fill missing values with mode
columns_with_null = ['person_home_ownership', 'loan_intent', 'loan_grade']
for column in columns_with_null:
    df[column].fillna(df[column].mode()[0], inplace=True)

# Prepare features X and target y
X = df.drop(columns=['id', 'loan_status']).values
y = df['loan_status'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Check if the mapping and missing value filling was successful
print(df.isnull().sum())


id                            0
person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
loan_status                   0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


## 1. Train a KNN Classifier

In [77]:
df.sample(15)

Unnamed: 0,id,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
16982,16982,28,70000,1.0,5.0,5.0,3.0,9800,14.35,0.14,1,10,0
20659,20659,25,96000,1.0,1.0,4.0,2.0,15000,10.62,0.16,0,4,0
43426,43426,29,28800,1.0,3.0,4.0,3.0,9000,13.49,0.31,0,5,1
10808,10808,22,42000,1.0,6.0,3.0,1.0,2000,5.42,0.05,0,2,0
16081,16081,33,50000,1.0,5.0,1.0,1.0,10000,7.88,0.2,0,9,0
26853,26853,25,72000,1.0,1.0,3.0,3.0,6000,13.35,0.08,1,2,0
44749,44749,22,30000,1.0,6.0,1.0,4.0,8500,14.59,0.28,1,3,1
58381,58381,26,26000,1.0,4.0,1.0,4.0,8400,14.96,0.33,0,2,1
29998,29998,27,150000,1.0,3.0,4.0,2.0,12000,10.25,0.08,0,9,0
17577,17577,21,28150,1.0,0.0,3.0,2.0,8000,10.99,0.3,0,3,0


In [78]:
score_list = {}

In [79]:
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier(n_neighbors=22)
KNN.fit(X_train,y_train)
knn_score = KNN.score(X_test,y_test)
print(f"Score is {knn_score}")

Score is 0.8936569284983518


- Perform cross validation

In [80]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(KNN, X, y, cv=10)
scores

array([0.89326513, 0.89462916, 0.89428815, 0.89360614, 0.89445865,
       0.89427012, 0.89222374, 0.89648704, 0.88864256, 0.89768076])

In [81]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
score_list["KNN Classifier"] = scores.mean()

0.89 accuracy with a standard deviation of 0.00


## 2. Train a Logistic Regression Classifier

In [82]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(X_train,y_train)

lr_score = LR.score(X_test,y_test)
print(f"Score is {lr_score}")

Score is 0.8802432647493463


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- Perform cross validation

In [83]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LR, X, y, cv=10)
scores

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.88098892, 0.8801364 , 0.88184143, 0.88354646, 0.88269395,
       0.87755798, 0.87943383, 0.88250341, 0.88045703, 0.88267394])

In [84]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
score_list["Logistic Regression"] = scores.mean()

0.88 accuracy with a standard deviation of 0.00


In [85]:
from sklearn.naive_bayes import GaussianNB

nbc = GaussianNB()
nbc.fit(X_train,y_train)
nbc_score = nbc.score(X_test,y_test)

print(f"Score is {nbc_score}")

Score is 0.8827441173127203


- Perform cross validation

In [86]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(nbc, X, y, cv=10)
scores

array([0.88218244, 0.87536232, 0.88439898, 0.88150043, 0.88422847,
       0.88011596, 0.88369714, 0.88540246, 0.88489086, 0.88096862])

In [87]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
score_list["GaussianNBC"] = scores.mean()

0.88 accuracy with a standard deviation of 0.00


## 4. Train a SVM Classifier

In [88]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train,y_train)
svc_score = svc.score(X_test,y_test)

print(f"Score is {svc_score}")

Score is 0.8578492667954984


- Perform cross validation

In [89]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(svc, X, y, cv=5)
scores

array([0.85761787, 0.85761787, 0.85761787, 0.85753261, 0.85761787])

In [90]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
score_list["SVC"] = scores.mean()

0.86 accuracy with a standard deviation of 0.00


## 5. Train a Decision Tree Classifier

In [91]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train,y_train)
dtc_score = dtc.score(X_test,y_test)

print(f"Score is {dtc_score}")

Score is 0.8859838581334546


- Perform cross validation

In [92]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(dtc, X, y, cv=10)
scores

array([0.89207161, 0.88201194, 0.88883205, 0.88542199, 0.89002558,
       0.88761937, 0.88386767, 0.88966576, 0.87619372, 0.88335607])

## 6. Train a Random Forest Classifier

In [93]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=50,random_state=1)
rfc.fit(X_train,y_train)
rfc_score = rfc.score(X_test,y_test)
score_list["RFC"]=rfc_score

print(f"Score is {rfc_score}")

Score is 0.928782539502103


## 7. Compare all the performance of all classification models

In [94]:
score_list = list(score_list.items())
for alg, score in score_list:
    print(f"{alg} Score is {str(score)[:4]} ")

KNN Classifier Score is 0.89 
Logistic Regression Score is 0.88 
GaussianNBC Score is 0.88 
SVC Score is 0.85 
RFC Score is 0.92 


In [95]:
# Define the mapping dictionary
mapping_dict_dt = {
    'person_home_ownership': {"RENT": 1, "OWN": 0},
    'loan_intent': { "PERSONAL": 1, "MORTGAGE": 2, "MEDICAL": 3,
        "VENTURE": 4, "EDUCATION": 5, "HOMEIMPROVEMENT": 6,
        "DEBTCONSOLIDATION": 7 },
    'loan_grade': {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6},
    'cb_person_default_on_file': {"Y": 1, "N": 0}
}

# Apply the mappings
for column, mapping in mapping_dict_dt.items():
    dt[column] = dt[column].map(mapping)


## 9. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [96]:
id = dt.pop('id')
y_pred = rfc.predict(dt)

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'id': id,
    'loan_status': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")



Submission file created: submission_file.csv
