# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [234]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV

In [235]:
# Loading the train & test dataset
url_train = "https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv"
url_test = "https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv"

data_train = pd.read_csv(url_train)
data_test = pd.read_csv(url_test)

data_train

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [236]:
data_train["Loan_Status"].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

In [237]:
# Train & Test Data Dimensions
print("Train Data Shape", data_train.shape)
print("Test Data Shape", data_test.shape)

Train Data Shape (614, 13)
Test Data Shape (367, 12)


In [238]:
# Checking for null values on train data
data_train.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [239]:
# Checking for null values on test data
data_test.isnull().sum()

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [240]:
# Replacing "Y" & "N" with 1 & 0
data_train["Loan_Status"] = data_train["Loan_Status"].replace({"Y": 1, "N": 0})
data_train["Loan_Status"].value_counts()

1    422
0    192
Name: Loan_Status, dtype: int64

In [241]:
X_train = data_train.drop(["Loan_ID","Loan_Status","Gender"],axis=1)
X_test = data_test.drop(["Loan_ID","Gender"],axis=1)
y_train = data_train[["Loan_Status"]]

In [242]:
# Dividing Categorical & Numerical Values Columns
categorical_values = X_train.select_dtypes(include="object").columns.tolist()
numerical_values = X_train.select_dtypes(exclude="object").columns.tolist()
print("Categorical Value Columns:",categorical_values)
print("Numerical Value Columns:",numerical_values)

Categorical Value Columns: ['Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
Numerical Value Columns: ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']


In [243]:
X_train_cat = X_train[categorical_values]
X_train_num = X_train[numerical_values]
X_test_cat = X_test[categorical_values]
X_test_num = X_test[numerical_values]

In [244]:
# Imputing Categorical Values
categorical_imputation = SimpleImputer(strategy = "most_frequent")
X_train_cat = pd.DataFrame(categorical_imputation.fit_transform(X_train_cat), columns=categorical_values)
X_test_cat = pd.DataFrame(categorical_imputation.fit_transform(X_test_cat), columns=categorical_values)

In [245]:
# Imputing Numerical Values
numerical_imputation = SimpleImputer(strategy = "mean")
X_train_num = pd.DataFrame(numerical_imputation.fit_transform(X_train_num), columns=numerical_values)
X_test_num = pd.DataFrame(numerical_imputation.fit_transform(X_test_num), columns=numerical_values)

In [246]:
# Changing The Datatype Of Specified Columns To "category"
X_train_cat[categorical_values] = X_train_cat[categorical_values].astype("category")
X_test_cat[categorical_values] = X_test_cat[categorical_values].astype("category") 

In [247]:
# Encoding The Categorical Values To Numerical
encoder = LabelEncoder()
[X_train_cat, X_test_cat] = [df.apply(encoder.fit_transform) for df in [X_train_cat, X_test_cat]]

X_train_cat.head()

Unnamed: 0,Married,Dependents,Education,Self_Employed,Property_Area
0,0,0,0,0,2
1,1,1,0,0,0
2,1,0,0,1,2
3,1,0,1,0,2
4,0,0,0,0,2


In [248]:
X_train = pd.concat([X_train_num, X_train_cat], axis = 1)
X_train_cols = X_train.columns
X_train_cols

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Married', 'Dependents',
       'Education', 'Self_Employed', 'Property_Area'],
      dtype='object')

In [249]:
X_test = pd.concat([X_test_num, X_test_cat], axis = 1)
X_test_cols = X_test.columns
X_test_cols

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Married', 'Dependents',
       'Education', 'Self_Employed', 'Property_Area'],
      dtype='object')

In [250]:
scaler = StandardScaler()
scaler.fit(X_train)

In [251]:
X_train = pd.DataFrame(scaler.transform(X_train), columns = X_train_cols)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test_cols)

In [252]:
# Checking for null values on train data after imputation
X_train.isnull().sum()

ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Married              0
Dependents           0
Education            0
Self_Employed        0
Property_Area        0
dtype: int64

In [253]:
# Checking for null values on test data after imputation
X_test.isnull().sum()

ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Married              0
Dependents           0
Education            0
Self_Employed        0
Property_Area        0
dtype: int64

In [254]:
# Splitting The Data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.20, random_state=0)

In [255]:
# Function To Evaluate Different Models
def model_evaluation(model, X_test, y_test):
    y_pred = model.predict(X_test)
    return {'acc': metrics.accuracy_score(y_test, y_pred)}

In [256]:
# Evaluating Models Accuracies
lr_classifier = LogisticRegression()
dt_classifier = DecisionTreeClassifier()
rfc_classifier = RandomForestClassifier()
knn_classifier = KNeighborsClassifier()
svm_classifier = SVC()
nb_classifier = GaussianNB()

models_dict = {
    "Logistic Regression": lr_classifier,
    "Decision Tree": dt_classifier,
    "Random Forest": rfc_classifier,
    "KNN": knn_classifier,
    "SVM": svm_classifier,
    "NaiveBayes": nb_classifier
}

results_list = []
for model_name, model in models_dict.items():
    model.fit(X_train, y_train)
    acc = model_evaluation(model, X_test, y_test)['acc']
    results_list.append({"model_name": model_name, "Accuracy": acc})

models_results = pd.DataFrame(results_list)
models_results

  y = column_or_1d(y, warn=True)
  return fit_method(estimator, *args, **kwargs)
  return self._fit(X, y)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Unnamed: 0,model_name,Accuracy
0,Logistic Regression,0.837398
1,Decision Tree,0.666667
2,Random Forest,0.788618
3,KNN,0.780488
4,SVM,0.821138
5,NaiveBayes,0.829268


In [257]:
# Evaluating Models Accuracies With GridSearchCv
lr_params = {
    'C': [0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}
dt_params = {
    "criterion": ['gini','entropy'],
    "max_depth": list(range(1, 10)),
    "min_samples_split": list(range(1, 5))
}
rfc_params = {
    'n_estimators': [10, 100, 1000],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 20]
}
knn_params = {
    'n_neighbors': list(range(5, 10)), 
    'leaf_size': list(range(10, 50, 10))
}
svm_params = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid']
}
nb_params = {
    'var_smoothing': np.logspace(0, -9, num = 100)
}

models_params_dict = {
    "Logistic Regression": [lr_classifier, lr_params],
    "Decision Tree": [dt_classifier, dt_params],
    "Random Forest": [rfc_classifier, rfc_params],
    "KNN": [knn_classifier, knn_params],
    "SVM": [svm_classifier, svm_params],
    "NaiveBayes": [nb_classifier, nb_params],
}

results_list = []

for model_name, model in models_params_dict.items():
    grid_search = GridSearchCV(estimator=model[0], param_grid=model[1], cv=5, scoring="accuracy", n_jobs=8)
    grid_search.fit(X_train, y_train)
    
    y_pred = grid_search.predict(X_test)
    acc = metrics.accuracy_score(y_test, y_pred)
    
    result_dict = {
        "Model_name": model_name,
        "Accuracy": acc,
        "HyperParameter": str(grid_search.best_params_)
    }
    
    results_list.append(result_dict)

results_df = pd.DataFrame(results_list)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

90 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/bilaldilbar/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/bilaldilbar/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1144, in wrapper
    estimator._validate_params()
  File "/Users/bilaldilbar/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "/Users/bilaldilbar/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 95,

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return self._fit(X, y)
  return s

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

In [258]:
# Displaying Accuracies
pd.set_option('display.max_colwidth', None)
results_df['HyperParameter'] = results_df['HyperParameter'].astype(str).str.ljust(150)
results_df

Unnamed: 0,Model_name,Accuracy,HyperParameter
0,Logistic Regression,0.837398,"{'C': 0.1, 'solver': 'lbfgs'}"
1,Decision Tree,0.829268,"{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 2}"
2,Random Forest,0.829268,"{'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 1000}"
3,KNN,0.796748,"{'leaf_size': 10, 'n_neighbors': 9}"
4,SVM,0.829268,"{'C': 0.1, 'kernel': 'linear'}"
5,NaiveBayes,0.837398,{'var_smoothing': 1.0}


In [259]:
# Model With Best Accuracy
best_model = results_df.loc[results_df['Accuracy'].idxmax()]
best_model = pd.DataFrame(best_model)
best_model

Unnamed: 0,0
Model_name,Logistic Regression
Accuracy,0.837398
HyperParameter,"{'C': 0.1, 'solver': 'lbfgs'}"
