# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [285]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectFromModel, SelectKBest, chi2, RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier
from typing import List, Optional
import pandas as pd
import numpy as np

In [240]:
class ClassificationFeatureSelector:
    __methods: List[str] = ['pearson', 'mutual_info', 'rfe', 'lin-reg', 'rf', 'lgbm']
    __n_jobs: int

    feature_names: List[str]
    feature_support_: pd.DataFrame
    sorted_features_: List[str]

    def __init__(self,
                 methods='__all__',
                 n_jobs: Optional[int] = None):
        self.__n_jobs = n_jobs
        if methods != '__all__' \
                and isinstance(methods, List) \
                and all(isinstance(m, str) for m in methods):
            self.__methods = methods

    def __cor_selector(self,
                       X: pd.DataFrame,
                       y: pd.DataFrame,
                       number_of_features: int) -> List[bool]:
        feature_names = X.columns.to_list()
        coefficients = [np.corrcoef(X[name].to_numpy(), y.to_numpy())[0, 1] for name in feature_names]
        coefficients = [0 if np.isnan(coef) else coef for coef in coefficients]
        feature_indexes = np.argsort(np.abs(coefficients))[-number_of_features:]
        support = [index in feature_indexes for index, name in enumerate(feature_names)]
        return support

    def __chi2_selector(self,
                               X: pd.DataFrame,
                               y: pd.DataFrame,
                               number_of_features: int) -> List[bool]:
        selector = SelectKBest(score_func=chi2,
                               k=number_of_features)
        selector = selector.fit(X, y)
        return selector.get_support()

    def __rfe_selector(self,
                       X: np.ndarray,
                       y: pd.DataFrame,
                       number_of_features: int):
        model = LogisticRegression()
        selector = RFE(estimator=model,
                       n_features_to_select=number_of_features,
                       step=1,
                       verbose=5)
        selector = selector.fit(X, y)
        return selector.get_support()

    def __embedded_log_reg_selector(self,
                                    X: np.ndarray,
                                    y: pd.DataFrame,
                                    number_of_features: int):
        model = LogisticRegression(n_jobs=self.__n_jobs)
        selector = SelectFromModel(model,
                                   max_features=number_of_features)
        selector = selector.fit(X, y)
        return selector.get_support()

    def __embedded_rf_selector(self,
                               X: np.ndarray,
                               y: pd.DataFrame,
                               number_of_features: int):
        model = RandomForestClassifier(n_estimators=50,
                                       n_jobs=self.__n_jobs,
                                       random_state=42,
                                       max_features=number_of_features)
        selector = SelectFromModel(model,
                                   max_features=number_of_features)
        embedded_selector = selector.fit(X, y)
        return embedded_selector.get_support()

    def __embedded_lgbm_selector(self,
                                 X: np.ndarray,
                                 y: pd.DataFrame,
                                 number_of_features: int):
        model = LGBMClassifier(n_estimators=500,
                               learning_rate=0.05,
                               num_leaves=32,
                               colsample_bytree=0.2,
                               reg_alpha=3,
                               reg_lambda=1,
                               min_split_gain=0.01,
                               min_child_weight=40,
                               n_jobs=self.__n_jobs,
                               random_state=42)
        selector = SelectFromModel(model,
                                   max_features=number_of_features)
        selector = selector.fit(X, y)
        return selector.get_support()

    def sort_features(self,
                      X: pd.DataFrame,
                      y: pd.DataFrame,
                      number_of_features: int):
        feature_names = X.columns.to_list()
        methods_support = {'Feature': feature_names}

        for method in self.__methods:
            print(f'Calculating {method}')
            if method == 'pearson' or self.__methods == '__all__':
                methods_support[method] = self.__cor_selector(X, y, number_of_features)
            if method == 'mutual_info' or self.__methods == '__all__':
                methods_support[method] = self.__chi2_selector(X, y, number_of_features)
            if method == 'rfe' or self.__methods == '__all__':
                methods_support[method] = self.__rfe_selector(X.to_numpy(), y, number_of_features)
            if method == 'lin-reg' or self.__methods == '__all__':
                methods_support[method] = self.__embedded_log_reg_selector(X.to_numpy(), y, number_of_features)
            if method == 'rf' or self.__methods == '__all__':
                methods_support[method] = self.__embedded_rf_selector(X.to_numpy(), y, number_of_features)
            if method == 'lgbm' or self.__methods == '__all__':
                methods_support[method] = self.__embedded_lgbm_selector(X.to_numpy(), y, number_of_features)

        pd.set_option('display.max_rows', None)

        feature_selection_df = pd.DataFrame(methods_support)
        feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
        feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'],
                                                                ascending=False)

        feature_selection_df.index = range(1, len(feature_selection_df)+1)
        self.feature_support_ = feature_selection_df
        self.sorted_features_ = feature_selection_df['Feature'].tolist()

In [241]:
df = pd.read_csv('https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv')

In [242]:
df.drop('Loan_ID', inplace=True, axis=1)
df['Loan_Status'].replace('Y', 1, True)
df['Loan_Status'].replace('N', 0, True)

In [243]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,1
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,0
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,1
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,1
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,1


In [244]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             601 non-null    object 
 1   Married            611 non-null    object 
 2   Dependents         599 non-null    object 
 3   Education          614 non-null    object 
 4   Self_Employed      582 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         592 non-null    float64
 8   Loan_Amount_Term   600 non-null    float64
 9   Credit_History     564 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    int64  
dtypes: float64(4), int64(2), object(6)
memory usage: 57.7+ KB


In [245]:
df.isna().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [246]:
df = df.dropna()
df.shape

(480, 12)

In [247]:
categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
                        'CoapplicantIncome', 'Property_Area']

In [248]:
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

In [249]:
dummies = pd.get_dummies(X[categorical_features])

for feature in categorical_features:
    X.drop(feature, inplace=True, axis=1)

X = pd.concat([X, dummies], axis=1)

In [250]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 613
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ApplicantIncome          480 non-null    int64  
 1   LoanAmount               480 non-null    float64
 2   Loan_Amount_Term         480 non-null    float64
 3   Credit_History           480 non-null    float64
 4   CoapplicantIncome        480 non-null    float64
 5   Gender_Female            480 non-null    uint8  
 6   Gender_Male              480 non-null    uint8  
 7   Married_No               480 non-null    uint8  
 8   Married_Yes              480 non-null    uint8  
 9   Dependents_0             480 non-null    uint8  
 10  Dependents_1             480 non-null    uint8  
 11  Dependents_2             480 non-null    uint8  
 12  Dependents_3+            480 non-null    uint8  
 13  Education_Graduate       480 non-null    uint8  
 14  Education_Not Graduate   4

In [251]:
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

In [252]:
selector = ClassificationFeatureSelector(n_jobs=-1)

In [253]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [254]:
score = 0
number_of_features = None
best_features = None
estimator = None

for i in range(1, X.shape[1]):
    print(f'calculating for first {i} features')
    selector.sort_features(X_train, y_train, i)
    X_1 = X[selector.sorted_features_[:i]]
    X_train_test, X_test_test, y_train_test, y_test_test = train_test_split(X_1,
                                                                            y,
                                                                            random_state=42)
    test_model = RandomForestClassifier(n_estimators=50,
                                        n_jobs=-1,
                                        random_state=42)
    test_model = test_model.fit(X_train_test, y_train_test)
    y_pred_test = test_model.predict(X_test_test)
    score_ = accuracy_score(y_test_test, y_pred_test)
    if score_ > score:
        score = score_
        number_of_features = i
        best_features = selector.sorted_features_[:i]
        estimator = test_model

calculating for first 1 features
Calculating pearson
Calculating mutual_info
Calculating rfe
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.
Calculating lin-reg
Calculating rf
Calculating lgbm
calculating for first 2 features
Calculating pearson
Calculating mutual_info
Calculating rfe
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estima

  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passk

In [271]:
score

0.8083333333333333

In [255]:
best_features

['LoanAmount',
 'Credit_History',
 'CoapplicantIncome',
 'ApplicantIncome',
 'Self_Employed_Yes',
 'Property_Area_Semiurban',
 'Married_Yes',
 'Married_No',
 'Loan_Amount_Term',
 'Gender_Male',
 'Gender_Female',
 'Education_Graduate',
 'Self_Employed_No',
 'Property_Area_Urban',
 'Property_Area_Rural',
 'Education_Not Graduate',
 'Dependents_3+']

In [298]:
estimator.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [259]:
max_depth = np.max([e.get_depth() for e in estimator.estimators_])
max_depth

22

In [262]:
rf = RandomForestClassifier()

In [263]:
random_grid = {
    'n_estimators': [50, 100, 150, 300, 500, 1000],
    'max_depth': [None] + [num for num in np.linspace(5, max_depth, 4)],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False],
    'random_state': [42]
}

In [294]:
grid_ = RandomizedSearchCV(rf,
                          random_grid,
                          n_iter=20,
                          cv=3,
                          random_state=42,
                          n_jobs=-1)

In [295]:
grid_.fit(X_train, y_train)

Traceback (most recent call last):
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1350, in fit
    multi_class = _check_multi_class(self.multi_class, solver,
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 473, in _check_multi_class
    raise ValueError("Solver %s does not support "
ValueError: Solver liblinear does not support a multinomial backend.

Traceback (most recent call last):
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/

RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(), n_iter=20,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [None, 5.0,
                                                      10.666666666666668,
                                                      16.333333333333336,
                                                      22.0],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [50, 100, 150, 300, 500,
                                                         1000],
                                        'random_state': [42]},
                   random_state=42)

In [296]:
grid_.best_params_

{'random_state': 42,
 'n_estimators': 1000,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'log2',
 'max_depth': 10.666666666666668,
 'bootstrap': True}

In [297]:
y_pred_random = grid_.predict(X_test)
accuracy_score(y_test, y_pred_random)

0.7916666666666666

In [299]:
estimators_grids = [
    {
        'estimator': DecisionTreeClassifier(),
        'param_grid': {
            'max_depth': [None] + [num for num in np.linspace(5, max_depth, 4)],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'max_features': ['auto', 'sqrt', 'log2'],
            'random_state': [42]
        }
    },
    {
        'estimator': KNeighborsClassifier(),
        'param_grid': {
            'n_neighbors': [2],
            'weights': ['uniform', 'distance'],
            'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'leaf_size': [int(num) for num in np.linspace(10, 50, 4)]
        }
    },
    {
        'estimator': LogisticRegression(),
        'param_grid': {
            'penalty': ['l1', 'l2', 'elasticnet', 'none'],
            'dual': [True, False],
            'C': [num for num in np.linspace(0, 1, 4)],
            'class_weight': ['balanced', None],
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            'multi_class': ['auto', 'ovr', 'multinomial'],
            'random_state': [42]
        }
    },
    {
        'estimator': RandomForestClassifier(),
        'param_grid': {
            'n_estimators': [50, 100, 150, 300, 500, 1000],
            'max_depth': [None] + [num for num in np.linspace(5, max_depth, 4)],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'max_features': ['auto', 'sqrt', 'log2'],
            'bootstrap': [True, False],
            'random_state': [42]
        }
    },
    {
        'estimator': SVC(),
        'param_grid': {
            'C': np.linspace(0.1, 100, 5),
            'gamma': np.linspace(0.001, 1, 5),
            'kernel': ['rbf', 'poly', 'sigmoid']
        }
    },
    {
        'estimator': LGBMClassifier(),
        'param_grid': {
            'n_estimators': [int(num) for num in np.linspace(400, 100, 3)],
            'colsample_bytree': [0.7, 0.8],
            'max_depth': [int(num) for num in np.linspace(15, 25, 3)],
            'num_leaves': [int(num) for num in np.linspace(50, 200, 3)],
            'reg_alpha': [1.1, 1.2, 1.3],
            'reg_lambda': [1.1, 1.2, 1.3],
            'min_split_gain': [0.3, 0.4],
            'subsample': [0.7, 0.8, 0.9],
            'subsample_freq': [20]
        }
    },
]

In [300]:
for estimator_grid in estimators_grids:
    grid = GridSearchCV(estimator_grid['estimator'],
                        estimator_grid['param_grid'],
                        cv=3,
                        n_jobs=2)
    grid.fit(X_train, y_train)
    y_pred_test = grid.predict(X_test)
    score = accuracy_score(y_test, y_pred_test)
    estimator_grid['score'] = score
    estimator_grid['best_params'] = grid.best_params_
    estimator_grid['best_estimator'] = grid.best_estimator_

Traceback (most recent call last):
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "/Users/daniyarkurmanbayev/Documents/GBC/mlenv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/daniyar

In [301]:
sorted_grids = sorted(estimators_grids, key=lambda est: est['score'], reverse=True)
sorted_grids


[{'estimator': LogisticRegression(),
  'param_grid': {'penalty': ['l1', 'l2', 'elasticnet', 'none'],
   'dual': [True, False],
   'C': [0.0, 0.3333333333333333, 0.6666666666666666, 1.0],
   'class_weight': ['balanced', None],
   'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
   'multi_class': ['auto', 'ovr', 'multinomial'],
   'random_state': [42]},
  'score': 0.7916666666666666,
  'best_params': {'C': 0.3333333333333333,
   'class_weight': 'balanced',
   'dual': False,
   'multi_class': 'auto',
   'penalty': 'l1',
   'random_state': 42,
   'solver': 'saga'},
  'best_estimator': LogisticRegression(C=0.3333333333333333, class_weight='balanced', penalty='l1',
                     random_state=42, solver='saga')},
 {'estimator': RandomForestClassifier(),
  'param_grid': {'n_estimators': [50, 100, 150, 300, 500, 1000],
   'max_depth': [None, 5.0, 10.666666666666668, 16.333333333333336, 22.0],
   'min_samples_split': [2, 5, 10],
   'min_samples_leaf': [1, 2, 4],
   'max_featu