## Model Selection

Selection criteria

- Statistical Performance
  - Prediction: RMSE, Deviation
  - CLassification: AOC, Accuracy, F1 score
- Business Performance.
- Fairness criteria.
- Interpretability/explainability.
- Ease of automation.

# Part 1 - Data Preparation

In this notebook, we will try to find the best model that helps kMeans Bank decide whether to extend credit to a customer or not. You have seen this problem before when you were training kNN models. To find the best model, we will run and evaluate 6 different models. The 6 models are:
- Linear regression model
- Decision tree regressor
- kNN regression model
- Logistic regression model
- Decision tree classifier
- kNN classification model

We will start with importing the required packages. These include modules that are needed to handle and manipulate data, metrics that are used to evaluate models, and the models themselves.

**Note for Jupyter Notebook**: 

All the libraries/packages need to be updated to the latest version for the code to execute without any errors. Kindly check whether the *scikit-learn* library has been updated to the newest version, 1.1.

In [1]:
# Import numpy and pandas to work with numbers and dataframes
import pandas as pd
import numpy as np

# Import matplotlib.pyplot to create visualizations
from matplotlib import pyplot as plt

# Import libraries used in basic data manipulation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Import GridSearchCV to perform cross validation
from sklearn.model_selection import GridSearchCV

# Import different metrics used to evaluate models
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Import modules needed to train different models
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import Lasso, LassoCV, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn import tree

In [2]:
# Filter warnings
import warnings 
warnings.filterwarnings('ignore')

After importing the required libraries, we load the data and take a look at the first few rows of the data set. For this step, make sure that the CSV file is in the same folder as the notebook, or you have specified the complete address.

In [3]:
# Load the data and take a look at it
# Note: Make sure that the data is in the same folder as the Jupyter notebook or specify the address correctly
credit_df = pd.read_csv("MyCreditData.csv")
credit_df.head()

Unnamed: 0,checking_account,duration,credit_history,purpose,amount,savings_account,employment_duration,installment_rate,other_debtors,present_residence,...,age,other_installment_plans,housing,number_credits,job,people_liable,telephone,foreign_worker,gender,profit
0,3,18,0,2,1049,4,2,2,2,3,...,21,1,0,0,1,0,0,0,female,242
1,3,9,0,5,2799,4,0,1,2,0,...,36,1,0,1,1,1,0,0,male,596
2,0,12,4,8,841,0,1,1,2,3,...,23,1,0,0,3,0,0,0,female,25
3,3,12,0,5,2122,4,0,0,2,0,...,39,1,0,1,3,1,0,1,male,568
4,3,12,0,5,2171,4,0,2,2,3,...,38,0,2,1,3,0,0,1,male,782


In [5]:
## Ensure Python reads the categorical variables as categorical.
non_categorical_columns = ['duration', 'amount', 'age', 'profit']
for column in credit_df.columns:
    if column not in non_categorical_columns:
        credit_df[column] = pd.Categorical(credit_df[column])

In [6]:
# Create binary dependant variables.
credit_df['is_profitable'] = np.where(credit_df['profit'] > 0, 1, 0)

In [8]:
y = credit_df['profit']
X = credit_df.iloc[:, :-2]

X = pd.get_dummies(X, drop_first=False) # for kNN and trees.
X2 = pd.get_dummies(X, drop_first=True) # for regression.

X_train, X_val, X2_train, X2_val, y_train, y_val = train_test_split(X, X2, y, test_size=0.3, random_state=1)

In [10]:
scaler = StandardScaler()
X_train[['duration', 'amount', 'age']] = scaler.fit_transform(X_train[['duration', 'amount', 'age']])
X2_train[['duration', 'amount', 'age']] = scaler.fit_transform(X2_train[['duration', 'amount', 'age']])

X_val[['duration', 'amount', 'age']] = scaler.transform(X_val[['duration', 'amount', 'age']])
X2_val[['duration', 'amount', 'age']] = scaler.transform(X2_val[['duration', 'amount', 'age']])

In [11]:
#rmse function
def rmse(y_train, y_pred):
    return np.sqrt(mean_squared_error(y_train, y_pred))

### Prediction Models