# Preprocessing

In [None]:
import pandas as pd
import numpy as np

from pandas.api.types import is_numeric_dtype

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier

from sklearn.metrics import accuracy_score

import joblib

In [None]:
rng = np.random.RandomState(2)

## Understand the business problem

We are a small company, making loans in a competitive market. We want to  speed-up loan decisions and reduce the number of people who default on their loans.

## Select performance measures

- This is Supervised Learning of a Binary Classifier.
- We will measure accuracy.
- We will compare with a majority-class classifier.


## Acquire a dataset

In [None]:
import os
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_dir = "./drive/My Drive/Colab Notebooks/" # You may need to change this, depending on where your notebooks are on Google Drive
else:
    base_dir = "."
dataset_dir = os.path.join(base_dir, "datasets")

In [None]:
df = pd.read_csv(os.path.join(dataset_dir, "loans.csv"))

## Take a cheeky look

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe(include="all")

In [None]:
for column in df.columns:
    if not is_numeric_dtype(df[column]):
        print(column, df[column].value_counts())
        print()

In [None]:
# Proportion of the dataset whose loan was approved

print((df["Loan_Decision"] == "Y").sum() / df.shape[0])

Observations

1. We have numeric-valued, Boolean-valued and nominal-valued features.
2. Loan_id is not a feature: it is used purely for identification. It does not describe the applicant.
3. We have 6 features with missing values: Sex, Married, Dependents, Self_Employed, Loan_Amount, Loan_Amount_Term. If you think they are invalid, you can delete those examples. Check with your domain expert. I'm going to assume that our domain expert tells us that it is invalid to apply for a loan but to fail to supply a Loan_Amount or a Loan_Term. So we will delete these. Happily, there aren't too many of them.
4. The target is Loan_Decision. Fortunately, it does not have missing values. (If it did, we'd delete those examples.)
5. There are some extreme values. If you think they are invalid (e.g. typos), then delete those rows. Check with your domain expert! I'm going to assume that our domain expert tells us that it is invalid to ask for a Loan_Term of more than 360. So we will delete these.

## Cleanup anything that is simply invalid

In [None]:
df.drop(columns=["Loan_Id"], inplace=True)

df.dropna(subset=["Loan_Amount", "Loan_Term"], inplace=True)

df = df[df["Loan_Term"] <= 360]

In [None]:
df.info()

## Split into training set and test set

In [None]:
features = ["Sex", "Married", "Dependents", "Education", "Self_Employed", "Applicant_Income", 
            "Coapplicant_Income", "Loan_Amount", "Loan_Term", "Property_Area"]

numeric_features = ["Applicant_Income", "Coapplicant_Income", "Loan_Amount", "Loan_Term"]
boolean_features = ["Sex", "Married", "Education", "Self_Employed"]
nominal_features = ["Dependents", "Property_Area"]

X = df[features]

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df["Loan_Decision"])

label_encoder.inverse_transform([0, 1])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=df["Loan_Decision"], random_state=rng)

##  Exploratory Data Analysis

Do the EDA on a copy of X_train.

In [None]:
X_train_copy = X_train.copy()

We would now do visualizations, compute correlation coefficients, and so on.

I did this but have excluded it from here to save time and space in the lecture.

Among the things I discovered:
1. No correlations between the numeric features, with one exception - applicant income and loan amount are moderately positively correlated.
2. On their own, the numeric-valued features are not predictive of Loan_Decision. We could remove them, but perhaps they are predictive in combination with other features. So, for now, we'll leave them in.
3. Most of the non-numeric features did seem partly predictive of Loan_Decision, e.g. Married people were more likely to get a loan.

## Feature Engineering

In [None]:
# The applicant's total income
X_train_copy["Total_Income"] = X_train_copy["Applicant_Income"] + X_train_copy["Coapplicant_Income"]

# How much the applicant would repay in each time period
X_train_copy["Payments_Per_Period"] = X_train_copy["Loan_Amount"] / X_train_copy["Loan_Term"]

# The payment (above) as a proportion of the applicant's income 
X_train_copy["Proportion_Of_Income"] = X_train_copy["Payments_Per_Period"] / X_train_copy["Total_Income"]

We would now do more visualizations and compute more correlation coefficients to see whether these new features are promising or not.

I did this but have excluded it from here to save time and space in the lecture. In this case, none of them on their own was predictive of the target. But we might include them later, since they might be useful in combination with other features.

## Preprocess the Data

From now on, we work on the original data, not the copy.

### Outliers

Our EDA earlier will have given some insight into the presence of outliers: we will be able to visualize them in the various charts that we plot.

We need to be careful: simple rules-of-thumb may result in too many outliers. For example, one rule-of-thumb is: a numeric value is an outlier if it exceeds a maximum value (e.g. the third quartile plus 1.5 times the inter-quartile-range) or falls below a minimum value (e.g. the first quartile minus 1.5 times the inter-quartile-range). How many values will this treat as outliers?

In [None]:
q1 = X_train[numeric_features].quantile(0.25)
q3 = X_train[numeric_features].quantile(0.75)
iqr =  q3 - q1
((X_train[numeric_features] < q1 - 1.5 * iqr) | (X_train[numeric_features] > q3 + 1.5 * iqr)).sum(axis=0)

From what I saw in the EDA, I suspect that this is too aggressive. I think the numbers are more like 7 outliers for Applicant_Income, 4-6 for Coapplicant_Income, 0-12 for Loan_Amount, and none for Loan_Term.

Despite my reservations, in order to illustrate one solution to outliers, I will define a class that can be included in a pipeline. It clips values to the maximum or minmium. But it can be toggled, so we can try clipping and not clipping as part of the grid search.

By including it in the pipeline, it wil apply to both training examples and validation/test examples - this is controversial!

In [None]:
class Clipper(BaseEstimator, TransformerMixin):

    def __init__(self, clip=True):
        self.clip = clip
        
    def fit(self, X, y=None):
        if self.clip:
            q1 = X.quantile(0.25)
            q3 = X.quantile(0.75)
            iqr =  q3 - q1 
            self.min = q1 - 1.5 * iqr
            self.max = q3 + 1.5 * iqr
        return self
    
    def transform(self, X, y=None):
        if self.clip:
            X = X.clip(self.min, self.max, axis=1)
        return X

### Missing values 

We still have missing values in Sex, Married, Dependents, and Self_Employed. Our domain expert agrees that we should use the mode for all of them.

### Scaling numeric-valued features

We will let grid search find the best scaler.

### Converting non-numeric features to numeric.

We will use one-hot-encoding. But for feature has only two values, we retain only one of the features that one-hot encoding would give us.

### Feature Engineering

We can write classes to insert new features. I will do it for one of the features. The class can be toggled, so we can try with the feature and without.

In [None]:
class InsertProportionOfIncome(BaseEstimator, TransformerMixin):

    def __init__(self, insert=True):
        self.insert = insert
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.insert:
            X["Proportion_Of_Income"] = (X["Loan_Amount"] / X["Loan_Term"]) / (X["Applicant_Income"] + X["Coapplicant_Income"]) 
            
            # If the new feature is intended to replace the existing ones, 
            # you could drop the existing ones here
            # X.drop(["Applicant_Income", "Coapplicant_Income", "Loan_Amount", Loan_Term], axis=1)

            # If the new feature can produce np.inf, replace those value by np.nan
            # X = X.replace( [ np.inf, -np.inf ], np.nan )
        return X

In [None]:
preprocessor = ColumnTransformer([
        ("num", Pipeline([("proportion_of_income", InsertProportionOfIncome()),
                          ("outliers", Clipper()),
                          ("scaler", None)
                          ]), 
                numeric_features),
        ("nom", Pipeline([("imputer", SimpleImputer(missing_values=np.nan, strategy="most_frequent")), 
                          ("encoder", OneHotEncoder(drop="if_binary"))]), 
                nominal_features + boolean_features)],
        remainder="drop")

## Model selection

We must choose our validation method, e.g. holdout or k-fold CV. We have a small amount of data, so we'll choose k-fold with k = 10. We will use stratified k-fold (which is the default when we write cv=10).

We want to do better than a majority-class classifier.

In [None]:
dummy = DummyClassifier()

dummy.fit(X_train, y_train)

np.mean(cross_val_score(dummy, X_train, y_train, scoring="accuracy", cv=10))

In [None]:
def grid_search(preprocessor, predictor, param_grid, cv, metric):
    model = Pipeline([
                ("preprocessor", preprocessor),
                ("predictor", predictor)
    ])

    gs = GridSearchCV(model, param_grid, scoring=metric, cv=cv, n_jobs=-1)

    gs.fit(X_train, y_train)

    return gs

In [None]:
knn_gs = grid_search(
    preprocessor = preprocessor, 
    predictor = KNeighborsClassifier(),
    param_grid = {
        "preprocessor__num__proportion_of_income__insert": [True, False],
        "preprocessor__num__outliers__clip": [True, False],
        "preprocessor__num__scaler" : [None, MinMaxScaler(), RobustScaler(), StandardScaler()],
        "predictor__n_neighbors": range(1, 11) ,
        "predictor__weights" : ["uniform", "distance"]
    },
    cv = 10,
    metric = "accuracy"
)

knn_gs.best_params_, knn_gs.best_score_

In [None]:
decision_tree_gs = grid_search(
    preprocessor = preprocessor, 
    predictor = DecisionTreeClassifier(random_state=rng),
    param_grid = {
        "preprocessor__num__proportion_of_income__insert": [True, False],
        "preprocessor__num__outliers__clip": [True, False],
        "preprocessor__num__scaler" : [None],
        "predictor__max_depth": range(1, 11)                 
    },
    cv = 10,
    metric = "accuracy"
)

decision_tree_gs.best_params_, decision_tree_gs.best_score_

Both are a bit more accurate than the majority-class classifier, and the decision tree is a lttle more accurate than kNN. It is also faster than kNN at inference time.

*(Quite frankly, the learned models are not much better than the majority-class classifier. I would check whether my models are underfitting or overfitting and then fix whichever it is.)*

## If you're certain you've finished with model selection, then you can use the test set

In [None]:
accuracy_score(decision_tree_gs.best_estimator_.predict(X_test), y_test)

## If you decide to deploy, then train on the whole dataset and save

In [None]:
decision_tree_gs.best_estimator_.fit(X, y)

In [None]:
joblib.dump(decision_tree_gs.best_estimator_, os.path.join(base_dir, 'models/loan_approval_model.pkl')) # assumes a folder called models

You can load a model, e.g. in a different Jupyter Notebook like so:

In [None]:
model = joblib.load(os.path.join(base_dir, 'models/loan_approval_model.pkl'))

Then you can use it for inference.

In [None]:
applicant = {"Sex": "Male",
             "Married": "Yes",
             "Dependents": "0",
             "Education": "Graduate",
             "Self_Employed": "Yes",
             "Applicant_Income": 5818, 
             "Coapplicant_Income": 2160, 
             "Loan_Amount": 184,
             "Loan_Term": 360,
             "Property_Area": "Semiurban"}

In [None]:
decision = model.predict(pd.DataFrame([applicant]))
label_encoder.inverse_transform(decision)

Note the advantage: we saved not just the decison tree but also all the preprocessing code which was in a pipeline with the decision tree. So now, during inference, those same preprocessing steps will be applied. 