In EDA (Step-1), we explored the dataset using statistics, visualizations & testing various hypotheses.

Next step is to clean & preprocess the data before sending the data as input to ML algorithms. This involves various steps which are as follows:
* Data Cleaning
* Feature Engineering
* Handling Outliers
* Handling Missing values
* Encoding Categorical features
* Feature Scaling

In [4]:
# Imports
import pandas as pd

# Load the data

In [5]:
TRAIN_DATA_PATH = "../data/train.csv"
train_df = pd.read_csv(TRAIN_DATA_PATH, index_col=0)
train_df.reset_index(drop=True, inplace=True)
train_df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,...,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address,loan_status
0,18500.0,60 months,10.65,340.24,B,B2,CNMI Government,10+ years,OWN,40000.0,...,0.0,8.0,0.1,27.0,f,INDIVIDUAL,,0.0,"7530 Barnes Flat Apt. 584\r\nWhitetown, NV 30723",Fully Paid
1,13175.0,36 months,16.55,466.78,D,D2,customer service / account rep,4 years,RENT,30000.0,...,1.0,1046.0,15.8,8.0,f,INDIVIDUAL,0.0,0.0,"443 Rice Views Apt. 282\r\nNorth Jameshaven, A...",Fully Paid
2,35000.0,60 months,17.86,886.11,D,D5,Branch Manager,10+ years,MORTGAGE,80000.0,...,0.0,20239.0,57.5,36.0,w,INDIVIDUAL,2.0,0.0,3857 Christopher Courts Suite 005\r\nEast Chri...,Charged Off
3,20400.0,36 months,12.12,678.75,B,B3,California Dept of transportation,10+ years,RENT,65000.0,...,0.0,12717.0,49.4,31.0,f,INDIVIDUAL,0.0,0.0,"840 Parks Viaduct\r\nLake Brittanyside, MT 48052",Fully Paid
4,35000.0,60 months,17.57,880.61,D,D4,Air Traffic Control Specialist,10+ years,RENT,200000.0,...,0.0,14572.0,63.1,8.0,w,INDIVIDUAL,0.0,0.0,"042 Jamie Grove\r\nEast Maryshire, LA 70466",Charged Off


# Data cleaning

In this step, we'll: 
* Change data-types of certain features to relevant data-type
* Extract relevant information from features
    * Extract State-code from the `address` feature. Could also extract pincode, but upon careful inspection of pincodes in EDA step, not all pincodes were valid.
* Reduce cardinality of categorical features by merging rare categories
* Drop irrelevant features
    * Dropping `title` as it contains null values and also its 
    information is already captured by purpose feature(no missing values)
    * `emp_title` has very high cardinality. To use this feature, might require to group the titles based on occupation such as management, medical, education etc. But dropping it for now.
* Encoding target feature
    * We'll encode `loan_status` class-labels as follows: __Charged Off = 1 & Fully Paid = 0__ (assigning Charged Off as positive label because its the minority class)

In [6]:
def clean_data(df):
    # Modifying the new copy of dataframe
    df_copy = df.copy()

    # Converting issue_d & earliest_cr_line to datetime type
    for feat in ["issue_d", "earliest_cr_line"]:    
        df_copy[feat] = pd.to_datetime(df_copy[feat], format="mixed")

    # Extract state code from address feature and dropping address feature later
    df_copy["state"] = df_copy["address"].str.extract(r'.\s([\w]{2})\s\d{4,5}$')[0]

    # Merge categories to reduce cardinality
    df_copy["home_ownership"] = df_copy["home_ownership"].replace(["ANY", "NONE"], "OTHER") # Merging ANY & NONE into OTHER
    df_copy["verification_status"] = df_copy["verification_status"].replace("Source Verified", "Verified") # Merging Source Verified into Verified
    df_copy["purpose"] = df_copy["purpose"].apply(lambda x: x if x in {"credit_card", "debt_consolidation"} else "other") # Merging rarer categories into other
    df_copy["application_type"] = df_copy["application_type"].replace(["JOINT", "DIRECT_PAY"], "NON_INDIVIDUAL")

    # Encoding target_feature
    df_copy["loan_status"] = df_copy["loan_status"].map({"Charged Off": 1, "Fully Paid": 0})

    # Dropping features
    df_copy.drop(columns=["title", "emp_title", "address"], inplace=True)

    return df_copy

In [7]:
cleaned_df = clean_data(train_df)
cleaned_df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,...,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,loan_status,state
0,18500.0,60 months,10.65,340.24,B,B2,10+ years,OWN,40000.0,Verified,...,0.0,8.0,0.1,27.0,f,INDIVIDUAL,,0.0,0,NV
1,13175.0,36 months,16.55,466.78,D,D2,4 years,RENT,30000.0,Verified,...,1.0,1046.0,15.8,8.0,f,INDIVIDUAL,0.0,0.0,0,AL
2,35000.0,60 months,17.86,886.11,D,D5,10+ years,MORTGAGE,80000.0,Verified,...,0.0,20239.0,57.5,36.0,w,INDIVIDUAL,2.0,0.0,1,AR
3,20400.0,36 months,12.12,678.75,B,B3,10+ years,RENT,65000.0,Verified,...,0.0,12717.0,49.4,31.0,f,INDIVIDUAL,0.0,0.0,0,MT
4,35000.0,60 months,17.57,880.61,D,D4,10+ years,RENT,200000.0,Verified,...,0.0,14572.0,63.1,8.0,w,INDIVIDUAL,0.0,0.0,1,LA


# Feature Engineering

This step involves creating/engineering new features, using a combination of original features, that are meaningful & would be helpful in prediction. The features we'll engineer based on the analysis in Step-1 are as follows:
* loan_to_income = loan_amnt / annual_inc
* emi_to_income = installment / (annual_inc/12)
* credit_age_years = (issue_d - earliest_cr_line)/365
* closed_acc = total_acc - open_acc
* negative_rec = (pub_rec > 0) | (pub_rec_bankruptcies > 0) 
* credit_util_ratio = revol_bal / annual_inc
* mort_ratio = mort_acc / total_acc

__Note:__ negative_rec looks like a binary feature and can be treated as a categorical feature

# Handle Outliers

To treat outliers for numerical features, there most commonly used approach is capping outliers on either side by bounds. To cap outliers different techniques are:
* __IQR method__: Uses IQR to find upper & lower bounds and then cap outliers outside these bounds to the bound values. $$\text{lower-bound} = Q1 - 1.5*IQR \newline \text{upper-bound} = Q3 + 1.5*IQR \newline \text{IQR} = Q3-Q1$$
But, for few feature such as `pub_rec` & `pub_rec_bankruptcies`, the upper bound turns out to be 0. So, the all values after capping for these features will be 0, thus losing important information.
* __Winsorization__: Uses fixed percentile values, i.e. 0-5th percentile for lower-bound & 95-99th percentile for upper bounds, to cap outliers exceeding these bounds.

Since, all our numerical features are right-skewed, outliers appear only on the upper-end, which is why we'll use upper-bound (98th percentile) capping via Winsorization

# Handle Missing values

We'll go for imputation instead of dropping missing values to avoid any loss of information. After cleaning the dataset, the features with missing values are: __revol_util, mort_acc & pub_rec_bankruptcies__ (Numerical) & __emp_length__ (Categorical).

Now, there are multiple approaches to impute missing values for numerical & categorical features:
* Numerical: Imputing missing values with central tendencies i.e. __Mean or Median__. Mean imputation isn't suitable in this case due to presence of outliers but Median is reliable value to be imputed.
* Categorical: Imputing missing values with central tendency like __Mode__. 

Another alternative can be using neighbors to fill the missing values, which works well for numerical features. So, we'll be using `KNNImputer` to fill missing values for numerical features & for categorical features we'll be imputing Mode of the feature.

# Encoding Categorical features

This step involves converting Categorical features into numerical format. Major Encoding techniques available are:
* __One-hot encoding__: Represent the categories using a vector of zeros and ones, 1 indicating presence of a particular category and 0 indicating absence. Avoid this encoding, if the feature cardinality is high, as it'll lead to dimensionality curse. Also, take care of dummy variable trap i.e. each feature can be represented using a binary vector of size k-1 where k=# of categories/labels
* __Ordinal encoding__: Represent the categories using an integer, usually ordered. This encoding is used where there is an inherent order observed among the feature categories.
* __Frequency encoding__: Represent the categories using their frequency i.e. count of samples in the dataset. This can be good alternative where the feature has high cardinality.
* __Target encoding__: Represent each category using the average of target feature for that particular category. This can be useful when the feature cardinality is high, but need to be careful of overfitting if not used properly.
* __Weight of evidence encoding__: Represent each category as log of odds-ratio between positive & negative class samples in target feature. This is commonly used for credit risk analysis & binary classification problem (our use-case exactly). Also, this encoding is observed to work well with Logistic regression. Read about it [here](https://feature-engine.trainindata.com/en/1.8.x/user_guide/encoding/WoEEncoder.html)

In this case, we'll use Ordinal-encoding for features which have some inherent order (i.e. __grade, sub_grade, emp_length__). For the remaining features we can perform either Target-encoding or WOE(Weight-of-evidence)-encoding, but they need to be performed dynamically for the changing train-validation splits in Cross-validation.

# Feature Scaling 

Scaling is a good practice to bring values of all the features in the similar range, which proves beneficial specially for linear models (Logistic-regression) or distance-based models (KNN). There are multiple [scaling techniques](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling):

* Standard-scaling: Shifts & scales the feature distribution to have mean=0 & standard-deviation=1. This works well if the feature is normally distributed. Its sensitive to outliers as it uses mean for scaling
* Minmax-scaling: Scales the input feature distribution to range of [0, 1]. Its sensitive to outliers and scales inlier to a small range in such cases.
* Robust-scaling: Its similar to Standard-scaling but it uses median & IQR instead of mean & standard-deviation, statistics which are robust to outliers, for scaling. Hence, its most commonly used for features containing outliers.
* Box-cox transformation: Uses parametric, monotonic transformations to map any distribution as close to Gaussian distribution as possible. But this transformation is strictly for positive values only (>0)

In our case, since almost all the numerical features have outliers & some features have zero-values (unsuitable for Box-cox), Robust-scaling looks like a suitable scaling-technique.

# Pipeline

We'll be creating a pipeline for the above mentioned preprocessing steps, so that we can quickly experiment during model building with various choices. The order of preprocessing steps will be as follows:

1. Missing values imputation
2. Handling outliers
3. Feature engineering 
4. Drop irrelevant features
5. Encoding Categorical features
6. Feature Scaling

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.impute import KNNImputer, SimpleImputer
from category_encoders import TargetEncoder, WOEEncoder
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.exceptions import NotFittedError
from sklearn.utils.validation import check_is_fitted

In [8]:
# Feature types
target_feat = "loan_status"
independent_feat = [col for col in cleaned_df.columns if col!=target_feat]
date_feat = ["issue_d", "earliest_cr_line"]
num_feat = [col for col in independent_feat if cleaned_df[col].dtype=="float"]
cat_feat = [col for col in independent_feat if col not in set(num_feat) and col not in set(date_feat)]
eng_feat = ["loan_income_ratio", "emi_ratio", "credit_age_years", "closed_acc", "credit_util_ratio", "mortgage_ratio"]

# Features for pipelines
final_cat_feat = cat_feat + ["negative_rec"] # Include binary negative_rec after feature engineering
final_num_feat = num_feat + eng_feat

In [None]:
# Custom Feature-engineering transformer
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self, **kwargs):
        pass

    def fit(self, X, y=None):
        return self # nothing to fit, return self
    
    def transform(self, X):
        X = X.copy()
        # Loan to income ratio
        X["loan_income_ratio"] = (X["loan_amnt"] / X["annual_inc"]).round(2)
        # EMI to monthly income ratio
        X["emi_ratio"] = (X["installment"] / (X["annual_inc"]/12)).round(2)
        # Credit-line age in years
        X["credit_age_years"] = ((X["issue_d"] - X["earliest_cr_line"]).dt.days / 365).round(1)
        # Total closed accounts
        X["closed_acc"] = X["total_acc"] - X["open_acc"]
        # Has negative records
        X["negative_rec"] = ((X["pub_rec"]> 0) | (X["pub_rec_bankruptcies"]>0)).astype("int")
        # Credit utilization ratio
        X["credit_util_ratio"] = (X["revol_bal"] / X["annual_inc"]).round(2)
        # Mortgage accounts ratio
        X["mortgage_ratio"] = (X["mort_acc"] / X["total_acc"]).round(2)

        return X
    
# Custom Outlier handler
class OutlierHandler(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self.features = features

    def fit(self, X, y=None):
        # Ensure X is a dataframe to access columns
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X should be a pandas Dataframe object")
        
        # Calculating the statistics
        self.bounds_ = {}
        for col in self.features:
            upper_bound = X[col].quantile(0.98)
            self.bounds_[col] = upper_bound
            
        return self
    
    def transform(self, X):
        # Check if fitted
        if not self.bounds_:
            raise RuntimeError("You must run fit() before transform()")
        
        X = X.copy()
        # Capping the outliers on the upper-end
        for col in self.features:
            upper_bound = self.bounds_[col]
            X[col] = X[col].clip(upper=upper_bound)
        return X
    
# Custom Feature dropper
class FeatureDropper(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self.features = features

    def fit(self, X, y=None):
        return self # Nothing to fit, return self
    
    def transform(self, X):
        X = X.copy()
        return X.drop(columns=self.features, errors="ignore")
    
# Custom Imputer
class Imputer(BaseEstimator, TransformerMixin):
    def __init__(self, use_knn_imputation=False, num_features=None, cat_features=None):
        self.use_knn_imputation = use_knn_imputation
        self.num_features = num_features
        self.cat_features = cat_features
        
    def fit(self, X, y=None):
        # Ensure X is a dataframe to access columns
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X should be a pandas Dataframe object")

        # Instantiate the imputers
        if self.use_knn_imputation:
            self.num_imputer_ = KNNImputer()
        else:
            self.num_imputer_ = SimpleImputer(strategy="median")
        self.cat_imputer_ = SimpleImputer(strategy="most_frequent")

        # Fit the imputers
        if self.num_features:
            self.num_imputer_.fit(X[self.num_features])
        if self.cat_features:
            self.cat_imputer_.fit(X[self.cat_features])

        return self

    def transform(self, X):
        # Check if the imputers have been fitted
        try:
            check_is_fitted(self.num_imputer_)
            check_is_fitted(self.cat_imputer_)
        except NotFittedError:
            raise RuntimeError("You must run fit() before transform()")

        X = X.copy()
        # Tranform numerical features
        if self.num_features:
            imputed_nums = self.num_imputer_.transform(X[self.num_features])
            X[self.num_features] = imputed_nums

        # Transform categorical features
        if self.cat_features:
            imputed_cats = self.cat_imputer_.transform(X[self.cat_features])
            X[self.cat_features] = imputed_cats

        return X
    

# Custom Categorical Encoder
class CatEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, encoder_type="woe", features=None):
        self.encoder_type = encoder_type
        self.features = features

    def fit(self, X, y=None):
        # Ensure X is a dataframe to access columns
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X should be a pandas Dataframe object")
        
        # Instantiate the encoder
        if self.encoder_type == "woe":
            self.cat_encoder_ = WOEEncoder(cols=self.features)
        elif self.encoder_type == "target":
            self.cat_encoder_ = TargetEncoder(cols=self.features)
        else:
            raise ValueError("encoder_type must be either 'woe' or 'target'") 
        
        # Fit the encoder
        if self.features:
            self.cat_encoder_.fit(X, y)
        
        return self

    def transform(self, X):
        # Check if the encoder has fitted
        try:
            check_is_fitted(self.cat_encoder_)
        except NotFittedError:
            raise RuntimeError("You must run fit() before transform()")
        
        X = X.copy()

        # Encode the categorical features
        if self.features:
            X_trans = self.cat_encoder_.transform(X)

        return X_trans
    

# Custom Scaler
class Scaler(BaseEstimator, TransformerMixin):
    def __init__(self, features=None):
        self.features = features

    def fit(self, X, y=None):
        # Ensure X is a dataframe to access columns
        if not isinstance(X, pd.DataFrame):
            raise ValueError("X should be a pandas Dataframe object")
        
        # Instantiate the scaler
        self.scaler_ = RobustScaler()

        # Fit the scaler
        if self.features:
            self.scaler_.fit(X[self.features])

        return self
    
    def transform(self, X):
        # Check if the scaler has fitted
        try:
            check_is_fitted(self.scaler_)
        except NotFittedError:
            raise RuntimeError("You must run fit() before transform()")
        
        X = X.copy()

        # Scale the numerical features
        if self.features:
            scaled_nums = self.scaler_.transform(X[self.features])
            X[self.features] = scaled_nums

        return X

In [41]:
# Build pipeline
def build_pipeline(model, use_imputation=True, use_outlier_capping=True, use_encoding=True, encoder_type="woe", use_scaling=True, features_to_drop=[]):
    #-----Imputation-----
    if use_imputation:
        imputer = Imputer(use_knn_imputation=False, 
                          num_features=num_feat,
                          cat_features=cat_feat)
    else:
        imputer = "passthrough"

    #-----Outlier handling-----
    if use_outlier_capping:
        outlier_capper = OutlierHandler(features=num_feat)
    else:
        outlier_capper = "passthrough"

    #-----Categorical encoding-----
    if use_encoding:
        cat_encoder = CatEncoder(encoder_type, features=final_cat_feat)
    else:
        cat_encoder = "passthrough"
    
    #-----Dropping specified features-----
    if features_to_drop:
        feat_dropper = FeatureDropper(features=features_to_drop)
    else:
        feat_dropper = "passthrough"
    
    #-----Scaling-----
    if use_scaling:
        scaler = Scaler(features=final_num_feat)
    else:
        scaler = "passthrough"

    # Preprocessing pipeline
    final_pipeline = Pipeline(steps=[
        ("imputer", imputer),
        ("outlier_capper", outlier_capper),
        ("feature_engineer", FeatureEngineer()),
        ("feature_dropper", feat_dropper),
        ("cat_encoder", cat_encoder),
        ("scaler", scaler),
        ("model", model)
    ])

    return final_pipeline

## Things I Learnt
* sklearn's `Pipeline` is useful to automate steps that are __sequential__ in nature like in our case. We have automated all the preprocessing steps along with model fitting on the preprocessed data.
* To create custom pipeline steps, we inherit from the following classes:
    * __BaseEstimator__: It provides us with `fit() & transform()` which we can modify according to our requirements.
    * __TransformerMixin__: It automatically provides the inheriting class with `fit_transform()` given that it has implementation of the `fit()` & `transform()` methods already.
* While implementing a custom pipeline step, any attributes we initialize in the `__init__` method, should have the same name as the input argument as per sklearn's conventions. e.g. `self.features` for `features` argument. Also, attributes ending with an underscore e.g. `self.bounds_`, are reserved for fitted attributes (values calculated in fit() method)
* Pipeline vs. ColumnTransformer
    * Each step in `Pipeline` works on the entire input Dataframe X. We have defined a custom class for each Pipeline step, because we want certain steps to work on different subsets of the input features and not all e.g. In scaling, we want to scale only the numerical features & not the encoded categorical features.
    * Though the above issue can be solved using `ColumnTransformer` which applies a transformation on a subset of columns (hence the name). If there are multiple transformations inside the `ColumnTransformer` pipeline, then all these transformations are executed __parallely__ and the results are concatenated. But few issues I faced with `ColumnTransformer` are: 
        * The output of each transformation is a numpy array and not a dataframe, thus losing the column information. This might lead to __column not found errors__ if you're try to access columns inside the transformer.
        * The `remainder` argument determines what to do with the features which weren't transformed at all in the `ColumnTransformer` pipeline. __passthrough__ concatenates these columns untransformed to the transformed results, while __drop__ drops these columns from the final result. Now, if we use "passthrough" option for the date-related columns i.e. issue_d & earliest_cr_line, then as their data-type is different i.e. datetime[ns] they couldn't be concatenated to the transformed feature results & raised an error of data-type mismatch.
    * Due to above reasons, I opted for using only Pipeline with custom step implementations