## Feature Preprocessing & Model Analysis

### 1. Preprocessing & Model Pipelines

- **Preprocessing Pipeline**: the pipeline, facilitated by `ColumnTransformer`, was designed to optimize the dataset for subsequent modeling.
    - **SimpleImputer**: Missing data points were adeptly handled, ensuring no inadvertent omissions due to partial records.
    - **OneHotEncoder**: With the goal of capturing the nuances of categorical variables, categorical attributes were transformed to a machine-learning amicable format.
    - **FeaturesTransformer**: An indispensable part of our workflow, the `FeaturesTransformer` seamlessly amalgamated data from various external files like `client_profile_data`, `applications_history_data`, `bki_data`, and `payments_data`. Its engineering prowess manifested in the generation of sophisticated statistics and intricate feature interactions using operations such as multiplication, division, and aggregations.

- **Model Pipeline**: Beyond the standard algorithm application, models hyperparameter were via `RandomizedSearchCV` and `BayesSearchCV`, with early stopping acting as a bulwark against overfitting.

### 2. Model Performance

**Breakdown of Scores across Train, Validation, and Test phases**:

- **L2 Model (Linear Regression with L2 Regularization)**
    - Train: 0.715
    - Validation: 0.71
    - Test: 0.731

- **Random Forest**
    - Train: 0.734
    - Validation: 0.699
    - Test: 0.731

- **XGBoost**
    - Train: 0.768
    - Validation: 0.727
    - Test: 0.736

- **LightGBM**
    - Train: 0.797
    - Validation: 0.729
    - Test: 0.757

- **CatBoost**
    - Train: 0.751
    - Validation: 0.724
    - Test: 0.743

Analyzing the scores, LightGBM's dominance was evident. While some models like LightGBM and XGBoost exhibited minor overfitting tendencies, the depth of their performance was undeniable.

### 3. Feature Importance Analysis

Post initial observations, a deeper dive into feature importance was undertaken for our top trio: LightGBM, XGBoost, and CatBoost using permutation importance. The subsequent training phase, armed with the most influential features, exhibited the following scores:

- **XGBoost**:
    - All Features: 0.736
    - Selected Features: 0.736

- **LightGBM**:
    - All Features: 0.757
    - Selected Features: 0.753

- **CatBoost**:
    - All Features: 0.743
    - Selected Features: 0.741

The performance of models on pruned features revealed an enhanced validation score, confirming our suspicion that a judicious selection can effectively exclude irrelevant attributes. However, the decision gravitated towards employing LightGBM with all its features, given the minor score variations and the importance of retaining comprehensive data narratives.


In [1]:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format

import numpy as np

from typing import Optional

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from skopt import BayesSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier, early_stopping

from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score

from sklearn.inspection import permutation_importance
from tqdm import tqdm
import joblib
import pickle


In [2]:
class FeaturesTransformer(BaseEstimator, TransformerMixin):
    """
    Description:
    -------
    Clean up data, create new features for every dataset and merge everything into one dataset.

    Parameters:
    - client_profile_data: pandas.core.frame.DataFrame, client profile data
    - history_data: pandas.core.frame.DataFrame, applications history data
    - bki_data: pandas.core.frame.DataFrame, BKI data
    - payments_data: pandas.core.frame.DataFrame, payments data
    """
    
    def __init__(self, client_profile_data, applications_history_data, bki_data, payments_data, percentile_lower=5, percentile_upper=95):
            self.client_profile_data = client_profile_data
            self.applications_history_data = applications_history_data
            self.bki_data = bki_data
            self.payments_data = payments_data
            self.columns_ = None
            self.percentile_lower = percentile_lower
            self.percentile_upper = percentile_upper
    
    def fit(self, X, y=None):
        """
        Description:
        -------
        Fit the transformer to the data and calculate various statistics.

        Parameters:
        - X: pandas.core.frame.DataFrame, input data
        - y: None, not used

        Returns:
        - self
        """
        # Merging data to create statistics
        merged_data = self.merge_tables_on_app_num(X, self.client_profile_data)

        # Generate statistics from merged_data
        self.set_groupby_profile_stats(merged_data)
        self.set_percentile_bonds(merged_data) 

        # Clearing the temporary merged data
        del merged_data
        
        return self       

    def transform(self, X, y=None):
        """
        Description:
        -------
        Transform the input data by creating new features and merging datasets.

        Parameters:
        - X: pandas.core.frame.DataFrame, input data
        - y: None, not used

        Returns:
        - X_transformed: pandas.core.frame.DataFrame, transformed data
        """
        
        X = self.create_client_profile_features(X)
        history = self.create_applications_history_features(self.applications_history_data)
        bki = self.create_bki_features(self.bki_data)
        payments = self.create_payments_features(self.payments_data)
        
        # Merge all Datasets/Tables to one dataset
        for table in [history, bki, payments]:
            X = self.merge_tables_on_app_num(X, table)
            
        self.test_ids = list(X["APPLICATION_NUMBER"]) 
        X.drop(columns=["APPLICATION_NUMBER"], inplace=True)
        self.columns_ = X.columns.tolist()
            
        return X

    def get_columns():
        return self.columns

    def merge_tables_on_app_num(self, left_df, right_df):     
        """
        Description:
        -------
            Merge any two tables based on APPLICATION_NUMBER column

        Parameters
        ----------
            left_df: pandas.core.frame.DataFrame
            
            right_df: pandas.core.frame.DataFrame
            
        Returns
        -------
            merged_df: pandas.core.frame.DataFrame

        """  

        df = pd.merge(left_df, right_df, how="left", on="APPLICATION_NUMBER")
        df = df.replace(np.inf, np.nan)
        df = df.replace(-np.inf, np.nan)
            
        return df
        
    def set_groupby_profile_stats(self, X):     
        """
        Description:
        -------
            Calculate group statists such as TOTAL_SALARY and AMOUNT_CREDIT mean for 
            EDUCATION_LEVEL, GENDER, FAMILY_STATUS and REGION_POPULATION on the train set

        Parameters
        ----------
            X: pandas.core.frame.DataFrame

        """

        aggs = {"TOTAL_SALARY": ["mean"], "AMOUNT_CREDIT": ["mean"]}
        self.stats_education = self.create_numerical_aggs(X, groupby_id="EDUCATION_LEVEL", aggs=aggs, prefix="GROUPBY_EDU_LEVEL_")
        self.stats_gender = self.create_numerical_aggs(X, groupby_id="GENDER", aggs=aggs, prefix="GROUPBY_GENDER_")
        self.stats_region = self.create_numerical_aggs(X, groupby_id="REGION_POPULATION", aggs=aggs, prefix="GROUPBY_REGION_")
        self.stats_fam_status = self.create_numerical_aggs(X, groupby_id="FAMILY_STATUS", aggs=aggs, prefix="GROUPBY_FAM_STATUS_")

    def set_percentile_bonds(self, X):     
        """
        Description:
        -------
            Calculate left and right percentile bonds for 
            TOTAL_SALARY, AMOUNT_CREDIT and AMOUNT_ANNUITY on the train set

        Parameters
        ----------
            X: pandas.core.frame.DataFrame

        """ 
        
        self.salary_left_bond, self.salary_right_bond = np.nanpercentile(X['TOTAL_SALARY'], q=self.percentile_lower), np.nanpercentile(X['TOTAL_SALARY'], q=self.percentile_upper)
        self.credit_left_bond, self.credit_right_bond = np.nanpercentile(X['AMOUNT_CREDIT'], q=self.percentile_lower), np.nanpercentile(X['AMOUNT_CREDIT'], q=self.percentile_upper)
        self.annuity_left_bond, self.annuity_right_bond = np.nanpercentile(X['AMOUNT_ANNUITY'], q=self.percentile_lower), np.nanpercentile(X['AMOUNT_ANNUITY'], q=self.percentile_upper)

        
    def create_numerical_aggs(self, data: pd.DataFrame, groupby_id: str, aggs: dict,
                          prefix: Optional[str] = None, suffix: Optional[str] = None,) -> pd.DataFrame:
        """
        Description:
        -------
            Create aggregations for numeric features

        Parameters
        ----------
            data: pandas.core.frame.DataFrame

            groupby_id: str

            aggs: dict 
                Dictionary with feature's name and the list of aggr fucntions to perform

            prefix: str, optional, default = None
                Prefix which will be used to name a new created feature

            suffix: str, optional, default = None
                Suffix which will be used to name a new created feature

        Returns
        -------
            stats: pandas.core.frame.DataFrame

        """
        
        if not prefix:
            prefix = ""
        if not suffix:
            suffix = ""

        data_grouped = data.groupby(groupby_id)
        stats = data_grouped.agg(aggs)
        stats.columns = [f"{prefix}{feature}_{stat}{suffix}".upper() for feature, stat in stats]
        stats = stats.reset_index()

        return stats
    
    def create_client_profile_features(self, X):    
        """
        Description:
        -------
            Create new features for client_profile dataset

        Parameters
        ----------
            X: pandas.core.frame.DataFrame

        Returns
        -------
            X_transformed: pandas.core.frame.DataFrame
        """
        
        X = X.copy() 
        X = self.merge_tables_on_app_num(X, self.client_profile_data)
  
        #FLAG MISSING and OUTLIERS
        
        # Create new columns to flag features with a lot of missing values
        flag_missing_columns = ['OWN_CAR_AGE', 'EXTERNAL_SCORING_RATING_1', 'EXTERNAL_SCORING_RATING_3', 'AMT_REQ_CREDIT_BUREAU_MON']
        for column in flag_missing_columns:
            X['MISSING_' + column] = X[column].isna().astype('int')

        # Flag DAYS_ON_LAST_JOB > 350000 as missing 
        X['MISSING_DAYS_ON_LAST_JOB'] = (X.DAYS_ON_LAST_JOB > 350000).astype('int')
        
        # Fill out missing values for MISSING_OWN_CAR_AGE feature as 0
        X.loc[X['MISSING_OWN_CAR_AGE']==1,'MISSING_OWN_CAR_AGE'] = 0

        #Flag outliers for 'TOTAL_SALARY', 'AMOUNT_CREDIT', 'AMOUNT_ANNUITY'
        X['OUTLIER_TOTAL_SALARY'] = ((X['TOTAL_SALARY'] < self.salary_left_bond) | (X['TOTAL_SALARY'] > self.salary_right_bond)).astype('int')
        X['OUTLIER_AMOUNT_CREDIT'] = ((X['AMOUNT_CREDIT'] < self.credit_left_bond) | (X['AMOUNT_CREDIT'] > self.credit_right_bond)).astype('int')
        X['OUTLIER_AMOUNT_ANNUITY'] = ((X['AMOUNT_ANNUITY'] < self.annuity_left_bond) | (X['AMOUNT_ANNUITY'] > self.annuity_right_bond)).astype('int')

        #PROCESS NUMERIC FEATURES
        
        # Make CHILDRENS as a descrete/categorical feature
        X['CHILDREN_0']  = (X.CHILDRENS == 0).astype('int')
        X['CHILDREN_1_2'] = ((X['CHILDRENS'] >= 1) & (X['CHILDRENS'] <= 2)).astype('int')
        X['CHILDREN_3+']  = (X.CHILDRENS >= 3).astype('int')

        # Make FAMILY_SIZE as a descrete/categorical feature
        X['FAMILY_SIZE_0']  = (X.FAMILY_SIZE == 0).astype('int')
        X['FAMILY_SIZE_1']  = (X.FAMILY_SIZE == 1).astype('int')
        X['FAMILY_SIZE_2']  = (X.FAMILY_SIZE == 2).astype('int')
        X['FAMILY_SIZE_3+']  = (X.FAMILY_SIZE >= 3).astype('int')
        
        #Generate new EDUCATION_LEVEL metrics
        X = X.merge(self.stats_education, how="left", on="EDUCATION_LEVEL")
        X["RATIO_CREDIT_to_MEAN_CREDIT_BY_EDUCATION"] = X["AMOUNT_CREDIT"] / X["GROUPBY_EDU_LEVEL_AMOUNT_CREDIT_MEAN"]
        X["RATIO_SALARY_to_MEAN_SALARY_BY_EDUCATION"] = X["TOTAL_SALARY"] / X["GROUPBY_EDU_LEVEL_TOTAL_SALARY_MEAN"]
        X["DIFF_SALARY_and_MEAN_SALARY_BY_EDUCATION"] = X["TOTAL_SALARY"] - X["GROUPBY_EDU_LEVEL_TOTAL_SALARY_MEAN"]       

        #Generate new GENDER metrics
        X = X.merge(self.stats_gender, how="left", on="GENDER")
        X["RATIO_CREDIT_to_MEAN_CREDIT_BY_GENDER"] = X["AMOUNT_CREDIT"] / X["GROUPBY_GENDER_AMOUNT_CREDIT_MEAN"]
        X["RATIO_SALARY_to_MEAN_SALARY_BY_GENDER"] = X["TOTAL_SALARY"] / X["GROUPBY_GENDER_TOTAL_SALARY_MEAN"]
        X["DIFF_SALARY_and_MEAN_SALARY_BY_GENDER"] = X["TOTAL_SALARY"] - X["GROUPBY_GENDER_TOTAL_SALARY_MEAN"]

        #Generate new REGION_POPULATION metrics
        X = X.merge(self.stats_region, how="left", on="REGION_POPULATION")
        X["RATIO_CREDIT_to_MEAN_CREDIT_BY_REGION"] = X["AMOUNT_CREDIT"] / X["GROUPBY_REGION_AMOUNT_CREDIT_MEAN"]
        X["RATIO_SALARY_to_MEAN_SALARY_BY_REGION"] = X["TOTAL_SALARY"] / X["GROUPBY_REGION_TOTAL_SALARY_MEAN"]
        X["DIFF_SALARY_and_MEAN_SALARY_BY_REGION"] = X["TOTAL_SALARY"] - X["GROUPBY_REGION_TOTAL_SALARY_MEAN"]

        #Generate new FAMILY_STATUS metrics
        X = X.merge(self.stats_fam_status, how="left", on="FAMILY_STATUS")
        X["RATIO_CREDIT_to_MEAN_CREDIT_BY_FAM_STATUS"] = X["AMOUNT_CREDIT"] / X["GROUPBY_FAM_STATUS_AMOUNT_CREDIT_MEAN"]
        X["RATIO_SALARY_to_MEAN_SALARY_BY_FAM_STATUS"] = X["TOTAL_SALARY"] / X["GROUPBY_FAM_STATUS_TOTAL_SALARY_MEAN"]
        X["DIFF_SALARY_and_MEAN_SALARY_BY_FAM_STATUS"] = X["TOTAL_SALARY"] - X["GROUPBY_FAM_STATUS_TOTAL_SALARY_MEAN"]

        # Generate financial metrics
        X['RATIO_CREDIT_to_ANNUITY'] = X['AMOUNT_CREDIT'] / X['AMOUNT_ANNUITY'] 
        X['RATIO_CREDIT_to_SALARY'] = X['AMOUNT_CREDIT'] / X['TOTAL_SALARY'] 
        X['RATIO_SALARY_TO_CREDIT'] = X['TOTAL_SALARY'] / X['AMOUNT_CREDIT'] 
        X['RATIO_ANNUITY_to_SALARY'] = X['AMOUNT_ANNUITY'] / X['TOTAL_SALARY'] 
        X['DIFF_SALARY_and_ANNUITY'] = X['TOTAL_SALARY'] - X['AMOUNT_ANNUITY'] 
        X["FLG_MORE_THAN_50PERCENT_FOR_CREDIT"] = np.where(X["RATIO_ANNUITY_to_SALARY"] > 0.5, 1, 0)
        X["FLG_MORE_THAN_30PERCENT_FOR_CREDIT"] = np.where(X["RATIO_ANNUITY_to_SALARY"] > 0.3, 1, 0)
        X["FLG_PHONE_and_EMAIL"] = np.where((X["FLAG_PHONE"]==1)&(X["FLAG_EMAIL"]==1), 1, 0)

       # Generate scoring metrics
        for function_name in ["mean", "nanmedian", 'min', 'max', 'var']:
            feature_name = "EXTERNAL_SCORING_{}".format(function_name)
            X[feature_name] = eval("np.{}".format(function_name))(
                X[["EXTERNAL_SCORING_RATING_1", "EXTERNAL_SCORING_RATING_2", "EXTERNAL_SCORING_RATING_3"]], axis=1
            )
        X["EXTERNAL_SCORING_PROD"] = X["EXTERNAL_SCORING_RATING_1"] * X["EXTERNAL_SCORING_RATING_2"] * X["EXTERNAL_SCORING_RATING_3"]
        X["EXTERNAL_SCORING_WEIGHTED"] = X["EXTERNAL_SCORING_RATING_1"] * 2 + X["EXTERNAL_SCORING_RATING_2"] * 1 + X["EXTERNAL_SCORING_RATING_3"] * 3
        X["EXPECTED_TOTAL_LOSS_1"] = X["EXTERNAL_SCORING_RATING_1"] * X["AMOUNT_CREDIT"]
        X["EXPECTED_TOTAL_LOSS_2"] = X["EXTERNAL_SCORING_RATING_2"] * X["AMOUNT_CREDIT"]
        X["EXPECTED_TOTAL_LOSS_3"] = X["EXTERNAL_SCORING_RATING_3"] * X["AMOUNT_CREDIT"]
        X["EXPECTED_MONTHLY_LOSS_1"] = X["EXTERNAL_SCORING_RATING_1"] * X["AMOUNT_ANNUITY"]
        X["EXPECTED_MONTHLY_LOSS_2"] = X["EXTERNAL_SCORING_RATING_2"] * X["AMOUNT_ANNUITY"]
        X["EXPECTED_MONTHLY_LOSS_3"] = X["EXTERNAL_SCORING_RATING_3"] * X["AMOUNT_ANNUITY"]

        # Ratio with Age
        X["RATIO_ANNUITY_to_AGE"] = X["AMOUNT_ANNUITY"] / X["AGE"]
        X["RATIO_CREDIT_to_AGE"] = X["AMOUNT_CREDIT"] / X["AGE"]
        X["RATIO_SALARY_to_AGE"] = X["TOTAL_SALARY"] / X["AGE"]
        X["RATIO_AGE_to_SALARY"] = X["AGE"] /X["TOTAL_SALARY"]

        # Ratio with days_on_last_job
        X["RATIO_ANNUITY_to_DAYS_ON_LAST_JOB"] = X["AMOUNT_ANNUITY"] / X["DAYS_ON_LAST_JOB"]
        X["RATIO_CREDIT_to_DAYS_ON_LAST_JOB"] = X["AMOUNT_CREDIT"] / X["DAYS_ON_LAST_JOB"]
        X["RATIO_SALARY_to_DAYS_ON_LAST_JOB"] = X["TOTAL_SALARY"] / X["DAYS_ON_LAST_JOB"]
        X["RATIO_DAYS_ON_LAST_JOB_to_SALARY"] = X["DAYS_ON_LAST_JOB"] /X["TOTAL_SALARY"]
        X["RATIO_AGE_to_DAYS_ON_LAST_JOB"] = X["AGE"] /X["DAYS_ON_LAST_JOB"]
        X["RATIO_AGE_to_OWN_CAR_AGE"] = X["AGE"] /X["OWN_CAR_AGE"]

        # Ratio with FAMILY_SIZE
        X["RATIO_SALARY_TO_PER_FAMILY_SIZE"] = X["TOTAL_SALARY"] / X["FAMILY_SIZE"]

        #BKI metrics
        bki_flags = [flag for flag in X.columns if "AMT_REQ_CREDIT_BUREAU" in flag]
        X["BKI_REQUESTS_COUNT"] = X[bki_flags].sum(axis=1)
        X["BKI_KURTOSIS"] = X[bki_flags].kurtosis(axis=1)
        
        #Categorical metrics
        X.GENDER.replace('XNA', 'Missing', inplace=True)
        X.FAMILY_STATUS.replace('Unknown', 'Missing', inplace=True)
                              
        X = X.drop(["CHILDRENS", "FAMILY_SIZE"], axis=1)
 
        return X
       
    def create_applications_history_features(self, X):
        """
        Description:
        -------
            Create new features for applications_history_data dataset

        Returns
        -------
            X_transformed: pandas.core.frame.DataFrame
        """
        
        # Create new features for previosly refused applications
        aggs_refused = {
            'PREV_APPLICATION_NUMBER': ['count'],
            'AMT_APPLICATION': ['mean'],
            'DAYS_DECISION': ['mean']
        }
        mask_refused = X["NAME_CONTRACT_STATUS"] == "Refused"
        stats_refused = self.create_numerical_aggs(X[mask_refused], groupby_id="APPLICATION_NUMBER", aggs=aggs_refused, prefix="PREV_REFUSED_")
        
        # Create new features for previosly approved applications
        aggs_approved = {
            'PREV_APPLICATION_NUMBER': ['count'],
            'AMOUNT_CREDIT': ['sum', 'mean'],
        }
        mask_approved = X["NAME_CONTRACT_STATUS"] == "Approved"
        stats_approved = self.create_numerical_aggs(X[mask_approved], groupby_id="APPLICATION_NUMBER", aggs=aggs_approved, prefix="PREV_APPROVED_")

        # Caution: Ensure that APPLICATION_NUMBER is unique in both datasets to prevent many-to-many merges
        res = stats_refused.merge(stats_approved, how='outer', on='APPLICATION_NUMBER')
        
        return res
    
    def create_bki_features(self, X):      
        """
        Description:
        -------
            Create new features for bki dataset

        Returns
        -------
            X_transformed: pandas.core.frame.DataFrame
        """
            
        # Create new features for active applications    
        aggs_active = {
            'CREDIT_DAY_OVERDUE': ['min', 'max', 'mean'],
            'AMT_CREDIT_SUM_OVERDUE': ['mean'],
            'AMT_CREDIT_MAX_OVERDUE': ['mean']
        }
        mask_active = X['CREDIT_ACTIVE'] =='Active'
        stats_active = self.create_numerical_aggs(X[mask_active], groupby_id="APPLICATION_NUMBER", aggs=aggs_active, prefix="BKI_ACTIVE_")

        # Create new features for closed applications 
        aggs_closed = {
            'CREDIT_DAY_OVERDUE': ['mean'],
            'AMT_CREDIT_SUM_OVERDUE': ['mean'],
            'AMT_CREDIT_MAX_OVERDUE': ['mean']
        }
        mask_closed = X['CREDIT_ACTIVE'] =='Closed'
        stats_closed = self.create_numerical_aggs(X[mask_closed], groupby_id="APPLICATION_NUMBER", aggs=aggs_closed, prefix="BKI_CLOSED_")

        # Caution: Ensure that APPLICATION_NUMBER is unique in both datasets to prevent many-to-many merges
        res = stats_active.merge(stats_closed, how='outer', on='APPLICATION_NUMBER')
        
        return res
 
    def create_payments_features(self, X):  
        """
        Description:
        -------
            Create new features for payment dataset

        Returns
        -------
            X_transformed: pandas.core.frame.DataFrame
        """
            
        X["RATIO_DAYS_PAYMENT_to_DAYS_INSTALMENT"] = X["DAYS_ENTRY_PAYMENT"] / X["DAYS_INSTALMENT"]
        X["RATIO_DAYS_INSTALMENT_to_DAYS_PAYMENT"] = X["DAYS_INSTALMENT"] / X["DAYS_ENTRY_PAYMENT"]
        X["DIFF_DAYS_PAYMENT_and_DAYS_INSTALMENT"] = X["DAYS_ENTRY_PAYMENT"] - X["DAYS_INSTALMENT"]
        X["RATIO_AMT_INSTALMENT_to_AMT_PAYMENT"] = X["AMT_INSTALMENT"] / X["AMT_PAYMENT"]
        X["RATIO_AMT_PAYMENT_to_AMT_INSTALMENT"] = X["AMT_PAYMENT"] / X["AMT_INSTALMENT"]
        X["DIFF_AMT_PAYMENT_and_AMT_INSTALMENT"] = X["AMT_PAYMENT"] - X["AMT_INSTALMENT"]
        X["RATIO_DAYS_PAYMENT_to_AMT_PAYMENT"] = X["DAYS_ENTRY_PAYMENT"] / X["AMT_PAYMENT"]
        X["RATIO_DAYS_INSTALMENT_to_AMT_INSTALMENT"] = X["DAYS_INSTALMENT"] / X["AMT_INSTALMENT"]

        aggs = {
            "AMT_PAYMENT": ["mean"],
            "AMT_INSTALMENT": ["mean"],
            "RATIO_DAYS_PAYMENT_to_DAYS_INSTALMENT": ["mean", "std"],
            "RATIO_DAYS_INSTALMENT_to_DAYS_PAYMENT": ["mean", "std"],
            "DIFF_DAYS_PAYMENT_and_DAYS_INSTALMENT": ["mean"],
            "RATIO_AMT_INSTALMENT_to_AMT_PAYMENT": ["mean", "std"],
            "DIFF_AMT_PAYMENT_and_AMT_INSTALMENT": ["mean"],
            "RATIO_DAYS_PAYMENT_to_AMT_PAYMENT": ["mean"],
            "RATIO_DAYS_INSTALMENT_to_AMT_INSTALMENT": ["mean", "std"],
            "RATIO_AMT_PAYMENT_to_AMT_INSTALMENT": ["mean", "std"]
        } 

        res = self.create_numerical_aggs(
            X, groupby_id="APPLICATION_NUMBER", aggs=aggs, prefix="PAYMENT_STAT_"
        )
        
        return res

In [3]:
def select_most_important_features(estimator, X_valid, y_valid, threshold=0.0001, metric='roc_auc'):
    """
    Select the most important features based on permutation importance.

    Parameters:
    - estimator: sklearn-API estimator, model
    - X_valid: pandas.core.frame.DataFrame, validation data features
    - y_valid: pandas.core.frame.Series, validation data target
    - threshold: float, optional, default=0.0001, the threshold for feature importance

    Returns:
    - best_features: list, names of the best features
    """
    importances = permutation_importance(estimator, X_valid, y_valid, scoring=metric)
    important_feature_indices = importances.importances_mean > threshold
    best_features = X_valid.columns[important_feature_indices].tolist()

    return best_features

In [19]:
def save_predictions(estimators, name, test_df, test_ids, scores):

    """
    Description
    ----------
        Predict on the test dataset and save results as a csv file

    Parameters
    ----------
        estimators: callable

        test_df: test dataframe
        
        test_ids: APPLICATION_NUMBER for test dataset

    Returns
    -------
        pred: pd.DataFrame
    """

    pred = estimators.predict_proba(test_df)[:, 1]
    results = pd.DataFrame({
        "APPLICATION_NUMBER": test_ids,
        "TARGET": pred
    })   
    fn = f"../data/results/results_{name}_{scores}.csv"
    print(f"Result succesfullty saved to {fn}")
    results.to_csv(fn, index=False)

    return results

In [5]:
def predict_and_scores_roc_auc(name, model, X_train, X_valid, X_test, y_train, y_valid, y_test, test):
    
    # Make a predicition on different validation train datasets
    pred_train = model.predict_proba(X_train)
    pred_valid = model.predict_proba(X_valid)
    pred_test = model.predict_proba(X_test)
    
    # Score roc_auc
    train_score = round(roc_auc_score(y_train, pred_train[:, 1]), 3)
    valid_score = round(roc_auc_score(y_valid, pred_valid[:, 1]), 3)
    test_score = round(roc_auc_score(y_test, pred_test[:, 1]), 3)

    #Count how many observations are marked as 1
    test_count_1= (model.predict(test)==1).sum()
    
    print(
        f'Model: {name}, ',
        f'Train-score: {train_score}, ',
        f'Valid-score: {valid_score}, ',
        f'Test-score: {test_score}, ',
        f'Target count: {test_count_1}'
    )
    
    return train_score, valid_score, test_score

In [6]:
class ModelParam():
    """
    Description:
    --------
    A class for managing hyperparameter grids and fit parameters for different machine learning models.

    Parameters:
    - model_name: str, the name of the machine learning model
    - X: pandas.core.frame.DataFrame, optional, default=None, input features
    - y: pandas.core.frame.Series, optional, default=None, target variable
    - model_random_state: int, optional, default=123, random state for data splitting and model initialization
    """
        
    def __init__(self, model_name, X=None, y=None, X_valid=None, y_valid=None):
        self.model_name = model_name
        self.X = X
        self.y = y
        self.X_valid = X_valid
        self.y_valid = y_valid
        self.hyperparam_grid = self.set_grid_hyperparam()
        self.hyperparam_fit = self.set_fit_hyperparam()
        
    def set_grid_hyperparam(self):
        """
        Set the hyperparameter grid for different machine learning models.

        Returns:
        - grid_hyperparams: dict, a dictionary containing hyperparameter grids for various models
        """
        
        grid_hyperparams = {
            'l2': {
                'logisticregression__C': [0.001],
                'logisticregression__penalty': ['l2'],
                'logisticregression__max_iter': [1000]
                
            },
            'rf': {
                'randomforestclassifier__n_estimators': [500],
                'randomforestclassifier__max_depth': [15],
                'randomforestclassifier__max_features': ['sqrt'],
                'randomforestclassifier__min_samples_leaf': [500],
                'randomforestclassifier__bootstrap': [True],
                'randomforestclassifier__criterion': ['gini']
            },
            'xgb': {
                "xgbclassifier__booster": ["gbtree"],
                "xgbclassifier__objective": ["binary:logistic"],
                "xgbclassifier__eval_metric": ["auc"],
                "xgbclassifier__learning_rate": [0.1],
                "xgbclassifier__n_estimators": [10000],
                "xgbclassifier__reg_lambda": [0.1],
                "xgbclassifier__max_depth": [3],
                "xgbclassifier__verbosity": [0],
                "xgbclassifier__early_stopping_rounds": [100],
                "xgbclassifier__verbosity": [0],
                "xgbclassifier__colsample_bytree": [0.3] 
            },
            'lgb': {
                "lgbmclassifier__boosting_type": ["gbdt"],
                "lgbmclassifier__objective": ["binary"],
                "lgbmclassifier__learning_rate": [0.01],
                "lgbmclassifier__n_estimators": [1139],
                "lgbmclassifier__verbosity": [-1],
                "lgbmclassifier__max_depth": [8],
                "lgbmclassifier__silent": [True],
                "lgbmclassifier__num_leaves": [10000],
                "lgbmclassifier__bagging_fraction": [0.3],
                "lgbmclassifier__min_data_in_leaf": [1000]
            },
            'cb': {
                "catboostclassifier__n_estimators": [4648],
                "catboostclassifier__loss_function": ["Logloss"],
                "catboostclassifier__eval_metric": ["AUC"],
                "catboostclassifier__learning_rate": [0.06],
                "catboostclassifier__max_bin": [333],
                "catboostclassifier__verbose": [0],
                "catboostclassifier__max_depth": [3],
                "catboostclassifier__l2_leaf_reg": [71.4],
                "catboostclassifier__early_stopping_rounds": [100]
            }
        }
                       
        return grid_hyperparams.get(self.model_name, "none")

    def set_fit_hyperparam(self):
        """
        Set the fit parameters and data splits for different machine learning models.

        Returns:
        - fit_hyperparams: dict, a dictionary containing fit parameters for various models
        """

        X_train, y_train, X_valid, y_valid = self.X, self.y, self.X_valid, self.y_valid
        stopper = early_stopping(stopping_rounds=100, first_metric_only=False)
        
        fit_hyperparams = {
            'xgb': {
                'xgbclassifier__eval_set': [(X_train, y_train), (X_valid, y_valid)], 
                'xgbclassifier__verbose': 0
            },
            'lgb': {
                'lgbmclassifier__eval_set': [(X_train, y_train), (X_valid, y_valid)], 
                "lgbmclassifier__eval_metric": "auc",
                'lgbmclassifier__callbacks': [stopper]
            },
            'cb': {
                'catboostclassifier__eval_set': [(X_train, y_train), (X_valid, y_valid)]
            }
        }

        return fit_hyperparams.get(self.model_name, {})   
    
    def get_grid_hyperparam(self):
        """
        Get the hyperparameter grid for the selected machine learning model.

        Returns:
        - hyperparam_grid: dict, a dictionary containing hyperparameter grid for the model
        """
        return self.hyperparam_grid
    
    def get_fit_hyperparam(self):
        """
        Get the fit parameters for the selected machine learning model.

        Returns:
        - hyperparam_fit: dict, a dictionary containing fit parameters for the model
        """
        return self.hyperparam_fit


## Load and Preprocess Data

In [7]:
# Load data
applications_history_data = pd.read_csv('../data/applications_history.csv')
bki_data = pd.read_csv('../data/bki.csv')
client_profile_data = pd.read_csv('../data/client_profile.csv')
payments_data = pd.read_csv('../data/payments.csv')
test_data = pd.read_csv('../data/test.csv')
train_data = pd.read_csv('../data/train.csv')

In [8]:
train, test = train_data.drop(columns=["TARGET"]), test_data.copy()
target = train_data["TARGET"]

In [9]:
# Split train dataset to train, valid and test to do the holdout validation
X_train, X_valid, y_train, y_valid = train_test_split(
    train, target, test_size=0.1, random_state=1234, stratify=target
)
X_valid, X_test, y_valid, y_test = train_test_split(
    X_valid, y_valid, test_size=0.2, random_state=1234, stratify=y_valid
)

## Model Selection

#### NOTE: FeaturesTransformer is not part of the Preprocessing Pipeline as we need to retrieve column names for permutation_importance

In [10]:
# Fit the FeaturesTransformer Once and Get Column Names:
features_transformer = FeaturesTransformer(client_profile_data, applications_history_data, bki_data, payments_data)
X_train_transformed = features_transformer.fit_transform(X_train)
all_features = X_train_transformed.columns.tolist()

# Divide the columns into numerical and categorical
categorical_columns = [col for col in all_features if X_train_transformed[col].dtype == 'O']
numerical_columns = [col for col in all_features if col not in categorical_columns]

# Create Your Preprocessing Pipeline with ColumnTransformer
preprocessing = make_pipeline(
    make_column_transformer(
        (make_pipeline(SimpleImputer(strategy='constant', fill_value='Missing'), OneHotEncoder(sparse_output=False)), 
         categorical_columns),
        (SimpleImputer(strategy='median'), numerical_columns),
        remainder='passthrough'
    )
)

X_train = preprocessing.fit_transform(X_train_transformed)
X_valid = preprocessing.transform(features_transformer.transform(X_valid))
X_test = preprocessing.transform(features_transformer.transform(X_test))
test_no_target = preprocessing.transform(features_transformer.transform(test))

# Retrieve the transformers
ohe_categories = preprocessing.named_steps['columntransformer'].named_transformers_['pipeline'].named_steps['onehotencoder'].get_feature_names_out(categorical_columns)

# Combine the names of the one-hot-encoded columns with the names of the numerical columns
all_columns = list(ohe_categories) + numerical_columns

In [11]:
# Function for creating model pipelines
pipelines = {
    'l2': make_pipeline(StandardScaler(), LogisticRegression()),
    'rf': make_pipeline(RandomForestClassifier()),
    'xgb': make_pipeline(XGBClassifier()),
    'lgb': make_pipeline(LGBMClassifier()),
    'cb': make_pipeline(CatBoostClassifier())
}

In [12]:
# Create empty dictionary called fitted_models
fitted_models = {}

# Loop through model pipelines, tuning each one and saving it to fitted_models
for name, pipeline in pipelines.items():
    model_param = ModelParam(name, X_train, y_train, X_valid, y_valid) 
    model = RandomizedSearchCV(pipeline, model_param.get_grid_hyperparam(), cv=5, n_jobs=-1, verbose=0)
    
    # Fit model on X_train, y_train
    if model_param.get_fit_hyperparam() == 'none':
        model.fit(X_train, y_train)
    else:
        model.fit(X_train, y_train, **model_param.get_fit_hyperparam())

   
    # Store model in fitted_models[name] 
    fitted_models[name] = model
    
    # Print '{name} has been fitted'
    print(name, 'has been fitted.')

l2 has been fitted.
rf has been fitted.


  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


xgb has been fitted.
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[946]	training's auc: 0.796933	training's binary_logloss: 0.23573	valid_1's auc: 0.729111	valid_1's binary_logloss: 0.255065
lgb has been fitted.
cb has been fitted.


In [24]:
for name, model in fitted_models.items():
    _, _, test_score = predict_and_scores_roc_auc(name, model, X_train, X_valid, X_test, y_train, y_valid, y_test, test_no_target)
    test_ids_ = features_transformer.test_ids
    save_predictions(model, name, test_no_target, test_ids_, test_score)

Model: l2,  Train-score: 0.715,  Valid-score: 0.71,  Test-score: 0.731,  Target count: 43
Result succesfullty saved to ../data/results/results_l2_0.731.csv
Model: rf,  Train-score: 0.734,  Valid-score: 0.699,  Test-score: 0.73,  Target count: 0
Result succesfullty saved to ../data/results/results_rf_0.73.csv
Model: xgb,  Train-score: 0.768,  Valid-score: 0.727,  Test-score: 0.736,  Target count: 379
Result succesfullty saved to ../data/results/results_xgb_0.736.csv
Model: lgb,  Train-score: 0.797,  Valid-score: 0.729,  Test-score: 0.757,  Target count: 266
Result succesfullty saved to ../data/results/results_lgb_0.757.csv
Model: cb,  Train-score: 0.751,  Valid-score: 0.724,  Test-score: 0.743,  Target count: 426
Result succesfullty saved to ../data/results/results_cb_0.743.csv


### Best Models with most important features

In [15]:
# Preprocess Categorical and Numerical features
preprocessing_ = make_pipeline(
    FeaturesTransformer(client_profile_data, applications_history_data, bki_data, payments_data),
    make_column_transformer(
        (make_pipeline(SimpleImputer(strategy='constant', fill_value='Missing'), OneHotEncoder(sparse_output=False)), 
         make_column_selector(dtype_exclude='number')),
        (SimpleImputer(strategy='median'), make_column_selector(dtype_include='number')),
        remainder='passthrough'
    )
)

# Function for creating model pipelines
b_pipelines = {
    'xgb': make_pipeline(XGBClassifier()),
    'lgb': make_pipeline(LGBMClassifier()),
    'cb': make_pipeline(CatBoostClassifier())
}

In [16]:
# Train models with best scores on the most important features
best_model_features = {}
for model in b_pipelines.keys():
    best_model_features[model] = select_most_important_features(fitted_models[model], 
                                                        pd.DataFrame(X_valid, columns=all_columns), 
                                                        y_valid)
    print(model, ' most important features have been selected')

xgb  most important features have been selected
lgb  most important features have been selected
cb  most important features have been selected


In [22]:
best_fitted_models = {}
for b_model, b_pipeline in b_pipelines.items():

    #Split data
    X_train_, X_valid_, y_train_, y_valid_ = train_test_split(
        train, target, test_size=0.1, random_state=1234, stratify=target)
    X_valid_, X_test_, y_valid_, y_test_ = train_test_split(
        X_valid_, y_valid_, test_size=0.2, random_state=1234, stratify=y_valid_)

    # Transform data
    X_train_transformed_ = preprocessing_.fit_transform(X_train_)
    X_valid_transformed_ = preprocessing_.transform(X_valid_)
    X_test_transformed_ = preprocessing_.transform(X_test_)
    test_no_target_transformed_ = preprocessing_.transform(test)

    #Create trnasfor df with columns
    X_train_transformed_ = pd.DataFrame(X_train_transformed_, columns=all_columns)
    X_valid_transformed_ = pd.DataFrame(X_valid_transformed_, columns=all_columns)
    X_test_transformed_ = pd.DataFrame(X_test_transformed_, columns=all_columns)
    test_no_target_transformed_ = pd.DataFrame(test_no_target_transformed_, columns=all_columns)


    #Filter data by most important columns
    transformed_columns = best_model_features[b_model]
    X_train_ = pd.DataFrame(X_train_transformed_, columns=transformed_columns)
    X_valid_ = pd.DataFrame(X_valid_transformed_, columns=transformed_columns)
    X_test_ = pd.DataFrame(X_test_transformed_, columns=transformed_columns)
    test_no_target_ = pd.DataFrame(test_no_target_transformed_, columns=transformed_columns)

    # Fit model
    model_param_ = ModelParam(b_model, X_train_, y_train_, X_valid_, y_valid_)
    model_ = RandomizedSearchCV(b_pipeline, model_param_.get_grid_hyperparam(), cv=5, n_jobs=-1)
    model_.fit(X_train_, y_train_, **model_param_.get_fit_hyperparam())

    # Store model in fitted_models[name] 
    best_fitted_models[b_model] = model_
    
    # Print '{name} has been fitted'
    print(b_model, 'has been fitted.')

    train_score, valid_score, test_score = predict_and_scores_roc_auc(b_model, model_, X_train_, X_valid_, X_test_, y_train_, y_valid_, y_test_, test_no_target_)
    test_ids_ = preprocessing_.named_steps['featurestransformer'].test_ids
    save_predictions(model_, b_model, test_no_target_, test_ids_, test_score)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif 

xgb has been fitted.
Model: xgb,  Train-score: 0.785,  Valid-score: 0.735,  Test-score: 0.736,  Target count: 425
Result succesfullty saved to ../data/results/results_xgb_0.736.csv
Training until validation scores don't improve for 100 rounds
Did not meet early stopping. Best iteration is:
[1114]	training's auc: 0.800794	training's binary_logloss: 0.234106	valid_1's auc: 0.733446	valid_1's binary_logloss: 0.254403
lgb has been fitted.
Model: lgb,  Train-score: 0.801,  Valid-score: 0.733,  Test-score: 0.753,  Target count: 305
Result succesfullty saved to ../data/results/results_lgb_0.753.csv
cb has been fitted.
Model: cb,  Train-score: 0.754,  Valid-score: 0.735,  Test-score: 0.741,  Target count: 477
Result succesfullty saved to ../data/results/results_cb_0.741.csv
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[647]	valid_0's auc: 0.770056	valid_0's binary_logloss: 0.242843	valid_1's auc: 0.723304	valid_1's binary_logloss: 0.255955
Tr

In [14]:
# Save preprocessing pipeling to be used in the script
joblib.dump(preprocessing, '../models/preprocessing_pipeline.pkl')

# Save Best performed model to be used in the script
with open('../models/model.pkl', 'wb') as f:
    pickle.dump(fitted_models['lgb'].best_estimator_, f)