# Home Credit - Credit Risk Model Classification Using LightGBM and CatBoost

This is one of the notebooks I created for Kaggle competitions: [Home Credit - Credit Risk Model Stability](https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability). 

The goal of this competition is to predict which clients are more likely to default on their loans using internal or external data sources that Home Credit has of clients, such as previous applications, bureau information and tax information. All the data are transformed and tokenized to protect the privacy of clients. The evaluation will favor solutions that are stable over time.

I combined other public notebook with my own constructed features. With this and metric hacking (allowed by the competition host and the metric hacking part is not in this notebook), I achieved a rank of 25 in public leaderboard. 
Unfortunately, my feature engineering seemed to overfit the public leaderboard data and when it came to private leaderboard, my score dropped a lot. However, I think this is a very valuable experience for me. This is the first time I attended the Kaggle competition and I learned a lot of skills and knowledge from people contributed to this competition.

In [1]:
import gc
import lightgbm as lgb
import numpy as np
import pandas as pd
import polars as pl
import warnings

from catboost import CatBoostClassifier, Pool
from glob import glob
from IPython.display import display
from pathlib import Path
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedGroupKFold
from typing import Any

warnings.filterwarnings("ignore")

# ROOT = Path("/kaggle/input/home-credit-credit-risk-model-stability")
ROOT = Path("C:/Users/zyc71/Data Science Projects/Home Credit - Credit Risk Model Stability/home-credit-credit-risk-model-stability")
TRAIN_DIR = ROOT / "parquet_files" / "train"
TEST_DIR = ROOT / "parquet_files" / "test"

# PERSONAL_PATH = '/kaggle/input/home-credit-feature-information/'
PERSONAL_PATH = 'C:/Users/zyc71/Data Science Projects/Home Credit - Credit Risk Model Stability/'

VERSION = str('_t60')

## Aggregation Function

In [2]:
# List of columns for sum aggregation
'''
I reviewed the definition of each column and tried different feature constructions. 

The summation of the columns listed below is most useful. For example, debtoverdue_47A is the amount that is currently past due on a client's 
existing credit contract recorded in credit bureau. I sum up all past due amount of all contracts for each customer.

Combining this feature construction and the metric hacking (which is allowed by the competition host), I can achieve 25th in the public leaderboard.
However, this method seems to be overfitting the public data, when goes to private data, the score significantly drops.
'''


sum_cols = [
    # applrev_1 
    'applprev_1_net_dpd_D', 'annuity_853A', 'credacc_actualbalance_314A', 'credacc_credlmt_575A', 'credamount_590A', 
    'currdebt_94A', 'downpmt_134A', 'mainoccupationinc_437A', 	'outstandingdebt_522A', 'revolvingaccount_394A', 
    # tax_registry_a_1
    'amount_4527230A',
    # tax_registry_b_1
    'amount_4917619A',
    # tax_registry_c_1
    'pmtamount_36A',
    # credit_bureau_a_1
    'credlmt_230A', 'credlmt_935A', 'debtoutstand_525A', 'debtoverdue_47A', 'instlamount_768A', 'instlamount_852A', 	
    'monthlyinstlamount_332A', 'monthlyinstlamount_674A', 'outstandingamount_354A', 'outstandingamount_362A', 'overdueamount_31A', 
    'overdueamount_659A', 'residualamount_488A', 'residualamount_856A', 'totalamount_6A', 'totalamount_996A', 'totaldebtoverduevalue_178A',
    'totaldebtoverduevalue_718A', 'totaloutstanddebtvalue_39A', 'totaloutstanddebtvalue_668A', 
    'contractsum_5085717L', 'numberofcontrsvalue_258L', 'numberofcontrsvalue_358L', 'numberofinstls_229L', 'numberofinstls_320L', 
    'numberofoutstandinstls_520L', 'numberofoutstandinstls_59L', 'numberofoverdueinstlmax_1039L', 'numberofoverdueinstlmax_1151L', 
    'numberofoverdueinstls_725L', 'numberofoverdueinstls_834L', 'prolongationcount_1120L', 'prolongationcount_599L', 
    # credit_bureau_b_1
    'credlmt_1052A', 'credlmt_228A', 'credlmt_3940954A', 'debtpastduevalue_732A', 'debtvalue_227A', 'installmentamount_644A', 
    'installmentamount_833A', 'instlamount_892A', 'maxdebtpduevalodued_3940955A', 'overdueamountmax_950A', 'residualamount_1093A', 
    'residualamount_127A', 'residualamount_3940956A', 'totalamount_503A', 'totalamount_881A', 
    'credquantity_1099L', 'credquantity_984L', 'numberofinstls_810L', 	'pmtnumpending_403L', 
    # debitcard_1
    'last180dayaveragebalance_704A', 'last180dayturnover_1134A', 'last30dayturnover_651A', 
]

In [3]:
# Each feature ends with different characters, which indicates the type of that feature.
# Define different aggregation functions for different types of features and also do summation for a list of features selected.
class Aggregator:
    
    @staticmethod
    def max_expr(df):
        """
        Generates expressions for calculating maximum values for specific columns.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of expressions for maximum values.
        """
        cols = [
            col
            for col in df.columns
            if (col[-1] in ("P", "M", "A", "D", "T", "L")) or ("num_group" in col)
        ]

        expr_max = [
            pl.col(col).max().alias(f"max_{col}") for col in cols
        ]

        return expr_max

    
    @staticmethod
    def min_expr(df):
        """
        Generates expressions for calculating minimum values for specific columns.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of expressions for minimum values.
        """
        cols = [
            col
            for col in df.columns
            if (col[-1] in ("P", "M", "A", "D", "T", "L")) or ("num_group" in col)
        ]

        expr_min = [
            pl.col(col).min().alias(f"min_{col}") for col in cols
        ]

        return expr_min

    
    @staticmethod
    def mean_expr(df):
        """
        Generates expressions for calculating mean values for specific columns.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of expressions for mean values.
        """
        cols = [col for col in df.columns if col.endswith(("P", "A", "D"))]

        expr_mean = [
            pl.col(col).mean().alias(f"mean_{col}") for col in cols
        ]

        return expr_mean

    
    @staticmethod
    def var_expr(df):
        """
        Generates expressions for calculating variance for specific columns.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of expressions for variance.
        """
        cols = [col for col in df.columns if col.endswith(("P", "A", "D"))]

        expr_mean = [
            pl.col(col).var().alias(f"var_{col}") for col in cols
        ]

        return expr_mean

    @staticmethod
    def mode_expr(df):
        """
        Generates expressions for calculating mode values for specific columns.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of expressions for mode values.
        """
        cols = [col for col in df.columns if col.endswith("M")]

        expr_mode = [
            pl.col(col).drop_nulls().mode().first().alias(f"mode_{col}") for col in cols
        ]

        return expr_mode

    
    @staticmethod
    def sum_expr(df):
        """
        Generates expressions for calculating maximum values for specific columns.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of expressions for sum values.
        """
        cols = [
            col
            for col in df.columns
            if col in sum_cols
        ]

        expr_sum = [
            pl.col(col).sum().alias(f"sum_{col}") for col in cols
        ]

        return expr_sum

    
    @staticmethod
    def get_exprs(df):
        """
        Combines expressions for maximum, mean, and variance calculations.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - list[pl.Series]: List of combined expressions.
        """
        exprs = (
            Aggregator.max_expr(df) + Aggregator.mean_expr(df) + Aggregator.var_expr(df) + Aggregator.sum_expr(df)
        )

        return exprs

## Utility Function

In [4]:
class Utility:

    @staticmethod
    # Kaggle has limited memory, use this function to optimize memory usage
    def reduce_memory_usage(df, name):
        """
        Reduces memory usage of a DataFrame by converting column types.

        Args:
        - df (pl.DataFrame): DataFrame to optimize.
        - name (str): Name of the DataFrame.

        Returns:
        - pl.DataFrame: Optimized DataFrame.
        """
        print(
            f"Memory usage of dataframe \"{name}\" is {round(df.estimated_size('mb'), 4)} MB."
        )

        int_types = [
            pl.Int8,
            pl.Int16,
            pl.Int32,
            pl.Int64,
            pl.UInt8,
            pl.UInt16,
            pl.UInt32,
            pl.UInt64,
        ]
        float_types = [pl.Float32, pl.Float64]

        for col in df.columns:
            col_type = df[col].dtype
            if col_type in int_types + float_types:
                c_min = df[col].min()
                c_max = df[col].max()

                if c_min is not None and c_max is not None:
                    if col_type in int_types:
                        if c_min >= 0:
                            if (
                                c_min >= np.iinfo(np.uint8).min
                                and c_max <= np.iinfo(np.uint8).max
                            ):
                                df = df.with_columns(df[col].cast(pl.UInt8))
                            elif (
                                c_min >= np.iinfo(np.uint16).min
                                and c_max <= np.iinfo(np.uint16).max
                            ):
                                df = df.with_columns(df[col].cast(pl.UInt16))
                            elif (
                                c_min >= np.iinfo(np.uint32).min
                                and c_max <= np.iinfo(np.uint32).max
                            ):
                                df = df.with_columns(df[col].cast(pl.UInt32))
                            elif (
                                c_min >= np.iinfo(np.uint64).min
                                and c_max <= np.iinfo(np.uint64).max
                            ):
                                df = df.with_columns(df[col].cast(pl.UInt64))
                        else:
                            if (
                                c_min >= np.iinfo(np.int8).min
                                and c_max <= np.iinfo(np.int8).max
                            ):
                                df = df.with_columns(df[col].cast(pl.Int8))
                            elif (
                                c_min >= np.iinfo(np.int16).min
                                and c_max <= np.iinfo(np.int16).max
                            ):
                                df = df.with_columns(df[col].cast(pl.Int16))
                            elif (
                                c_min >= np.iinfo(np.int32).min
                                and c_max <= np.iinfo(np.int32).max
                            ):
                                df = df.with_columns(df[col].cast(pl.Int32))
                            elif (
                                c_min >= np.iinfo(np.int64).min
                                and c_max <= np.iinfo(np.int64).max
                            ):
                                df = df.with_columns(df[col].cast(pl.Int64))
                    elif col_type in float_types:
                        if (
                            c_min > np.finfo(np.float32).min
                            and c_max < np.finfo(np.float32).max
                        ):
                            df = df.with_columns(df[col].cast(pl.Float32))

        print(
            f"Memory usage of dataframe \"{name}\" became {round(df.estimated_size('mb'), 4)} MB."
        )

        return df

    
    @staticmethod
    def to_pandas(df, cat_cols = None):
        """
        Converts a Polars DataFrame to a Pandas DataFrame.

        Args:
        - df (pl.DataFrame): Polars DataFrame to convert.
        - cat_cols (list[str]): List of categorical columns. Default is None.

        Returns:
        - (pd.DataFrame, list[str]): Tuple containing the converted Pandas DataFrame and categorical columns.
        """
        df = df.to_pandas()

        if cat_cols is None:
            cat_cols = list(df.select_dtypes("object").columns)

        df[cat_cols] = df[cat_cols].astype("str")

        return df, cat_cols

## Data Loading and Processing

In [5]:
class SchemaGen:
    
    @staticmethod
    # Cast data types
    def change_dtypes(df):
        """
        Changes the data types of columns in the DataFrame.

        Args:
        - df (pl.LazyFrame): Input LazyFrame.

        Returns:
        - pl.LazyFrame: LazyFrame with modified data types.
        """
        for col in df.columns:
            if col == "case_id":
                df = df.with_columns(pl.col(col).cast(pl.UInt32).alias(col))
            elif col in ["WEEK_NUM", "num_group1", "num_group2"]:
                df = df.with_columns(pl.col(col).cast(pl.UInt16).alias(col))
            elif col == "date_decision" or col[-1] == "D":
                df = df.with_columns(pl.col(col).cast(pl.Date).alias(col))
            elif col[-1] in ["P", "A"]:
                df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))
            elif col[-1] in ("M",):
                df = df.with_columns(pl.col(col).cast(pl.String))
        return df

    
    @staticmethod
    # Scan each data file, complete aggregation and concatenate all the results (one data source may have more than one data file)
    def scan_files(glob_path, depth = None):
        """
        Scans Parquet files matching the glob pattern and combines them into a LazyFrame.

        Args:
        - glob_path (str): Glob pattern to match Parquet files.
        - depth (int, optional): Depth level for data aggregation. Defaults to None.

        Returns:
        - pl.LazyFrame: Combined LazyFrame.
        """
        chunks: list[pl.LazyFrame] = []
        for path in glob(str(glob_path)):
            df: pl.LazyFrame = pl.scan_parquet(
                path, low_memory = True, rechunk = True
            ).pipe(SchemaGen.change_dtypes)
            print(f"File {Path(path).stem} loaded into memory.")

            if depth in (1, 2):
                exprs = Aggregator.get_exprs(df)
                df = df.group_by("case_id").agg(exprs)

                del exprs
                gc.collect()

            chunks.append(df)

        df = pl.concat(chunks, how = "vertical_relaxed")

        del chunks
        gc.collect()

        df = df.unique(subset=["case_id"])

        return df

    
    @staticmethod
    # Join all data into one data frame
    def join_dataframes(
        df_base,
        depth_0,
        depth_1,
        depth_2,
    ):
        """
        Joins multiple LazyFrames with a base LazyFrame.

        Args:
        - df_base (pl.LazyFrame): Base LazyFrame.
        - depth_0 (list[pl.LazyFrame]): List of LazyFrames for depth 0.
        - depth_1 (list[pl.LazyFrame]): List of LazyFrames for depth 1.
        - depth_2 (list[pl.LazyFrame]): List of LazyFrames for depth 2.

        Returns:
        - pl.DataFrame: Joined DataFrame.
        """
        for i, df in enumerate(depth_0 + depth_1 + depth_2):
            df_base = df_base.join(df, how = "left", on = "case_id", suffix = f"_{i}")

        return df_base.collect().pipe(Utility.reduce_memory_usage, "df_train")

In [6]:
# Filter out some features with too many missing values, too many unique values or only one possible value
def filter_cols(df):
    """
    Filters columns in the DataFrame based on null percentage and unique values for string columns.

    Args:
    - df (pl.DataFrame): Input DataFrame.

    Returns:
    - pl.DataFrame: DataFrame with filtered columns.
    """
    for col in df.columns:
        if col not in ["case_id", "year", "month", "week_num", "target"]:
            null_pct = df[col].is_null().mean()

            if null_pct > 0.95:
                df = df.drop(col)

    for col in df.columns:
        if (col not in ["case_id", "year", "month", "week_num", "target"]) & (
            df[col].dtype == pl.String
        ):
            freq = df[col].n_unique()

            if (freq > 200) | (freq == 1):
                df = df.drop(col)

    return df


# riskassesment_302T feature has values like "8% - 11%", transform this feature into a range and average
def transform_cols(df):
    """
    Transforms columns in the DataFrame according to predefined rules.

    Args:
    - df (pl.DataFrame): Input DataFrame.

    Returns:
    - pl.DataFrame: DataFrame with transformed columns.
    """
    if "riskassesment_302T" in df.columns:
        if df["riskassesment_302T"].dtype == pl.Null:
            df = df.with_columns(
                [
                    pl.Series(
                        "riskassesment_302T_rng", df["riskassesment_302T"], pl.UInt8
                    ),
                    pl.Series(
                        "riskassesment_302T_mean", df["riskassesment_302T"], pl.UInt8
                    ),
                ]
            )
        else:
            pct_low = (
                df["riskassesment_302T"]
                .str.split(" - ")
                .apply(lambda x: x[0].replace("%", ""))
                .cast(pl.UInt8)
            )
            pct_high = (
                df["riskassesment_302T"]
                .str.split(" - ")
                .apply(lambda x: x[1].replace("%", ""))
                .cast(pl.UInt8)
            )

            diff = pct_high - pct_low
            avg = ((pct_low + pct_high) / 2).cast(pl.Float32)

            del pct_high, pct_low
            gc.collect()

            df = df.with_columns(
                [
                    diff.alias("riskassesment_302T_rng"),
                    avg.alias("riskassesment_302T_mean"),
                ]
            )

        df.drop("riskassesment_302T")

    return df


# For each datetime feature (already aggregated), calculate the difference between it and decision date.
# In addition, convert the decision date into year and day of month
def handle_dates(df):
    """
    Handles date columns in the DataFrame.

    Args:
    - df (pl.DataFrame): Input DataFrame.

    Returns:
    - pl.DataFrame: DataFrame with transformed date columns.
    """
    for col in df.columns:
        if col.endswith("D"):
            df = df.with_columns(pl.col(col) - pl.col("date_decision"))
            df = df.with_columns(pl.col(col).dt.total_days().cast(pl.Int32))

    df = df.rename(
        {
            "MONTH": "month",
            "WEEK_NUM": "week_num"
        }
    )
            
    df = df.with_columns(
        [
            pl.col("date_decision").dt.year().alias("year").cast(pl.Int16),
            pl.col("date_decision").dt.day().alias("day").cast(pl.UInt8),
        ]
    )

    return df.drop("date_decision")

In [7]:
# Use UDFs above to scan all training parquet files and process

data_store = {
    "df_base": SchemaGen.scan_files(TRAIN_DIR / "train_base.parquet"),
    "depth_0": [
        SchemaGen.scan_files(TRAIN_DIR / "train_static_cb_0.parquet"),
        SchemaGen.scan_files(TRAIN_DIR / "train_static_0_*.parquet"),
    ],
    "depth_1": [
        SchemaGen.scan_files(TRAIN_DIR / "train_applprev_1_*.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_tax_registry_a_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_tax_registry_b_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_tax_registry_c_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_credit_bureau_a_1_*.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_credit_bureau_b_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_other_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_person_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_deposit_1.parquet", 1),
        SchemaGen.scan_files(TRAIN_DIR / "train_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        SchemaGen.scan_files(TRAIN_DIR / "train_credit_bureau_a_2_*.parquet", 2),
        SchemaGen.scan_files(TRAIN_DIR / "train_credit_bureau_b_2.parquet", 2),
    ],
}

df_train = (
    SchemaGen.join_dataframes(**data_store)
    .pipe(filter_cols)
    .pipe(transform_cols)
    .pipe(handle_dates)
    .pipe(Utility.reduce_memory_usage, "df_train")
)

del data_store
gc.collect()

print(f"Train data shape: {df_train.shape}")
display(df_train.head(10))

File train_base loaded into memory.
File train_static_cb_0 loaded into memory.
File train_static_0_0 loaded into memory.
File train_static_0_1 loaded into memory.
File train_applprev_1_0 loaded into memory.
File train_applprev_1_1 loaded into memory.
File train_tax_registry_a_1 loaded into memory.
File train_tax_registry_b_1 loaded into memory.
File train_tax_registry_c_1 loaded into memory.
File train_credit_bureau_a_1_0 loaded into memory.
File train_credit_bureau_a_1_1 loaded into memory.
File train_credit_bureau_a_1_2 loaded into memory.
File train_credit_bureau_a_1_3 loaded into memory.
File train_credit_bureau_b_1 loaded into memory.
File train_other_1 loaded into memory.
File train_person_1 loaded into memory.
File train_deposit_1 loaded into memory.
File train_debitcard_1 loaded into memory.
File train_credit_bureau_a_2_0 loaded into memory.
File train_credit_bureau_a_2_1 loaded into memory.
File train_credit_bureau_a_2_10 loaded into memory.
File train_credit_bureau_a_2_2 load

case_id,month,week_num,target,assignmentdate_238D,assignmentdate_4527235D,birthdate_574D,contractssum_5085716L,dateofbirth_337D,days120_123L,days180_256L,days30_165L,days360_512L,days90_310L,description_5085714M,education_1103M,education_88M,firstquarter_103L,fourthquarter_440L,maritalst_385M,maritalst_893M,numberofqueries_373L,pmtaverage_3A,pmtaverage_4527227A,pmtcount_4527229L,pmtcount_693L,pmtscount_423L,pmtssum_45A,requesttype_4525192L,responsedate_1012D,responsedate_4527233D,responsedate_4917613D,secondquarter_766L,thirdquarter_1082L,actualdpdtolerance_344P,amtinstpaidbefduel24m_4187115A,annuity_780A,…,max_openingdate_313D,mean_amount_416A,mean_openingdate_313D,max_num_group1_11,max_openingdate_857D,mean_openingdate_857D,sum_last180dayaveragebalance_704A,sum_last180dayturnover_1134A,sum_last30dayturnover_651A,max_collater_typofvalofguarant_298M,max_collater_typofvalofguarant_407M,max_collater_valueofguarantee_1124L,max_collater_valueofguarantee_876L,max_collaterals_typeofguarante_359M,max_collaterals_typeofguarante_669M,max_num_group1_12,max_num_group2,max_pmts_dpd_1073P,max_pmts_dpd_303P,max_pmts_month_158T,max_pmts_month_706T,max_pmts_overdue_1140A,max_pmts_overdue_1152A,max_pmts_year_1139T,max_pmts_year_507T,max_subjectroles_name_541M,max_subjectroles_name_838M,mean_pmts_dpd_1073P,mean_pmts_dpd_303P,mean_pmts_overdue_1140A,mean_pmts_overdue_1152A,var_pmts_dpd_1073P,var_pmts_dpd_303P,var_pmts_overdue_1140A,var_pmts_overdue_1152A,year,day
u32,u32,u8,u8,i16,u8,i16,f32,i32,f32,f32,f32,f32,f32,str,str,str,f32,f32,str,str,f32,f32,f32,f32,f32,f32,f32,str,i8,u8,i8,f32,f32,f32,f32,f32,…,i16,f32,i16,u8,i16,i16,f32,f32,f32,str,str,f32,f32,str,str,u16,u8,f32,f32,f32,f32,f32,f32,f32,f32,str,str,f32,f32,f32,f32,f32,f32,f32,f32,u16,u8
1574279,201910,40,0,,,,,-12580.0,2.0,2.0,2.0,7.0,2.0,"""a55475b1""","""a55475b1""","""a55475b1""",1.0,9.0,"""a55475b1""","""a55475b1""",7.0,,,,,,,"""DEDUCTION_6""",,14.0,,1.0,2.0,0.0,18897.158203,2429.400146,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,0.0,"""c7a5ad39""","""c7a5ad39""",0.0,35.0,0.0,0.0,12.0,12.0,0.0,0.0,2020.0,2019.0,"""ab3c25cf""","""ab3c25cf""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2019,10
1580043,201910,40,0,,,,,-11091.0,3.0,4.0,0.0,6.0,2.0,"""a55475b1""","""6b2ae0fa""","""a55475b1""",8.0,5.0,"""a7fcb6e5""","""a55475b1""",6.0,,,,,,,"""DEDUCTION_6""",,14.0,,9.0,4.0,0.0,0.0,3153.400146,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,0.0,"""c7a5ad39""","""c7a5ad39""",9.0,35.0,0.0,0.0,12.0,12.0,0.0,0.0,2020.0,2020.0,"""ab3c25cf""","""ab3c25cf""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2019,13
1587909,201910,41,0,,,,,-11614.0,3.0,3.0,2.0,9.0,2.0,"""a55475b1""","""a55475b1""","""a55475b1""",3.0,7.0,"""3439d993""","""a55475b1""",9.0,,,,,,,"""DEDUCTION_6""",,14.0,,2.0,2.0,0.0,0.0,9822.600586,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,0.0,"""c7a5ad39""","""c7a5ad39""",1.0,35.0,21.0,33.0,12.0,12.0,31887.4375,9720.662109,2020.0,2019.0,"""ab3c25cf""","""ab3c25cf""",1.636364,6.157895,3901.07959,2690.657959,21.051136,67.704124,108703648.0,8961506.0,2019,19
1303197,201903,9,0,,,-20216.0,,,,,,,,"""a55475b1""","""a55475b1""","""a55475b1""",,,"""a55475b1""","""a55475b1""",,,,,,6.0,5230.200195,,14.0,,,,,0.0,,2175.800049,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019,8
1612159,201911,43,0,,,,,-12148.0,1.0,2.0,1.0,4.0,1.0,"""a55475b1""","""6b2ae0fa""","""a55475b1""",6.0,4.0,"""3439d993""","""a55475b1""",4.0,,,,,,,"""DEDUCTION_6""",,14.0,,0.0,1.0,0.0,126995.0,13333.0,…,,,,,,,,,,"""a55475b1""","""a55475b1""",6337500.0,219990.0,"""c7a5ad39""","""c7a5ad39""",3.0,35.0,2.0,4.0,12.0,12.0,39502.109375,3186.573975,2020.0,2020.0,"""ab3c25cf""","""ab3c25cf""",0.107143,0.068966,2819.629883,54.940929,0.17328,0.275862,107182048.0,175073.34375,2019,4
1671541,201912,48,0,,,,,-12911.0,1.0,2.0,1.0,3.0,1.0,"""a55475b1""","""a55475b1""","""a55475b1""",0.0,1.0,"""a55475b1""","""a55475b1""",3.0,,,,,,,"""DEDUCTION_6""",,14.0,,2.0,0.0,0.0,10645.600586,1741.400024,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,0.0,"""c7a5ad39""","""c7a5ad39""",0.0,23.0,0.0,0.0,12.0,12.0,0.0,0.0,2020.0,2020.0,"""ab3c25cf""","""ab3c25cf""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2019,7
1748480,202001,54,0,,,,,-13680.0,0.0,0.0,0.0,0.0,0.0,"""a55475b1""","""a55475b1""","""a55475b1""",0.0,0.0,"""a55475b1""","""a55475b1""",0.0,,,,,,,"""DEDUCTION_6""",,14.0,,0.0,0.0,0.0,10198.0,1376.0,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,,"""a55475b1""","""c7a5ad39""",0.0,11.0,6.0,,12.0,,16.0,,2020.0,,"""a55475b1""","""ab3c25cf""",1.0,,2.666667,,6.0,,42.666668,,2020,14
1863564,202006,75,0,,,,168005.546875,-8446.0,2.0,4.0,0.0,8.0,1.0,"""2fc785b2""","""a55475b1""","""a55475b1""",2.0,3.0,"""a7fcb6e5""","""a55475b1""",8.0,,,,,,,,,,14.0,2.0,1.0,0.0,29976.0,3662.800049,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,0.0,"""c7a5ad39""","""c7a5ad39""",3.0,23.0,0.0,0.0,12.0,12.0,0.0,0.0,2021.0,2021.0,"""ab3c25cf""","""ab3c25cf""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2020,15
1932521,202009,89,0,,,,0.0,-20471.0,1.0,1.0,1.0,1.0,1.0,"""2fc785b2""","""a55475b1""","""a55475b1""",3.0,3.0,"""a55475b1""","""a55475b1""",1.0,,,,,,,,,,14.0,5.0,3.0,0.0,90076.710938,2753.199951,…,,,,,,,,,,"""a55475b1""","""a55475b1""",9272000.0,105000.0,"""c7a5ad39""","""c7a5ad39""",6.0,35.0,0.0,1150.0,12.0,12.0,0.0,51958.601562,2021.0,2021.0,"""ab3c25cf""","""daf49a8a""",0.0,318.879303,0.0,16760.066406,0.0,180322.03125,0.0,449149376.0,2020,18
2585425,201906,23,0,,,-10608.0,,-10608.0,5.0,7.0,3.0,14.0,4.0,"""a55475b1""","""6b2ae0fa""","""a55475b1""",13.0,6.0,"""3439d993""","""a55475b1""",14.0,,,,,7.0,8037.5,,14.0,,,7.0,5.0,0.0,0.0,8225.200195,…,,,,,,,,,,"""a55475b1""","""a55475b1""",0.0,0.0,"""c7a5ad39""","""c7a5ad39""",4.0,35.0,0.0,0.0,12.0,12.0,0.0,0.0,2020.0,2020.0,"""ab3c25cf""","""ab3c25cf""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2019,17


In [8]:
# Use UDFs above to scan all test parquet files and process

data_store = {
    "df_base": SchemaGen.scan_files(TEST_DIR / "test_base.parquet"),
    "depth_0": [
        SchemaGen.scan_files(TEST_DIR / "test_static_cb_0.parquet"),
        SchemaGen.scan_files(TEST_DIR / "test_static_0_*.parquet"),
    ],
    "depth_1": [
        SchemaGen.scan_files(TEST_DIR / "test_applprev_1_*.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_tax_registry_a_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_tax_registry_b_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_tax_registry_c_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_credit_bureau_a_1_*.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_credit_bureau_b_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_other_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_person_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_deposit_1.parquet", 1),
        SchemaGen.scan_files(TEST_DIR / "test_debitcard_1.parquet", 1),
    ],
    "depth_2": [
        SchemaGen.scan_files(TEST_DIR / "test_credit_bureau_a_2_*.parquet", 2),
        SchemaGen.scan_files(TEST_DIR / "test_credit_bureau_b_2.parquet", 2),
    ],
}

df_test = (
    SchemaGen.join_dataframes(**data_store)
    .pipe(transform_cols)
    .pipe(handle_dates)
    .select([col for col in df_train.columns if col != "target"])
    .pipe(Utility.reduce_memory_usage, "df_test")
)

del data_store
gc.collect()

print(f"Test data shape: {df_test.shape}")

File test_base loaded into memory.
File test_static_cb_0 loaded into memory.
File test_static_0_0 loaded into memory.
File test_static_0_1 loaded into memory.
File test_static_0_2 loaded into memory.
File test_applprev_1_0 loaded into memory.
File test_applprev_1_1 loaded into memory.
File test_applprev_1_2 loaded into memory.
File test_tax_registry_a_1 loaded into memory.
File test_tax_registry_b_1 loaded into memory.
File test_tax_registry_c_1 loaded into memory.
File test_credit_bureau_a_1_0 loaded into memory.
File test_credit_bureau_a_1_1 loaded into memory.
File test_credit_bureau_a_1_2 loaded into memory.
File test_credit_bureau_a_1_3 loaded into memory.
File test_credit_bureau_a_1_4 loaded into memory.
File test_credit_bureau_b_1 loaded into memory.
File test_other_1 loaded into memory.
File test_person_1 loaded into memory.
File test_deposit_1 loaded into memory.
File test_debitcard_1 loaded into memory.
File test_credit_bureau_a_2_0 loaded into memory.
File test_credit_bureau

In [9]:
# Convert polars data frame to pandas data frame

df_train, cat_cols = Utility.to_pandas(df_train)
df_test, cat_cols = Utility.to_pandas(df_test, cat_cols)

## Model Training

In [10]:
# Define some functions for further ensemble learning use

class VotingModel(BaseEstimator, ClassifierMixin):
    """
    A voting ensemble model that combines predictions from multiple estimators.

    Parameters:
    - estimators (list): List of base estimators.

    Attributes:
    - estimators (list): List of base estimators.

    Methods:
    - fit(X, y=None): Fit the model to the training data.
    - predict(X): Predict class labels for samples.
    - predict_proba(X): Predict class probabilities for samples.
    """

    def __init__(self, estimators):
        """
        Initialize the VotingModel with a list of base estimators.

        Args:
        - estimators (list): List of base estimators.
        """
        super().__init__()
        self.estimators = estimators

    def fit(self, X, y = None):
        """
        Fit the model to the training data.

        Args:
        - X: Input features.
        - y: Target labels (ignored).

        Returns:
        - self: Returns the instance itself.
        """
        return self

    def predict(self, X):
        """
        Predict class labels for samples.

        Args:
        - X: Input features.

        Returns:
        - numpy.ndarray: Predicted class labels.
        """
        y_preds = [estimator.predict(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)

    def predict_proba(self, X):
        """
        Predict class probabilities for samples.

        Args:
        - X: Input features.

        Returns:
        - numpy.ndarray: Predicted class probabilities.
        """
        y_preds = [estimator.predict_proba(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)

In [12]:
X = df_train.drop(columns=["target", "case_id", "week_num"])
y = df_train["target"]

weeks = df_train["week_num"]

# Delete df_train to save memory space
del df_train
gc.collect()

cv = StratifiedGroupKFold(n_splits = 8, shuffle = False)

# Parameters for lightgbm
params1 = {
    "boosting_type": "gbdt",
    "colsample_bynode": 0.8, # rate of randomly select a subset of features on each tree node
    "colsample_bytree": 0.8, # rate of randomly select a subset of features on each iteration (tree)
    "extra_trees": True, # use extremely randomized trees, if set to true, when evaluating node splits LightGBM will check only one randomly-chosen threshold for each feature
    "learning_rate": 0.05,
    "l1_regularization": 0.1,
    "l2_regularization": 10,
    "max_depth": 20,
    "metric": "auc",
    "n_estimators": 2000,
    "num_leaves": 64, # max number of leaves in one tree
    "objective": "binary",
    "random_state": 42,
    "verbose": -1, # controls the level of LightGBM’s verbosity - < 0: Fatal, = 0: Error (Warning), = 1: Info, > 1: Debug
}

# Parameters for lightgbm2
params2 = {
    "boosting_type": "gbdt",
    "colsample_bynode": 0.8,
    "colsample_bytree": 0.8,
    "extra_trees": True,
    "learning_rate": 0.03,
    "l1_regularization": 0.1,
    "l2_regularization": 10,
    "max_depth": 16,
    "metric": "auc",
    "n_estimators": 2000,
    "num_leaves": 72,
    "objective": "binary",
    "random_state": 42,
    "verbose": -1,
}

# Initialize lists to save models and cross validation scores
fitted_models_cat = []
fitted_models_lgb = []

cv_scores_cat = []
cv_scores_lgb = []

# This method is borrowed from some other public notebook:
# In addition to training by catboost, train two lightgbms for odd and even cross validations, respectively.
# Initially, this works well for test data of public leaderboard, but might also be the reason that is overfitting the public leadboard data.
iter_cnt = 0
for idx_train, idx_valid in cv.split(X, y, groups = weeks):
    X_train, y_train = X.iloc[idx_train], y.iloc[idx_train]
    X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid]

    train_pool = Pool(X_train, y_train, cat_features = cat_cols)
    val_pool = Pool(X_valid, y_valid, cat_features = cat_cols)

    clf = CatBoostClassifier(
        best_model_min_trees = 1200, # The minimal number of trees that the best model should have. If set, the output model contains at least the given number of trees even if the best model is located within these trees.
        boosting_type = "Plain", # "Plain" is the classic gradient boosting scheme
        eval_metric = "AUC",
        iterations = 6000,
        learning_rate = 0.05,
        l2_leaf_reg = 10, # Coefficient at the L2 regularization term of the cost function.
        max_leaves = 64,
        random_seed = 42,
        task_type = "GPU",
        use_best_model = True # Use the validation dataset to identify the iteration with the optimal value of the metric specified, no trees are saved after this iteration.
    )

    clf.fit(train_pool, eval_set = val_pool, verbose = 300)
    fitted_models_cat.append(clf)

    y_pred_valid = clf.predict_proba(X_valid)[:, 1]
    auc_score = roc_auc_score(y_valid, y_pred_valid)
    cv_scores_cat.append(auc_score)

    X_train[cat_cols] = X_train[cat_cols].astype("category")
    X_valid[cat_cols] = X_valid[cat_cols].astype("category")

    if iter_cnt % 2 == 0:
        model = lgb.LGBMClassifier(**params1)
    else:
        model = lgb.LGBMClassifier(**params2)

    model.fit(
        X_train,
        y_train,
        eval_set=[(X_valid, y_valid)],
        # log_evaulation: The period to log the evaluation results.
        # early_stopping: Validation score needs to improve at least every (stopping_rounds) rounds to continue training
        callbacks=[lgb.log_evaluation(100), lgb.early_stopping(100)],
    )
    fitted_models_lgb.append(model)

    y_pred_valid = model.predict_proba(X_valid)[:, 1]
    auc_score = roc_auc_score(y_valid, y_pred_valid)
    cv_scores_lgb.append(auc_score)

    iter_cnt += 1

ensemble_model = VotingModel(fitted_models_cat + fitted_models_lgb)

print(f"\nCV AUC scores for CatBoost: {cv_scores_cat}")
print(f"Maximum CV AUC score for Catboost: {max(cv_scores_cat)}", end="\n\n")


print(f"CV AUC scores for LGBM: {cv_scores_lgb}")
print(f"Maximum CV AUC score for LGBM: {max(cv_scores_lgb)}", end="\n\n")

print('CB CV Score Results')    
print("CV AUC scores: ", cv_scores_cat)
print("Maximum CV AUC score: ", max(cv_scores_cat))
print("Average CV AUC score: ", np.mean(cv_scores_cat))
print("Std CV AUC score: ", np.std(cv_scores_cat))

print('LGB CV Score Results')
print("CV AUC scores: ", cv_scores_lgb)
print("Maximum CV AUC score: ", max(cv_scores_lgb))
print("Average CV AUC score: ", np.mean(cv_scores_lgb))
print("Std CV AUC score: ", np.std(cv_scores_lgb))

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6735594	best: 0.6735594 (0)	total: 198ms	remaining: 19m 47s
300:	test: 0.8488633	best: 0.8488633 (300)	total: 59.6s	remaining: 18m 47s
600:	test: 0.8541675	best: 0.8541675 (600)	total: 1m 58s	remaining: 17m 41s
900:	test: 0.8565263	best: 0.8565263 (900)	total: 2m 56s	remaining: 16m 39s
1200:	test: 0.8581834	best: 0.8581834 (1200)	total: 3m 55s	remaining: 15m 39s
1500:	test: 0.8591421	best: 0.8591421 (1500)	total: 4m 53s	remaining: 14m 39s
1800:	test: 0.8600221	best: 0.8600221 (1800)	total: 5m 51s	remaining: 13m 39s
2100:	test: 0.8606563	best: 0.8606563 (2100)	total: 6m 50s	remaining: 12m 40s
2400:	test: 0.8612165	best: 0.8612165 (2400)	total: 7m 48s	remaining: 11m 42s
2700:	test: 0.8617208	best: 0.8617208 (2700)	total: 8m 47s	remaining: 10m 44s
3000:	test: 0.8621022	best: 0.8621038 (2975)	total: 9m 46s	remaining: 9m 45s
3300:	test: 0.8624181	best: 0.8624243 (3270)	total: 10m 44s	remaining: 8m 47s
3600:	test: 0.8626831	best: 0.8626831 (3600)	total: 11m 42s	remaining: 7m 48s
3

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6833925	best: 0.6833925 (0)	total: 388ms	remaining: 38m 49s
300:	test: 0.8379942	best: 0.8379942 (300)	total: 1m	remaining: 19m 9s
600:	test: 0.8431437	best: 0.8431437 (600)	total: 2m 1s	remaining: 18m 8s
900:	test: 0.8453843	best: 0.8453843 (900)	total: 3m 2s	remaining: 17m 14s
1200:	test: 0.8465820	best: 0.8465820 (1200)	total: 4m 2s	remaining: 16m 10s
1500:	test: 0.8477875	best: 0.8477875 (1500)	total: 5m 2s	remaining: 15m 5s
1800:	test: 0.8486117	best: 0.8486117 (1800)	total: 6m 1s	remaining: 14m 3s
2100:	test: 0.8492302	best: 0.8492397 (2095)	total: 7m 1s	remaining: 13m 2s
2400:	test: 0.8497553	best: 0.8497553 (2400)	total: 8m 2s	remaining: 12m 2s
2700:	test: 0.8501998	best: 0.8501998 (2700)	total: 9m 2s	remaining: 11m 2s
3000:	test: 0.8506450	best: 0.8506450 (3000)	total: 10m 1s	remaining: 10m 1s
3300:	test: 0.8509005	best: 0.8509005 (3300)	total: 11m 2s	remaining: 9m 1s
3600:	test: 0.8512725	best: 0.8512725 (3600)	total: 12m	remaining: 7m 59s
3900:	test: 0.8515257	bes

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6597642	best: 0.6597642 (0)	total: 485ms	remaining: 48m 26s
300:	test: 0.8438950	best: 0.8438950 (300)	total: 1m	remaining: 19m 10s
600:	test: 0.8487698	best: 0.8487698 (600)	total: 2m	remaining: 18m 3s
900:	test: 0.8507326	best: 0.8507326 (900)	total: 3m	remaining: 16m 58s
1200:	test: 0.8519702	best: 0.8519702 (1200)	total: 4m	remaining: 15m 59s
1500:	test: 0.8529012	best: 0.8529012 (1500)	total: 5m 1s	remaining: 15m 4s
1800:	test: 0.8536661	best: 0.8536661 (1800)	total: 6m 2s	remaining: 14m 5s
2100:	test: 0.8541255	best: 0.8541255 (2100)	total: 7m 2s	remaining: 13m 4s
2400:	test: 0.8546107	best: 0.8546107 (2400)	total: 8m 2s	remaining: 12m 3s
2700:	test: 0.8549064	best: 0.8549064 (2700)	total: 9m 2s	remaining: 11m 2s
3000:	test: 0.8551682	best: 0.8551682 (3000)	total: 10m 2s	remaining: 10m 1s
3300:	test: 0.8555513	best: 0.8555513 (3300)	total: 11m 1s	remaining: 9m 1s
3600:	test: 0.8558157	best: 0.8558157 (3600)	total: 12m 2s	remaining: 8m 1s
3900:	test: 0.8560490	best: 0.8

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6765360	best: 0.6765360 (0)	total: 414ms	remaining: 41m 26s
300:	test: 0.8342060	best: 0.8342060 (300)	total: 1m	remaining: 18m 59s
600:	test: 0.8383230	best: 0.8383230 (600)	total: 1m 59s	remaining: 17m 55s
900:	test: 0.8397016	best: 0.8397016 (900)	total: 3m	remaining: 17m 3s
1200:	test: 0.8409257	best: 0.8409257 (1200)	total: 4m 1s	remaining: 16m 6s
1500:	test: 0.8420105	best: 0.8420105 (1500)	total: 5m 1s	remaining: 15m 3s
1800:	test: 0.8428063	best: 0.8428063 (1800)	total: 6m	remaining: 13m 59s
2100:	test: 0.8432276	best: 0.8432276 (2100)	total: 7m	remaining: 13m
2400:	test: 0.8437664	best: 0.8437664 (2400)	total: 8m 1s	remaining: 12m 1s
2700:	test: 0.8441136	best: 0.8441136 (2700)	total: 9m 1s	remaining: 11m 1s
3000:	test: 0.8445557	best: 0.8445557 (3000)	total: 9m 59s	remaining: 9m 59s
3300:	test: 0.8448339	best: 0.8448339 (3300)	total: 10m 59s	remaining: 8m 58s
3600:	test: 0.8451937	best: 0.8452023 (3595)	total: 12m	remaining: 7m 59s
3900:	test: 0.8454791	best: 0.845

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6818588	best: 0.6818588 (0)	total: 407ms	remaining: 40m 40s
300:	test: 0.8445651	best: 0.8445651 (300)	total: 1m 2s	remaining: 19m 43s
600:	test: 0.8495462	best: 0.8495462 (600)	total: 2m 3s	remaining: 18m 26s
900:	test: 0.8516886	best: 0.8516886 (900)	total: 3m 4s	remaining: 17m 22s
1200:	test: 0.8531722	best: 0.8531722 (1200)	total: 4m 3s	remaining: 16m 12s
1500:	test: 0.8540956	best: 0.8540956 (1500)	total: 5m 3s	remaining: 15m 8s
1800:	test: 0.8549441	best: 0.8549441 (1800)	total: 6m 2s	remaining: 14m 4s
2100:	test: 0.8555078	best: 0.8555078 (2100)	total: 7m	remaining: 13m
2400:	test: 0.8561334	best: 0.8561334 (2400)	total: 7m 59s	remaining: 11m 58s
2700:	test: 0.8566350	best: 0.8566350 (2700)	total: 8m 58s	remaining: 10m 57s
3000:	test: 0.8569768	best: 0.8569768 (3000)	total: 9m 57s	remaining: 9m 57s
3300:	test: 0.8573779	best: 0.8573779 (3300)	total: 10m 56s	remaining: 8m 57s
3600:	test: 0.8576860	best: 0.8576872 (3595)	total: 11m 56s	remaining: 7m 57s
3900:	test: 0.85

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6914641	best: 0.6914641 (0)	total: 544ms	remaining: 54m 25s
300:	test: 0.8459468	best: 0.8459468 (300)	total: 1m 2s	remaining: 19m 45s
600:	test: 0.8513044	best: 0.8513044 (600)	total: 2m 3s	remaining: 18m 26s
900:	test: 0.8533852	best: 0.8533852 (900)	total: 3m 3s	remaining: 17m 17s
1200:	test: 0.8547442	best: 0.8547442 (1200)	total: 4m 3s	remaining: 16m 14s
1500:	test: 0.8557436	best: 0.8557436 (1500)	total: 5m 3s	remaining: 15m 10s
1800:	test: 0.8565557	best: 0.8565557 (1800)	total: 6m 5s	remaining: 14m 11s
2100:	test: 0.8571569	best: 0.8571597 (2095)	total: 7m 7s	remaining: 13m 12s
2400:	test: 0.8575456	best: 0.8575456 (2400)	total: 8m 7s	remaining: 12m 10s
2700:	test: 0.8579456	best: 0.8579460 (2695)	total: 9m 8s	remaining: 11m 9s
3000:	test: 0.8584364	best: 0.8584364 (3000)	total: 10m 8s	remaining: 10m 7s
3300:	test: 0.8587411	best: 0.8587411 (3300)	total: 11m 8s	remaining: 9m 6s
3600:	test: 0.8590358	best: 0.8590358 (3600)	total: 12m 8s	remaining: 8m 5s
3900:	test: 0.

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6636217	best: 0.6636217 (0)	total: 518ms	remaining: 51m 47s
300:	test: 0.8486139	best: 0.8486139 (300)	total: 1m	remaining: 19m 5s
600:	test: 0.8537899	best: 0.8537899 (600)	total: 2m 1s	remaining: 18m 6s
900:	test: 0.8556963	best: 0.8556963 (900)	total: 3m 2s	remaining: 17m 11s
1200:	test: 0.8570461	best: 0.8570461 (1200)	total: 4m 1s	remaining: 16m 6s
1500:	test: 0.8580706	best: 0.8580706 (1500)	total: 5m 1s	remaining: 15m 3s
1800:	test: 0.8587388	best: 0.8587388 (1800)	total: 6m 2s	remaining: 14m 4s
2100:	test: 0.8593062	best: 0.8593062 (2100)	total: 7m 2s	remaining: 13m 3s
2400:	test: 0.8598994	best: 0.8598994 (2400)	total: 8m	remaining: 12m
2700:	test: 0.8602836	best: 0.8602836 (2700)	total: 8m 59s	remaining: 10m 58s
3000:	test: 0.8605787	best: 0.8605787 (3000)	total: 9m 57s	remaining: 9m 56s
3300:	test: 0.8608863	best: 0.8608863 (3300)	total: 10m 55s	remaining: 8m 55s
3600:	test: 0.8612337	best: 0.8612337 (3600)	total: 11m 53s	remaining: 7m 55s
3900:	test: 0.8614522	be

Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.6653582	best: 0.6653582 (0)	total: 468ms	remaining: 46m 48s
300:	test: 0.8448931	best: 0.8448931 (300)	total: 1m 2s	remaining: 19m 42s
600:	test: 0.8504478	best: 0.8504478 (600)	total: 2m 3s	remaining: 18m 25s
900:	test: 0.8527124	best: 0.8527124 (900)	total: 3m 2s	remaining: 17m 14s
1200:	test: 0.8540191	best: 0.8540191 (1200)	total: 4m 2s	remaining: 16m 8s
1500:	test: 0.8551530	best: 0.8551530 (1500)	total: 5m	remaining: 15m 1s
1800:	test: 0.8560977	best: 0.8560977 (1800)	total: 6m	remaining: 14m
2100:	test: 0.8567510	best: 0.8567510 (2100)	total: 6m 59s	remaining: 12m 58s
2400:	test: 0.8574413	best: 0.8574413 (2400)	total: 7m 58s	remaining: 11m 57s
2700:	test: 0.8577887	best: 0.8577887 (2700)	total: 8m 58s	remaining: 10m 57s
3000:	test: 0.8581604	best: 0.8581604 (3000)	total: 9m 56s	remaining: 9m 56s
3300:	test: 0.8585659	best: 0.8585659 (3300)	total: 10m 56s	remaining: 8m 57s
3600:	test: 0.8588911	best: 0.8588911 (3600)	total: 11m 56s	remaining: 7m 57s
3900:	test: 0.8591

## Model Saving and Loading

In [11]:
# Save the ensemble model

# ensemble_model = VotingModel(fitted_models_cat + fitted_models_lgb)

# import pickle
# pickle.dump(ensemble_model, open('ensemble_model_' + 'sv' + VERSION + '.pkl', 'wb'))

In [12]:
# Load pre-trained ensemble model to save memory and time

import pickle
ensemble_model = pickle.load(open(PERSONAL_PATH + 'ensemble_model_' + 'sv' + VERSION +'.pkl', 'rb'))

In [13]:
ensemble_model

## Prediction

In [14]:
# Load the sample submission data (only 10 records, use for Kaggle submission)

df_subm = pd.read_csv(ROOT / "sample_submission.csv")
df_subm = df_subm.set_index("case_id")

In [15]:
# Process

X_test = df_test.drop(columns=["week_num"]).set_index("case_id")
X_test[cat_cols] = X_test[cat_cols].astype("category")

In [16]:
# Prediction
y_pred: pd.Series = pd.Series(ensemble_model.predict_proba(X_test)[:, 1], index = X_test.index)

df_subm["score"] = y_pred

display(df_subm)

# df_subm.to_csv("submission.csv")

# del X_test, y_pred, df_subm
# gc.collect()

Unnamed: 0_level_0,score
case_id,Unnamed: 1_level_1
57543,0.008518
57549,0.044705
57551,0.002777
57552,0.017779
57569,0.162139
57630,0.009587
57631,0.034756
57632,0.007646
57633,0.039108
57634,0.030771
