# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > TABLE OF CONTENTS<br><div>  

* [IMPORT](#1)
* [INTRODUCTION](#2)
* [PREPROCESSING](#3)
* [MODEL DEFINITION](#4)
* [FEATURE ENGINEERING](#5)
* [MODEL TRAINING](#6)  
* [SUBMISSION](#7)    

<a id="1"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > IMPORT<br><div> 

In [24]:
%%time 

import numpy as np;
import pandas as pd;

from sklearn.preprocessing import LabelEncoder,normalize;
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier;
from sklearn.metrics import accuracy_score, log_loss;
from sklearn.impute import SimpleImputer;
from adjdatatools.preprocessing import AdjustedScaler

import imblearn;
from imblearn.over_sampling import RandomOverSampler;
from imblearn.under_sampling import RandomUnderSampler;

from xgboost import XGBClassifier;
from lightgbm import LGBMClassifier;
import inspect;
from collections import defaultdict;
from tabpfn import TabPFNClassifier;
from catboost import CatBoostClassifier;

from tqdm.notebook import tqdm;
from datetime import datetime;
from sklearn.model_selection import KFold as KF, GridSearchCV;
from colorama import Fore, Style, init;
from pprint import pprint;

import warnings;
warnings.filterwarnings('ignore');

from gc import collect;

print();
collect();


CPU times: user 93.2 ms, sys: 4 ms, total: 97.2 ms
Wall time: 96.5 ms




In [2]:
%%time 

from IPython.display import clear_output;

# Color printing    
def PrintColor(text:str, color = Fore.BLUE, style = Style.BRIGHT):
    "Prints color outputs using colorama using a text F-string";
    print(style + color + text + Style.RESET_ALL); 
    
pd.set_option('display.max_columns', 60);
pd.set_option('display.max_rows', 50);

from sklearn import set_config; 
set_config(transform_output = "pandas");

clear_output();
print();


CPU times: user 698 µs, sys: 309 µs, total: 1.01 ms
Wall time: 636 µs


<a id="2"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > INTRODUCTION<br><div> 

**Notebook objective**<br>

This notebook is adapted from the best public scoring notebook for the competition- **ICR - Identifying Age-Related Conditions**. References are as below-<br>
1. https://www.kaggle.com/code/vadimkamaev/postprocessin-ensemble
2. https://www.kaggle.com/code/aikhmelnytskyy/public-krni-pdi-with-two-additional-models
3. https://www.kaggle.com/code/opamusora/changed-threshold

I also used the explanation from a recent discussion post correcting the metric implementation in the above references as below-
https://www.kaggle.com/competitions/icr-identify-age-related-conditions/discussion/422442<br>

Many thanks to the contributors of these references and all the best to the participants too!<br>

**My contribution**<br>
1. Changed the implementation of the metric based on the above discussion post reference<br>
2. Added a couple of ML models in addition to the ones present in the reference files<br>
3. Made a small configuration class to enable the user to toggle and make simple experiments<br>
4. Added comments and organized the code efficiently for effective readibility<br>



In [3]:
%%time 

# Defining a configuration class with key variables to toggle for simple experiments:-

class CFG:
    """
    This class defines several variables to toggle for experiments
    """;
    
    state          = 42;
    n_splits_outer = 10;
    n_splits_inner = 5;
    
    # Defines post-processing cutoffs:-    
    postprocess_req= "Y";
    lw_cutoff      = 0.28;
    up_cutoff      = 0.595;
    
print();


CPU times: user 38 µs, sys: 0 ns, total: 38 µs
Wall time: 41.2 µs


<a id="3"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > PREPROCESSING<br><div> 
    
**Key tasks**<br>
1. Data import<br>
2. Defining the correct competition metric function<br>
3. Under-sampling the datasets appropriately<br>

In [4]:
%%time 

# Making the undersampled dataset:-
def MakeRndUndSmpl(df):
    """
    This function makes a dataset using the provided dataset with undersampling technique. 
    This is a slightly verbose implementation of the same.
    
    Input-->   df - pd.DataFrame
    Returns--> Modified dataframe
    """;
    
    # Calculate the number of samples for each label. 
    neg, pos = np.bincount(df['Class']);

    # Choose the samples with class label `1`.
    one_df = df.loc[df['Class'] == 1] ;
    # Choose the samples with class label `0`.
    zero_df = df.loc[df['Class'] == 0];
    # Select `pos` number of negative samples.
    # This makes sure that we have equal number of samples for each label.
    zero_df = zero_df.sample(n=pos);

    # Join both label dataframes.
    undersampled_df = pd.concat([zero_df, one_df]);

    # Shuffle the data and return
    return undersampled_df.sample(frac = 1);

print();
collect();


CPU times: user 78.7 ms, sys: 149 µs, total: 78.8 ms
Wall time: 78.5 ms


In [5]:
%%time 

# Defining the competition metric as per Chris Deotte's post:-
def ScoreMetric(ytrue, ypred):
    """
    This function provides the competition metric- balanced log loss correctly as per Chris Deotte post
    
    Inputs-->
    ytrue, ypred- np.array - true and prediction arrays
    Returns--> balanced log loss score
    
    Note:- The floor value of the returned balanced log loss is defined to prevent a divide-by-0-error
    """;
    
    nc = np.bincount(ytrue);
    return log_loss(ytrue, ypred, sample_weight = 1 / nc[ytrue], eps=1e-15);

print();
collect();


CPU times: user 93.5 ms, sys: 0 ns, total: 93.5 ms
Wall time: 93.1 ms


In [6]:
%%time 

# Importing the datasets:-
train  = pd.read_csv('icr-identify-age-related-conditions/train.csv')
test   = pd.read_csv('icr-identify-age-related-conditions/test.csv')
sample = pd.read_csv('icr-identify-age-related-conditions/sample_submission.csv')
greeks = pd.read_csv('icr-identify-age-related-conditions/greeks.csv')

# Encoding the category column- EJ:-
first_category = train.EJ.unique()[0];
train.EJ       = train.EJ.eq(first_category).astype('int');
test.EJ        = test.EJ.eq(first_category).astype('int');

# features = [fe for fe in train.columns if fe not in ['CF', 'CB', 'DV', 'BR', 'DF', 'GB', 'AH', 
#                                                      'CW', 'CL', 'BP', 'BD', 'FC', 'GE', 'GF', 
#                                                      'AR', 'GI', 'AX', 'DA', 'Class']]
# train = train[features + ['Class']]
# test = test[features]

# Implementing the under-sampling process:-
train_good        = MakeRndUndSmpl(train);
predictor_columns = [n for n in train.columns if n not in ['Class','Id']];
x                 = train[predictor_columns];
y                 = train['Class'];
cv_outer          = KF(n_splits = CFG.n_splits_outer,  shuffle = True,   random_state = CFG.state);
cv_inner          = KF(n_splits = CFG.n_splits_inner,  shuffle  = True,  random_state = CFG.state);


PrintColor(f"\nTrain and test feature predictors\n");
display(np.array(predictor_columns));

PrintColor(f"\nOriginal and undersampled train data shape = {train.shape} {train_good.shape}\n");

collect();
print();

[1m[34m
Train and test feature predictors
[0m


array(['AB', 'AF', 'AH', 'AM', 'AR', 'AX', 'AY', 'AZ', 'BC', 'BD ', 'BN',
       'BP', 'BQ', 'BR', 'BZ', 'CB', 'CC', 'CD ', 'CF', 'CH', 'CL', 'CR',
       'CS', 'CU', 'CW ', 'DA', 'DE', 'DF', 'DH', 'DI', 'DL', 'DN', 'DU',
       'DV', 'DY', 'EB', 'EE', 'EG', 'EH', 'EJ', 'EL', 'EP', 'EU', 'FC',
       'FD ', 'FE', 'FI', 'FL', 'FR', 'FS', 'GB', 'GE', 'GF', 'GH', 'GI',
       'GL'], dtype='<U3')

[1m[34m
Original and undersampled train data shape = (617, 58) (216, 58)
[0m

CPU times: user 90.8 ms, sys: 0 ns, total: 90.8 ms
Wall time: 90.2 ms


<a id="4"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > MODEL DEFINITION<br><div> 

In [54]:
%%time 

class Ensemble():
    """
    This class defines the below process-
    1. Enlists the base models for the subsequent ensemble and imputes nulls using a SimpleImputer
    2. Creates the fit method to fit them to the training data based on model choice
    3. Curates the model probability predictions 
    """;
    
    def __init__(self):
        """
        This method initializes the imputation strategy and the classifier choices for the subsequent models
        """;
        
        self.imputer = SimpleImputer(missing_values = np.nan, 
                                     strategy = 'median')
        self.scaler = AdjustedScaler(with_centering=True)

        self.classifiers = \
        {
        "XGBC": XGBClassifier( n_estimators     = 200,
                               max_depth        = 3,
                               learning_rate    = 0.15,
                               subsample        = 0.9,
                               colsample_bytree = 0.85,
                               reg_alpha        = 0.0001,
                               reg_lambda       = 0.85,
                              ),
         
        "LGBMC1":LGBMClassifier(**{ 'device'            : "cpu",
                                    'verbose'           : -1,
                                    'boosting_type'     : 'gbdt',
                                    'random_state'      : 42,
                                    'colsample_bytree'  : 0.4,
                                    'learning_rate'     : 0.10,
                                    'max_depth'         : 3,
                                    'min_child_samples' : 5,
                                    'n_estimators'      : 150,
                                    'num_leaves'        : 40,
                                    'reg_alpha'         : 0.0001,
                                    'reg_lambda'        : 0.65,
                                    'subsample'         : 0.65, 
                                  }
                               ),

        "LGBMC2":LGBMClassifier(**{ 'boosting_type': 'goss',
                                    'colsample_bytree': 0.6204955836777386,
                                    'learning_rate': 0.0491626229304876,
                                    'max_bin': 255,
                                    'max_depth': 10,
                                    'min_child_samples': 12,
                                    'num_leaves': 61,
                                    'reg_alpha': 0.0002196360644948,
                                    'reg_lambda': 6.476059262144023e-06,
                                    'subsample': None,
                                    'subsample_freq': 6,
                                    'n_estimators': 229,
                                    # 'objective': 'binary',
                                    'metric': 'none',
                                    'force_col_wise': False,
                                    'verbose': -1,
                                    'is_unbalance': True,
                                    'class_weight': 'balanced'}
                                ),

        # "CBC":CatBoostClassifier(**{'bagging_temperature': None,
        #                             'boosting_type': 'Plain',
        #                             'bootstrap_type': 'MVS',
        #                             'colsample_bylevel': None,
        #                             'depth': 10,
        #                             'grow_policy': 'Depthwise',
        #                             'l2_leaf_reg': 3.267182921204289,
        #                             'learning_rate': 0.0346243575278611,
        #                             'max_leaves': None,
        #                             'min_data_in_leaf': 31,
        #                             'random_strength': 66.99954134814993,
        #                             'subsample': 0.3659716227495403,
        #                             'n_estimators': 5997,
        #                             'task_type': 'CPU',
        #                             'eval_metric': 'MultiClass',
        #                             'loss_function': 'MultiClass',
        #                             'random_seed': 13062023,
        #                             'verbose': 0,
        #                             'od_type': 'Iter',
        #                             'od_wait': 100,
        #                             'border_count': 254,
        #                             'auto_class_weights': 'Balanced'
        #                             }),
                           
        "TPFN1C": TabPFNClassifier(N_ensemble_configurations = 24, seed = 42, device='cuda:0'),
                           
        "TPFN2C": TabPFNClassifier(N_ensemble_configurations = 64, seed = 42, device='cuda:0'),
        };
    
    def fit(self,X,y): 
        "This method fits the classifier choices to the train dataset";
        
        y = y.values;
        unique_classes, y = np.unique(y, return_inverse=True);
        self.classes_     = unique_classes;
        first_category    = X.EJ.unique()[0];
        X.EJ              = X.EJ.eq(first_category).astype('int');
        X                 = self.imputer.fit_transform(X);
        self.scaler.fit(X)
        X                 = self.scaler.transform(X);

        for method, classifier in tqdm(self.classifiers.items(), "--Model fit--"):
            if method.upper().startswith("TPFN"):
                classifier.fit(X,y,overwrite_warning = True);
            else:
                classifier.fit(X, y);
     
    def predict_proba(self, x):
        "This method curates predictions from the individual fitted classifiers";
        
        x = self.imputer.transform(x)
        x = self.scaler.transform(x)

        # ?
        
        probabilities          = np.stack([classifier.predict_proba(x) for classifier in self.classifiers.values()])

        averaged_probabilities = np.mean(probabilities, axis=0)
        class_0_est_instances  = averaged_probabilities[:, 0].sum()
        others_est_instances   = averaged_probabilities[:, 1:].sum()
        
        # Calculating weighted probabilities:-
        # ?
        new_probabilities = averaged_probabilities * np.array([[1/(class_0_est_instances if i==0 else others_est_instances) for i in range(averaged_probabilities.shape[1])]])
        return new_probabilities / np.sum(new_probabilities, axis=1, keepdims=1)
    
print();
collect();

y_pred        = m.predict_proba(test_pred_and_time)






CPU times: user 2.27 s, sys: 0 ns, total: 2.27 s
Wall time: 1.61 s


In [18]:
%%time 

# Defining the training function:-
def TrainMdl(model, x, y, y_meta):
    """
    This function aims to do the below-
    1. Trains models with a CV split using the inner-outer CV strategy
    2. Curates model predictions for OOF score 
    3. Saves the best model available from the above strategy
    
    Inputs- 
    model - model object to be used for fit and prediction
    x,y, y_meta - input data to be used to engender predictions and OOF score
    
    Returns- best model from CV strategy
    """
    
    outer_results = list()
    best_loss     = np.inf
    split         = 0
    splits        = 5
    
    for train_idx,val_idx in tqdm(cv_inner.split(x), total = splits):
        split+=1
        
        x_train, x_val = x.iloc[train_idx], x.iloc[val_idx]
        y_train, y_val = y_meta.iloc[train_idx], y.iloc[val_idx]
                
        model.fit(x_train, y_train)
        
        y_pred        = model.predict_proba(x_val)
        # ?
        probabilities = np.concatenate((y_pred[:,:1], np.sum(y_pred[:,1:], 1, keepdims=True)), axis=1)
        p0 = probabilities[:,:1]
        
        y_p = 1 - p0
                
        loss = ScoreMetric(y_val, y_p)

        if loss < best_loss:
            best_model = model
            best_loss  = loss
            PrintColor(f'-----> Best model saved', color = Fore.GREEN)
            
        outer_results.append(loss)
        num_space = 5 if split <= 9 else 4
        PrintColor(f"--> Fold{split}. {'-' * num_space}> CV = {loss:.5f}")
        del num_space
    
    PrintColor(f"\n ----> Mean CV score = {np.mean(outer_results):.5f} <----\n", color= Fore.MAGENTA)
  
    return best_model, outer_results

CPU times: user 0 ns, sys: 4 µs, total: 4 µs
Wall time: 5.96 µs


<a id="5"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > FEATURE ENGINEERING<br><div> 
    
1. We append the greeks data to the train data to perhaps extract additional insights<br>
2. We over-sample the train data and then prepare the model Xtrain and ytrain data for the training and inference step

In [9]:
%%time 

# Making use of the greeks dataset for additional information:-
times = greeks.Epsilon.copy();
times[greeks.Epsilon != 'Unknown'] = \
greeks.Epsilon[greeks.Epsilon != 'Unknown'].\
map(lambda x: datetime.strptime(x,'%m/%d/%Y').toordinal());
times[greeks.Epsilon == 'Unknown'] = np.nan;

# Appending the greeks data to the train data:-
train_pred_and_time = pd.concat((train, times), axis=1);
test_predictors     = test[predictor_columns];
first_category      = test_predictors.EJ.unique()[0];
test_predictors.EJ  = test_predictors.EJ.eq(first_category).astype('int');
test_pred_and_time  = np.concatenate((test_predictors, np.zeros((len(test_predictors), 1)) + train_pred_and_time.Epsilon.max() + 1), axis=1);

print();
collect();


CPU times: user 80.7 ms, sys: 0 ns, total: 80.7 ms
Wall time: 80 ms


In [10]:
%%time 

ros = RandomOverSampler(random_state = CFG.state);

train_ros, y_ros = ros.fit_resample(train_pred_and_time, greeks.Alpha);
PrintColor('\nOriginal dataset shape\n');
pprint(greeks.Alpha.value_counts());
PrintColor('\nResample dataset shape\n');
pprint( y_ros.value_counts());

x_ros = train_ros.drop(['Class', 'Id'],axis=1);
y_    = train_ros.Class;

collect();
print();

[1m[34m
Original dataset shape
[0m
Alpha
A    509
B     61
G     29
D     18
Name: count, dtype: int64
[1m[34m
Resample dataset shape
[0m
Alpha
B    509
A    509
D    509
G    509
Name: count, dtype: int64

CPU times: user 94.9 ms, sys: 0 ns, total: 94.9 ms
Wall time: 94.1 ms


<a id="6"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > MODEL TRAINING<br><div> 

In [48]:
%%time

yt = Ensemble()
m, outer_results  = TrainMdl(yt, x_ros, y_, y_ros)
y_.value_counts() / y_.shape[0]

y_pred        = m.predict_proba(test_pred_and_time)
probabilities = np.concatenate((y_pred[:,:1], np.sum(y_pred[:,1:], 1, keepdims=True)), axis=1)
p0            = probabilities[:,:1]

Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters


  0%|          | 0/5 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


--Model fit--:   0%|          | 0/5 [00:00<?, ?it/s]

[1m[32m-----> Best model saved[0m
[1m[34m--> Fold1. -----> CV = 0.01704[0m


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


--Model fit--:   0%|          | 0/5 [00:00<?, ?it/s]

[1m[32m-----> Best model saved[0m
[1m[34m--> Fold2. -----> CV = 0.01280[0m


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


--Model fit--:   0%|          | 0/5 [00:00<?, ?it/s]

[1m[34m--> Fold3. -----> CV = 0.03071[0m


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


--Model fit--:   0%|          | 0/5 [00:00<?, ?it/s]

[1m[34m--> Fold4. -----> CV = 0.01523[0m


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


--Model fit--:   0%|          | 0/5 [00:00<?, ?it/s]

[1m[34m--> Fold5. -----> CV = 0.01486[0m
[1m[35m
 ----> Mean CV score = 0.01813 <----
[0m




CPU times: user 2min 8s, sys: 746 ms, total: 2min 8s
Wall time: 23.3 s


<a id="7"></a>
# <div style= "font-family: Cambria; font-weight:bold; letter-spacing: 0px; color:#ffffff; font-size:120%; text-align:left;padding:3.0px; background: #006bb3; border-bottom: 8px solid #a6a6a6" > SUBMISSION<br><div> 

In [12]:
%%time 

if CFG.postprocess_req.upper() == "Y":
    PrintColor(f"\nPost-processing predictions with cutoffs = {CFG.lw_cutoff:.2f} {CFG.up_cutoff:.2f}\n");
    p0[p0 > CFG.up_cutoff] = 1;
    p0[p0 < CFG.lw_cutoff] = 0; 
    
else:
    PrintColor(f"Post-processing is not required", color = Fore.RED);
    
submission            = pd.DataFrame(test["Id"], columns = ["Id"]);
submission["class_0"] = p0;
submission["class_1"] = 1 - p0;

submission.to_csv('submission.csv', index = False);
submission_df = pd.read_csv('submission.csv')
display(submission_df);

collect();
print();

[1m[34m
Post-processing predictions with cutoffs = 0.28 0.59
[0m


Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.5,0.5
1,010ebe33f668,0.5,0.5
2,02fa521e1838,0.5,0.5
3,040e15f562a2,0.5,0.5
4,046e85c7cc7f,0.5,0.5



CPU times: user 83.9 ms, sys: 0 ns, total: 83.9 ms
Wall time: 82.9 ms
