<a href="https://www.kaggle.com/code/ailafelixa/icr-iarc-pre-process-cat-boost-classifier?scriptVersionId=131374407" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Importing relevant packages

In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
import warnings
warnings.filterwarnings('ignore')
import copy

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import catboost
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics



## Importing the data

In [2]:
# Load the data
train = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')
test = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')
greeks = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/greeks.csv')
sample_submission = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv')

## Descriptive and Exploratory Data Analysis

First of all, lets take a look at the data

In [3]:
train.head()

Unnamed: 0,Id,AB,AF,AH,AM,AR,AX,AY,AZ,BC,...,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
0,000ff2bfdfe9,0.209377,3109.03329,85.200147,22.394407,8.138688,0.699861,0.025578,9.812214,5.555634,...,7.298162,1.73855,0.094822,11.339138,72.611063,2003.810319,22.136229,69.834944,0.120343,1
1,007255e47698,0.145282,978.76416,85.200147,36.968889,8.138688,3.63219,0.025578,13.51779,1.2299,...,0.173229,0.49706,0.568932,9.292698,72.611063,27981.56275,29.13543,32.131996,21.978,0
2,013f2bd269f5,0.47003,2635.10654,85.200147,32.360553,8.138688,6.73284,0.025578,12.82457,1.2299,...,7.70956,0.97556,1.198821,37.077772,88.609437,13676.95781,28.022851,35.192676,0.196941,0
3,043ac50845d5,0.252107,3819.65177,120.201618,77.112203,8.138688,3.685344,0.025578,11.053708,1.2299,...,6.122162,0.49706,0.284466,18.529584,82.416803,2094.262452,39.948656,90.493248,0.155829,0
4,044fb8a146ec,0.380297,3733.04844,85.200147,14.103738,8.138688,3.942255,0.05481,3.396778,102.15198,...,8.153058,48.50134,0.121914,16.408728,146.109943,8524.370502,45.381316,36.262628,0.096614,1


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 617 entries, 0 to 616
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Id      617 non-null    object 
 1   AB      617 non-null    float64
 2   AF      617 non-null    float64
 3   AH      617 non-null    float64
 4   AM      617 non-null    float64
 5   AR      617 non-null    float64
 6   AX      617 non-null    float64
 7   AY      617 non-null    float64
 8   AZ      617 non-null    float64
 9   BC      617 non-null    float64
 10  BD      617 non-null    float64
 11  BN      617 non-null    float64
 12  BP      617 non-null    float64
 13  BQ      557 non-null    float64
 14  BR      617 non-null    float64
 15  BZ      617 non-null    float64
 16  CB      615 non-null    float64
 17  CC      614 non-null    float64
 18  CD      617 non-null    float64
 19  CF      617 non-null    float64
 20  CH      617 non-null    float64
 21  CL      617 non-null    float64
 22  CR

The majority of our columns are float64. We have one column ('EJ') with the 'object' type. It is interesting to verify why this column have this type.
Columns like 'BQ' and 'EL' have more than 50 lines with null values. We will need to take a look at these to verify if some kind of imputation is needed.

### Are our train dataset balanced or disbalanced?

In [5]:
train['Class'].value_counts()

0    509
1    108
Name: Class, dtype: int64

Our dataset is highly disbalanced. The number of negative cases is nearly 5 times higher than the number of positive examples.

## Data pre-processing

### Spliting the train dataset into train and validation

In [6]:
#First, let split the dataset into train and validation
#As our dataset is disbalanced, we will look-forward to maintain the same variable frequencies

X = copy.deepcopy(train.drop('Id', axis=1))
y = X.pop('Class')

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, stratify=y, random_state=42)

### Pre-processing the column with the 'object' type

In [7]:
#First, lets see how many different values the column with 'object' type have

train['EJ'].value_counts()


B    395
A    222
Name: EJ, dtype: int64

There are two distinct values in this column. In this case we can simply replace the values for 1 and 0

In [8]:
#Transforming the column with 'object' type

X_train['EJ'] = X_train['EJ'].replace({'A': 1, 'B': 0})
X_valid['EJ'] = X_valid['EJ'].replace({'A': 1, 'B': 0})

In [9]:
X_train['EJ'].value_counts()

0    318
1    175
Name: EJ, dtype: int64

## Applying standard scaler

In [10]:
X_train.describe()

Unnamed: 0,AB,AF,AH,AM,AR,AX,AY,AZ,BC,BD,...,FI,FL,FR,FS,GB,GE,GF,GH,GI,GL
count,493.0,493.0,493.0,493.0,493.0,493.0,493.0,493.0,493.0,493.0,...,493.0,492.0,493.0,491.0,493.0,493.0,493.0,493.0,493.0,492.0
mean,0.483672,3451.565811,115.762651,41.105671,9.883313,5.551565,0.063965,10.642512,8.740344,5442.788608,...,10.099123,5.535944,4.028661,0.437762,20.971784,129.42817,14228.741789,31.570198,50.623218,8.463101
std,0.494641,2005.366931,116.261447,75.086262,8.676564,2.595,0.465331,4.195261,72.693664,3296.277834,...,2.973876,12.24672,56.098927,1.448711,10.167604,144.2564,18288.411077,9.889031,36.053497,10.291116
min,0.081187,192.59328,85.200147,3.177522,8.138688,0.699861,0.025578,3.396778,1.2299,2103.14378,...,3.58345,0.173229,0.49706,0.06773,4.102182,72.611063,13.038894,9.432735,0.897628,0.001129
25%,0.252107,2184.24239,85.200147,12.270314,8.138688,4.101717,0.025578,8.173694,1.2299,4158.01198,...,8.495533,0.173229,0.49706,0.06773,14.46461,72.611063,2798.992584,25.113029,23.065696,0.123954
50%,0.350386,3045.933,85.200147,20.505237,8.138688,5.031912,0.025578,10.593662,1.2299,5083.82127,...,9.862757,3.000258,1.12085,0.257374,19.041194,72.611063,7521.784092,30.750344,41.416916,0.337827
75%,0.559763,4376.46594,109.742109,42.221401,8.138688,6.511365,0.038367,13.057744,4.891488,6129.660335,...,11.329215,6.122686,1.50742,0.54184,25.515386,126.7679,19035.70924,37.21,67.931664,21.978
max,6.161666,18964.47278,1910.123198,630.51823,173.534448,38.27088,10.315851,30.192882,1463.693448,53060.59924,...,35.851039,137.932739,1244.22702,31.365763,135.781294,1497.351958,143790.0712,81.210825,191.194764,21.978


As we can see, we have values with a huge different scales. Lets apply a standard scale before follow with the other pre-processing procedures

In [11]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
column_names = X_train.columns.tolist()

# Transform the training data
X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=column_names)

# Transform the test data using the same scaler
X_valid_scaled = pd.DataFrame(scaler.transform(X_valid), columns=column_names)

X_train = X_train_scaled
X_valid = X_valid_scaled

## Handling missing values

In [12]:
#Identifying column with nulls 

cols_with_missings = [col for col in X_train.columns
                        if X_train[col].isnull().any()]

print('Columns with nulls: ', cols_with_missings)

#Imputing using SimpleImputer

my_imputer = SimpleImputer(strategy="mean")
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

X_train = imputed_X_train
X_valid = imputed_X_valid

Columns with nulls:  ['BQ', 'CB', 'CC', 'DU', 'EL', 'FC', 'FL', 'FS', 'GL']


In [13]:
#Verifying if there is any null

cols_with_missings = [col for col in X_train.columns
                        if X_train[col].isnull().any()]
print('Columns with nulls: ', cols_with_missings)

Columns with nulls:  []


## Balacing the dataset using SMOTE

In [14]:
smote = SMOTE(sampling_strategy=1.0, random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [15]:
y_train.value_counts()

0    407
1    407
Name: Class, dtype: int64

## Selecting the model type

In [16]:
# Fit CatBoost model
cat_model = catboost.CatBoostClassifier()
cat_model.fit(X_train, y_train, verbose=False)

# Predict on the test set and calculate Log Loss
cat_pred_proba = cat_model.predict_proba(X_valid)[::,1]
cat_ll = metrics.log_loss(y_valid, cat_pred_proba)

# Calculate ROC curve and AUC score for CatBoost
cat_fpr, cat_tpr, _ = metrics.roc_curve(y_valid, cat_pred_proba)
cat_auc = metrics.auc(cat_fpr, cat_tpr)

# Fit LightGBM model
lgb_model = lgb.LGBMClassifier()
lgb_model.fit(X_train, y_train, verbose=False)

# Predict on the test set and calculate Log Loss
lgb_pred_proba = lgb_model.predict_proba(X_valid)[::,1]
lgb_ll = metrics.log_loss(y_valid, lgb_pred_proba)

# Calculate ROC curve and AUC score for LightGBM
lgb_fpr, lgb_tpr, _ = metrics.roc_curve(y_valid, lgb_pred_proba)
lgb_auc = metrics.auc(lgb_fpr, lgb_tpr)

# Fit XGBoost model
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train, verbose=False)

# Predict on the test set and calculate LogLoss
xgb_pred_proba = xgb_model.predict_proba(X_valid)[::,1]
xgb_ll = metrics.log_loss(y_valid, xgb_pred_proba)

# Calculate ROC curve and AUC score for XGBoost
xgb_fpr, xgb_tpr, _ = metrics.roc_curve(y_valid, xgb_pred_proba)
xgb_auc = metrics.auc(xgb_fpr, xgb_tpr)

# Fit Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Predict on the test set and calculate LogLoss
rf_pred_proba = rf_model.predict_proba(X_valid)[::,1]
rf_ll = metrics.log_loss(y_valid, rf_pred_proba)

# Calculate ROC curve and AUC score for RF
rf_fpr, rf_tpr, _ = metrics.roc_curve(y_valid, rf_pred_proba)
rf_auc = metrics.auc(rf_fpr, rf_tpr)


# Print the results
print("CatBoost LogLoss: {:.2f}".format(cat_ll))
print("CatBoost AUC: {:.2f}".format(cat_auc))
print("-"*40)
print("LightGBM LogLoss: {:.2f}".format(lgb_ll))
print("LightGBM AUC: {:.2f}".format(lgb_auc))
print("-"*40)
print("Random Forest LogLoss: {:.2f}".format(rf_ll))
print("Random Forest AUC: {:.2f}".format(rf_auc))
print("-"*40)
print("XGBoost LogLoss: {:.2f}".format(xgb_ll))
print("XGBoost AUC: {:.2f}".format(xgb_auc))

CatBoost LogLoss: 0.19
CatBoost AUC: 0.97
----------------------------------------
LightGBM LogLoss: 0.30
LightGBM AUC: 0.96
----------------------------------------
Random Forest LogLoss: 0.28
Random Forest AUC: 0.95
----------------------------------------
XGBoost LogLoss: 0.26
XGBoost AUC: 0.96


Let's follow with the CatBoost classifier!

In [17]:
model = catboost.CatBoostClassifier()
model.fit(X_train, y_train, verbose=False)

y_pred_proba = model.predict_proba(X_valid)
log_loss = metrics.log_loss(y_valid, y_pred_proba)
print("Log loss: ", log_loss)

Log loss:  0.19383100548033255


Let's test some new params

In [18]:
model = catboost.CatBoostClassifier(eval_metric='Logloss',
                                    depth=4,
                                    n_estimators=500,
                                    )

model.fit(X_train, y_train, verbose=False)

y_pred_proba = model.predict_proba(X_valid)
log_loss = metrics.log_loss(y_valid, y_pred_proba)
print("Log loss: ", log_loss)

Log loss:  0.1892589357772616


In [19]:
# Initialize the CatBoostClassifier with L1 regularization
model = catboost.CatBoostClassifier(eval_metric='Logloss',
                           depth=4,
                           n_estimators=500,
                           l2_leaf_reg=1,  # L1 regularization parameter
                           random_seed=1)

# Fit the model to the training data
model.fit(X_train, y_train, verbose=False)

# Get feature importances (L1 regularization)
feature_importances = np.abs(model.feature_importances_)

# Create a dictionary mapping feature names to importances
importance_dict = dict(zip(X_train.columns, feature_importances))

# Sort the feature importances in descending order
sorted_importances = sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)

# Print the feature importances
for feature, importance in sorted_importances:
    print(feature, importance)
    
y_pred_proba = model.predict_proba(X_valid)
log_loss = metrics.log_loss(y_valid, y_pred_proba)
print("Log loss: ", log_loss)

DU 20.00365591419002
AB 9.262896843034882
BQ 7.060639815757501
CR 5.266313767885252
EB 3.2084207330940253
EU 3.072283645828158
CD  2.8357015756218034
AF 2.8163280959070733
DA 2.7662267635959483
GL 2.6848406441506167
BC 2.557841182421223
CC 2.533165939871479
FL 2.1629631731951275
DY 2.0531424636791975
DL 2.0007255498163787
DN 1.6788064254448603
BN 1.6506982999941369
EE 1.55358316127208
EP 1.4086350965563206
CB 1.3911477937861336
DH 1.3075809411183255
EL 1.2793753892375597
CH 1.2521997537023386
AM 1.1411745802855244
DE 1.1245754348511534
FI 1.1096281573832059
CU 1.0872334218118747
DF 1.0403337351443835
EH 0.9835553210304959
CF 0.9246957760935975
FR 0.9194900777550807
FE 0.8111246538570905
FD  0.7644933498967662
CW  0.7384522834463644
AX 0.670488452709826
EJ 0.6433493397007402
CS 0.6391840930980888
EG 0.5238648573248533
GE 0.4593079650320112
DV 0.452010291482242
BP 0.4387004053765765
GI 0.4339217676194196
AR 0.3941264543808079
BZ 0.3514766211214465
CL 0.3443027857022532
BR 0.3272333220288

In [20]:
# Initialize the CatBoostClassifier with L1 regularization
model = catboost.CatBoostClassifier(eval_metric='Logloss',
                           depth=4,
                           n_estimators=500,
                           l2_leaf_reg=1,  # L1 regularization parameter
                           random_seed=1)

# Fit the model to the training data
model.fit(X_train, y_train, verbose=False)

# Get feature importances (L1 regularization)
feature_importances = np.abs(model.feature_importances_)

# Create a dictionary mapping feature names to importances
importance_dict = dict(zip(X_train.columns, feature_importances))

# Set a threshold for importance
threshold = 0.5

# Drop the features with importance below the threshold
selected_features = [feature for feature, importance in importance_dict.items() if importance >= threshold]

# Filter the datasets with the selected features
X_train_filtered = X_train[selected_features]
X_valid_filtered = X_valid[selected_features]

# Print the dropped features
dropped_features = [feature for feature in X_train.columns if feature not in selected_features]
print("Dropped features:")
print(dropped_features)

Dropped features:
['AH', 'AR', 'AY', 'AZ', 'BD ', 'BP', 'BR', 'BZ', 'CL', 'DI', 'DV', 'FC', 'FS', 'GB', 'GE', 'GF', 'GH', 'GI']


In [21]:
model.fit(X_train_filtered, y_train, verbose=False)

y_pred_proba = model.predict_proba(X_valid_filtered)
log_loss = metrics.log_loss(y_valid, y_pred_proba)
print("Log loss: ", log_loss)

Log loss:  0.18022987527094997


In [22]:
X_train = X_train_filtered
X_valid = X_valid_filtered

## Generating the final predictions

In [23]:
#Pre-processing the test dataset
X_test = copy.deepcopy(test)
X_test = X_test.drop('Id', axis=1)
#Applying the numbers to the object column
X_test['EJ'] = test['EJ'].replace({'A': 1, 'B': 0})

In [24]:
#Imputing features with null values

cols_with_missings = [col for col in X_test.columns
                        if X_test[col].isnull().any()]

print('Columns with nulls: ', cols_with_missings)

#Imputing using SimpleImputer

my_imputer = SimpleImputer(strategy="mean")
imputed_test = pd.DataFrame(my_imputer.fit_transform(X_test))

imputed_test.columns = X_test.columns

X_test = imputed_test

Columns with nulls:  []


In [25]:
X_combined = pd.concat([X_train, X_valid], axis=0)

In [26]:
y_combined = pd.concat([y_train, y_valid], axis=0)

In [27]:
model.fit(X_combined,y_combined)

0:	learn: 0.6677553	total: 2.84ms	remaining: 1.42s
1:	learn: 0.6349290	total: 5.61ms	remaining: 1.4s
2:	learn: 0.6109247	total: 8.18ms	remaining: 1.35s
3:	learn: 0.5888601	total: 10.6ms	remaining: 1.31s
4:	learn: 0.5681910	total: 13ms	remaining: 1.29s
5:	learn: 0.5487293	total: 15.5ms	remaining: 1.27s
6:	learn: 0.5343916	total: 17.9ms	remaining: 1.26s
7:	learn: 0.5161711	total: 20.4ms	remaining: 1.25s
8:	learn: 0.5008905	total: 22.9ms	remaining: 1.25s
9:	learn: 0.4915873	total: 25.4ms	remaining: 1.24s
10:	learn: 0.4772918	total: 27.8ms	remaining: 1.23s
11:	learn: 0.4636592	total: 30.2ms	remaining: 1.23s
12:	learn: 0.4464776	total: 32.4ms	remaining: 1.22s
13:	learn: 0.4342310	total: 34.7ms	remaining: 1.2s
14:	learn: 0.4231787	total: 36.9ms	remaining: 1.19s
15:	learn: 0.4107874	total: 39.5ms	remaining: 1.19s
16:	learn: 0.3991171	total: 42.3ms	remaining: 1.2s
17:	learn: 0.3884456	total: 44.7ms	remaining: 1.2s
18:	learn: 0.3797309	total: 47.2ms	remaining: 1.19s
19:	learn: 0.3709960	total: 

<catboost.core.CatBoostClassifier at 0x7ce3c6a62770>

In [28]:
sample_submission.head()

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.5,0.5
1,010ebe33f668,0.5,0.5
2,02fa521e1838,0.5,0.5
3,040e15f562a2,0.5,0.5
4,046e85c7cc7f,0.5,0.5


In [29]:
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=column_names)
X_test = X_test_scaled

In [30]:
X_test_filtered = X_test[selected_features]
X_test = X_test_filtered

In [31]:
pred_probs = model.predict_proba(X_test)

data = {'Id': test.Id, 'class_0': pred_probs[:, 0], 'class_1': pred_probs[:, 1]}
df = pd.DataFrame(data)
print(df.head())

             Id   class_0   class_1
0  00eed32682bb  0.998247  0.001753
1  010ebe33f668  0.998247  0.001753
2  02fa521e1838  0.998247  0.001753
3  040e15f562a2  0.998247  0.001753
4  046e85c7cc7f  0.998247  0.001753


In [32]:
df.to_csv('submission.csv', index=False)