# Kaggle Challenge using LazyPredict  

Kaggle Playground Challenge Season 3 Episode 10.   

In this notebook I want to try for the first time LazyPredict and CalibratedClassifierCV.

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (15, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

import warnings
warnings.simplefilter(action='ignore')

import opendatasets as od

# Importing Data

In [2]:
od.download('https://www.kaggle.com/competitions/playground-series-s3e10/data?select=train.csv')

Skipping, found downloaded files in ".\playground-series-s3e10" (use force=True to force download)


In [3]:
od.download('https://www.kaggle.com/datasets/brsdincer/pulsar-classification-for-class-prediction')

Skipping, found downloaded files in ".\pulsar-classification-for-class-prediction" (use force=True to force download)


In [4]:
url = './playground-series-s3e10/'

train = pd.read_csv(url + 'train.csv').drop(columns='id')
test = pd.read_csv(url + 'test.csv').drop(columns='id')
submission = pd.read_csv(url + 'sample_submission.csv')

original = pd.read_csv('./pulsar-classification-for-class-prediction/Pulsar.csv')

# EDA and Feature Engineering

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117564 entries, 0 to 117563
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Mean_Integrated       117564 non-null  float64
 1   SD                    117564 non-null  float64
 2   EK                    117564 non-null  float64
 3   Skewness              117564 non-null  float64
 4   Mean_DMSNR_Curve      117564 non-null  float64
 5   SD_DMSNR_Curve        117564 non-null  float64
 6   EK_DMSNR_Curve        117564 non-null  float64
 7   Skewness_DMSNR_Curve  117564 non-null  float64
 8   Class                 117564 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 8.1 MB


No missing values.

In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78377 entries, 0 to 78376
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Mean_Integrated       78377 non-null  float64
 1   SD                    78377 non-null  float64
 2   EK                    78377 non-null  float64
 3   Skewness              78377 non-null  float64
 4   Mean_DMSNR_Curve      78377 non-null  float64
 5   SD_DMSNR_Curve        78377 non-null  float64
 6   EK_DMSNR_Curve        78377 non-null  float64
 7   Skewness_DMSNR_Curve  78377 non-null  float64
dtypes: float64(8)
memory usage: 4.8 MB


In [7]:
original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Mean_Integrated       17898 non-null  float64
 1   SD                    17898 non-null  float64
 2   EK                    17898 non-null  float64
 3   Skewness              17898 non-null  float64
 4   Mean_DMSNR_Curve      17898 non-null  float64
 5   SD_DMSNR_Curve        17898 non-null  float64
 6   EK_DMSNR_Curve        17898 non-null  float64
 7   Skewness_DMSNR_Curve  17898 non-null  float64
 8   Class                 17898 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


In [8]:
train = pd.concat([train, original])

In [9]:
train.columns

Index(['Mean_Integrated', 'SD', 'EK', 'Skewness', 'Mean_DMSNR_Curve',
       'SD_DMSNR_Curve', 'EK_DMSNR_Curve', 'Skewness_DMSNR_Curve', 'Class'],
      dtype='object')

## New Features  

These are features suggested by ChatGPT, by the way adding them does not seem to improve the performance.

### Range of the Integrated Profile

In [10]:
train['int_prof_range'] = max(train.Mean_Integrated) - min(train.Mean_Integrated)
test['int_prof_range'] = max(test.Mean_Integrated) - min(test.Mean_Integrated)

### Coefficient of Variation

In [55]:
train['coefficient_of_variation'] = train.SD / train.Mean_Integrated 
test['coefficient_of_variation'] = test.SD / test.Mean_Integrated 

### Signal to Noise Ratio 

In [12]:
train['signal_to_noise_ratio'] = train.Mean_Integrated / train.SD
test['signal_to_noise_ratio'] = test.Mean_Integrated / test.SD

## Dividing in X and y

In [10]:
X = train.drop(columns='Class')
y = train.Class

In [11]:
y.value_counts()

0    122856
1     12606
Name: Class, dtype: int64

## Scaling  

After some trials the best scaler seems to be the RobustScaler.

In [12]:
numerical_cols = X.select_dtypes(np.number).columns.to_list()

In [13]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler().fit(X[numerical_cols])

X[numerical_cols] = scaler.transform(X[numerical_cols])
test[numerical_cols] = scaler.transform(test[numerical_cols])

# Splitting  

It is necessary to create a validation set in this case since in the test set the target variable is missing.

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.2, random_state=42)

# Using LazyPredict

In [32]:
import lazypredict
from lazypredict.Supervised import LazyClassifier

In [33]:
reg = LazyClassifier(verbose=0, ignore_warnings=False, custom_metric=None)

models, predictions = reg.fit(X_train, X_val, y_train, y_val)

 14%|█▍        | 4/29 [00:29<03:09,  7.57s/it]

CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)


 48%|████▊     | 14/29 [00:38<00:18,  1.27s/it]

LabelPropagation model failed to execute
Unable to allocate 87.5 GiB for an array with shape (108369, 108369) and data type float64
LabelSpreading model failed to execute
Unable to allocate 87.5 GiB for an array with shape (108369, 108369) and data type float64


 66%|██████▌   | 19/29 [00:42<00:07,  1.27it/s]

NuSVC model failed to execute
specified nu is infeasible


 90%|████████▉ | 26/29 [01:27<00:22,  7.61s/it]

StackingClassifier model failed to execute
StackingClassifier.__init__() missing 1 required positional argument: 'estimators'


100%|██████████| 29/29 [01:32<00:00,  3.20s/it]


The result is sorted by the ROC AUC score, since it is the one used in the challenge evaluation.

In [38]:
models.sort_values('ROC AUC', ascending=False)

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.99,0.96,0.96,0.99,0.45
XGBClassifier,0.99,0.96,0.96,0.99,4.42
BernoulliNB,0.97,0.96,0.96,0.98,0.09
RandomForestClassifier,0.99,0.96,0.96,0.99,22.33
QuadraticDiscriminantAnalysis,0.98,0.96,0.96,0.98,0.11
ExtraTreesClassifier,0.99,0.96,0.96,0.99,4.87
KNeighborsClassifier,0.99,0.96,0.96,0.99,1.7
AdaBoostClassifier,0.99,0.95,0.95,0.99,5.96
BaggingClassifier,0.99,0.95,0.95,0.99,8.31
SVC,0.99,0.95,0.95,0.99,22.07


# Model 

In [14]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

Some hyperparams ranges are taken from: https://medium.com/broadhorizon-cmotions/hyperparameter-tuning-for-hyperaccurate-xgboost-model-d6e6b8650a11  

This is only one of the run made, it is provided as an example.  

I have decided to fit the RandomizedSearchCV on the entire training set. 

In [74]:
classification = RandomizedSearchCV(XGBClassifier(n_jobs=-1, random_state=42), {
    'max_depth': [1,2,3,4,5,6,7,8,9,10],
    'learning_rate': [.001, .005, .01, .05, .1, .2, .3],
    'n_estimators': [300, 500, 1000, 1500],
    'min_child_weight': [1,2,3,4,5,6,7,8,9,10],
    'subsample': [.4, .5, .6, .7, .8, .9, 1],
    'colsample_bytree': [.4, .5, .6, .7, .75, .77, .79 ,.8, .85],
    'gamma': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5],
    'reg_lambda':[0, 0.5, 1, 1.5, 2, 3, 4.5]
}, cv=5, return_train_score=False, scoring='roc_auc', n_iter=5)

classification.fit(X,y)

pd.DataFrame(classification.cv_results_).sort_values('rank_test_score').head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_subsample,param_reg_lambda,param_n_estimators,param_min_child_weight,param_max_depth,param_learning_rate,...,param_colsample_bytree,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,75.452836,4.589257,0.218348,0.016156,0.6,1.5,1500,9,5,0.01,...,0.8,"{'subsample': 0.6, 'reg_lambda': 1.5, 'n_estim...",0.995351,0.996883,0.995511,0.996081,0.98397,0.993559,0.004825,1
4,17.87039,0.1361,0.071134,0.000813,0.9,1.0,300,8,8,0.3,...,0.85,"{'subsample': 0.9, 'reg_lambda': 1, 'n_estimat...",0.99302,0.995048,0.994067,0.994451,0.981519,0.991621,0.005094,2
0,113.248182,5.96696,0.435441,0.008272,0.8,0.0,1500,10,9,0.1,...,0.6,"{'subsample': 0.8, 'reg_lambda': 0, 'n_estimat...",0.992911,0.995261,0.993452,0.994193,0.981556,0.991474,0.005021,3


Checking the best params

In [75]:
classification.best_params_

{'subsample': 0.6,
 'reg_lambda': 1.5,
 'n_estimators': 1500,
 'min_child_weight': 9,
 'max_depth': 5,
 'learning_rate': 0.01,
 'gamma': 0.0,
 'colsample_bytree': 0.8}

Writing the model with its best parameters.

In [98]:
model = XGBClassifier(n_jobs=-1, n_estimators=1500, max_depth=5, subsample=0.6, gamma=0, colsample_bytree=0.8, learning_rate=0.01,
                      reg_lambda=1.5, min_child_weight=9 ,random_state=42)

# Calibrating Probabilities  

It can lead to a better score.  
The Isotonic one provided the best performance in this case.  

Tutorial used for CalibratedClassifierCV: https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/

In [18]:
from sklearn.calibration import CalibratedClassifierCV

In [99]:
calibrated_model = CalibratedClassifierCV(model, method='isotonic', cv=5)

calibrated_model.fit(X,y)

CalibratedClassifierCV(base_estimator=XGBClassifier(base_score=None,
                                                    booster=None,
                                                    callbacks=None,
                                                    colsample_bylevel=None,
                                                    colsample_bynode=None,
                                                    colsample_bytree=0.5,
                                                    early_stopping_rounds=None,
                                                    enable_categorical=False,
                                                    eval_metric=None, gamma=0,
                                                    gpu_id=None,
                                                    grow_policy=None,
                                                    importance_type=None,
                                                    interaction_constraints=None,
                                            

# Submission

Storing predictions

In [96]:
preds = calibrated_model.predict_proba(test)[:,1]

In [93]:
submission['Class'] = preds

In [97]:
submission.to_csv('fifteenth_attempt.csv', index=None)

LazyPredict has been very helpful to find the best model to use in this case, even though almost all of them provide the same performance.  

Personally, I have decided to use XGBClassifier since it is the one I know better.  

It is important to first discover the best model and then tuning the hyperparameters. In this case I wanted to train the RandomizedSearchCV on the entire dataset  
in order to consider all the training data provided by Kaggle. The results could be quite different, but the suggestion provided by LazyPredict is still right.  

So far the best result was obtained without calibrating the probabilities and without adding new features.