The [first place solution](https://www.kaggle.com/competitions/playground-series-s4e10/discussion/543725) was done by [Hardy Xu](https://www.kaggle.com/hardyxu52), and he called it **CatBoost All The Way Down**.

There was no file attached, so this file will just be a full documentation of his method using his completed post and responses to comments.

# Original Submission

### 1st Place Solution - CatBoost All The Way Down

Hey Kagglers! I used to be pretty active in these playground competitions, but after the December 2023 competition I took a break from Kaggle. On a whim I decided to start working on this one about 10 days ago, and it's been as much of a thrill as it ever was. Getting 1st place was a surprise, to be sure, but a welcome one!

#### Cross-Validation

I'm sure you've heard this before, but setting up a robust cross-validation scheme for evaluating the performance of your predictions is VERY important to doing well in these competitions. I see lots of questions from folks on what kind of feature engineering to do, or how to best ensemble models, impute data, engineer features, etc. For a vast majority of these questions, there's no single answer that is universally true for any dataset. The only way to find out what works for a particular dataset is to try various options and see what performs the best, and that's where cross-validation comes in. In these playground competitions, the data is usually split 60-40 between train and test set, and 20% of the test set is used for the public leaderboard. That means that a CV score measures your performance on 60% of the entire dataset, whereas the public leaderboard measures your performance on only 8%, making cross-validation performance a much more reliable indicator of progress than public leaderboard performance. All of the decisions made below were based on optimizing my cross-validation performance.

#### Data Preprocessing

Shoutout to various member of the community for the tip to treat the numerical features as categorical. What I found most effective was to maintain both the numeric feature and a categorical copy of it. I didn't do any other feature engineering, as my experience from past playground competitions has usually been that feature engineering is of little use. I did include the original dataset.

#### Modelling

My general approach here is the same as the one I used last competition. For each of XGBoost, LightGBM, and CatBoost, I used Optuna to find 10 different sets of 'optimal hyperparameters' and averaged their predictions to get an overall prediction for each. Shoutout to @omidbaghchehsaraei's post here for the tip to use large max_bin values. I also added a Neural Network that was heavily inspired from @paddykb's notebook here. The performance of each of these models is as follows:

| Model | 	CV Score |	Public LB | 	Private LB | 
| --- | --- | --- | --- |
| LightGBM |	.96811 | 	.97005 | 	.96637 |
| XGBoost |	.96767 |	.96989 |	.96540 |
| CatBoost |	.96972 |	.97299 |	.96865 |
| NN |	.96678 |	.97088 |	.96577 |


What I think might have been my secret sauce was that for each of these model predictions, I trained a CatBoost model using the initial model predictions as a baseline. An example of how to do this can be found here. I'm not sure exactly what inspired me to do this, perhaps it was from seeing how amazingly well CatBoost performed on this data, but to my surprise CatBoost was able to significantly improve the performance of each of these model predictions, even the ones that were originally generated using CatBoost. The performance of these CatBoost-improved models are as follows:

| Initial Model	| CV Score	| Public LB	 |Private LB|
| --- | --- | --- | --- |
| LightGBM	|.96856	|.97048	|.96713|
| XGBoost	|.96815|	.97024|	.96611|
| CatBoost|	.96997|	.97334|	.96903|
| NN	|.96732|	.97117|	.96667|

I find it impressive that the CatBoost model that used CatBoost predictions as a baseline would have been enough for 3rd place. CatBoost was the king for this comp! The final step was a Neural Network to stack these 4 predictions together. This squeezed out the extra last bit of performance needed to bring the solution to the top.

| CV Score	|Public LB|	Private LB|
|---|---|---|
|.97059	|.97344|	.96938|

## Helpful Comments

Yes, I did use early stopping, although in most instances I early stopped on the optimal Log Loss rather than AUC (the exception was when evaluating hyperparameters). I think this produced better generalization performance, though I'm not entirely sure.

Yes, I did maintain a consistent CV strategy, although in order to reduce the effect of having overly optimistic CV estimates of performance, I varied up the seed depending on the task, e.g. for generating predictions I used 1 seed, for evaluating hyperparameters I used a different seed (also a different CV scheme in this case), and then for ensembling predictions I used yet another seed.

For step 1 I kept predicted probabilities, not 0/1 predictions. Each model was trained 10 times with different sets of hyperparameters to get 10 sets of predicted probabilities that were averaged together.

For step 2, the predictions were still made on the train data. To clarify, the predictions from the previous step are not used as an additional feature here, but they're used as starting points for the CatBoost predictions. When training CatBoost (and gradient boosting algorithms in general), every row has a default prediction value that the algorithm starts from. By comparing how far off this prediction is from the correct value, the algorithm learns to refine this value as it constructs more trees. Instead of having CatBoost use the default prediction values at the start, I had it use the values obtained from step 1.

I used a 5-fold with 2 repeats, giving me 10 different train-test splits. For each of these splits, I ran an Optuna study to optimize hyperparameters for that split. Taking the optimal hyperparameters from each of those 10 splits gave me 10 different sets of hyperparameters that I used to generate model predictions with a different k-fold setup.

This is the set I used for CatBoost as an example. This only includes the hyperparameters were specifically optimized in Optuna.

[{'bootstrap_type': 'Bernoulli',
'depth': 7,
'reg_lambda': 18.072311740276326,
'subsample': 0.75},

{'bootstrap_type': 'Bernoulli',
'depth': 6,
'reg_lambda': 15.55567035414246,
'subsample': 0.65},

{'bootstrap_type': 'Bernoulli',
'depth': 8,
'reg_lambda': 15.20382054295614,
'subsample': 0.85},

{'bootstrap_type': 'Bayesian',
'depth': 8,
'reg_lambda': 30.899963577024543,
'bagging_temperature': 1.4142844438284334},

{'bootstrap_type': 'Bernoulli',
'depth': 9,
'reg_lambda': 7.088272373863902,
'subsample': 0.30000000000000004},

{'bootstrap_type': 'Bernoulli',
'depth': 8,
'reg_lambda': 7.007118903793957,
'subsample': 0.4},

{'bootstrap_type': 'Bayesian',
'depth': 8,
'reg_lambda': 8.520228449144428,
'bagging_temperature': 1.4936853757213577},

{'bootstrap_type': 'Bernoulli',
'depth': 7,
'reg_lambda': 2.255594506488112,
'subsample': 0.6},

{'bootstrap_type': 'Bayesian',
'depth': 6,
'reg_lambda': 8.900380995166119,
'bagging_temperature': 1.6284468341652825},

{'bootstrap_type': 'Bayesian',
'depth': 8,
'reg_lambda': 40.81634255865609,
'bagging_temperature': 0.8294626793120554}]