Competition Homepage: https://www.kaggle.com/competitions/icr-identify-age-related-conditions
- Notebook: exp/04_logistics_bayesian_v4.1.py; CV: na / Public LB: 0.25 / Private LB: 0.45 (with interaction terms).
- Noteboook: exp/04_logistics_bayesian_v4.2.py; CV: na / Public LB: 0.24 / Private LB: 0.47 (with interaction and quadratic terms).
This was my go-to model when joining this competition because it is resilient against overfitting, especially for this small dataset.
The model consists of a linear combination of:
- Metric terms;
- A group (feature EJ) term;
- Interaction terms (basically x*y);
- Quadratic terms (x^2);
To make the model less susceptible to overfitting, the weights of metric, interaction, and quadratic features distributed as StudentT. Moreover, I added a guess term to cover the mislabeled data samples.
Mathematically, the model looks like this:
The Bayesian model will produce all possible predictions for
We want to optimize the balanced log loss with the following form:
where
We will want to convert the loss to a loss for each data point
With this form,
we can find the best
where
The prediction that optimizes the log loss satisfies:
- Notebook: exp/11.1e_features_selection_sampler_neuralnetwork.py; F1 CV: 0.91 ± 0.05 / Log Loss CV: 0.27 ± 0.06 / Public LB: 0.29 / Private LB: 0.56
This model is just a simple 3-layer neural network with 1024 neurons, followed by 512 neurons and finally 64 neurons in the last layer. This simple model didn't work very well with the current setup, so I added a couple of regularization techniques:
- Dropout and Layer normalization;
- Weight decay of 1.0;
- Add some gaussian noise to the training data;
- Early stopping;
- Force the output of the last layer to be normally distributed and uncorrelated.
The input data to the model were preprocessed with the following steps:
- Replace missing values with median value;
- Normalize the data;
- Create interaction and quadratic terms using all columns;
- Remove correlated features with correlation threshold over 0.2 using agglomerative clustering;
- Data oversampling using SMOTE.
The model were trained using stratified k-fold with 10 folds. The best model in each fold is then used to perform out-of-fold predictions.
- Notebook: exp/14_neuralnetwork_mixture_model.py; F1 CV: 0.91 ± 0.06 / Log Loss CV: 0.30 ± 0.04 / Public LB: 0.35 / Private LB: 0.54
Upon investigating the performance of the 3-layer neural network model above on 10 folds, I saw that performance was particularly worst in the fold #6, #7, and #9. Thus, I hypothesized that the performance of the model might improve if the embeddings (the output of the last layer) followed mixture gaussian distribution.
Therefore, I implemented a gaussian mixture model that receives class membership of each data point (one-hot encoded) and predicts the mean and variance of the 64 embedding dimensions. The mean and the variance vectors are then plugged into the unnormalized PDF along with the embeddings to calculate the negative log-likelihood.