## Reverse Engineering the LB dataset

This notebook attempts at reverse engineering the LB dataset, to infer the true values for the portion of the test dataset used to evaluate the leaderboard score.

I hope the organizers of this competition will not mind this attempt. I think reverse engineering happens in all Kaggle competitions. By making this notebook public, these attempts are shared among all the participants and do not provide unfair advantage to anyone. Moreover, reverse engineering the score is also an intellectual challenge, and it leads to understanding patterns in the data that may not be apparent from a standard approach.

This notebook is inspired by these ones, from the Binary Rainfall competition:

- https://www.kaggle.com/competitions/playground-series-s5e3/discussion/568865
- https://www.kaggle.com/code/act18l/lb-probing-hitchhiker-version

## Strategy

- We select the best submissions in the leaderboard, from the public notebooks
- We read the submission files for these submissions
- Since the leaderboard score is computed on a Spearman correlation, we are only interested in the rank of each point. Thus, we convert the scores to ranks from 1 to 35
- We use a combinatorial approach, as in the notebooks linked, to infer the original values


## Applications

- By applying this method, we can infer the structure of the dataset used to compute the LB score
- We want to understand how the LB score is computed. Is it based on the first rows of the test dataset? On the first columns? Or a mix of the two?
- This information may be used to provide additional data to train a model.
- In addition, this notebook collects all the best submissions for this challenge. Feel free to combine them to compute additional models (stacking).

## Importing scores

In [3]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [4]:
!uv pip install --system --quiet scikit-learn==1.6.1 ortools

### Read score 

Remember to add these as Inputs to this notebook

In [19]:
!ls /kaggle/input/*/submission.csv

/kaggle/input/0-61128-v4-ai-hackathon-pipeline-v4/submission.csv
/kaggle/input/ai-hackathon-pipeline/submission.csv
/kaggle/input/celltypepipeline/submission.csv
/kaggle/input/gpu-celltype-pipeline-feature-engineering-incl/submission.csv
/kaggle/input/gpu-celltype-pipeline-gridsearchcv/submission.csv
/kaggle/input/gpu-celltype-pipeline/submission.csv
/kaggle/input/optuna-eda-isotonic-regression/submission.csv


In [17]:
pd.read_csv("/kaggle/input/0-61128-v4-ai-hackathon-pipeline-v4/submission.csv").iloc[:, 1:36]

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,...,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35
0,-3.163321,0.315301,-0.741791,-0.180956,0.672437,0.898118,0.012866,0.023510,1.850663,0.257863,...,0.000567,1.727454,0.049003,-0.039811,0.000358,0.063324,0.319172,0.000192,-0.002971,0.053911
1,-2.932460,0.298874,-0.686479,-0.167479,0.893317,0.782124,0.016019,0.029567,1.609244,0.283112,...,0.000556,1.505301,0.042701,-0.037713,0.000446,0.061913,0.294717,0.000215,-0.002737,0.049310
2,1.717432,-0.049089,0.421992,0.098908,0.475189,-0.343789,0.008841,0.011537,-0.679399,0.033295,...,-0.000037,-0.661449,-0.018740,0.024574,0.000177,0.019195,-0.078359,0.000052,0.001725,-0.011230
3,0.719556,0.026008,0.184250,0.041867,0.685467,-0.132145,0.012132,0.018874,-0.252003,0.105690,...,0.000100,-0.253619,-0.007177,0.010711,0.000285,0.028716,-0.001256,0.000102,0.000774,0.000966
4,-2.057826,0.233963,-0.477802,-0.117211,0.968669,0.532044,0.016905,0.030606,1.097312,0.260118,...,0.000457,1.024719,0.029072,-0.026630,0.000459,0.054330,0.220761,0.000203,-0.001889,0.036905
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2083,0.983079,0.004227,0.246397,0.056351,0.074716,-0.049958,0.003201,0.000963,-0.071242,0.000058,...,0.000019,-0.098057,-0.002759,0.016657,0.000026,0.024570,-0.007986,0.000019,0.000996,0.001412
2084,-0.466061,0.112392,-0.099149,-0.026751,0.126093,0.320564,0.004292,0.004311,0.683757,0.065614,...,0.000198,0.614638,0.017451,-0.002430,0.000077,0.037651,0.110221,0.000060,-0.000399,0.020800
2085,-1.370933,0.180341,-0.314781,-0.078520,0.274369,0.523027,0.006660,0.009744,1.093744,0.124654,...,0.000319,1.004564,0.028508,-0.014826,0.000157,0.046161,0.181179,0.000100,-0.001263,0.032140
2086,0.672360,0.029950,0.173133,0.039285,0.806663,-0.149802,0.013903,0.022422,-0.290623,0.126449,...,0.000115,-0.287080,-0.008127,0.009598,0.000336,0.029494,-0.000341,0.000118,0.000734,0.000808


In [None]:
preds=np.array([
    pd.read_csv("/kaggle/input/0-61128-v4-ai-hackathon-pipeline-v4/submission.csv")
    pd.read_csv("/kaggle/input/rapids-knn-starter-ensemble-lb-0-961-wow/submission_ensemble.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rain-or-shine-rainfall-prediction-with-ml/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/xgboost-starter-ensemble-lb-0-935-wow/submission_ensemble.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/deployment-streamlit-cnn-for-good-resume/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/0-96218-logistic-regression-plus-ensemble/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rainfall-pred-logistic-regression-plus-ensemble/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/playgrounds5e3-baseline-v2/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/weathercook-ai-generated/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/87-9-logistic-s5e3-rainfall-probability-in-r/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rainfall-prediction-eda-catboost-optuna/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/shap-feature-engineering-lstm-cnn-ensemble/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/ps-s5e3-rainfall-hyperspace-as-feats/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rainfall-catboost/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rainfall-keras-tensorflow/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/lgb-and-xgb/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/fork-improvement-xgb-lb-0-929/submission_ensemble.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/xtratreeclassifier-v1-updated-multiplier-next-step/submission.csv")['target'][:146],
    pd.read_csv("/kaggle/input/rainfall-dataset-roc-auc-0-87154/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rainfall-simple-logistic-regression/submission.csv").rainfall[:146],
    pd.read_csv("/kaggle/input/rainfall-prediction-will-it-rain-tomorrow-87-67/submission.csv").target[:146],
    pd.read_csv("/kaggle/input/s5e3-logisticregression/submission.csv").rainfall[:146],
])
len(preds)

In [None]:
# Scores - these are copied&pasted manually from each notebook
scores=[
    0.85626,
    0.96111,
    0.86484,
    0.93550,
    0.88710,
    0.96218,
    0.90104,
    0.94851,
    0.88200,
    0.86430,
    0.86698,
    0.87851,
    0.95548,
    0.85679,
    0.80718,
    0.85009,
    0.92947,
    0.86390,
    0.86725,
    0.87396,
    0.84633,
    0.86377
    
]
len(scores)

In [None]:
%%time
N=146
p=113
n=33
from ortools.sat.python import cp_model

class SolutionCallback(cp_model.CpSolverSolutionCallback):

    def __init__(self):
        super().__init__()
        self.solutions = []

    def on_solution_callback(self):
        self.solutions.append([self.Value(x[i]) for i in range(N)])

model = cp_model.CpModel()
x = [model.NewIntVar(0, 1, F'x[{i}]') for i in range(N)]

model.Add(sum(x)==p)

for m in range(10):
    y_pred = preds[m]
    r = pd.Series(y_pred).rank().values
    model.Add(sum(x[i]*int(np.around(r[i]*2)) for i in range(N))==int(np.around(scores[m]*n*p*2))+p*(p+1))

solver = cp_model.CpSolver()
s = SolutionCallback()
solver.SearchForAllSolutions(model, s)



In [None]:
len(s.solutions)