Some follow-up notes for [Don’t Let a Data Leak Sink Your Project](https://win-vector.com/2025/05/05/dont-let-a-data-leak-sink-your-project/).

Consider the amount of information leaking from a leader board in the style of <a href="https://arxiv.org/abs/1707.01825">Jacob Whitehill, "Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle", arXiv:1707.01825, 6 Jul 2017</a>.

Define:

  * `m`: the number of labels we expose in our query.
  * `c`: the number of categories per label.
  * `k`: the number of decimal digits of resolution we see.

Let's overly simplify this to we expect the returned score to be in the range `[0, 1]` and we can only tell values apart that differ by at least `10**(-k)`. Then there are `10**k + 1` such values. Let's say this is about `10**k`.

To expose `m` labels we need `c**m` values.

So we need `c**m <= 10**k`. Thus `m <= k * log(10) / log(c)`. 

Let's see this bound for `k = 6`, `c = 3`.

In [1]:
import numpy as np

k = 6
c = 3

# m-bound
k * np.log(10) / np.log(c)

12.575419645736307

So this *very* simple estimate leads us to expect we can expose about 12 labels per query. However, Whitehill's scheme exposed about 6 labels per query.

I am going to argue, this it is plausible that 6 is in fact possibly the right number.

Part of the over-simplification above we assuming we could design a sum to distribute values nearly uniformly over an interval. Experience with randomized algorithms in fact tells us *if* our weights are in a limited range then we expect a [concentration effect](https://en.wikipedia.org/wiki/Concentration_inequality), where most of the sums are near each other.

I don't want to fully model this, but we can approximate this effect using a "[birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem)" style argument.

We want to encode `c**m` sums as distinct values. This means every pair of values must encode distinctly. Let's use our rounding-level `10**(-k)` as an estimate of the probability of any two values colliding. Now there are `(c**m choose 2)` pairs. So for a randomized construction to have a good chance of finding a good encoding we would want `(c**m choose 2) 10**(-k) < 1`. There is no reason to insist on a simple randomized construction, but it is a useful heuristic to say things become plausible where such constructions work. Solve `(c**m choose 2) 10**(-k) < 1` for `m` by passing to the approximation `c**m c**m / 2 < 10**k` which gives us `m < (log(2) + k log(10)) / (2 log(c))`.

Plugging in our same numbers gives us.

In [2]:
# heuristic m bound
(np.log(2) + k * np.log(10)) / (2 * np.log(c))

6.6031746996538825

That is: it is reasonable to expect there is a query plan that exposes 6 hold-out labels per query. This is only a crude heuristic argument and estimate, but is happens to match Whitehill's experience of being able to design a 6 label strategy for this case.