Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML challenge inspired from the aggregate reporting API proposal #137

Closed
BasileLeparmentier opened this issue Apr 28, 2021 · 12 comments
Closed

Comments

@BasileLeparmentier
Copy link

Hi everyone,

​(repost of issue csharrison/aggregate-reporting-api#21 as apparently it was not the right repo for it)

We are delighted to announce that we will be organising a challenge with adKDD inspired by the aggregate measurement API, tackling the optimisation use case.

Criteo will provide a dataset and some prize money and let researchers and data scientists from around the world compete on how to learn performing bidding models from differential private reports. Link to the challenge here.

We will be happy to work with the Chrome team to set the appropriate parameter to the differential privacy function, to be as close as possible to what real-life operations could look like (e.g. an epsilon level would be beneficial).

We hope to kickstart the challenge in early May, so if you are interested in solving the “optimisation” use case using the aggregate reporting API, please do participate!

Best,

Basile on behalf of the Criteo team

@BasileLeparmentier
Copy link
Author

(I apologize for the repost)
Dear @csharrison ,

For the challenge could you please provide us for an order of magnitude of the epsilon for us to work with? If you have an idea on the noise distribution too (Gaussian? Laplacian?) it would be great!

We understand very well that this number might change (and we hope the challenge will help us inform the choice of epsilon!) but it would really help to have an order of magnitude to set up the challenge.

Many thanks!

@csharrison
Copy link
Collaborator

Thanks Basile! I'll work on trying to get this information to you soon.

@ALamraniAlaouiScibids
Copy link

Hello @BasileLeparmentier,

Thank you for realizing this challenge, we will surely learn a lot and prepare for the future of click / sales optimization !
Would it be possible to release a detailed note of the way the dataset was generated ?
(the query used for the aggregation, the different values of the parameters etc...)

@BasileLeparmentier
Copy link
Author

Hi,

Yes all of this will be described in the challenge!

@BasileLeparmentier
Copy link
Author

Dear @csharrison ,

Sorry to bother you, but if we could get the value of the epsilon today, it would really be great.

We are launching the challenge Monday and if we want to be able to do checks, if we get the epsilon later than today, it will be very hard for us.

many thanks,
Basile Leparmentier

@csharrison
Copy link
Collaborator

Hey Basile, yep planning on sending you some information by today, I just need to get a few final sign-offs. Sorry for the delay.

@BasileLeparmentier
Copy link
Author

Thx!

@csharrison
Copy link
Collaborator

csharrison commented May 5, 2021

Hey Basile, here are some parameters for you to work with.

  • Total epsilon of 5, resetting every month, with an L1 “sensitivity budget” which limits the total contributions per (browser, week, publisher) across all buckets by some fixed parameter (L1).
    • Note that we are also considering “pacing” this budget over time (e.g. having a 7/6 epsilon per week). You could consider divvying up the budget that way if you like.
  • For the purposes of this challenge you can set the L1 parameter however you like, and allow histogram contributions of any real numbers from 0 to L1. In the API we would expect reports to be dropped after the sensitivity budget is consumed, until the end of the time window.
    • e.g. in the simplest setting, all clicks would contribute L1 to one bucket of the underlying histogram, but then the user would not be able to contribute anything else for the rest of the time window for that publisher.
    • More complex settings are possible by allowing the user to contribute less than L1 to multiple buckets, across multiple events.
  • Noise is added to every aggregate bucket via Laplace(L1 / epsilon)
  • Publisher eTLD+1 available in the clear associated with reports for click reports, along with ~hour granularity.

Aggregation function

There are a few choices of aggregation functions you can use. Any of these could be acceptable but offer different trade-offs.

  • Unbounded key space + Laplace + thresholding. This is the last bullet in issue 116. Group by every key in the data, add Laplace noise, and threshold at some fixed T. See algorithm described here (set B0 = 1 and Binf = L1) and associated code for how to compute this for Laplace histograms. You can use delta = 10^-7 to start.
    • Pros: Doesn’t require prior knowledge of the domain keys
    • Cons: Introduces bias in the data via thresholding mechanism
  • Bounded key space + Laplace. Similar to the first bullet in issue 116. Same as (1) but the aggregation function takes as input a list of “possible” keys (not learned from the sensitive data itself!). Final histograms are computed by adding Laplace noise to sums per key, as long as each key was present in the input. If there are any keys present in the data that were not in the input, they should be dropped. No thresholding or delta term is necessary.
    • Pros: Doesn’t introduce bias by thresholding
    • Cons: Requires a priori knowledge of the domain. For sparse domains (ones where only a small fraction of the keys show up in real data), output will be similarly sparse with lots of noisy 0s.
  • Dense key space + Laplace + hierarchies. This matches the second bullet in issue 116, it is a “hierarchy of histograms” approach where aggregation keys are specified in some fixed domain (e.g. 32 bits) and are queryable at various bit-prefixes, with Laplace noise added to each bucket. This would entail splitting epsilon budget between different hierarchies queried, and additionally require some (dense) encoding of aggregation keys. It also removes the need for any specific thresholding in the privacy mechanism itself. This technique is more compatible with MPC.
    • Pros: Doesn’t require thresholding. Allows for hierarchical queries.
    • Cons: Requires structured keys in some dense encoding (e.g. 32 bits or less), a priori knowledge of the domain.

I’m happy to go into more detail on any of this. Note that we are planning on publishing a doc with more details on how this relates to the attribution API, so stay tuned. In that doc we are planning on suggesting option (3) for its flexibility and compatibility with MPC, but for the purposes of this challenge I think any of these are fine.

Note: Training models using aggregate data may be easier if you allow on-device model training / Federated Learning. We haven’t elaborated on how to do this but we should discuss it if you want to do any follow-up competitions in this space, since it may be something the Privacy Sandbox could support.

@BasileLeparmentier
Copy link
Author

Hi Charlie,

Many thanks for your answer!

Is it important for you to stick with the Laplacian noise?

As it is a challenge, there are a lot of queries (around 200, which we would also expect in a real world setting, even probably more), we believe that using the gaussian mechanism with a delta of 10e-6 or 10e-8 would give better learning results with the same privacy guarantee.

Any thought?

Basile

@csharrison
Copy link
Collaborator

The Laplace noise fits better with our current design (details pending) where we wouldn't expect so many queries / one report to influence so many different outputs. However, if you already have all these aggregation dimensions fixed in your design I think it should be OK to use Gaussian noise. We aren't opposed to Gaussian noise on any principled level.

@csharrison
Copy link
Collaborator

FYI, published some more information about aggregate attribution reports here:
https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md

@BasileLeparmentier
Copy link
Author

Hi Charlie and others interested,

The ML challenge is now live!

For those who want to participate, please follow this link https://competitions.codalab.org/competitions/31485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants