Privacy guarantees of Ranked Granular Report #16
Prompt for discussion in SPARROW technical workshop July 16th (see #14):
Can you explain what makes a variable un-protected? It seems that it's supposed to be something that can be known only to one party (either the advertiser or the publisher). But how can the reporting system be sure of that?
It's certainly not enough to say that a value is un-protected if one party comes up with it without talking to the other party. To take an obvious example, the publisher's code knows the time when the page loaded, and the advertiser's code knows the time when the ad rendered, but if these were both un-protected variables, then of course they would immediately let the two parties join up their reports. (As I mentioned in #9, I don't believe that an agreement to not collude or share data is a viable protection.)
I think that without un-protected variables, this is very nearly the same as aggregate reporting, though you're proposing k-anonymity instead of differential privacy. That is, a report containing each row, with k-anonymity used to redact rare values, is very similar to a report that lists each sufficiently-popular event and the number of times that event occurred. (The ranking preference idea here matches up with the Aggregated Reporting section "Grouping multiple keys in one query")
But joining these with un-protected variables seems like a pretty substantial change in the reporting model.
The text was updated successfully, but these errors were encountered:
Following our workshop, please find below a summary of the discussions around this issue. The full video can be found here, password:0Y$y0.R$. We invite participants to comment if they want to add a specific point or want to correct the following summary of the discussions.
Unprotected vs protected variable.
What are unprotected variables:
Those are variables that cannot be used to identify a user on the publisher website and therefore do not need to be hidden in the report. They include, for instance:
Some work is still needed to reach a consensus that these "unprotected" variable do not introduce a vulnerability in the privacy protections.
Differential privacy vs ranked privacy-preserving granular report
The ranked granular report in SPARROW relies on K-anonymity on protected variables vs differential privacy for Chrome reporting.
One of the reason is the presence of unprotected variables, that are not easily introduced within differential privacy framework.
According to Charlie Harrison, Chrome engineers are working on a version of differential privacy that would allow handling such cases, but the framework is not complete yet.
Another reason is that we have reservations about differential privacy as the correct tool for online advertising, as explained here (https://github.com/Pl-Mrcy/privacysandbox-reporting-analyses/blob/master/differential-privacy-for-online-advertising.md ).
No consensus was reached during the meeting about the best tool to provide anonymous reporting between differential privacy and k-anonymity. Further discussion about trade-offs will be needed.
Thanks for the summary, Basile.
As @csharrison brought up in the meeting, the Chrome proposal for an event-level conversion measurement API does have a little of the "unprotected variable" nature to it. In that proposal, a report contains an actual event ID which can be associated with the ad auction. But in exchange for making the auction-time signals "unprotected", we place substantial limits on what other information can be joined with the event — just a few bits of information from conversion time, and even those bits involve DP-style noise.
In the same way, the existence of "unprotected variables" in your proposal triggers the need for substantial protections, otherwise it runs the risk of letting a large amount of user-specific information travel from the publisher site to the advertiser site.
To illustrate the risk here, let's consider what it would take for a colluding publisher and advertiser to join actual user identifiers using this report.
Suppose every user on Large Publisher P has a unique user ID. In the signals it contributes to the ad auction, the publisher includes the first bit, first 2 bits, first 3 bits, etc., of the user's ID. These are all protected variables, so probably "first 16 bits" would not pass the k-anonymity threshold, and would be suppressed. But in your proposal, the Gatekeeper would allow a report which contained as many bits as k-anonymity allows.
Now suppose Large Advertiser A also has a unique ID for each of their customers, and associates that with the click ID whenever a customer clicks on their ad and visits their site. If the Gatekeeper's report includes the click ID as an unprotected variable, then Advertiser A can learn many bits of the Publisher P ID for its customers. Of course it's not a unique ID, due to k-anonymity.
But the following week, Publisher P could switch to sending the last 1,2,3,etc bits of its user IDs. Since both IDs are stable over time, a person who clicks on two different ads from P to A leaks twice as many bits of identity. At that rate it doesn't take long to join IDs.
Thank you for this example.
Comparing it to the one you gave in the first version of SPARROW reporting, where only a display was needed to pass user ids between two colluding actors, we take the fact there now needs to be at least two clicks and a very specific configuration as a testimony to the tremendous progress we made!
Let us detail exactly what it would take to run this attack.
Technically, the example you gave, albeit convoluted, is indeed feasible. But we think that the requirements to run such an attack are too costly (the attacker would need to buy many impressions) and rare (both in term of success rate and coverage) to make it even remotely practical, particularly at scale. The cost of the attack exceeds by several factors the potential benefits. On the other end, a reporting such as the one we propose allows for practical advertising, which would result in publisher spend. I guess this is a case of pondering the privacy impact with other considerations (as mentioned by your team during one of the W3C IWABG calls), and in that case, we believe that the potential privacy gain there is far too small compared to the impact on publisher revenues.
Also to note: a similar, much simpler attack could be given on contextual requests, affecting all proposals, where one would just need to pass the user_id in the link descriptor to get exactly the same level of information you described. In that case, why would anyone run a complex attack such as the one you described when there are easier ways to proceed? We also want to pinpoint that similar attacks relying on persistent ids could also be conducted in a differential private framework. Or, to assume differential privacy over multiple reports, you would require to strongly increase the added noise – making the reporting unusable.
First, I definitely agree that we've been making tremendous progress! I'm sorry if I haven't made that clear enough.
Second, my point is that the "unprotected variable" idea lets a pretty large amount of information flow across sites.
It's large enough that I described a way to transmit a whole user ID. But even if that particular use is unlikely, it's still a way for the advertiser to get an awful lot of information that is (a) about a specific user of their site, and (b) about behavior that happens while that person is not on their site.
As I wrote in the Potential Privacy Model for the Web at the beginning of the Chrome Privacy Sandbox work, we acknowledge that some use cases rely on one site learning a little bit about some user's off-site behavior.
But the Click Through Conversion Measurement Event-Level API is an example of what "a little bit" of information might look like: it's limited to 3 bits of information, with 5% noise on the value of those bits. By contrast, you're proposing a potentially unbounded amount of cross-site information with no noise.
We are glad to know that you welcome this progress.
Do you see ways of putting numbers behind “an awful lot” or “a little bit” of information? I do think that quantifying the information leak potential as an "awful lot” is too strong. Besides the interest group (which is already known by the advertiser thanks to his first party id), most unprotected variables would relate to the ad, not to the user itself.
The status of "unprotected variable" would be awarded on a case by case basis to make sure that doesn't reveal any sensitive information. We think label (click and view) are important unprotected variable, and we don't think they are particularly sensitive.
Taking a step back from these technicalities, it seems that we disagree on the appropriate method to solve the last mile (or, dare I say, meter) issues on user privacy. You wish to design a system in which the users is shielded from any attack, however costly and convoluted, via technical means.
From our point of view, the Click Through Conversion Measurement Event-Level API is way too constrained to model conversion flow. Average Conversion Rate (conversions / clicks) is in the range of a percent. We would lose crucial information on all intermediate events (page-viewed, etc.) and on the conversion itself (price, etc.). Advertisers simply won't be able use this report.
Pulling the thread, they will eventually transfer their money where they can actually still measure something: walled gardens, youtube videos, search ads, etc.
On the 2nd of June 2020 IWABG call, @ablanchard1138 asked you tentative values for parameters that would be used in the reports, to which you answered:
That’s what I am trying to show here, and in the piece I wrote here, and through the scripts we published here to simulate differential private reporting: the level of noise to get the privacy you’re requiring is not compatible with actionable reporting for advertisers and publishers.
Let me clarify what I meant when I said unprotected variables seem to me like "a way for the advertiser to get an awful lot of information that is (a) about a specific user of their site, and (b) about behavior that happens while that person is not on their site." Maybe I misunderstand something about your proposal.
(a) Suppose that we somehow ensure that unprotected variables reveal data known only to the advertiser. For example, this could include things like "did the user convert?", or "dollar value of the user's conversion", or even "encrypted form of a unique ID of the user on the advertiser's side".
(b) The unprotected variables can include arbitrary data known to the publisher if it reaches the report's k-anonymity threshold. For example, suppose the publisher's ad network logs the behavior of a user just on the publisher's site, and remembers the IAB Tech Lab Content Taxonomy that each user interacted with the most. Levels 1+2 give 371 categories, but most sites concentrate in many fewer, so it's quite reasonable to think that many people's most-interacted-with-category-per-site would be moderately popular.
Now when a user visits the advertiser's page, what information will the granular report allow the advertiser to learn about this specific user? It seems to me that the advertiser gets to learn their favorite content taxonomies from all web sites where that user saw an ad for that advertiser.
That definitely seems like "an awful lot of information" to me. The behavior I describe is neither costly nor convoluted. Am I misunderstanding some kind of limit that would prevent this, or even make it unlikely?
I think there is some misunderstanding here. Please excuse us if it originates from a lack of clarity on our side.
What I am about to say might have to be amended when incorporating RTB House proposal. For the sake of this conversion, I assume that interest groups are defined as per TURTLEDOVE/SPARROW.
I think that the misunderstanding comes from a confusion about who has access to what information/variable.
An example of the first type is the interest group, and the second could be background color of the ad (The gatekeeper will have to ensure that no publisher variable are used to define it. Should any publisher variable be used, the variable would become protected).
All those variable are only available thanks to the browser and are therefore very limited by design.
The gatekeeper has no access to things like "did the user convert?", or "dollar value of the user's conversion", or even "encrypted form of a unique ID of the user on the advertiser's side".
The sentence below is thus inaccurate in the SPARROW reporting we envision.
To compare websites with a train station: the train station "advertiser" never gets to know all the stations the user went to in the past (as it is not his business). However, when the user arrives on deck from a train he has chosen to get in via a click, the train station knows where the train comes from.
Thanks, Basile. You're right, I had been assuming that the "unprotected variables" alone were enough to join with a user identifier on the publisher site; I didn't realize that you thought this would only become possible after a click.
But the unprotected variables can join over all impressions shown in the same browser, right? The Gatekeeper-chosen "AB test ID" persists over time and across sites. And something like the "background color" could be randomly selected at ad serving time, and would likewise persist across impressions.
So it seems to me that if unprotected variables join up all impressions shown to a single browser, then one click would be enough for the advertiser to learn about all impressions shown in that browser, not just the one impression that was clicked on.
What you are describing cannot be done in the current proposal, except using the ABtestid. This is why we are proposing a strict limit to this ABtestid (we have proposed that it'd be a number between 1 and 10 with some random resets despite being mostly stable).
Let's assume, for example, that the gatekeeper wants to keep the background color for user_1 constant across websites, for the advertiser to link all displays on user_1 of a specific IG when one click occurs.
As long as there is more than one user in each ABTestID (and as there are only 10 different ABTestIDs, there should always be many users) this should not be possible within our design, and therefore a click should only give information on the impression the user clicked on.
The leak that you describe could happen if we were to add more variables on the user (e.g. including RTBHouse proposal, or allowing additional information about the number of ads served etc.). The proposal as it currently is doesn't allow it.
To allow more information on the user to be transferred by the browser (e.g. with RTBhouse proposal), we will indeed need to update the proposal accordingly.
K-anonymity, but at the user feature level would be the way to go for this specific report - we would have two sets of protected features, publisher and advertiser, with the k-anonymity computed on different scopes.
This would need to be investigated more in-depth to make sure it would work.