Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider event-noising as a possible enhancement to privacy for event-level data #84

Closed
csharrison opened this issue Nov 18, 2020 · 7 comments
Labels
inactive? Issue may be inactive

Comments

@csharrison
Copy link
Collaborator

High level idea

The current explainer for the event-level API discusses adding noise to whether the conversion occurred or not. In fact, you could imagine adding this, along with some small variations to the API to achieve some version of local differential privacy. (Note that the initial proposal adds noise to the conversion metadata, but not to the converted-or-not state.) This idea was also brought up (by @btsavage) in the Privacy CG face to face meeting earlier this year (minutes).

The general idea here is to, for every impression, enumerate all possible outputs of the API:

  • No conversion
  • Conversion with conversion metadata m = 1
  • etc.

Now, the new way to add noise in this setting would be to, for every impression:

  • With probability 1-p, the API works without any noise added anywhere
  • With probability p, the API outputs a result by picking an output from the above list with uniform probability

Note that this algorithm can cause true conversions to be dropped, and for non-converting impressions to be falsely marked as converted. However, it allows us to make much stronger privacy claims about the API. Note that if p = 1, the API reveals no new information about the user’s cross site behavior. This is an example of k-ary randomized response. Any one user’s contributions are hidden with this method, while still allowing useful results in aggregate (e.g. with a step to estimate the true / unbiased counts, similar to noise_corrector.py). This mechanism is similar to RAPPOR, which Chrome uses internally to add local privacy to data for later analysis.

Something to this effect could be used for enhancing the privacy of the event-level API, either as a knob that different browser vendors could tune to their liking, or as a way to unlock more sensitive use-cases like support for view-through conversions.

Analog to aggregate data

One way to think about noise with local differential privacy in this way is that it protects the results of any one particular individual while allowing flexible aggregate queries. In a way this is similar (although not identical) to the privacy guarantees we want in the Aggregation Service. For similar privacy parameters the “local DP” approach typically produces noisy data with more variance and bias, while allowing maximum flexibility for queries in that you can learn aggregate statistics about a lot of different slices of traffic rather than aggregating pre-registered aggregation keys.

This could be useful if aggregation slices are not known a priori, for instance if there is a class of invalid traffic from a botnet that you want to detect in aggregate but you don’t know in advance what the traffic pattern looks like.

cc @johnwilander FYI in case you have opinions here

Problems with this approach

Uniform noise can make low CvR slices hard to analyze

Techniques that introduce many false positive conversions for slices of data that have very low conversion rates risk drowning out any signal.

There are a few possible techniques we could use to help with this problem. One idea is to tune the noise such that rather than picking uniform randomly with some probability like we do for k-ary randomized response, we pick fake values distributed according to some noisy/rough view of the true distribution of the data. That is, small slices should not get too many false positives added to them. This addresses the basic problem "conversions themselves are rare events": any fixed noise probability p risks drowning out the signal for rare enough events, but noise that is scaled to the number of true conversions can be sure to leave the signal intact. Because the randomized response roughly follows the true distribution of data, we may be able to randomize at a higher rate than a traditional uniform randomized response

The privacy of such an approach is non-trivial to analyze, but it seems like a potential way to make the measurement error more consistent across advertisers / traffic slices, especially when we have data where we want to preserve the underlying distribution in the output. While analyzing the API with strict DP definitions has good advantages (and it is possible with this modified scheme as far as I can tell), we may want to consider other techniques for analyzing privacy like techniques to show mechanisms like this do a good job preventing joining 1P identifiers.

Adding noise to events makes authentication even harder

The current ideas around authentication (see #13 for more details as well as SERVICE.md) involve something like blind signing impression / conversion data. That is, we can attach a signature to an impression or conversion without necessarily knowing what information the browser is embedding in that event. This allows us to make statements like “all reports from the API must have come from a conversion event that received a signature”.

This is useful if we want to noise the data within a particular event (like the conversion metadata), but if the browser is minting fake conversion events from thin air due to randomized response, achieving this binding is impossible since there is no conversion event in the first place. The browser is essentially introducing spam conversions that look indistinguishable from fraudulent ones.

I do not know of a perfect solution here that is implementable on-device. I see two possible solutions:

  • Have the local DP style noise added by a trusted server component (which could be implemented in an MPC fashion similar to SERVICE.md), but it does introduce complexity to the design.
  • Implement a weaker form of authentication. The simplest technique could look similar to Trust Tokens, where reports will only be sent by users with Trust Tokens for the reporting origin. Alternatively, we could implement the event-binding on just the impression side and not the conversion side.

Neither of these solutions are ideal, so I’d appreciate it if anyone has any thoughts on improvements we could make here! cc @btsavage, @ajknox.

@eriktaubeneck
Copy link

Adding noise to events makes authentication even harder

...
I do not know of a perfect solution here that is implementable on-device. I see two possible solutions:

Have the local DP style noise added by a trusted server component (which could be implemented in an MPC fashion similar to SERVICE.md), but it does introduce complexity to the design.
Implement a weaker form of authentication. The simplest technique could look similar to Trust Tokens, where reports will only be sent by users with Trust Tokens for the reporting origin. Alternatively, we could implement the event-binding on just the impression side and not the conversion side.

Could we potentially combine these? i.e.

  • At the local level, we use a weak form such as a trust token, where verifies it came from the appropriate key owner but not that it's a verified event.
  • In aggregate, we compute the number of real and fake conversions sent (with DP), where the real events use a stronger form of verification, but it's only provided to the trusted MPC.

With this, you could then at least identify if something fraudulent is happening.

@csharrison
Copy link
Collaborator Author

Could we potentially combine these? i.e.

  • At the local level, we use a weak form such as a trust token, where verifies it came from the appropriate key owner but not that it's a verified event.
  • In aggregate, we compute the number of real and fake conversions sent (with DP), where the real events use a stronger form of verification, but it's only provided to the trusted MPC.

With this, you could then at least identify if something fraudulent is happening.

I think this could work but I'm not sure if it buys you much, in terms of actionability once you determine some fraud is happening. Another idea along those veins is to use the (stronger auth'd) aggregate API to tune filters for the event-level data since there might be some information in the aggregate on which event-level conversions may have been the spammy ones.

@eriktaubeneck
Copy link

For my own sake, I wanted to walk through this at one more level of detail. I’m not sure I follow the filtering idea, but hopefully a more concrete example can help clarify.

Suppose we have an impression/event trail with an impression for shop.example displayed/clicked on publisher.example and an event on shop.example. The relationship between shop.example and publisher.example could be direct, or mediated by adtech.example.

I see four classes of potential attacks here:

  1. attacker.example attempts to create fraudulent impressions appearing to have occurred on publisher.example.
  2. publisher.example attempts to create fraudulent impressions appearing to have occurred on publisher.example.
  3. attacker.example attempts to create fraudulent events appearing to have occurred on shop.example.
  4. shop.example attempts to create fraudulent events appearing to have occurred on shop.example.

If I understand correctly, 1. and 3. would be solved by domain bound Trust Tokens, and (before applying DP) 2. and 4. would be solved by event binding.

For adding DP, we want to create believable fake events, which are indistinguishable from 4.

At this point, what I’m envisioning is:

First, when the impression happens, publisher.example starts a RawAggregateReport, as described by the aggregation service explainer:

{
  aggregation_key: [“campaign=12”],
  bindings: {
    “impression”: {
      nonce: “701ad7b4-4d36-4d05”,
      sig: “f915cd47-c52b-4069-9b23-b76f5262a15c”
    }
  }
}

Note: this includes a browser mediated impression binding. The nonce would only be known to the browser, but publisher.example would create sig.

This is then stored in the browser. Later, we either have a real event reported by shop.example, or we have a fake event, as needed to satisfy our DP requirements.

In the real event case, a domain-tied Trust Token would be bound to the event level API response, and it would follow the flow described in the original readme / the first comment in this issue. In addition, shop.example would also participate in a browser mediated event binding. The RawAggregateReport is now updated and sent to the aggregate service.

{
  aggregation_key: [“campaign=12”],
  bindings: {
    “impression”: {
      nonce: “701ad7b4-4d36-4d05”,
      sig: “f915cd47-c52b-4069-9b23-b76f5262a15c”
    },
    “event”: {
      nonce: “edbef1f4-8838-442e”,
      sig: “7a44a512-6bab-48ce-a1c2-7f61073251e3”
    }
  }
}

In the fake event case, the event level API flow is the same (with a domain-tied Trust token). However, at this point, there is no event binding, and the RawAggregateReport is sent as is (with only the impression binding.)

At this point the aggregate API can report on the aggregate counts of real and fake events, at the aggregation_key granularity so long as the number of events there meet the DP requirements of the aggregate API. If the rate of fake events is significantly higher than the expected rate (given p), then publisher.example can decide how to proceed.

Fundamentally, I’m not sure you can release any more information than this, since the aggregate API is already designed to release information once the appropriate bounds and noise have been applied. Am I missing something else we could do at this point?

@csharrison
Copy link
Collaborator Author

Erik thanks for this detailed write-up. I think an approach like this can help detect where the fraud is happening via the aggregate API's aggregation_key. This basic idea is what I was getting at in my previous comment, which was to use the aggregate data to find where the spam is, and potentially use that to identify which event-level reports are more likely spam. However, I am not sure this link actually exists and an attacker can craft unrelated event-level and aggregate messages if they so please. I want to think more about what useful actions can be done when fraud is detected in aggregate.

If we have time we can discuss this more on the call Monday. One other concern I have with this approach is that it means we need to send aggregate reports for false conversions as well, which significantly increases communication cost if you assume conversions occur at higher rates than impressions.

@eriktaubeneck
Copy link

@csharrison I'm planning to hop back in and engage here. Unless you object, I think it would be useful to move the fraud detection over to a new issue, as it seems like this would be an issue for any noise rate > 0, regardless if it's fixed or variable.

@csharrison
Copy link
Collaborator Author

Sounds good to me, feel free to file a new issue and separate this one.

@csharrison
Copy link
Collaborator Author

Closing since we ended up implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive? Issue may be inactive
Projects
None yet
Development

No branches or pull requests

3 participants