Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a Test to measure revenue loss #35

Closed
benjaminsavage opened this issue Jun 18, 2020 · 6 comments
Closed

Running a Test to measure revenue loss #35

benjaminsavage opened this issue Jun 18, 2020 · 6 comments

Comments

@benjaminsavage
Copy link

Hi Michael!

I am really interested in running some actual tests of this proposal. I think it would really help inform some of the design considerations, such as the minimum size of an interest group.

I've spoken with a number of engineers on Facebook's Audience Network team to think about how we could go about designing such a test, and we immediately encountered a few big open questions we need to resolve in order to design an experiment. I'll post about them one at a time to simplify the discussion.

  1. Attribution. Currently, we are able to train machine learning systems by providing them with training data like: "We showed this ad, in this context, to this person and it did/did-not lead to a conversion". With Chrome's proposed Conversion Measurement API this would still be possible. We take large numbers of rows of this type of training data and send it to an ML model to learn. In a TURTLEDOVE world, what attributed conversion data will be available for model training? I assume we would NOT be able to use the conversion measurement API in this context and will only have access to aggregated metrics. If that's the case, what aggregate metrics will we have available for model training? Will the reporting be standardized and automatically generated by the browser, or will we have some degree of control here?

I am not sure what you have in mind, nor am I sure what would be the most useful metrics, but here are some random ideas of potential things one might attempt to measure to kick off a discussion:

  • I served campaign_id = 0x15283750234 when I received a private interest group ad request for an unknown context. It did / did-not result in a conversion.
  • In the last 24 hours, I have served a total of 10,000 advertisements on publisher X via the TURTLEDOVE API. Out of those, 234 of them were ads for advertiser Y. The aggregated reporting API tells me that 14 conversions happened on advertiser Y's website as a result of those 234 ad impressions.
  • In the last 7 days, I have served a total of 100 ads to interest group Z across a variety of publishers. Of those ads, 22 of them were ads for advertiser Y. The aggregated reporting API tells me that 1 conversion happened on advertiser Y's website as a result of those 22 ad impressions.

Thanks in advance for helping us understand these constraints so that we can properly model such an experiment.

@michaelkleber
Copy link
Collaborator

Great question. As you suspected, we can't allow any event-level reporting (even post-conversion reporting) that joins the contextual and interest-group information about the ad impression.

Aggregate reporting should be fine, and we should be able to aggregate across signals from
all of the publisher page, the interest group, and the conversion page (or any subset of those). Adding in @csharrison for visibility.

So for your specific examples, but not in your order:

  • In the last 24 hours, I have served a total of 10,000 advertisements on publisher X via the TURTLEDOVE API. Out of those, 234 of them were ads for advertiser Y. The aggregated reporting API tells me that 14 conversions happened on advertiser Y's website as a result of those 234 ad impressions.

Looks good to me. And if you have more information about the people, e.g. signals that publisher X provided about them, you should be able to aggregate on / slice by those attributes as well.

  • In the last 7 days, I have served a total of 100 ads to interest group Z across a variety of publishers. Of those ads, 22 of them were ads for advertiser Y. The aggregated reporting API tells me that 1 conversion happened on advertiser Y's website as a result of those 22 ad impressions.

Looks good too. Of course aggregated measurements will have some noise, and "1 conversion" may be hard to tell apart from 2 conversions.

  • I served campaign_id = 0x15283750234 when I received a private interest group ad request for an unknown context. It did / did-not result in a conversion.

This seems like it could work if we figure out a way to be sure the campaign_id is fixed at the time the ad is chosen in response to the interest-group request, rather than something that could be updated at render time based on contextual signals. It's not entirely obvious how to do that. Is this kind of context-free event-level data really going to be useful? I would have guessed that the lack of signals associated with the interest-group request would lead to relatively little benefit here.

@benjaminsavage
Copy link
Author

Thanks for the quick answer!

Next question is similar, but relates to billing.

The amount we charge the advertiser is proportional to the likelihood of a conversion. Over longer windows of time, we compare the expected value generated (as reflected in the budget utilized) to the actual value generated (as reflected by the number of attributed conversions). From this, we can see if our systems are properly calibrated.

It's a big problem when those systems are not well calibrated. It means we are either over-estimating how much value will be generated, and charging an advertiser at a higher rate than their desired cost per conversion, or we are under-estimating how much value will be generated, and bidding too low (thereby losing auctions to other advertisers and not buying as many ad impressions as the advertiser ideally would have liked to buy).

On our ad network, we observe that the estimated likelihood of a conversion may vary by several orders of magnitude, even for the same user <=> ad combination, just by changing the placement where the ad is delivered. At the time the ad is returned, if that context is not known, we will not be able to do a very good job estimating the likelihood of a conversion, and thus we will not know how much to charge. Going with some sort of network-wide average estimate isn't going to be acceptable, because it will lead to a massive amount of poorly calibrated campaigns and sad advertisers who are over / under bidding for inventory.

As such, we will probably need to adjust billing to be based on some kind of aggregate data that is collected after the ad is actually shown. Our next question is about how this will work.

If we are generating a "bid" using locally executed code, will we be able to log this bid value with the aggregated reporting API? If we can compute the sum, across all of the bids for a given advertiser, we will know how much to charge them for all their ads that run across the network. If we can compute the sum of the bids for a given advertiser <=> publisher combination, that will help us calibrate future bids for that combination. If we can aggregate the bids of all the ads that ran on a given publisher, we will know how much to pay them.

Generating a bid that incorporates all three concepts (the advertiser, the publisher context, and person who will be shown that ad - here represented by the private interest group) is another challenge worth discussing.

We can imagine a world where some kind of "baseline bid" is generated based upon the ad chosen for the given private interest group. We can envision sending this "baseline bid" down to the client, and then dynamically updating it based on the actual context where the ad is shown, prior to logging it via the aggregated reporting API.

We think it should be possible to achieve a reasonable level of calibration by simply computing a placement-level-multiplier (based on aggregate historical data about the performance of ads on that placement). We would just take the "baseline bid" and multiply it by this placement-level multiplier to generate the final bid.

The only problem is how to get this placement-level multiplier available in the context of the on-device bid generation.

One really terrible option would be to return a truly massive JavaScript object along with the ad bundle, that contains a mapping from placement-ID to multiplier for all placements that exist on the ad network. Leaving aside the problem of how truly huge this would be to send around, it would reveal an awful lot of information about the performance of other publishers on the network.

Another option would be to have some other asynchronous channel by which the browser could just ask the ad-server: "What is your multiplier for this placement?". This request would contain absolutely no information about private interest groups, or specific ads, and the result could be cached for the next few hours at least. Alternatively, one could imagine some other API by which Facebook JavaScript code would write data to some kind of store that provided read-only access during the bid-generation stage. That would allow us to write this placement-level bid-multiplier at some appropriate cadence, and use it at bid-generation time.

What do you think about this idea?

@michaelkleber
Copy link
Collaborator

Hi Ben, sorry for the delay this time.

If your desired bid is (baseline from interest group and advertiser) * (placement-level multiplier), then I'd expect you to send those two individual values separately: one in the signals that are part of the interest-group ad request, and one in the signals that get sent back to the browser with the contextual request. There's a comment from a previous issue about getting your desired metadata into the contextual response.

You will certainly be able to feed this bid into aggregated reporting, along with the advertiser+publisher combination, for both billing and calibration.

The only caveat here is that the finer you slice, the fewer events you're aggregating over, so the more noise you need to tolerate. It might also be beneficial to look at aggregates by advertiser alone or by publisher alone. Well, the benefit for billing is obvious, at least; I guess the merits for calibration depend on technical questions like your approach to back-off modeling.

@benjaminsavage
Copy link
Author

Fantastic. Thank you for this response!

That's a really important insight I hadn't understood before. Thanks for linking to that other comment. That absolutely works for me. I'm happy to send a placement-level multiplier back with the contextual + 1p request for use in the JS bidding function.

Totally understand the noise concerns that come with the aggregated reporting API on thinner and thinner slices. Any data you can give us on the value of epsilon, and how to simulate the noise added to provide global differential privacy would be really helpful to ensure we can simulate a realistic result and get useful test data.

This is great, two of my top questions answered! Lots to go =).

Next one: What is the delay between the private-interest-group ad request and the time it is eventually shown? Should I assume it's minutes? hours? less than a day? more than a day? Is there a minimum time? How should I model the distribution of the delay?

@michaelkleber
Copy link
Collaborator

I welcome suggestions on this! In the original explainer I had:

For example, a browser might [...] a few times a day issue any relevant interest-group ad requests to those networks.

So that would go with a delay of at most four to six hours cache lifetime. But definitely not wedded to this answer. It would be interesting if we could model this as ads being downloaded around the beginning of each "browsing session", for example.

It seems like we need to balance two things against each other: Freshness vs. data over-use. If we request ads too infrequently, then a lot can change between serving and rendering time, and the in-browser auction drifts away from optimal over time. But if we request too often, then we run the risk of downloading lots of ads that never get shown.

@JensenPaul
Copy link
Collaborator

Closing this issue as it represents past design discussion that predates more recent proposals. I believe some of this feedback was incorporated into the Protected Audience (formerly known as FLEDGE) proposal. If you feel further discussion is needed, please feel free to reopen this issue or file a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants