Protected Audience AB testing #909

fhoering · 2023-11-16T08:27:07Z

Why do we need A/B tests?

A/B testing is a key feature for experimentation in order to increase performance of Protected audience
It allows us to measure the impact of technical changes
We must be able to measure long term effects in a consistent way

To give an example of what we mean by long term effects let’s look at a complex user journey and assume that we split users per publisher website (because we have access to the hostname in PA API), on some websites we propose buying strategy A and on some publisher websites we propose buying strategy B, and that we can measure conversions like sales for each ad display.

In retargeting, we show a banner to users multiple times before they buy. For example, a user has added Nike shoes to his basket but has not converted, we will remind him of the product, through ads on several publishers. When he converts, the sale will be attributed to the publisher on which was shown the last ad and not to whatever happened before that. In other terms, it is impossible to measure the effect of a buying strategy A versus B since we will not have a single identifier across sites.

Existing mechanism with ExperimentGroupId

https://github.com/WICG/turtledove/blob/main/FLEDGE.md#21-initiating-an-on-device-auction

Optionally,perBuyerExperimentGroupIds can be specified to support coordinated experiments with buyers' trusted servers. If specified, this must also be an integer between zero and 65535 (16 bits).

The expected workflow has been described here:
Extending FLEDGE to support coordinated experiments by abrik0131 · Pull Request #266 · WICG/turtledove

Our understanding is that this translates to:

Pros:

buyerExperimentGroupId can be dynamically set by buyer as part of the contextual call allowing any split (see comment below, it already might not apply anymore as async calls should be used to reduce auction latency)
16 bits (65535 different values) is big enough to handle multiple experiments at a time
this AB testing seems interesting to measure technical changes
analysis can directly be done via ExperimentGroupId as it is propagated to reportWin

Cons:

We cannot measure long term effects (as explained above) as the split must be done based on contextual signals for example by publisher domain, so basically we can only measure something that is directly attributed to this ad display, a same Chrome browser might see changes on the very same ad campaign in population A & B
One alternative, that would allow measuring long terms effects, would be segmenting users based on geolocation, the challenge here would be to have populations of same size and same user behavior, so it will not be universally applicable but depend on the use case
It might not be applicable to auctions where signals are resolved in an asynchronous way as in this case the contextual call and the key/value server call run in parallel

Splitting per interest group and 1st party user id

Doing a per interest group split seems appealing because for interest groups that are created on one advertiser website one could apply the same changes to the same campaigns to all 1st party users of this advertiser.

This would mainly work for single advertiser AB tests where we target users that already went to advertiser web page. It would work less well for more complex scenarios on all our traffic where we modify the behavior of multiple campaigns on multiple websites and in this case the same drawback as above, the very same user could see behavior changes in population A and B.

As we would split users during tagging phase we cannot guarantee that we really see those users again for a bidding opportunity. So we cannot guarantee an even split as for bidding we might only see n% of users of population A for bidding and a different amount for population B (some more explanation here Approach 2: Intent-to-Treat)

Pros:

Could handle single advertiser scenarios where we consistently know the user with its 1st party id

Cons:

For reporting we need to log the AB test population of computeBid to reportWin
- it could be done by encoding the AB test population inside the renderUrl at the expense of k-anonymity, handling 5 AB tests in a independent way would mean 2^5 renderUrls
- alternatively aggregated reporting like Aggregated ARA could be used
We cannot handle a large number of AB tests in parallel, in any case less than with ExperimentGroupId
Leakage as the same user will change for different advertiser websites which will be an issue when the behavior is changed on retargeting campaigns for multiple advertisers or when more upper funnel campaigns are used
Additional bias because we split user based on tagging behavior and not when we get a bid opportunity

Using shared storage for AB testing

The shared-storage proposal already has a section on how to activate AB tests. The general idea is to create a unique user identifier (seed) for the Chrome browser with generateSeed, then call the window.sharedStorage.selectURL operation which takes a list of urls, hashes the user identifier to an index in this list and then returns the url for that user. The AB test population would be encoded in the url and as the number of urls is limited to 8 urls it would allow 3 bits of entropy for the user population. As different urls can be used for each call and would leak 3 bit all the time some mechanisms are in place to limit the budget per 24h per distinct number of urls (see https://github.com/WICG/shared-storage#budgeting).

As of now shared storage can only be called from a browser Javascript context and not from a Protected Audience worklet. This means the urls selection can only happen during rendering and not during bidding and therefore shared storage can only be used for pure creative AB tests and not Protected Audience bidding AB tests. So we still need a dedicated proposal to activate Protected Audience AB tests.

Proposal - Inject a low entropy global user population into `computeBid`

For real world scenarios a global user population would still be needed for AB tests that need to measure complex user behaviors. As injecting any form of user identifier would leak additional information we propose a low entropy user identifier and some mitigations to prevent using or combining this into a full user identifier.

Chrome could cluster all users into a low entropy UserExperimentGroupId something like 3 bits. This identifier should be randomly drawn for each ad tech and not unique to all actors to prevent that our measurement cannot be influenced by the testing of other ad techs.

As attribution is measured for each impression or click we would like this identifier to be stable for some time but it should be also shifted on a certain amount of users to prevent a large population drift over time. Long running AB tests will influence users and then user behavior will change over time introducing some bias. The usual way to solve this is restarting an AB test which cannot be done here for such a limited amount of buckets. So one idea might be to constantly rotate the population. Constantly rotating the population would be also useful to limit the effectiveness of a coordinated attack among Ad Techs to identify a user. If 1% of users get reassigned to population each day it would mean that after 14 days 14% of user might have shifted population.

If the labels are rotated every X weeks, it adds further burden to those trying to collude and update their 1st-party ID → global ID mappings

This new population id would be injected only into the generateBid function and also the trusted key/value server (to mirror current ExperimentGroupID behavior and because many of our computations are still server side, it is secure by design as it will run in a TEE without side effects).

The identifiers could only get out of the of generateBid function via existing mechanisms and that already present privacy/utility trade offs, for example:

by adding more renderUrls at the expense of k-anonymity
by reserving some bits of modelingSignals at the expense of handling less advertiserSignals
by using aggregated reporting at the expense of DP noise and bucketization

If we encode the 3 bits into renderUrl this proposal seems very aligned with the proposal on shared-storage to allow 8 URLs (= 3 bits of entropy) for selectURL to activate creative AB testing (post bidding). In our case as Chrome would control the seed and the generateSeed function can not be used we would not leak more than 3 bit. So introducing any form of budget capping seems not necessary.

To prevent some cookie sync scenario where ad techs combine this new id into a full user identifier Chrome could add an explicit statement to the attestation to prevent ad techs sharing this id.

By design as we have few AB test populations we could only run a limited number of AB tests at the same time but we could reserve this for important AB tests and use the ExperimentGroupId mechanism more for technical AB tests.

The text was updated successfully, but these errors were encountered:

remysaissy · 2023-11-16T10:50:45Z

Hello,
I can that we, at Teads are facing the same issue.
The situation is well summarized as the possible options so nothing to add on this but we would be very interested in having this issue solved too.
Thanks.

fhoering · 2023-11-29T20:22:25Z

This has been discussed in the WICG from 29th of November 2023.

There has been a question from @michaelkleber on why the scenario on interest group 1st party user id split would not work.

Let's image a scenario where I want to test 2 buying strategies across all my advertisers, one where I always bid 1 EUR (A) and one where I always bid 2 EUR (B).

In todays world I would propose either strategy A or B to users and then measure how much displays, clicks & sales I get. Note that paying less doesn't mean the user will also buy something. What I would like to have is the best buying strategy.

Now let's say 1 chrome browser does 1 auction. I have 2 advertisers, create 1 interest group per advertiser and then split by advertiser 1st party user id. During the auction each IG participates in the auction and out of all Chrome browsers we would have 25% that see AA scenarios, 25% AB%, 25% BA and 25% BB scenarios.
For the AA and BB scenario it is all good. For the AB & BA scenarios B would always win the auction as 2 EUR is always higher than 1 EUR. So in 75% of the cases B would win the auction and in only 25% A. So my split would be heavily unbalanced towards B. As I have competition I will not get exactly 75% displays on B vs 25% of displays on A. Also for more complciated buying strategies I will not know that in reality I exposed users 75/25 vs 50/50 to compensate in some form.

So if I can't apply a unique split inside one auction this form of split doesn't seem to work at all for cross advertiser buying strategies even for retargeting campaigns.

As a side note splitting by time (hour, day, ..) usually doesn't work because users don't have the same behavior over time (see black friday for example)

EDIT: removed cost per sales metrics to simplify the example

alois-bissuel · 2024-03-27T17:30:48Z

Jumping on the subject, to double down on what @fhoering explained, the issue here is that we won't be able to measure during the test what will happen when this tested modification is rolled out.

For instance, if one user has two interest groups for one adtech and one (IG1) is in reference population (no modification of the bidding) and the other (IG2) is in test population.
Let's assume that you want to test a large change of bid, ie a big lowering of the bid on some opportunities which might be less profitable.
During the course of the test, IG1 will always win on these opportunities.
At roll out, IG1 and IG2 will have a more equal chance of winning the opportunity.

Thus, the measure during the test will be impacted by competition within an adtech, which won't happen after roll out.

michaelkleber · 2024-03-27T20:14:35Z

@alois-bissuel I think we talked about this during the 2023-11-29 call. This kind of bidding experiment is one where it makes sense to randomize A/B diversion based on a 1st-party identifier on publisher site. Now all of your IGs will compete against your other IGs using the same strategy on a single page (or even across a single site), so it will be reflective of the effect of rolling it out broadly.

fhoering · 2024-03-28T10:38:27Z

In reality changing the bid strategy is a complex behavior. So it will never be as easy as knowing in advance what effect this will produce like the bid will be always lower in all cases.

And in the case of a split by publisher 1st party id I will have the problem that I cannot know which bid strategy produced the user's conversion behavior at the end. If he goes to publisher1 (high bid, sees several ads), publisher2 (low bid, no ad), publisher3 (low bid, sees one ad and clicks) => then buys something.
It is what has been shown in slide 7 in https://github.com/WICG/turtledove/blob/main/meetings/2023-11-29-FLEDGE-call-minutes-slides-ab-testing.pdf.

To me this ask still makes sense and 3 bits seems reasonable and very aligned with the shared storage API. It could be seen as converging all Privacy Sandbox APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protected Audience AB testing #909

Protected Audience AB testing #909

fhoering commented Nov 16, 2023 •

edited

remysaissy commented Nov 16, 2023 •

edited

fhoering commented Nov 29, 2023 •

edited

alois-bissuel commented Mar 27, 2024

michaelkleber commented Mar 27, 2024

fhoering commented Mar 28, 2024 •

edited

Protected Audience AB testing #909

Protected Audience AB testing #909

Comments

fhoering commented Nov 16, 2023 • edited

Why do we need A/B tests?

Existing mechanism with ExperimentGroupId

Splitting per interest group and 1st party user id

Using shared storage for AB testing

Proposal - Inject a low entropy global user population into computeBid

remysaissy commented Nov 16, 2023 • edited

fhoering commented Nov 29, 2023 • edited

alois-bissuel commented Mar 27, 2024

michaelkleber commented Mar 27, 2024

fhoering commented Mar 28, 2024 • edited

fhoering commented Nov 16, 2023 •

edited

Proposal - Inject a low entropy global user population into `computeBid`

remysaissy commented Nov 16, 2023 •

edited

fhoering commented Nov 29, 2023 •

edited

fhoering commented Mar 28, 2024 •

edited