-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real Time Monitoring API for FLEDGE #430
Comments
Hi Alex, I'd like to pull apart the different types of events you'd like to monitor. Taking them out of order, if you don't mind:
For Fenced Frames that have network access, nothing new is needed for this, right? It's just like everything in the browser today?
In previous FLEDGE calls we had discussed this kind of reporting using https://github.com/WICG/turtledove/blob/main/FLEDGE_extended_PA_reporting.md. Are you saying that aggregated reports like this are not good enough?
These are the tricky ones: if they were to include stack traces, then the worklets could use the error reporting mechanism to smuggle out all the information that we want to avoid leaking, including e.g. joined publisher-site and advertiser-site identities. Have you thought about ways this could be made privacy-safe? |
Aggregated reports could be used in principle. However, we are concerned that the aggregated reports will not satisfy our delay requirements of approximately 5 minutes. SRE would need access to the monitoring time series data in real-time, with a maximal delay of approximately 5 minutes for critical metrics during emergency response. If the delay is too large, this would increase the incident detection time and we may incur significant revenue loss by the time we detect that there is an incident. In addition, a larger delay in the monitoring data would make it harder for us to verify whether a mitigation is effective and would delay the time required to mitigate outages.
I understand the privacy concern. Could stack traces be available temporarily? If not, could we get some information about a crash type? |
@abrik0131 in theory, if the aggregated report was available immediately with a data freshness that's within that 5 minute delay SLA, does that satisfy the requirement? Are you asking us to only solve for the delay and not the granularity of the data? |
Yes, that satisfies the requirement.
Yes, we are interested in the total number of errors/reports. The main issue is getting the data quickly enough. |
@abrik0131 is it possible to understand a bit more of the impact of having this data not at 5, but let's say at 120 minutes (2 hours) or any intermediate intervals (15 minutes, 30 minutes, 60 minutes) you would be palatable to accept? |
The error reporting API is intended to quickly detect the most severe types of outages, where in the worst-case 100% of remarketing queries would crash or return errors. This would both impact our revenue and impact publishers which would not receive remarketing ad revenue for the duration of the outage. The longer the delay is, the longer it would take for us to apply mitigations or press red buttons which could mitigate the outage impact. If there was say a 1 hour delay, then in the worst-case 100% of remarketing queries would fail for an extra hour before we could detect the outage. The monitoring delay would also impact the time to mitigate an outage in the case where pressing our red buttons fails to mitigate the outage. In this case, our debugging workflow involves making changes to the serving configuration and then waiting for a signal on whether the change was effective or not, and a longer monitoring delay would increase the time required for each iteration. In this case, an extra 1 hour of monitoring delay would correspond to several hours of outage impact. For example, if we have 4 emergency red buttons (config changes, rollbacks, etc.) to press, we would need to press one button at a time and wait 1 hour to get a signal to see if the mitigation attempt was successful. This could bring the time to mitigation to 5 hours after detection or possibly more if other mitigation mechanisms are required. As to whether a longer delay is acceptable, this would depend on what fallback mechanisms we have available in FLEDGE in the case of crashes or errors. For example, if there was already a fallback in the browser to show the contextual ad, this would reduce the impact compared to the browser showing no ad in the case of a crash. In general, we would strongly prefer the reporting delay to be 5 minutes or less. |
@abrik0131 another question, can you please help us understand, from the above summary which events and data requested are:
We suspect the answer is "none" but it's important to clarify, as |
To monitor bidding and scoring JavaScript crashes, is it sufficient for the JavaScript to catch the exceptions and report them via a sampled reporting API like this? |
We wanted to repost with our updated understanding of our requirements since the original post is now over nine months old and a lot has changed including the introduction of a proposal for sampled debug reporting. Real-time monitoring is intended for speedy detection of issues from histograms and time series where we largely don’t care about individual data points in isolation but where we do care about sudden changes in aggregated metrics across all auction invocations (e.g. increased latency or crash rate or bid anomalies to name a few). We would like to emphasize that it’s not limited to JavaScript and worklet crashes, which was the focus of our previous discussions. Our requirements are:
Something roughly shaped like the private aggregation API gives us most of our mileage should we be able to guarantee O(minutes) delay for conceivably a very large stream of events. We believe the requirements are complementary to those of Issue 162’s proposal for sampled debug reporting since for monitoring we want full browser/auction coverage with lower entropy and without locking out browsers. This is in contrast to the new debug sampling proposal’s strengths for root cause determination where we will likely need only a small number of samples with significantly higher entropy. |
Do you expect buyers to want their own monitoring, independent from permitting sellers to see their failures? (Also note that the sampling, the way the debugging API does, probably makes it easier to do things realtime, privacy-wise). |
An addition to the types of events we would like monitor: |
👍 5. crashes that occur in buyer reporting worklets While not real time, this "dashboard" is linked somewhere from DevRel docs: https://pscs.glitch.me/ |
The Privacy Sandbox team is looking into this issue and seeks feedback from adtechs on the following:
We look forward to feedback from buyers and sellers. Please let us know if you have any questions. |
It's mentioned indirectly above, but can we simply borrow the logic of the |
@rdgordon-index,
|
@rushilw-google An additional API for real time monitoring sounds useful to us at RTB House. We are evaluating the questions you posted, but in the meantime I wanted to ask: what would be an ETA for the new api to be available? Whether this API is available before the 3pc phase-out is a crucial factor in prioritizing on our end. |
SHA: c5a1bb4 Reason: push, by JensenPaul Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Understood @JensenPaul -- but to @jonasz's point, my concern is regarding the timeline; a net-new API seems further out than potentially re-purposing the existing debugging framework whilewhich is not being downsampled pre-3PCD. |
Low single-digit minutes -- we're extremely sensitive to any impact to potential interruptions.
Anything that prevents |
Thanks @jonasz and @rdgordon-index for your comments. We recognize the tight timeline to 3PC phase-out and we are working on publishing an explainer as soon as possible, where we’ll publish the timeline after due consideration. To reduce the effort at the time of adoption, this API will likely share similar ergonomics to those of Private Aggregation API, with some differences that allow the SLA to be near real time. We encourage adtechs to continue sharing requirements at this stage as inputs to the design. |
I don't think it should be limited to crashes, after all, anyone can have logic errors and those could be even more expensive than crashes. |
Hello @rushilw-google ,
|
Thank you for drafting the explainer. We wanted to raise a certain doubt regarding opt-in: Thanks, |
Hi @michal-kalisz, thanks for your note! Regarding opt-in, we found it helpful to consider four aspects of the options for the opt-in mechanism - privacy implications, performance implications, build complexity/time and adtech preference beyond these reasons. With that lens, sharing our thoughts on the options below:
We plan to launch Real Time Monitoring with the buyer-seller coordination mechanism while continuing the discussion on other mechanisms, if you have any further thoughts on those. |
Much of the proposal has been implemented in Chrome and can be tested in Chrome Canary today using the |
Hello, If much of the proposal has been implemented, does it mean if real-time reports are emitted, we should be able to receive them on the
Has the format been decided yet? Is it a raw vector of 1024 bytes with value 0 or 1, or is it more elaborate? |
The Real Time Monitoring API should now be available for testing in 50% of Chrome Canary and Dev channel traffic. |
Yes, if the API is enabled via the command line flag or as part of testing on pre-stable Chrome traffic, and your origin is opted in via the auction config.
#1226 specifies the serialization format. |
Running Canary with cmdline flag I confirm the POST requests are sent.
If you are also interested in receiving low entropy UACH so that you have the option to differentiate Mobile, &c. real-time events, note your interest here. |
The Real Time Monitoring API should now be available for testing in 50% Beta channel traffic. |
Hello @qingxinwu. The |
The Real Time Monitoring API should now be available for testing in 1% Stable channel traffic. |
@JensenPaul, @qingxinwu I figured out why the |
Thanks for the update @JensenPaul ! Just to clarify, is the |
No, the flag is not needed to be part of the 1% traffic. Also see the real time reporting explainer about which reports are sent when the feature is enabled. |
Hi @qingxinwu, From this comment, I understand the API is enabled on non-labeled traffic, is that the case? Was this in any documentation? I also notice in my browser that when I force a label in my browser, no report is sent, while they do get sent when I don't force a label. I am asking because, we have everything in place on our side to receive reports, we have a couple of SSPs configured to send us reports, yet we still don't receive any pings. If the API is only enabled on non-labeled traffic, this could be a problem for us, as we can only participate in Protected Audience auctions on labeled traffic. |
Hi @ccharnay67 yes, you are correct that this Real Time Monitoring feature was not enabled in Mode A and Mode B facilitated testing labeled traffic |
@ajvelasquez-privacy-sandbox, it is my understanding that we are beyond the CMA-aligned/mandated market testing period. Is the pattern of new PA features being excluded from labeled instances a CMA mandate? |
See also |
Thank you for your answers. Would you consider enabling the feature for mode A or B, or maybe even just on one label? The comment you mention @JacobGo states that the other feature was held back on labelled traffic "to avoid disrupting ongoing experiments and testing". But this was in May, close to the CMA test, and we believe this argument may not be as strong now. Furthermore, one of the goals of the Real-Time Monitoring API is to be useful when doing experiments and testing, it should allow us to detect issues in such settings. So, in our opinion, it makes a lot of sense to enable it on labelled traffic. |
Thank you for bringing this to our attention. We've been considering this and will post an update when we've made a decision |
The Real Time Monitoring API should now be enabled in 100% Stable channel traffic, starting from 129.0.6668.77 |
Hi @qingxinwu, could you please precise if it is still limited to non-labeled traffic? We are still not getting any reports on our end. |
Yes, it's still limited to non-labeled traffic at this moment. |
Client side ad auctions described in FLEDGE create new challenges for detecting failures in real time and reacting to them in order to avoid expensive outages. The main factor contributing to these new challenges is decreased visibility of client side failures. To improve visibility of client side failures we propose the following extension of FLEDGE whose purpose is to support effective monitoring of client side auctions.
Events to be monitored
There are several types of events that we would like to monitor. These are:
For the crashes, i.e. events (1)-(3) we would like Chrome to send the following types of data:
For the failures on URL fetches we would like Chrome to send the following types of data:
Registering monitoring URLs
Reporting URLs will be provided in the auction config.
sellerErrorReportingUrl
paramperBuyerErrorReportinggUrl
paramsellerErrorReportingUrl
.perBuyerErrorReportingUrl
for the appropriate buyer.perBuyerErrorReportingUrl
for the appropriate buyer.a.
biddingLogicUrl
,dailyUpdateUrl
,trustedBiddingSignalsUrl
,renderUrl
toperBuyerErrorReportingUrl
for the appropriate buyer.b.
trustedScoringSignalsUrl
tosellerErrorReportingUrl
.Sending monitoring notifications
Upon each event, i.e. a crash in the seller or buyer worklets, or a crash in Fenced Frames, or a timeout, Chrome will send a notification to the registered URL, with the following URL params:
eventType
containing the type of the event. For sellers eventType will be omitted. For buyers, three values are possible:'bidding'
,'fencedframes'
, or'timeout'
.workletVersion
containing the worklet code version.stackTrace
containing stack trace.chromeVersion
containing Chrome version.In case of a failure on a URL fetch, additional param containing URL causing the failure will be added:
fetchFailureURL
containing the timed out URL.The notifications will look as follows. For the crash in the seller worklet
For the crash in the buyer worklet
For the crash in the fenced frames
For the failure on URL fetch
The text was updated successfully, but these errors were encountered: