Skip to content
This repository has been archived by the owner on May 16, 2023. It is now read-only.

Unnecessary delay in exposure notification due to delayed fetching of Diagnosis Keys? #466

Closed
kbobrowski opened this issue Jun 13, 2020 · 129 comments
Assignees
Labels
mirrored-to-jira This item is also tracked internally in JIRA

Comments

@kbobrowski
Copy link

Current API allows for querying days on which Diagnosis Keys became available:

$ date +%Y-%m-%d
2020-06-13

$ curl https://svc90.main.px.t-online.de/version/v1/diagnosis-keys/country/DE/date
["2020-06-09","2020-06-10","2020-06-11","2020-06-12"]

Apps use this endpoint to check if these dates are already in cache, and if not to download Diagnosis Keys for missing days using endpoint:

$ curl https://svc90.main.px.t-online.de/version/v1/diagnosis-keys/country/DE/date/2020-06-12 --output diagnosis_keys_2020-06-12.zip

Let's assume a scenario where someone had contact with infected person. Infected person uploads Diagnosis Keys on 2020-06-12, but these Diagnosis Keys are bundled into package that is only available to download on 2020-06-13. This introduces a day of delay, during which a person who had contact with diagnosed person is unaware of this contact and is not able to self-isolate, potentially endangering people around.

Interestingly it seems to be possible to get Diagnosis Keys uploaded on current day, but only using hour API endpoint:

$ curl https://svc90.main.px.t-online.de/version/v1/diagnosis-keys/country/DE/date/2020-06-13     
<!-- error -->

$ curl https://svc90.main.px.t-online.de/version/v1/diagnosis-keys/country/DE/date/2020-06-13/hour
[3,8,10,12]

$ curl https://svc90.main.px.t-online.de/version/v1/diagnosis-keys/country/DE/date/2020-06-13/hour/12 --output diagnosis_keys_2020-06-13_12.zip

This endpoint seems to be only used in a function which is marked to be dropped before release. Functionality to fetch hours for current day existed, but it was removed in this pull request.

Is there a reason not to continuously download Diagnosis Keys which are being made available during current day, and instead wait until package for complete day is ready?

As far as I understand this would not introduce any privacy issues since they are already handled by upload mechanism documented here, and would just let people know faster that they were exposed. Privacy of this solution can also be confirmed by looking at latest Diagnosis Key in each bundle, e.g. in a bundle from current day (2020-06-13, hour 12) latest Diagnosis Key has interval number 2653200, meaning it was generating RPIs yesterday:

$ TZ="utc" date -d@$((2653200 * 600)) +"%Y-%m-%d %H:%M"
2020-06-12 00:00

Internal Tracking ID: EXPOSUREAPP-1567

@kbobrowski
Copy link
Author

kbobrowski commented Jun 13, 2020

Further on this point: retrieval of Diagnosis Keys seems to be triggered every 24 hours, which further worsens worst-case performance of the system in a situation where:

  • Infected person uploads Diagnosis Keys on 01-07 at 01:00
  • App of a person who was in contact with infected person downloads new batch of Diagnosis Keys on 01-07 at 23:00 (this batch does not contain yet Diagnosis Keys of mentioned infected person)
  • Diagnosis keys of infected person become available for download on 02-07 at 00:00
  • Relevant Diagnosis Key is downloaded by mentioned contact person on 02-07 at 23:00
  • Contact person only learns next day after waking up (on 03-07) about exposure status

There seem to be no privacy / technical limitations to enable contact person to learn about exposure status already on the morning of 01-07 (two days earlier).

Change from checking for new Diagnosis Keys every 2 hours to every 24 hours has been introduced recently in this pull request.

Solution architecture document states one hour interval in Data Format, Data Transfer and Data Processing and Bandwidth Estimations.

@mh-
Copy link

mh- commented Jun 14, 2020

There seems to be one limitation that would prevent running RPI matching every hour: calls to provideDiagnosisKeys() are limited to 20 per day (for privacy protection reasons, to prevent an app from querying the API about contact with single users, and obviously there’s also the battery consumption thing...)
But that’s far away from running it only once per day.

@kbobrowski
Copy link
Author

@mh- thanks for additional info, it seems that original design was already taking this into account by capping calls to provideDiagnosisKeys() at 12 per day. So to recap, in original design we had:

  • downloading new bundle every 2 hours
  • bundles made available every 1 hour
  • worst case additional delay in exposure notification: 3 hours

And now we seem to have:

  • downloading new bundle every 24 hours
  • bundles made available every 24 hours
  • worst case additional delay in exposure notification: 48 hours

@kbobrowski
Copy link
Author

This starts to look more like Android implementation issue, since iOS updates diagnosis keys every 2 hours, and queries both days endpoint and hours endpoint for current day (function definition here, called from here). This would mean that delay in exposure notification introduced for iOS users is up to 3 hours, while for Android users up to 48 hours.

@SebastianWolf-SAP
Copy link
Member

We will provide further details in the next days, right now only some preliminary information with a big disclaimer: It will probably become more concrete or even be corrected in the next days when we have the final confirmation by the respective colleagues. It's Sunday and many colleagues take their well-deserved day off.

To my knowledge, both the Android and iOS app behave consistently when it comes to updating the OS-internal Exposure Notification Framework with diagnosis keys - this is done once per day for all platforms. The code which you saw on iOS also triggers the risk calculation, which can be done more often per day (but always with the same diagnosis keys from the current day). You know, the epidemiological parameters for the risk calculation might change, so two different calculations with the same diagnosis keys might actually yield in different results...

The reason for that once-per-day frequency is mainly API rate limiting as already outlined by @mh-. On Android you have 20 per day, on iOS even only 15. This rate limiting doesn't only apply to the calls per day, but also to the number of data files which you can send to the OS-internal framework per day.

As we always need to present the complete, epidemiologically relevant data of all diagnosis key to the framework, we already have 14 signed data files if we retrieve the diagnosis keys once per day for the relevant period of 2 weeks. So there is no chance of updating the diagnosis keys more often than that.

Of course, the backend could provide larger chunks of data (e.g. always the complete data for the respective last 2 weeks), so the number of files for the relevant period of 2 weeks is smaller, which would then also enable to call the framework more often, but that would certainly increase the overall load on the backend infrastructure, as overall more data needed to be served to clients. Probably it also have other drawbacks...

Thus, the current approach is from my perspective a balance between the rate limiting of the API, efficient handling of server infrastructure and actions that make sense from an epidemiological point of view. Even the delay mentioned by @kbobrowski might still be OK when the usual incubation period is taken into account.

But once again: Take this statement with a grain of salt. We will see further updates in the next days and of course keep this issue open for further discussions.

Until then: Enjoy your Sunday!

Mit freundlichen Grüßen/Best regards,
SW
Corona Warn-App Open Source Team

@kbobrowski
Copy link
Author

kbobrowski commented Jun 14, 2020

@SebastianWolf-SAP thanks for quick response (on Sunday!)

The thing I'm missing is why background task (whatever it is doing, even if just updating epidemiological paramteres) on iOS is triggered every 2 hours while on Android it is triggered every 24 hours, but perhaps it will explained / corrected later.

Fully understand that rate limiting on iOS and Android is driving factor for refresh rate. On Android it is limited to every 1.2 hours, on iOS to every 1.6 hours. Why choose refresh rate of every 24 hours though? This seems like quite arbitrary number (obviously related to human circadian rhytm, but this seems to have no relevance here). Initial choice of refreshing every 2 hours seemed reasonable.

As we always need to present the complete, epidemiologically relevant data of all diagnosis key to the framework, we already have 14 signed data files if we retrieve the diagnosis keys once per day for the relevant period of 2 weeks. So there is no chance of updating the diagnosis keys more often than that.

I'm missing something here, let me use following user story to illustrate it: I'm meeting with infected person on 01-07. I'm becoming infected on that day. This person receives positive test result on 07-07 and uploads Diagnosis Keys on that day (I'm already contagious then). Full 14 days worth of Diagnosis Keys of a person that infected me will become available at /date/2020-07-07 endpoint. These Diagnosis Keys also become available at /date/2020-07-07/hour/12. Now the problem is that I cannot yet fetch /date/2020-07-07 (it becomes available on next day) and I keep infecting my family / friends. I only learn on 08-07 about the fact that I was exposed.

Why I cannot fetch fresh /date/2020-07-07/hour/12 package with 14 Diagnosis Keys of person that infected me and subsequently isolate myself? This seems like a design decision that may result in loss of health / life, as I can be spreading virus in worst case for about 45 hours more than if 2 hour refresh and fetching hourly keys was implemented.

To summarize, I think this issue is in fact two seprarate but closely related issues:

  • frequency of fetching data, 24 hour interval increases worst case delay by 22 hours (assuming max frequency is to fetch every 2 hours) - rate at which server bundles new Diagnosis Keys is irrelevant here, if it bundles it every minute or every week it still results in worst-case 22 hours increase in delay in exposure notification arising just from this decision alone

  • rate at which new bundles of Diagnosis Keys are available, if my understanding is correct each bundle contains all 14 Diagnosis Keys of individual person who contributed to this bundle, so it seems to make sense to make it available as frequently as possible, every 1 hour seemed like a good decision but now it seems to be 24 hours

Each of these issues introduces 24 hours worst-case delay in receiving exposure notification (independently, so it may result in 48 hours delay), compared to original design which had worst-case 3 hours delay.

But let's leave it for working days, enjoy your Sunday as well!

@mh-
Copy link

mh- commented Jun 14, 2020

@kbobrowski If I understood @SebastianWolf-SAP correctly, he has the assumption that you cannot just feed the EN API on the device with a set of DKs that have been uploaded in the last 2 hours, but that you must feed it with all DKs that have been uploaded in the last 14 days, and that each such DK file transaction counts towards the rate limit.
You also cannot combine the DK files yourself on the device, because they have to be signed in the backend.
If that is correct and you do not want the server to deliver full-14-day-packages every 2 hours, all you can do is to collect daily files, and feed the last 14 of them once per day to the framework.

However, this feels like a strange restriction, and it's not really obvious from the API doc,
it seems like you can provide a list of key files for each call that is counted for the rate limit:

/**
* Provides a list of diagnosis key files for exposure checking. The files are to
* be synced from the server. Old diagnosis keys (for example older than 14 days)
* will be ignored.
*
* Diagnosis keys will be stored and matching will be performed in the near future, 
* after which you’ll receive a broadcast with the
* {@link #ACTION_EXPOSURE_STATE_UPDATED} action. 
*
* The diagnosis key files must be signed appropriately. Exposure configuration
* options can be provided to tune the matching algorithm. A unique token for this 
* batch can also be provided, which will be used to associate the matches with
* this request as part of {@link #getExposureSummary} and
* {@link #getExposureInformation}. Alternatively, the same token can be passed in
* multiple times to concatenate results. 
*
* After the result Task has returned, keyFiles can be deleted. 
*
* Results for a given token remain for 14 days. 
*/

Task​<​Void​>​ provideDiagnosisKeys​(
    List​<​File​>​ keyFiles, ​ExposureConfiguration​ configuration, ​String​ token​);

@kbobrowski
Copy link
Author

he has the assumption that you cannot just feed the EN API on the device with a set of DKs that have been uploaded in the last 2 hours, but that you must feed it with all DKs that have been uploaded in the last 14 days

@mh- not sure if @SebastianWolf-SAP meant this but I don't believe that this is how a server bundles data. Bundle from 12-06-2020 contains 78 Diagnosis Keys which were live between 2020-05-30 and 2020-06-11, and bundle from 13-06-2020 contains 84 DKs from time period between 2020-05-31 and 2020-06-12, but there is no single DK shared between these two bundles.

Hourly DKs are also signed so there is no need to combine them and sign again, and I also believe that API works as you described:

it seems like you can provide a list of key files for each call that is counted for the rate limit

@mh-
Copy link

mh- commented Jun 14, 2020

Bundle from 12-06-2020 contains 78 Diagnosis Keys which were live between 2020-05-30 and 2020-06-11, and bundle from 13-06-2020 contains 84 DKs from time period between 2020-05-31 and 2020-06-12, but there is no single DK shared between these two bundles.

Yes, I think we all agree on this, and that's why you need to deliver all of them to the API in one "session" (using one token), before you can ask it for a meaningful risk scoring.
But since you are apparently able to do it with just one call, I also don't see a reason why this can't be done 15 times per day.

@kbobrowski
Copy link
Author

kbobrowski commented Jun 14, 2020

Ok so I think I'm starting to understand what's going on here. Information about paths of downloaded DK bundles is stored in this KeyCacheDao. We have RetrieveDiagnosisKeysTransaction which fetches keys from the server using asyncFetchFiles function. This function simply checks if files for available dates have already been downloaded, if not blocks until missing days are downloaded, and returns list of files corresponding to all existing .zip files with DKs. Then these files are fed by mentioned transaction to executeAPISubmission function which then feeds all of these files one by one to provideDiagnosisKeys. There is an implicit assumption here that provideDiagnosisKeys function will be called less then 20 times a day, which is ensured by fetching only daily keys, only every 24 hours and deleting outdated files. As a result executeAPISubmission is called once a day with a list of maximum 14 files, and then provideDiagnosisKeys will be called maximum 14 times a day (in one short burst, by iterating over provided list).

This seems like a strange design, I thought that it's possible to simply feed Exposure Notification framework with new DKs, they will land in internal LevelDB database and EN framework will simply recalculate risk score. Or just pass in one call a list of all files (daily and hourly bundles). There is this statement in docs of executeAPISubmission:

We currently use Batch Size 1 and thus submit multiple times to the API. This means that instead of directly submitting all files at once, we have to split up our file list as this equals a different batch for Google every time.

Why feed only single-element batches? It seems that we can call this function 20 times a day (each time with multiple files) as stated in Google's documentation:

Calls to this method are limited to 20 per day

On the other hand Apple documentation linked by @SebastianWolf-SAP states that:

Apps can’t submit more than 15 key data files every 24 hours

This is a big difference in functionality between limiting "calls to a function" and "data files passed", is there a typo in one of these documents? Or do Apple and Google frameworks have this much discrepancy in functionality?

In any case, if this is true that there is this discrepancy I don't think that this is the reason for downgrading functionality of one app in order to match lower functionality level of the other app. It is in the best interest of iPhone users that all Android users are notified as fast as possible about their exposure status, such that they have less chance of potentially infecting people around (including iPhone users).

@cfritzsche
Copy link

@SebastianWolf-SAP wrote:

Even the delay mentioned by @kbobrowski might still be OK when the usual incubation period is taken into account.

I think @kbobrowski is right to point out unecessarily long delays. One main value this app provides is faster time-to-notification after positive test. But noticing symptoms (or receiving an exposure notification) and getting a test result take 2-3 days even in good times. During that time, contacts are already infectious themselves. If another 2-3 day delay is introduced, the time advantage is essentially lost and the infection has hopped to the next level of contacts. (Where second-degree notification corona-warn-app/cwa-wishlist#24 would come in handy)

I hope this delay issue can be clarified in the coming days.

@SebastianWolf-SAP
Copy link
Member

Wow, there has been quite a few additional comments in the last hours. :) Let me just confirm quickly that we will definitely clarify that issue in the upcoming days with all the required details.

Mit freundlichen Grüßen/Best regards,
SW
Corona Warn-App Open Source Team

@kbobrowski
Copy link
Author

kbobrowski commented Jun 14, 2020

@SebastianWolf-SAP thanks for keeping us updated :) just one additional comment: Google's reference Android implementation provides files with Diagnosis Keys in batches and removes these files after confirming that they were submitted to Exposure Notification framework. I also can see by using apps from different countries that Diagnosis Keys are stored inside EN framework in app_en_diagnosis_keys directory (simply as export.bin files). This led me to believe that intended way of interacting with EN is to simply feed it new Diagnosis Keys (in batches, up to 20 times a day) and just let it do the job and notify app in case contact with infected person was determined.

@kbobrowski
Copy link
Author

kbobrowski commented Jun 15, 2020

Some final thougths: what still confuses me is this part of @SebastianWolf-SAP response:

As we always need to present the complete, epidemiologically relevant data of all diagnosis key to the framework, we already have 14 signed data files if we retrieve the diagnosis keys once per day for the relevant period of 2 weeks. So there is no chance of updating the diagnosis keys more often than that.

This is reflected in current implementation, where last 14 daily bundles of DKs are stored in local storage and provided every day to provideDiagnosisKeys function. What is really confusing is why we need to repeadedly provide 14 times the same daily bundle to the framework over the course of 14 days. The only reason which comes to my mind is that epidemiological configuration may change and DKs need to be re-evalueated, but this looks like solving right problem in a wrong place, let me explain:

  • each day some specific daily bundle of DKs becomes less epidemiologically relevant as it contains older and older DKs, in extreme case 14-day old bundle contains DKs which were "live" 2 to 4 weeks ago, yet we still feed it to EN framework to do the matching (do 2 to 4 week old RPIs even exist anymore in LevelDB of EN?). Using this approach only 50% of DKs fed to provideDiagnosisKeys are younger than 2 weeks, so there is 2x unnecesary overhead.
  • as a result of this design downloading new data is limited to once every 24 hours, and only to daily keys, resulting in worst-case 48 hours delay in exposure notification

Let's consider alternative approach:

  • apps download data more frequently, for the sake of argument (not to get into discussion about providing list of files to provideDiagnosisKeys as opposed to single file at a time) let's assume new bundle of DKs is released every 2 hours and fetched every 2 hours. Each of these bundles contain 14 days worth of DKs of infected persons, enough to do the matching with RPIs. These bundles are not stored on device - just provided 12 times a day to EN framework and discarded, each time new data arrives EN calculates exposure score and notifies user
  • what happens in case we need to re-evaluate epidemiological configuration: we can simply ship all of DKs from the last 14 days at some extra endpoint which is queried together with 2-hourly endpoint, or simply bundle these DKs in one big 2-hour update. If this would be too heavy we can spreead it over the course of 12 packages that are downloaded every day. I expect this would happen only sporadically, when epidemiological configuration needs to be adjusted.

The latter solution seem to provide following advantages over existing one:

  • decrease of worst-case delay in notification from 48 hours to 4 hours (if we assume currently available 1-hourly endpoints queried every 2 hours it can be further reduced to 3 hours, but it would require to call provideDiagnosisKeys with a list of files)
  • there is no matching of DKs which are outside last-2-weeks window, contrary to current approach where 50% of DKs are older than 2 weeks

@mh-
Copy link

mh- commented Jun 15, 2020

I guess one aspect in this is also: Will the transmission_risk_level that is associated with each Diagnosis Key change with a change of ExposureConfiguration?
If yes, all the Diagnosis Keys would need to be updated when ExposureConfiguration changes; but if not (if there's a fixed robust model for determining the transmission_risk_level of a Diagnosis Key, then that's not required.

@kbobrowski
Copy link
Author

kbobrowski commented Jun 15, 2020

@mh- I don't believe that retrospective updating of transmission_risk_level is in scope because current design effectively prevents it - apps store daily bundles for 14 days and there seem to be no mechanism to trigger re-downloading of an old bundle. The apps also feed all of the bundles they have stored to provideDiagnosisKeys every day. Even if some DKs would be assigned new transmission_risk_levels and published in a new day bundle, the same DKs but with old transmission_risk_level would still be provided to EN along the updated ones.

In "alternative approach" described in my previous comment there would be no problem to implement this feature though should it be needed, DKs with updated transmission risk would just be pushed to new hourly bundle (there would be no previously downloaded DKs with same TEK but different transmission risk level passed alongside to provideDiagnosisKeys)

@tklingbeil
Copy link
Member

Both frameworks currently limit the number of files we can submit.
As we want to reduce the amount of data downloaded by each client to the minimum, we need to cache the downloaded diagnosis keys - which also need to be signed by the server - so no repackaging on the client side.

The risk calculated for a certain exposure incident might also change over time: Not due to the data changing, but due to the amount of days it lies in the past (e.g. an exposure event from yesterday weights differently, than one of 10 days ago). To take this into consideration for the calculation of the risk (and to properly include the new attenuation buckets), we need to feed all keys into the API every day... And this is, how we already hit the API rate limit by doing the matching once per day.

@MikeJayDee
Copy link

If you have direct line to the Apple and Google Exposure Notification teams, it might be worth highlighting this to them. I can't imagine that calling the API once a day at most was what they had in mind with their rate limits.

@kbobrowski
Copy link
Author

kbobrowski commented Jun 15, 2020

@tklingbeil thank you for the response, I have some further questions:

  • at the very minimum (without making any changes to current design) it would be possible to decrease delay in exposure notification by implementing fetching of new daily bundle of DKs in early morning hours every day, e.g. at random time between 01:00 and 04:00, and if phone did not have access to the internet during these hours then at earliest possible time. This would already significantly reduce current worst-case delay by almost half (approx 20-23 hours). Would you consider including such change?

  • it seems that indeed Google/Apple did not have in mind specific way of using their APIs which is implemented by CWA, have you reached out to Google/Apple in order to attempt to patch it on their side?

  • I can see the importance of including old bundles in exposure calculation, to combine multiple exposures and infer more accurate risk. I also understand that risk associated by each exposure changes as time passes. But it seems that similar result can be obtained by storing individual ExposureSummary instances returned from EN (for the past 2 weeks) and based on this infering exposure status to display. With this approach there is no need to store past bundles of Diagnosis Keys, and score calculated this way can be easily updated on 2-hourly basis (12 calls to EN with single file each a day). Would you consider it viable approach? This would reduce current exposure notification delay by a factor of 12.

Thank you in advance for your patience, I understand that you are very busy this week but it seems that current worst-case 48 hours delay is bounded to have serious consequences on someone's life / health, and this seems like something that can be at least partially avoided.

@christian-kirschnick
Copy link
Member

Indeed we did reach out to them already, and they are looking into the issue & see what they can do. They also noted that when they make changes, that it will take some time until this is shipped to all the phones - so we can't expect any adjustments within the next 2 weeks or so.

@kbobrowski
Copy link
Author

I've asked Google about it at their exposure-notifications-feedback mailing list:

Hi,

I'd just like to quickly clarify limits imposed by Exposure Notifications API on calls to function provideDiagnosisKeys(). Documentation states that "Calls to this method are limited to 20 per day", could you confirm whether limits are imposed on execution of function call (such that it would be possible to call it max 20 times a day, but each time passing multiple files with Diagnosis Keys), or maybe limits are imposed on arguments (total sum of passed files cannot exceed 20)?

Best regards,
Kamil

And got response (also got permission to publish it here):

Hi Kamil,

The limit is imposed on the number of method calls, so providing multiple files per call still results in a single quota being taken.

Thanks,
Jake

@Spacefish
Copy link

Spacefish commented Jun 18, 2020

Diese Studie: https://www.csh.ac.at/wp-content/uploads/2020/06/CSH-Studie-%C3%84rztefunktdienst.pdf kommt zu dem Schluss, dass falls die Isolation 6-12 Stunden früher angefangen wird, dies die Ansteckungsrate um bis zu 58% reduzieren kann. Unter diesem Hintergrund wäre ich dafür, die Keys mehrmals täglich zu aktualisieren.. Bis zu 48 Stunden zusätzliches Delay sind hier einfach inakzeptabel, denn dann vergehen im Worst-Case 4-5 Tage zwischen Test -> Notification..

"Unsere Studie zeigt, dass Maßnahmen, welche die effektive Infektiositätsdauer auch nur um wenige
Stunden für einen großen Teil der Erkrankten reduzieren, eine deutliche Wirkung entfalten"

Vorschlag:

  • Eine signierte Datei pro Stunde. Das sind dann 336 Dateien für 14 Tage..
  • Range Get in die API einbauen, App fragt die API also z.B.
    /version/v1/diagnosis-keys/country/DE/datesince/2020-06-14_1
    um alle Key Dateien von 2020-06-14 Stunde 1 bis heute zu erhalten.
    Um zu ermitteln welche Dateien gebraucht werden, iterriert die App einfach die Dateien von -14 Tage bis max. Jetzt durch und bricht ab und stellt das request, sobald sie auf die erste fehlende Datei stößt.
    An die Google API wird ein Batch Request mit der Batchsize 336 gesendet (entsprechend kleiner, falls es keine Keys für eine Stunde gibt), tagsüber alle Stunde Nachts alle 4 Stunden (um das 20 Request / Tag Limit nicht zu sprengen)

@christian-kirschnick
Copy link
Member

@ChristopherSchmitz I guess you can answer this best.

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jun 18, 2020

I'm surprised this issue has not gotten more attention, it seems extremely important.
In general, I think it would be good to talk more to Google/Apple about this and raise questions in public on their repos. Furthermore, we should reach out to other countries that are using GAP like Italy, Switzerland, Latvia (maybe ask question as issue on their Github repo) and ask whether they know anything and how they designed it.

I can't imagine that there isn't a better way of doing this than the current implementation here.

All comments are good, only one I find questionable:

Of course, the backend could provide larger chunks of data (e.g. always the complete data for the respective last 2 weeks), so the number of files for the relevant period of 2 weeks is smaller, which would then also enable to call the framework more often, but that would certainly increase the overall load on the backend infrastructure, as overall more data needed to be served to clients. Probably it also have other drawbacks...

Thus, the current approach is from my perspective a balance between the rate limiting of the API, efficient handling of server infrastructure and actions that make sense from an epidemiological point of view. Even the delay mentioned by @kbobrowski might still be OK when the usual incubation period is taken into account.

Is backend load really an issue for this project? Such claims should be quantified - does a change cost 1k, 10k or 100k € extra per day? And weighed against potential benefit: earlier notification, earlier test, reduced transmission, somehow quantified (don't feel I know enough for now). Rather than just being done by gut feeling.

How can a delay of 24hr, even 12hr be OK? I think getting the RKI in on this would be helpful too, @BenzlerJ, maybe? Some assumptions need to be checked here.

I don't want to blame past design decisions, you had to get it out quickly and managed to do that. But that doesn't stop us from rectifying potentially subideal decisions that save lives.

Given the life/death importance of this issue, it would be great to have another update from maintainers on their current views after 3 days: @SebastianWolf-SAP @tkowark @tklingbeil @cfritzsche @christian-kirschnick I come in peace.

@mh-
Copy link

mh- commented Jun 18, 2020

The currently released app version 1.0.2 makes exactly 3 attempts (BackgroundConstants.WORKER_RETRY_COUNT_THRESHOLD) in less than 1 minute, once per day (DIAGNOSIS_KEY_RETRIEVAL_TRIES_PER_DAY), to try and download Diagnosis Keys.
If the network connection is broken during that time window, it waits for another day.
I think this should be improved as well.

(@corneliusroemer thanks for the information about 1.0.2)

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jun 18, 2020

@mh- Small correction: At least for Android 1.0.2 was pushed through play store yesterday/today
Plus: maybe this issue should get relabelled - I think it's about more than a "question" and "documentation", "enhancement" and "bug" would not suit it badly I think.

Update: @mh- raises a point about keys only being attempted to be downloaded once a day, 3 times. The fact that no further attempts are made when the initial attempt failed may be a cause behind all sorts of issues that report lack of exposure key updates: e.g.

@mh- @kbobrowski Do you think it would make sense to open a new issue to suggest that more tries to key retrieval be made throughout the day? It's related to the issue here but can be fixed independently.

@kbobrowski
Copy link
Author

@corneliusroemer maintainers are already pointed from the issues you cited to this issue, perhaps one issue focusing on this specific problem will be created, I won't do it for now since I have not gathered enough information myself. It's important to keep signal to noise ratio high so until I have enough info I tend not to open new issues. If you or @mh- feel you are in a position to raise new specific issue then it's up to you of course :)

@kbobrowski
Copy link
Author

kbobrowski commented Jun 18, 2020

@mh- that's an interesting finding, also quite interesting is this part of the code:

    /**
     * Get maximum calls count to Google API
     *
     * @return Long
     *
     * @see BackgroundConstants.DIAGNOSIS_KEY_RETRIEVAL_TRIES_PER_DAY
     * @see BackgroundConstants.GOOGLE_API_MAX_CALLS_PER_DAY
     */
    fun getDiagnosisKeyRetrievalMaximumCalls() =
        BackgroundConstants.DIAGNOSIS_KEY_RETRIEVAL_TRIES_PER_DAY
            .coerceAtMost(BackgroundConstants.GOOGLE_API_MAX_CALLS_PER_DAY)

it indicates that initial intention was indeed to call provideDiagnosisKeys() multiple times a day, each time with a list of Diagnosis Keys. Right now if DIAGNOSIS_KEY_RETRIEVAL_TRIES_PER_DAY was set to anything higher than 1, let's say 2:

  • getDiagnosisKeyRetrievalMaximumCalls() would return 2 (since GOOGLE_API_MAX_CALLS_PER_DAY is set to 20)
  • provideDiagnosisKeys would be called 14 * 2 = 28 times, exceeding Google limit

so it may be not easily fixable without some deeper changes. Crucial thing regarding this issue is once again: to understand limits on provideDiagnosisKeys, and why we need to feed single-element batches.

Value of DIAGNOSIS_KEY_RETRIEVAL_TRIES_PER_DAY was changed from 12 (as we could have expected, fetching every 2 hours) to 1 in this pull request.

@w-flo
Copy link

w-flo commented Jun 19, 2020

Ideally, the "cwa serial interval", i.e. the delay between receiving an alert, then getting a test done, receiving the test result, uploading your keys and then finally the next "generation" of alerts being received should be less than the COVID-19 serial interval, which appears to be 4-5 days. I have some doubts about the feasibility of this, since laboratories and public authorities might not be fast enough for this to work in all cases. If the purely technical delay between uploading keys and alerts popping up is something like 24 hours on average, 48 hours in a bad case, and even more in worst case (backend or network issues during the 3 minute daily "attempt to update" interval) then it's even more unlikely.

The distribution API URL linked in the issue report seems to be serving 0 keys as of now, more than 72 hours after app launch. Hopefully this is just caused by some initial issues in laboratories or authorities. Otherwise, positive tests results which were communicated to the tested person on Tuesday (test probably done on Sunday/Monday?) would have their alerts received no earlier than Saturday, at least one week after meeting the infected person. By then, the person who receives an alert has probably passed the "most contagious" phase already, and if it takes just a few days longer, they might not even be contagious any more, reducing the app efficiency to 0%.

Edit: Thank you Christian for pointing that out to me (in the next comment). It's a good approach, and I've rushed to an incorrect conclusion there. Looks like that part of my concern is invalid and we should see some keys published soon, as uploads include more and more keys. Sorry for the noise.

I still feel like it's quite important to reduce the not-extremely-uncommon-bad-case of 48 hours delay as much as possible, even if that means the backend needs to be 14x as beefy. Also the "3 attempts in a 3 minute time window once a day" issue is very important, but it looks like this is already being worked on.

@cwa-bot cwa-bot bot moved this from ToDo to Mirrored to Jira in [CM] cwa-documentation Nov 25, 2020
@Ein-Tim
Copy link
Contributor

Ein-Tim commented Nov 25, 2020

CWA Version 1.7.0 (under Android: https://github.com/corona-warn-app/cwa-app-android/releases/tag/v1.7.0) and CWA Version 1.7.1 (under iOS: https://github.com/corona-warn-app/cwa-app-ios/releases/tag/v1.7.1) were released today which introduced key fetching all 4 hours (if the phone is connected to the internet via Wi-Fi). 🎉

@cwa-bot cwa-bot bot moved this from Mirrored to Jira to ToDo in [CM] cwa-documentation Nov 25, 2020
@dsarkar dsarkar moved this from ToDo to In Progress in [CM] cwa-documentation Nov 25, 2020
@dsarkar
Copy link
Member

dsarkar commented Nov 26, 2020

Dear community,

Regarding the update to CWA 1.7: Please note that it is a staged rollout. Over the next two days, 100% of the users should have received the update.

We would appreciate very much your feedback regarding this issue over the next few days!

Best wishes,
DS


Corona-Warn-App Open Source Team

@cfritzsche
Copy link

I am seeing background updates spaced 4 hours apart which is excellent. I didn't see yet if the hourly endpoint is checked though.

Thanks for finally making this work with all parties :-)

@cwa-bot cwa-bot bot moved this from In Progress to ToDo in [CM] cwa-documentation Nov 26, 2020
@dsarkar dsarkar moved this from ToDo to Waiting for Input in [CM] cwa-documentation Nov 26, 2020
@ndegendogo
Copy link

@cfritzsche you have Android, yes?

On iPhone we can see more details in the ENF log.
So I can confirm that I see the hourly files accumulating over the day. And for full days in the past loading the daily files as before.
And, as you said, every 4-5 hours such a check.

Thanks!

@cwa-bot cwa-bot bot moved this from Waiting for Input to ToDo in [CM] cwa-documentation Nov 26, 2020
@cfritzsche
Copy link

cfritzsche commented Nov 26, 2020

Nope, iOS here. Sure I can see that the keys change every few hours. But I can't verify they are fetched from the hourly endpoint.

@ndegendogo
Copy link

Nope, iOS here.

@cfritzsche did you have a look at your ENF log file? Either export it as json, or click through it directly in the settings.

What do I see here:

  1. on first level I see the timestamps of checks. I see these timestamps every 4-5 hours.
  2. next level I see the list of key files. I identify them by their hash value, or by the number of keys.
  3. The first 14 have large number of keys, like: 22000 / 19000 / 9000. They follow the same pattern as the 14 files before cwa 1.7:
    every day the oldest was dropped, 13 were same as yesterday, and a fresh file added.
  4. after this follow more files with smaller number of keys, like: 190 / 1500 / or even 0. Number of these files increments with every hour.

So the pattern is like expected for hourly files of today, and daily files for the past 14 days.
I did not download the files manually to verify their hash values.

@cfritzsche
Copy link

Thanks @ndegendogo . Actually isn't it waste to keep checking the old dailies after the first check of the day? Can they still be added to? Then TSIs concerns over infrastructure load were correct after all, the same daily keys are now downloaded by the same devices much more often.

@dsarkar dsarkar moved this from ToDo to In Progress in [CM] cwa-documentation Nov 27, 2020
@ndegendogo
Copy link

Actually isn't it waste to keep checking the old dailies after the first check of the day?

@cfritzsche well, this is same behaviour as before 1.7.
I did not monitor the behaviour with regard to download. My assumption is that the previously downloaded key files can (and probably are) cached on the device, so they need not contribute to load on the server infrastructure.

As for exact reasons why the same files are checked over and over again, I can only guess:

  • the check is always performed with newest risk parameter configuration. If these parameters change, the risk assessment may yield a different result than before.
  • the key files are packaged according to day of upload; not according to day of contact.
    If you assume that the risk formula has an accumulating part across contacts at the same time ( I am not 100% sure about this one ) - then it is essential for correct risk calculation that you include all available contacts of a given timeframe.

@cwa-bot cwa-bot bot moved this from In Progress to ToDo in [CM] cwa-documentation Nov 27, 2020
@dsarkar dsarkar moved this from ToDo to Mirrored to Jira in [CM] cwa-documentation Nov 27, 2020
@cfritzsche
Copy link

@kbobrowski can't we close this issue now with 1.7.1 out?

@cwa-bot cwa-bot bot moved this from Mirrored to Jira to ToDo in [CM] cwa-documentation Dec 3, 2020
@dsarkar dsarkar moved this from ToDo to Waiting for Input in [CM] cwa-documentation Dec 3, 2020
@dsarkar
Copy link
Member

dsarkar commented Dec 7, 2020

Dear @kbobrowski, and community,

Thanks to all for contributing here. We think that this issue has been mitigated with the latest release of CWA 1.7.1. As @cfritzsche suggests, we are about to close this issue. Thanks again.

Best wishes,
DS


Corona-Warn-App Open Source Team

@dsarkar
Copy link
Member

dsarkar commented Dec 15, 2020

Dear all,

Thanks again, we will close this issue now.

Best wishes,
DS


Corona-Warn-App Open Source Team

@dsarkar dsarkar closed this as completed Dec 15, 2020
@cwa-bot cwa-bot bot moved this from Waiting for Input to Done in [CM] cwa-documentation Dec 15, 2020
@dsarkar dsarkar removed the hot topic 🔥 Hot Topics from the view point of the Community label Dec 15, 2020
@ndegendogo
Copy link

As a (well, it's not a proof, but let's call it:) an anecdotical indication how fast the key distribution part can be now, let me refer to this ticket:

She had a red encounter pushed on her app on Monday, 7th of December, after a colleague of hers was diagnosed with Covid. (Said colleague had first symptoms on saturday, 28th, got tested a week after and got her results on monday 7th).

This is how it should be!
Thank you! 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
mirrored-to-jira This item is also tracked internally in JIRA
Development

No branches or pull requests