-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silent (from user perspective) scenario application failure #514
Comments
Although I've patched R5 to avoid returning bogus regional results with no scenario applied, it would still be interesting to identify what was causing the scenario application failures in the first place. It might be something to do with fetching scenarios from S3 or saving them locally on the workers. We will need to dig through the staging worker logs to see. |
It turns out that the striping was not due to a limited number of workers failing; it was due to only one of 100 workers actually applying the scenario. All other workers were failing to apply the scenario and falling back on the baseline. This was not happening silently in the sense that the problem was logged on the worker, but from an end user perspective and from the backend's perspective it was invisible. The patch ensures that workers fail hard and refuse to submit work results instead of just failing to apply the scenario. It causes analyses to stall forever instead of appearing to complete normally, without applying any scenario. Generally we want things to fail fast and loudly whenever something is amiss. Also throughout R5 I think we should avoid falling back on / initializing with defaults - I recently also encountered a problem with a field that was being initialized to a default value and never overwritten. |
We believe this issue is solved, the only known cause of the issue has been fixed. Re-open this issue if the striping has been seen again. |
In a recent Analysis run there are stripe artifacts:
This might be the same thing we saw in #483: errors happen when finding/applying scenarios. In the log in #483 we see:
TransportNetworkCache - No scenario provided or loaded. Replacing with empty scenario.
This means that when an error happens, instead of just failing to submit results for those origin points, and the backend redelivering and retrying them, it instead applies no scenario semi-silently and continues to calculate and return results. We need to change that behavior.
The text was updated successfully, but these errors were encountered: