Re-attempt failed snapshots and reports #119
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🗣 Description
This PR adds logic to re-attempt any failed snapshots or reports, in a single-threaded manner. The re-attempts are made after the entire group of snapshots or reports has been attempted once via multi-threading. Each failed snapshot or report is only re-attempted once.
💭 Motivation and context
This change makes the script more robust and able to overcome certain types of non-reproducible failures. We know for a fact that some third-party reports are very memory-intensive and will fail with "out of memory" errors when multiple reporting threads are running, so that is why I decided to make our re-attempts single-threaded.
Resolves #107.
🧪 Testing
To verify that these changes worked as expected, I first commented out any parts of
create_snapshots_reports_scorecard.py
that were not necessary for my local testing or related to the changes that I made (e.g. pausing/resuming the commander, creating the sample report, etc). Then I replaced thesnapshot_command
andreport_command
with harmless test and error commands like this:Once I had the script set up so that it could be run without actually generating any snapshots or reports (each thread only running my test/error commands above), I executed a full run using all of the entities in the production database with read-only credentials, just in case (e.g.
./create_snapshots_reports_scorecard.py --no-dock --no-log --no-pause cyhy-read-only scan-read-only
). I confirmed that the output of the full run looked as expected (output below sanitized and shortened to show only the interesting bits):Although it isn't shown above, I also verified that the new code functioned correctly when dealing with third-party snapshots and reports. I also confirmed that if a re-attempt failed for a snapshot or a report, the failure is still logged the same way as it is now.
✅ Pre-approval checklist
✅ Post-merge checklist