Re-attempt failed snapshots and reports #119

dav3r · 2024-04-10T17:40:42Z

🗣 Description

This PR adds logic to re-attempt any failed snapshots or reports, in a single-threaded manner. The re-attempts are made after the entire group of snapshots or reports has been attempted once via multi-threading. Each failed snapshot or report is only re-attempted once.

💭 Motivation and context

This change makes the script more robust and able to overcome certain types of non-reproducible failures. We know for a fact that some third-party reports are very memory-intensive and will fail with "out of memory" errors when multiple reporting threads are running, so that is why I decided to make our re-attempts single-threaded.

Resolves #107.

🧪 Testing

To verify that these changes worked as expected, I first commented out any parts of create_snapshots_reports_scorecard.py that were not necessary for my local testing or related to the changes that I made (e.g. pausing/resuming the commander, creating the sample report, etc). Then I replaced the snapshot_command and report_command with harmless test and error commands like this:

    # Force a failure for testing purposes 0.1% of the time.  This is higher
    # than our actual failure rate (which is essentially 0% these days), but it
    # allows us to test the failure handling code.
    random_error = random.randint(1, 1000)
    if random_error == 50:
        snapshot_command = ["sh", "foo"]
    else:
        random_sleep = random.randint(1, 3)
        snapshot_command = [
            "sleep",
            str(random_sleep),
        ]

Once I had the script set up so that it could be run without actually generating any snapshots or reports (each thread only running my test/error commands above), I executed a full run using all of the entities in the production database with read-only credentials, just in case (e.g. ./create_snapshots_reports_scorecard.py --no-dock --no-log --no-pause cyhy-read-only scan-read-only). I confirmed that the output of the full run looked as expected (output below sanitized and shortened to show only the interesting bits):

<snip>
2024-04-10 11:43:37,033 INFO - Attempting to regenerate failed snapshots: [u'AAAAA', u'BBBBB', u'CCCCC', u'DDDDD', u'EEEEE', u'FFFFF']
2024-04-10 11:43:40,093 INFO - [MainThread] Successful snapshot: AAAAA (3.04 s)
2024-04-10 11:43:41,165 INFO - [MainThread] Successful snapshot: BBBBB (1.04 s)
2024-04-10 11:43:44,231 INFO - [MainThread] Successful snapshot: CCCCC (3.04 s)
2024-04-10 11:43:46,306 INFO - [MainThread] Successful snapshot: DDDDD (2.04 s)
2024-04-10 11:43:48,365 INFO - [MainThread] Successful snapshot: EEEEE (2.04 s)
2024-04-10 11:43:50,430 INFO - [MainThread] Successful snapshot: FFFFF (2.04 s)
2024-04-10 11:43:50,431 INFO - Time to complete re-attempting failed snapshots: 0.22 minutes
2024-04-10 11:43:50,432 INFO - Time to complete snapshots: 20.49 minutes
<snip>
2024-04-10 11:51:48,099 INFO - Attempting to regenerate failed reports: [u'GGGGG', u'HHHHH', u'IIIII', u'JJJJJ', u'KKKKK', u'LLLLL', u'MMMMM']
2024-04-10 11:51:48,099 INFO - [MainThread] Starting report for: GGGGG
2024-04-10 11:51:51,146 INFO - [MainThread] Successful report generated: GGGGG (3.05 s)
2024-04-10 11:51:51,147 INFO - [MainThread] Starting report for: HHHHH
2024-04-10 11:51:52,196 INFO - [MainThread] Successful report generated: HHHHH (1.05 s)
2024-04-10 11:51:52,197 INFO - [MainThread] Starting report for: IIIII
2024-04-10 11:51:53,239 INFO - [MainThread] Successful report generated: IIIII (1.04 s)
2024-04-10 11:51:53,240 INFO - [MainThread] Starting report for: JJJJJ
2024-04-10 11:51:55,286 INFO - [MainThread] Successful report generated: JJJJJ (2.05 s)
2024-04-10 11:51:55,287 INFO - [MainThread] Starting report for: KKKKK
2024-04-10 11:51:57,334 INFO - [MainThread] Successful report generated: KKKKK (2.05 s)
2024-04-10 11:51:57,335 INFO - [MainThread] Starting report for: LLLLL
2024-04-10 11:51:59,379 INFO - [MainThread] Successful report generated: LLLLL (2.04 s)
2024-04-10 11:51:59,380 INFO - [MainThread] Starting report for: MMMMM
2024-04-10 11:52:00,419 INFO - [MainThread] Successful report generated: MMMMM (1.04 s)
2024-04-10 11:52:00,420 INFO - Time to complete re-attempting failed reports: 0.21 minutes
2024-04-10 11:52:00,421 INFO - Time to complete reports: 8.17 minutes
<snip>
2024-04-10 11:53:13,542 INFO - Number of snapshots generated: 8454
2024-04-10 11:53:13,543 INFO -   Number of third-party and grouping node snapshots generated: 32
2024-04-10 11:53:13,544 INFO - Number of snapshots failed: 0
2024-04-10 11:53:13,544 INFO -   Number of third-party and grouping node snapshots failed: 0
2024-04-10 11:53:13,545 INFO - Number of reports generated: 8227
2024-04-10 11:53:13,545 INFO -   Third-party reports generated: 32
2024-04-10 11:53:13,546 INFO - Number of reports failed: 0
2024-04-10 11:53:13,546 INFO -   Third-party reports failed: 0
<snip>

Although it isn't shown above, I also verified that the new code functioned correctly when dealing with third-party snapshots and reports. I also confirmed that if a re-attempt failed for a snapshot or a report, the failure is still logged the same way as it is now.

✅ Pre-approval checklist

This PR has an informative and human-readable title.
Changes are limited to a single goal - eschew scope creep!
All relevant type-of-change labels have been added.
I have read the CONTRIBUTING document.
These code changes follow cisagov code standards.
All new and existing tests pass.

✅ Post-merge checklist

Deploy this change to Production
Closely review the output of this script the first time it runs in Production to ensure everything worked as expected

Note that the re-attempts are single-threaded. We know that certain third-party reports are very memory-intensive and will fail with "out of memory" errors when multiple reporting threads are running, so that is why we decided to go single-threaded for our re-attempts.

dav3r · 2024-04-10T19:32:32Z

@jsf9k I deployed this change to Production. When you take care of the reports this weekend, please be sure to verify that there were no failed snapshots or reports before you send everything out. Last weekend, there were three failed third-party reports (due to their large memory consumption combined with it being the first time that third-party reports were multi-threaded). I'm hopeful that this PR will enable those three reports to be generated successfully via the new single-threaded re-attempt process.

dav3r added the improvement This issue or pull request will add new or improve existing functionality label Apr 10, 2024

dav3r self-assigned this Apr 10, 2024

dav3r requested review from mcdonnnj, jsf9k, felddy and a team April 10, 2024 17:40

dv4harr10 approved these changes Apr 10, 2024

View reviewed changes

jsf9k approved these changes Apr 10, 2024

View reviewed changes

mcdonnnj approved these changes Apr 10, 2024

View reviewed changes

dav3r merged commit 5c2fa2f into develop Apr 10, 2024
1 check passed

dav3r deleted the improvement/retry-failed-snaps-reports branch April 10, 2024 19:18

This was referenced Apr 15, 2024

Re-attempted snapshots/reports are duplicated in the list of durations #120

Closed

Only display duration in "Longest snapshots/reports" summary for successful snapshots/report #121

Merged

dav3r mentioned this pull request Jun 4, 2024

OSError cannot allocate memory during report generation #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-attempt failed snapshots and reports #119

Re-attempt failed snapshots and reports #119

dav3r commented Apr 10, 2024 •

edited

Loading

dav3r commented Apr 10, 2024

Re-attempt failed snapshots and reports #119

Re-attempt failed snapshots and reports #119

Conversation

dav3r commented Apr 10, 2024 • edited Loading

🗣 Description

💭 Motivation and context

🧪 Testing

✅ Pre-approval checklist

✅ Post-merge checklist

dav3r commented Apr 10, 2024

dav3r commented Apr 10, 2024 •

edited

Loading