Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-attempt failed snapshots and reports #119

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

dav3r
Copy link
Member

@dav3r dav3r commented Apr 10, 2024

🗣 Description

This PR adds logic to re-attempt any failed snapshots or reports, in a single-threaded manner. The re-attempts are made after the entire group of snapshots or reports has been attempted once via multi-threading. Each failed snapshot or report is only re-attempted once.

💭 Motivation and context

This change makes the script more robust and able to overcome certain types of non-reproducible failures. We know for a fact that some third-party reports are very memory-intensive and will fail with "out of memory" errors when multiple reporting threads are running, so that is why I decided to make our re-attempts single-threaded.

Resolves #107.

🧪 Testing

To verify that these changes worked as expected, I first commented out any parts of create_snapshots_reports_scorecard.py that were not necessary for my local testing or related to the changes that I made (e.g. pausing/resuming the commander, creating the sample report, etc). Then I replaced the snapshot_command and report_command with harmless test and error commands like this:

    # Force a failure for testing purposes 0.1% of the time.  This is higher
    # than our actual failure rate (which is essentially 0% these days), but it
    # allows us to test the failure handling code.
    random_error = random.randint(1, 1000)
    if random_error == 50:
        snapshot_command = ["sh", "foo"]
    else:
        random_sleep = random.randint(1, 3)
        snapshot_command = [
            "sleep",
            str(random_sleep),
        ]

Once I had the script set up so that it could be run without actually generating any snapshots or reports (each thread only running my test/error commands above), I executed a full run using all of the entities in the production database with read-only credentials, just in case (e.g. ./create_snapshots_reports_scorecard.py --no-dock --no-log --no-pause cyhy-read-only scan-read-only). I confirmed that the output of the full run looked as expected (output below sanitized and shortened to show only the interesting bits):

<snip>
2024-04-10 11:43:37,033 INFO - Attempting to regenerate failed snapshots: [u'AAAAA', u'BBBBB', u'CCCCC', u'DDDDD', u'EEEEE', u'FFFFF']
2024-04-10 11:43:40,093 INFO - [MainThread] Successful snapshot: AAAAA (3.04 s)
2024-04-10 11:43:41,165 INFO - [MainThread] Successful snapshot: BBBBB (1.04 s)
2024-04-10 11:43:44,231 INFO - [MainThread] Successful snapshot: CCCCC (3.04 s)
2024-04-10 11:43:46,306 INFO - [MainThread] Successful snapshot: DDDDD (2.04 s)
2024-04-10 11:43:48,365 INFO - [MainThread] Successful snapshot: EEEEE (2.04 s)
2024-04-10 11:43:50,430 INFO - [MainThread] Successful snapshot: FFFFF (2.04 s)
2024-04-10 11:43:50,431 INFO - Time to complete re-attempting failed snapshots: 0.22 minutes
2024-04-10 11:43:50,432 INFO - Time to complete snapshots: 20.49 minutes
<snip>
2024-04-10 11:51:48,099 INFO - Attempting to regenerate failed reports: [u'GGGGG', u'HHHHH', u'IIIII', u'JJJJJ', u'KKKKK', u'LLLLL', u'MMMMM']
2024-04-10 11:51:48,099 INFO - [MainThread] Starting report for: GGGGG
2024-04-10 11:51:51,146 INFO - [MainThread] Successful report generated: GGGGG (3.05 s)
2024-04-10 11:51:51,147 INFO - [MainThread] Starting report for: HHHHH
2024-04-10 11:51:52,196 INFO - [MainThread] Successful report generated: HHHHH (1.05 s)
2024-04-10 11:51:52,197 INFO - [MainThread] Starting report for: IIIII
2024-04-10 11:51:53,239 INFO - [MainThread] Successful report generated: IIIII (1.04 s)
2024-04-10 11:51:53,240 INFO - [MainThread] Starting report for: JJJJJ
2024-04-10 11:51:55,286 INFO - [MainThread] Successful report generated: JJJJJ (2.05 s)
2024-04-10 11:51:55,287 INFO - [MainThread] Starting report for: KKKKK
2024-04-10 11:51:57,334 INFO - [MainThread] Successful report generated: KKKKK (2.05 s)
2024-04-10 11:51:57,335 INFO - [MainThread] Starting report for: LLLLL
2024-04-10 11:51:59,379 INFO - [MainThread] Successful report generated: LLLLL (2.04 s)
2024-04-10 11:51:59,380 INFO - [MainThread] Starting report for: MMMMM
2024-04-10 11:52:00,419 INFO - [MainThread] Successful report generated: MMMMM (1.04 s)
2024-04-10 11:52:00,420 INFO - Time to complete re-attempting failed reports: 0.21 minutes
2024-04-10 11:52:00,421 INFO - Time to complete reports: 8.17 minutes
<snip>
2024-04-10 11:53:13,542 INFO - Number of snapshots generated: 8454
2024-04-10 11:53:13,543 INFO -   Number of third-party and grouping node snapshots generated: 32
2024-04-10 11:53:13,544 INFO - Number of snapshots failed: 0
2024-04-10 11:53:13,544 INFO -   Number of third-party and grouping node snapshots failed: 0
2024-04-10 11:53:13,545 INFO - Number of reports generated: 8227
2024-04-10 11:53:13,545 INFO -   Third-party reports generated: 32
2024-04-10 11:53:13,546 INFO - Number of reports failed: 0
2024-04-10 11:53:13,546 INFO -   Third-party reports failed: 0
<snip>

Although it isn't shown above, I also verified that the new code functioned correctly when dealing with third-party snapshots and reports. I also confirmed that if a re-attempt failed for a snapshot or a report, the failure is still logged the same way as it is now.

✅ Pre-approval checklist

  • This PR has an informative and human-readable title.
  • Changes are limited to a single goal - eschew scope creep!
  • All relevant type-of-change labels have been added.
  • I have read the CONTRIBUTING document.
  • These code changes follow cisagov code standards.
  • All new and existing tests pass.

✅ Post-merge checklist

  • Deploy this change to Production
  • Closely review the output of this script the first time it runs in Production to ensure everything worked as expected

Note that the re-attempts are single-threaded.  We know that certain third-party reports are very memory-intensive and will fail with "out of memory" errors when multiple reporting threads are running, so that is why we decided to go single-threaded for our re-attempts.
@dav3r dav3r added the improvement This issue or pull request will add new or improve existing functionality label Apr 10, 2024
@dav3r dav3r self-assigned this Apr 10, 2024
@dav3r dav3r requested review from mcdonnnj, jsf9k, felddy and a team April 10, 2024 17:40
@dav3r dav3r merged commit 5c2fa2f into develop Apr 10, 2024
1 check passed
@dav3r dav3r deleted the improvement/retry-failed-snaps-reports branch April 10, 2024 19:18
@dav3r
Copy link
Member Author

dav3r commented Apr 10, 2024

@jsf9k I deployed this change to Production. When you take care of the reports this weekend, please be sure to verify that there were no failed snapshots or reports before you send everything out. Last weekend, there were three failed third-party reports (due to their large memory consumption combined with it being the first time that third-party reports were multi-threaded). I'm hopeful that this PR will enable those three reports to be generated successfully via the new single-threaded re-attempt process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement This issue or pull request will add new or improve existing functionality
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Automatically re-attempt failed snapshots and reports
4 participants