Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize snapshot creation #61

Merged
merged 15 commits into from
Jan 22, 2021

Conversation

dav3r
Copy link
Member

@dav3r dav3r commented Jan 12, 2021

🗣 Description

This PR modifies the weekly reporting script so that snapshot creation is multithreaded, instead of single-threaded.

While I was making changes to this script, I took the opportunity to clean up some function names, variable names, comments, and logging statements to make them clearer and more consistent (69795cc, fd426e5, d208d94). I also removed some old testing code that had been commented-out, but is no longer needed (70ea90c).

Resolves #58.

💭 Motivation and Context

From #58:

Creating snapshots is the biggest chunk of our weekly reporting time (about 700 minutes out of 850 minutes total; that's over 80% of the total time and climbing with each new report recipient we add). By creating multiple snapshots at once, we should be able to significantly reduce the total time it takes to create snapshots, just like we have done with report creation.

Since we have to pause the commander while snapshots and reports are generated, shorter reporting time means more weekly uptime for CyHy processing.

This PR starts us off with 16 threads for snapshot creation, since that is how many we currently use for report creation. That number can be increased or decreased in Production after we observe how long the threaded snapshot creation process takes and if there are any database performance issues.

Related, we may be able to increase the number of report-creation threads and reduce the time it takes to make the reports (currently 133 minutes), since the initial number of 16 threads predates our AWS infrastructure (we currently use an instance with 8 cores to generate reports).

Note that third-party snapshot creation (which must occur after "regular" snapshot creation) has not moved to threads yet. Those third-party snapshots are created relatively quickly (less than 6 minutes currently), so we won't save all that much time by creating them in parallel. Still, I added a TODO to do that eventually.

🧪 Testing

I tested these changes on my local system using the set of Production organizations, except I neutered the code so that no actual snapshots or reports were generated.

Once I was satisfied with that testing, I created a set of 37 dummy orgs (32 "regular" orgs and 5 "third-party reporting" orgs) and imported them to a test environment (thanks @mcdonnnj for letting me use your environment). I then ran the entire script (well, I commented out a couple of unrelated bits of code that would've required more test environment setup than I was willing to do) and confirmed that the new multithreaded code successfully generated snapshots for the regular orgs and the third-party snapshots were also generated successfully (non-threaded, but using code that was modified in this PR).

✅ Checklist

  • This PR has an informative and human-readable title.
  • Changes are limited to a single goal - eschew scope creep!
  • All future TODOs are captured in issues, which are referenced in code comments.
  • All relevant type-of-change labels have been added.
  • I have read the CONTRIBUTING document.
  • These code changes follow cisagov code standards.
  • All new and existing tests pass.

We start with the list of reports to generate, then figure out which 
snapshots need to be generated in order to create those reports.  
Finally, we split up the list of snapshots into sublists and launch a 
snapshot-creation thread to process each sublist.  This is similar to 
how we create reports, though that report code still needs to be cleaned 
up to match how this snapshot code now works.

Note that all snapshots (including snapshots for third-party reports) 
are now created from a common create_snapshot() function, which 
simplifies the code slightly.
This is not a huge concern at the moment since third-party snapshot 
creation does not take very long.
@dav3r dav3r added the improvement This issue or pull request will add new or improve existing functionality label Jan 12, 2021
@dav3r dav3r self-assigned this Jan 12, 2021
@lgtm-com
Copy link

lgtm-com bot commented Jan 12, 2021

This pull request introduces 3 alerts when merging 5bed850 into 6b97eec - view on LGTM.com

new alerts:

  • 2 for Variable defined multiple times
  • 1 for Except block handles 'BaseException'

Copy link
Member

@jsf9k jsf9k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I did leave a comment describing an improvement; we may want to create an issue from that.

extras/create_snapshots_reports_scorecard.py Show resolved Hide resolved
extras/create_snapshots_reports_scorecard.py Outdated Show resolved Hide resolved
Copy link
Contributor

@hillaryj hillaryj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Member

@mcdonnnj mcdonnnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this task done! 💪💪💪Hopefully it will result in greatly reduced snapshot generation runtime. I had some thoughts/suggestions, but nothing is particularly showstopping.

extras/create_snapshots_reports_scorecard.py Outdated Show resolved Hide resolved
extras/create_snapshots_reports_scorecard.py Show resolved Hide resolved
extras/create_snapshots_reports_scorecard.py Show resolved Hide resolved
extras/create_snapshots_reports_scorecard.py Outdated Show resolved Hide resolved
return snapshot_process.returncode


def create_snapshots(org_list, db, cyhy_db_section):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of create_snapshot()/create_snapshots() because of their similarity. Would something like create_snapshots_from_list() be more at-a-glance readable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - this would increase usability to disambiguate the two and keep users from having to rely on the help text to tell them apart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion- see 36a4128.

extras/create_snapshots_reports_scorecard.py Outdated Show resolved Hide resolved
extras/create_snapshots_reports_scorecard.py Outdated Show resolved Hide resolved
Thanks @mcdonnnj and LGTM!

Co-authored-by: Nick M. <50747025+mcdonnnj@users.noreply.github.com>
@lgtm-com
Copy link

lgtm-com bot commented Jan 19, 2021

This pull request introduces 2 alerts when merging 42d9453 into 6b97eec - view on LGTM.com

new alerts:

  • 2 for Variable defined multiple times

@dav3r dav3r requested a review from mcdonnnj January 21, 2021 16:43
@dav3r dav3r requested review from hillaryj and jsf9k January 21, 2021 16:43
@dav3r
Copy link
Member Author

dav3r commented Jan 21, 2021

I re-requested review from you all to make sure you're all good with the recent changes I made based on your excellent feedback. Thanks!

PS: I re-ran the script in a test environment after making these changes and everything still looks good.

Copy link
Member

@jsf9k jsf9k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all still looks great to me! Nice improvements since the last time I reviewed!

Copy link
Contributor

@hillaryj hillaryj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! 🏆

@dav3r
Copy link
Member Author

dav3r commented Jan 22, 2021

This pull request introduces 2 alerts when merging 42d9453 into 6b97eec - view on LGTM.com

new alerts:

  • 2 for Variable defined multiple times

For posterity, it seems that these LGTM alerts are not taking into account the try ... finally block, as explained in this comment.

@mcdonnnj
Copy link
Member

This pull request introduces 2 alerts when merging 42d9453 into 6b97eec - view on LGTM.com
new alerts:

  • 2 for Variable defined multiple times

For posterity, it seems that these LGTM alerts are not taking into account the try ... finally block, as explained in this comment.

They are assigned here

successful_tp_snaps = list()
failed_tp_snaps = list()

and then if they meet the appropriate logic, they are reassigned here
successful_tp_snaps = third_party_report_ids
else:
# Create snapshots needed for third-party reports
successful_tp_snaps, failed_tp_snaps = create_third_party_snapshots(
db, cyhy_db_section, third_party_report_ids
)

I think LGTM expects you do define them exclusively in that if/else logic instead of doing an initial declaration and then overriding as needed. I'm fine with how we're doing it, but wanted to give more explanation for posterity.

Copy link
Member

@mcdonnnj mcdonnnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking even better to me now! Thanks again for parallelizing this job 💪💪💪

@jsf9k
Copy link
Member

jsf9k commented Jan 22, 2021

I think LGTM expects you do define them exclusively in that if/else logic instead of doing an initial declaration and then overriding as needed. I'm fine with how we're doing it, but wanted to give more explanation for posterity.

I detest the leaky scope "feature" of Python.

Copy link
Member

@felddy felddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Nooice!
d2f91f7aeb5d98d222fb083796c1fc58

@dav3r dav3r merged commit ebaacc7 into develop Jan 22, 2021
@dav3r dav3r deleted the improvement/parallelize_snapshot_creation branch January 22, 2021 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement This issue or pull request will add new or improve existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelize snapshot creation
5 participants