-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize snapshot creation #61
Conversation
We start with the list of reports to generate, then figure out which snapshots need to be generated in order to create those reports. Finally, we split up the list of snapshots into sublists and launch a snapshot-creation thread to process each sublist. This is similar to how we create reports, though that report code still needs to be cleaned up to match how this snapshot code now works. Note that all snapshots (including snapshots for third-party reports) are now created from a common create_snapshot() function, which simplifies the code slightly.
This is not a huge concern at the moment since third-party snapshot creation does not take very long.
This pull request introduces 3 alerts when merging 5bed850 into 6b97eec - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I did leave a comment describing an improvement; we may want to create an issue from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this task done! 💪💪💪Hopefully it will result in greatly reduced snapshot generation runtime. I had some thoughts/suggestions, but nothing is particularly showstopping.
return snapshot_process.returncode | ||
|
||
|
||
def create_snapshots(org_list, db, cyhy_db_section): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of create_snapshot()
/create_snapshots()
because of their similarity. Would something like create_snapshots_from_list()
be more at-a-glance readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - this would increase usability to disambiguate the two and keep users from having to rely on the help text to tell them apart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion- see 36a4128.
Thanks @mcdonnnj and LGTM! Co-authored-by: Nick M. <50747025+mcdonnnj@users.noreply.github.com>
This pull request introduces 2 alerts when merging 42d9453 into 6b97eec - view on LGTM.com new alerts:
|
Co-authored-by: Nick M. <50747025+mcdonnnj@users.noreply.github.com>
I re-requested review from you all to make sure you're all good with the recent changes I made based on your excellent feedback. Thanks! PS: I re-ran the script in a test environment after making these changes and everything still looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all still looks great to me! Nice improvements since the last time I reviewed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! 🏆
For posterity, it seems that these LGTM alerts are not taking into account the |
They are assigned here cyhy-reports/extras/create_snapshots_reports_scorecard.py Lines 796 to 797 in 01e36ea
and then if they meet the appropriate logic, they are reassigned here cyhy-reports/extras/create_snapshots_reports_scorecard.py Lines 931 to 936 in 01e36ea
I think LGTM expects you do define them exclusively in that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking even better to me now! Thanks again for parallelizing this job 💪💪💪
I detest the leaky scope "feature" of Python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗣 Description
This PR modifies the weekly reporting script so that snapshot creation is multithreaded, instead of single-threaded.
While I was making changes to this script, I took the opportunity to clean up some function names, variable names, comments, and logging statements to make them clearer and more consistent (69795cc, fd426e5, d208d94). I also removed some old testing code that had been commented-out, but is no longer needed (70ea90c).
Resolves #58.
💭 Motivation and Context
From #58:
This PR starts us off with 16 threads for snapshot creation, since that is how many we currently use for report creation. That number can be increased or decreased in Production after we observe how long the threaded snapshot creation process takes and if there are any database performance issues.
Related, we may be able to increase the number of report-creation threads and reduce the time it takes to make the reports (currently 133 minutes), since the initial number of 16 threads predates our AWS infrastructure (we currently use an instance with 8 cores to generate reports).
Note that third-party snapshot creation (which must occur after "regular" snapshot creation) has not moved to threads yet. Those third-party snapshots are created relatively quickly (less than 6 minutes currently), so we won't save all that much time by creating them in parallel. Still, I added a TODO to do that eventually.
🧪 Testing
I tested these changes on my local system using the set of Production organizations, except I neutered the code so that no actual snapshots or reports were generated.
Once I was satisfied with that testing, I created a set of 37 dummy orgs (32 "regular" orgs and 5 "third-party reporting" orgs) and imported them to a test environment (thanks @mcdonnnj for letting me use your environment). I then ran the entire script (well, I commented out a couple of unrelated bits of code that would've required more test environment setup than I was willing to do) and confirmed that the new multithreaded code successfully generated snapshots for the regular orgs and the third-party snapshots were also generated successfully (non-threaded, but using code that was modified in this PR).
✅ Checklist