Fix: Eliminate redundant full table scans in messages and events collection (phase 1, only events.py)#3479
Fix: Eliminate redundant full table scans in messages and events collection (phase 1, only events.py)#3479PredictiveManish wants to merge 4 commits intoaugurlabs:mainfrom
Conversation
MoralCode
left a comment
There was a problem hiding this comment.
code LGTM! Would like to better understand how this has been tested/whether it still needs testing to confirm it works
Actually yes it needs testing to confirm! Will confirm when it's done, facing a little issue in setting up as I have purchased a new device so will share when tested. |
|
marked waiting due to waiting for testing (also probably needs a rebase) |
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
b3d0ac6 to
1f128fc
Compare
shlokgilda
left a comment
There was a problem hiding this comment.
LGTM. Waiting to hear back about the testing status.
|
Can you resolve the merge conflicts? |
Signed-off-by: Manish Tiwari <manish.tiwari.09@zohomail.in>
Yes, sure! |
|
Once this has been tested, it can be merged. |
Signed-off-by: Manish Tiwari <manish.tiwari.09@zohomail.in>
ae07921
shlokgilda
left a comment
There was a problem hiding this comment.
The optimization idea is right but issue_url_to_id_map and pr_url_to_id_map are never actually defined in collect(). They're just passed as undefined variables to _process_events. This will crash.
You need to add these two lines in collect() after repo_id is set, before the loop:
issue_url_to_id_map = self._get_map_from_issue_url_to_id(repo_id)
pr_url_to_id_map = self._get_map_from_pr_url_to_id(repo_id)The helper methods still exist but nothing calls them now, so they're dead code until this is wired up.
Signed-off-by: Manish Tiwari <manish.tiwari.09@zohomail.in>
shlokgilda
left a comment
There was a problem hiding this comment.
Left an inline comment. Should be GTG after that
|
|
||
| # making this a decent size since process_events retrieves all the issues and prs each time | ||
| if len(events) >= event_batch_size: | ||
| self._process_events(events, repo_id) |
There was a problem hiding this comment.
This call wasn't updated to pass the new map arguments. It'll raise a TypeError at runtime for any repo with more events than event_batch_size. Needs to be:
self._process_events(events, repo_id, issue_url_to_id_map, pr_url_to_id_map)Also the comment right above (# making this a decent size since process_events retrieves all the issues and prs each time) is now stale and should be removed.
Description
augur/tasks/github/events.pyBuilt
issue_url_to_id_mapandpr_url_to_id_maponce inBulkGithubEventCollection.collect()before the batch loopUpdated
_process_events(),_process_issue_events(), and_process_pr_events()to accept mappings as parametersRemoved redundant
_get_map_from_*()calls from batch processing methodsNotes for Reviewers
This PR fixes (partially) Full table scans on every batch in messages and events collection #3440
Signed commits