https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L210-L246
Query error
this query, per the comments and what i understand of the implementation is getting all of the commit data's emails and names from the commit table that do not appear in the contributors table or the contributors_aliases table.
This works fine for new contributors that are not yet in the contributors table.
Where this fails is if a record slips through (maybe commits were made using an email that later got linked to a github account). In that case, the records with NULL cmt_ght_author_id never get revisited and properly linked to a contributor that IS resolved, just not for the commits that slipped through.
Last Collection Date
since we added a filter for last collection date (added by @IsaacMilarky added it in 8539825bb217c388735dfa1bc43d25dc4cee0d51, PR augurlabs/augur#3253 ), this problem is worse, since anything that slipped through or didnt get properly linked will now be systematically ignored by its older last collection date.
When this PR was filed, I called out that the change it was making seemed unrelated to the analyze_commits_in_parallel problem that was being solved at the time (augurlabs/augur#3253 (comment)).
Since this query is being run as essentially a precondition check to establish what records we should attempt to run contributor resolution on.
currently our logic is to find all email addresses (looking up against two sources, the contributors table and the aliases table, see #237) haven't been matched yet
That is way less simple than "we should run contributor resolution on all commits we don't currently have linked to a contributor" AKA "is the cmt_ght_author_id NULL?"
Note
Migrated from augurlabs/augur#3779
Originally opened by
@MoralCodeon 2026-03-19https://github.com/chaoss/augur/blob/49a008ab97c43472339e400cb316a5323110d78d/augur/tasks/github/facade_github/tasks.py#L210-L246
Query error
this query, per the comments and what i understand of the implementation is getting all of the commit data's emails and names from the commit table that do not appear in the contributors table or the contributors_aliases table.
This works fine for new contributors that are not yet in the contributors table.
Where this fails is if a record slips through (maybe commits were made using an email that later got linked to a github account). In that case, the records with NULL
cmt_ght_author_idnever get revisited and properly linked to a contributor that IS resolved, just not for the commits that slipped through.Last Collection Date
since we added a filter for last collection date (added by @IsaacMilarky added it in 8539825bb217c388735dfa1bc43d25dc4cee0d51, PR augurlabs/augur#3253 ), this problem is worse, since anything that slipped through or didnt get properly linked will now be systematically ignored by its older last collection date.
When this PR was filed, I called out that the change it was making seemed unrelated to the analyze_commits_in_parallel problem that was being solved at the time (augurlabs/augur#3253 (comment)).
Since this query is being run as essentially a precondition check to establish what records we should attempt to run contributor resolution on.
currently our logic is to find all email addresses (looking up against two sources, the contributors table and the aliases table, see #237) haven't been matched yet
That is way less simple than "we should run contributor resolution on all commits we don't currently have linked to a contributor" AKA "is the
cmt_ght_author_idNULL?"