Skip to content

GitHub plugin: domain accounts row missing for non-contributor authors, leaving issues.creator_id / pull_requests.author_id as orphan FKs #8886

@JAORMX

Description

@JAORMX

The GitHub plugin writes issues.creator_id and pull_requests.author_id for every author it sees, but only creates the corresponding accounts row when that user has also committed code to the tracked repo. Anyone who only filed an issue, or who opened a PR without commits to the repo, ends up with a foreign key that points to nothing. A LEFT JOIN accounts ON issues.creator_id = accounts.id returns NULL even though the login is sitting one column over in issues.creator_name.

Reproduction

  1. Configure the GitHub plugin against a public repo with external issue traffic. stacklok/toolhive is a convenient one.
  2. Pick an issue filed by someone with no commits in that repo. In stacklok/toolhive, #4297 (filed by milichev), #4301, #4359, and #4413 all qualify.
  3. After a full sync, query the domain layer:
SELECT i.url, i.creator_id, i.creator_name, a.user_name
  FROM issues i
  LEFT JOIN accounts a ON a.id = i.creator_id
 WHERE i.url = 'https://github.com/stacklok/toolhive/issues/4297';

creator_id is populated, creator_name is milichev, and a.user_name is NULL. Expected: a.user_name is milichev. The same shape affects pull_requests.author_id for PRs from non-committer authors.

Root cause

backend/plugins/github/tasks/issue_extractor.go emits a GithubRepoAccount for body.User via convertAccount(...) but does not emit a GithubAccount row. _tool_github_accounts is populated separately by account_collector.go, which only iterates /repos/{owner}/{repo}/contributors. Users who never committed never land in _tool_github_accounts.

account_convertor.go::ConvertAccounts converts only what is in _tool_github_accounts:

dal.Select("_tool_github_accounts.*"),
dal.From(&models.GithubAccount{}),
dal.Where("repo_github_id = ? and _tool_github_accounts.connection_id=?", ...),
dal.Join(`left join _tool_github_repo_accounts gra on (...)`),

The LEFT JOIN _tool_github_repo_accounts with a WHERE on the right-table column behaves like an inner join, so users present only in _tool_github_repo_accounts (the rows issue_extractor inserts via convertAccount) get filtered out.

issue_convertor.go::ConvertIssues writes domainIssue.CreatorId = accountIdGen.Generate(connId, issue.AuthorId) unconditionally for any non-zero AuthorId, producing the orphan FK. pr_extractor.go has the same pattern, with the same outcome for PR authors and merged-by users.

The denormalized columns issues.creator_name and pull_requests.author_name are populated correctly from the API response, so the login is recoverable. Any consumer that joins to accounts for author attribution will silently under-attribute external contributors. Bot filters that key on accounts.user_name LIKE '%[bot]' will miss bot-filed issues for the same reason.

Reproduced on v1.0.3-beta10. The same code paths exist on main; the recent refactor of ConvertAccounts from NewDataConverter to NewStatefulDataConverter did not change the filter. Likely present in every prior release.

Suggested fix

Have the issue and PR extractors emit a GithubAccount alongside the GithubRepoAccount. In issue_extractor.go, where body.User is processed:

if body.User != nil {
    githubIssue.AuthorId = body.User.Id
    githubIssue.AuthorName = body.User.Login
    repoUser, err := convertAccount(body.User, data.Options.GithubId, data.Options.ConnectionId)
    if err != nil { return nil, err }
    results = append(results, repoUser)

    results = append(results, &models.GithubAccount{
        ConnectionId: data.Options.ConnectionId,
        Id:           body.User.Id,
        Login:        body.User.Login,
    })
}

Same change in pr_extractor.go for body.User and body.MergedBy. The upsert into _tool_github_accounts is idempotent: if the user later shows up as a contributor, account_extractor overwrites with the richer record. The alternative is to rewrite the ConvertAccounts query to UNION in author IDs from _tool_github_issues and _tool_github_pull_requests, which is larger and changes the contract of that pipeline.

Workaround for consumers

COALESCE(NULLIF(a.user_name, ''), i.creator_name) for issues, or pr.author_name for PRs. The denormalized columns are populated before the orphan-FK chain runs, so the login is always there.

Metadata

Metadata

Assignees

Labels

component/pluginsThis issue or PR relates to pluginspriority/highThis issue is very importanttype/bugThis issue is a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions