feat(github): Add incremental data collection#8858
Merged
klesh merged 24 commits intoapache:mainfrom Apr 29, 2026
Merged
Conversation
The since= query parameter passed to the GitHub API in four collectors was formatted using Go's time.Time.String(), which produces a human-readable string (e.g. "2024-01-15 10:30:00 +0000 UTC") rather than the ISO 8601 / RFC 3339 format the GitHub API requires (e.g. "2024-01-15T10:30:00Z"). The GitHub API silently ignores malformed date strings, causing these collectors to perform full re-scans on every incremental run despite appearing to filter correctly. Fix by using .UTC().Format(time.RFC3339) in all four affected collectors: - comment_collector.go - issue_collector.go - commit_collector.go - pr_review_comment_collector.go (cherry picked from commit cf3cb621462d8ae661df5fcb8e1b47c70564cd60)
Switch Extract Events from the legacy full-scan NewApiExtractor to NewStatefulApiExtractor, which filters _raw_github_api_events by created_at >= last_run_start on incremental syncs. This table had 13,828 rows in a representative production run and took ~117s to process on every run regardless of how few new events were collected. After this change incremental runs process only newly collected rows. (cherry picked from commit 84f324f65e8576cd2ff0ca8b74a2588ee2600e12)
Switch Extract Pull Requests from legacy full-scan NewApiExtractor to NewStatefulApiExtractor. The _raw_github_api_pull_requests table had 12,448 rows in a representative production run, taking ~108s on every incremental sync. After this change only newly collected PR rows are processed. SubtaskConfig captures prType and prComponent regex strings so that a scope config change automatically triggers a full re-extract. BeforeExtract deletes GithubPrLabel rows for the current PR before re-inserting them in incremental mode, preventing stale labels from persisting when labels are removed upstream. (cherry picked from commit 80dc76d4e5f1ad6c65d48d42f60f50b87c9dad2d)
…ulApiExtractor Switch Extract Workflow Runs (~7,364 raw rows, ~65s) and Extract PR Commits (~6,982 raw rows, ~60s) from legacy full-scan NewApiExtractor to NewStatefulApiExtractor. Both extractors are simple mappings with no scope-config dependency, so no SubtaskConfig or BeforeExtract needed. (cherry picked from commit ea4f158a65ad0f7a2c23ba7bbc9932059e2ca408)
…Extractor Migrate Extract Jobs (~5,369 rows, ~48s), Extract PR Reviews (~3,073 rows, ~27s), and Extract PR Review Comments (~1,820 rows, ~16s) from legacy full-scan NewApiExtractor to NewStatefulApiExtractor. Also moves prUrlRegex compilation in pr_review_comment_extractor.go from inside the Extract closure (recompiled on every raw row) to before the extractor is created, eliminating redundant regexp compilation. (cherry picked from commit 54d601587bc74ebd0f1103345c545571146739b5)
…xtractor Migrate the final seven GitHub extractors to NewStatefulApiExtractor: issue, comment, account, account_org, milestone, commit, commit_stats. issue_extractor gains SubtaskConfig (issue classification regex strings) so scope config changes trigger automatic full re-extraction, and BeforeExtract cleanup for GithubIssueLabel and GithubIssueAssignee rows in incremental mode to prevent stale labels/assignees persisting after upstream removal. All other extractors in this commit are simple migrations with no config-sensitivity or child record cleanup needed. With this commit all 14 GitHub plugin extractors are now incremental. Combined with the collector fixes in earlier commits, incremental collection runs that previously took 9+ minutes in the extract phase will now complete in seconds when few or no new records were collected. (cherry picked from commit 606acea88caef63662733162faa47a6c6d3155cc)
Convert Workflow Runs and Convert Jobs now use NewStatefulDataConverter, skipping records unchanged since last run on incremental pipelines. Jobs are filtered via JOIN on _tool_github_runs.github_updated_at. (cherry picked from commit 75a909efccab1d04bfae4058f4674663b00762a6)
…nverter Convert PR Commits, Convert PR Comments, and Convert PR Reviews now use NewStatefulDataConverter. Child-of-PR records are filtered incrementally via JOIN on _tool_github_pull_requests.github_updated_at; PR comments additionally filter on their own github_updated_at. (cherry picked from commit 17cbc2e4caff7349fa2d3389d7cafcc5d0edee71)
…verter Convert Pull Requests filters by GithubPullRequest.github_updated_at. Convert Reviews and Convert PR Issues filter via JOIN on pull_requests github_updated_at since reviewers and pr_issues have no own timestamp. (cherry picked from commit 47437ae49d30b967b443c964a4b1baca34344acc)
Migrate the last 9 converters from DataConverter to StatefulDataConverter so they skip already-processed records on incremental runs: - issue_convertor: filter on github_updated_at - issue_comment_convertor: filter on github_updated_at - issue_label_convertor: JOIN to issues, filter on issues.github_updated_at - issue_assignee_convertor: JOIN to issues, filter on issues.github_updated_at - pr_label_convertor: JOIN to pull_requests, filter on pr.github_updated_at - account_convertor: filter on updated_at - release_convertor: filter on updated_at - repo_convertor: filter on updated_at; retain GithubApiRepo struct (used by pr_extractor) - commit_convertor: filter on authored_date (cherry picked from commit 7bcc9da4676d34d9cd24e41a68765cf85723ac81)
df32811 to
7b1e8f1
Compare
Contributor
|
Nice improvement. Could you please fix the failed test cases? Thank you. |
…he#8859) Signed-off-by: yamoyamoto <yamo7yamoto@gmail.com>
Contributor
|
Weird, why it is complaining about the |
Contributor
Author
|
I will check. |
klesh
approved these changes
Apr 29, 2026
Contributor
klesh
left a comment
There was a problem hiding this comment.
LGTM.
Thank you for your contribution.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
pr-type/bug-fix,pr-type/feature-development, etc.Summary
Does this close any open issues?
No
Screenshots
Other Information
Any other information that is important to this PR.