Skip to content

feat(github): Add incremental data collection#8858

Merged
klesh merged 24 commits intoapache:mainfrom
AkerBP:feat/githup-incremental-data-collect
Apr 29, 2026
Merged

feat(github): Add incremental data collection#8858
klesh merged 24 commits intoapache:mainfrom
AkerBP:feat/githup-incremental-data-collect

Conversation

@lrf-nitro
Copy link
Copy Markdown
Contributor

⚠️ Pre Checklist

Please complete ALL items in this checklist, and remove before submitting

  • I have read through the Contributing Documentation.
  • I have added relevant tests.
  • I have added relevant documentation.
  • I will add labels to the PR, such as pr-type/bug-fix, pr-type/feature-development, etc.

Summary

  • This change set stabilizes GitHub incremental sync by moving high-volume extractors/converters to stateful incremental mode, hardening workflow-run collection windows, and fixing converter bootstrap/performance edge cases.
  • It also improves observability with per-subtask processed-record counts in completion logs.
  • Key fixes:
    • Stateful extractor + converter migration for GitHub plugin.
    • Time format change to RFC3339 since formatting for GitHub API collectors.
    • Workflow runs incremental bootstrap fallback + adaptive-window safety retained.
    • Subtask-state bootstrap from collector checkpoints to avoid broad converter first-pass on long-lived projects.
    • Convert Jobs incremental query optimized to avoid expensive run-table join scans on empty deltas.
  • Observed runtime outcomes from the logs:
    • Pre-fix data collection run time approx. 2 hours
    • With incremental handling the processing time was reduced to 2 minutes
  • Net effect: incremental behavior is now consistent in both collection and post-collection processing, with major runtime reduction on previously problematic projects.

Does this close any open issues?

No

Screenshots

image

Other Information

Any other information that is important to this PR.

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. component/plugins This issue or PR relates to plugins improvement pr-type/feature-development This PR is to develop a new feature priority/high This issue is very important labels Apr 27, 2026
The since= query parameter passed to the GitHub API in four collectors
was formatted using Go's time.Time.String(), which produces a
human-readable string (e.g. "2024-01-15 10:30:00 +0000 UTC") rather
than the ISO 8601 / RFC 3339 format the GitHub API requires
(e.g. "2024-01-15T10:30:00Z").

The GitHub API silently ignores malformed date strings, causing these
collectors to perform full re-scans on every incremental run despite
appearing to filter correctly. Fix by using .UTC().Format(time.RFC3339)
in all four affected collectors:
- comment_collector.go
- issue_collector.go
- commit_collector.go
- pr_review_comment_collector.go

(cherry picked from commit cf3cb621462d8ae661df5fcb8e1b47c70564cd60)
Switch Extract Events from the legacy full-scan NewApiExtractor to
NewStatefulApiExtractor, which filters _raw_github_api_events by
created_at >= last_run_start on incremental syncs. This table had
13,828 rows in a representative production run and took ~117s to
process on every run regardless of how few new events were collected.
After this change incremental runs process only newly collected rows.

(cherry picked from commit 84f324f65e8576cd2ff0ca8b74a2588ee2600e12)
Switch Extract Pull Requests from legacy full-scan NewApiExtractor to
NewStatefulApiExtractor. The _raw_github_api_pull_requests table had
12,448 rows in a representative production run, taking ~108s on every
incremental sync. After this change only newly collected PR rows are
processed.

SubtaskConfig captures prType and prComponent regex strings so that a
scope config change automatically triggers a full re-extract. BeforeExtract
deletes GithubPrLabel rows for the current PR before re-inserting them
in incremental mode, preventing stale labels from persisting when labels
are removed upstream.

(cherry picked from commit 80dc76d4e5f1ad6c65d48d42f60f50b87c9dad2d)
…ulApiExtractor

Switch Extract Workflow Runs (~7,364 raw rows, ~65s) and Extract PR
Commits (~6,982 raw rows, ~60s) from legacy full-scan NewApiExtractor
to NewStatefulApiExtractor. Both extractors are simple mappings with
no scope-config dependency, so no SubtaskConfig or BeforeExtract needed.

(cherry picked from commit ea4f158a65ad0f7a2c23ba7bbc9932059e2ca408)
…Extractor

Migrate Extract Jobs (~5,369 rows, ~48s), Extract PR Reviews (~3,073
rows, ~27s), and Extract PR Review Comments (~1,820 rows, ~16s) from
legacy full-scan NewApiExtractor to NewStatefulApiExtractor.

Also moves prUrlRegex compilation in pr_review_comment_extractor.go
from inside the Extract closure (recompiled on every raw row) to before
the extractor is created, eliminating redundant regexp compilation.

(cherry picked from commit 54d601587bc74ebd0f1103345c545571146739b5)
…xtractor

Migrate the final seven GitHub extractors to NewStatefulApiExtractor:
issue, comment, account, account_org, milestone, commit, commit_stats.

issue_extractor gains SubtaskConfig (issue classification regex strings)
so scope config changes trigger automatic full re-extraction, and
BeforeExtract cleanup for GithubIssueLabel and GithubIssueAssignee rows
in incremental mode to prevent stale labels/assignees persisting after
upstream removal.

All other extractors in this commit are simple migrations with no
config-sensitivity or child record cleanup needed.

With this commit all 14 GitHub plugin extractors are now incremental.
Combined with the collector fixes in earlier commits, incremental
collection runs that previously took 9+ minutes in the extract phase
will now complete in seconds when few or no new records were collected.

(cherry picked from commit 606acea88caef63662733162faa47a6c6d3155cc)
Convert Workflow Runs and Convert Jobs now use NewStatefulDataConverter,
skipping records unchanged since last run on incremental pipelines.
Jobs are filtered via JOIN on _tool_github_runs.github_updated_at.

(cherry picked from commit 75a909efccab1d04bfae4058f4674663b00762a6)
…nverter

Convert PR Commits, Convert PR Comments, and Convert PR Reviews now use
NewStatefulDataConverter. Child-of-PR records are filtered incrementally
via JOIN on _tool_github_pull_requests.github_updated_at; PR comments
additionally filter on their own github_updated_at.

(cherry picked from commit 17cbc2e4caff7349fa2d3389d7cafcc5d0edee71)
…verter

Convert Pull Requests filters by GithubPullRequest.github_updated_at.
Convert Reviews and Convert PR Issues filter via JOIN on pull_requests
github_updated_at since reviewers and pr_issues have no own timestamp.

(cherry picked from commit 47437ae49d30b967b443c964a4b1baca34344acc)
Migrate the last 9 converters from DataConverter to StatefulDataConverter
so they skip already-processed records on incremental runs:

- issue_convertor: filter on github_updated_at
- issue_comment_convertor: filter on github_updated_at
- issue_label_convertor: JOIN to issues, filter on issues.github_updated_at
- issue_assignee_convertor: JOIN to issues, filter on issues.github_updated_at
- pr_label_convertor: JOIN to pull_requests, filter on pr.github_updated_at
- account_convertor: filter on updated_at
- release_convertor: filter on updated_at
- repo_convertor: filter on updated_at; retain GithubApiRepo struct (used by pr_extractor)
- commit_convertor: filter on authored_date

(cherry picked from commit 7bcc9da4676d34d9cd24e41a68765cf85723ac81)
@lrf-nitro lrf-nitro force-pushed the feat/githup-incremental-data-collect branch from df32811 to 7b1e8f1 Compare April 27, 2026 20:59
@klesh
Copy link
Copy Markdown
Contributor

klesh commented Apr 28, 2026

Nice improvement. Could you please fix the failed test cases? Thank you.

@klesh
Copy link
Copy Markdown
Contributor

klesh commented Apr 29, 2026

Weird, why it is complaining about the _devlake_collector_latest_state table doesn't exist.

@lrf-nitro
Copy link
Copy Markdown
Contributor Author

I will check.

Copy link
Copy Markdown
Contributor

@klesh klesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thank you for your contribution.

@klesh klesh merged commit e26c61b into apache:main Apr 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/plugins This issue or PR relates to plugins improvement pr-type/feature-development This PR is to develop a new feature priority/high This issue is very important size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants