Skip to content

feat: migrate batch intermediate data from JSON files to tenant DB#130

Merged
coji merged 6 commits intomainfrom
feat/batch-json-to-db
Feb 28, 2026
Merged

feat: migrate batch intermediate data from JSON files to tenant DB#130
coji merged 6 commits intomainfrom
feat/batch-json-to-db

Conversation

@coji
Copy link
Owner

@coji coji commented Feb 28, 2026

Summary

  • バッチの中間データ(GitHub API から取得した PR, commits, reviews, discussions)を JSON ファイルから tenant DB テーブル(github_raw_data, github_raw_tags)に移行
  • Turso 移行時に Fly.io volume を撤廃するための前提作業
  • ShapedGitHub* モデルのフィールドを snake_case → camelCase に統一し、getTenantDbRaw を廃止

変更内容

  • db/tenant.sqlgithub_raw_data, github_raw_tags テーブル追加
  • store.ts を DB ベース(Kysely クエリビルダーによる UPSERT/SELECT)に書き換え
  • provider.ts を新しい store API(savePrData, saveTags, preloadAll)に移行
  • ShapedGitHub* モデルの 9 フィールドを camelCase 化し、CamelCasePlugin との互換性問題を解消
  • getTenantDbRaw(プラグインなし Kysely インスタンス)を廃止、getTenantDb に統一
  • 不要になったファイルを削除: path-builder.ts, path.ts, json.ts, index.ts, batch/scripts/golden-*.ts

Test plan

  • pnpm validate (lint, format, typecheck, build, 52 tests)
  • store.test.ts: save/load round-trip, upsert, empty state, preloadAll, multiple PRs (7 tests)
  • 本番 DB で batch fetchgithub_raw_data に camelCase キーでデータ保存を確認
  • 本番 DB で batch upsert → analyze → pull_requests に書き込み成功を確認

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • GitHub data (pull requests, commits, reviews, discussions, and tags) is now persisted directly to the database for improved reliability and performance.
  • Refactor

    • Removed file-based JSON storage; data is exclusively database-backed for consistency.
    • Streamlined internal data handling and field naming conventions.
  • Chores

    • Removed snapshot comparison utilities.

Replace file-based JSON storage (path-builder + fs read/write) with
tenant DB tables (github_raw_data, github_raw_tags) for batch PR data.
This removes the Fly.io volume dependency for intermediate data,
preparing for the Turso migration.

- Add github_raw_data and github_raw_tags tables to tenant schema
- Rewrite store.ts to use DB UPSERT/SELECT with raw SQL via getTenantDbRaw
- Add getTenantDbRaw (plugin-free Kysely instance) to preserve JSON keys
- Update provider.ts to use new store API (savePrData, saveTags, preloadAll)
- Delete unused path-builder, path, json, index helper files
- Add store.test.ts with 7 integration tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 28, 2026

Warning

Rate limit exceeded

@coji has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 9729068 and 60bc081.

📒 Files selected for processing (8)
  • app/services/tenant-db.server.ts
  • app/services/tenant-type.ts
  • batch/provider/github/aggregator.test.ts
  • batch/provider/github/pullrequest.ts
  • batch/provider/github/release-detect.ts
  • batch/provider/github/store.test.ts
  • batch/provider/github/store.ts
  • kysely-codegen.tenant.config.ts
📝 Walkthrough

Walkthrough

Migrates GitHub data storage from filesystem JSON files to DB-backed persistence (two new tables: github_raw_data, github_raw_tags); refactors tenant DB caching internals; removes multiple filesystem path/json helpers and golden snapshot scripts; updates GitHub provider and store to use DB-backed APIs and introduces comprehensive store tests; renames many GitHub data fields from snake_case to camelCase.

Changes

Cohort / File(s) Summary
Tenant DB internals
app/services/tenant-db.server.ts
Introduced internal getOrCreateDb(organizationId) and reorganized tenant DB caching/creation; getTenantDb now delegates to the helper; close functions updated to use the new internal variable naming.
Tenant DB types
app/services/tenant-type.ts
Added exported interfaces GithubRawData and GithubRawTags; extended tenant DB interface with githubRawData and githubRawTags.
Schema / Migrations
db/migrations/tenant/20260228064258.sql, db/tenant.sql
Added github_raw_data (PK: repository_id, pull_request_number) and github_raw_tags (PK: repository_id) tables with fetch timestamps and FK constraints to repositories.
Removed FS helpers & tests
batch/helper/json.ts, batch/helper/path.ts, batch/helper/path-builder.ts, batch/helper/path-builder.test.ts, batch/helper/index.ts
Deleted filesystem path and JSON utilities and their re-exports; removed associated tests.
GitHub store (DB-backed)
batch/provider/github/store.ts, batch/provider/github/store.test.ts
Replaced file-based persistence with DB-backed APIs: added savePrData, saveTags, and preloadAll; implemented SQL upserts, preloading cache, and loader methods; added comprehensive tests using temporary SQLite tenant DBs.
GitHub provider
batch/provider/github/provider.ts
Updated provider to use new store API (preloadAll, savePrData, saveTags); removed pathBuilder usage and per-file saves; adjusted logging.
GitHub types & shapes (camelCase)
batch/provider/github/model.ts, batch/provider/github/fetcher.ts, batch/provider/github/aggregator.ts, batch/provider/github/pullrequest.ts, batch/provider/github/release-detect.ts, batch/provider/github/review-response.ts
Renamed many GitHub data fields from snake_case → camelCase (e.g., created_atcreatedAt, updated_atupdatedAt, merged_atmergedAt, etc.) and updated usages and sorting/logic accordingly.
Removed golden scripts & package.json
batch/scripts/golden-snapshot.ts, batch/scripts/golden-compare.ts, package.json
Deleted golden snapshot/compare CLI scripts and removed their npm script entries.

Sequence Diagram(s)

sequenceDiagram
    participant Provider as GitHub Provider
    participant Store as Store Module
    participant DB as Tenant DB
    participant Cache as In-memory Cache

    Note over Provider,DB: DB-backed persistence and preload flow
    Provider->>Store: preloadAll()
    Store->>DB: SELECT * FROM github_raw_data / github_raw_tags
    DB-->>Store: Rows
    Store->>Cache: populate map

    Provider->>Store: savePrData(pr, {commits,reviews,discussions})
    Store->>DB: UPSERT INTO github_raw_data
    DB-->>Store: OK

    Provider->>Store: saveTags(tags)
    Store->>DB: UPSERT INTO github_raw_tags
    DB-->>Store: OK

    Provider->>Store: loader.commits(prNumber)
    alt Cache hit
        Cache-->>Store: return commits
    else Cache miss
        Store->>DB: SELECT WHERE repository_id & pull_request_number
        DB-->>Store: row
        Store->>Cache: cache and return
    end
    Store-->>Provider: PR data
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 From files to tables, down the rabbit lane,
No more JSON crumbs in a folder plain.
Preload the cache, let SQL hum,
PRs and tags in rows now come.
A hopping cheer for schema new — hooray for migration fun!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary objective of the changeset: migrating batch intermediate data storage from JSON files to the tenant database.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/batch-json-to-db

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@batch/provider/github/store.ts`:
- Around line 40-63: savePrData updates the DB but doesn't update the in-memory
preload cache, causing loaders (which read from preloaded after preloadAll) to
see stale data; after the INSERT ... ON CONFLICT executes successfully in
savePrData, also update the preloaded cache entry used by preloadAll/loaders
(the preloaded Map/object) for the key corresponding to repositoryId + pr.number
with the same shape you serialize (pull_request, commits, reviews, discussions)
and set fetched_at to the current time (e.g., new Date().toISOString()); apply
the same change to the other DB-write helper(s) that mutate GitHub data (the
writer covering the region referenced by lines 96-113) so they keep preloaded in
sync.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97715c5 and 74c2a15.

⛔ Files ignored due to path filters (1)
  • db/migrations/tenant/atlas.sum is excluded by !**/*.sum
📒 Files selected for processing (12)
  • app/services/tenant-db.server.ts
  • app/services/tenant-type.ts
  • batch/helper/index.ts
  • batch/helper/json.ts
  • batch/helper/path-builder.test.ts
  • batch/helper/path-builder.ts
  • batch/helper/path.ts
  • batch/provider/github/provider.ts
  • batch/provider/github/store.test.ts
  • batch/provider/github/store.ts
  • db/migrations/tenant/20260228064258.sql
  • db/tenant.sql
💤 Files with no reviewable changes (5)
  • batch/helper/path-builder.test.ts
  • batch/helper/path.ts
  • batch/helper/json.ts
  • batch/helper/path-builder.ts
  • batch/helper/index.ts

Comment on lines +40 to 63
const savePrData = async (
pr: ShapedGitHubPullRequest,
data: {
commits: ShapedGitHubCommit[]
reviews: ShapedGitHubReview[]
discussions: ShapedGitHubReviewComment[]
},
) => {
const prJson = JSON.stringify(pr)
const commitsJson = JSON.stringify(data.commits)
const reviewsJson = JSON.stringify(data.reviews)
const discussionsJson = JSON.stringify(data.discussions)

await sql`
INSERT INTO github_raw_data (repository_id, pull_request_number, pull_request, commits, reviews, discussions)
VALUES (${repositoryId}, ${pr.number}, ${prJson}, ${commitsJson}, ${reviewsJson}, ${discussionsJson})
ON CONFLICT (repository_id, pull_request_number) DO UPDATE SET
pull_request = ${prJson},
commits = ${commitsJson},
reviews = ${reviewsJson},
discussions = ${discussionsJson},
fetched_at = datetime('now')
`.execute(db)
}
Copy link

@coderabbitai coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep preload cache consistent after writes.

After preloadAll(), loaders read from preloaded (Line 111). savePrData() updates the DB but not the cache, so reads can return stale data in the same store instance.

🔧 Proposed fix
   const savePrData = async (
     pr: ShapedGitHubPullRequest,
     data: {
       commits: ShapedGitHubCommit[]
       reviews: ShapedGitHubReview[]
       discussions: ShapedGitHubReviewComment[]
     },
   ) => {
@@
     await sql`
@@
     `.execute(db)
+
+    if (preloaded) {
+      preloaded.set(pr.number, {
+        pullRequest: pr,
+        commits: data.commits,
+        reviews: data.reviews,
+        discussions: data.discussions,
+      })
+    }
   }

Also applies to: 96-113

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@batch/provider/github/store.ts` around lines 40 - 63, savePrData updates the
DB but doesn't update the in-memory preload cache, causing loaders (which read
from preloaded after preloadAll) to see stale data; after the INSERT ... ON
CONFLICT executes successfully in savePrData, also update the preloaded cache
entry used by preloadAll/loaders (the preloaded Map/object) for the key
corresponding to repositoryId + pr.number with the same shape you serialize
(pull_request, commits, reviews, discussions) and set fetched_at to the current
time (e.g., new Date().toISOString()); apply the same change to the other
DB-write helper(s) that mutate GitHub data (the writer covering the region
referenced by lines 96-113) so they keep preloaded in sync.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

実際の使用フローでは savePrData(fetch フェーズ)→ preloadAll(analyze フェーズ)の順で呼ばれ、同一インスタンスで save 後に preloaded から read するパスは存在しないため、対応不要と判断しました。

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coji さん、ご説明ありがとうございます。

おっしゃる通り、実際の使用フローを確認すると:

  1. Fetch フェーズ: savePrData() でデータを DB に書き込み
  2. Analyze フェーズ: preloadAll() で DB から最新データを一括ロード
  3. Read: キャッシュから読み取り

という順序で実行されるため、書き込み後に古いキャッシュから読み取るパスは存在しませんね。preloadAll() が必ず DB から最新データを取得するので、キャッシュの整合性は保たれています。

provider.ts の変更内容からも、この設計意図が確認できました。ご指摘の通り、このケースではキャッシュの明示的な無効化は不要です 👍

coji and others added 3 commits February 28, 2026 16:44
ShapedGitHub* モデルの snake_case フィールドを camelCase に統一し、
CamelCasePlugin との互換性問題を解消。store.ts を getTenantDb +
Kysely クエリビルダーに書き換え、getTenantDbRaw を削除。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
batch/scripts/ ディレクトリと package.json の関連スクリプトを削除。
中間データが tenant DB に移行されたため不要。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- getTenantDb にインライン化し getOrCreateDb の不要な間接層を除去
- onConflict で excluded 参照を使い JSON.stringify の重複呼び出しを解消
- ParseJSONResultsPlugin による自動パースについてコメント追加
- store.test.ts の SQL 文字列補間を prepared statement に変更
- テスト名・コメントの snake_case 残りを camelCase に修正

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
batch/provider/github/store.ts (1)

26-53: ⚠️ Potential issue | 🟡 Minor

Synchronize preloaded after savePrData writes.

After preloadAll() (Line 94), loadRow() reads from cache first (Line 115-117). savePrData() (Line 26-53) updates DB but not preloaded, so same-instance reads can return stale/missing data.

💡 Suggested fix
   const savePrData = async (
@@
   ) => {
@@
       .execute()
+
+    if (preloaded) {
+      preloaded.set(pr.number, {
+        pullRequest: pr,
+        commits: data.commits,
+        reviews: data.reviews,
+        discussions: data.discussions,
+      })
+    }
   }

   const loadRow = async (number: number) => {
-    if (preloaded) {
-      return preloaded.get(number) ?? null
+    if (preloaded) {
+      const cached = preloaded.get(number)
+      if (cached) return cached
     }
@@
-    return parseRow(row)
+    const parsed = parseRow(row)
+    if (preloaded) preloaded.set(number, parsed)
+    return parsed
   }

Also applies to: 94-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@batch/provider/github/store.ts` around lines 26 - 53, savePrData writes PR
data to the database but does not update the in-memory cache preloaded, so
subsequent calls to loadRow (which checks preloaded first after preloadAll) can
return stale/missing data; fix by updating the preloaded map inside savePrData
after the DB upsert (use the same keying used by preloadAll/loadRow, e.g.
repositoryId and pr.number) to store the JSON-parsed or original object you just
wrote, ensuring preloaded stays in sync with the DB for same-instance reads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@batch/provider/github/release-detect.ts`:
- Line 63: Wrap the dynamic RegExp construction around tagCondition in a
try-catch: when creating tagRegexp from tagCondition (the new
RegExp(tagCondition) instance in release-detect.ts), catch SyntaxError or other
exceptions, log or surface a clear error including the invalid pattern, and fall
back to a safe behavior (e.g., treat as no-match / skip processing this config
or exclude the condition) instead of letting the exception propagate; ensure any
downstream use of tagRegexp checks that it was successfully created before
calling test() or exec() so invalid patterns don't crash batch execution.

In `@batch/provider/github/store.test.ts`:
- Around line 20-21: Add an afterAll hook to unstub environment variables so
vi.stubEnv calls don't leak; specifically, after the existing afterEach block
add: afterAll(() => { vi.unstubAllEnvs(); }); This uses the vi.unstubAllEnvs()
helper to restore env state corresponding to the vi.stubEnv(...) calls and
should be placed in the same test file scope (near the existing afterEach) so
all tests are cleaned up.

---

Duplicate comments:
In `@batch/provider/github/store.ts`:
- Around line 26-53: savePrData writes PR data to the database but does not
update the in-memory cache preloaded, so subsequent calls to loadRow (which
checks preloaded first after preloadAll) can return stale/missing data; fix by
updating the preloaded map inside savePrData after the DB upsert (use the same
keying used by preloadAll/loadRow, e.g. repositoryId and pr.number) to store the
JSON-parsed or original object you just wrote, ensuring preloaded stays in sync
with the DB for same-instance reads.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74c2a15 and 9729068.

📒 Files selected for processing (14)
  • app/services/tenant-db.server.ts
  • batch/provider/github/aggregator.test.ts
  • batch/provider/github/aggregator.ts
  • batch/provider/github/fetcher.ts
  • batch/provider/github/model.ts
  • batch/provider/github/provider.ts
  • batch/provider/github/pullrequest.ts
  • batch/provider/github/release-detect.ts
  • batch/provider/github/review-response.ts
  • batch/provider/github/store.test.ts
  • batch/provider/github/store.ts
  • batch/scripts/golden-compare.ts
  • batch/scripts/golden-snapshot.ts
  • package.json
💤 Files with no reviewable changes (3)
  • package.json
  • batch/scripts/golden-snapshot.ts
  • batch/scripts/golden-compare.ts

coji and others added 2 commits February 28, 2026 17:00
kysely-codegen の overrides で github_raw_data/github_raw_tags の
JSON カラムを ColumnType<unknown, string, string> に設定。
SELECT 時は unknown、INSERT 時は string となり、store.ts の
as unknown as を as に軽減。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- release-detect.ts: 不正な tagCondition で RegExp コンストラクタが
  例外を投げないよう try-catch を追加
- store.test.ts: vi.stubEnv のリークを防ぐ afterAll(unstubAllEnvs) を追加

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coji coji merged commit 8a22a13 into main Feb 28, 2026
6 checks passed
@coji coji deleted the feat/batch-json-to-db branch February 28, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant