#15-create git is&pr preprocessor, update pinecone sync logic by jonathanMLDev · Pull Request #118 · cppalliance/boost-data-collector

jonathanMLDev · 2026-03-17T23:15:47Z

Summary by CodeRabbit

New Features
- Multiple trackers (Clang, Boost library, Boost mailing list) can now sync to Pinecone for indexing and search.
- New preprocessing pipeline to prepare issues and PRs for Pinecone ingestion.
- CLI options to trigger Pinecone syncs and view progress.
Configuration
- Expanded environment configuration for Pinecone (API, index, batching, chunking, embedding models, namespaces).
- Added optional Celery and Slack configuration placeholders.

coderabbitai · 2026-03-17T23:15:55Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d610be53-f4eb-45ae-9018-7695b038082b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Pinecone indexing across multiple trackers: new settings, preprocessors for issues/PRs, core GitHub preprocessing, and management-command hooks to run Pinecone syncs (with CLI options and logging) for Clang, Boost Library, and Boost Mailing List trackers.

Changes

Cohort / File(s)	Summary
Configuration & Env `./.env.example`, `config/settings.py`	Added broad GitHub shared tokens block, many Pinecone settings (API keys, index, env, batch/chunking, embedding models, per-app app_type/namespace), Celery and Slack placeholders.
Core GitHub Preprocessing `github_activity_tracker/preprocessors/github_preprocess.py`	New multi-repo preprocessing pipeline: iterators over raw JSON, timestamp parsing, document builders, deduplication, and public preprocess_issues/prs and preprocess_all_issues/prs functions returning Pinecone-ready documents.
Clang Tracker `clang_github_tracker/management/commands/run_clang_github_tracker.py`, `clang_github_tracker/preprocessors/...`	Added CLI flags for pinecone app_type/namespace, new _run_pinecone_sync helper; wired issue/pr preprocessors that delegate to core preprocessing; calls run_cppa_pinecone_sync.
Boost Library Tracker `boost_library_tracker/management/commands/run_boost_library_tracker.py`, `boost_library_tracker/preprocessors/...`	Added pinecone_sync task and public task_pinecone_sync function, CLI options for pinecone app_type/namespace, and preprocessors for boostorg issues/PRs.
Boost Mailing List Tracker `boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py`	Renamed pinecone contract from app_id→app_type, switched CLI flag to --pinecone-app-type, use settings defaults, and pass preprocessor as dotted path.

Sequence Diagram

sequenceDiagram
    actor User
    participant Cmd as Tracker Command\n(e.g., run_clang_github_tracker)
    participant Sync as Pinecone Sync\n(run_cppa_pinecone_sync)
    participant Pre as Tracker Preprocessor\n(issue/pr_preprocessor)
    participant Core as Core GitHub\nPreprocessing
    participant Raw as Raw JSON Files\n(workspace/raw)
    participant Pine as Pinecone API

    User->>Cmd: run command (--pinecone-app-type)
    Cmd->>Cmd: fetch/sync raw GitHub data
    Cmd->>Sync: _run_pinecone_sync(app_type, namespace, preprocessor_path)
    Sync->>Pre: call preprocess_for_pinecone(failed_ids, final_sync_at)
    Pre->>Core: delegate to preprocess_* (owner[/repo])
    Core->>Raw: iterate JSON files
    Raw-->>Core: return issue/PR JSON
    Core->>Core: build documents, parse timestamps, dedupe
    Core-->>Pre: return documents, chunked_flag
    Pre-->>Sync: return documents
    Sync->>Pine: upsert documents (namespace=app_type)
    Pine-->>Sync: ack
    Sync-->>Cmd: complete
    Cmd-->>User: report success

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

#29-add cppa-pinecone-sync app #74: Introduced the cppa_pinecone_sync app/management command that this PR now calls and integrates with.
Add clang_github_tracker app in boost-data-collector project #82 #84: Extended clang_github_tracker with Pinecone preprocessors and sync wiring; closely related to the clang changes here.
Add boost-library-tracker app in boost-data-collector project #54 #67: Prior changes to boost_library_tracker that this PR augments with Pinecone sync and preprocessors.

Suggested labels

enhancement

Suggested reviewers

snowfox1003

Poem

🐰 Hops of code and crumbs of light,

Docs and issues tidy bright,
Pinecone hums, indexes bloom,
Preprocessors clear the room,
A rabbit cheers — syncs take flight!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '#15-create git is&pr preprocessor, update pinecone sync logic' directly addresses the main changes: creating GitHub issue and PR preprocessors and updating Pinecone synchronization logic, which are the core objectives of this changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 86.21% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

clang_github_tracker/preprocessors/pr_preprocessor.py (1)

23-24: Same recommendation: use Django settings for APP_TYPE.

For consistency with centralized settings:

♻️ Proposed fix

-NAMESPACE = "github-clang"
-APP_TYPE = os.getenv("CLANG_GITHUB_PINECONE_APP_TYPE", NAMESPACE)
+NAMESPACE = "github-clang"
+APP_TYPE = settings.CLANG_GITHUB_PINECONE_APP_TYPE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/preprocessors/pr_preprocessor.py` around lines 23 - 24,
Replace the direct env-var fallback for APP_TYPE with a Django settings-backed
value: instead of assigning APP_TYPE =
os.getenv("CLANG_GITHUB_PINECONE_APP_TYPE", NAMESPACE), read APP_TYPE from
django.conf.settings (e.g., settings.CLANG_GITHUB_PINECONE_APP_TYPE) with
NAMESPACE as the default; update imports to include from django.conf import
settings and ensure NAMESPACE remains the default constant used when the setting
is missing or empty.

boost_library_tracker/preprocessors/pr_preprocessor.py (1)

26-27: Same recommendation: use Django settings for APP_TYPE.

For consistency with the sibling issue_preprocessor.py fix and the centralized settings in config/settings.py:

♻️ Proposed fix

-NAMESPACE = "github-boostorg"
-APP_TYPE = os.getenv("BOOST_GITHUB_PINECONE_APP_TYPE", NAMESPACE)
+NAMESPACE = "github-boostorg"
+APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@boost_library_tracker/preprocessors/pr_preprocessor.py` around lines 26 - 27,
Replace the environment-variable fallback for APP_TYPE in pr_preprocessor.py
with the centralized Django setting used elsewhere: import and read the value
from config.settings (the same approach applied in issue_preprocessor.py)
instead of calling os.getenv; update references to NAMESPACE and APP_TYPE so
APP_TYPE defaults to NAMESPACE when the settings value is absent, and ensure the
module imports settings at top and uses settings.BOOST_GITHUB_PINECONE_APP_TYPE
(or the existing setting name in config/settings.py) to keep behavior
consistent.

clang_github_tracker/management/commands/run_clang_github_tracker.py (1)

170-186: Redundant fallback: effective_app_type and effective_namespace duplicate logic already in lines 103-108.

The pinecone_app_type and pinecone_namespace variables are already guaranteed to have values from settings fallback at lines 103-108. The additional or settings.* checks are unnecessary.

♻️ Proposed simplification

-        # Phase: upsert issues and PRs to Pinecone
-        effective_app_type = (
-            pinecone_app_type or settings.CLANG_GITHUB_PINECONE_APP_TYPE
-        )
-        effective_namespace = (
-            pinecone_namespace or settings.CLANG_GITHUB_PINECONE_NAMESPACE
-        )
         _run_pinecone_sync(
-            effective_app_type,
-            effective_namespace,
+            pinecone_app_type,
+            pinecone_namespace,
             "clang_github_tracker.preprocessors.issue_preprocessor.preprocess_for_pinecone",
         )
         _run_pinecone_sync(
-            effective_app_type,
-            effective_namespace,
+            pinecone_app_type,
+            pinecone_namespace,
             "clang_github_tracker.preprocessors.pr_preprocessor.preprocess_for_pinecone",
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/management/commands/run_clang_github_tracker.py` around
lines 170 - 186, The variables effective_app_type and effective_namespace are
redundantly reapplying the settings fallback already handled earlier; simplify
by removing those local fallbacks and pass the existing pinecone_app_type and
pinecone_namespace directly to _run_pinecone_sync. Locate the block that sets
effective_app_type/effective_namespace and the two _run_pinecone_sync calls in
run_clang_github_tracker.py and replace uses of
effective_app_type/effective_namespace with pinecone_app_type/pinecone_namespace
(or remove the intermediate variables entirely) so the earlier fallback logic is
the single source of truth.

boost_library_tracker/preprocessors/issue_preprocessor.py (1)

28-29: Consider using Django settings for APP_TYPE instead of os.getenv for consistency.

The code reads APP_TYPE directly from os.getenv, but config/settings.py already defines BOOST_GITHUB_PINECONE_APP_TYPE with proper normalization and fallback logic. Using the settings ensures consistent behavior across the codebase.
♻️ Proposed fix
-NAMESPACE = "github-boostorg"
-APP_TYPE = os.getenv("BOOST_GITHUB_PINECONE_APP_TYPE", NAMESPACE)
+NAMESPACE = "github-boostorg"
+APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE
You can also remove the import os line if no longer needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_tracker/preprocessors/issue_preprocessor.py` around lines 28 -
29, Replace the direct os.getenv usage for APP_TYPE with the Django setting that
centralizes normalization/fallback: import settings from django.conf and set
APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE (keeping the existing
NAMESPACE constant), and remove the unused import os if it becomes redundant;
refer to the NAMESPACE and APP_TYPE symbols and the config/settings.py
definition when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@github_activity_tracker/preprocessors/github_preprocess.py`:
- Around line 181-222: build_pr_document currently omits the repository owner in
the returned metadata like build_issue_document did; update the function
signature to accept an owner parameter (e.g., add owner: str to
build_pr_document) and include "owner": owner in the metadata dict returned by
build_pr_document so the metadata mirrors build_issue_document (locate
build_pr_document, its returned metadata block, and add the owner field there).
- Around line 137-178: The metadata currently returned by build_issue_document
is missing the required "owner" field and the function signature doesn't accept
an owner to populate it; update build_issue_document(path, data, repo) to accept
an additional owner: str parameter, add "owner": owner to the metadata dict
(alongside repo_name), and update all call sites that invoke
build_issue_document to pass the repository owner string so the document shape
matches the docstring.

---

Nitpick comments:
In `@boost_library_tracker/preprocessors/issue_preprocessor.py`:
- Around line 28-29: Replace the direct os.getenv usage for APP_TYPE with the
Django setting that centralizes normalization/fallback: import settings from
django.conf and set APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE (keeping
the existing NAMESPACE constant), and remove the unused import os if it becomes
redundant; refer to the NAMESPACE and APP_TYPE symbols and the
config/settings.py definition when making the change.

In `@boost_library_tracker/preprocessors/pr_preprocessor.py`:
- Around line 26-27: Replace the environment-variable fallback for APP_TYPE in
pr_preprocessor.py with the centralized Django setting used elsewhere: import
and read the value from config.settings (the same approach applied in
issue_preprocessor.py) instead of calling os.getenv; update references to
NAMESPACE and APP_TYPE so APP_TYPE defaults to NAMESPACE when the settings value
is absent, and ensure the module imports settings at top and uses
settings.BOOST_GITHUB_PINECONE_APP_TYPE (or the existing setting name in
config/settings.py) to keep behavior consistent.

In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 170-186: The variables effective_app_type and effective_namespace
are redundantly reapplying the settings fallback already handled earlier;
simplify by removing those local fallbacks and pass the existing
pinecone_app_type and pinecone_namespace directly to _run_pinecone_sync. Locate
the block that sets effective_app_type/effective_namespace and the two
_run_pinecone_sync calls in run_clang_github_tracker.py and replace uses of
effective_app_type/effective_namespace with pinecone_app_type/pinecone_namespace
(or remove the intermediate variables entirely) so the earlier fallback logic is
the single source of truth.

In `@clang_github_tracker/preprocessors/pr_preprocessor.py`:
- Around line 23-24: Replace the direct env-var fallback for APP_TYPE with a
Django settings-backed value: instead of assigning APP_TYPE =
os.getenv("CLANG_GITHUB_PINECONE_APP_TYPE", NAMESPACE), read APP_TYPE from
django.conf.settings (e.g., settings.CLANG_GITHUB_PINECONE_APP_TYPE) with
NAMESPACE as the default; update imports to include from django.conf import
settings and ensure NAMESPACE remains the default constant used when the setting
is missing or empty.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f824617a-64c6-4215-b7ba-2b307505d570

📥 Commits

Reviewing files that changed from the base of the PR and between f354dba and 291ad2a.

📒 Files selected for processing (13)

.env.example
boost_library_tracker/management/commands/run_boost_library_tracker.py
boost_library_tracker/preprocessors/__init__.py
boost_library_tracker/preprocessors/issue_preprocessor.py
boost_library_tracker/preprocessors/pr_preprocessor.py
boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py
clang_github_tracker/management/commands/run_clang_github_tracker.py
clang_github_tracker/preprocessors/__init__.py
clang_github_tracker/preprocessors/issue_preprocessor.py
clang_github_tracker/preprocessors/pr_preprocessor.py
config/settings.py
github_activity_tracker/preprocessors/__init__.py
github_activity_tracker/preprocessors/github_preprocess.py

…son")

jonathanMLDev · 2026-03-18T01:00:02Z

@coderabbitai continue to review

coderabbitai · 2026-03-18T01:00:18Z

@jonathanMLDev: Sure! I'll continue reviewing the PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

jonathanMLDev · 2026-03-18T01:24:31Z

@coderabbitai full review

coderabbitai · 2026-03-18T01:24:38Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

boost_library_tracker/management/commands/run_boost_library_tracker.py (1)
245-275: Consider centralizing _run_pinecone_sync into a shared utility.

The helper is now duplicated across tracker commands with near-identical behavior. Moving it to a common module will reduce drift and make future error-handling changes safer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_tracker/management/commands/run_boost_library_tracker.py`
around lines 245 - 275, Extract the duplicated helper _run_pinecone_sync into a
single shared utility module (e.g., a new utils or common module) and replace
the copies in each tracker command with imports from that module; keep the same
signature (app_type: str, namespace: str, preprocessor_dotted_path: str), same
calls to call_command("run_cppa_pinecone_sync", ...), and same logging/error
handling, then update all files that contained the duplicate (including
run_boost_library_tracker's usage) to import the shared _run_pinecone_sync and
remove the local definitions so future changes to error handling or behavior are
centralized.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py`:
- Around line 338-341: The early return when the API fetch yields no emails
prevents the final Pinecone indexing step (_run_pinecone_sync) from running;
instead of returning immediately when the fetch is empty, change the control
flow so you skip the per-email workspace processing branch but continue to the
Pinecone phase. Concretely: remove or replace the early "return" after the
empty-fetch check with logic that sets a flag (e.g., has_indexable_items) if any
items were persisted during workspace processing, or simply proceed to call
_run_pinecone_sync unconditionally; ensure
_run_pinecone_sync(app_type=pinecone_app_type, namespace=pinecone_namespace) is
reachable at the end of the command even when the API returned zero emails.

In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 171-187: The two calls to _run_pinecone_sync are using the same
effective_app_type (variable effective_app_type), which causes a shared
PineconeSyncStatus cursor and makes PRs potentially skip updates after issues
advance the cursor; fix by invoking _run_pinecone_sync with distinct app_type
keys for issues and PRs (e.g., derive two values like "github-clang-issues" and
"github-clang-prs" instead of effective_app_type) or alternatively merge the two
preprocessors into a single _run_pinecone_sync invocation that returns both
issue and PR documents; update any place that reads/writes the sync cursor
(e.g., update_sync_status / PineconeSyncStatus usage) to use the corresponding
distinct app_type so each stream maintains its own final_sync_at.

---

Nitpick comments:
In `@boost_library_tracker/management/commands/run_boost_library_tracker.py`:
- Around line 245-275: Extract the duplicated helper _run_pinecone_sync into a
single shared utility module (e.g., a new utils or common module) and replace
the copies in each tracker command with imports from that module; keep the same
signature (app_type: str, namespace: str, preprocessor_dotted_path: str), same
calls to call_command("run_cppa_pinecone_sync", ...), and same logging/error
handling, then update all files that contained the duplicate (including
run_boost_library_tracker's usage) to import the shared _run_pinecone_sync and
remove the local definitions so future changes to error handling or behavior are
centralized.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d362c25c-4c66-4e2e-874b-3b80adbb7784

📥 Commits

Reviewing files that changed from the base of the PR and between f354dba and 91c0ccf.

📒 Files selected for processing (13)

.env.example
boost_library_tracker/management/commands/run_boost_library_tracker.py
boost_library_tracker/preprocessors/__init__.py
boost_library_tracker/preprocessors/issue_preprocessor.py
boost_library_tracker/preprocessors/pr_preprocessor.py
boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py
clang_github_tracker/management/commands/run_clang_github_tracker.py
clang_github_tracker/preprocessors/__init__.py
clang_github_tracker/preprocessors/issue_preprocessor.py
clang_github_tracker/preprocessors/pr_preprocessor.py
config/settings.py
github_activity_tracker/preprocessors/__init__.py
github_activity_tracker/preprocessors/github_preprocess.py

cppalliance#15-create git is&pr preprocessor, update pinecone sync logic

82c458d

jonathanMLDev self-assigned this Mar 17, 2026

jonathanMLDev mentioned this pull request Mar 17, 2026

Github Issue&PR Preprocessing for Pinecone RAG Database #15

Closed

cppalliance#15-fixed lint error

291ad2a

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread github_activity_tracker/preprocessors/github_preprocess.py

Comment thread github_activity_tracker/preprocessors/github_preprocess.py

cppalliance#15-fix docstring and update issue metadata(add "state_rea…

91c0ccf

…son")

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py

Comment thread clang_github_tracker/management/commands/run_clang_github_tracker.py

zho and others added 2 commits March 19, 2026 00:37

cppalliance#15-added a logic that use file modified datetime

5c431e2

Merge branch 'develop' into dev-15-preprocessor-for-github-issue-pr

b4a1455

jonathanMLDev requested a review from snowfox1003 March 18, 2026 16:43

snowfox1003 approved these changes Mar 18, 2026

View reviewed changes

snowfox1003 merged commit 0ad6144 into cppalliance:develop Mar 18, 2026
3 checks passed

coderabbitai Bot mentioned this pull request Mar 26, 2026

#126-fixed this app and cppa-pinecone app #128

Merged

Conversation

jonathanMLDev commented Mar 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonathanMLDev commented Mar 18, 2026

Uh oh!

coderabbitai Bot commented Mar 18, 2026

Uh oh!

jonathanMLDev commented Mar 18, 2026

Uh oh!

coderabbitai Bot commented Mar 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonathanMLDev commented Mar 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 17, 2026 •

edited

Loading