Skip to content

#15-create git is&pr preprocessor, update pinecone sync logic#118

Merged
snowfox1003 merged 5 commits intocppalliance:developfrom
jonathanMLDev:dev-15-preprocessor-for-github-issue-pr
Mar 18, 2026
Merged

#15-create git is&pr preprocessor, update pinecone sync logic#118
snowfox1003 merged 5 commits intocppalliance:developfrom
jonathanMLDev:dev-15-preprocessor-for-github-issue-pr

Conversation

@jonathanMLDev
Copy link
Copy Markdown
Collaborator

@jonathanMLDev jonathanMLDev commented Mar 17, 2026

Summary by CodeRabbit

  • New Features

    • Multiple trackers (Clang, Boost library, Boost mailing list) can now sync to Pinecone for indexing and search.
    • New preprocessing pipeline to prepare issues and PRs for Pinecone ingestion.
    • CLI options to trigger Pinecone syncs and view progress.
  • Configuration

    • Expanded environment configuration for Pinecone (API, index, batching, chunking, embedding models, namespaces).
    • Added optional Celery and Slack configuration placeholders.

@jonathanMLDev jonathanMLDev self-assigned this Mar 17, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d610be53-f4eb-45ae-9018-7695b038082b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Pinecone indexing across multiple trackers: new settings, preprocessors for issues/PRs, core GitHub preprocessing, and management-command hooks to run Pinecone syncs (with CLI options and logging) for Clang, Boost Library, and Boost Mailing List trackers.

Changes

Cohort / File(s) Summary
Configuration & Env
./.env.example, config/settings.py
Added broad GitHub shared tokens block, many Pinecone settings (API keys, index, env, batch/chunking, embedding models, per-app app_type/namespace), Celery and Slack placeholders.
Core GitHub Preprocessing
github_activity_tracker/preprocessors/github_preprocess.py
New multi-repo preprocessing pipeline: iterators over raw JSON, timestamp parsing, document builders, deduplication, and public preprocess_issues/prs and preprocess_all_issues/prs functions returning Pinecone-ready documents.
Clang Tracker
clang_github_tracker/management/commands/run_clang_github_tracker.py, clang_github_tracker/preprocessors/...
Added CLI flags for pinecone app_type/namespace, new _run_pinecone_sync helper; wired issue/pr preprocessors that delegate to core preprocessing; calls run_cppa_pinecone_sync.
Boost Library Tracker
boost_library_tracker/management/commands/run_boost_library_tracker.py, boost_library_tracker/preprocessors/...
Added pinecone_sync task and public task_pinecone_sync function, CLI options for pinecone app_type/namespace, and preprocessors for boostorg issues/PRs.
Boost Mailing List Tracker
boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py
Renamed pinecone contract from app_id→app_type, switched CLI flag to --pinecone-app-type, use settings defaults, and pass preprocessor as dotted path.

Sequence Diagram

sequenceDiagram
    actor User
    participant Cmd as Tracker Command\n(e.g., run_clang_github_tracker)
    participant Sync as Pinecone Sync\n(run_cppa_pinecone_sync)
    participant Pre as Tracker Preprocessor\n(issue/pr_preprocessor)
    participant Core as Core GitHub\nPreprocessing
    participant Raw as Raw JSON Files\n(workspace/raw)
    participant Pine as Pinecone API

    User->>Cmd: run command (--pinecone-app-type)
    Cmd->>Cmd: fetch/sync raw GitHub data
    Cmd->>Sync: _run_pinecone_sync(app_type, namespace, preprocessor_path)
    Sync->>Pre: call preprocess_for_pinecone(failed_ids, final_sync_at)
    Pre->>Core: delegate to preprocess_* (owner[/repo])
    Core->>Raw: iterate JSON files
    Raw-->>Core: return issue/PR JSON
    Core->>Core: build documents, parse timestamps, dedupe
    Core-->>Pre: return documents, chunked_flag
    Pre-->>Sync: return documents
    Sync->>Pine: upsert documents (namespace=app_type)
    Pine-->>Sync: ack
    Sync-->>Cmd: complete
    Cmd-->>User: report success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • snowfox1003

Poem

🐰 Hops of code and crumbs of light,

Docs and issues tidy bright,
Pinecone hums, indexes bloom,
Preprocessors clear the room,
A rabbit cheers — syncs take flight!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '#15-create git is&pr preprocessor, update pinecone sync logic' directly addresses the main changes: creating GitHub issue and PR preprocessors and updating Pinecone synchronization logic, which are the core objectives of this changeset.
Docstring Coverage ✅ Passed Docstring coverage is 86.21% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
clang_github_tracker/preprocessors/pr_preprocessor.py (1)

23-24: Same recommendation: use Django settings for APP_TYPE.

For consistency with centralized settings:

♻️ Proposed fix
-NAMESPACE = "github-clang"
-APP_TYPE = os.getenv("CLANG_GITHUB_PINECONE_APP_TYPE", NAMESPACE)
+NAMESPACE = "github-clang"
+APP_TYPE = settings.CLANG_GITHUB_PINECONE_APP_TYPE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/preprocessors/pr_preprocessor.py` around lines 23 - 24,
Replace the direct env-var fallback for APP_TYPE with a Django settings-backed
value: instead of assigning APP_TYPE =
os.getenv("CLANG_GITHUB_PINECONE_APP_TYPE", NAMESPACE), read APP_TYPE from
django.conf.settings (e.g., settings.CLANG_GITHUB_PINECONE_APP_TYPE) with
NAMESPACE as the default; update imports to include from django.conf import
settings and ensure NAMESPACE remains the default constant used when the setting
is missing or empty.
boost_library_tracker/preprocessors/pr_preprocessor.py (1)

26-27: Same recommendation: use Django settings for APP_TYPE.

For consistency with the sibling issue_preprocessor.py fix and the centralized settings in config/settings.py:

♻️ Proposed fix
-NAMESPACE = "github-boostorg"
-APP_TYPE = os.getenv("BOOST_GITHUB_PINECONE_APP_TYPE", NAMESPACE)
+NAMESPACE = "github-boostorg"
+APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_tracker/preprocessors/pr_preprocessor.py` around lines 26 - 27,
Replace the environment-variable fallback for APP_TYPE in pr_preprocessor.py
with the centralized Django setting used elsewhere: import and read the value
from config.settings (the same approach applied in issue_preprocessor.py)
instead of calling os.getenv; update references to NAMESPACE and APP_TYPE so
APP_TYPE defaults to NAMESPACE when the settings value is absent, and ensure the
module imports settings at top and uses settings.BOOST_GITHUB_PINECONE_APP_TYPE
(or the existing setting name in config/settings.py) to keep behavior
consistent.
clang_github_tracker/management/commands/run_clang_github_tracker.py (1)

170-186: Redundant fallback: effective_app_type and effective_namespace duplicate logic already in lines 103-108.

The pinecone_app_type and pinecone_namespace variables are already guaranteed to have values from settings fallback at lines 103-108. The additional or settings.* checks are unnecessary.

♻️ Proposed simplification
-        # Phase: upsert issues and PRs to Pinecone
-        effective_app_type = (
-            pinecone_app_type or settings.CLANG_GITHUB_PINECONE_APP_TYPE
-        )
-        effective_namespace = (
-            pinecone_namespace or settings.CLANG_GITHUB_PINECONE_NAMESPACE
-        )
         _run_pinecone_sync(
-            effective_app_type,
-            effective_namespace,
+            pinecone_app_type,
+            pinecone_namespace,
             "clang_github_tracker.preprocessors.issue_preprocessor.preprocess_for_pinecone",
         )
         _run_pinecone_sync(
-            effective_app_type,
-            effective_namespace,
+            pinecone_app_type,
+            pinecone_namespace,
             "clang_github_tracker.preprocessors.pr_preprocessor.preprocess_for_pinecone",
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@clang_github_tracker/management/commands/run_clang_github_tracker.py` around
lines 170 - 186, The variables effective_app_type and effective_namespace are
redundantly reapplying the settings fallback already handled earlier; simplify
by removing those local fallbacks and pass the existing pinecone_app_type and
pinecone_namespace directly to _run_pinecone_sync. Locate the block that sets
effective_app_type/effective_namespace and the two _run_pinecone_sync calls in
run_clang_github_tracker.py and replace uses of
effective_app_type/effective_namespace with pinecone_app_type/pinecone_namespace
(or remove the intermediate variables entirely) so the earlier fallback logic is
the single source of truth.
boost_library_tracker/preprocessors/issue_preprocessor.py (1)

28-29: Consider using Django settings for APP_TYPE instead of os.getenv for consistency.

The code reads APP_TYPE directly from os.getenv, but config/settings.py already defines BOOST_GITHUB_PINECONE_APP_TYPE with proper normalization and fallback logic. Using the settings ensures consistent behavior across the codebase.

♻️ Proposed fix
-NAMESPACE = "github-boostorg"
-APP_TYPE = os.getenv("BOOST_GITHUB_PINECONE_APP_TYPE", NAMESPACE)
+NAMESPACE = "github-boostorg"
+APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE

You can also remove the import os line if no longer needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_tracker/preprocessors/issue_preprocessor.py` around lines 28 -
29, Replace the direct os.getenv usage for APP_TYPE with the Django setting that
centralizes normalization/fallback: import settings from django.conf and set
APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE (keeping the existing
NAMESPACE constant), and remove the unused import os if it becomes redundant;
refer to the NAMESPACE and APP_TYPE symbols and the config/settings.py
definition when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@github_activity_tracker/preprocessors/github_preprocess.py`:
- Around line 181-222: build_pr_document currently omits the repository owner in
the returned metadata like build_issue_document did; update the function
signature to accept an owner parameter (e.g., add owner: str to
build_pr_document) and include "owner": owner in the metadata dict returned by
build_pr_document so the metadata mirrors build_issue_document (locate
build_pr_document, its returned metadata block, and add the owner field there).
- Around line 137-178: The metadata currently returned by build_issue_document
is missing the required "owner" field and the function signature doesn't accept
an owner to populate it; update build_issue_document(path, data, repo) to accept
an additional owner: str parameter, add "owner": owner to the metadata dict
(alongside repo_name), and update all call sites that invoke
build_issue_document to pass the repository owner string so the document shape
matches the docstring.

---

Nitpick comments:
In `@boost_library_tracker/preprocessors/issue_preprocessor.py`:
- Around line 28-29: Replace the direct os.getenv usage for APP_TYPE with the
Django setting that centralizes normalization/fallback: import settings from
django.conf and set APP_TYPE = settings.BOOST_GITHUB_PINECONE_APP_TYPE (keeping
the existing NAMESPACE constant), and remove the unused import os if it becomes
redundant; refer to the NAMESPACE and APP_TYPE symbols and the
config/settings.py definition when making the change.

In `@boost_library_tracker/preprocessors/pr_preprocessor.py`:
- Around line 26-27: Replace the environment-variable fallback for APP_TYPE in
pr_preprocessor.py with the centralized Django setting used elsewhere: import
and read the value from config.settings (the same approach applied in
issue_preprocessor.py) instead of calling os.getenv; update references to
NAMESPACE and APP_TYPE so APP_TYPE defaults to NAMESPACE when the settings value
is absent, and ensure the module imports settings at top and uses
settings.BOOST_GITHUB_PINECONE_APP_TYPE (or the existing setting name in
config/settings.py) to keep behavior consistent.

In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 170-186: The variables effective_app_type and effective_namespace
are redundantly reapplying the settings fallback already handled earlier;
simplify by removing those local fallbacks and pass the existing
pinecone_app_type and pinecone_namespace directly to _run_pinecone_sync. Locate
the block that sets effective_app_type/effective_namespace and the two
_run_pinecone_sync calls in run_clang_github_tracker.py and replace uses of
effective_app_type/effective_namespace with pinecone_app_type/pinecone_namespace
(or remove the intermediate variables entirely) so the earlier fallback logic is
the single source of truth.

In `@clang_github_tracker/preprocessors/pr_preprocessor.py`:
- Around line 23-24: Replace the direct env-var fallback for APP_TYPE with a
Django settings-backed value: instead of assigning APP_TYPE =
os.getenv("CLANG_GITHUB_PINECONE_APP_TYPE", NAMESPACE), read APP_TYPE from
django.conf.settings (e.g., settings.CLANG_GITHUB_PINECONE_APP_TYPE) with
NAMESPACE as the default; update imports to include from django.conf import
settings and ensure NAMESPACE remains the default constant used when the setting
is missing or empty.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f824617a-64c6-4215-b7ba-2b307505d570

📥 Commits

Reviewing files that changed from the base of the PR and between f354dba and 291ad2a.

📒 Files selected for processing (13)
  • .env.example
  • boost_library_tracker/management/commands/run_boost_library_tracker.py
  • boost_library_tracker/preprocessors/__init__.py
  • boost_library_tracker/preprocessors/issue_preprocessor.py
  • boost_library_tracker/preprocessors/pr_preprocessor.py
  • boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py
  • clang_github_tracker/management/commands/run_clang_github_tracker.py
  • clang_github_tracker/preprocessors/__init__.py
  • clang_github_tracker/preprocessors/issue_preprocessor.py
  • clang_github_tracker/preprocessors/pr_preprocessor.py
  • config/settings.py
  • github_activity_tracker/preprocessors/__init__.py
  • github_activity_tracker/preprocessors/github_preprocess.py

Comment thread github_activity_tracker/preprocessors/github_preprocess.py
Comment thread github_activity_tracker/preprocessors/github_preprocess.py
@jonathanMLDev
Copy link
Copy Markdown
Collaborator Author

@coderabbitai continue to review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 18, 2026

@jonathanMLDev: Sure! I'll continue reviewing the PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@jonathanMLDev
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 18, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
boost_library_tracker/management/commands/run_boost_library_tracker.py (1)

245-275: Consider centralizing _run_pinecone_sync into a shared utility.

The helper is now duplicated across tracker commands with near-identical behavior. Moving it to a common module will reduce drift and make future error-handling changes safer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_tracker/management/commands/run_boost_library_tracker.py`
around lines 245 - 275, Extract the duplicated helper _run_pinecone_sync into a
single shared utility module (e.g., a new utils or common module) and replace
the copies in each tracker command with imports from that module; keep the same
signature (app_type: str, namespace: str, preprocessor_dotted_path: str), same
calls to call_command("run_cppa_pinecone_sync", ...), and same logging/error
handling, then update all files that contained the duplicate (including
run_boost_library_tracker's usage) to import the shared _run_pinecone_sync and
remove the local definitions so future changes to error handling or behavior are
centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py`:
- Around line 338-341: The early return when the API fetch yields no emails
prevents the final Pinecone indexing step (_run_pinecone_sync) from running;
instead of returning immediately when the fetch is empty, change the control
flow so you skip the per-email workspace processing branch but continue to the
Pinecone phase. Concretely: remove or replace the early "return" after the
empty-fetch check with logic that sets a flag (e.g., has_indexable_items) if any
items were persisted during workspace processing, or simply proceed to call
_run_pinecone_sync unconditionally; ensure
_run_pinecone_sync(app_type=pinecone_app_type, namespace=pinecone_namespace) is
reachable at the end of the command even when the API returned zero emails.

In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 171-187: The two calls to _run_pinecone_sync are using the same
effective_app_type (variable effective_app_type), which causes a shared
PineconeSyncStatus cursor and makes PRs potentially skip updates after issues
advance the cursor; fix by invoking _run_pinecone_sync with distinct app_type
keys for issues and PRs (e.g., derive two values like "github-clang-issues" and
"github-clang-prs" instead of effective_app_type) or alternatively merge the two
preprocessors into a single _run_pinecone_sync invocation that returns both
issue and PR documents; update any place that reads/writes the sync cursor
(e.g., update_sync_status / PineconeSyncStatus usage) to use the corresponding
distinct app_type so each stream maintains its own final_sync_at.

---

Nitpick comments:
In `@boost_library_tracker/management/commands/run_boost_library_tracker.py`:
- Around line 245-275: Extract the duplicated helper _run_pinecone_sync into a
single shared utility module (e.g., a new utils or common module) and replace
the copies in each tracker command with imports from that module; keep the same
signature (app_type: str, namespace: str, preprocessor_dotted_path: str), same
calls to call_command("run_cppa_pinecone_sync", ...), and same logging/error
handling, then update all files that contained the duplicate (including
run_boost_library_tracker's usage) to import the shared _run_pinecone_sync and
remove the local definitions so future changes to error handling or behavior are
centralized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d362c25c-4c66-4e2e-874b-3b80adbb7784

📥 Commits

Reviewing files that changed from the base of the PR and between f354dba and 91c0ccf.

📒 Files selected for processing (13)
  • .env.example
  • boost_library_tracker/management/commands/run_boost_library_tracker.py
  • boost_library_tracker/preprocessors/__init__.py
  • boost_library_tracker/preprocessors/issue_preprocessor.py
  • boost_library_tracker/preprocessors/pr_preprocessor.py
  • boost_mailing_list_tracker/management/commands/run_boost_mailing_list_tracker.py
  • clang_github_tracker/management/commands/run_clang_github_tracker.py
  • clang_github_tracker/preprocessors/__init__.py
  • clang_github_tracker/preprocessors/issue_preprocessor.py
  • clang_github_tracker/preprocessors/pr_preprocessor.py
  • config/settings.py
  • github_activity_tracker/preprocessors/__init__.py
  • github_activity_tracker/preprocessors/github_preprocess.py

@snowfox1003 snowfox1003 merged commit 0ad6144 into cppalliance:develop Mar 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants