#29-add cppa-pinecone-sync app#74
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughA new Django app Changes
Sequence Diagram(s)sequenceDiagram
participant App as Caller
participant Sync as sync_to_pinecone
participant Ingestion as PineconeIngestion
participant Pinecone as Pinecone API
participant DB as Database
App->>Sync: sync_to_pinecone(app_id, namespace, preprocess_fn)
Sync->>DB: get_failed_ids(app_id)
DB-->>Sync: [failed_ids]
Sync->>DB: get_final_sync_at(app_id)
DB-->>Sync: last_sync_time
Sync->>App: preprocess_fn(failed_ids, last_sync_time)
App-->>Sync: raw_docs, is_chunked
Sync->>Sync: _build_documents_from_raw(raw_docs)
Sync-->>Sync: [Document objects]
Sync->>Ingestion: upsert_documents(documents, namespace, is_chunked)
Ingestion->>Ingestion: Split & validate chunks
Ingestion->>Pinecone: Batch upsert (dense index)
Pinecone-->>Ingestion: success/failures
Ingestion->>Pinecone: Batch upsert (sparse index)
Pinecone-->>Ingestion: success/failures
Ingestion-->>Sync: {upserted, failed_documents, errors}
Sync->>Sync: Extract failed_ids from results
Sync->>DB: clear_failed_ids(app_id)
DB-->>Sync: deletion_count
Sync->>DB: record_failed_ids(app_id, new_failed_ids)
DB-->>Sync: [created entries]
Sync->>DB: update_sync_status(app_id, current_time)
DB-->>Sync: PineconeSyncStatus
Sync-->>App: {upserted, total, failed_count, failed_ids, errors}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/Schema.md (1)
729-730:⚠️ Potential issue | 🟡 MinorAppendix A is still describing removed
typefields.Line 729 and Line 730 still reference
type, but Section 9 now definesapp_id. Please update these rows to prevent stale schema guidance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/Schema.md` around lines 729 - 730, Update the two table rows for PineconeFailList and PineconeSyncStatus to reflect the new schema: replace references to the removed "type" field with "app_id" and adjust the descriptions accordingly (e.g., "failed_id, app_id" for PineconeFailList and "app_id, final_sync_at, created_at, updated_at" for PineconeSyncStatus) so they match Section 9's definition of app_id.
🧹 Nitpick comments (3)
cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py (2)
115-118: Remove redundant exception object fromlogging.exceptioncall.
logging.exceptionautomatically includes the exception info; passingeexplicitly is redundant.💡 Suggested fix
except Exception as e: - logger.exception("run_cppa_pinecone_sync failed: %s", e) + logger.exception("run_cppa_pinecone_sync failed") self.stderr.write(self.style.ERROR(f"Sync failed: {e}")) raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py` around lines 115 - 118, In the except block of run_cppa_pinecone_sync.py where logger.exception is called, remove the redundant exception object argument (e) because logging.exception already records the exception info; change the call to simply logger.exception("run_cppa_pinecone_sync failed") and keep the existing self.stderr.write(self.style.ERROR(f"Sync failed: {e}")) and raise to preserve stderr output and re-raising behavior.
33-35: Consider usingTypeErrorfor non-callable check.
TypeErroris more semantically appropriate when the resolved attribute exists but is not callable.💡 Suggested fix
if not callable(fn): - raise ValueError(f"{dotted_path!r} is not callable") + raise TypeError(f"{dotted_path!r} is not callable")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py` around lines 33 - 35, Replace the ValueError raised when a resolved attribute isn't callable with a TypeError to reflect the correct semantics: in the block that checks "if not callable(fn): raise ValueError(f\"{dotted_path!r} is not callable\")" (variables fn and dotted_path), change the exception to TypeError and keep the same message so callers get the correct exception type while preserving the diagnostic text.cppa_pinecone_sync/ingestion.py (1)
290-294: Consider usinglogging.exceptionfor automatic stack trace inclusion.When catching exceptions in batch processing,
logging.exceptionprovides more debugging context thanlogging.error.💡 Suggested improvement
except Exception as e: error_msg = f"Error upserting batch {batch_num}: {e}" - logger.error(error_msg) + logger.exception(error_msg) errors.append(error_msg) failed_docs.extend(self._mark_batch_failed(batch, e))Apply similar changes to lines 399 and 460 for consistent error logging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/ingestion.py` around lines 290 - 294, Replace the current logger.error usage inside the exception handlers with logger.exception so the stack trace is captured; specifically update the except block that builds error_msg ("Error upserting batch {batch_num}: {e}") to call logger.exception(error_msg) instead of logger.error(error_msg) and keep the existing errors.append(error_msg) and failed_docs.extend(self._mark_batch_failed(batch, e)) behavior; apply the same replacement (logger.error -> logger.exception) to the other exception handlers referenced around the symbols at lines handling batch processing (the handlers around batch_num and the comparable blocks near the code sections at the locations noted, e.g., the other except blocks at the sections you flagged near lines 399 and 460) so all batch-failure logs include stack traces.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.env.example:
- Around line 92-95: The PINECONE_CLOUD variable is misplaced under the
Workspace section; move the commented PINECONE_CLOUD line out of the "Workspace"
block and place it with the Pinecone configuration block (near the existing
Pinecone keys around where PINECONE_API_KEY / PINECONE_ENV or other Pinecone
vars are declared), ensuring the Workspace section contains only
workspace-related keys and the Pinecone block contains PINECONE_CLOUD.
In `@cppa_pinecone_sync/ingestion.py`:
- Around line 319-323: The code calls record.pop("table_ids") which will raise a
KeyError when metadata lacks that key; update the removal to be defensive (e.g.,
use pop with a default or check membership) in the code paths that build records
(referencing record, metadata, doc_id) so both upsert_documents and callers that
bypass _build_documents_from_raw don't crash; change the record.pop("table_ids")
invocation to a safe removal (for example: only pop if "table_ids" in record or
use record.pop("table_ids", None)) in the function that builds/appends records.
In `@cppa_pinecone_sync/tests/test_models.py`:
- Around line 74-86: The test test_pinecone_sync_status_str only asserts the
app_id is present but should also assert the final_sync_at is included in the
string; update the test (in test_pinecone_sync_status_str) to build the expected
timestamp string from the created object's final_sync_at (e.g.,
str(obj.final_sync_at) or obj.final_sync_at.isoformat()) and add an assertion
that this string (or a reliably formatted subset like the date portion) is
contained in s to ensure final_sync_at appears in __str__ output.
In `@cppa_pinecone_sync/tests/test_services.py`:
- Around line 109-112: The assertion in the test for
services.get_final_sync_at(app_id) uses a directional difference ((result -
when).total_seconds() < 1) which will pass for large negative offsets; change it
to assert closeness by using an absolute time delta or bounded range (e.g.,
assert abs((result - when).total_seconds()) < 1 or compare (result - when)
against datetime.timedelta(seconds=1) with absolute value) so the test truly
verifies the timestamps are within one second of each other.
In `@docs/Contributing.md`:
- Around line 22-27: Add a new table row for the cppa_pinecone_sync service to
the services table so docs include the new app; specifically insert a row with
the App name `cppa_pinecone_sync`, the File `cppa_pinecone_sync/services.py`,
and a short Notes entry (e.g., "Pinecone sync, vectors, indexes") alongside the
existing entries for `cppa_user_tracker`, `github_activity_tracker`,
`boost_library_tracker`, and `discord_activity_tracker` so the CONTRIBUTING
service-layer list is complete.
In `@docs/Schema.md`:
- Around line 668-670: The note for PineconeSyncStatus uses "source type" but
the schema is keyed by app_id; update the explanatory text to consistently use
app_id terminology: change "per source type" to "per app_id" and clarify that
final_sync_at, created_at, and updated_at are timestamps for the row keyed by
app_id; ensure both PineconeFailList and PineconeSyncStatus descriptions
consistently reference app_id (and the failed_id where applicable) so the
documentation matches the schema semantics.
In `@docs/service_api/cppa_pinecone_sync.md`:
- Around line 11-70: The docs incorrectly use sync_type: str — update all
affected function signatures and parameter tables (get_failed_ids,
clear_failed_ids, record_failed_ids, get_final_sync_at, update_sync_status) to
use app_id: int (or int | None where applicable) and change descriptions to
state "Application ID" instead of "Source type"; also adjust examples/return
type notes if they referenced string IDs so they reflect the integer app_id
contract used by the service/schema.
---
Outside diff comments:
In `@docs/Schema.md`:
- Around line 729-730: Update the two table rows for PineconeFailList and
PineconeSyncStatus to reflect the new schema: replace references to the removed
"type" field with "app_id" and adjust the descriptions accordingly (e.g.,
"failed_id, app_id" for PineconeFailList and "app_id, final_sync_at, created_at,
updated_at" for PineconeSyncStatus) so they match Section 9's definition of
app_id.
---
Nitpick comments:
In `@cppa_pinecone_sync/ingestion.py`:
- Around line 290-294: Replace the current logger.error usage inside the
exception handlers with logger.exception so the stack trace is captured;
specifically update the except block that builds error_msg ("Error upserting
batch {batch_num}: {e}") to call logger.exception(error_msg) instead of
logger.error(error_msg) and keep the existing errors.append(error_msg) and
failed_docs.extend(self._mark_batch_failed(batch, e)) behavior; apply the same
replacement (logger.error -> logger.exception) to the other exception handlers
referenced around the symbols at lines handling batch processing (the handlers
around batch_num and the comparable blocks near the code sections at the
locations noted, e.g., the other except blocks at the sections you flagged near
lines 399 and 460) so all batch-failure logs include stack traces.
In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py`:
- Around line 115-118: In the except block of run_cppa_pinecone_sync.py where
logger.exception is called, remove the redundant exception object argument (e)
because logging.exception already records the exception info; change the call to
simply logger.exception("run_cppa_pinecone_sync failed") and keep the existing
self.stderr.write(self.style.ERROR(f"Sync failed: {e}")) and raise to preserve
stderr output and re-raising behavior.
- Around line 33-35: Replace the ValueError raised when a resolved attribute
isn't callable with a TypeError to reflect the correct semantics: in the block
that checks "if not callable(fn): raise ValueError(f\"{dotted_path!r} is not
callable\")" (variables fn and dotted_path), change the exception to TypeError
and keep the same message so callers get the correct exception type while
preserving the diagnostic text.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
.env.exampleconfig/settings.pyconftest.pycppa_pinecone_sync/__init__.pycppa_pinecone_sync/admin.pycppa_pinecone_sync/apps.pycppa_pinecone_sync/ingestion.pycppa_pinecone_sync/management/__init__.pycppa_pinecone_sync/management/commands/__init__.pycppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.pycppa_pinecone_sync/migrations/0001_initial.pycppa_pinecone_sync/migrations/__init__.pycppa_pinecone_sync/models.pycppa_pinecone_sync/services.pycppa_pinecone_sync/sync.pycppa_pinecone_sync/tests/__init__.pycppa_pinecone_sync/tests/fixtures.pycppa_pinecone_sync/tests/test_models.pycppa_pinecone_sync/tests/test_services.pycppa_pinecone_sync/tests/test_sync.pydocs/Contributing.mddocs/Pinecone_preprocess_guideline.mddocs/Schema.mddocs/Service_API.mddocs/service_api/README.mddocs/service_api/cppa_pinecone_sync.mdrequirements.txt
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (2)
docs/Schema.md (1)
670-670:⚠️ Potential issue | 🟡 MinorUse
app_idterminology in the sync-status note.Line 670 still says “last sync for that type,” but this table is keyed by
app_id. Please align wording to avoid semantic drift.📘 Suggested docs patch
-**Note:** **PineconeSyncStatus** - Tracks the last successful sync per app. One row per `app_id`. `final_sync_at` is when the last sync for that type completed; `created_at` and `updated_at` are for the row. +**Note:** **PineconeSyncStatus** - Tracks the last successful sync per app. One row per `app_id`. `final_sync_at` is when the last sync for that `app_id` completed; `created_at` and `updated_at` are for the row.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/Schema.md` at line 670, Update the PineconeSyncStatus note to use the `app_id` terminology: change the phrase "last sync for that type" to something like "last sync for that app (`app_id`)" so the description clearly reflects that PineconeSyncStatus is keyed by `app_id` and that `final_sync_at` records when the last sync for that app completed; keep `created_at` and `updated_at` as row timestamps.cppa_pinecone_sync/tests/test_models.py (1)
75-87:⚠️ Potential issue | 🟡 MinorStrengthen
__str__assertion to verify the timestamp value.Line 86 currently checks only field-name text. Add a value check so the test guarantees
final_sync_atcontent is actually rendered.🧪 Suggested test patch
s = str(obj) assert "1" in s assert "final_sync_at" in s + assert when.date().isoformat() in s🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/tests/test_models.py` around lines 75 - 87, Test currently only asserts the field-name "final_sync_at" appears in the string; update it to assert the actual timestamp value is included. After creating when = timezone.now() and obj = baker.make(..., final_sync_at=when), check that the string s contains a deterministic representation of when (e.g. when.isoformat() or when.strftime(...) matching your model's __str__ formatting) instead of or in addition to "final_sync_at" so the test verifies the timestamp value is rendered for PineconeSyncStatus.
🧹 Nitpick comments (2)
requirements.txt (1)
16-19: Consider pinning or constraining upper bounds for new dependencies.The loose
>=constraints allow any future major version, which could introduce breaking changes. With pinecone now at 8.1.0 (vs. the >=3.0 constraint), langchain-core at 1.2.16 (vs. >=0.1), and langchain-text-splitters at 1.1.1 (vs. >=0.0.1), there's clear risk of unintended major version updates. Consider specifying upper bounds (e.g.,pinecone>=3.0,<4orlangchain-core>=0.1,<2) to prevent unexpected breakage during dependency updates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements.txt` around lines 16 - 19, The requirements file uses open-ended >= version specifiers for pinecone, langchain-core, and langchain-text-splitters; update these entries to include conservative upper bounds to avoid accidental major upgrades (for example change pinecone to a range like >=3.0,<4, langchain-core to >=0.1,<2, and langchain-text-splitters to >=0.0.1,<2) or pin to tested exact versions; modify the lines referencing the package names pinecone, langchain-core, and langchain-text-splitters in requirements.txt accordingly.cppa_pinecone_sync/sync.py (1)
79-88: Deduplicate failed IDs before persisting them.
_extract_new_failed_idscan return duplicates across failed documents, which leads to duplicate fail-list rows and redundant retries.Proposed fix
def _extract_new_failed_ids(result: dict[str, Any]) -> list[str]: @@ - return new_failed_ids + # preserve order while removing duplicates + return list(dict.fromkeys(new_failed_ids))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/sync.py` around lines 79 - 88, _extract_new_failed_ids may return duplicate IDs across failed_documents; update it to deduplicate before returning by collecting IDs into a seen set (or using an ordered-unique approach like dict.fromkeys) as you build new_failed_ids so duplicates are skipped, then return the list of unique IDs (preserving first-seen order) instead of the raw extended list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.env.example:
- Around line 80-88: The Celery worker/beat comments currently sit inside the
Pinecone block and should be removed or moved; update .env.example by either
deleting the two Celery comment lines ("Run worker: celery -A config worker -l
info" and "Run beat (daily 1 AM PST): celery -A config beat -l info") or
relocating them under a new dedicated "# Celery" section header so they are no
longer orphaned within the Pinecone configuration block.
In `@cppa_pinecone_sync/ingestion.py`:
- Around line 311-320: The current ID generation uses original_doc_id and falls
back to appending text[:50] and len(text) when metadata lacks start_index, which
can yield collisions; update the logic around original_doc_id/doc_id (where
hashlib.md5 is called) to include a truly unique per-chunk component (for
example batch_start_idx plus the local chunk index or a UUID via uuid.uuid4())
when "start_index" is not present, so each record dict with keys "id" and
"chunk_text" is guaranteed unique even for identical text/length; ensure you
update the branch that checks "start_index" in metadata and compute the final
doc_id using the augmented original_doc_id before hashing.
In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py`:
- Around line 69-83: The validation blocks in the command's handle() use "return
1" which does not produce a non-zero exit in Django 4.2+; replace each "return
1" with raising django.core.management.CommandError carrying the same error
message, and ensure CommandError is imported (from django.core.management import
CommandError); update the two checks that reference app_id, namespace, and
preprocessor_path to raise CommandError("When --app-id is set, both --namespace
and --preprocessor are required.") and CommandError("When --namespace or
--preprocessor is set, --app-id is required.") respectively so the command
runner returns a proper non-zero exit.
In `@cppa_pinecone_sync/models.py`:
- Around line 4-5: Docstrings for the classes PineconeFailList and
PineconeSyncStatus use a Unicode EN DASH (–); replace each EN DASH in those
docstrings with a standard ASCII hyphen-minus (-) so punctuation is consistent
and toolchains that expect ASCII don't choke; update the docstring lines for
PineconeFailList and PineconeSyncStatus accordingly.
In `@cppa_pinecone_sync/sync.py`:
- Around line 140-144: The current flow clears existing failed IDs with
services.clear_failed_ids(app_id) before inserting new_failed_ids which can wipe
the retry list if insertion fails; change this to an atomic swap: either call a
single transactional method (e.g., services.replace_failed_ids(app_id,
new_failed_ids)) that writes the new set atomically, or else insert/merge the
new_failed_ids first via services.record_failed_ids(app_id, new_failed_ids) and
only call services.clear_failed_ids(app_id) to remove old IDs after the insert
succeeds; update the code around _extract_new_failed_ids,
services.record_failed_ids and services.clear_failed_ids accordingly so the
replacement is atomic and the retry list cannot be lost on failure.
In `@docs/Pinecone_preprocess_guideline.md`:
- Line 13: The inline comment for app_id is incorrect: update the usage example
for the app_id variable (symbol: app_id) so it no longer says "stored as
str(app_id) in DB" and instead reflects the actual schema (stored as an
integer), e.g., change the comment to indicate it's stored as an int (or remove
the str(...) mention) so the docs match the schema.
---
Duplicate comments:
In `@cppa_pinecone_sync/tests/test_models.py`:
- Around line 75-87: Test currently only asserts the field-name "final_sync_at"
appears in the string; update it to assert the actual timestamp value is
included. After creating when = timezone.now() and obj = baker.make(...,
final_sync_at=when), check that the string s contains a deterministic
representation of when (e.g. when.isoformat() or when.strftime(...) matching
your model's __str__ formatting) instead of or in addition to "final_sync_at" so
the test verifies the timestamp value is rendered for PineconeSyncStatus.
In `@docs/Schema.md`:
- Line 670: Update the PineconeSyncStatus note to use the `app_id` terminology:
change the phrase "last sync for that type" to something like "last sync for
that app (`app_id`)" so the description clearly reflects that PineconeSyncStatus
is keyed by `app_id` and that `final_sync_at` records when the last sync for
that app completed; keep `created_at` and `updated_at` as row timestamps.
---
Nitpick comments:
In `@cppa_pinecone_sync/sync.py`:
- Around line 79-88: _extract_new_failed_ids may return duplicate IDs across
failed_documents; update it to deduplicate before returning by collecting IDs
into a seen set (or using an ordered-unique approach like dict.fromkeys) as you
build new_failed_ids so duplicates are skipped, then return the list of unique
IDs (preserving first-seen order) instead of the raw extended list.
In `@requirements.txt`:
- Around line 16-19: The requirements file uses open-ended >= version specifiers
for pinecone, langchain-core, and langchain-text-splitters; update these entries
to include conservative upper bounds to avoid accidental major upgrades (for
example change pinecone to a range like >=3.0,<4, langchain-core to >=0.1,<2,
and langchain-text-splitters to >=0.0.1,<2) or pin to tested exact versions;
modify the lines referencing the package names pinecone, langchain-core, and
langchain-text-splitters in requirements.txt accordingly.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
.env.exampleconfig/settings.pyconftest.pycppa_pinecone_sync/__init__.pycppa_pinecone_sync/admin.pycppa_pinecone_sync/apps.pycppa_pinecone_sync/ingestion.pycppa_pinecone_sync/management/__init__.pycppa_pinecone_sync/management/commands/__init__.pycppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.pycppa_pinecone_sync/migrations/0001_initial.pycppa_pinecone_sync/migrations/__init__.pycppa_pinecone_sync/models.pycppa_pinecone_sync/services.pycppa_pinecone_sync/sync.pycppa_pinecone_sync/tests/__init__.pycppa_pinecone_sync/tests/fixtures.pycppa_pinecone_sync/tests/test_models.pycppa_pinecone_sync/tests/test_services.pycppa_pinecone_sync/tests/test_sync.pydocs/Contributing.mddocs/Pinecone_preprocess_guideline.mddocs/Schema.mddocs/Service_API.mddocs/service_api/README.mddocs/service_api/cppa_pinecone_sync.mdrequirements.txt
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (8)
docs/Pinecone_preprocess_guideline.md (1)
13-13:⚠️ Potential issue | 🟡 MinorFix the
app_idstorage note in the example.Line 13 still says
app_idis stored asstr(app_id), which conflicts with the app_id-as-integer schema used elsewhere.📘 Suggested docs patch
- app_id=1, # e.g. 1, 2, 3 (stored as str(app_id) in DB) + app_id=1, # e.g. 1, 2, 3 (stored as int in DB)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/Pinecone_preprocess_guideline.md` at line 13, Update the inline comment for the example "app_id=1" to reflect that app_id is stored as an integer in the DB (not str); locate the "app_id=1, # e.g. 1, 2, 3 (stored as str(app_id) in DB)" line and change the parenthetical note to indicate integer storage (e.g., "(stored as int in DB)" or remove the str(...) mention) so it matches the integer schema used elsewhere.docs/Schema.md (1)
670-670:⚠️ Potential issue | 🟡 MinorUse
app_idterminology consistently in the note.Line 670 still says “last sync for that type completed,” which conflicts with the updated
app_id-keyed model semantics.📘 Suggested wording fix
-**Note:** **PineconeSyncStatus** - Tracks the last successful sync per app. One row per `app_id`. `final_sync_at` is when the last sync for that type completed; `created_at` and `updated_at` are for the row. +**Note:** **PineconeSyncStatus** - Tracks the last successful sync per app. One row per `app_id`. `final_sync_at` is when the last sync for that `app_id` completed; `created_at` and `updated_at` are for that row.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/Schema.md` at line 670, Update the Note for PineconeSyncStatus to use "app_id" terminology consistently: change the phrase "last sync for that type completed" to "last sync for that app completed" (or "last successful sync for that app_id") and ensure fields referenced (final_sync_at, created_at, updated_at) remain described as row-level timestamps for the app_id-keyed model so the note matches the PineconeSyncStatus semantics..env.example (1)
80-87:⚠️ Potential issue | 🟡 MinorRemove or relocate Celery comments from the Pinecone section.
Lines 86-87 are Celery operational notes but are currently nested under the Pinecone config block, which is misleading.
🔧 Suggested cleanup
# PINECONE_API_KEY=pc-xxxx # PINECONE_INDEX_NAME=boost-dashboard # PINECONE_ENVIRONMENT=us-east-1 # PINECONE_CLOUD=aws -# Run worker: celery -A config worker -l info -# Run beat (daily 1 AM PST): celery -A config beat -l info🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.env.example around lines 80 - 87, The Celery operational notes are incorrectly placed inside the Pinecone env block; remove or relocate the two Celery comment lines ("Run worker: celery -A config worker -l info" and "Run beat (daily 1 AM PST): celery -A config beat -l info") so the Pinecone variables (PINECONE_API_KEY, PINECONE_INDEX_NAME, PINECONE_ENVIRONMENT, PINECONE_CLOUD) remain grouped together; either move those Celery lines into a new dedicated Celery section (or README/deployment docs) or delete them from .env.example to avoid misleading placement.cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py (1)
15-15:⚠️ Potential issue | 🔴 CriticalUse
CommandErrorfor failures and avoid integer returns fromhandle().At Line 75 and Line 82,
return 1is not a reliable failure signal for Django commands. UseCommandErrorfor invalid argument combinations. Also avoidreturn 0at Line 92 and Line 114; returnNoneon success.Proposed fix
-from django.core.management.base import BaseCommand +from django.core.management.base import BaseCommand, CommandError @@ if app_id is not None and not (namespace and preprocessor_path): - self.stderr.write( - self.style.ERROR( - "When --app-id is set, both --namespace and --preprocessor are required." - ) - ) - return 1 + raise CommandError( + "When --app-id is set, both --namespace and --preprocessor are required." + ) if (namespace or preprocessor_path) and app_id is None: - self.stderr.write( - self.style.ERROR( - "When --namespace or --preprocessor is set, --app-id is required." - ) - ) - return 1 + raise CommandError( + "When --namespace or --preprocessor is set, --app-id is required." + ) @@ - return 0 + return @@ - return 0 + returnIn Django 4.2/5.0 management commands, does returning 1 from handle() set process exit code, or should CommandError be raised for non-zero exits?Also applies to: 69-83, 92-92, 114-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py` at line 15, The command's handle() in class Command should not signal failures via integer returns; replace the invalid-argument early exits that currently "return 1" by raising django.core.management.base.CommandError with a clear message (e.g., where argument combinations are invalid inside handle()), and remove explicit "return 0" success returns so the method returns None on success; update any related conditional branches in run_cppa_pinecone_sync.Command.handle to raise CommandError for errors and simply exit the function (no return value) for success.cppa_pinecone_sync/tests/test_models.py (1)
74-87:⚠️ Potential issue | 🟡 MinorAssert the rendered
final_sync_atvalue, not just the literal field name.This currently passes even if the timestamp itself is missing or malformed in
__str__.🔧 Proposed fix
s = str(obj) - assert "1" in s - assert "final_sync_at" in s + assert "app_id=1" in s + assert str(obj.final_sync_at.date()) in s🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/tests/test_models.py` around lines 74 - 87, The test test_pinecone_sync_status_str is asserting the literal field name instead of the actual timestamp; update it to assert that the rendered final_sync_at value appears in str(obj) by comparing against the known when value (e.g. str(when) or when.isoformat()) so the __str__ output from PineconeSyncStatus actually includes the timestamp; locate the test_pinecone_sync_status_str function and replace the assert "final_sync_at" in s with an assertion that the stringified when is contained in s.cppa_pinecone_sync/models.py (1)
4-5:⚠️ Potential issue | 🟡 MinorReplace EN DASH with ASCII hyphen in the module docstring.
Keeping ASCII punctuation avoids toolchain ambiguity and resolves the Ruff warning.
🔧 Proposed fix
-PineconeFailList – records failed sync operations by failed_id and app_id for retry or audit. -PineconeSyncStatus – tracks the last successful sync per source app_id. +PineconeFailList - records failed sync operations by failed_id and app_id for retry or audit. +PineconeSyncStatus - tracks the last successful sync per source app_id.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/models.py` around lines 4 - 5, The module docstring contains EN DASH characters between the class descriptions ("PineconeFailList – records..." and "PineconeSyncStatus – tracks..."); replace each EN DASH (–) with a plain ASCII hyphen (-) in the module docstring so the text reads "PineconeFailList - records..." and "PineconeSyncStatus - tracks..." to satisfy the linter and avoid toolchain ambiguity, leaving the rest of the docstring and the identifiers PineconeFailList and PineconeSyncStatus unchanged.cppa_pinecone_sync/sync.py (1)
140-144:⚠️ Potential issue | 🟠 MajorMake failed-ID replacement atomic to prevent retry-list loss.
If
record_failed_idsfails afterclear_failed_ids, previous retry IDs are permanently lost.🔒 Proposed fix
+from django.db import transaction @@ - services.clear_failed_ids(app_id) new_failed_ids = _extract_new_failed_ids(result) - if new_failed_ids: - services.record_failed_ids(app_id, new_failed_ids) + with transaction.atomic(): + services.clear_failed_ids(app_id) + if new_failed_ids: + services.record_failed_ids(app_id, new_failed_ids) if new_failed_ids: logger.warning( "app_id=%s: %d source IDs recorded as failed", app_id, len(new_failed_ids) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/sync.py` around lines 140 - 144, Currently code calls services.clear_failed_ids(app_id) before services.record_failed_ids(app_id, new_failed_ids), risking loss if record fails; instead make the replacement atomic by either (A) adding a new service method like services.replace_failed_ids(app_id, new_failed_ids) that performs the clear-and-insert in a single DB transaction and call that from this spot, or (B) wrap clear_failed_ids and record_failed_ids in a transaction inside services so they succeed or rollback together; locate this call site (services.clear_failed_ids / services.record_failed_ids / _extract_new_failed_ids) and switch to the atomic replace API (or implement transactional behavior) so previous retry IDs aren’t lost on failure.cppa_pinecone_sync/ingestion.py (1)
311-319:⚠️ Potential issue | 🟠 MajorFallback chunk-ID strategy can collide for repeated text.
When
start_indexis missing, usingtext[:50]+len(text)is not unique and can overwrite vectors.🔧 Proposed fix
- for doc in batch: + for offset, doc in enumerate(batch): @@ if "start_index" in metadata: original_doc_id = f"{original_doc_id}_{metadata['start_index']}" else: - original_doc_id = f"{original_doc_id}_{text[:50]}_{len(text)}" + chunk_key = batch_start_idx + offset + original_doc_id = f"{original_doc_id}_{chunk_key}_{len(text)}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/ingestion.py` around lines 311 - 319, The fallback chunk-ID generation using original_doc_id + text[:50] + len(text) can collide for repeated or similar text; update the logic that builds original_doc_id (used to compute doc_id) to include a stable per-chunk unique value such as the record's index/offset within the batch or a hash of the entire chunk (e.g., sha256(text) or uuid4 if nondeterministic IDs are acceptable) instead of just text[:50] and len(text); ensure the branch that checks for "start_index" in metadata still preserves existing behavior and that the final doc_id remains the md5(hex) of the composed original_doc_id.
🧹 Nitpick comments (1)
cppa_pinecone_sync/services.py (1)
39-44: De-duplicatefailed_idsbeforebulk_createto avoid redundant retry rows.This keeps retry queues smaller and avoids repeated processing of the same source ID.
♻️ Proposed refactor
def record_failed_ids(app_id: int, failed_ids: list[str]) -> list[PineconeFailList]: """Bulk-create PineconeFailList entries for each failed_id. Returns created objects.""" if not failed_ids: return [] - objs = [PineconeFailList(failed_id=fid, app_id=app_id) for fid in failed_ids] + unique_failed_ids = list(dict.fromkeys(failed_ids)) + objs = [ + PineconeFailList(failed_id=fid, app_id=app_id) + for fid in unique_failed_ids + ] return PineconeFailList.objects.bulk_create(objs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/services.py` around lines 39 - 44, In record_failed_ids, deduplicate the incoming failed_ids before creating PineconeFailList rows to avoid duplicate retry entries: produce a unique list (preserving order, e.g. via dict.fromkeys) from failed_ids, build objs = [PineconeFailList(failed_id=fid, app_id=app_id) for fid in unique_failed_ids], and pass that list to PineconeFailList.objects.bulk_create; leave the early return for empty input intact and return the bulk_create result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cppa_pinecone_sync/services.py`:
- Around line 69-71: The save call in services.py inside the "if not created"
block only updates final_sync_at, which prevents the model's auto_now updated_at
from refreshing; modify the obj.save call (in the block that sets
obj.final_sync_at = ts) to include "updated_at" in update_fields (e.g.,
update_fields=["final_sync_at", "updated_at"]) so Django will update the
auto_now timestamp when final_sync_at is changed.
---
Duplicate comments:
In @.env.example:
- Around line 80-87: The Celery operational notes are incorrectly placed inside
the Pinecone env block; remove or relocate the two Celery comment lines ("Run
worker: celery -A config worker -l info" and "Run beat (daily 1 AM PST): celery
-A config beat -l info") so the Pinecone variables (PINECONE_API_KEY,
PINECONE_INDEX_NAME, PINECONE_ENVIRONMENT, PINECONE_CLOUD) remain grouped
together; either move those Celery lines into a new dedicated Celery section (or
README/deployment docs) or delete them from .env.example to avoid misleading
placement.
In `@cppa_pinecone_sync/ingestion.py`:
- Around line 311-319: The fallback chunk-ID generation using original_doc_id +
text[:50] + len(text) can collide for repeated or similar text; update the logic
that builds original_doc_id (used to compute doc_id) to include a stable
per-chunk unique value such as the record's index/offset within the batch or a
hash of the entire chunk (e.g., sha256(text) or uuid4 if nondeterministic IDs
are acceptable) instead of just text[:50] and len(text); ensure the branch that
checks for "start_index" in metadata still preserves existing behavior and that
the final doc_id remains the md5(hex) of the composed original_doc_id.
In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py`:
- Line 15: The command's handle() in class Command should not signal failures
via integer returns; replace the invalid-argument early exits that currently
"return 1" by raising django.core.management.base.CommandError with a clear
message (e.g., where argument combinations are invalid inside handle()), and
remove explicit "return 0" success returns so the method returns None on
success; update any related conditional branches in
run_cppa_pinecone_sync.Command.handle to raise CommandError for errors and
simply exit the function (no return value) for success.
In `@cppa_pinecone_sync/models.py`:
- Around line 4-5: The module docstring contains EN DASH characters between the
class descriptions ("PineconeFailList – records..." and "PineconeSyncStatus –
tracks..."); replace each EN DASH (–) with a plain ASCII hyphen (-) in the
module docstring so the text reads "PineconeFailList - records..." and
"PineconeSyncStatus - tracks..." to satisfy the linter and avoid toolchain
ambiguity, leaving the rest of the docstring and the identifiers
PineconeFailList and PineconeSyncStatus unchanged.
In `@cppa_pinecone_sync/sync.py`:
- Around line 140-144: Currently code calls services.clear_failed_ids(app_id)
before services.record_failed_ids(app_id, new_failed_ids), risking loss if
record fails; instead make the replacement atomic by either (A) adding a new
service method like services.replace_failed_ids(app_id, new_failed_ids) that
performs the clear-and-insert in a single DB transaction and call that from this
spot, or (B) wrap clear_failed_ids and record_failed_ids in a transaction inside
services so they succeed or rollback together; locate this call site
(services.clear_failed_ids / services.record_failed_ids /
_extract_new_failed_ids) and switch to the atomic replace API (or implement
transactional behavior) so previous retry IDs aren’t lost on failure.
In `@cppa_pinecone_sync/tests/test_models.py`:
- Around line 74-87: The test test_pinecone_sync_status_str is asserting the
literal field name instead of the actual timestamp; update it to assert that the
rendered final_sync_at value appears in str(obj) by comparing against the known
when value (e.g. str(when) or when.isoformat()) so the __str__ output from
PineconeSyncStatus actually includes the timestamp; locate the
test_pinecone_sync_status_str function and replace the assert "final_sync_at" in
s with an assertion that the stringified when is contained in s.
In `@docs/Pinecone_preprocess_guideline.md`:
- Line 13: Update the inline comment for the example "app_id=1" to reflect that
app_id is stored as an integer in the DB (not str); locate the "app_id=1, #
e.g. 1, 2, 3 (stored as str(app_id) in DB)" line and change the parenthetical
note to indicate integer storage (e.g., "(stored as int in DB)" or remove the
str(...) mention) so it matches the integer schema used elsewhere.
In `@docs/Schema.md`:
- Line 670: Update the Note for PineconeSyncStatus to use "app_id" terminology
consistently: change the phrase "last sync for that type completed" to "last
sync for that app completed" (or "last successful sync for that app_id") and
ensure fields referenced (final_sync_at, created_at, updated_at) remain
described as row-level timestamps for the app_id-keyed model so the note matches
the PineconeSyncStatus semantics.
---
Nitpick comments:
In `@cppa_pinecone_sync/services.py`:
- Around line 39-44: In record_failed_ids, deduplicate the incoming failed_ids
before creating PineconeFailList rows to avoid duplicate retry entries: produce
a unique list (preserving order, e.g. via dict.fromkeys) from failed_ids, build
objs = [PineconeFailList(failed_id=fid, app_id=app_id) for fid in
unique_failed_ids], and pass that list to PineconeFailList.objects.bulk_create;
leave the early return for empty input intact and return the bulk_create result.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (27)
.env.exampleconfig/settings.pyconftest.pycppa_pinecone_sync/__init__.pycppa_pinecone_sync/admin.pycppa_pinecone_sync/apps.pycppa_pinecone_sync/ingestion.pycppa_pinecone_sync/management/__init__.pycppa_pinecone_sync/management/commands/__init__.pycppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.pycppa_pinecone_sync/migrations/0001_initial.pycppa_pinecone_sync/migrations/__init__.pycppa_pinecone_sync/models.pycppa_pinecone_sync/services.pycppa_pinecone_sync/sync.pycppa_pinecone_sync/tests/__init__.pycppa_pinecone_sync/tests/fixtures.pycppa_pinecone_sync/tests/test_models.pycppa_pinecone_sync/tests/test_services.pycppa_pinecone_sync/tests/test_sync.pydocs/Contributing.mddocs/Pinecone_preprocess_guideline.mddocs/Schema.mddocs/Service_API.mddocs/service_api/README.mddocs/service_api/cppa_pinecone_sync.mdrequirements.txt
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py (1)
105-108: Remove redundant logging and preserve exception chain.Lines 106-107 log the same error twice—
logger.exceptionalready includes the traceback, making the subsequentlogger.errorredundant. Additionally, usingraise ... from epreserves the exception chain for better debugging.♻️ Proposed fix
except Exception as e: - logger.exception("run_cppa_pinecone_sync failed: %s", e) - logger.error("Sync failed: %s", e) - raise CommandError(f"Sync failed: {e}") + logger.exception("run_cppa_pinecone_sync failed") + raise CommandError(f"Sync failed: {e}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py` around lines 105 - 108, Remove the redundant logger.error call and preserve the exception chain: inside the except block that catches Exception as e in run_cppa_pinecone_sync (the except in run method of the management command), keep the existing logger.exception(...) which logs the traceback, remove the subsequent logger.error(...) line, and re-raise the CommandError using "raise CommandError(f\"Sync failed: {e}\") from e" so the original exception is chained to the CommandError for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.py`:
- Around line 105-108: Remove the redundant logger.error call and preserve the
exception chain: inside the except block that catches Exception as e in
run_cppa_pinecone_sync (the except in run method of the management command),
keep the existing logger.exception(...) which logs the traceback, remove the
subsequent logger.error(...) line, and re-raise the CommandError using "raise
CommandError(f\"Sync failed: {e}\") from e" so the original exception is chained
to the CommandError for debugging.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cppa_pinecone_sync/management/commands/run_cppa_pinecone_sync.pycppa_pinecone_sync/models.pycppa_pinecone_sync/services.pycppa_pinecone_sync/sync.pydocs/Pinecone_preprocess_guideline.md
Summary by CodeRabbit
New Features
Admin
Management Command
Models & Services
Tests
Configuration Changes
Documentation
Dependencies