26 feature/boost documentation tracker#101
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a new Django app Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(240,248,255,0.5)
actor User
participant CLI as Management Command
participant DB as Database
participant Fetcher as Fetcher
participant Converter as HTML→MD
participant WS as Workspace
participant Pinecone as Pinecone
end
User->>CLI: run_boost_library_docs_tracker (versions, options)
CLI->>DB: resolve versions & library list
alt use_local
CLI->>Fetcher: download & extract source ZIP
Fetcher->>WS: store extracted files
end
loop per library
CLI->>Fetcher: walk or crawl pages
Fetcher->>Converter: convert HTML -> Markdown
Converter-->>Fetcher: markdown
Fetcher->>WS: save page markdown
CLI->>DB: upsert BoostDocContent (url, hash)
CLI->>DB: link content to BoostLibraryVersion
end
alt pinecone sync enabled
CLI->>DB: select un-upserted docs
CLI->>CLI: preprocess_for_pinecone -> documents
CLI->>Pinecone: upsert documents
CLI->>DB: mark docs is_upserted=true
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 16
🧹 Nitpick comments (3)
boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py (1)
10-40: Consider folding this into0001_initialbefore merge.Since this app is new in the same PR, fresh installs will create
page_contentandstatusonly to drop them immediately here. Squashing now keeps the initial schema aligned with the shipped models and avoids unnecessary DDL.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py` around lines 10 - 40, This migration creates then immediately drops fields (page_content and status) and adds is_upserted/index; fold these changes into the initial migration by updating 0001_initial to define the final schema (omit page_content and status, include is_upserted BooleanField(db_index=True) and the new Index bl_docs_libver_upserted_ix on ["boost_library_version","is_upserted"]) and then remove or delete 0002_remove_page_content_and_status_add_is_upserted.py so fresh installs never create-and-drop those columns; ensure migration dependencies and the migration history are adjusted (delete 0002 or squash into 0001) so the migration graph remains consistent.docs/boost_library_docs_tracker.md (1)
22-44: Consider adding a language specifier to the fenced code block.The directory structure code block lacks a language specifier. For consistency and to satisfy linting, consider adding
textorplaintext.📝 Suggested fix
-``` +```text boost_library_docs_tracker/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/boost_library_docs_tracker.md` around lines 22 - 44, The fenced directory-listing code block in docs/boost_library_docs_tracker.md is missing a language specifier; update the opening fence from ``` to ```text (or ```plaintext) so the block starts with a language tag (e.g., replace the leading ``` before "boost_library_docs_tracker/" with ```text) to satisfy linting and ensure consistent rendering.boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py (1)
258-275: Consider usinglogger.exceptionto preserve stack traces.When catching exceptions during crawl/walk operations,
logger.exceptionautomatically includes the full traceback, which aids debugging. The currentlogger.erroronly logs the message.♻️ Suggested improvement
except Exception as exc: - logger.error("[%s] local walk failed: %s", lib_name, exc) + logger.exception("[%s] local walk failed", lib_name) self.stdout.write( self.style.ERROR(f" [{lib_name}] local walk error: {exc}") ) return 0 else: self.stdout.write(f" [{lib_name}] crawling {doc_root_url} ...") try: pages = fetcher.crawl_library_pages( doc_root_url, max_pages=max_pages, delay_secs=0.3 ) except Exception as exc: - logger.error("[%s] crawl failed: %s", lib_name, exc) + logger.exception("[%s] crawl failed", lib_name) self.stdout.write( self.style.ERROR(f" [{lib_name}] crawl error: {exc}") ) return 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py` around lines 258 - 275, The except handlers in run_boost_library_docs_tracker.py that currently call logger.error in the local walk and crawl blocks (the except blocks around the local walk and around fetcher.crawl_library_pages) should use logger.exception to automatically record the traceback; update the two logger.error calls that reference "[%s] local walk failed" and "[%s] crawl failed" to logger.exception while keeping the existing self.stdout.write calls intact so the user-facing messages remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 305-330: The crawler currently uses the original request URL
(variable url) when resolving and scoping links even after a redirect (final_url
= resp.url), causing sibling pages to be misresolved or filtered; update the
link resolution and scoping to use the canonical response URL: replace url with
final_url in the urljoin call that computes abs_url and also ensure the
startswith check that compares abs_url against doc_root_url uses the canonical
base (derive a canonical doc_root from final_url if necessary) so that link
resolution and the scope check reference the redirected/canonical location
(affecting the loop that uses BeautifulSoup, a["href"], abs_url, visited, and
queue).
- Around line 78-80: Do not assume presence of zip_path or top_dir equals a
successful previous run; instead make downloads and extracts atomic and verify
completion: download into a temporary file (e.g., zip_path.with_suffix('.part'))
and only rename/move to zip_path when the download finishes and optionally
verify size/checksum, and extract into a temporary directory then atomically
rename to top_dir or write a sentinel like top_dir + '.complete' after
successful extraction; update the code paths that currently check
zip_path.exists() and top_dir.exists() to check the completion marker or
checksum and to remove/replace partial artifacts before retrying so
partial/corrupt files are not reused.
- Around line 247-260: The current substring check using lib_key.split("/")[-1]
is unsafe; instead compute the library root path via source_root /
_library_root_for_key(lib_key) (resolve it) and test ancestry of linked against
that path (e.g., linked.is_relative_to(library_root) or check library_root in
linked.parents or linked == library_root) before enqueueing; update the
conditional that now reads "if lib_key.split... not in linked.as_posix()" to
perform this path-based ancestry check using the computed library_root variable
(ensure you resolve both paths to compare canonical forms).
In `@boost_library_docs_tracker/html_to_md.py`:
- Around line 161-175: The code only recognizes backtick fences; update all
fence handling to accept both backticks and tildes: change _RE_BAD_FENCE to
match either "```" or "~~~" (e.g. ^([`~]{3,}) *programlisting\s*$), update
_RE_FENCED_BLOCK to capture opening/closing fences that may be backticks or
tildes (use a backref for the fence char and length, e.g.
(?:^|\n)([`~]{3,}[^\n]*\n.*?\1{3,})), and modify the fence toggle logic inside
_join_wrapped_lines() to detect/track either backtick or tilde fences (not just
"```"); also audit and update any other regexes that assume only backticks (the
other fence-related occurrences) to ensure consistent handling of ~~~ fences.
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Line 15: Fix the typo in the command module's docstring: change "extrated" to
"extracted" in the string inside run_boost_library_docs_tracker.py (the
management command module/class) so the sentence reads "Extract tree is saved in
workspace/boost_library_docs_tracker/extracted/." Ensure you update the
docstring associated with the command class/function in that file.
- Around line 391-400: The conversion using str.isdigit() can misclassify
non-ASCII numerals; update the failed_ids handling to robustly build
int_failed_ids by iterating over failed_ids and attempting int(...) in a
try/except (ValueError) block, only keeping positive integers, then call
services.set_is_upserted_by_ids(int_failed_ids, False) and log via
logger.warning as before (referencing failed_ids, int_failed_ids,
services.set_is_upserted_by_ids, and logger.warning to locate the change).
- Around line 475-488: In the MultipleObjectsReturned except block in
run_boost_library_docs_tracker, avoid calling .first().pk without checking for
None: assign result = BoostLibraryVersion.objects.filter(library__name=lib_name,
version__version=version).first(), then if result is None log an error/warning
(using logger) and either return None or raise BoostLibraryVersion.DoesNotExist;
otherwise return result.pk — this prevents an AttributeError if the row(s) are
deleted between .get() and .first().
- Line 244: The code currently overrides the parsed max_pages value to None when
not in dry_run (max_pages = max_pages if dry_run else None), which disables the
--max-pages limit in real runs; change this so the parsed max_pages is always
respected (remove the conditional override and use the existing max_pages
variable as-is) so the per-library page cap applies in both dry_run and normal
runs, or alternatively update documentation/help strings to explicitly state
that non-dry runs are unlimited if that behavior is intended; refer to the
max_pages variable and dry_run flag in run_boost_library_docs_tracker.py when
making the change.
In `@boost_library_docs_tracker/services.py`:
- Around line 42-54: Normalize the incoming url by stripping whitespace once
before any validation/DB operations and use that normalized value for the
BoostDocContent lookup and creation; specifically, replace raw uses of the
parameter `url` with a local `normalized_url = url.strip()` (validate against
empty after strip), then call `BoostDocContent.objects.get(url=normalized_url)`
and `BoostDocContent.objects.create(url=normalized_url,
content_hash=content_hash, scraped_at=_now())` so the uniqueness invariant is
enforced consistently.
- Around line 111-115: The bulk-update path in set_is_upserted_by_ids currently
calls
BoostLibraryDocumentation.objects.filter(pk__in=ids).update(is_upserted=value)
which bypasses model.save() and therefore doesn't refresh updated_at; modify
set_is_upserted_by_ids to also set updated_at (e.g., include
updated_at=timezone.now()) in the same QuerySet.update call so it matches
set_is_upserted's behavior and keeps last-modified timestamps consistent.
- Around line 87-89: When updating obj.page_count in the conditional that checks
created and page_count, include the auto-updated timestamp field in the save
call: add "updated_at" to the update_fields list so the model's updated_at is
persisted alongside page_count (matching the pattern used in set_is_upserted).
Locate the block that sets obj.page_count and calls
obj.save(update_fields=["page_count"]) and change the update_fields to include
"updated_at".
In `@boost_library_docs_tracker/tests/test_models.py`:
- Around line 18-24: The test currently catches a broad Exception when calling
baker.make("boost_library_docs_tracker.BoostDocContent",
url=boost_doc_content.url), which can hide unrelated import/setup errors; change
the assertion to expect the concrete DB uniqueness failure (e.g.
pytest.raises(IntegrityError)) and import IntegrityError (from django.db or
django.db.utils), and if needed wrap the create in a transaction.atomic() block
to ensure the DB raises the integrity error for the duplicate URL on the
BoostDocContent model.
In `@boost_library_docs_tracker/workspace.py`:
- Around line 98-120: resolve_path_from_url currently trusts the URL path and
returns get_converted_root() / f"boost_{url_version}" /
Path(*relative_parts).with_suffix(".md"), which allows path traversal via '..'
and doesn't validate the host; update resolve_path_from_url to first verify the
parsed.netloc is an allowed host (e.g., "www.boost.org" or "boost.org"), then
sanitize and normalize relative_parts (reject empty segments, '.' or '..' and
any absolute segments), build the target path and call .resolve() and ensure the
resolved path is a subpath of get_converted_root() / f"boost_{url_version}"
(compare parents) before returning it; if validation fails, return None so
callers like save_page or load_page_by_url cannot escape the workspace.
In `@docs/Schema.md`:
- Around line 699-730: The Schema.md docs tracker section is out of sync with
the shipped models: update the mermaid diagram and narrative to match
boost_library_docs_tracker/models.py and the migration 0002 (remove page_content
and status, add is_upserted). Specifically, remove non-existent
columns/relations (first_version_id, last_version_id,
BoostDocContent.is_upserted, unique content_hash and any
BoostVersion→BoostDocContent FK), remove references to page_content in Appendix
A, and add the actual fields present in code
(BoostLibraryDocumentation.page_count, updated_at,
BoostLibraryDocumentation.is_upserted and any true unique/index constraints
present in models). Ensure the diagram and notes reflect the real composite
unique constraint on (boost_library_version_id, boost_doc_content_id) and the
index on (boost_library_version_id, is_upserted) as implemented.
In `@docs/service_api/boost_library_docs_tracker.md`:
- Around line 12-24: The docs are out of sync with the implementation: update
the API doc for get_or_create_doc_content to reflect the actual signature
get_or_create_doc_content(url: str, content_hash: str) (remove the page_content
parameter and DB side-effect descriptions), remove or replace the non-existent
status helper table (lines describing "created/content_changed/unchanged"), and
instead document the actual public functions exported by the module:
get_or_create_doc_content(url, content_hash), link_content_to_library_version,
set_is_upserted, set_is_upserted_by_ids, and get_docs_for_library_version with
brief parameter/return summaries matching their implementations.
In `@requirements.txt`:
- Around line 11-18: requirements.txt declares pypandoc but the Pandoc binary
itself isn't provisioned; update CI and container/dev setup to install the
pandoc binary so pypandoc's fallback to the CLI works reliably. Add installation
steps for pandoc in your Dockerfile (or base image), in CI workflow scripts
(apt/yum/choco/brew as appropriate), and document the dev setup (README) with a
recommendation like "install pandoc (e.g., apt install pandoc or brew install
pandoc)" so that pypandoc in requirements.txt can always find the pandoc
executable at runtime.
---
Nitpick comments:
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 258-275: The except handlers in run_boost_library_docs_tracker.py
that currently call logger.error in the local walk and crawl blocks (the except
blocks around the local walk and around fetcher.crawl_library_pages) should use
logger.exception to automatically record the traceback; update the two
logger.error calls that reference "[%s] local walk failed" and "[%s] crawl
failed" to logger.exception while keeping the existing self.stdout.write calls
intact so the user-facing messages remain unchanged.
In
`@boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py`:
- Around line 10-40: This migration creates then immediately drops fields
(page_content and status) and adds is_upserted/index; fold these changes into
the initial migration by updating 0001_initial to define the final schema (omit
page_content and status, include is_upserted BooleanField(db_index=True) and the
new Index bl_docs_libver_upserted_ix on ["boost_library_version","is_upserted"])
and then remove or delete 0002_remove_page_content_and_status_add_is_upserted.py
so fresh installs never create-and-drop those columns; ensure migration
dependencies and the migration history are adjusted (delete 0002 or squash into
0001) so the migration graph remains consistent.
In `@docs/boost_library_docs_tracker.md`:
- Around line 22-44: The fenced directory-listing code block in
docs/boost_library_docs_tracker.md is missing a language specifier; update the
opening fence from ``` to ```text (or ```plaintext) so the block starts with a
language tag (e.g., replace the leading ``` before "boost_library_docs_tracker/"
with ```text) to satisfy linting and ensure consistent rendering.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 954a344a-e052-44b2-b828-56c2184d7bc5
📒 Files selected for processing (30)
boost_library_docs_tracker/__init__.pyboost_library_docs_tracker/admin.pyboost_library_docs_tracker/apps.pyboost_library_docs_tracker/fetcher.pyboost_library_docs_tracker/html_to_md.pyboost_library_docs_tracker/management/__init__.pyboost_library_docs_tracker/management/commands/__init__.pyboost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.pyboost_library_docs_tracker/migrations/0001_initial.pyboost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.pyboost_library_docs_tracker/migrations/__init__.pyboost_library_docs_tracker/models.pyboost_library_docs_tracker/preprocessor.pyboost_library_docs_tracker/services.pyboost_library_docs_tracker/tests/__init__.pyboost_library_docs_tracker/tests/fixtures.pyboost_library_docs_tracker/tests/test_fetcher.pyboost_library_docs_tracker/tests/test_models.pyboost_library_docs_tracker/tests/test_services.pyboost_library_docs_tracker/workspace.pyconfig/settings.pyconftest.pydocs/Contributing.mddocs/Schema.mddocs/Service_API.mddocs/boost_library_docs_tracker.mddocs/service_api/README.mddocs/service_api/boost_library_docs_tracker.mdrequirements.txtworkflow/management/commands/run_all_collectors.py
|
@coderabbitai continue to review |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
docs/service_api/boost_library_docs_tracker.md (1)
1-39:⚠️ Potential issue | 🟠 MajorDocumentation is out of sync with the actual
services.pyimplementation.This matches the previous review feedback. The documented API does not reflect the actual public functions:
Documented Actual in services.pyget_or_create_doc_content(url, page_content, content_hash)get_or_create_doc_content(url, content_hash, version_id=None)mark_relation_running,mark_relation_completed, etc.Not implemented get_pending_docs_for_library_version,get_docs_pending_syncNot implemented The actual public surface includes:
get_or_create_doc_content,set_doc_content_upserted,set_doc_content_upserted_by_ids,get_unupserted_doc_contents,link_content_to_library_version,get_docs_for_library_version.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/service_api/boost_library_docs_tracker.md` around lines 1 - 39, The docs are out of sync with services.py: update the markdown to match the actual public API (or vice-versa). Specifically, change the get_or_create_doc_content signature to get_or_create_doc_content(url, content_hash, version_id=None); remove or stop documenting mark_relation_running/mark_relation_completed/mark_relation_failed/get_pending_docs_for_library_version/get_docs_pending_sync (they don't exist); and instead document the actual public functions set_doc_content_upserted, set_doc_content_upserted_by_ids, get_unupserted_doc_contents, link_content_to_library_version, and get_docs_for_library_version, including their parameter lists, return types, and behavior (e.g., upsert semantics, status fields updated). Ensure all function names (get_or_create_doc_content, set_doc_content_upserted, set_doc_content_upserted_by_ids, get_unupserted_doc_contents, link_content_to_library_version, get_docs_for_library_version) are present and described consistently with services.py.
🧹 Nitpick comments (7)
boost_library_docs_tracker/html_to_md.py (1)
20-23: Consider narrowing the exception type.While catching
Exceptionworks for an optional dependency,ImportErrorwould be more precise and avoid masking unexpected issues during import.Suggested refinement
try: import pypandoc -except Exception: # optional runtime dependency +except ImportError: # optional runtime dependency pypandoc = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/html_to_md.py` around lines 20 - 23, The try/except around importing pypandoc is too broad; change the except Exception to except ImportError so only missing-import errors are caught and other import-time errors are propagated. Update the import block that sets pypandoc = None (the try: import pypandoc / except ... / pypandoc = None sequence) to catch ImportError specifically.boost_library_docs_tracker/tests/test_services.py (1)
43-57: Test name may be misleading—clarify the scenario.The test is named
test_get_or_create_doc_content_content_changed_when_url_differsbut the scenario actually tests: "when the samecontent_hashis found but with a different URL, the URL field is updated." Thechange_typeis"content_changed"because the URL in the existing record differs.Consider renaming for clarity:
def test_get_or_create_doc_content_updates_url_for_existing_hash(): """get_or_create_doc_content returns 'content_changed' and updates url when hash exists with different url."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/tests/test_services.py` around lines 43 - 57, Rename the test function test_get_or_create_doc_content_content_changed_when_url_differs to something clearer like test_get_or_create_doc_content_updates_url_for_existing_hash and update its docstring to state: "get_or_create_doc_content returns 'content_changed' and updates url when hash exists with different url" so the name and docstring accurately reflect that the function tests updating the URL for an existing content_hash (refer to the test function and its assertions calling services.get_or_create_doc_content and checking change_type and obj2.url).boost_library_docs_tracker/fetcher.py (3)
154-163: Consider extracting magic library key mappings to a constant.The special-case handling for
numeric/,enable_if, andswapkeys is embedded inline. If more special cases emerge, this could become hard to maintain. Consider extracting to a documented constant or configuration.♻️ Optional: Extract to constant
# Library keys that map to alternate root directories _LIBRARY_KEY_OVERRIDES = { "enable_if": "core", "swap": "core", } _LIBRARY_KEY_PREFIXES = { "numeric/": "numeric", # preserves full path } def _library_root_for_key(lib_key: str) -> Path: key = (lib_key or "").strip().strip("/") if not key: return Path("libs") for prefix, subdir in _LIBRARY_KEY_PREFIXES.items(): if key.startswith(prefix): return Path("libs") / key for pattern, target in _LIBRARY_KEY_OVERRIDES.items(): if pattern in key: return Path("libs") / target return Path("libs") / key.split("/")[0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 154 - 163, The _library_root_for_key function currently embeds special-case strings inline; extract these mappings into top-level documented constants (e.g. _LIBRARY_KEY_OVERRIDES = {"enable_if":"core","swap":"core"} and _LIBRARY_KEY_PREFIXES = {"numeric/":"numeric"}) and update _library_root_for_key to reference those constants: normalize key as before, check prefixes from _LIBRARY_KEY_PREFIXES first (preserving full path for matching prefixes), then check substring patterns from _LIBRARY_KEY_OVERRIDES to map to alternate subdirs, and fall back to key.split("/")[0]; add brief docstrings for the constants so future special cases are maintainable and discoverable.
36-41: Global session singleton may cause issues in concurrent/multi-threaded contexts.The module-level
_SESSIONsingleton is shared across all callers. While this is efficient for single-threaded CLI commands, it could cause issues if this module is ever used in concurrent contexts (e.g., async workers, tests with parallel execution). Therequests.Sessionobject is not thread-safe.For a Django management command this is likely fine, but consider documenting this limitation or using thread-local storage if broader usage is anticipated.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 36 - 41, The module-level _SESSION singleton used by _get_session can cause problems in concurrent/threaded contexts; replace it with thread-local storage by introducing a threading.local() container (e.g., _LOCAL) and store the requests.Session instance on that container (e.g., _LOCAL.session) inside _get_session so each thread gets its own Session while preserving the User-Agent header, or alternatively change _get_session to always return a new requests.Session if thread-local is not desired; update imports to include threading and ensure references to _SESSION are switched to the new thread-local attribute (look for _get_session and _SESSION in this file).
78-80: Resume-safe checks are commented out.The code that skipped re-downloading and re-extracting when files already exist is commented out. This addresses the past review concern about incomplete artifacts being reused, but now every run re-downloads and re-extracts the full source zip.
If the intent is to keep downloads atomic, consider implementing a completion marker (e.g.,
zip_path.with_suffix('.complete')) or checksum verification so that valid downloads can be reused without risk of partial artifacts.Also applies to: 134-138
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 78 - 80, The resume-safe existence checks around zip_path were commented out, forcing redownloads; restore resume logic by checking for a validated completion marker or checksum before skipping download/extract (e.g., check zip_path.exists() AND zip_path.with_suffix('.complete').exists() or verify a saved checksum), and when downloading write the zip atomically and create the .complete marker (or write checksum) only after successful download and extraction; update the code paths that currently reference zip_path to use this marker/verification so partial artifacts are never treated as complete.boost_library_docs_tracker/preprocessor.py (1)
103-115: N+1 query pattern in_get_library_name.Each call executes a separate query (
doc_content.library_relations.select_related(...).first()). When processing many records, this results in N additional queries. Since_select_recordsdoesn't prefetchlibrary_relations, consider prefetching in the queryset or batch-loading library names.♻️ Proposed optimization using prefetch_related
In
_select_records, add prefetch for the relation chain:+from django.db.models import Prefetch +from .models import BoostLibraryDocumentation + def _select_records( failed_ids: list[str], final_sync_at: datetime | None, ) -> list[BoostDocContent]: ... qs = ( BoostDocContent.objects.filter(query) .select_related("first_version", "last_version") + .prefetch_related( + Prefetch( + "library_relations", + queryset=BoostLibraryDocumentation.objects.select_related( + "boost_library_version__library" + ).order_by("id"), + ) + ) .order_by("id") ) return list(qs)Then simplify
_get_library_nameto use the prefetched data:def _get_library_name(doc_content: BoostDocContent) -> str: - rel = ( - doc_content.library_relations.select_related("boost_library_version__library") - .order_by("id") - .first() - ) + # Use prefetched data (already ordered by id) + rels = list(doc_content.library_relations.all()) + rel = rels[0] if rels else None if rel is None: return "" return rel.boost_library_version.library.name🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/preprocessor.py` around lines 103 - 115, The function _get_library_name triggers an N+1 DB query by calling doc_content.library_relations.select_related(...).first() for each BoostDocContent; modify the record selection in _select_records to prefetch the relation chain (prefetch_related("library_relations__boost_library_version__library") or equivalent) or batch-load a mapping of BoostDocContent → library name, then simplify _get_library_name to read from the already-prefetched doc_content.library_relations (e.g., use the first item on the prefetched relation list) so no extra queries are executed when iterating many BoostDocContent instances.boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py (1)
265-270: Consider usinglogger.exceptionto preserve stack traces.Using
logger.errorloses the stack trace information. For debugging failed crawls/walks, having the full traceback is valuable. The static analysis hints also suggest this change.♻️ Use logger.exception for traceback
except Exception as exc: - logger.error("[%s] local walk failed: %s", lib_name, exc) + logger.exception("[%s] local walk failed: %s", lib_name, exc) self.stdout.write( self.style.ERROR(f" [{lib_name}] local walk error: {exc}") ) return 0 else: ... except Exception as exc: - logger.error("[%s] crawl failed: %s", lib_name, exc) + logger.exception("[%s] crawl failed: %s", lib_name, exc) self.stdout.write( self.style.ERROR(f" [{lib_name}] crawl error: {exc}") )Apply similar changes to lines 323, 333, 346, and 392.
Also applies to: 277-282
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py` around lines 265 - 270, Replace the current logger.error calls in the exception handlers with logger.exception (or logger.error(..., exc_info=True)) so the traceback is preserved: update the handler that logs "local walk failed" (where logger.error("[%s] local walk failed: %s", lib_name, exc) and self.stdout.write(self.style.ERROR(f" [{lib_name}] local walk error: {exc}")) are used) as well as the other except blocks referenced (the blocks around the ranges noted — 277-282, and the handlers at the locations corresponding to lines 323, 333, 346, and 392) to call logger.exception(...) instead of logger.error(...), keeping the same message content and leaving the existing stdout writes intact so stack traces are included in the logs while user-facing messages remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/actions.yml:
- Around line 79-80: The workflow step "Install system dependencies" currently
runs "sudo apt-get install -y pandoc" without refreshing apt metadata; modify
that step to run "sudo apt-get update -y" (or combine as "sudo apt-get update -y
&& sudo apt-get install -y pandoc") before installing pandoc so the package
index is fresh and installation won't fail intermittently on GitHub runners.
In `@boost_library_docs_tracker/admin.py`:
- Around line 15-28: The admin config references non-existent fields causing
Django checks to fail; update BoostLibraryDocumentationAdmin to only use fields
present on the model by removing references to is_upserted and page_count from
list_display and removing is_upserted from list_filter, and remove updated_at
from readonly_fields so that list_display contains only id,
boost_library_version, boost_doc_content (or whichever of those actually exist)
and readonly_fields contains created_at only; modify the attributes
list_display, list_filter, search_fields, raw_id_fields, and readonly_fields
within the BoostLibraryDocumentationAdmin class accordingly to match the model's
actual fields.
In `@boost_library_docs_tracker/models.py`:
- Around line 11-18: The docstring claims "one row per URL" but the model
actually enforces uniqueness on content_hash rather than url; update the
docstring to accurately state the deduplication key (content_hash) and that
multiple rows may exist for the same URL when content changes, referencing the
fields content_hash, url, first_version_id, last_version_id, and is_upserted so
readers know which attributes track versions and Pinecone state; alternatively,
if the intent is truly one row per URL, change the model's uniqueness constraint
to use url instead of content_hash and adjust any related logic accordingly
(update the docstring to match whichever rule you choose).
In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 159-164: The code marks records via
services.set_doc_content_upserted_by_ids(ids_to_mark, True) inside
_build_documents before calling sync_to_pinecone, which can leave rows
incorrectly marked if Pinecone upsert fails; change the flow so that
_build_documents returns ids_to_mark but does NOT set is_upserted, and move the
call to services.set_doc_content_upserted_by_ids(ids_to_mark, True) to after
sync_to_pinecone completes successfully (or perform both actions in a single
transactional/try-except block where you only set is_upserted=True on success
and revert on failure); update references in the calling management command to
accept the ids list returned from _build_documents and ensure failed_ids
handling still works with this new ordering.
In `@docs/boost_library_docs_tracker.md`:
- Around line 3-4: The documentation describes an outdated data model and
execution flow; update the doc to reflect the current schema where
BoostDocContent holds metadata and is_upserted, and BoostLibraryDocumentation is
only the join row (library_version, doc_content). Edit all sections that
reference storing extracted text in a separate DB table, relation-level fields
like status/page_count, and restart/sync logic (including lines referenced
around 52-55, 94-108, 114-132) to instead explain: extraction writes into
BoostDocContent with its metadata and is_upserted flag, the join row is
BoostLibraryDocumentation, how restart logic should check
BoostDocContent.is_upserted and re-run upserts accordingly, and how Pinecone
upserts operate per BoostDocContent; ensure field and model names
(BoostDocContent, BoostLibraryDocumentation, is_upserted) are used accurately
throughout.
- Around line 142-145: The examples for run_boost_library_docs_tracker currently
show bare versions; update them to use the repository's stored
BoostVersion.version format (include the boost- prefix) so they match what's
actually stored and accepted — e.g., change the command examples that call
run_boost_library_docs_tracker --version to pass values like boost-1.85.0 /
boost-1.86.0 and add a short note referencing BoostVersion.version to avoid
confusion.
In `@docs/service_api/README.md`:
- Line 12: The README table row misstates the schema: update the docs tracker
summary so BoostDocContent is described as holding sync state (not merely
URL-unique content) and BoostLibraryDocumentation is described as the join row
(not the per-page status holder); specifically edit the table entry referencing
boost_library_docs_tracker.services / boost_library_docs_tracker to mention
BoostDocContent contains sync state fields and BoostLibraryDocumentation is the
join/linking entity between library-version and page.
---
Duplicate comments:
In `@docs/service_api/boost_library_docs_tracker.md`:
- Around line 1-39: The docs are out of sync with services.py: update the
markdown to match the actual public API (or vice-versa). Specifically, change
the get_or_create_doc_content signature to get_or_create_doc_content(url,
content_hash, version_id=None); remove or stop documenting
mark_relation_running/mark_relation_completed/mark_relation_failed/get_pending_docs_for_library_version/get_docs_pending_sync
(they don't exist); and instead document the actual public functions
set_doc_content_upserted, set_doc_content_upserted_by_ids,
get_unupserted_doc_contents, link_content_to_library_version, and
get_docs_for_library_version, including their parameter lists, return types, and
behavior (e.g., upsert semantics, status fields updated). Ensure all function
names (get_or_create_doc_content, set_doc_content_upserted,
set_doc_content_upserted_by_ids, get_unupserted_doc_contents,
link_content_to_library_version, get_docs_for_library_version) are present and
described consistently with services.py.
---
Nitpick comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 154-163: The _library_root_for_key function currently embeds
special-case strings inline; extract these mappings into top-level documented
constants (e.g. _LIBRARY_KEY_OVERRIDES = {"enable_if":"core","swap":"core"} and
_LIBRARY_KEY_PREFIXES = {"numeric/":"numeric"}) and update _library_root_for_key
to reference those constants: normalize key as before, check prefixes from
_LIBRARY_KEY_PREFIXES first (preserving full path for matching prefixes), then
check substring patterns from _LIBRARY_KEY_OVERRIDES to map to alternate
subdirs, and fall back to key.split("/")[0]; add brief docstrings for the
constants so future special cases are maintainable and discoverable.
- Around line 36-41: The module-level _SESSION singleton used by _get_session
can cause problems in concurrent/threaded contexts; replace it with thread-local
storage by introducing a threading.local() container (e.g., _LOCAL) and store
the requests.Session instance on that container (e.g., _LOCAL.session) inside
_get_session so each thread gets its own Session while preserving the User-Agent
header, or alternatively change _get_session to always return a new
requests.Session if thread-local is not desired; update imports to include
threading and ensure references to _SESSION are switched to the new thread-local
attribute (look for _get_session and _SESSION in this file).
- Around line 78-80: The resume-safe existence checks around zip_path were
commented out, forcing redownloads; restore resume logic by checking for a
validated completion marker or checksum before skipping download/extract (e.g.,
check zip_path.exists() AND zip_path.with_suffix('.complete').exists() or verify
a saved checksum), and when downloading write the zip atomically and create the
.complete marker (or write checksum) only after successful download and
extraction; update the code paths that currently reference zip_path to use this
marker/verification so partial artifacts are never treated as complete.
In `@boost_library_docs_tracker/html_to_md.py`:
- Around line 20-23: The try/except around importing pypandoc is too broad;
change the except Exception to except ImportError so only missing-import errors
are caught and other import-time errors are propagated. Update the import block
that sets pypandoc = None (the try: import pypandoc / except ... / pypandoc =
None sequence) to catch ImportError specifically.
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 265-270: Replace the current logger.error calls in the exception
handlers with logger.exception (or logger.error(..., exc_info=True)) so the
traceback is preserved: update the handler that logs "local walk failed" (where
logger.error("[%s] local walk failed: %s", lib_name, exc) and
self.stdout.write(self.style.ERROR(f" [{lib_name}] local walk error: {exc}"))
are used) as well as the other except blocks referenced (the blocks around the
ranges noted — 277-282, and the handlers at the locations corresponding to lines
323, 333, 346, and 392) to call logger.exception(...) instead of
logger.error(...), keeping the same message content and leaving the existing
stdout writes intact so stack traces are included in the logs while user-facing
messages remain unchanged.
In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 103-115: The function _get_library_name triggers an N+1 DB query
by calling doc_content.library_relations.select_related(...).first() for each
BoostDocContent; modify the record selection in _select_records to prefetch the
relation chain
(prefetch_related("library_relations__boost_library_version__library") or
equivalent) or batch-load a mapping of BoostDocContent → library name, then
simplify _get_library_name to read from the already-prefetched
doc_content.library_relations (e.g., use the first item on the prefetched
relation list) so no extra queries are executed when iterating many
BoostDocContent instances.
In `@boost_library_docs_tracker/tests/test_services.py`:
- Around line 43-57: Rename the test function
test_get_or_create_doc_content_content_changed_when_url_differs to something
clearer like test_get_or_create_doc_content_updates_url_for_existing_hash and
update its docstring to state: "get_or_create_doc_content returns
'content_changed' and updates url when hash exists with different url" so the
name and docstring accurately reflect that the function tests updating the URL
for an existing content_hash (refer to the test function and its assertions
calling services.get_or_create_doc_content and checking change_type and
obj2.url).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8db6f473-44b0-4a02-96be-70f907849f37
📒 Files selected for processing (33)
.github/workflows/actions.ymlREADME.mdboost_library_docs_tracker/__init__.pyboost_library_docs_tracker/admin.pyboost_library_docs_tracker/apps.pyboost_library_docs_tracker/fetcher.pyboost_library_docs_tracker/html_to_md.pyboost_library_docs_tracker/management/__init__.pyboost_library_docs_tracker/management/commands/__init__.pyboost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.pyboost_library_docs_tracker/migrations/0001_initial.pyboost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.pyboost_library_docs_tracker/migrations/0003_redesign_per_schema_v10.pyboost_library_docs_tracker/migrations/__init__.pyboost_library_docs_tracker/models.pyboost_library_docs_tracker/preprocessor.pyboost_library_docs_tracker/services.pyboost_library_docs_tracker/tests/__init__.pyboost_library_docs_tracker/tests/fixtures.pyboost_library_docs_tracker/tests/test_fetcher.pyboost_library_docs_tracker/tests/test_models.pyboost_library_docs_tracker/tests/test_services.pyboost_library_docs_tracker/workspace.pyconfig/settings.pyconftest.pydocs/Contributing.mddocs/Schema.mddocs/Service_API.mddocs/boost_library_docs_tracker.mddocs/service_api/README.mddocs/service_api/boost_library_docs_tracker.mdrequirements.txtworkflow/management/commands/run_all_collectors.py
|
@coderabbitai continue to review |
|
✅ Actions performedFull review triggered. |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (2)
boost_library_docs_tracker/fetcher.py (2)
323-329:⚠️ Potential issue | 🟠 MajorUse the canonical redirect root for crawl scoping too.
Links are resolved against
final_url, but the prefix filter still uses the originaldoc_root_url. If the seed redirects fromindex.htmtoindex.htmlor another canonical location, sibling pages are filtered out and the crawl terminates early.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 323 - 329, The crawl scope check uses the original doc_root_url but links are resolved against final_url, so when the seed redirected the prefix filter (doc_root_url) can incorrectly exclude valid sibling pages; update the scope comparison to use the canonical root derived from final_url (e.g., compute canonical_doc_root = urljoin(final_url, "/") or otherwise normalize final_url) and replace doc_root_url with that canonical root in the abs_url.startswith(...) check (variables to locate: abs_url, final_url, doc_root_url, visited, queue).
255-258:⚠️ Potential issue | 🟠 MajorScope the local crawl by ancestry from the actual start tree.
The substring check still breaks
/doc/...libraries: those pages usually live undersource_root/doc/..., so their descendants never containlib_keyand the BFS stops at the entry page. It can also admit out-of-tree paths that merely contain the token.🧭 Safer scoping
base_url = f"{BOOST_ORG_BASE}/doc/libs/{url_version}/" + doc = (lib_documentation or "").strip() + if doc.startswith("/doc/"): + scope_root = (source_root / get_start_path(lib_key, lib_documentation).parts[0]).resolve() + else: + scope_root = (source_root / _library_root_for_key(lib_key)).resolve() + visited: set[Path] = set() queue: deque[Path] = deque([start_file.resolve()]) results: list[tuple[str, str]] = [] @@ if not href.endswith(".html") and not href.endswith(".htm"): continue linked = (file_path.parent / href).resolve() - # Stay within the library subtree - if lib_key.split("/")[-1] not in linked.as_posix(): + try: + linked.relative_to(scope_root) + except ValueError: continue if linked not in visited and linked not in queue: queue.append(linked)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/fetcher.py` around lines 255 - 258, The current substring check using lib_key against linked.as_posix() is unsafe; instead ensure linked is actually inside the crawl start tree by testing ancestry. Replace the substring test with a proper Path containment check: compute the actual start root used for this BFS (the directory that file_path started from or the library root) and then verify linked.is_relative_to(start_root) (or manually walk linked.parents to find start_root) before continuing; keep references to the existing variables linked, file_path and lib_key when locating/deriving the start_root.
🧹 Nitpick comments (3)
requirements.txt (1)
14-14: Consider pinningpypandocversion for reproducibility.Other dependencies specify minimum versions (e.g.,
beautifulsoup4>=4.12), butpypandochas no version constraint. While this is unlikely to cause issues, pinning to a minimum version improves build reproducibility.💡 Suggested change
-pypandoc +pypandoc>=1.11🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements.txt` at line 14, Pin the pypandoc dependency in requirements.txt to improve reproducibility: replace the bare "pypandoc" entry with a version-constrained specifier such as "pypandoc>=1.8.1" (or another vetted minimum) to match the project's dependency policy; update the requirements.txt entry for pypandoc accordingly and run your dependency install/test to confirm compatibility.boost_library_docs_tracker/html_to_md.py (1)
20-23: Consider catchingImportErrorinstead of bareException.While the fallback behavior is correct, catching
ImportErrorwould be more precise for module import failures.Suggested change
try: import pypandoc -except Exception: # optional runtime dependency +except ImportError: # optional runtime dependency pypandoc = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/html_to_md.py` around lines 20 - 23, The try/except that imports pypandoc should catch ImportError (or ModuleNotFoundError) instead of a bare Exception; update the import block around the pypandoc import (the try importing pypandoc) to use "except ImportError:" (or "except (ImportError, ModuleNotFoundError):" if you prefer) and keep the fallback pypandoc = None so only import failures are swallowed.boost_library_docs_tracker/tests/test_fetcher.py (1)
51-184: Add coverage forwalk_library_html().
--use-localis a first-class path in this feature, but this module only exercisescrawl_library_pages(). A regression in local HTML discovery or canonical URL generation would currently ship untested.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/tests/test_fetcher.py` around lines 51 - 184, The tests only cover crawl_library_pages but miss walk_library_html, so add a unit test that exercises walk_library_html to validate local HTML discovery and canonical URL generation: create a temporary directory with a root index HTML and linked local pages (including edge cases like redirects via differing filenames and non-HTML files), call walk_library_html with the directory path and the expected root URL prefix, and assert the returned canonical URLs and converted text (use the existing _mock_pandoc/autouse fixtures and the walk_library_html function name) to ensure local pages are discovered, non-HTML files are skipped, and final canonical URLs match expected values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 127-141: The ZIP extraction code using zipfile.ZipFile (variables
zip_path, extract_dir, names, root_name and zf.extractall) must validate each
archive member before extracting to prevent Zip Slip; iterate zf.infolist() and
for each member ensure the member.name is not absolute, does not contain '..',
and that the resolved target path
(extract_dir.joinpath(member.filename).resolve()) is inside
extract_dir.resolve() (or use Path.is_relative_to where available); skip or
raise on invalid entries and avoid using zf.extractall() directly—extract only
validated members and also consider rejecting or safely handling symlinks found
in member external attributes.
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 67-70: The help text for the --versions CLI flag is misleading;
update the help string (where add_argument defines --versions) to state that if
not provided the command falls back to the newest BoostVersion row in the
database (not GitHub API), and mention the resolver used (_resolve_versions) so
operators know where the default comes from.
- Around line 525-540: The current _key in _sort_versions_by_db re-sorts unknown
versions by their string value, losing the caller's original order; fix it by
capturing the original positions (e.g., index_map = {v: i for i, v in
enumerate(versions)}) and change _key(v: str) to return (created_at is None,
created_at if created_at is not None else index_map[v], v) so known DB versions
sort by version_created_at and unknown versions are appended in the original
caller order; update references to db_versions and _key in the
_sort_versions_by_db function accordingly.
- Around line 484-485: The index_url construction (variable index_url used with
urljoin to compute doc_root_url) lacks a trailing slash so urljoin treats
lib_key as a filename and can strip it when lib_doc is a relative path; fix by
ensuring index_url ends with a slash (e.g., append "/" to the f-string that
builds index_url or call a helper to ensure a trailing slash) so that
urljoin(index_url, lib_doc) preserves the lib_key component; update the
assignment of index_url (and any related usage before urljoin) to include the
trailing slash while leaving the urljoin(doc_root_url) logic intact.
In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 134-141: The code currently treats any falsy page_content from
workspace.load_page_by_url(...) as "missing" and skips it; change the check to
explicitly test for None (i.e., page_content is None) so that valid empty
strings ("") are accepted and only actual None results are skipped; update the
conditional around page_content in the preprocessor function (reference:
page_content and workspace.load_page_by_url) to use an identity comparison to
None and leave the warning/continue behavior only for None.
- Around line 104-116: _get_library_name collapses a BoostDocContent linked to
multiple BoostLibraryDocumentation rows into a single arbitrary name; instead
collect and return all distinct library names from doc_content.library_relations
(via rel.boost_library_version.library.name) in a deterministic order (e.g.,
ordered by id) so shared docs are indexed under every associated library; rename
to _get_library_names (or change return type to List[str]) and update callers to
accept and emit multiple names; apply the same fix to the similar code at the
145-154 section.
In `@boost_library_docs_tracker/tests/test_services.py`:
- Around line 69-82: The test
test_get_or_create_doc_content_updates_scraped_at_on_unchanged currently uses a
non-strict timestamp assertion; update it to freeze or patch time around the
second call to services.get_or_create_doc_content so you can assert
obj2.scraped_at is strictly greater than old_scraped_at (e.g., use
freezegun.freeze_time or monkeypatch datetime/now, call
get_or_create_doc_content inside the advanced time context, then refresh_from_db
and assert obj2.scraped_at > old_scraped_at) to ensure the timestamp was
actually bumped.
In `@docs/boost_library_docs_tracker.md`:
- Around line 159-166: The checklist wrongly instructs integrators to run
`python manage.py makemigrations boost_library_docs_tracker`; since the app
ships committed migrations, remove that step and instead only instruct running
`python manage.py migrate` to apply the existing migrations; keep the other
items (adding "boost_library_docs_tracker" to INSTALLED_APPS and
_WORKSPACE_APP_SLUGS, adding BOOST_DOCS_* settings and env defaults, adding
"run_boost_library_docs_tracker" to COLLECTOR_COMMANDS, and adding
beautifulsoup4 and lxml to requirements.txt) unchanged.
- Around line 88-90: Update the docs to state that the CLI flag --max-pages only
applies when running with --dry-run: clarify that in
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
the code sets max_pages = max_pages if dry_run else None, so in non-dry-run
(production) runs there is no page cap and --max-pages is ignored; change the
table row for `--max-pages N` to indicate “applies to --dry-run only; no cap in
real runs” and add a short note referencing the dry-run behavior.
- Around line 22-44: The directory tree is missing several files and later
migrations referenced elsewhere; update the diagram to reflect the app's actual
layout by adding entries for html_to_md.py, preprocessor.py, workspace.py,
fetcher variants (local_scrape or pinecone integrations), and include additional
migration files (e.g., later migration .py names) and any service helper files
so readers can find local-scrape and Pinecone pieces; locate and update the
diagram that lists boost_library_docs_tracker (referencing filenames
html_to_md.py, preprocessor.py, workspace.py, services.py, fetcher.py and
migrations/*) so it matches the real repository structure.
---
Duplicate comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 323-329: The crawl scope check uses the original doc_root_url but
links are resolved against final_url, so when the seed redirected the prefix
filter (doc_root_url) can incorrectly exclude valid sibling pages; update the
scope comparison to use the canonical root derived from final_url (e.g., compute
canonical_doc_root = urljoin(final_url, "/") or otherwise normalize final_url)
and replace doc_root_url with that canonical root in the abs_url.startswith(...)
check (variables to locate: abs_url, final_url, doc_root_url, visited, queue).
- Around line 255-258: The current substring check using lib_key against
linked.as_posix() is unsafe; instead ensure linked is actually inside the crawl
start tree by testing ancestry. Replace the substring test with a proper Path
containment check: compute the actual start root used for this BFS (the
directory that file_path started from or the library root) and then verify
linked.is_relative_to(start_root) (or manually walk linked.parents to find
start_root) before continuing; keep references to the existing variables linked,
file_path and lib_key when locating/deriving the start_root.
---
Nitpick comments:
In `@boost_library_docs_tracker/html_to_md.py`:
- Around line 20-23: The try/except that imports pypandoc should catch
ImportError (or ModuleNotFoundError) instead of a bare Exception; update the
import block around the pypandoc import (the try importing pypandoc) to use
"except ImportError:" (or "except (ImportError, ModuleNotFoundError):" if you
prefer) and keep the fallback pypandoc = None so only import failures are
swallowed.
In `@boost_library_docs_tracker/tests/test_fetcher.py`:
- Around line 51-184: The tests only cover crawl_library_pages but miss
walk_library_html, so add a unit test that exercises walk_library_html to
validate local HTML discovery and canonical URL generation: create a temporary
directory with a root index HTML and linked local pages (including edge cases
like redirects via differing filenames and non-HTML files), call
walk_library_html with the directory path and the expected root URL prefix, and
assert the returned canonical URLs and converted text (use the existing
_mock_pandoc/autouse fixtures and the walk_library_html function name) to ensure
local pages are discovered, non-HTML files are skipped, and final canonical URLs
match expected values.
In `@requirements.txt`:
- Line 14: Pin the pypandoc dependency in requirements.txt to improve
reproducibility: replace the bare "pypandoc" entry with a version-constrained
specifier such as "pypandoc>=1.8.1" (or another vetted minimum) to match the
project's dependency policy; update the requirements.txt entry for pypandoc
accordingly and run your dependency install/test to confirm compatibility.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f7215eca-4e82-4d8b-ae0e-924b74be0216
📒 Files selected for processing (33)
.github/workflows/actions.ymlREADME.mdboost_library_docs_tracker/__init__.pyboost_library_docs_tracker/admin.pyboost_library_docs_tracker/apps.pyboost_library_docs_tracker/fetcher.pyboost_library_docs_tracker/html_to_md.pyboost_library_docs_tracker/management/__init__.pyboost_library_docs_tracker/management/commands/__init__.pyboost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.pyboost_library_docs_tracker/migrations/0001_initial.pyboost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.pyboost_library_docs_tracker/migrations/0003_redesign_per_schema_v10.pyboost_library_docs_tracker/migrations/__init__.pyboost_library_docs_tracker/models.pyboost_library_docs_tracker/preprocessor.pyboost_library_docs_tracker/services.pyboost_library_docs_tracker/tests/__init__.pyboost_library_docs_tracker/tests/fixtures.pyboost_library_docs_tracker/tests/test_fetcher.pyboost_library_docs_tracker/tests/test_models.pyboost_library_docs_tracker/tests/test_services.pyboost_library_docs_tracker/workspace.pyconfig/settings.pyconftest.pydocs/Contributing.mddocs/Schema.mddocs/Service_API.mddocs/boost_library_docs_tracker.mddocs/service_api/README.mddocs/service_api/boost_library_docs_tracker.mdrequirements.txtworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
♻️ Duplicate comments (9)
docs/boost_library_docs_tracker.md (3)
166-166:⚠️ Potential issue | 🟡 MinorRemove
makemigrationsfrom the setup steps.This app already ships migrations, so regenerating them here creates local drift instead of applying the checked-in schema.
♻️ Suggested edit
-6. Run `python manage.py makemigrations boost_library_docs_tracker` and `python manage.py migrate`. +6. Run `python manage.py migrate`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/boost_library_docs_tracker.md` at line 166, Remove the instruction to run `python manage.py makemigrations boost_library_docs_tracker` because the app already ships migrations; update the setup step so it only instructs running `python manage.py migrate` (keep the existing `migrate` command and delete the `makemigrations` command text), ensuring the docs no longer suggest regenerating migrations that cause local drift.
88-90:⚠️ Potential issue | 🟡 MinorDocument
--max-pagesas--dry-run-only.The command clears the page cap on real runs, so this row currently promises a production limit that never applies.
Based on learnings, in
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.pythe logicmax_pages = max_pages if dry_run else Noneis intentional;--max-pagesonly applies during--dry-run.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/boost_library_docs_tracker.md` around lines 88 - 90, The docs table incorrectly implies --max-pages applies to real runs; update the documentation row for `--max-pages` to indicate it only applies during `--dry-run` (reflecting the logic in run_boost_library_docs_tracker.py where `max_pages = max_pages if dry_run else None`), e.g., change the description to state “Per-library page cap for the BFS crawl (applies only during --dry-run)”; reference the `max_pages` variable and `dry_run` flag when making the wording change.
22-44:⚠️ Potential issue | 🟡 MinorRefresh the tree and tag the fence as
text.This directory sketch still omits modules and migrations the rest of the guide points readers to (
html_to_md.py,preprocessor.py,workspace.py, later migrations, etc.), and the unlabeled fence will keep tripping MD040.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/boost_library_docs_tracker.md` around lines 22 - 44, The directory sketch is outdated and the fenced block lacks a language tag; update the tree to include the omitted modules (add entries for html_to_md.py, preprocessor.py, workspace.py and any later migration files referenced elsewhere) and change the Markdown fence from ``` to ```text so the snippet is tagged as text; locate the block in docs/boost_library_docs_tracker.md (the tree listing shown) and replace it with the refreshed listing including those filenames and the fenced code tag.boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py (4)
484-485:⚠️ Potential issue | 🟠 MajorKeep
lib_keyin the crawl root.Without a trailing slash,
urljoin()treatsindex_urlas a file path and can replacelib_keywhenlib_docis relative, yielding the wrong documentation root.🩹 Suggested fix
- index_url = f"{base}/doc/libs/{url_version}/libs/{lib_key}" + index_url = f"{base}/doc/libs/{url_version}/libs/{lib_key}/" doc_root_url = urljoin(index_url, lib_doc) if lib_doc else index_urlExpected result: the printed URLs drop
/algorithm/, confirming the current base is treated as a file path.#!/bin/bash python - <<'PY' from urllib.parse import urljoin base = "https://www.boost.org/doc/libs/1_90_0/libs/algorithm" for rel in ("doc/html/index.html", "index.html", "doc/"): print(f"{rel:20} -> {urljoin(base, rel)}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py` around lines 484 - 485, The calculated index_url is missing a trailing slash so urljoin treats it as a file and can drop lib_key when lib_doc is relative; update the construction of index_url (used to compute doc_root_url with urljoin) to ensure the URL ends with a slash (e.g., append "/" to f"{base}/doc/libs/{url_version}/libs/{lib_key}") so that urljoin preserves lib_key as the crawl root when resolving lib_doc.
27-30:⚠️ Potential issue | 🟡 MinorAlign the version docs with the actual resolver.
These examples still use bare versions, and the help text says the default comes from the GitHub API. The command actually resolves the newest
BoostVersionrow from the DB and later looks up the exact storedBoostVersion.version, so the examples/help should useboost-*values.Based on learnings, `BoostVersion.version` is stored with the `boost-` prefix in this project.♻️ Suggested edit
- python manage.py run_boost_library_docs_tracker --versions 1.86.0 1.87.0 + python manage.py run_boost_library_docs_tracker --versions boost-1.86.0 boost-1.87.0- "One or more Boost versions to scrape (e.g. 1.86.0 1.87.0). " - "Defaults to latest release from GitHub API." + "One or more Boost versions to scrape (e.g. boost-1.86.0 boost-1.87.0). " + "Defaults to the newest BoostVersion row in the database."Also applies to: 67-70
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py` around lines 27 - 30, Update the usage/help examples in run_boost_library_docs_tracker to use the actual stored BoostVersion.version format (the "boost-*" prefix) instead of bare numeric versions: change examples like "--versions 1.86.0" to "--versions boost-1.86.0" and any mention in help text that the default comes from GitHub API to indicate it resolves the newest BoostVersion row from the DB and uses BoostVersion.version (e.g., "boost-<semver>"); also apply the same substitution to the other example block referenced (lines ~67-70) so all examples and help strings consistently use the boost-* values.
89-94:⚠️ Potential issue | 🟡 MinorMake the help text explicit that
--max-pagesis dry-run only.
_process_library()disables the cap outside--dry-run, so the current help string overstates what this flag does in production.Based on learnings,
max_pages = max_pages if dry_run else Noneis intentional in this command;--max-pagesonly applies during--dry-run.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py` around lines 89 - 94, Update the argparse help for the "--max-pages" option to explicitly state it only applies during dry-run mode: change the help string for the option defined with default=DEFAULT_MAX_PAGES (and metavar="N") to indicate that the per-library page cap is enforced only when --dry-run is used because _process_library() sets max_pages = max_pages if dry_run else None; reference the flag name "--max-pages", the DEFAULT_MAX_PAGES constant, and the _process_library() behavior in your wording so users understand this is a dry-run-only constraint.
525-540:⚠️ Potential issue | 🟡 MinorPreserve caller order for versions missing from the DB.
The docstring says unknown versions are appended in original order, but the current key falls back to
v, so those entries are re-sorted lexicographically instead.♻️ Suggested fix
def _sort_versions_by_db(versions: list[str]) -> list[str]: @@ db_versions = { bv.version: bv.version_created_at for bv in BoostVersion.objects.filter(version__in=versions).only( "version", "version_created_at" ) } + original_index = {version: i for i, version in enumerate(versions)} def _key(v: str): created_at = db_versions.get(v) - return (created_at is None, created_at or "", v) + if created_at is None: + return (1, "", original_index[v]) + return (0, created_at, original_index[v])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py` around lines 525 - 540, The _sort_versions_by_db function currently uses the version string v as the final sort tie-breaker which causes unknown versions to be re-sorted lexicographically; instead capture the original input order and use that as the tie-breaker so unknown DB versions retain caller order: create an index map from the input list (e.g. index_map = {v: i for i, v in enumerate(versions)}) inside _sort_versions_by_db and change the key function (_key) to return (created_at is None, created_at or "", index_map[v]) so missing versions are appended in their original order; reference: _sort_versions_by_db, db_versions, _key.boost_library_docs_tracker/preprocessor.py (2)
104-116:⚠️ Potential issue | 🟠 MajorPreserve all library associations in Pinecone metadata.
A single
BoostDocContentcan be linked through multipleBoostLibraryDocumentationrows. Picking.first()makes the library metadata lossy and order-dependent, so shared pages end up indexed under only one library. Please emit all distinct library names, ideally from prefetched relations so this hot path doesn’t add one query per record.Also applies to: 143-154
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/preprocessor.py` around lines 104 - 116, The helper currently drops multiple library links by calling .first(); change _get_library_name (and the similar block at lines 143-154) to return all distinct library names (e.g., List[str]) instead of a single string, iterate over the prefetched BoostDocContent.library_relations (use select_related("boost_library_version__library") already present) and collect rel.boost_library_version.library.name into a deduplicated list while avoiding extra DB hits, and update any callers to handle a list of names (or a comma-joined string) as metadata for Pinecone.
134-141:⚠️ Potential issue | 🟠 MajorTreat only
Noneas a missing workspace page.An existing converted page can legitimately be
"". The current falsy check skips those docs forever instead of upserting them once.🩹 Suggested fix
- page_content = workspace.load_page_by_url(doc_content.url) - if not page_content: + page_content = workspace.load_page_by_url(doc_content.url) + if page_content is None: logger.warning( "Workspace file missing for url=%s (doc_content_id=%d), skipping.", doc_content.url, doc_content.pk, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_library_docs_tracker/preprocessor.py` around lines 134 - 141, Replace the falsy check on the loaded page with an explicit None comparison so empty strings are treated as valid content: change the conditional that examines the result of workspace.load_page_by_url(doc_content.url) (the page_content variable) from "if not page_content" to "if page_content is None" and keep the existing logger.warning call (referencing logger.warning and doc_content.url/doc_content.pk) and continue behavior for None results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 484-485: The calculated index_url is missing a trailing slash so
urljoin treats it as a file and can drop lib_key when lib_doc is relative;
update the construction of index_url (used to compute doc_root_url with urljoin)
to ensure the URL ends with a slash (e.g., append "/" to
f"{base}/doc/libs/{url_version}/libs/{lib_key}") so that urljoin preserves
lib_key as the crawl root when resolving lib_doc.
- Around line 27-30: Update the usage/help examples in
run_boost_library_docs_tracker to use the actual stored BoostVersion.version
format (the "boost-*" prefix) instead of bare numeric versions: change examples
like "--versions 1.86.0" to "--versions boost-1.86.0" and any mention in help
text that the default comes from GitHub API to indicate it resolves the newest
BoostVersion row from the DB and uses BoostVersion.version (e.g.,
"boost-<semver>"); also apply the same substitution to the other example block
referenced (lines ~67-70) so all examples and help strings consistently use the
boost-* values.
- Around line 89-94: Update the argparse help for the "--max-pages" option to
explicitly state it only applies during dry-run mode: change the help string for
the option defined with default=DEFAULT_MAX_PAGES (and metavar="N") to indicate
that the per-library page cap is enforced only when --dry-run is used because
_process_library() sets max_pages = max_pages if dry_run else None; reference
the flag name "--max-pages", the DEFAULT_MAX_PAGES constant, and the
_process_library() behavior in your wording so users understand this is a
dry-run-only constraint.
- Around line 525-540: The _sort_versions_by_db function currently uses the
version string v as the final sort tie-breaker which causes unknown versions to
be re-sorted lexicographically; instead capture the original input order and use
that as the tie-breaker so unknown DB versions retain caller order: create an
index map from the input list (e.g. index_map = {v: i for i, v in
enumerate(versions)}) inside _sort_versions_by_db and change the key function
(_key) to return (created_at is None, created_at or "", index_map[v]) so missing
versions are appended in their original order; reference: _sort_versions_by_db,
db_versions, _key.
In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 104-116: The helper currently drops multiple library links by
calling .first(); change _get_library_name (and the similar block at lines
143-154) to return all distinct library names (e.g., List[str]) instead of a
single string, iterate over the prefetched BoostDocContent.library_relations
(use select_related("boost_library_version__library") already present) and
collect rel.boost_library_version.library.name into a deduplicated list while
avoiding extra DB hits, and update any callers to handle a list of names (or a
comma-joined string) as metadata for Pinecone.
- Around line 134-141: Replace the falsy check on the loaded page with an
explicit None comparison so empty strings are treated as valid content: change
the conditional that examines the result of
workspace.load_page_by_url(doc_content.url) (the page_content variable) from "if
not page_content" to "if page_content is None" and keep the existing
logger.warning call (referencing logger.warning and
doc_content.url/doc_content.pk) and continue behavior for None results.
In `@docs/boost_library_docs_tracker.md`:
- Line 166: Remove the instruction to run `python manage.py makemigrations
boost_library_docs_tracker` because the app already ships migrations; update the
setup step so it only instructs running `python manage.py migrate` (keep the
existing `migrate` command and delete the `makemigrations` command text),
ensuring the docs no longer suggest regenerating migrations that cause local
drift.
- Around line 88-90: The docs table incorrectly implies --max-pages applies to
real runs; update the documentation row for `--max-pages` to indicate it only
applies during `--dry-run` (reflecting the logic in
run_boost_library_docs_tracker.py where `max_pages = max_pages if dry_run else
None`), e.g., change the description to state “Per-library page cap for the BFS
crawl (applies only during --dry-run)”; reference the `max_pages` variable and
`dry_run` flag when making the wording change.
- Around line 22-44: The directory sketch is outdated and the fenced block lacks
a language tag; update the tree to include the omitted modules (add entries for
html_to_md.py, preprocessor.py, workspace.py and any later migration files
referenced elsewhere) and change the Markdown fence from ``` to ```text so the
snippet is tagged as text; locate the block in
docs/boost_library_docs_tracker.md (the tree listing shown) and replace it with
the refreshed listing including those filenames and the fenced code tag.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8f65c4eb-c6d0-4575-9fb1-f0e629c0b9e8
📒 Files selected for processing (7)
.github/workflows/actions.ymlboost_library_docs_tracker/admin.pyboost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.pyboost_library_docs_tracker/models.pyboost_library_docs_tracker/preprocessor.pydocs/boost_library_docs_tracker.mddocs/service_api/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
- boost_library_docs_tracker/admin.py
- docs/service_api/README.md
- .github/workflows/actions.yml
Summary by CodeRabbit
New Features
Admin
Documentation
Chores
Tests