Skip to content

26 feature/boost documentation tracker#101

Merged
snowfox1003 merged 14 commits intocppalliance:developfrom
henry0816191:26-feature/boost-documentation-tracker
Mar 16, 2026
Merged

26 feature/boost documentation tracker#101
snowfox1003 merged 14 commits intocppalliance:developfrom
henry0816191:26-feature/boost-documentation-tracker

Conversation

@henry0816191
Copy link
Copy Markdown
Collaborator

@henry0816191 henry0816191 commented Mar 6, 2026

Summary by CodeRabbit

  • New Features

    • Boost Library Docs Tracker: scrape/crawl Boost docs (local or remote), convert HTML to Markdown, store deduplicated, versioned pages, workspace helpers, and a management command to run collections with optional Pinecone sync.
  • Admin

    • Admin views for inspecting scraped documentation and related records.
  • Documentation

    • User, service API, and schema docs added; pandoc noted as prerequisite.
  • Chores

    • Added HTML parsing/conversion deps and CI step to install pandoc.
  • Tests

    • New tests and fixtures for fetcher, models, and services.

@henry0816191 henry0816191 self-assigned this Mar 6, 2026
@henry0816191 henry0816191 added the enhancement New feature or request label Mar 6, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5dd1e18b-d4e9-43e4-9a2f-c2558e6ee9b3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new Django app boost_library_docs_tracker providing models, migrations, admin, workspace helpers, fetcher (local ZIP + HTTP crawler), HTML→Markdown conversion, services, preprocessor for Pinecone, a management command to orchestrate scraping/upserts, tests, docs, CI and settings integration.

Changes

Cohort / File(s) Summary
Models & Migrations / Admin
boost_library_docs_tracker/models.py, boost_library_docs_tracker/migrations/..., boost_library_docs_tracker/admin.py
Introduces BoostDocContent and BoostLibraryDocumentation models, initial and follow-up migrations (fields, indexes, unique constraints, schema redesign), and admin registrations for both models.
App Registration & CI
boost_library_docs_tracker/apps.py, config/settings.py, conftest.py, workflow/management/commands/run_all_collectors.py, .github/workflows/actions.yml
Adds AppConfig, registers app in INSTALLED_APPS and workspace slugs, exposes pytest fixtures plugin, adds collector command, and adds CI step to install pandoc.
Fetching & Conversion
boost_library_docs_tracker/fetcher.py, boost_library_docs_tracker/html_to_md.py
Implements ZIP download/extract and HTTP BFS crawler, with URL resolution and logging; adds HTML→Markdown pipeline (preprocess, pandoc conversion, postprocess) producing cleaned Markdown.
Management Command
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
New management command handling version resolution, per-version local extract or crawl, workspace writes, DB upserts/linking, optional Pinecone sync, and CLI options.
Services & Preprocessor
boost_library_docs_tracker/services.py, boost_library_docs_tracker/preprocessor.py
Service functions for get-or-create doc content, bulk upsert flags, linking content to library versions, and preprocessing documents for Pinecone (builds document payloads).
Workspace I/O
boost_library_docs_tracker/workspace.py
Workspace and path helpers for zip, extract, and converted Markdown storage; resolve URL→path, save and load converted pages.
Tests & Fixtures
boost_library_docs_tracker/tests/fixtures.py, boost_library_docs_tracker/tests/test_*.py
Adds fixtures and unit tests covering fetcher (mocked HTTP), models, and services behavior and constraints.
Docs & Schema
docs/boost_library_docs_tracker.md, docs/service_api/*, docs/Schema.md, docs/Contributing.md, README.md
New app documentation, service API docs, schema updates, CONTRIBUTING and README notes (pandoc prerequisite).
Dependencies
requirements.txt
Adds parsing/conversion and Pinecone-related dependencies: beautifulsoup4, lxml, pypandoc, pinecone, langchain-text-splitters, langchain-core.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(240,248,255,0.5)
    actor User
    participant CLI as Management Command
    participant DB as Database
    participant Fetcher as Fetcher
    participant Converter as HTML→MD
    participant WS as Workspace
    participant Pinecone as Pinecone
    end

    User->>CLI: run_boost_library_docs_tracker (versions, options)
    CLI->>DB: resolve versions & library list
    alt use_local
        CLI->>Fetcher: download & extract source ZIP
        Fetcher->>WS: store extracted files
    end
    loop per library
        CLI->>Fetcher: walk or crawl pages
        Fetcher->>Converter: convert HTML -> Markdown
        Converter-->>Fetcher: markdown
        Fetcher->>WS: save page markdown
        CLI->>DB: upsert BoostDocContent (url, hash)
        CLI->>DB: link content to BoostLibraryVersion
    end
    alt pinecone sync enabled
        CLI->>DB: select un-upserted docs
        CLI->>CLI: preprocess_for_pinecone -> documents
        CLI->>Pinecone: upsert documents
        CLI->>DB: mark docs is_upserted=true
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • snowfox1003
  • jonathanMLDev

Poem

"I hopped through docs both near and far,
Saved markdown crumbs from every doc star.
I chewed on HTML, spat out sweet rhyme,
Upserted hashes, one hop at a time.
🐇✨📚"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title '26 feature/boost documentation tracker' is vague and does not clearly convey the main changes. It uses a branch naming convention rather than a descriptive summary. Replace the title with a clear, concise description of the primary change, such as 'Add Boost library documentation tracker with scraping and Pinecone sync capabilities' or 'Introduce documentation content tracking for Boost libraries'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 81.13% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@henry0816191
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (3)
boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py (1)

10-40: Consider folding this into 0001_initial before merge.

Since this app is new in the same PR, fresh installs will create page_content and status only to drop them immediately here. Squashing now keeps the initial schema aligned with the shipped models and avoids unnecessary DDL.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py`
around lines 10 - 40, This migration creates then immediately drops fields
(page_content and status) and adds is_upserted/index; fold these changes into
the initial migration by updating 0001_initial to define the final schema (omit
page_content and status, include is_upserted BooleanField(db_index=True) and the
new Index bl_docs_libver_upserted_ix on ["boost_library_version","is_upserted"])
and then remove or delete 0002_remove_page_content_and_status_add_is_upserted.py
so fresh installs never create-and-drop those columns; ensure migration
dependencies and the migration history are adjusted (delete 0002 or squash into
0001) so the migration graph remains consistent.
docs/boost_library_docs_tracker.md (1)

22-44: Consider adding a language specifier to the fenced code block.

The directory structure code block lacks a language specifier. For consistency and to satisfy linting, consider adding text or plaintext.

📝 Suggested fix
-```
+```text
 boost_library_docs_tracker/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/boost_library_docs_tracker.md` around lines 22 - 44, The fenced
directory-listing code block in docs/boost_library_docs_tracker.md is missing a
language specifier; update the opening fence from ``` to ```text (or
```plaintext) so the block starts with a language tag (e.g., replace the leading
``` before "boost_library_docs_tracker/" with ```text) to satisfy linting and
ensure consistent rendering.
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py (1)

258-275: Consider using logger.exception to preserve stack traces.

When catching exceptions during crawl/walk operations, logger.exception automatically includes the full traceback, which aids debugging. The current logger.error only logs the message.

♻️ Suggested improvement
             except Exception as exc:
-                logger.error("[%s] local walk failed: %s", lib_name, exc)
+                logger.exception("[%s] local walk failed", lib_name)
                 self.stdout.write(
                     self.style.ERROR(f"  [{lib_name}] local walk error: {exc}")
                 )
                 return 0
         else:
             self.stdout.write(f"  [{lib_name}] crawling {doc_root_url} ...")
             try:
                 pages = fetcher.crawl_library_pages(
                     doc_root_url, max_pages=max_pages, delay_secs=0.3
                 )
             except Exception as exc:
-                logger.error("[%s] crawl failed: %s", lib_name, exc)
+                logger.exception("[%s] crawl failed", lib_name)
                 self.stdout.write(
                     self.style.ERROR(f"  [{lib_name}] crawl error: {exc}")
                 )
                 return 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`
around lines 258 - 275, The except handlers in run_boost_library_docs_tracker.py
that currently call logger.error in the local walk and crawl blocks (the except
blocks around the local walk and around fetcher.crawl_library_pages) should use
logger.exception to automatically record the traceback; update the two
logger.error calls that reference "[%s] local walk failed" and "[%s] crawl
failed" to logger.exception while keeping the existing self.stdout.write calls
intact so the user-facing messages remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 305-330: The crawler currently uses the original request URL
(variable url) when resolving and scoping links even after a redirect (final_url
= resp.url), causing sibling pages to be misresolved or filtered; update the
link resolution and scoping to use the canonical response URL: replace url with
final_url in the urljoin call that computes abs_url and also ensure the
startswith check that compares abs_url against doc_root_url uses the canonical
base (derive a canonical doc_root from final_url if necessary) so that link
resolution and the scope check reference the redirected/canonical location
(affecting the loop that uses BeautifulSoup, a["href"], abs_url, visited, and
queue).
- Around line 78-80: Do not assume presence of zip_path or top_dir equals a
successful previous run; instead make downloads and extracts atomic and verify
completion: download into a temporary file (e.g., zip_path.with_suffix('.part'))
and only rename/move to zip_path when the download finishes and optionally
verify size/checksum, and extract into a temporary directory then atomically
rename to top_dir or write a sentinel like top_dir + '.complete' after
successful extraction; update the code paths that currently check
zip_path.exists() and top_dir.exists() to check the completion marker or
checksum and to remove/replace partial artifacts before retrying so
partial/corrupt files are not reused.
- Around line 247-260: The current substring check using lib_key.split("/")[-1]
is unsafe; instead compute the library root path via source_root /
_library_root_for_key(lib_key) (resolve it) and test ancestry of linked against
that path (e.g., linked.is_relative_to(library_root) or check library_root in
linked.parents or linked == library_root) before enqueueing; update the
conditional that now reads "if lib_key.split... not in linked.as_posix()" to
perform this path-based ancestry check using the computed library_root variable
(ensure you resolve both paths to compare canonical forms).

In `@boost_library_docs_tracker/html_to_md.py`:
- Around line 161-175: The code only recognizes backtick fences; update all
fence handling to accept both backticks and tildes: change _RE_BAD_FENCE to
match either "```" or "~~~" (e.g. ^([`~]{3,}) *programlisting\s*$), update
_RE_FENCED_BLOCK to capture opening/closing fences that may be backticks or
tildes (use a backref for the fence char and length, e.g.
(?:^|\n)([`~]{3,}[^\n]*\n.*?\1{3,})), and modify the fence toggle logic inside
_join_wrapped_lines() to detect/track either backtick or tilde fences (not just
"```"); also audit and update any other regexes that assume only backticks (the
other fence-related occurrences) to ensure consistent handling of ~~~ fences.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Line 15: Fix the typo in the command module's docstring: change "extrated" to
"extracted" in the string inside run_boost_library_docs_tracker.py (the
management command module/class) so the sentence reads "Extract tree is saved in
workspace/boost_library_docs_tracker/extracted/." Ensure you update the
docstring associated with the command class/function in that file.
- Around line 391-400: The conversion using str.isdigit() can misclassify
non-ASCII numerals; update the failed_ids handling to robustly build
int_failed_ids by iterating over failed_ids and attempting int(...) in a
try/except (ValueError) block, only keeping positive integers, then call
services.set_is_upserted_by_ids(int_failed_ids, False) and log via
logger.warning as before (referencing failed_ids, int_failed_ids,
services.set_is_upserted_by_ids, and logger.warning to locate the change).
- Around line 475-488: In the MultipleObjectsReturned except block in
run_boost_library_docs_tracker, avoid calling .first().pk without checking for
None: assign result = BoostLibraryVersion.objects.filter(library__name=lib_name,
version__version=version).first(), then if result is None log an error/warning
(using logger) and either return None or raise BoostLibraryVersion.DoesNotExist;
otherwise return result.pk — this prevents an AttributeError if the row(s) are
deleted between .get() and .first().
- Line 244: The code currently overrides the parsed max_pages value to None when
not in dry_run (max_pages = max_pages if dry_run else None), which disables the
--max-pages limit in real runs; change this so the parsed max_pages is always
respected (remove the conditional override and use the existing max_pages
variable as-is) so the per-library page cap applies in both dry_run and normal
runs, or alternatively update documentation/help strings to explicitly state
that non-dry runs are unlimited if that behavior is intended; refer to the
max_pages variable and dry_run flag in run_boost_library_docs_tracker.py when
making the change.

In `@boost_library_docs_tracker/services.py`:
- Around line 42-54: Normalize the incoming url by stripping whitespace once
before any validation/DB operations and use that normalized value for the
BoostDocContent lookup and creation; specifically, replace raw uses of the
parameter `url` with a local `normalized_url = url.strip()` (validate against
empty after strip), then call `BoostDocContent.objects.get(url=normalized_url)`
and `BoostDocContent.objects.create(url=normalized_url,
content_hash=content_hash, scraped_at=_now())` so the uniqueness invariant is
enforced consistently.
- Around line 111-115: The bulk-update path in set_is_upserted_by_ids currently
calls
BoostLibraryDocumentation.objects.filter(pk__in=ids).update(is_upserted=value)
which bypasses model.save() and therefore doesn't refresh updated_at; modify
set_is_upserted_by_ids to also set updated_at (e.g., include
updated_at=timezone.now()) in the same QuerySet.update call so it matches
set_is_upserted's behavior and keeps last-modified timestamps consistent.
- Around line 87-89: When updating obj.page_count in the conditional that checks
created and page_count, include the auto-updated timestamp field in the save
call: add "updated_at" to the update_fields list so the model's updated_at is
persisted alongside page_count (matching the pattern used in set_is_upserted).
Locate the block that sets obj.page_count and calls
obj.save(update_fields=["page_count"]) and change the update_fields to include
"updated_at".

In `@boost_library_docs_tracker/tests/test_models.py`:
- Around line 18-24: The test currently catches a broad Exception when calling
baker.make("boost_library_docs_tracker.BoostDocContent",
url=boost_doc_content.url), which can hide unrelated import/setup errors; change
the assertion to expect the concrete DB uniqueness failure (e.g.
pytest.raises(IntegrityError)) and import IntegrityError (from django.db or
django.db.utils), and if needed wrap the create in a transaction.atomic() block
to ensure the DB raises the integrity error for the duplicate URL on the
BoostDocContent model.

In `@boost_library_docs_tracker/workspace.py`:
- Around line 98-120: resolve_path_from_url currently trusts the URL path and
returns get_converted_root() / f"boost_{url_version}" /
Path(*relative_parts).with_suffix(".md"), which allows path traversal via '..'
and doesn't validate the host; update resolve_path_from_url to first verify the
parsed.netloc is an allowed host (e.g., "www.boost.org" or "boost.org"), then
sanitize and normalize relative_parts (reject empty segments, '.' or '..' and
any absolute segments), build the target path and call .resolve() and ensure the
resolved path is a subpath of get_converted_root() / f"boost_{url_version}"
(compare parents) before returning it; if validation fails, return None so
callers like save_page or load_page_by_url cannot escape the workspace.

In `@docs/Schema.md`:
- Around line 699-730: The Schema.md docs tracker section is out of sync with
the shipped models: update the mermaid diagram and narrative to match
boost_library_docs_tracker/models.py and the migration 0002 (remove page_content
and status, add is_upserted). Specifically, remove non-existent
columns/relations (first_version_id, last_version_id,
BoostDocContent.is_upserted, unique content_hash and any
BoostVersion→BoostDocContent FK), remove references to page_content in Appendix
A, and add the actual fields present in code
(BoostLibraryDocumentation.page_count, updated_at,
BoostLibraryDocumentation.is_upserted and any true unique/index constraints
present in models). Ensure the diagram and notes reflect the real composite
unique constraint on (boost_library_version_id, boost_doc_content_id) and the
index on (boost_library_version_id, is_upserted) as implemented.

In `@docs/service_api/boost_library_docs_tracker.md`:
- Around line 12-24: The docs are out of sync with the implementation: update
the API doc for get_or_create_doc_content to reflect the actual signature
get_or_create_doc_content(url: str, content_hash: str) (remove the page_content
parameter and DB side-effect descriptions), remove or replace the non-existent
status helper table (lines describing "created/content_changed/unchanged"), and
instead document the actual public functions exported by the module:
get_or_create_doc_content(url, content_hash), link_content_to_library_version,
set_is_upserted, set_is_upserted_by_ids, and get_docs_for_library_version with
brief parameter/return summaries matching their implementations.

In `@requirements.txt`:
- Around line 11-18: requirements.txt declares pypandoc but the Pandoc binary
itself isn't provisioned; update CI and container/dev setup to install the
pandoc binary so pypandoc's fallback to the CLI works reliably. Add installation
steps for pandoc in your Dockerfile (or base image), in CI workflow scripts
(apt/yum/choco/brew as appropriate), and document the dev setup (README) with a
recommendation like "install pandoc (e.g., apt install pandoc or brew install
pandoc)" so that pypandoc in requirements.txt can always find the pandoc
executable at runtime.

---

Nitpick comments:
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 258-275: The except handlers in run_boost_library_docs_tracker.py
that currently call logger.error in the local walk and crawl blocks (the except
blocks around the local walk and around fetcher.crawl_library_pages) should use
logger.exception to automatically record the traceback; update the two
logger.error calls that reference "[%s] local walk failed" and "[%s] crawl
failed" to logger.exception while keeping the existing self.stdout.write calls
intact so the user-facing messages remain unchanged.

In
`@boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py`:
- Around line 10-40: This migration creates then immediately drops fields
(page_content and status) and adds is_upserted/index; fold these changes into
the initial migration by updating 0001_initial to define the final schema (omit
page_content and status, include is_upserted BooleanField(db_index=True) and the
new Index bl_docs_libver_upserted_ix on ["boost_library_version","is_upserted"])
and then remove or delete 0002_remove_page_content_and_status_add_is_upserted.py
so fresh installs never create-and-drop those columns; ensure migration
dependencies and the migration history are adjusted (delete 0002 or squash into
0001) so the migration graph remains consistent.

In `@docs/boost_library_docs_tracker.md`:
- Around line 22-44: The fenced directory-listing code block in
docs/boost_library_docs_tracker.md is missing a language specifier; update the
opening fence from ``` to ```text (or ```plaintext) so the block starts with a
language tag (e.g., replace the leading ``` before "boost_library_docs_tracker/"
with ```text) to satisfy linting and ensure consistent rendering.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 954a344a-e052-44b2-b828-56c2184d7bc5

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce6ec1 and f17f42b.

📒 Files selected for processing (30)
  • boost_library_docs_tracker/__init__.py
  • boost_library_docs_tracker/admin.py
  • boost_library_docs_tracker/apps.py
  • boost_library_docs_tracker/fetcher.py
  • boost_library_docs_tracker/html_to_md.py
  • boost_library_docs_tracker/management/__init__.py
  • boost_library_docs_tracker/management/commands/__init__.py
  • boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
  • boost_library_docs_tracker/migrations/0001_initial.py
  • boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py
  • boost_library_docs_tracker/migrations/__init__.py
  • boost_library_docs_tracker/models.py
  • boost_library_docs_tracker/preprocessor.py
  • boost_library_docs_tracker/services.py
  • boost_library_docs_tracker/tests/__init__.py
  • boost_library_docs_tracker/tests/fixtures.py
  • boost_library_docs_tracker/tests/test_fetcher.py
  • boost_library_docs_tracker/tests/test_models.py
  • boost_library_docs_tracker/tests/test_services.py
  • boost_library_docs_tracker/workspace.py
  • config/settings.py
  • conftest.py
  • docs/Contributing.md
  • docs/Schema.md
  • docs/Service_API.md
  • docs/boost_library_docs_tracker.md
  • docs/service_api/README.md
  • docs/service_api/boost_library_docs_tracker.md
  • requirements.txt
  • workflow/management/commands/run_all_collectors.py

Comment thread boost_library_docs_tracker/fetcher.py Outdated
Comment thread boost_library_docs_tracker/fetcher.py
Comment thread boost_library_docs_tracker/fetcher.py
Comment thread boost_library_docs_tracker/html_to_md.py
Comment thread boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py Outdated
Comment thread boost_library_docs_tracker/tests/test_models.py Outdated
Comment thread boost_library_docs_tracker/workspace.py Outdated
Comment thread docs/Schema.md Outdated
Comment thread docs/service_api/boost_library_docs_tracker.md
Comment thread requirements.txt
@henry0816191
Copy link
Copy Markdown
Collaborator Author

@coderabbitai continue to review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

@henry0816191 Sure! I'll continue the review of PR #101 right away.

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
docs/service_api/boost_library_docs_tracker.md (1)

1-39: ⚠️ Potential issue | 🟠 Major

Documentation is out of sync with the actual services.py implementation.

This matches the previous review feedback. The documented API does not reflect the actual public functions:

Documented Actual in services.py
get_or_create_doc_content(url, page_content, content_hash) get_or_create_doc_content(url, content_hash, version_id=None)
mark_relation_running, mark_relation_completed, etc. Not implemented
get_pending_docs_for_library_version, get_docs_pending_sync Not implemented

The actual public surface includes: get_or_create_doc_content, set_doc_content_upserted, set_doc_content_upserted_by_ids, get_unupserted_doc_contents, link_content_to_library_version, get_docs_for_library_version.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/service_api/boost_library_docs_tracker.md` around lines 1 - 39, The docs
are out of sync with services.py: update the markdown to match the actual public
API (or vice-versa). Specifically, change the get_or_create_doc_content
signature to get_or_create_doc_content(url, content_hash, version_id=None);
remove or stop documenting
mark_relation_running/mark_relation_completed/mark_relation_failed/get_pending_docs_for_library_version/get_docs_pending_sync
(they don't exist); and instead document the actual public functions
set_doc_content_upserted, set_doc_content_upserted_by_ids,
get_unupserted_doc_contents, link_content_to_library_version, and
get_docs_for_library_version, including their parameter lists, return types, and
behavior (e.g., upsert semantics, status fields updated). Ensure all function
names (get_or_create_doc_content, set_doc_content_upserted,
set_doc_content_upserted_by_ids, get_unupserted_doc_contents,
link_content_to_library_version, get_docs_for_library_version) are present and
described consistently with services.py.
🧹 Nitpick comments (7)
boost_library_docs_tracker/html_to_md.py (1)

20-23: Consider narrowing the exception type.

While catching Exception works for an optional dependency, ImportError would be more precise and avoid masking unexpected issues during import.

Suggested refinement
 try:
     import pypandoc
-except Exception:  # optional runtime dependency
+except ImportError:  # optional runtime dependency
     pypandoc = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/html_to_md.py` around lines 20 - 23, The
try/except around importing pypandoc is too broad; change the except Exception
to except ImportError so only missing-import errors are caught and other
import-time errors are propagated. Update the import block that sets pypandoc =
None (the try: import pypandoc / except ... / pypandoc = None sequence) to catch
ImportError specifically.
boost_library_docs_tracker/tests/test_services.py (1)

43-57: Test name may be misleading—clarify the scenario.

The test is named test_get_or_create_doc_content_content_changed_when_url_differs but the scenario actually tests: "when the same content_hash is found but with a different URL, the URL field is updated." The change_type is "content_changed" because the URL in the existing record differs.

Consider renaming for clarity:

def test_get_or_create_doc_content_updates_url_for_existing_hash():
    """get_or_create_doc_content returns 'content_changed' and updates url when hash exists with different url."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/tests/test_services.py` around lines 43 - 57,
Rename the test function
test_get_or_create_doc_content_content_changed_when_url_differs to something
clearer like test_get_or_create_doc_content_updates_url_for_existing_hash and
update its docstring to state: "get_or_create_doc_content returns
'content_changed' and updates url when hash exists with different url" so the
name and docstring accurately reflect that the function tests updating the URL
for an existing content_hash (refer to the test function and its assertions
calling services.get_or_create_doc_content and checking change_type and
obj2.url).
boost_library_docs_tracker/fetcher.py (3)

154-163: Consider extracting magic library key mappings to a constant.

The special-case handling for numeric/, enable_if, and swap keys is embedded inline. If more special cases emerge, this could become hard to maintain. Consider extracting to a documented constant or configuration.

♻️ Optional: Extract to constant
# Library keys that map to alternate root directories
_LIBRARY_KEY_OVERRIDES = {
    "enable_if": "core",
    "swap": "core",
}
_LIBRARY_KEY_PREFIXES = {
    "numeric/": "numeric",  # preserves full path
}

def _library_root_for_key(lib_key: str) -> Path:
    key = (lib_key or "").strip().strip("/")
    if not key:
        return Path("libs")
    for prefix, subdir in _LIBRARY_KEY_PREFIXES.items():
        if key.startswith(prefix):
            return Path("libs") / key
    for pattern, target in _LIBRARY_KEY_OVERRIDES.items():
        if pattern in key:
            return Path("libs") / target
    return Path("libs") / key.split("/")[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 154 - 163, The
_library_root_for_key function currently embeds special-case strings inline;
extract these mappings into top-level documented constants (e.g.
_LIBRARY_KEY_OVERRIDES = {"enable_if":"core","swap":"core"} and
_LIBRARY_KEY_PREFIXES = {"numeric/":"numeric"}) and update _library_root_for_key
to reference those constants: normalize key as before, check prefixes from
_LIBRARY_KEY_PREFIXES first (preserving full path for matching prefixes), then
check substring patterns from _LIBRARY_KEY_OVERRIDES to map to alternate
subdirs, and fall back to key.split("/")[0]; add brief docstrings for the
constants so future special cases are maintainable and discoverable.

36-41: Global session singleton may cause issues in concurrent/multi-threaded contexts.

The module-level _SESSION singleton is shared across all callers. While this is efficient for single-threaded CLI commands, it could cause issues if this module is ever used in concurrent contexts (e.g., async workers, tests with parallel execution). The requests.Session object is not thread-safe.

For a Django management command this is likely fine, but consider documenting this limitation or using thread-local storage if broader usage is anticipated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 36 - 41, The module-level
_SESSION singleton used by _get_session can cause problems in
concurrent/threaded contexts; replace it with thread-local storage by
introducing a threading.local() container (e.g., _LOCAL) and store the
requests.Session instance on that container (e.g., _LOCAL.session) inside
_get_session so each thread gets its own Session while preserving the User-Agent
header, or alternatively change _get_session to always return a new
requests.Session if thread-local is not desired; update imports to include
threading and ensure references to _SESSION are switched to the new thread-local
attribute (look for _get_session and _SESSION in this file).

78-80: Resume-safe checks are commented out.

The code that skipped re-downloading and re-extracting when files already exist is commented out. This addresses the past review concern about incomplete artifacts being reused, but now every run re-downloads and re-extracts the full source zip.

If the intent is to keep downloads atomic, consider implementing a completion marker (e.g., zip_path.with_suffix('.complete')) or checksum verification so that valid downloads can be reused without risk of partial artifacts.

Also applies to: 134-138

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 78 - 80, The resume-safe
existence checks around zip_path were commented out, forcing redownloads;
restore resume logic by checking for a validated completion marker or checksum
before skipping download/extract (e.g., check zip_path.exists() AND
zip_path.with_suffix('.complete').exists() or verify a saved checksum), and when
downloading write the zip atomically and create the .complete marker (or write
checksum) only after successful download and extraction; update the code paths
that currently reference zip_path to use this marker/verification so partial
artifacts are never treated as complete.
boost_library_docs_tracker/preprocessor.py (1)

103-115: N+1 query pattern in _get_library_name.

Each call executes a separate query (doc_content.library_relations.select_related(...).first()). When processing many records, this results in N additional queries. Since _select_records doesn't prefetch library_relations, consider prefetching in the queryset or batch-loading library names.

♻️ Proposed optimization using prefetch_related

In _select_records, add prefetch for the relation chain:

+from django.db.models import Prefetch
+from .models import BoostLibraryDocumentation
+
 def _select_records(
     failed_ids: list[str],
     final_sync_at: datetime | None,
 ) -> list[BoostDocContent]:
     ...
     qs = (
         BoostDocContent.objects.filter(query)
         .select_related("first_version", "last_version")
+        .prefetch_related(
+            Prefetch(
+                "library_relations",
+                queryset=BoostLibraryDocumentation.objects.select_related(
+                    "boost_library_version__library"
+                ).order_by("id"),
+            )
+        )
         .order_by("id")
     )
     return list(qs)

Then simplify _get_library_name to use the prefetched data:

 def _get_library_name(doc_content: BoostDocContent) -> str:
-    rel = (
-        doc_content.library_relations.select_related("boost_library_version__library")
-        .order_by("id")
-        .first()
-    )
+    # Use prefetched data (already ordered by id)
+    rels = list(doc_content.library_relations.all())
+    rel = rels[0] if rels else None
     if rel is None:
         return ""
     return rel.boost_library_version.library.name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/preprocessor.py` around lines 103 - 115, The
function _get_library_name triggers an N+1 DB query by calling
doc_content.library_relations.select_related(...).first() for each
BoostDocContent; modify the record selection in _select_records to prefetch the
relation chain
(prefetch_related("library_relations__boost_library_version__library") or
equivalent) or batch-load a mapping of BoostDocContent → library name, then
simplify _get_library_name to read from the already-prefetched
doc_content.library_relations (e.g., use the first item on the prefetched
relation list) so no extra queries are executed when iterating many
BoostDocContent instances.
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py (1)

265-270: Consider using logger.exception to preserve stack traces.

Using logger.error loses the stack trace information. For debugging failed crawls/walks, having the full traceback is valuable. The static analysis hints also suggest this change.

♻️ Use logger.exception for traceback
             except Exception as exc:
-                logger.error("[%s] local walk failed: %s", lib_name, exc)
+                logger.exception("[%s] local walk failed: %s", lib_name, exc)
                 self.stdout.write(
                     self.style.ERROR(f"  [{lib_name}] local walk error: {exc}")
                 )
                 return 0
         else:
             ...
             except Exception as exc:
-                logger.error("[%s] crawl failed: %s", lib_name, exc)
+                logger.exception("[%s] crawl failed: %s", lib_name, exc)
                 self.stdout.write(
                     self.style.ERROR(f"  [{lib_name}] crawl error: {exc}")
                 )

Apply similar changes to lines 323, 333, 346, and 392.

Also applies to: 277-282

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`
around lines 265 - 270, Replace the current logger.error calls in the exception
handlers with logger.exception (or logger.error(..., exc_info=True)) so the
traceback is preserved: update the handler that logs "local walk failed" (where
logger.error("[%s] local walk failed: %s", lib_name, exc) and
self.stdout.write(self.style.ERROR(f"  [{lib_name}] local walk error: {exc}"))
are used) as well as the other except blocks referenced (the blocks around the
ranges noted — 277-282, and the handlers at the locations corresponding to lines
323, 333, 346, and 392) to call logger.exception(...) instead of
logger.error(...), keeping the same message content and leaving the existing
stdout writes intact so stack traces are included in the logs while user-facing
messages remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/actions.yml:
- Around line 79-80: The workflow step "Install system dependencies" currently
runs "sudo apt-get install -y pandoc" without refreshing apt metadata; modify
that step to run "sudo apt-get update -y" (or combine as "sudo apt-get update -y
&& sudo apt-get install -y pandoc") before installing pandoc so the package
index is fresh and installation won't fail intermittently on GitHub runners.

In `@boost_library_docs_tracker/admin.py`:
- Around line 15-28: The admin config references non-existent fields causing
Django checks to fail; update BoostLibraryDocumentationAdmin to only use fields
present on the model by removing references to is_upserted and page_count from
list_display and removing is_upserted from list_filter, and remove updated_at
from readonly_fields so that list_display contains only id,
boost_library_version, boost_doc_content (or whichever of those actually exist)
and readonly_fields contains created_at only; modify the attributes
list_display, list_filter, search_fields, raw_id_fields, and readonly_fields
within the BoostLibraryDocumentationAdmin class accordingly to match the model's
actual fields.

In `@boost_library_docs_tracker/models.py`:
- Around line 11-18: The docstring claims "one row per URL" but the model
actually enforces uniqueness on content_hash rather than url; update the
docstring to accurately state the deduplication key (content_hash) and that
multiple rows may exist for the same URL when content changes, referencing the
fields content_hash, url, first_version_id, last_version_id, and is_upserted so
readers know which attributes track versions and Pinecone state; alternatively,
if the intent is truly one row per URL, change the model's uniqueness constraint
to use url instead of content_hash and adjust any related logic accordingly
(update the docstring to match whichever rule you choose).

In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 159-164: The code marks records via
services.set_doc_content_upserted_by_ids(ids_to_mark, True) inside
_build_documents before calling sync_to_pinecone, which can leave rows
incorrectly marked if Pinecone upsert fails; change the flow so that
_build_documents returns ids_to_mark but does NOT set is_upserted, and move the
call to services.set_doc_content_upserted_by_ids(ids_to_mark, True) to after
sync_to_pinecone completes successfully (or perform both actions in a single
transactional/try-except block where you only set is_upserted=True on success
and revert on failure); update references in the calling management command to
accept the ids list returned from _build_documents and ensure failed_ids
handling still works with this new ordering.

In `@docs/boost_library_docs_tracker.md`:
- Around line 3-4: The documentation describes an outdated data model and
execution flow; update the doc to reflect the current schema where
BoostDocContent holds metadata and is_upserted, and BoostLibraryDocumentation is
only the join row (library_version, doc_content). Edit all sections that
reference storing extracted text in a separate DB table, relation-level fields
like status/page_count, and restart/sync logic (including lines referenced
around 52-55, 94-108, 114-132) to instead explain: extraction writes into
BoostDocContent with its metadata and is_upserted flag, the join row is
BoostLibraryDocumentation, how restart logic should check
BoostDocContent.is_upserted and re-run upserts accordingly, and how Pinecone
upserts operate per BoostDocContent; ensure field and model names
(BoostDocContent, BoostLibraryDocumentation, is_upserted) are used accurately
throughout.
- Around line 142-145: The examples for run_boost_library_docs_tracker currently
show bare versions; update them to use the repository's stored
BoostVersion.version format (include the boost- prefix) so they match what's
actually stored and accepted — e.g., change the command examples that call
run_boost_library_docs_tracker --version to pass values like boost-1.85.0 /
boost-1.86.0 and add a short note referencing BoostVersion.version to avoid
confusion.

In `@docs/service_api/README.md`:
- Line 12: The README table row misstates the schema: update the docs tracker
summary so BoostDocContent is described as holding sync state (not merely
URL-unique content) and BoostLibraryDocumentation is described as the join row
(not the per-page status holder); specifically edit the table entry referencing
boost_library_docs_tracker.services / boost_library_docs_tracker to mention
BoostDocContent contains sync state fields and BoostLibraryDocumentation is the
join/linking entity between library-version and page.

---

Duplicate comments:
In `@docs/service_api/boost_library_docs_tracker.md`:
- Around line 1-39: The docs are out of sync with services.py: update the
markdown to match the actual public API (or vice-versa). Specifically, change
the get_or_create_doc_content signature to get_or_create_doc_content(url,
content_hash, version_id=None); remove or stop documenting
mark_relation_running/mark_relation_completed/mark_relation_failed/get_pending_docs_for_library_version/get_docs_pending_sync
(they don't exist); and instead document the actual public functions
set_doc_content_upserted, set_doc_content_upserted_by_ids,
get_unupserted_doc_contents, link_content_to_library_version, and
get_docs_for_library_version, including their parameter lists, return types, and
behavior (e.g., upsert semantics, status fields updated). Ensure all function
names (get_or_create_doc_content, set_doc_content_upserted,
set_doc_content_upserted_by_ids, get_unupserted_doc_contents,
link_content_to_library_version, get_docs_for_library_version) are present and
described consistently with services.py.

---

Nitpick comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 154-163: The _library_root_for_key function currently embeds
special-case strings inline; extract these mappings into top-level documented
constants (e.g. _LIBRARY_KEY_OVERRIDES = {"enable_if":"core","swap":"core"} and
_LIBRARY_KEY_PREFIXES = {"numeric/":"numeric"}) and update _library_root_for_key
to reference those constants: normalize key as before, check prefixes from
_LIBRARY_KEY_PREFIXES first (preserving full path for matching prefixes), then
check substring patterns from _LIBRARY_KEY_OVERRIDES to map to alternate
subdirs, and fall back to key.split("/")[0]; add brief docstrings for the
constants so future special cases are maintainable and discoverable.
- Around line 36-41: The module-level _SESSION singleton used by _get_session
can cause problems in concurrent/threaded contexts; replace it with thread-local
storage by introducing a threading.local() container (e.g., _LOCAL) and store
the requests.Session instance on that container (e.g., _LOCAL.session) inside
_get_session so each thread gets its own Session while preserving the User-Agent
header, or alternatively change _get_session to always return a new
requests.Session if thread-local is not desired; update imports to include
threading and ensure references to _SESSION are switched to the new thread-local
attribute (look for _get_session and _SESSION in this file).
- Around line 78-80: The resume-safe existence checks around zip_path were
commented out, forcing redownloads; restore resume logic by checking for a
validated completion marker or checksum before skipping download/extract (e.g.,
check zip_path.exists() AND zip_path.with_suffix('.complete').exists() or verify
a saved checksum), and when downloading write the zip atomically and create the
.complete marker (or write checksum) only after successful download and
extraction; update the code paths that currently reference zip_path to use this
marker/verification so partial artifacts are never treated as complete.

In `@boost_library_docs_tracker/html_to_md.py`:
- Around line 20-23: The try/except around importing pypandoc is too broad;
change the except Exception to except ImportError so only missing-import errors
are caught and other import-time errors are propagated. Update the import block
that sets pypandoc = None (the try: import pypandoc / except ... / pypandoc =
None sequence) to catch ImportError specifically.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 265-270: Replace the current logger.error calls in the exception
handlers with logger.exception (or logger.error(..., exc_info=True)) so the
traceback is preserved: update the handler that logs "local walk failed" (where
logger.error("[%s] local walk failed: %s", lib_name, exc) and
self.stdout.write(self.style.ERROR(f"  [{lib_name}] local walk error: {exc}"))
are used) as well as the other except blocks referenced (the blocks around the
ranges noted — 277-282, and the handlers at the locations corresponding to lines
323, 333, 346, and 392) to call logger.exception(...) instead of
logger.error(...), keeping the same message content and leaving the existing
stdout writes intact so stack traces are included in the logs while user-facing
messages remain unchanged.

In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 103-115: The function _get_library_name triggers an N+1 DB query
by calling doc_content.library_relations.select_related(...).first() for each
BoostDocContent; modify the record selection in _select_records to prefetch the
relation chain
(prefetch_related("library_relations__boost_library_version__library") or
equivalent) or batch-load a mapping of BoostDocContent → library name, then
simplify _get_library_name to read from the already-prefetched
doc_content.library_relations (e.g., use the first item on the prefetched
relation list) so no extra queries are executed when iterating many
BoostDocContent instances.

In `@boost_library_docs_tracker/tests/test_services.py`:
- Around line 43-57: Rename the test function
test_get_or_create_doc_content_content_changed_when_url_differs to something
clearer like test_get_or_create_doc_content_updates_url_for_existing_hash and
update its docstring to state: "get_or_create_doc_content returns
'content_changed' and updates url when hash exists with different url" so the
name and docstring accurately reflect that the function tests updating the URL
for an existing content_hash (refer to the test function and its assertions
calling services.get_or_create_doc_content and checking change_type and
obj2.url).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8db6f473-44b0-4a02-96be-70f907849f37

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce6ec1 and 4e9c8b3.

📒 Files selected for processing (33)
  • .github/workflows/actions.yml
  • README.md
  • boost_library_docs_tracker/__init__.py
  • boost_library_docs_tracker/admin.py
  • boost_library_docs_tracker/apps.py
  • boost_library_docs_tracker/fetcher.py
  • boost_library_docs_tracker/html_to_md.py
  • boost_library_docs_tracker/management/__init__.py
  • boost_library_docs_tracker/management/commands/__init__.py
  • boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
  • boost_library_docs_tracker/migrations/0001_initial.py
  • boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py
  • boost_library_docs_tracker/migrations/0003_redesign_per_schema_v10.py
  • boost_library_docs_tracker/migrations/__init__.py
  • boost_library_docs_tracker/models.py
  • boost_library_docs_tracker/preprocessor.py
  • boost_library_docs_tracker/services.py
  • boost_library_docs_tracker/tests/__init__.py
  • boost_library_docs_tracker/tests/fixtures.py
  • boost_library_docs_tracker/tests/test_fetcher.py
  • boost_library_docs_tracker/tests/test_models.py
  • boost_library_docs_tracker/tests/test_services.py
  • boost_library_docs_tracker/workspace.py
  • config/settings.py
  • conftest.py
  • docs/Contributing.md
  • docs/Schema.md
  • docs/Service_API.md
  • docs/boost_library_docs_tracker.md
  • docs/service_api/README.md
  • docs/service_api/boost_library_docs_tracker.md
  • requirements.txt
  • workflow/management/commands/run_all_collectors.py

Comment thread .github/workflows/actions.yml Outdated
Comment thread boost_library_docs_tracker/admin.py Outdated
Comment thread boost_library_docs_tracker/models.py Outdated
Comment thread boost_library_docs_tracker/preprocessor.py Outdated
Comment thread docs/boost_library_docs_tracker.md Outdated
Comment thread docs/boost_library_docs_tracker.md
Comment thread docs/service_api/README.md Outdated
@henry0816191
Copy link
Copy Markdown
Collaborator Author

henry0816191 commented Mar 6, 2026

@coderabbitai continue to review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

@henry0816191 Sure! I'll continue the review of PR #101 right away.

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

@henry0816191 Sure! Resuming the review of PR #101 now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (2)
boost_library_docs_tracker/fetcher.py (2)

323-329: ⚠️ Potential issue | 🟠 Major

Use the canonical redirect root for crawl scoping too.

Links are resolved against final_url, but the prefix filter still uses the original doc_root_url. If the seed redirects from index.htm to index.html or another canonical location, sibling pages are filtered out and the crawl terminates early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 323 - 329, The crawl
scope check uses the original doc_root_url but links are resolved against
final_url, so when the seed redirected the prefix filter (doc_root_url) can
incorrectly exclude valid sibling pages; update the scope comparison to use the
canonical root derived from final_url (e.g., compute canonical_doc_root =
urljoin(final_url, "/") or otherwise normalize final_url) and replace
doc_root_url with that canonical root in the abs_url.startswith(...) check
(variables to locate: abs_url, final_url, doc_root_url, visited, queue).

255-258: ⚠️ Potential issue | 🟠 Major

Scope the local crawl by ancestry from the actual start tree.

The substring check still breaks /doc/... libraries: those pages usually live under source_root/doc/..., so their descendants never contain lib_key and the BFS stops at the entry page. It can also admit out-of-tree paths that merely contain the token.

🧭 Safer scoping
     base_url = f"{BOOST_ORG_BASE}/doc/libs/{url_version}/"
+    doc = (lib_documentation or "").strip()
+    if doc.startswith("/doc/"):
+        scope_root = (source_root / get_start_path(lib_key, lib_documentation).parts[0]).resolve()
+    else:
+        scope_root = (source_root / _library_root_for_key(lib_key)).resolve()
+
     visited: set[Path] = set()
     queue: deque[Path] = deque([start_file.resolve()])
     results: list[tuple[str, str]] = []
@@
             if not href.endswith(".html") and not href.endswith(".htm"):
                 continue
             linked = (file_path.parent / href).resolve()
-            # Stay within the library subtree
-            if lib_key.split("/")[-1] not in linked.as_posix():
+            try:
+                linked.relative_to(scope_root)
+            except ValueError:
                 continue
             if linked not in visited and linked not in queue:
                 queue.append(linked)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/fetcher.py` around lines 255 - 258, The current
substring check using lib_key against linked.as_posix() is unsafe; instead
ensure linked is actually inside the crawl start tree by testing ancestry.
Replace the substring test with a proper Path containment check: compute the
actual start root used for this BFS (the directory that file_path started from
or the library root) and then verify linked.is_relative_to(start_root) (or
manually walk linked.parents to find start_root) before continuing; keep
references to the existing variables linked, file_path and lib_key when
locating/deriving the start_root.
🧹 Nitpick comments (3)
requirements.txt (1)

14-14: Consider pinning pypandoc version for reproducibility.

Other dependencies specify minimum versions (e.g., beautifulsoup4>=4.12), but pypandoc has no version constraint. While this is unlikely to cause issues, pinning to a minimum version improves build reproducibility.

💡 Suggested change
-pypandoc
+pypandoc>=1.11
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` at line 14, Pin the pypandoc dependency in requirements.txt
to improve reproducibility: replace the bare "pypandoc" entry with a
version-constrained specifier such as "pypandoc>=1.8.1" (or another vetted
minimum) to match the project's dependency policy; update the requirements.txt
entry for pypandoc accordingly and run your dependency install/test to confirm
compatibility.
boost_library_docs_tracker/html_to_md.py (1)

20-23: Consider catching ImportError instead of bare Exception.

While the fallback behavior is correct, catching ImportError would be more precise for module import failures.

Suggested change
 try:
     import pypandoc
-except Exception:  # optional runtime dependency
+except ImportError:  # optional runtime dependency
     pypandoc = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/html_to_md.py` around lines 20 - 23, The
try/except that imports pypandoc should catch ImportError (or
ModuleNotFoundError) instead of a bare Exception; update the import block around
the pypandoc import (the try importing pypandoc) to use "except ImportError:"
(or "except (ImportError, ModuleNotFoundError):" if you prefer) and keep the
fallback pypandoc = None so only import failures are swallowed.
boost_library_docs_tracker/tests/test_fetcher.py (1)

51-184: Add coverage for walk_library_html().

--use-local is a first-class path in this feature, but this module only exercises crawl_library_pages(). A regression in local HTML discovery or canonical URL generation would currently ship untested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/tests/test_fetcher.py` around lines 51 - 184, The
tests only cover crawl_library_pages but miss walk_library_html, so add a unit
test that exercises walk_library_html to validate local HTML discovery and
canonical URL generation: create a temporary directory with a root index HTML
and linked local pages (including edge cases like redirects via differing
filenames and non-HTML files), call walk_library_html with the directory path
and the expected root URL prefix, and assert the returned canonical URLs and
converted text (use the existing _mock_pandoc/autouse fixtures and the
walk_library_html function name) to ensure local pages are discovered, non-HTML
files are skipped, and final canonical URLs match expected values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 127-141: The ZIP extraction code using zipfile.ZipFile (variables
zip_path, extract_dir, names, root_name and zf.extractall) must validate each
archive member before extracting to prevent Zip Slip; iterate zf.infolist() and
for each member ensure the member.name is not absolute, does not contain '..',
and that the resolved target path
(extract_dir.joinpath(member.filename).resolve()) is inside
extract_dir.resolve() (or use Path.is_relative_to where available); skip or
raise on invalid entries and avoid using zf.extractall() directly—extract only
validated members and also consider rejecting or safely handling symlinks found
in member external attributes.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 67-70: The help text for the --versions CLI flag is misleading;
update the help string (where add_argument defines --versions) to state that if
not provided the command falls back to the newest BoostVersion row in the
database (not GitHub API), and mention the resolver used (_resolve_versions) so
operators know where the default comes from.
- Around line 525-540: The current _key in _sort_versions_by_db re-sorts unknown
versions by their string value, losing the caller's original order; fix it by
capturing the original positions (e.g., index_map = {v: i for i, v in
enumerate(versions)}) and change _key(v: str) to return (created_at is None,
created_at if created_at is not None else index_map[v], v) so known DB versions
sort by version_created_at and unknown versions are appended in the original
caller order; update references to db_versions and _key in the
_sort_versions_by_db function accordingly.
- Around line 484-485: The index_url construction (variable index_url used with
urljoin to compute doc_root_url) lacks a trailing slash so urljoin treats
lib_key as a filename and can strip it when lib_doc is a relative path; fix by
ensuring index_url ends with a slash (e.g., append "/" to the f-string that
builds index_url or call a helper to ensure a trailing slash) so that
urljoin(index_url, lib_doc) preserves the lib_key component; update the
assignment of index_url (and any related usage before urljoin) to include the
trailing slash while leaving the urljoin(doc_root_url) logic intact.

In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 134-141: The code currently treats any falsy page_content from
workspace.load_page_by_url(...) as "missing" and skips it; change the check to
explicitly test for None (i.e., page_content is None) so that valid empty
strings ("") are accepted and only actual None results are skipped; update the
conditional around page_content in the preprocessor function (reference:
page_content and workspace.load_page_by_url) to use an identity comparison to
None and leave the warning/continue behavior only for None.
- Around line 104-116: _get_library_name collapses a BoostDocContent linked to
multiple BoostLibraryDocumentation rows into a single arbitrary name; instead
collect and return all distinct library names from doc_content.library_relations
(via rel.boost_library_version.library.name) in a deterministic order (e.g.,
ordered by id) so shared docs are indexed under every associated library; rename
to _get_library_names (or change return type to List[str]) and update callers to
accept and emit multiple names; apply the same fix to the similar code at the
145-154 section.

In `@boost_library_docs_tracker/tests/test_services.py`:
- Around line 69-82: The test
test_get_or_create_doc_content_updates_scraped_at_on_unchanged currently uses a
non-strict timestamp assertion; update it to freeze or patch time around the
second call to services.get_or_create_doc_content so you can assert
obj2.scraped_at is strictly greater than old_scraped_at (e.g., use
freezegun.freeze_time or monkeypatch datetime/now, call
get_or_create_doc_content inside the advanced time context, then refresh_from_db
and assert obj2.scraped_at > old_scraped_at) to ensure the timestamp was
actually bumped.

In `@docs/boost_library_docs_tracker.md`:
- Around line 159-166: The checklist wrongly instructs integrators to run
`python manage.py makemigrations boost_library_docs_tracker`; since the app
ships committed migrations, remove that step and instead only instruct running
`python manage.py migrate` to apply the existing migrations; keep the other
items (adding "boost_library_docs_tracker" to INSTALLED_APPS and
_WORKSPACE_APP_SLUGS, adding BOOST_DOCS_* settings and env defaults, adding
"run_boost_library_docs_tracker" to COLLECTOR_COMMANDS, and adding
beautifulsoup4 and lxml to requirements.txt) unchanged.
- Around line 88-90: Update the docs to state that the CLI flag --max-pages only
applies when running with --dry-run: clarify that in
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
the code sets max_pages = max_pages if dry_run else None, so in non-dry-run
(production) runs there is no page cap and --max-pages is ignored; change the
table row for `--max-pages N` to indicate “applies to --dry-run only; no cap in
real runs” and add a short note referencing the dry-run behavior.
- Around line 22-44: The directory tree is missing several files and later
migrations referenced elsewhere; update the diagram to reflect the app's actual
layout by adding entries for html_to_md.py, preprocessor.py, workspace.py,
fetcher variants (local_scrape or pinecone integrations), and include additional
migration files (e.g., later migration .py names) and any service helper files
so readers can find local-scrape and Pinecone pieces; locate and update the
diagram that lists boost_library_docs_tracker (referencing filenames
html_to_md.py, preprocessor.py, workspace.py, services.py, fetcher.py and
migrations/*) so it matches the real repository structure.

---

Duplicate comments:
In `@boost_library_docs_tracker/fetcher.py`:
- Around line 323-329: The crawl scope check uses the original doc_root_url but
links are resolved against final_url, so when the seed redirected the prefix
filter (doc_root_url) can incorrectly exclude valid sibling pages; update the
scope comparison to use the canonical root derived from final_url (e.g., compute
canonical_doc_root = urljoin(final_url, "/") or otherwise normalize final_url)
and replace doc_root_url with that canonical root in the abs_url.startswith(...)
check (variables to locate: abs_url, final_url, doc_root_url, visited, queue).
- Around line 255-258: The current substring check using lib_key against
linked.as_posix() is unsafe; instead ensure linked is actually inside the crawl
start tree by testing ancestry. Replace the substring test with a proper Path
containment check: compute the actual start root used for this BFS (the
directory that file_path started from or the library root) and then verify
linked.is_relative_to(start_root) (or manually walk linked.parents to find
start_root) before continuing; keep references to the existing variables linked,
file_path and lib_key when locating/deriving the start_root.

---

Nitpick comments:
In `@boost_library_docs_tracker/html_to_md.py`:
- Around line 20-23: The try/except that imports pypandoc should catch
ImportError (or ModuleNotFoundError) instead of a bare Exception; update the
import block around the pypandoc import (the try importing pypandoc) to use
"except ImportError:" (or "except (ImportError, ModuleNotFoundError):" if you
prefer) and keep the fallback pypandoc = None so only import failures are
swallowed.

In `@boost_library_docs_tracker/tests/test_fetcher.py`:
- Around line 51-184: The tests only cover crawl_library_pages but miss
walk_library_html, so add a unit test that exercises walk_library_html to
validate local HTML discovery and canonical URL generation: create a temporary
directory with a root index HTML and linked local pages (including edge cases
like redirects via differing filenames and non-HTML files), call
walk_library_html with the directory path and the expected root URL prefix, and
assert the returned canonical URLs and converted text (use the existing
_mock_pandoc/autouse fixtures and the walk_library_html function name) to ensure
local pages are discovered, non-HTML files are skipped, and final canonical URLs
match expected values.

In `@requirements.txt`:
- Line 14: Pin the pypandoc dependency in requirements.txt to improve
reproducibility: replace the bare "pypandoc" entry with a version-constrained
specifier such as "pypandoc>=1.8.1" (or another vetted minimum) to match the
project's dependency policy; update the requirements.txt entry for pypandoc
accordingly and run your dependency install/test to confirm compatibility.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f7215eca-4e82-4d8b-ae0e-924b74be0216

📥 Commits

Reviewing files that changed from the base of the PR and between 8ce6ec1 and f417570.

📒 Files selected for processing (33)
  • .github/workflows/actions.yml
  • README.md
  • boost_library_docs_tracker/__init__.py
  • boost_library_docs_tracker/admin.py
  • boost_library_docs_tracker/apps.py
  • boost_library_docs_tracker/fetcher.py
  • boost_library_docs_tracker/html_to_md.py
  • boost_library_docs_tracker/management/__init__.py
  • boost_library_docs_tracker/management/commands/__init__.py
  • boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
  • boost_library_docs_tracker/migrations/0001_initial.py
  • boost_library_docs_tracker/migrations/0002_remove_page_content_and_status_add_is_upserted.py
  • boost_library_docs_tracker/migrations/0003_redesign_per_schema_v10.py
  • boost_library_docs_tracker/migrations/__init__.py
  • boost_library_docs_tracker/models.py
  • boost_library_docs_tracker/preprocessor.py
  • boost_library_docs_tracker/services.py
  • boost_library_docs_tracker/tests/__init__.py
  • boost_library_docs_tracker/tests/fixtures.py
  • boost_library_docs_tracker/tests/test_fetcher.py
  • boost_library_docs_tracker/tests/test_models.py
  • boost_library_docs_tracker/tests/test_services.py
  • boost_library_docs_tracker/workspace.py
  • config/settings.py
  • conftest.py
  • docs/Contributing.md
  • docs/Schema.md
  • docs/Service_API.md
  • docs/boost_library_docs_tracker.md
  • docs/service_api/README.md
  • docs/service_api/boost_library_docs_tracker.md
  • requirements.txt
  • workflow/management/commands/run_all_collectors.py

Comment thread boost_library_docs_tracker/fetcher.py
Comment thread boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py Outdated
Comment thread boost_library_docs_tracker/preprocessor.py
Comment thread boost_library_docs_tracker/preprocessor.py
Comment thread boost_library_docs_tracker/tests/test_services.py Outdated
Comment thread docs/boost_library_docs_tracker.md
Comment thread docs/boost_library_docs_tracker.md
Comment thread docs/boost_library_docs_tracker.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (9)
docs/boost_library_docs_tracker.md (3)

166-166: ⚠️ Potential issue | 🟡 Minor

Remove makemigrations from the setup steps.

This app already ships migrations, so regenerating them here creates local drift instead of applying the checked-in schema.

♻️ Suggested edit
-6. Run `python manage.py makemigrations boost_library_docs_tracker` and `python manage.py migrate`.
+6. Run `python manage.py migrate`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/boost_library_docs_tracker.md` at line 166, Remove the instruction to
run `python manage.py makemigrations boost_library_docs_tracker` because the app
already ships migrations; update the setup step so it only instructs running
`python manage.py migrate` (keep the existing `migrate` command and delete the
`makemigrations` command text), ensuring the docs no longer suggest regenerating
migrations that cause local drift.

88-90: ⚠️ Potential issue | 🟡 Minor

Document --max-pages as --dry-run-only.

The command clears the page cap on real runs, so this row currently promises a production limit that never applies.

Based on learnings, in boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py the logic max_pages = max_pages if dry_run else None is intentional; --max-pages only applies during --dry-run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/boost_library_docs_tracker.md` around lines 88 - 90, The docs table
incorrectly implies --max-pages applies to real runs; update the documentation
row for `--max-pages` to indicate it only applies during `--dry-run` (reflecting
the logic in run_boost_library_docs_tracker.py where `max_pages = max_pages if
dry_run else None`), e.g., change the description to state “Per-library page cap
for the BFS crawl (applies only during --dry-run)”; reference the `max_pages`
variable and `dry_run` flag when making the wording change.

22-44: ⚠️ Potential issue | 🟡 Minor

Refresh the tree and tag the fence as text.

This directory sketch still omits modules and migrations the rest of the guide points readers to (html_to_md.py, preprocessor.py, workspace.py, later migrations, etc.), and the unlabeled fence will keep tripping MD040.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/boost_library_docs_tracker.md` around lines 22 - 44, The directory
sketch is outdated and the fenced block lacks a language tag; update the tree to
include the omitted modules (add entries for html_to_md.py, preprocessor.py,
workspace.py and any later migration files referenced elsewhere) and change the
Markdown fence from ``` to ```text so the snippet is tagged as text; locate the
block in docs/boost_library_docs_tracker.md (the tree listing shown) and replace
it with the refreshed listing including those filenames and the fenced code tag.
boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py (4)

484-485: ⚠️ Potential issue | 🟠 Major

Keep lib_key in the crawl root.

Without a trailing slash, urljoin() treats index_url as a file path and can replace lib_key when lib_doc is relative, yielding the wrong documentation root.

🩹 Suggested fix
-            index_url = f"{base}/doc/libs/{url_version}/libs/{lib_key}"
+            index_url = f"{base}/doc/libs/{url_version}/libs/{lib_key}/"
             doc_root_url = urljoin(index_url, lib_doc) if lib_doc else index_url

Expected result: the printed URLs drop /algorithm/, confirming the current base is treated as a file path.

#!/bin/bash
python - <<'PY'
from urllib.parse import urljoin
base = "https://www.boost.org/doc/libs/1_90_0/libs/algorithm"
for rel in ("doc/html/index.html", "index.html", "doc/"):
    print(f"{rel:20} -> {urljoin(base, rel)}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`
around lines 484 - 485, The calculated index_url is missing a trailing slash so
urljoin treats it as a file and can drop lib_key when lib_doc is relative;
update the construction of index_url (used to compute doc_root_url with urljoin)
to ensure the URL ends with a slash (e.g., append "/" to
f"{base}/doc/libs/{url_version}/libs/{lib_key}") so that urljoin preserves
lib_key as the crawl root when resolving lib_doc.

27-30: ⚠️ Potential issue | 🟡 Minor

Align the version docs with the actual resolver.

These examples still use bare versions, and the help text says the default comes from the GitHub API. The command actually resolves the newest BoostVersion row from the DB and later looks up the exact stored BoostVersion.version, so the examples/help should use boost-* values.

♻️ Suggested edit
-  python manage.py run_boost_library_docs_tracker --versions 1.86.0 1.87.0
+  python manage.py run_boost_library_docs_tracker --versions boost-1.86.0 boost-1.87.0
-                "One or more Boost versions to scrape (e.g. 1.86.0 1.87.0). "
-                "Defaults to latest release from GitHub API."
+                "One or more Boost versions to scrape (e.g. boost-1.86.0 boost-1.87.0). "
+                "Defaults to the newest BoostVersion row in the database."
Based on learnings, `BoostVersion.version` is stored with the `boost-` prefix in this project.

Also applies to: 67-70

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`
around lines 27 - 30, Update the usage/help examples in
run_boost_library_docs_tracker to use the actual stored BoostVersion.version
format (the "boost-*" prefix) instead of bare numeric versions: change examples
like "--versions 1.86.0" to "--versions boost-1.86.0" and any mention in help
text that the default comes from GitHub API to indicate it resolves the newest
BoostVersion row from the DB and uses BoostVersion.version (e.g.,
"boost-<semver>"); also apply the same substitution to the other example block
referenced (lines ~67-70) so all examples and help strings consistently use the
boost-* values.

89-94: ⚠️ Potential issue | 🟡 Minor

Make the help text explicit that --max-pages is dry-run only.

_process_library() disables the cap outside --dry-run, so the current help string overstates what this flag does in production.

Based on learnings, max_pages = max_pages if dry_run else None is intentional in this command; --max-pages only applies during --dry-run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`
around lines 89 - 94, Update the argparse help for the "--max-pages" option to
explicitly state it only applies during dry-run mode: change the help string for
the option defined with default=DEFAULT_MAX_PAGES (and metavar="N") to indicate
that the per-library page cap is enforced only when --dry-run is used because
_process_library() sets max_pages = max_pages if dry_run else None; reference
the flag name "--max-pages", the DEFAULT_MAX_PAGES constant, and the
_process_library() behavior in your wording so users understand this is a
dry-run-only constraint.

525-540: ⚠️ Potential issue | 🟡 Minor

Preserve caller order for versions missing from the DB.

The docstring says unknown versions are appended in original order, but the current key falls back to v, so those entries are re-sorted lexicographically instead.

♻️ Suggested fix
 def _sort_versions_by_db(versions: list[str]) -> list[str]:
@@
     db_versions = {
         bv.version: bv.version_created_at
         for bv in BoostVersion.objects.filter(version__in=versions).only(
             "version", "version_created_at"
         )
     }
+    original_index = {version: i for i, version in enumerate(versions)}
 
     def _key(v: str):
         created_at = db_versions.get(v)
-        return (created_at is None, created_at or "", v)
+        if created_at is None:
+            return (1, "", original_index[v])
+        return (0, created_at, original_index[v])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`
around lines 525 - 540, The _sort_versions_by_db function currently uses the
version string v as the final sort tie-breaker which causes unknown versions to
be re-sorted lexicographically; instead capture the original input order and use
that as the tie-breaker so unknown DB versions retain caller order: create an
index map from the input list (e.g. index_map = {v: i for i, v in
enumerate(versions)}) inside _sort_versions_by_db and change the key function
(_key) to return (created_at is None, created_at or "", index_map[v]) so missing
versions are appended in their original order; reference: _sort_versions_by_db,
db_versions, _key.
boost_library_docs_tracker/preprocessor.py (2)

104-116: ⚠️ Potential issue | 🟠 Major

Preserve all library associations in Pinecone metadata.

A single BoostDocContent can be linked through multiple BoostLibraryDocumentation rows. Picking .first() makes the library metadata lossy and order-dependent, so shared pages end up indexed under only one library. Please emit all distinct library names, ideally from prefetched relations so this hot path doesn’t add one query per record.

Also applies to: 143-154

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/preprocessor.py` around lines 104 - 116, The
helper currently drops multiple library links by calling .first(); change
_get_library_name (and the similar block at lines 143-154) to return all
distinct library names (e.g., List[str]) instead of a single string, iterate
over the prefetched BoostDocContent.library_relations (use
select_related("boost_library_version__library") already present) and collect
rel.boost_library_version.library.name into a deduplicated list while avoiding
extra DB hits, and update any callers to handle a list of names (or a
comma-joined string) as metadata for Pinecone.

134-141: ⚠️ Potential issue | 🟠 Major

Treat only None as a missing workspace page.

An existing converted page can legitimately be "". The current falsy check skips those docs forever instead of upserting them once.

🩹 Suggested fix
-        page_content = workspace.load_page_by_url(doc_content.url)
-        if not page_content:
+        page_content = workspace.load_page_by_url(doc_content.url)
+        if page_content is None:
             logger.warning(
                 "Workspace file missing for url=%s (doc_content_id=%d), skipping.",
                 doc_content.url,
                 doc_content.pk,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@boost_library_docs_tracker/preprocessor.py` around lines 134 - 141, Replace
the falsy check on the loaded page with an explicit None comparison so empty
strings are treated as valid content: change the conditional that examines the
result of workspace.load_page_by_url(doc_content.url) (the page_content
variable) from "if not page_content" to "if page_content is None" and keep the
existing logger.warning call (referencing logger.warning and
doc_content.url/doc_content.pk) and continue behavior for None results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py`:
- Around line 484-485: The calculated index_url is missing a trailing slash so
urljoin treats it as a file and can drop lib_key when lib_doc is relative;
update the construction of index_url (used to compute doc_root_url with urljoin)
to ensure the URL ends with a slash (e.g., append "/" to
f"{base}/doc/libs/{url_version}/libs/{lib_key}") so that urljoin preserves
lib_key as the crawl root when resolving lib_doc.
- Around line 27-30: Update the usage/help examples in
run_boost_library_docs_tracker to use the actual stored BoostVersion.version
format (the "boost-*" prefix) instead of bare numeric versions: change examples
like "--versions 1.86.0" to "--versions boost-1.86.0" and any mention in help
text that the default comes from GitHub API to indicate it resolves the newest
BoostVersion row from the DB and uses BoostVersion.version (e.g.,
"boost-<semver>"); also apply the same substitution to the other example block
referenced (lines ~67-70) so all examples and help strings consistently use the
boost-* values.
- Around line 89-94: Update the argparse help for the "--max-pages" option to
explicitly state it only applies during dry-run mode: change the help string for
the option defined with default=DEFAULT_MAX_PAGES (and metavar="N") to indicate
that the per-library page cap is enforced only when --dry-run is used because
_process_library() sets max_pages = max_pages if dry_run else None; reference
the flag name "--max-pages", the DEFAULT_MAX_PAGES constant, and the
_process_library() behavior in your wording so users understand this is a
dry-run-only constraint.
- Around line 525-540: The _sort_versions_by_db function currently uses the
version string v as the final sort tie-breaker which causes unknown versions to
be re-sorted lexicographically; instead capture the original input order and use
that as the tie-breaker so unknown DB versions retain caller order: create an
index map from the input list (e.g. index_map = {v: i for i, v in
enumerate(versions)}) inside _sort_versions_by_db and change the key function
(_key) to return (created_at is None, created_at or "", index_map[v]) so missing
versions are appended in their original order; reference: _sort_versions_by_db,
db_versions, _key.

In `@boost_library_docs_tracker/preprocessor.py`:
- Around line 104-116: The helper currently drops multiple library links by
calling .first(); change _get_library_name (and the similar block at lines
143-154) to return all distinct library names (e.g., List[str]) instead of a
single string, iterate over the prefetched BoostDocContent.library_relations
(use select_related("boost_library_version__library") already present) and
collect rel.boost_library_version.library.name into a deduplicated list while
avoiding extra DB hits, and update any callers to handle a list of names (or a
comma-joined string) as metadata for Pinecone.
- Around line 134-141: Replace the falsy check on the loaded page with an
explicit None comparison so empty strings are treated as valid content: change
the conditional that examines the result of
workspace.load_page_by_url(doc_content.url) (the page_content variable) from "if
not page_content" to "if page_content is None" and keep the existing
logger.warning call (referencing logger.warning and
doc_content.url/doc_content.pk) and continue behavior for None results.

In `@docs/boost_library_docs_tracker.md`:
- Line 166: Remove the instruction to run `python manage.py makemigrations
boost_library_docs_tracker` because the app already ships migrations; update the
setup step so it only instructs running `python manage.py migrate` (keep the
existing `migrate` command and delete the `makemigrations` command text),
ensuring the docs no longer suggest regenerating migrations that cause local
drift.
- Around line 88-90: The docs table incorrectly implies --max-pages applies to
real runs; update the documentation row for `--max-pages` to indicate it only
applies during `--dry-run` (reflecting the logic in
run_boost_library_docs_tracker.py where `max_pages = max_pages if dry_run else
None`), e.g., change the description to state “Per-library page cap for the BFS
crawl (applies only during --dry-run)”; reference the `max_pages` variable and
`dry_run` flag when making the wording change.
- Around line 22-44: The directory sketch is outdated and the fenced block lacks
a language tag; update the tree to include the omitted modules (add entries for
html_to_md.py, preprocessor.py, workspace.py and any later migration files
referenced elsewhere) and change the Markdown fence from ``` to ```text so the
snippet is tagged as text; locate the block in
docs/boost_library_docs_tracker.md (the tree listing shown) and replace it with
the refreshed listing including those filenames and the fenced code tag.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8f65c4eb-c6d0-4575-9fb1-f0e629c0b9e8

📥 Commits

Reviewing files that changed from the base of the PR and between 4e9c8b3 and f417570.

📒 Files selected for processing (7)
  • .github/workflows/actions.yml
  • boost_library_docs_tracker/admin.py
  • boost_library_docs_tracker/management/commands/run_boost_library_docs_tracker.py
  • boost_library_docs_tracker/models.py
  • boost_library_docs_tracker/preprocessor.py
  • docs/boost_library_docs_tracker.md
  • docs/service_api/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • boost_library_docs_tracker/admin.py
  • docs/service_api/README.md
  • .github/workflows/actions.yml

@henry0816191 henry0816191 requested a review from snowfox1003 March 9, 2026 03:42
Comment thread boost_library_docs_tracker/fetcher.py
@snowfox1003 snowfox1003 merged commit 656acc4 into cppalliance:develop Mar 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants