Skip to content

Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)#104

Merged
snowfox1003 merged 23 commits intocppalliance:developfrom
leostar0412:dev-24
Apr 20, 2026
Merged

Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24)#104
snowfox1003 merged 23 commits intocppalliance:developfrom
leostar0412:dev-24

Conversation

@leostar0412
Copy link
Copy Markdown
Collaborator

@leostar0412 leostar0412 commented Mar 10, 2026

Summary

Introduces the WG21 paper tracker: fetches mailings from open-std.org, downloads papers, uploads to GCS, and triggers the Cloud Run conversion job. Includes tests and several robustness improvements.

Changes

New app: wg21_paper_tracker

  • Fetcher: Fetches mailings from the WG21 index; fetches papers per mailing (anchor by id or name); case-insensitive paper_id.
  • Pipeline: Downloads papers with retry (3 attempts) and 30s timeout; uploads to GCS; updates DB; optional Cloud Run job trigger when new papers are uploaded.
  • Models: WG21Mailing, WG21Paper (with year as int, unique_together on (paper_id, year)), WG21PaperAuthor; single migration 0001_initial.
  • Services: get_or_create_mailing, get_or_create_paper (by paper_id + year), mark_paper_downloaded; paper_id/year normalized (lowercase, int year).
  • Management commands:
    • run_wg21_paper_tracker: runs pipeline, triggers Cloud Run if new papers; --dry-run; docstrings added.
    • import_wg21_metadata_from_csv: imports metadata from CSV; handles duplicate paper_id (IntegrityError) by updating existing row; derives year from mailing_date.
  • Admin: WG21Paper list_display and list_filter include year.

Summary by CodeRabbit

  • New Features

    • WG21 Paper Tracker: ingest mailings, papers, authors; admin UI; CSV import; workspace helpers; multi-engine PDF→Markdown conversion with optional Cloud Run trigger; management commands and end-to-end pipeline.
  • New Fields

    • Added author_alias to author profiles; paper/year, mailing linkage, author_order, subgroup, and is_downloaded for WG21 papers.
  • Documentation

    • Cloud Run runbook and updated schema/service API docs.
  • Tests

    • Extensive unit tests for fetcher, pipeline, services, workspace, models, and commands.
  • Chores

    • App registration, migrations, dependencies, Docker packaging, and CI test env update.

@leostar0412 leostar0412 self-assigned this Mar 10, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6af6c8da-d390-4d63-8bc4-7e6dae40f8ab

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new Django app wg21_paper_tracker (models, migrations, admin, services, fetcher, pipeline, workspace, management commands), a containerized Cloud Run PDF→Markdown conversion job with multiple converters, tests, docs, CI/test-settings tweaks, and config constants for GCS/Cloud Run integration.

Changes

Cohort / File(s) Summary
Configuration
config/settings.py, config/test_settings.py, .github/workflows/actions.yml, requirements.txt
Register wg21_paper_tracker in INSTALLED_APPS and WORKSPACE_APP_SLUGS; add WG21/GCP constants (WG21_GCS_BUCKET, GCP_PROJECT_ID, GCP_LOCATION, WG21_CLOUD_RUN_JOB_NAME, WG21_CLOUD_RUN_ENABLED); add TEST_DATABASE_URL handling and CI env var; add external deps.
App bootstrap & admin
wg21_paper_tracker/apps.py, wg21_paper_tracker/admin.py
Add AppConfig and admin registrations (WG21Mailing, WG21Paper, WG21PaperAuthor) with inlines, list_display, filters, search, ordering.
Models & Migrations
wg21_paper_tracker/models.py, wg21_paper_tracker/migrations/0001_initial.py, cppa_user_tracker/models.py, cppa_user_tracker/migrations/...
Introduce WG21Mailing, WG21Paper, WG21PaperAuthor (unique (paper_id, year)); add author_alias to WG21PaperAuthorProfile and corresponding migrations.
Service layer
wg21_paper_tracker/services.py, cppa_user_tracker/services.py
Add transactional upserts and association helpers (mailing/paper creation, placeholder-year promotion, author linking, mark_paper_downloaded) and get_or_create_wg21_paper_author_profile.
Fetching & Workspace
wg21_paper_tracker/fetcher.py, wg21_paper_tracker/workspace.py
Add web-scraping utilities (fetch_all_mailings, fetch_papers_for_mailing) with dedupe and robust HTTP handling; workspace helpers (get_workspace_root, get_raw_dir) with validation and dir creation.
Pipeline & Orchestration
wg21_paper_tracker/pipeline.py, wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py, workflow/management/commands/run_all_collectors.py
Implement run_tracker_pipeline (download, optional GCS upload, DB updates), management commands to run pipeline and optionally trigger Cloud Run job; register command in collector sequence.
Cloud Run conversion app
wg21_paper_tracker/cloud_run_job/main.py, wg21_paper_tracker/cloud_run_job/Dockerfile, wg21_paper_tracker/cloud_run_job/requirements.txt, wg21_paper_tracker/cloud_run_job/converters/*
Add containerized Cloud Run job and converters (Docling, PDFPlumber, OpenAI/OpenRouter OCR flow); main orchestrates GCS listing/download/convert/upload with three-tier converter pipeline and validity filtering.
Importer & Commands
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
Add CSV import command with normalization, dry-run support, IntegrityError handling and statistics tracking.
Tests
wg21_paper_tracker/tests/*, cppa_user_tracker/tests/test_services.py
Add extensive unit tests covering fetcher, models, pipeline, services, workspace, and management commands; network/GCS/converter interactions mocked.
Docs & Schema
docs/Schema.md, docs/operations/WG21_Cloud_Run.md, docs/service_api/cppa_user_tracker.md
Document new schema elements (WG21Mailing, paper year/document_date, author_order, author_alias), Cloud Run operations, and new service API.
Converters export
wg21_paper_tracker/cloud_run_job/converters/__init__.py
Re-export converter APIs: convert_with_docling, convert_with_pdfplumber, convert_with_openai.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Management\nCommand
    participant Fetch as Fetcher
    participant DB as Database
    participant GCS as Google\nCloud Storage
    participant CR as Cloud Run\nJob

    CLI->>Fetch: fetch_all_mailings()
    Fetch-->>CLI: mailings[]

    loop per new mailing
        CLI->>Fetch: fetch_papers_for_mailing(year,date)
        Fetch-->>CLI: papers[]
        CLI->>DB: get_or_create_mailing / get_or_create_paper
        DB-->>CLI: persisted
        CLI->>CLI: _download_file(url)
        CLI->>GCS: upload raw/wg21_paper_tracker/<year>/<mailing>/<file>
        GCS-->>CLI: upload result
    end

    alt total_new_papers > 0 and CloudRun configured
        CLI->>CR: trigger_cloud_run_job(project, location, job)
        CR->>GCS: list raw PDFs
        CR-->>CR: convert_pdf_to_md (Docling → PDFPlumber → OpenAI)
        CR->>GCS: upload converted/wg21_papers/*.md
        GCS-->>CR: upload result
    end

    CLI-->>CLI: return total_new_papers
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • jonathanMLDev
  • snowfox1003

Poem

🐰 I hopped through mailings, parsed each link and line,

Chopped PDFs to carrots, then to Markdown so fine,
I matched authors and aliases, stitched emails with care,
Uploaded crumbs to cloud shelves, left traces everywhere,
WG21 tracked — I twitch my whiskers, code light as air.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a WG21 paper tracker with fetch, download, GCS upload, and tests.
Docstring Coverage ✅ Passed Docstring coverage is 83.67% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (8)
wg21_paper_tracker/tests/test_pipeline.py (1)

49-62: Use tmp_path here instead of /tmp/out.

This writes outside pytest's sandbox and is not portable to Windows or restricted runners. Using tmp_path keeps the test isolated.

♻️ Proposed fix
-def test_download_file_uses_timeout():
+def test_download_file_uses_timeout(tmp_path):
     """_download_file calls requests.get with DOWNLOAD_TIMEOUT."""
     url = "https://example.com/f"
-    filepath = Path("/tmp/out")
+    filepath = tmp_path / "out"
     resp = MagicMock()
     resp.raise_for_status = MagicMock()
     resp.headers = {"content-type": "text/plain"}
     resp.content = b"x"
     resp.apparent_encoding = "utf-8"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 49 - 62, The test
test_download_file_uses_timeout writes to a hardcoded /tmp/out; change it to use
pytest's tmp_path fixture to keep the test sandboxed and portable: update the
test signature to accept tmp_path, replace filepath = Path("/tmp/out") with
filepath = tmp_path / "out" (or similar), and keep the rest of the assertions
the same; this targets the test function test_download_file_uses_timeout and the
_download_file call so no other behavior changes are required.
wg21_paper_tracker/tests/test_models.py (1)

57-76: Add the failing duplicate case for (paper_id, year).

This only proves that different years are allowed. If the unique constraint disappears, this test still passes. Add a same-year duplicate insert that expects IntegrityError.

🧪 Suggested test hardening

Add this import near the top of the file:

from django.db import IntegrityError

Then extend the test like this:

 def test_wg21_paper_unique_together_paper_id_year():
     """WG21Paper allows same paper_id with different year."""
@@
     p2 = WG21Paper.objects.create(
         paper_id="sd-1",
         url="https://example.com/2.pdf",
         title="T2",
         mailing=m2,
         year=2025,
     )
     assert p2.pk is not None
+    with pytest.raises(IntegrityError):
+        WG21Paper.objects.create(
+            paper_id="sd-1",
+            url="https://example.com/3.pdf",
+            title="T3",
+            mailing=m2,
+            year=2025,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_models.py` around lines 57 - 76, Test only
covered different-year case; add a failing same-year duplicate to assert the
unique constraint. Import IntegrityError from django.db at the top of the test
file, then modify test_wg21_paper_unique_together_paper_id_year to attempt
creating another WG21Paper with the same paper_id and same year (use the
existing m1/year=2024) and assert it raises IntegrityError (e.g., with
pytest.raises(IntegrityError)). Keep existing successful different-year creation
(WG21Paper creation for year=2025) unchanged.
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)

220-232: Move import to module level.

Importing get_or_create_wg21_paper_author_profile inside the exception handler is an anti-pattern. If the import fails, it will mask the original IntegrityError with an ImportError, making debugging harder. Move this import to the top of the file.

♻️ Proposed fix

Add to imports at the top of the file (after line 24):

from cppa_user_tracker.services import get_or_create_wg21_paper_author_profile

Then remove the inline import at lines 221-223.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 220 - 232, The inline import of
get_or_create_wg21_paper_author_profile inside the block that creates
WG21PaperAuthor can mask the original IntegrityError; move the import to module
level by adding "from cppa_user_tracker.services import
get_or_create_wg21_paper_author_profile" to the top imports and remove the
inline import inside the loop so get_or_create_wg21_paper_author_profile is
resolved at module import time and won’t convert an IntegrityError into an
ImportError during exception handling.

233-242: Use logging.exception for better diagnostics.

When catching exceptions and logging errors, logger.exception() automatically includes the stack trace without needing to pass the exception object. This aids debugging for unexpected failures.

♻️ Example change for line 235-239
             except Exception as inner:
                 stats["skipped"] += 1
-                logger.error(
+                logger.exception(
                     "Error for paper_id=%s (after IntegrityError): %s",
                     paper_id,
-                    inner,
                 )

Apply similarly to lines 165-169, 210, and 242.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 233 - 242, Replace the logger.error calls in the exception handlers
with logger.exception so stack traces are captured: specifically, in the inner
except Exception as inner block that currently does stats["skipped"] += 1 and
logger.error("Error for paper_id=%s (after IntegrityError): %s", paper_id,
inner), change that to logger.exception with the same message and remove passing
the exception object; likewise change the outer except Exception as e handler
(stats["skipped"] += 1; logger.error("Error for paper_id=%s: %s", paper_id, e))
to use logger.exception, and apply the same replacement to the other similar
catch blocks earlier in the file (the blocks around the IntegrityError handling
and the other exception catches referenced in the review) so all unexpected
exceptions use logger.exception to include stack traces.
wg21_paper_tracker/cloud_run_job/converters/docling_converter.py (1)

11-20: InputFormat is imported but never used.

The InputFormat symbol is imported from docling.datamodel.base_models but is not used anywhere in the function. Remove it to avoid confusion.

♻️ Proposed fix
 try:
     from docling.document_converter import DocumentConverter  # type: ignore[import-untyped]
-    from docling.datamodel.base_models import InputFormat  # type: ignore[import-untyped]

     DOCLING_AVAILABLE = True
 except ImportError:
     DocumentConverter = None  # type: ignore[assignment,misc]
-    InputFormat = None  # type: ignore[assignment,misc]
     DOCLING_AVAILABLE = False
     logger.warning("Docling not available. Install with: pip install docling")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py` around
lines 11 - 20, The import of InputFormat from docling.datamodel.base_models is
unused; remove InputFormat from the try-import block so only DocumentConverter
is imported (and keep the DOCLING_AVAILABLE assignment and the except fallback).
Update the top-of-file import statement around DocumentConverter/InputFormat to
drop InputFormat and any related type-ignore comments for it, leaving
DocumentConverter, DOCLING_AVAILABLE, and the logger.warning logic intact.
wg21_paper_tracker/cloud_run_job/main.py (1)

36-42: Clarify the error pattern matching logic.

The conditional if pattern.startswith("error:") or pattern.startswith("exception:") is checking whether the literal pattern string (from the list) starts with these prefixes, not whether the content does. While this works correctly, it's confusing. Consider restructuring for clarity.

♻️ Suggested clarification
-    for pattern in error_patterns:
-        if pattern in first_part:
-            if pattern.startswith("error:") or pattern.startswith("exception:"):
-                return False
-            idx = first_part.find(pattern)
-            if idx < 100:
-                return False
+    # Patterns that invalidate anywhere in first 1000 chars
+    critical_patterns = ["error:", "exception:"]
+    # Patterns that only invalidate if in first 100 chars
+    early_patterns = ["traceback", "failed to", "unable to convert", 
+                      "conversion failed", "error processing"]
+    
+    for pattern in critical_patterns:
+        if pattern in first_part:
+            return False
+    for pattern in early_patterns:
+        idx = first_part.find(pattern)
+        if idx != -1 and idx < 100:
+            return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 36 - 42, The logic in
the loop over error_patterns is confusing because it checks
pattern.startswith("error:")/("exception:") instead of whether the content
(first_part) starts with those prefixes; update the conditional to test
first_part (e.g. first_part.lower().startswith("error:") or
first_part.lower().startswith("exception:")) so the intent is explicit, or
alternatively document/rename entries in error_patterns if they are meant to be
literal sentinel patterns; change the check inside the for-loop that references
error_patterns and first_part accordingly and keep the subsequent idx/threshold
logic unchanged.
wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)

87-89: Remove redundant exception object from logging.exception.

logger.exception() automatically includes the exception info from the current context. Passing e explicitly is redundant (TRY401).

♻️ Proposed fix
         except Exception as e:
-            logger.exception("WG21 Paper Tracker failed: %s", e)
+            logger.exception("WG21 Paper Tracker failed")
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around
lines 87 - 89, In the except block that catches Exception as e in
run_wg21_paper_tracker (the try/except surrounding the WG21 Paper Tracker run),
remove the redundant exception object from the logger.exception call so it reads
logger.exception("WG21 Paper Tracker failed") instead of logger.exception("WG21
Paper Tracker failed: %s", e); keep the existing raise to re-raise the
exception.
wg21_paper_tracker/tests/test_services.py (1)

165-225: Add a same-paper_id/different-year regression for download marking.

This file already proves paper_id can repeat across years, but mark_paper_downloaded is only asserted against a single row. A multi-year case here would catch cross-year updates immediately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_services.py` around lines 165 - 225, Add a new
test that creates two papers with the same paper_id but different years (use
get_or_create_paper and two mailings), call mark_paper_downloaded using the
normalized paper_id for one of them, then refresh both papers and assert that
only the intended year's paper has is_downloaded=True while the other remains
False; name it something like
test_mark_paper_downloaded_respects_year_for_same_paper_id and reuse
mark_paper_downloaded, get_or_create_paper, and get_or_create_mailing to locate
the relevant logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cppa_user_tracker/services.py`:
- Around line 358-365: The code uses display_name as the lookup key in
get_or_create_wg21_paper_author_profile but WG21PaperAuthorProfile.display_name
is not unique, causing race-created duplicates and MultipleObjectsReturned; fix
by using a true unique identifier or making the author key unique. Update the
WG21PaperAuthorProfile model to include or mark a unique field (e.g.,
author_key, canonical_name, or external_id) with unique=True and create the
corresponding migration, then change get_or_create_wg21_paper_author_profile to
call WG21PaperAuthorProfile.objects.get_or_create(author_key=..., defaults={
'display_name': display_name_val, ... }) (or perform an atomic
transaction/select_for_update around creation) so lookups use the unique field
instead of display_name.

In `@docs/operations/WG21_Cloud_Run.md`:
- Around line 41-48: The doc shows embedding OPENROUTER_API_KEY via
--set-env-vars in the gcloud run jobs create wg21-convert command which leaks
secrets; update the example to remove OPENROUTER_API_KEY from --set-env-vars and
instead add a short note instructing operators to inject OPENROUTER_API_KEY via
Cloud Run secret injection (or the platform's secret manager/secret-mount flow),
e.g., reference using a secret resource rather than CLI env, and show the safer
pattern around the command that uses only WG21_GCS_BUCKET in --set-env-vars plus
a sentence pointing to the secret-management steps for OPENROUTER_API_KEY.

In `@docs/Schema.md`:
- Around line 637-640: Update the schema documentation to include the
WG21Paper.year field and explicitly state the composite uniqueness constraint
(paper_id, year) wherever WG21Paper is described; update the diagram and the
table rows that list paper_id to show year as a column and change any uniqueness
notes that imply paper_id is globally unique to instead document the composite
unique index on (paper_id, year). Ensure the same change is applied consistently
to the other occurrences mentioned (the other WG21Paper descriptions and summary
rows).

In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 226-256: The code currently treats fallback error stubs as real
output because it only checks markdown_content length; update the conversion
loop (where page_markdown is appended to markdown_parts) to set a boolean flag
(e.g., had_model_output) to True whenever a real page_markdown is produced, and
leave it False for fallback stubs and exceptions; then, before logging success
and returning markdown_content in the end block, check that had_model_output is
True (and that markdown_content.strip() is non-empty) and only then log success
and return content, otherwise log the conversion-as-empty warning and return
None. Ensure the flag is defined before the page loop and reference symbols
page_markdown, markdown_parts, markdown_content, total_pages and pdf_path.name
in your changes.
- Line 1: Add "from __future__ import annotations" at the top of the module so
annotations like Image.Image are not evaluated at import time, and move the
import io to the top-level (outside the try/except) so the fallback path always
has io available; keep the optional pdf2image/Pillow import block setting
PDF2IMAGE_AVAILABLE and binding Image in the except, but with the future-import
and top-level io the functions that reference Image.Image in their annotations
(the functions defined around lines 34, 62 and 94) will no longer cause a
NameError during module import.

In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py`:
- Around line 55-69: The Markdown table output in the tables loop (in
pdfplumber_converter.py where you iterate "for table in tables" and append rows
into markdown_parts) never emits the required header separator row, so add logic
to treat the first non-empty row of each table as the header and immediately
after appending that first row insert a separator row with the same column count
(e.g., "| --- | --- | ... |"), generating one '---' cell per column (respecting
empty cells) before appending the remaining body rows; update the code around
the table row appends that reference markdown_parts and table to insert this
separator only once per table.

In `@wg21_paper_tracker/cloud_run_job/Dockerfile`:
- Around line 1-24: The image runs as root; create and switch to a dedicated
non-root user to reduce blast radius: add steps in the Dockerfile to create a
sysuser (e.g., "appuser" or UID 1000) with no-login, chown the WORKDIR (/app)
and any written paths to that user, and add a USER appuser directive before CMD
so the container runs unprivileged; keep apt-get and pip install steps running
as root, but ensure files copied into /app are chown'd to the new user (use
chown -R appuser:appuser /app) and then switch to USER appuser after copying and
permission changes.

In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 19-43: The function is_content_valid accepts None but is annotated
as str; change its type annotation to Optional[str] (or Union[str, None] for
<3.10) and import Optional from typing if not present so the signature matches
actual behavior (update the def is_content_valid parameter type and ensure
typing import exists); no other logic changes required.

In `@wg21_paper_tracker/cloud_run_job/requirements.txt`:
- Line 4: Update the Pillow pin in requirements.txt to require a non-vulnerable
minimum version: replace the existing "Pillow>=10.0.0" requirement with a
minimum of "Pillow>=10.3.0" so the job parsing untrusted PDFs/images excludes
the known vulnerable releases (refer to the requirement entry "Pillow>=10.0.0"
to locate the line to change).

In `@wg21_paper_tracker/fetcher.py`:
- Around line 29-34: The current fetch code swallows network/HTTP errors by
logging and returning [], which hides failures; instead define and raise a typed
fetch exception (e.g., FetchError or HTTPFetchError) and propagate the original
exception (use exception chaining) instead of returning an empty list. Replace
the except block around requests.get(BASE_URL) to raise the new FetchError with
context (include the caught exception and response status when available), and
make the same change for the other fetch block (the similar try/except at lines
65-70) so callers can decide to retry or fail.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 68-70: Accessing settings.GCP_PROJECT_ID, settings.GCP_LOCATION,
and settings.WG21_CLOUD_RUN_JOB_NAME can raise AttributeError if not set; change
those direct attribute accesses (project_id, location, job_name) to use
getattr(settings, 'GCP_PROJECT_ID', <sensible_default>) etc., providing
appropriate defaults or failing fast with a clear error message; update the code
in run_wg21_paper_tracker.py where project_id, location, and job_name are
assigned so it uses getattr(settings, ...) instead of direct attribute access.

In `@wg21_paper_tracker/models.py`:
- Line 6: The current unique_together allows multiple (paper_id, NULL) rows;
replace it with explicit Meta.constraints using two UniqueConstraint entries:
one UniqueConstraint(fields=['paper','year'],
condition=models.Q(year__isnull=False), name='unique_paper_year_not_null') to
enforce uniqueness for non-null years, and a second
UniqueConstraint(fields=['paper'], condition=models.Q(year__isnull=True),
name='unique_paper_null_year') to ensure only one (paper, NULL) row per paper;
update the model Meta (remove unique_together) and import models.Q if needed so
get_or_create_paper(..., year=None) cannot create duplicates.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 200-209: best_paper["filename"] is taken from scraped HTML and
used directly to build local_path (raw_dir / filename) and gcs_path, allowing
path traversal (e.g., "../../etc/passwd") or unexpected blob names; before using
filename in _download_file and _upload_to_gcs, sanitize it by extracting a safe
basename and removing path separators/parent refs (e.g., use
pathlib.Path(filename).name or an equivalent sanitize_name function), reject
empty or suspicious names, and then build local_path and gcs_path from that
sanitized_name (keep mailing_date and raw/wg21_papers prefix as-is) so
_download_file(local_path) and _upload_to_gcs(...) cannot escape the workspace
or create malicious object keys.
- Around line 189-193: The loop currently skips downloads based only on paper_id
and .first(), causing different years with the same ID to be suppressed and
relying on arbitrary ordering; instead, for each paper in papers_by_id[p_id]
check WG21Paper.objects.filter(paper_id=pid, year=paper.year,
is_downloaded=True).exists() (avoid .first()), and increment skipped_downloaded
and skip only that specific paper when the (paper_id, year) tuple is already
downloaded so decisions are deterministic and year-aware.

In `@wg21_paper_tracker/services.py`:
- Around line 86-88: The mark_paper_downloaded function is currently only
filtering by paper_id and therefore updates all WG21Paper rows with that id
across years; change the function signature to accept (or require) a year
parameter and normalize it, then update the WG21Paper.objects.filter(...) call
to include both paper_id and year so you only set is_downloaded=True for the
specific (paper_id, year) row; validate/normalize the inputs (e.g., strip/lower
paper_id and ensure year is an int or valid string) and preserve existing
behavior when inputs are missing by raising or returning early as appropriate.

In `@wg21_paper_tracker/tests/test_fetcher.py`:
- Around line 16-31: Update both the implementation of fetch_all_mailings and
the tests so network errors use requests-specific exceptions: change the
fetch_all_mailings exception handlers that currently catch bare Exception to
catch requests.RequestException (or requests.HTTPError for raise_for_status
paths) and update tests in test_fetcher.py to set side_effect to
requests.RequestException("network error") and to a requests.HTTPError for the
raise_for_status case; ensure patches still target
wg21_paper_tracker.fetcher.requests.get and that response.raise_for_status
side_effect uses the appropriate requests exception class.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 64-74: The tests patching requests.get in
test_download_file_retries_on_failure and
test_download_file_succeeds_on_second_attempt should raise
requests.RequestException rather than a bare Exception so they model real
requests library failures; update the patch side_effects to use
requests.RequestException("connection error") (and the second test's side_effect
sequence) while keeping assertions on _download_file, DOWNLOAD_MAX_RETRIES, and
sleep behavior intact.

In `@wg21_paper_tracker/workspace.py`:
- Around line 18-23: Validate the mailing_date input in get_raw_dir before using
it as a path component: ensure it matches the strict YYYY-MM pattern (e.g., with
a regex like r'^\d{4}-\d{2}$') and reject any value containing path separators
or traversal sequences by raising a ValueError; then only build the path using
get_workspace_path(_RAW_APP_SLUG) and the validated mailing_date so callers
cannot escape the workspace/root directory.

---

Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py`:
- Around line 11-20: The import of InputFormat from
docling.datamodel.base_models is unused; remove InputFormat from the try-import
block so only DocumentConverter is imported (and keep the DOCLING_AVAILABLE
assignment and the except fallback). Update the top-of-file import statement
around DocumentConverter/InputFormat to drop InputFormat and any related
type-ignore comments for it, leaving DocumentConverter, DOCLING_AVAILABLE, and
the logger.warning logic intact.

In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 36-42: The logic in the loop over error_patterns is confusing
because it checks pattern.startswith("error:")/("exception:") instead of whether
the content (first_part) starts with those prefixes; update the conditional to
test first_part (e.g. first_part.lower().startswith("error:") or
first_part.lower().startswith("exception:")) so the intent is explicit, or
alternatively document/rename entries in error_patterns if they are meant to be
literal sentinel patterns; change the check inside the for-loop that references
error_patterns and first_part accordingly and keep the subsequent idx/threshold
logic unchanged.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 220-232: The inline import of
get_or_create_wg21_paper_author_profile inside the block that creates
WG21PaperAuthor can mask the original IntegrityError; move the import to module
level by adding "from cppa_user_tracker.services import
get_or_create_wg21_paper_author_profile" to the top imports and remove the
inline import inside the loop so get_or_create_wg21_paper_author_profile is
resolved at module import time and won’t convert an IntegrityError into an
ImportError during exception handling.
- Around line 233-242: Replace the logger.error calls in the exception handlers
with logger.exception so stack traces are captured: specifically, in the inner
except Exception as inner block that currently does stats["skipped"] += 1 and
logger.error("Error for paper_id=%s (after IntegrityError): %s", paper_id,
inner), change that to logger.exception with the same message and remove passing
the exception object; likewise change the outer except Exception as e handler
(stats["skipped"] += 1; logger.error("Error for paper_id=%s: %s", paper_id, e))
to use logger.exception, and apply the same replacement to the other similar
catch blocks earlier in the file (the blocks around the IntegrityError handling
and the other exception catches referenced in the review) so all unexpected
exceptions use logger.exception to include stack traces.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 87-89: In the except block that catches Exception as e in
run_wg21_paper_tracker (the try/except surrounding the WG21 Paper Tracker run),
remove the redundant exception object from the logger.exception call so it reads
logger.exception("WG21 Paper Tracker failed") instead of logger.exception("WG21
Paper Tracker failed: %s", e); keep the existing raise to re-raise the
exception.

In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 57-76: Test only covered different-year case; add a failing
same-year duplicate to assert the unique constraint. Import IntegrityError from
django.db at the top of the test file, then modify
test_wg21_paper_unique_together_paper_id_year to attempt creating another
WG21Paper with the same paper_id and same year (use the existing m1/year=2024)
and assert it raises IntegrityError (e.g., with pytest.raises(IntegrityError)).
Keep existing successful different-year creation (WG21Paper creation for
year=2025) unchanged.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 49-62: The test test_download_file_uses_timeout writes to a
hardcoded /tmp/out; change it to use pytest's tmp_path fixture to keep the test
sandboxed and portable: update the test signature to accept tmp_path, replace
filepath = Path("/tmp/out") with filepath = tmp_path / "out" (or similar), and
keep the rest of the assertions the same; this targets the test function
test_download_file_uses_timeout and the _download_file call so no other behavior
changes are required.

In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 165-225: Add a new test that creates two papers with the same
paper_id but different years (use get_or_create_paper and two mailings), call
mark_paper_downloaded using the normalized paper_id for one of them, then
refresh both papers and assert that only the intended year's paper has
is_downloaded=True while the other remains False; name it something like
test_mark_paper_downloaded_respects_year_for_same_paper_id and reuse
mark_paper_downloaded, get_or_create_paper, and get_or_create_mailing to locate
the relevant logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 92d9fb12-9816-445e-85e9-0748c89867c7

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 18f07c3.

📒 Files selected for processing (34)
  • config/settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/services.py
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread cppa_user_tracker/services.py
Comment thread docs/operations/WG21_Cloud_Run.md Outdated
Comment thread docs/Schema.md
Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/services.py Outdated
Comment thread wg21_paper_tracker/tests/test_fetcher.py
Comment thread wg21_paper_tracker/tests/test_pipeline.py Outdated
Comment thread wg21_paper_tracker/workspace.py Outdated
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (5)
wg21_paper_tracker/fetcher.py (1)

29-34: Keep the graceful fallback, but log the actual request failure.

These branches intentionally return [], so the log is the only signal maintainers get during upstream outages. Right now the exception details are discarded, which makes diagnosis much harder.

🪵 Proposed logging-only fix
-    except requests.RequestException:
-        logger.error("Failed to fetch WG21 index.")
+    except requests.RequestException:
+        logger.exception("Failed to fetch WG21 index.")
         return []
@@
-    except requests.RequestException:
-        logger.error("Failed to fetch year page %s.", year)
+    except requests.RequestException:
+        logger.exception("Failed to fetch year page %s.", year)
         return []

Based on learnings, returning [] here is intentional, so preserving the exception context in logs is especially important.

Also applies to: 65-70

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/fetcher.py` around lines 29 - 34, The except blocks that
catch requests.RequestException in fetcher.py (the try around
requests.get(f"{BASE_URL}/", timeout=30) and the similar block later) swallow
the exception details; update both handlers to capture the exception (e.g.
except requests.RequestException as e) and log the real error context via
logger.error including the exception (either via exc_info=True or by
interpolating e into the message) while preserving the existing behavior of
returning [] so the graceful fallback remains.
wg21_paper_tracker/cloud_run_job/main.py (1)

67-71: Consider more informative exit on missing environment variable.

Returning silently after logging an error may cause the Cloud Run job to appear successful (exit code 0). Consider raising an exception or using sys.exit(1) to signal failure.

♻️ Proposed fix
+import sys
+
 def main():
     bucket_name = os.getenv("WG21_GCS_BUCKET")
     if not bucket_name:
         logger.error("WG21_GCS_BUCKET env var not set.")
-        return
+        sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 67 - 71, The code in
main() logs when WG21_GCS_BUCKET is unset but then returns (exit code 0); change
this to signal failure by either raising an exception or calling sys.exit(1).
Specifically, update the bucket_name check in main() (the logger.error call) so
after logging it calls sys.exit(1) or raises a RuntimeError with a clear message
about WG21_GCS_BUCKET being missing, and ensure you import sys if you choose
sys.exit.
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)

60-68: Remove commented-out code or restore try/except.

The commented-out exception handling around parse_date should either be removed if unnecessary, or restored if parse_date can raise exceptions that need handling.

♻️ Clean up
 def _parse_document_date(date_str: str):
     """Return date or None from CSV date column (e.g. YYYY-MM-DD). Invalid values return None."""
     cleaned = _norm(date_str)
     if not cleaned:
         return None
-    # try:
     return parse_date(cleaned)
-    # except (ValueError, TypeError):
-    #     return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 60 - 68, The _parse_document_date function currently contains
commented-out try/except around parse_date; either remove the dead commented
code or reintroduce error handling: decide whether parse_date can raise
ValueError/TypeError in your environment and if so, wrap the call to
parse_date(cleaned) in a try/except that returns None on ValueError or
TypeError; otherwise delete the commented-out except block and its comments so
_parse_document_date contains only the normalized check and the direct return of
parse_date(cleaned).

164-171: Use logger.exception for better debugging in error handlers.

Using logger.exception instead of logger.error will include the full traceback, which is valuable for debugging batch import failures.

♻️ Proposed fix
             except Exception as e:
                 stats["skipped"] += 1
-                logger.error(
+                logger.exception(
                     "Error parsing document date for paper_id=%s: %s",
                     paper_id,
                     e,
                 )
                 continue

Apply similar changes at lines 211, 233-237, and 240.

Also applies to: 231-240

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 164 - 171, Replace calls to logger.error inside the except
Exception as e handlers with logger.exception so the traceback is logged;
specifically update the block that currently does stats["skipped"] += 1 followed
by logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e)
(and the other similar except blocks that increment stats["skipped"]) to call
logger.exception("Error parsing document date for paper_id=%s", paper_id) (or an
analogous message) and keep the stats increment and continue; apply the same
change to the other except handlers currently using logger.error in this file
(the occurrences around the other try/except blocks that increment
stats["skipped"]).
wg21_paper_tracker/services.py (1)

42-46: Consider extracting year normalization to a helper.

The year parsing logic is duplicated between get_or_create_paper (lines 42-46) and mark_paper_downloaded (lines 112-116) with slightly different truthiness checks (if year: vs if year is not None:). While both currently work correctly, extracting this to a shared helper would improve maintainability.

♻️ Suggested helper
def _normalize_year(year: int | str | None) -> int:
    """Normalize year input to int, defaulting to 0."""
    if year is None:
        return 0
    s = (year if isinstance(year, str) else str(year)).strip()[:4]
    return int(s) if s.isdigit() else 0

Also applies to: 112-116

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/services.py` around lines 42 - 46, Extract the duplicated
year-parsing logic into a shared helper (e.g., _normalize_year) and use it from
both get_or_create_paper and mark_paper_downloaded; implement
_normalize_year(year: int | str | None) -> int to return 0 for None/invalid
input, trim strings to first 4 chars, and return int(s) if s.isdigit() else 0,
then replace the inline blocks in get_or_create_paper and mark_paper_downloaded
with year_val = _normalize_year(year).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alas.py`:
- Around line 13-18: Migration 0006 created a field named author_alas (likely a
typo) on model wg21paperauthorprofile; to fix, rename the model field in
models.py from author_alas to author_alias, then delete this migration file
(0006_wg21paperauthorprofile_author_alas.py) and regenerate migrations by
running python manage.py makemigrations so Django produces a migration that adds
author_alias (or a proper RenameField migration if preserving data), and commit
the regenerated migration.

In `@cppa_user_tracker/models.py`:
- Line 168: Rename the typo'd model field author_alas to author_alias in
cppa_user_tracker.models (update the Field declaration in the model class where
author_alas is defined), then regenerate the migration so the DB schema matches:
delete or replace the incorrect migration
0006_wg21paperauthorprofile_author_alas.py and run makemigrations/migrate to
create a migration that reflects the rename (or use makemigrations --name ... to
produce a rename operation), and update any references in code/tests that used
author_alas to use author_alias.

In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 240-246: The except block in openai_converter.py currently appends
raw exception text into markdown_parts (using page_num and str(e)), which may
leak internal details; keep the detailed error in logs (logger.error/exception
with exc_info=True) but replace the persisted content with a generic placeholder
such as "Error processing this page" (no exception text) when appending to
markdown_parts so only non-sensitive, user-facing text is stored.

In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 38-44: The logic in the loop over error_patterns incorrectly calls
pattern.startswith("error:")/("exception:"), which tests the pattern string
rather than whether the content (first_part) begins with those markers; update
the condition in the for-loop that inspects first_part so you check
first_part.startswith("error:") or first_part.startswith("exception:") (or
equivalently test if idx == 0 after finding pattern in first_part) instead of
pattern.startswith(...); alternatively, if the intent was to always reject the
literal patterns regardless of position, test if pattern in {"error:",
"exception:"} and return False unconditionally—modify the block using variables
error_patterns, first_part, pattern accordingly.

In `@wg21_paper_tracker/cloud_run_job/requirements.txt`:
- Line 4: The requirements file pins Pillow too low (Pillow>=10.3.0) which is
vulnerable; update the Pillow spec in
wg21_paper_tracker/cloud_run_job/requirements.txt to "Pillow>=12.1.1" (or later)
to address GHSA-cfh3-3jmp-rvhc, then rebuild your environment/container and
re-run tests to ensure compatibility with the updated Pillow release.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 204-220: The IntegrityError handler currently recovers by fetching
WG21Paper.objects.filter(paper_id=paper_id).first(), which can pick the wrong
row when multiple years exist; change the lookup to include year (e.g.,
WG21Paper.objects.filter(paper_id=paper_id, year=year).first()) and handle the
case where year is None by using an appropriate disambiguation (such as skipping
update, logging the ambiguity, or selecting the most relevant year) so you
update the correct row in the except block around IntegrityError and adjust
stats/logging (stats["skipped"] and logger.error) accordingly; update the code
paths that set paper.url/title/document_date/mailing/subgroup/year and call
paper.save() only after a correct, unambiguous paper is found.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 64-83: The Cloud Run trigger and the "Downloaded and uploaded" log
must be gated when GCS uploads are disabled: update the logic around
run_tracker_pipeline()/total_new_papers so you only call
trigger_cloud_run_job(project_id, location, job_name) when uploads actually
occurred (e.g., check settings.WG21_GCS_BUCKET is set or, preferably, change
run_tracker_pipeline to return an explicit uploaded_papers count/flag and use
that). Also update the logger.info message to accurately reflect whether papers
were uploaded vs. only downloaded (use uploaded_papers or the WG21_GCS_BUCKET
check to choose the message), and keep using trigger_cloud_run_job and logger.*
as the referenced symbols to locate where to change behavior.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 158-160: Normalize the parsed year once when extracting the
mailing info and reuse that single normalized value for deduplication and
persistence; replace the current ad-hoc conversion (where a missing year becomes
0 in some places and None in others) by computing year_normalized =
int(m_info["year"]) if m_info["year"] else None (or equivalent) where
mailing_date/title/year are extracted, then use year_normalized everywhere
downstream (including the dedupe key construction and the persistence call) and
remove any places that treat missing years as 0 so the same canonical key is
used consistently (also update the other occurrences referenced around the other
similar blocks).
- Around line 219-225: The current branch in pipeline.py sets uploaded=True when
bucket_name is missing, causing is_downloaded to be marked true even though GCS
upload never happened; instead, keep uploaded=False (or introduce a separate
local_only flag) so records requiring GCS remain eligible for backfill.
Concretely, in the block using bucket_name and _upload_to_gcs, do not simulate
success — set uploaded based on an actual upload result (or set a distinct
local_only state) and ensure the code that flips is_downloaded only does so when
uploaded is true; apply the same fix to the similar branch around the second
occurrence (the block that mirrors lines 254-257).

In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 58-76: Update the test_wg21_paper_unique_together_paper_id_year
function to also assert that inserting a WG21Paper with the same paper_id and
the same year fails: after creating the first 2024 WG21Paper (using WG21Mailing
m1), attempt a second WG21Paper with paper_id="sd-1" and year=2024 wrapped in
pytest.raises(IntegrityError) to confirm the (paper_id, year) uniqueness
constraint, while keeping the existing 2025 insert (p2) as the successful
positive case; use WG21Mailing (m2) as needed for the positive 2025 row.

In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 153-165: The test and service are forcing unknown years to the
sentinel 0 which collapses distinct imports; instead preserve NULL for unknown
years: update get_or_create_paper to stop converting year=None to 0 and pass
None through so the DB stores NULL (ensure the Paper model/column is nullable),
and adjust the test test_get_or_create_paper_year_none_stored_as_zero to expect
paper.year is None (or alternatively enforce a validation at the service
boundary to raise when year is missing if you prefer rejecting None). Refer to
get_or_create_paper and the failing test name when making the change.
- Around line 76-103: The test currently sets mock_profile.return_value to a
single MagicMock so both authors appear to get the same profile; change to
return distinct profile tuples using mock_profile.side_effect = [(profile1,
True), (profile2, True)] where profile1.pk = 1 and profile2.pk = 2, and then
update the mock_get_or_create_paper_author assertions to expect calls with paper
and profile1 for position 1 and paper and profile2 for position 2 (keep
mock_get_or_create_paper_author return as before). This ensures
get_or_create_paper actually links each author to a distinct profile; reference
the test function test_get_or_create_paper_calls_author_profile_for_each_author,
mock_profile, mock_get_or_create_paper_author, get_or_create_paper, and
get_or_create_mailing when making the edits.

---

Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 67-71: The code in main() logs when WG21_GCS_BUCKET is unset but
then returns (exit code 0); change this to signal failure by either raising an
exception or calling sys.exit(1). Specifically, update the bucket_name check in
main() (the logger.error call) so after logging it calls sys.exit(1) or raises a
RuntimeError with a clear message about WG21_GCS_BUCKET being missing, and
ensure you import sys if you choose sys.exit.

In `@wg21_paper_tracker/fetcher.py`:
- Around line 29-34: The except blocks that catch requests.RequestException in
fetcher.py (the try around requests.get(f"{BASE_URL}/", timeout=30) and the
similar block later) swallow the exception details; update both handlers to
capture the exception (e.g. except requests.RequestException as e) and log the
real error context via logger.error including the exception (either via
exc_info=True or by interpolating e into the message) while preserving the
existing behavior of returning [] so the graceful fallback remains.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 60-68: The _parse_document_date function currently contains
commented-out try/except around parse_date; either remove the dead commented
code or reintroduce error handling: decide whether parse_date can raise
ValueError/TypeError in your environment and if so, wrap the call to
parse_date(cleaned) in a try/except that returns None on ValueError or
TypeError; otherwise delete the commented-out except block and its comments so
_parse_document_date contains only the normalized check and the direct return of
parse_date(cleaned).
- Around line 164-171: Replace calls to logger.error inside the except Exception
as e handlers with logger.exception so the traceback is logged; specifically
update the block that currently does stats["skipped"] += 1 followed by
logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e)
(and the other similar except blocks that increment stats["skipped"]) to call
logger.exception("Error parsing document date for paper_id=%s", paper_id) (or an
analogous message) and keep the stats increment and continue; apply the same
change to the other except handlers currently using logger.error in this file
(the occurrences around the other try/except blocks that increment
stats["skipped"]).

In `@wg21_paper_tracker/services.py`:
- Around line 42-46: Extract the duplicated year-parsing logic into a shared
helper (e.g., _normalize_year) and use it from both get_or_create_paper and
mark_paper_downloaded; implement _normalize_year(year: int | str | None) -> int
to return 0 for None/invalid input, trim strings to first 4 chars, and return
int(s) if s.isdigit() else 0, then replace the inline blocks in
get_or_create_paper and mark_paper_downloaded with year_val =
_normalize_year(year).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: eeb3955d-3988-4c0b-af91-76b020089677

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and f4388ff.

📒 Files selected for processing (38)
  • config/settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alas.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread cppa_user_tracker/models.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/main.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/requirements.txt Outdated
Comment thread wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py Outdated
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/tests/test_models.py
Comment thread wg21_paper_tracker/tests/test_services.py Outdated
Comment thread wg21_paper_tracker/tests/test_services.py
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (3)
wg21_paper_tracker/tests/test_pipeline.py (2)

50-62: Consider using tmp_path fixture for the timeout test.

Line 53 uses a hardcoded /tmp/out path. While the mock prevents actual file writes, using tmp_path would be more consistent with other tests and avoid the static analysis warning.

♻️ Suggested fix
-def test_download_file_uses_timeout():
+def test_download_file_uses_timeout(tmp_path):
     """_download_file calls requests.get with DOWNLOAD_TIMEOUT."""
     url = "https://example.com/f"
-    filepath = Path("/tmp/out")
+    filepath = tmp_path / "out"
     resp = MagicMock()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 50 - 62, The test
test_download_file_uses_timeout should use the pytest tmp_path fixture instead
of a hardcoded Path("/tmp/out"); modify the test signature to accept tmp_path
and replace filepath = Path("/tmp/out") with filepath = tmp_path / "out" (or
similar filename), keep the rest of the test unchanged so it still patches
wg21_paper_tracker.pipeline.requests.get and asserts timeout == DOWNLOAD_TIMEOUT
when calling _download_file.

147-166: Deeply nested context managers reduce readability.

Consider using unittest.mock.patch.multiple or contextlib.ExitStack to flatten the nesting, or use pytest-mock's mocker fixture for cleaner syntax.

♻️ Example using ExitStack
+from contextlib import ExitStack
+
 `@pytest.mark.django_db`
 def test_run_tracker_pipeline_downloads_new_papers(tmp_path):
     """run_tracker_pipeline downloads papers for new mailings and returns count."""
     from wg21_paper_tracker.models import WG21Mailing
 
     WG21Mailing.objects.create(mailing_date="2025-01", title="Previous")
     mailings = [...]
     papers = [...]
-    with patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings):
-        with patch(
-            "wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers
-        ):
-            with patch(
-                "wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path
-            ):
-                # ... more nesting
-                            n = run_tracker_pipeline()
+    with ExitStack() as stack:
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline._download_file", return_value=True))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket"))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True))
+        n = run_tracker_pipeline()
     assert n == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 147 - 166, The test
uses deeply nested patch context managers around run_tracker_pipeline (patching
fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file,
settings.WG21_GCS_BUCKET, and _upload_to_gcs); flatten these by replacing the
nested with either unittest.mock.patch.multiple or contextlib.ExitStack (or
pytest-mock's mocker) to apply all patches in one block: locate the test
function in tests/test_pipeline.py and replace the nested with a single
patch.multiple call (or an ExitStack that enters each patch) that sets
return_value/side_effect for fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, _upload_to_gcs and sets WG21_GCS_BUCKET to
"test-bucket", then call run_tracker_pipeline() and assert n == 1.
wg21_paper_tracker/models.py (1)

18-18: Prefer tuples for Meta constants.

These values are static config, so tuples fit better than lists here and clear the current Ruff RUF012 warnings without changing behavior.

Also applies to: 45-46, 74-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/models.py` at line 18, The Meta class uses list literals
for static ordering values (e.g., ordering = ["-mailing_date"]) which triggers
RUF012; replace these list literals with equivalent tuples (ordering =
("-mailing_date",)) across all Meta definitions in this file (also update the
other occurrences that currently use lists at the other Meta blocks referenced)
so the behavior is unchanged but the constants are tuples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/settings.py`:
- Around line 223-225: The WG21_CLOUD_RUN_JOB_NAME default makes the job name
always truthy and causes run_wg21_paper_tracker to try invoking Cloud Run even
when no job is configured; change settings so WG21_CLOUD_RUN_JOB_NAME defaults
to an empty string (no implicit "wg21-convert") and add/use an explicit boolean
flag (e.g., WG21_CLOUD_RUN_ENABLED from env) that run_wg21_paper_tracker checks
along with GCP_PROJECT_ID and WG21_GCS_BUCKET before attempting Cloud Run;
update the code paths that reference WG21_CLOUD_RUN_JOB_NAME and
run_wg21_paper_tracker to gate execution on the new WG21_CLOUD_RUN_ENABLED flag
(and leave existing GCP_PROJECT_ID/WG21_GCS_BUCKET checks intact).

In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 230-261: The current logic treats any single successful page as a
successful conversion and returns partial Markdown; change the success check so
only fully converted documents are considered successful: replace the condition
that checks "if successful_pages > 0 and markdown_content.strip()" with a
stricter check that successful_pages equals total_pages (e.g., "if
successful_pages == total_pages and markdown_content.strip()") and when pages
are missing log a clear warning including successful_pages/total_pages and
return None so partial results are not persisted; update the log messages around
pdf_path.name and the extracted character count to run only on full success.

In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 81-110: The loop over blobs currently allows any exception from
blob.exists(), blob.download_to_filename(), convert_pdf_to_md, or
md_blob.upload_from_string() to abort the whole job; wrap the per-blob
processing (from checking md_blob.exists() through upload and file deletion) in
a try/except block that catches Exception, logs the blob name and the exception
via logger.error("... %s", blob.name, exc_info=True) and then continues to the
next blob; ensure local_pdf_path.unlink() is executed in a finally block (or
check and remove after the except) so the temp file is always cleaned up even on
errors; keep calls to convert_pdf_to_md, md_blob.exists(),
download_to_filename(), and upload_from_string() inside that try so a single bad
file is skipped without aborting the job.

In `@wg21_paper_tracker/fetcher.py`:
- Around line 29-34: The exception handlers in fetcher.py that catch
requests.RequestException (around the requests.get/response.raise_for_status
block) and the similar handler later (lines ~65-70) currently call logger.error
and drop the original exception details; replace those logger.error(...) calls
with logger.exception(...) so the full traceback and HTTP/timeout details are
preserved when handling failures from requests.get and related code paths.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 75-81: The current except block catches exceptions from
trigger_cloud_run_job(project_id, location, job_name) and only logs them, which
swallows failures; change the handler in run_wg21_paper_tracker.py so that after
logging the failure (logger.error(...)) it re-raises the exception (raise) so
the scheduler marks the run failed and retries/alerts instead of silently
succeeding; locate the try/except around trigger_cloud_run_job and add a
re-raise after the logger.error call.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 205-224: The format priority ordering is inverted: format_priority
currently assigns lower numbers to less-desired formats so
p_list.sort(key=lambda x: format_priority(x["type"])) picks ".adoc" over ".pdf";
fix by making the sort choose the preferred format (PDF) first — either invert
the priorities mapping in format_priority (give "pdf" the lowest value and
"adoc" a higher value) or keep the mapping but change the sort to
p_list.sort(key=..., reverse=True) so best_paper = p_list[0] yields the PDF;
update format_priority and/or the p_list.sort(...) call accordingly.

In `@wg21_paper_tracker/services.py`:
- Around line 42-57: The current code uses year_val in the get_or_create key so
placeholder rows with year=0 block later updates; change to create/get by
paper_id only (call WG21Paper.objects.get_or_create(paper_id=paper_id,
defaults={...}) without year) and then, after obtaining the paper instance, if a
valid numeric year (year_val > 0) is present and paper.year differs, set
paper.year = year_val and save() to promote the placeholder; also update the
existing year-update branch that references year transitions (the branch using
year_val and the code that tries to update year later) to rely on this
single-instance update flow.
- Around line 81-86: The current author-sync loop only upserts current authors
and never removes stale WG21PaperAuthor rows; modify the author-sync to (1)
build the desired list by pairing author_names with author_emails (emails =
author_emails or []), (2) for each pair call
get_or_create_wg21_paper_author_profile(name, email=...) and collect the profile
IDs and desired positions, (3) query existing WG21PaperAuthor rows for the paper
and delete any whose profile ID is not in the desired set (this also handles
empty author_names by deleting all), and (4) ensure you then create/upsert
remaining desired WG21PaperAuthor entries with correct positions via
get_or_create_paper_author(paper, profile, position). Use the existing helpers
get_or_create_wg21_paper_author_profile and get_or_create_paper_author and
ensure ordering is preserved.

---

Nitpick comments:
In `@wg21_paper_tracker/models.py`:
- Line 18: The Meta class uses list literals for static ordering values (e.g.,
ordering = ["-mailing_date"]) which triggers RUF012; replace these list literals
with equivalent tuples (ordering = ("-mailing_date",)) across all Meta
definitions in this file (also update the other occurrences that currently use
lists at the other Meta blocks referenced) so the behavior is unchanged but the
constants are tuples.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 50-62: The test test_download_file_uses_timeout should use the
pytest tmp_path fixture instead of a hardcoded Path("/tmp/out"); modify the test
signature to accept tmp_path and replace filepath = Path("/tmp/out") with
filepath = tmp_path / "out" (or similar filename), keep the rest of the test
unchanged so it still patches wg21_paper_tracker.pipeline.requests.get and
asserts timeout == DOWNLOAD_TIMEOUT when calling _download_file.
- Around line 147-166: The test uses deeply nested patch context managers around
run_tracker_pipeline (patching fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, and _upload_to_gcs);
flatten these by replacing the nested with either unittest.mock.patch.multiple
or contextlib.ExitStack (or pytest-mock's mocker) to apply all patches in one
block: locate the test function in tests/test_pipeline.py and replace the nested
with a single patch.multiple call (or an ExitStack that enters each patch) that
sets return_value/side_effect for fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, _upload_to_gcs and sets WG21_GCS_BUCKET to
"test-bucket", then call run_tracker_pipeline() and assert n == 1.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e5209104-7b70-42e6-b1cc-ce64b45283f1

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 62d5d42.

📒 Files selected for processing (38)
  • config/settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread config/settings.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/main.py Outdated
Comment thread wg21_paper_tracker/fetcher.py
Comment thread wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py Outdated
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/services.py Outdated
Comment thread wg21_paper_tracker/services.py
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
cppa_user_tracker/services.py (1)

383-391: ⚠️ Potential issue | 🟠 Major

Don’t collapse same-name authors when the email doesn’t match.

When email_val is present but no candidate has that email, this falls back to an arbitrary existing profile. That merges distinct WG21 authors who share a display name and also loses the newly discovered email. If you reuse a single candidate, attach the email; if multiple candidates exist and none matches, create a new profile instead.

💡 Minimal fix
     if len(candidates) == 1:
-        return candidates[0], False
+        profile = candidates[0]
+        if email_val and not profile.emails.filter(email=email_val).exists():
+            add_email(
+                profile,
+                email_val,
+                is_primary=not profile.emails.filter(is_active=True).exists(),
+            )
+        return profile, False

     # Two or more: disambiguate by email if provided
     if email_val:
         for p in candidates:
             if p.emails.filter(email=email_val).exists():
                 return p, False
+        profile = WG21PaperAuthorProfile.objects.create(display_name=display_name_val)
+        add_email(profile, email_val, is_primary=True)
+        return profile, True
     return candidates[0], False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/services.py` around lines 383 - 391, The current logic in
the profile-disambiguation block returns an arbitrary existing candidate when
email_val is provided but no candidate has that email; instead, if email_val is
present and exactly one candidate exists reuse that candidate and attach the new
email (use the profile object referenced as p/candidates[0] and call the model
method used elsewhere to add/save emails), but if multiple candidates exist and
none match email_val, do not collapse them—create and return a new profile
populated with the discovered email; update the branch around the
candidates/email_val check (the code that inspects candidates, email_val, and
uses p.emails.filter(...).exists()) to implement: 1) if len(candidates)==1 and
email_val add and persist the email then return that profile, 2) if email_val
and no candidate matches and len(candidates)>1 create a new profile with the
email and return it, otherwise preserve the existing fallback behavior.
wg21_paper_tracker/pipeline.py (1)

205-208: ⚠️ Potential issue | 🟠 Major

Restore the intended source-first format ordering.

With min(...), this mapping now makes PDF win again. That changes which artifact is persisted for any paper that has multiple formats and looks like a regression to the previously discussed policy.

♻️ Proposed fix
         def format_priority(ext: str) -> int:
-            # Prefer PDF (Cloud Run converts PDFs); then html, adoc, ps
-            priorities = {"pdf": 1, "html": 2, "adoc": 3, "ps": 4}
+            # Prefer source/markup formats first: adoc, html, ps, then pdf
+            priorities = {"adoc": 1, "html": 2, "ps": 3, "pdf": 4}
             return priorities.get(ext.lower(), 100)

Based on learnings, in wg21_paper_tracker/pipeline.py, format_priority() is intentionally source-first: adoc > html > ps > pdf.

Also applies to: 223-224

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/pipeline.py` around lines 205 - 208, format_priority()
currently ranks PDF highest (pdf:1) which, combined with the min(...) selection,
makes PDFs win; restore the intended source-first ordering by changing the
priorities mapping in format_priority (and the duplicate mapping at the other
occurrence) to rank source formats first: set "adoc" highest, then "html", then
"ps", then "pdf" (e.g. adoc:1, html:2, ps:3, pdf:4 with the same default), so
the existing selection logic continues to pick source formats before PDF.
🧹 Nitpick comments (4)
docs/service_api/cppa_user_tracker.md (1)

46-48: Document what “the first” means.

The implementation is deterministic here: it orders matches by id and returns the lowest-id profile when email does not disambiguate. Calling that out would make the tie-breaker contract clearer for API consumers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/service_api/cppa_user_tracker.md` around lines 46 - 48, The docs for
get_or_create_wg21_paper_author_profile lack a precise definition of "the first"
when multiple profiles match; update the description to state the deterministic
tie-breaker: when multiple profiles match and email does not disambiguate, the
function orders matching WG21PaperAuthorProfile rows by id ascending and returns
the lowest-id profile (i.e., the minimum id), so callers can rely on that
deterministic behavior. Ensure the wording references
get_or_create_wg21_paper_author_profile and WG21PaperAuthorProfile so the
tie-breaker is explicit.
workflow/management/commands/run_all_collectors.py (1)

23-23: Consider gating the WG21 collector in the aggregate runner.

run_wg21_paper_tracker immediately runs the WG21 fetch pipeline, so adding it unconditionally here makes every run_all_collectors run depend on WG21 availability, and on GCP as well when Cloud Run triggering is enabled. A setting or --include-wg21 flag would keep unrelated batch runs from failing in environments that are not ready for this collector.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workflow/management/commands/run_all_collectors.py` at line 23, The aggregate
runner currently always includes "run_wg21_paper_tracker", making runs depend on
WG21; change this by gating its inclusion behind a configuration or CLI flag:
add a Django setting (e.g., RUN_WG21_COLLECTOR default False) and/or a
management command option (e.g., add_argument("--include-wg21",
action="store_true")) in the Command class for run_all_collectors.py, then only
append or call run_wg21_paper_tracker when the flag or setting is True; also
update the command help text to document the new flag and ensure default
behavior excludes the WG21 collector so unrelated runs don’t fail.
cppa_user_tracker/tests/test_services.py (1)

578-648: Add one normalization regression for trimmed inputs.

The implementation strips both display_name and email before matching, but these tests only cover already-normalized values. A whitespace case would protect the dedupe path that otherwise regresses into duplicate author profiles.

🧪 Possible test addition
+@pytest.mark.django_db
+def test_get_or_create_wg21_paper_author_profile_strips_inputs():
+    profile, created = services.get_or_create_wg21_paper_author_profile(
+        display_name="  Trimmed Author  ",
+        email="  author@example.com  ",
+    )
+    assert created is True
+    assert profile.display_name == "Trimmed Author"
+    assert profile.emails.filter(email="author@example.com").exists()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/tests/test_services.py` around lines 578 - 648, Add a test
that ensures get_or_create_wg21_paper_author_profile normalizes trimmed inputs:
create a WG21PaperAuthorProfile (e.g., first) and/or add an email via
services.add_email(second, "match@example.com"), then call
services.get_or_create_wg21_paper_author_profile with display_name and/or email
containing leading/trailing whitespace (e.g., "  Same Name  ", " 
match@example.com  ") and assert created is False and the returned profile.id
matches the existing profile (and that
profile.emails.filter(email="match@example.com").exists() when testing email
matching); this covers the trimming/normalization path for
get_or_create_wg21_paper_author_profile and prevents duplicate profiles.
wg21_paper_tracker/tests/test_services.py (1)

221-253: Assert that the non-matching year stays untouched.

These tests prove the target (paper_id, year) flips to downloaded, but they would still pass if mark_paper_downloaded() accidentally updated every year variant for the same paper_id. Seed a second WG21Paper with the same ID and a different year, then assert it remains False after the call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_services.py` around lines 221 - 253, Add a test
case that ensures mark_paper_downloaded only updates the specific (paper_id,
year) pair by creating a second WG21Paper via get_or_create_paper with the same
paper_id ("p1000r0") but a different year (e.g., 2024), call
mark_paper_downloaded("p1000r0", year=2025), refresh both paper instances with
refresh_from_db, and assert the paper for 2025 has is_downloaded True while the
paper for the other year remains False; reference the existing test functions
and the mark_paper_downloaded and get_or_create_paper helpers to locate where to
add this assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py`:
- Around line 13-18: The migration currently adds
WG21PaperAuthorProfile.author_alias with default="" and preserve_default=True
which leaves a permanent default in the migration state; change the AddField
call for model_name="wg21paperauthorprofile" and name="author_alias" to use
preserve_default=False so the "" default is only used as a one-time backfill for
existing rows and does not persist as the model's default.

In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The list_blobs prefix is wrong: raw_prefix is set to
"raw/wg21_paper_tracker/" but the uploader writes to
"raw/wg21_papers/{mailing_date}/{filename}"; update raw_prefix to
"raw/wg21_papers/" so client.list_blobs(bucket, prefix=raw_prefix) will discover
newly uploaded PDFs (leave converted_prefix unchanged), then re-run the job to
verify the converter sees files.
- Around line 69-72: The code currently logs and returns when the
WG21_GCS_BUCKET env var is missing (bucket_name), causing the Cloud Run job to
exit successfully; change this to terminate with a non-zero exit so the
scheduler detects failure: after checking os.getenv("WG21_GCS_BUCKET") and
logging via logger.error, call sys.exit(1) or raise a RuntimeError with a clear
message (e.g., include "WG21_GCS_BUCKET not set") instead of returning; update
the block that references bucket_name in main (or the module entrypoint) to
perform this non-zero exit so the job fails fast and surfaces the
misconfiguration.

In `@wg21_paper_tracker/fetcher.py`:
- Around line 95-106: The scraper currently accepts absolute hrefs directly and
sets paper_url (in the loop over first_cell.find_all("a", href=True)) which
allows off-origin links; change this to always canonicalize the href using
urllib.parse.urljoin(base=url or BASE_URL, href) to produce a resolved_url, then
parse it with urllib.parse.urlparse and only accept it if parsed.scheme is
"http" or "https" and parsed.netloc equals "www.open-std.org" before assigning
to paper_url or appending to the download queue; otherwise skip the link. Ensure
you reference the existing variables paper_pattern, href, paper_url, BASE_URL,
and url (the page base) when implementing the check.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 120-122: Replace the silent failure in
import_wg21_metadata_from_csv.py where the code currently checks
csv_path.exists() and calls logger.error + return: raise
django.core.management.base.CommandError with a clear message (e.g. "CSV file
not found: <path>") instead of returning; update the command entry-point (the
class derived from BaseCommand / its handle() method that references csv_path)
to propagate the exception. Also add a unit test that calls
django.core.management.call_command('import_wg21_metadata_from_csv', csv_path)
and asserts that CommandError is raised when the file is missing. Use the
csv_path variable and the command class/handle() name to locate the change.

In `@wg21_paper_tracker/workspace.py`:
- Around line 24-25: The helper currently forces raw storage under
get_workspace_path(_RAW_APP_SLUG) which ignores the overrideable
settings.RAW_DIR; change the computation of raw_root in workspace.py so it
prefers settings.RAW_DIR (e.g., if settings.RAW_DIR is set/non-empty use
Path(settings.RAW_DIR) as raw_root) and only falls back to
get_workspace_path(_RAW_APP_SLUG) otherwise, then compute path = raw_root /
mailing_date; reference symbols: settings.RAW_DIR, get_workspace_path,
_RAW_APP_SLUG, raw_root, path.

---

Duplicate comments:
In `@cppa_user_tracker/services.py`:
- Around line 383-391: The current logic in the profile-disambiguation block
returns an arbitrary existing candidate when email_val is provided but no
candidate has that email; instead, if email_val is present and exactly one
candidate exists reuse that candidate and attach the new email (use the profile
object referenced as p/candidates[0] and call the model method used elsewhere to
add/save emails), but if multiple candidates exist and none match email_val, do
not collapse them—create and return a new profile populated with the discovered
email; update the branch around the candidates/email_val check (the code that
inspects candidates, email_val, and uses p.emails.filter(...).exists()) to
implement: 1) if len(candidates)==1 and email_val add and persist the email then
return that profile, 2) if email_val and no candidate matches and
len(candidates)>1 create a new profile with the email and return it, otherwise
preserve the existing fallback behavior.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 205-208: format_priority() currently ranks PDF highest (pdf:1)
which, combined with the min(...) selection, makes PDFs win; restore the
intended source-first ordering by changing the priorities mapping in
format_priority (and the duplicate mapping at the other occurrence) to rank
source formats first: set "adoc" highest, then "html", then "ps", then "pdf"
(e.g. adoc:1, html:2, ps:3, pdf:4 with the same default), so the existing
selection logic continues to pick source formats before PDF.

---

Nitpick comments:
In `@cppa_user_tracker/tests/test_services.py`:
- Around line 578-648: Add a test that ensures
get_or_create_wg21_paper_author_profile normalizes trimmed inputs: create a
WG21PaperAuthorProfile (e.g., first) and/or add an email via
services.add_email(second, "match@example.com"), then call
services.get_or_create_wg21_paper_author_profile with display_name and/or email
containing leading/trailing whitespace (e.g., "  Same Name  ", " 
match@example.com  ") and assert created is False and the returned profile.id
matches the existing profile (and that
profile.emails.filter(email="match@example.com").exists() when testing email
matching); this covers the trimming/normalization path for
get_or_create_wg21_paper_author_profile and prevents duplicate profiles.

In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The docs for get_or_create_wg21_paper_author_profile lack a
precise definition of "the first" when multiple profiles match; update the
description to state the deterministic tie-breaker: when multiple profiles match
and email does not disambiguate, the function orders matching
WG21PaperAuthorProfile rows by id ascending and returns the lowest-id profile
(i.e., the minimum id), so callers can rely on that deterministic behavior.
Ensure the wording references get_or_create_wg21_paper_author_profile and
WG21PaperAuthorProfile so the tie-breaker is explicit.

In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 221-253: Add a test case that ensures mark_paper_downloaded only
updates the specific (paper_id, year) pair by creating a second WG21Paper via
get_or_create_paper with the same paper_id ("p1000r0") but a different year
(e.g., 2024), call mark_paper_downloaded("p1000r0", year=2025), refresh both
paper instances with refresh_from_db, and assert the paper for 2025 has
is_downloaded True while the paper for the other year remains False; reference
the existing test functions and the mark_paper_downloaded and
get_or_create_paper helpers to locate where to add this assertion.

In `@workflow/management/commands/run_all_collectors.py`:
- Line 23: The aggregate runner currently always includes
"run_wg21_paper_tracker", making runs depend on WG21; change this by gating its
inclusion behind a configuration or CLI flag: add a Django setting (e.g.,
RUN_WG21_COLLECTOR default False) and/or a management command option (e.g.,
add_argument("--include-wg21", action="store_true")) in the Command class for
run_all_collectors.py, then only append or call run_wg21_paper_tracker when the
flag or setting is True; also update the command help text to document the new
flag and ensure default behavior excludes the WG21 collector so unrelated runs
don’t fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d5221e33-cecc-4089-99f2-28b0f07fc98c

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and e3e91c8.

📒 Files selected for processing (38)
  • config/settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread wg21_paper_tracker/cloud_run_job/main.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/main.py Outdated
Comment thread wg21_paper_tracker/fetcher.py
Comment thread wg21_paper_tracker/fetcher.py Outdated
Comment thread wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py Outdated
Comment thread wg21_paper_tracker/workspace.py Outdated
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (1)
wg21_paper_tracker/cloud_run_job/main.py (1)

77-80: ⚠️ Potential issue | 🔴 Critical

Fix the raw GCS prefix before this ships.

The uploader stores PDFs under raw/wg21_papers/..., but this job is listing raw/wg21_paper_tracker/. With the current prefix, the converter never sees newly uploaded papers, so the end-to-end WG21 flow is effectively broken.

🐛 Minimal fix
-    raw_prefix = "raw/wg21_paper_tracker/"
+    raw_prefix = "raw/wg21_papers/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The GCS
listing prefix is incorrect: update the raw_prefix variable currently set to
"raw/wg21_paper_tracker/" to the actual uploader path "raw/wg21_papers/" so that
blobs = client.list_blobs(bucket, prefix=raw_prefix) will find newly uploaded
PDFs; leave converted_prefix as-is and verify the uploader writes to
"raw/wg21_papers/" if needed.
🧹 Nitpick comments (6)
wg21_paper_tracker/tests/test_workspace.py (1)

43-86: Add a regression test for the mailing_date=None path.

get_raw_dir() accepts str | None, but the suite only covers valid strings and invalid inputs. A positive case for get_raw_dir(None, 2025) would lock in the year-only behavior and catch the current wg21_paper_tracker/workspace.py Line 24 fullmatch(None) crash.

🧪 Suggested test
+def test_get_raw_dir_returns_year_dir_when_mailing_date_is_none(mock_workspace_path):
+    with patch("wg21_paper_tracker.workspace.settings") as mock_settings:
+        mock_settings.RAW_DIR = mock_workspace_path
+        path = get_raw_dir(None, 2025)
+
+    expected = mock_workspace_path / "wg21_paper_tracker" / "2025"
+    assert path == expected
+    assert path.is_dir()
+
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_workspace.py` around lines 43 - 86, Add a
regression test that ensures get_raw_dir accepts mailing_date=None and returns
the year-only directory instead of crashing: create a new test (e.g.,
test_get_raw_dir_accepts_none_mailing_date) that patches
wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, calls
get_raw_dir(None, 2025), and asserts the returned Path == mock_workspace_path /
"wg21_paper_tracker" / "2025" (and that it exists/is_dir and path.name ==
"2025"); also update get_raw_dir in wg21_paper_tracker.workspace to guard the
re.fullmatch check (avoid calling fullmatch on None) by treating None as the "no
mailing_date" case that yields the year-only directory.
wg21_paper_tracker/tests/test_models.py (1)

35-46: Assert the retained prefix, not just the length.

This only proves the rendered title is at most 60 characters. WG21Paper.__str__() could return the wrong slice—or even an empty title—and this test would still pass.

🧪 Tighten the assertion
-    assert len(str(p).split(": ", 1)[-1]) <= 60
+    rendered = str(p)
+    assert rendered.startswith("p1: ")
+    assert rendered.split(": ", 1)[-1] == long_title[:60]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_models.py` around lines 35 - 46, Update the
test to assert the actual retained prefix of the title, not just its length:
after computing rendered_title = str(p).split(": ", 1)[-1], assert that
rendered_title == long_title[:60] (or at minimum startswith long_title[:60]) so
WG21Paper.__str__ is returning the correct leading slice of the original title;
reference the test function test_wg21_paper_str_truncates_long_title and the
WG21Paper.__str__ output extraction (str(p).split(": ", 1)[-1]) when making the
change.
wg21_paper_tracker/tests/test_fetcher.py (1)

91-206: Add a regression test for off-origin paper links.

fetch_papers_for_mailing() now skips resolved URLs outside www.open-std.org, but none of these tests hit that branch. A small fixture here would keep the ingest/SSRF guard from disappearing in a future refactor.

🛡️ Suggested test
+def test_fetch_papers_for_mailing_skips_off_origin_links():
+    """fetch_papers_for_mailing skips papers hosted outside open-std.org."""
+    html = """
+    <html><body>
+    <span id="mailing2025-01"></span>
+    <table>
+    <tr><td><a href="https://example.com/p1000r0.pdf">p1000r0.pdf</a></td></tr>
+    </table>
+    </body></html>
+    """
+    with patch("wg21_paper_tracker.fetcher.requests.get") as m:
+        resp = MagicMock()
+        resp.text = html
+        resp.raise_for_status = MagicMock()
+        m.return_value = resp
+        result = fetch_papers_for_mailing("2025", "2025-01")
+    assert result == []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_fetcher.py` around lines 91 - 206, Add a
regression test that verifies fetch_papers_for_mailing skips papers whose
resolved URLs are off-origin (not under www.open-std.org). Create a test like
the others that patches wg21_paper_tracker.fetcher.requests.get to return HTML
with the mailing anchor (id or name) and a table row containing an absolute href
pointing to another host (e.g., "http://evil.com/p1234.pdf"); call
fetch_papers_for_mailing("2025","2025-01") and assert the result does not
include that paper (e.g., result == []). Reference fetch_papers_for_mailing and
the requests.get patch used in existing tests to locate where to add this new
test.
wg21_paper_tracker/admin.py (1)

19-32: Use list_select_related for the related objects shown in the changelist.

Both admins render foreign keys in list_display, so the changelist will do extra queries for mailing, paper, and profile as row counts grow.

⚡ Suggested tweak
 `@admin.register`(WG21Paper)
 class WG21PaperAdmin(admin.ModelAdmin):
+    list_select_related = ("mailing",)
     list_display = (
         "paper_id",
         "year",
         "title",
         "document_date",
@@
 `@admin.register`(WG21PaperAuthor)
 class WG21PaperAuthorAdmin(admin.ModelAdmin):
+    list_select_related = ("paper", "profile")
     list_display = ("paper", "profile", "created_at")
     search_fields = ("paper__paper_id", "profile__display_name")
     raw_id_fields = ("paper", "profile")

Also applies to: 35-39

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/admin.py` around lines 19 - 32, Add list_select_related to
the admin classes to prefetch related FK objects shown in list_display: in
WG21PaperAdmin add list_select_related = ("mailing", "subgroup") (or include any
other FK fields like "year" if it's a related model) to avoid per-row queries
for mailing/subgroup, and in WG21PaperAuthorAdmin (the admin responsible for
author rows) add list_select_related = ("paper", "profile") so the changelist
uses joins instead of extra queries for paper/profile; place these attributes on
the respective ModelAdmin classes (WG21PaperAdmin and WG21PaperAuthorAdmin).
wg21_paper_tracker/tests/test_services.py (2)

157-170: Assert the placeholder row is promoted, not just stored.

The docstring says this year=None row can be updated later, but the test only checks the intermediate 0 sentinel. Adding a second get_or_create_paper(...) call with the real year would protect the year=0 -> real year promotion path in wg21_paper_tracker/services.py from regressing silently.

🧪 Suggested test strengthening
 `@pytest.mark.django_db`
-def test_get_or_create_paper_year_none_stored_as_zero(db):
+def test_get_or_create_paper_year_none_stored_as_zero():
     """get_or_create_paper with year=None stores 0 so records can be updated later."""
     mailing, _ = get_or_create_mailing("2025-01", "Title")
     paper, _ = get_or_create_paper(
         paper_id="n5034",
         url="https://example.com/n5034.html",
@@
         mailing=mailing,
         year=None,
     )
     assert paper.year == 0
+
+    promoted, created = get_or_create_paper(
+        paper_id="n5034",
+        url="https://example.com/n5034.html",
+        title="Draft",
+        document_date=None,
+        mailing=mailing,
+        year=2025,
+    )
+    assert created is False
+    assert promoted.pk == paper.pk
+    assert promoted.year == 2025
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_services.py` around lines 157 - 170, The test
test_get_or_create_paper_year_none_stored_as_zero only asserts that year=None
becomes the sentinel 0 but doesn't exercise the promotion path; update the test
to call get_or_create_paper a second time with the same paper_id/url but with
document_date or year set to the real year (e.g., year=2025), verify it returns
the same Paper instance (identity or id) and assert paper.year is updated from 0
to the real year, ensuring the promotion logic in get_or_create_paper in
wg21_paper_tracker/services.py is exercised.

51-51: Remove the unused db fixture parameters.

These signatures never reference db, so Ruff will keep reporting ARG001 on them.

Also applies to: 77-77, 112-112, 127-127, 173-173, 199-199, 222-222, 240-240

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_services.py` at line 51, The listed test
functions include an unused db fixture parameter (e.g.,
test_get_or_create_paper_creates_new) which triggers ARG001; remove the unused
"db" parameter from each test signature so the tests only accept fixtures they
actually use, and run/verify the tests to ensure no other references to "db"
remain in those functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cppa_user_tracker/services.py`:
- Around line 383-391: When reusing an existing WG21PaperAuthorProfile from
candidates, ensure a newly provided email_val is backfilled into that profile:
in the single-candidate branch (when len(candidates) == 1) and in the final
tie-break return (returning candidates[0]), check if email_val is truthy and the
chosen profile does not already have that email (use
p.emails.filter(email=email_val).exists()); if missing, create the email record
(e.g., p.emails.create(email=email_val)) or otherwise add and save the profile
before returning. This ensures name-only profiles get updated when richer
metadata arrives.

In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 234-245: The page-level fallback messages in openai_converter.py
currently use phrases like "Conversion failed" and "Error processing" which
trigger is_content_valid() rejection; update both places that append to
markdown_parts (the else branch that logs "Failed to convert page {page_num}
with OpenAI" and the except block that logs "Error processing page {page_num}")
to use a neutral placeholder such as "## Page {page_num}\n\n*[Content
unavailable for this page]*\n\n" (or similar wording without
"error"/"failed"/"processing") so that page 1 falling into these branches won’t
mark the assembled Markdown as a converter failure. Ensure you change both
markdown_parts.append calls shown in the diff.

In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py`:
- Around line 62-68: The table cell rendering currently treats any falsy value
as empty in pdfplumber_converter.py (the generator inside the
markdown_parts.append that builds table rows), which removes 0 and False; change
the conditional to only treat None as empty (use "if cell is not None" instead
of "if cell") so that values like 0 and False are preserved by converting them
with str(cell).

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 203-223: The code increments stats["papers_updated"] at the start
of the IntegrityError handler before attempting the lookup/update, which
overstates updates when the recovery fails or the row is skipped; move the
stats["papers_updated"] increment so it runs only after a successful recovery
save (i.e., after paper.save() in the except IntegrityError block where you find
paper via WG21Paper.objects.filter(...).first()), and ensure stats["skipped"] is
incremented when paper is None or any subsequent failure prevents the save.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 197-223: The comments around the papers_by_id grouping and
selection are misleading: format_priority() returns lower numbers for adoc/html
and the min(p_list, key=lambda x: format_priority(x["type"])) picks adoc/html
before pdf, so update the surrounding comments (references: papers_by_id,
format_priority, best_paper, min(...)) to state that adoc > html > ps > pdf is
preferred and that we intentionally pick adoc/html first (not “prioritize PDF”
or “PDF first for conversion”).
- Around line 127-145: The current logic only re-checks latest_mailing, which
lets earlier mailings with transient failures never get retried; instead,
identify and requeue any mailing that appears incomplete by checking
fetch_papers_for_mailing() results for each candidate mailing and append those
with empty/failed results to new_mailings. Concretely, change the filter logic
around new_mailings/latest_mailing: for each mailing in all_mailings with
mailing_date > latest_date (and also include any mailing equal to
latest_mailing), call fetch_papers_for_mailing(m) and if it returns an empty
list (the defined failure mode) treat that mailing as incomplete and ensure it
stays in new_mailings so it will be retried; keep using the existing symbols
new_mailings, latest_mailing, all_mailings, and fetch_papers_for_mailing to
locate and implement the change.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 50-63: Update the test to use pytest's tmp_path fixture instead of
a hard-coded POSIX path: change the test signature to accept tmp_path (def
test_download_file_uses_timeout(tmp_path):), replace filepath = Path("/tmp/out")
with filepath = tmp_path / "out", and remove the now-unused Path import; keep
the rest of the assertions and the patched call to
wg21_paper_tracker.pipeline.requests.get and the reference to DOWNLOAD_TIMEOUT
unchanged.

In `@wg21_paper_tracker/workspace.py`:
- Around line 22-25: get_raw_dir currently calls
_MAILING_DATE_RE.fullmatch(mailing_date) even when mailing_date is None, causing
a TypeError; update the validation to first check for None before applying the
regex (e.g., only call _MAILING_DATE_RE.fullmatch when mailing_date is not None)
so that the existing year-only fallback logic in get_raw_dir still executes when
mailing_date is None.

---

Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The GCS listing prefix is incorrect: update the raw_prefix
variable currently set to "raw/wg21_paper_tracker/" to the actual uploader path
"raw/wg21_papers/" so that blobs = client.list_blobs(bucket, prefix=raw_prefix)
will find newly uploaded PDFs; leave converted_prefix as-is and verify the
uploader writes to "raw/wg21_papers/" if needed.

---

Nitpick comments:
In `@wg21_paper_tracker/admin.py`:
- Around line 19-32: Add list_select_related to the admin classes to prefetch
related FK objects shown in list_display: in WG21PaperAdmin add
list_select_related = ("mailing", "subgroup") (or include any other FK fields
like "year" if it's a related model) to avoid per-row queries for
mailing/subgroup, and in WG21PaperAuthorAdmin (the admin responsible for author
rows) add list_select_related = ("paper", "profile") so the changelist uses
joins instead of extra queries for paper/profile; place these attributes on the
respective ModelAdmin classes (WG21PaperAdmin and WG21PaperAuthorAdmin).

In `@wg21_paper_tracker/tests/test_fetcher.py`:
- Around line 91-206: Add a regression test that verifies
fetch_papers_for_mailing skips papers whose resolved URLs are off-origin (not
under www.open-std.org). Create a test like the others that patches
wg21_paper_tracker.fetcher.requests.get to return HTML with the mailing anchor
(id or name) and a table row containing an absolute href pointing to another
host (e.g., "http://evil.com/p1234.pdf"); call
fetch_papers_for_mailing("2025","2025-01") and assert the result does not
include that paper (e.g., result == []). Reference fetch_papers_for_mailing and
the requests.get patch used in existing tests to locate where to add this new
test.

In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 35-46: Update the test to assert the actual retained prefix of the
title, not just its length: after computing rendered_title = str(p).split(": ",
1)[-1], assert that rendered_title == long_title[:60] (or at minimum startswith
long_title[:60]) so WG21Paper.__str__ is returning the correct leading slice of
the original title; reference the test function
test_wg21_paper_str_truncates_long_title and the WG21Paper.__str__ output
extraction (str(p).split(": ", 1)[-1]) when making the change.

In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 157-170: The test
test_get_or_create_paper_year_none_stored_as_zero only asserts that year=None
becomes the sentinel 0 but doesn't exercise the promotion path; update the test
to call get_or_create_paper a second time with the same paper_id/url but with
document_date or year set to the real year (e.g., year=2025), verify it returns
the same Paper instance (identity or id) and assert paper.year is updated from 0
to the real year, ensuring the promotion logic in get_or_create_paper in
wg21_paper_tracker/services.py is exercised.
- Line 51: The listed test functions include an unused db fixture parameter
(e.g., test_get_or_create_paper_creates_new) which triggers ARG001; remove the
unused "db" parameter from each test signature so the tests only accept fixtures
they actually use, and run/verify the tests to ensure no other references to
"db" remain in those functions.

In `@wg21_paper_tracker/tests/test_workspace.py`:
- Around line 43-86: Add a regression test that ensures get_raw_dir accepts
mailing_date=None and returns the year-only directory instead of crashing:
create a new test (e.g., test_get_raw_dir_accepts_none_mailing_date) that
patches wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, calls
get_raw_dir(None, 2025), and asserts the returned Path == mock_workspace_path /
"wg21_paper_tracker" / "2025" (and that it exists/is_dir and path.name ==
"2025"); also update get_raw_dir in wg21_paper_tracker.workspace to guard the
re.fullmatch check (avoid calling fullmatch on None) by treating None as the "no
mailing_date" case that yields the year-only directory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 63584f18-f1ea-46a1-95ea-5e18b05dafb3

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 005278a.

📒 Files selected for processing (40)
  • config/settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread cppa_user_tracker/services.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py Outdated
Comment thread wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py Outdated
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/tests/test_pipeline.py Outdated
Comment thread wg21_paper_tracker/workspace.py
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
wg21_paper_tracker/tests/test_services.py (1)

157-169: ⚠️ Potential issue | 🟠 Major

Sentinel value 0 for unknown year may cause collision issues.

The test asserts that year=None is stored as 0. Since the composite unique key is (paper_id, year), two imports of the same paper_id without a resolved year will target the same row (both mapping to year=0), preventing them from being distinguishable.

Consider whether this is the intended behavior. If unknown years should remain distinct, either:

  • Keep year nullable in the model
  • Reject missing years at the service boundary with a validation error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_services.py` around lines 157 - 169, The test
and service currently map year=None to 0 causing collisions due to the composite
unique key (paper_id, year); update the behavior so unknown years remain
distinct by removing the sentinel mapping: change get_or_create_paper to stop
coercing document_date/year None to 0 (leave year as None) and ensure the Paper
model's year field allows null (nullable) or adjust the unique constraint to
exclude NULLs as appropriate, and then update tests (e.g.,
test_get_or_create_paper_year_none_stored_as_zero) to assert year is None (or to
expect a validation error if you choose to reject missing years at the service
boundary) so imports of the same paper_id without resolved years do not collide.
cppa_user_tracker/services.py (1)

358-381: ⚠️ Potential issue | 🟠 Major

Race condition on profile creation with non-unique display_name.

WG21PaperAuthorProfile.display_name is indexed but not unique. The check at line 371-375 (finding candidates) and creation at line 378 are not atomic, so concurrent calls with the same display_name can create duplicate profiles.

Consider wrapping the lookup-and-create logic in select_for_update() or using get_or_create with a proper unique constraint on display_name (or an author_alias field if disambiguation is needed).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/services.py` around lines 358 - 381, The current
get_or_create_wg21_paper_author_profile function can race when multiple requests
create a WG21PaperAuthorProfile with the same non-unique display_name; make the
lookup-and-create atomic by wrapping the operation in a transaction
(transaction.atomic) and locking the relevant rows (e.g., use
WG21PaperAuthorProfile.objects.select_for_update().filter(display_name=display_name_val)
inside the transaction) or use WG21PaperAuthorProfile.objects.get_or_create with
a unique constraint on the canonical key (or an author_alias) and handle
IntegrityError with a retry to avoid duplicates; ensure add_email(profile,
email_val, is_primary=True) still runs only for the created profile and return
the existing profile when found.
🧹 Nitpick comments (8)
wg21_paper_tracker/tests/test_models.py (1)

35-46: Tighten the truncation assertion.

Line 46 only checks the rendered title is short enough, so this test would still pass if WG21Paper.__str__ stopped including the title entirely. Assert the expected truncated content as well so the test locks down the actual contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_models.py` around lines 35 - 46, The test
test_wg21_paper_str_truncates_long_title currently only checks length; update it
to assert the actual truncated content from WG21Paper.__str__ instead of just
length: compute rendered = str(p).split(": ", 1)[-1] and assert that rendered
startswith the first 60 characters of long_title (long_title[:60]) so the title
is present and correctly truncated; keep the existing setup (WG21Mailing m,
WG21Paper p) and replace the len(...) assertion with this content check.
wg21_paper_tracker/cloud_run_job/Dockerfile (1)

22-23: Add a .dockerignore next to this Dockerfile to enforce proper build context.

The documented build process correctly limits build context by running docker build from within wg21_paper_tracker/cloud_run_job/ (per docs/operations/WG21_Cloud_Run.md), so the current risk is low. However, adding a .dockerignore would defensively prevent accidental inclusion of unrelated repo content if the build is ever invoked from the repo root without proper context handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/Dockerfile` around lines 22 - 23, Add a
.dockerignore next to the Dockerfile to prevent accidental inclusion of the
whole repo when the Docker build context is wrong: create a .dockerignore in the
same directory as the Dockerfile and exclude large/unrelated items (e.g., .git,
node_modules, .venv, .env, build/ dist/ caches, IDE files) and any parent-folder
paths, and then explicitly allow the files your Docker build needs (so the COPY
--chown=app:app . . step still copies only the app sources, Dockerfile,
requirements.txt/pyproject.toml or package.json and other build inputs
referenced by your Dockerfile). Ensure the patterns cover common heavy folders
and that required build artifacts referenced by the Dockerfile are whitelisted.
wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)

95-97: Minor: Redundant exception object in logging.exception.

The exception object e is automatically included by logging.exception. You can simplify this.

♻️ Optional cleanup
-        except Exception as e:
-            logger.exception("WG21 Paper Tracker failed: %s", e)
+        except Exception:
+            logger.exception("WG21 Paper Tracker failed.")
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around
lines 95 - 97, In the except block that currently reads "except Exception as e:"
remove the redundant exception object from the logging call by replacing
logger.exception("WG21 Paper Tracker failed: %s", e) with logger.exception("WG21
Paper Tracker failed") so the exception is still logged (logger.exception
automatically includes the traceback) and keep the subsequent raise unchanged;
locate this change around the except block where the variable e is defined and
logger.exception is called.
wg21_paper_tracker/tests/test_pipeline.py (1)

124-165: Consider using pytest.ExitStack or contextlib.ExitStack to reduce nesting.

The deeply nested with patch(...) blocks (6 levels) make this test harder to read. Consider using ExitStack or multiple @patch decorators.

♻️ Example using decorators
`@pytest.mark.django_db`
`@patch`("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True)
`@patch`("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket")
`@patch`("wg21_paper_tracker.pipeline._download_file", return_value=True)
`@patch`("wg21_paper_tracker.pipeline.fetch_papers_for_mailing")
`@patch`("wg21_paper_tracker.pipeline.fetch_all_mailings")
`@patch`("wg21_paper_tracker.pipeline.get_raw_dir")
def test_run_tracker_pipeline_downloads_new_papers(
    mock_get_raw_dir, mock_fetch_mailings, mock_fetch_papers,
    mock_download, mock_bucket, mock_upload, tmp_path
):
    # ... setup mocks and assertions
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 124 - 165, The test
test_run_tracker_pipeline_downloads_new_papers has six nested with patch(...)
blocks which reduce readability; refactor it to use contextlib.ExitStack or
pytest `@patch` decorators to flatten the nesting. Locate the test function and
replace the nested patches for fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, settings.WG21_GCS_BUCKET and _upload_to_gcs with
either a single ExitStack that enters each patcher or with stacked `@patch`
decorators (keeping the same return_value or replacement values) and adjust the
test signature to accept the mock objects in the correct order.
wg21_paper_tracker/cloud_run_job/converters/docling_converter.py (1)

59-63: Consider using %s formatting or !s conversion for exception in f-string.

Per Ruff RUF010, str(e) in f-strings can be replaced with {e!s} for explicit string conversion.

♻️ Proposed fix
     except Exception as e:
         logger.error(
-            f"Docling conversion failed for {pdf_path.name}: {str(e)}", exc_info=True
+            f"Docling conversion failed for {pdf_path.name}: {e!s}", exc_info=True
         )
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py` around
lines 59 - 63, The exception log in the except block of docling_converter.py
uses str(e) inside an f-string; update the logger.error call to use explicit
string conversion {e!s} (or change to %-formatting with %s) instead of str(e) so
it complies with Ruff RUF010 — locate the except block where
logger.error(f"Docling conversion failed for {pdf_path.name}: {str(e)}",
exc_info=True) and replace the str(e) usage with {e!s} while keeping
pdf_path.name and exc_info=True unchanged.
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (1)

163-170: Use logger.exception to preserve stack traces in error logs.

When catching exceptions in data processing, using logger.exception instead of logger.error automatically includes the stack trace, which aids debugging.

♻️ Proposed fix
             except Exception as e:
                 stats["skipped"] += 1
-                logger.error(
+                logger.exception(
                     "Error parsing document date for paper_id=%s: %s",
                     paper_id,
                     e,
                 )
                 continue

Apply similar changes to lines 212, 235-239, and 242.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 163 - 170, Replace the logger.error calls inside the except
Exception as e handlers in import_wg21_metadata_from_csv.py with
logger.exception so the stack trace is preserved; specifically update the except
block that logs "Error parsing document date for paper_id=%s" as well as the
other similar except blocks referenced (around the blocks currently at lines
212, 235-239, and 242) to call logger.exception(...) while keeping the existing
stats["skipped"] increment and continue behavior. Ensure the message text
remains descriptive and pass the same context (e.g., paper_id) to the
logger.exception call.
wg21_paper_tracker/services.py (2)

152-158: Consider requiring year in mark_paper_downloaded().

Filtering by (paper_id, year) is the right fix, but the default year=None now normalizes to 0. A caller that forgets to pass the year silently targets only placeholder rows, which is easy to miss because the helper name still reads like a complete action. Making year required would surface that mistake at the call site.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/services.py` around lines 152 - 158, The function
mark_paper_downloaded currently defaults year to None which _normalize_year
converts to 0, allowing callers to silently hit placeholder rows; change the
signature of mark_paper_downloaded to require year (remove the = None default)
and validate/raise a clear exception if year is missing before calling
_normalize_year; update any callers to pass an explicit year and keep the
WG21Paper.objects.filter(paper_id=..., year=year_val).update(...) logic
unchanged so we no longer risk silently updating placeholder rows.

40-50: Normalize document_date at the service boundary.

paper.document_date comes back from Django as a date, but this helper still types document_date as Optional[str] and passes it through untouched even though run_tracker_pipeline() already supplies a parsed date | None. That makes the contract ambiguous, and any caller that sends "YYYY-MM-DD" will keep the row dirty even when the stored value is the same logical date. Normalize once inside the service and use that value everywhere in the update/default paths.

Also applies to: 62-64, 87-89, 101-103, 113-115

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/services.py` around lines 40 - 50, The service should
normalize the incoming document_date to a native date once at the boundary (in
get_or_create_paper) and use that normalized value for all comparisons, defaults
and updates; change the parameter handling in get_or_create_paper to
parse/convert Optional[str] or date inputs into a single date | None variable
and then use that variable in calls that set or compare paper.document_date
(including create_or_update logic), and apply the same normalization pattern to
the other helper functions that accept document_date (the other occurrences
noted in the review) so callers supplying "YYYY-MM-DD" strings won't keep rows
dirty and the contract is unambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/operations/WG21_Cloud_Run.md`:
- Line 5: Update the wording to clarify that Cloud Run invocation is conditional
rather than automatic: change the sentence referencing run_wg21_paper_tracker
and wg21-convert to state that the Django tracker can be configured to trigger
the wg21-convert Cloud Run job, but this is optional (e.g., GCS-only deployments
or environments that use a different job name will not trigger it), and mirror
this conditional language in the intro and the flow summary so operators
understand the trigger is configurable rather than mandatory.

In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The docs for get_or_create_wg21_paper_author_profile
currently imply lookup-only behavior but the function also mutates returned
profiles by attaching a provided email when the profile is missing one; update
the table description for get_or_create_wg21_paper_author_profile to explicitly
state that if an email argument is supplied and the resolved/created
WG21PaperAuthorProfile lacks an email, the function will associate that email
with the profile and return the profile plus a boolean indicating creation;
mention the side-effect so callers know it can modify existing profiles.

In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 37-59: pdf_to_images currently calls convert_from_path() which
materializes all pages into PIL Images and keeps them alive across
convert_with_openai, causing memory spikes for large PDFs; change pdf_to_images
(and adjust convert_with_openai usage) to avoid loading the entire PDF by either
calling convert_from_path(..., paths_only=True, output_folder=tmp_dir) to get
per-page file paths and yield/process images one-at-a-time, or implement
batching using convert_from_path with first_page/last_page to convert and
process N pages at a time, ensuring you create a temporary output folder
(tmp_dir) and clean up intermediate files so only the current batch/page is
resident in memory; update references to pdf_to_images and convert_with_openai
to consume images/paths as a generator or in small batches instead of relying on
a full list.

In `@wg21_paper_tracker/migrations/0001_initial.py`:
- Around line 59-63: The migration currently sets the IntegerField "year" with
default=0 which permits invalid rows; change the field definition in the initial
migration (0001_initial.py) for the "year" field
(models.IntegerField(db_index=True, default=0)) to be required by removing
default=0 and ensuring null=False (e.g. models.IntegerField(db_index=True) with
no default and no null=True), so missing parsed years will raise errors instead
of inserting a bogus 0 and thereby keep the uniqueness constraint on (paper_id,
year) meaningful.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 201-205: In the for p in papers loop compute pid = (p["paper_id"]
or "").strip().lower() and if pid is empty, emit a warning (e.g.,
logging.warning or the module's logger) including enough context from p (title
or source) and continue so you never insert an empty key into papers_by_id; this
prevents creating the "" bucket that collapses malformed entries and affects the
downstream (paper_id, year) uniqueness and the downloaded check around the later
code (near the current line 216).

---

Duplicate comments:
In `@cppa_user_tracker/services.py`:
- Around line 358-381: The current get_or_create_wg21_paper_author_profile
function can race when multiple requests create a WG21PaperAuthorProfile with
the same non-unique display_name; make the lookup-and-create atomic by wrapping
the operation in a transaction (transaction.atomic) and locking the relevant
rows (e.g., use
WG21PaperAuthorProfile.objects.select_for_update().filter(display_name=display_name_val)
inside the transaction) or use WG21PaperAuthorProfile.objects.get_or_create with
a unique constraint on the canonical key (or an author_alias) and handle
IntegrityError with a retry to avoid duplicates; ensure add_email(profile,
email_val, is_primary=True) still runs only for the created profile and return
the existing profile when found.

In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 157-169: The test and service currently map year=None to 0 causing
collisions due to the composite unique key (paper_id, year); update the behavior
so unknown years remain distinct by removing the sentinel mapping: change
get_or_create_paper to stop coercing document_date/year None to 0 (leave year as
None) and ensure the Paper model's year field allows null (nullable) or adjust
the unique constraint to exclude NULLs as appropriate, and then update tests
(e.g., test_get_or_create_paper_year_none_stored_as_zero) to assert year is None
(or to expect a validation error if you choose to reject missing years at the
service boundary) so imports of the same paper_id without resolved years do not
collide.

---

Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py`:
- Around line 59-63: The exception log in the except block of
docling_converter.py uses str(e) inside an f-string; update the logger.error
call to use explicit string conversion {e!s} (or change to %-formatting with %s)
instead of str(e) so it complies with Ruff RUF010 — locate the except block
where logger.error(f"Docling conversion failed for {pdf_path.name}: {str(e)}",
exc_info=True) and replace the str(e) usage with {e!s} while keeping
pdf_path.name and exc_info=True unchanged.

In `@wg21_paper_tracker/cloud_run_job/Dockerfile`:
- Around line 22-23: Add a .dockerignore next to the Dockerfile to prevent
accidental inclusion of the whole repo when the Docker build context is wrong:
create a .dockerignore in the same directory as the Dockerfile and exclude
large/unrelated items (e.g., .git, node_modules, .venv, .env, build/ dist/
caches, IDE files) and any parent-folder paths, and then explicitly allow the
files your Docker build needs (so the COPY --chown=app:app . . step still copies
only the app sources, Dockerfile, requirements.txt/pyproject.toml or
package.json and other build inputs referenced by your Dockerfile). Ensure the
patterns cover common heavy folders and that required build artifacts referenced
by the Dockerfile are whitelisted.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 163-170: Replace the logger.error calls inside the except
Exception as e handlers in import_wg21_metadata_from_csv.py with
logger.exception so the stack trace is preserved; specifically update the except
block that logs "Error parsing document date for paper_id=%s" as well as the
other similar except blocks referenced (around the blocks currently at lines
212, 235-239, and 242) to call logger.exception(...) while keeping the existing
stats["skipped"] increment and continue behavior. Ensure the message text
remains descriptive and pass the same context (e.g., paper_id) to the
logger.exception call.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-97: In the except block that currently reads "except Exception
as e:" remove the redundant exception object from the logging call by replacing
logger.exception("WG21 Paper Tracker failed: %s", e) with logger.exception("WG21
Paper Tracker failed") so the exception is still logged (logger.exception
automatically includes the traceback) and keep the subsequent raise unchanged;
locate this change around the except block where the variable e is defined and
logger.exception is called.

In `@wg21_paper_tracker/services.py`:
- Around line 152-158: The function mark_paper_downloaded currently defaults
year to None which _normalize_year converts to 0, allowing callers to silently
hit placeholder rows; change the signature of mark_paper_downloaded to require
year (remove the = None default) and validate/raise a clear exception if year is
missing before calling _normalize_year; update any callers to pass an explicit
year and keep the WG21Paper.objects.filter(paper_id=...,
year=year_val).update(...) logic unchanged so we no longer risk silently
updating placeholder rows.
- Around line 40-50: The service should normalize the incoming document_date to
a native date once at the boundary (in get_or_create_paper) and use that
normalized value for all comparisons, defaults and updates; change the parameter
handling in get_or_create_paper to parse/convert Optional[str] or date inputs
into a single date | None variable and then use that variable in calls that set
or compare paper.document_date (including create_or_update logic), and apply the
same normalization pattern to the other helper functions that accept
document_date (the other occurrences noted in the review) so callers supplying
"YYYY-MM-DD" strings won't keep rows dirty and the contract is unambiguous.

In `@wg21_paper_tracker/tests/test_models.py`:
- Around line 35-46: The test test_wg21_paper_str_truncates_long_title currently
only checks length; update it to assert the actual truncated content from
WG21Paper.__str__ instead of just length: compute rendered = str(p).split(": ",
1)[-1] and assert that rendered startswith the first 60 characters of long_title
(long_title[:60]) so the title is present and correctly truncated; keep the
existing setup (WG21Mailing m, WG21Paper p) and replace the len(...) assertion
with this content check.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 124-165: The test test_run_tracker_pipeline_downloads_new_papers
has six nested with patch(...) blocks which reduce readability; refactor it to
use contextlib.ExitStack or pytest `@patch` decorators to flatten the nesting.
Locate the test function and replace the nested patches for fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET
and _upload_to_gcs with either a single ExitStack that enters each patcher or
with stacked `@patch` decorators (keeping the same return_value or replacement
values) and adjust the test signature to accept the mock objects in the correct
order.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 68568e8a-8113-4e08-9aed-2cf2b329bd64

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 3547652.

📒 Files selected for processing (40)
  • config/settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread docs/operations/WG21_Cloud_Run.md Outdated
Comment thread docs/service_api/cppa_user_tracker.md Outdated
Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/migrations/0001_initial.py
Comment thread wg21_paper_tracker/pipeline.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
cppa_user_tracker/services.py (2)

358-375: ⚠️ Potential issue | 🟠 Major

Concurrent profile creation can produce duplicates.

Since display_name is only indexed (not unique), two concurrent calls with the same name can both see candidates=[] and each create a new profile. Later lookups will see multiple candidates and may silently pick candidates[0], merging data that should stay separate.

Consider wrapping the creation in select_for_update() or using get_or_create with a unique constraint on a canonicalized key (e.g., lowercased display_name + email hash).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/services.py` around lines 358 - 375, The
get_or_create_wg21_paper_author_profile function can create duplicate
WG21PaperAuthorProfile rows under concurrent calls; fix by performing the
existence check and potential creation inside a DB transaction with row-level
locking or by using a canonical unique key and get_or_create: wrap the candidate
lookup/creation in transaction.atomic(), re-query with select_for_update() on
WG21PaperAuthorProfile.objects.filter(display_name=display_name_val) to lock
rows before deciding to create, or alternatively add a canonical unique field
(e.g., canonical_display_name or canonical_display_name_plus_email) on
WG21PaperAuthorProfile and use
WG21PaperAuthorProfile.objects.get_or_create(canonical_field=...,
defaults={...}) so concurrent creates cannot produce duplicates; ensure you
update the function to use the chosen approach and remove the race by
referencing display_name, email, candidates and WG21PaperAuthorProfile in the
change.

393-405: ⚠️ Potential issue | 🟠 Major

Attaching an unmatched email to an arbitrary same-name profile can merge distinct authors.

When multiple display_name matches exist and none already owns email_val, the fallback picks candidates[0] and adds the new email there. This silently merges distinct authors whenever two people share a name. The safer fallback is to create a fresh profile for that email instead of mutating an arbitrary candidate.

Suggested fix
     # Two or more: disambiguate by email if provided
     if email_val:
         for p in candidates:
             if p.emails.filter(email=email_val).exists():
                 return p, False
-    profile = candidates[0]
-    if email_val and not profile.emails.filter(email=email_val).exists():
-        add_email(
-            profile,
-            email_val,
-            is_primary=not profile.emails.filter(is_active=True).exists(),
-        )
-    return profile, False
+        # No existing profile owns this email; create a new one
+        profile = WG21PaperAuthorProfile.objects.create(display_name=display_name_val)
+        add_email(profile, email_val, is_primary=True)
+        return profile, True
+    # No email provided; return first match
+    return candidates[0], False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/services.py` around lines 393 - 405, When multiple
candidates match display_name but none own email_val, do not pick candidates[0]
and attach the email; instead create a fresh profile for that email and attach
the email to it. Replace the fallback block that sets profile = candidates[0]
and calls add_email(...) with logic that constructs a new profile (using the
existing profile-creation helper in this module or Profile model factory), call
add_email(new_profile, email_val, is_primary=...), and return the new profile
(and the created flag, e.g. True) rather than mutating an arbitrary candidate;
keep references to candidates, email_val and add_email to locate and update the
code.
wg21_paper_tracker/cloud_run_job/main.py (2)

77-80: ⚠️ Potential issue | 🔴 Critical

Use the uploader's raw GCS prefix.

Line 77 still points at raw/wg21_paper_tracker/, but the uploader stores PDFs under raw/wg21_papers/.... With the current prefix, newly uploaded papers are never discovered, so the conversion stage never runs.

🐛 Minimal fix
-    raw_prefix = "raw/wg21_paper_tracker/"
+    raw_prefix = "raw/wg21_papers/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The GCS
prefix used to list new uploads is incorrect: update the raw_prefix variable in
main.py from "raw/wg21_paper_tracker/" to the uploader's actual prefix
"raw/wg21_papers/" so client.list_blobs(bucket, prefix=raw_prefix) will discover
newly uploaded PDFs; verify converted_prefix remains "converted/wg21_papers/"
and re-run the listing logic (the change affects the raw_prefix symbol used
before calling client.list_blobs).

39-45: ⚠️ Potential issue | 🟠 Major

Don't reject valid output just because it mentions error: or exception:.

Line 41 bypasses the proximity check, so a legitimate paper that contains those tokens anywhere in the first 1000 characters is treated as a converter failure. That can trigger unnecessary PDFPlumber/OpenAI fallbacks and still drop valid conversions for papers about error handling or exception safety.

💡 Minimal fix
     for pattern in error_patterns:
         if pattern in first_part:
-            if pattern in ("error:", "exception:"):
-                return False
             idx = first_part.find(pattern)
             if idx < 100:
                 return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The current
loop over error_patterns incorrectly treats "error:" and "exception:" as
immediate failures (the `if pattern in ("error:", "exception:"): return False`
branch), bypassing the proximity check; remove that special-case and apply the
same proximity logic to all patterns by finding `idx = first_part.find(pattern)`
and returning False only if `idx` is within the intended proximity threshold
(use 1000 chars as per review or the existing threshold variable), i.e., replace
the special-case with the same `idx` check for `error_patterns`, referencing the
loop variables `error_patterns`, `first_part`, `pattern`, and `idx`.
🧹 Nitpick comments (5)
wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py (1)

97-102: Minor style suggestion: use f-string conversion flag.

Static analysis suggests using the explicit conversion flag !s instead of wrapping with str().

♻️ Optional style improvement
     except Exception as e:
         logger.error(
-            f"PDFPlumber conversion failed for {pdf_path.name}: {str(e)}",
+            f"PDFPlumber conversion failed for {pdf_path.name}: {e!s}",
             exc_info=True,
         )
         return None

The same applies to line 81 (str(e){e!s}).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py` around
lines 97 - 102, The log message uses str(e) which static analysis flags; update
the exception formatting in the PDFPlumber conversion catch blocks to use the
f-string conversion flag (!s) instead of str(), i.e., replace occurrences of
str(e) with {e!s} in the logger.error calls that reference pdf_path.name and the
earlier similar logger call; keep the rest of the logger.error parameters
(exc_info=True) unchanged and ensure you update both places where the exception
variable e is interpolated.
wg21_paper_tracker/migrations/0001_initial.py (1)

28-31: Minor: db_index=True is redundant on mailing_date since unique=True already creates an index.

PostgreSQL automatically creates a B-tree index for unique constraints. This is a minor redundancy and doesn't affect functionality.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/migrations/0001_initial.py` around lines 28 - 31, Remove
the redundant db_index=True from the mailing_date field definition (the
models.CharField named "mailing_date" with max_length=7 and unique=True) in the
migration so the unique constraint remains but no duplicate index is requested;
edit the tuple for "mailing_date" in 0001_initial.py to drop db_index=True and
leave unique=True so PostgreSQL creates only the unique index.
wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)

48-97: Add command-level tests for the new orchestration branches.

This command now owns the dry-run short circuit, Cloud Run gating, and failure propagation, but there’s no coverage here for those branches. A small call_command() suite with mocked run_tracker_pipeline() / trigger_cloud_run_job() would make the intended behavior much safer to change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around
lines 48 - 97, Add unit tests that exercise the management command's branches by
calling the command via Django's call_command and mocking run_tracker_pipeline
and trigger_cloud_run_job: test dry-run by passing --dry-run and asserting
run_tracker_pipeline/trigger_cloud_run_job are not called; test no-new-papers by
mocking run_tracker_pipeline to return 0 and asserting trigger_cloud_run_job is
not called; test Cloud Run trigger by mocking run_tracker_pipeline to return >0
and configuring settings (GCP_PROJECT_ID, WG21_CLOUD_RUN_JOB_NAME,
WG21_GCS_BUCKET, WG21_CLOUD_RUN_ENABLED=True) and asserting
trigger_cloud_run_job is called with (project_id, location, job_name); and test
failure propagation by having trigger_cloud_run_job raise and asserting the
command raises the exception. Use mocks/patch for run_tracker_pipeline and
trigger_cloud_run_job and override_settings for the required WG21_* settings to
isolate behavior.
cppa_user_tracker/tests/test_services.py (1)

638-648: Assert the email-association side effect in this fallback case.

This branch only verifies that the first profile is returned. It would still pass if get_or_create_wg21_paper_author_profile() stopped attaching nomatch@example.com to that returned profile, even though the public API now documents that side effect.

🧪 Tighten the assertion
 def test_get_or_create_wg21_paper_author_profile_two_candidates_email_matches_none_returns_first():
     """get_or_create_wg21_paper_author_profile returns first when email provided but no match."""
     first = WG21PaperAuthorProfile.objects.create(display_name="Other Name")
     second = WG21PaperAuthorProfile.objects.create(display_name="Other Name")
     services.add_email(second, "other@example.com", is_primary=True)
     profile, created = services.get_or_create_wg21_paper_author_profile(
         display_name="Other Name",
         email="nomatch@example.com",
     )
     assert created is False
     assert profile.id == first.id
+    assert first.emails.filter(email="nomatch@example.com").exists()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cppa_user_tracker/tests/test_services.py` around lines 638 - 648, The test
must also assert the documented side effect that nomatch@example.com is attached
to the returned profile: after calling
services.get_or_create_wg21_paper_author_profile(...), add an assertion that the
returned profile has that email associated (e.g. assert
services.get_emails(profile) contains "nomatch@example.com" or assert
profile.emails.filter(address="nomatch@example.com").exists()), referencing
WG21PaperAuthorProfile and services.get_or_create_wg21_paper_author_profile to
locate where to add the assertion.
wg21_paper_tracker/tests/test_pipeline.py (1)

146-165: Assert the upload side effect in the happy-path test.

Right now this still passes if run_tracker_pipeline() increments its count before _upload_to_gcs() or stops calling upload entirely. Since GCS upload is the new behavior here, make the mock observable.

🧪 Tighten the assertion
-                    with patch(
-                        "wg21_paper_tracker.pipeline._upload_to_gcs",
-                        return_value=True,
-                    ):
-                        n = run_tracker_pipeline()
+                    with patch(
+                        "wg21_paper_tracker.pipeline._upload_to_gcs",
+                        return_value=True,
+                    ) as upload_mock:
+                        n = run_tracker_pipeline()
     assert n == 1
+    upload_mock.assert_called_once()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 165, The test
currently patches _upload_to_gcs but doesn't assert it was called; update the
happy-path test to make the mock observable by assigning the patch to a variable
(e.g., mock_upload) when patching "wg21_paper_tracker.pipeline._upload_to_gcs",
run run_tracker_pipeline(), then assert mock_upload was called the expected
number of times and/or with the expected args (e.g.,
mock_upload.assert_called_once() or mock_upload.assert_called_with(...)) to
ensure run_tracker_pipeline() actually performs the GCS upload side effect.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 182-239: The IntegrityError except block can be triggered by a
race creating the mailing, so before attempting to recover/update the paper you
must ensure `mailing` is re-resolved; inside the `except IntegrityError as e:`
(where `get_or_create_paper` and `get_or_create_mailing` are used) fetch or
recreate the mailing (e.g., re-run `get_or_create_mailing(mailing_date,
mailing_title)` or lookup by those fields) and assign it to the `mailing`
variable before using it to update `paper` in the recovery path (ensure you
handle failure to find/create mailing and only then proceed to lookup/update
`WG21Paper`), so the recovery no longer relies on an uninitialized `mailing`
when `get_or_create_mailing` raced.

---

Duplicate comments:
In `@cppa_user_tracker/services.py`:
- Around line 358-375: The get_or_create_wg21_paper_author_profile function can
create duplicate WG21PaperAuthorProfile rows under concurrent calls; fix by
performing the existence check and potential creation inside a DB transaction
with row-level locking or by using a canonical unique key and get_or_create:
wrap the candidate lookup/creation in transaction.atomic(), re-query with
select_for_update() on
WG21PaperAuthorProfile.objects.filter(display_name=display_name_val) to lock
rows before deciding to create, or alternatively add a canonical unique field
(e.g., canonical_display_name or canonical_display_name_plus_email) on
WG21PaperAuthorProfile and use
WG21PaperAuthorProfile.objects.get_or_create(canonical_field=...,
defaults={...}) so concurrent creates cannot produce duplicates; ensure you
update the function to use the chosen approach and remove the race by
referencing display_name, email, candidates and WG21PaperAuthorProfile in the
change.
- Around line 393-405: When multiple candidates match display_name but none own
email_val, do not pick candidates[0] and attach the email; instead create a
fresh profile for that email and attach the email to it. Replace the fallback
block that sets profile = candidates[0] and calls add_email(...) with logic that
constructs a new profile (using the existing profile-creation helper in this
module or Profile model factory), call add_email(new_profile, email_val,
is_primary=...), and return the new profile (and the created flag, e.g. True)
rather than mutating an arbitrary candidate; keep references to candidates,
email_val and add_email to locate and update the code.

In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The GCS prefix used to list new uploads is incorrect:
update the raw_prefix variable in main.py from "raw/wg21_paper_tracker/" to the
uploader's actual prefix "raw/wg21_papers/" so client.list_blobs(bucket,
prefix=raw_prefix) will discover newly uploaded PDFs; verify converted_prefix
remains "converted/wg21_papers/" and re-run the listing logic (the change
affects the raw_prefix symbol used before calling client.list_blobs).
- Around line 39-45: The current loop over error_patterns incorrectly treats
"error:" and "exception:" as immediate failures (the `if pattern in ("error:",
"exception:"): return False` branch), bypassing the proximity check; remove that
special-case and apply the same proximity logic to all patterns by finding `idx
= first_part.find(pattern)` and returning False only if `idx` is within the
intended proximity threshold (use 1000 chars as per review or the existing
threshold variable), i.e., replace the special-case with the same `idx` check
for `error_patterns`, referencing the loop variables `error_patterns`,
`first_part`, `pattern`, and `idx`.

---

Nitpick comments:
In `@cppa_user_tracker/tests/test_services.py`:
- Around line 638-648: The test must also assert the documented side effect that
nomatch@example.com is attached to the returned profile: after calling
services.get_or_create_wg21_paper_author_profile(...), add an assertion that the
returned profile has that email associated (e.g. assert
services.get_emails(profile) contains "nomatch@example.com" or assert
profile.emails.filter(address="nomatch@example.com").exists()), referencing
WG21PaperAuthorProfile and services.get_or_create_wg21_paper_author_profile to
locate where to add the assertion.

In `@wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py`:
- Around line 97-102: The log message uses str(e) which static analysis flags;
update the exception formatting in the PDFPlumber conversion catch blocks to use
the f-string conversion flag (!s) instead of str(), i.e., replace occurrences of
str(e) with {e!s} in the logger.error calls that reference pdf_path.name and the
earlier similar logger call; keep the rest of the logger.error parameters
(exc_info=True) unchanged and ensure you update both places where the exception
variable e is interpolated.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 48-97: Add unit tests that exercise the management command's
branches by calling the command via Django's call_command and mocking
run_tracker_pipeline and trigger_cloud_run_job: test dry-run by passing
--dry-run and asserting run_tracker_pipeline/trigger_cloud_run_job are not
called; test no-new-papers by mocking run_tracker_pipeline to return 0 and
asserting trigger_cloud_run_job is not called; test Cloud Run trigger by mocking
run_tracker_pipeline to return >0 and configuring settings (GCP_PROJECT_ID,
WG21_CLOUD_RUN_JOB_NAME, WG21_GCS_BUCKET, WG21_CLOUD_RUN_ENABLED=True) and
asserting trigger_cloud_run_job is called with (project_id, location, job_name);
and test failure propagation by having trigger_cloud_run_job raise and asserting
the command raises the exception. Use mocks/patch for run_tracker_pipeline and
trigger_cloud_run_job and override_settings for the required WG21_* settings to
isolate behavior.

In `@wg21_paper_tracker/migrations/0001_initial.py`:
- Around line 28-31: Remove the redundant db_index=True from the mailing_date
field definition (the models.CharField named "mailing_date" with max_length=7
and unique=True) in the migration so the unique constraint remains but no
duplicate index is requested; edit the tuple for "mailing_date" in
0001_initial.py to drop db_index=True and leave unique=True so PostgreSQL
creates only the unique index.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-165: The test currently patches _upload_to_gcs but doesn't
assert it was called; update the happy-path test to make the mock observable by
assigning the patch to a variable (e.g., mock_upload) when patching
"wg21_paper_tracker.pipeline._upload_to_gcs", run run_tracker_pipeline(), then
assert mock_upload was called the expected number of times and/or with the
expected args (e.g., mock_upload.assert_called_once() or
mock_upload.assert_called_with(...)) to ensure run_tracker_pipeline() actually
performs the GCS upload side effect.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b12acd34-fd02-4c42-8618-7660b0726e29

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and c33c475.

📒 Files selected for processing (42)
  • .github/workflows/actions.yml
  • config/settings.py
  • config/test_settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py Outdated
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
wg21_paper_tracker/cloud_run_job/main.py (2)

77-80: ⚠️ Potential issue | 🔴 Critical

Use the uploader’s raw prefix here.

main() lists raw/wg21_paper_tracker/, but the pipeline uploads PDFs under raw/wg21_papers/.... With the current prefix, this job never sees newly uploaded files, so the conversion flow is effectively disconnected.

🐛 Minimal fix
-    raw_prefix = "raw/wg21_paper_tracker/"
+    raw_prefix = "raw/wg21_papers/"
     converted_prefix = "converted/wg21_papers/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The
list_blobs call in main() is using the wrong raw prefix
("raw/wg21_paper_tracker/") so newly uploaded PDFs never get picked up; update
the raw_prefix variable used by client.list_blobs to the uploader’s prefix
("raw/wg21_papers/") (and keep converted_prefix as-is) so
client.list_blobs(bucket, prefix=raw_prefix) enumerates the actual uploaded
files.

39-45: ⚠️ Potential issue | 🟠 Major

Don’t reject valid content just because it mentions error: or exception: later in the page.

Line 41 currently returns False for any occurrence of those substrings anywhere in the first 1000 characters. That can discard legitimate papers whose opening text contains headings or examples like Exception: / error:. The special-case should only reject when those markers appear at the start of the content.

💡 Tighten the heuristic
     for pattern in error_patterns:
-        if pattern in first_part:
-            if pattern in ("error:", "exception:"):
-                return False
-            idx = first_part.find(pattern)
-            if idx < 100:
-                return False
+        idx = first_part.find(pattern)
+        if idx == -1:
+            continue
+        if pattern in ("error:", "exception:"):
+            if idx == 0:
+                return False
+            continue
+        if idx < 100:
+            return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The loop over
error_patterns is rejecting content whenever "error:" or "exception:" appears
anywhere early in first_part; tighten it so those two markers only reject if
they start the content: replace the inner special-case with a start check (e.g.,
if pattern in ("error:", "exception:"): if
first_part.lstrip().lower().startswith(pattern): return False) while leaving the
existing idx-based rejection (idx < 100) for other patterns; operate on the same
variables error_patterns, first_part and idx.
🧹 Nitpick comments (5)
wg21_paper_tracker/migrations/0001_initial.py (1)

28-31: Consider removing redundant db_index=True on unique field.

The mailing_date field has unique=True, which automatically creates a B-tree index in PostgreSQL. The explicit db_index=True is redundant. Based on project learning, unique constraints already provide indexing.

♻️ Proposed fix
                 (
                     "mailing_date",
-                    models.CharField(db_index=True, max_length=7, unique=True),
+                    models.CharField(max_length=7, unique=True),
                 ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/migrations/0001_initial.py` around lines 28 - 31, The
migration defines the "mailing_date" models.CharField with both db_index=True
and unique=True; remove the redundant db_index=True from that field declaration
(leave unique=True) so the unique constraint creates the index implicitly;
update the migration operation that contains the "mailing_date" CharField (in
the initial migration class) to drop the db_index argument and regenerate or
confirm migrations if necessary.
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)

241-243: Use logging.exception here as well.

Same issue - logging.error loses the traceback context.

♻️ Proposed fix
         except Exception as e:
             stats["skipped"] += 1
-            logger.error("Error for paper_id=%s: %s", paper_id, e)
+            logger.exception("Error for paper_id=%s", paper_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 241 - 243, The except block in import_wg21_metadata_from_csv.py is
logging exceptions with logger.error which drops traceback; replace the
logger.error("Error for paper_id=%s: %s", paper_id, e) call in the except
Exception as e handler with logger.exception("Error for paper_id=%s", paper_id)
(or logger.exception with an explicit message) so the full traceback is
recorded; keep the stats["skipped"] += 1 line unchanged and do not re-raise
unless intended.

163-170: Use logging.exception for full traceback visibility.

The static analysis correctly flags that logging.error with manual %s formatting loses the traceback. Use logging.exception to capture the full stack trace for debugging.

♻️ Proposed fix
         except Exception as e:
             stats["skipped"] += 1
-            logger.error(
-                "Error parsing document date for paper_id=%s: %s",
+            logger.exception(
+                "Error parsing document date for paper_id=%s",
                 paper_id,
-                e,
             )
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 163 - 170, The except block currently logs the error with
logger.error losing the traceback; replace the logger.error call inside the
except Exception as e in import_wg21_metadata_from_csv.py with
logger.exception(...) so the full stack trace is recorded (keep the same message
and paper_id argument), leaving stats["skipped"] increment and the continue
intact; reference symbols: logger, stats, paper_id, and the except Exception as
e block where document date parsing is handled.
wg21_paper_tracker/tests/test_pipeline.py (1)

146-165: Consider using pytest-mock to reduce nesting depth.

The deeply nested with patch(...) blocks work correctly but reduce readability. Using pytest-mock's mocker fixture would flatten this significantly.

♻️ Example with mocker fixture
`@pytest.mark.django_db`
def test_run_tracker_pipeline_downloads_new_papers(tmp_path, mocker):
    """run_tracker_pipeline downloads papers for new mailings and returns count."""
    from wg21_paper_tracker.models import WG21Mailing

    WG21Mailing.objects.create(mailing_date="2025-01", title="Previous")
    mailings = [...]
    papers = [...]
    
    mocker.patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings)
    mocker.patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers)
    mocker.patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path)
    mocker.patch("wg21_paper_tracker.pipeline._download_file", return_value=True)
    mocker.patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket")
    mocker.patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True)
    
    n = run_tracker_pipeline()
    assert n == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 165, The nested
with-patch blocks in the test should be flattened by using the pytest-mock
"mocker" fixture: in the test function (e.g.,
test_run_tracker_pipeline_downloads_new_papers) replace the multiple with
patch(...) contexts for wg21_paper_tracker.pipeline.fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET,
and _upload_to_gcs with mocker.patch calls that return the same values
(mailings, papers, tmp_path, True, "test-bucket", True respectively), then call
run_tracker_pipeline() and assert the result remains 1; this reduces nesting
while keeping the same mocked behavior.
wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)

95-97: Remove redundant exception object from logging.exception.

logging.exception automatically includes the exception info from the current context. Passing e is redundant.

♻️ Proposed fix
         except Exception as e:
-            logger.exception("WG21 Paper Tracker failed: %s", e)
+            logger.exception("WG21 Paper Tracker failed.")
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around
lines 95 - 97, In the except block inside the RunWG21PaperTracker command (in
run_wg21_paper_tracker.py) remove the redundant exception object from the
logger.exception call: replace logger.exception("WG21 Paper Tracker failed: %s",
e) with logger.exception("WG21 Paper Tracker failed") so the current exception
context is logged automatically; leave the subsequent raise in place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The docs for get_or_create_wg21_paper_author_profile
currently say that when multiple profiles share the same display_name and none
match the supplied email the function "returns the first", but tests and
implementation actually create a new WG21PaperAuthorProfile in that case; update
the description to explicitly state that when multiple name-matching profiles
exist and the provided email matches none, the function will create and return a
new WG21PaperAuthorProfile (and return created=True), and also note the
side-effect that if email is supplied and the resolved or newly created profile
lacks that email it will be associated with the profile.

In `@wg21_paper_tracker/models.py`:
- Line 12: Remove the redundant db_index on the mailing_date field: since
mailing_date is declared with unique=True (models.CharField(..., unique=True)),
drop the db_index=True argument so the field is defined as mailing_date =
models.CharField(max_length=7, unique=True); update any related migrations
accordingly so only the unique index remains.

In `@wg21_paper_tracker/pipeline.py`:
- Around line 231-270: The code assumes each scraped paper dict has required
keys and may KeyError during selection or creation; first filter p_list to only
entries that have a valid "type" key usable by format_priority and required
fields ("url","title","subgroup","authors","filename") with correct types,
logging and skipping any malformed entries (use .get and isinstance checks);
then compute best_paper from the filtered list (so min(..., key=lambda x:
format_priority(x["type"])) cannot KeyError). When building filename and calling
get_or_create_paper, access fields via .get and validate/normalize author list
and subgroup before passing them into get_or_create_paper, and if any required
field is missing or invalid log a warning with pid and continue instead of
indexing.

In `@wg21_paper_tracker/services.py`:
- Around line 154-158: In mark_paper_downloaded, the sanitized paper_id can be
empty and should be rejected before proceeding; after the existing paper_id =
(paper_id or "").strip().lower() line, add a validation that raises ValueError
(e.g., "paper_id is required") if paper_id == "", so malformed/blank IDs never
reach _normalize_year or the update query; keep the existing year validation and
use the same normalized year flow afterward.

---

Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The list_blobs call in main() is using the wrong raw prefix
("raw/wg21_paper_tracker/") so newly uploaded PDFs never get picked up; update
the raw_prefix variable used by client.list_blobs to the uploader’s prefix
("raw/wg21_papers/") (and keep converted_prefix as-is) so
client.list_blobs(bucket, prefix=raw_prefix) enumerates the actual uploaded
files.
- Around line 39-45: The loop over error_patterns is rejecting content whenever
"error:" or "exception:" appears anywhere early in first_part; tighten it so
those two markers only reject if they start the content: replace the inner
special-case with a start check (e.g., if pattern in ("error:", "exception:"):
if first_part.lstrip().lower().startswith(pattern): return False) while leaving
the existing idx-based rejection (idx < 100) for other patterns; operate on the
same variables error_patterns, first_part and idx.

---

Nitpick comments:
In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 241-243: The except block in import_wg21_metadata_from_csv.py is
logging exceptions with logger.error which drops traceback; replace the
logger.error("Error for paper_id=%s: %s", paper_id, e) call in the except
Exception as e handler with logger.exception("Error for paper_id=%s", paper_id)
(or logger.exception with an explicit message) so the full traceback is
recorded; keep the stats["skipped"] += 1 line unchanged and do not re-raise
unless intended.
- Around line 163-170: The except block currently logs the error with
logger.error losing the traceback; replace the logger.error call inside the
except Exception as e in import_wg21_metadata_from_csv.py with
logger.exception(...) so the full stack trace is recorded (keep the same message
and paper_id argument), leaving stats["skipped"] increment and the continue
intact; reference symbols: logger, stats, paper_id, and the except Exception as
e block where document date parsing is handled.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-97: In the except block inside the RunWG21PaperTracker command
(in run_wg21_paper_tracker.py) remove the redundant exception object from the
logger.exception call: replace logger.exception("WG21 Paper Tracker failed: %s",
e) with logger.exception("WG21 Paper Tracker failed") so the current exception
context is logged automatically; leave the subsequent raise in place.

In `@wg21_paper_tracker/migrations/0001_initial.py`:
- Around line 28-31: The migration defines the "mailing_date" models.CharField
with both db_index=True and unique=True; remove the redundant db_index=True from
that field declaration (leave unique=True) so the unique constraint creates the
index implicitly; update the migration operation that contains the
"mailing_date" CharField (in the initial migration class) to drop the db_index
argument and regenerate or confirm migrations if necessary.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-165: The nested with-patch blocks in the test should be
flattened by using the pytest-mock "mocker" fixture: in the test function (e.g.,
test_run_tracker_pipeline_downloads_new_papers) replace the multiple with
patch(...) contexts for wg21_paper_tracker.pipeline.fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET,
and _upload_to_gcs with mocker.patch calls that return the same values
(mailings, papers, tmp_path, True, "test-bucket", True respectively), then call
run_tracker_pipeline() and assert the result remains 1; this reduces nesting
while keeping the same mocked behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f538d9c1-4cc2-44ea-98cf-95ec72879f2b

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 93ee8b7.

📒 Files selected for processing (42)
  • .github/workflows/actions.yml
  • config/settings.py
  • config/test_settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread docs/service_api/cppa_user_tracker.md Outdated
Comment thread wg21_paper_tracker/models.py
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread wg21_paper_tracker/services.py
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/service_api/cppa_user_tracker.md (1)

46-48: ⚠️ Potential issue | 🟡 Minor

Documentation doesn't match implementation for the multiple-match + unmatched-email case.

The description says "otherwise returns the first" when multiple profiles exist and no email matches. However, reviewing cppa_user_tracker/services.py (lines 394-400), when multiple candidates exist with the same display_name and the supplied email matches none, the implementation creates a new profile rather than returning the first candidate.

📘 Suggested doc fix
-| `get_or_create_wg21_paper_author_profile` | `display_name: str`, `email: str \| None = None` | `tuple[WG21PaperAuthorProfile, bool]` | Resolve by display_name (optional email for disambiguation). If no profile exists, creates one and adds email if provided. If one exists, or multiple exist and one matches the email, returns that profile; otherwise returns the first. **Side effect:** if `email` is supplied and the resolved or created profile does not already have that email, the function associates it with the profile (so existing profiles may be updated). Returns the profile and a boolean indicating creation. Use when linking paper authors so that same name + same email link to the same profile. |
+| `get_or_create_wg21_paper_author_profile` | `display_name: str`, `email: str \| None = None` | `tuple[WG21PaperAuthorProfile, bool]` | Resolve by display_name (optional email for disambiguation). If no profile exists, creates one and adds email if provided. If one exists, returns it (and adds email if needed). If multiple exist and one matches the email, returns that profile. If multiple exist and no email is provided, returns the first. If multiple exist and the supplied email matches none, creates a new profile with that email. **Side effect:** if `email` is supplied and the resolved or created profile does not already have that email, the function associates it with the profile (so existing profiles may be updated). Returns the profile and a boolean indicating creation. Use when linking paper authors so that same name + same email link to the same profile. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/service_api/cppa_user_tracker.md` around lines 46 - 48, The
documentation for get_or_create_wg21_paper_author_profile is incorrect about the
multiple-match + unmatched-email case: update the description to state that when
multiple profiles share the same display_name and the supplied email does not
match any existing profile, the implementation creates a new profile (and
returns it with created=True) rather than returning the first existing
candidate; also ensure the "Side effect" text clarifies that supplying an email
may either associate it with an existing resolved profile or lead to creation of
a new profile when no email match is found.
🧹 Nitpick comments (7)
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)

241-243: Use logger.exception for better tracebacks.

Same as above—logger.exception will capture the full stack trace.

🧹 Suggested fix
             except Exception as e:
                 stats["skipped"] += 1
-                logger.error("Error for paper_id=%s: %s", paper_id, e)
+                logger.exception("Error for paper_id=%s", paper_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 241 - 243, In the except block inside
import_wg21_metadata_from_csv.py (where stats["skipped"] is incremented and the
current line is logger.error("Error for paper_id=%s: %s", paper_id, e)), replace
logger.error with logger.exception so the full traceback is captured; keep the
stats["skipped"] increment and the contextual message including paper_id to
preserve existing metrics and logs while providing the full exception stack
trace for debugging.

163-170: Use logger.exception for better tracebacks.

When catching exceptions, logger.exception automatically includes the stack trace, which aids debugging.

🧹 Suggested fix
             except Exception as e:
                 stats["skipped"] += 1
-                logger.error(
-                    "Error parsing document date for paper_id=%s: %s",
+                logger.exception(
+                    "Error parsing document date for paper_id=%s",
                     paper_id,
-                    e,
                 )
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 163 - 170, The except block in import_wg21_metadata_from_csv.py
that handles date parsing (inside the function handling CSV import where
stats["skipped"] is incremented) should use logger.exception instead of
logger.error so the stack trace is captured; replace the logger.error call in
the except Exception as e block with logger.exception("Error parsing document
date for paper_id=%s", paper_id) (you can still increment stats["skipped"] and
continue as before).
wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)

95-96: Remove redundant exception object from logging.exception.

logging.exception automatically includes exception info, so passing e is redundant.

🧹 Suggested fix
         except Exception as e:
-            logger.exception("WG21 Paper Tracker failed: %s", e)
+            logger.exception("WG21 Paper Tracker failed.")
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around
lines 95 - 96, The except block in run_wg21_paper_tracker.py currently calls
logger.exception("WG21 Paper Tracker failed: %s", e) which redundantly passes
the exception object; change the call in the except handler to
logger.exception("WG21 Paper Tracker failed") (i.e., remove the "%s" placeholder
and the e argument) so the exception info is logged via logging.exception's
built-in traceback handling; apply the same change if similar logger.exception
usages exist in the same command.
wg21_paper_tracker/tests/test_pipeline.py (1)

146-165: Consider using ExitStack to reduce nesting depth.

The seven nested with patch(...) statements make this test harder to read. contextlib.ExitStack can flatten this.

♻️ Suggested refactor
+from contextlib import ExitStack
+
 `@pytest.mark.django_db`
 def test_run_tracker_pipeline_downloads_new_papers(tmp_path):
     """run_tracker_pipeline downloads papers for new mailings and returns count."""
     from wg21_paper_tracker.models import WG21Mailing

     WG21Mailing.objects.create(mailing_date="2025-01", title="Previous")
     mailings = [...]
     papers = [...]
-    with patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings):
-        with patch(
-            "wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers
-        ):
-            with patch(
-                "wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path
-            ):
-                with patch(
-                    "wg21_paper_tracker.pipeline._download_file", return_value=True
-                ):
-                    with patch(
-                        "wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET",
-                        "test-bucket",
-                    ):
-                        with patch(
-                            "wg21_paper_tracker.pipeline._upload_to_gcs",
-                            return_value=True,
-                        ):
-                            n = run_tracker_pipeline()
+    with ExitStack() as stack:
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline._download_file", return_value=True))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket"))
+        stack.enter_context(patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True))
+        n = run_tracker_pipeline()
     assert n == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 165, The test
currently nests seven with-patch blocks around run_tracker_pipeline (patching
fetch_all_mailings, fetch_papers_for_mailing, get_raw_dir, _download_file,
settings.WG21_GCS_BUCKET, _upload_to_gcs) which is hard to read; refactor the
test to use contextlib.ExitStack to enter all patches in a single flattened
block, creating Patchers for "wg21_paper_tracker.pipeline.fetch_all_mailings",
"wg21_paper_tracker.pipeline.fetch_papers_for_mailing",
"wg21_paper_tracker.pipeline.get_raw_dir",
"wg21_paper_tracker.pipeline._download_file",
"wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", and
"wg21_paper_tracker.pipeline._upload_to_gcs", set their return_value or value as
before, and then call run_tracker_pipeline() inside the ExitStack context before
asserting n == 1.
wg21_paper_tracker/tests/test_services.py (1)

49-70: Remove redundant db fixture parameter.

The @pytest.mark.django_db marker already enables database access. The db parameter is unnecessary and triggers static analysis warnings across multiple tests.

♻️ Suggested change (apply to all similar tests)
 `@pytest.mark.django_db`
 `@patch`("wg21_paper_tracker.services.get_or_create_wg21_paper_author_profile")
-def test_get_or_create_paper_creates_new(mock_profile, db):
+def test_get_or_create_paper_creates_new(mock_profile):

Apply the same change to: test_get_or_create_paper_calls_author_profile_for_each_author, test_get_or_create_paper_normalizes_paper_id_lowercase, test_get_or_create_paper_gets_existing_and_updates, test_get_or_create_paper_year_none_stored_as_zero, test_get_or_create_paper_same_paper_id_different_year_creates_two, test_get_or_create_paper_sets_author_order, test_mark_paper_downloaded_requires_year, test_mark_paper_downloaded_sets_flag, test_mark_paper_downloaded_normalizes_paper_id.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_services.py` around lines 49 - 70, Remove the
redundant `db` fixture parameter from the affected test function signatures
(it’s unnecessary because `@pytest.mark.django_db` already provides DB access).
Update the tests named test_get_or_create_paper_creates_new,
test_get_or_create_paper_calls_author_profile_for_each_author,
test_get_or_create_paper_normalizes_paper_id_lowercase,
test_get_or_create_paper_gets_existing_and_updates,
test_get_or_create_paper_year_none_stored_as_zero,
test_get_or_create_paper_same_paper_id_different_year_creates_two,
test_get_or_create_paper_sets_author_order,
test_mark_paper_downloaded_requires_year, test_mark_paper_downloaded_sets_flag,
and test_mark_paper_downloaded_normalizes_paper_id by removing the trailing ",
db" parameter from their definitions so only their mocked/needed parameters
(e.g., mock_profile) remain; keep the `@pytest.mark.django_db` marker in place and
run tests to confirm no behavior changes.
wg21_paper_tracker/cloud_run_job/converters/docling_converter.py (1)

11-20: Consider removing unused InputFormat import.

InputFormat is imported but never used in the module. If it's not needed for future expansion, removing it would clean up the code.

♻️ Suggested change
 try:
     from docling.document_converter import DocumentConverter  # type: ignore[import-untyped]
-    from docling.datamodel.base_models import InputFormat  # type: ignore[import-untyped]

     DOCLING_AVAILABLE = True
 except ImportError:
     DocumentConverter = None  # type: ignore[assignment,misc]
-    InputFormat = None  # type: ignore[assignment,misc]
     DOCLING_AVAILABLE = False
     logger.warning("Docling not available. Install with: pip install docling")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py` around
lines 11 - 20, The module imports InputFormat but never uses it; remove the
unused import to clean up the code by deleting "from
docling.datamodel.base_models import InputFormat" and any related type: ignore
comments, leaving the try/except to only import DocumentConverter and set
DocumentConverter = None, InputFormat = None assignments should be removed as
well (keep DocumentConverter, DOCLING_AVAILABLE, and the logger.warning
unchanged).
wg21_paper_tracker/services.py (1)

80-109: Potential race condition in placeholder upgrade.

When year_val > 0, the code first queries for an exact match, then queries for a placeholder (year=0). Two concurrent requests with the same paper_id but a real year could both find the placeholder and both attempt to promote it—one succeeds, the other hits an IntegrityError on the (paper_id, year) unique constraint.

Consider using select_for_update() on the placeholder lookup to serialize concurrent promotions:

💡 Suggested fix
             placeholder = WG21Paper.objects.filter(paper_id=paper_id, year=0).first()
+            placeholder = WG21Paper.objects.select_for_update().filter(
+                paper_id=paper_id, year=0
+            ).first()

Alternatively, wrap the promotion in a try/except for IntegrityError and re-fetch the existing row if it was concurrently created.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/services.py` around lines 80 - 109, The
placeholder-to-real-year promotion can race: when year_val > 0, two threads can
both fetch the placeholder (WG21Paper with year=0) and attempt to set
year=year_val causing an IntegrityError on the unique (paper_id, year)
constraint; fix by serializing or handling the race—either acquire a DB row lock
when reading the placeholder using select_for_update() on the WG21Paper
placeholder query (inside a transaction) before updating, or wrap the promotion
block in a try/except catching IntegrityError, and on IntegrityError re-query
WG21Paper for the exact (paper_id, year_val) row and use that (or call
_update_paper) so only one caller creates/promotes the record.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/service_api/cppa_user_tracker.md`:
- Around line 46-48: The documentation for
get_or_create_wg21_paper_author_profile is incorrect about the multiple-match +
unmatched-email case: update the description to state that when multiple
profiles share the same display_name and the supplied email does not match any
existing profile, the implementation creates a new profile (and returns it with
created=True) rather than returning the first existing candidate; also ensure
the "Side effect" text clarifies that supplying an email may either associate it
with an existing resolved profile or lead to creation of a new profile when no
email match is found.

---

Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/converters/docling_converter.py`:
- Around line 11-20: The module imports InputFormat but never uses it; remove
the unused import to clean up the code by deleting "from
docling.datamodel.base_models import InputFormat" and any related type: ignore
comments, leaving the try/except to only import DocumentConverter and set
DocumentConverter = None, InputFormat = None assignments should be removed as
well (keep DocumentConverter, DOCLING_AVAILABLE, and the logger.warning
unchanged).

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 241-243: In the except block inside
import_wg21_metadata_from_csv.py (where stats["skipped"] is incremented and the
current line is logger.error("Error for paper_id=%s: %s", paper_id, e)), replace
logger.error with logger.exception so the full traceback is captured; keep the
stats["skipped"] increment and the contextual message including paper_id to
preserve existing metrics and logs while providing the full exception stack
trace for debugging.
- Around line 163-170: The except block in import_wg21_metadata_from_csv.py that
handles date parsing (inside the function handling CSV import where
stats["skipped"] is incremented) should use logger.exception instead of
logger.error so the stack trace is captured; replace the logger.error call in
the except Exception as e block with logger.exception("Error parsing document
date for paper_id=%s", paper_id) (you can still increment stats["skipped"] and
continue as before).

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-96: The except block in run_wg21_paper_tracker.py currently
calls logger.exception("WG21 Paper Tracker failed: %s", e) which redundantly
passes the exception object; change the call in the except handler to
logger.exception("WG21 Paper Tracker failed") (i.e., remove the "%s" placeholder
and the e argument) so the exception info is logged via logging.exception's
built-in traceback handling; apply the same change if similar logger.exception
usages exist in the same command.

In `@wg21_paper_tracker/services.py`:
- Around line 80-109: The placeholder-to-real-year promotion can race: when
year_val > 0, two threads can both fetch the placeholder (WG21Paper with year=0)
and attempt to set year=year_val causing an IntegrityError on the unique
(paper_id, year) constraint; fix by serializing or handling the race—either
acquire a DB row lock when reading the placeholder using select_for_update() on
the WG21Paper placeholder query (inside a transaction) before updating, or wrap
the promotion block in a try/except catching IntegrityError, and on
IntegrityError re-query WG21Paper for the exact (paper_id, year_val) row and use
that (or call _update_paper) so only one caller creates/promotes the record.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-165: The test currently nests seven with-patch blocks around
run_tracker_pipeline (patching fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, _upload_to_gcs) which is
hard to read; refactor the test to use contextlib.ExitStack to enter all patches
in a single flattened block, creating Patchers for
"wg21_paper_tracker.pipeline.fetch_all_mailings",
"wg21_paper_tracker.pipeline.fetch_papers_for_mailing",
"wg21_paper_tracker.pipeline.get_raw_dir",
"wg21_paper_tracker.pipeline._download_file",
"wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", and
"wg21_paper_tracker.pipeline._upload_to_gcs", set their return_value or value as
before, and then call run_tracker_pipeline() inside the ExitStack context before
asserting n == 1.

In `@wg21_paper_tracker/tests/test_services.py`:
- Around line 49-70: Remove the redundant `db` fixture parameter from the
affected test function signatures (it’s unnecessary because
`@pytest.mark.django_db` already provides DB access). Update the tests named
test_get_or_create_paper_creates_new,
test_get_or_create_paper_calls_author_profile_for_each_author,
test_get_or_create_paper_normalizes_paper_id_lowercase,
test_get_or_create_paper_gets_existing_and_updates,
test_get_or_create_paper_year_none_stored_as_zero,
test_get_or_create_paper_same_paper_id_different_year_creates_two,
test_get_or_create_paper_sets_author_order,
test_mark_paper_downloaded_requires_year, test_mark_paper_downloaded_sets_flag,
and test_mark_paper_downloaded_normalizes_paper_id by removing the trailing ",
db" parameter from their definitions so only their mocked/needed parameters
(e.g., mock_profile) remain; keep the `@pytest.mark.django_db` marker in place and
run tests to confirm no behavior changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3fb307e5-8b22-4b80-933d-1df3e7d059ca

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 93ee8b7.

📒 Files selected for processing (42)
  • .github/workflows/actions.yml
  • config/settings.py
  • config/test_settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

…ne validation, mailing race recovery, logging, and API docs cppalliance#24
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
wg21_paper_tracker/cloud_run_job/main.py (2)

77-80: ⚠️ Potential issue | 🔴 Critical

Still using the wrong raw GCS prefix.

This job lists raw/wg21_paper_tracker/, but the uploader writes PDFs under raw/wg21_papers/. With the current prefix, the conversion job never sees newly uploaded papers, so the end-to-end pipeline is effectively broken.

💡 Suggested fix
-    raw_prefix = "raw/wg21_paper_tracker/"
+    raw_prefix = "raw/wg21_papers/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 77 - 80, The code sets
raw_prefix = "raw/wg21_paper_tracker/" but the uploader stores PDFs under
"raw/wg21_papers/", so client.list_blobs(bucket, prefix=raw_prefix) never finds
new files; change the raw_prefix value to "raw/wg21_papers/" (leave
converted_prefix as-is) so the list_blobs call in main.py uses the correct GCS
prefix and the conversion job sees uploaded PDFs.

39-45: ⚠️ Potential issue | 🟠 Major

Don't reject any paper that merely mentions error: or exception:.

Line 41 now marks content invalid whenever either token appears anywhere in the first 1000 characters. WG21 papers on exceptions/error handling can legitimately contain those strings near the top, so a successful conversion gets discarded and falls through to slower fallbacks or an empty result.

💡 Suggested fix
     for pattern in error_patterns:
-        if pattern in first_part:
-            if pattern in ("error:", "exception:"):
-                return False
-            idx = first_part.find(pattern)
-            if idx < 100:
-                return False
+        idx = first_part.find(pattern)
+        if idx != -1 and idx < 100:
+            return False
     return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The current
loop over error_patterns wrongfully rejects content when pattern is "error:" or
"exception:"; change the branch inside the for-loop that checks pattern in
("error:", "exception:") to skip rejecting (i.e., continue) instead of returning
False so papers mentioning those tokens near the top are not discarded; keep the
existing idx = first_part.find(pattern) and idx < 100 check for all other
patterns (use variables error_patterns, first_part, pattern, idx to locate the
code).
🧹 Nitpick comments (5)
wg21_paper_tracker/tests/test_workspace.py (1)

43-86: Add a regression test for the year-only raw path.

get_raw_dir() now supports mailing_date=None, but this suite only covers dated and invalid inputs. A small get_raw_dir(None, 2025) case would lock in the <raw>/wg21_paper_tracker/2025/ fallback that was just added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_workspace.py` around lines 43 - 86, Add a
regression test that calls get_raw_dir(None, 2025) to verify the year-only
fallback path is created: patch wg21_paper_tracker.workspace.settings.RAW_DIR to
the fixture path, call get_raw_dir with mailing_date=None and year 2025, and
assert the returned Path equals the expected RAW_DIR / "wg21_paper_tracker" /
"2025" and that it exists/is a directory (and optionally assert path.name ==
"2025"); reference the existing test functions (e.g.,
test_get_raw_dir_returns_mailing_date_subdir, test_get_raw_dir_creates_parents)
to place the new test alongside them in tests/test_workspace.py.
wg21_paper_tracker/tests/test_pipeline.py (1)

146-164: Consider using ExitStack or pytest-mock's mocker fixture to flatten nested context managers.

The deeply nested with patch(...) blocks are functional but reduce readability. This is a minor style suggestion.

♻️ Example using contextlib.ExitStack
from contextlib import ExitStack

with ExitStack() as stack:
    stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_all_mailings", return_value=mailings))
    stack.enter_context(patch("wg21_paper_tracker.pipeline.fetch_papers_for_mailing", return_value=papers))
    stack.enter_context(patch("wg21_paper_tracker.pipeline.get_raw_dir", return_value=tmp_path))
    stack.enter_context(patch("wg21_paper_tracker.pipeline._download_file", return_value=True))
    stack.enter_context(patch("wg21_paper_tracker.pipeline.settings.WG21_GCS_BUCKET", "test-bucket"))
    stack.enter_context(patch("wg21_paper_tracker.pipeline._upload_to_gcs", return_value=True))
    n = run_tracker_pipeline()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/tests/test_pipeline.py` around lines 146 - 164, Replace
the deeply nested with-patch blocks in the test around run_tracker_pipeline() by
flattening them using contextlib.ExitStack (or pytest-mock's mocker fixture) to
improve readability; enter_context for each patch targeting
wg21_paper_tracker.pipeline.fetch_all_mailings, fetch_papers_for_mailing,
get_raw_dir, _download_file, settings.WG21_GCS_BUCKET, and _upload_to_gcs, then
call run_tracker_pipeline() inside the single ExitStack context so all patches
are active without nested indentation.
wg21_paper_tracker/pipeline.py (3)

299-305: Move parse_date import to module level.

Importing inside the loop executes the import machinery repeatedly. Moving it to the top of the file with other imports improves clarity and avoids repeated lookup overhead.

♻️ Suggested fix

Add to imports at top of file:

from django.utils.dateparse import parse_date

Then remove line 300 and use parse_date directly:

             doc_date_str = best_paper.get("document_date")
-            from django.utils.dateparse import parse_date
-
             doc_date = None
             if doc_date_str:
                 try:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/pipeline.py` around lines 299 - 305, The local import of
parse_date inside the loop should be moved to the module-level imports to avoid
repeated import overhead: add "from django.utils.dateparse import parse_date" at
the top of the file alongside other imports, then remove the inline import
inside the block that processes best_paper (the lines around doc_date_str =
best_paper.get("document_date") and the try/if that calls parse_date) so the
code uses the module-level parse_date directly.

94-100: Use logging.exception for final download failure to capture full traceback.

After exhausting retries, the traceback would help diagnose connection issues, SSL errors, etc.

♻️ Suggested fix
             else:
-                logger.error(
+                logger.exception(
                     "Failed to download %s after %d attempts: %s",
                     url,
                     DOWNLOAD_MAX_RETRIES,
                     e,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/pipeline.py` around lines 94 - 100, Replace the final
logger.error call that logs the failed download (the block referencing logger,
url, DOWNLOAD_MAX_RETRIES and exception variable e) with a call that records the
full traceback — e.g. use logger.exception (or logger.error(..., exc_info=True))
in the download retry/failure handler so the exception trace is captured when
the download in the pipeline download/retry logic fails after
DOWNLOAD_MAX_RETRIES.

49-51: Use logging.exception to preserve traceback for GCS upload failures.

logging.error discards the stack trace; switching to logging.exception (or passing exc_info=True) preserves the full traceback, which is valuable for debugging GCS permission or network issues.

♻️ Suggested fix
     except Exception as e:
-        logger.error("Failed to upload to GCS: %s", e)
+        logger.exception("Failed to upload to GCS: %s", e)
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/pipeline.py` around lines 49 - 51, The except block that
catches exceptions during GCS upload uses logger.error("Failed to upload to GCS:
%s", e) which discards the traceback; update that handler in pipeline.py (the
except block around the GCS upload logic) to call logger.exception("Failed to
upload to GCS") or logger.error("Failed to upload to GCS", exc_info=True) so the
full traceback is preserved for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@wg21_paper_tracker/cloud_run_job/converters/openai_converter.py`:
- Around line 186-200: convert_page_with_openai currently makes a single
requests.post call and immediately returns None on any exception; add a bounded
retry loop around the requests.post/response.raise_for_status/json parsing (in
function convert_page_with_openai) to handle transient failures (timeouts,
requests.exceptions like Timeout/ConnectionError, HTTP 429 and 5xx responses)
with exponential backoff (e.g., 2 retries with delays like 1s then 2s) before
falling back to returning None; ensure you still call
response.raise_for_status() after each attempt, break and return
markdown_content on success, and keep the existing logger.error with exc_info if
all retries fail.

In `@wg21_paper_tracker/cloud_run_job/Dockerfile`:
- Around line 17-28: The Dockerfile in wg21_paper_tracker/cloud_run_job assumes
the build context is that subdirectory (uses COPY requirements.txt . and CMD
["python","main.py"]) but docs tell users to build from the repo root, causing
wrong files to be copied and the entrypoint to fail; fix by either updating the
documentation to instruct building with the cloud_run_job directory as context
(e.g., docker build -f wg21_paper_tracker/cloud_run_job/Dockerfile
wg21_paper_tracker/cloud_run_job/) or modify the Dockerfile to use
repo-root-aware paths (replace COPY requirements.txt . with COPY
wg21_paper_tracker/cloud_run_job/requirements.txt . and change CMD
["python","main.py"] to CMD
["python","wg21_paper_tracker/cloud_run_job/main.py"]) so both build contexts
work correctly.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 152-160: The code currently computes year from the raw CSV strings
(mailing_date or document_date) before calling _parse_document_date, which
allows invalid dates like "2025-99-99" to produce a spurious year; move the call
to document_date = _parse_document_date(row.get("date", "")) before computing
year and then derive year only from the validated document_date
(document_date[:4]) or, if that is None, from mailing_date but only when
MAILING_DATE_PATTERN.match(mailing_date) succeeds; update the year_str/year
assignment that references mailing_date, MAILING_DATE_PATTERN, document_date,
and paper_id accordingly so year is None for invalid parsed dates.

In `@wg21_paper_tracker/services.py`:
- Around line 80-109: Promote placeholder rows (WG21Paper with year=0) inside a
try/except to handle concurrent unique constraint violations: import
IntegrityError from django.db, wrap the placeholder modification and
placeholder.save() in a try block, on IntegrityError re-fetch the canonical
record with WG21Paper.objects.filter(paper_id=paper_id, year=year_val).first()
(or use update_or_create) and then update that canonical record via
_update_paper or field assignments, setting created=False; ensure fallback logic
covers both the successful placeholder.save() path and the IntegrityError
recovery path so no IntegrityError bubbles up.

---

Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 77-80: The code sets raw_prefix = "raw/wg21_paper_tracker/" but
the uploader stores PDFs under "raw/wg21_papers/", so client.list_blobs(bucket,
prefix=raw_prefix) never finds new files; change the raw_prefix value to
"raw/wg21_papers/" (leave converted_prefix as-is) so the list_blobs call in
main.py uses the correct GCS prefix and the conversion job sees uploaded PDFs.
- Around line 39-45: The current loop over error_patterns wrongfully rejects
content when pattern is "error:" or "exception:"; change the branch inside the
for-loop that checks pattern in ("error:", "exception:") to skip rejecting
(i.e., continue) instead of returning False so papers mentioning those tokens
near the top are not discarded; keep the existing idx = first_part.find(pattern)
and idx < 100 check for all other patterns (use variables error_patterns,
first_part, pattern, idx to locate the code).

---

Nitpick comments:
In `@wg21_paper_tracker/pipeline.py`:
- Around line 299-305: The local import of parse_date inside the loop should be
moved to the module-level imports to avoid repeated import overhead: add "from
django.utils.dateparse import parse_date" at the top of the file alongside other
imports, then remove the inline import inside the block that processes
best_paper (the lines around doc_date_str = best_paper.get("document_date") and
the try/if that calls parse_date) so the code uses the module-level parse_date
directly.
- Around line 94-100: Replace the final logger.error call that logs the failed
download (the block referencing logger, url, DOWNLOAD_MAX_RETRIES and exception
variable e) with a call that records the full traceback — e.g. use
logger.exception (or logger.error(..., exc_info=True)) in the download
retry/failure handler so the exception trace is captured when the download in
the pipeline download/retry logic fails after DOWNLOAD_MAX_RETRIES.
- Around line 49-51: The except block that catches exceptions during GCS upload
uses logger.error("Failed to upload to GCS: %s", e) which discards the
traceback; update that handler in pipeline.py (the except block around the GCS
upload logic) to call logger.exception("Failed to upload to GCS") or
logger.error("Failed to upload to GCS", exc_info=True) so the full traceback is
preserved for debugging.

In `@wg21_paper_tracker/tests/test_pipeline.py`:
- Around line 146-164: Replace the deeply nested with-patch blocks in the test
around run_tracker_pipeline() by flattening them using contextlib.ExitStack (or
pytest-mock's mocker fixture) to improve readability; enter_context for each
patch targeting wg21_paper_tracker.pipeline.fetch_all_mailings,
fetch_papers_for_mailing, get_raw_dir, _download_file, settings.WG21_GCS_BUCKET,
and _upload_to_gcs, then call run_tracker_pipeline() inside the single ExitStack
context so all patches are active without nested indentation.

In `@wg21_paper_tracker/tests/test_workspace.py`:
- Around line 43-86: Add a regression test that calls get_raw_dir(None, 2025) to
verify the year-only fallback path is created: patch
wg21_paper_tracker.workspace.settings.RAW_DIR to the fixture path, call
get_raw_dir with mailing_date=None and year 2025, and assert the returned Path
equals the expected RAW_DIR / "wg21_paper_tracker" / "2025" and that it
exists/is a directory (and optionally assert path.name == "2025"); reference the
existing test functions (e.g., test_get_raw_dir_returns_mailing_date_subdir,
test_get_raw_dir_creates_parents) to place the new test alongside them in
tests/test_workspace.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 896ea723-15e9-4f30-b95e-bd13b4771210

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and 61a6c7f.

📒 Files selected for processing (42)
  • .github/workflows/actions.yml
  • config/settings.py
  • config/test_settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread wg21_paper_tracker/cloud_run_job/converters/openai_converter.py Outdated
Comment thread wg21_paper_tracker/cloud_run_job/Dockerfile Outdated
Comment thread wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py Outdated
Comment thread wg21_paper_tracker/services.py Outdated
@leostar0412
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
wg21_paper_tracker/cloud_run_job/main.py (1)

39-45: ⚠️ Potential issue | 🟠 Major

Don't reject every early error: / exception: literal unconditionally.

Lines 41-42 invalidate content whenever either token appears anywhere in the first 1,000 characters. WG21 papers can legitimately contain diagnostics or code snippets with error: / exception: near the top, so a valid conversion can be thrown away and force a worse fallback or an empty upload. Gate those markers by position instead of rejecting them unconditionally.

Suggested fix
     content_lower = content_stripped.lower()
     first_part = content_lower[:1000]
+    if first_part.startswith(("error:", "exception:")):
+        return False
+
     for pattern in error_patterns:
-        if pattern in first_part:
-            if pattern in ("error:", "exception:"):
-                return False
-            idx = first_part.find(pattern)
-            if idx < 100:
-                return False
+        if pattern in {"error:", "exception:"}:
+            continue
+        idx = first_part.find(pattern)
+        if idx != -1 and idx < 100:
+            return False
     return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 39 - 45, The loop over
error_patterns in main.py currently rejects the input unconditionally when it
finds the literal "error:" or "exception:"; change this to a positional check
instead: inside the for pattern in error_patterns loop, compute idx =
first_part.find(pattern) and only return False if idx != -1 and idx is within
the early-position threshold (e.g., idx < 1000) for the "error:" and
"exception:" patterns; keep the existing early-position check for other patterns
(the existing idx < 100 behavior) but unify the logic so no pattern is rejected
solely because it appears anywhere in first_part without checking its index.
Ensure you update the branches around error_patterns, first_part, and the idx
checks accordingly.
🧹 Nitpick comments (5)
wg21_paper_tracker/cloud_run_job/main.py (1)

9-11: Relative imports alone won't fix import brittleness; the underlying issue requires clarification of execution model.

The current from converters... imports work fine for script execution from the cloud_run_job directory (which is the apparent deployment scenario based on the Dockerfile). However, the suggested fix of using .converters... relative imports won't work without also making cloud_run_job a proper Python package (i.e., adding an __init__.py file).

If the goal is to make this code robustly importable in multiple ways, consider either:

  • Keeping the current approach and documenting the expected execution directory, or
  • Adding __init__.py to cloud_run_job/ and updating imports to use relative paths like from .converters...

The latter approach would require verifying that treating cloud_run_job as a package doesn't conflict with other parts of the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/cloud_run_job/main.py` around lines 9 - 11, The current
top-level imports in main.py (from converters.docling_converter, from
converters.pdfplumber_converter, from converters.openai_converter) are brittle
depending on how the module is run; either keep the absolute-style imports and
add documentation/instructions that the process must be started with
cloud_run_job as the working package root, or convert to package-relative
imports (from .converters.docling_converter, .pdfplumber_converter,
.openai_converter) and make cloud_run_job a proper package by adding an
__init__.py (then verify no other code breaks by treating cloud_run_job as a
package). Ensure whichever approach you choose is applied consistently and
update deployment/run docs or CI to match.
wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py (1)

95-97: Minor: Redundant exception object in logging.exception.

logging.exception automatically includes the exception info, so passing e is redundant.

♻️ Proposed fix
         except Exception as e:
-            logger.exception("WG21 Paper Tracker failed: %s", e)
+            logger.exception("WG21 Paper Tracker failed.")
             raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py` around
lines 95 - 97, In the except block in run_wg21_paper_tracker.py where
logger.exception("WG21 Paper Tracker failed: %s", e) is called, remove the
redundant exception object from the logging call and just call
logger.exception("WG21 Paper Tracker failed") so the exception info is logged
automatically (leave the surrounding except Exception as e / raise as-is).
wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py (2)

162-168: Consider using logger.exception for better debugging.

When catching exceptions, logger.exception automatically includes the stack trace, which aids debugging. The current logger.error call loses this context.

♻️ Proposed fix
             except Exception as e:
                 stats["skipped"] += 1
-                logger.error(
+                logger.exception(
                     "Error parsing document date for paper_id=%s: %s",
                     paper_id,
                     e,
                 )
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 162 - 168, The except block that increments stats["skipped"] uses
logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e)
which omits the stack trace; replace this logger.error call with
logger.exception(...) in the same except Exception as e handler so the error
message for paper_id and the full traceback are logged (keep the
stats["skipped"] increment and the surrounding exception handler intact).

240-242: Use logger.exception for consistent error logging.

Same as above—using logger.exception preserves the stack trace for the outer exception handler.

♻️ Proposed fix
             except Exception as e:
                 stats["skipped"] += 1
-                logger.error("Error for paper_id=%s: %s", paper_id, e)
+                logger.exception("Error for paper_id=%s", paper_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`
around lines 240 - 242, Replace the logger.error call in the exception handler
with logger.exception so the stack trace is preserved; specifically, in
import_wg21_metadata_from_csv.py where the except Exception as e block
increments stats["skipped"] and logs the error for paper_id, change the logging
call that currently uses logger.error("Error for paper_id=%s: %s", paper_id, e)
to use logger.exception with a descriptive message that includes paper_id (e.g.,
logger.exception("Error for paper_id=%s", paper_id)) so the full traceback is
captured.
wg21_paper_tracker/pipeline.py (1)

49-51: Consider using logging.exception for better tracebacks.

When logging errors from caught exceptions, logging.exception automatically includes the traceback, which aids debugging.

♻️ Proposed fix
     except Exception as e:
-        logger.error("Failed to upload to GCS: %s", e)
+        logger.exception("Failed to upload to GCS: %s", e)
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wg21_paper_tracker/pipeline.py` around lines 49 - 51, Replace the current
exception logging call that uses logger.error in the except block with
logger.exception so the traceback is captured; locate the except Exception as e
block in pipeline.py (the GCS upload handler) and change logger.error("Failed to
upload to GCS: %s", e) to logger.exception("Failed to upload to GCS") (or
logger.error(..., exc_info=True)) and keep the existing return False behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 39-45: The loop over error_patterns in main.py currently rejects
the input unconditionally when it finds the literal "error:" or "exception:";
change this to a positional check instead: inside the for pattern in
error_patterns loop, compute idx = first_part.find(pattern) and only return
False if idx != -1 and idx is within the early-position threshold (e.g., idx <
1000) for the "error:" and "exception:" patterns; keep the existing
early-position check for other patterns (the existing idx < 100 behavior) but
unify the logic so no pattern is rejected solely because it appears anywhere in
first_part without checking its index. Ensure you update the branches around
error_patterns, first_part, and the idx checks accordingly.

---

Nitpick comments:
In `@wg21_paper_tracker/cloud_run_job/main.py`:
- Around line 9-11: The current top-level imports in main.py (from
converters.docling_converter, from converters.pdfplumber_converter, from
converters.openai_converter) are brittle depending on how the module is run;
either keep the absolute-style imports and add documentation/instructions that
the process must be started with cloud_run_job as the working package root, or
convert to package-relative imports (from .converters.docling_converter,
.pdfplumber_converter, .openai_converter) and make cloud_run_job a proper
package by adding an __init__.py (then verify no other code breaks by treating
cloud_run_job as a package). Ensure whichever approach you choose is applied
consistently and update deployment/run docs or CI to match.

In `@wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py`:
- Around line 162-168: The except block that increments stats["skipped"] uses
logger.error("Error parsing document date for paper_id=%s: %s", paper_id, e)
which omits the stack trace; replace this logger.error call with
logger.exception(...) in the same except Exception as e handler so the error
message for paper_id and the full traceback are logged (keep the
stats["skipped"] increment and the surrounding exception handler intact).
- Around line 240-242: Replace the logger.error call in the exception handler
with logger.exception so the stack trace is preserved; specifically, in
import_wg21_metadata_from_csv.py where the except Exception as e block
increments stats["skipped"] and logs the error for paper_id, change the logging
call that currently uses logger.error("Error for paper_id=%s: %s", paper_id, e)
to use logger.exception with a descriptive message that includes paper_id (e.g.,
logger.exception("Error for paper_id=%s", paper_id)) so the full traceback is
captured.

In `@wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py`:
- Around line 95-97: In the except block in run_wg21_paper_tracker.py where
logger.exception("WG21 Paper Tracker failed: %s", e) is called, remove the
redundant exception object from the logging call and just call
logger.exception("WG21 Paper Tracker failed") so the exception info is logged
automatically (leave the surrounding except Exception as e / raise as-is).

In `@wg21_paper_tracker/pipeline.py`:
- Around line 49-51: Replace the current exception logging call that uses
logger.error in the except block with logger.exception so the traceback is
captured; locate the except Exception as e block in pipeline.py (the GCS upload
handler) and change logger.error("Failed to upload to GCS: %s", e) to
logger.exception("Failed to upload to GCS") (or logger.error(...,
exc_info=True)) and keep the existing return False behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0c92301a-595c-4725-92d2-d15ad8dab74f

📥 Commits

Reviewing files that changed from the base of the PR and between 236de33 and c246241.

📒 Files selected for processing (42)
  • .github/workflows/actions.yml
  • config/settings.py
  • config/test_settings.py
  • cppa_user_tracker/migrations/0005_alter_slackuser_slack_user_id.py
  • cppa_user_tracker/migrations/0006_wg21paperauthorprofile_author_alias.py
  • cppa_user_tracker/models.py
  • cppa_user_tracker/services.py
  • cppa_user_tracker/tests/test_services.py
  • dev-24error:
  • docs/Schema.md
  • docs/operations/WG21_Cloud_Run.md
  • docs/service_api/cppa_user_tracker.md
  • requirements.txt
  • wg21_paper_tracker/__init__.py
  • wg21_paper_tracker/admin.py
  • wg21_paper_tracker/apps.py
  • wg21_paper_tracker/cloud_run_job/Dockerfile
  • wg21_paper_tracker/cloud_run_job/converters/__init__.py
  • wg21_paper_tracker/cloud_run_job/converters/docling_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/openai_converter.py
  • wg21_paper_tracker/cloud_run_job/converters/pdfplumber_converter.py
  • wg21_paper_tracker/cloud_run_job/main.py
  • wg21_paper_tracker/cloud_run_job/requirements.txt
  • wg21_paper_tracker/fetcher.py
  • wg21_paper_tracker/management/__init__.py
  • wg21_paper_tracker/management/commands/__init__.py
  • wg21_paper_tracker/management/commands/import_wg21_metadata_from_csv.py
  • wg21_paper_tracker/management/commands/run_wg21_paper_tracker.py
  • wg21_paper_tracker/migrations/0001_initial.py
  • wg21_paper_tracker/migrations/__init__.py
  • wg21_paper_tracker/models.py
  • wg21_paper_tracker/pipeline.py
  • wg21_paper_tracker/services.py
  • wg21_paper_tracker/tests/__init__.py
  • wg21_paper_tracker/tests/test_commands.py
  • wg21_paper_tracker/tests/test_fetcher.py
  • wg21_paper_tracker/tests/test_models.py
  • wg21_paper_tracker/tests/test_pipeline.py
  • wg21_paper_tracker/tests/test_services.py
  • wg21_paper_tracker/tests/test_workspace.py
  • wg21_paper_tracker/workspace.py
  • workflow/management/commands/run_all_collectors.py

Comment thread cppa_user_tracker/services.py
Comment thread wg21_paper_tracker/services.py
Comment thread wg21_paper_tracker/services.py
Comment thread wg21_paper_tracker/services.py
Comment thread wg21_paper_tracker/pipeline.py Outdated
Comment thread config/test_settings.py Outdated
@snowfox1003 snowfox1003 merged commit cf8f184 into cppalliance:develop Apr 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants