Add clang_github_tracker app in boost-data-collector project #82#84
Add clang_github_tracker app in boost-data-collector project #82#84snowfox1003 merged 9 commits intocppalliance:developfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a new clang_github_tracker Django app to fetch and store raw GitHub activity for a configurable owner/repo, including workspace path helpers, state management, sync logic, a management command, tests, settings, and documentation. Changes
Sequence DiagramsequenceDiagram
participant User
participant Cmd as Management Command
participant State as State Manager
participant Sync as Sync Engine
participant GH as GitHub API
participant Raw as Raw Storage
User->>Cmd: run_clang_github_tracker [--from-date][--to-date][--dry-run]
Cmd->>State: resolve_start_end_dates(from,to)
State->>State: ensure_state_file_exists()
State->>Raw: compute_state_from_raw()
Raw-->>State: latest timestamps
State-->>Cmd: (start_commit, start_issue, start_pr, end_date)
alt dry-run
Cmd-->>User: log resolved dates and dry-run
else
Cmd->>Sync: sync_raw_only(start_*, end_date)
Sync->>GH: fetch commits/issues/PRs
GH-->>Sync: activity payloads
Sync->>Raw: save raw items (commits/issues/prs)
Sync->>State: save_state(latest observed dates)
State->>Raw: persist state.json
Sync-->>Cmd: (commits_saved, issues_saved, prs_saved)
Cmd-->>User: log sync results
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
clang_github_tracker/workspace.py (1)
44-50: Validateownerandrepopath segments before composing filesystem paths.These values are environment-driven; validating against separators/relative traversal prevents accidental directory escape outside the intended workspace subtree.
Suggested hardening
@@ _APP_SLUG = "clang_github_activity" _RAW_APP_SLUG = "github_activity_tracker" @@ OWNER = settings.CLANG_GITHUB_OWNER REPO = settings.CLANG_GITHUB_REPO + + +def _validate_segment(value: str, field: str) -> str: + if ( + not value + or value in {".", ".."} + or "/" in value + or "\\" in value + or ".." in value + ): + raise ValueError(f"Invalid {field} path segment: {value!r}") + return value @@ def get_raw_repo_dir( owner: str = OWNER, repo: str = REPO, *, create: bool = True ) -> Path: """Return workspace/raw/github_activity_tracker/<owner>/<repo>/; creates dirs if missing when create=True.""" - path = get_raw_root() / owner / repo + owner = _validate_segment(owner, "owner") + repo = _validate_segment(repo, "repo") + path = get_raw_root() / owner / repo🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/workspace.py` around lines 44 - 50, The get_raw_repo_dir function builds filesystem paths from environment-driven owner and repo; validate and sanitize these inputs to prevent path-separator or traversal attacks by: (1) reject or normalize any owner/repo containing path separators or traversal tokens (e.g., '/', '\\', '..') — raise ValueError if invalid; (2) construct the candidate path using get_raw_root() / owner / repo, then resolve it (Path.resolve(strict=False)) and ensure the resolved path is inside the resolved get_raw_root() (use Path.is_relative_to or compare parents) to prevent escape; (3) only call path.mkdir(parents=True, exist_ok=True) after validation. Reference get_raw_repo_dir and get_raw_root in your changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 80-83: The handler currently returns early when
clang_state.resolve_start_end_dates(from_date, to_date) yields None, causing a
zero-exit success; change this to signal failure by raising a Django
CommandError (or calling sys.exit(1)) with a clear message instead of returning.
Locate the management command's handle method (class Command in
run_clang_github_tracker.py) and replace the "if resolved is None: return"
branch with a failure path like "if resolved is None: raise CommandError('Failed
to resolve start/end dates: ...')" so schedulers/monitoring receive a non-zero
exit.
In `@clang_github_tracker/state.py`:
- Around line 57-63: load_state is treating an empty dict as a valid state
(returning a dict of keys with None) which prevents ensure_state_file_exists
from triggering recompute; update load_state to treat empty or missing-key
objects as invalid by returning {} unless the loaded dict is non-empty and
contains at least one of KEY_LAST_COMMIT_DATE, KEY_LAST_ISSUE_DATE, or
KEY_LAST_PR_DATE with a non-None value. Concretely, in load_state check that
isinstance(data, dict) and that any(data.get(KEY_LAST_COMMIT_DATE),
data.get(KEY_LAST_ISSUE_DATE), data.get(KEY_LAST_PR_DATE)) is truthy before
returning the mapped dict; otherwise return {} so ensure_state_file_exists will
recompute from raw.
In `@clang_github_tracker/sync_raw.py`:
- Around line 98-148: The try covers all phases so a failure during issues/PRs
loses already-fetched commits; after each successful phase (commits, issues,
prs) call clang_state.save_state(...) with the corresponding
latest_commit/latest_issue/latest_pr (use merge=True) to persist progress;
locate the loops using fetcher.fetch_commits_from_github,
fetcher.fetch_issues_from_github, fetcher.fetch_pull_requests_from_github and
the save_* helpers (save_commit_raw_source, save_issue_raw_source,
save_pr_raw_source) and insert a clang_state.save_state call immediately after
each loop using the current latest_* values so partial progress is preserved.
In `@docs/Workspace.md`:
- Around line 16-17: Update the workspace layout note under
github_activity_tracker to clarify that the tracked repository is configurable:
mention the CLANG_GITHUB_OWNER and CLANG_GITHUB_REPO environment variables (or
config keys) as controlling which <owner>/<repo> is stored, and replace the
hardcoded "llvm/llvm-project" wording with an explanatory phrase like "e.g.,
llvm/llvm-project — configurable via CLANG_GITHUB_OWNER/CLANG_GITHUB_REPO" so
operators know the repo can be changed; update the sentence that references
llvm/llvm-project accordingly in docs/Workspace.md.
---
Nitpick comments:
In `@clang_github_tracker/workspace.py`:
- Around line 44-50: The get_raw_repo_dir function builds filesystem paths from
environment-driven owner and repo; validate and sanitize these inputs to prevent
path-separator or traversal attacks by: (1) reject or normalize any owner/repo
containing path separators or traversal tokens (e.g., '/', '\\', '..') — raise
ValueError if invalid; (2) construct the candidate path using get_raw_root() /
owner / repo, then resolve it (Path.resolve(strict=False)) and ensure the
resolved path is inside the resolved get_raw_root() (use Path.is_relative_to or
compare parents) to prevent escape; (3) only call path.mkdir(parents=True,
exist_ok=True) after validation. Reference get_raw_repo_dir and get_raw_root in
your changes.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
clang_github_tracker/sync_raw.py (1)
146-148: Remove redundant exception object fromlogger.exception.
logger.exception()automatically includes the exception info from the current context. Passingeexplicitly is redundant.Proposed fix
except (ConnectionException, RateLimitException) as e: - logger.exception("clang_github_tracker sync failed: %s", e) + logger.exception("clang_github_tracker sync failed") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/sync_raw.py` around lines 146 - 148, The except block catching ConnectionException and RateLimitException passes the caught exception variable e to logger.exception redundantly; update the handler in sync_raw.py to call logger.exception with just the message (e.g., logger.exception("clang_github_tracker sync failed")) and remove the extra argument and the unused formatting, leaving the raise as-is so the exception context is preserved; reference the except (ConnectionException, RateLimitException) as e block, logger.exception, and the variable e when making the change.clang_github_tracker/tests/test_commands.py (1)
16-27: Consider adding cleanup/fixture for state file management.The test manually deletes the state file if it exists, but doesn't clean up after itself. This could affect other tests if run order changes. Consider using a pytest fixture with cleanup.
Example fixture approach
`@pytest.fixture` def clean_state_file(): """Ensure state file is removed before and after test.""" workspace = get_workspace_path("clang_github_activity") state_file = workspace / "state.json" if state_file.exists(): state_file.unlink() yield state_file if state_file.exists(): state_file.unlink() `@pytest.mark.django_db` def test_run_clang_github_tracker_dry_run_creates_state_if_missing(caplog, clean_state_file): """With --dry-run and no state file, command creates state from raw scan and resolves dates.""" with caplog.at_level(logging.INFO): call_command(CMD_NAME, "--dry-run", stdout=StringIO(), stderr=StringIO()) assert any("Resolved:" in r.getMessage() for r in caplog.records) assert any("Dry run" in r.getMessage() for r in caplog.records)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 16 - 27, Test leaves a workspace state file behind which can cause order-dependent failures; create a pytest fixture (e.g., clean_state_file) that locates the same state_file via get_workspace_path("clang_github_activity")/ "state.json", removes it before yielding and removes it again after the test, then update test_run_clang_github_tracker_dry_run_creates_state_if_missing to accept the fixture (clean_state_file) and drop the manual unlink logic so the file is cleaned up both before and after the call_command invocation.clang_github_tracker/management/commands/run_clang_github_tracker.py (1)
109-111: Remove redundant exception object fromlogger.exception.
logger.exception()automatically captures and logs the current exception context. Passingeexplicitly is redundant.Proposed fix
except Exception as e: - logger.exception("run_clang_github_tracker failed: %s", e) + logger.exception("run_clang_github_tracker failed") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/management/commands/run_clang_github_tracker.py` around lines 109 - 111, The except block in run_clang_github_tracker.py currently calls logger.exception("run_clang_github_tracker failed: %s", e) which redundantly passes the exception object; change the call to logger.exception("run_clang_github_tracker failed") so that logger.exception uses the implicit exception context, leaving the subsequent raise intact (locate the except Exception as e: handler in the run_clang_github_tracker command code and remove the explicit e argument from logger.exception).clang_github_tracker/state.py (1)
247-254: Retry logic callsensure_state_file_exists()twice but may not be effective.Lines 247-249 call
ensure_state_file_exists()and immediately retry if empty. However, if the first call failed due to a transient error (e.g., file system issue), the second call is unlikely to succeed without any delay or backoff. Consider adding a brief delay or removing the redundant retry if it's not needed.Option 1: Add a brief delay between retries
state = ensure_state_file_exists() if not state: + import time + time.sleep(0.1) # Brief delay before retry state = ensure_state_file_exists() # retry onceOption 2: Document the retry intent
state = ensure_state_file_exists() if not state: - state = ensure_state_file_exists() # retry once + # Retry once in case of transient filesystem issues + state = ensure_state_file_exists()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/state.py` around lines 247 - 254, The current retry calls ensure_state_file_exists() twice back-to-back which is ineffective for transient failures; replace the immediate second call with a simple retry loop: call ensure_state_file_exists() once, and if it returns falsy retry up to 2 more times with a short sleep/backoff (e.g., time.sleep(0.5) then 1.0) before logging via logger.error and returning None; ensure you import time and update the state variable inside the loop so the checks reference the latest result and keep the existing logger.error message if finally unsuccessful.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/settings.py`:
- Around line 143-146: The CLANG_GITHUB_OWNER and CLANG_GITHUB_REPO assignments
accept whitespace-only environment values because .strip() is applied after the
fallback, producing empty strings; change them to strip the env value first and
then fallback if the stripped result is empty — i.e., call env(...) (with the
same default), strip the returned string, and if that stripped string is falsy
set it to "llvm" for CLANG_GITHUB_OWNER and "llvm-project" for
CLANG_GITHUB_REPO; update the assignments for the symbols CLANG_GITHUB_OWNER and
CLANG_GITHUB_REPO accordingly.
---
Nitpick comments:
In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 109-111: The except block in run_clang_github_tracker.py currently
calls logger.exception("run_clang_github_tracker failed: %s", e) which
redundantly passes the exception object; change the call to
logger.exception("run_clang_github_tracker failed") so that logger.exception
uses the implicit exception context, leaving the subsequent raise intact (locate
the except Exception as e: handler in the run_clang_github_tracker command code
and remove the explicit e argument from logger.exception).
In `@clang_github_tracker/state.py`:
- Around line 247-254: The current retry calls ensure_state_file_exists() twice
back-to-back which is ineffective for transient failures; replace the immediate
second call with a simple retry loop: call ensure_state_file_exists() once, and
if it returns falsy retry up to 2 more times with a short sleep/backoff (e.g.,
time.sleep(0.5) then 1.0) before logging via logger.error and returning None;
ensure you import time and update the state variable inside the loop so the
checks reference the latest result and keep the existing logger.error message if
finally unsuccessful.
In `@clang_github_tracker/sync_raw.py`:
- Around line 146-148: The except block catching ConnectionException and
RateLimitException passes the caught exception variable e to logger.exception
redundantly; update the handler in sync_raw.py to call logger.exception with
just the message (e.g., logger.exception("clang_github_tracker sync failed"))
and remove the extra argument and the unused formatting, leaving the raise as-is
so the exception context is preserved; reference the except
(ConnectionException, RateLimitException) as e block, logger.exception, and the
variable e when making the change.
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 16-27: Test leaves a workspace state file behind which can cause
order-dependent failures; create a pytest fixture (e.g., clean_state_file) that
locates the same state_file via get_workspace_path("clang_github_activity")/
"state.json", removes it before yielding and removes it again after the test,
then update test_run_clang_github_tracker_dry_run_creates_state_if_missing to
accept the fixture (clean_state_file) and drop the manual unlink logic so the
file is cleaned up both before and after the call_command invocation.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
clang_github_tracker/management/commands/run_clang_github_tracker.py (1)
109-111: Consider removing redundant exception object fromlogging.exception.
logger.exception()automatically includes the exception traceback, so passingeas a format argument is redundant.♻️ Suggested simplification
except Exception as e: - logger.exception("run_clang_github_tracker failed: %s", e) + logger.exception("run_clang_github_tracker failed") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/management/commands/run_clang_github_tracker.py` around lines 109 - 111, In the except block in run_clang_github_tracker.py (inside the run_clang_github_tracker command handler), remove the redundant exception object passed to logger.exception and log only a descriptive message (e.g., "run_clang_github_tracker failed") since logger.exception already includes the traceback; keep the subsequent raise to re-raise the exception.clang_github_tracker/tests/test_commands.py (1)
48-67: Consider verifying all resolved date parameters for completeness.The test checks
start_commitandend_datebut doesn't verifystart_issueandstart_pr. Sincesync_raw_onlyaccepts separate start dates for each entity type, asserting all four parameters would strengthen coverage.♻️ Suggested enhancement
sync_mock.assert_called_once() call_kw = sync_mock.call_args[1] assert "start_commit" in call_kw + assert "start_issue" in call_kw + assert "start_pr" in call_kw assert "end_date" in call_kw assert any("saved commits=" in r.getMessage() for r in caplog.records)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 48 - 67, Update the test_run_clang_github_tracker_calls_sync_raw_only_when_not_dry_run to also assert that sync_raw_only was called with the other resolved date parameters: after obtaining call_kw from sync_mock.call_args[1], add assertions that "start_issue" and "start_pr" are present (and optionally that their values match the resolved start date or expected parsed dates), in addition to the existing checks for "start_commit" and "end_date" so all four entity start parameters are verified for completeness.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clang_github_tracker/state.py`:
- Around line 60-73: The state loader currently accepts any truthy last_*_date
values (e.g., ints/dicts) which can masquerade as valid state; update the
validation in the function that contains this snippet (ensure_state_file_exists
/ the state-parsing helper) to only accept strings that are valid ISO8601 dates:
for each of KEY_LAST_COMMIT_DATE, KEY_LAST_ISSUE_DATE, KEY_LAST_PR_DATE check
isinstance(value, str) and attempt to parse with datetime.fromisoformat (or
equivalent) and treat parse failures as invalid; only return the dict when at
least one key contains a successfully parsed ISO string, otherwise return {} so
recompute is triggered.
In `@clang_github_tracker/workspace.py`:
- Around line 44-50: The get_raw_repo_dir function currently composes filesystem
paths from owner and repo without validation; before creating or joining the
path in get_raw_repo_dir, validate both owner and repo (e.g., in the
get_raw_repo_dir function) and reject values that are absolute paths, contain
path separators or traversal segments ("/", "\\", "..") or otherwise do not
match an allowed pattern (for example only alphanumeric, hyphen, underscore); if
invalid, raise ValueError; after validation, continue to build path =
get_raw_root() / owner / repo and mkdir when create is True.
---
Nitpick comments:
In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 109-111: In the except block in run_clang_github_tracker.py
(inside the run_clang_github_tracker command handler), remove the redundant
exception object passed to logger.exception and log only a descriptive message
(e.g., "run_clang_github_tracker failed") since logger.exception already
includes the traceback; keep the subsequent raise to re-raise the exception.
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 48-67: Update the
test_run_clang_github_tracker_calls_sync_raw_only_when_not_dry_run to also
assert that sync_raw_only was called with the other resolved date parameters:
after obtaining call_kw from sync_mock.call_args[1], add assertions that
"start_issue" and "start_pr" are present (and optionally that their values match
the resolved start date or expected parsed dates), in addition to the existing
checks for "start_commit" and "end_date" so all four entity start parameters are
verified for completeness.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
clang_github_tracker/workspace.py (1)
44-50:⚠️ Potential issue | 🟠 MajorHarden
owner/repopath segments before path composition.
get_raw_repo_dir()currently accepts raw path components. Values like../xor embedded separators can redirect writes outsideworkspace/raw/github_activity_tracker.🔧 Proposed fix
from pathlib import Path +import re @@ STATE_FILENAME = "state.json" +_SEGMENT_RE = re.compile(r"^[A-Za-z0-9_.-]+$") @@ OWNER = settings.CLANG_GITHUB_OWNER REPO = settings.CLANG_GITHUB_REPO + + +def _validate_repo_segment(value: str, label: str) -> str: + segment = (value or "").strip() + if ( + not segment + or segment in {".", ".."} + or "/" in segment + or "\\" in segment + or not _SEGMENT_RE.fullmatch(segment) + ): + raise ValueError(f"Invalid GitHub {label}: {value!r}") + return segment @@ def get_raw_repo_dir( owner: str = OWNER, repo: str = REPO, *, create: bool = True ) -> Path: """Return workspace/raw/github_activity_tracker/<owner>/<repo>/; creates dirs if missing when create=True.""" - path = get_raw_root() / owner / repo + safe_owner = _validate_repo_segment(owner, "owner") + safe_repo = _validate_repo_segment(repo, "repo") + path = get_raw_root() / safe_owner / safe_repo if create: path.mkdir(parents=True, exist_ok=True) return path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/workspace.py` around lines 44 - 50, get_raw_repo_dir currently concatenates raw owner/repo strings into a filesystem path, allowing path-traversal (e.g., "../x" or embedded separators); sanitize and validate the owner and repo before composing the path: in get_raw_repo_dir, reject or normalize inputs by ensuring owner and repo are non-empty, do not contain path separators or ".." (check against "/" and os.path.sep and the literal ".."), or replace them with their basename (e.g., Path(owner).name) and raise ValueError on invalid input; then build the path using the sanitized values and proceed with path.mkdir as before; optionally, after creating the directory, assert that the resolved path is a descendant of get_raw_root().resolve() to be extra-safe.
🧹 Nitpick comments (1)
clang_github_tracker/tests/test_commands.py (1)
48-67: Consider adding test for state resolution failure.The current tests cover the happy path. Per project learnings, when
resolve_start_end_dates()returnsNone, the command returns early (exits with 0) rather than raisingCommandError. Consider adding a test to verify this graceful degradation behavior.💡 Example test case
`@pytest.mark.django_db` def test_run_clang_github_tracker_exits_gracefully_on_state_resolution_failure(caplog): """When resolve_start_end_dates returns None, command exits gracefully (no sync, no error).""" with patch( "clang_github_tracker.management.commands.run_clang_github_tracker.resolve_start_end_dates", return_value=None, ): with patch( "clang_github_tracker.management.commands.run_clang_github_tracker.sync_raw_only" ) as sync_mock: # Should not raise call_command(CMD_NAME, stdout=StringIO(), stderr=StringIO()) sync_mock.assert_not_called()Based on learnings: "when resolve_start_end_dates() returns None (state resolution failure), the command returns early (exits with 0) rather than raising CommandError."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 48 - 67, Add a unit test that verifies the management command returns early (does not call sync_raw_only and does not raise) when resolve_start_end_dates() fails (returns None); specifically patch resolve_start_end_dates in the run_clang_github_tracker module to return None, patch sync_raw_only to ensure it is not called, invoke call_command(CMD_NAME, stdout=StringIO(), stderr=StringIO()), and assert sync_mock.assert_not_called() and no exception was raised. Reference resolve_start_end_dates, sync_raw_only, CMD_NAME, and call_command in the test to locate the right functions to patch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@clang_github_tracker/workspace.py`:
- Around line 44-50: get_raw_repo_dir currently concatenates raw owner/repo
strings into a filesystem path, allowing path-traversal (e.g., "../x" or
embedded separators); sanitize and validate the owner and repo before composing
the path: in get_raw_repo_dir, reject or normalize inputs by ensuring owner and
repo are non-empty, do not contain path separators or ".." (check against "/"
and os.path.sep and the literal ".."), or replace them with their basename
(e.g., Path(owner).name) and raise ValueError on invalid input; then build the
path using the sanitized values and proceed with path.mkdir as before;
optionally, after creating the directory, assert that the resolved path is a
descendant of get_raw_root().resolve() to be extra-safe.
---
Nitpick comments:
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 48-67: Add a unit test that verifies the management command
returns early (does not call sync_raw_only and does not raise) when
resolve_start_end_dates() fails (returns None); specifically patch
resolve_start_end_dates in the run_clang_github_tracker module to return None,
patch sync_raw_only to ensure it is not called, invoke call_command(CMD_NAME,
stdout=StringIO(), stderr=StringIO()), and assert sync_mock.assert_not_called()
and no exception was raised. Reference resolve_start_end_dates, sync_raw_only,
CMD_NAME, and call_command in the test to locate the right functions to patch.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
clang_github_tracker/workspace.py (1)
31-36:⚠️ Potential issue | 🟠 MajorReject
.and..path segments to close traversal escape routes.Current validation allows
"."/"..", which can move writes outside the intended<owner>/<repo>subtree.🔒 Suggested hardening
def _sanitize_segment(value: str, label: str) -> str: """Validate owner/repo for use as a path segment; reject traversal, separators, empty.""" value = (value or "").strip() - if not value or not _SEGMENT_RE.fullmatch(value): + if value in {".", ".."} or not _SEGMENT_RE.fullmatch(value): raise ValueError(f"Invalid GitHub {label}: {value!r}") return value🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/workspace.py` around lines 31 - 36, The _sanitize_segment function currently allows "." and ".." which enable path traversal; update _sanitize_segment to explicitly reject "." and ".." after trimming and before using _SEGMENT_RE (i.e., if value in {".", ".."}: raise ValueError(...)); keep the existing ValueError message and continue to validate with _SEGMENT_RE to preserve other checks.
🧹 Nitpick comments (2)
config/settings.py (1)
142-148: Consider aligning runtime workspace pre-creation with test settings.Since the new tracker uses
workspace/clang_github_activity/state.json, addingclang_github_activityto_WORKSPACE_APP_SLUGSwould keep runtime behavior consistent withconfig/test_settings.pyand make workspace layout more predictable.♻️ Suggested consistency tweak
_WORKSPACE_APP_SLUGS = ( "github_activity_tracker", "boost_library_tracker", "boost_usage_tracker", + "clang_github_activity", "cppa_slack_transcript_tracker", "discord_activity_tracker", "boost_mailing_list_tracker", "shared", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/settings.py` around lines 142 - 148, Add the new tracker app slug to the runtime workspace slug list: update the _WORKSPACE_APP_SLUGS constant to include "clang_github_activity" so the runtime pre-creates workspace/clang_github_activity/state.json the same way tests expect; modify the list where _WORKSPACE_APP_SLUGS is defined to append or include "clang_github_activity" alongside existing slugs to align runtime workspace layout with config/test_settings.py.clang_github_tracker/tests/test_commands.py (1)
29-67: Add one test for invalid date input fallback behavior.A dedicated case with invalid
--from-date/--to-datewould lock in the intended warning-and-fallback flow and prevent future regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 29 - 67, Add a new test in the same tests module that patches clang_github_tracker.management.commands.run_clang_github_tracker.sync_raw_only, calls call_command(CMD_NAME, "--from-date=invalid", "--to-date=also-bad", ...) and asserts that (1) caplog captured a warning/error message about invalid dates (e.g., contains "invalid" or "Resolved:" warning) and (2) sync_raw_only is still invoked with the fallback/resolved date arguments (check call_kw contains "start_commit" or "start_date"/"end_date" as used by the command). Ensure the test mirrors the existing tests' structure (use `@pytest.mark.django_db`, caplog.at_level(logging.INFO), StringIO for stdout/stderr) and uses sync_raw_only.assert_called_once() and inspects sync_mock.call_args[1] for the resolved-date keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 17-27: The test
test_run_clang_github_tracker_dry_run_creates_state_if_missing currently only
asserts logs; update it to assert the actual side-effect by checking the state
file created by call_command: after calling call_command(CMD_NAME, "--dry-run",
...) verify state_file.exists() is True and optionally open/read state_file to
assert expected keys/contents (e.g., that resolved dates are present) so the
test fails if the file is not written; keep references to
get_workspace_path("clang_github_activity"), state_file, and the test function
name when making the change.
---
Duplicate comments:
In `@clang_github_tracker/workspace.py`:
- Around line 31-36: The _sanitize_segment function currently allows "." and
".." which enable path traversal; update _sanitize_segment to explicitly reject
"." and ".." after trimming and before using _SEGMENT_RE (i.e., if value in
{".", ".."}: raise ValueError(...)); keep the existing ValueError message and
continue to validate with _SEGMENT_RE to preserve other checks.
---
Nitpick comments:
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 29-67: Add a new test in the same tests module that patches
clang_github_tracker.management.commands.run_clang_github_tracker.sync_raw_only,
calls call_command(CMD_NAME, "--from-date=invalid", "--to-date=also-bad", ...)
and asserts that (1) caplog captured a warning/error message about invalid dates
(e.g., contains "invalid" or "Resolved:" warning) and (2) sync_raw_only is still
invoked with the fallback/resolved date arguments (check call_kw contains
"start_commit" or "start_date"/"end_date" as used by the command). Ensure the
test mirrors the existing tests' structure (use `@pytest.mark.django_db`,
caplog.at_level(logging.INFO), StringIO for stdout/stderr) and uses
sync_raw_only.assert_called_once() and inspects sync_mock.call_args[1] for the
resolved-date keys.
In `@config/settings.py`:
- Around line 142-148: Add the new tracker app slug to the runtime workspace
slug list: update the _WORKSPACE_APP_SLUGS constant to include
"clang_github_activity" so the runtime pre-creates
workspace/clang_github_activity/state.json the same way tests expect; modify the
list where _WORKSPACE_APP_SLUGS is defined to append or include
"clang_github_activity" alongside existing slugs to align runtime workspace
layout with config/test_settings.py.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
clang_github_tracker/tests/test_commands.py (1)
17-27:⚠️ Potential issue | 🟡 MinorAssert the state-file side effect in the dry-run bootstrap test.
Lines [25]-[26] only validate logs, so this test still passes if
state.jsoncreation regresses.Suggested test hardening
with caplog.at_level(logging.INFO): call_command(CMD_NAME, "--dry-run", stdout=StringIO(), stderr=StringIO()) assert any("Resolved:" in r.getMessage() for r in caplog.records) assert any("Dry run" in r.getMessage() for r in caplog.records) + assert state_file.exists()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 17 - 27, The test test_run_clang_github_tracker_dry_run_creates_state_if_missing currently only checks logs and can miss regressions where state.json is not created; after calling call_command(CMD_NAME, "--dry-run", ...) assert that the state_file Path (variable state_file) now exists and optionally contains valid JSON or expected keys (e.g., load with json.load and assert type/dates) to verify the side-effect; use the existing state_file variable and CMD_NAME/call_command to locate where to add these assertions.
🧹 Nitpick comments (2)
clang_github_tracker/tests/test_commands.py (1)
49-67: Strengthen sync-call assertions to verify resolved values, not just kwarg presence.Lines [64]-[66] can produce false positives even if wrong dates are passed to
sync_raw_only.Suggested assertion upgrade
sync_mock.assert_called_once() - call_kw = sync_mock.call_args[1] - assert "start_commit" in call_kw - assert "end_date" in call_kw + call_kw = sync_mock.call_args.kwargs + assert call_kw["start_commit"].isoformat() == "2024-01-01T00:00:00+00:00" + assert call_kw["start_issue"].isoformat() == "2024-01-01T00:00:00+00:00" + assert call_kw["start_pr"].isoformat() == "2024-01-01T00:00:00+00:00" + assert call_kw["end_date"].isoformat() == "2024-01-02T00:00:00+00:00"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/tests/test_commands.py` around lines 49 - 67, The test test_run_clang_github_tracker_calls_sync_raw_only_when_not_dry_run only asserts that sync_raw_only was called with keys present, which can pass with wrong values; update the assertions on the mock (sync_raw_only) to assert the actual resolved values for the relevant kwargs (e.g., verify call_kw["start_commit"] equals the expected resolved start value and call_kw["end_date"] equals the expected resolved end value) after calling call_command(CMD_NAME, "--from-date=2024-01-01", "--to-date=2024-01-02", ...); keep the existing caplog check for "saved commits=" but replace the generic "in call_kw" checks with explicit equality checks against the expected parsed/converted dates or commit identifiers returned by the command.clang_github_tracker/management/commands/run_clang_github_tracker.py (1)
75-78: Consider permissive fallback for inverted date ranges to match project command behavior.Line [76] currently hard-fails; this is stricter than the repo’s graceful date-input handling pattern.
Consistency-oriented alternative
- if from_date and to_date and from_date > to_date: - raise CommandError( - "Invalid date range: from_date must be before or equal to to_date." - ) + if from_date and to_date and from_date > to_date: + logger.warning( + "Invalid date range (--from-date > --to-date); ignoring both and using default date resolution." + ) + from_date = None + to_date = NoneBased on learnings: In Django management commands in this project, invalid start/end date inputs are intentionally treated permissively and fall back to default logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/management/commands/run_clang_github_tracker.py` around lines 75 - 78, The current strict check in run_clang_github_tracker.Command.handle that raises CommandError when from_date > to_date should be made permissive: instead of raising, swap the two values (set from_date, to_date = to_date, from_date) and emit a warning via self.stderr.write or the command logger to note the correction; update the block referencing from_date and to_date so downstream logic uses the swapped values (look for the if from_date and to_date check and replace the raise with the swap + warning).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@clang_github_tracker/tests/test_state.py`:
- Around line 27-34: The test patches the wrong function: update the mock to
patch get_raw_repo_dir used by compute_state_from_raw instead of get_raw_root;
specifically patch "clang_github_tracker.state.get_raw_repo_dir" (the function
imported/used by clang_state.compute_state_from_raw), set its return_value to
tmp_path / "raw" / "github_activity_tracker" (or an appropriate Path) and keep
the rest of the assertions unchanged so compute_state_from_raw exercises the
intended no-repo path handling.
---
Duplicate comments:
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 17-27: The test
test_run_clang_github_tracker_dry_run_creates_state_if_missing currently only
checks logs and can miss regressions where state.json is not created; after
calling call_command(CMD_NAME, "--dry-run", ...) assert that the state_file Path
(variable state_file) now exists and optionally contains valid JSON or expected
keys (e.g., load with json.load and assert type/dates) to verify the
side-effect; use the existing state_file variable and CMD_NAME/call_command to
locate where to add these assertions.
---
Nitpick comments:
In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 75-78: The current strict check in
run_clang_github_tracker.Command.handle that raises CommandError when from_date
> to_date should be made permissive: instead of raising, swap the two values
(set from_date, to_date = to_date, from_date) and emit a warning via
self.stderr.write or the command logger to note the correction; update the block
referencing from_date and to_date so downstream logic uses the swapped values
(look for the if from_date and to_date check and replace the raise with the swap
+ warning).
In `@clang_github_tracker/tests/test_commands.py`:
- Around line 49-67: The test
test_run_clang_github_tracker_calls_sync_raw_only_when_not_dry_run only asserts
that sync_raw_only was called with keys present, which can pass with wrong
values; update the assertions on the mock (sync_raw_only) to assert the actual
resolved values for the relevant kwargs (e.g., verify call_kw["start_commit"]
equals the expected resolved start value and call_kw["end_date"] equals the
expected resolved end value) after calling call_command(CMD_NAME,
"--from-date=2024-01-01", "--to-date=2024-01-02", ...); keep the existing caplog
check for "saved commits=" but replace the generic "in call_kw" checks with
explicit equality checks against the expected parsed/converted dates or commit
identifiers returned by the command.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
clang_github_tracker/sync_raw.py (1)
143-145: Minor: Redundant exception object inlogging.exceptioncall.Same as in the management command—
logger.exception()already captures the exception context.♻️ Optional simplification
except (ConnectionException, RateLimitException) as e: - logger.exception("clang_github_tracker sync failed: %s", e) + logger.exception("clang_github_tracker sync failed") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/sync_raw.py` around lines 143 - 145, The except block catching ConnectionException and RateLimitException in sync_raw.py uses logger.exception("clang_github_tracker sync failed: %s", e) which redundantly passes the exception object; change the call to logger.exception("clang_github_tracker sync failed") (leave the exception classes ConnectionException and RateLimitException and the subsequent raise intact) so the exception context is logged automatically by logger.exception.clang_github_tracker/management/commands/run_clang_github_tracker.py (1)
113-115: Minor: Redundant exception object inlogging.exceptioncall.
logger.exception()automatically includes the exception information from the current context; passingeexplicitly is redundant.♻️ Optional simplification
except Exception as e: - logger.exception("run_clang_github_tracker failed: %s", e) + logger.exception("run_clang_github_tracker failed") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@clang_github_tracker/management/commands/run_clang_github_tracker.py` around lines 113 - 115, Remove the redundant exception object from the logger.exception call: in the exception handler where the code currently does logger.exception("run_clang_github_tracker failed: %s", e) (inside the run_clang_github_tracker/handle exception block), change it to call logger.exception("run_clang_github_tracker failed") so the exception context is logged automatically, and keep the subsequent raise as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@clang_github_tracker/management/commands/run_clang_github_tracker.py`:
- Around line 113-115: Remove the redundant exception object from the
logger.exception call: in the exception handler where the code currently does
logger.exception("run_clang_github_tracker failed: %s", e) (inside the
run_clang_github_tracker/handle exception block), change it to call
logger.exception("run_clang_github_tracker failed") so the exception context is
logged automatically, and keep the subsequent raise as-is.
In `@clang_github_tracker/sync_raw.py`:
- Around line 143-145: The except block catching ConnectionException and
RateLimitException in sync_raw.py uses logger.exception("clang_github_tracker
sync failed: %s", e) which redundantly passes the exception object; change the
call to logger.exception("clang_github_tracker sync failed") (leave the
exception classes ConnectionException and RateLimitException and the subsequent
raise intact) so the exception context is logged automatically by
logger.exception.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.env.exampleclang_github_tracker/__init__.pyclang_github_tracker/management/__init__.pyclang_github_tracker/management/commands/__init__.pyclang_github_tracker/management/commands/run_clang_github_tracker.pyclang_github_tracker/state.pyclang_github_tracker/sync_raw.pyclang_github_tracker/tests/__init__.pyclang_github_tracker/tests/test_commands.pyclang_github_tracker/tests/test_state.pyclang_github_tracker/workspace.pyconfig/settings.pyconfig/test_settings.pydocs/Workspace.mdworkflow/management/commands/run_all_collectors.py
jonathanMLDev
left a comment
There was a problem hiding this comment.
Please read the comments and update the code if it is reasonable.
llvm/llvm-project) and saves to raw JSON underworkspace/raw/github_activity_tracker/<owner>/<repo>/— no DB.CLANG_GITHUB_OWNERandCLANG_GITHUB_REPOin.env.workspace/clang_github_activity/state.json; if missing or invalid, state is recomputed from raw (or written withnulldates) with one retry, then the command exits on error.python manage.py run_clang_github_tracker [--dry-run] [--from-date ISO] [--to-date ISO].Summary by CodeRabbit
New Features
Configuration
Documentation
Tests
Chores