#20-add boost-usage-tracker app#71
Conversation
- add db update logic from file - applied multi-processing for different rate limit of github search api and content scraping - applied bulk processing by repo for database updating
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a new Django app boost_usage_tracker with models, migrations, services, repo/code searchers, post-processing, management commands, CSV/JSON loaders, admin registration, tests, and docs to discover and record Boost C++ usage and yearly repo counts by language. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Management Cmd
participant RepoSearch as repo_searcher
participant CodeSearch as boost_searcher
participant GitHub as GitHub API
participant PostProc as post_process
participant DB as Database
CLI->>RepoSearch: search_repos_with_date_splitting(start,end,language,min_stars)
RepoSearch->>GitHub: REST /search/repositories (paged)
GitHub-->>RepoSearch: paginated repo results
RepoSearch->>RepoSearch: split ranges if total>1000
RepoSearch-->>CLI: list[RepoSearchResult]
CLI->>CodeSearch: search_boost_include_files_batch(repos)
loop per-repo (batched)
CodeSearch->>GitHub: GraphQL code search for includes
GitHub-->>CodeSearch: file paths
CodeSearch->>GitHub: fetch file content & commit date (REST/GraphQL)
GitHub-->>CodeSearch: file content + commit metadata
CodeSearch-->>CLI: FileSearchResult
end
CLI->>PostProc: process_single_repo(repo, file_results, last_commit_dt)
PostProc->>DB: resolve BoostFile matches (exact & suffix)
DB-->>PostProc: BoostFile or none
PostProc->>DB: create/update BoostExternalRepository
PostProc->>DB: bulk_create_or_update_boost_usage / mark_usages_excepted_bulk
PostProc->>DB: create BoostMissingHeaderTmp for unresolved headers
DB-->>PostProc: commit summaries
PostProc-->>CLI: processing outcome
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
discord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.py (1)
14-25:⚠️ Potential issue | 🟠 MajorPin ORM operations to the migration database alias.
Lines 14-25 use the default manager for reads and writes inside
RunPython, while line 40 usesschema_editor.execute()on the migration-specific connection. In multi-database setups, this creates inconsistent migration state: ORM operations read/write to the default database while the SQL remapping targets the migration database.Add
db_alias = schema_editor.connection.aliasat the start ofmigrate_users_forward(), then apply.using(db_alias)to all ORM operations:Proposed fix
def migrate_users_forward(apps, schema_editor): """Create DiscordProfile for each DiscordUser, remap DiscordMessage.author_id.""" DiscordUser = apps.get_model("discord_activity_tracker", "DiscordUser") BaseProfile = apps.get_model("cppa_user_tracker", "BaseProfile") DiscordProfile = apps.get_model("cppa_user_tracker", "DiscordProfile") + db_alias = schema_editor.connection.alias # Build mapping: old DiscordUser.pk → new DiscordProfile.pk pk_map = {} - for du in DiscordUser.objects.all(): + for du in DiscordUser.objects.using(db_alias).all().iterator(): # Create BaseProfile row first (multi-table inheritance) - bp = BaseProfile.objects.create(type="discord") + bp = BaseProfile.objects.using(db_alias).create(type="discord") # Create DiscordProfile row - DiscordProfile.objects.create( + DiscordProfile.objects.using(db_alias).create( baseprofile_ptr_id=bp.pk, discord_user_id=du.user_id, username=du.username, display_name=du.display_name, avatar_url=du.avatar_url, is_bot=du.is_bot, ) pk_map[du.pk] = bp.pk🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.py` around lines 14 - 25, In migrate_users_forward(), pin all ORM operations to the migration DB by first getting db_alias = schema_editor.connection.alias and then call .using(db_alias) on every ORM access and write (e.g., DiscordUser.objects.using(db_alias).all(), BaseProfile.objects.using(db_alias).create(...), DiscordProfile.objects.using(db_alias).create(...)); make sure any other ORM calls in that function also use .using(db_alias) so reads/writes target the same migration-specific connection as schema_editor.execute().
🧹 Nitpick comments (12)
boost_usage_tracker/update_boostusage_from_csv.py (3)
32-35:_parse_datetimeis duplicated across CSV loaders.An identical helper exists in
boost_usage_tracker/update_repository_from_csv.py(lines 48-51). Consider extracting it into a shared utility (e.g.,boost_usage_tracker/utils.py) to reduce duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_boostusage_from_csv.py` around lines 32 - 35, The helper function _parse_datetime is duplicated in update_boostusage_from_csv.py and update_repository_from_csv.py; extract it into a shared module (e.g., create boost_usage_tracker/utils.py with the _parse_datetime implementation) and update both modules to import _parse_datetime from boost_usage_tracker.utils instead of defining it inline; ensure the signature and behavior (handling None/empty strings and calling parse_datetime) remain identical and run tests/imports to confirm no breaking changes.
82-88: Silently skipped rows are invisible in the result dict.When a row has empty
owner,repo_name,file_path, orboost_header_name, it'scontinued without incrementing any counter or logging. The caller has no way to know rows were dropped. Consider adding askipped_emptycounter or a debug log.Proposed fix
+ if not owner or not repo_name or not file_path or not boost_header_name: + result.setdefault("skipped_empty", 0) + result["skipped_empty"] += 1 continueAnd initialise
"skipped_empty": 0in the result dict on Line 68.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_boostusage_from_csv.py` around lines 82 - 88, The CSV loop silently drops rows when owner/repo_name/file_path/boost_header_name are empty; initialize a skipped counter in the result dict (add "skipped_empty": 0 to the result structure constructed where results are initialized) and increment result["skipped_empty"] inside the for row in reader block right before the continue; optionally add a debug log there mentioning the row index/contents to aid troubleshooting.
111-113:BoostFilelookup bygithub_file__filenamemay match across repos.
GitHubFile.filenameis not globally unique — multiple repos can have a file with the same name. For Boost headers this is unlikely to cause issues in practice, but a more precise filter (e.g., constraining to the Boost library repo) would be safer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_boostusage_from_csv.py` around lines 111 - 113, The lookup for boost_header using BoostFile.objects.filter(github_file__filename=...) can return matches from other repos because GitHubFile.filename is not unique; update the query that sets boost_header to also filter by the Boost repo (e.g., add a condition on the related repo field such as github_file__repository__full_name or github_file__repo__full_name == "boostorg/boost" or the appropriate repo identifier) so the filter is constrained to the Boost library repository before calling .first().boost_usage_tracker/models.py (1)
78-78:excepted_atis aDateFieldwhile all other*_atfields areDateTimeField.This loses time-of-day precision and is inconsistent with
created_at,updated_at, andlast_commit_date. If this is intentional (only the date matters), consider adding a brief comment to explain why. Otherwise, switch toDateTimeFieldfor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/models.py` at line 78, The excepted_at field is defined as models.DateField which drops time precision and is inconsistent with other timestamp fields (e.g., created_at, updated_at, last_commit_date); change excepted_at to models.DateTimeField(null=True, blank=True) to preserve time-of-day data (and run the migration), or if the date-only semantics are intentional add a short comment above the excepted_at declaration explaining why it differs.boost_usage_tracker/update_githubfile_from_csv.py (2)
26-32:_parse_boolis duplicated across CSV loaders.The identical
_parse_boolhelper exists inupdate_repository_from_csv.py(lines 40-46) and here. Consider extracting it into a shared utility module (e.g., a commoncsv_helpers.py) to avoid drift between copies.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_githubfile_from_csv.py` around lines 26 - 32, Duplicate helper _parse_bool found in update_githubfile_from_csv.py and update_repository_from_csv.py; extract it into a new shared module (e.g., csv_helpers.py) and replace both copies with imports. Create csv_helpers.py exporting _parse_bool (or parse_bool) with the same logic, update update_githubfile_from_csv.py and update_repository_from_csv.py to import the function from that module, and remove the duplicated definitions so both files use the single shared helper.
79-106: Silently skipping rows with missing required fields.Lines 83-84 skip rows where
owner,repo_name, orfile_pathare empty without any logging. Theskipped_no_repocounter (line 90) is tracked for missing repos, but completely empty rows are invisible. A debug-level log or a counter in the result dict would aid troubleshooting malformed CSVs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_githubfile_from_csv.py` around lines 79 - 106, The loop reading CSV rows currently silently continues when owner, repo_name, or file_path are empty; update the for-row logic (the block using variables owner, repo_name, file_path and the result dict) to increment a new counter like result["skipped_missing_fields"] and emit a debug log (including the row index or the raw row dict) when any required field is missing, so malformed/empty rows are visible alongside the existing skipped_no_repo handling; keep the existing repo lookup and create_or_update_github_file flow unchanged.boost_usage_tracker/tests/test_services.py (2)
269-269: Unused unpacked variables flagged by Ruff (RUF059).Same pattern as in test_models.py — prefix unused unpacked variables with
_:
- Line 269:
usage2→_usage2- Line 530:
tmp2→_tmp2- Line 579:
usage→_usage- Line 595:
tmp→_tmpAlso applies to: 530-530, 579-579, 595-595
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/tests/test_services.py` at line 269, Replace the unused unpacked variables by prefixing them with an underscore where create_or_update_boost_usage and similar calls unpack tuples: change usage2 to _usage2 (the call to services.create_or_update_boost_usage), tmp2 to _tmp2, usage to _usage, and tmp to _tmp so the unused values are explicitly ignored and Ruff RUF059 is satisfied; ensure only the variables actually used remain without the underscore prefix.
474-493:select_relatedtest doesn't verify query count.The test docstring states it verifies
select_relatedbehavior, but merely accessingusage.boost_headerandusage.file_pathdoesn't fail whenselect_relatedis absent — it just triggers lazy loads. Usedjango.test.utils.CaptureQueriesContextorpytest-django'sdjango_assert_num_queriesfixture to actually assert no additional queries are executed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/tests/test_services.py` around lines 474 - 493, The test test_get_active_usages_for_repo_select_related currently only accesses related fields but doesn't assert query count; update the test to wrap access to usage.boost_header and usage.file_path with a query-assertion context (e.g., django.test.utils.CaptureQueriesContext or pytest-django's django_assert_num_queries) and assert that zero additional queries are executed when calling services.get_active_usages_for_repo(ext_repo) and then accessing those related attributes; keep references to the test function name and the service function get_active_usages_for_repo so the change is scoped to this test.boost_usage_tracker/tests/test_models.py (1)
156-156: Unused unpacked variables flagged by Ruff (RUF059).Multiple test functions unpack values they never use. Prefix them with
_to satisfy the linter and signal intent:
- Line 156:
tmp→_tmp- Line 199:
usage2→_usage2- Lines 263/268:
tmp1,c1,tmp2,c2→_tmp1,_c1,_tmp2,_c2- Line 386:
tmp2→_tmp2- Lines 448/453:
c1,c2→_c1,_c2- Line 478:
tmp→_tmpAlso applies to: 199-199, 263-268
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/tests/test_models.py` at line 156, Several tests unpack multiple values from calls like get_or_create_missing_header_usage and other helpers but leave some variables unused; rename those unused unpacked variables to have a leading underscore to satisfy Ruff (RUF059). For example, in the call to get_or_create_missing_header_usage change usage, tmp, _ = ... to usage, _tmp, _ = ... (or usage, _tmp, _ = get_or_create_missing_header_usage), and apply the same pattern for the other occurrences listed: change usage2 → _usage2; tmp1, c1, tmp2, c2 → _tmp1, _c1, _tmp2, _c2; tmp2 → _tmp2; c1, c2 → _c1, _c2; and tmp → _tmp. Locate these in the test functions surrounding the symbols get_or_create_missing_header_usage and similar unpacking sites and rename the unused variables accordingly.boost_usage_tracker/post_process.py (2)
218-221: Broadexcept Exceptionswallows errors silently.While re-raising
ConnectionExceptionandRateLimitExceptionis correct, the catch-all on line 220 means programming bugs (e.g.,KeyError,AttributeError) will only appear as warnings in the log, andstatswill be returned with partial/zero data — potentially hiding data-loss issues. Consider logging aterrorlevel withexc_info=Trueso stack traces are captured.Proposed improvement
except (ConnectionException, RateLimitException): raise except Exception as e: # pylint: disable=broad-exception-caught - logger.warning("Failed post-processing %s: %s", repo_full_name, e) + logger.error("Failed post-processing %s: %s", repo_full_name, e, exc_info=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/post_process.py` around lines 218 - 221, The broad except in post_process (catching Exception) currently logs only a warning and swallows programming errors; change the except Exception as e handler so it logs at error level with full stack trace (use logger.error(..., exc_info=True) and include repo_full_name in the message) and then re-raise the exception so bugs (KeyError/AttributeError etc.) aren't silently hidden; keep the existing except (ConnectionException, RateLimitException) re-raise behavior unchanged.
39-53: Suffix fallback issues N DB queries per unresolved header path.
_resolve_boost_headerruns up tolen(parts)queries per path. For deeply nested headers this adds up. Since this is the fallback after a bulk exact-match miss, the volume should be low, but consider adding a local cache across calls if the same unresolved path appears in multiple repos within a batch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/post_process.py` around lines 39 - 53, _resolve_boost_header performs up to O(len(parts)) DB queries per call for the same unresolved header; add a local cache to avoid repeated queries across calls by decorating _resolve_boost_header with an appropriate cache (e.g., functools.lru_cache(maxsize=4096)) or by introducing a module-level dict that maps header_path -> BoostFile|None, keyed by the header_path string; ensure the cache stores either the BoostFile instance or None and is consulted before running the query loop (and consider evicting/clearing the cache between batches if repository state changes are a concern).boost_usage_tracker/update_git_account.py (1)
26-59: Near-identical helper functions duplicated fromdb_from_file.py.
_load_json_records_from_pathand_load_all_json_recordsare completely duplicated between files.get_github_account_diris nearly identical except it usesGITHUB_ACCOUNT_SUBDIR(public) versus_GITHUB_ACCOUNT_SUBDIR(private). Consider extracting these shared helpers to a common module to eliminate duplication, or ifdb_from_file.pyis the canonical location, import from there (noting the constant naming difference that would need alignment).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_git_account.py` around lines 26 - 59, The three helpers get_github_account_dir, _load_json_records_from_path, and _load_all_json_records are duplicated from db_from_file.py; extract them into a shared module (e.g., boost_usage_tracker.utils or boost_usage_tracker.file_utils) and have update_git_account.py import and use them instead of redefining; ensure the directory-constant naming is reconciled (either rename GITHUB_ACCOUNT_SUBDIR to match _GITHUB_ACCOUNT_SUBDIR or export the canonical constant from db_from_file.py and update callers) and keep the same function signatures for _load_json_records_from_path and _load_all_json_records so callers like get_github_account_dir continue to work.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.env.example:
- Around line 86-89: The "Celery" header was accidentally placed inside the "App
credentials" block and lost its surrounding separator bars; fix by restoring the
standard section header pattern: end the "App credentials" separator block, then
create a new standalone Celery section using the same "===...===" separator
lines above and below the line "# Celery (optional; for scheduled
run_all_collectors)". Ensure the comment text remains unchanged and follows the
exact pattern used by other sections so "Celery" is its own top-level section.
In `@boost_usage_tracker/boost_searcher.py`:
- Around line 265-276: _build_boost_include_query currently trims the repo list
to fit MAX_CODE_SEARCH_QUERY_LEN but the caller search_boost_include_files_batch
never retries the remaining repos, causing silent skips; change
search_boost_include_files_batch to iterate over the full repo_full_names in a
loop: repeatedly call _build_boost_include_query on the yet-unprocessed slice of
repo_full_names (e.g., repo_full_names[i:]), process the returned query and
repos, and advance i by len(repos) until all repos are handled; also handle the
edge case where a single repo produces a query longer than
MAX_CODE_SEARCH_QUERY_LEN by logging/warning and skipping that repo to avoid an
infinite loop.
In `@boost_usage_tracker/db_from_file.py`:
- Around line 96-104: The loop over records assumes each item is a dict and
calls rec.get, which will raise AttributeError for nulls/strings/numbers; before
accessing rec.get in the for rec in records loop, check that rec is a mapping
(e.g., isinstance(rec, dict)) and if not, logger.warning("Skipping non-dict
record: %r", rec) and continue; keep the existing raw_id/try-except conversion
logic (raw_id, gid, logger) unchanged so only non-dict entries are skipped
safely.
In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py`:
- Around line 374-388: The date parsing for options["until"] and
options["since"] can raise ValueError and should not abort the command; wrap the
datetime.strptime calls for computing until and since in try/except ValueError
blocks (or use helper parsing) so that if parsing options["until"] fails you set
until = now, and if parsing options["since"] fails you set since = until -
timedelta(days=1); update the code that sets the until and since variables
(referencing now, until, since and options) to apply these fallbacks and keep
tzinfo=timezone.utc as before.
In
`@boost_usage_tracker/management/commands/run_update_created_repos_by_language.py`:
- Around line 77-93: The success message block that writes summary using
self.style.SUCCESS should switch to self.style.WARNING when the result indicates
errors; update the writer logic around the block that formats the result (the
call using self.stdout.write(... self.style.SUCCESS(...).format(...))) to choose
self.style.WARNING whenever result contains errors (e.g., non-empty
result.get("errors") or when earlier error paths were taken), and mirror the
same change at the similar summary write later (the other occurrence referenced
by the reviewer). Ensure you preserve the existing formatted fields (requested,
processed, missing, start, end, stars, rows, created, updated) but select
WARNING style conditionally instead of always using SUCCESS.
In `@boost_usage_tracker/management/commands/run_update_db.py`:
- Around line 128-142: The handle method currently returns integers to indicate
success/failure which Django ignores; change it to raise
django.core.management.base.CommandError when result contains errors: after
calling runner(source) and detecting result.get("errors"), aggregate the error
messages from result["errors"] and raise CommandError with that text instead of
writing errors and returning 1; keep the successful path to call
formatter(result) and write the success message via
self.stdout.write(self.style.SUCCESS(msg)) and then return None (or simply end)
so the command exits with success. Ensure you import CommandError and reference
the existing handle, TARGETS, runner, and formatter symbols when updating the
code.
In
`@boost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.py`:
- Around line 65-68: Typo in the index name: change "bost_missing_tmp_usage_id"
to "boost_missing_tmp_usage_id" both in the migration AddIndex entry that
targets model_name="boostmissingheadertmp" (the migrations.AddIndex(...
index=models.Index(... name="bost_missing_tmp_usage_id") ) and in the model
definition for BoostMissingHeaderTmp where the Index with name
"bost_missing_tmp_usage_id" is declared (likely in Meta.indexes); update both
occurrences so the DB schema and model metadata use
"boost_missing_tmp_usage_id".
In `@boost_usage_tracker/models.py`:
- Around line 127-129: The index name in the Index declaration is
misspelled—update the models.Index entry in the indexes list (the
models.Index(fields=["usage"], name="bost_missing_tmp_usage_id") line) to use
"boost_missing_tmp_usage_id" instead of "bost_missing_tmp_usage_id", and then
regenerate the migration so the corrected name is applied before any migration
is run.
In `@boost_usage_tracker/repo_searcher.py`:
- Around line 157-172: The code can raise UnboundLocalError because second_diff
is only assigned inside the else branch; ensure second_diff is defined before
it's referenced in the final condition. Edit the logic in repo_searcher.py
around the date-splitting block: initialize second_diff (e.g., second_diff = 0)
before the if/else, or restructure so the final condition checks the existence
of second_range/second_diff safely; update the variables used by the subsequent
calls to _process_date_range (and related variables first_range_1/2,
second_range_1/2) so they are always defined or guarded before use.
In `@boost_usage_tracker/services.py`:
- Around line 235-260: The loop currently appends every matched BoostUsage to
to_update even if no fields changed; change the logic in the loop that iterates
(bh_id, fp_id), usage in existing_map.items() to detect real changes before
appending: compute new values for last_commit_date and excepted_at (e.g.,
new_last = key_to_item[key][2], new_excepted = None) and only if
usage.last_commit_date != new_last or usage.excepted_at != new_excepted then set
the fields on usage, set usage.updated_at later, and append to_update; leave
unchanged rows out of to_update so BoostUsage.objects.bulk_update only updates
truly modified rows and updated_count = len(to_update).
- Around line 184-193: When an existing BoostUsage row is re-detected (i.e.
get_or_create returned an existing record via usage, _), the code updates
last_commit_date but doesn't clear the excepted_at flag so the usage can remain
incorrectly excepted; update the branch that handles existing records (where _
is False) to set usage.excepted_at = None (and any related excepted_* fields if
applicable) before saving and include "excepted_at" in the update_fields list
along with "last_commit_date" and "updated_at" so the excepted state is cleared
when a usage is reactivated.
In `@boost_usage_tracker/tests/test_update_created_repos_by_language.py`:
- Around line 14-24: The test
test_update_created_repos_by_language_requires_languages currently uses
patch.dict("os.environ", {}, clear=False) which does not remove
REPO_COUNT_LANGUAGES (LANGUAGES_ENV_KEY) from the environment; update the patch
to guarantee the env var is absent by either using patch.dict("os.environ", {},
clear=True) or explicitly setting the key to an empty value using the module
constant (e.g., patch.dict("os.environ", {LANGUAGES_ENV_KEY: ""}, clear=False)
after importing LANGUAGES_ENV_KEY), so that update_created_repos_by_language
falls back to no languages and the assertions reliably exercise the error path.
In `@boost_usage_tracker/update_created_repos_by_language.py`:
- Around line 58-77: The early error return blocks around parsing languages
(where source_csv, language_names, _parse_languages_csv, and LANGUAGES_ENV_KEY
are used) return a payload missing keys present in the success path; update
those error returns to include the full stable schema (e.g., include
languages_requested, start_year, end_year, years, created, updated,
rows_processed, and errors) so callers always receive the same shape; apply the
same fix to the other early-return blocks mentioned (the ones near the other
_parse_languages_csv/validation checks) to ensure consistent payload across all
paths.
In `@boost_usage_tracker/update_git_account.py`:
- Around line 116-118: The log message incorrectly says "Skipping" even though
get_or_create_github_account was already called and the record upserted; either
validate username before calling get_or_create_github_account (move the username
== "" check above the call that performs the upsert) or change the warning text
to reflect post-upsert state (e.g., "Warning: record has empty username for
owner: %s") in update_git_account.py, referencing the username and owner
variables and keeping created_count/updated_count logic intact.
- Around line 84-99: Loop handling in update_git_account.py assumes owner is a
dict and calls owner.get("id") which raises AttributeError when rec["owner"] is
null; add a guard after setting owner (in the loop over records using variables
rec and owner) to check if owner is None or not a mapping/dict, log a warning
like "Skipping record missing owner or github_account_id: %s" with rec or owner,
and continue before attempting owner.get("id"); keep the existing int conversion
to gid and the warnings for missing/invalid github_account_id_raw unchanged.
In `@boost_usage_tracker/update_repository_from_csv.py`:
- Around line 1-5: The module docstring in update_repository_from_csv.py lists
expected CSV columns but omits the "forks" column which the code later parses
(the variable/name "forks" is read from the CSV). Update the top-of-file
docstring to include "forks" among the CSV columns (e.g., add it to the
comma-separated list alongside owner, repo_name, stars, description,
repo_pushed_at, repo_created_at, repo_updated_at, boost_version,
is_boost_embedded, is_boost_used) so the docstring matches the actual parsing
logic.
In `@config/settings.py`:
- Line 130: In the list of workspace slugs in settings.py the two string
literals "boost_usage_tracker" and "cppa_slack_transcript_tracker" are adjacent
without a comma causing implicit concatenation; add the missing comma between
them so they are two separate entries, ensuring downstream uses like
WORKSPACE_DIR / "cppa_slack_transcript_tracker" / "chrome_profile" reference the
correct directory name.
In `@github_activity_tracker/models.py`:
- Around line 58-60: The fields year, all_repos, and significant_repos currently
allow negative values; update their definitions to enforce non-negative values
at DB and model level by replacing IntegerField with PositiveIntegerField (or
adding MinValueValidator(0) if you need zero allowed) and add a CheckConstraint
in the model Meta (e.g., "year >= 0", "all_repos >= 0", "significant_repos >=
0") to the model(s) that contain these fields (the occurrences around the year,
all_repos, significant_repos definitions at lines ~58-60 and the other similar
block at ~67-72) so the database rejects negative values as well as the ORM.
Ensure constraints have unique names and include migration generation.
---
Outside diff comments:
In
`@discord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.py`:
- Around line 14-25: In migrate_users_forward(), pin all ORM operations to the
migration DB by first getting db_alias = schema_editor.connection.alias and then
call .using(db_alias) on every ORM access and write (e.g.,
DiscordUser.objects.using(db_alias).all(),
BaseProfile.objects.using(db_alias).create(...),
DiscordProfile.objects.using(db_alias).create(...)); make sure any other ORM
calls in that function also use .using(db_alias) so reads/writes target the same
migration-specific connection as schema_editor.execute().
---
Nitpick comments:
In `@boost_usage_tracker/models.py`:
- Line 78: The excepted_at field is defined as models.DateField which drops time
precision and is inconsistent with other timestamp fields (e.g., created_at,
updated_at, last_commit_date); change excepted_at to
models.DateTimeField(null=True, blank=True) to preserve time-of-day data (and
run the migration), or if the date-only semantics are intentional add a short
comment above the excepted_at declaration explaining why it differs.
In `@boost_usage_tracker/post_process.py`:
- Around line 218-221: The broad except in post_process (catching Exception)
currently logs only a warning and swallows programming errors; change the except
Exception as e handler so it logs at error level with full stack trace (use
logger.error(..., exc_info=True) and include repo_full_name in the message) and
then re-raise the exception so bugs (KeyError/AttributeError etc.) aren't
silently hidden; keep the existing except (ConnectionException,
RateLimitException) re-raise behavior unchanged.
- Around line 39-53: _resolve_boost_header performs up to O(len(parts)) DB
queries per call for the same unresolved header; add a local cache to avoid
repeated queries across calls by decorating _resolve_boost_header with an
appropriate cache (e.g., functools.lru_cache(maxsize=4096)) or by introducing a
module-level dict that maps header_path -> BoostFile|None, keyed by the
header_path string; ensure the cache stores either the BoostFile instance or
None and is consulted before running the query loop (and consider
evicting/clearing the cache between batches if repository state changes are a
concern).
In `@boost_usage_tracker/tests/test_models.py`:
- Line 156: Several tests unpack multiple values from calls like
get_or_create_missing_header_usage and other helpers but leave some variables
unused; rename those unused unpacked variables to have a leading underscore to
satisfy Ruff (RUF059). For example, in the call to
get_or_create_missing_header_usage change usage, tmp, _ = ... to usage, _tmp, _
= ... (or usage, _tmp, _ = get_or_create_missing_header_usage), and apply the
same pattern for the other occurrences listed: change usage2 → _usage2; tmp1,
c1, tmp2, c2 → _tmp1, _c1, _tmp2, _c2; tmp2 → _tmp2; c1, c2 → _c1, _c2; and tmp
→ _tmp. Locate these in the test functions surrounding the symbols
get_or_create_missing_header_usage and similar unpacking sites and rename the
unused variables accordingly.
In `@boost_usage_tracker/tests/test_services.py`:
- Line 269: Replace the unused unpacked variables by prefixing them with an
underscore where create_or_update_boost_usage and similar calls unpack tuples:
change usage2 to _usage2 (the call to services.create_or_update_boost_usage),
tmp2 to _tmp2, usage to _usage, and tmp to _tmp so the unused values are
explicitly ignored and Ruff RUF059 is satisfied; ensure only the variables
actually used remain without the underscore prefix.
- Around line 474-493: The test test_get_active_usages_for_repo_select_related
currently only accesses related fields but doesn't assert query count; update
the test to wrap access to usage.boost_header and usage.file_path with a
query-assertion context (e.g., django.test.utils.CaptureQueriesContext or
pytest-django's django_assert_num_queries) and assert that zero additional
queries are executed when calling services.get_active_usages_for_repo(ext_repo)
and then accessing those related attributes; keep references to the test
function name and the service function get_active_usages_for_repo so the change
is scoped to this test.
In `@boost_usage_tracker/update_boostusage_from_csv.py`:
- Around line 32-35: The helper function _parse_datetime is duplicated in
update_boostusage_from_csv.py and update_repository_from_csv.py; extract it into
a shared module (e.g., create boost_usage_tracker/utils.py with the
_parse_datetime implementation) and update both modules to import
_parse_datetime from boost_usage_tracker.utils instead of defining it inline;
ensure the signature and behavior (handling None/empty strings and calling
parse_datetime) remain identical and run tests/imports to confirm no breaking
changes.
- Around line 82-88: The CSV loop silently drops rows when
owner/repo_name/file_path/boost_header_name are empty; initialize a skipped
counter in the result dict (add "skipped_empty": 0 to the result structure
constructed where results are initialized) and increment result["skipped_empty"]
inside the for row in reader block right before the continue; optionally add a
debug log there mentioning the row index/contents to aid troubleshooting.
- Around line 111-113: The lookup for boost_header using
BoostFile.objects.filter(github_file__filename=...) can return matches from
other repos because GitHubFile.filename is not unique; update the query that
sets boost_header to also filter by the Boost repo (e.g., add a condition on the
related repo field such as github_file__repository__full_name or
github_file__repo__full_name == "boostorg/boost" or the appropriate repo
identifier) so the filter is constrained to the Boost library repository before
calling .first().
In `@boost_usage_tracker/update_git_account.py`:
- Around line 26-59: The three helpers get_github_account_dir,
_load_json_records_from_path, and _load_all_json_records are duplicated from
db_from_file.py; extract them into a shared module (e.g.,
boost_usage_tracker.utils or boost_usage_tracker.file_utils) and have
update_git_account.py import and use them instead of redefining; ensure the
directory-constant naming is reconciled (either rename GITHUB_ACCOUNT_SUBDIR to
match _GITHUB_ACCOUNT_SUBDIR or export the canonical constant from
db_from_file.py and update callers) and keep the same function signatures for
_load_json_records_from_path and _load_all_json_records so callers like
get_github_account_dir continue to work.
In `@boost_usage_tracker/update_githubfile_from_csv.py`:
- Around line 26-32: Duplicate helper _parse_bool found in
update_githubfile_from_csv.py and update_repository_from_csv.py; extract it into
a new shared module (e.g., csv_helpers.py) and replace both copies with imports.
Create csv_helpers.py exporting _parse_bool (or parse_bool) with the same logic,
update update_githubfile_from_csv.py and update_repository_from_csv.py to import
the function from that module, and remove the duplicated definitions so both
files use the single shared helper.
- Around line 79-106: The loop reading CSV rows currently silently continues
when owner, repo_name, or file_path are empty; update the for-row logic (the
block using variables owner, repo_name, file_path and the result dict) to
increment a new counter like result["skipped_missing_fields"] and emit a debug
log (including the row index or the raw row dict) when any required field is
missing, so malformed/empty rows are visible alongside the existing
skipped_no_repo handling; keep the existing repo lookup and
create_or_update_github_file flow unchanged.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (43)
.env.exampleboost_usage_tracker/__init__.pyboost_usage_tracker/admin.pyboost_usage_tracker/apps.pyboost_usage_tracker/boost_searcher.pyboost_usage_tracker/db_from_file.pyboost_usage_tracker/management/__init__.pyboost_usage_tracker/management/commands/__init__.pyboost_usage_tracker/management/commands/run_boost_usage_tracker.pyboost_usage_tracker/management/commands/run_update_created_repos_by_language.pyboost_usage_tracker/management/commands/run_update_db.pyboost_usage_tracker/migrations/0001_initial.pyboost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.pyboost_usage_tracker/migrations/__init__.pyboost_usage_tracker/models.pyboost_usage_tracker/post_process.pyboost_usage_tracker/repo_searcher.pyboost_usage_tracker/services.pyboost_usage_tracker/tests/__init__.pyboost_usage_tracker/tests/fixtures.pyboost_usage_tracker/tests/test_db_from_file.pyboost_usage_tracker/tests/test_models.pyboost_usage_tracker/tests/test_services.pyboost_usage_tracker/tests/test_update_created_repos_by_language.pyboost_usage_tracker/update_boostusage_from_csv.pyboost_usage_tracker/update_created_repos_by_language.pyboost_usage_tracker/update_git_account.pyboost_usage_tracker/update_githubfile_from_csv.pyboost_usage_tracker/update_repository_from_csv.pyconfig/settings.pyconftest.pydiscord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.pydocs/Contributing.mddocs/Schema.mddocs/Service_API.mddocs/service_api/README.mddocs/service_api/boost_usage_tracker.mdgithub_activity_tracker/migrations/0002_createdreposbylanguage.pygithub_activity_tracker/models.pygithub_activity_tracker/services.pygithub_activity_tracker/tests/test_models.pygithub_activity_tracker/tests/test_services.pyworkflow/management/commands/run_all_collectors.py
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
discord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.py (1)
14-40:⚠️ Potential issue | 🟠 MajorBind ORM operations to the migration database alias to prevent data inconsistency.
Lines 14, 16, and 18 execute ORM queries and writes without
schema_editor.connection.alias. In multi-database or database-router setups, these operations default to the DEFAULT database instead of the migration's active database, while the raw SQL at line 40 correctly executes on the migration connection. This creates a race condition where new DiscordProfile records may be written to one database while the raw SQL update reads/writes from another, breaking data integrity.Suggested fix
def migrate_users_forward(apps, schema_editor): """Create DiscordProfile for each DiscordUser, remap DiscordMessage.author_id.""" DiscordUser = apps.get_model("discord_activity_tracker", "DiscordUser") BaseProfile = apps.get_model("cppa_user_tracker", "BaseProfile") DiscordProfile = apps.get_model("cppa_user_tracker", "DiscordProfile") + db_alias = schema_editor.connection.alias # Build mapping: old DiscordUser.pk → new DiscordProfile.pk pk_map = {} - for du in DiscordUser.objects.all(): + for du in DiscordUser.objects.using(db_alias).all(): # Create BaseProfile row first (multi-table inheritance) - bp = BaseProfile.objects.create(type="discord") + bp = BaseProfile.objects.using(db_alias).create(type="discord") # Create DiscordProfile row - DiscordProfile.objects.create( + DiscordProfile.objects.using(db_alias).create( baseprofile_ptr_id=bp.pk, discord_user_id=du.user_id, username=du.username, display_name=du.display_name, avatar_url=du.avatar_url, is_bot=du.is_bot, ) pk_map[du.pk] = bp.pk🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@discord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.py` around lines 14 - 40, The ORM queries and saves must be bound to the migration DB alias: change DiscordUser.objects.all(), BaseProfile.objects.create(...) and DiscordProfile.objects.create(...) to use the migration connection alias (e.g. .using(schema_editor.connection.alias) on the queryset/create calls) so all reads/writes occur on schema_editor.connection.alias rather than the default DB; keep the raw SQL on schema_editor.execute as-is so both ORM operations and the SQL update target the same database.
♻️ Duplicate comments (14)
boost_usage_tracker/tests/test_update_created_repos_by_language.py (1)
14-24:⚠️ Potential issue | 🟡 Minor
patch.dict("os.environ", {}, clear=False)is a no-op —REPO_COUNT_LANGUAGESis not cleared.This was raised in a previous review and remains unresolved. If the env var is set in CI or via a
.envloader, the test will silently exercise the wrong code path.Proposed fix — pin the key to empty
- with patch.dict("os.environ", {}, clear=False): + with patch.dict("os.environ", {"REPO_COUNT_LANGUAGES": ""}, clear=False):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/tests/test_update_created_repos_by_language.py` around lines 14 - 24, The test currently uses patch.dict("os.environ", {}, clear=False) which is a no-op; change the patch to explicitly set the REPO_COUNT_LANGUAGES key to an empty string so the code path that requires languages is exercised. Concretely, update the test to use patch.dict("os.environ", {"REPO_COUNT_LANGUAGES": ""}, clear=False) (or clear=True if you prefer to wipe all env vars) so update_created_repos_by_language is forced to rely on the provided languages_csv and return the expected error; keep references to update_created_repos_by_language and REPO_COUNT_LANGUAGES when making the change.github_activity_tracker/models.py (1)
58-60:⚠️ Potential issue | 🟠 MajorMissing DB-level non-negative constraints on
year,all_repos, andsignificant_repos.This was raised in a previous review and remains unresolved. Negative counts in aggregate metrics would silently corrupt reporting.
Proposed schema hardening
class Meta: db_table = "github_activity_tracker_createdreposbylanguage" ordering = ["language", "year"] constraints = [ models.UniqueConstraint( fields=["language", "year"], name="github_activity_tracker_created_lang_year_uniq", - ) + ), + models.CheckConstraint( + check=models.Q(year__gte=0), + name="github_activity_tracker_created_lang_year_non_negative", + ), + models.CheckConstraint( + check=models.Q(all_repos__gte=0), + name="github_activity_tracker_created_lang_all_repos_non_negative", + ), + models.CheckConstraint( + check=models.Q(significant_repos__gte=0), + name="github_activity_tracker_created_lang_sig_repos_non_negative", + ), ]A corresponding migration will also be needed.
Also applies to: 67-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@github_activity_tracker/models.py` around lines 58 - 60, Change the IntegerField definitions for year, all_repos, and significant_repos to non-negative types and add a DB-level CheckConstraint: use models.PositiveIntegerField (or IntegerField with MinValueValidator(0)) for year (keep db_index=True) and models.PositiveIntegerField(default=0) for all_repos and significant_repos, then add a Meta.constraints entry with CheckConstraint(expressions=Q(year__gte=0) & Q(all_repos__gte=0) & Q(significant_repos__gte=0), name='githubactivity_nonnegative_counts') to the model to enforce the rule at the DB level; apply the same changes to the other similar fields mentioned (lines 67-72) and generate the corresponding migration.boost_usage_tracker/models.py (1)
127-129:⚠️ Potential issue | 🟡 MinorFix typo in index name before migrations settle.
Line 128 has
bost_missing_tmp_usage_id; this should beboost_missing_tmp_usage_id.Suggested fix
indexes = [ - models.Index(fields=["usage"], name="bost_missing_tmp_usage_id"), + models.Index(fields=["usage"], name="boost_missing_tmp_usage_id"), ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/models.py` around lines 127 - 129, The index name in the model Meta indexes array is misspelled: change the models.Index name value "bost_missing_tmp_usage_id" to the correct "boost_missing_tmp_usage_id" so the index name is accurate before migrations are finalized; update the indexes entry where models.Index(fields=["usage"], name="bost_missing_tmp_usage_id") is declared to use the corrected string.config/settings.py (1)
130-130:⚠️ Potential issue | 🔴 CriticalCritical: missing comma causes implicit string concatenation.
At Line 130, two literals are concatenated into one slug, so neither
boost_usage_trackernorcppa_slack_transcript_trackerworkspace directory is created correctly.Suggested fix
_WORKSPACE_APP_SLUGS = ( "github_activity_tracker", "boost_library_tracker", - "boost_usage_tracker" "cppa_slack_transcript_tracker", + "boost_usage_tracker", + "cppa_slack_transcript_tracker", "discord_activity_tracker", "shared", )#!/bin/bash # Verify the adjacent string literal pattern and impacted workspace slug block. rg -n -C2 '"boost_usage_tracker"\s*"cppa_slack_transcript_tracker"|_WORKSPACE_APP_SLUGS|chrome_profile' config/settings.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/settings.py` at line 130, The list of workspace slugs contains adjacent string literals that are implicitly concatenated; add a missing comma between "boost_usage_tracker" and "cppa_slack_transcript_tracker" in the workspace slug block (the _WORKSPACE_APP_SLUGS / workspace slug list in config/settings.py) so each slug is its own entry and both directories are created correctly.boost_usage_tracker/management/commands/run_update_created_repos_by_language.py (1)
77-95:⚠️ Potential issue | 🟡 MinorUse warning style for summary when errors are present.
Line 78 always formats the summary with
SUCCESS, even when errors were printed and Line 95 returns1. This makes failed runs look successful in logs.Suggested fix
+ summary_style = ( + self.style.WARNING if result.get("errors") else self.style.SUCCESS + ) # pylint: disable=no-member self.stdout.write( - self.style.SUCCESS( # pylint: disable=no-member + summary_style( "languages_requested={requested} processed={processed} missing={missing} " "years={start}-{end} stars_min={stars} rows_processed={rows} " "created={created} updated={updated}".format(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/management/commands/run_update_created_repos_by_language.py` around lines 77 - 95, The summary message always uses self.style.SUCCESS in the self.stdout.write call even when result.get("errors") is truthy; change the formatting to use a warning (e.g., self.style.WARNING) when result.get("errors") is present so the printed summary matches the exit code. Locate the self.stdout.write(...) block that formats requested/processed/missing/... using result.get(...) and conditionally choose style.SUCCESS when no errors and style.WARNING when result.get("errors") is truthy, keeping the same message content, then keep the existing return 1 if result.get("errors") else 0 behavior.boost_usage_tracker/update_git_account.py (2)
84-90:⚠️ Potential issue | 🟠 MajorGuard
ownerbefore dereferencing to avoid runtime crash.At Line 88,
owner.get("id")can raiseAttributeErrorwhenrec["owner"]isNoneor non-object.Proposed fix
for rec in records: owner = rec if "owner" in rec: owner = rec.get("owner") + if not isinstance(owner, dict): + logger.warning( + "Skipping record missing owner or github_account_id: %s", rec + ) + continue github_account_id_raw = owner.get("id")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_git_account.py` around lines 84 - 90, The loop dereferences owner with owner.get("id") which can raise AttributeError if owner is None or not a mapping; update the loop in the records processing (the for rec in records block) to first validate owner (the owner variable derived from rec or rec.get("owner")) is a dict-like object (e.g., check owner is not None and isinstance(owner, dict) or hasattr(owner, "get")) before calling owner.get("id"); if the owner is invalid, emit the existing logger.warning (or an enhanced message referencing owner) and continue to the next record so github_account_id_raw is never accessed on a non-object.
116-117:⚠️ Potential issue | 🟡 MinorFix misleading log wording after successful upsert.
Line 117 says “Skipping”, but the account has already been created/updated on Line 104-111.
Proposed fix
if username == "": - logger.warning("Skipping record with empty username: %s", owner) + logger.warning("Upserted record with empty username: %s", owner)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_git_account.py` around lines 116 - 117, The log message "Skipping record with empty username: %s" is misleading because the account was already created/updated earlier; update the logger call in the block that checks username == "" to reflect that the upsert was completed for this owner (e.g., logger.info or logger.debug with text like "Account created/updated for owner with empty username: %s") and keep the referenced variables owner and username in the message so it clearly identifies the record; modify the logger.warning call to logger.info (or logger.debug) and change the message text accordingly in the same function where owner and username are used.boost_usage_tracker/update_created_repos_by_language.py (1)
58-77:⚠️ Potential issue | 🟡 MinorKeep error returns schema-compatible with success return.
Current early returns omit keys present in normal output (
languages_requested,start_year,end_year, etc.), which complicates command/reporting code.Proposed fix pattern
+def _result_template(*, start_year: int, end_year: int, stars_min: int) -> dict[str, Any]: + return { + "languages_requested": [], + "languages_processed": [], + "languages_missing": [], + "start_year": start_year, + "end_year": end_year, + "stars_min": stars_min, + "created": 0, + "updated": 0, + "rows_processed": 0, + "errors": [], + } @@ if start_year > end_year: - return { - "created": 0, - "updated": 0, - "rows_processed": 0, - "errors": [ - f"Invalid year range: start_year({start_year}) > end_year({end_year})" - ], - } + result = _result_template( + start_year=start_year, + end_year=end_year, + stars_min=stars_min, + ) + result["errors"].append( + f"Invalid year range: start_year({start_year}) > end_year({end_year})" + ) + return resultAlso applies to: 95-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_created_repos_by_language.py` around lines 58 - 77, Early return payloads in update_created_repos_by_language.py omit fields present in successful results; make all early error returns include the full schema (e.g., languages_requested, start_year, end_year, created, updated, rows_processed, errors) so callers can rely on a consistent shape. Locate the two early-return blocks around the _parse_languages_csv call (and the other one at lines ~95-100) and update them to populate languages_requested (use language_names or source_csv), start_year and end_year (use existing start_year/end_year variables), plus created/updated/rows_processed set to 0, and append the error message into errors, matching the keys used by the normal success return.boost_usage_tracker/update_repository_from_csv.py (1)
4-5:⚠️ Potential issue | 🟡 MinorInclude
forksin the documented CSV columns.
forksis parsed on Line 108-109 but is still missing from the module docstring column list, so docs and behavior diverge.Proposed fix
-CSV columns: owner, repo_name, stars, description, repo_pushed_at, repo_created_at, +CSV columns: owner, repo_name, stars, forks, description, repo_pushed_at, repo_created_at, repo_updated_at, boost_version, is_boost_embedded, is_boost_used.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_repository_from_csv.py` around lines 4 - 5, The module docstring's CSV column list is missing "forks" even though the code parses a forks field (see the forks parsing around the existing parsing logic at lines ~108-109); update the top-of-file docstring CSV columns line to include "forks" so the documented columns match the actual parser behavior, i.e., add forks to the comma-separated list of columns in the module docstring.boost_usage_tracker/management/commands/run_update_db.py (1)
135-142:⚠️ Potential issue | 🟠 MajorUse
CommandErrorfor failure instead of returning integers fromhandle().Returning
1/0here is not a reliable CLI failure signal; raiseCommandErrorwhenresult["errors"]is present.Proposed fix
-from django.core.management.base import BaseCommand +from django.core.management.base import BaseCommand, CommandError @@ if result.get("errors"): for err in result["errors"]: self.stderr.write(self.style.ERROR(err)) # pylint: disable=no-member - return 1 + raise CommandError(f"run_update_db failed for target={target}") @@ msg = formatter(result) self.stdout.write(self.style.SUCCESS(msg)) # pylint: disable=no-member - return 0 + return NoneUse this read-only check to verify Django’s behavior in your environment:
#!/bin/bash python - <<'PY' import inspect from django.core.management.base import BaseCommand print("=== BaseCommand.execute ===") print(inspect.getsource(BaseCommand.execute)) print("\n=== BaseCommand.run_from_argv ===") print(inspect.getsource(BaseCommand.run_from_argv)) PYExpected result:
run_from_argv()does not maphandle()integer return values to process exit codes; non-zero exit comes from raised exceptions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/management/commands/run_update_db.py` around lines 135 - 142, Replace the integer return flow in handle() with a raised CommandError: when result.get("errors") is truthy, import CommandError from django.core.management and raise CommandError with a clear message (e.g. join result["errors"] or include relevant details) instead of writing to stderr and returning 1; keep the success path (use formatter(result), self.stdout.write(self.style.SUCCESS(msg))) and simply return None (no numeric return). Ensure you reference the same symbols: result, formatter, self.stderr/self.stdout, and raise CommandError.boost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.py (1)
67-67:⚠️ Potential issue | 🟡 MinorFix index name typo before schema lands.
bost_missing_tmp_usage_idappears misspelled and will persist in DB metadata.Proposed fix
- index=models.Index(fields=["usage"], name="bost_missing_tmp_usage_id"), + index=models.Index(fields=["usage"], name="boost_missing_tmp_usage_id"),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.py` at line 67, The index name in the migration's Index declaration is misspelled; update the Index(... name="bost_missing_tmp_usage_id") to the correct spelling (e.g., "boost_missing_tmp_usage_id") in the migration file so the database metadata uses the intended name; locate the Index call in the migration (the models.Index(fields=["usage"], name="bost_missing_tmp_usage_id") line) and change only the name string to the corrected identifier.boost_usage_tracker/services.py (2)
235-260:⚠️ Potential issue | 🟠 MajorBulk upsert currently updates rows even when nothing changed.
Line 245 appends all matched usages to
to_update, so unchanged rows still getbulk_updated andupdated_countis overstated.🔧 Proposed fix
for (bh_id, fp_id), usage in existing_map.items(): key = (bh_id, fp_id) if key not in key_to_item: continue to_create_keys.discard(key) boost_header, file_path, last_commit_date = key_to_item[key] + changed = False if last_commit_date is not None and usage.last_commit_date != last_commit_date: usage.last_commit_date = last_commit_date + changed = True if usage.excepted_at is not None: usage.excepted_at = None - to_update.append(usage) + changed = True + if changed: + to_update.append(usage)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/services.py` around lines 235 - 260, The code currently appends every matched usage to to_update causing unchanged rows to be bulk_updated and inflating updated_count; change the logic in the loop that iterates existing_map items (referencing existing_map, key_to_item, usage, to_update) to compare the current fields (last_commit_date and excepted_at) against the values from key_to_item and only append usage to to_update when at least one of those fields actually changes, then set usage.updated_at for those changed records before calling BoostUsage.objects.bulk_update and compute updated_count as the length of the filtered to_update list.
184-193:⚠️ Potential issue | 🟠 MajorRe-detected missing-header usage can remain incorrectly excepted.
At Line 190-Line 193, existing placeholder usages update
last_commit_datebut never clearexcepted_at. Re-detected rows can stay inactive.🔧 Proposed fix
- usage, _ = BoostUsage.objects.get_or_create( + usage, created_usage = BoostUsage.objects.get_or_create( repo=repo, boost_header=None, file_path=file_path, defaults={"last_commit_date": last_commit_date}, ) - if not _ and last_commit_date and usage.last_commit_date != last_commit_date: - usage.last_commit_date = last_commit_date - usage.save(update_fields=["last_commit_date", "updated_at"]) + if not created_usage: + changed = False + if last_commit_date is not None and usage.last_commit_date != last_commit_date: + usage.last_commit_date = last_commit_date + changed = True + if usage.excepted_at is not None: + usage.excepted_at = None + changed = True + if changed: + usage.save(update_fields=["last_commit_date", "excepted_at", "updated_at"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/services.py` around lines 184 - 193, When updating an existing BoostUsage after get_or_create (the variable usage returned from BoostUsage.objects.get_or_create), clear any stale exception state by setting usage.excepted_at = None before saving if you change last_commit_date; include "excepted_at" in the save(update_fields=[...]) call so the null is persisted. Locate the block that checks "if not _ and last_commit_date and usage.last_commit_date != last_commit_date" and update it to reset excepted_at and save with update_fields=["last_commit_date", "updated_at", "excepted_at"] (ensure excepted_at is set to None).boost_usage_tracker/management/commands/run_boost_usage_tracker.py (1)
375-388:⚠️ Potential issue | 🟠 MajorInvalid
--since/--untilinput still hard-fails command execution.Line 376 and Line 383 can raise
ValueError, which aborts the command instead of falling back to defaults.🔧 Proposed permissive parsing
now = datetime.now(timezone.utc) - until = ( - datetime.strptime(options["until"], "%Y-%m-%d").replace( - tzinfo=timezone.utc, - ) - if options["until"] - else now - ) - since = ( - datetime.strptime(options["since"], "%Y-%m-%d").replace( - tzinfo=timezone.utc, - ) - if options["since"] - else until - timedelta(days=1) - ) + def _parse_ymd_or_none(value: str | None) -> datetime | None: + if not value: + return None + try: + return datetime.strptime(value, "%Y-%m-%d").replace(tzinfo=timezone.utc) + except ValueError: + return None + + until = _parse_ymd_or_none(options["until"]) or now + since = _parse_ymd_or_none(options["since"]) or (until - timedelta(days=1))Based on learnings, invalid date inputs in this repository’s management commands should fall back to default logic instead of raising.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py` around lines 375 - 388, The parsing of options["until"] and options["since"] using datetime.strptime can raise ValueError and currently aborts execution; wrap each datetime.strptime call that sets until and since in a try/except ValueError block so that on parse failure you fall back to the existing defaults (for until use now, for since use until - timedelta(days=1)), and optionally emit a warning/log about the invalid input; update the logic around the until and since assignments (the blocks that reference options["until"], options["since"], datetime.strptime, tzinfo=timezone.utc, and the default calculations) to implement this permissive parsing behavior.
🧹 Nitpick comments (5)
boost_usage_tracker/migrations/0001_initial.py (1)
40-40: Verify the field nameexcepted_atis intentional.
excepted_atis an unusual name — did you meanextracted_at,exempted_at, or something else? If it tracks when a record was marked as an exception/exclusion, a more descriptive name likeexempted_atorexcluded_atmight reduce confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/migrations/0001_initial.py` at line 40, The migration defines a DateField named excepted_at which is likely a typo and should be clarified/renamed (e.g., extracted_at, exempted_at, or excluded_at) — locate the field definition in the initial migration and in the corresponding model (search for excepted_at) and decide the intended name; if it's a typo, update the model field name to the correct one and create a new migration that renames the field (use a RenameField operation) rather than editing already-applied historic migrations, or if the project is pre-release and safe to rewrite, update the initial migration and all references to excepted_at to the chosen name (e.g., exempted_at) to keep naming consistent.boost_usage_tracker/boost_searcher.py (1)
248-252: Silently swallowed exception hides commit-date fetch failures.The bare
except Exception: passloses all diagnostic information. Even thoughcommit_dateis optional, logging atdebuglevel (consistent with the rest of the file) would help troubleshoot issues.♻️ Proposed fix
except Exception: - pass + logger.debug( + "Failed to fetch commit date for %s/%s", + repo_full_name, + file_path, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/boost_searcher.py` around lines 248 - 252, The try/except block that parses commits to set commit_date currently swallows all errors (except Exception: pass); change it to catch Exception as e and emit a debug-level log (consistent with other uses in this module) with the error details (e or use exc_info/stack) so failures to parse commits/date are recorded; keep commit_date optional and unchanged on error but do not silently ignore exceptions—update the block around the commits parsing (where commits is inspected and datetime.fromisoformat(...) is called) to log the diagnostic information.boost_usage_tracker/repo_searcher.py (1)
133-133: Remove commented-outtime.sleepcall.This looks like a leftover from development. If the rate-limit handling in the client makes this unnecessary, remove the dead comment to avoid confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/repo_searcher.py` at line 133, Remove the leftover commented-out call "# time.sleep(SEARCH_DELAY)" from repo_searcher.py to eliminate dead code and confusion; locate the commented symbol time.sleep(SEARCH_DELAY) in the search loop (where SEARCH_DELAY is referenced) and delete the line, or if a deliberate placeholder is required, replace it with a clear TODO comment referencing SEARCH_DELAY and the surrounding search logic.github_activity_tracker/services.py (1)
79-79: Use unpacking instead of list concatenation.Ruff RUF005 flags
update_fields + ["updated_at"]as less idiomatic.♻️ Proposed fix
- row.save(update_fields=update_fields + ["updated_at"]) + row.save(update_fields=[*update_fields, "updated_at"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@github_activity_tracker/services.py` at line 79, Replace the list-concatenation passed to row.save with list unpacking: instead of using update_fields + ["updated_at"] when calling the save method on the row object, build the update_fields argument using unpacking so the call to row.save(update_fields=...) uses the expanded original update_fields plus "updated_at"; locate the save call on the variable named row (row.save(...)) in github_activity_tracker/services.py and update the argument accordingly.boost_usage_tracker/tests/test_models.py (1)
156-156: Use_for intentionally unused tuple elements to satisfy lint cleanly.Several unpacked values are not used; renaming to
_(or_tmp,_created) keeps intent explicit and avoids RUF059 noise.Example cleanup
- usage, tmp, _ = services.get_or_create_missing_header_usage( + usage, _, _ = services.get_or_create_missing_header_usage( @@ - usage2, created2 = services.create_or_update_boost_usage( + _, created2 = services.create_or_update_boost_usage( @@ - usage1, tmp1, c1 = services.get_or_create_missing_header_usage( + usage1, _, _ = services.get_or_create_missing_header_usage(Also applies to: 199-199, 263-263, 268-268, 386-386, 448-448, 453-453, 478-478
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/tests/test_models.py` at line 156, Several tests unpack tuple returns but leave some elements unused (e.g., the call to services.get_or_create_missing_header_usage), which triggers lint RUF059; update those unpackings to use `_` (or `_tmp`, `_created`) for intentionally unused elements so intent is clear and lint clean. Specifically, locate calls such as services.get_or_create_missing_header_usage and the other occurrences referenced (around the noted lines) and change the tuple unpack pattern from e.g. "usage, tmp, _" or "a, b, c" to use "_" for any value that isn’t referenced later in the test; keep the variable name for the used element(s) only and leave the rest as underscores.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py`:
- Around line 215-220: The global watermark last_commit_dt computed from
BoostUsage.objects.aggregate(Max("last_commit_date")) can skip valid updates in
slower repos; replace the global watermark with a per-repository watermark:
compute the max last_commit_date grouped by the repo identifier from BoostUsage
(falling back to CREATION_START_DEFAULT) and pass the repo-specific value into
monitor_content (or any function using last_commit_dt) so each repo uses its own
max last_commit_date instead of the single last_commit_dt; update any call sites
that reference last_commit_dt to accept a repo parameter or lookup the per-repo
watermark by repository id/name.
In `@boost_usage_tracker/tests/test_services.py`:
- Line 269: In tests/test_services.py replace intentionally unused unpacked
variables returned from services.create_or_update_boost_usage with underscore
placeholders (e.g., change "usage2, created2 =
services.create_or_update_boost_usage(...)" to "_, created2 =
services.create_or_update_boost_usage(...)" or "usageX, _ = ..." depending on
which value is unused) for each occurrence (the calls around the
create_or_update_boost_usage invocations flagged). This removes the unused-name
lint warnings (RUF059) while keeping the used return values intact.
In `@boost_usage_tracker/update_boostusage_from_csv.py`:
- Line 4: The CSV contract documents an excepted_at column but the import loop
in update_boostusage_from_csv.py never applies it; update the CSV row parsing to
extract the excepted_at field (parse as datetime or None), and pass that value
into the service call that creates/updates each record (e.g., the boost usage
import loop where owner, repo_name, file_path, boost_header_name, last_commit_ts
are mapped) so the service method (create_or_update_boost_usage /
BoostUsageService.*) receives and persists excepted_at instead of silently
dropping it; alternatively, if you intend to drop the column, remove excepted_at
from the declared CSV columns to keep the contract consistent.
In `@boost_usage_tracker/update_git_account.py`:
- Around line 142-143: The single-file branch doesn't handle JSON parsing or IO
errors like the directory branch does; wrap the call to
_load_json_records_from_path(path) (the block starting with "elif path.is_file()
and path.suffix.lower() == '.json':") in a try/except that catches
json.JSONDecodeError and OSError, log an explanatory error (including the
exception) and gracefully exit/return the same way directory-mode does so the
command doesn't crash on malformed or unreadable JSON.
---
Outside diff comments:
In
`@discord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.py`:
- Around line 14-40: The ORM queries and saves must be bound to the migration DB
alias: change DiscordUser.objects.all(), BaseProfile.objects.create(...) and
DiscordProfile.objects.create(...) to use the migration connection alias (e.g.
.using(schema_editor.connection.alias) on the queryset/create calls) so all
reads/writes occur on schema_editor.connection.alias rather than the default DB;
keep the raw SQL on schema_editor.execute as-is so both ORM operations and the
SQL update target the same database.
---
Duplicate comments:
In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py`:
- Around line 375-388: The parsing of options["until"] and options["since"]
using datetime.strptime can raise ValueError and currently aborts execution;
wrap each datetime.strptime call that sets until and since in a try/except
ValueError block so that on parse failure you fall back to the existing defaults
(for until use now, for since use until - timedelta(days=1)), and optionally
emit a warning/log about the invalid input; update the logic around the until
and since assignments (the blocks that reference options["until"],
options["since"], datetime.strptime, tzinfo=timezone.utc, and the default
calculations) to implement this permissive parsing behavior.
In
`@boost_usage_tracker/management/commands/run_update_created_repos_by_language.py`:
- Around line 77-95: The summary message always uses self.style.SUCCESS in the
self.stdout.write call even when result.get("errors") is truthy; change the
formatting to use a warning (e.g., self.style.WARNING) when result.get("errors")
is present so the printed summary matches the exit code. Locate the
self.stdout.write(...) block that formats requested/processed/missing/... using
result.get(...) and conditionally choose style.SUCCESS when no errors and
style.WARNING when result.get("errors") is truthy, keeping the same message
content, then keep the existing return 1 if result.get("errors") else 0
behavior.
In `@boost_usage_tracker/management/commands/run_update_db.py`:
- Around line 135-142: Replace the integer return flow in handle() with a raised
CommandError: when result.get("errors") is truthy, import CommandError from
django.core.management and raise CommandError with a clear message (e.g. join
result["errors"] or include relevant details) instead of writing to stderr and
returning 1; keep the success path (use formatter(result),
self.stdout.write(self.style.SUCCESS(msg))) and simply return None (no numeric
return). Ensure you reference the same symbols: result, formatter,
self.stderr/self.stdout, and raise CommandError.
In
`@boost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.py`:
- Line 67: The index name in the migration's Index declaration is misspelled;
update the Index(... name="bost_missing_tmp_usage_id") to the correct spelling
(e.g., "boost_missing_tmp_usage_id") in the migration file so the database
metadata uses the intended name; locate the Index call in the migration (the
models.Index(fields=["usage"], name="bost_missing_tmp_usage_id") line) and
change only the name string to the corrected identifier.
In `@boost_usage_tracker/models.py`:
- Around line 127-129: The index name in the model Meta indexes array is
misspelled: change the models.Index name value "bost_missing_tmp_usage_id" to
the correct "boost_missing_tmp_usage_id" so the index name is accurate before
migrations are finalized; update the indexes entry where
models.Index(fields=["usage"], name="bost_missing_tmp_usage_id") is declared to
use the corrected string.
In `@boost_usage_tracker/services.py`:
- Around line 235-260: The code currently appends every matched usage to
to_update causing unchanged rows to be bulk_updated and inflating updated_count;
change the logic in the loop that iterates existing_map items (referencing
existing_map, key_to_item, usage, to_update) to compare the current fields
(last_commit_date and excepted_at) against the values from key_to_item and only
append usage to to_update when at least one of those fields actually changes,
then set usage.updated_at for those changed records before calling
BoostUsage.objects.bulk_update and compute updated_count as the length of the
filtered to_update list.
- Around line 184-193: When updating an existing BoostUsage after get_or_create
(the variable usage returned from BoostUsage.objects.get_or_create), clear any
stale exception state by setting usage.excepted_at = None before saving if you
change last_commit_date; include "excepted_at" in the save(update_fields=[...])
call so the null is persisted. Locate the block that checks "if not _ and
last_commit_date and usage.last_commit_date != last_commit_date" and update it
to reset excepted_at and save with update_fields=["last_commit_date",
"updated_at", "excepted_at"] (ensure excepted_at is set to None).
In `@boost_usage_tracker/tests/test_update_created_repos_by_language.py`:
- Around line 14-24: The test currently uses patch.dict("os.environ", {},
clear=False) which is a no-op; change the patch to explicitly set the
REPO_COUNT_LANGUAGES key to an empty string so the code path that requires
languages is exercised. Concretely, update the test to use
patch.dict("os.environ", {"REPO_COUNT_LANGUAGES": ""}, clear=False) (or
clear=True if you prefer to wipe all env vars) so
update_created_repos_by_language is forced to rely on the provided languages_csv
and return the expected error; keep references to
update_created_repos_by_language and REPO_COUNT_LANGUAGES when making the
change.
In `@boost_usage_tracker/update_created_repos_by_language.py`:
- Around line 58-77: Early return payloads in
update_created_repos_by_language.py omit fields present in successful results;
make all early error returns include the full schema (e.g., languages_requested,
start_year, end_year, created, updated, rows_processed, errors) so callers can
rely on a consistent shape. Locate the two early-return blocks around the
_parse_languages_csv call (and the other one at lines ~95-100) and update them
to populate languages_requested (use language_names or source_csv), start_year
and end_year (use existing start_year/end_year variables), plus
created/updated/rows_processed set to 0, and append the error message into
errors, matching the keys used by the normal success return.
In `@boost_usage_tracker/update_git_account.py`:
- Around line 84-90: The loop dereferences owner with owner.get("id") which can
raise AttributeError if owner is None or not a mapping; update the loop in the
records processing (the for rec in records block) to first validate owner (the
owner variable derived from rec or rec.get("owner")) is a dict-like object
(e.g., check owner is not None and isinstance(owner, dict) or hasattr(owner,
"get")) before calling owner.get("id"); if the owner is invalid, emit the
existing logger.warning (or an enhanced message referencing owner) and continue
to the next record so github_account_id_raw is never accessed on a non-object.
- Around line 116-117: The log message "Skipping record with empty username: %s"
is misleading because the account was already created/updated earlier; update
the logger call in the block that checks username == "" to reflect that the
upsert was completed for this owner (e.g., logger.info or logger.debug with text
like "Account created/updated for owner with empty username: %s") and keep the
referenced variables owner and username in the message so it clearly identifies
the record; modify the logger.warning call to logger.info (or logger.debug) and
change the message text accordingly in the same function where owner and
username are used.
In `@boost_usage_tracker/update_repository_from_csv.py`:
- Around line 4-5: The module docstring's CSV column list is missing "forks"
even though the code parses a forks field (see the forks parsing around the
existing parsing logic at lines ~108-109); update the top-of-file docstring CSV
columns line to include "forks" so the documented columns match the actual
parser behavior, i.e., add forks to the comma-separated list of columns in the
module docstring.
In `@config/settings.py`:
- Line 130: The list of workspace slugs contains adjacent string literals that
are implicitly concatenated; add a missing comma between "boost_usage_tracker"
and "cppa_slack_transcript_tracker" in the workspace slug block (the
_WORKSPACE_APP_SLUGS / workspace slug list in config/settings.py) so each slug
is its own entry and both directories are created correctly.
In `@github_activity_tracker/models.py`:
- Around line 58-60: Change the IntegerField definitions for year, all_repos,
and significant_repos to non-negative types and add a DB-level CheckConstraint:
use models.PositiveIntegerField (or IntegerField with MinValueValidator(0)) for
year (keep db_index=True) and models.PositiveIntegerField(default=0) for
all_repos and significant_repos, then add a Meta.constraints entry with
CheckConstraint(expressions=Q(year__gte=0) & Q(all_repos__gte=0) &
Q(significant_repos__gte=0), name='githubactivity_nonnegative_counts') to the
model to enforce the rule at the DB level; apply the same changes to the other
similar fields mentioned (lines 67-72) and generate the corresponding migration.
---
Nitpick comments:
In `@boost_usage_tracker/boost_searcher.py`:
- Around line 248-252: The try/except block that parses commits to set
commit_date currently swallows all errors (except Exception: pass); change it to
catch Exception as e and emit a debug-level log (consistent with other uses in
this module) with the error details (e or use exc_info/stack) so failures to
parse commits/date are recorded; keep commit_date optional and unchanged on
error but do not silently ignore exceptions—update the block around the commits
parsing (where commits is inspected and datetime.fromisoformat(...) is called)
to log the diagnostic information.
In `@boost_usage_tracker/migrations/0001_initial.py`:
- Line 40: The migration defines a DateField named excepted_at which is likely a
typo and should be clarified/renamed (e.g., extracted_at, exempted_at, or
excluded_at) — locate the field definition in the initial migration and in the
corresponding model (search for excepted_at) and decide the intended name; if
it's a typo, update the model field name to the correct one and create a new
migration that renames the field (use a RenameField operation) rather than
editing already-applied historic migrations, or if the project is pre-release
and safe to rewrite, update the initial migration and all references to
excepted_at to the chosen name (e.g., exempted_at) to keep naming consistent.
In `@boost_usage_tracker/repo_searcher.py`:
- Line 133: Remove the leftover commented-out call "# time.sleep(SEARCH_DELAY)"
from repo_searcher.py to eliminate dead code and confusion; locate the commented
symbol time.sleep(SEARCH_DELAY) in the search loop (where SEARCH_DELAY is
referenced) and delete the line, or if a deliberate placeholder is required,
replace it with a clear TODO comment referencing SEARCH_DELAY and the
surrounding search logic.
In `@boost_usage_tracker/tests/test_models.py`:
- Line 156: Several tests unpack tuple returns but leave some elements unused
(e.g., the call to services.get_or_create_missing_header_usage), which triggers
lint RUF059; update those unpackings to use `_` (or `_tmp`, `_created`) for
intentionally unused elements so intent is clear and lint clean. Specifically,
locate calls such as services.get_or_create_missing_header_usage and the other
occurrences referenced (around the noted lines) and change the tuple unpack
pattern from e.g. "usage, tmp, _" or "a, b, c" to use "_" for any value that
isn’t referenced later in the test; keep the variable name for the used
element(s) only and leave the rest as underscores.
In `@github_activity_tracker/services.py`:
- Line 79: Replace the list-concatenation passed to row.save with list
unpacking: instead of using update_fields + ["updated_at"] when calling the save
method on the row object, build the update_fields argument using unpacking so
the call to row.save(update_fields=...) uses the expanded original update_fields
plus "updated_at"; locate the save call on the variable named row
(row.save(...)) in github_activity_tracker/services.py and update the argument
accordingly.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (43)
.env.exampleboost_usage_tracker/__init__.pyboost_usage_tracker/admin.pyboost_usage_tracker/apps.pyboost_usage_tracker/boost_searcher.pyboost_usage_tracker/db_from_file.pyboost_usage_tracker/management/__init__.pyboost_usage_tracker/management/commands/__init__.pyboost_usage_tracker/management/commands/run_boost_usage_tracker.pyboost_usage_tracker/management/commands/run_update_created_repos_by_language.pyboost_usage_tracker/management/commands/run_update_db.pyboost_usage_tracker/migrations/0001_initial.pyboost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.pyboost_usage_tracker/migrations/__init__.pyboost_usage_tracker/models.pyboost_usage_tracker/post_process.pyboost_usage_tracker/repo_searcher.pyboost_usage_tracker/services.pyboost_usage_tracker/tests/__init__.pyboost_usage_tracker/tests/fixtures.pyboost_usage_tracker/tests/test_db_from_file.pyboost_usage_tracker/tests/test_models.pyboost_usage_tracker/tests/test_services.pyboost_usage_tracker/tests/test_update_created_repos_by_language.pyboost_usage_tracker/update_boostusage_from_csv.pyboost_usage_tracker/update_created_repos_by_language.pyboost_usage_tracker/update_git_account.pyboost_usage_tracker/update_githubfile_from_csv.pyboost_usage_tracker/update_repository_from_csv.pyconfig/settings.pyconftest.pydiscord_activity_tracker/migrations/0002_migrate_users_to_discord_profile.pydocs/Contributing.mddocs/Schema.mddocs/Service_API.mddocs/service_api/README.mddocs/service_api/boost_usage_tracker.mdgithub_activity_tracker/migrations/0002_createdreposbylanguage.pygithub_activity_tracker/models.pygithub_activity_tracker/services.pygithub_activity_tracker/tests/test_models.pygithub_activity_tracker/tests/test_services.pyworkflow/management/commands/run_all_collectors.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
boost_usage_tracker/boost_searcher.py (1)
265-276:⚠️ Potential issue | 🟠 Major
_build_boost_include_querylacks query-length validation, causing potential silent failures.The function builds a query that can exceed GitHub's 255-character limit when multiple repos are provided but does not validate this. When the query exceeds the limit, the API call in
search_boost_include_files_batchfails silently—the exception is caught (line 506) and an empty list is returned without indication that repos were skipped due to an invalid query.Consider implementing query-length enforcement before sending to the API: either split repos into batches within
search_boost_include_files_batchto ensure query length stays under 255 characters, or explicitly warn when truncation occurs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/boost_searcher.py` around lines 265 - 276, The _build_boost_include_query can produce queries over GitHub's 255-character limit causing silent failures; update the logic so search_boost_include_files_batch never sends an over-long query: modify either _build_boost_include_query to accept a max_length and return one or multiple safe queries (or add batching in search_boost_include_files_batch using _chunked) by computing base_len = len('"#include <boost/" language:C++ ') and ensuring each appended " repo:NAME" segment keeps query length <255, splitting the repo_full_names into smaller chunks accordingly; also add a warning/log (processLogger or logger used in this module) whenever splitting occurs so callers know repos were batched.boost_usage_tracker/services.py (1)
190-193:⚠️ Potential issue | 🟠 MajorRe-detected missing-header usage is not reactivated.
For existing rows, Line 190-Line 193 refreshes
last_commit_datebut does not clearexcepted_at, so a re-detected usage can remain incorrectly excepted.Suggested fix
- if not _ and last_commit_date and usage.last_commit_date != last_commit_date: - usage.last_commit_date = last_commit_date - usage.save(update_fields=["last_commit_date", "updated_at"]) + if not _: + changed = False + if last_commit_date is not None and usage.last_commit_date != last_commit_date: + usage.last_commit_date = last_commit_date + changed = True + if usage.excepted_at is not None: + usage.excepted_at = None + changed = True + if changed: + usage.save(update_fields=["last_commit_date", "excepted_at", "updated_at"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/services.py` around lines 190 - 193, When refreshing an existing BoostMissingHeaderTmp usage's last_commit_date in the block that checks `if not _ and last_commit_date and usage.last_commit_date != last_commit_date`, also clear any previous exception by setting `usage.excepted_at = None` and include "excepted_at" in the `usage.save(update_fields=[...])` call so a re-detected missing-header is reactivated; update the `usage` object before saving and ensure `BoostMissingHeaderTmp`'s save updates both "last_commit_date" and "excepted_at".boost_usage_tracker/update_created_repos_by_language.py (1)
58-65:⚠️ Potential issue | 🟡 MinorKeep error-path payload schema consistent with success-path schema.
Line 58 and Line 70 return dicts that omit keys returned elsewhere (
languages_requested,languages_processed,start_year,end_year,stars_min, etc.). Reuse_result_template(...)for these early returns to keep a stable contract.Suggested fix
if start_year > end_year: - return { - "created": 0, - "updated": 0, - "rows_processed": 0, - "errors": [ - f"Invalid year range: start_year({start_year}) > end_year({end_year})" - ], - } + result = _result_template( + start_year=start_year, end_year=end_year, stars_min=stars_min + ) + result["errors"].append( + f"Invalid year range: start_year({start_year}) > end_year({end_year})" + ) + return result @@ if not language_names: - return { - "created": 0, - "updated": 0, - "rows_processed": 0, - "errors": [ - f"No languages provided. Set {LANGUAGES_ENV_KEY} in .env or pass --languages.", - ], - } + result = _result_template( + start_year=start_year, end_year=end_year, stars_min=stars_min + ) + result["errors"].append( + f"No languages provided. Set {LANGUAGES_ENV_KEY} in .env or pass --languages." + ) + return resultAlso applies to: 70-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_created_repos_by_language.py` around lines 58 - 65, The early-return error payloads in update_created_repos_by_language.py return ad-hoc dicts that omit keys present in successful responses; replace those returns with the canonical template by calling _result_template(...) (passing start_year, end_year, languages_requested, languages_processed, stars_min, etc.) and then updating the returned dict's "created", "updated", "rows_processed" and "errors" fields (e.g., set created/updated/rows_processed to 0 and populate errors list). Do this for the invalid-year return in the function and the other early-return block around lines 70-77 so all branches return the same schema produced by _result_template.boost_usage_tracker/update_git_account.py (1)
119-120:⚠️ Potential issue | 🟡 MinorWarning message is inaccurate for updated rows.
Line 120 logs
"Created record..."even when the row was updated. Use neutral wording like"Upserted record with empty username...".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_git_account.py` around lines 119 - 120, The warning message incorrectly says "Created record..." even when the row was updated; change the log text to neutral wording like "Upserted record with empty username: %s" so it correctly reflects both creates and updates. Locate the logger.warning call that uses the variables username and owner (the line with if username == "": logger.warning(...)) and replace the message string only, keeping the same formatting and parameters (owner) and keeping the existing logger.warning invocation intact.
🧹 Nitpick comments (3)
boost_usage_tracker/update_repository_from_csv.py (1)
31-52: Consider extracting shared parsing helpers.
_parse_int,_parse_bool, and_parse_datetimeare duplicated acrossupdate_boostusage_from_csv.py,update_githubfile_from_csv.py, and this file. A sharedcsv_helpers.pymodule could reduce duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_repository_from_csv.py` around lines 31 - 52, Duplicate parsing helpers _parse_int, _parse_bool, and _parse_datetime are repeated across modules; extract them into a new shared module (e.g., csv_helpers.py) exporting these functions (remove or keep leading underscores as desired for public use), replace the local definitions in update_repository_from_csv.py, update_boostusage_from_csv.py, and update_githubfile_from_csv.py with imports from csv_helpers, and run tests/linters to ensure no unresolved references (search for _parse_int, _parse_bool, _parse_datetime to update all call sites).boost_usage_tracker/boost_searcher.py (1)
243-252: Silent exception swallowing hides commit-date fetch failures.The inner
try-except-passat Lines 251-252 silently discards errors when fetching the commit date via REST. While the file content is still returned, failures here (e.g., rate limits, auth issues) go unnoticed.🔧 Add debug logging for visibility
try: commits = client.rest_request( f"/repos/{repo_full_name}/commits", params={"path": file_path, "per_page": 1}, ) if isinstance(commits, list) and commits: date_str = commits[0]["commit"]["committer"]["date"] commit_date = datetime.fromisoformat(date_str.replace("Z", "+00:00")) - except Exception: - pass + except Exception as e: + logger.debug("Failed to fetch commit date for %s/%s: %s", repo_full_name, file_path, e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/boost_searcher.py` around lines 243 - 252, The try/except around client.rest_request is swallowing errors silently; change the except block to catch Exception as e and log the failure (including repo_full_name and file_path) instead of passing so commit-date fetch failures are visible; keep the same behavior (don’t re-raise) but emit a debug/warning message via the module logger (or processLogger if used) mentioning client.rest_request, commits, and the exception to aid troubleshooting while preserving commit_date fallback logic.boost_usage_tracker/update_created_repos_by_language.py (1)
30-36: Reuse one GitHub client per run instead of creating one per query.Line 31 is on the inner hot path and instantiates a client for every count call. Passing a shared client through the loop will reduce overhead and simplify request-level diagnostics.
Also applies to: 123-124
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_created_repos_by_language.py` around lines 30 - 36, The helper _count_items_from_git currently creates a new GitHub client on each call which is expensive; refactor it to accept a client parameter (e.g., client: GitHubClient) or make it optional and only create one client at the outer scope, then pass that same client into every _count_items_from_git call from the main loop in update_created_repos_by_language (and the other call sites noted around the second use). Update the callers to instantiate get_github_client(use="scraping") once before the loop and pass it into _count_items_from_git (or rely on the provided client param) so the client is reused for all repository count queries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@boost_usage_tracker/db_from_file.py`:
- Around line 155-157: The single-file JSON branch calls
_load_json_records_from_path(path) without protection and can raise
json.JSONDecodeError or OSError; wrap that call in the same try/except pattern
used in directory mode so the function preserves its structured result contract
— catch json.JSONDecodeError and OSError (and any other errors directory mode
handles), log or capture the exception, and return the same error-shaped result
(instead of letting the exception propagate); i.e., replace the direct
assignment to records with a try: records = _load_json_records_from_path(path)
except <same exceptions as directory handling>: return the equivalent error
result used elsewhere in this module.
In `@boost_usage_tracker/update_git_account.py`:
- Around line 84-88: Guard against non-dict records before attempting key
access: inside the loop that iterates over records (for rec in records) check if
rec is a dict (e.g., if not isinstance(rec, dict): continue or log and continue)
before evaluating '"owner" in rec' and before calling rec.get("owner"); this
prevents TypeError when records contains non-dict items and preserves the
existing owner extraction logic using the owner variable.
---
Duplicate comments:
In `@boost_usage_tracker/boost_searcher.py`:
- Around line 265-276: The _build_boost_include_query can produce queries over
GitHub's 255-character limit causing silent failures; update the logic so
search_boost_include_files_batch never sends an over-long query: modify either
_build_boost_include_query to accept a max_length and return one or multiple
safe queries (or add batching in search_boost_include_files_batch using
_chunked) by computing base_len = len('"#include <boost/" language:C++ ') and
ensuring each appended " repo:NAME" segment keeps query length <255, splitting
the repo_full_names into smaller chunks accordingly; also add a warning/log
(processLogger or logger used in this module) whenever splitting occurs so
callers know repos were batched.
In `@boost_usage_tracker/services.py`:
- Around line 190-193: When refreshing an existing BoostMissingHeaderTmp usage's
last_commit_date in the block that checks `if not _ and last_commit_date and
usage.last_commit_date != last_commit_date`, also clear any previous exception
by setting `usage.excepted_at = None` and include "excepted_at" in the
`usage.save(update_fields=[...])` call so a re-detected missing-header is
reactivated; update the `usage` object before saving and ensure
`BoostMissingHeaderTmp`'s save updates both "last_commit_date" and
"excepted_at".
In `@boost_usage_tracker/update_created_repos_by_language.py`:
- Around line 58-65: The early-return error payloads in
update_created_repos_by_language.py return ad-hoc dicts that omit keys present
in successful responses; replace those returns with the canonical template by
calling _result_template(...) (passing start_year, end_year,
languages_requested, languages_processed, stars_min, etc.) and then updating the
returned dict's "created", "updated", "rows_processed" and "errors" fields
(e.g., set created/updated/rows_processed to 0 and populate errors list). Do
this for the invalid-year return in the function and the other early-return
block around lines 70-77 so all branches return the same schema produced by
_result_template.
In `@boost_usage_tracker/update_git_account.py`:
- Around line 119-120: The warning message incorrectly says "Created record..."
even when the row was updated; change the log text to neutral wording like
"Upserted record with empty username: %s" so it correctly reflects both creates
and updates. Locate the logger.warning call that uses the variables username and
owner (the line with if username == "": logger.warning(...)) and replace the
message string only, keeping the same formatting and parameters (owner) and
keeping the existing logger.warning invocation intact.
---
Nitpick comments:
In `@boost_usage_tracker/boost_searcher.py`:
- Around line 243-252: The try/except around client.rest_request is swallowing
errors silently; change the except block to catch Exception as e and log the
failure (including repo_full_name and file_path) instead of passing so
commit-date fetch failures are visible; keep the same behavior (don’t re-raise)
but emit a debug/warning message via the module logger (or processLogger if
used) mentioning client.rest_request, commits, and the exception to aid
troubleshooting while preserving commit_date fallback logic.
In `@boost_usage_tracker/update_created_repos_by_language.py`:
- Around line 30-36: The helper _count_items_from_git currently creates a new
GitHub client on each call which is expensive; refactor it to accept a client
parameter (e.g., client: GitHubClient) or make it optional and only create one
client at the outer scope, then pass that same client into every
_count_items_from_git call from the main loop in
update_created_repos_by_language (and the other call sites noted around the
second use). Update the callers to instantiate get_github_client(use="scraping")
once before the loop and pass it into _count_items_from_git (or rely on the
provided client param) so the client is reused for all repository count queries.
In `@boost_usage_tracker/update_repository_from_csv.py`:
- Around line 31-52: Duplicate parsing helpers _parse_int, _parse_bool, and
_parse_datetime are repeated across modules; extract them into a new shared
module (e.g., csv_helpers.py) exporting these functions (remove or keep leading
underscores as desired for public use), replace the local definitions in
update_repository_from_csv.py, update_boostusage_from_csv.py, and
update_githubfile_from_csv.py with imports from csv_helpers, and run
tests/linters to ensure no unresolved references (search for _parse_int,
_parse_bool, _parse_datetime to update all call sites).
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
boost_usage_tracker/boost_searcher.pyboost_usage_tracker/db_from_file.pyboost_usage_tracker/management/commands/run_boost_usage_tracker.pyboost_usage_tracker/management/commands/run_update_created_repos_by_language.pyboost_usage_tracker/management/commands/run_update_db.pyboost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.pyboost_usage_tracker/models.pyboost_usage_tracker/repo_searcher.pyboost_usage_tracker/services.pyboost_usage_tracker/tests/test_services.pyboost_usage_tracker/tests/test_update_created_repos_by_language.pyboost_usage_tracker/update_boostusage_from_csv.pyboost_usage_tracker/update_created_repos_by_language.pyboost_usage_tracker/update_git_account.pyboost_usage_tracker/update_repository_from_csv.pyconfig/settings.pygithub_activity_tracker/models.py
🚧 Files skipped from review as they are similar to previous changes (2)
- config/settings.py
- boost_usage_tracker/tests/test_services.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
♻️ Duplicate comments (2)
boost_usage_tracker/boost_searcher.py (1)
265-271:⚠️ Potential issue | 🟡 MinorQuery length limit (
MAX_CODE_SEARCH_QUERY_LEN) is defined but not enforced.
_build_boost_include_queryconstructs queries without checkingMAX_CODE_SEARCH_QUERY_LEN. If the combinedrepo:filters exceed 255 characters, the GitHub API may reject the query. Consider truncating the repo list to fit within the limit and returning remaining repos for subsequent batches.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/boost_searcher.py` around lines 265 - 271, _build_boost_include_query currently concatenates all repo filters into a single query without enforcing MAX_CODE_SEARCH_QUERY_LEN; change it to respect the MAX_CODE_SEARCH_QUERY_LEN constant by computing allowed space after base = '"#include <boost/" language:C++ ' and iterating through parts = [f"repo:{name}"...] appending parts until adding the next part would exceed MAX_CODE_SEARCH_QUERY_LEN, then return the built query and the list of remaining repo names for batching; update the function to use the same return signature (query, remaining_repos) and ensure callers handle remaining_repos for subsequent queries.boost_usage_tracker/services.py (1)
184-197:⚠️ Potential issue | 🟠 MajorMissing-header usage re-detection doesn't clear
excepted_at.When a missing-header usage is re-detected, Lines 190-192 update
last_commit_datebut don't clearexcepted_at. This inconsistency withcreate_or_update_boost_usage(Lines 147-151) means re-detected missing-header usages remain incorrectly marked as excepted.🔧 Proposed fix
- if not _ and last_commit_date and usage.last_commit_date != last_commit_date: - usage.last_commit_date = last_commit_date - usage.save(update_fields=["last_commit_date", "updated_at"]) + if not _: + changed = False + if last_commit_date and usage.last_commit_date != last_commit_date: + usage.last_commit_date = last_commit_date + changed = True + if usage.excepted_at is not None: + usage.excepted_at = None + changed = True + if changed: + usage.save(update_fields=["last_commit_date", "excepted_at", "updated_at"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/services.py` around lines 184 - 197, The re-detection path that updates last_commit_date for an existing BoostUsage (the block using BoostUsage.objects.get_or_create and checking "if not _ and last_commit_date and usage.last_commit_date != last_commit_date") must also clear the excepted_at timestamp so re-detected missing-header usages are no longer marked excepted; modify that branch in services.py to set usage.excepted_at = None and include "excepted_at" in the usage.save(update_fields=[...]) call (in addition to "last_commit_date" and "updated_at") so the change is persisted, mirroring the behavior in create_or_update_boost_usage.
🧹 Nitpick comments (5)
boost_usage_tracker/update_git_account.py (1)
26-31: Consider extracting shared JSON loading utilities to reduce duplication.These helper functions (
get_github_account_dir,_load_json_records_from_path,_load_all_json_records) are nearly identical to those inboost_usage_tracker/db_from_file.py(lines 30-63). Consider consolidating them into a shared utility module to improve maintainability.Also applies to: 33-59
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_git_account.py` around lines 26 - 31, Duplicate JSON/dir helper logic in get_github_account_dir, _load_json_records_from_path, and _load_all_json_records should be factored into a shared utility module: extract the implementations into a new helper (e.g., json_io or file_utils) that exposes functions like ensure_workspace_subdir(path_name) and load_json_records(path) and then update get_github_account_dir to call the shared ensure_workspace_subdir and replace local _load_json_records_from_path/_load_all_json_records usages with imports from the new module (adjust signatures to accept the same inputs used by get_github_account_dir and any callers).boost_usage_tracker/boost_searcher.py (1)
251-252: Log commit date fetch failures instead of silently passing.The
try-except-passblock silently swallows errors when fetching commit dates. Consider logging at debug level for observability.♻️ Proposed fix
try: commits = client.rest_request( f"/repos/{repo_full_name}/commits", params={"path": file_path, "per_page": 1}, ) if isinstance(commits, list) and commits: date_str = commits[0]["commit"]["committer"]["date"] commit_date = datetime.fromisoformat(date_str.replace("Z", "+00:00")) - except Exception: - pass + except Exception as e: + logger.debug("Failed to fetch commit date for %s/%s: %s", repo_full_name, file_path, e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/boost_searcher.py` around lines 251 - 252, Replace the silent "except Exception: pass" that swallows errors when fetching commit dates with an explicit debug-level log: catch the exception as e (e.g., "except Exception as e") and call the module or instance logger (e.g., logger.debug or self.logger.debug) with a clear message including identifying context (commit id/repo/file) and the exception (e.g., "Failed to fetch commit date for %s: %s", <identifier>, e) so failures are observable while preserving existing flow.boost_usage_tracker/services.py (1)
116-116: Consider using unpacking for list construction.Per Ruff RUF005,
[*update_fields, "updated_at"]is preferred over concatenation.if update_fields: - ext_repo.save(update_fields=update_fields + ["updated_at"]) + ext_repo.save(update_fields=[*update_fields, "updated_at"])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/services.py` at line 116, Replace the list concatenation passed to ext_repo.save with list unpacking per Ruff RUF005: when calling ext_repo.save(...), construct the fields argument using [*update_fields, "updated_at"] instead of update_fields + ["updated_at"]; update the call site (ext_repo.save) where update_fields is used to pass the unpacked list.boost_usage_tracker/update_created_repos_by_language.py (1)
30-36: Client is instantiated on every API call.
_count_items_from_gitcreates a newGitHubAPIClienton each invocation. Since the main loop calls this twice per language/year combination (Lines 123-124), this creates many redundant client instances.♻️ Proposed refactor
-def _count_items_from_git(query: str) -> int: - client = get_github_client(use="scraping") +def _count_items_from_git(client, query: str) -> int: data = client.rest_request( "/search/repositories", params={"q": query, "per_page": 1}, ) return int(data.get("total_count", 0))Then in the main function, create the client once:
+ client = get_github_client(use="scraping") for language_name in ordered_language_names: ... for year in range(start_year, end_year + 1): ... try: - all_repos = _count_items_from_git(q_all) - significant_repos = _count_items_from_git(q_sig) + all_repos = _count_items_from_git(client, q_all) + significant_repos = _count_items_from_git(client, q_sig)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/update_created_repos_by_language.py` around lines 30 - 36, _count_items_from_git currently calls get_github_client inside the function, causing a new GitHubAPIClient to be created on every invocation; change this by creating the client once and reusing it: modify _count_items_from_git to accept a client parameter (e.g., client: GitHubAPIClient) and remove the internal get_github_client call, then in the main loop that iterates languages/years (the caller that invokes _count_items_from_git twice per combo) instantiate client = get_github_client(use="scraping") once and pass that client into both calls; alternatively, create a module-level reusable client via get_github_client and have _count_items_from_git use that, ensuring only one client is created for all queries.boost_usage_tracker/management/commands/run_boost_usage_tracker.py (1)
108-110: Uselogger.exceptionfor rate limit/connection errors.
logger.errorat Line 109 loses the stack trace. Usinglogger.exceptionprovides better debugging context when these errors occur.except (ConnectionException, RateLimitException) as e: - logger.error("Rate limit / connection error during batch search: %s", e) + logger.exception("Rate limit / connection error during batch search") raiseThe same applies to Lines 153-155.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py` around lines 108 - 110, Replace the logger.error calls in the exception handlers that catch ConnectionException and RateLimitException with logger.exception so the stack trace is preserved; specifically, in the except block matching "except (ConnectionException, RateLimitException) as e" (the block that currently calls logger.error("Rate limit / connection error during batch search: %s", e) and re-raises) and the similar except block later around lines 153-155, change the call to logger.exception with the same message and continue re-raising the exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@boost_usage_tracker/boost_searcher.py`:
- Around line 265-271: _build_boost_include_query currently concatenates all
repo filters into a single query without enforcing MAX_CODE_SEARCH_QUERY_LEN;
change it to respect the MAX_CODE_SEARCH_QUERY_LEN constant by computing allowed
space after base = '"#include <boost/" language:C++ ' and iterating through
parts = [f"repo:{name}"...] appending parts until adding the next part would
exceed MAX_CODE_SEARCH_QUERY_LEN, then return the built query and the list of
remaining repo names for batching; update the function to use the same return
signature (query, remaining_repos) and ensure callers handle remaining_repos for
subsequent queries.
In `@boost_usage_tracker/services.py`:
- Around line 184-197: The re-detection path that updates last_commit_date for
an existing BoostUsage (the block using BoostUsage.objects.get_or_create and
checking "if not _ and last_commit_date and usage.last_commit_date !=
last_commit_date") must also clear the excepted_at timestamp so re-detected
missing-header usages are no longer marked excepted; modify that branch in
services.py to set usage.excepted_at = None and include "excepted_at" in the
usage.save(update_fields=[...]) call (in addition to "last_commit_date" and
"updated_at") so the change is persisted, mirroring the behavior in
create_or_update_boost_usage.
---
Nitpick comments:
In `@boost_usage_tracker/boost_searcher.py`:
- Around line 251-252: Replace the silent "except Exception: pass" that swallows
errors when fetching commit dates with an explicit debug-level log: catch the
exception as e (e.g., "except Exception as e") and call the module or instance
logger (e.g., logger.debug or self.logger.debug) with a clear message including
identifying context (commit id/repo/file) and the exception (e.g., "Failed to
fetch commit date for %s: %s", <identifier>, e) so failures are observable while
preserving existing flow.
In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py`:
- Around line 108-110: Replace the logger.error calls in the exception handlers
that catch ConnectionException and RateLimitException with logger.exception so
the stack trace is preserved; specifically, in the except block matching "except
(ConnectionException, RateLimitException) as e" (the block that currently calls
logger.error("Rate limit / connection error during batch search: %s", e) and
re-raises) and the similar except block later around lines 153-155, change the
call to logger.exception with the same message and continue re-raising the
exception.
In `@boost_usage_tracker/services.py`:
- Line 116: Replace the list concatenation passed to ext_repo.save with list
unpacking per Ruff RUF005: when calling ext_repo.save(...), construct the fields
argument using [*update_fields, "updated_at"] instead of update_fields +
["updated_at"]; update the call site (ext_repo.save) where update_fields is used
to pass the unpacked list.
In `@boost_usage_tracker/update_created_repos_by_language.py`:
- Around line 30-36: _count_items_from_git currently calls get_github_client
inside the function, causing a new GitHubAPIClient to be created on every
invocation; change this by creating the client once and reusing it: modify
_count_items_from_git to accept a client parameter (e.g., client:
GitHubAPIClient) and remove the internal get_github_client call, then in the
main loop that iterates languages/years (the caller that invokes
_count_items_from_git twice per combo) instantiate client =
get_github_client(use="scraping") once and pass that client into both calls;
alternatively, create a module-level reusable client via get_github_client and
have _count_items_from_git use that, ensuring only one client is created for all
queries.
In `@boost_usage_tracker/update_git_account.py`:
- Around line 26-31: Duplicate JSON/dir helper logic in get_github_account_dir,
_load_json_records_from_path, and _load_all_json_records should be factored into
a shared utility module: extract the implementations into a new helper (e.g.,
json_io or file_utils) that exposes functions like
ensure_workspace_subdir(path_name) and load_json_records(path) and then update
get_github_account_dir to call the shared ensure_workspace_subdir and replace
local _load_json_records_from_path/_load_all_json_records usages with imports
from the new module (adjust signatures to accept the same inputs used by
get_github_account_dir and any callers).
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
boost_usage_tracker/boost_searcher.pyboost_usage_tracker/db_from_file.pyboost_usage_tracker/management/commands/run_boost_usage_tracker.pyboost_usage_tracker/management/commands/run_update_created_repos_by_language.pyboost_usage_tracker/management/commands/run_update_db.pyboost_usage_tracker/migrations/0002_add_boost_missing_header_tmp_and_nullable_boost_header.pyboost_usage_tracker/models.pyboost_usage_tracker/repo_searcher.pyboost_usage_tracker/services.pyboost_usage_tracker/tests/test_services.pyboost_usage_tracker/tests/test_update_created_repos_by_language.pyboost_usage_tracker/update_boostusage_from_csv.pyboost_usage_tracker/update_created_repos_by_language.pyboost_usage_tracker/update_git_account.pyboost_usage_tracker/update_repository_from_csv.pyconfig/settings.pygithub_activity_tracker/models.py
✅ Files skipped from review due to trivial changes (1)
- boost_usage_tracker/tests/test_services.py
🚧 Files skipped from review as they are similar to previous changes (2)
- config/settings.py
- boost_usage_tracker/update_repository_from_csv.py
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
boost_usage_tracker/tests/test_update_created_repos_by_language.py (1)
32-35: Consider prefixing unusedclientparameter with underscore to satisfy linter.The static analysis tool flags unused
clientarguments. Using_clientor_makes the intent explicit.🔧 Proposed fix
- def fake_count(client, query: str) -> int: + def fake_count(_client, query: str) -> int: if "stars:>10" in query: return 12 return 120And for the lambda on line 69:
- side_effect=lambda client, q: 130 if "stars:>10" not in q else 13, + side_effect=lambda _client, q: 130 if "stars:>10" not in q else 13,Also applies to: 69-69
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/tests/test_update_created_repos_by_language.py` around lines 32 - 35, Rename the unused function parameter and lambda parameter to indicate intentionally unused variables: change fake_count's signature from def fake_count(client, query: str) to def fake_count(_client, query: str) (or use _), and update the anonymous lambda used later (the lambda accepting a client) to use _client or _ as its parameter name; this satisfies the linter while preserving behavior.boost_usage_tracker/services.py (1)
228-234: Consider filtering existing usages by the keys being processed.Lines 228-234 load all
BoostUsagerecords for the repo with a non-nullboost_header, but only a subset matchingkey_to_itemkeys is processed. For repositories with many historical usages, this could load significantly more data than needed.♻️ Proposed optimization
# Build map (boost_header_id, file_path_id) -> usage for existing rows + # Filter to only the keys we care about + bh_ids = {k[0] for k in key_to_item.keys()} + fp_ids = {k[1] for k in key_to_item.keys()} existing_map = { (u.boost_header_id, u.file_path_id): u for u in BoostUsage.objects.filter( repo=repo, boost_header__isnull=False, + boost_header_id__in=bh_ids, + file_path_id__in=fp_ids, ).select_related("boost_header", "file_path") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/services.py` around lines 228 - 234, The current creation of existing_map in boost_usage_tracker/services.py loads all BoostUsage for the repo with a non-null boost_header; restrict the query to only the (boost_header_id, file_path_id) pairs you are about to process (the keys in key_to_item) to avoid loading unrelated historical usages. Build the set/list of key pairs from key_to_item and use that to filter the BoostUsage queryset (e.g., construct a combined Q for each (boost_header_id, file_path_id) pair or an equivalent filter) while keeping select_related("boost_header", "file_path"), then build existing_map from that filtered queryset.boost_usage_tracker/management/commands/run_boost_usage_tracker.py (1)
285-292: Consider updatingupdated_atwhen bulk-updating star counts.The bulk update for changed star counts only updates the
starsfield. IfGitHubRepositoryhas anupdated_atfield that should reflect modification time, it won't be updated here.♻️ Proposed fix
if stars_to_update and not dry_run: repos_to_refresh = list( GitHubRepository.objects.filter(pk__in=stars_to_update.keys()) ) + from django.utils import timezone + now = timezone.now() for repo_obj in repos_to_refresh: repo_obj.stars = stars_to_update[repo_obj.pk] - GitHubRepository.objects.bulk_update(repos_to_refresh, ["stars"]) + repo_obj.updated_at = now + GitHubRepository.objects.bulk_update(repos_to_refresh, ["stars", "updated_at"]) logger.info("Bulk-updated stars for %d repos", len(repos_to_refresh))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py` around lines 285 - 292, The bulk update only writes the stars field so GitHubRepository.updated_at isn't changed; when preparing repos_to_refresh (from stars_to_update and GitHubRepository.objects.filter(...)) set each repo_obj.updated_at = timezone.now() (or the appropriate timestamp) and include "updated_at" in the fields list passed to GitHubRepository.objects.bulk_update(repos_to_refresh, ["stars", "updated_at"]) so the modification time is persisted; ensure you import django.utils.timezone if needed and only set updated_at when not a dry_run.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@boost_usage_tracker/management/commands/run_boost_usage_tracker.py`:
- Around line 285-292: The bulk update only writes the stars field so
GitHubRepository.updated_at isn't changed; when preparing repos_to_refresh (from
stars_to_update and GitHubRepository.objects.filter(...)) set each
repo_obj.updated_at = timezone.now() (or the appropriate timestamp) and include
"updated_at" in the fields list passed to
GitHubRepository.objects.bulk_update(repos_to_refresh, ["stars", "updated_at"])
so the modification time is persisted; ensure you import django.utils.timezone
if needed and only set updated_at when not a dry_run.
In `@boost_usage_tracker/services.py`:
- Around line 228-234: The current creation of existing_map in
boost_usage_tracker/services.py loads all BoostUsage for the repo with a
non-null boost_header; restrict the query to only the (boost_header_id,
file_path_id) pairs you are about to process (the keys in key_to_item) to avoid
loading unrelated historical usages. Build the set/list of key pairs from
key_to_item and use that to filter the BoostUsage queryset (e.g., construct a
combined Q for each (boost_header_id, file_path_id) pair or an equivalent
filter) while keeping select_related("boost_header", "file_path"), then build
existing_map from that filtered queryset.
In `@boost_usage_tracker/tests/test_update_created_repos_by_language.py`:
- Around line 32-35: Rename the unused function parameter and lambda parameter
to indicate intentionally unused variables: change fake_count's signature from
def fake_count(client, query: str) to def fake_count(_client, query: str) (or
use _), and update the anonymous lambda used later (the lambda accepting a
client) to use _client or _ as its parameter name; this satisfies the linter
while preserving behavior.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
boost_usage_tracker/boost_searcher.pyboost_usage_tracker/management/commands/run_boost_usage_tracker.pyboost_usage_tracker/services.pyboost_usage_tracker/tests/test_update_created_repos_by_language.pyboost_usage_tracker/update_created_repos_by_language.py
Summary by CodeRabbit
New Features
Admin
Models / Migrations
Services
Tests
Documentation