feat: cloud pricing, org hierarchy, GitHub ETL, extensions#15
feat: cloud pricing, org hierarchy, GitHub ETL, extensions#15jeremyeder merged 5 commits intomainfrom
Conversation
Adds scripts/fetch_pricing.py which fetches from public AWS pricing bulk files (no auth) and hardcoded Anthropic published rates. Includes discount factor support and ROSA cluster cost estimation. Also adds: - Multi-DB attachment pattern (pricing.db, extensible) - Extension loader for local-only plugins - Pricing MCP tools (cloud_pricing_lookup, rosa_cluster_costs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hain) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add check-no-secrets.sh: scans staged files for API keys, tokens, passwords, private keys, and bearer tokens - Add check-protected-files.sh: blocks changes to governance/ paths (configurable via PROTECTED_PATHS env var, bypass with ALLOW_PROTECTED=1) - Add refresh_catalog.py: scans data/, renames files to date-prefix convention, maintains DATA_CATALOG.yaml, and optionally fetches from Jira and web URLs via --refresh flag - Update .pre-commit-config.yaml to use external hook scripts - Update CLAUDE.md with refresh_catalog.py command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
WalkthroughThis pull request adds large data-fetching scripts (GitHub and cloud pricing), a catalog refresh utility, MCP server extensions exposing org, pricing, and GitHub query tools, pre-commit hook scripts to detect secrets and protect files, startup extension loading, and test script checks for optional DBs. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MCP as MCP Server
participant PDB as pricing.db
participant GDB as github.db
participant Ext as extensions/*.py
Client->>MCP: request (e.g., cloud_pricing_lookup / github query / gps://pricing-schema)
MCP-->>PDB: SELECT schema/queries (if attached)
MCP-->>GDB: SELECT schema/queries (if attached)
MCP->>Ext: Invoke extension hooks (if loaded)
MCP-->>Client: JSON / DDL / error response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
| # Handle collision: if target already exists, skip | ||
| new_path = DATA_DIR / new_name | ||
| if new_path.exists() and new_path != filepath: | ||
| print(f" SKIP (target exists): {old_name} -> {new_name}") |
| renames[old_name] = new_name | ||
|
|
||
| if dry_run: | ||
| print(f" Would rename: {old_name} -> {new_name}") |
| print(f" Would rename: {old_name} -> {new_name}") | ||
| else: | ||
| filepath.rename(new_path) | ||
| print(f" Renamed: {old_name} -> {new_name}") |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mcp_server.py (1)
1-1:⚠️ Potential issue | 🟡 MinorCI failure: Ruff format check failed.
The pipeline indicates this file needs reformatting. Run:
uv run --extra dev ruff format mcp_server.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mcp_server.py` at line 1, The file contains a formatting mismatch (starts with "#!/usr/bin/env python3"); run the ruff formatter to fix style issues by executing the suggested command (uv run --extra dev ruff format mcp_server.py) or apply equivalent ruff formatting so the file passes the Ruff format check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mcp_server.py`:
- Around line 946-971: The repo search currently issues an unbounded query and
then does an N+1 per-repo language lookup (see the SELECT into repos and the
loop that queries github.gh_repo_language), so modify the function to
accept/validate a limit parameter and add a LIMIT clause to the main repo SQL,
and replace the per-repo language query with a single batched query: collect
repo_id values from the fetched repos, run one SELECT repo_id, language, bytes
FROM github.gh_repo_language WHERE repo_id IN (...) ORDER BY bytes DESC, then
build a langs_by_repo map and assign r["languages"] =
langs_by_repo.get(r["repo_id"], []) for each repo (use the same
params/placeholder style and _rows_to_dicts-like formatting for consistency).
In `@scripts/check-no-secrets.sh`:
- Around line 8-12: The script scripts/check-no-secrets.sh currently greps every
file from the for-loop (for f in "$@") and can produce false positives for known
safe files; add an exclusion check before running grep to skip filenames or
paths such as .env.example, .pre-commit-config.yaml, .gitignore and doc files
(e.g., *.md, *.mdx) so the grep is only run on relevant source/config files, or
alternatively re-introduce the exclude pattern into .pre-commit-config.yaml;
implement the skip by matching $f against an array or case pattern and continue
the loop if matched (preserving the existing echo " $f" and found variable
behavior).
In `@scripts/fetch_github.py`:
- Around line 1-5: Run ruff format on the script to fix formatting issues
flagged by CI: execute "ruff format scripts/fetch_github.py" (or run your
project's formatter) to reformat the file header and body so it passes "ruff
format --check"; after formatting, verify the shebang and comment block remain
intact and commit the updated scripts/fetch_github.py.
- Around line 122-124: The 5xx detection loop is inefficient; replace the
any(str(c) in stderr for c in range(500, 600)) check with a direct pattern or
numeric extraction from stderr (e.g., use a regex like '\b5\d{2}\b' to detect
any 5xx status) or parse an HTTP/status code variable if available, then keep
the same backoff logic using attempt and time.sleep; update the conditional in
the retry block that references stderr and attempt so it reliably matches 5xx
responses without iterating the whole range.
- Around line 186-189: The current fragile concat (in the block checking `if
paginate and body.startswith("["):`) naively replaces "][` and `]\n[` which can
corrupt JSON when those sequences appear in string values; instead, split `body`
into separate JSON array tokens and parse each safely (e.g., locate balanced
brackets or split on `]\n[`/`][` then wrap/ensure valid arrays), json.loads each
array and extend a single result list, or switch to invoking GH with `--jq` to
produce newline-delimited objects and parse line-by-line—update the code that
mutates `body` to implement one of these safe merges using `json.loads` rather
than simple string replace.
In `@scripts/fetch_pricing.py`:
- Around line 1-7: CI failed ruff format check for the Python script starting
with "#!/usr/bin/env python3"; run the code formatter (ruff format) on that
script to apply the style fixes, re-check with `ruff format --check`, and commit
the formatted file so the pipeline passes.
- Around line 66-75: The CLAUDE_PRICING constant is hardcoded without a
verification date; update the declaration for CLAUDE_PRICING to include a
comment above it stating the date (e.g., "verified YYYY-MM-DD") and optionally
the Anthropic pricing URL or source; ensure the comment is placed immediately
above the CLAUDE_PRICING list so future maintainers can see when the values were
last confirmed.
In `@scripts/refresh_catalog.py`:
- Around line 1-5: CI is failing ruff format --check for the script; run the
formatter (ruff format) on the script (scripts/refresh_catalog.py) to fix
formatting issues, stage and commit the updated file so the formatted
shebang/comments and any minor whitespace changes pass the ruff check.
- Around line 127-138: get_birth_date currently uses macOS-only `stat -f %SB`
and should be made cross-platform: update get_birth_date to first try native
Python file metadata (Path(filepath).stat().st_birthtime when present), then if
that isn't available attempt GNU stat (`stat --format=%w <file>`) via
subprocess.run (check output is not "-" and parse YYYY-MM-DD), and only then
fall back to datetime.now().strftime("%Y-%m-%d"); use sys.platform or
shutil.which("stat") to decide behavior and adjust exception handling around
subprocess.run to return the fallback on error. Ensure you reference the
existing get_birth_date function and subprocess.run usage when making changes.
---
Outside diff comments:
In `@mcp_server.py`:
- Line 1: The file contains a formatting mismatch (starts with "#!/usr/bin/env
python3"); run the ruff formatter to fix style issues by executing the suggested
command (uv run --extra dev ruff format mcp_server.py) or apply equivalent ruff
formatting so the file passes the Ruff format check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 496ed0af-6e7a-4ebb-8d81-be5c016a7af3
📒 Files selected for processing (10)
.claude/CLAUDE.md.gitignore.pre-commit-config.yamlmcp_server.pyscripts/check-no-secrets.shscripts/check-protected-files.shscripts/fetch_github.pyscripts/fetch_pricing.pyscripts/refresh_catalog.pyscripts/test.sh
| for f in "$@"; do | ||
| if grep -qE '(API_TOKEN|API_SECRET|API_KEY|PASSWORD|PRIVATE_KEY|Bearer |client_secret|-----BEGIN .* KEY-----)' "$f" 2>/dev/null; then | ||
| echo " $f" | ||
| found=1 | ||
| fi |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Missing exclusion logic may cause false positives.
The previous hook excluded files like .env.example, .pre-commit-config.yaml, and .gitignore. This script scans all text files without exclusions, so commits touching documentation that mentions "API_KEY" or "PASSWORD" conceptually will be blocked.
Consider either:
- Adding an exclusion list within the script, or
- Re-adding the
excludepattern to.pre-commit-config.yaml
Example in-script exclusion
found=0
for f in "$@"; do
+ # Skip known safe files
+ case "$f" in
+ *.example|.pre-commit-config.yaml|.gitignore|*.md) continue ;;
+ esac
if grep -qE '(API_TOKEN|API_SECRET|API_KEY|PASSWORD|PRIVATE_KEY|Bearer |client_secret|-----BEGIN .* KEY-----)' "$f" 2>/dev/null; then📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for f in "$@"; do | |
| if grep -qE '(API_TOKEN|API_SECRET|API_KEY|PASSWORD|PRIVATE_KEY|Bearer |client_secret|-----BEGIN .* KEY-----)' "$f" 2>/dev/null; then | |
| echo " $f" | |
| found=1 | |
| fi | |
| for f in "$@"; do | |
| # Skip known safe files | |
| case "$f" in | |
| *.example|.pre-commit-config.yaml|.gitignore|*.md) continue ;; | |
| esac | |
| if grep -qE '(API_TOKEN|API_SECRET|API_KEY|PASSWORD|PRIVATE_KEY|Bearer |client_secret|-----BEGIN .* KEY-----)' "$f" 2>/dev/null; then | |
| echo " $f" | |
| found=1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/check-no-secrets.sh` around lines 8 - 12, The script
scripts/check-no-secrets.sh currently greps every file from the for-loop (for f
in "$@") and can produce false positives for known safe files; add an exclusion
check before running grep to skip filenames or paths such as .env.example,
.pre-commit-config.yaml, .gitignore and doc files (e.g., *.md, *.mdx) so the
grep is only run on relevant source/config files, or alternatively re-introduce
the exclude pattern into .pre-commit-config.yaml; implement the skip by matching
$f against an array or case pattern and continue the loop if matched (preserving
the existing echo " $f" and found variable behavior).
| if any(str(c) in stderr for c in range(500, 600)): | ||
| time.sleep(2 ** (attempt + 1)) | ||
| continue |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Inefficient 5xx error detection.
Iterating over range(500, 600) and checking string containment is awkward. Consider a regex or direct status code extraction.
- if any(str(c) in stderr for c in range(500, 600)):
+ if re.search(r'\b5\d{2}\b', stderr):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/fetch_github.py` around lines 122 - 124, The 5xx detection loop is
inefficient; replace the any(str(c) in stderr for c in range(500, 600)) check
with a direct pattern or numeric extraction from stderr (e.g., use a regex like
'\b5\d{2}\b' to detect any 5xx status) or parse an HTTP/status code variable if
available, then keep the same backoff logic using attempt and time.sleep; update
the conditional in the retry block that references stderr and attempt so it
reliably matches 5xx responses without iterating the whole range.
| if paginate and body.startswith("["): | ||
| # gh --paginate concatenates arrays as ][ | ||
| # or sometimes just newline-separated arrays | ||
| body = body.replace("]\n[", ",").replace("][", ",") |
There was a problem hiding this comment.
Fragile JSON array concatenation.
The gh --paginate output handling assumes ][ or ]\n[ never appears in JSON string values. While unlikely for GitHub API data, a malformed commit message or issue body containing these sequences could corrupt parsing.
A safer approach would be to use --jq to flatten at the source or to parse each JSON array separately and merge in Python:
Safer alternative
- # Paginated output may be multiple JSON arrays concatenated
- if paginate and body.startswith("["):
- # gh --paginate concatenates arrays as ][
- # or sometimes just newline-separated arrays
- body = body.replace("]\n[", ",").replace("][", ",")
+ # Paginated output: multiple JSON arrays, parse each separately
+ if paginate and body.startswith("["):
+ import re
+ arrays = re.split(r'(?<=\])\s*(?=\[)', body)
+ merged = []
+ for arr_str in arrays:
+ try:
+ merged.extend(json.loads(arr_str))
+ except json.JSONDecodeError:
+ pass
+ return merged🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/fetch_github.py` around lines 186 - 189, The current fragile concat
(in the block checking `if paginate and body.startswith("["):`) naively replaces
"][` and `]\n[` which can corrupt JSON when those sequences appear in string
values; instead, split `body` into separate JSON array tokens and parse each
safely (e.g., locate balanced brackets or split on `]\n[`/`][` then wrap/ensure
valid arrays), json.loads each array and extend a single result list, or switch
to invoking GH with `--jq` to produce newline-delimited objects and parse
line-by-line—update the code that mutates `body` to implement one of these safe
merges using `json.loads` rather than simple string replace.
| CLAUDE_PRICING = [ | ||
| {"model": "claude-opus-4", "input_per_mtok": 15.00, "output_per_mtok": 75.00}, | ||
| {"model": "claude-sonnet-4", "input_per_mtok": 3.00, "output_per_mtok": 15.00}, | ||
| {"model": "claude-haiku-4", "input_per_mtok": 0.80, "output_per_mtok": 4.00}, | ||
| {"model": "claude-3-5-sonnet", "input_per_mtok": 3.00, "output_per_mtok": 15.00}, | ||
| {"model": "claude-3-5-haiku", "input_per_mtok": 0.80, "output_per_mtok": 4.00}, | ||
| {"model": "claude-3-opus", "input_per_mtok": 15.00, "output_per_mtok": 75.00}, | ||
| {"model": "claude-3-sonnet", "input_per_mtok": 3.00, "output_per_mtok": 15.00}, | ||
| {"model": "claude-3-haiku", "input_per_mtok": 0.25, "output_per_mtok": 1.25}, | ||
| ] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider documenting the pricing source date.
The hardcoded Claude pricing should include a comment with the date it was last verified against Anthropic's pricing page, making it easier to track staleness.
# Published Anthropic Claude model pricing (per million tokens, USD)
-# Source: https://www.anthropic.com/pricing — update when pricing changes
+# Source: https://www.anthropic.com/pricing
+# Last verified: 2026-04-XX — update when pricing changes🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/fetch_pricing.py` around lines 66 - 75, The CLAUDE_PRICING constant
is hardcoded without a verification date; update the declaration for
CLAUDE_PRICING to include a comment above it stating the date (e.g., "verified
YYYY-MM-DD") and optionally the Anthropic pricing URL or source; ensure the
comment is placed immediately above the CLAUDE_PRICING list so future
maintainers can see when the values were last confirmed.
| def get_birth_date(filepath: Path) -> str: | ||
| """Get the file's birth date as YYYY-MM-DD using stat.""" | ||
| try: | ||
| result = subprocess.run( | ||
| ["stat", "-f", "%SB", "-t", "%Y-%m-%d", str(filepath)], | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| ) | ||
| return result.stdout.strip() | ||
| except subprocess.CalledProcessError: | ||
| return datetime.now().strftime("%Y-%m-%d") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
stat -f %SB is macOS-specific.
This syntax fails on Linux. The fallback to datetime.now() works, but Linux users won't get accurate birth dates. Consider detecting the platform:
Cross-platform approach
def get_birth_date(filepath: Path) -> str:
"""Get the file's birth date as YYYY-MM-DD using stat."""
+ import platform
try:
- result = subprocess.run(
- ["stat", "-f", "%SB", "-t", "%Y-%m-%d", str(filepath)],
- capture_output=True,
- text=True,
- check=True,
- )
- return result.stdout.strip()
+ if platform.system() == "Darwin":
+ result = subprocess.run(
+ ["stat", "-f", "%SB", "-t", "%Y-%m-%d", str(filepath)],
+ capture_output=True, text=True, check=True,
+ )
+ return result.stdout.strip()
+ else:
+ # Linux: use mtime as fallback (birth time often unavailable)
+ mtime = filepath.stat().st_mtime
+ return datetime.fromtimestamp(mtime).strftime("%Y-%m-%d")
except subprocess.CalledProcessError:
return datetime.now().strftime("%Y-%m-%d")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/refresh_catalog.py` around lines 127 - 138, get_birth_date currently
uses macOS-only `stat -f %SB` and should be made cross-platform: update
get_birth_date to first try native Python file metadata
(Path(filepath).stat().st_birthtime when present), then if that isn't available
attempt GNU stat (`stat --format=%w <file>`) via subprocess.run (check output is
not "-" and parse YYYY-MM-DD), and only then fall back to
datetime.now().strftime("%Y-%m-%d"); use sys.platform or shutil.which("stat") to
decide behavior and adjust exception handling around subprocess.run to return
the fallback on error. Ensure you reference the existing get_birth_date function
and subprocess.run usage when making changes.
- Sanitize exception output in Jira fetch to prevent credential leakage in tracebacks (CodeQL clear-text logging fix) - Add presence-only credential status logging (set/not set, never values) - Add limit parameter and LIMIT clause to search_github_repos - Replace N+1 per-repo language queries with single batched query - Auto-format all scripts with ruff format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
scripts/fetch_pricing.pyfetches from public AWS pricing bulk files (no auth) + hardcoded Anthropic published Claude model rates. Includes discount factor support and ROSA cluster cost estimation.get_direct_reports,get_org_tree,get_management_chain— recursive CTE queries against the person table for traversing manager relationships.scripts/fetch_github.py --org YOUR_ORGfetches repos, commits, PRs, issues, releases, code frequency. Includes 6 MCP tools for querying GitHub data.extensions/directory for local-only plugins not tracked upstream (e.g., org-specific integrations).check-no-secrets.shscans for API keys/tokens,check-protected-files.shguards governance/ directory.scripts/refresh_catalog.py --refreshfetches fresh data from configured sources.Test plan
uv run scripts/fetch_pricing.py --claude-onlyadds Claude model pricing to pricing.dbuv run scripts/fetch_pricing.py --skip-rosapopulates pricing.db with AWS public pricinguv run scripts/fetch_github.py --org YOUR_ORGpopulates github.dbuv run mcp_server.pycloud_pricing_lookuptool returns resultsgithub_org_summarytool returns resultsget_org_tree/get_direct_reports/get_management_chainworkscripts/test.shpasses🤖 Generated with Claude Code