feat: production Docker hardening, /health/ readiness, and JSON logging#228
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughImplements a /health/ readiness endpoint, persists collector-group run outcomes, hardens Docker image/runtime and compose overlays with healthchecks, adds Cloud Logging JSON formatting, updates settings/CI/docs, and includes tests covering health and logging behavior. ChangesProduction Deployment Infrastructure
Sequence Diagram(s)sequenceDiagram
participant Client
participant HealthEndpoint as /health/ endpoint
participant HealthChecks as run_health_checks()
participant Database
participant Celery
participant CollectorServices
participant PineconeSyncStatus
Client->>HealthEndpoint: GET /health/ (Bearer token optional)
HealthEndpoint->>HealthChecks: check authentication, run checks
HealthChecks->>Database: verify connectivity, measure latency
Database-->>HealthChecks: ok/error result
alt Database OK
HealthChecks->>Celery: inspect().ping() for active workers
Celery-->>HealthChecks: worker list and count
HealthChecks->>CollectorServices: list_group_statuses()
CollectorServices->>Database: fetch CollectorGroupRunStatus rows
Database-->>CollectorServices: group staleness data
HealthChecks->>PineconeSyncStatus: query sync timestamps
PineconeSyncStatus-->>HealthChecks: app-type sync times
end
HealthChecks-->>HealthEndpoint: aggregated status payload
HealthEndpoint-->>Client: 200 (healthy) or 503 (unhealthy) + JSON
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
… custom health checks in the main service configuration.
…ce health check logic in health.py. Added error handling for collector group status checks and updated test settings for collector freshness enforcement. Improved test coverage for health checks under failure conditions.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
docker-compose.yml (1)
78-87: 💤 Low valueHealthcheck may fail spuriously if any process exits during scan.
The one-liner iterates all
/proc/[0-9]*/cmdlinepaths and opens each file. If any process exits betweenglob()andopen(), aFileNotFoundErrorpropagates and fails the entire check—even if Celery is running fine.With 5 retries this is unlikely to cause real issues, but a try/except would make it deterministic:
♻️ More robust alternative
healthcheck: test: [ "CMD-SHELL", - 'python -c "import glob,sys; sys.exit(0 if any(b\"celery\" in (d:=open(p,\"rb\").read()) and b\"worker\" in d for p in glob.glob(\"/proc/[0-9]*/cmdline\")) else 1)"', + 'python -c "import glob,sys; found=False +for p in glob.glob(\"/proc/[0-9]*/cmdline\"): + try: + d=open(p,\"rb\").read() + if b\"celery\" in d and b\"worker\" in d: found=True; break + except: pass +sys.exit(0 if found else 1)"', ]Alternatively, use
celery inspect pingfor a more semantic check (though it requires broker connectivity).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docker-compose.yml` around lines 78 - 87, The healthcheck one-liner that scans "/proc/[0-9]*/cmdline" can raise FileNotFoundError if a process exits between glob() and open(); update the CMD-SHELL python snippet (the string under healthcheck.test) to catch and ignore FileNotFoundError (and optionally PermissionError) when opening/reading each cmdline so the scan continues rather than aborting, preserving the original logic that returns 0 if any cmdline contains both b"celery" and b"worker"; alternatively, replace the whole test with a broker-aware check such as running "celery inspect ping" if you prefer a semantic check.config/tests/test_health.py (1)
97-97: ⚡ Quick winAdd trailing newline at end of file.
Python files should end with a newline character per PEP 8 convention. Most linters will flag this.
📝 Proposed fix
assert payload["checks"]["collector_groups"] == {} +🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/tests/test_health.py` at line 97, The file config/tests/test_health.py is missing a trailing newline at EOF; open that test file (look near the assertion line containing assert payload["checks"]["collector_groups"] == {}), add a single newline character at the end of the file so it ends with a newline (saving the file will ensure linters/PEP8 are satisfied).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@boost_collector_runner/management/commands/run_scheduled_collectors.py`:
- Around line 221-225: The current logic records a group success whenever
group_id is set and exit_code == 0, which falsely records success when all tasks
were skipped; modify the code around where collectors are executed to track
whether any collector actually ran (e.g., introduce a boolean like any_executed
or an integer executed_count updated inside the collector execution loop), then
change the final block that calls
collector_services.record_group_success(group_id) to require both exit_code == 0
and any_executed (or executed_count > 0); keep the failure path as-is
(record_group_failure with exit_code) when collectors ran or if you still want
failures recorded for attempted runs.
In `@config/health.py`:
- Around line 27-29: Replace all uses of returning str(exc) from the exception
handlers in config/health.py with stable, non-sensitive error responses (e.g.
return {"ok": False, "error": "internal_error"} or a specific error
code/message), and ensure the full exception is recorded via structured logging
(logger.exception(...) or logger.error(..., exc_info=True)) inside the
corresponding except Exception as exc blocks (the three occurrences shown by the
diff and the other blocks at the referenced ranges). Locate the except Exception
as exc blocks in config/health.py and: 1) log the detailed exception with
logger.exception or logger.error(..., exc_info=True); 2) return a generic error
code/message payload to the caller instead of str(exc).
- Around line 64-67: The try/except around load_config currently swallows
parsing/service errors by returning an empty set (causing any_stale to become
False); instead surface failures so readiness fails: in the except block for
load_config (and the similar blocks at the other locations), catch Exception as
e, log the error, and return None (or re-raise) rather than return set() so
callers can detect a None result and mark freshness/readiness as failed; update
callers that check the return to treat None as an error state (e.g., any_stale =
None -> unhealthy).
In `@config/tests/test_health.py`:
- Around line 47-53: The Celery mock in the test uses _check_celery_workers but
returns inconsistent data (workers: [] with responded: 1); update the mock
return so it's realistic and consistent—for example set "responded": 1 and
include a corresponding single worker entry in the "workers" list (or
alternatively set "responded": 0 if keeping an empty list). Locate the mock of
_check_celery_workers in the test and adjust the returned dict so "workers" and
"responded" match.
In `@docs/service_api/boost_collector_runner.md`:
- Line 25: Update the broken markdown link "[Contributing]" that currently
points to "../Contributing.md" so it uses the correct relative path and casing;
replace the target with "../../CONTRIBUTING.md" so the link resolves on
case-sensitive filesystems (change the link target only, leave the link text
"[Contributing]" unchanged).
In `@Makefile`:
- Line 100: The health-check curl invocation in the Makefile should include an
Authorization header when HEALTH_CHECK_TOKEN is set; update the recipe that
contains the $(COMPOSE) exec -T $(APP) sh -c 'curl ...' line to conditionally
add a header like -H "Authorization: Bearer $(HEALTH_CHECK_TOKEN)" when
HEALTH_CHECK_TOKEN is non-empty (e.g. create an AUTH_HDR variable that expands
to the header or empty string and insert it into the curl command), ensuring
quoting/escaping works inside the sh -c command so the token is passed into the
container.
---
Nitpick comments:
In `@config/tests/test_health.py`:
- Line 97: The file config/tests/test_health.py is missing a trailing newline at
EOF; open that test file (look near the assertion line containing assert
payload["checks"]["collector_groups"] == {}), add a single newline character at
the end of the file so it ends with a newline (saving the file will ensure
linters/PEP8 are satisfied).
In `@docker-compose.yml`:
- Around line 78-87: The healthcheck one-liner that scans "/proc/[0-9]*/cmdline"
can raise FileNotFoundError if a process exits between glob() and open(); update
the CMD-SHELL python snippet (the string under healthcheck.test) to catch and
ignore FileNotFoundError (and optionally PermissionError) when opening/reading
each cmdline so the scan continues rather than aborting, preserving the original
logic that returns 0 if any cmdline contains both b"celery" and b"worker";
alternatively, replace the whole test with a broker-aware check such as running
"celery inspect ping" if you prefer a semantic check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9ba58817-f2e2-40c2-a494-c3017e1e3bd6
⛔ Files ignored due to path filters (2)
requirements-dev.lockis excluded by!**/*.lockrequirements.lockis excluded by!**/*.lock
📒 Files selected for processing (30)
.dockerignore.env.example.github/workflows/actions.ymlCONTRIBUTING.mdDockerfileMakefileREADME.mdboost_collector_runner/management/commands/run_scheduled_collectors.pyboost_collector_runner/migrations/0001_initial.pyboost_collector_runner/models.pyboost_collector_runner/services.pyboost_collector_runner/tests/test_services.pyconfig/health.pyconfig/logging_formatters.pyconfig/settings.pyconfig/test_settings.pyconfig/tests/__init__.pyconfig/tests/test_health.pyconfig/tests/test_logging_json.pyconfig/urls.pydocker-compose.prod.ymldocker-compose.ymldocker-entrypoint.shdocker/gunicorn.conf.pydocs/Deployment.mddocs/GCP_Production_Checklist.mddocs/service_api/README.mddocs/service_api/boost_collector_runner.mdrequirements-dev.inrequirements.in
…rization. Update health check error handling to provide stable error messages. Improve logging for database and celery worker checks. Add tests to ensure correct behavior when tasks are skipped and handle exceptions gracefully.
Resolve test_commands.py conflict: keep develop stop-on-failure caplog assertions and --strict YAML test; keep feature on_release skip test. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
config/tests/test_health.py (1)
69-73: ⚡ Quick winAdd an explicit invalid-token assertion for the bearer auth path.
This test currently validates missing token + valid token, but not wrong token. Add one assertion to lock down the unauthorized branch.
Suggested patch
`@override_settings`(HEALTH_CHECK_TOKEN="secret-token") def test_health_view_requires_bearer_when_token_set(api_client): response = api_client.get("/health/") assert response.status_code == 401 + response = api_client.get("/health/", HTTP_AUTHORIZATION="Bearer wrong-token") + assert response.status_code == 401 response = api_client.get("/health/", HTTP_AUTHORIZATION="Bearer secret-token") assert response.status_code in (200, 503)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/tests/test_health.py` around lines 69 - 73, In test_health_view_requires_bearer_when_token_set, add an explicit check for an invalid bearer token by calling api_client.get("/health/", HTTP_AUTHORIZATION="Bearer wrong-token") and asserting the response.status_code == 401 so the unauthorized branch is covered; update the function (test_health_view_requires_bearer_when_token_set) to include this extra assertion alongside the existing missing-token and valid-token checks.config/settings.py (1)
579-605: ⚡ Quick winValidate
LOG_FORMATvalues explicitly.Unknown values currently downgrade silently to text logging. Failing fast (or warning loudly) avoids accidental loss of structured logs in production.
Proposed fix
+from django.core.exceptions import ImproperlyConfigured + # Logging format: text (default) or json (GCP Cloud Logging on stdout) LOG_FORMAT = (env("LOG_FORMAT", default="text") or "text").strip().lower() +if LOG_FORMAT not in {"text", "json"}: + raise ImproperlyConfigured( + f"Invalid LOG_FORMAT={LOG_FORMAT!r}. Expected 'text' or 'json'." + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/settings.py` around lines 579 - 605, LOG_FORMAT is currently normalized but not validated, so unknown values silently fall back to text; update the block around LOG_FORMAT, _LOG_FORMATTERS and _CONSOLE_FORMATTER to validate allowed values and fail fast: define an allowed set (e.g. {"text","json","verbose","simple"}), normalize LOG_FORMAT as already done, then if LOG_FORMAT not in the allowed set raise a ValueError (or log a clear error and raise) naming the invalid value and listing allowed values; keep the existing behavior for valid values: when LOG_FORMAT == "json" add the "cloud_json" entry to _LOG_FORMATTERS and set _CONSOLE_FORMATTER accordingly, and map "text" to the same console formatter as "verbose" if you want to preserve current mapping.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@config/health.py`:
- Around line 36-37: The numeric environment parsing for
HEALTH_CELERY_MIN_WORKERS and HEALTH_CELERY_INSPECT_TIMEOUT (variables
min_workers and timeout) is unguarded and can raise on invalid input; wrap the
int/float casts for min_workers and timeout (and any other numeric casts around
lines 87-88) in a try/except catching ValueError/TypeError, fall back to the
existing default values, and log a warning so the /health/ handler returns a
structured unhealthy response instead of bubbling an exception. Ensure you
update the same code paths that perform numeric casts on lines ~87-88 as well so
all numeric settings are protected.
In `@Dockerfile`:
- Around line 13-18: The Dockerfile installs apt packages without pinned
versions (libpq5, git, curl, gosu) which risks unreproducible builds; update the
RUN line to pin each package to a specific version (or use package=version
syntax and/or apt-get install with the distro's known version variables), and
add a comment explaining the chosen versions; reference the Dockerfile RUN step
that installs libpq5, git, curl, and gosu and ensure the pinned versions are
tested to exist in the base image's apt sources before committing.
- Around line 39-40: The HEALTHCHECK command should include an Authorization
bearer header when HEALTH_CHECK_TOKEN is present so unauthenticated probes don't
return 401; modify the Dockerfile HEALTHCHECK (the line starting with
HEALTHCHECK ... CMD curl ...) to conditionally add -H "Authorization: Bearer
$HEALTH_CHECK_TOKEN" (or equivalent) only when the env var is set, falling back
to the current curl command when it's empty, and apply the same conditional
header logic to the compose override healthcheck so both probes behave
identically.
---
Nitpick comments:
In `@config/settings.py`:
- Around line 579-605: LOG_FORMAT is currently normalized but not validated, so
unknown values silently fall back to text; update the block around LOG_FORMAT,
_LOG_FORMATTERS and _CONSOLE_FORMATTER to validate allowed values and fail fast:
define an allowed set (e.g. {"text","json","verbose","simple"}), normalize
LOG_FORMAT as already done, then if LOG_FORMAT not in the allowed set raise a
ValueError (or log a clear error and raise) naming the invalid value and listing
allowed values; keep the existing behavior for valid values: when LOG_FORMAT ==
"json" add the "cloud_json" entry to _LOG_FORMATTERS and set _CONSOLE_FORMATTER
accordingly, and map "text" to the same console formatter as "verbose" if you
want to preserve current mapping.
In `@config/tests/test_health.py`:
- Around line 69-73: In test_health_view_requires_bearer_when_token_set, add an
explicit check for an invalid bearer token by calling api_client.get("/health/",
HTTP_AUTHORIZATION="Bearer wrong-token") and asserting the response.status_code
== 401 so the unauthorized branch is covered; update the function
(test_health_view_requires_bearer_when_token_set) to include this extra
assertion alongside the existing missing-token and valid-token checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1ba6eb73-debb-44cb-90fc-4df64a24cfb1
⛔ Files ignored due to path filters (2)
requirements-dev.lockis excluded by!**/*.lockrequirements.lockis excluded by!**/*.lock
📒 Files selected for processing (31)
.dockerignore.env.example.github/workflows/actions.ymlCONTRIBUTING.mdDockerfileMakefileREADME.mdboost_collector_runner/management/commands/run_scheduled_collectors.pyboost_collector_runner/migrations/0001_initial.pyboost_collector_runner/models.pyboost_collector_runner/services.pyboost_collector_runner/tests/test_commands.pyboost_collector_runner/tests/test_services.pyconfig/health.pyconfig/logging_formatters.pyconfig/settings.pyconfig/test_settings.pyconfig/tests/__init__.pyconfig/tests/test_health.pyconfig/tests/test_logging_json.pyconfig/urls.pydocker-compose.prod.ymldocker-compose.ymldocker-entrypoint.shdocker/gunicorn.conf.pydocs/Deployment.mddocs/GCP_Production_Checklist.mddocs/service_api/README.mddocs/service_api/boost_collector_runner.mdrequirements-dev.inrequirements.in
…rization in Docker configuration. Update health check error handling in health.py to manage invalid settings gracefully. Add tests for invalid configuration scenarios in test_health.py to ensure stability and proper error reporting.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker-compose.prod.yml`:
- Around line 20-24: The Dockerfile HEALTHCHECK uses escaped Compose-style
variables ($${HEALTH_CHECK_TOKEN:-} and $${HEALTH_CHECK_TOKEN}) which get
transformed into $$ and literal braces when evaluated by the container shell;
update the HEALTHCHECK command in the Dockerfile (the HEALTHCHECK/CMD sh -c
block) to use shell variable expansion (${HEALTH_CHECK_TOKEN:-} and
${HEALTH_CHECK_TOKEN}) so the container runtime sees the actual token value and
the Authorization header is constructed correctly at runtime; keep
docker-compose.prod.yml unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9d9aac59-6dcb-4018-a9fb-4a5abdf3da6e
📒 Files selected for processing (4)
Dockerfileconfig/health.pyconfig/tests/test_health.pydocker-compose.prod.yml
…le expansion for HEALTH_CHECK_TOKEN.
… such as stale status, freshness enforcement, and error reporting.
Summary
Hardens the VM Docker Compose production path and adds readiness/observability for scheduled collectors.
python:3.13-slim, non-rootappuser, imageHEALTHCHECK→GET /health/, Gunicorngthreadviadocker/gunicorn.conf.py; newdocker-compose.prod.yml(resource limits,LOG_FORMAT=json,restart: always). Dev compose keeps root entrypoint for volume chown; Celery worker/beat disable the image HTTP healthcheck.GET /health/checks PostgreSQL, Celery worker ping (configurable min count/timeout), and per–daily-group collector freshness from YAML schedule +CollectorGroupRunStatus. Optional bearer token;HEALTH_ENFORCE_COLLECTOR_FRESHNESS(defaulttrue; CI/smoke setsfalseuntil groups have run once).CollectorGroupRunStatusmodel,boost_collector_runner/services.py, migration;run_scheduled_collectors --grouprecords success/failure.LOG_FORMAT=jsonusesCloudLoggingJsonFormatter(severityfield) for GCP-friendly structured logs.pinecone>=6.0,<7); locks recompiled for Python 3.13.docs/GCP_Production_Checklist.md, Deployment updates, generateddocs/service_api/boost_collector_runner.md.Deploy notes
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d, then migrate; verifycurl http://127.0.0.1:8000/health/.HEALTH_ENFORCE_COLLECTOR_FRESHNESS=truein production.env.pinecone-plugin-inferencefrom the venv afteruv pip install -r requirements.lock.Test plan
uv run pytest config/tests/test_health.py boost_collector_runner/tests/test_services.py -vpython scripts/generate_service_docs.py --checkdocker compose -f docker-compose.yml -f docker-compose.ci.yml build && docker compose up -d && migrate && make healthCloses: #211
Summary by CodeRabbit
New Features
Documentation
Chores
Tests