Unified input API: artifact_id + data replaces input_csv#199
Unified input API: artifact_id + data replaces input_csv#199
Conversation
|
@claude code review |
|
Claude finished @RafaelPo's task in 2m 1s —— View job Code Review: Unified Input API
Overall this is a well-structured API migration. The I found a few issues worth addressing: 1. Bug: Sync context manager in async code
# Current (broken)
with create_client() as client:
async with create_session(client=client) as session:
artifact_id = await create_table_artifact(df, session)
# Fix
async with create_client() as client:
async with create_session(client=client) as session:
artifact_id = await create_table_artifact(df, session)This will cause runtime failures when the upload endpoint is called in production. 2. Bug: Duplicate
|
| Line | Default | Description |
|---|---|---|
| 16 | 50_000 |
"Maximum number of rows allowed in inline JSON data" |
| 78 | 5_000 |
"Maximum rows allowed in inline data (list[dict])." |
Pydantic uses the last definition, so the effective limit is 5,000 — a 10x reduction from the original. If that's intentional, remove the first definition. If not, remove the second one (or consolidate with the intended value).
3. Bug: Integration tests still reference removed input_csv parameter
test_stdio_content.py:755 and test_stdio_content.py:829 — These integration tests still pass input_csv as a tool parameter, but ScreenInput / AgentInput now use artifact_id / data (with extra="forbid"). These would raise ValidationError if RUN_INTEGRATION_TESTS=1 is set:
# Line 755 — still uses input_csv
"params": {
"task": "Filter for remote positions",
"input_csv": jobs_csv, # ← removed field, will fail
}
# Should be:
"params": {
"task": "Filter for remote positions",
"data": [{"company": "Airtable", ...}, ...],
# or use artifact_id from upload_data
}Same issue at line 829 for the agent pipeline test.
4. Nit: Prefer is not None over truthy check for artifact_id
tools.py:67 — if params.artifact_id: uses a truthy check. While the model validator prevents empty strings from reaching here (UUID validation would fail), is not None is more explicit and consistent with the rest of the codebase:
def _resolve_input(params) -> UUID | pd.DataFrame:
if params.artifact_id is not None: # more explicit
return UUID(params.artifact_id)
return pd.DataFrame(params.data)5. Minor: Upload endpoint lacks per-user auth
http_config.py:104 — PUT /api/uploads/{upload_id} is registered as a custom route without OAuth middleware. It relies solely on HMAC signature verification. This is likely intentional (designed for curl -T from client sandboxes), but worth confirming that the HMAC-only approach is sufficient since any holder of the signed URL can upload without identity verification.
Summary
| Severity | Count | Items |
|---|---|---|
| Bug | 3 | Sync/async context manager mismatch, duplicate config field, broken integration tests |
| Nit | 1 | Truthy vs is not None check |
| Note | 1 | Upload auth design consideration |
Items 1 and 2 should be fixed before merge. Item 3 is low-risk since those tests are gated by RUN_INTEGRATION_TESTS=1.
|
@claude Review the code. Focus on security and safety |
|
Claude finished @RafaelPo's task in 1m 33s —— View job Security & Safety Review
I reviewed all 23 changed files with a security and safety focus. The SSRF protections, HMAC signing, consume-on-use Redis patterns, and input validation are well-implemented. Here are the findings: 1. Security: API token stored in Redis upload metadata
meta = json.dumps({
"upload_id": upload_id,
"filename": params.filename,
"expires_at": expires_at,
"api_token": api_token, # ← plaintext API token in Redis
})This is a pragmatic design choice (the upload endpoint needs to act on behalf of the user), but worth noting:
2. Security: Upload endpoint body buffering before Content-Length rejection
content_length_str = request.headers.get("content-length", "")
if content_length_str:
# ... early reject if too large
body = await request.body() # ← full body buffered regardless
if len(body) > settings.max_upload_size_bytes:
return None, None, JSONResponse({"error": "File too large"}, status_code=413)A client omitting 3. Security: SSRF protection with DNS rebinding gap
def _validate_url_target(url: str) -> None:
addrinfos = socket.getaddrinfo(hostname, ...)
for ..., sockaddr in addrinfos:
if _is_blocked_ip(sockaddr[0]):
raise ValueError(...)
async def fetch_csv_from_url(url: str) -> pd.DataFrame:
_validate_url_target(url) # ← check DNS here
async with httpx.AsyncClient(...): # ← connect here (re-resolves DNS)
response = await client.get(url)There's a classic TOCTOU gap: the DNS resolution at validation time and the DNS resolution at connection time are separate. An attacker with a short-TTL DNS record could make the first resolution return a public IP and the second return This is a well-known limitation that's hard to fully solve without a custom DNS resolver or connecting via IP. The current implementation is reasonable — the redirect hook covers the most common attack vector (open redirects to internal services). Just noting this for awareness. 4. Bug (confirmed from prior review):
|
| Severity | Count | Items |
|---|---|---|
| Security note | 3 | API token in Redis, chunked upload body buffering, DNS TOCTOU |
| Nit | 1 | Path canonicalization in validate_csv_path |
| Good practice | 10+ | HMAC timing safety, consume-on-use, SSRF blocklist, rate limiting, etc. |
The security posture is solid. The findings above are defense-in-depth considerations rather than exploitable vulnerabilities. The short-TTL consume-on-use pattern for API tokens in Redis is the most material item — worth a brief threat model discussion on whether encryption at rest is warranted for your deployment.
Processing tools now accept artifact_id (UUID from upload_data) or data (list[dict]) instead of input_csv/input_data/input_json. Adds upload_data tool for URL/file ingestion and request_upload_url for presigned large-file uploads in HTTP mode. Phase 1: Simplified _SingleSourceInput and MergeInput models Phase 2: upload_data tool (URL + local path + Google Sheets) Phase 3: Presigned URL upload system (HMAC, Redis metadata, REST endpoint) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Adopt _aid_or_dataframe and _input_data_mode properties on models, removing free functions from tools.py - Add left/right properties to MergeInput - Use `is not None` for artifact_id checks - Fix sync context manager in uploads.py (with → async with) - Remove duplicate max_inline_rows in config.py - Update integration tests to use data instead of removed input_csv - Reject empty CSV in upload_data local file path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove auto-generated per-process secret — it breaks when pods don't share state. Now fails fast with a clear error if unset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- /pub URLs now correctly convert to /export?format=csv - Headers-only CSV from URL raises clear "empty CSV" error instead of misleading "could not parse as CSV or JSON" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instructions: - Add _INSTRUCTIONS_STDIO and _INSTRUCTIONS_HTTP to app.py - HTTP instructions guide agent to use request_upload_url for local files - server.py sets instructions based on transport mode Security & correctness (from parallel review): - SSRF protection: block internal IPs in URL fetching - __Host- cookie prefix for auth state cookie - Rate limiter: in-memory fallback when Redis unavailable - Upload endpoint: use caller's API token, limit CSV rows - Poll token via Authorization header (not just query param) - Progress URL no longer leaks poll token in URL Also: - Update upload_data docstring and error message for HTTP mode - Sync manifest.json description Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ckdown - Fix shell injection in upload curl command via shlex.quote() - Add SecurityHeadersMiddleware (HSTS, X-Content-Type-Options, X-Frame-Options, Cache-Control, Referrer-Policy) on all HTTP responses - Add Redis TLS support (REDIS_SSL setting) - Stream URL fetch with size limit (max_fetch_size_bytes) to prevent OOM - Validate UPLOAD_SECRET at startup instead of first request - Warn on missing REDIS_PASSWORD in HTTP mode at startup - Enable rate limiting in --no-auth mode; cap in-memory fallback at 50K entries - Container hardening: cap_drop ALL, no-new-privileges, read-only rootfs, CPU limits, REDISCLI_AUTH for healthcheck, pinned Redis image, network isolation - Add --frozen to Dockerfile uv sync to prevent lockfile drift - Sanitize SSRF error (no longer leaks resolved IPs) - Add repr=False on sensitive config fields to prevent accidental logging - Add Vary: Origin, Access-Control-Max-Age on CORS preflight responses - Default to 127.0.0.1 in --no-auth mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ockerignore - Fix SSRF DNS-rebinding TOCTOU: add _SSRFSafeTransport that re-validates hostnames at request time; block GKE metadata hostname; cap max_redirects=5 - Add user-scoped data isolation: record task owner (JWT sub) on submission, check ownership in everyrow_results_http to prevent cross-user access - Encrypt tokens at rest in Redis using Fernet (derived from UPLOAD_SECRET): task tokens, poll tokens, auth codes, refresh tokens, upload metadata - Add root .dockerignore with deny-all allowlist to prevent secrets leaking into Docker build context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove --frozen from Dockerfile (incompatible with --no-sources) - Fix Redis healthcheck: pass REDIS_PASSWORD as env var, remove cap_drop ALL which prevented Redis user switching - Reject data=[] in _SingleSourceInput and MergeInput validators to prevent wasteful zero-row task submissions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1076043 to
17d401f
Compare
Deploy workflow should only run on workflow_dispatch. PR checks are handled by the CI workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit ef615fe.
| assert body is not None and meta is not None # type narrowing | ||
|
|
||
| # Retrieve and decrypt the user's API token | ||
| api_token = decrypt_value(meta.get("api_token", "")) |
There was a problem hiding this comment.
Bug: Calling decrypt_value with an empty string when an api_token is missing from metadata raises an unhandled InvalidToken exception, causing a 500 error instead of a 403.
Severity: HIGH
Suggested Fix
Wrap the call to decrypt_value in a try...except InvalidToken block to catch the exception and handle it as an authorization failure. Alternatively, explicitly check for the existence of the api_token key in the meta dictionary before attempting decryption.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: everyrow-mcp/src/everyrow_mcp/uploads.py#L232
Potential issue: The function `decrypt_value` is called with the result of
`meta.get("api_token", "")`. If the `api_token` key is missing from the upload metadata,
an empty string is passed to the decryption function. When `UPLOAD_SECRET` is
configured, the underlying decryption library raises an `InvalidToken` exception when
attempting to decrypt an empty string. This exception is not handled, causing the
request to fail with a 500 server error instead of returning the intended 403
authorization error. This behavior is triggered in HTTP mode for any upload that lacks
an API token.
|
Superseded by two focused PRs:
Splitting makes review easier and isolates risk. The deploy-mcp workflow change is already in #206. |
Summary
Simplifies processing tool inputs to
artifact_id+data: list[dict], addsupload_dataas a universal data ingestion tool, and adds security hardening across the HTTP deployment.Input API
_SingleSourceInput:input_csv+data: str | list[dict]→artifact_id: str+data: list[dict]MergeInput:left_csv/right_csv→left_artifact_id/right_artifact_idUUIDorDataFramedirectly to SDK (which already accepts both)data=[]) is now rejected at validation timeNew tools
everyrow_upload_data— fetch from URLs (incl. Google Sheets), local CSV paths (stdio), upload viacreate_table_artifact→ returnsartifact_ideveryrow_request_upload_url(HTTP only) — HMAC-signed presigned URLs for large file uploadsNew modules
uploads.py— HMAC signing,request_upload_urltool,PUT /api/uploads/{upload_id}REST endpointTransport-aware server instructions
request_upload_urlfor local files instead of file pathsSecurity hardening
_SSRFSafeTransportto close TOCTOU gap, blocked hostname list (metadata endpoints), redirect validationBodySizeLimitMiddlewareenforces upload size limits even for chunked-encoding requestsX-Content-Type-Options,X-Frame-Options,Referrer-Policy,HSTS,Cache-Controlon all responsesshlex.quote()on curl command argumentsPath.resolve()invalidate_csv_pathbefore validationREDIS_SSLsetting for encrypted connectionsrepr=Falseonredis_password,supabase_anon_key,upload_secret,everyrow_api_keyno-new-privileges,cap_drop, read-only root filesystem, resource limits, isolated networkVary: Origin,Access-Control-Max-Age,X-Content-Type-Optionson downloads127.0.0.1binding instead of0.0.0.0max_fetch_size_bytescap for URL downloads.dockerignoreadded to exclude tests, docs, git artifacts from imageCommits
e640f40Unified input API: artifact_id + data replaces input_csv18e95deLog input source (artifact_id vs data) on task submissiond1556ccMove input resolution to model properties; fix review bugs453d9c5Require UPLOAD_SECRET env var for multi-pod HMAC signing4adc7bbFix Google Sheets /pub URL handling and empty CSV error message9e4539bAdd transport-aware server instructions; security hardening51afb4aSecurity hardening: shell injection, headers, Redis TLS, container lockdownd7de387Security hardening: SSRF TOCTOU, user isolation, token encryption, .dockerignored9d2847Fix Docker build, Redis healthcheck, and reject empty inline dataTest plan
upload_datawith Google Sheets URL → artifact_idscreenwith artifact_id → task submitted and completedscreenwith inline data → task submitted and completedupload_datawith local file path rejected in HTTP mode with helpful error🤖 Generated with Claude Code