Skip to content

Unified input API: artifact_id + data replaces input_csv#199

Closed
RafaelPo wants to merge 11 commits intomainfrom
feat/unified-input-api
Closed

Unified input API: artifact_id + data replaces input_csv#199
RafaelPo wants to merge 11 commits intomainfrom
feat/unified-input-api

Conversation

@RafaelPo
Copy link
Contributor

@RafaelPo RafaelPo commented Feb 24, 2026

Summary

Simplifies processing tool inputs to artifact_id + data: list[dict], adds upload_data as a universal data ingestion tool, and adds security hardening across the HTTP deployment.

Input API

  • _SingleSourceInput: input_csv + data: str | list[dict]artifact_id: str + data: list[dict]
  • MergeInput: left_csv/right_csvleft_artifact_id/right_artifact_id
  • Processing tools pass UUID or DataFrame directly to SDK (which already accepts both)
  • Empty inline data (data=[]) is now rejected at validation time

New tools

  • everyrow_upload_data — fetch from URLs (incl. Google Sheets), local CSV paths (stdio), upload via create_table_artifact → returns artifact_id
  • everyrow_request_upload_url (HTTP only) — HMAC-signed presigned URLs for large file uploads

New modules

  • uploads.py — HMAC signing, request_upload_url tool, PUT /api/uploads/{upload_id} REST endpoint

Transport-aware server instructions

  • MCP server sends workflow instructions appropriate for stdio vs HTTP mode
  • HTTP mode guides agents to use request_upload_url for local files instead of file paths

Security hardening

  • SSRF protection: DNS validation + _SSRFSafeTransport to close TOCTOU gap, blocked hostname list (metadata endpoints), redirect validation
  • Body size limit: ASGI-level BodySizeLimitMiddleware enforces upload size limits even for chunked-encoding requests
  • Security headers: X-Content-Type-Options, X-Frame-Options, Referrer-Policy, HSTS, Cache-Control on all responses
  • Shell injection: shlex.quote() on curl command arguments
  • Path traversal: Path.resolve() in validate_csv_path before validation
  • Redis TLS: REDIS_SSL setting for encrypted connections
  • Secrets in repr: repr=False on redis_password, supabase_anon_key, upload_secret, everyrow_api_key
  • Token encryption: upload metadata API tokens encrypted at rest in Redis (Fernet)
  • Container hardening: no-new-privileges, cap_drop, read-only root filesystem, resource limits, isolated network
  • Rate limiting: in-memory fallback with hard cap (50k entries), applied to all HTTP modes
  • CORS: Vary: Origin, Access-Control-Max-Age, X-Content-Type-Options on downloads
  • No-auth mode: defaults to 127.0.0.1 binding instead of 0.0.0.0
  • UPLOAD_SECRET required: fail-fast if not set in HTTP auth mode (multi-pod signing)
  • Response size limit: streaming fetch with max_fetch_size_bytes cap for URL downloads
  • .dockerignore added to exclude tests, docs, git artifacts from image

Commits

  1. e640f40 Unified input API: artifact_id + data replaces input_csv
  2. 18e95de Log input source (artifact_id vs data) on task submission
  3. d1556cc Move input resolution to model properties; fix review bugs
  4. 453d9c5 Require UPLOAD_SECRET env var for multi-pod HMAC signing
  5. 4adc7bb Fix Google Sheets /pub URL handling and empty CSV error message
  6. 9e4539b Add transport-aware server instructions; security hardening
  7. 51afb4a Security hardening: shell injection, headers, Redis TLS, container lockdown
  8. d7de387 Security hardening: SSRF TOCTOU, user isolation, token encryption, .dockerignore
  9. d9d2847 Fix Docker build, Redis healthcheck, and reject empty inline data

Test plan

  • 278 unit tests pass (+ 18 skipped integration tests)
  • Docker build + deploy with Cloudflare tunnel
  • Manual: upload_data with Google Sheets URL → artifact_id
  • Manual: screen with artifact_id → task submitted and completed
  • Manual: screen with inline data → task submitted and completed
  • Manual: upload_data with local file path rejected in HTTP mode with helpful error
  • Health endpoint returns 200 through tunnel

🤖 Generated with Claude Code

@RafaelPo
Copy link
Contributor Author

@claude code review

@github-actions
Copy link
Contributor

github-actions bot commented Feb 24, 2026

Claude finished @RafaelPo's task in 2m 1s —— View job


Code Review: Unified Input API

  • Read repo conventions
  • Review all 14 changed files
  • Analyze for bugs, security issues, and correctness
  • Post review findings

Overall this is a well-structured API migration. The artifact_id + data pattern is cleaner than the old input_csv/input_data/input_json split, and the upload system (HMAC-signed presigned URLs with consume-on-use Redis metadata) is solid. Tests are comprehensive.

I found a few issues worth addressing:


1. Bug: Sync context manager in async code

uploads.py:208 — The Sentry review is correct. with create_client() as client: enters the synchronous __enter__, which initializes a sync httpx.Client. But create_session and create_table_artifact perform async HTTP calls that need httpx.AsyncClient (initialized by __aenter__).

# Current (broken)
with create_client() as client:
    async with create_session(client=client) as session:
        artifact_id = await create_table_artifact(df, session)

# Fix
async with create_client() as client:
    async with create_session(client=client) as session:
        artifact_id = await create_table_artifact(df, session)

This will cause runtime failures when the upload endpoint is called in production.

Fix this →


2. Bug: Duplicate max_inline_rows field in Settings

config.py:16-19 and config.py:78-81max_inline_rows is defined twice in the same Settings class with different defaults:

Line Default Description
16 50_000 "Maximum number of rows allowed in inline JSON data"
78 5_000 "Maximum rows allowed in inline data (list[dict])."

Pydantic uses the last definition, so the effective limit is 5,000 — a 10x reduction from the original. If that's intentional, remove the first definition. If not, remove the second one (or consolidate with the intended value).

Fix this →


3. Bug: Integration tests still reference removed input_csv parameter

test_stdio_content.py:755 and test_stdio_content.py:829 — These integration tests still pass input_csv as a tool parameter, but ScreenInput / AgentInput now use artifact_id / data (with extra="forbid"). These would raise ValidationError if RUN_INTEGRATION_TESTS=1 is set:

# Line 755 — still uses input_csv
"params": {
    "task": "Filter for remote positions",
    "input_csv": jobs_csv,  # ← removed field, will fail
}

# Should be:
"params": {
    "task": "Filter for remote positions",
    "data": [{"company": "Airtable", ...}, ...],
    # or use artifact_id from upload_data
}

Same issue at line 829 for the agent pipeline test.

Fix this →


4. Nit: Prefer is not None over truthy check for artifact_id

tools.py:67if params.artifact_id: uses a truthy check. While the model validator prevents empty strings from reaching here (UUID validation would fail), is not None is more explicit and consistent with the rest of the codebase:

def _resolve_input(params) -> UUID | pd.DataFrame:
    if params.artifact_id is not None:  # more explicit
        return UUID(params.artifact_id)
    return pd.DataFrame(params.data)

5. Minor: Upload endpoint lacks per-user auth

http_config.py:104PUT /api/uploads/{upload_id} is registered as a custom route without OAuth middleware. It relies solely on HMAC signature verification. This is likely intentional (designed for curl -T from client sandboxes), but worth confirming that the HMAC-only approach is sufficient since any holder of the signed URL can upload without identity verification.


Summary

Severity Count Items
Bug 3 Sync/async context manager mismatch, duplicate config field, broken integration tests
Nit 1 Truthy vs is not None check
Note 1 Upload auth design consideration

Items 1 and 2 should be fixed before merge. Item 3 is low-risk since those tests are gated by RUN_INTEGRATION_TESTS=1.

@RafaelPo
Copy link
Contributor Author

@claude Review the code. Focus on security and safety

@github-actions
Copy link
Contributor

github-actions bot commented Feb 24, 2026

Claude finished @RafaelPo's task in 1m 33s —— View job


Security & Safety Review

  • Read repo conventions
  • Review upload system (HMAC signing, presigned URLs, file ingestion)
  • Review input validation and model changes
  • Review middleware and auth changes
  • Review utility functions (URL handling, data parsing)
  • Review Redis store changes
  • Post findings

I reviewed all 23 changed files with a security and safety focus. The SSRF protections, HMAC signing, consume-on-use Redis patterns, and input validation are well-implemented. Here are the findings:


1. Security: API token stored in Redis upload metadata

uploads.py:134-139 — The user's API token is serialized into Redis as part of the upload metadata:

meta = json.dumps({
    "upload_id": upload_id,
    "filename": params.filename,
    "expires_at": expires_at,
    "api_token": api_token,  # ← plaintext API token in Redis
})

This is a pragmatic design choice (the upload endpoint needs to act on behalf of the user), but worth noting:

  • The token is stored with a short TTL (upload_url_ttl, default 5 min) and consumed atomically via GETDEL, limiting the exposure window.
  • However, if Redis is compromised or an attacker gains read access, they can extract valid API tokens during the TTL window.
  • Consider whether the token could be encrypted at rest in Redis using the UPLOAD_SECRET as the encryption key, or whether the risk is acceptable given the short TTL and consume-on-use pattern.

2. Security: Upload endpoint body buffering before Content-Length rejection

uploads.py:200-218 — The Content-Length check is done before request.body(), which is good. However, if Content-Length is not provided (chunked transfer encoding), the full body is read into memory before the size check at line 217:

content_length_str = request.headers.get("content-length", "")
if content_length_str:
    # ... early reject if too large
    
body = await request.body()  # ← full body buffered regardless
if len(body) > settings.max_upload_size_bytes:
    return None, None, JSONResponse({"error": "File too large"}, status_code=413)

A client omitting Content-Length (e.g. curl --data-binary with chunked encoding) can force the server to buffer up to the ASGI server's body limit before the size check triggers. With the default max_upload_size_bytes of 50 MB this is manageable, but in a multi-tenant deployment consider adding an ASGI-level body size limit (e.g. via uvicorn --limit-request-body or middleware) as defense-in-depth.


3. Security: SSRF protection with DNS rebinding gap

utils.py:42-63 — The SSRF protection resolves the hostname and checks against blocked networks before the HTTP request:

def _validate_url_target(url: str) -> None:
    addrinfos = socket.getaddrinfo(hostname, ...)
    for ..., sockaddr in addrinfos:
        if _is_blocked_ip(sockaddr[0]):
            raise ValueError(...)

async def fetch_csv_from_url(url: str) -> pd.DataFrame:
    _validate_url_target(url)           # ← check DNS here
    async with httpx.AsyncClient(...):  # ← connect here (re-resolves DNS)
        response = await client.get(url)

There's a classic TOCTOU gap: the DNS resolution at validation time and the DNS resolution at connection time are separate. An attacker with a short-TTL DNS record could make the first resolution return a public IP and the second return 127.0.0.1. The redirect hook (_check_redirect) partially mitigates this for redirects, but not for the initial connection.

This is a well-known limitation that's hard to fully solve without a custom DNS resolver or connecting via IP. The current implementation is reasonable — the redirect hook covers the most common attack vector (open redirects to internal services). Just noting this for awareness.


4. Bug (confirmed from prior review): with instead of async with for client in handle_upload

uploads.py:246-254 — The previous review correctly identified this. The handle_upload endpoint now uses AuthenticatedClient directly (not through create_client()) and calls create_session without the problematic sync context manager. Looking at the current code:

client = AuthenticatedClient(
    base_url=settings.everyrow_api_url,
    token=api_token,
    ...
)
async with create_session(client=client) as session:
    artifact_id = await create_table_artifact(df, session)

This is correct — AuthenticatedClient is constructed directly without a context manager, and create_session is properly used as an async context manager. The prior review's issue (item #1) appears to have been already fixed in the latest commit.


5. Nit: validate_csv_path allows path traversal sequences in filename

utils.py:186-198validate_csv_path checks that the path is absolute, exists, is a file, and ends in .csv, but doesn't canonicalize/resolve the path. In stdio mode an LLM agent could pass /../../../etc/passwd.csv (if it existed). The is_absolute() + exists() + is_file() + .csv suffix checks make exploitation very unlikely in practice, but Path.resolve() before validation would eliminate symlink-based traversal:

def validate_csv_path(path: str) -> None:
    p = Path(path).resolve()  # ← canonicalize
    # ... rest of checks

6. Good practices observed

  • HMAC timing-safe comparison (hmac.compare_digest) — prevents timing attacks on signature verification (uploads.py:78)
  • Consume-on-use Redis patternsGETDEL for upload metadata, auth codes, and refresh tokens prevents replay attacks
  • SSRF blocklist covers RFC 1918, loopback, link-local, and IPv6 equivalents (utils.py:20-30)
  • Redirect target validation via httpx event hook (utils.py:115-126)
  • Rate limiting on auth endpoints and API requests with Redis + in-memory fallback
  • extra="forbid" on all input models prevents parameter injection
  • UUID validation on all user-supplied IDs
  • Generic error messages in upload endpoint — no internal details leaked
  • __Host- cookie prefix on auth state cookie enforces Secure + path=/
  • Content-Disposition header sanitization on CSV download (routes.py:135)
  • Redis key sanitization via regex (redis_store.py:36-42)

Summary

Severity Count Items
Security note 3 API token in Redis, chunked upload body buffering, DNS TOCTOU
Nit 1 Path canonicalization in validate_csv_path
Good practice 10+ HMAC timing safety, consume-on-use, SSRF blocklist, rate limiting, etc.

The security posture is solid. The findings above are defense-in-depth considerations rather than exploitable vulnerabilities. The short-TTL consume-on-use pattern for API tokens in Redis is the most material item — worth a brief threat model discussion on whether encryption at rest is warranted for your deployment.

RafaelPo and others added 9 commits February 24, 2026 19:29
Processing tools now accept artifact_id (UUID from upload_data) or
data (list[dict]) instead of input_csv/input_data/input_json. Adds
upload_data tool for URL/file ingestion and request_upload_url for
presigned large-file uploads in HTTP mode.

Phase 1: Simplified _SingleSourceInput and MergeInput models
Phase 2: upload_data tool (URL + local path + Google Sheets)
Phase 3: Presigned URL upload system (HMAC, Redis metadata, REST endpoint)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Adopt _aid_or_dataframe and _input_data_mode properties on models,
  removing free functions from tools.py
- Add left/right properties to MergeInput
- Use `is not None` for artifact_id checks
- Fix sync context manager in uploads.py (with → async with)
- Remove duplicate max_inline_rows in config.py
- Update integration tests to use data instead of removed input_csv
- Reject empty CSV in upload_data local file path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove auto-generated per-process secret — it breaks when pods
don't share state. Now fails fast with a clear error if unset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- /pub URLs now correctly convert to /export?format=csv
- Headers-only CSV from URL raises clear "empty CSV" error instead
  of misleading "could not parse as CSV or JSON"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instructions:
- Add _INSTRUCTIONS_STDIO and _INSTRUCTIONS_HTTP to app.py
- HTTP instructions guide agent to use request_upload_url for local files
- server.py sets instructions based on transport mode

Security & correctness (from parallel review):
- SSRF protection: block internal IPs in URL fetching
- __Host- cookie prefix for auth state cookie
- Rate limiter: in-memory fallback when Redis unavailable
- Upload endpoint: use caller's API token, limit CSV rows
- Poll token via Authorization header (not just query param)
- Progress URL no longer leaks poll token in URL

Also:
- Update upload_data docstring and error message for HTTP mode
- Sync manifest.json description

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ckdown

- Fix shell injection in upload curl command via shlex.quote()
- Add SecurityHeadersMiddleware (HSTS, X-Content-Type-Options, X-Frame-Options,
  Cache-Control, Referrer-Policy) on all HTTP responses
- Add Redis TLS support (REDIS_SSL setting)
- Stream URL fetch with size limit (max_fetch_size_bytes) to prevent OOM
- Validate UPLOAD_SECRET at startup instead of first request
- Warn on missing REDIS_PASSWORD in HTTP mode at startup
- Enable rate limiting in --no-auth mode; cap in-memory fallback at 50K entries
- Container hardening: cap_drop ALL, no-new-privileges, read-only rootfs,
  CPU limits, REDISCLI_AUTH for healthcheck, pinned Redis image, network isolation
- Add --frozen to Dockerfile uv sync to prevent lockfile drift
- Sanitize SSRF error (no longer leaks resolved IPs)
- Add repr=False on sensitive config fields to prevent accidental logging
- Add Vary: Origin, Access-Control-Max-Age on CORS preflight responses
- Default to 127.0.0.1 in --no-auth mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ockerignore

- Fix SSRF DNS-rebinding TOCTOU: add _SSRFSafeTransport that re-validates
  hostnames at request time; block GKE metadata hostname; cap max_redirects=5
- Add user-scoped data isolation: record task owner (JWT sub) on submission,
  check ownership in everyrow_results_http to prevent cross-user access
- Encrypt tokens at rest in Redis using Fernet (derived from UPLOAD_SECRET):
  task tokens, poll tokens, auth codes, refresh tokens, upload metadata
- Add root .dockerignore with deny-all allowlist to prevent secrets leaking
  into Docker build context

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove --frozen from Dockerfile (incompatible with --no-sources)
- Fix Redis healthcheck: pass REDIS_PASSWORD as env var, remove
  cap_drop ALL which prevented Redis user switching
- Reject data=[] in _SingleSourceInput and MergeInput validators
  to prevent wasteful zero-row task submissions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RafaelPo RafaelPo force-pushed the feat/unified-input-api branch from 1076043 to 17d401f Compare February 24, 2026 19:32
RafaelPo and others added 2 commits February 24, 2026 19:35
Deploy workflow should only run on workflow_dispatch. PR checks
are handled by the CI workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
assert body is not None and meta is not None # type narrowing

# Retrieve and decrypt the user's API token
api_token = decrypt_value(meta.get("api_token", ""))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Calling decrypt_value with an empty string when an api_token is missing from metadata raises an unhandled InvalidToken exception, causing a 500 error instead of a 403.
Severity: HIGH

Suggested Fix

Wrap the call to decrypt_value in a try...except InvalidToken block to catch the exception and handle it as an authorization failure. Alternatively, explicitly check for the existence of the api_token key in the meta dictionary before attempting decryption.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: everyrow-mcp/src/everyrow_mcp/uploads.py#L232

Potential issue: The function `decrypt_value` is called with the result of
`meta.get("api_token", "")`. If the `api_token` key is missing from the upload metadata,
an empty string is passed to the decryption function. When `UPLOAD_SECRET` is
configured, the underlying decryption library raises an `InvalidToken` exception when
attempting to decrypt an empty string. This exception is not handled, causing the
request to fail with a 500 server error instead of returning the intended 403
authorization error. This behavior is triggered in HTTP mode for any upload that lacks
an API token.

@RafaelPo RafaelPo removed the request for review from straeter February 24, 2026 19:46
@RafaelPo RafaelPo marked this pull request as draft February 24, 2026 19:47
@RafaelPo
Copy link
Contributor Author

Superseded by two focused PRs:

Splitting makes review easier and isolates risk. The deploy-mcp workflow change is already in #206.

@RafaelPo RafaelPo closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant