Unified input API: artifact_id + data replaces input_csv by RafaelPo · Pull Request #199 · futuresearch/futuresearch-python

RafaelPo · 2026-02-24T15:08:22Z

Summary

Simplifies processing tool inputs to artifact_id + data: list[dict], adds upload_data as a universal data ingestion tool, and adds security hardening across the HTTP deployment.

Input API

_SingleSourceInput: input_csv + data: str | list[dict] → artifact_id: str + data: list[dict]
MergeInput: left_csv/right_csv → left_artifact_id/right_artifact_id
Processing tools pass UUID or DataFrame directly to SDK (which already accepts both)
Empty inline data (data=[]) is now rejected at validation time

New tools

everyrow_upload_data — fetch from URLs (incl. Google Sheets), local CSV paths (stdio), upload via create_table_artifact → returns artifact_id
everyrow_request_upload_url (HTTP only) — HMAC-signed presigned URLs for large file uploads

New modules

uploads.py — HMAC signing, request_upload_url tool, PUT /api/uploads/{upload_id} REST endpoint

Transport-aware server instructions

MCP server sends workflow instructions appropriate for stdio vs HTTP mode
HTTP mode guides agents to use request_upload_url for local files instead of file paths

Security hardening

SSRF protection: DNS validation + _SSRFSafeTransport to close TOCTOU gap, blocked hostname list (metadata endpoints), redirect validation
Body size limit: ASGI-level BodySizeLimitMiddleware enforces upload size limits even for chunked-encoding requests
Security headers: X-Content-Type-Options, X-Frame-Options, Referrer-Policy, HSTS, Cache-Control on all responses
Shell injection: shlex.quote() on curl command arguments
Path traversal: Path.resolve() in validate_csv_path before validation
Redis TLS: REDIS_SSL setting for encrypted connections
Secrets in repr: repr=False on redis_password, supabase_anon_key, upload_secret, everyrow_api_key
Token encryption: upload metadata API tokens encrypted at rest in Redis (Fernet)
Container hardening: no-new-privileges, cap_drop, read-only root filesystem, resource limits, isolated network
Rate limiting: in-memory fallback with hard cap (50k entries), applied to all HTTP modes
CORS: Vary: Origin, Access-Control-Max-Age, X-Content-Type-Options on downloads
No-auth mode: defaults to 127.0.0.1 binding instead of 0.0.0.0
UPLOAD_SECRET required: fail-fast if not set in HTTP auth mode (multi-pod signing)
Response size limit: streaming fetch with max_fetch_size_bytes cap for URL downloads
.dockerignore added to exclude tests, docs, git artifacts from image

Commits

e640f40 Unified input API: artifact_id + data replaces input_csv
18e95de Log input source (artifact_id vs data) on task submission
d1556cc Move input resolution to model properties; fix review bugs
453d9c5 Require UPLOAD_SECRET env var for multi-pod HMAC signing
4adc7bb Fix Google Sheets /pub URL handling and empty CSV error message
9e4539b Add transport-aware server instructions; security hardening
51afb4a Security hardening: shell injection, headers, Redis TLS, container lockdown
d7de387 Security hardening: SSRF TOCTOU, user isolation, token encryption, .dockerignore
d9d2847 Fix Docker build, Redis healthcheck, and reject empty inline data

Test plan

278 unit tests pass (+ 18 skipped integration tests)
Docker build + deploy with Cloudflare tunnel
Manual: upload_data with Google Sheets URL → artifact_id
Manual: screen with artifact_id → task submitted and completed
Manual: screen with inline data → task submitted and completed
Manual: upload_data with local file path rejected in HTTP mode with helpful error
Health endpoint returns 200 through tunnel

🤖 Generated with Claude Code

everyrow-mcp/src/everyrow_mcp/uploads.py

RafaelPo · 2026-02-24T15:13:12Z

@claude code review

github-actions · 2026-02-24T15:13:30Z

Claude finished @RafaelPo's task in 2m 1s —— View job

Code Review: Unified Input API

Read repo conventions
Review all 14 changed files
Analyze for bugs, security issues, and correctness
Post review findings

Overall this is a well-structured API migration. The artifact_id + data pattern is cleaner than the old input_csv/input_data/input_json split, and the upload system (HMAC-signed presigned URLs with consume-on-use Redis metadata) is solid. Tests are comprehensive.

I found a few issues worth addressing:

1. Bug: Sync context manager in async code

uploads.py:208 — The Sentry review is correct. with create_client() as client: enters the synchronous __enter__, which initializes a sync httpx.Client. But create_session and create_table_artifact perform async HTTP calls that need httpx.AsyncClient (initialized by __aenter__).

# Current (broken)
with create_client() as client:
    async with create_session(client=client) as session:
        artifact_id = await create_table_artifact(df, session)

# Fix
async with create_client() as client:
    async with create_session(client=client) as session:
        artifact_id = await create_table_artifact(df, session)

This will cause runtime failures when the upload endpoint is called in production.

Fix this →

2. Bug: Duplicate `max_inline_rows` field in Settings

config.py:16-19 and config.py:78-81 — max_inline_rows is defined twice in the same Settings class with different defaults:

Line	Default	Description
16	`50_000`	"Maximum number of rows allowed in inline JSON data"
78	`5_000`	"Maximum rows allowed in inline data (list[dict])."

Pydantic uses the last definition, so the effective limit is 5,000 — a 10x reduction from the original. If that's intentional, remove the first definition. If not, remove the second one (or consolidate with the intended value).

Fix this →

3. Bug: Integration tests still reference removed `input_csv` parameter

test_stdio_content.py:755 and test_stdio_content.py:829 — These integration tests still pass input_csv as a tool parameter, but ScreenInput / AgentInput now use artifact_id / data (with extra="forbid"). These would raise ValidationError if RUN_INTEGRATION_TESTS=1 is set:

# Line 755 — still uses input_csv
"params": {
    "task": "Filter for remote positions",
    "input_csv": jobs_csv,  # ← removed field, will fail
}

# Should be:
"params": {
    "task": "Filter for remote positions",
    "data": [{"company": "Airtable", ...}, ...],
    # or use artifact_id from upload_data
}

Same issue at line 829 for the agent pipeline test.

Fix this →

4. Nit: Prefer `is not None` over truthy check for `artifact_id`

tools.py:67 — if params.artifact_id: uses a truthy check. While the model validator prevents empty strings from reaching here (UUID validation would fail), is not None is more explicit and consistent with the rest of the codebase:

def _resolve_input(params) -> UUID | pd.DataFrame:
    if params.artifact_id is not None:  # more explicit
        return UUID(params.artifact_id)
    return pd.DataFrame(params.data)

5. Minor: Upload endpoint lacks per-user auth

http_config.py:104 — PUT /api/uploads/{upload_id} is registered as a custom route without OAuth middleware. It relies solely on HMAC signature verification. This is likely intentional (designed for curl -T from client sandboxes), but worth confirming that the HMAC-only approach is sufficient since any holder of the signed URL can upload without identity verification.

Summary

Severity	Count	Items
Bug	3	Sync/async context manager mismatch, duplicate config field, broken integration tests
Nit	1	Truthy vs `is not None` check
Note	1	Upload auth design consideration

Items 1 and 2 should be fixed before merge. Item 3 is low-risk since those tests are gated by RUN_INTEGRATION_TESTS=1.

everyrow-mcp/src/everyrow_mcp/tools.py

everyrow-mcp/src/everyrow_mcp/utils.py

RafaelPo · 2026-02-24T16:05:02Z

@claude Review the code. Focus on security and safety

github-actions · 2026-02-24T16:05:34Z

Claude finished @RafaelPo's task in 1m 33s —— View job

Security & Safety Review

Read repo conventions
Review upload system (HMAC signing, presigned URLs, file ingestion)
Review input validation and model changes
Review middleware and auth changes
Review utility functions (URL handling, data parsing)
Review Redis store changes
Post findings

I reviewed all 23 changed files with a security and safety focus. The SSRF protections, HMAC signing, consume-on-use Redis patterns, and input validation are well-implemented. Here are the findings:

1. Security: API token stored in Redis upload metadata

uploads.py:134-139 — The user's API token is serialized into Redis as part of the upload metadata:

meta = json.dumps({
    "upload_id": upload_id,
    "filename": params.filename,
    "expires_at": expires_at,
    "api_token": api_token,  # ← plaintext API token in Redis
})

This is a pragmatic design choice (the upload endpoint needs to act on behalf of the user), but worth noting:

The token is stored with a short TTL (upload_url_ttl, default 5 min) and consumed atomically via GETDEL, limiting the exposure window.
However, if Redis is compromised or an attacker gains read access, they can extract valid API tokens during the TTL window.
Consider whether the token could be encrypted at rest in Redis using the UPLOAD_SECRET as the encryption key, or whether the risk is acceptable given the short TTL and consume-on-use pattern.

2. Security: Upload endpoint body buffering before Content-Length rejection

uploads.py:200-218 — The Content-Length check is done before request.body(), which is good. However, if Content-Length is not provided (chunked transfer encoding), the full body is read into memory before the size check at line 217:

content_length_str = request.headers.get("content-length", "")
if content_length_str:
    # ... early reject if too large
    
body = await request.body()  # ← full body buffered regardless
if len(body) > settings.max_upload_size_bytes:
    return None, None, JSONResponse({"error": "File too large"}, status_code=413)

A client omitting Content-Length (e.g. curl --data-binary with chunked encoding) can force the server to buffer up to the ASGI server's body limit before the size check triggers. With the default max_upload_size_bytes of 50 MB this is manageable, but in a multi-tenant deployment consider adding an ASGI-level body size limit (e.g. via uvicorn --limit-request-body or middleware) as defense-in-depth.

3. Security: SSRF protection with DNS rebinding gap

utils.py:42-63 — The SSRF protection resolves the hostname and checks against blocked networks before the HTTP request:

def _validate_url_target(url: str) -> None:
    addrinfos = socket.getaddrinfo(hostname, ...)
    for ..., sockaddr in addrinfos:
        if _is_blocked_ip(sockaddr[0]):
            raise ValueError(...)

async def fetch_csv_from_url(url: str) -> pd.DataFrame:
    _validate_url_target(url)           # ← check DNS here
    async with httpx.AsyncClient(...):  # ← connect here (re-resolves DNS)
        response = await client.get(url)

There's a classic TOCTOU gap: the DNS resolution at validation time and the DNS resolution at connection time are separate. An attacker with a short-TTL DNS record could make the first resolution return a public IP and the second return 127.0.0.1. The redirect hook (_check_redirect) partially mitigates this for redirects, but not for the initial connection.

This is a well-known limitation that's hard to fully solve without a custom DNS resolver or connecting via IP. The current implementation is reasonable — the redirect hook covers the most common attack vector (open redirects to internal services). Just noting this for awareness.

4. Bug (confirmed from prior review): `with` instead of `async with` for client in `handle_upload`

uploads.py:246-254 — The previous review correctly identified this. The handle_upload endpoint now uses AuthenticatedClient directly (not through create_client()) and calls create_session without the problematic sync context manager. Looking at the current code:

client = AuthenticatedClient(
    base_url=settings.everyrow_api_url,
    token=api_token,
    ...
)
async with create_session(client=client) as session:
    artifact_id = await create_table_artifact(df, session)

This is correct — AuthenticatedClient is constructed directly without a context manager, and create_session is properly used as an async context manager. The prior review's issue (item #1) appears to have been already fixed in the latest commit.

5. Nit: `validate_csv_path` allows path traversal sequences in filename

utils.py:186-198 — validate_csv_path checks that the path is absolute, exists, is a file, and ends in .csv, but doesn't canonicalize/resolve the path. In stdio mode an LLM agent could pass /../../../etc/passwd.csv (if it existed). The is_absolute() + exists() + is_file() + .csv suffix checks make exploitation very unlikely in practice, but Path.resolve() before validation would eliminate symlink-based traversal:

def validate_csv_path(path: str) -> None:
    p = Path(path).resolve()  # ← canonicalize
    # ... rest of checks

6. Good practices observed

HMAC timing-safe comparison (hmac.compare_digest) — prevents timing attacks on signature verification (uploads.py:78)
Consume-on-use Redis patterns — GETDEL for upload metadata, auth codes, and refresh tokens prevents replay attacks
SSRF blocklist covers RFC 1918, loopback, link-local, and IPv6 equivalents (utils.py:20-30)
Redirect target validation via httpx event hook (utils.py:115-126)
Rate limiting on auth endpoints and API requests with Redis + in-memory fallback
extra="forbid" on all input models prevents parameter injection
UUID validation on all user-supplied IDs
Generic error messages in upload endpoint — no internal details leaked
__Host- cookie prefix on auth state cookie enforces Secure + path=/
Content-Disposition header sanitization on CSV download (routes.py:135)
Redis key sanitization via regex (redis_store.py:36-42)

Summary

Severity	Count	Items
Security note	3	API token in Redis, chunked upload body buffering, DNS TOCTOU
Nit	1	Path canonicalization in `validate_csv_path`
Good practice	10+	HMAC timing safety, consume-on-use, SSRF blocklist, rate limiting, etc.

The security posture is solid. The findings above are defense-in-depth considerations rather than exploitable vulnerabilities. The short-TTL consume-on-use pattern for API tokens in Redis is the most material item — worth a brief threat model discussion on whether encryption at rest is warranted for your deployment.

everyrow-mcp/src/everyrow_mcp/models.py

Processing tools now accept artifact_id (UUID from upload_data) or data (list[dict]) instead of input_csv/input_data/input_json. Adds upload_data tool for URL/file ingestion and request_upload_url for presigned large-file uploads in HTTP mode. Phase 1: Simplified _SingleSourceInput and MergeInput models Phase 2: upload_data tool (URL + local path + Google Sheets) Phase 3: Presigned URL upload system (HMAC, Redis metadata, REST endpoint) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Adopt _aid_or_dataframe and _input_data_mode properties on models, removing free functions from tools.py - Add left/right properties to MergeInput - Use `is not None` for artifact_id checks - Fix sync context manager in uploads.py (with → async with) - Remove duplicate max_inline_rows in config.py - Update integration tests to use data instead of removed input_csv - Reject empty CSV in upload_data local file path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove auto-generated per-process secret — it breaks when pods don't share state. Now fails fast with a clear error if unset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- /pub URLs now correctly convert to /export?format=csv - Headers-only CSV from URL raises clear "empty CSV" error instead of misleading "could not parse as CSV or JSON" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instructions: - Add _INSTRUCTIONS_STDIO and _INSTRUCTIONS_HTTP to app.py - HTTP instructions guide agent to use request_upload_url for local files - server.py sets instructions based on transport mode Security & correctness (from parallel review): - SSRF protection: block internal IPs in URL fetching - __Host- cookie prefix for auth state cookie - Rate limiter: in-memory fallback when Redis unavailable - Upload endpoint: use caller's API token, limit CSV rows - Poll token via Authorization header (not just query param) - Progress URL no longer leaks poll token in URL Also: - Update upload_data docstring and error message for HTTP mode - Sync manifest.json description Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ckdown - Fix shell injection in upload curl command via shlex.quote() - Add SecurityHeadersMiddleware (HSTS, X-Content-Type-Options, X-Frame-Options, Cache-Control, Referrer-Policy) on all HTTP responses - Add Redis TLS support (REDIS_SSL setting) - Stream URL fetch with size limit (max_fetch_size_bytes) to prevent OOM - Validate UPLOAD_SECRET at startup instead of first request - Warn on missing REDIS_PASSWORD in HTTP mode at startup - Enable rate limiting in --no-auth mode; cap in-memory fallback at 50K entries - Container hardening: cap_drop ALL, no-new-privileges, read-only rootfs, CPU limits, REDISCLI_AUTH for healthcheck, pinned Redis image, network isolation - Add --frozen to Dockerfile uv sync to prevent lockfile drift - Sanitize SSRF error (no longer leaks resolved IPs) - Add repr=False on sensitive config fields to prevent accidental logging - Add Vary: Origin, Access-Control-Max-Age on CORS preflight responses - Default to 127.0.0.1 in --no-auth mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ockerignore - Fix SSRF DNS-rebinding TOCTOU: add _SSRFSafeTransport that re-validates hostnames at request time; block GKE metadata hostname; cap max_redirects=5 - Add user-scoped data isolation: record task owner (JWT sub) on submission, check ownership in everyrow_results_http to prevent cross-user access - Encrypt tokens at rest in Redis using Fernet (derived from UPLOAD_SECRET): task tokens, poll tokens, auth codes, refresh tokens, upload metadata - Add root .dockerignore with deny-all allowlist to prevent secrets leaking into Docker build context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove --frozen from Dockerfile (incompatible with --no-sources) - Fix Redis healthcheck: pass REDIS_PASSWORD as env var, remove cap_drop ALL which prevented Redis user switching - Reject data=[] in _SingleSourceInput and MergeInput validators to prevent wasteful zero-row task submissions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Deploy workflow should only run on workflow_dispatch. PR checks are handled by the CI workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit ef615fe.

sentry · 2026-02-24T19:41:00Z

everyrow-mcp/src/everyrow_mcp/uploads.py

+    assert body is not None and meta is not None  # type narrowing
+
+    # Retrieve and decrypt the user's API token
+    api_token = decrypt_value(meta.get("api_token", ""))


Bug: Calling decrypt_value with an empty string when an api_token is missing from metadata raises an unhandled InvalidToken exception, causing a 500 error instead of a 403.
_{Severity: HIGH}

Suggested Fix

Wrap the call to decrypt_value in a try...except InvalidToken block to catch the exception and handle it as an authorization failure. Alternatively, explicitly check for the existence of the api_token key in the meta dictionary before attempting decryption.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: everyrow-mcp/src/everyrow_mcp/uploads.py#L232 Potential issue: The function `decrypt_value` is called with the result of `meta.get("api_token", "")`. If the `api_token` key is missing from the upload metadata, an empty string is passed to the decryption function. When `UPLOAD_SECRET` is configured, the underlying decryption library raises an `InvalidToken` exception when attempting to decrypt an empty string. This exception is not handled, causing the request to fail with a 500 server error instead of returning the intended 403 authorization error. This behavior is triggered in HTTP mode for any upload that lacks an API token.

RafaelPo · 2026-02-24T19:49:30Z

Superseded by two focused PRs:

Unified input API: artifact_id + data replaces input_csv #207 — Unified input API (artifact_id + data replaces input_csv)
Security hardening: SSRF, headers, Redis TLS, container lockdown #208 — Security hardening (SSRF, headers, Redis TLS, container lockdown)

Splitting makes review easier and isolates risk. The deploy-mcp workflow change is already in #206.

sentry bot reviewed Feb 24, 2026

View reviewed changes

everyrow-mcp/src/everyrow_mcp/uploads.py Outdated Show resolved Hide resolved

sentry bot reviewed Feb 24, 2026

View reviewed changes

everyrow-mcp/src/everyrow_mcp/tools.py Show resolved Hide resolved

sentry bot reviewed Feb 24, 2026

View reviewed changes

everyrow-mcp/src/everyrow_mcp/utils.py Show resolved Hide resolved

everyrow-mcp/src/everyrow_mcp/utils.py Show resolved Hide resolved

sentry bot reviewed Feb 24, 2026

View reviewed changes

everyrow-mcp/src/everyrow_mcp/models.py Show resolved Hide resolved

RafaelPo requested review from hnykda, jackwildman, nikosbosse, rgambee and straeter February 24, 2026 16:45

RafaelPo and others added 9 commits February 24, 2026 19:29

Log input source (artifact_id vs data) on task submission

7b0f621

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Require UPLOAD_SECRET env var for multi-pod HMAC signing

6785b38

Remove auto-generated per-process secret — it breaks when pods don't share state. Now fails fast with a clear error if unset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RafaelPo force-pushed the feat/unified-input-api branch from 1076043 to 17d401f Compare February 24, 2026 19:32

RafaelPo and others added 2 commits February 24, 2026 19:35

Remove pull_request trigger from deploy-mcp workflow

ef615fe

Deploy workflow should only run on workflow_dispatch. PR checks are handled by the CI workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "Remove pull_request trigger from deploy-mcp workflow"

1249c84

This reverts commit ef615fe.

sentry bot reviewed Feb 24, 2026

View reviewed changes

RafaelPo removed request for hnykda, jackwildman, nikosbosse and rgambee February 24, 2026 19:46

RafaelPo removed the request for review from straeter February 24, 2026 19:46

RafaelPo marked this pull request as draft February 24, 2026 19:47

RafaelPo closed this Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified input API: artifact_id + data replaces input_csv#199

Unified input API: artifact_id + data replaces input_csv#199
RafaelPo wants to merge 11 commits intomainfrom
feat/unified-input-api

RafaelPo commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

RafaelPo commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RafaelPo commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

sentry bot Feb 24, 2026

Uh oh!

RafaelPo commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RafaelPo commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Input API

New tools

New modules

Transport-aware server instructions

Security hardening

Commits

Test plan

Uh oh!

Uh oh!

RafaelPo commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Unified Input API

1. Bug: Sync context manager in async code

2. Bug: Duplicate max_inline_rows field in Settings

3. Bug: Integration tests still reference removed input_csv parameter

4. Nit: Prefer is not None over truthy check for artifact_id

5. Minor: Upload endpoint lacks per-user auth

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RafaelPo commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security & Safety Review

1. Security: API token stored in Redis upload metadata

2. Security: Upload endpoint body buffering before Content-Length rejection

3. Security: SSRF protection with DNS rebinding gap

4. Bug (confirmed from prior review): with instead of async with for client in handle_upload

5. Nit: validate_csv_path allows path traversal sequences in filename

6. Good practices observed

Summary

Uh oh!

Uh oh!

sentry bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

RafaelPo commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RafaelPo commented Feb 24, 2026 •

edited

Loading

github-actions bot commented Feb 24, 2026 •

edited

Loading

2. Bug: Duplicate `max_inline_rows` field in Settings

3. Bug: Integration tests still reference removed `input_csv` parameter

4. Nit: Prefer `is not None` over truthy check for `artifact_id`

github-actions bot commented Feb 24, 2026 •

edited

Loading

4. Bug (confirmed from prior review): `with` instead of `async with` for client in `handle_upload`

5. Nit: `validate_csv_path` allows path traversal sequences in filename