Skip to content

Superset streaming export bug #40465

@fl0-m

Description

@fl0-m

Bug description

Bug description

In Superset 6.1.0, the new streaming CSV export pipeline introduced by #35478 ("feat(streaming): Streaming CSV uploads for over 100k records for constant memory usage") bypasses Superset's standard query-preparation pipeline. This produces two distinct regressions, both reproducible against Trino.

Bug 1 — CSV exports crash on Trino with __STREAM_ERROR__

The streaming path in superset/commands/streaming_export/base.py::_execute_query_and_stream sends raw chart SQL directly to engine.execute(text(sql)) without running it through database.mutate_sql_based_on_config() first. The SQL Superset generates for a chart ends with a LIMIT N; line — and Trino's HTTP statement endpoint rejects trailing semicolons as mismatched input ';'. Expecting: <EOF>.

Because the streaming response has already flushed headers by the time the exception fires, Flask cannot change the status code. The generator instead writes the sentinel string __STREAM_ERROR__: Export failed. Please try again in some time. (63 bytes) into the response body and closes the stream. The user receives an HTTP 200 with that text inside what should have been their CSV file. The frontend has no way to distinguish this from a successful download.

Bug 2 — User impersonation is bypassed

On databases configured with impersonate_user: true (Trino, Presto, etc.), every other Superset execution site acquires the engine via database.get_sqla_engine_with_context(user_name=…) so the end user's identity is forwarded as the X-Trino-User header. The streaming export path acquires its engine without this context and runs every query as the service principal.

Consequences:

  • Audit trail broken — every CSV export, from every user, shows up in the Trino query log as the service account.
  • Resource-group routing broken — exports no longer route to the user's configured Trino resource group.
  • Possible authorization bypass — engines that key per-user authz off X-Trino-User (Ranger, OPA, file-based ACLs, row/column-level security via session-aware views) will see the service account on the streaming path. A Superset user may be able to export data via "Download CSV" that they are not permitted to read via SQL Lab.

Bug 1 is the visible crash. Bug 2 is independently reproducible — even with bug 1 patched, every CSV in the Trino query log is misattributed.

The non-streaming export paths (Excel export, SQL Lab, /api/v1/chart/data JSON renders) are unaffected because they go through the proper pipeline.

How to reproduce the bug

  1. Connect Superset 6.1.0 to a Trino cluster with impersonate_user: true.
  2. Create a dashboard tile or standalone chart backed by a Trino dataset.
  3. As any logged-in OAuth user (not the service principal), click DownloadExport to CSV.
  4. Open the downloaded file.
  5. Open the Trino UI / query history and locate the corresponding query.

Expected

  • The CSV contains the chart's data.
  • The Trino query record shows User: <logged-in user>, the user's normal resource group, and the database's default schema.

Actual

  • The downloaded file is 63 bytes and contains only:
    __STREAM_ERROR__: Export failed. Please try again in some time.
    
  • The Trino query record shows:
    • Error Type: USER_ERROR
    • Error Code: SYNTAX_ERROR (1)
    • Message: line N:13: mismatched input ';'. Expecting: <EOF>
    • User: <service principal> (not the end user)
    • Resource Group: n/a
    • Schema: <empty>

Performing the same action with Export to Excel instead of Export to CSV works correctly and shows the end user, the right resource group, the default schema, and a sqlglot-reformatted SQL body.

Side-by-side evidence

Same chart, same user, two consecutive export attempts seconds apart.

Failing CSV export — streaming path

User:            superset
Principal:       superset
Source:          Apache Superset
Catalog:         my_catalog
Schema:          (empty)
Resource Group:  n/a
Status:          USER_ERROR / SYNTAX_ERROR
SQL (last line): LIMIT 500000;
SQL form:        raw, lowercase keywords, DATE '2026-05-20'

Succeeding Excel export — non-streaming path

User:            analyst@example.com               <-- end user via X-Trino-User
Principal:       superset
Source:          Apache Superset
Catalog:         my_catalog
Schema:          my_schema
Resource Group:  analysts
Status:          FINISHED
SQL (last line): LIMIT 500000
SQL form:        uppercased keywords, CAST('2026-05-20' AS DATE)

Both SQL strings are derived from the same chart definition. The differences (trailing ;, missing sqlglot reformat, missing schema context, missing user impersonation) are all consequences of the streaming path skipping mutate_sql_based_on_config() and get_sqla_engine_with_context(user_name=…).

Minimal SQL illustrating the difference

What the streaming CSV path sends to Trino (fails):

SELECT category AS category, region AS region, sum(amount) AS "SUM(amount)"
FROM (select date, order_id, region, amount, category
      from my_catalog.my_schema.orders) AS virtual_table
WHERE date >= DATE '2026-05-20' AND date < DATE '2026-05-27'
  AND amount > 100 AND region IS NOT NULL
GROUP BY category, region
ORDER BY "SUM(amount)" DESC
LIMIT 500000;

What the non-streaming Excel path sends to Trino (works):

SELECT
  category AS category,
  region AS region,
  SUM(amount) AS "SUM(amount)"
FROM (
  SELECT date, order_id, region, amount, category
  FROM my_catalog.my_schema.orders
) AS virtual_table
WHERE
  date >= CAST('2026-05-20' AS DATE)
  AND date < CAST('2026-05-27' AS DATE)
  AND amount > 100
  AND NOT region IS NULL
GROUP BY category, region
ORDER BY "SUM(amount)" DESC
LIMIT 500000

Stack trace

ERROR:superset.commands.streaming_export.base:Traceback: Traceback (most recent call last):
  File ".../sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File ".../trino/sqlalchemy/dialect.py", line 442, in do_execute
    cursor.execute(statement, parameters)
  File ".../trino/dbapi.py", line 640, in execute
    self._iterator = iter(self._query.execute())
  File ".../trino/client.py", line 938, in execute
    self._result.rows += self.fetch()
  File ".../trino/client.py", line 958, in fetch
    status = self._request.process(response)
  File ".../trino/client.py", line 727, in process
    raise self._process_error(response["error"], response.get("id"))
trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=SYNTAX_ERROR,
    message="line 24:13: mismatched input ';'. Expecting: <EOF>", query_id=...)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/superset/commands/streaming_export/base.py", line 225, in csv_generator
    yield from self._execute_query_and_stream(sql, database, limit)
  File "/app/superset/commands/streaming_export/base.py", line 168, in _execute_query_and_stream
    ).execute(text(sql))
  ...
sqlalchemy.exc.ProgrammingError: (trino.exceptions.TrinoUserError) TrinoUserError(
    type=USER_ERROR, name=SYNTAX_ERROR,
    message="line 24:13: mismatched input ';'. Expecting: <EOF>", query_id=...)

Trino-side parser stack (from the corresponding query in the Trino UI):

io.trino.sql.parser.ParsingException: line 24:13: mismatched input ';'. Expecting: <EOF>
    at io.trino.sql.parser.ErrorHandler.syntaxError(ErrorHandler.java:108)
    ...
    at io.trino.dispatcher.DispatchManager.createQueryInternal(DispatchManager.java:225)

Environment

  • Superset version: 6.1.0
  • Database engine: Trino 480 (trino-python-client via SQLAlchemy)
  • DB connection setting: impersonate_user: true
  • Python: 3.10
  • Deployment: Helm chart on Kubernetes
  • Auth: OAuth2

Severity

I'd argue release-blocker class for two reasons:

  1. Functional: every dashboard/chart CSV export against Trino or Presto in 6.1.0 is broken, with no in-UI signal of failure (HTTP 200 + sentinel text inside the file).
  2. Security: missing impersonation may silently bypass per-user authorization on deployments that key Trino authz off X-Trino-User. Any deployment using Ranger / OPA / file-based ACLs / RLS views with Superset + Trino should validate before upgrading.

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.10

Node version

I don't know

Browser

Chrome

Additional context

No response

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions