Skip to content

fix(sqllab): require dataset match for raw query access#40409

Draft
sha174n wants to merge 9 commits into
apache:masterfrom
sha174n:fix/sqllab-dataset-scoped-access
Draft

fix(sqllab): require dataset match for raw query access#40409
sha174n wants to merge 9 commits into
apache:masterfrom
sha174n:fix/sqllab-dataset-scoped-access

Conversation

@sha174n
Copy link
Copy Markdown
Contributor

@sha174n sha174n commented May 24, 2026

SUMMARY

raise_for_access in superset/security/manager.py previously had two related blind spots in its database/table/query branch:

  1. The SQL Lab path treated catalog_access / schema_access as sufficient to authorise any table in that catalog/schema, even when no Superset dataset existed for the table. The chart-data branch right below requires can_access_schema(datasource) on a registered SqlaTable, so the SQL Lab path was strictly broader than the chart-data path for the same permission.
  2. The MetaDB pseudo-database (superset/extensions/metadb.py) returns row data via the database + table form of raise_for_access and therefore inherited the same schema_access fallthrough.

This PR introduces an explicit kwarg force_dataset_match: bool = False on raise_for_access. When True, the historical catalog_access / schema_access fallthroughs are skipped and every referenced table must resolve to a Superset dataset the user has datasource_access on (or owns).

The three call sites that actually execute or return raw row data pass force_dataset_match=True:

  • superset/sqllab/validators.py (SQL Lab execute pre-flight)
  • superset/models/sql_lab.py::Query.raise_for_access (used by SQL Lab results, CSV export, and streaming export)
  • superset/extensions/metadb.py (MetaDB autoload path)

All other call sites keep the historical semantics unchanged. This preserves the documented schema_access behaviour for chart-data, dataset CRUD from SQL, /table_metadata/, /select_star/, explore on saved queries, and the like.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A. Backend permission scoping, no UI surface.

TESTING INSTRUCTIONS

pytest tests/integration_tests/security_tests.py::TestSecurityManager -k raise_for_access
pytest tests/unit_tests/security/manager_test.py

Three new tests under force_dataset_match=True:

  • test_raise_for_access_force_dataset_match_allows_dataset: a user with datasource_access on a registered SqlaTable can still query the table via the strict path.
  • test_raise_for_access_force_dataset_match_denies_schema_only: schema_access alone no longer satisfies the strict path.
  • test_raise_for_access_default_keeps_schema_access: the default (False) path still allows schema_access for non-dataset tables, preserving backwards compatibility for chart-data / dataset CRUD.

ADDITIONAL INFORMATION

Backwards compatibility

Deployments that used schema_access as a deliberate "give Gamma full SQL Lab access to one schema" grant will start being denied at the SQL Lab execute path, results-fetch, CSV export, streaming export, and MetaDB autoload. Mitigation:

  • Upgrade those users to database_access, or
  • Register the relevant tables as Superset datasets so datasource_access can be granted explicitly.

All non-SQL-Lab call sites (chart-data, dataset CRUD, table_metadata, select_star, explore, viz, query_context) keep the historical behaviour and are unaffected.

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

In raise_for_access, the SQL Lab branch let catalog_access and
schema_access short-circuit the per-table dataset-existence check.
The chart-data branch right next door requires can_access_schema
on a registered SqlaTable, so the SQL Lab path was strictly broader
than the chart-data path for the same permission.

Tighten the SQL Lab path so every referenced table must resolve to
a Superset dataset the user holds datasource_access on (or owns).
The table-only call sites (/table_metadata/, /select_star/, MetaDB)
keep the historical catalog_access / schema_access fallthrough since
they return schema metadata only, not row data.

Two existing tests updated to mock query_datasources_by_name so they
exercise the new dataset-match path; one regression test added.

Backwards compatibility: deployments that used schema_access as a
deliberate "full SQL Lab access to one schema" grant will start
being denied. Mitigation: grant database_access, or register the
relevant tables as Superset datasets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

❌ Patch coverage is 80.76923% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.23%. Comparing base (709ef9b) to head (27e04e1).

Files with missing lines Patch % Lines
superset/security/manager.py 86.66% 1 Missing and 1 partial ⚠️
superset/sql/parse.py 75.00% 1 Missing and 1 partial ⚠️
superset/extensions/metadb.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #40409      +/-   ##
==========================================
+ Coverage   64.20%   64.23%   +0.03%     
==========================================
  Files        2592     2592              
  Lines      139226   139243      +17     
  Branches    32326    32332       +6     
==========================================
+ Hits        89385    89439      +54     
+ Misses      48306    48256      -50     
- Partials     1535     1548      +13     
Flag Coverage Δ
hive 39.30% <23.07%> (+0.06%) ⬆️
mysql 58.83% <80.76%> (+0.07%) ⬆️
postgres 58.91% <80.76%> (+0.07%) ⬆️
presto 40.98% <23.07%> (+0.05%) ⬆️
python 60.46% <80.76%> (+0.06%) ⬆️
sqlite 58.50% <80.76%> (+0.02%) ⬆️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…t_match)

Code review of the previous revision flagged that `query is not None`
was too broad as a discriminator: it caught chart-data on saved-query
datasources, dataset create/update from SQL, SQL Lab results/export,
and explore on saved queries, all of which were schema_access-by-design
flows. It also missed MetaDB, which uses the table-only call path but
actually returns row data.

Replace the inferred flag with an explicit kwarg
`force_dataset_match: bool = False` on `raise_for_access`. Default
(False) preserves the historical catalog_access / schema_access
fallthrough for chart-data, dataset CRUD, table_metadata, select_star,
explore, etc. The three call sites that actually execute or return raw
row data set it to True:

  - superset/sqllab/validators.py   (SQL Lab execute pre-flight)
  - superset/models/sql_lab.py      (Query.raise_for_access, used by
                                     results.py / export.py /
                                     streaming_export_command.py)
  - superset/extensions/metadb.py   (closes the same bypass on the
                                     database+table path)

Tests:
  - Reverts the broad updates to test_raise_for_access_query /
    test_raise_for_access_sql so they exercise the default path again.
  - Drops the vacuous regression test that never granted schema_access
    in the first place.
  - Adds three focused tests under force_dataset_match=True:
      * allows when a dataset exists and datasource_access matches
      * denies when only schema_access is held and no dataset exists
      * (default path) still allows schema_access for non-dataset
        tables, proving backwards compatibility.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pull-request-size pull-request-size Bot added size/L and removed size/M labels May 24, 2026
sha174n and others added 7 commits May 24, 2026 21:20
Code review on the previous revision flagged two related ways an
authenticated user could still defeat the new force_dataset_match
gate. Both close the same authorization gap and are confirmed.

1. SqlaTable.query_datasources_by_name drops the schema filter when
   schema is None (see connectors/sqla/models.py:1938 `if schema:`).
   On a backend whose default_schema_for_query() returns None and a
   parser output that did not include a schema, the lookup matches
   SqlaTables across every schema. The engine then resolves the
   actual query against the wrong schema via search_path, while the
   dataset-match check passed against an unrelated row.

   Fix: under force_dataset_match=True, refuse tables whose qualified
   schema is None. Users must pick a schema in the SQL Lab schema
   picker or qualify the table explicitly.

2. sqlglot represents statements it cannot fully model as exp.Command.
   extract_tables_from_statement returns set() for an exp.Command with
   no parseable inner literal (e.g. CALL stored_proc() that builds
   dynamic SQL internally). The per-table check below then sees zero
   tables and the script proceeds, even though the procedure may read
   arbitrary tables at execution time.

   Fix: add SQLScript.has_unparseable_statement; under
   force_dataset_match=True, refuse any script that contains one. The
   default (chart-data / dataset CRUD / metadata previews) is unchanged.

Two new regression tests cover both vectors. Existing tests for the
non-strict path are untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… + cover unparseable check

- security/manager.py: apply default schema in the table branch of
  raise_for_access so MetaDB's 2-part URIs (db.table) keep working
  under force_dataset_match; the strict-deny still fires when no
  default can be resolved.
- tests/integration_tests/security_tests.py: pin the new behavior.
- tests/unit_tests/sql/parse_tests.py: cover has_unparseable_statement
  to reach 100% sql/ coverage (parameterized: SELECT, CALL, mixed, KQL).
- tests/integration_tests/sqllab_tests.py: update test_sql_json_schema_access
  to reflect that schema_access alone no longer grants raw SQL Lab access
  (the legacy fallthrough this PR closes).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant