fix(dataset-api): disambiguate get_or_create by schema by Abdulrehman-PIAIC80387 · Pull Request #40494 · apache/superset

Abdulrehman-PIAIC80387 · 2026-05-28T09:26:50Z

SUMMARY

Fixes #30377. Adopts the approach from the stale PR #30379 by @luizcapu (credited below).

POST /api/v1/dataset/get_or_create/ accepts a schema field in the request body but ignores it during the existence check, calling DatasetDAO.get_table_by_name(database_id, table_name). This produces two failures:

500 (MultipleResultsFound) — when two or more datasets already exist with the same table_name across different schemas, one_or_none() raises.
False-positive 200 — when one dataset exists in schema A and the caller asks for the same table_name in schema B, the API returns the schema-A dataset and never creates the schema-B one.

The mismatch is invisible in default deployments because most installs only ever have one dataset per table_name.

Fix

Adds DatasetDAO.get_table_by_schema_and_name(database_id, schema, table_name).
get_or_create_dataset now reads body.get("schema") and uses the schema-aware lookup, matching the contract the request schema already advertises.

TESTING INSTRUCTIONS

pytest tests/integration_tests/datasets/api_tests.py::DatasetApiTests::test_get_or_create_dataset_disambiguates_by_schema -v

Regression test seeds two datasets with the same table_name in different schemas, then asserts both lookups (schema=schema_a and schema=schema_b) return the correct dataset id without raising.

Credit

This PR adopts the approach from #30379 by @luizcapu (Pinterest), which sat without review for ~1 year. @rusackas invited adoption on the issue thread three times in 2025-2026. I left a courtesy ping on the issue and proceeded.

Modern union syntax (str | None) is used throughout; the original PR's Optional[str] reverts are not included per Superset's Python style.

ADDITIONAL INFORMATION

Has associated issue: Dataset get_or_create API fail to create a dataset with the same table_name but different schema #30377
Required feature flags:
Changes UI
Includes DB Migration
Introduces new feature or API
Removes existing feature or API

bito-code-review · 2026-05-28T09:27:02Z

Code Review Agent Run #80ffbd

Actionable Suggestions - 0

Additional Suggestions - 1

superset/datasets/api.py - 1
- Catalog not filtered in table lookup · Line 1091-1102
  
  The "get existing table" path checks `schema` but not `catalog`, while `CreateDatasetCommand.validate()` validates uniqueness considering both via `DatasetDAO.validate_uniqueness(database, table)`. If two datasets share a `table_name` across different catalogs with the same schema, the lookup at line 1099 could return the wrong one, and `CreateDatasetCommand` would then raise `DatasetExistsValidationError` — a confusing UX. Pass `catalog` to `get_table_by_schema_and_name()` for consistency.

Review Details

Files reviewed - 3 · Commit Range: e6483e2..e6483e2
- superset/daos/dataset.py
- superset/datasets/api.py
- tests/integration_tests/datasets/api_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

codecov · 2026-05-28T09:31:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.17%. Comparing base (c73106b) to head (ca8f665).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #40494   +/-   ##
=======================================
  Coverage   64.17%   64.17%           
=======================================
  Files        2592     2592           
  Lines      139299   139304    +5     
  Branches    32347    32347           
=======================================
+ Hits        89395    89399    +4     
- Misses      48367    48368    +1     
  Partials     1537     1537

Flag	Coverage Δ
hive	`39.20% <33.33%> (+<0.01%)`	⬆️
mysql	`58.71% <100.00%> (+<0.01%)`	⬆️
postgres	`58.79% <100.00%> (+<0.01%)`	⬆️
presto	`40.88% <33.33%> (-0.01%)`	⬇️
python	`60.34% <100.00%> (+<0.01%)`	⬆️
sqlite	`58.43% <100.00%> (+<0.01%)`	⬆️
unit	`100.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…test

codeant-ai-for-open-source · 2026-05-28T09:58:11Z

+    def get_table_by_schema_and_name(
+        database_id: int, schema: str | None, table_name: str
+    ) -> SqlaTable | None:
+        # Filter by schema as well so callers can disambiguate datasets that
+        # share a ``table_name`` across schemas (#30377).
+        return (
+            db.session.query(SqlaTable)
+            .filter_by(database_id=database_id, schema=schema, table_name=table_name)
+            .one_or_none()


Suggestion: The new lookup still ignores catalog, even though GetOrCreateDatasetSchema accepts it and dataset uniqueness includes (database_id, catalog, schema, table_name). In multi-catalog databases this can still return the wrong dataset or raise MultipleResultsFound when two datasets share schema/table across catalogs. Include catalog in both method arguments and the filter so the existence check matches the API contract and uniqueness model. [api mismatch]

Severity Level: Critical 🚨

- ❌ /dataset/get_or_create crashes with MultipleResultsFound. - ❌ Wrong dataset returned when catalogs differ but schema matches. - ⚠️ Multi-catalog environments misroute datasets across catalogs. - ⚠️ API contract misaligns with GetOrCreateDatasetSchema.catalog.

Steps of Reproduction ✅

1. Observe the dataset uniqueness model: `SqlaTable` in `superset/connectors/sqla/models.py:1256-1258` defines a SQLAlchemy `UniqueConstraint("database_id", "catalog", "schema", "table_name")`, meaning datasets are distinguished by `(database_id, catalog, schema, table_name)`. 2. Create two datasets for the same physical table across different catalogs but with the same schema and table name using the regular dataset creation API (`DatasetRestApi.post` at `superset/datasets/api.py:310-360`, which calls `CreateDatasetCommand` at `superset/commands/dataset/create.py:42-52` and `DatasetDAO.validate_uniqueness` at `superset/daos/dataset.py:20-40`) with `catalog="cat_a"` and `catalog="cat_b"`, identical `schema` and `table_name`, and the same `database`. 3. Call `POST /api/v1/dataset/get_or_create/` which is implemented by `DatasetRestApi.get_or_create_dataset` at `superset/datasets/api.py:1045-1107` with a JSON body matching the existing datasets: `{"database_id": <db_id>, "table_name": "<name>", "schema": "<schema>", "catalog": "cat_a"}`; the request is validated by `GetOrCreateDatasetSchema` (`superset/datasets/schemas.py:360-380`) which accepts `catalog` but the handler ignores it. 4. Inside `get_or_create_dataset`, the handler computes `schema = body.get("schema")` and calls `DatasetDAO.get_table_by_schema_and_name(database_id, schema, table_name)` (`superset/datasets/api.py:1093-1101`), which in turn queries `SqlaTable` in `DatasetDAO.get_table_by_schema_and_name` at `superset/daos/dataset.py:48-58` using only `database_id`, `schema` and `table_name` and omits `catalog`; if two rows exist across catalogs this query returns multiple rows and `Query.one_or_none()` raises `MultipleResultsFound` (HTTP 500), and if only one catalog row exists but the client intended another catalog, the method still returns that single row, causing a false-positive match from the wrong catalog instead of creating the new dataset.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** superset/daos/dataset.py **Line:** 428:436 **Comment:** *Api Mismatch: The new lookup still ignores `catalog`, even though `GetOrCreateDatasetSchema` accepts it and dataset uniqueness includes `(database_id, catalog, schema, table_name)`. In multi-catalog databases this can still return the wrong dataset or raise `MultipleResultsFound` when two datasets share schema/table across catalogs. Include `catalog` in both method arguments and the filter so the existence check matches the API contract and uniqueness model. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

Addressed in 9ab07e3: get_table_by_schema_and_name now takes catalog and the filter uses the full (database_id, catalog, schema, table_name) uniqueness key. The API derives catalog from the request body, falling back to database.get_default_catalog() — matching DatasetDAO.validate_uniqueness.

…r_create lookup

Abdulrehman-PIAIC80387 · 2026-05-28T10:05:41Z

@codeant-ai-for-open-source addressed both flags in 9ab07e3:

Catalog mismatch (Critical) — get_table_by_schema_and_name now takes catalog and the lookup uses the full (database_id, catalog, schema, table_name) uniqueness key. The API derives catalog from the request, falling back to database.get_default_catalog() to match the semantics in DatasetDAO.validate_uniqueness.
Empty-string schema (Major) — request body's schema is now normalised with body.get("schema") or None so empty-string and missing schema resolve consistently against datasets stored with NULL schema.

codeant-ai-for-open-source · 2026-05-28T10:06:33Z

Question: 1. catalog mismatch (critical) — get_table_by_schema_and_name now takes catalog and the lookup uses the full (database_id, catalog, schema, table_name) uniqueness key. the api derives catalog from the request, falling back to database.get_default_catalog() to match the semantics in datasetdao.validate_uniqueness.

empty-string schema (major) — request body's schema is now normalised with body.get("schema") or none so empty-string and missing schema resolve consistently against datasets stored with null schema.

Answer:

Catalog mismatch: Yes — the new lookup is using the same 4-part uniqueness key as the rest of the dataset logic, which is the right direction. In get_or_create_dataset, catalog is taken from the request when present, otherwise it falls back to database.get_default_catalog(), so it matches the semantics described in DatasetDAO.validate_uniqueness. That means the endpoint should no longer incorrectly match a dataset from a different catalog when the same table_name exists in multiple catalogs.
Empty-string schema: Yes — body.get("schema") or None normalizes both missing schema and "" to None, so the lookup is consistent with datasets stored with a NULL schema. That prevents a blank-string request value from failing to match an existing null-schema dataset, and it keeps the behavior aligned with how Superset generally treats empty schema values.

bito-code-review · 2026-05-28T12:49:24Z

Code Review Agent Run #2b3c2d

Actionable Suggestions - 0

Additional Suggestions - 1

superset/datasets/api.py - 1
- Add catalog disambiguation test · Line 1105-1105
  
  The existing regression test `test_get_or_create_dataset_disambiguates_by_schema` only covers schema disambiguation. Consider adding a parallel test for catalog disambiguation to validate that the new behavior (catalog defaults to `None` instead of database's default) works correctly when two datasets share a table_name across different catalogs.

Review Details

Files reviewed - 3 · Commit Range: e6483e2..ca8f665
- tests/integration_tests/datasets/api_tests.py
- superset/daos/dataset.py
- superset/datasets/api.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

fix(dataset-api): disambiguate get_or_create by schema

e6483e2

pull-request-size Bot added the size/M label May 28, 2026

dosubot Bot added the data:dataset Related to dataset configurations label May 28, 2026

github-actions Bot added the api Related to the REST API label May 28, 2026

test: skip metadata introspection in get_or_create schema regression …

f469404

…test

pull-request-size Bot added size/L and removed size/M labels May 28, 2026

codeant-ai-for-open-source Bot reviewed May 28, 2026

View reviewed changes

fix(dataset-api): include catalog and normalize empty schema in get_o…

9ab07e3

…r_create lookup

Abdulrehman-PIAIC80387 added 2 commits May 28, 2026 15:32

fix(dataset-api): don't apply catalog default in get_or_create lookup

31b4f17

test: skip get_or_create schema regression on SQLite (legacy constraint)

ca8f665

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dataset-api): disambiguate get_or_create by schema#40494

fix(dataset-api): disambiguate get_or_create by schema#40494
Abdulrehman-PIAIC80387 wants to merge 5 commits into
apache:masterfrom
Abdulrehman-PIAIC80387:fix/dataset-get-or-create-schema-aware-30377

Abdulrehman-PIAIC80387 commented May 28, 2026

Uh oh!

bito-code-review Bot commented May 28, 2026 •

edited

Loading

Code Review Agent Run #80ffbd

Uh oh!

codecov Bot commented May 28, 2026 •

edited

Loading

Uh oh!

codeant-ai-for-open-source Bot May 28, 2026

Uh oh!

Abdulrehman-PIAIC80387 May 28, 2026

Uh oh!

Uh oh!

Abdulrehman-PIAIC80387 commented May 28, 2026

Uh oh!

codeant-ai-for-open-source Bot commented May 28, 2026

Uh oh!

bito-code-review Bot commented May 28, 2026 •

edited

Loading

Code Review Agent Run #2b3c2d

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abdulrehman-PIAIC80387 commented May 28, 2026

SUMMARY

Fix

TESTING INSTRUCTIONS

Credit

ADDITIONAL INFORMATION

Uh oh!

bito-code-review Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #80ffbd

Uh oh!

codecov Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codeant-ai-for-open-source Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Abdulrehman-PIAIC80387 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Abdulrehman-PIAIC80387 commented May 28, 2026

Uh oh!

codeant-ai-for-open-source Bot commented May 28, 2026

Uh oh!

bito-code-review Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #2b3c2d

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bito-code-review Bot commented May 28, 2026 •

edited

Loading

codecov Bot commented May 28, 2026 •

edited

Loading

bito-code-review Bot commented May 28, 2026 •

edited

Loading