fix(dataset-api): disambiguate get_or_create by schema#40494
fix(dataset-api): disambiguate get_or_create by schema#40494Abdulrehman-PIAIC80387 wants to merge 5 commits into
Conversation
Code Review Agent Run #80ffbdActionable Suggestions - 0Additional Suggestions - 1
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #40494 +/- ##
=======================================
Coverage 64.17% 64.17%
=======================================
Files 2592 2592
Lines 139299 139304 +5
Branches 32347 32347
=======================================
+ Hits 89395 89399 +4
- Misses 48367 48368 +1
Partials 1537 1537
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def get_table_by_schema_and_name( | ||
| database_id: int, schema: str | None, table_name: str | ||
| ) -> SqlaTable | None: | ||
| # Filter by schema as well so callers can disambiguate datasets that | ||
| # share a ``table_name`` across schemas (#30377). | ||
| return ( | ||
| db.session.query(SqlaTable) | ||
| .filter_by(database_id=database_id, schema=schema, table_name=table_name) | ||
| .one_or_none() |
There was a problem hiding this comment.
Suggestion: The new lookup still ignores catalog, even though GetOrCreateDatasetSchema accepts it and dataset uniqueness includes (database_id, catalog, schema, table_name). In multi-catalog databases this can still return the wrong dataset or raise MultipleResultsFound when two datasets share schema/table across catalogs. Include catalog in both method arguments and the filter so the existence check matches the API contract and uniqueness model. [api mismatch]
Severity Level: Critical 🚨
- ❌ /dataset/get_or_create crashes with MultipleResultsFound.
- ❌ Wrong dataset returned when catalogs differ but schema matches.
- ⚠️ Multi-catalog environments misroute datasets across catalogs.
- ⚠️ API contract misaligns with GetOrCreateDatasetSchema.catalog.Steps of Reproduction ✅
1. Observe the dataset uniqueness model: `SqlaTable` in
`superset/connectors/sqla/models.py:1256-1258` defines a SQLAlchemy
`UniqueConstraint("database_id", "catalog", "schema", "table_name")`, meaning datasets are
distinguished by `(database_id, catalog, schema, table_name)`.
2. Create two datasets for the same physical table across different catalogs but with the
same schema and table name using the regular dataset creation API (`DatasetRestApi.post`
at `superset/datasets/api.py:310-360`, which calls `CreateDatasetCommand` at
`superset/commands/dataset/create.py:42-52` and `DatasetDAO.validate_uniqueness` at
`superset/daos/dataset.py:20-40`) with `catalog="cat_a"` and `catalog="cat_b"`, identical
`schema` and `table_name`, and the same `database`.
3. Call `POST /api/v1/dataset/get_or_create/` which is implemented by
`DatasetRestApi.get_or_create_dataset` at `superset/datasets/api.py:1045-1107` with a JSON
body matching the existing datasets: `{"database_id": <db_id>, "table_name": "<name>",
"schema": "<schema>", "catalog": "cat_a"}`; the request is validated by
`GetOrCreateDatasetSchema` (`superset/datasets/schemas.py:360-380`) which accepts
`catalog` but the handler ignores it.
4. Inside `get_or_create_dataset`, the handler computes `schema = body.get("schema")` and
calls `DatasetDAO.get_table_by_schema_and_name(database_id, schema, table_name)`
(`superset/datasets/api.py:1093-1101`), which in turn queries `SqlaTable` in
`DatasetDAO.get_table_by_schema_and_name` at `superset/daos/dataset.py:48-58` using only
`database_id`, `schema` and `table_name` and omits `catalog`; if two rows exist across
catalogs this query returns multiple rows and `Query.one_or_none()` raises
`MultipleResultsFound` (HTTP 500), and if only one catalog row exists but the client
intended another catalog, the method still returns that single row, causing a
false-positive match from the wrong catalog instead of creating the new dataset.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** superset/daos/dataset.py
**Line:** 428:436
**Comment:**
*Api Mismatch: The new lookup still ignores `catalog`, even though `GetOrCreateDatasetSchema` accepts it and dataset uniqueness includes `(database_id, catalog, schema, table_name)`. In multi-catalog databases this can still return the wrong dataset or raise `MultipleResultsFound` when two datasets share schema/table across catalogs. Include `catalog` in both method arguments and the filter so the existence check matches the API contract and uniqueness model.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fixThere was a problem hiding this comment.
Addressed in 9ab07e3: get_table_by_schema_and_name now takes catalog and the filter uses the full (database_id, catalog, schema, table_name) uniqueness key. The API derives catalog from the request body, falling back to database.get_default_catalog() — matching DatasetDAO.validate_uniqueness.
|
@codeant-ai-for-open-source addressed both flags in 9ab07e3:
|
|
Question: 1. catalog mismatch (critical) —
Answer:
|
Code Review Agent Run #2b3c2dActionable Suggestions - 0Additional Suggestions - 1
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
Fixes #30377. Adopts the approach from the stale PR #30379 by @luizcapu (credited below).
POST /api/v1/dataset/get_or_create/accepts aschemafield in the request body but ignores it during the existence check, callingDatasetDAO.get_table_by_name(database_id, table_name). This produces two failures:table_nameacross different schemas,one_or_none()raises.table_namein schema B, the API returns the schema-A dataset and never creates the schema-B one.The mismatch is invisible in default deployments because most installs only ever have one dataset per
table_name.Fix
DatasetDAO.get_table_by_schema_and_name(database_id, schema, table_name).get_or_create_datasetnow readsbody.get("schema")and uses the schema-aware lookup, matching the contract the request schema already advertises.TESTING INSTRUCTIONS
Regression test seeds two datasets with the same
table_namein different schemas, then asserts both lookups (schema=schema_aandschema=schema_b) return the correct dataset id without raising.Credit
This PR adopts the approach from #30379 by @luizcapu (Pinterest), which sat without review for ~1 year. @rusackas invited adoption on the issue thread three times in 2025-2026. I left a courtesy ping on the issue and proceeded.
Modern union syntax (
str | None) is used throughout; the original PR'sOptional[str]reverts are not included per Superset's Python style.ADDITIONAL INFORMATION