Skip to content

fix(dataset-api): disambiguate get_or_create by schema#40494

Open
Abdulrehman-PIAIC80387 wants to merge 5 commits into
apache:masterfrom
Abdulrehman-PIAIC80387:fix/dataset-get-or-create-schema-aware-30377
Open

fix(dataset-api): disambiguate get_or_create by schema#40494
Abdulrehman-PIAIC80387 wants to merge 5 commits into
apache:masterfrom
Abdulrehman-PIAIC80387:fix/dataset-get-or-create-schema-aware-30377

Conversation

@Abdulrehman-PIAIC80387
Copy link
Copy Markdown
Contributor

SUMMARY

Fixes #30377. Adopts the approach from the stale PR #30379 by @luizcapu (credited below).

POST /api/v1/dataset/get_or_create/ accepts a schema field in the request body but ignores it during the existence check, calling DatasetDAO.get_table_by_name(database_id, table_name). This produces two failures:

  • 500 (MultipleResultsFound) — when two or more datasets already exist with the same table_name across different schemas, one_or_none() raises.
  • False-positive 200 — when one dataset exists in schema A and the caller asks for the same table_name in schema B, the API returns the schema-A dataset and never creates the schema-B one.

The mismatch is invisible in default deployments because most installs only ever have one dataset per table_name.

Fix

  • Adds DatasetDAO.get_table_by_schema_and_name(database_id, schema, table_name).
  • get_or_create_dataset now reads body.get("schema") and uses the schema-aware lookup, matching the contract the request schema already advertises.

TESTING INSTRUCTIONS

pytest tests/integration_tests/datasets/api_tests.py::DatasetApiTests::test_get_or_create_dataset_disambiguates_by_schema -v

Regression test seeds two datasets with the same table_name in different schemas, then asserts both lookups (schema=schema_a and schema=schema_b) return the correct dataset id without raising.

Credit

This PR adopts the approach from #30379 by @luizcapu (Pinterest), which sat without review for ~1 year. @rusackas invited adoption on the issue thread three times in 2025-2026. I left a courtesy ping on the issue and proceeded.

Modern union syntax (str | None) is used throughout; the original PR's Optional[str] reverts are not included per Superset's Python style.

ADDITIONAL INFORMATION

@dosubot dosubot Bot added the data:dataset Related to dataset configurations label May 28, 2026
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented May 28, 2026

Code Review Agent Run #80ffbd

Actionable Suggestions - 0
Additional Suggestions - 1
  • superset/datasets/api.py - 1
    • Catalog not filtered in table lookup · Line 1091-1102
      The "get existing table" path checks `schema` but not `catalog`, while `CreateDatasetCommand.validate()` validates uniqueness considering both via `DatasetDAO.validate_uniqueness(database, table)`. If two datasets share a `table_name` across different catalogs with the same schema, the lookup at line 1099 could return the wrong one, and `CreateDatasetCommand` would then raise `DatasetExistsValidationError` — a confusing UX. Pass `catalog` to `get_table_by_schema_and_name()` for consistency.
Review Details
  • Files reviewed - 3 · Commit Range: e6483e2..e6483e2
    • superset/daos/dataset.py
    • superset/datasets/api.py
    • tests/integration_tests/datasets/api_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

@github-actions github-actions Bot added the api Related to the REST API label May 28, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.17%. Comparing base (c73106b) to head (ca8f665).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #40494   +/-   ##
=======================================
  Coverage   64.17%   64.17%           
=======================================
  Files        2592     2592           
  Lines      139299   139304    +5     
  Branches    32347    32347           
=======================================
+ Hits        89395    89399    +4     
- Misses      48367    48368    +1     
  Partials     1537     1537           
Flag Coverage Δ
hive 39.20% <33.33%> (+<0.01%) ⬆️
mysql 58.71% <100.00%> (+<0.01%) ⬆️
postgres 58.79% <100.00%> (+<0.01%) ⬆️
presto 40.88% <33.33%> (-0.01%) ⬇️
python 60.34% <100.00%> (+<0.01%) ⬆️
sqlite 58.43% <100.00%> (+<0.01%) ⬆️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pull-request-size pull-request-size Bot added size/L and removed size/M labels May 28, 2026
Comment thread superset/daos/dataset.py
Comment on lines +428 to +436
def get_table_by_schema_and_name(
database_id: int, schema: str | None, table_name: str
) -> SqlaTable | None:
# Filter by schema as well so callers can disambiguate datasets that
# share a ``table_name`` across schemas (#30377).
return (
db.session.query(SqlaTable)
.filter_by(database_id=database_id, schema=schema, table_name=table_name)
.one_or_none()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The new lookup still ignores catalog, even though GetOrCreateDatasetSchema accepts it and dataset uniqueness includes (database_id, catalog, schema, table_name). In multi-catalog databases this can still return the wrong dataset or raise MultipleResultsFound when two datasets share schema/table across catalogs. Include catalog in both method arguments and the filter so the existence check matches the API contract and uniqueness model. [api mismatch]

Severity Level: Critical 🚨
- ❌ /dataset/get_or_create crashes with MultipleResultsFound.
- ❌ Wrong dataset returned when catalogs differ but schema matches.
- ⚠️ Multi-catalog environments misroute datasets across catalogs.
- ⚠️ API contract misaligns with GetOrCreateDatasetSchema.catalog.
Steps of Reproduction ✅
1. Observe the dataset uniqueness model: `SqlaTable` in
`superset/connectors/sqla/models.py:1256-1258` defines a SQLAlchemy
`UniqueConstraint("database_id", "catalog", "schema", "table_name")`, meaning datasets are
distinguished by `(database_id, catalog, schema, table_name)`.

2. Create two datasets for the same physical table across different catalogs but with the
same schema and table name using the regular dataset creation API (`DatasetRestApi.post`
at `superset/datasets/api.py:310-360`, which calls `CreateDatasetCommand` at
`superset/commands/dataset/create.py:42-52` and `DatasetDAO.validate_uniqueness` at
`superset/daos/dataset.py:20-40`) with `catalog="cat_a"` and `catalog="cat_b"`, identical
`schema` and `table_name`, and the same `database`.

3. Call `POST /api/v1/dataset/get_or_create/` which is implemented by
`DatasetRestApi.get_or_create_dataset` at `superset/datasets/api.py:1045-1107` with a JSON
body matching the existing datasets: `{"database_id": <db_id>, "table_name": "<name>",
"schema": "<schema>", "catalog": "cat_a"}`; the request is validated by
`GetOrCreateDatasetSchema` (`superset/datasets/schemas.py:360-380`) which accepts
`catalog` but the handler ignores it.

4. Inside `get_or_create_dataset`, the handler computes `schema = body.get("schema")` and
calls `DatasetDAO.get_table_by_schema_and_name(database_id, schema, table_name)`
(`superset/datasets/api.py:1093-1101`), which in turn queries `SqlaTable` in
`DatasetDAO.get_table_by_schema_and_name` at `superset/daos/dataset.py:48-58` using only
`database_id`, `schema` and `table_name` and omits `catalog`; if two rows exist across
catalogs this query returns multiple rows and `Query.one_or_none()` raises
`MultipleResultsFound` (HTTP 500), and if only one catalog row exists but the client
intended another catalog, the method still returns that single row, causing a
false-positive match from the wrong catalog instead of creating the new dataset.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** superset/daos/dataset.py
**Line:** 428:436
**Comment:**
	*Api Mismatch: The new lookup still ignores `catalog`, even though `GetOrCreateDatasetSchema` accepts it and dataset uniqueness includes `(database_id, catalog, schema, table_name)`. In multi-catalog databases this can still return the wrong dataset or raise `MultipleResultsFound` when two datasets share schema/table across catalogs. Include `catalog` in both method arguments and the filter so the existence check matches the API contract and uniqueness model.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9ab07e3: get_table_by_schema_and_name now takes catalog and the filter uses the full (database_id, catalog, schema, table_name) uniqueness key. The API derives catalog from the request body, falling back to database.get_default_catalog() — matching DatasetDAO.validate_uniqueness.

Comment thread superset/datasets/api.py Outdated
@Abdulrehman-PIAIC80387
Copy link
Copy Markdown
Contributor Author

@codeant-ai-for-open-source addressed both flags in 9ab07e3:

  1. Catalog mismatch (Critical)get_table_by_schema_and_name now takes catalog and the lookup uses the full (database_id, catalog, schema, table_name) uniqueness key. The API derives catalog from the request, falling back to database.get_default_catalog() to match the semantics in DatasetDAO.validate_uniqueness.

  2. Empty-string schema (Major) — request body's schema is now normalised with body.get("schema") or None so empty-string and missing schema resolve consistently against datasets stored with NULL schema.

@codeant-ai-for-open-source
Copy link
Copy Markdown
Contributor

Question: 1. catalog mismatch (critical)get_table_by_schema_and_name now takes catalog and the lookup uses the full (database_id, catalog, schema, table_name) uniqueness key. the api derives catalog from the request, falling back to database.get_default_catalog() to match the semantics in datasetdao.validate_uniqueness.

  1. empty-string schema (major) — request body's schema is now normalised with body.get("schema") or none so empty-string and missing schema resolve consistently against datasets stored with null schema.

Answer:

  1. Catalog mismatch: Yes — the new lookup is using the same 4-part uniqueness key as the rest of the dataset logic, which is the right direction. In get_or_create_dataset, catalog is taken from the request when present, otherwise it falls back to database.get_default_catalog(), so it matches the semantics described in DatasetDAO.validate_uniqueness. That means the endpoint should no longer incorrectly match a dataset from a different catalog when the same table_name exists in multiple catalogs.

  2. Empty-string schema: Yes — body.get("schema") or None normalizes both missing schema and "" to None, so the lookup is consistent with datasets stored with a NULL schema. That prevents a blank-string request value from failing to match an existing null-schema dataset, and it keeps the behavior aligned with how Superset generally treats empty schema values.

@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented May 28, 2026

Code Review Agent Run #2b3c2d

Actionable Suggestions - 0
Additional Suggestions - 1
  • superset/datasets/api.py - 1
    • Add catalog disambiguation test · Line 1105-1105
      The existing regression test `test_get_or_create_dataset_disambiguates_by_schema` only covers schema disambiguation. Consider adding a parallel test for catalog disambiguation to validate that the new behavior (catalog defaults to `None` instead of database's default) works correctly when two datasets share a table_name across different catalogs.
Review Details
  • Files reviewed - 3 · Commit Range: e6483e2..ca8f665
    • tests/integration_tests/datasets/api_tests.py
    • superset/daos/dataset.py
    • superset/datasets/api.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Related to the REST API data:dataset Related to dataset configurations size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset get_or_create API fail to create a dataset with the same table_name but different schema

1 participant