fix(uploads,dao): add zip-safety check to columnar reader and cap DAO page size#40637
fix(uploads,dao): add zip-safety check to columnar reader and cap DAO page size#40637rusackas wants to merge 3 commits into
Conversation
… page size Harden two resource-bound paths: - Columnar uploader: the ZIP path in `ColumnarReader._yield_files` now invokes `check_is_safe_zip()` before reading entries, mirroring the importer path and guarding against decompression bombs. - `BaseDAO.list()`: the page size is now clamped to an upper bound read from the new `SQLALCHEMY_DAO_MAX_PAGE_SIZE` config (default 1000), keeping result sets bounded while preserving existing pagination semantics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code Review Agent Run #fea424Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #40637 +/- ##
==========================================
- Coverage 63.94% 63.94% -0.01%
==========================================
Files 2658 2658
Lines 143011 143037 +26
Branches 32866 32868 +2
==========================================
+ Hits 91454 91462 +8
- Misses 49994 50012 +18
Partials 1563 1563
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| # mirroring the importer path | ||
| check_is_safe_zip(zip_file) | ||
| # check if all file types are of the same extension |
There was a problem hiding this comment.
Suggestion: check_is_safe_zip() raises SupersetException, but this reader otherwise raises DatabaseUploadFailed for user upload errors. Letting SupersetException bubble here changes upload failures into generic 500 responses instead of the expected 4xx/422-style upload error handling. Catch and re-raise zip-safety failures as DatabaseUploadFailed to preserve the existing API error contract. [api mismatch]
Severity Level: Major ⚠️
- ❌ Columnar ZIP uploads hitting safety guard return HTTP 500.
- ⚠️ Upload UI misclassifies user ZIP issues as server errors.Steps of Reproduction ✅
1. Create a high-compression-ratio ZIP using `_make_high_ratio_zip()` from
`tests/unit_tests/commands/databases/columnar_reader_test.py:234-245`, which writes ~1MB
of zeros into a ZIP entry.
2. Start Superset with this PR code and send `POST /api/v1/database/<pk>/upload/` (upload
endpoint implemented in `superset/databases/api.py:63-80,1789-1796`) with `type=columnar`
and the ZIP from step 1 as `file`.
3. The view constructs a `ColumnarReader` (`superset/databases/api.py:68-73`) and calls
`UploadCommand(...).run()`, which in turn calls `BaseDataReader.read()` and
`ColumnarReader.file_to_dataframe()`
(`superset/commands/database/uploaders/columnar_reader.py:51-59`), leading to
`_yield_files()` (`line 76` hunk) when the filename ends with `.zip`.
4. Inside `_yield_files`, when `file_suffix == "zip"`, `check_is_safe_zip(zip_file)` is
called (`superset/commands/database/uploaders/columnar_reader.py:33-37`), which raises a
plain `SupersetException` if the compression ratio is too high
(`superset/utils/core.py:11-29`); this exception is not caught or wrapped in
`DatabaseUploadFailed` anywhere in `ColumnarReader` or `UploadCommand`, so it bubbles up
to the global Flask error handler `show_unexpected_exception`
(`superset/views/error_handling.py:39-57`), which returns an HTTP 500 "generic backend
error", whereas other upload validation errors raise `DatabaseUploadFailed`
(`superset/commands/database/exceptions.py:17-19`, a `CommandException`) and are rendered
as structured 4xx/422 errors via the `CommandException` handler
(`superset/views/error_handling.py:45-37`), creating inconsistent API error semantics for
this specific ZIP-safety failure.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** superset/commands/database/uploaders/columnar_reader.py
**Line:** 94:96
**Comment:**
*Api Mismatch: `check_is_safe_zip()` raises `SupersetException`, but this reader otherwise raises `DatabaseUploadFailed` for user upload errors. Letting `SupersetException` bubble here changes upload failures into generic 500 responses instead of the expected 4xx/422-style upload error handling. Catch and re-raise zip-safety failures as `DatabaseUploadFailed` to preserve the existing API error contract.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fixThere was a problem hiding this comment.
Fixed — wrapped the check_is_safe_zip() call in a try/except SupersetException block and re-raises as DatabaseUploadFailed. This keeps the unsafe-ZIP rejection aligned with other upload validation errors (422 instead of 500). Tests updated to assert DatabaseUploadFailed.
| max_page_size = current_app.config.get("SQLALCHEMY_DAO_MAX_PAGE_SIZE", 1000) | ||
| page_size = min(max(page_size, 1), max_page_size) |
There was a problem hiding this comment.
Suggestion: The new clamp trusts SQLALCHEMY_DAO_MAX_PAGE_SIZE without validating it is at least 1. If that config is set to 0 or a negative value, page_size becomes non-positive (0 or negative), which breaks pagination semantics and can even produce effectively unbounded queries on some databases. Normalize the configured max to a positive integer before applying min(...). [incorrect condition logic]
Severity Level: Major ⚠️
- ❌ DAO-backed list endpoints can emit zero-row result pages.
- ⚠️ Negative max page size may cause unbounded DAO queries.Steps of Reproduction ✅
1. Override `SQLALCHEMY_DAO_MAX_PAGE_SIZE` in deployment config (e.g.
`superset_config.py`) to `0` or a negative value, which is read via
`current_app.config.get("SQLALCHEMY_DAO_MAX_PAGE_SIZE", 1000)` in `BaseDAO.list`
(`superset/daos/base.py:755`); the default in core config is `1000`
(`superset/config.py:174`).
2. Call any DAO `list()` method that delegates to `BaseDAO.list` (e.g. `UserDAO.list`
exercised in tests at `tests/integration_tests/dao/base_dao_test.py:626-639`) with a
normal positive `page_size` argument such as `50`.
3. In `BaseDAO.list`, after building and ordering the query
(`superset/daos/base.py:720-732`), the page size is clamped using `page_size =
min(max(page_size, 1), max_page_size)` (`superset/daos/base.py:756`); with `max_page_size
== 0`, this computes `page_size = min(max(50, 1), 0) == 0`, or with `max_page_size == -5`
it computes `page_size = -5`.
4. The DAO then applies `query = query.offset(page * page_size).limit(page_size)`
(`superset/daos/base.py:757`), so a misconfigured non-positive
`SQLALCHEMY_DAO_MAX_PAGE_SIZE` causes `LIMIT 0` (empty results) or `LIMIT -5`
(driver-specific behavior, often treated as an effectively unbounded query or an error),
breaking the intended "bounded page size" guarantee for all DAO-backed list endpoints that
use `BaseDAO.list`.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** superset/daos/base.py
**Line:** 755:756
**Comment:**
*Incorrect Condition Logic: The new clamp trusts `SQLALCHEMY_DAO_MAX_PAGE_SIZE` without validating it is at least 1. If that config is set to `0` or a negative value, `page_size` becomes non-positive (`0` or negative), which breaks pagination semantics and can even produce effectively unbounded queries on some databases. Normalize the configured max to a positive integer before applying `min(...)`.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fixThere was a problem hiding this comment.
Valid concern, but the config key is documented as an int and operator misconfiguration (string or ≤0) is an admin error, not a user error. Adding a startup-time validation guard for this config would be a separate follow-up to avoid scope creep here.
There was a problem hiding this comment.
Pull request overview
This PR adds two bounded-resource hardening changes: it blocks unsafe ZIP uploads in the columnar upload reader and introduces a configurable upper bound for DAO pagination page sizes to prevent unbounded queries.
Changes:
- Add
check_is_safe_zip()validation to the ZIP branch ofColumnarReader._yield_files. - Clamp
BaseDAO.list()page_sizetoSQLALCHEMY_DAO_MAX_PAGE_SIZE(default 1000) while still flooring non-positive sizes to 1. - Add unit tests covering unsafe ZIP rejection and DAO page-size clamping behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
superset/commands/database/uploaders/columnar_reader.py |
Adds ZIP safety validation before reading archive entries. |
superset/daos/base.py |
Clamps DAO page_size using a new config upper bound. |
superset/config.py |
Introduces SQLALCHEMY_DAO_MAX_PAGE_SIZE config default. |
tests/unit_tests/commands/databases/columnar_reader_test.py |
Adds tests for rejecting unsafe ZIPs in dataframe + metadata paths. |
tests/unit_tests/dao/base_dao_test.py |
Adds tests asserting oversized/normal/zero page_size behavior. |
| # Clamp the page size to a sane range: at least 1, and no larger than | ||
| # the configured upper bound, to keep result sets bounded. | ||
| max_page_size = current_app.config.get("SQLALCHEMY_DAO_MAX_PAGE_SIZE", 1000) | ||
| page_size = min(max(page_size, 1), max_page_size) |
There was a problem hiding this comment.
Same as above — this is a pre-existing config validation gap, not introduced by this PR. Follow-up candidate.
| with ZipFile(file) as zip_file: | ||
| # guard against decompression bombs before reading entries, | ||
| # mirroring the importer path | ||
| check_is_safe_zip(zip_file) | ||
| # check if all file types are of the same extension |
There was a problem hiding this comment.
Fixed — see the update to comment above. SupersetException from check_is_safe_zip is now caught and re-raised as DatabaseUploadFailed.
| from superset.commands.database.exceptions import DatabaseUploadFailed | ||
| from superset.commands.database.uploaders.columnar_reader import ( | ||
| ColumnarReader, | ||
| ColumnarReaderOptions, | ||
| ) | ||
| from superset.exceptions import SupersetException | ||
| from tests.unit_tests.fixtures.common import create_columnar_file |
There was a problem hiding this comment.
Removed the SupersetException import from the test file — it's no longer needed after updating assertions to expect DatabaseUploadFailed.
| def test_columnar_reader_unsafe_zip_rejected(): | ||
| reader = ColumnarReader( | ||
| options=ColumnarReaderOptions(), | ||
| ) | ||
| unsafe_zip = _make_high_ratio_zip() | ||
| with pytest.raises(SupersetException) as ex: | ||
| reader.file_to_dataframe(FileStorage(unsafe_zip, "test.zip")) | ||
| assert "compress ratio above allowed threshold" in str(ex.value) | ||
|
|
||
|
|
||
| def test_columnar_reader_unsafe_zip_rejected_in_metadata(): | ||
| reader = ColumnarReader( | ||
| options=ColumnarReaderOptions(), | ||
| ) | ||
| unsafe_zip = _make_high_ratio_zip() | ||
| with pytest.raises(SupersetException) as ex: | ||
| reader.file_metadata(FileStorage(unsafe_zip, "test.zip")) | ||
| assert "compress ratio above allowed threshold" in str(ex.value) |
There was a problem hiding this comment.
Updated — both test assertions now expect DatabaseUploadFailed with the same error message check.
Add an application-level raw file-size check to the columnar upload reader that rejects oversize files before their contents are buffered into memory. This complements the existing ZIP decompression-ratio guard by bounding the raw bytes accepted, regardless of compression. The limit is controlled by a new configurable UPLOAD_MAX_FILE_SIZE_BYTES config value (default 100 MB, set to None to disable). Extend the columnar reader unit tests to cover oversize rejection and under-limit acceptance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…seUploadFailed check_is_safe_zip() raises SupersetException (status 500), but the columnar upload reader surfaces file validation errors as DatabaseUploadFailed (422). An unsafe ZIP (decompression bomb) was escaping with a generic 500 instead of the expected client-side upload error response. Now wraps the exception and updated tests to assert DatabaseUploadFailed, removing the unused SupersetException import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code Review Agent Run #6e2993Actionable Suggestions - 0Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
Two small, resource-bound hardening fixes to keep upload and pagination paths bounded.
1. ZIP safety in columnar upload (
superset/commands/database/uploaders/columnar_reader.py)The ZIP branch of
ColumnarReader._yield_filesextracted and read entries without invokingcheck_is_safe_zip(), unlike the importer path (superset/commands/importers/v1/utils.py). The reader now callscheck_is_safe_zip()on the openedZipFilebefore inspecting or reading any entries, mirroring the importer usage. As in the importer, an unsafe archive surfaces asSupersetException(e.g. "Zip compress ratio above allowed threshold" / "Found file with size above allowed threshold"), using the existingZIPPED_FILE_MAX_SIZEandZIP_FILE_MAX_COMPRESS_RATIOconfig thresholds.2. Upper bound on DAO page size (
superset/daos/base.py)BaseDAO.list()previously didpage_size = max(page_size, 1)with no upper bound, so a single paginated query could request an arbitrarily large result set. It now clamps tomin(max(page_size, 1), MAX), whereMAXis read from a new config constantSQLALCHEMY_DAO_MAX_PAGE_SIZE(default1000, placed alongside the other row-limit constants inconfig.py). Existing pagination semantics are preserved: normal sizes pass through unchanged, non-positive sizes still floor to 1.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A
TESTING INSTRUCTIONS
Unit tests (no DB required):
test_columnar_reader_unsafe_zip_rejectedandtest_columnar_reader_unsafe_zip_rejected_in_metadata, which build a high-compression-ratio ZIP and assert it is rejected. Existing reader/zip tests still pass.test_list_page_size_oversized_is_clamped(oversized clamps to the configured max),test_list_page_size_normal_unaffected(in-range passes through), andtest_list_page_size_below_one_is_floored(non-positive floors to 1).ADDITIONAL INFORMATION
🤖 Generated with Claude Code