Skip to content

fix(uploads,dao): add zip-safety check to columnar reader and cap DAO page size#40637

Open
rusackas wants to merge 3 commits into
masterfrom
fix/upload-zip-safety-and-dao-page-size-cap
Open

fix(uploads,dao): add zip-safety check to columnar reader and cap DAO page size#40637
rusackas wants to merge 3 commits into
masterfrom
fix/upload-zip-safety-and-dao-page-size-cap

Conversation

@rusackas
Copy link
Copy Markdown
Member

@rusackas rusackas commented Jun 2, 2026

SUMMARY

Two small, resource-bound hardening fixes to keep upload and pagination paths bounded.

1. ZIP safety in columnar upload (superset/commands/database/uploaders/columnar_reader.py)

The ZIP branch of ColumnarReader._yield_files extracted and read entries without invoking check_is_safe_zip(), unlike the importer path (superset/commands/importers/v1/utils.py). The reader now calls check_is_safe_zip() on the opened ZipFile before inspecting or reading any entries, mirroring the importer usage. As in the importer, an unsafe archive surfaces as SupersetException (e.g. "Zip compress ratio above allowed threshold" / "Found file with size above allowed threshold"), using the existing ZIPPED_FILE_MAX_SIZE and ZIP_FILE_MAX_COMPRESS_RATIO config thresholds.

2. Upper bound on DAO page size (superset/daos/base.py)

BaseDAO.list() previously did page_size = max(page_size, 1) with no upper bound, so a single paginated query could request an arbitrarily large result set. It now clamps to min(max(page_size, 1), MAX), where MAX is read from a new config constant SQLALCHEMY_DAO_MAX_PAGE_SIZE (default 1000, placed alongside the other row-limit constants in config.py). Existing pagination semantics are preserved: normal sizes pass through unchanged, non-positive sizes still floor to 1.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TESTING INSTRUCTIONS

Unit tests (no DB required):

python -m pytest tests/unit_tests/commands/databases/columnar_reader_test.py tests/unit_tests/dao/base_dao_test.py -q
  • Columnar reader: added test_columnar_reader_unsafe_zip_rejected and test_columnar_reader_unsafe_zip_rejected_in_metadata, which build a high-compression-ratio ZIP and assert it is rejected. Existing reader/zip tests still pass.
  • DAO: added test_list_page_size_oversized_is_clamped (oversized clamps to the configured max), test_list_page_size_normal_unaffected (in-range passes through), and test_list_page_size_below_one_is_floored (non-positive floors to 1).

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

🤖 Generated with Claude Code

… page size

Harden two resource-bound paths:

- Columnar uploader: the ZIP path in `ColumnarReader._yield_files` now
  invokes `check_is_safe_zip()` before reading entries, mirroring the
  importer path and guarding against decompression bombs.
- `BaseDAO.list()`: the page size is now clamped to an upper bound read
  from the new `SQLALCHEMY_DAO_MAX_PAGE_SIZE` config (default 1000),
  keeping result sets bounded while preserving existing pagination
  semantics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rusackas rusackas requested review from dpgaspar and hainenber June 2, 2026 00:02
@dosubot dosubot Bot added the change:backend Requires changing the backend label Jun 2, 2026
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented Jun 2, 2026

Code Review Agent Run #fea424

Actionable Suggestions - 0
Review Details
  • Files reviewed - 5 · Commit Range: 822c3f7..822c3f7
    • superset/commands/database/uploaders/columnar_reader.py
    • superset/config.py
    • superset/daos/base.py
    • tests/unit_tests/commands/databases/columnar_reader_test.py
    • tests/unit_tests/dao/base_dao_test.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 2, 2026

Deploy Preview for superset-docs-preview ready!

Name Link
🔨 Latest commit 822c3f7
🔍 Latest deploy log https://app.netlify.com/projects/superset-docs-preview/deploys/6a1e1d7beb8c03000878a758
😎 Deploy Preview https://deploy-preview-40637--superset-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.94%. Comparing base (2b8e31b) to head (71b6d4b).
⚠️ Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
...set/commands/database/uploaders/columnar_reader.py 18.18% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #40637      +/-   ##
==========================================
- Coverage   63.94%   63.94%   -0.01%     
==========================================
  Files        2658     2658              
  Lines      143011   143037      +26     
  Branches    32866    32868       +2     
==========================================
+ Hits        91454    91462       +8     
- Misses      49994    50012      +18     
  Partials     1563     1563              
Flag Coverage Δ
hive 39.75% <25.92%> (-0.01%) ⬇️
mysql 58.39% <33.33%> (-0.02%) ⬇️
postgres 58.46% <33.33%> (-0.02%) ⬇️
presto 41.35% <25.92%> (-0.01%) ⬇️
python 59.95% <33.33%> (-0.02%) ⬇️
sqlite 58.12% <33.33%> (-0.02%) ⬇️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +94 to 96
# mirroring the importer path
check_is_safe_zip(zip_file)
# check if all file types are of the same extension
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: check_is_safe_zip() raises SupersetException, but this reader otherwise raises DatabaseUploadFailed for user upload errors. Letting SupersetException bubble here changes upload failures into generic 500 responses instead of the expected 4xx/422-style upload error handling. Catch and re-raise zip-safety failures as DatabaseUploadFailed to preserve the existing API error contract. [api mismatch]

Severity Level: Major ⚠️
- ❌ Columnar ZIP uploads hitting safety guard return HTTP 500.
- ⚠️ Upload UI misclassifies user ZIP issues as server errors.
Steps of Reproduction ✅
1. Create a high-compression-ratio ZIP using `_make_high_ratio_zip()` from
`tests/unit_tests/commands/databases/columnar_reader_test.py:234-245`, which writes ~1MB
of zeros into a ZIP entry.

2. Start Superset with this PR code and send `POST /api/v1/database/<pk>/upload/` (upload
endpoint implemented in `superset/databases/api.py:63-80,1789-1796`) with `type=columnar`
and the ZIP from step 1 as `file`.

3. The view constructs a `ColumnarReader` (`superset/databases/api.py:68-73`) and calls
`UploadCommand(...).run()`, which in turn calls `BaseDataReader.read()` and
`ColumnarReader.file_to_dataframe()`
(`superset/commands/database/uploaders/columnar_reader.py:51-59`), leading to
`_yield_files()` (`line 76` hunk) when the filename ends with `.zip`.

4. Inside `_yield_files`, when `file_suffix == "zip"`, `check_is_safe_zip(zip_file)` is
called (`superset/commands/database/uploaders/columnar_reader.py:33-37`), which raises a
plain `SupersetException` if the compression ratio is too high
(`superset/utils/core.py:11-29`); this exception is not caught or wrapped in
`DatabaseUploadFailed` anywhere in `ColumnarReader` or `UploadCommand`, so it bubbles up
to the global Flask error handler `show_unexpected_exception`
(`superset/views/error_handling.py:39-57`), which returns an HTTP 500 "generic backend
error", whereas other upload validation errors raise `DatabaseUploadFailed`
(`superset/commands/database/exceptions.py:17-19`, a `CommandException`) and are rendered
as structured 4xx/422 errors via the `CommandException` handler
(`superset/views/error_handling.py:45-37`), creating inconsistent API error semantics for
this specific ZIP-safety failure.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** superset/commands/database/uploaders/columnar_reader.py
**Line:** 94:96
**Comment:**
	*Api Mismatch: `check_is_safe_zip()` raises `SupersetException`, but this reader otherwise raises `DatabaseUploadFailed` for user upload errors. Letting `SupersetException` bubble here changes upload failures into generic 500 responses instead of the expected 4xx/422-style upload error handling. Catch and re-raise zip-safety failures as `DatabaseUploadFailed` to preserve the existing API error contract.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — wrapped the check_is_safe_zip() call in a try/except SupersetException block and re-raises as DatabaseUploadFailed. This keeps the unsafe-ZIP rejection aligned with other upload validation errors (422 instead of 500). Tests updated to assert DatabaseUploadFailed.

Comment thread superset/daos/base.py
Comment on lines +755 to +756
max_page_size = current_app.config.get("SQLALCHEMY_DAO_MAX_PAGE_SIZE", 1000)
page_size = min(max(page_size, 1), max_page_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The new clamp trusts SQLALCHEMY_DAO_MAX_PAGE_SIZE without validating it is at least 1. If that config is set to 0 or a negative value, page_size becomes non-positive (0 or negative), which breaks pagination semantics and can even produce effectively unbounded queries on some databases. Normalize the configured max to a positive integer before applying min(...). [incorrect condition logic]

Severity Level: Major ⚠️
- ❌ DAO-backed list endpoints can emit zero-row result pages.
- ⚠️ Negative max page size may cause unbounded DAO queries.
Steps of Reproduction ✅
1. Override `SQLALCHEMY_DAO_MAX_PAGE_SIZE` in deployment config (e.g.
`superset_config.py`) to `0` or a negative value, which is read via
`current_app.config.get("SQLALCHEMY_DAO_MAX_PAGE_SIZE", 1000)` in `BaseDAO.list`
(`superset/daos/base.py:755`); the default in core config is `1000`
(`superset/config.py:174`).

2. Call any DAO `list()` method that delegates to `BaseDAO.list` (e.g. `UserDAO.list`
exercised in tests at `tests/integration_tests/dao/base_dao_test.py:626-639`) with a
normal positive `page_size` argument such as `50`.

3. In `BaseDAO.list`, after building and ordering the query
(`superset/daos/base.py:720-732`), the page size is clamped using `page_size =
min(max(page_size, 1), max_page_size)` (`superset/daos/base.py:756`); with `max_page_size
== 0`, this computes `page_size = min(max(50, 1), 0) == 0`, or with `max_page_size == -5`
it computes `page_size = -5`.

4. The DAO then applies `query = query.offset(page * page_size).limit(page_size)`
(`superset/daos/base.py:757`), so a misconfigured non-positive
`SQLALCHEMY_DAO_MAX_PAGE_SIZE` causes `LIMIT 0` (empty results) or `LIMIT -5`
(driver-specific behavior, often treated as an effectively unbounded query or an error),
breaking the intended "bounded page size" guarantee for all DAO-backed list endpoints that
use `BaseDAO.list`.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** superset/daos/base.py
**Line:** 755:756
**Comment:**
	*Incorrect Condition Logic: The new clamp trusts `SQLALCHEMY_DAO_MAX_PAGE_SIZE` without validating it is at least 1. If that config is set to `0` or a negative value, `page_size` becomes non-positive (`0` or negative), which breaks pagination semantics and can even produce effectively unbounded queries on some databases. Normalize the configured max to a positive integer before applying `min(...)`.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern, but the config key is documented as an int and operator misconfiguration (string or ≤0) is an admin error, not a user error. Adding a startup-time validation guard for this config would be a separate follow-up to avoid scope creep here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two bounded-resource hardening changes: it blocks unsafe ZIP uploads in the columnar upload reader and introduces a configurable upper bound for DAO pagination page sizes to prevent unbounded queries.

Changes:

  • Add check_is_safe_zip() validation to the ZIP branch of ColumnarReader._yield_files.
  • Clamp BaseDAO.list() page_size to SQLALCHEMY_DAO_MAX_PAGE_SIZE (default 1000) while still flooring non-positive sizes to 1.
  • Add unit tests covering unsafe ZIP rejection and DAO page-size clamping behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
superset/commands/database/uploaders/columnar_reader.py Adds ZIP safety validation before reading archive entries.
superset/daos/base.py Clamps DAO page_size using a new config upper bound.
superset/config.py Introduces SQLALCHEMY_DAO_MAX_PAGE_SIZE config default.
tests/unit_tests/commands/databases/columnar_reader_test.py Adds tests for rejecting unsafe ZIPs in dataframe + metadata paths.
tests/unit_tests/dao/base_dao_test.py Adds tests asserting oversized/normal/zero page_size behavior.

Comment thread superset/daos/base.py
Comment on lines +753 to +756
# Clamp the page size to a sane range: at least 1, and no larger than
# the configured upper bound, to keep result sets bounded.
max_page_size = current_app.config.get("SQLALCHEMY_DAO_MAX_PAGE_SIZE", 1000)
page_size = min(max(page_size, 1), max_page_size)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — this is a pre-existing config validation gap, not introduced by this PR. Follow-up candidate.

Comment on lines 92 to 96
with ZipFile(file) as zip_file:
# guard against decompression bombs before reading entries,
# mirroring the importer path
check_is_safe_zip(zip_file)
# check if all file types are of the same extension
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — see the update to comment above. SupersetException from check_is_safe_zip is now caught and re-raised as DatabaseUploadFailed.

Comment on lines 26 to 32
from superset.commands.database.exceptions import DatabaseUploadFailed
from superset.commands.database.uploaders.columnar_reader import (
ColumnarReader,
ColumnarReaderOptions,
)
from superset.exceptions import SupersetException
from tests.unit_tests.fixtures.common import create_columnar_file
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the SupersetException import from the test file — it's no longer needed after updating assertions to expect DatabaseUploadFailed.

Comment on lines +248 to +265
def test_columnar_reader_unsafe_zip_rejected():
reader = ColumnarReader(
options=ColumnarReaderOptions(),
)
unsafe_zip = _make_high_ratio_zip()
with pytest.raises(SupersetException) as ex:
reader.file_to_dataframe(FileStorage(unsafe_zip, "test.zip"))
assert "compress ratio above allowed threshold" in str(ex.value)


def test_columnar_reader_unsafe_zip_rejected_in_metadata():
reader = ColumnarReader(
options=ColumnarReaderOptions(),
)
unsafe_zip = _make_high_ratio_zip()
with pytest.raises(SupersetException) as ex:
reader.file_metadata(FileStorage(unsafe_zip, "test.zip"))
assert "compress ratio above allowed threshold" in str(ex.value)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — both test assertions now expect DatabaseUploadFailed with the same error message check.

claude and others added 2 commits June 1, 2026 17:50
Add an application-level raw file-size check to the columnar upload reader
that rejects oversize files before their contents are buffered into memory.
This complements the existing ZIP decompression-ratio guard by bounding the
raw bytes accepted, regardless of compression.

The limit is controlled by a new configurable UPLOAD_MAX_FILE_SIZE_BYTES
config value (default 100 MB, set to None to disable). Extend the columnar
reader unit tests to cover oversize rejection and under-limit acceptance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…seUploadFailed

check_is_safe_zip() raises SupersetException (status 500), but the columnar
upload reader surfaces file validation errors as DatabaseUploadFailed (422).
An unsafe ZIP (decompression bomb) was escaping with a generic 500 instead
of the expected client-side upload error response. Now wraps the exception
and updated tests to assert DatabaseUploadFailed, removing the unused
SupersetException import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented Jun 2, 2026

Code Review Agent Run #6e2993

Actionable Suggestions - 0
Filtered by Review Rules

Bito filtered these suggestions based on rules created automatically for your feedback. Manage rules.

  • superset/config.py - 1
Review Details
  • Files reviewed - 3 · Commit Range: 822c3f7..71b6d4b
    • superset/commands/database/uploaders/columnar_reader.py
    • superset/config.py
    • tests/unit_tests/commands/databases/columnar_reader_test.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change:backend Requires changing the backend size/L

Projects

Status: Needs Review

Development

Successfully merging this pull request may close these issues.

3 participants