Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Feb 4, 2026

Which issue does this PR close?

Rationale for this change

Large DataFrames could ignore the configured max_memory_bytes limit during display.

Previously the defaults (repr_rows=10, min_rows_display=20) meant the collection loop condition rows_so_far < min_rows stayed true even after exceeding the memory budget, causing significantly more data to be streamed/collected than intended.

This PR resolves that by:

  • Introducing a clearer max_rows setting (replacing repr_rows).
  • Enforcing the invariant that the guaranteed minimum (min_rows_display) cannot exceed the maximum rows cap.
  • Adding a deprecation path for repr_rows so existing users aren’t broken immediately.

What changes are included in this PR?

  • Docs: Update user guide examples to use max_rows instead of repr_rows.

  • Python formatter API:

    • Add max_rows as the primary configuration for limiting displayed rows.

    • Keep repr_rows as a deprecated alias (constructor arg + property), emitting DeprecationWarning.

    • Add centralized validation via _validate_formatter_parameters():

      • All numeric args must be positive.
      • Enforce min_rows_display <= max_rows.
      • Reject ambiguous configs where both repr_rows and max_rows are provided with different values.
    • Store resolved value internally as _max_rows and expose max_rows / deprecated repr_rows properties.

    • Add max_rows to configure_formatter() allowed keys.

  • Rust display/streaming logic:

    • Rename config field repr_rows -> max_rows.
    • Update defaults (min_rows 20 → 10) to avoid violating the min/max relationship.
    • Validate min_rows <= max_rows.
    • Update the streaming loop to collect until (memory && max_rows) or until the guaranteed min_rows is reached, with clearer comments.
    • Continue to proportionally reduce rows when memory is exceeded while still respecting the minimum rows guarantee.

Are these changes tested?

Yes.

  • Updated existing formatter tests to use max_rows.

  • Added new tests for:

    • Memory-limit boundary conditions (tiny budget, default budget, large budget, and min-rows override).

    • repr_rows backward compatibility:

      • Emits DeprecationWarning when used.
      • Resolves correctly to max_rows.
      • Errors when conflicting with an explicit max_rows.
    • Validation failures for invalid max_rows and for min_rows_display > max_rows.

Are there any user-facing changes?

Yes.

  • New option: max_rows is now the preferred way to cap rows displayed in repr/HTML output.

  • Deprecation: repr_rows is deprecated and will emit a DeprecationWarning.

    • Existing code using repr_rows continues to work.
    • Providing both repr_rows and max_rows with different values raises a ValueError.
  • Behavioral change: Default minimum rows displayed changes from 20 to 10.

  • Docs: Updated examples and clarified that min_rows_display must be <= max_rows.

If the deprecation/rename is considered a public API change, please add the api change label.

LLM-generated code disclosure

This PR includes code and comments generated with assistance from an LLM. All LLM-generated content has been manually reviewed and tested.

@kosiew kosiew changed the title Fix DataFrame display memory limit by introducing max_rows and enforcing min_rows_display <= max_rows Enforce DataFrame display memory limits with max_rows + min_rows constraint (deprecate repr_rows) Feb 4, 2026
@kosiew kosiew marked this pull request as ready for review February 4, 2026 09:29
@kosiew kosiew self-assigned this Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Display max memory size is not respected if repr_rows < min_rows

1 participant