Skip to content

[Improvement]: Optimize target file size after self-optimizing #3645

@cxxiii

Description

@cxxiii

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Each merging task is designed to output a single file no larger than the target size. However, in practice, when the total size of the input files is close to the target size, multiple output files may be generated, with their total size still under the target size.

How should we improve?

This issue is likely caused by the Parquet writer's inaccurate size estimation.

The estimation error of Parquet writer file size mainly stems from the impact of compression. The estimated file size consists of two parts: the data already written to disk, which is compressed and thus accurately measured, and the data still buffered in memory. The in-memory buffer includes both the uncompressed data in the column store (not yet flushed to the page store) and the compressed data already in the page store for each column. Since most of the buffered data is counted in its uncompressed form, the estimated file size is often larger than the actual file size.
Additionally, files with more columns are more prone to estimation errors. For files with fewer columns, the flush to the page store is often triggered before flushing to the disk (the specified row group size which is default 128 MB is reached)—either because the content of a single column exceeds the specified page size (1 MB) or rowCount inserting exceed pageRowCountLimit (default 2000 rows). Since data written to the page store is compressed, files with fewer columns have a smaller proportion of uncompressed data in memory, resulting in smaller estimation errors.

To resolve this, we can introduce a condition: when the total size of the input files is less than the target size, we allow the output file size to be unlimited; otherwise, we enforce the output file size to not exceed the target size.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions