[python] Support value stats with truncate mode by default#7701
[python] Support value stats with truncate mode by default#7701XiaoHongbo-Hope wants to merge 2 commits into
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
The binary column in Java does not generate min/max stats. If Java reads the manifest written in Python and pushes predicates down to the binary column, it may result in incorrect file skipping.
Thanks, fixed |
488ae1c to
23134b3
Compare
| max_seq_number=max_seq_number(), | ||
| options=options) | ||
| options=options, | ||
| write_cols=self.write_cols) |
There was a problem hiding this comment.
The Java KeyValueDataFileWriter does not support writeCols, therefore is it necessary to pass write_cols?
There was a problem hiding this comment.
The Java KeyValueDataFileWriter does not support
writeCols, therefore is it necessary to passwrite_cols?
Removed
JingsongLi
left a comment
There was a problem hiding this comment.
We should not enable this stats by default. Ideally, these statistical information should be obtained from the format.
16e4f9a to
90efd97
Compare
90efd97 to
c1a4ac0
Compare
Updated default config as none |
Purpose
Python-written append tables have no value stats in data files, preventing predicate pushdown from skipping irrelevant files during upsert-by-key reads. This PR enables default value stats for append table pruning. A follow-up PR will use these stats in the
upsert_by_keylookup path.Skip us/ns/tz timestamps: _serialize_timestamp only supports ms precision (8-byte millis). Java's TIMESTAMP(4-9) uses a compound millis+nanos format that requires a different serialization path. Timezone is also not yet supported in serialization.
Tests