Skip to content

Estimate header size in DataStatsU8_estimateHuffmanSizeFast (#681)#681

Open
terrelln wants to merge 8 commits into
facebook:devfrom
terrelln:export-D102351635
Open

Estimate header size in DataStatsU8_estimateHuffmanSizeFast (#681)#681
terrelln wants to merge 8 commits into
facebook:devfrom
terrelln:export-D102351635

Conversation

@terrelln
Copy link
Copy Markdown
Contributor

@terrelln terrelln commented Apr 27, 2026

Summary:

  • Add a header size estimate to DataStatsU8_estimateHuffmanSizeFast()
  • No longer report entropy > 7 as incompressible, this is not true.

This is still not a good estimate for Huffman compressed size, and is more-or-less useless to compare against FSE, because it is just reporting the entropy cost. Stacked diffs will improve it.

Reviewed By: Cyan4973

Differential Revision: D102351635

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 27, 2026

@terrelln has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102351635.

@meta-codesync meta-codesync Bot changed the title Estimate header size in DataStatsU8_estimateHuffmanSizeFast Estimate header size in DataStatsU8_estimateHuffmanSizeFast (#681) Apr 27, 2026
terrelln added a commit to terrelln/openzl that referenced this pull request Apr 27, 2026
…#681)

Summary:
Pull Request resolved: facebook#681

* Add a header size estimate to `DataStatsU8_estimateHuffmanSizeFast()`
* No longer report `entropy > 7` as incompressible, this is not true.

This is still not a good estimate for Huffman compressed size, and is more-or-less useless to compare against `FSE`, because it is just reporting the entropy cost. Stacked diffs will improve it.

Differential Revision: D102351635
@terrelln terrelln force-pushed the export-D102351635 branch from 3965b02 to 3d1ba79 Compare April 27, 2026 21:11
terrelln and others added 8 commits May 12, 2026 10:58
Differential Revision: D104843371
Summary:
The LZ encoder capped `matchLength()` at `UINT16_MAX` because sequence match
lengths are stored as `uint16_t`. If the match had been walked backward at least
`UINT16_MAX` bytes, then the match finding process would resume at a position
which had already been inserted into the hash table. This would result in match
with `distance <= 0` and corruption would ensue.

Differential Revision: D104838040
Differential Revision: D105332025
Differential Revision: D104873678
Summary:
LZ must be a dynamic graph because it invokes a multi-input node (ZL_NODE_MUX_LENGTHS).
But we need to allow overriding the successors in a serialized graph for training and configurability.
So do the same thing that FieldLZ does, which sets the index of the `customGraph` as a local int param.
When the local int param is set, set the corresponding successor to that graph.

Differential Revision: D105367885
Summary: As title

Differential Revision: D105368972
Differential Revision: D102620101
…#681)

Summary:
Pull Request resolved: facebook#681

* Add a header size estimate to `DataStatsU8_estimateHuffmanSizeFast()`
* No longer report `entropy > 7` as incompressible, this is not true.

This is still not a good estimate for Huffman compressed size, and is more-or-less useless to compare against `FSE`, because it is just reporting the entropy cost. Stacked diffs will improve it.

Reviewed By: Cyan4973

Differential Revision: D102351635
@terrelln terrelln force-pushed the export-D102351635 branch from 3d1ba79 to 0bef02a Compare May 18, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant