Skip to content

Better estimate entropy costs (#680)#680

Open
terrelln wants to merge 9 commits into
facebook:devfrom
terrelln:export-D102352000
Open

Better estimate entropy costs (#680)#680
terrelln wants to merge 9 commits into
facebook:devfrom
terrelln:export-D102352000

Conversation

@terrelln
Copy link
Copy Markdown
Contributor

@terrelln terrelln commented Apr 27, 2026

Summary:

Huffman Changes

  • Add parameters ZL_ENTROPY_MIN_GAIN_BYTES_PID (default 32) and ZL_ENTROPY_MIN_GAIN_PCT_PID (default 1) to the entropy graph that control the minimum gain in bytes or percent to allow entropy compression.
  • Improve Huffman estimation by building the Huffman CTable in the entropy graph to estimate the compressed size exactly, and if Huffman is selected passing the CTable down to the Huffman node.
  • Improve estimation of Huffman and FSE header sizes based on empirical evidence.

LZ Changes

  • Use ZL_ENTROPY_MIN_GAIN_BYTES_PID to set a more conservative bound on when to use Huffman based on the input size, rather than the size of the Huffman compressed stream.
  • Have ZL_GRAPH_COMPRESS_SMALL_LENGTHS (which is private) forward its parameters to ZL_GRAPH_HUFFMAN so we can control ZL_ENTROPY_MIN_GAIN_BYTES_PID.

Differential Revision: D102352000

@meta-cla meta-cla Bot added the cla signed label Apr 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 27, 2026

@terrelln has exported this pull request. If you are a Meta employee, you can view the originating Diff in D102352000.

@meta-codesync meta-codesync Bot changed the title Better estimate entropy costs Better estimate entropy costs (#680) Apr 27, 2026
terrelln added a commit to terrelln/openzl that referenced this pull request Apr 27, 2026
Summary:

## Huffman Changes
* Add parameters `ZL_ENTROPY_MIN_GAIN_BYTES_PID` (default 32) and `ZL_ENTROPY_MIN_GAIN_PCT_PID` (default 1) to the entropy graph that control the minimum gain in bytes or percent to allow entropy compression.
* Improve Huffman estimation by building the Huffman CTable in the entropy graph to estimate the compressed size exactly, and if Huffman is selected passing the CTable down to the Huffman node.
* Improve estimation of Huffman and FSE header sizes based on empirical evidence.

## LZ Changes
* Use `ZL_ENTROPY_MIN_GAIN_BYTES_PID` to set a more conservative bound on when to use Huffman based on the input size, rather than the size of the Huffman compressed stream.
* Have `ZL_GRAPH_COMPRESS_SMALL_LENGTHS` (which is private) forward its parameters to `ZL_GRAPH_HUFFMAN` so we can control `ZL_ENTROPY_MIN_GAIN_BYTES_PID`.

Differential Revision: D102352000
@terrelln terrelln force-pushed the export-D102352000 branch 2 times, most recently from ab5b6a9 to 55453d2 Compare April 27, 2026 20:57
terrelln added a commit to terrelln/openzl that referenced this pull request Apr 27, 2026
Summary:

## Huffman Changes
* Add parameters `ZL_ENTROPY_MIN_GAIN_BYTES_PID` (default 32) and `ZL_ENTROPY_MIN_GAIN_PCT_PID` (default 1) to the entropy graph that control the minimum gain in bytes or percent to allow entropy compression.
* Improve Huffman estimation by building the Huffman CTable in the entropy graph to estimate the compressed size exactly, and if Huffman is selected passing the CTable down to the Huffman node.
* Improve estimation of Huffman and FSE header sizes based on empirical evidence.

## LZ Changes
* Use `ZL_ENTROPY_MIN_GAIN_BYTES_PID` to set a more conservative bound on when to use Huffman based on the input size, rather than the size of the Huffman compressed stream.
* Have `ZL_GRAPH_COMPRESS_SMALL_LENGTHS` (which is private) forward its parameters to `ZL_GRAPH_HUFFMAN` so we can control `ZL_ENTROPY_MIN_GAIN_BYTES_PID`.

Differential Revision: D102352000
terrelln added a commit to terrelln/openzl that referenced this pull request Apr 27, 2026
Summary:
Pull Request resolved: facebook#680

## Huffman Changes
* Add parameters `ZL_ENTROPY_MIN_GAIN_BYTES_PID` (default 32) and `ZL_ENTROPY_MIN_GAIN_PCT_PID` (default 1) to the entropy graph that control the minimum gain in bytes or percent to allow entropy compression.
* Improve Huffman estimation by building the Huffman CTable in the entropy graph to estimate the compressed size exactly, and if Huffman is selected passing the CTable down to the Huffman node.
* Improve estimation of Huffman and FSE header sizes based on empirical evidence.

## LZ Changes
* Use `ZL_ENTROPY_MIN_GAIN_BYTES_PID` to set a more conservative bound on when to use Huffman based on the input size, rather than the size of the Huffman compressed stream.
* Have `ZL_GRAPH_COMPRESS_SMALL_LENGTHS` (which is private) forward its parameters to `ZL_GRAPH_HUFFMAN` so we can control `ZL_ENTROPY_MIN_GAIN_BYTES_PID`.

Differential Revision: D102352000
@terrelln terrelln force-pushed the export-D102352000 branch from 55453d2 to fa0fbbc Compare April 27, 2026 21:01
terrelln and others added 9 commits May 12, 2026 10:58
Differential Revision: D104843371
Summary:
The LZ encoder capped `matchLength()` at `UINT16_MAX` because sequence match
lengths are stored as `uint16_t`. If the match had been walked backward at least
`UINT16_MAX` bytes, then the match finding process would resume at a position
which had already been inserted into the hash table. This would result in match
with `distance <= 0` and corruption would ensue.

Differential Revision: D104838040
Differential Revision: D105332025
Differential Revision: D104873678
Summary:
LZ must be a dynamic graph because it invokes a multi-input node (ZL_NODE_MUX_LENGTHS).
But we need to allow overriding the successors in a serialized graph for training and configurability.
So do the same thing that FieldLZ does, which sets the index of the `customGraph` as a local int param.
When the local int param is set, set the corresponding successor to that graph.

Differential Revision: D105367885
Summary: As title

Differential Revision: D105368972
Differential Revision: D102620101
Differential Revision: D102351635
Summary:
Pull Request resolved: facebook#680

## Huffman Changes
* Add parameters `ZL_ENTROPY_MIN_GAIN_BYTES_PID` (default 32) and `ZL_ENTROPY_MIN_GAIN_PCT_PID` (default 1) to the entropy graph that control the minimum gain in bytes or percent to allow entropy compression.
* Improve Huffman estimation by building the Huffman CTable in the entropy graph to estimate the compressed size exactly, and if Huffman is selected passing the CTable down to the Huffman node.
* Improve estimation of Huffman and FSE header sizes based on empirical evidence.

## LZ Changes
* Use `ZL_ENTROPY_MIN_GAIN_BYTES_PID` to set a more conservative bound on when to use Huffman based on the input size, rather than the size of the Huffman compressed stream.
* Have `ZL_GRAPH_COMPRESS_SMALL_LENGTHS` (which is private) forward its parameters to `ZL_GRAPH_HUFFMAN` so we can control `ZL_ENTROPY_MIN_GAIN_BYTES_PID`.

Differential Revision: D102352000
@terrelln terrelln force-pushed the export-D102352000 branch from fa0fbbc to 5191aca Compare May 18, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant