TANS with separate probability #52

irevoire · 2024-05-17T09:05:53Z

Hello,

In my use case, I have a Vec of String in a structure that I want to compress.
But I need to keep O(1) access to the element in the Vec, so I was thinking about using TANS and storing my probability table on the side:
Before:

struct Index {
  doc: Vec<String>,
}

After:

struct Index {
  doc: Vec<Vec<u8>>,
  // The same probabilities are used to encode/decode every string of the documents
  prob: [u8; 256],
}

Is this library supposed to support this?
From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.

The text was updated successfully, but these errors were encountered:

robamler · 2024-06-10T18:05:08Z

Sorry for the delay. I'm not sure I understand the question. Of course you can use the same probabilistic model to independently compress/decompress several messages. And yes, in this case you have to keep the model in memory only once, since compression and decompression don't consume the model, they only need a reference to it. Admittedly, this is a bit obscured by the generic nature of the API; for example, the method AnsCoder::encode_iid_symbols_reverse takes a generic argument model whose type has to implement EncoderModel, so it may indeed seem like you'd have to provide a fresh entropy model every time. But there's a blanket implementation of EncoderModel for any reference &M where M implements EncoderModel, so you only need a single owned EncoderModel and can hand out as many shared references to it as you like (some small entropy models also implement Copy; for those, it's usually more performant to pass them by value).

I'm attaching an example of a full compression/decompression round trip below. But in brief, if I understand correctly what you're trying to achieve, then your struct for the compressed representation of Index should probably look something like this:

struct CompressedIndex {
    doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
    probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
    alphabet: Vec<char>, // List of all distinct characters that can appear in a message (see full example below).
}

And there's nothing that holds you back from encoding or decoding each entry of doc independently, using the shared entropy model probs and the shared alphabet (see full round-trip example below).

From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.

I'm not sure I understand. Of course you have to provide the probabilities anytime you encode or decode a symbol (in fact, you have to provide the entire entropy model, not just the probability of the specific symbol you're currently encoding or decoding). That's not a limitation of constriction, it's a fundamental theoretical limitation of source coding: one cannot (losslessly) compress data without a probabilistic model of the data source ("source coding theorem").

Full Example

use std::collections::HashMap;

use constriction::{
    backends::Cursor,
    stream::{
        model::DefaultContiguousCategoricalEntropyModel, stack::DefaultAnsCoder, Decode, Encode,
    },
    UnwrapInfallible,
};

#[derive(Debug, PartialEq, Eq)]
struct UncompressedIndex {
    doc: Vec<String>,
}

#[derive(Debug)]
struct CompressedIndex {
    doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
    probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
    alphabet: Vec<char>, // List of all distinct characters that can appear in a message.
}

impl UncompressedIndex {
    fn compress(
        &self,
        probs: DefaultContiguousCategoricalEntropyModel,
        alphabet: Vec<char>,
    ) -> CompressedIndex {
        let inverse_alphabet = alphabet
            .iter()
            .enumerate()
            .map(|(index, &character)| (character, index))
            .collect::<HashMap<_, _>>();

        let doc = self
            .doc
            .iter()
            .map(|message| {
                let mut coder = DefaultAnsCoder::new();

                // Start with a special EOF symbol so that `CompressedIndex::decompress` knows when to terminate:
                coder.encode_symbol(alphabet.len(), &probs).unwrap();

                // Then encode the message, character by character, in reverse order:
                for character in message.chars().rev() {
                    let char_index = *inverse_alphabet.get(&character).unwrap();
                    coder.encode_symbol(char_index, &probs).unwrap();
                }

                coder.into_compressed().unwrap_infallible()
            })
            .collect();

        CompressedIndex {
            doc,
            probs,
            alphabet,
        }
    }
}

impl CompressedIndex {
    fn decompress(&self) -> UncompressedIndex {
        let doc = self
            .doc
            .iter()
            .map(|data| {
                let mut coder =
                    DefaultAnsCoder::from_compressed(Cursor::new_at_write_end(&data[..])).unwrap();
                core::iter::from_fn(|| {
                    let symbol_id = coder.decode_symbol(&self.probs).unwrap();
                    self.alphabet.get(symbol_id).copied() // Returns `None` if `symbol_id` is the EOF token, which terminates the iterator.
                })
                .collect()
            })
            .collect();

        UncompressedIndex { doc }
    }
}

#[test]
fn round_trip() {
    let uncompressed = UncompressedIndex {
        doc: vec!["Hello, World!".to_string(), "Goodbye.".to_string()],
    };

    let alphabet = vec![
        'H', 'e', 'l', 'o', ',', ' ', 'W', 'r', 'd', '!', 'G', 'b', 'y', '.',
    ];
    let counts = [1., 2., 3., 4., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2.]; // The last entry is for the EOF token.
    let probs =
        DefaultContiguousCategoricalEntropyModel::from_floating_point_probabilities(&counts)
            .unwrap();

    let compressed = uncompressed.compress(probs, alphabet);
    let reconstructed = compressed.decompress();
    assert_eq!(uncompressed, reconstructed);
}

robamler added a commit that referenced this issue Jun 15, 2024

Add integration test for #52

11f3cd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TANS with separate probability #52

TANS with separate probability #52

irevoire commented May 17, 2024 •

edited

Loading

robamler commented Jun 10, 2024

TANS with separate probability #52

TANS with separate probability #52

Comments

irevoire commented May 17, 2024 • edited Loading

robamler commented Jun 10, 2024

Full Example

irevoire commented May 17, 2024 •

edited

Loading