Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TANS with separate probability #52

Open
irevoire opened this issue May 17, 2024 · 1 comment
Open

TANS with separate probability #52

irevoire opened this issue May 17, 2024 · 1 comment

Comments

@irevoire
Copy link

irevoire commented May 17, 2024

Hello,

In my use case, I have a Vec of String in a structure that I want to compress.
But I need to keep O(1) access to the element in the Vec, so I was thinking about using TANS and storing my probability table on the side:
Before:

struct Index {
  doc: Vec<String>,
}

After:

struct Index {
  doc: Vec<Vec<u8>>,
  // The same probabilities are used to encode/decode every string of the documents
  prob: [u8; 256],
}

Is this library supposed to support this?
From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.

@robamler
Copy link
Collaborator

Sorry for the delay. I'm not sure I understand the question. Of course you can use the same probabilistic model to independently compress/decompress several messages. And yes, in this case you have to keep the model in memory only once, since compression and decompression don't consume the model, they only need a reference to it. Admittedly, this is a bit obscured by the generic nature of the API; for example, the method AnsCoder::encode_iid_symbols_reverse takes a generic argument model whose type has to implement EncoderModel, so it may indeed seem like you'd have to provide a fresh entropy model every time. But there's a blanket implementation of EncoderModel for any reference &M where M implements EncoderModel, so you only need a single owned EncoderModel and can hand out as many shared references to it as you like (some small entropy models also implement Copy; for those, it's usually more performant to pass them by value).

I'm attaching an example of a full compression/decompression round trip below. But in brief, if I understand correctly what you're trying to achieve, then your struct for the compressed representation of Index should probably look something like this:

struct CompressedIndex {
    doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
    probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
    alphabet: Vec<char>, // List of all distinct characters that can appear in a message (see full example below).
}

And there's nothing that holds you back from encoding or decoding each entry of doc independently, using the shared entropy model probs and the shared alphabet (see full round-trip example below).

From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.

I'm not sure I understand. Of course you have to provide the probabilities anytime you encode or decode a symbol (in fact, you have to provide the entire entropy model, not just the probability of the specific symbol you're currently encoding or decoding). That's not a limitation of constriction, it's a fundamental theoretical limitation of source coding: one cannot (losslessly) compress data without a probabilistic model of the data source ("source coding theorem").

Full Example

use std::collections::HashMap;

use constriction::{
    backends::Cursor,
    stream::{
        model::DefaultContiguousCategoricalEntropyModel, stack::DefaultAnsCoder, Decode, Encode,
    },
    UnwrapInfallible,
};

#[derive(Debug, PartialEq, Eq)]
struct UncompressedIndex {
    doc: Vec<String>,
}

#[derive(Debug)]
struct CompressedIndex {
    doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
    probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
    alphabet: Vec<char>, // List of all distinct characters that can appear in a message.
}

impl UncompressedIndex {
    fn compress(
        &self,
        probs: DefaultContiguousCategoricalEntropyModel,
        alphabet: Vec<char>,
    ) -> CompressedIndex {
        let inverse_alphabet = alphabet
            .iter()
            .enumerate()
            .map(|(index, &character)| (character, index))
            .collect::<HashMap<_, _>>();

        let doc = self
            .doc
            .iter()
            .map(|message| {
                let mut coder = DefaultAnsCoder::new();

                // Start with a special EOF symbol so that `CompressedIndex::decompress` knows when to terminate:
                coder.encode_symbol(alphabet.len(), &probs).unwrap();

                // Then encode the message, character by character, in reverse order:
                for character in message.chars().rev() {
                    let char_index = *inverse_alphabet.get(&character).unwrap();
                    coder.encode_symbol(char_index, &probs).unwrap();
                }

                coder.into_compressed().unwrap_infallible()
            })
            .collect();

        CompressedIndex {
            doc,
            probs,
            alphabet,
        }
    }
}

impl CompressedIndex {
    fn decompress(&self) -> UncompressedIndex {
        let doc = self
            .doc
            .iter()
            .map(|data| {
                let mut coder =
                    DefaultAnsCoder::from_compressed(Cursor::new_at_write_end(&data[..])).unwrap();
                core::iter::from_fn(|| {
                    let symbol_id = coder.decode_symbol(&self.probs).unwrap();
                    self.alphabet.get(symbol_id).copied() // Returns `None` if `symbol_id` is the EOF token, which terminates the iterator.
                })
                .collect()
            })
            .collect();

        UncompressedIndex { doc }
    }
}

#[test]
fn round_trip() {
    let uncompressed = UncompressedIndex {
        doc: vec!["Hello, World!".to_string(), "Goodbye.".to_string()],
    };

    let alphabet = vec![
        'H', 'e', 'l', 'o', ',', ' ', 'W', 'r', 'd', '!', 'G', 'b', 'y', '.',
    ];
    let counts = [1., 2., 3., 4., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2.]; // The last entry is for the EOF token.
    let probs =
        DefaultContiguousCategoricalEntropyModel::from_floating_point_probabilities(&counts)
            .unwrap();

    let compressed = uncompressed.compress(probs, alphabet);
    let reconstructed = compressed.decompress();
    assert_eq!(uncompressed, reconstructed);
}

robamler added a commit that referenced this issue Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants