-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TANS with separate probability #52
Comments
Sorry for the delay. I'm not sure I understand the question. Of course you can use the same probabilistic model to independently compress/decompress several messages. And yes, in this case you have to keep the model in memory only once, since compression and decompression don't consume the model, they only need a reference to it. Admittedly, this is a bit obscured by the generic nature of the API; for example, the method I'm attaching an example of a full compression/decompression round trip below. But in brief, if I understand correctly what you're trying to achieve, then your struct for the compressed representation of struct CompressedIndex {
doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
alphabet: Vec<char>, // List of all distinct characters that can appear in a message (see full example below).
} And there's nothing that holds you back from encoding or decoding each entry of
I'm not sure I understand. Of course you have to provide the probabilities anytime you encode or decode a symbol (in fact, you have to provide the entire entropy model, not just the probability of the specific symbol you're currently encoding or decoding). That's not a limitation of Full Exampleuse std::collections::HashMap;
use constriction::{
backends::Cursor,
stream::{
model::DefaultContiguousCategoricalEntropyModel, stack::DefaultAnsCoder, Decode, Encode,
},
UnwrapInfallible,
};
#[derive(Debug, PartialEq, Eq)]
struct UncompressedIndex {
doc: Vec<String>,
}
#[derive(Debug)]
struct CompressedIndex {
doc: Vec<Vec<u32>>, // Note that constriction represents bit strings in 32-bit chunks by default for performance reasons.
probs: DefaultContiguousCategoricalEntropyModel, // (for example; you can use any entropy model in `constriction::stream::model`)
alphabet: Vec<char>, // List of all distinct characters that can appear in a message.
}
impl UncompressedIndex {
fn compress(
&self,
probs: DefaultContiguousCategoricalEntropyModel,
alphabet: Vec<char>,
) -> CompressedIndex {
let inverse_alphabet = alphabet
.iter()
.enumerate()
.map(|(index, &character)| (character, index))
.collect::<HashMap<_, _>>();
let doc = self
.doc
.iter()
.map(|message| {
let mut coder = DefaultAnsCoder::new();
// Start with a special EOF symbol so that `CompressedIndex::decompress` knows when to terminate:
coder.encode_symbol(alphabet.len(), &probs).unwrap();
// Then encode the message, character by character, in reverse order:
for character in message.chars().rev() {
let char_index = *inverse_alphabet.get(&character).unwrap();
coder.encode_symbol(char_index, &probs).unwrap();
}
coder.into_compressed().unwrap_infallible()
})
.collect();
CompressedIndex {
doc,
probs,
alphabet,
}
}
}
impl CompressedIndex {
fn decompress(&self) -> UncompressedIndex {
let doc = self
.doc
.iter()
.map(|data| {
let mut coder =
DefaultAnsCoder::from_compressed(Cursor::new_at_write_end(&data[..])).unwrap();
core::iter::from_fn(|| {
let symbol_id = coder.decode_symbol(&self.probs).unwrap();
self.alphabet.get(symbol_id).copied() // Returns `None` if `symbol_id` is the EOF token, which terminates the iterator.
})
.collect()
})
.collect();
UncompressedIndex { doc }
}
}
#[test]
fn round_trip() {
let uncompressed = UncompressedIndex {
doc: vec!["Hello, World!".to_string(), "Goodbye.".to_string()],
};
let alphabet = vec![
'H', 'e', 'l', 'o', ',', ' ', 'W', 'r', 'd', '!', 'G', 'b', 'y', '.',
];
let counts = [1., 2., 3., 4., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2.]; // The last entry is for the EOF token.
let probs =
DefaultContiguousCategoricalEntropyModel::from_floating_point_probabilities(&counts)
.unwrap();
let compressed = uncompressed.compress(probs, alphabet);
let reconstructed = compressed.decompress();
assert_eq!(uncompressed, reconstructed);
} |
Hello,
In my use case, I have a
Vec
ofString
in a structure that I want to compress.But I need to keep
O(1)
access to the element in theVec
, so I was thinking about using TANS and storing my probability table on the side:Before:
After:
Is this library supposed to support this?
From what I’ve seen, it seems like we need to provide the probabilities for the symbol we're currently compressing.
The text was updated successfully, but these errors were encountered: