# LZ78 Sequential Probability Assignment: Python Interface for Rust Implementation
This code is associated with the paper [A Family of LZ78-based Universal Sequential Probability Assignments](https://arxiv.org/abs/2410.06589).

## Setup
You need to install Rust and Maturin, and then install the Python bindings for the `lz78` library as an editable Python package.
1. Install Rust: [Instructions](https://www.rust-lang.org/tools/install).
2. If applicable, switch to the desired Python environment.
3. Install Maturin: `pip install maturin`
4. Install the `lz78` Python package: `cd crates/python && maturin develop && cd ../..`

For generating "dummy" text, also install the `lorem` package: `pip install lorem`. You should also have `numpy`.

## Imports

In [1]:
from lz78 import Sequence, LZ78Encoder, CharacterMap, StreamingLZ78Encoder
import numpy as np
from time import time
import lorem

## Sequences
TODO: describe

### Example: Integer Sequence

In [2]:
data = np.random.randint(0, 2, size=(10_000_000,))
int_sequence = Sequence(data, alphabet_size=2)

A limited number of Python list operations work on `Sequence`:

In [3]:
len(int_sequence)

10000000

In [4]:
int_sequence[-20:]

[0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0]

### String Sequence

In [5]:
# generate some dummy data
s = " ".join(([lorem.paragraph() for _ in range(10)]))

In [6]:
charmap = CharacterMap(s)

In [7]:
charmap.encode_all("lorem ipsum")

[13, 8, 9, 5, 6, 3, 1, 7, 4, 14, 6]

In [8]:
charmap.decode_all(charmap.encode_all("lorem ipsum"))

'lorem ipsum'

In [9]:
# this should error, but with a helpful warning message
charmap.encode_all("hello world")

RuntimeError: Character "h" not in mapping

In [10]:
charseq = Sequence(s, charmap=charmap)

In [11]:
charseq[:20]

[0, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 2, 5, 6, 7, 8, 9, 10, 11]

In [12]:
charseq.get_data()

'Sit sit sit tempora. Velit ut adipisci dolorem quaerat velit voluptatem neque. Magnam voluptatem adipisci magnam sit sed. Eius est sit dolore. Ut neque adipisci velit non voluptatem. Consectetur adipisci neque ut etincidunt dolorem. Numquam dolore quiquia ut magnam ut. Amet dolorem sed consectetur ut neque velit. Non est quisquam labore labore quiquia amet. Ut amet amet quaerat adipisci non porro quiquia. Aliquam aliquam porro dolorem voluptatem. Dolor velit tempora non est sit magnam. Ipsum aliquam aliquam ut. Numquam labore ipsum eius dolorem. Quisquam voluptatem adipisci non ut voluptatem. Ipsum quisquam quaerat ut ut non. Modi quisquam ut dolore eius numquam tempora numquam. Dolor magnam adipisci aliquam adipisci modi amet. Quisquam adipisci non modi non ut. Quisquam neque etincidunt modi. Non consectetur dolor dolor quiquia dolorem dolorem. Etincidunt labore modi labore neque consectetur amet. Porro velit dolore aliquam dolorem. Neque tempora ut magnam eius neque amet. Quisquam q

## LZ78 Block-Wise Compression
TODO: describe

In [13]:
data = " ".join(([lorem.paragraph() for _ in range(10_000)]))
charmap = CharacterMap(data)
charseq = Sequence(data, charmap=charmap)
encoder = LZ78Encoder()

In [14]:
# The Jupyter notebook might not register that this is running and it'll
# look like it's waiting for the cell to start running. This is expected 
tic = time()
encoded = encoder.encode(charseq)
toc = time()
print("encode time", toc - tic, "seconds")

encode time 2.6338725090026855 seconds
hi


In [30]:
encoded.compression_ratio()

0.31895026564598083

In [33]:
tic = time()
decoded = encoder.decode(encoded)
toc = time()
print("decode time", toc - tic, "seconds")

decode time 0.998359203338623 seconds


### Streaming

In [27]:
charmap = CharacterMap("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ. ,?")

In [28]:
encoder = StreamingLZ78Encoder(charmap.alphabet_size())

In [29]:
for _ in range(100):
    encoder.encode_block(Sequence(lorem.paragraph(), charmap=charmap))

In [30]:
encoder.get_encoded_sequence().compression_ratio()

0.44504818320274353

In [31]:
encoder.decode()[100:110]

[13, 14, 13, 53, 16, 20, 0, 4, 17, 0]

In [32]:
charmap.decode_all(encoder.decode()[100:200])

'non quaerat amet. Ut modi adipisci dolore tempora voluptatem sed dolore. Aliquam amet amet sed ipsum'