Crazy Hutter Play

Somewhere to jot down notes about a general idea I've been noodling around. I have been using pen & paper, but as code might well ensue it makes sense to put things here.

The Hutter Prize is a challenge to compress data a lot. Specifically, create a compressed version (self-extracting archive) of the 100MB file enwik8 (a bunch of pages from Wikipedia) of less than about 16MB. I have absolutely no ambition of approaching the challenge, but it offers a good starting point to what I've been thinking about.

I doubt very much that the kind of thing I'm about to describe will be any good for Hutter, but it is fun to ponder. I'm pretty sure it's fairly closely related to some kinds of encryption, but that's not my field at all, so any similarities between this and real work are purely coincidental.

The General Idea

Begin with a Pseudorandom number generator. Probably based on a Linear-feedback shift register (LFSR). These things are wonderful! I first encountered them in the context of music synthesis, as, given a long enough sequence, their output approximates white noise in a very predictable fashion. It's how a lot of the early electronic games produced explosion noises.

They only need a handful of gates (or a few lines of code), looking something like this :

A 16-bit Fibonacci LFSR. The feedback tap numbers shown correspond to a primitive polynomial in the table, so the register cycles through the maximum number of 65535 states excluding the all-zeroes state. The state shown, 0xACE1 (hexadecimal) will be followed by 0x5670.

Now to 'compress' some data, start with the definition of the LFSR and the seed. Run the thing. Then do extensive searching of its output for patterns matching sequences of the input data.

Say the data is the string "Hello". The output of a sequence generator, mapped to characters, at it's 100th point might be:

...a p v I R n _ k e H e l Q v l g...

There, at position 110 are the first three characters of the data. So record that as a pair (position in the sequence, number of characters), ie (110, 3). Now later in the sequence, say at it's 200th point, there's :

...K s ! m l o s W _ ...

yielding (203, 2).

That's the full 5-character string represented in 4 8-bit numbers! Woo-hoo!

You will need some kind of heuristics for deciding when to stop searching - maybe have lots of searches running in parallel, pick the one that yields the best result within a given max chunk of the sequence). So if a little further on in the sequence you get ...ello... that may be more efficient to store.

Ok, this is pretty bogus as compression. Chances are 'Hel' won't occur so soon, and you end up with pointers that are bigger than the data you are trying to represent. But it kinda works as pseudo-encryption. If the sender and recipient have identical machines (and no-one else does), like Enigma the message will be pretty well hidden.

As stated on Wikipedia, "The goal of the Hutter Prize is to encourage research in artificial intelligence (AI).". Noting that the machine described above has recursion at it's heart, I don't think I'd be going too far out on a limb by pointing out parallels to The Unreasonable Effectiveness of Recurrent Neural Networks. A RNN trained on Wikipedia text could be viewed as a quasi-pseudorandom sequence generator that is optimised for Wikipedia-like text.

2am Encoding

I often suffer from insomnia. This kind of stuff doesn't help. Here's what I ended up trying to do in my head last night.

If, rather than the kind of sequence generator described above you just use the sequence of integers expressed in binary, dropping leading zeroes after the first 0, you get :

0 1 10 11 100 101 110 111 1000 1001...

This too can be compressed using a similar system. If you squeeze the bits together:

01101110010111011110001001...

you can point to chunks along the way (position in the sequence, number of characters) again :

0 (0,1)
1 (1, 1)
2 (2, 2)
3 (3, 2)
...

But, in the bit-squeezing, 3 could have been expressed more economically with (1, 2)

Continuing to 4 bits, this squeezes to :

0110100111100010101011

4 = 100 (4, 3)
5 = 101 (3, 3)
6 = 110 (2, 3)
7 = 111 (7, 3)
8 = 1000 (8, 4)
9 = 1001 (4, 4)
10 = 1010 (10, 4)
11 = 1011 (11, 4)
12 = 1100 (9, 4)
13 = 1101 (1, 4)
14 = 1110 (8, 4)
15 = 1111 (7, 4)

I'm counting from 0 on position and may well have made a mistake or two there

There are two numbers for ever one in the original, but this overhead could maybe be reduced by taking into account whatever the patterns are in the values - the second value looks like it follows a very simple sequence.

Ok, I reckon reality prevents this from ever being anything remotely compression, nor can I see any other practical application right now. But it's more challenging than counting sheep.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crazy Hutter Play

The General Idea

2am Encoding

See Also

About

Releases

Packages

License

danja/chup

Folders and files

Latest commit

History

Repository files navigation

Crazy Hutter Play

The General Idea

2am Encoding

See Also

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages