# <font color=blue>Connecting Primes and Algorithmic Information Theory</font>
## <font color=blue>Ponder and Prove</font>


The [Wikipedia page on Algorithmic Information Theory](https://en.wikipedia.org/wiki/Algorithmic_information_theory) is very dense.


[Here is a gentler introduction](https://byui-cse.github.io/cse280-course/descriptive-complexity.pdf) by one of the leading lights of this theory.


# <font color=red>**TODO** Explore Huffman Trees and Huffman Codes for Data Compression</font>


How does one go about compressing information as compactly as possible?


How about if the information to be compressed is the first ten million primes?


The first ten of the first ten million primes:


|    |
|---:|
|  2 |
|  3 |
|  5 |
|  7 |
| 11 |
| 13 |
| 17 |
| 19 |
| 23 |
| 29 |


The last ten of the first ten million primes:


|           |
|----------:|
| 179424551 |
| 179424571 |
| 179424577 |
| 179424601 |
| 179424611 |
| 179424617 |
| 179424629 |
| 179424667 |
| 179424671 |
| 179424673 |


As ASCII text stored in a file with one prime per line, the size of this file is slightly over 89 megabytes (93484450 bytes, to be exact).


It is possible to compress this down to just over 5 megabytes (5589056 bytes, to be exact).


That's a 94% compression ratio!


Standard compression tools can only get about a 73% compression ratio for this ASCII data.


A more clever approach is needed.


Instead of compressing the list of prime numbers, compress a list of the **gaps** between them!


Generating this list is very straightforward:


In [None]:
from sympy import prime, sieve
# ftmp = first ten million primes
ftmp = list(sieve.primerange(2, prime(10000000) + 1))

In [None]:
gaps = [*map(lambda i: ftmp[i] - ftmp[i - 1], range(1, 10000000))]

Check to see if the list of primes is restorable from the gaps list:

In [None]:
pl = [2]
[pl.append(pl[-1] + g) for g in gaps]
pl == ftmp

For Huffman encoding, the larger the frequency of occurrence of a gap size, the smaller the number of bits encoding that gap size.


Counting how many times each gap size occurs is a two-liner:


In [None]:
from collections import Counter
gap_dict = Counter(gaps)
gap_dict # doesn't count as a line of code!

As a correctness check, here are the first ten and the last ten gap counts:


|  Gap | Count   |
|-----:|--------:|
|    1 |       1 |
|    2 |  738597 |
|    4 |  738717 |
|    6 | 1297540 |
|    8 |  566151 |
|   10 |  729808 |
|   12 |  920661 |
|   14 |  503524 |
|   16 |  371677 |
|   18 |  667734 |
|      |         |
|  190 |       1 |
|  192 |       3 |
|  194 |       1 |
|  196 |       1 |
|  198 |       6 |
|  202 |       2 |
|  204 |       3 |
|  210 |       4 |
|  220 |       1 |
|  222 |       1 |


Note two things from these partial gap counts:


1. Small even numbers (< 100) are well represented, larger ones (< 1000) less so.
2. Ten million primes aren't enough to have **every** even number represented; for example, 200, 206, 208, 212, 214, 216, and 218 do not appear even once.


## <font color=red>**TODO** Questions to Answer</font>


1. Is it possible to get better compression using a binary encoding instead of ASCII --- storing each prime with 32 bits (4 bytes)?
2. Is a fixed-width encoding of the gap counts, which requires knowing how many different gap sizes there are, better or worse than binary?
3. Is the Huffman tree small enough to make graphically viewing it feasible using the [anytree](https://pypi.org/project/anytree) Python module?
4. What does this compression technique have to do with algorithmic information theory?
