# Problem Statement

Code Design Test: Data Compression Design
Design an algorithm that will compress a given data buffer of bytes. Please describe your design and submit an
implementation in Python.
Your submission will be judged based on
- The number of bytes your output uses if saved to file
- Run time
- Scalability
- Maintainability
- Testability

**Assumptions**
1. data is an array of bytes. Each byte will contain a number from 0 to 127 (0x00 to 0x7F). It is common
for the data in the buffer to have the same value repeated in the series.
2. The compressed data will need to be decompressable. Please ensure that your algorithm allows for a
decompression algorithm to return the buffer to its previous form.

**Example**
```python
data = bytes([0x03, 0x74, 0x04, 0x04, 0x04, 0x35, 0x35, 0x64,
0x64, 0x64, 0x64, 0x00, 0x00, 0x00, 0x00, 0x00,
0x56, 0x45, 0x56, 0x56, 0x56, 0x09, 0x09, 0x09])
compressed_bytes = byte_compress(data)
```

## Algorithm Design

There are two important properties that can be leveraged to reduce the size of the data:
1. The sequence of bytes consist of numbers that range from 0 - 127.
2. Numbers tend to repeat themselves which suggests the distribution of integers will likely be skewed (i.e. not uniform). 

Two popular algorithms can be used to compress the data. Each will have their tradeoffs.
1. Fixed length encoding 
2. Huffmaan enconding

### Choosing an Algorithm
#### Fixed Length Enconding
**Pros**
* one
* two

**Cons**
* 
*

#### Huffman Encoding
**Pros**
* one 
* two

**Cons**
* one
* two

The huffman enconding technique offers more flexibility and will likely outperform the Fixed length encoding method should the data be much more likely to be skewed. This algorithm will not perform well if the distribution of data turns out to be uniform (i.e. a high degree of entropy). 

## Algorithm Pseudocode

Huffman enconding requires the construction of Huffman tree that arranges

In [37]:
from src.lossless.encoders.hoffman_encoder import HuffmanEncoder
from src.lossless.decoders.huffman_decoder import HuffmanDecoder
from src.core.dist import Dist
from src.core.tree import HoffmanTree
from src.compress import compress_bytes
from src.decompress import decompress_bytes
from src.utils import generate_random_data, number_of_bits
from bitarray import bitarray

data = generate_random_data(1000, p=0.1)
original_bytes = bytes(data)
dist = Dist(data)
tree = HoffmanTree(dist)
encoder = HuffmanEncoder(tree=tree)
decoder = HuffmanDecoder(tree=tree)
final_bits = compress_bytes(data, encoder)
final_bytes = final_bits.tobytes()

decompressed_data = decompress_bytes(final_bytes, decoder)
size_before = number_of_bits(original_bytes)
print(f"Size before compression: {size_before}")

size_after = number_of_bits(final_bytes)
print(f"Size after compression: {size_after}")



if data == decompressed_data:
    print("Success: Decompression returned origin result")
else:
    print("Error: Original bytes do not equal decompressed bytes")

IndexError: bitarray index out of range

In [38]:
a = final_bits
b = final_bits.tobytes()
c = bitarray(endian="big")
c.frombytes(b)

In [44]:
len(c) / 8

868.0

In [55]:
def to_binary(n:int) -> bitarray:
    if n <= 1:
        return bitarray(str(n))
    return to_binary(n // 2) + bitarray(str(n % 2))

In [56]:
to_binary(10)

bitarray('1010')

In [53]:
a = bitarray()
a.frombytes(bytes(1))

In [54]:
a

bitarray('00000000')

In [None]:
tree = HoffmanTree(dist)
encoder = HuffmanEncoder(tree=tree)
decoder = HuffmanDecoder(tree=tree)

In [None]:
final_bytes = compress_bytes(data, encoder)

In [None]:
final_bytes

In [None]:
b = bitarray()
b.frombytes(final_bytes)

In [None]:
[1,2,3] == [1,3, 2]

In [None]:
decompress_bytes(final_bytes, decoder)

In [None]:
before = bitarray()
before.frombytes(original_bytes)

after = bitarray()
after.frombytes(final_bytes)

In [None]:
len(before)

In [None]:
len(after)

In [None]:
original_bytes

In [None]:
final_bytes

In [None]:
bitarray(int.from_bytes(b.tobytes(), 'big'))

In [None]:
data = bytes([0x03, 0x74, 0x04, 0x04, 0x04, 0x35, 0x35, 0x64,
0x64, 0x64, 0x64, 0x00, 0x00, 0x00, 0x00, 0x00,
0x56, 0x45, 0x56, 0x56, 0x56, 0x09, 0x09, 0x09])
b = bitarray()
b.frombytes(data)


In [None]:
sys.getsizeof(data)