Skip to content
This repository has been archived by the owner on Oct 28, 2023. It is now read-only.
/ huffman Public archive

Simple and straigthforward implementation of Huffman coding - as a small exercise

Notifications You must be signed in to change notification settings

christophevg/huffman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman

Simple and straigthforward implementation of Huffman coding - as a small exercise.

Running the script as is, it performs all steps in the Huffman coding process on a predefined "hello world" string...

$ python -m huffman.coding
88 hello world
[('h', 1), ('e', 1), ('l', 3), ('o', 2), (' ', 1), ('w', 1), ('r', 1), ('d', 1)]
((((((('r', 1), ('d', 1)), 2), (((' ', 1), ('w', 1)), 2)), 4), ((('l', 3), ((((('h', 1), ('e', 1)), 2), ('o', 2)), 4)), 7)), 11)
 11
   4
     2
       ('r', 1)
       ('d', 1)
     2
       (' ', 1)
       ('w', 1)
   7
     ('l', 3)
     4
       2
         ('h', 1)
         ('e', 1)
       ('o', 2)
((('r', 'd'), (' ', 'w')), ('l', (('h', 'e'), 'o')))
{'r': '000', 'd': '001', ' ': '010', 'w': '011', 'l': '10', 'h': '1100', 'e': '1101', 'o': '111'}
32 0.36363636363636365 11001101101011101001111100010001
0001r1d01 1w01l001h1e1o
((('r', 'd'), (' ', 'w')), ('l', (('h', 'e'), 'o')))
{'000': 'r', '001': 'd', '010': ' ', '011': 'w', '10': 'l', '1100': 'h', '1101': 'e', '111': 'o'}
hello world

Command line arguments are considered a string...

$ python -m huffman.coding hello world from cli
160 hello world from cli
[('h', 1), ('e', 1), ('l', 4), ('o', 3), (' ', 3), ('w', 1), ('r', 2), ('d', 1), ('f', 1), ('m', 1), ('c', 1), ('i', 1)]
((((((((('c', 1), ('i', 1)), 2), ((('f', 1), ('m', 1)), 2)), 4), ('l', 4)), 8), ((((('r', 2), ('o', 3)), 5), (((' ', 3), ((((('w', 1), ('d', 1)), 2), ((('h', 1), ('e', 1)), 2)), 4)), 7)), 12)), 20)
 20
   8
     4
       2
         ('c', 1)
         ('i', 1)
       2
         ('f', 1)
         ('m', 1)
     ('l', 4)
   12
     5
       ('r', 2)
       ('o', 3)
     7
       (' ', 3)
       4
         2
           ('w', 1)
           ('d', 1)
         2
           ('h', 1)
           ('e', 1)
(((('c', 'i'), ('f', 'm')), 'l'), (('r', 'o'), (' ', (('w', 'd'), ('h', 'e')))))
{'c': '0000', 'i': '0001', 'f': '0010', 'm': '0011', 'l': '01', 'r': '100', 'o': '101', ' ': '110', 'w': '11100', 'd': '11101', 'h': '11110', 'e': '11111'}
68 0.425 11110111110101101110111001011000111101110001010010100111100000010001
00001c1i01f1m1l001r1o01 001w1d01h1e
(((('c', 'i'), ('f', 'm')), 'l'), (('r', 'o'), (' ', (('w', 'd'), ('h', 'e')))))
{'0000': 'c', '0001': 'i', '0010': 'f', '0011': 'm', '01': 'l', '100': 'r', '101': 'o', '110': ' ', '11100': 'w', '11101': 'd', '11110': 'h', '11111': 'e'}
hello world from cli

Download a large file, e.g. from https://corpus.canterbury.ac.nz/descriptions/ and provide it to the script...

$ make large
wget http://corpus.canterbury.ac.nz/resources/large.zip
--2020-08-22 10:37:09--  http://corpus.canterbury.ac.nz/resources/large.zip
Resolving corpus.canterbury.ac.nz (corpus.canterbury.ac.nz)... 132.181.17.8
Connecting to corpus.canterbury.ac.nz (corpus.canterbury.ac.nz)|132.181.17.8|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://corpus.canterbury.ac.nz/resources/large.zip [following]
--2020-08-22 10:37:09--  https://corpus.canterbury.ac.nz/resources/large.zip
Connecting to corpus.canterbury.ac.nz (corpus.canterbury.ac.nz)|132.181.17.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3256280 (3,1M) [application/zip]
Saving to: ‘large.zip’

large.zip            100%[===================>]   3,10M   366KB/s    in 8,8s    

2020-08-22 10:37:20 (361 KB/s) - ‘large.zip’ saved [3256280/3256280]

unzip large.zip -d large
Archive:  large.zip
  inflating: large/bible.txt         
  inflating: large/E.coli            
  inflating: large/world192.txt      

$ python -m huffman.coding large/bible.txt
32379136 bits
17747595 bits 0.5481182388560337 %

Decoding Performance

My initial implementation for decoding used a dictionary with bitstrings as keys and the corresponding characters as values. It was a similar approach to the encoding logic. Due to not actually decoding the encoded bible, I didn't notice that this took ... like forever ;-)

I changed the implementation to one that takes the encoding tree and traverse it, based on the bits in the encoded stream, adding the character when arriving at a leaf. This proved to be much faster, but still rather slow. Nevertheless, it introduced me to pytext-benchmark:

$ make test
tox
GLOB sdist-make: /Users/xtof/Workspace/huffman/setup.py
py37 inst-nodeps: /Users/xtof/Workspace/huffman/.tox/.tmp/package/1/huffman-0.0.2.zip
py37 installed: attrs==20.1.0,certifi==2020.6.20,chardet==3.0.4,coverage==5.2.1,coveralls==2.1.2,docopt==0.6.2,huffman @ file:///Users/xtof/Workspace/huffman/.tox/.tmp/package/1/huffman-0.0.2.zip,idna==2.10,importlib-metadata==1.7.0,iniconfig==1.0.1,more-itertools==8.4.0,packaging==20.4,pluggy==0.13.1,py==1.9.0,py-cpuinfo==7.0.0,pyparsing==2.4.7,pytest==6.0.1,pytest-benchmark==3.2.3,requests==2.24.0,six==1.15.0,toml==0.10.1,urllib3==1.25.10,zipp==3.1.0
py37 run-test-pre: PYTHONHASHSEED='1916796342'
py37 run-test: commands[0] | coverage run -m '--omit=*/.tox/*,*/distutils/*,*/tests/*' pytest
=================================== test session starts ===================================
platform darwin -- Python 3.7.7, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
cachedir: .tox/py37/.pytest_cache
benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/xtof/Workspace/huffman, configfile: tox.ini, testpaths: tests
plugins: benchmark-3.2.3
collected 1 item                                                                          

tests/test_coding.py .                                                              [100%]


------------------------------------------- benchmark: 1 tests ------------------------------------------
Name (time in s)        Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------
test_roundtrip       4.6859  5.0134  4.8253  0.1543  4.7472  0.2770       1;0  0.2072       5           1
---------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=================================== 1 passed in 39.77s ====================================
_________________________________________ summary _________________________________________
  py37: commands succeeded
  congratulations :)

Now this is going to prove to be addictive, giving me a very small piece of code to optimize to the moon and back ;-)

Fixed Key Length Lookup Table

Going back to the dictionary mapping bitstrings to characters, extending the bitstrings to a fixed key length (of the longest key), adding duplicates for all missing bits after the key, proved to be a (though memory consuming) much faster way, slashing the decoding time roughly by two ;-)

So in stead of looking in this table:

{
  "000": "w",
  "001": " ",
  "010": "d",
  "011": "h",
  "10": "l",
  "1100": "e",
  "1101": "r",
  "111": "o"
}

I'm now looking in this table:

{
  "0000": ("w", 3), "0001": ("w", 3),
  "0010": (" ", 3), "0011": (" ", 3),
  "0100": ("d", 3), "0101": ("d", 3),
  "0110": ("h", 3), "0111": ("h", 3),
  "1000": ("l", 2), "1001": ("l", 2), "1010": ("l", 2), "1011": ("l", 2),
  "1100": ("e", 4),
  "1101": ("r", 4),
  "1110": ("o", 3), "1111": ("o", 3)
}

It allows us to consume a fixed amount of bits from the code, ensures that the lookup table will always return a match, including the characters and the actual amount of bits that have to be consumed from the code.

Traversal vs Fixed Length Lookup Table

Building the lookup table takes little time, in comparison with the gain in decoding time.

TODO

  • implement it with "real bits" ;-)

References

About

Simple and straigthforward implementation of Huffman coding - as a small exercise

Topics

Resources

Stars

Watchers

Forks