WBZ

Check the article here: How to Build a Lossless Data Compression and Data Decompression Pipeline

A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files and other files on tabular format.

Data pipeline compression

How to use the tool

The tool is called WBZ, the first version only will be focused in compressing .csv files and I will be adding more features coming soon, the parameters are described as follow:

python wbz.py -a encode -f 'C:\Users\...\data.csv' -cs 20000 -ch ';'

python wbz.py -a decode -f 'C:\Users\...\data.wbz' -cs 20000 -ch ';'

-a is action , there is two actions: encode and decode
-f is filepath, if your action is encode make sure that the filepath choosed is a .csv file, if your action is decode make sure that you choosed filepath is a file with extension .wbz
-cs is chunk size, the algorithm Burrows–Wheeler transform (BWT) works with chunks sized in bytes, with this parameter you would specify the number of bytes to be processed by each CPU.
-ch is special character, each chunk encoded by the algorithm Burrows–Wheeler transform (BWT) will contain an special character inside it, it will help to identify an index for decodeding purposes, The possible column separator characters in your .csv file could work as a special character, it is recommended to use a separator that is not used by your columns and that does not appear in the content of the columns, this feature will be removed in the next versions of this tool.

The same chunk size and special character used for encode the file must be used for decode the file, The idea of keep them as parameters is to be able to get a better trade-off of the speed in the encoding and decoding process and a better compression rate.

Performance

The tests were done with three .csv files of different sizes and varying the chunk size:

data_1000000: One million records (61mb)
data_500000: Half a million records (31 mb)
data_250000: A quarter of a million records (16 mb)

There is an improvement in the rate compression for larger chunk sizes.

The compression times increase with a logarithmic behavior when the size of the chunk is increased as well.

Regardless of the size of the file, the decompression times have a constant behavior and tend to be reduced when the size of the chunk increases as well.

To-Do List:

Improve the compression times of huffman and BWT encoding times.
Improve the encode of the huffman table
Compression based on columns
Compress and decompress specific columns on the .csv file
Generate compressed chunks automatically for large files
Distributed compression and decompression

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

Created by Ramses Alexander Coraspe Valdez
Created on 2022

License

This project is licensed under the terms of the Apache License.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
img		img
wbz		wbz
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
googled57bdb220576a44a.html		googled57bdb220576a44a.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WBZ

Check the article here: How to Build a Lossless Data Compression and Data Decompression Pipeline

Data pipeline compression

How to use the tool

Performance

To-Do List:

Contributing and Feedback

Authors

License

About

Releases

Languages

License

Wittline/wbz

Folders and files

Latest commit

History

Repository files navigation

WBZ

Check the article here: How to Build a Lossless Data Compression and Data Decompression Pipeline

Data pipeline compression

How to use the tool

Performance

To-Do List:

Contributing and Feedback

Authors

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Languages