Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804

Open
GSharayu opened this issue Apr 16, 2021 · 9 comments
Open

Comments

@GSharayu
Copy link
Contributor

GSharayu commented Apr 16, 2021

When the forward index is not dictionary encoded, we have 2 choices:

  • store the data as is (RAW)
  • store the data snappy compressed - using snappy compression codec library

In addition to snappy, we should add support for other compression codecs subject to their availability in Java libraries.

Currently by default we use Snappy compression. However, this didn't really give good compression ratio for free-text data. LZO is known to provide better compression ratio and speed for larger char/varchar data.

So, we should explore other options

Firstly, we should start with simple test case to compress and uncompress direct byte buffer and do some functional and performance tests.

see ZSTD library in Java - https://github.com/luben/zstd-jni

Any new ideas/suggestions?

@siddharthteotia siddharthteotia changed the title Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression encoding for raw index Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index Apr 16, 2021
@xiangfu0
Copy link
Contributor

there is some previous discussion here: #5407

@yupeng9
Copy link
Contributor

yupeng9 commented Apr 16, 2021

I'd also suggest Gorilla TSZ compression to the list, which was proposed by Facebook Gorilla project. This compression algorithm is adopted by Uber's m3TSZ, which showed 40% improvement over standard TSZ, observed from Uber's production workloads.

@xiangfu0
Copy link
Contributor

I feel we should separate encoding and compression.
Maybe add two new fields into the schema.
For raw index encoding, we can support multiple options for different types:
e.g. INT/LONG : Delta/DoubleDelta/Gorrila
FLOAT/DOUBLE: Gorrila

For compression, we can support LZO, LZ4, ZSTD, DEFLATE, GZIP, SNAPPY etc

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Apr 16, 2021

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there

The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.

This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

@xiangfu0
Copy link
Contributor

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there

The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.

This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

@siddharthteotia
Copy link
Contributor

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there
The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.
This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?

@xiangfu0
Copy link
Contributor

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there
The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.
This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?

Cause we want to allow tuning compression per column basis, e.g. column1 in snappy and column2 in lz4 right?

This info can be stored:

  • either inside FieldSpec in schema,
  • or add a new field in tableConfig, with a map of columns to compression type mapping.

@siddharthteotia
Copy link
Contributor

Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.

@xiangfu0
Copy link
Contributor

Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.

Got it. For this part, if users change the config for a column compression type, do we consider rebuilding the column data?

GSharayu added a commit to GSharayu/pinot that referenced this issue Jun 16, 2021
wuwenw pushed a commit to wuwenw/incubator-pinot that referenced this issue Jun 18, 2021
renjugokulam pushed a commit to shivagowda/incubator-pinot that referenced this issue Jul 21, 2021
renjugokulam added a commit to shivagowda/incubator-pinot that referenced this issue Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants