Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804

GSharayu · 2021-04-16T00:16:55Z

When the forward index is not dictionary encoded, we have 2 choices:

store the data as is (RAW)
store the data snappy compressed - using snappy compression codec library

In addition to snappy, we should add support for other compression codecs subject to their availability in Java libraries.

Currently by default we use Snappy compression. However, this didn't really give good compression ratio for free-text data. LZO is known to provide better compression ratio and speed for larger char/varchar data.

So, we should explore other options

Firstly, we should start with simple test case to compress and uncompress direct byte buffer and do some functional and performance tests.

see ZSTD library in Java - https://github.com/luben/zstd-jni

Any new ideas/suggestions?

xiangfu0 · 2021-04-16T09:53:45Z

there is some previous discussion here: #5407

yupeng9 · 2021-04-16T17:28:34Z

I'd also suggest Gorilla TSZ compression to the list, which was proposed by Facebook Gorilla project. This compression algorithm is adopted by Uber's m3TSZ, which showed 40% improvement over standard TSZ, observed from Uber's production workloads.

xiangfu0 · 2021-04-16T19:54:25Z

I feel we should separate encoding and compression.
Maybe add two new fields into the schema.
For raw index encoding, we can support multiple options for different types:
e.g. INT/LONG : Delta/DoubleDelta/Gorrila
FLOAT/DOUBLE: Gorrila

For compression, we can support LZO, LZ4, ZSTD, DEFLATE, GZIP, SNAPPY etc

siddharthteotia · 2021-04-16T20:55:17Z

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there

The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.

This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

xiangfu0 · 2021-04-16T21:22:15Z

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there

The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.

This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

siddharthteotia · 2021-04-16T21:35:36Z

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there
The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.
This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?

xiangfu0 · 2021-04-16T22:00:44Z

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there
The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.
This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?

Cause we want to allow tuning compression per column basis, e.g. column1 in snappy and column2 in lz4 right?

This info can be stored:

either inside FieldSpec in schema,
or add a new field in tableConfig, with a map of columns to compression type mapping.

siddharthteotia · 2021-04-16T22:05:54Z

Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.

xiangfu0 · 2021-04-17T00:00:11Z

Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.

Got it. For this part, if users change the config for a column compression type, do we consider rebuilding the column data?

Add LZ4 Compression Codec (apache#6804) (apache#7035)

siddharthteotia changed the title ~~Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression encoding for raw index~~ Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index Apr 16, 2021

GSharayu mentioned this issue May 4, 2021

Support ZSTD compression codec for raw index #6876

Merged

GSharayu added a commit to GSharayu/pinot that referenced this issue May 5, 2021

Add Zstandard compression support with JMH benchmarking(apache#6804)

a20535d

siddharthteotia pushed a commit that referenced this issue May 7, 2021

Add Zstandard compression support with JMH benchmarking(#6804) (#6876)

73426bc

GSharayu mentioned this issue Jun 8, 2021

Support LZ4 compression codec for raw index (#6804) #7035

Merged

GSharayu added a commit to GSharayu/pinot that referenced this issue Jun 16, 2021

Add LZ4 Compression Codec (apache#6804)

d7cf4ed

siddharthteotia pushed a commit that referenced this issue Jun 16, 2021

Add LZ4 Compression Codec (#6804) (#7035)

b826f2f

wuwenw pushed a commit to wuwenw/incubator-pinot that referenced this issue Jun 18, 2021

Add LZ4 Compression Codec (apache#6804) (apache#7035)

f50a817

renjugokulam pushed a commit to shivagowda/incubator-pinot that referenced this issue Jul 21, 2021

Add LZ4 Compression Codec (apache#6804) (apache#7035)

5d13537

renjugokulam added a commit to shivagowda/incubator-pinot that referenced this issue Jul 21, 2021

Merge pull request #13 from shivagowda/shiv_2_b826f2f

d91bd67

Add LZ4 Compression Codec (apache#6804) (apache#7035)

This was referenced Nov 19, 2021

use LZ4 as default compression mode #7797

Merged

Change default raw index compression format to LZ4 #7795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804

Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804

GSharayu commented Apr 16, 2021 •

edited by siddharthteotia

xiangfu0 commented Apr 16, 2021

yupeng9 commented Apr 16, 2021

xiangfu0 commented Apr 16, 2021

siddharthteotia commented Apr 16, 2021 •

edited

xiangfu0 commented Apr 16, 2021

siddharthteotia commented Apr 16, 2021

xiangfu0 commented Apr 16, 2021

siddharthteotia commented Apr 16, 2021

xiangfu0 commented Apr 17, 2021

Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804

Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804

Comments

GSharayu commented Apr 16, 2021 • edited by siddharthteotia

xiangfu0 commented Apr 16, 2021

yupeng9 commented Apr 16, 2021

xiangfu0 commented Apr 16, 2021

siddharthteotia commented Apr 16, 2021 • edited

xiangfu0 commented Apr 16, 2021

siddharthteotia commented Apr 16, 2021

xiangfu0 commented Apr 16, 2021

siddharthteotia commented Apr 16, 2021

xiangfu0 commented Apr 17, 2021

GSharayu commented Apr 16, 2021 •

edited by siddharthteotia

siddharthteotia commented Apr 16, 2021 •

edited