[SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark #11236

maropu · 2016-02-17T09:37:35Z

This pr added benchmark codes for Encoder#compress().
Also, it replaced the benchmark results with new ones because the output format of Benchmark changed.

maropu · 2016-02-17T09:38:38Z

@nongli This discussion comes from #10965;
Before I make pr to improve compression performance of columnar caches, I have some questions about InMemoryRelation. In the current codes of InMemoryRelation, ColumnBuilder buffers input tuples in heap space and compresses them in bulk. Do you have any plan to use off-heap for this compression processing in ColumnBuilder? IMO we can fix this by adding some functionality in ColumnVector you implemented for parquet vectorized decoding, and then by using the extended ColumnVector to buffer input tuples inside ColumnBuilder. What do you think?

maropu · 2016-02-17T09:48:54Z

Anyway, I'd like to making prs to improve compression performance in InMemoryRelation.
A goal of this activity is to make in-memory cache size approaching to parquet formatted data size.
As a first step, I'd like to use DeltaBinaryPackingValuesReader/Writer of parquet-column in IntDelta and LongDelta encoders because this efficient integer compression can be widely applied in many types such as SHORT, INT, and LONG... However, I have one technical issue; DeltaBinaryPackingValuesReader/Writer has internal buffer to compress/decompress data, so we need to copy the whole data into Spark internal buffer. It is a kind of overheads. To avoid this overhead, we can inline the parquet codes in Spark though, it has a maintenance issue.

In a second step, I have a plan to add codes to apply general-purpose compression algorithms like LZ4 and Snappy in the final step of ColumnBuilder#build. This is because byte arrays generated
by some type-specific encoders like DictionaryEncoding are compressible with these algorithms.
Parquet also applys compression just before writing data into disk.

Please give me some suggestion on this?

maropu · 2016-02-18T11:32:30Z

Jenkins, retest this please.

SparkQA · 2016-02-18T13:10:46Z

Test build #51484 has finished for PR 11236 at commit 7021303.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-02-23T15:10:01Z

I tried to implement IntDeltaBinaryPacking in compressionSchemes; this is the simplified version of IntDeltaBinaryPackingReader/Writer in parquet-column so as to calculate compressed size easily in gatherCompressibilityStats.
maropu@71bb944

The benchmark results are as follows;

Running benchmark: INT Decode(Lower Skew)
  Running case: PassThrough(1.000)
  Running case: RunLengthEncoding(1.002)
  Running case: DictionaryEncoding(0.500)
  Running case: IntDelta(0.250)
  Running case: IntDeltaBinaryPacking(0.068)

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
INT Decode(Lower Skew):             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
PassThrough(1.000)                        285 /  360        235.7           4.2       1.0X
RunLengthEncoding(1.002)                  700 /  715         95.8          10.4       0.4X
DictionaryEncoding(0.500)                 763 /  782         88.0          11.4       0.4X
IntDelta(0.250)                           684 /  702         98.1          10.2       0.4X
IntDeltaBinaryPacking(0.068)              805 /  811         83.4          12.0       0.4X

Running benchmark: INT Decode(Higher Skew)
  Running case: PassThrough(1.000)
  Running case: RunLengthEncoding(1.337)
  Running case: DictionaryEncoding(0.501)
  Running case: IntDelta(0.250)
  Running case: IntDeltaBinaryPacking(0.182)

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
INT Decode(Higher Skew):            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
PassThrough(1.000)                        690 /  716         97.3          10.3       1.0X
RunLengthEncoding(1.337)                 1127 / 1148         59.5          16.8       0.6X
DictionaryEncoding(0.501)                 836 /  856         80.2          12.5       0.8X
IntDelta(0.250)                           763 /  778         88.0          11.4       0.9X
IntDeltaBinaryPacking(0.182)              873 /  884         76.9          13.0       0.8X

The speeds of encoding/decoding get a little worse though, the compression ratios get much better.

maropu · 2016-02-23T15:12:22Z

@nongli ping

maropu · 2016-02-25T06:31:59Z

@nongli @rxin ping

nongli · 2016-02-26T01:22:25Z

LGTM,
thanks for writing these benchmarks.

I think moving forward, I agree that ColumnVector is a natural data structure to decode into, but we should probably not add this logic directly into those classes just from a code maintenance point of view. I think exploring the parquet encodings makes sense but let's start by benchmarking those and see if they have the right performance characteristics.

rxin · 2016-02-26T04:17:39Z

Merging this in master. Thanks.

Add benchmark codes for compress()

7021303

asfgit closed this in 1b39faf Feb 26, 2016

maropu deleted the CompressionSpike branch July 5, 2017 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark #11236

[SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark #11236

maropu commented Feb 17, 2016

maropu commented Feb 17, 2016

maropu commented Feb 17, 2016

maropu commented Feb 18, 2016

SparkQA commented Feb 18, 2016

maropu commented Feb 23, 2016

maropu commented Feb 23, 2016

maropu commented Feb 25, 2016

nongli commented Feb 26, 2016

rxin commented Feb 26, 2016

[SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark #11236

[SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark #11236

Conversation

maropu commented Feb 17, 2016

maropu commented Feb 17, 2016

maropu commented Feb 17, 2016

maropu commented Feb 18, 2016

SparkQA commented Feb 18, 2016

maropu commented Feb 23, 2016

maropu commented Feb 23, 2016

maropu commented Feb 25, 2016

nongli commented Feb 26, 2016

rxin commented Feb 26, 2016