[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage #285

liancheng · 2014-04-01T16:23:24Z

(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: BooleanBitSet, IntDelta and LongDelta, which are trivial to add later in this or another separate PR.)

This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:

CompressionScheme

Each CompressionScheme represents a concrete compression algorithm, which basically consists of an Encoder for compression and a Decoder for decompression. Algorithms implemented include:

RunLengthEncoding
DictionaryEncoding

Algorithms to be implemented include:

BooleanBitSet
IntDelta
LongDelta
CompressibleColumnBuilder

A stackable ColumnBuilder trait used to build byte buffers for compressible columns. A best CompressionScheme that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the ColumnBuilder. However, if no CompressionScheme can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.

Memory layout of the final byte buffer is showed below:

 .--------------------------- Column type ID (4 bytes)
 |   .----------------------- Null count N (4 bytes)
 |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
 |   |   |     .------------- Compression scheme ID (4 bytes)
 |   |   |     |   .--------- Compressed non-null elements
 V   V   V     V   V
+---+---+-----+---+---------+
|   |   | ... |   | ... ... |
+---+---+-----+---+---------+
 \-----------/ \-----------/
    header         body

CompressibleColumnAccessor

A stackable ColumnAccessor trait used to iterate (possibly) compressed data column.

ColumnStats

Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.

Strictly speaking, ColumnStats related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).

A major refactoring change since PR #205 is:

Refactored all getter/setter methods for primitive types in various places into ColumnType classes to remove duplicated code.

Primitive setters/getters for (Mutable)Rows are moved to ColumnTypes.

* Added two more compression schemes (RLE & dictionary encoding) * Moved compression support code to columnar.compression * Various refactoring

* Added test suites for RunLengthEncoding and DictionaryEncoding * Completed ColumnStatsSuite * Bug fix: RunLengthEncoding didn't encode the last run * Refactored some test related code for clarity

AmplabJenkins · 2014-04-01T16:27:23Z

Merged build triggered.

AmplabJenkins · 2014-04-01T16:27:30Z

Merged build started.

AmplabJenkins · 2014-04-01T17:19:59Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-01T17:19:59Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13633/

marmbrus · 2014-04-01T20:31:55Z

sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnTypeSuite.scala

@@ -55,10 +56,8 @@ class ColumnTypeSuite extends FunSuite {
    }


Minor existing comment: I find this style of testing produces very cryptic failures. When something breaks all you are going to get is 4 does not equal 8. Furthermore, because the failure is in a loop the stacktrace also won't be helpful in figuring out which datatype is wrong. Finally, the correct answer for GENERIC is 10 lines aways from the check, making it unnecessarily hard to read the test and see what the expected answers are.

I think something like this would be clearer, and the same number of lines of code:

def checkActualSize[A](t: ColumnType[A], v: A, expectedSize: Int) = if(t.actualSize(v) != expectedSize) { fail(s"Wrong actualSize for $t, actual: ${t.actualSize(t)}, expected: $expected") } checkActualSize(INT, Int.MaxValue, 4) ...

I found expectResult is equivalent to this and is more concise. Updated all occurrences where I think is proper.

Oh cool, I did not know about that. Much clearer!

marmbrus · 2014-04-01T21:21:35Z

Hi @liancheng,

This looks pretty good :) Thanks for working on it! I made a bunch of comments, but mostly just cause I tried to look at this pretty closely. I think the only really important one before we merge is to fix the visibility of IntColumnStats.

Just to make sure I understood the code correctly, compression is now turned on by default since the CompressibleColumnBuilder is mixed into all the NativeColumnBuilders, and when build() is called inside of the InMemoryColumnarTableScan the "staging buffer" will get compressed using the codec with the best compression ratio?

If that's correct then I think we can merge once the above issues is addressed.

There are a few followups that we can do in a separate PR:

Add the missing compression codecs
Push predicates into the table scan and use statistics where applicable to prune entire partitions.
We could also create a planning strategy that calculates the answer for applicable aggregates using these statistics.

I think the latter two are going to require some refactoring so that in-memory cached data becomes a logical concept and the actual table scan is created by the planning strategies once we know about predicates and available statistics. Regardless, this should be transparent to the user and easy to add later.

apache#285

AmplabJenkins · 2014-04-02T13:17:23Z

Merged build triggered.

AmplabJenkins · 2014-04-02T13:17:31Z

Merged build started.

liancheng · 2014-04-02T13:28:27Z

Thanks for the detailed review :) All issues addressed.

Just to make sure I understood the code correctly, compression is now turned on by default since the CompressibleColumnBuilder is mixed into all the NativeColumnBuilders, and when build() is called inside of the InMemoryColumnarTableScan the "staging buffer" will get compressed using the codec with the best compression ratio?

Exactly.

AmplabJenkins · 2014-04-02T14:12:10Z

Merged build finished.

AmplabJenkins · 2014-04-02T14:12:10Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13684/

marmbrus · 2014-04-02T19:05:20Z

LGTM

@pwendell, this is ready to merge.

pwendell · 2014-04-02T19:47:41Z

Merged, thanks!

Yarn refactor

marmbrus · 2014-04-30T00:18:11Z

Note that much of this code was adapted from Shark, including: https://github.com/amplab/shark/blob/master/src/test/scala/shark/memstore2/column/CompressionAlgorithmSuite.scala

@marmbrus

…r storage JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373) (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.) This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include: * `CompressionScheme` Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include: * `RunLengthEncoding` * `DictionaryEncoding` Algorithms to be implemented include: * `BooleanBitSet` * `IntDelta` * `LongDelta` * `CompressibleColumnBuilder` A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time. Memory layout of the final byte buffer is showed below: ``` .--------------------------- Column type ID (4 bytes) | .----------------------- Null count N (4 bytes) | | .------------------- Null positions (4 x N bytes, empty if null count is zero) | | | .------------- Compression scheme ID (4 bytes) | | | | .--------- Compressed non-null elements V V V V V +---+---+-----+---+---------+ | | | ... | | ... ... | +---+---+-----+---+---------+ \-----------/ \-----------/ header body ``` * `CompressibleColumnAccessor` A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column. * `ColumnStats` Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information. Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible). A major refactoring change since PR apache#205 is: * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#285 from liancheng/memColumnarCompression and squashes the following commits: ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance 5034453 [Cheng Lian] Bug fix, more tests, and more refactoring c298b76 [Cheng Lian] Test suites refactored 2780d6a [Cheng Lian] [WIP] in-memory columnar compression support 211331c [Cheng Lian] WIP: in-memory columnar compression support 85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code

…pache#285) * add cloud-provider-openstack acceptance job for manila-provisioner * zuul.d/jobs and pipelines for cloud-provider-openstack-acceptance-test-manila-provisioner * changed roles in manila-provisioner job * fixed roles in cloud-provider-openstack manila-provisioner job

liancheng added 6 commits April 1, 2014 21:18

Refactored ColumnAccessors & ColumnBuilders to remove duplicate code

85cc59b

Primitive setters/getters for (Mutable)Rows are moved to ColumnTypes.

WIP: in-memory columnar compression support

211331c

[WIP] in-memory columnar compression support

2780d6a

* Added two more compression schemes (RLE & dictionary encoding) * Moved compression support code to columnar.compression * Various refactoring

Test suites refactored

c298b76

Bug fix, more tests, and more refactoring

5034453

* Added test suites for RunLengthEncoding and DictionaryEncoding * Completed ColumnStatsSuite * Bug fix: RunLengthEncoding didn't encode the last run * Refactored some test related code for clarity

Removed Ordering[T] in ColumnStats for better performance

d3a4fa9

marmbrus reviewed Apr 1, 2014
View reviewed changes

Addressed all PR comments by @marmbrus

ed71bbd

apache#285

asfgit closed this in 1faa579 Apr 2, 2014

liancheng deleted the memColumnarCompression branch April 3, 2014 09:28

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Apr 7, 2014

Merge pull request apache#285 from colorant/yarn-refactor

30b9db0

Yarn refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage #285

[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage #285

liancheng commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

marmbrus Apr 1, 2014

liancheng Apr 2, 2014

marmbrus Apr 2, 2014

marmbrus commented Apr 1, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

liancheng commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

marmbrus commented Apr 2, 2014

pwendell commented Apr 2, 2014

marmbrus commented Apr 30, 2014

		@@ -55,10 +56,8 @@ class ColumnTypeSuite extends FunSuite {
		}

[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage #285

[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage #285

Conversation

liancheng commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

marmbrus Apr 1, 2014

Choose a reason for hiding this comment

liancheng Apr 2, 2014

Choose a reason for hiding this comment

marmbrus Apr 2, 2014

Choose a reason for hiding this comment

marmbrus commented Apr 1, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

liancheng commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

marmbrus commented Apr 2, 2014

pwendell commented Apr 2, 2014

marmbrus commented Apr 30, 2014