[SPARK-13997][SQL] Use Hadoop 2.0 default value for compression in data sources. #11806

HyukjinKwon · 2016-03-18T00:21:25Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13997

Currently, JSON, TEXT and CSV data sources use CompressionCodecs class to set compression configurations via option("compress", "codec").

I made this uses Hadoop 1.x default value (block level compression). However, the default value in Hadoop 2.x is record level compression as described in mapred-site.xml.

Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default values.

How was this patch tested?

Via ./dev/run_tests and unit tests.

rxin · 2016-03-18T01:35:18Z

We should create a JIRA for changing config values.

rxin · 2016-03-18T01:41:37Z

BTW can you explain more what this actually means? It seems really bad to compress every record. Maybe this is not what this config is doing?

SparkQA · 2016-03-18T01:59:09Z

Test build #53484 has finished for PR 11806 at commit 80749a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-18T04:17:13Z

@rxin According to Hadoop Definitive Guide 3th edition, it looks it is right configurations to the unit of compression (record or block).

HyukjinKwon · 2016-03-18T04:19:23Z

Maybe I think we should leave the configurations as default if users do not specify the configuration. It might be able to use setIfUnset() instead of set().

rxin · 2016-03-18T04:19:31Z

Am I misunderstanding it? It seems insane to run compression at record level because the overhead is very high/

HyukjinKwon · 2016-03-18T04:20:56Z

I see.. Should I maybe close this?

rxin · 2016-03-18T04:26:55Z

Yea - until we can figure out what it actually means, I'd close this for now.

cc @tomwhite - maybe you can shed some light on what "record" means here?

HyukjinKwon · 2016-03-18T04:47:41Z

Actually, I did not understand why the overhead of compression at record (I mean a row in Spark, a key-value record in Hadoop output format) level would be very high. I think it is slightly high and the compression ratio is a bit lower but allows random accessing at record level.

Maybe I have the lack of knowledge and experience. I would really appreciate if you help me (and also I can understand what "record" exactly means).

rxin · 2016-03-18T04:56:55Z

The efficiency of compression algorithms usually goes down as the frame (block) size goes down.

rxin · 2016-03-18T04:58:05Z

http://www.txtwizard.net/compression

Try this. Put "a" in it. The compression ratio is 5%, i.e. the compressed size is 20x the size of the original text.

HyukjinKwon · 2016-03-18T05:02:34Z

I see. Thanks! AFAIK, record level compression does not actually compress each whole record but only ~~positions of~~ the values. Could I maybe a bit wait until @tomwhite give some feedback before closing this?

srowen · 2016-03-18T11:27:03Z

It does mean each record is compressed separately. Maybe that makes sense for huge records, or somehow facilitates processing pieces of a block (since the whole block has to be uncompressed to use any of it). However Tom's book says block compression should be preferred. I don't know why it's not the default. Also summoning @steveloughran

steveloughran · 2016-03-18T11:51:31Z

Summary: use an optimised storage format and dataframes, worry about compression afterwards

you need to use a compression code that lets you seek into a bit of the file before you read; this is what's needed for parallel access to data in different blocks. I don't remember which codecs are best for this or the specific switch to turn it on. LZO? Snappy?
Compression performance? With native libs it decompression can be fast; snappy has a good reputation there. Without the native libs things will be way slow. With the right settings, the performance costs of decompression are less than the wait times for the uncompressed data (that's on HDD; SSD changes the rules again & someone needs to do the experiments.)
Except for the specific case of the ingest phase, start by converting your data into an optimised format: Parquet, ORC, which does things like columnar storage and compression, This massively minimises IO when working with a few columns (provided the format/API supports predicate pushdowns...use the dataframes API & not RDDs directly). I assume you can compress these too, though with compression already happening on columns, gains would be less.
Ingress is a good streaming usecase, if you haven't played with it already ...

srowen · 2016-03-18T12:09:48Z

Yes that's all true, but the question is whether it's better to default to BLOCK or RECORD compression. You're maybe saying it doesn't matter so let's leave it at BLOCK.

HyukjinKwon · 2016-03-18T12:12:29Z

@srowen @steveloughran Thank you so much. Could anybody please decide if I should go head or not? For me it's a bit confusing. I will follow the decision.

HyukjinKwon · 2016-03-18T12:13:00Z

Closing this.

HyukjinKwon · 2016-03-18T12:15:14Z

@srowen Oh, wait. Should I better change set() to setIfUnset()?

srowen · 2016-03-18T12:32:50Z

I'd leave this unless we have a clear reason to prefer RECORD in some case

HyukjinKwon · 2016-03-18T12:40:59Z

Thanks!

tomwhite · 2016-03-18T13:32:40Z

I agree that BLOCK is always to be preferred over RECORD, so leave it at BLOCK. RECORD is the default in Hadoop 1 and 2 (for backwards compatibility reasons), but that doesn't mean it has to be the same in Spark.

HyukjinKwon · 2016-03-18T13:39:44Z

@tomwhite Sorry for adding more comments but does that mean the default value in Hadoop 1.x is RECORD?

tomwhite · 2016-03-18T13:52:24Z

No, it's RECORD, like it is in Hadoop 2.

https://github.com/apache/hadoop/blob/branch-1.2/src/mapred/mapred-default.xml#L736-L742

Use Hadoop 2.0 default value for compression

80749a8

HyukjinKwon changed the title ~~[MINOR][SQL] Use Hadoop 2.0 default value for compression in data sources.~~ [SPARK-13997][SQL] Use Hadoop 2.0 default value for compression in data sources. Mar 18, 2016

HyukjinKwon closed this Mar 18, 2016

HyukjinKwon deleted the minor-compression branch October 1, 2016 06:42

[SPARK-13997][SQL] Use Hadoop 2.0 default value for compression in data sources. #11806

[SPARK-13997][SQL] Use Hadoop 2.0 default value for compression in data sources. #11806

Uh oh!

Conversation

HyukjinKwon commented Mar 18, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

SparkQA commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

rxin commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

srowen commented Mar 18, 2016

Uh oh!

steveloughran commented Mar 18, 2016

Uh oh!

srowen commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

srowen commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

tomwhite commented Mar 18, 2016

Uh oh!

HyukjinKwon commented Mar 18, 2016

Uh oh!

tomwhite commented Mar 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants