Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13997

Currently, JSON, TEXT and CSV data sources use CompressionCodecs class to set compression configurations via option("compress", "codec").

I made this uses Hadoop 1.x default value (block level compression). However, the default value in Hadoop 2.x is record level compression as described in mapred-site.xml.

Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default values.

How was this patch tested?

Via ./dev/run_tests and unit tests.

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

We should create a JIRA for changing config values.

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

BTW can you explain more what this actually means? It seems really bad to compress every record. Maybe this is not what this config is doing?

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53484 has finished for PR 11806 at commit 80749a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [MINOR][SQL] Use Hadoop 2.0 default value for compression in data sources. [SPARK-13997][SQL] Use Hadoop 2.0 default value for compression in data sources. Mar 18, 2016
@HyukjinKwon
Copy link
Member Author

@rxin According to Hadoop Definitive Guide 3th edition, it looks it is right configurations to the unit of compression (record or block).

@HyukjinKwon
Copy link
Member Author

Maybe I think we should leave the configurations as default if users do not specify the configuration. It might be able to use setIfUnset() instead of set().

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

Am I misunderstanding it? It seems insane to run compression at record level because the overhead is very high/

@HyukjinKwon
Copy link
Member Author

I see.. Should I maybe close this?

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

Yea - until we can figure out what it actually means, I'd close this for now.

cc @tomwhite - maybe you can shed some light on what "record" means here?

@HyukjinKwon
Copy link
Member Author

Actually, I did not understand why the overhead of compression at record (I mean a row in Spark, a key-value record in Hadoop output format) level would be very high. I think it is slightly high and the compression ratio is a bit lower but allows random accessing at record level.

Maybe I have the lack of knowledge and experience. I would really appreciate if you help me (and also I can understand what "record" exactly means).

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

The efficiency of compression algorithms usually goes down as the frame (block) size goes down.

@rxin
Copy link
Contributor

rxin commented Mar 18, 2016

http://www.txtwizard.net/compression

Try this. Put "a" in it. The compression ratio is 5%, i.e. the compressed size is 20x the size of the original text.

@HyukjinKwon
Copy link
Member Author

I see. Thanks! AFAIK, record level compression does not actually compress each whole record but only positions of the values. Could I maybe a bit wait until @tomwhite give some feedback before closing this?

@srowen
Copy link
Member

srowen commented Mar 18, 2016

It does mean each record is compressed separately. Maybe that makes sense for huge records, or somehow facilitates processing pieces of a block (since the whole block has to be uncompressed to use any of it). However Tom's book says block compression should be preferred. I don't know why it's not the default. Also summoning @steveloughran

@steveloughran
Copy link
Contributor

Summary: use an optimised storage format and dataframes, worry about compression afterwards

  1. you need to use a compression code that lets you seek into a bit of the file before you read; this is what's needed for parallel access to data in different blocks. I don't remember which codecs are best for this or the specific switch to turn it on. LZO? Snappy?
  2. Compression performance? With native libs it decompression can be fast; snappy has a good reputation there. Without the native libs things will be way slow. With the right settings, the performance costs of decompression are less than the wait times for the uncompressed data (that's on HDD; SSD changes the rules again & someone needs to do the experiments.)
  3. Except for the specific case of the ingest phase, start by converting your data into an optimised format: Parquet, ORC, which does things like columnar storage and compression, This massively minimises IO when working with a few columns (provided the format/API supports predicate pushdowns...use the dataframes API & not RDDs directly). I assume you can compress these too, though with compression already happening on columns, gains would be less.
  4. Ingress is a good streaming usecase, if you haven't played with it already ...

@srowen
Copy link
Member

srowen commented Mar 18, 2016

Yes that's all true, but the question is whether it's better to default to BLOCK or RECORD compression. You're maybe saying it doesn't matter so let's leave it at BLOCK.

@HyukjinKwon
Copy link
Member Author

@srowen @steveloughran Thank you so much. Could anybody please decide if I should go head or not? For me it's a bit confusing. I will follow the decision.

@HyukjinKwon
Copy link
Member Author

Closing this.

@HyukjinKwon
Copy link
Member Author

@srowen Oh, wait. Should I better change set() to setIfUnset()?

@srowen
Copy link
Member

srowen commented Mar 18, 2016

I'd leave this unless we have a clear reason to prefer RECORD in some case

@HyukjinKwon
Copy link
Member Author

Thanks!

@tomwhite
Copy link
Member

I agree that BLOCK is always to be preferred over RECORD, so leave it at BLOCK. RECORD is the default in Hadoop 1 and 2 (for backwards compatibility reasons), but that doesn't mean it has to be the same in Spark.

@HyukjinKwon
Copy link
Member Author

@tomwhite Sorry for adding more comments but does that mean the default value in Hadoop 1.x is RECORD?

@tomwhite
Copy link
Member

@HyukjinKwon HyukjinKwon deleted the minor-compression branch October 1, 2016 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants