-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-13997][SQL] Use Hadoop 2.0 default value for compression in data sources. #11806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We should create a JIRA for changing config values. |
|
BTW can you explain more what this actually means? It seems really bad to compress every record. Maybe this is not what this config is doing? |
|
Test build #53484 has finished for PR 11806 at commit
|
|
@rxin According to Hadoop Definitive Guide 3th edition, it looks it is right configurations to the unit of compression (record or block). |
|
Maybe I think we should leave the configurations as default if users do not specify the configuration. It might be able to use |
|
Am I misunderstanding it? It seems insane to run compression at record level because the overhead is very high/ |
|
I see.. Should I maybe close this? |
|
Yea - until we can figure out what it actually means, I'd close this for now. cc @tomwhite - maybe you can shed some light on what "record" means here? |
|
Actually, I did not understand why the overhead of compression at record (I mean a row in Spark, a key-value record in Hadoop output format) level would be very high. I think it is slightly high and the compression ratio is a bit lower but allows random accessing at record level. Maybe I have the lack of knowledge and experience. I would really appreciate if you help me (and also I can understand what "record" exactly means). |
|
The efficiency of compression algorithms usually goes down as the frame (block) size goes down. |
|
http://www.txtwizard.net/compression Try this. Put "a" in it. The compression ratio is 5%, i.e. the compressed size is 20x the size of the original text. |
|
I see. Thanks! AFAIK, record level compression does not actually compress each whole record but only |
|
It does mean each record is compressed separately. Maybe that makes sense for huge records, or somehow facilitates processing pieces of a block (since the whole block has to be uncompressed to use any of it). However Tom's book says block compression should be preferred. I don't know why it's not the default. Also summoning @steveloughran |
|
Summary: use an optimised storage format and dataframes, worry about compression afterwards
|
|
Yes that's all true, but the question is whether it's better to default to BLOCK or RECORD compression. You're maybe saying it doesn't matter so let's leave it at BLOCK. |
|
@srowen @steveloughran Thank you so much. Could anybody please decide if I should go head or not? For me it's a bit confusing. I will follow the decision. |
|
Closing this. |
|
@srowen Oh, wait. Should I better change |
|
I'd leave this unless we have a clear reason to prefer RECORD in some case |
|
Thanks! |
|
I agree that BLOCK is always to be preferred over RECORD, so leave it at BLOCK. RECORD is the default in Hadoop 1 and 2 (for backwards compatibility reasons), but that doesn't mean it has to be the same in Spark. |
|
@tomwhite Sorry for adding more comments but does that mean the default value in Hadoop 1.x is RECORD? |
|
No, it's RECORD, like it is in Hadoop 2. https://github.com/apache/hadoop/blob/branch-1.2/src/mapred/mapred-default.xml#L736-L742 |
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13997
Currently, JSON, TEXT and CSV data sources use
CompressionCodecsclass to set compression configurations viaoption("compress", "codec").I made this uses Hadoop 1.x default value (block level compression). However, the default value in Hadoop 2.x is record level compression as described in mapred-site.xml.
Since it drops Hadoop 1.x, it will make sense to use Hadoop 2.x default values.
How was this patch tested?
Via
./dev/run_testsand unit tests.