[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

maropu · 2016-02-27T04:42:34Z

What changes were proposed in this pull request?

This pr to make the short names of compression codecs in ParquetRelation consistent against other ones. This pr comes from #11324.

How was this patch tested?

Add more tests in TextSuite.

… names

SparkQA · 2016-02-27T04:49:02Z

Test build #52101 has finished for PR 11408 at commit 25e9250.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-27T07:42:58Z

Test build #52102 has finished for PR 11408 at commit 2d7737b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-02-27T13:49:24Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -60,6 +60,51 @@ private[spark] object CallSite {
  val empty = CallSite("", "")
 }

+/** An utility class to map short compression codec names to qualified ones. */
+private[spark] class ShortCompressionCodecNameMapper {


I agree with standardizing names, but this seems like a lot of over-engineering, with abstract classes and whatnot just to attempt to make some keys consistent. I think this makes it harder to understand, and would stick to standardizing the keys.

Yeah, in an original idea, we should have consistent short names for compression codecs everywhere in spark. Any simpler way we can have?

+1, agree with Sean.

I also agree - this abstract class is too much. I think just having lz4/bzip2 etc in different places isn't that big of a deal.

maropu · 2016-02-27T13:57:46Z

@rxin ping

SparkQA · 2016-02-27T17:11:04Z

Test build #52125 has finished for PR 11408 at commit da879a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T09:58:07Z

Test build #52230 has finished for PR 11408 at commit c3f0140.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-01T10:53:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

-    val compressedFiles = tempFile.listFiles()
-    assert(compressedFiles.exists(_.getName.endsWith(".gz")))
-    verifyFrame(sqlContext.read.text(tempFile.getCanonicalPath))
+    Seq("bzip2", "deflate", "gzip").map { codecName =>


Nit: foreach not map. It doesn't matter in practice though.

SparkQA · 2016-03-02T03:02:56Z

Test build #52276 has finished for PR 11408 at commit ac6b82c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-02T07:42:09Z

Test build #52297 has finished for PR 11408 at commit 1be5cc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-02T07:49:18Z

@HyukjinKwon can you help review this one? It looks ok to me. Maybe you can do yours on top of this, which adds Python and better error messages.

HyukjinKwon · 2016-03-02T07:59:25Z

@rxin Sure. (I will within tomorrow).

HyukjinKwon · 2016-03-02T09:42:10Z

One thing I am not sure here is, DeflateCodec is basically zlib as far as I remember but Deflate is Hadoop's conversion. For example, for ORC, it uses ZLIB rather than Deflate. Although I agree it is a good idea to make the names consistent, I am a bit worried if they are safe or not.

Might this be better to make them private? or do you guys think Spark should accept deflate as zlib and zlib as deflate (but in this case, the file extensions for JSON, TEXT and CSV datasources would have .deflate for zlib which might have to be separately handled)?

(I am adding those supports here #11464 for ORC/Parquet).

HyukjinKwon · 2016-03-02T10:05:39Z

One more thing is, I am not sure if we then need uncompressed or none (just like Parquet) for JSON, CSV and TEXT datasources as shorten names to explicitly set no compression codec.

Would you give me some feedback please?

HyukjinKwon · 2016-03-02T10:36:23Z

If the consistent short names infer only lower-case names, then I think we can just leave them as corrected here in a way.

maropu · 2016-03-02T12:39:02Z

Yes, you're right . gzip and deflate has a compression algorithm in common. However, in my opinion, it'd be better to support both short names for hadoop compatibility because the file formats for them are different. To support both formats does not make hadoop guys get confused.

maropu · 2016-03-02T12:47:20Z

Setting no compression codec explicitly is important for uses to understand behaviours though, I'm not sure we need both names...

rxin · 2016-03-02T22:28:54Z

I think "none" is enough.

HyukjinKwon · 2016-03-02T22:37:53Z

Thanks. Then zlib for ORC, deflate for TEXT, JSON and CSV, none for TEXT, JSON, CSV and ORC, and none/uncompressed for Parquet?

rxin · 2016-03-02T22:45:01Z

Sorry I don't understand what you mean ...

HyukjinKwon · 2016-03-02T22:54:23Z

Let me list up possible compression options for each data source.

For JSON, CSV and TEXT data sources,
none - no compression
bzip2
snappy
gzip
deflate - this is zlib

For Parquet,
none/uncompressed - 'uncompressed' was possible to set to Spark configuration option.
gzip
lzo
snappy

For ORC,
none - no compression
zlib
snappy
lzo

rxin · 2016-03-02T22:58:31Z

Why do we need to support uncompressed in parquet? It it for backward compatibility?

HyukjinKwon · 2016-03-02T22:59:56Z

Yes wouldn't the users set 'uncompressed' as it was possible to set via Spark configuration?

rxin · 2016-03-02T23:02:41Z

OK sounds great. I'd have it as an undocumented way for backward compatibility.

rxin · 2016-03-02T23:03:03Z

deflate is pretty confusing. I'd just say zlib? -- actually never mind, let's just say deflate.

HyukjinKwon · 2016-03-02T23:07:58Z

As I said then we might need to consider handling the extensions for JSON, TEXT and CSV which are .deflate. We might need to manually change them to .zlib.

Do you think we can just leave the extensions as they are?

rxin · 2016-03-02T23:10:16Z

We don't specify any extensions right now, do we?

HyukjinKwon · 2016-03-02T23:15:36Z

I remember I saw some tests for compression codes which check the extension when they are compressed. Then, let me correct them (or simply check them)and then create a new PR based on this.

HyukjinKwon · 2016-03-02T23:19:28Z

Let's talk more in the new PR. I will try to deal with this in capacity myself first.

rxin · 2016-03-02T23:23:32Z

OK I thought about this a little bit more -- I'd just have uncompressed as an undocumented option for all data sources. That way, it is very consistent.

@HyukjinKwon should I merge this pr now?

HyukjinKwon · 2016-03-02T23:28:32Z

Yes please. Let me make a followup.

rxin · 2016-03-02T23:30:28Z

Thanks - merging this in master.

…ent in ParquetRelation ## What changes were proposed in this pull request? This pr to make the short names of compression codecs in `ParquetRelation` consistent against other ones. This pr comes from apache#11324. ## How was this patch tested? Add more tests in `TextSuite`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#11408 from maropu/SPARK-13528.

Add a common utility code to map short names to fully-qualified codec…

25e9250

… names

Fix style errors

2d7737b

srowen reviewed Feb 27, 2016
View reviewed changes

Add a test for an illegal codec given in an option

da879a6

maropu force-pushed the SPARK-13528 branch from 73f94d6 to 911e57b Compare March 1, 2016 08:30

Revert codes

c3f0140

maropu force-pushed the SPARK-13528 branch from 911e57b to c3f0140 Compare March 1, 2016 08:31

maropu changed the title ~~[SPARK-13528][SQL][Core] Make the short names of compression codecs consistent in spark~~ [SPARK-13528][SQL] Make the short names of compression codecs consistent in spark Mar 1, 2016

maropu changed the title ~~[SPARK-13528][SQL] Make the short names of compression codecs consistent in spark~~ [SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation Mar 1, 2016

srowen reviewed Mar 1, 2016
View reviewed changes

Use foreach not map

ac6b82c

Fix bugs

1be5cc1

rxin mentioned this pull request Mar 2, 2016

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() #11464

Closed

asfgit closed this in 6250cf1 Mar 2, 2016

andrewor14 mentioned this pull request Mar 3, 2016

[SPARK-13632] [SQL] Move commands.scala to command package #11482

Closed

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Mar 8, 2016

Address comments from apache#11408 + fix style

0079074

maropu deleted the SPARK-13528 branch July 5, 2017 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

maropu commented Feb 27, 2016

SparkQA commented Feb 27, 2016

SparkQA commented Feb 27, 2016

srowen Feb 27, 2016

maropu Feb 27, 2016

JoshRosen Feb 28, 2016

rxin Feb 28, 2016

maropu commented Feb 27, 2016

SparkQA commented Feb 27, 2016

SparkQA commented Mar 1, 2016

srowen Mar 1, 2016

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

maropu commented Mar 2, 2016

maropu commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

Conversation

maropu commented Feb 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 27, 2016

SparkQA commented Feb 27, 2016

srowen Feb 27, 2016

Choose a reason for hiding this comment

maropu Feb 27, 2016

Choose a reason for hiding this comment

JoshRosen Feb 28, 2016

Choose a reason for hiding this comment

rxin Feb 28, 2016

Choose a reason for hiding this comment

maropu commented Feb 27, 2016

SparkQA commented Feb 27, 2016

SparkQA commented Mar 1, 2016

srowen Mar 1, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2016

SparkQA commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

maropu commented Mar 2, 2016

maropu commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016

HyukjinKwon commented Mar 2, 2016

rxin commented Mar 2, 2016