Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408

Closed
wants to merge 6 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Feb 27, 2016

What changes were proposed in this pull request?

This pr to make the short names of compression codecs in ParquetRelation consistent against other ones. This pr comes from #11324.

How was this patch tested?

Add more tests in TextSuite.

@SparkQA
Copy link

SparkQA commented Feb 27, 2016

Test build #52101 has finished for PR 11408 at commit 25e9250.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2016

Test build #52102 has finished for PR 11408 at commit 2d7737b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -60,6 +60,51 @@ private[spark] object CallSite {
val empty = CallSite("", "")
}

/** An utility class to map short compression codec names to qualified ones. */
private[spark] class ShortCompressionCodecNameMapper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with standardizing names, but this seems like a lot of over-engineering, with abstract classes and whatnot just to attempt to make some keys consistent. I think this makes it harder to understand, and would stick to standardizing the keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in an original idea, we should have consistent short names for compression codecs everywhere in spark. Any simpler way we can have?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, agree with Sean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree - this abstract class is too much. I think just having lz4/bzip2 etc in different places isn't that big of a deal.

@maropu
Copy link
Member Author

maropu commented Feb 27, 2016

@rxin ping

@SparkQA
Copy link

SparkQA commented Feb 27, 2016

Test build #52125 has finished for PR 11408 at commit da879a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-13528][SQL][Core] Make the short names of compression codecs consistent in spark [SPARK-13528][SQL] Make the short names of compression codecs consistent in spark Mar 1, 2016
@maropu maropu changed the title [SPARK-13528][SQL] Make the short names of compression codecs consistent in spark [SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation Mar 1, 2016
@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52230 has finished for PR 11408 at commit c3f0140.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val compressedFiles = tempFile.listFiles()
assert(compressedFiles.exists(_.getName.endsWith(".gz")))
verifyFrame(sqlContext.read.text(tempFile.getCanonicalPath))
Seq("bzip2", "deflate", "gzip").map { codecName =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: foreach not map. It doesn't matter in practice though.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52276 has finished for PR 11408 at commit ac6b82c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52297 has finished for PR 11408 at commit 1be5cc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

@HyukjinKwon can you help review this one? It looks ok to me. Maybe you can do yours on top of this, which adds Python and better error messages.

@HyukjinKwon
Copy link
Member

@rxin Sure. (I will within tomorrow).

@HyukjinKwon
Copy link
Member

One thing I am not sure here is, DeflateCodec is basically zlib as far as I remember but Deflate is Hadoop's conversion. For example, for ORC, it uses ZLIB rather than Deflate. Although I agree it is a good idea to make the names consistent, I am a bit worried if they are safe or not.

Might this be better to make them private? or do you guys think Spark should accept deflate as zlib and zlib as deflate (but in this case, the file extensions for JSON, TEXT and CSV datasources would have .deflate for zlib which might have to be separately handled)?

(I am adding those supports here #11464 for ORC/Parquet).

@HyukjinKwon
Copy link
Member

One more thing is, I am not sure if we then need uncompressed or none (just like Parquet) for JSON, CSV and TEXT datasources as shorten names to explicitly set no compression codec.

Would you give me some feedback please?

@HyukjinKwon
Copy link
Member

If the consistent short names infer only lower-case names, then I think we can just leave them as corrected here in a way.

@maropu
Copy link
Member Author

maropu commented Mar 2, 2016

Yes, you're right . gzip and deflate has a compression algorithm in common. However, in my opinion, it'd be better to support both short names for hadoop compatibility because the file formats for them are different. To support both formats does not make hadoop guys get confused.

@maropu
Copy link
Member Author

maropu commented Mar 2, 2016

Setting no compression codec explicitly is important for uses to understand behaviours though, I'm not sure we need both names...

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

I think "none" is enough.

@HyukjinKwon
Copy link
Member

Thanks. Then zlib for ORC, deflate for TEXT, JSON and CSV, none for TEXT, JSON, CSV and ORC, and none/uncompressed for Parquet?

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

Sorry I don't understand what you mean ...

@HyukjinKwon
Copy link
Member

Let me list up possible compression options for each data source.

For JSON, CSV and TEXT data sources,
none - no compression
bzip2
snappy
gzip
deflate - this is zlib

For Parquet,
none/uncompressed - 'uncompressed' was possible to set to Spark configuration option.
gzip
lzo
snappy

For ORC,
none - no compression
zlib
snappy
lzo

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

Why do we need to support uncompressed in parquet? It it for backward compatibility?

@HyukjinKwon
Copy link
Member

Yes wouldn't the users set 'uncompressed' as it was possible to set via Spark configuration?

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

OK sounds great. I'd have it as an undocumented way for backward compatibility.

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

deflate is pretty confusing. I'd just say zlib? -- actually never mind, let's just say deflate.

@HyukjinKwon
Copy link
Member

As I said then we might need to consider handling the extensions for JSON, TEXT and CSV which are .deflate. We might need to manually change them to .zlib.

Do you think we can just leave the extensions as they are?

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

We don't specify any extensions right now, do we?

@HyukjinKwon
Copy link
Member

I remember I saw some tests for compression codes which check the extension when they are compressed. Then, let me correct them (or simply check them)and then create a new PR based on this.

@HyukjinKwon
Copy link
Member

Let's talk more in the new PR. I will try to deal with this in capacity myself first.

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

OK I thought about this a little bit more -- I'd just have uncompressed as an undocumented option for all data sources. That way, it is very consistent.

@HyukjinKwon should I merge this pr now?

@HyukjinKwon
Copy link
Member

Yes please. Let me make a followup.

@rxin
Copy link
Contributor

rxin commented Mar 2, 2016

Thanks - merging this in master.

@asfgit asfgit closed this in 6250cf1 Mar 2, 2016
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Mar 8, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…ent in ParquetRelation

## What changes were proposed in this pull request?
This pr to make the short names of compression codecs in `ParquetRelation` consistent against other ones. This pr comes from apache#11324.

## How was this patch tested?
Add more tests in `TextSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes apache#11408 from maropu/SPARK-13528.
@maropu maropu deleted the SPARK-13528 branch July 5, 2017 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants