Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12871][SQL] Support to specify the option for compression codec. #10805

Closed
wants to merge 14 commits into from

Conversation

HyukjinKwon
Copy link
Member

https://issues.apache.org/jira/browse/SPARK-12871
This PR added an option to support to specify compression codec.
This adds the option codec as an alias compression as filed in SPARK-12668 .

Note that I did not add configurations for Hadoop 1.x as this CsvRelation is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49592 has finished for PR 10805 at commit 5b57fc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L

val nullValue = parameters.getOrElse("nullValue", "")

val codec = parameters.getOrElse("codec", parameters.getOrElse("compression", null))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will correct this to look up for compression first.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49643 has finished for PR 10805 at commit e7ebddd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L

val nullValue = parameters.getOrElse("nullValue", "")

val codec = parameters.getOrElse("compression", parameters.getOrElse("codec", null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this one i'd name the internally name compression or compressionCodec since codec can mean a lot of different things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other thing is that i'd create short-form names for the common options, e.g. "gzip" should become GzipCodec. You'd need to look into what the commonly supported formats are and come up with their short names. We should also make sure this is case insensitive.

@rxin
Copy link
Contributor

rxin commented Jan 19, 2016

Yup we are dropping Hadoop 1.x support, so it is OK to have it only for Hadoop 2.x.

@HyukjinKwon
Copy link
Member Author

I will resolve conflicts and update this soon.

@HyukjinKwon
Copy link
Member Author

Supported shorten names for compression codecs are below (case insensitive):

bzip2 -> org.apache.hadoop.io.compress.BZip2Codec
gzip -> org.apache.hadoop.io.compress.GzipCodec
lz4 -> org.apache.hadoop.io.compress.Lz4Codec
snappy -> org.apache.hadoop.io.compress.SnappyCodec

@@ -44,6 +46,13 @@ private[sql] case class CSVParameters(@transient parameters: Map[String, String]
}
}

// Available compression codec list
val shortCompressionCodecNames = Map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go into the object rather than in the case class

@HyukjinKwon
Copy link
Member Author

Although CSVCompressionCodecs might be shared with JSON datasource, I will make that share this at the separate PR for JSON.

case e: ClassNotFoundException => None
}
codecClassName.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] " +
s"is not available. Available codecs are ${shortCompressionCodecNames.keys.mkString(",")}."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a space after the comma

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49671 has finished for PR 10805 at commit adb9eb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49676 has finished for PR 10805 at commit 6400b76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49679 has finished for PR 10805 at commit 0245eea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 19, 2016

Oh one thing: this doesn't support reading with compression yet, does it?

@HyukjinKwon
Copy link
Member Author

Oh yes it does. Actually I am reading compressed files in the test I added here.

As you know it recognises the compression codec by file extension so if you meant manually setting compression codec for reading (for the files not having extensions somehow), it does not.

@rxin
Copy link
Contributor

rxin commented Jan 19, 2016

Yea I'm thinking we should also support specifying options, and it is "auto" by default which decides based on extensions.

@HyukjinKwon
Copy link
Member Author

I see. I will anyway try to figure this out though. I somehow this might be a bit too much as almost all files would have proper extensions and I think the (almost) only exception might be files initially uploaded by users to a file system.

Maybe I am missing something though. I don't think users would not give wrong extensions for the files but set compression codec for reading properly, in particular, when they use HDFS because AKAIK Hadoop supports reading compressed files by extension. I feel like they might have to give proper extensions.

} catch {
case e: ClassNotFoundException => None
}
codecClassName.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just put this throw inside the catch block and not bother with the Option stuff

@rxin
Copy link
Contributor

rxin commented Jan 20, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49750 has finished for PR 10805 at commit cd9f742.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 20, 2016

I've merged this in master. Thanks.

@asfgit asfgit closed this in 6844d36 Jan 20, 2016
asfgit pushed a commit that referenced this pull request Jan 23, 2016
…c for JSON datasource

https://issues.apache.org/jira/browse/SPARK-12872

This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations.
For reflecting codec by names, it is similar with #10805.

As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #10858 from HyukjinKwon/SPARK-12872.
falaki pushed a commit to databricks/spark-csv that referenced this pull request Jan 29, 2016
#234
This PR is similar with apache/spark#10805.

This PR adds the support for shorten names for compression codecs and added a `CompressionCodecs` class instead of the implicit function as its use is nor recommended.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #235 from HyukjinKwon/ISSUE-234-shorten-name.
asfgit pushed a commit that referenced this pull request Feb 26, 2016
…ssion codec for TEXT

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13503
This PR makes the TEXT datasource can compress output by option instead of manually setting Hadoop configurations.
For reflecting codec by names, it is similar with #10805 and #10858.

## How was this patch tested?

This was tested with unittests and with `dev/run_tests` for coding style

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11384 from HyukjinKwon/SPARK-13503.
@HyukjinKwon HyukjinKwon deleted the SPARK-12420 branch September 23, 2016 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants