[SPARK-12871][SQL] Support to specify the option for compression codec. #10805

HyukjinKwon · 2016-01-18T10:39:53Z

https://issues.apache.org/jira/browse/SPARK-12871
This PR added an option to support to specify compression codec.
This adds the option codec as an alias compression as filed in SPARK-12668 .

Note that I did not add configurations for Hadoop 1.x as this CsvRelation is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support.

SparkQA · 2016-01-18T12:38:16Z

Test build #49592 has finished for PR 10805 at commit 5b57fc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-01-18T22:57:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L

  val nullValue = parameters.getOrElse("nullValue", "")

+  val codec = parameters.getOrElse("codec", parameters.getOrElse("compression", null))


I will correct this to look up for compression first.

SparkQA · 2016-01-19T02:07:14Z

Test build #49643 has finished for PR 10805 at commit e7ebddd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-19T05:39:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L

  val nullValue = parameters.getOrElse("nullValue", "")

+  val codec = parameters.getOrElse("compression", parameters.getOrElse("codec", null))


for this one i'd name the internally name compression or compressionCodec since codec can mean a lot of different things.

the other thing is that i'd create short-form names for the common options, e.g. "gzip" should become GzipCodec. You'd need to look into what the commonly supported formats are and come up with their short names. We should also make sure this is case insensitive.

rxin · 2016-01-19T05:41:20Z

Yup we are dropping Hadoop 1.x support, so it is OK to have it only for Hadoop 2.x.

HyukjinKwon · 2016-01-19T05:44:25Z

I will resolve conflicts and update this soon.

…essionCodec

HyukjinKwon · 2016-01-19T07:45:52Z

Supported shorten names for compression codecs are below (case insensitive):

bzip2 -> org.apache.hadoop.io.compress.BZip2Codec
gzip -> org.apache.hadoop.io.compress.GzipCodec
lz4 -> org.apache.hadoop.io.compress.Lz4Codec
snappy -> org.apache.hadoop.io.compress.SnappyCodec

rxin · 2016-01-19T07:58:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

@@ -44,6 +46,13 @@ private[sql] case class CSVParameters(@transient parameters: Map[String, String]
    }
  }

+  // Available compression codec list
+  val shortCompressionCodecNames = Map(


this should go into the object rather than in the case class

HyukjinKwon · 2016-01-19T08:47:01Z

Although CSVCompressionCodecs might be shared with JSON datasource, I will make that share this at the separate PR for JSON.

rxin · 2016-01-19T08:49:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

+      case e: ClassNotFoundException => None
+    }
+    codecClassName.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] " +
+      s"is not available. Available codecs are ${shortCompressionCodecNames.keys.mkString(",")}."))


add a space after the comma

SparkQA · 2016-01-19T09:29:08Z

Test build #49671 has finished for PR 10805 at commit adb9eb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-19T10:39:42Z

Test build #49676 has finished for PR 10805 at commit 6400b76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-19T10:53:27Z

Test build #49679 has finished for PR 10805 at commit 0245eea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-19T18:33:38Z

Oh one thing: this doesn't support reading with compression yet, does it?

HyukjinKwon · 2016-01-19T23:02:28Z

Oh yes it does. Actually I am reading compressed files in the test I added here.

As you know it recognises the compression codec by file extension so if you meant manually setting compression codec for reading (for the files not having extensions somehow), it does not.

rxin · 2016-01-19T23:09:02Z

Yea I'm thinking we should also support specifying options, and it is "auto" by default which decides based on extensions.

HyukjinKwon · 2016-01-19T23:28:19Z

I see. I will anyway try to figure this out though. I somehow this might be a bit too much as almost all files would have proper extensions and I think the (almost) only exception might be files initially uploaded by users to a file system.

Maybe I am missing something though. I don't think users would not give wrong extensions for the files but set compression codec for reading properly, in particular, when they use HDFS because AKAIK Hadoop supports reading compressed files by extension. I feel like they might have to give proper extensions.

aarondav · 2016-01-20T02:01:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParameters.scala

+    } catch {
+      case e: ClassNotFoundException => None
+    }
+    codecClassName.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] " +


should just put this throw inside the catch block and not bother with the Option stuff

rxin · 2016-01-20T02:53:04Z

LGTM

SparkQA · 2016-01-20T04:15:10Z

Test build #49750 has finished for PR 10805 at commit cd9f742.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-20T04:45:32Z

I've merged this in master. Thanks.

…c for JSON datasource https://issues.apache.org/jira/browse/SPARK-12872 This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with #10805. As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10858 from HyukjinKwon/SPARK-12872.

#234 This PR is similar with apache/spark#10805. This PR adds the support for shorten names for compression codecs and added a `CompressionCodecs` class instead of the implicit function as its use is nor recommended. Author: hyukjinkwon <gurwls223@gmail.com> Closes #235 from HyukjinKwon/ISSUE-234-shorten-name.

…ssion codec for TEXT ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13503 This PR makes the TEXT datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with #10805 and #10858. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Closes #11384 from HyukjinKwon/SPARK-13503.

HyukjinKwon added 4 commits January 18, 2016 18:53

Support for compress option

06774ad

Correct Scala style and add an alias for codec as compression.

5e1611d

Move back some codes changed unintentionally.

d154f02

Remove Hadoop 1.x configurations

5b57fc2

HyukjinKwon reviewed Jan 18, 2016
View reviewed changes

Look up compression first.

e7ebddd

rxin reviewed Jan 19, 2016
View reviewed changes

HyukjinKwon added 2 commits January 19, 2016 15:19

Resolve conflicts

f4ffbf6

Support for shorten names and change the variable from codec to compr…

adb9eb2

…essionCodec

rxin reviewed Jan 19, 2016
View reviewed changes

HyukjinKwon added 3 commits January 19, 2016 17:39

Correct several nits and validate class names.

432da5d

Add a newline between function and variables

4388fe5

Remove whitespace

6400b76

rxin reviewed Jan 19, 2016
View reviewed changes

HyukjinKwon added 3 commits January 19, 2016 18:04

Several nits and reduce complexity

56316a8

Update exception message

c9217be

Simply add a space after comma.

0245eea

This was referenced Jan 20, 2016

Support to specify a shorten name for compression codec databricks/spark-csv#235

Closed

Support to specify a shorten name for compression codec databricks/spark-xml#66

Closed

aarondav reviewed Jan 20, 2016
View reviewed changes

Update exception message and get rid of Option.

cd9f742

asfgit closed this in 6844d36 Jan 20, 2016

HyukjinKwon mentioned this pull request Jan 21, 2016

[SPARK-12872][SQL] Support to specify the option for compression codec for JSON datasource #10858

Closed

HyukjinKwon mentioned this pull request Feb 26, 2016

[SPARK-13503][SQL] Support to specify the (writing) option for compression codec for TEXT #11384

Closed

HyukjinKwon deleted the SPARK-12420 branch September 23, 2016 18:28

HyukjinKwon mentioned this pull request Nov 14, 2016

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path #13837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12871][SQL] Support to specify the option for compression codec. #10805

[SPARK-12871][SQL] Support to specify the option for compression codec. #10805

HyukjinKwon commented Jan 18, 2016

SparkQA commented Jan 18, 2016

HyukjinKwon Jan 18, 2016

SparkQA commented Jan 19, 2016

rxin Jan 19, 2016

rxin Jan 19, 2016

rxin commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

rxin Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

rxin Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

rxin commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

rxin commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

aarondav Jan 20, 2016

rxin commented Jan 20, 2016

SparkQA commented Jan 20, 2016

rxin commented Jan 20, 2016

		@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L

		val nullValue = parameters.getOrElse("nullValue", "")

		val codec = parameters.getOrElse("codec", parameters.getOrElse("compression", null))

		@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L

		val nullValue = parameters.getOrElse("nullValue", "")

		val codec = parameters.getOrElse("compression", parameters.getOrElse("codec", null))

[SPARK-12871][SQL] Support to specify the option for compression codec. #10805

[SPARK-12871][SQL] Support to specify the option for compression codec. #10805

Conversation

HyukjinKwon commented Jan 18, 2016

SparkQA commented Jan 18, 2016

HyukjinKwon Jan 18, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2016

rxin Jan 19, 2016

Choose a reason for hiding this comment

rxin Jan 19, 2016

Choose a reason for hiding this comment

rxin commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

rxin Jan 19, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Jan 19, 2016

rxin Jan 19, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

rxin commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

rxin commented Jan 19, 2016

HyukjinKwon commented Jan 19, 2016

aarondav Jan 20, 2016

Choose a reason for hiding this comment

rxin commented Jan 20, 2016

SparkQA commented Jan 20, 2016

rxin commented Jan 20, 2016