-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12871][SQL] Support to specify the option for compression codec. #10805
Conversation
Test build #49592 has finished for PR 10805 at commit
|
@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L | |||
|
|||
val nullValue = parameters.getOrElse("nullValue", "") | |||
|
|||
val codec = parameters.getOrElse("codec", parameters.getOrElse("compression", null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will correct this to look up for compression first.
Test build #49643 has finished for PR 10805 at commit
|
@@ -71,6 +71,8 @@ private[sql] case class CSVParameters(parameters: Map[String, String]) extends L | |||
|
|||
val nullValue = parameters.getOrElse("nullValue", "") | |||
|
|||
val codec = parameters.getOrElse("compression", parameters.getOrElse("codec", null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this one i'd name the internally name compression or compressionCodec since codec can mean a lot of different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other thing is that i'd create short-form names for the common options, e.g. "gzip" should become GzipCodec. You'd need to look into what the commonly supported formats are and come up with their short names. We should also make sure this is case insensitive.
Yup we are dropping Hadoop 1.x support, so it is OK to have it only for Hadoop 2.x. |
I will resolve conflicts and update this soon. |
Supported shorten names for compression codecs are below (case insensitive):
|
@@ -44,6 +46,13 @@ private[sql] case class CSVParameters(@transient parameters: Map[String, String] | |||
} | |||
} | |||
|
|||
// Available compression codec list | |||
val shortCompressionCodecNames = Map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go into the object rather than in the case class
Although |
case e: ClassNotFoundException => None | ||
} | ||
codecClassName.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] " + | ||
s"is not available. Available codecs are ${shortCompressionCodecNames.keys.mkString(",")}.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a space after the comma
Test build #49671 has finished for PR 10805 at commit
|
Test build #49676 has finished for PR 10805 at commit
|
Test build #49679 has finished for PR 10805 at commit
|
Oh one thing: this doesn't support reading with compression yet, does it? |
Oh yes it does. Actually I am reading compressed files in the test I added here. As you know it recognises the compression codec by file extension so if you meant manually setting compression codec for reading (for the files not having extensions somehow), it does not. |
Yea I'm thinking we should also support specifying options, and it is "auto" by default which decides based on extensions. |
I see. I will anyway try to figure this out though. I somehow this might be a bit too much as almost all files would have proper extensions and I think the (almost) only exception might be files initially uploaded by users to a file system. Maybe I am missing something though. I don't think users would not give wrong extensions for the files but set compression codec for reading properly, in particular, when they use HDFS because AKAIK Hadoop supports reading compressed files by extension. I feel like they might have to give proper extensions. |
} catch { | ||
case e: ClassNotFoundException => None | ||
} | ||
codecClassName.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should just put this throw inside the catch block and not bother with the Option stuff
LGTM |
Test build #49750 has finished for PR 10805 at commit
|
I've merged this in master. Thanks. |
…c for JSON datasource https://issues.apache.org/jira/browse/SPARK-12872 This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with #10805. As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10858 from HyukjinKwon/SPARK-12872.
#234 This PR is similar with apache/spark#10805. This PR adds the support for shorten names for compression codecs and added a `CompressionCodecs` class instead of the implicit function as its use is nor recommended. Author: hyukjinkwon <gurwls223@gmail.com> Closes #235 from HyukjinKwon/ISSUE-234-shorten-name.
…ssion codec for TEXT ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13503 This PR makes the TEXT datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with #10805 and #10858. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Closes #11384 from HyukjinKwon/SPARK-13503.
https://issues.apache.org/jira/browse/SPARK-12871
This PR added an option to support to specify compression codec.
This adds the option
codec
as an aliascompression
as filed in SPARK-12668 .Note that I did not add configurations for Hadoop 1.x as this
CsvRelation
is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support.