Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtrip null values of any type #147

Closed

Conversation

andy327
Copy link
Contributor

@andy327 andy327 commented Sep 15, 2015

This pull request adds functionality to spark-csv with the goal of having the ability to write null values to file and read them back out again as null. Two changes were made to enable this.

First, since the com.databricks.spark.csv package previously had the null string hardcoded to "null" when saving to a csv file, this was changed to read the null token out of the passed in parameters map, from the value for "nullToken", enabling writing null values as empty strings by use of this option. The default is left to "null" to maintain the previous behavior of the library.

Secondly, the castTo method from com.databricks.spark.csv.util.TypeCast had an impossible-to-reach case statement when the castType was an instance of StringType. As a result, it was not possible to read string values from file as null. This pull request adds a setting 'treatEmptyValuesAsNulls' that allows empty string values in fields that are marked as nullable to be read as null values, as expected. Again, the previous behavior is enabled by default, so this pull request only changes the behavior when treatEmptyValuesAsNulls is explicitly set to true. The appropriate changes to CsvParser and CsvRelation were made to include this new setting.

Additionally, a unit test has been added to CsvSuite to test the ability to round-trip (both string and non-string) null values by writing nulls and reading them back out again as nulls.

@codecov-io
Copy link

Current coverage is 86.57%

Merging #147 into master will decrease coverage by -0.33% as of 589c204

@@            master    #147   diff @@
======================================
  Files           10      10       
  Stmts          420     432    +12
  Branches       127     131     +4
  Methods          0       0       
======================================
+ Hit            365     374     +9
  Partial          0       0       
- Missed          55      58     +3

Review entire Coverage Diff as of 589c204

Powered by Codecov. Updated on successful CI builds.

@@ -110,6 +110,14 @@ class DefaultSource
} else {
throw new Exception("Ignore white space flag can be true or false")
}
val treatEmptyValuesAsNulls = parameters.getOrElse("treatEmptyValuesAsNulls", "false")
val treatEmptyValuesAsNullsFlag = if(treatEmptyValuesAsNulls == "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: if (treatEmptyValuesAsNulls ...
Leave a space between the parenthesis and if here and everywhere else in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes. I prefer the same, but I was keeping it consistent with the rest of the file's style. Would you prefer I add the space here, add it everywhere else in the file as well, or leave it unchanged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we just merged a PR that unifies the style across the code and adds Scala style checks.

@@ -35,8 +35,9 @@ object TypeCast {
* @param datum string value
* @param castType SparkSQL type
*/
private[csv] def castTo(datum: String, castType: DataType, nullable: Boolean = true): Any = {
if (datum == "" && nullable && !castType.isInstanceOf[StringType]){
private[csv] def castTo(datum: String, castType: DataType, nullable: Boolean = true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:

castTo(
    datum: String, 
    castType: DataType, 
    nullable: Boolean = true,
    treatEmptyValuesAsNulls: Boolean = false)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think the binary incompatibility problem is that we don't have something like Spark's MiMa exclude generator to handle the private[csv] excludes.

@falaki
Copy link
Member

falaki commented Sep 16, 2015

@andy327 left a few comments. Thanks for submitting this. Would you please rebase it?

@andy327
Copy link
Contributor Author

andy327 commented Sep 18, 2015

I brought the branch up-to-date with commits leading up to yesterday. Tests pass, but the Travis build fails for other reasons (Binary compatibility check failed).

@JoshRosen
Copy link
Contributor

Here's the MiMa error:

[error]  * synthetic method tsvFile$default$6()java.lang.String in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: java.lang.String, is now: Boolean
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile$default$6")
[error]  * synthetic method csvFile$default$11()java.lang.String in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: java.lang.String, is now: Boolean
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.csvFile$default$11")
[error]  * method tsvFile(java.lang.String,Boolean,java.lang.String,Boolean,Boolean,java.lang.String,Boolean)org.apache.spark.sql.DataFrame in class com.databricks.spark.csv.package#CsvContext does not have a correspondent in new version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile")
[error]  * method csvFile(java.lang.String,Boolean,Char,Char,java.lang.Character,java.lang.Character,java.lang.String,java.lang.String,Boolean,Boolean,java.lang.String,Boolean)org.apache.spark.sql.DataFrame in class com.databricks.spark.csv.package#CsvContext does not have a correspondent in new version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.package#CsvContext.csvFile")
[error]  * synthetic method tsvFile$default$7()Boolean in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: Boolean, is now: java.lang.String
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile$default$7")
[error]  * synthetic method csvFile$default$12()Boolean in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: Boolean, is now: java.lang.String
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.csvFile$default$12")
[error]  * method castTo(java.lang.String,org.apache.spark.sql.types.DataType,Boolean)java.lang.Object in object com.databricks.spark.csv.util.TypeCast does not have a correspondent in new version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.util.TypeCast.castTo")

Will comment inline RE: these errors.

@@ -38,6 +38,7 @@ package object csv {
parserLib: String = "COMMONS",
ignoreLeadingWhiteSpace: Boolean = false,
ignoreTrailingWhiteSpace: Boolean = false,
treatEmptyValuesAsNulls: Boolean = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MiMa check failed because the addition of a new argument to this method changes its Java method signature and also risks source incompatibility.

@falaki, for this reason I think that, in retrospect, it would have been better to go with a builder pattern for this instead of a big Scala method with defaults arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @marmbrus and @rxin as well, since I think this API design issue would be of interest to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an auxiliary method that is just for convenience. Users are encouraged to use the builder patter. For that reason, and to keep MiMa compatibility, @andy327 please remove the new option from csvFile. In future we may completely remove this method.

@falaki
Copy link
Member

falaki commented Sep 21, 2015

LGTM modulo style and MiMa check fixes.

@andy327
Copy link
Contributor Author

andy327 commented Sep 29, 2015

Unfortunately, there is no way to apply the features in this PR without modifying the signature to castTo in TypeCast.scala. It could be possible to add a castTo helper method containing the original three parameters? Otherwise this PR would require a change to the MiMa checks.. Let me know what you would prefer. Also, are there any other style fixes left?

@JoshRosen
Copy link
Contributor

I think that it's fine to modify that signature in TypeCast.scala given that it's an internal API. The problem here is that the MiMa checks, as currently implemented, aren't properly respecting private[csv]. I could fix this by porting some of Spark's MiMa exclude generation code, but this will take a while. In the meantime, I recommend either disabling those MiMa checks in Travis (since we didn't have them before) and filing a followup task to re-enable or remove them. To guard against binary incompatibilities, we'll have to remember to just manually run MiMa and audit the list by hand before making the next release.

@falaki
Copy link
Member

falaki commented Oct 4, 2015

Thanks. merging this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants