Roundtrip null values of any type #147

andy327 · 2015-09-15T15:39:32Z

This pull request adds functionality to spark-csv with the goal of having the ability to write null values to file and read them back out again as null. Two changes were made to enable this.

First, since the com.databricks.spark.csv package previously had the null string hardcoded to "null" when saving to a csv file, this was changed to read the null token out of the passed in parameters map, from the value for "nullToken", enabling writing null values as empty strings by use of this option. The default is left to "null" to maintain the previous behavior of the library.

Secondly, the castTo method from com.databricks.spark.csv.util.TypeCast had an impossible-to-reach case statement when the castType was an instance of StringType. As a result, it was not possible to read string values from file as null. This pull request adds a setting 'treatEmptyValuesAsNulls' that allows empty string values in fields that are marked as nullable to be read as null values, as expected. Again, the previous behavior is enabled by default, so this pull request only changes the behavior when treatEmptyValuesAsNulls is explicitly set to true. The appropriate changes to CsvParser and CsvRelation were made to include this new setting.

Additionally, a unit test has been added to CsvSuite to test the ability to round-trip (both string and non-string) null values by writing nulls and reading them back out again as nulls.

… nulls when reading csv.

codecov-io · 2015-09-15T16:05:07Z

Current coverage is `86.57%`

Merging #147 into master will decrease coverage by -0.33% as of 589c204

@@            master    #147   diff @@
======================================
  Files           10      10       
  Stmts          420     432    +12
  Branches       127     131     +4
  Methods          0       0       
======================================
+ Hit            365     374     +9
  Partial          0       0       
- Missed          55      58     +3

Review entire Coverage Diff as of 589c204

Powered by Codecov. Updated on successful CI builds.

falaki · 2015-09-16T00:00:26Z

src/main/scala/com/databricks/spark/csv/DefaultSource.scala

@@ -110,6 +110,14 @@ class DefaultSource
    } else {
      throw new Exception("Ignore white space flag can be true or false")
    }
+    val treatEmptyValuesAsNulls = parameters.getOrElse("treatEmptyValuesAsNulls", "false")
+    val treatEmptyValuesAsNullsFlag = if(treatEmptyValuesAsNulls == "false") {


Style nit: if (treatEmptyValuesAsNulls ...
Leave a space between the parenthesis and if here and everywhere else in the PR.

ah, yes. I prefer the same, but I was keeping it consistent with the rest of the file's style. Would you prefer I add the space here, add it everywhere else in the file as well, or leave it unchanged?

yes, we just merged a PR that unifies the style across the code and adds Scala style checks.

falaki · 2015-09-16T20:15:51Z

src/main/scala/com/databricks/spark/csv/util/TypeCast.scala

@@ -35,8 +35,9 @@ object TypeCast {
   * @param datum string value
   * @param castType SparkSQL type
   */
-  private[csv] def castTo(datum: String, castType: DataType, nullable: Boolean = true): Any = {
-    if (datum == "" && nullable && !castType.isInstanceOf[StringType]){
+  private[csv] def castTo(datum: String, castType: DataType, nullable: Boolean = true,


style:

castTo( datum: String, castType: DataType, nullable: Boolean = true, treatEmptyValuesAsNulls: Boolean = false)

Here, I think the binary incompatibility problem is that we don't have something like Spark's MiMa exclude generator to handle the private[csv] excludes.

falaki · 2015-09-16T20:17:27Z

@andy327 left a few comments. Thanks for submitting this. Would you please rebase it?

andy327 · 2015-09-18T23:31:10Z

I brought the branch up-to-date with commits leading up to yesterday. Tests pass, but the Travis build fails for other reasons (Binary compatibility check failed).

JoshRosen · 2015-09-18T23:43:35Z

Here's the MiMa error:

[error]  * synthetic method tsvFile$default$6()java.lang.String in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: java.lang.String, is now: Boolean
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile$default$6")
[error]  * synthetic method csvFile$default$11()java.lang.String in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: java.lang.String, is now: Boolean
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.csvFile$default$11")
[error]  * method tsvFile(java.lang.String,Boolean,java.lang.String,Boolean,Boolean,java.lang.String,Boolean)org.apache.spark.sql.DataFrame in class com.databricks.spark.csv.package#CsvContext does not have a correspondent in new version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile")
[error]  * method csvFile(java.lang.String,Boolean,Char,Char,java.lang.Character,java.lang.Character,java.lang.String,java.lang.String,Boolean,Boolean,java.lang.String,Boolean)org.apache.spark.sql.DataFrame in class com.databricks.spark.csv.package#CsvContext does not have a correspondent in new version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.package#CsvContext.csvFile")
[error]  * synthetic method tsvFile$default$7()Boolean in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: Boolean, is now: java.lang.String
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile$default$7")
[error]  * synthetic method csvFile$default$12()Boolean in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: Boolean, is now: java.lang.String
[error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.csvFile$default$12")
[error]  * method castTo(java.lang.String,org.apache.spark.sql.types.DataType,Boolean)java.lang.Object in object com.databricks.spark.csv.util.TypeCast does not have a correspondent in new version
[error]    filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.util.TypeCast.castTo")

Will comment inline RE: these errors.

JoshRosen · 2015-09-18T23:45:34Z

src/main/scala/com/databricks/spark/csv/package.scala

@@ -38,6 +38,7 @@ package object csv {
        parserLib: String = "COMMONS",
        ignoreLeadingWhiteSpace: Boolean = false,
        ignoreTrailingWhiteSpace: Boolean = false,
+        treatEmptyValuesAsNulls: Boolean = false,


The MiMa check failed because the addition of a new argument to this method changes its Java method signature and also risks source incompatibility.

@falaki, for this reason I think that, in retrospect, it would have been better to go with a builder pattern for this instead of a big Scala method with defaults arguments.

/cc @marmbrus and @rxin as well, since I think this API design issue would be of interest to you.

This is an auxiliary method that is just for convenience. Users are encouraged to use the builder patter. For that reason, and to keep MiMa compatibility, @andy327 please remove the new option from csvFile. In future we may completely remove this method.

…s to keep MiMa compatibility.

falaki · 2015-09-21T18:06:28Z

LGTM modulo style and MiMa check fixes.

andy327 · 2015-09-29T19:48:08Z

Unfortunately, there is no way to apply the features in this PR without modifying the signature to castTo in TypeCast.scala. It could be possible to add a castTo helper method containing the original three parameters? Otherwise this PR would require a change to the MiMa checks.. Let me know what you would prefer. Also, are there any other style fixes left?

JoshRosen · 2015-09-29T20:16:01Z

I think that it's fine to modify that signature in TypeCast.scala given that it's an internal API. The problem here is that the MiMa checks, as currently implemented, aren't properly respecting private[csv]. I could fix this by porting some of Spark's MiMa exclude generation code, but this will take a while. In the meantime, I recommend either disabling those MiMa checks in Travis (since we didn't have them before) and filing a followup task to re-enable or remove them. To guard against binary incompatibilities, we'll have to remember to just manually run MiMa and audit the list by hand before making the next release.

falaki · 2015-10-04T19:42:38Z

Thanks. merging this now.

andy327 added 3 commits September 11, 2015 16:15

Add nullToken to csvFile options; add switch to treat empty values as…

cc2c9c0

… nulls when reading csv.

Keep lines to max 100 characters.

ef89f78

Ignore order of read values in CsvSuite test.

678814f

falaki reviewed Sep 16, 2015
View reviewed changes

Rename nullToken to nullValue.

877bb61

falaki reviewed Sep 16, 2015
View reviewed changes

andy327 added 2 commits September 16, 2015 18:36

Merge branch 'master' into feat-set-null-tokens

8b2de88

Update branch with style changes.

48d7aa6

JoshRosen reviewed Sep 18, 2015
View reviewed changes

andy327 added 2 commits September 20, 2015 11:26

Remove treatEmptyValuesAsNulls option from csvFile and tsvFile method…

6785285

…s to keep MiMa compatibility.

Style fix on castTo method.

01fa79c

Exclude TypeCast from MiMa checks.

176d4e8

falaki closed this in ee152f3 Oct 4, 2015

lw-lin mentioned this pull request Aug 1, 2016

[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly apache/spark#14118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roundtrip null values of any type #147

Roundtrip null values of any type #147

andy327 commented Sep 15, 2015

codecov-io commented Sep 15, 2015

falaki Sep 16, 2015

andy327 Sep 16, 2015

falaki Sep 16, 2015

falaki Sep 16, 2015

JoshRosen Sep 18, 2015

falaki commented Sep 16, 2015

andy327 commented Sep 18, 2015

JoshRosen commented Sep 18, 2015

JoshRosen Sep 18, 2015

JoshRosen Sep 18, 2015

falaki Sep 19, 2015

falaki commented Sep 21, 2015

andy327 commented Sep 29, 2015

JoshRosen commented Sep 29, 2015

falaki commented Oct 4, 2015

Roundtrip null values of any type #147

Roundtrip null values of any type #147

Conversation

andy327 commented Sep 15, 2015

codecov-io commented Sep 15, 2015

Current coverage is 86.57%

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

falaki commented Sep 16, 2015

andy327 commented Sep 18, 2015

JoshRosen commented Sep 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

falaki commented Sep 21, 2015

andy327 commented Sep 29, 2015

JoshRosen commented Sep 29, 2015

falaki commented Oct 4, 2015

Current coverage is `86.57%`