Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 4, 2016

What changes were proposed in this pull request?

Currently, there are the three cases when reading CSV by datasource when it is PERMISSIVE parse mode.

  • schema == parsed tokens (from each line)
    No problem to cast the value in the tokens to the field in the schema as they are equal.

  • schema < parsed tokens (from each line)
    It slices the tokens into the number of fields in schema.

  • schema > parsed tokens (from each line)
    It appends null into parsed tokens so that safely values can be casted with the schema.

However, when null is appended in the third case, we should take null into account when casting the values.

In case of StringType, it is fine as UTF8String.fromString(datum) produces null when the input is null. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not StringType.

The codes below:

val path = "/tmp/a"
Seq("1").toDF().write.text(path.getAbsolutePath)
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", IntegerType, true) :: Nil)
spark.read.schema(schema).option("header", "false").csv(path).show()

prints

Before

java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24)

After

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+

How was this patch tested?

Unit test in CSVSuite.scala and CSVTypeCastSuite.scala

@HyukjinKwon HyukjinKwon changed the title [SPARK-18269][SQL] null should be properly read when schema is lager than parsed tokens and types are not string [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens Nov 4, 2016
@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68131 has finished for PR 15767 at commit 3692099.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68132 has finished for PR 15767 at commit e1c58c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68133 has finished for PR 15767 at commit 4132075.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -232,7 +232,7 @@ private[csv] object CSVTypeCast {
nullable: Boolean = true,
options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
if (datum == null || nullable && datum == options.nullValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer with parentheses, though I think it's correct. It is this right?
datum == null || (nullable && datum == options.nullValue)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense.

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68182 has finished for PR 15767 at commit c0667d1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 5, 2016

Can you get rid of those links from the pr description? They become stale immediately after merging this.

@@ -232,7 +232,8 @@ private[csv] object CSVTypeCast {
nullable: Boolean = true,
options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
val isNull = datum == options.nullValue || datum == null
if (nullable && isNull) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this more clear if you do

// datum can be null if the number of fields found is less than the length of the schema
if (datum == options.nullValue || datum == null) {
  if (!nullable) {
    throw some exception saying field is not null but null value found
  }
  null
} else {
  ..
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add a null validation test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will.

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68184 has finished for PR 15767 at commit b913eac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// datum can be null if the number of fields found is less than the length of the schema
if (datum == options.nullValue || datum == null) {
if (!nullable) {
throw new RuntimeException("null value found but the field is not nullable.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This could be require(nullable, ...) which would throw a better exception, IllegalArgumentException, too. (Even NPE would be reasonable.) But I don't feel strongly about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think this sounds good.

Copy link
Contributor

@rxin rxin Nov 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought more about this - the current error message doesn't give the user a way to know which field is causing the problem. Can you add at least the field name to the error msg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can. Will try to make this neat up.

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68198 has finished for PR 15767 at commit e5146e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68205 has finished for PR 15767 at commit e5146e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68206 has finished for PR 15767 at commit 0b15484.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (nullable && datum == options.nullValue) {
// datum can be null if the number of fields found is less than the length of the schema
if (datum == options.nullValue || datum == null) {
require(nullable, "null value found but the field is not nullable.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry it doesn't make sense to throw illegalargumentexception at runtime here. What's the illegal argument? It is usually used when some parameters are out of bound or invalid. What's happening here is that we found exceptions during runtime that invalidates the assumption the user made (the field is nullable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @marmbrus

related to some other topic we talked about - for csv we do throw errors if a field is defined non-nullable by the user, but the data ended up containing nulls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought users set a non-nullable field in the schema inappropriate for data having null values and that is an illigal argument.. FWIW, IllegalArgumentException seems extending RuntimeException too..
Let me revert it back. I don't feel strongly about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Other related PRs with the comment, #15767 (comment), are #15329 and #14124)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin well, the user input is the illegal argument I guess, but I don't feel strongly about it. I was really arguing against RuntimeException which is never really the right thing to throw. NullPointerException? IllegalStateException? take your pick of anything else more specific.

@SparkQA
Copy link

SparkQA commented Nov 6, 2016

Test build #68222 has finished for PR 15767 at commit e5146e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 6, 2016

Test build #68225 has finished for PR 15767 at commit e5146e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2016

Test build #68236 has finished for PR 15767 at commit aa89171.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2016

Merging in master/branch-2.1. Thanks.

asfgit pushed a commit that referenced this pull request Nov 7, 2016
…ma is lager than parsed tokens

## What changes were proposed in this pull request?

Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode.

- schema == parsed tokens (from each line)
  No problem to cast the value in the tokens to the field in the schema as they are equal.

- schema < parsed tokens (from each line)
  It slices the tokens into the number of fields in schema.

- schema > parsed tokens (from each line)
  It appends `null` into parsed tokens so that safely values can be casted with the schema.

However, when `null` is appended in the third case, we should take `null` into account when casting the values.

In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`.

The codes below:

```scala
val path = "/tmp/a"
Seq("1").toDF().write.text(path.getAbsolutePath)
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", IntegerType, true) :: Nil)
spark.read.schema(schema).option("header", "false").csv(path).show()
```

prints

**Before**

```
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24)
```

**After**

```
+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+
```

## How was this patch tested?

Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15767 from HyukjinKwon/SPARK-18269.

(cherry picked from commit 556a3b7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 556a3b7 Nov 7, 2016
@HyukjinKwon
Copy link
Member Author

Thank you @rxin!

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ma is lager than parsed tokens

## What changes were proposed in this pull request?

Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode.

- schema == parsed tokens (from each line)
  No problem to cast the value in the tokens to the field in the schema as they are equal.

- schema < parsed tokens (from each line)
  It slices the tokens into the number of fields in schema.

- schema > parsed tokens (from each line)
  It appends `null` into parsed tokens so that safely values can be casted with the schema.

However, when `null` is appended in the third case, we should take `null` into account when casting the values.

In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`.

The codes below:

```scala
val path = "/tmp/a"
Seq("1").toDF().write.text(path.getAbsolutePath)
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", IntegerType, true) :: Nil)
spark.read.schema(schema).option("header", "false").csv(path).show()
```

prints

**Before**

```
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24)
```

**After**

```
+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+
```

## How was this patch tested?

Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#15767 from HyukjinKwon/SPARK-18269.
@HyukjinKwon HyukjinKwon deleted the SPARK-18269 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants