New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data #16928

Closed
wants to merge 12 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@maropu
Member

maropu commented Feb 14, 2017

What changes were proposed in this pull request?

This pr added a logic to put malformed tokens into a new field when parsing CSV data in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);

Caused by: java.lang.IllegalArgumentException
	at java.sql.Date.valueOf(Date.java:143)
	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at scala.util.Try.getOrElse(Try.scala:79)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
	at 

In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.

How was this patch tested?

Added tests in CSVSuite.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 14, 2017

Test build #72877 has finished for PR 16928 at commit 266819a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 14, 2017

Test build #72877 has finished for PR 16928 at commit 266819a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 15, 2017

Member

@HyukjinKwon Could you check this and give me any insight before committers do?

Member

maropu commented Feb 15, 2017

@HyukjinKwon Could you check this and give me any insight before committers do?

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Feb 15, 2017

Member

@maropu, I just ran some similar tests with JSON datasource. What do you think about matching it to JSON's behaviour by introducing columnNameOfCorruptRecord?

I ran with the data and schema as below:

Seq("""{"a": "a", "b" : 1}""").toDF().write.text("/tmp/path")
val schema = StructType(StructField("a", IntegerType, true) :: StructField("b", StringType, true) :: StructField("_corrupt_record", StringType, true) :: Nil)

FAILFAST

scala> spark.read.schema(schema).option("mode", "FAILFAST").json("/tmp/path").show()
org.apache.spark.sql.catalyst.json.SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a": "a", "b" : 1}

DROPMALFORMED

scala> spark.read.schema(schema).option("mode", "DROPMALFORMED").json("/tmp/path").show()
+---+---+---------------+
|  a|  b|_corrupt_record|
+---+---+---------------+
+---+---+---------------+

PERMISSIVE

scala> spark.read.schema(schema).option("mode", "PERMISSIVE").json("/tmp/path").show()
+----+----+-------------------+
|   a|   b|    _corrupt_record|
+----+----+-------------------+
|null|null|{"a": "a", "b" : 1}|
+----+----+-------------------+
Member

HyukjinKwon commented Feb 15, 2017

@maropu, I just ran some similar tests with JSON datasource. What do you think about matching it to JSON's behaviour by introducing columnNameOfCorruptRecord?

I ran with the data and schema as below:

Seq("""{"a": "a", "b" : 1}""").toDF().write.text("/tmp/path")
val schema = StructType(StructField("a", IntegerType, true) :: StructField("b", StringType, true) :: StructField("_corrupt_record", StringType, true) :: Nil)

FAILFAST

scala> spark.read.schema(schema).option("mode", "FAILFAST").json("/tmp/path").show()
org.apache.spark.sql.catalyst.json.SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {"a": "a", "b" : 1}

DROPMALFORMED

scala> spark.read.schema(schema).option("mode", "DROPMALFORMED").json("/tmp/path").show()
+---+---+---------------+
|  a|  b|_corrupt_record|
+---+---+---------------+
+---+---+---------------+

PERMISSIVE

scala> spark.read.schema(schema).option("mode", "PERMISSIVE").json("/tmp/path").show()
+----+----+-------------------+
|   a|   b|    _corrupt_record|
+----+----+-------------------+
|null|null|{"a": "a", "b" : 1}|
+----+----+-------------------+
@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 15, 2017

Member

Aha, looks good to me. Just a sec, and I'll modify the code.

Member

maropu commented Feb 15, 2017

Aha, looks good to me. Just a sec, and I'll modify the code.

@maropu maropu changed the title from [SPARK-18699][SQL] Fill NULL in a field when detecting a malformed token to [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data Feb 16, 2017

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Feb 16, 2017

Member

cc @cloud-fan, could I ask if you think matching it seems reasonable? I wrote some details below for you to track this issue easily.

Currently, JSON produces

Seq("""{"a": "a", "b" : 1}""").toDF().write.text("/tmp/path")
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", StringType, true) :: 
  StructField("_corrupt_record", StringType, true) :: Nil)
spark.read.schema(schema)
  .option("mode", "PERMISSIVE")
  .json("/tmp/path").show()
+----+----+-------------------+
|   a|   b|    _corrupt_record|
+----+----+-------------------+
|null|null|{"a": "a", "b" : 1}|
+----+----+-------------------+

whereas CSV produces

Seq("""a,1""").toDF().write.text("/tmp/path")
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", StringType, true) :: Nil)
spark.read.schema(schema)
  .option("mode", "PERMISSIVE")
  .csv("/tmp/path").show()
java.lang.NumberFormatException: For input string: "a"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)

To cut it short, the problem is, parsing itself is fine but when a value is unable to convert. In case of JSON, it fills the value in the column specified in columnNameOfCorruptRecord whereas CSV throws an exception even if it is PERMISSIVE mode.

It seems there are two ways to fix this.

One is what @maropu initially suggested - permissively fills it as null. My worry here is losing the data and inconsistency with JSON's parse mode.

The other way I suggested here - matching it to JSON's one, storing the value in the column specified in columnNameOfCorruptRecord. However, note that I guess we will not get this column while inferring in CSV in most cases as I can't imagine the case when CSV itself is malformed.

JSON schema inference produces this column when the JSON is malformed as below:

Seq("""{"a": "a", "b" :""").toDF().write.text("/tmp/test123")
spark.read.json("/tmp/test123").printSchema
root
 |-- _corrupt_record: string (nullable = true)
Member

HyukjinKwon commented Feb 16, 2017

cc @cloud-fan, could I ask if you think matching it seems reasonable? I wrote some details below for you to track this issue easily.

Currently, JSON produces

Seq("""{"a": "a", "b" : 1}""").toDF().write.text("/tmp/path")
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", StringType, true) :: 
  StructField("_corrupt_record", StringType, true) :: Nil)
spark.read.schema(schema)
  .option("mode", "PERMISSIVE")
  .json("/tmp/path").show()
+----+----+-------------------+
|   a|   b|    _corrupt_record|
+----+----+-------------------+
|null|null|{"a": "a", "b" : 1}|
+----+----+-------------------+

whereas CSV produces

Seq("""a,1""").toDF().write.text("/tmp/path")
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", StringType, true) :: Nil)
spark.read.schema(schema)
  .option("mode", "PERMISSIVE")
  .csv("/tmp/path").show()
java.lang.NumberFormatException: For input string: "a"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)

To cut it short, the problem is, parsing itself is fine but when a value is unable to convert. In case of JSON, it fills the value in the column specified in columnNameOfCorruptRecord whereas CSV throws an exception even if it is PERMISSIVE mode.

It seems there are two ways to fix this.

One is what @maropu initially suggested - permissively fills it as null. My worry here is losing the data and inconsistency with JSON's parse mode.

The other way I suggested here - matching it to JSON's one, storing the value in the column specified in columnNameOfCorruptRecord. However, note that I guess we will not get this column while inferring in CSV in most cases as I can't imagine the case when CSV itself is malformed.

JSON schema inference produces this column when the JSON is malformed as below:

Seq("""{"a": "a", "b" :""").toDF().write.text("/tmp/test123")
spark.read.json("/tmp/test123").printSchema
root
 |-- _corrupt_record: string (nullable = true)
@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Feb 16, 2017

Member

Let me try to review further at my best once it is decided.

Member

HyukjinKwon commented Feb 16, 2017

Let me try to review further at my best once it is decided.

@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 16, 2017

Member

I also keep considering other ways to fix this...

Member

maropu commented Feb 16, 2017

I also keep considering other ways to fix this...

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 16, 2017

Test build #72971 has finished for PR 16928 at commit f09a899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 16, 2017

Test build #72971 has finished for PR 16928 at commit f09a899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 16, 2017

Test build #72969 has finished for PR 16928 at commit 5486f5d.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 16, 2017

Test build #72969 has finished for PR 16928 at commit 5486f5d.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 16, 2017

Test build #72991 has finished for PR 16928 at commit df39e39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 16, 2017

Test build #72991 has finished for PR 16928 at commit df39e39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 16, 2017

Member

Jenkins, retest this please.

Member

maropu commented Feb 16, 2017

Jenkins, retest this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 16, 2017

Test build #72993 has finished for PR 16928 at commit df39e39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 16, 2017

Test build #72993 has finished for PR 16928 at commit df39e39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Feb 16, 2017

Contributor

definitely we should match the behavior of json

Contributor

cloud-fan commented Feb 16, 2017

definitely we should match the behavior of json

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Feb 16, 2017

Member

Thanks, let me review further within tomorrow.

Member

HyukjinKwon commented Feb 16, 2017

Thanks, let me review further within tomorrow.

@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 18, 2017

Member

@HyukjinKwon okay, thanks! I'll check soon

Member

maropu commented Feb 18, 2017

@HyukjinKwon okay, thanks! I'll check soon

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 18, 2017

Test build #73109 has finished for PR 16928 at commit 873a383.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 18, 2017

Test build #73109 has finished for PR 16928 at commit 873a383.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@HyukjinKwon

@maropu, I left some more comments that might be helpful. Let me take a look further in following few days.

val parsedOptions = new CSVOptions(
options,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Feb 19, 2017

Member

(It seems CSVOptions is created twice above :)).

@HyukjinKwon

HyukjinKwon Feb 19, 2017

Member

(It seems CSVOptions is created twice above :)).

This comment has been minimized.

@maropu

maropu Feb 20, 2017

Member

Fixed

@maropu

maropu Feb 20, 2017

Member

Fixed

@@ -95,6 +104,9 @@ private[csv] class CSVOptions(
val dropMalformed = ParseModes.isDropMalformedMode(parseMode)
val permissive = ParseModes.isPermissiveMode(parseMode)
val columnNameOfCorruptRecord =
parameters.getOrElse("columnNameOfCorruptRecord", defaultColumnNameOfCorruptRecord)

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Feb 19, 2017

Member

Maybe, we should add this in readwriter.py too and document this in readwriter.py, DataFrameReader and DataStreamReader.

@HyukjinKwon

HyukjinKwon Feb 19, 2017

Member

Maybe, we should add this in readwriter.py too and document this in readwriter.py, DataFrameReader and DataStreamReader.

This comment has been minimized.

@maropu

maropu Feb 20, 2017

Member

Added doc descriptions in readwriter.py, DataFrameReader, and DataStreamReader.

@maropu

maropu Feb 20, 2017

Member

Added doc descriptions in readwriter.py, DataFrameReader, and DataStreamReader.

Show outdated Hide outdated ...ala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
Show outdated Hide outdated ...ala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala
@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 19, 2017

Member

I'll update in a day, thanks!

Member

maropu commented Feb 19, 2017

I'll update in a day, thanks!

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 20, 2017

Test build #73166 has finished for PR 16928 at commit bf286c7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 20, 2017

Test build #73166 has finished for PR 16928 at commit bf286c7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 20, 2017

Test build #73167 has finished for PR 16928 at commit 4df4bc6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 20, 2017

Test build #73167 has finished for PR 16928 at commit 4df4bc6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 20, 2017

Test build #73168 has finished for PR 16928 at commit fde8cc7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 20, 2017

Test build #73168 has finished for PR 16928 at commit fde8cc7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 20, 2017

Test build #73169 has finished for PR 16928 at commit 9b9d043.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 20, 2017

Test build #73169 has finished for PR 16928 at commit 9b9d043.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 20, 2017

Test build #73170 has finished for PR 16928 at commit 1956c63.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 20, 2017

Test build #73170 has finished for PR 16928 at commit 1956c63.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 20, 2017

Test build #73172 has finished for PR 16928 at commit 2fd9275.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 20, 2017

Test build #73172 has finished for PR 16928 at commit 2fd9275.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 23, 2017

Test build #73322 has finished for PR 16928 at commit c86febe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 23, 2017

Test build #73322 has finished for PR 16928 at commit c86febe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA commented Feb 23, 2017

Test build #73328 has started for PR 16928 at commit 8d9386a.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA commented Feb 23, 2017

Test build #73331 has started for PR 16928 at commit 512fb42.

@maropu

This comment has been minimized.

Show comment
Hide comment
@maropu

maropu Feb 23, 2017

Member

Jenkins, retest this please.

Member

maropu commented Feb 23, 2017

Jenkins, retest this please.

require(schema(corrFieldIndex).nullable)
}
private val dataSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord))

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Feb 23, 2017

Member

I just realised now we only use the length of dataSchema now. Could we just use the length if more commits should be pushed?

@HyukjinKwon

HyukjinKwon Feb 23, 2017

Member

I just realised now we only use the length of dataSchema now. Could we just use the length if more commits should be pushed?

This comment has been minimized.

@maropu

maropu Feb 23, 2017

Member

ok, I'll update

@maropu

maropu Feb 23, 2017

Member

ok, I'll update

This comment has been minimized.

@maropu

maropu Feb 23, 2017

Member

I reverted some parts of code and then dataSchema is used except for the length https://github.com/apache/spark/pull/16928/files#diff-d19881aceddcaa5c60620fdcda99b4c4R57. So, I kept this variable as it is.

@maropu

maropu Feb 23, 2017

Member

I reverted some parts of code and then dataSchema is used except for the length https://github.com/apache/spark/pull/16928/files#diff-d19881aceddcaa5c60620fdcda99b4c4R57. So, I kept this variable as it is.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 23, 2017

Test build #73335 has finished for PR 16928 at commit 512fb42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 23, 2017

Test build #73335 has finished for PR 16928 at commit 512fb42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 23, 2017

Test build #73348 has finished for PR 16928 at commit 3d514e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 23, 2017

Test build #73348 has finished for PR 16928 at commit 3d514e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 23, 2017

Test build #73353 has finished for PR 16928 at commit a58ff1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 23, 2017

Test build #73353 has finished for PR 16928 at commit a58ff1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@cloud-fan

This comment has been minimized.

Show comment
Hide comment
@cloud-fan

cloud-fan Feb 23, 2017

Contributor

thanks, merging to master!

Contributor

cloud-fan commented Feb 23, 2017

thanks, merging to master!

@asfgit asfgit closed this in 09ed6e7 Feb 23, 2017

Yunni added a commit to Yunni/spark that referenced this pull request Feb 27, 2017

[SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord…
…` field violates requirements in json formats

## What changes were proposed in this pull request?
This pr comes from apache#16928 and fixed a json behaviour along with the CSV one.

## How was this patch tested?
Added tests in `JsonSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#17023 from maropu/SPARK-19695.

Yunni added a commit to Yunni/spark that referenced this pull request Feb 27, 2017

[SPARK-18699][SQL] Put malformed tokens into a new field when parsing…
… CSV data

## What changes were proposed in this pull request?
This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
```
Caused by: java.lang.IllegalArgumentException
	at java.sql.Date.valueOf(Date.java:143)
	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at scala.util.Try.getOrElse(Try.scala:79)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
	at
```
In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.

## How was this patch tested?
Added tests in `CSVSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#16928 from maropu/SPARK-18699-2.
a string type field named ``columnNameOfCorruptRecord`` in an user-defined \
schema. If a schema does not have the field, it drops corrupt records during \
parsing. When inferring a schema, it implicitly adds a \
``columnNameOfCorruptRecord`` field in an output schema.

This comment has been minimized.

@gatorsmile

gatorsmile Jun 4, 2017

Member

@maropu For JSON, we implicitly add the columnNameOfCorruptRecord field during schema inference, when the mode is PERMISSIVE. What is the reason we are not doing the same thing for CSV schema inference?

@gatorsmile

gatorsmile Jun 4, 2017

Member

@maropu For JSON, we implicitly add the columnNameOfCorruptRecord field during schema inference, when the mode is PERMISSIVE. What is the reason we are not doing the same thing for CSV schema inference?

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Jun 4, 2017

Member

(Sorry for interrupting) yea, it should be consistent and we probably should change. Probably, we should also consider the records with tokens less or more than the schema as malformed records in PERMISSIVE mode rafher than filling some of it. @cloud-fan raised this issue before and I had a talk with some data analysists. It looked some agree and others do not. So, I just decided to not change the current behaviour for now.

To cut it short, the reason (I assume) is I could not imagine a simple common case that fails to parse CSV (not during conversion) for the current implementation. If there are, we should match the behaviour.

I am currently outside and this is my phone. I will double check this when I get to my computer but this will be correct if I haven't missed some changes in this code path.

@HyukjinKwon

HyukjinKwon Jun 4, 2017

Member

(Sorry for interrupting) yea, it should be consistent and we probably should change. Probably, we should also consider the records with tokens less or more than the schema as malformed records in PERMISSIVE mode rafher than filling some of it. @cloud-fan raised this issue before and I had a talk with some data analysists. It looked some agree and others do not. So, I just decided to not change the current behaviour for now.

To cut it short, the reason (I assume) is I could not imagine a simple common case that fails to parse CSV (not during conversion) for the current implementation. If there are, we should match the behaviour.

I am currently outside and this is my phone. I will double check this when I get to my computer but this will be correct if I haven't missed some changes in this code path.

This comment has been minimized.

@gatorsmile

gatorsmile Jun 4, 2017

Member

In CSV, the records with tokens less or more than the schema are already viewed as malformed records in (at least) 2.2. I did not check the previous versions.

I think we need to implicitly add the column columnNameOfCorruptRecord during the schema inference too.

@gatorsmile

gatorsmile Jun 4, 2017

Member

In CSV, the records with tokens less or more than the schema are already viewed as malformed records in (at least) 2.2. I did not check the previous versions.

I think we need to implicitly add the column columnNameOfCorruptRecord during the schema inference too.

This comment has been minimized.

@gatorsmile

gatorsmile Jun 4, 2017

Member

It has more than one issue here. The default of columnNameOfCorruptRecord does not respect the session conf spark.sql.columnNameOfCorruptRecord

@gatorsmile

gatorsmile Jun 4, 2017

Member

It has more than one issue here. The default of columnNameOfCorruptRecord does not respect the session conf spark.sql.columnNameOfCorruptRecord

This comment has been minimized.

@gatorsmile

gatorsmile Jun 4, 2017

Member

Will submit a PR soon for fixing both issues.

@gatorsmile

gatorsmile Jun 4, 2017

Member

Will submit a PR soon for fixing both issues.

This comment has been minimized.

@gatorsmile

gatorsmile Jun 5, 2017

Member

Now, users have to manually add the column columnNameOfCorruptRecord for seeing these malformed records.

@gatorsmile

gatorsmile Jun 5, 2017

Member

Now, users have to manually add the column columnNameOfCorruptRecord for seeing these malformed records.

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Jun 5, 2017

Member

@gatorsmile, I just got to my laptop.

I checked when the length of tokens are more than the schema it fills the malformed column. with the data below:

a,a

(BTW, it looks respecting spark.sql.columnNameOfCorruptRecord ?)

scala> spark.read.schema("a string, _corrupt_record string").csv("test.csv").show()
+---+---------------+
|  a|_corrupt_record|
+---+---------------+
|  a|            a,a|
+---+---------------+
scala> spark.conf.set("spark.sql.columnNameOfCorruptRecord", "abc")

scala> spark.read.schema("a string, abc string").csv("test.csv").show()
+---+---+
|  a|abc|
+---+---+
|  a|a,a|
+---+---+

And, I found another bug (when the length is less then the schema):

with data

a
a
a
a
a
scala> spark.read.schema("a string, b string, _corrupt_record string").csv("test.csv").show()

prints ...

17/06/05 09:45:26 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.NullPointerException
	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
	at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:50)
	at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:43)
	at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:64)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:236)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:230)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

It looks getCurrentInput produces null as the input is all parsed.

Another thing I would like to leave is (just to note the difference for all of us to not forget), JSON produces null in the columns and put the contents in the malformed column:
With the input:

{"a": 1, "b": "a"}
scala> spark.read.json("test.json").show()
+---+---+
|  a|  b|
+---+---+
|  1|  a|
+---+---+
scala> spark.read.schema("a string, b int, _corrupt_record string").json("test.json").show()
+----+----+------------------+
|   a|   b|   _corrupt_record|
+----+----+------------------+
|null|null|{"a": 1, "b": "a"}|
+----+----+------------------+
@HyukjinKwon

HyukjinKwon Jun 5, 2017

Member

@gatorsmile, I just got to my laptop.

I checked when the length of tokens are more than the schema it fills the malformed column. with the data below:

a,a

(BTW, it looks respecting spark.sql.columnNameOfCorruptRecord ?)

scala> spark.read.schema("a string, _corrupt_record string").csv("test.csv").show()
+---+---------------+
|  a|_corrupt_record|
+---+---------------+
|  a|            a,a|
+---+---------------+
scala> spark.conf.set("spark.sql.columnNameOfCorruptRecord", "abc")

scala> spark.read.schema("a string, abc string").csv("test.csv").show()
+---+---+
|  a|abc|
+---+---+
|  a|a,a|
+---+---+

And, I found another bug (when the length is less then the schema):

with data

a
a
a
a
a
scala> spark.read.schema("a string, b string, _corrupt_record string").csv("test.csv").show()

prints ...

17/06/05 09:45:26 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.NullPointerException
	at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
	at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
	at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:50)
	at org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:43)
	at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:64)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:236)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:230)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

It looks getCurrentInput produces null as the input is all parsed.

Another thing I would like to leave is (just to note the difference for all of us to not forget), JSON produces null in the columns and put the contents in the malformed column:
With the input:

{"a": 1, "b": "a"}
scala> spark.read.json("test.json").show()
+---+---+
|  a|  b|
+---+---+
|  1|  a|
+---+---+
scala> spark.read.schema("a string, b int, _corrupt_record string").json("test.json").show()
+----+----+------------------+
|   a|   b|   _corrupt_record|
+----+----+------------------+
|null|null|{"a": 1, "b": "a"}|
+----+----+------------------+

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Jun 5, 2017

Member

Oh.. I was writing the comments before seeing your comments ... Yes, I agree with your comments.

@HyukjinKwon

HyukjinKwon Jun 5, 2017

Member

Oh.. I was writing the comments before seeing your comments ... Yes, I agree with your comments.

This comment has been minimized.

@HyukjinKwon

HyukjinKwon Jun 5, 2017

Member

Let me give a shot to fix the bug I found above (NullPointerException). I think this can be easily fixed (but I am pretty sure the behaviour could be arguable). I will open a PR and cc you to show what it looks like.

@HyukjinKwon

HyukjinKwon Jun 5, 2017

Member

Let me give a shot to fix the bug I found above (NullPointerException). I think this can be easily fixed (but I am pretty sure the behaviour could be arguable). I will open a PR and cc you to show what it looks like.

This comment has been minimized.

@maropu

maropu Jun 5, 2017

Member

Sorry for my late response. yea, I also think these behaviour should be the same. But, I tried though in this pr though, I couldn't because (both you already noticed this...) we couldn't easily add a new column in the CSV code path. So, I think we probably need some refactoring DataSource to make this behaviour consistent.

@maropu

maropu Jun 5, 2017

Member

Sorry for my late response. yea, I also think these behaviour should be the same. But, I tried though in this pr though, I couldn't because (both you already noticed this...) we couldn't easily add a new column in the CSV code path. So, I think we probably need some refactoring DataSource to make this behaviour consistent.

ThySinner pushed a commit to ThySinner/spark that referenced this pull request Jun 9, 2017

[SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord…
…` field violates requirements in json formats

## What changes were proposed in this pull request?
This pr comes from apache#16928 and fixed a json behaviour along with the CSV one.

## How was this patch tested?
Added tests in `JsonSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#17023 from maropu/SPARK-19695.

ThySinner pushed a commit to ThySinner/spark that referenced this pull request Jun 9, 2017

[SPARK-18699][SQL] Put malformed tokens into a new field when parsing…
… CSV data

## What changes were proposed in this pull request?
This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
```
Caused by: java.lang.IllegalArgumentException
	at java.sql.Date.valueOf(Date.java:143)
	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
	at scala.util.Try.getOrElse(Try.scala:79)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
	at
```
In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.

## How was this patch tested?
Added tests in `CSVSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#16928 from maropu/SPARK-18699-2.

@maropu maropu deleted the maropu:SPARK-18699-2 branch Jul 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment