[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751

HyukjinKwon · 2016-11-03T10:38:44Z

What changes were proposed in this pull request?

This PR includes several fixes as below:

Case 1 read.json(rdd) - throws an exception before the execution for unsupported types.

import org.apache.spark.sql.types._
val rdd = spark.sparkContext.parallelize(1 to 100).map(i => s"""{"a": "str$i"}""")
val schema = new StructType().add("a", CalendarIntervalType)
spark.read.schema(schema).option("mode", "FAILFAST").json(rdd).show()

Case 2 read.json(path) - throws an exception before the execution for unsupported types.

import org.apache.spark.sql.types._
val path = "/tmp/aa"
val rdd = spark.sparkContext.parallelize(1 to 100).map(i => s"""{"a": "str$i"}""").saveAsTextFile(path)
val schema = new StructType().add("a", CalendarIntervalType)
spark.read.schema(schema).option("mode", "FAILFAST").json(path).show()

Case 3 read.csv(path) - throws an exception before the execution for unsupported types.

import org.apache.spark.sql.types._
val path = "/tmp/bb"
val rdd = spark.sparkContext.parallelize(1 to 100).saveAsTextFile(path)
val schema = new StructType().add("a", CalendarIntervalType)
spark.read.schema(schema).option("mode", "FAILFAST").csv(path).show()

Case 4 read.text(path) - throws an exception before the execution for unsupported types rather than printing incorrect values.

import org.apache.spark.sql.types._
val path = "/tmp/cc"
val rdd = spark.sparkContext.parallelize(1 to 100).saveAsTextFile(path)
val schema = new StructType().add("a", LongType)
spark.read.schema(schema).text(path).show()

currently this prints as below:

+-----------+
|          a|
+-----------+
|68719476738|
|68719476738|
|68719476738|
...

whereas actual content is

1
2
3
...

Case 5 from_json(...) - throws analysis exception for unsupported types (to_json is already throwing an analysis exception).

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq("""{"a" 1}""").toDS()
val schema = new StructType().add("a", CalendarIntervalType)
df.select(from_json($"value", schema)).collect()

Case 6 write.json(path) (this is a potential issue) - adds the schema check in writing

Currently, it seems JSON conversion does not support only CalendarIntervalType but this case seems already covered as below:

sql("SELECT interval 1 seconds as a").write.json("tmp/123")

Cannot save interval data type into external storage.;
org.apache.spark.sql.AnalysisException: Cannot save interval data type into external storage.;
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:462)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)

However, it'd be safe if we add this check like the other text-based datasources. We might add some more types in the future.

In more details for Case 1, Case 2 and Case 3, there are parsing modes, FAILFAST, DROPMALFORMED and PERMISSIVE. The original behaviour is,

FAILFAST - fails if it meets the unsupported type when parsing
DROPMALFORMED - drops record having non-null values in the unsupported types. Otherwise, it reads it as null.
PERMISSIVE - allows to read the values as null for unsupported types.

In case of FAILFAST, we can fail right after only checking the schema.

How was this patch tested?

Unit tests in JsonSuite, CSVSuite, TextSuite and JsonFunctionsSuite.

HyukjinKwon · 2016-11-03T10:41:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (StringType.acceptsType(child.dataType)) {
+      try {
+        JacksonUtils.verifySchema(schema)


In this case, we don't have to worry about parsing mode, mode because from_json produces null with the default parse mode, FAILFAST.

HyukjinKwon · 2016-11-03T10:42:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+      // mode, it drops records only containing non-null values in unsupported types. We should use
+      // `requiredSchema` instead of whole schema `dataSchema` here to not to break the original
+      // behaviour.
+      verifySchema(requiredSchema)


Here, it only checks projected columns for not changing the existing behaviour (we are not really checking the other columns when parsing already).

HyukjinKwon · 2016-11-03T10:44:03Z

cc @marmbrus, could you please take a look? This is the one about schema verification we talked when adding to_json .

SparkQA · 2016-11-03T13:01:45Z

Test build #68067 has finished for PR 15751 at commit 3836698.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-04T01:45:47Z

Let me try to explain in more details and add some tests for reviewing.

HyukjinKwon · 2016-11-04T04:14:11Z

cc @rxin and @marmbrus, mainly the actual change is only adding JacksonUtils.verifySchema(schema), CSVRelation.verifySchema(schema) and TextFileFormat.verifySchema(schema) and the others are only test codes. Could you please take a look?

SparkQA · 2016-11-04T05:17:45Z

Test build #68106 has finished for PR 15751 at commit d247947.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T06:12:08Z

Test build #68104 has finished for PR 15751 at commit 370aee0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T06:47:30Z

Test build #68105 has finished for PR 15751 at commit 5e87b12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-04T08:41:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

-            s"CSV data source does not support ${dataType.simpleString} data type.")
+      case _ =>
+        throw new UnsupportedOperationException(
+          s"CSV data source does not support ${dataType.simpleString} data type.")


CSV currently throws UnsupportedOperation but text datasource throws AnalysisException. I just matched this to UnsupportedOperation. I am happy to match this to AnalysisException if anyone thinks so.

SparkQA · 2016-11-04T08:59:27Z

Test build #68111 has finished for PR 15751 at commit d75c11e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T09:48:54Z

Test build #68119 has finished for PR 15751 at commit 9ff247b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T10:15:36Z

Test build #68121 has finished for PR 15751 at commit aea65e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-04T10:19:21Z

retest this please

SparkQA · 2016-11-04T12:32:46Z

Test build #68123 has finished for PR 15751 at commit aea65e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-11-04T18:55:22Z

Thanks for working on this! My meta question however is why don't we just support this instead? We can parse interval type from strings and I think it would be really easy to write a converter that produces a string for write.

HyukjinKwon · 2016-11-05T06:06:09Z

@marmbrus Sure! makes sense. One thing I instantly worry is, though, IIUC, it seems we should unset the check[1] for CalendarIntervalType for writing too and I am worried of other side effects. I can take a look into this deeper and then try to create a JIRA.

Anyway, this PR only makes the datasources throw the exceptions ahead and introduces some more test cases not existing before. So, I hope this is okay as it is if you are all fine.

[1]https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L355-L357

marmbrus · 2016-11-08T00:58:54Z

Sure, we could possibly merge this first, and you are right, we'd need to remove checks. I'll try to find some time to look this over as its a larger patch and I'm focusing on 2.1 bugs right now. If you have time you might beat me with the better solution :)

…SV and text functionailities

SparkQA · 2016-11-08T04:58:30Z

Test build #68316 has finished for PR 15751 at commit bae8db8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-08T07:03:32Z

retest this please

SparkQA · 2016-11-08T09:16:19Z

Test build #68326 has finished for PR 15751 at commit bae8db8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-06T00:37:56Z

Hi @marmbrus, would there be something you are worried of and I should take a careful look for?

HyukjinKwon · 2017-03-08T10:27:26Z

Do we want this change? If there is an approval, I will rebase. It seems easily making a conflict.

HyukjinKwon · 2017-03-31T08:50:52Z

I will close this for now and make a new one soon.

HyukjinKwon commented Nov 3, 2016

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities~~ [WIP][SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities Nov 4, 2016

HyukjinKwon changed the title ~~[WIP][SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities~~ [SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities Nov 4, 2016

HyukjinKwon commented Nov 4, 2016

View reviewed changes

srowen approved these changes Nov 4, 2016

View reviewed changes

HyukjinKwon added 9 commits November 8, 2016 12:56

Throws an exception before execution for unsupported types in Json, C…

eb697e1

…SV and text functionailities

Revert back the comparison change

8c79da7

Fix comments

144cc3b

Neat text tests

89a1fc8

Fix typos

ee91348

Add more test cases for other parse modes

14c3b0c

Another test case for reading null in JSON for unsupported types

d30e218

Clean up tests more

4c52716

Minimise the changes and respect the original behaviour

8081f58

HyukjinKwon added 2 commits November 8, 2016 12:56

Improve error message

f68245c

no extra change needed

bae8db8

HyukjinKwon force-pushed the SPARK-18246 branch from aea65e9 to bae8db8 Compare November 8, 2016 03:57

HyukjinKwon mentioned this pull request Feb 6, 2017

SPARK-16636 Add CalendarIntervalType to documentation #16747

Closed

HyukjinKwon closed this Mar 31, 2017

HyukjinKwon deleted the SPARK-18246 branch January 2, 2018 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751

[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751

HyukjinKwon commented Nov 3, 2016 •

edited

HyukjinKwon Nov 3, 2016

HyukjinKwon Nov 3, 2016

HyukjinKwon commented Nov 3, 2016 •

edited

SparkQA commented Nov 3, 2016

HyukjinKwon commented Nov 4, 2016

HyukjinKwon commented Nov 4, 2016 •

edited

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

HyukjinKwon Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

HyukjinKwon commented Nov 4, 2016

SparkQA commented Nov 4, 2016

marmbrus commented Nov 4, 2016

HyukjinKwon commented Nov 5, 2016 •

edited

marmbrus commented Nov 8, 2016

SparkQA commented Nov 8, 2016

HyukjinKwon commented Nov 8, 2016

SparkQA commented Nov 8, 2016

HyukjinKwon commented Jan 6, 2017

HyukjinKwon commented Mar 8, 2017

HyukjinKwon commented Mar 31, 2017

[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751

[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751

Conversation

HyukjinKwon commented Nov 3, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Nov 3, 2016

Choose a reason for hiding this comment

HyukjinKwon Nov 3, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Nov 3, 2016 • edited

SparkQA commented Nov 3, 2016

HyukjinKwon commented Nov 4, 2016

HyukjinKwon commented Nov 4, 2016 • edited

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

HyukjinKwon Nov 4, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

HyukjinKwon commented Nov 4, 2016

SparkQA commented Nov 4, 2016

marmbrus commented Nov 4, 2016

HyukjinKwon commented Nov 5, 2016 • edited

marmbrus commented Nov 8, 2016

SparkQA commented Nov 8, 2016

HyukjinKwon commented Nov 8, 2016

SparkQA commented Nov 8, 2016

HyukjinKwon commented Jan 6, 2017

HyukjinKwon commented Mar 8, 2017

HyukjinKwon commented Mar 31, 2017

HyukjinKwon commented Nov 3, 2016 •

edited

HyukjinKwon commented Nov 3, 2016 •

edited

HyukjinKwon commented Nov 4, 2016 •

edited

HyukjinKwon commented Nov 5, 2016 •

edited