Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities #15751

Closed
wants to merge 11 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 3, 2016

What changes were proposed in this pull request?

This PR includes several fixes as below:

Case 1 read.json(rdd) - throws an exception before the execution for unsupported types.

import org.apache.spark.sql.types._
val rdd = spark.sparkContext.parallelize(1 to 100).map(i => s"""{"a": "str$i"}""")
val schema = new StructType().add("a", CalendarIntervalType)
spark.read.schema(schema).option("mode", "FAILFAST").json(rdd).show()

Case 2 read.json(path) - throws an exception before the execution for unsupported types.

import org.apache.spark.sql.types._
val path = "/tmp/aa"
val rdd = spark.sparkContext.parallelize(1 to 100).map(i => s"""{"a": "str$i"}""").saveAsTextFile(path)
val schema = new StructType().add("a", CalendarIntervalType)
spark.read.schema(schema).option("mode", "FAILFAST").json(path).show()

Case 3 read.csv(path) - throws an exception before the execution for unsupported types.

import org.apache.spark.sql.types._
val path = "/tmp/bb"
val rdd = spark.sparkContext.parallelize(1 to 100).saveAsTextFile(path)
val schema = new StructType().add("a", CalendarIntervalType)
spark.read.schema(schema).option("mode", "FAILFAST").csv(path).show()

Case 4 read.text(path) - throws an exception before the execution for unsupported types rather than printing incorrect values.

import org.apache.spark.sql.types._
val path = "/tmp/cc"
val rdd = spark.sparkContext.parallelize(1 to 100).saveAsTextFile(path)
val schema = new StructType().add("a", LongType)
spark.read.schema(schema).text(path).show()

currently this prints as below:

+-----------+
|          a|
+-----------+
|68719476738|
|68719476738|
|68719476738|
...

whereas actual content is

1
2
3
...

Case 5 from_json(...) - throws analysis exception for unsupported types (to_json is already throwing an analysis exception).

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq("""{"a" 1}""").toDS()
val schema = new StructType().add("a", CalendarIntervalType)
df.select(from_json($"value", schema)).collect()

Case 6 write.json(path) (this is a potential issue) - adds the schema check in writing

Currently, it seems JSON conversion does not support only CalendarIntervalType but this case seems already covered as below:

sql("SELECT interval 1 seconds as a").write.json("tmp/123")
Cannot save interval data type into external storage.;
org.apache.spark.sql.AnalysisException: Cannot save interval data type into external storage.;
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:462)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)

However, it'd be safe if we add this check like the other text-based datasources. We might add some more types in the future.

In more details for Case 1, Case 2 and Case 3, there are parsing modes, FAILFAST, DROPMALFORMED and PERMISSIVE. The original behaviour is,

  • FAILFAST - fails if it meets the unsupported type when parsing

  • DROPMALFORMED - drops record having non-null values in the unsupported types. Otherwise, it reads it as null.

  • PERMISSIVE - allows to read the values as null for unsupported types.

In case of FAILFAST, we can fail right after only checking the schema.

How was this patch tested?

Unit tests in JsonSuite, CSVSuite, TextSuite and JsonFunctionsSuite.

override def checkInputDataTypes(): TypeCheckResult = {
if (StringType.acceptsType(child.dataType)) {
try {
JacksonUtils.verifySchema(schema)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we don't have to worry about parsing mode, mode because from_json produces null with the default parse mode, FAILFAST.

// mode, it drops records only containing non-null values in unsupported types. We should use
// `requiredSchema` instead of whole schema `dataSchema` here to not to break the original
// behaviour.
verifySchema(requiredSchema)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it only checks projected columns for not changing the existing behaviour (we are not really checking the other columns when parsing already).

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 3, 2016

cc @marmbrus, could you please take a look? This is the one about schema verification we talked when adding to_json .

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68067 has finished for PR 15751 at commit 3836698.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Let me try to explain in more details and add some tests for reviewing.

@HyukjinKwon HyukjinKwon changed the title [SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities [WIP][SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities Nov 4, 2016
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities [SPARK-18246][SQL] Throws an exception before execution for unsupported types in Json, CSV and text functionailities Nov 4, 2016
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 4, 2016

cc @rxin and @marmbrus, mainly the actual change is only adding JacksonUtils.verifySchema(schema), CSVRelation.verifySchema(schema) and TextFileFormat.verifySchema(schema) and the others are only test codes. Could you please take a look?

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68106 has finished for PR 15751 at commit d247947.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68104 has finished for PR 15751 at commit 370aee0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68105 has finished for PR 15751 at commit 5e87b12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"CSV data source does not support ${dataType.simpleString} data type.")
case _ =>
throw new UnsupportedOperationException(
s"CSV data source does not support ${dataType.simpleString} data type.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV currently throws UnsupportedOperation but text datasource throws AnalysisException. I just matched this to UnsupportedOperation. I am happy to match this to AnalysisException if anyone thinks so.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68111 has finished for PR 15751 at commit d75c11e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68119 has finished for PR 15751 at commit 9ff247b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68121 has finished for PR 15751 at commit aea65e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68123 has finished for PR 15751 at commit aea65e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 4, 2016

Thanks for working on this! My meta question however is why don't we just support this instead? We can parse interval type from strings and I think it would be really easy to write a converter that produces a string for write.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Nov 5, 2016

@marmbrus Sure! makes sense. One thing I instantly worry is, though, IIUC, it seems we should unset the check[1] for CalendarIntervalType for writing too and I am worried of other side effects. I can take a look into this deeper and then try to create a JIRA.

Anyway, this PR only makes the datasources throw the exceptions ahead and introduces some more test cases not existing before. So, I hope this is okay as it is if you are all fine.

[1]https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L355-L357

@marmbrus
Copy link
Contributor

marmbrus commented Nov 8, 2016

Sure, we could possibly merge this first, and you are right, we'd need to remove checks. I'll try to find some time to look this over as its a larger patch and I'm focusing on 2.1 bugs right now. If you have time you might beat me with the better solution :)

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68316 has finished for PR 15751 at commit bae8db8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68326 has finished for PR 15751 at commit bae8db8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hi @marmbrus, would there be something you are worried of and I should take a careful look for?

@HyukjinKwon
Copy link
Member Author

Do we want this change? If there is an approval, I will rebase. It seems easily making a conflict.

@HyukjinKwon
Copy link
Member Author

I will close this for now and make a new one soon.

@HyukjinKwon HyukjinKwon deleted the SPARK-18246 branch January 2, 2018 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants