[SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667

gengliangwang · 2018-06-29T06:54:59Z

What changes were proposed in this pull request?

With #21389, data source schema is validated on driver side before launching read/write tasks.
However,

Putting all the validations together in DataSourceUtils is tricky and hard to maintain. On second thought after review, I find that the OrcFileFormat in hive package is not matched, so that its validation wrong.
DataSourceUtils.verifyWriteSchema and DataSourceUtils.verifyReadSchema is not supposed to be called in every file format. We can move them to some upper entry.

So, I propose we can add a new method validateDataType in FileFormat. File format implementation can override the method to specify its supported/non-supported data types.
Although we should focus on data source V2 API, FileFormat should remain workable for some time. Adding this new method should be helpful.

How was this patch tested?

Unit test

gengliangwang · 2018-06-29T07:02:52Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala

@@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton {
        sql("select testType()").write.mode("overwrite").orc(orcDir)
      }.getMessage
      assert(msg.contains("ORC data source does not support calendarinterval data type."))
-
-      // read path


In read path, ORC should support CalendarIntervalType and NullType.

Is there any read path test already?

@HyukjinKwon No, the unit test is about unsupported data types, and ORC supports all data types in read path.

I mean the tests were negative tests. so I was expecting that we'd have positive tests.

Spark can never write out interval type, do we really need to support interval type at read path?

gengliangwang · 2018-06-29T07:03:32Z

@maropu @gatorsmile

SparkQA · 2018-06-29T07:05:02Z

Test build #92459 has finished for PR 21667 at commit 7fdf603.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-06-29T07:27:09Z

retest this please

HyukjinKwon · 2018-06-29T07:28:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+   *
+   * By default all data types are supported except [[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {


Hm, shouldn't we better whitelist them rather then blacklist?

Might be easier to write but it doesn't consider if we happened to have some more types on the other hand. It should better be explicit on what we support on the other hand.

I wrote CSV's with whitelisting before per @hvanhovell's comment long time ago. I was (am still) okay either way but might be good to leave a cc for him.

I like the whilelist, too. As @HyukjinKwon said, if someone implements a new type, the blacklist pass through it...

Whitelist for all file formats is behavior change. There are external file sources like https://github.com/databricks/spark-avro , which we probably have to update the code to make it compatible.

Currently exceptions are thrown in buildReader / buildReaderWithPartitionValues/ prepareWrite for unsupported types. New types are handled.

So overall I prefer blacklist.

I meant the cases in the match in each implementation within Spark. I didn't mean about the semantic about the API itself.

@HyukjinKwon Sorry I don't understand. Do you mean the default case is not supported?

case _ => false

But how to make all the external formats work?

OK. I meant, leaving the default case true

def supportDataType(...): Boolean = dataType match { case _ => true }

and whitelist each type within each implementation, for example, in CSVFileFormat.scala

def supportDataType(...) ... case _: StringType | ... => true case _ => false

blacklist is easier, but whitelist is safer.

SparkQA · 2018-06-29T11:08:55Z

Test build #92465 has finished for PR 21667 at commit 7fdf603.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-01T03:19:07Z

@hvanhovell

maropu · 2018-07-02T08:18:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -306,6 +306,7 @@ case class FileSourceScanExec(
  }

  private lazy val inputRDD: RDD[InternalRow] = {
+    DataSourceUtils.verifyReadSchema(relation.fileFormat, relation.dataSchema)


~~In some formats, is this verification applied two times, right?~~
ok, you removed the verification in each format implementation ;)

I'm not sure about this change. This is very late(just before execution), is there a better place for this check that happens at analysis phase?

BTW why do we need to check schema at read path? For user-specified schema?

Yes, for user-specified schema.

maropu · 2018-07-02T08:27:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSON
 import org.apache.spark.sql.catalyst.util.CompressionCodecs
 import org.apache.spark.sql.execution.datasources._
 import org.apache.spark.sql.sources._
-import org.apache.spark.sql.types.{StringType, StructType}
+import org.apache.spark.sql.types._


If we employ the blacklist, I think it'd be better that you don't fold these imports.

HyukjinKwon · 2018-07-02T14:27:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+   *
+   * By default all data types are supported except [[CalendarIntervalType]] in write path.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {


Also, do we really need this API? All what it does it is just to check the type and throw an exception.

HyukjinKwon · 2018-07-02T14:28:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+        throw new UnsupportedOperationException(
+          s"$format data source does not support ${dataType.simpleString} data type.")
+      }
+      dataType match {


Wait .. why we do the recursive thing here? What if the top level type is supported but nested is not? For example, Arrow integration in Spark doesn't currently support nested timestamp conversion for localization issue.

This is for general purpose, so that developer can skip matching arrays/maps/structs.
I don't know about nested timestamp, but we can override supportDataType to make sure the case is unsupported, right?

Yea.. then this code bit is not really general purpose anymore ... developers should check the codes inside and see if the nested types are automatically checked or not ..

I know..But if developers didn't read inside and process the case of arrays/maps/structs, the code should still work.

In that case, this code bit becomes rather obsolete .. To me Spark's dev API is too difficult for me to understand :-) .. Personally, I don't like to be too clever when it comes to API thing.

It's tricky to rely on 2 places to correctly determine the unsupported type. format.supportDataType should handle complex types themselves, to make the code clearer and easier to maintain.

I see. I will update it.

HyukjinKwon · 2018-07-02T14:37:00Z

Reading it again closely, I am actually not super happy on the proposal about introducing API change if the purpose of this is just to check the type and throw an exception. Apparently, it looks so. I am less sure how useful it is by looking the current change. It reduces the size of codes because it blacklists. I would suggest to make the API change separate with this PR.

HyukjinKwon · 2018-07-02T15:00:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+  /**
+   * Returns whether this format supports the given [[DataType]] in read/write path.
+   *
+   * By default all data types are supported except [[CalendarIntervalType]] in write path.


FYI, CalendarIntervalType isn't completely public yet .. cc @cloud-fan.

yea, it's not by default, we can't write out interval type. This check is in DataSource.planForWriting

gengliangwang · 2018-07-02T15:03:27Z

I agree that making it an API is a bit over.
But current there are problems(bug) as I listed in PR description.
Maybe we can create another separate Trait?

HyukjinKwon · 2018-07-02T15:04:33Z

The fixes about the bug look all okay but the API thing. Mind if I ask to proceed separately for the API change if that's possible?

gengliangwang · 2018-07-02T15:10:45Z

Sure, I am actually OK if we can have a different approach other than API.

cloud-fan · 2018-07-03T15:54:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

-   *  json -> W: Interval
-   *  orc -> W: Interval, Null
-   *  parquet -> R/W: Interval, Null
+   * in a driver side.


FileFormat is internal so it's nothing about public API, but just about design choice.

Generally it's ok to have a central place to put some business logic for different cases. However, here we can't access all FileFormat implementations, Hive ORC is in Hive module. Now the only choice is: dispatch the business logic into implementations.

So +1 on the approach taken by this PR.

cloud-fan · 2018-07-03T15:57:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+   * If the [[DataType]] is not supported, an exception will be thrown.
+   * By default all data types are supported.
+   */
+  def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = {}


it's better to return boolean here, and let the caller side to throw exception, so that we can unify the error message.

Yes, that was what I did in first commit. If the unsupported type is inside struct/array, then the error message is not accurate as the current way.
I am OK with revert to return Boolean though.

cloud-fan · 2018-07-03T16:02:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

@@ -148,6 +144,28 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister {
  override def hashCode(): Int = getClass.hashCode()

  override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat]
+
+  override def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = dataType match {


how about

the base class def validateDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match { case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | StringType | BinaryType | DateType | TimestampType | _: DecimalType => true case _ => false } json override def validateDataType(dataType: DataType, isReadPath: Boolean): Boolean = { case st: StructType => st.forall { f => validateDataType(f.dataType, isReadPath) } case ArrayType... ... case other => super.validateDataType(other) }

Then the base class could break other existing file formats, right?

what do you mean by break? If people use internal API(like avro), they are responsible to update their code for the internal API changes in new Spark releases.

gengliangwang · 2018-07-03T16:04:00Z

@HyukjinKwon @maropu I have updated the code. It is now using whitelist.
@cloud-fan Thanks for the review and +1

SparkQA · 2018-07-03T19:35:09Z

Test build #92569 has finished for PR 21667 at commit 5c590eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-04T16:56:56Z

Test build #92611 has finished for PR 21667 at commit 34134f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-05T05:26:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+
+  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
+    case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
+         StringType | BinaryType | DateType | TimestampType | _: DecimalType => true


isn't it just case _: AtomicType => true?

cloud-fan · 2018-07-05T05:28:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

@@ -141,6 +136,11 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
      }
    }
  }
+
+  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {


dataType == StringType

cloud-fan · 2018-07-05T05:29:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

+   * Returns whether this format supports the given [[DataType]] in read/write path.
+   * By default all data types are supported.
+   */
+  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true


who does not overwrite it?

HiveFileFormat. Currently I don't know Hive well. If someone can override it for HiveFileFormat, please create a follow up PR.

then why not remove this default implementation and create HiveFileFormat#supportDataType to return true?

Then we also need to update LibSVMFileFormat, and several file format in unit test. I really prefer to have a default behavior here, as FileFormat can still work without the new method.

makes sense

cloud-fan · 2018-07-05T10:24:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+
+    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)
+
+    case _: NullType => true


why JSON supports null type but CSV doesn't?

currently null type is not handled in UnivocityParser

cloud-fan · 2018-07-05T10:26:23Z

My major concern is when to apply this check. Ideally this should happen during analysis.

SparkQA · 2018-07-05T11:25:19Z

Test build #92639 has finished for PR 21667 at commit 44cf265.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-06T08:11:02Z

@cloud-fan I can understand you concern. But I can't find better entries. The entry in FileFormatWriter is the only one entry for every write action, otherwise we have to add the check in multiple places. The same for read path.

gengliangwang · 2018-07-06T13:50:16Z

retest this please.

SparkQA · 2018-07-06T16:08:26Z

Test build #92686 has finished for PR 21667 at commit 7266611.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-07T15:36:26Z

retest this please.

SparkQA · 2018-07-07T19:31:50Z

Test build #92711 has finished for PR 21667 at commit 7266611.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 16beafa4f128d3d592bf1f08c8e9b29770ae5a8a.

gengliangwang · 2018-07-11T15:51:22Z

retest this please.

cloud-fan · 2018-07-11T15:55:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+  override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
+    case _: AtomicType => true
+
+    case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)


can we do this in https://github.com/apache/spark/pull/21667/files#diff-ea05eba8c71b5596561adbbe9755ff36R46 ?

cloud-fan · 2018-07-11T16:10:59Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

-        // Interval type
-        var schema = StructType(StructField("a", CalendarIntervalType, true) :: Nil)
-        spark.range(1).write.format(format).mode("overwrite").save(tempDir)
-        spark.read.schema(schema).format(format).load(tempDir).collect()


when the user-specified schema doesn't match the physical schema, the behavior is undefined. So I don't think this is about backward compatibility, +1 to forbid interval type.

SparkQA · 2018-07-11T17:54:54Z

Test build #92864 has finished for PR 21667 at commit 757b82a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-11T18:05:47Z

Test build #92863 has finished for PR 21667 at commit 757b82a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-11T18:42:41Z

Test build #92867 has finished for PR 21667 at commit 9ed3a7d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-12T07:05:02Z

Test build #92913 has finished for PR 21667 at commit 13de60e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-12T11:35:17Z

retest this please

SparkQA · 2018-07-12T15:27:33Z

Test build #92930 has finished for PR 21667 at commit 13de60e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-12T16:11:12Z

thanks, merging to master!

…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes #23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes apache#23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

gengliangwang commented Jun 29, 2018

View reviewed changes

HyukjinKwon reviewed Jun 29, 2018

View reviewed changes

maropu reviewed Jul 2, 2018

View reviewed changes

HyukjinKwon reviewed Jul 2, 2018

View reviewed changes

gengliangwang changed the title ~~[SPARK-24691][SQL]Add new API supportDataType in FileFormat~~ [SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation Jul 3, 2018

cloud-fan reviewed Jul 3, 2018

View reviewed changes

cloud-fan reviewed Jul 5, 2018

View reviewed changes

gengliangwang added 8 commits July 11, 2018 16:30

refactor schema validation

63818c1

address comments

e39cd02

address comments and validate schema in text file format

e376ada

address comments

cf4147f

address one comment

55df128

Revert "address one comment"

8740ac0

This reverts commit 16beafa4f128d3d592bf1f08c8e9b29770ae5a8a.

update the validate position of read path

d4b4d13

remove duplicated check in hive orc

757b82a

gengliangwang force-pushed the refactorSchemaValidate branch from 7266611 to 757b82a Compare July 11, 2018 15:45

cloud-fan reviewed Jul 11, 2018

View reviewed changes

add back some test cases

9ed3a7d

fix DDSuite

13de60e

asfgit closed this in e6c6f90 Jul 12, 2018

gengliangwang mentioned this pull request Jan 24, 2019

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

Closed

gengliangwang mentioned this pull request Feb 13, 2019

[SPARK-26744][SQL]Support schema validation in FileDataSourceV2 framework #23714

Closed


		case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath)

		case _: NullType => true

[SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667

[SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667

Conversation

gengliangwang commented Jun 29, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

HyukjinKwon Jun 29, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jun 29, 2018

SparkQA commented Jun 29, 2018

maropu commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Jul 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 29, 2018

gengliangwang commented Jul 1, 2018

maropu Jul 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jul 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jul 2, 2018

HyukjinKwon commented Jul 2, 2018

gengliangwang commented Jul 2, 2018

cloud-fan Jul 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jul 3, 2018

SparkQA commented Jul 3, 2018

SparkQA commented Jul 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 5, 2018

SparkQA commented Jul 5, 2018

gengliangwang commented Jul 6, 2018

gengliangwang commented Jul 6, 2018

SparkQA commented Jul 6, 2018

gengliangwang commented Jul 7, 2018

SparkQA commented Jul 7, 2018

gengliangwang commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jun 29, 2018 •

edited

HyukjinKwon Jun 29, 2018 •

edited

gengliangwang Jul 2, 2018 •

edited

maropu Jul 2, 2018 •

edited

cloud-fan Jul 3, 2018 •

edited

HyukjinKwon Jul 2, 2018 •

edited

HyukjinKwon commented Jul 2, 2018 •

edited

cloud-fan Jul 3, 2018 •

edited