New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667
[SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667
Conversation
@@ -156,28 +156,6 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { | |||
sql("select testType()").write.mode("overwrite").orc(orcDir) | |||
}.getMessage | |||
assert(msg.contains("ORC data source does not support calendarinterval data type.")) | |||
|
|||
// read path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In read path, ORC should support CalendarIntervalType
and NullType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any read path test already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon No, the unit test is about unsupported data types, and ORC supports all data types in read path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the tests were negative tests. so I was expecting that we'd have positive tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark can never write out interval type, do we really need to support interval type at read path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for ^
Test build #92459 has finished for PR 21667 at commit
|
retest this please |
* | ||
* By default all data types are supported except [[CalendarIntervalType]] in write path. | ||
*/ | ||
def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, shouldn't we better whitelist them rather then blacklist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blacklist is easier.
With whitelist , we will have to validate
BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
StringType | BinaryType | DateType | TimestampType | DecimalType
Of course we can have a default function to process these. But if we add a new data source which didn't support all of them, the implementation will be verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier to write but it doesn't consider if we happened to have some more types on the other hand. It should better be explicit on what we support on the other hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote CSV's with whitelisting before per @hvanhovell's comment long time ago. I was (am still) okay either way but might be good to leave a cc for him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the whilelist, too. As @HyukjinKwon said, if someone implements a new type, the blacklist pass through it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitelist for all file formats is behavior change. There are external file sources like https://github.com/databricks/spark-avro , which we probably have to update the code to make it compatible.
Currently exceptions are thrown in buildReader
/ buildReaderWithPartitionValues
/ prepareWrite
for unsupported types. New types are handled.
So overall I prefer blacklist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the case
s in the match in each implementation within Spark. I didn't mean about the semantic about the API itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Sorry I don't understand. Do you mean the default case is not supported?
case _ => false
But how to make all the external formats work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I meant, leaving the default case true
def supportDataType(...): Boolean = dataType match {
case _ => true
}
and whitelist each type within each implementation, for example, in CSVFileFormat.scala
def supportDataType(...) ...
case _: StringType | ... => true
case _ => false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blacklist is easier, but whitelist is safer.
Test build #92465 has finished for PR 21667 at commit
|
@@ -306,6 +306,7 @@ case class FileSourceScanExec( | |||
} | |||
|
|||
private lazy val inputRDD: RDD[InternalRow] = { | |||
DataSourceUtils.verifyReadSchema(relation.fileFormat, relation.dataSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some formats, is this verification applied two times, right?
ok, you removed the verification in each format implementation ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this change. This is very late(just before execution), is there a better place for this check that happens at analysis phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW why do we need to check schema at read path? For user-specified schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for user-specified schema.
@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSON | |||
import org.apache.spark.sql.catalyst.util.CompressionCodecs | |||
import org.apache.spark.sql.execution.datasources._ | |||
import org.apache.spark.sql.sources._ | |||
import org.apache.spark.sql.types.{StringType, StructType} | |||
import org.apache.spark.sql.types._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we employ the blacklist, I think it'd be better that you don't fold these imports.
* | ||
* By default all data types are supported except [[CalendarIntervalType]] in write path. | ||
*/ | ||
def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we really need this API? All what it does it is just to check the type and throw an exception.
throw new UnsupportedOperationException( | ||
s"$format data source does not support ${dataType.simpleString} data type.") | ||
} | ||
dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait .. why we do the recursive thing here? What if the top level type is supported but nested is not? For example, Arrow integration in Spark doesn't currently support nested timestamp conversion for localization issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for general purpose, so that developer can skip matching arrays/maps/structs.
I don't know about nested timestamp, but we can override supportDataType to make sure the case is unsupported, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea.. then this code bit is not really general purpose anymore ... developers should check the codes inside and see if the nested types are automatically checked or not ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know..But if developers didn't read inside and process the case of arrays/maps/structs, the code should still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, this code bit becomes rather obsolete .. To me Spark's dev API is too difficult for me to understand :-) .. Personally, I don't like to be too clever when it comes to API thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky to rely on 2 places to correctly determine the unsupported type. format.supportDataType
should handle complex types themselves, to make the code clearer and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will update it.
Reading it again closely, I am actually not super happy on the proposal about introducing API change if the purpose of this is just to check the type and throw an exception. Apparently, it looks so. I am less sure how useful it is by looking the current change. It reduces the size of codes because it blacklists. I would suggest to make the API change separate with this PR. |
/** | ||
* Returns whether this format supports the given [[DataType]] in read/write path. | ||
* | ||
* By default all data types are supported except [[CalendarIntervalType]] in write path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, CalendarIntervalType
isn't completely public yet .. cc @cloud-fan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, it's not by default, we can't write out interval type. This check is in DataSource.planForWriting
I agree that making it an API is a bit over. |
The fixes about the bug look all okay but the API thing. Mind if I ask to proceed separately for the API change if that's possible? |
Sure, I am actually OK if we can have a different approach other than API. |
supportDataType
in FileFormat* json -> W: Interval | ||
* orc -> W: Interval, Null | ||
* parquet -> R/W: Interval, Null | ||
* in a driver side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileFormat
is internal so it's nothing about public API, but just about design choice.
Generally it's ok to have a central place to put some business logic for different cases. However, here we can't access all FileFormat
implementations, Hive ORC is in Hive module. Now the only choice is: dispatch the business logic into implementations.
So +1 on the approach taken by this PR.
* If the [[DataType]] is not supported, an exception will be thrown. | ||
* By default all data types are supported. | ||
*/ | ||
def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to return boolean here, and let the caller side to throw exception, so that we can unify the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was what I did in first commit. If the unsupported type is inside struct/array, then the error message is not accurate as the current way.
I am OK with revert to return Boolean though.
@@ -148,6 +144,28 @@ class JsonFileFormat extends TextBasedFileFormat with DataSourceRegister { | |||
override def hashCode(): Int = getClass.hashCode() | |||
|
|||
override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat] | |||
|
|||
override def validateDataType(dataType: DataType, isReadPath: Boolean): Unit = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
the base class
def validateDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match {
case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType |
StringType | BinaryType | DateType | TimestampType | _: DecimalType => true
case _ => false
}
json
override def validateDataType(dataType: DataType, isReadPath: Boolean): Boolean = {
case st: StructType => st.forall { f => validateDataType(f.dataType, isReadPath) }
case ArrayType...
...
case other => super.validateDataType(other)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the base class could break other existing file formats, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by break? If people use internal API(like avro), they are responsible to update their code for the internal API changes in new Spark releases.
@HyukjinKwon @maropu I have updated the code. It is now using whitelist. |
Test build #92569 has finished for PR 21667 at commit
|
Test build #92611 has finished for PR 21667 at commit
|
|
||
override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match { | ||
case BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | DoubleType | | ||
StringType | BinaryType | DateType | TimestampType | _: DecimalType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it just case _: AtomicType => true
?
@@ -141,6 +136,11 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { | |||
} | |||
} | |||
} | |||
|
|||
override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataType == StringType
* Returns whether this format supports the given [[DataType]] in read/write path. | ||
* By default all data types are supported. | ||
*/ | ||
def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who does not overwrite it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveFileFormat
. Currently I don't know Hive well. If someone can override it for HiveFileFormat
, please create a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why not remove this default implementation and create HiveFileFormat#supportDataType
to return true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we also need to update LibSVMFileFormat
, and several file format in unit test. I really prefer to have a default behavior here, as FileFormat
can still work without the new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
|
||
case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath) | ||
|
||
case _: NullType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why JSON supports null type but CSV doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently null type is not handled in UnivocityParser
My major concern is when to apply this check. Ideally this should happen during analysis. |
Test build #92639 has finished for PR 21667 at commit
|
@cloud-fan I can understand you concern. But I can't find better entries. The entry in |
retest this please. |
Test build #92686 has finished for PR 21667 at commit
|
retest this please. |
Test build #92711 has finished for PR 21667 at commit
|
This reverts commit 16beafa4f128d3d592bf1f08c8e9b29770ae5a8a.
7266611
to
757b82a
Compare
retest this please. |
override def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = dataType match { | ||
case _: AtomicType => true | ||
|
||
case udt: UserDefinedType[_] => supportDataType(udt.sqlType, isReadPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Interval type | ||
var schema = StructType(StructField("a", CalendarIntervalType, true) :: Nil) | ||
spark.range(1).write.format(format).mode("overwrite").save(tempDir) | ||
spark.read.schema(schema).format(format).load(tempDir).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the user-specified schema doesn't match the physical schema, the behavior is undefined. So I don't think this is about backward compatibility, +1 to forbid interval type.
Test build #92864 has finished for PR 21667 at commit
|
Test build #92863 has finished for PR 21667 at commit
|
Test build #92867 has finished for PR 21667 at commit
|
Test build #92913 has finished for PR 21667 at commit
|
retest this please |
Test build #92930 has finished for PR 21667 at commit
|
thanks, merging to master! |
…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes #23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes apache#23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
With #21389, data source schema is validated on driver side before launching read/write tasks.
However,
DataSourceUtils
is tricky and hard to maintain. On second thought after review, I find that theOrcFileFormat
in hive package is not matched, so that its validation wrong.DataSourceUtils.verifyWriteSchema
andDataSourceUtils.verifyReadSchema
is not supposed to be called in every file format. We can move them to some upper entry.So, I propose we can add a new method
validateDataType
in FileFormat. File format implementation can override the method to specify its supported/non-supported data types.Although we should focus on data source V2 API,
FileFormat
should remain workable for some time. Adding this new method should be helpful.How was this patch tested?
Unit test