New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17765][SQL] Support for writing out user-defined type in ORC datasource #15361
Conversation
@yhuai and @liancheng Do you mind if I ask to review this please? |
Test build #66386 has finished for PR 15361 at commit
|
@HyukjinKwon
Let me try to make simple scala test case that reproduces the issue from shell. May be this will be more helpful. |
@kxepal, Sure, I will definitely try to reproduce as soon as you do. Meanwhile, let me double check this. Thanks. |
@HyukjinKwon
|
@kxepal I will test this and fix it here together as well within tomorrow if there are some more cases to handle. Thanks for verifying this PR. |
@HyukjinKwon Thank you a lot! Staying tuned. |
@kxepal , I just tested (build, copied and pasted) the codes below: import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark Hive Example").enableHiveSupport().getOrCreate()
import spark.implicits._
val sv = org.apache.spark.mllib.linalg.Vectors.sparse(7, Array(0, 42), Array(-127, 128))
val df = Seq(("thing", sv)).toDF("thing", "vector")
df.write.format("orc").save("/tmp/thing.orc")
spark.read.schema(df.schema).format("orc").load("/tmp/thing.orc").show() prints below:
and it seems fine with the current master branch. Do you mind if I try to verify this again when we hopefully backport to branch-2.0 after this only is hopefully merged? |
@HyukjinKwon I'll do one more try today, but so far it looks like that you solved the problem \o/ Thank you! |
@kxepal Sure, thanks for confirming! |
@HyukjinKwon |
Hi @chenghao-intel and @davies, it seems related code paths were updated by your before. Do you mind if I ask to take a look please? |
yes, please go ahead. :) |
@chenghao-intel @davies Would there be other things maybe I should test or take care of? |
@@ -246,6 +246,9 @@ private[hive] trait HiveInspectors { | |||
* Wraps with Hive types based on object inspector. | |||
*/ | |||
protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any => Any = oi match { | |||
case _ if dataType.isInstanceOf[UserDefinedType[_]] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This codepath is shared by many things apart from ORC. Won't those be affected ?
- I would put this case in the every end. The reason being
UserDefinedType
are not that common compared to other types (esp. primitive types). So putting it below in the switch case will be better for perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This codepath is shared by many things apart from ORC. Won't those be affected ?
It seems this path is being used in hiveUDFs.scala
and hiveWriterContainers.scala
.
Actually, it'd be fine that a value converter for UDT uses the equivalent type (inner sql type) converter internally. It is a common pattern for other data sources as well.
I would put this case in the every end. The reason being UserDefinedType are not that common compared to other types (esp. primitive types). So putting it below in the switch case will be better for perf.
Cool :) (BTW, this would be only executed once per task as you might already know.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually, I forgot that this path is doing a pattern matching via ObjectInspector
not with DataType
. I just remember I put this first for this reason (we should check dataType
regardless of ObjectInspector
first).
I definitely can avoid this via fixing more than here but I guess it is fine as it is because this is not in the critical path (executed once).
val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25)))) | ||
val udtDF = data.toDF("id", "vectors") | ||
udtDF.write.orc(path.getAbsolutePath) | ||
val readBack = spark.read.schema(udtDF.schema).orc(path.getAbsolutePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious : how does this work ? I mean you added support for UserDefinedType
in wrapper
side, but at the unwrapper
side I don't see UserDefinedType
being handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine for reading because it refers the schema from ORC (detecting the fields via field names). Internal projection for UDT uses udt.sqlType
IIUC.
(gentle ping @chenghao-intel @davies) |
ping .. |
@HyukjinKwon May be we can reach someone else with commit bit? Do you know anyone to ping? |
I think the recent related codes were committed by @rxin. Do you mind if I ask to take a look please? |
@HyukjinKwon Please, do! Thanks a lot for helping here (: |
Merging in master/branch-2.1. |
…atasource ## What changes were proposed in this pull request? This PR adds the support for `UserDefinedType` when writing out instead of throwing `ClassCastException` in ORC data source. In more details, `OrcStruct` is being created based on string from`DataType.catalogString`. For user-defined type, it seems it returns `sqlType.simpleString` for `catalogString` by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example, `StructType`[2]. So, running the codes below (`MyDenseVector` was borrowed[3]) : ``` scala val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25)))) val udtDF = data.toDF("id", "vectors") udtDF.write.orc("/tmp/test.orc") ``` ends up throwing an exception as below: ``` java.lang.ClassCastException: org.apache.spark.sql.UDT$MyDenseVectorUDT cannot be cast to org.apache.spark.sql.types.ArrayType at org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:381) at org.apache.spark.sql.hive.orc.OrcSerializer.wrapperFor(OrcFileFormat.scala:164) ... ``` So, this PR uses `UserDefinedType.sqlType` during finding the correct converter when writing out in ORC data source. [1]https://github.com/apache/spark/blob/dfdcab00c7b6200c22883baa3ebc5818be09556f/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L95 [2]https://github.com/apache/spark/blob/d2dc8c4a162834818190ffd82894522c524ca3e5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L326 [3]https://github.com/apache/spark/blob/2bfed1a0c5be7d0718fd574a4dad90f4f6b44be7/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala#L38-L70 ## How was this patch tested? Unit tests in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15361 from HyukjinKwon/SPARK-17765. (cherry picked from commit a2d4647) Signed-off-by: Reynold Xin <rxin@databricks.com>
…atasource ## What changes were proposed in this pull request? This PR adds the support for `UserDefinedType` when writing out instead of throwing `ClassCastException` in ORC data source. In more details, `OrcStruct` is being created based on string from`DataType.catalogString`. For user-defined type, it seems it returns `sqlType.simpleString` for `catalogString` by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example, `StructType`[2]. So, running the codes below (`MyDenseVector` was borrowed[3]) : ``` scala val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25)))) val udtDF = data.toDF("id", "vectors") udtDF.write.orc("/tmp/test.orc") ``` ends up throwing an exception as below: ``` java.lang.ClassCastException: org.apache.spark.sql.UDT$MyDenseVectorUDT cannot be cast to org.apache.spark.sql.types.ArrayType at org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:381) at org.apache.spark.sql.hive.orc.OrcSerializer.wrapperFor(OrcFileFormat.scala:164) ... ``` So, this PR uses `UserDefinedType.sqlType` during finding the correct converter when writing out in ORC data source. [1]https://github.com/apache/spark/blob/dfdcab00c7b6200c22883baa3ebc5818be09556f/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L95 [2]https://github.com/apache/spark/blob/d2dc8c4a162834818190ffd82894522c524ca3e5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L326 [3]https://github.com/apache/spark/blob/2bfed1a0c5be7d0718fd574a4dad90f4f6b44be7/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala#L38-L70 ## How was this patch tested? Unit tests in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15361 from HyukjinKwon/SPARK-17765.
…atasource ## What changes were proposed in this pull request? This PR adds the support for `UserDefinedType` when writing out instead of throwing `ClassCastException` in ORC data source. In more details, `OrcStruct` is being created based on string from`DataType.catalogString`. For user-defined type, it seems it returns `sqlType.simpleString` for `catalogString` by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example, `StructType`[2]. So, running the codes below (`MyDenseVector` was borrowed[3]) : ``` scala val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25)))) val udtDF = data.toDF("id", "vectors") udtDF.write.orc("/tmp/test.orc") ``` ends up throwing an exception as below: ``` java.lang.ClassCastException: org.apache.spark.sql.UDT$MyDenseVectorUDT cannot be cast to org.apache.spark.sql.types.ArrayType at org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:381) at org.apache.spark.sql.hive.orc.OrcSerializer.wrapperFor(OrcFileFormat.scala:164) ... ``` So, this PR uses `UserDefinedType.sqlType` during finding the correct converter when writing out in ORC data source. [1]https://github.com/apache/spark/blob/dfdcab00c7b6200c22883baa3ebc5818be09556f/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L95 [2]https://github.com/apache/spark/blob/d2dc8c4a162834818190ffd82894522c524ca3e5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L326 [3]https://github.com/apache/spark/blob/2bfed1a0c5be7d0718fd574a4dad90f4f6b44be7/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala#L38-L70 ## How was this patch tested? Unit tests in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15361 from HyukjinKwon/SPARK-17765.
What changes were proposed in this pull request?
This PR adds the support for
UserDefinedType
when writing out instead of throwingClassCastException
in ORC data source.In more details,
OrcStruct
is being created based on string fromDataType.catalogString
. For user-defined type, it seems it returnssqlType.simpleString
forcatalogString
by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example,StructType
[2].So, running the codes below (
MyDenseVector
was borrowed[3]) :ends up throwing an exception as below:
So, this PR uses
UserDefinedType.sqlType
during finding the correct converter when writing out in ORC data source.[1]
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala
Line 95 in dfdcab0
[2]
spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala
Line 326 in d2dc8c4
[3]
spark/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala
Lines 38 to 70 in 2bfed1a
How was this patch tested?
Unit tests in
OrcQuerySuite
.