[SPARK-38437][SQL] Lenient serialization of datetime from datasource#35756
[SPARK-38437][SQL] Lenient serialization of datetime from datasource#35756MaxGekk wants to merge 5 commits intoapache:masterfrom
Conversation
|
This PR seems to me to be moving in the wrong direction. Previously we have compile/analysis-time checks which generate code specifically tailored to either SQL or Java-native types. After this PR, we would relax that compile-time check and instead perform per-row runtime checks on the object type. I would expect this to be detrimental to performance, and is generally contradictory to the approach of performing more analysis at query compile-time to avoid having to do checks at runtime. LMK if I'm missing anything. |
@xkrogen This PR doesn't weak any compile/analysis-time checks. The goal is to improve user experience with Spark SQL, and make it more flexible to user's input. Currently, Spark support 2 external Java types for Catalyst's timestamp type:
No, it doesn't relax any compile-time checks. We declare that Spark support both Java types for timestamps depending on Spark SQL config. After the PR, Spark will accept both independently from the config. cc @cloud-fan |
|
Can we narrow the scope down to only data source scan? For dataset, the type is known at compile-time and we can still generate precise code to process either |
Sure. I am trying this now. |
|
Perhaps I should have been more careful with my wording. I agree that no checks are weakened. However, we previously generated code that was specialized to a single type, either Java native or SQL type. Now, we perform a type-check at runtime, meaning we have to do an I think the PR looks much more reasonable now that it is scoped to only the Datasource API, and I do understand the problem you're addressing, though it still imposes extra overhead for well-behaved datasources that obey the val toRow = RowEncoder(StructType.fromAttributes(output), lenient = true).createSerializer()Not a strong concern, I think it is likely that the performance difference is small, but something to consider. |
|
as long as the data source always returns one type of datetime class, the branch prediction will work pretty well on JVM and the per record |
| deserializer, | ||
| ClassTag(cls)) | ||
| } | ||
| def apply(schema: StructType): ExpressionEncoder[Row] = { |
There was a problem hiding this comment.
nit: put a blank line between methods.
The datetime ops are pretty expensive, especially when need to look up to history data to calculate time zone offsets. Such op could take milliseconds (or hundreds microseconds) comparing to |
|
Merging to master. Thank you, @xkrogen and @cloud-fan for review. |
|
Fair points from both of you on branch prediction and relative cost of |
### What changes were proposed in this pull request? In the PR, I propose to support the lenient mode by the row serializer used by datasources to converts rows received from scans. Spark SQL will be able to accept: - `java.time.Instant` and `java.sql.Timestamp` for the `TIMESTAMP` type, and - `java.time.LocalDate` and `java.sql.Date` for the `DATE` type independently from the current value of the SQL config `spark.sql.datetime.java8API.enabled`. ### Why are the changes needed? A datasource might not aware of the Spark SQL config `spark.sql.datetime.java8API.enabled` if this datasource was developed before the config was introduced by Spark version 3.0.0. In that case, it always return "legacy" timestamps/dates of the types `java.sql.Timestamp`/`java.sql.Date` even if an user enabled Java 8 API. As Spark expects `java.time.Instant` or `java.time.LocalDate` but gets `java.time.Timestamp` or `java.sql.Date`, the user observes the exception: ```java ERROR SparkExecuteStatementOperation: Error executing query with ac61b10a-486e-463b-8726-3b61da58582e, currentState RUNNING, org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 8) (10.157.1.194 executor 0): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.sql.Timestamp is not a valid external type for schema of timestamp if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, instantToMicros, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, loan_perf_date), TimestampType), true, false) AS loan_perf_date#1125 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:239) ``` This PR fixes the issue above. And after the changes, users can use legacy datasource connecters with new Spark versions even when they need to enable Java 8 API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CodeGenerationSuite" $ build/sbt "test:testOnly *ObjectExpressionsSuite" ``` and new tests: ``` $ build/sbt "test:testOnly *RowEncoderSuite" $ build/sbt "test:testOnly *TableScanSuite" ``` Closes apache#35756 from MaxGekk/dynamic-serializer-java-ts. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
|
@MaxGekk hello , I find when the time zore is not the UTC(GTM 0) , insert a timestamp data like '1000-01-01 08:00:00' with timezone GTM +8, the select command will returen the date '1000-01-01 00:00:00', i find the code side has reduce the timezone offset about the input timestamp. So, do you think which way is correct with timestamp values in SQL. (a. insert '1000-01-01 08:00:00' return '1000-01-01 08:00:00'; b. insert '1000-01-01 08:00:00' return '1000-01-01 00:00:00') |
|
What's the session timezone of Spark for your table insertion and reading? Can you give some SQL statements to demonstrate your issue? |
|
|
@MaxGekk can you take a look? This looks unexpected. |
this timezone is PST |
If this is a issue , i can fix it. |
Is it avro specific? Do you observe the same w/ parquet? One more thing, the config I don't see that you modified |
ok, i will do more test about the file format and data, Also the config about the avro and parquet. |
For sure, the loaded data must be the same as saved data. We have many tests for saving/loading directly to datasources but here you save/load via Hive. |
yeah, the datasource is correct. the "// In avro the date "0001-01-01 00:00:00" will be read as "0002-01-01 00:00:00"" issue is our own changes to make it incorrect. I found parquet /orc/avro/rcfile in the test can pass, sequencefile/textfile may failed, i can fix it, and cc you to review later |
What changes were proposed in this pull request?
In the PR, I propose to support the lenient mode by the row serializer used by datasources to converts rows received from scans. Spark SQL will be able to accept:
java.time.Instantandjava.sql.Timestampfor theTIMESTAMPtype, andjava.time.LocalDateandjava.sql.Datefor theDATEtypeindependently from the current value of the SQL config
spark.sql.datetime.java8API.enabled.Why are the changes needed?
A datasource might not aware of the Spark SQL config
spark.sql.datetime.java8API.enabledif this datasource was developed before the config was introduced by Spark version 3.0.0. In that case, it always return "legacy" timestamps/dates of the typesjava.sql.Timestamp/java.sql.Dateeven if an user enabled Java 8 API. As Spark expectsjava.time.Instantorjava.time.LocalDatebut getsjava.time.Timestamporjava.sql.Date, the user observes the exception:This PR fixes the issue above. And after the changes, users can use legacy datasource connecters with new Spark versions even when they need to enable Java 8 API.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
By running the affected test suites:
and new tests: