-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The result is strange when casting string
to date
in ORC reading via spark (Schema Evolution)
#1237
Comments
Hi, @sinkinben . You are trying Both Apache Spark and ORC community recommend to use explicit SQL
scala> sql("select cast('2022-01-32' as DATE)").show()
+------------------------+
|CAST(2022-01-32 AS DATE)|
+------------------------+
| null|
+------------------------+
scala> sql("select cast('9808-02-30' as DATE)").show()
+------------------------+
|CAST(9808-02-30 AS DATE)|
+------------------------+
| null|
+------------------------+
scala> sql("select cast('2022-06-31' as DATE)").show()
+------------------------+
|CAST(2022-06-31 AS DATE)|
+------------------------+
| null|
+------------------------+
* The reader schema is said to be evolved (or projected) when it changed after the data is
* written by writers. The followings are supported in file-based data sources.
* Note that partition columns are not maintained in files. Here, `column` means non-partition
* column.
*
* 1. Add a column
* 2. Hide a column
* 3. Change a column position
* 4. Change a column type (Upcast)
*
* Here, we consider safe changes without data loss. For example, data type changes should be
* from small types to larger types like `int`-to-`long`, not vice versa.
*
* So far, file-based data sources have the following coverages.
*
* | File Format | Coverage | Note |
* | ------------ | ------------ | ------------------------------------------------------ |
* | TEXT | N/A | Schema consists of a single string column. |
* | CSV | 1, 2, 4 | |
* | JSON | 1, 2, 3, 4 | |
* | ORC | 1, 2, 3, 4 | Native vectorized ORC reader has the widest coverage. |
* | PARQUET | 1, 2, 3 | |
* | AVRO | 1, 2, 3 | |
scala> sql("set spark.sql.orc.impl=hive")
scala> :paste
// Entering paste mode (ctrl-D to finish)
val data = Seq(
("", "2022-01-32"), // pay attention to this, null
("", "9808-02-30"), // pay attention to this, 9808-02-29
("", "2022-06-31"), // pay attention to this, 2022-06-30
)
val cols = Seq("str", "date_str")
val df = spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.write.format("orc").mode("overwrite").save("/tmp/df")
spark.read.format("orc").schema("date_str date").load("/tmp/df").show(false)
// Exiting paste mode, now interpreting.
+--------+
|date_str|
+--------+
|null |
|null |
|null |
+--------+ |
Since there is a recommended way, I'll close this Q&A issue. We can still continue to discuss on this thread, @sinkinben . |
string
to date
in ORC reading via spark.string
to date
in ORC reading via spark (Schema Evolution)
Hi, @dongjoon-hyun , many thx for you reply. I have made more tests after I set the conf scala> :paste
// Entering paste mode (ctrl-D to finish)
val data = Seq(
("", "2002-01-01"),
("", "2022-08-29"),
("", "2022-08-31")
)
val cols = Seq("str", "date_str")
val df=spark.createDataFrame(data).toDF(cols:_*).repartition(1)
df.printSchema()
df.show(100)
df.write.mode("overwrite").orc("/tmp/orc/data.orc")
// Exiting paste mode, now interpreting.
root
|-- str: string (nullable = true)
|-- date_str: string (nullable = true)
+---+----------+
|str| date_str|
+---+----------+
| |2002-01-01|
| |2022-08-29|
| |2022-08-31|
+---+----------+
data: Seq[(String, String)] = List(("",2002-01-01), ("",2022-08-29), ("",2022-08-31))
cols: Seq[String] = List(str, date_str)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [str: string, date_str: string]
scala> sql("set spark.sql.orc.impl=hive")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> var df = spark.read.schema("date_str date").orc("/tmp/orc/data.orc"); df.show()
+--------+
|date_str|
+--------+
| null|
| null|
| null|
+--------+
df: org.apache.spark.sql.DataFrame = [date_str: date] These three cases are valid date, but why are they converted to |
|
|
Thank you for sharing the background. I know
IMO, you don't need to follow any behaviors on |
I created an ORC file by the code as follows.
Please note that these three cases are invalid date.
And I read it via:
Why is
2022-01-32
converted tonull
, while9808-02-30
is converted to9808-02-29
?Intuitively, they are invalid date, we should return 3 nulls.
The text was updated successfully, but these errors were encountered: