Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32205][SQL] Writing timestamp to mysql should be datetime type #29043

Closed

Conversation

TJX2014
Copy link
Contributor

@TJX2014 TJX2014 commented Jul 8, 2020

What changes were proposed in this pull request?

  1. Change date type from TIMESTAMP to DATETIME in mysql at org.apache.spark.sql.jdbc.MySQLDialect#getJDBCType
  2. Add UT in org.apache.spark.sql.test.SQLTestUtils#test .

Why are the changes needed?

Because write spark timestamp to mysql should has a '1000-01-01 00:00:00' to '9999-12-31 23:59:59' range.
see https://dev.mysql.com/doc/refman/5.7/en/datetime.html
While the date type in mysql should be datetime rather than timestamp.
Before this patch, when we use timestamp data type in mysql by auto created table:
sql("select cast('1111-01-01 00:00:01' as timestamp)").toDF("ts").write.mode("append").jdbc("jdbc:mysql://localhost:3306/test", "ts_test3",prop)
we will get an exception:
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect datetime value: '1111-01-01 00:00:01' for column 'ts' at row

Does this PR introduce any user-facing change?

Yes, after this patch, people could insert '1000-01-01 00:00:00' to '9999-12-31 23:59:59' range timestamp to mysql DATETIME column rather than TIMESTAMP column because TIMESTAMP in mysql not have the same range in spark by table auto created.

How was this patch tested?

Unit test.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Comment on lines +52 to +56
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
// For more details, please see
// https://dev.mysql.com/doc/refman/5.7/en/datetime.html
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case _ => None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need datetime in mysql ranther than timestamp.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 10, 2020

@srowen Could you please help me check this PR : )

@huaxingao
Copy link
Contributor

@TJX2014 Seems MySQL TIMESTAMP is affected by the time_zone setting but DATETIME is not (https://www.tech-recipes.com/rx/22599/mysql-datetime-vs-timestamp-data-type/). I think you will lose the time_zone info if you use DATETIME for TIMESTAMP.

@srowen
Copy link
Member

srowen commented Jul 10, 2020

Yeah DATETIME is definitely not the same thing as TIMESTAMP. You need to start with a Spark date-time type, logically.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 11, 2020

@huaxingao @srowen Sure, DATETIME and TIMESTAMP seems both have 8 bytes, while DATE only 4 bytes, the only thing different in mysql seems is timezone index.

@srowen
Copy link
Member

srowen commented Jul 11, 2020

... that is a very significant difference in what the 8 bytes mean!

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 11, 2020

@srowen @huaxingao So could we use datetime in mysql dialect of spark? The zoneinfo index seems should be considered in mysql side because when we write timestamp to mysql, the zone is considered.

@huaxingao
Copy link
Contributor

@TJX2014 Actually, regardless of time zone, map timestamp to datetime in mysql dialect doesn't seem to solve your problem.

Suppose you create a table with a timestamp column
CREATE TABLE test (c timestamp);
You want to insert a timestamp '1111-01-01 00:00:01' to TABLE test.

Before your fix, when inserting this timestamp '1111-01-01 00:00:01' to table test, mysql JDBC driver does the range check for this timestamp value and throw Exception because timestamp '1111-01-01 00:00:01' is not valid (mysql TIMESTAMP has a range of '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07' UTC)

After your fix, when inserting this timestamp '1111-01-01 00:00:01' to table test, since you change the data type to datetime now, mysql JDBC driver doesn't throw Exception since '1111-01-01 00:00:01' is a valid datetime value, but when inserting this value into table test, because the real type of column c is timestamp, mysql will throw Exception since '1111-01-01 00:00:01' is not a valid timestamp value.

I don't have mysql database and JDBC driver on my local to test this, but in theory it works this way. You may want to try to see if you can insert this timestamp '1111-01-01 00:00:01' OK.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 11, 2020

@huaxingao
Thank you for your detail response.

Actually, Timestamp in spark has the range '0001-01-01T00:00:00.000000Z, 9999-12-31T23:59:59.999999Z', see org.apache.spark.sql.types.TimestampType(8 bytes), and the relative date type seems should be datetime (8 bytes) in mysql, not the timestamp, only 4 bytes and have a smaller range '1970-01-01 00:00:01.000000' to '2038-01-19 03:14:07.999999'.

In real test:
We can not insert '1111-01-01 00:00:01' from spark into TIMESTAMP column of mysql currently because of out of range.
but we can insert the former '1111-01-01 00:00:01'into DATETIME column of mysql for datetime contains the range of timestamp in mysql.
See: https://dev.mysql.com/doc/refman/5.7/en/datetime.html

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 11, 2020

Yeah DATETIME is definitely not the same thing as TIMESTAMP. You need to start with a Spark date-time type, logically.

Hi, @srowen
Thank you for your response.
The range of date-time type in spark is [0001-01-01, 9999-12-31], which ignores hour, min, sec...,is it what you mean ?

@srowen
Copy link
Member

srowen commented Jul 11, 2020

I think the point is, you would lose time zone information writing as DATETIME, right?

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 12, 2020

Hi @srowen , The zone info is according to local zone of mysql, it seems we should not consider it as timezone loss because we have considered it when write. The different between DATETIME and TIMESTAMP is zone index rather than loss it and it belongs to mysql design.

@TJX2014 TJX2014 closed this Jul 12, 2020
@TJX2014 TJX2014 reopened this Jul 12, 2020
@huaxingao
Copy link
Contributor

@TJX2014
I am still not convinced that it's a good idea to map TimeStamp to DateTime in MySQLDialect. Using an example similar to yours:

sql("select cast('1970-01-01 00:00:01' as timestamp)").toDF("ts").write.mode("append").jdbc("jdbc:mysql://localhost:3306/test", "ts_test3",prop)

Since user explicitly cast string to timestamp, I would think that the user wants to insert 1970-01-0 00:00:01 as a TimeStamp data type. Suppose the current time zone on mysql server is america/los_angeles. After the data is inserted to mysql, if the user changes the time zone setting
SET TIME_ZONE = "america/new_york";
Then the user would expect to get 1970-01-01 03:00:01 when retrieving the timestamp. But with this patch, we silently change the data type from TimeStamp to DateTime, after user insert the timestamp and change the time zone setting, user will still get 1970-01-0 00:00:01 which is not the correct value.

@TJX2014
Copy link
Contributor Author

TJX2014 commented Jul 12, 2020

@huaxingao
Yeah, One of these could happen when user use mysql client in spark to execute SET TIME_ZONE = "america/new_york" to use mysql timezone index feature between DATETIME and TIMESTAMP. if this behavior used outside of spark, because of session state param, will have no effect. Seems if it is a problem, should be solved in mysql side.

@srowen srowen closed this Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants