-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column #21834
Conversation
Test build #93381 has finished for PR 21834 at commit
|
ping |
case LogicalRelation(JDBCRelation(_, parts, _), _, _, _) => | ||
val whereClauses = parts.map(_.asInstanceOf[JDBCPartition].whereClause).toSet | ||
assert(whereClauses === Set( | ||
""""T" < '2018-07-15 20:50:32.5' or "T" is null""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case to OracleIntegrationSuite.scala? I am afraid this does not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@maropu Do we have a log message for users to know the generated where clauses? If not, could you add one? |
Currently, no. Is it ok that the log level is |
Test build #93532 has finished for PR 21834 at commit
|
retest this please |
Test build #93537 has finished for PR 21834 at commit
|
Test build #93542 has finished for PR 21834 at commit
|
My only concern is the time zone issues. However, it will not affect the correctness since the logical partitioning will cover the whole range anyway. The worse case is the partitioning might not start and end in the expected boundaries due to the local/remote time zone issues, if existed. Since this is a new feature, we can address the issues and improve the solution in the following releases. |
retest this please |
LGTM pending tests |
Test build #93778 has finished for PR 21834 at commit
|
Thanks! Merged to master. |
Thanks for the merge! |
## What changes were proposed in this pull request? This pr supported Date/Timestamp in a JDBC partition column (a numeric column is only supported in the master). This pr also modified code to verify a partition column type; ``` val jdbcTable = spark.read .option("partitionColumn", "text") .option("lowerBound", "aaa") .option("upperBound", "zzz") .option("numPartitions", 2) .jdbc("jdbc:postgresql:postgres", "t", options) // with this pr org.apache.spark.sql.AnalysisException: Partition column type should be numeric, date, or timestamp, but string found.; at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.verifyAndGetNormalizedPartitionColumn(JDBCRelation.scala:165) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:85) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:317) // without this pr java.lang.NumberFormatException: For input string: "aaa" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277) ``` Closes apache#19999 ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#21834 from maropu/SPARK-22814.
private def toInternalBoundValue(value: String, columnType: DataType): Long = columnType match { | ||
case _: NumericType => value.toLong | ||
case DateType => DateTimeUtils.fromJavaDate(Date.valueOf(value)).toLong | ||
case TimestampType => DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu The Timestamp.valueOf
method expects timestamp in the format yyyy-[m]m-[d]d hh:mm:ss[.f...]
. Was it selected intentionally or just because it is the default pattern for Timestamp
?
For example, it cannot parse time zones as Cast
can:
Timestamp.valueOf("1973-02-27 02:30:00.102030+01:00")
Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:251)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinks it'd be nice to accept more general pattens for date/timestamp. Can you make a pr to fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am going to prepare a PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@gatorsmile @maropu this currently does not work with pyspark due to this line: spark/python/pyspark/sql/readwriter.py Lines 563 to 564 in d9bcacf
it tries to convert The resulting traceback is:
I think just removing the EDIT: looks like the pyspark docs explicitly say to use an integer column at the moment |
When i chose INSERTION_DATE as my partitionColumn with below dates Getting error : ORA-01861: literal does not match format string |
@shatestest |
@MaxGekk thanks for reply but same error with .option("dateFormat", "yyyy-MM-dd"); |
@SparkQA , may I know how you tested partitionColumn for date/timestamp ? any sample test cases ? |
There are tests in the PR, see https://github.com/apache/spark/pull/21834/files#diff-5e0cadf526662f9281aa26315b3750adR442 . @shatestest Could you open an JIRA ticket, and provide a reproducible example, please. |
If anyone finds themselves here looking to do this in pyspark, until support for this feature is added, here is a workaround: import datetime
from itertools import tee
def date_range(start, end, intv):
diff = (end - start) / intv
for i in range(intv):
yield start + diff * i
yield end
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return zip(a, b)
partition_column = 'mypartitioncol'
now = datetime.datetime.now(datetime.timezone.utc)
num_partitions = sc.defaultParallelism
lower_bound = now + datetime.timedelta(-30)
upper_bound = now
predicates = []
for start, end in pairwise(date_range(lower_bound, upper_bound, num_partitions)):
predicates.append(f'{partition_column} >= \'{start.isoformat()}\' AND {partition_column} < \'{end.isoformat()}\'')
df = (spark.read.jdbc(
url='myjdbcuri',
table='mytable',
predicates=predicates,
properties={'driver': 'org.postgresql.Driver'})) Associated JIRA comment. |
@MaxGekk sorry for delay . raised a ticket ...please let me know if you need any more info .... https://issues.apache.org/jira/browse/SPARK-27723 |
What changes were proposed in this pull request?
This pr supported Date/Timestamp in a JDBC partition column (a numeric column is only supported in the master). This pr also modified code to verify a partition column type;
Closes #19999
How was this patch tested?
Added tests in
JDBCSuite
.