[Protocol, Spark] UTC normalize timestamp partition values #3378

amogh-jahagirdar · 2024-07-15T21:51:33Z

Which Delta project/connector is this regarding?

[ X] Spark
Standalone
Flink
Kernel
Other (fill in here)

Description

Currently, in the Delta Protocol, timestamps are not stored with their time zone. This leads to unexpected behavior when querying across systems with different timezones configured (e.g. different spark sessions for instance). For instance in Spark, the timestamp value will be adjusted to spark session time zone and written to the delta log partition values without TZ. If someone were to query the same "timestamp" from a different session timezone, the same time zone value it can fail to surface results due to partition pruning.

What this change proposes to the delta lake protocol is to allow timestamp partition values to be adjusted to UTC and explicitly stored in partition values with a UTC suffix. The original approach is still supported for compatibility but it is recommended for newer writers to write with UTC suffix.

This is also important for Iceberg Uniform conversion because Iceberg timestamps must be UTC adjusted. Now we have a well defined format for UTC in delta, we can convert string partition values to Iceberg longs to make Uniform conversion succeed.

This change updates the Spark-Delta integration to write out the UTC adjusted values for timestamp types.

This also addresses an issue of microsecond partitions where previously microsecond partitioning (not recommended but technically allowed) would not work and be truncated to seconds.

How was this patch tested?

Added unit tests for the following cases:

1.) UTC timestamp partition values round trip across different session TZ
2.) A delta log with a mix of Non-UTC and UTC partition values round trip across the same session TZ
3.) Timestamp No Timezone round trips across timezones (kind of a tautology but important to make sure that the timestamp_ntz does not get written with UTC timestamp unintentionally)
4.) Timestamp round trips across same session time zone: UTC normalized
5.) Timestamp round trips across same session time zone: session time normalized (this case worked before this change, so it's important that it keeps working after this change)

Mix of microsecond/second level precision and dates before epoch (to test if everything works with negative)

Does this PR introduce any user-facing changes?

Yes in the sense that new timestamp partition values will be the normalized UTC values.

amogh-jahagirdar · 2024-07-15T21:52:27Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DeltaFileFormatWriter.scala

@@ -377,6 +378,13 @@ object DeltaFileFormatWriter extends LoggingShims {
    }
  }

+   class PartitionedTaskAttemptContextImpl(conf: Configuration,


Needed a way to actually send the types of the partition columns as part of the task context, that's important to distinguish the timestamp vs timestamp_ntz cases for this change.

If there's another way to get the partition column type details without this addition, that'd be awesome but didn't see any where else to get these detials

amogh-jahagirdar · 2024-07-15T21:55:22Z

spark/src/main/scala/org/apache/spark/sql/delta/util/TimestampFormatter.scala

@@ -67,15 +67,15 @@ sealed trait TimestampFormatter extends Serializable {

 class Iso8601TimestampFormatter(
    pattern: String,
-    timeZone: TimeZone,
+    timeZone: ZoneId,


I don't think changing this poses a compatibility concern as long as the various apply methods still are compatible (My scala is a bit rusty so let me know, but I'm unable to initialize this outside anyways)?

As for why the change:

Spark SQL supports setting any zoneID like "UTC-08:00" etc. However, if one tries to convert this into java timezone type via TimeZone.of(ZoneId.of("UTC-08:00")) then the timezone actualy ends up being just "UTC" because timezone API can't handle the offset -08:00. So passing a zoneID avoids a correctness issue there.

spark/src/main/scala/org/apache/spark/sql/delta/util/PartitionUtils.scala

amogh-jahagirdar · 2024-07-18T21:58:05Z

spark/src/main/scala/org/apache/spark/sql/delta/util/PartitionUtils.scala

@@ -117,7 +117,8 @@ object PartitionSpec {

 private[delta] object PartitionUtils {

-  val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.S]"
+  lazy val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.S]"
+  lazy val utcFormatter = TimestampFormatter("yyyy-MM-dd HH:mm:ss.SSSSSSz", ZoneId.of("UTC"))


Normalizes to microseconds like in Delta kernel:

delta/kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/PartitionUtils.java

Line 45 in 573a57f

DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS");

amogh-jahagirdar · 2024-07-18T22:47:19Z

PROTOCOL.md

+Note: A `timestamp` value in a partition value may be stored in one of the following ways:
+1. Without a timezone, where the timestamp should be interpreted using the time zone of the system which wrote to the table.
+2. Adjusted to UTC where the partition value must have the suffix "UTC".
+
+It is highly recommended that modern writers adjust the timestamp value to UTC as outlined in 2.


If we'd like to do the protocol changes separately, happy to do so, just let me know @scottsand-db @allisonport-db @vkorukanti @lzlfred

amogh-jahagirdar · 2024-07-18T22:49:32Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

@@ -67,7 +70,7 @@ class DelayedCommitProtocol(
  // since there's no guarantee the stats will exist.
  @transient val addedStatuses = new ArrayBuffer[AddFile]

-  val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.S]"
+  val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.SSSSSS][.S]"


This also addresses an issue of microsecond partitions; not that I'd reccomend identity partitions on microseconds due to extremely small files but it is technically supported by the delta protocol.

Let's make sure this is explicitly mentioned in the PR description.

Fokko · 2024-07-19T11:12:56Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

@@ -164,7 +183,8 @@ class DelayedCommitProtocol(
   */
  override def newTaskTempFile(
      taskContext: TaskAttemptContext, dir: Option[String], ext: String): String = {
-    val partitionValues = dir.map(parsePartitions).getOrElse(Map.empty[String, String])
+    val partitionValues = dir.map(dir => parsePartitions(dir, taskContext))


Scala esthetics nit. You could use currying:

Suggested change

val partitionValues = dir.map(dir => parsePartitions(dir, taskContext))

val partitionValues = dir.map(parsePartitions(taskContext))

And, then the signature of the parsePartitions would look like:

protected def parsePartitions(taskContext: TaskAttemptContext)(dir: String): Map[String, String] = {

You could, but then we'd have to adjust the signature of the called function just to support currying...

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoTableSuite.scala

Fokko · 2024-07-19T11:13:56Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

-            .map(l => Cast(l, StringType).eval())
-            .map(Option(_).map(_.toString).orNull))
-        .toMap
+      .columnNames


I think we want to minimize the number of unrelated changes

bart-samwel · 2024-07-22T12:39:56Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

@@ -67,7 +70,7 @@ class DelayedCommitProtocol(
  // since there's no guarantee the stats will exist.
  @transient val addedStatuses = new ArrayBuffer[AddFile]

-  val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.S]"
+  val timestampPartitionPattern = "yyyy-MM-dd HH:mm:ss[.SSSSSS][.S]"


Let's make sure this is explicitly mentioned in the PR description.

bart-samwel · 2024-07-22T12:44:03Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

    // TODO: enable validatePartitionColumns?
+    val utcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(


This sounds like it's the actual values. Let's prefix with "use" to make it clear that it's a boolean.

Suggested change

val utcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(

val useUtcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(

bart-samwel · 2024-07-22T12:46:31Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

    val dateFormatter = DateFormatter()
    val timestampFormatter =
-      TimestampFormatter(timestampPartitionPattern, java.util.TimeZone.getDefault)
+      TimestampFormatter(timestampPartitionPattern, sessionTimeZone)


This is actually a behavior change as well. The JVM timezone doesn't need to be the same as the session timezone. Given that the old behavior is bad with or without this change, I would suggest leaving it as-is. Then at least the fallback (disabling the config) will work to get back to exactly the old bad behavior, instead of some subtly different bad behavior. :)

That's a great point. I'll leave it as is. I think arguably it should've been session timezone to begin with, but again since it's already released like this for a while, we shouldn't introduce any new behavior.

bart-samwel · 2024-07-22T12:52:23Z

spark/src/main/scala/org/apache/spark/sql/delta/util/PartitionUtils.scala

-        if (validatePartitionColumns && columnValue != null && castedValue == null) {
-          throw DeltaErrors.partitionColumnCastFailed(
-            columnValue.toString, dataType.toString, columnName)
+        if (dataType == DataTypes.TimestampType && utcNormalizeTimestamp) {


Why don't we just pass the utcFormatter into inferPartitionColumnValue? Heck, why are we passing a timestampFormatter and a boolean here when we could just pass in a the utc formatter through timestampFormatter from the outside? (If there is a fundamental reason for this, then it needs at least a code comment that explains why we need this extra complexity.)

I think this ended up being answered in #3378 (comment) but let me know if this part is still unclear. Essentially the timestampFormatter here is for reading the partition string values, and what the boolean specifies is how to format the output partition values.

bart-samwel · 2024-07-22T12:56:09Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

@@ -164,7 +183,8 @@ class DelayedCommitProtocol(
   */
  override def newTaskTempFile(
      taskContext: TaskAttemptContext, dir: Option[String], ext: String): String = {
-    val partitionValues = dir.map(parsePartitions).getOrElse(Map.empty[String, String])
+    val partitionValues = dir.map(dir => parsePartitions(dir, taskContext))


You could, but then we'd have to adjust the signature of the called function just to support currying...

bart-samwel · 2024-07-22T12:58:53Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

    val parsedPartition =
      PartitionUtils
        .parsePartition(
          new Path(dir),
          typeInference = false,
          Set.empty,
-          Map.empty,
+          partitionCols,


This is an excellent situation where named arguments helps with readability. E.g. in the line above (Set.empty) I have no idea what we're opting out of. Even here, I'm not sure what partitionCols does here. (It's actually userSpecifiedDataTypes.) It would also help if we name partitionCols as partitionColTypes, to make it clear what it represents.

We could also use a comment that explains why we need to override the types here. I'm starting to think that this PartitionUtils is really weird because it tries to infer the column type from the data. This code seems to have been borrowed from Spark where it might have had a function because the values are parsed from the path. But in Delta we always know the exact data type...

Oh boy. I did some further digging. It seems that we're actually using this only for newTaskTempFile, which is used to generate the file name for a new file. And that one gets a path that Spark already pre-generated that includes the partition key values, which we then reverse engineer and reconstitute to create the real path. And then we also store those values into the partitions that we're storing into the Delta log, via addedFiles.

It would be really useful to add that context to the code here as a comment. It's totally unclear that this is the key point where we determine what the partition keys are for a file...

(FWIW, given that context, it would make sense to me to do always pass in the partition columns explicitly, so that we no longer rely on the weird inference logic. At least then this change makes some sort of sense. It would be great if we would not rely on that inference logic anywhere. But I think that's a larger change that we shouldn't do in the same PR.)

+100, I'll add some comments explaining all of this since it's def not an intuitive API to work with and in the end we probably want to get away from using it since we know the exact partition column data types for Delta.

bart-samwel · 2024-07-22T13:27:30Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

    val parsedPartition =
      PartitionUtils
        .parsePartition(
          new Path(dir),
          typeInference = false,
          Set.empty,
-          Map.empty,
+          partitionCols,
          validatePartitionColumns = false,
          java.util.TimeZone.getDefault,


Interestingly, this is the timezone that is to be used to parse the input. It is worth a comment that this is different from the output formatting, and that we expect Spark to produce the partitions using the JVM default timezone. (This probably means that Spark has the same issue with partition columns on any other table.)

I've commented this section in more detail!

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

lzlfred · 2024-07-22T19:57:57Z

PROTOCOL.md

@@ -1754,13 +1754,16 @@ Type | Serialization Format
 string | No translation required
 numeric types | The string representation of the number
 date | Encoded as `{year}-{month}-{day}`. For example, `1970-01-01`
-timestamp | Encoded as `{year}-{month}-{day} {hour}:{minute}:{second}` or `{year}-{month}-{day} {hour}:{minute}:{second}.{microsecond}` For example: `1970-01-01 00:00:00`, or `1970-01-01 00:00:00.123456`
+timestamp | Encoded as `{year}-{month}-{day} {hour}:{minute}:{second}` or `{year}-{month}-{day} {hour}:{minute}:{second}.{microsecond}` For example: `1970-01-01 00:00:00`, or `1970-01-01 00:00:00.123456` or `2024-06-14 04:00:00UTC` or `2024-06-14 04:00:00.123456UTC`


plz add the Z postfix to the protocol.

lzlfred · 2024-07-22T19:58:19Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

@@ -19,21 +19,24 @@ package org.apache.spark.sql.delta.files
 // scalastyle:off import.ordering.noEmptyLine
 import java.net.URI
 import java.util.UUID


I would like to ask to revert all the import ordering changes as those are not related to this feature.

lzlfred · 2024-07-22T19:58:40Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{DataType, StringType, TimestampType}
+
+import java.time.ZoneId


java import should go to very top.

lzlfred · 2024-07-22T19:58:55Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

-  protected def parsePartitions(dir: String): Map[String, String] = {
-    // TODO: timezones?
+  protected def parsePartitions(dir: String,
+                                taskContext: TaskAttemptContext): Map[String, String] = {


plz make the indentation 4 spaces.

lzlfred · 2024-07-22T20:01:16Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala

+    val utcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(
+      DeltaSQLConf.UTC_TIMESTAMP_PARTITION_VALUES.key, true) &&
+      taskContext.isInstanceOf[PartitionedTaskAttemptContextImpl] &&
+      taskContext.asInstanceOf[PartitionedTaskAttemptContextImpl]


plz avoid as/isInstanceOf; in almost all cases they can be replaced by pattern matching.

val utcTimestampPartitionValues = taskContext match {
case context:PartitionedTaskAttemptContextImpl if DeltaSQLConf.... =>
case _ => Map.empty
}

lzlfred · 2024-07-22T20:02:47Z

spark/src/main/scala/org/apache/spark/sql/delta/files/DeltaFileFormatWriter.scala

@@ -377,6 +378,13 @@ object DeltaFileFormatWriter extends LoggingShims {
    }
  }

+   class PartitionedTaskAttemptContextImpl(conf: Configuration,
+                                           taskId: TaskAttemptID,


indentation.

lzlfred · 2024-07-22T20:03:06Z

spark/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

@@ -17,7 +17,6 @@
 package org.apache.spark.sql.delta.files

 import scala.collection.mutable.ListBuffer


revert changes in this file.

amogh-jahagirdar · 2024-08-01T16:04:39Z

This change depends on the protocol changes from #3398

…to UTC and Spark writes these adjusted values

lzlfred

LGTM !!!

amogh-jahagirdar commented Jul 15, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/util/PartitionUtils.scala Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jul 15, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/util/PartitionUtils.scala Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jul 15, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/util/PartitionUtils.scala Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the utc-normalize-timestamp-partitions branch 3 times, most recently from 80125b9 to 8abff59 Compare July 18, 2024 21:46

amogh-jahagirdar changed the title ~~UTC normalize timestamp partitions~~ [Spark] UTC normalize timestamp partition values Jul 18, 2024

amogh-jahagirdar commented Jul 18, 2024

View reviewed changes

amogh-jahagirdar force-pushed the utc-normalize-timestamp-partitions branch 2 times, most recently from 954568b to 40c9bf4 Compare July 18, 2024 22:22

amogh-jahagirdar marked this pull request as ready for review July 18, 2024 22:36

amogh-jahagirdar changed the title ~~[Spark] UTC normalize timestamp partition values~~ [Spark, Protocol] UTC normalize timestamp partition values Jul 18, 2024

amogh-jahagirdar commented Jul 18, 2024

View reviewed changes

Fokko reviewed Jul 19, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaInsertIntoTableSuite.scala Show resolved Hide resolved

Fokko reviewed Jul 19, 2024

View reviewed changes

bart-samwel reviewed Jul 22, 2024

View reviewed changes

lzlfred reviewed Jul 22, 2024

View reviewed changes

amogh-jahagirdar force-pushed the utc-normalize-timestamp-partitions branch 8 times, most recently from faf26d9 to 78f1a38 Compare July 30, 2024 17:00

amogh-jahagirdar force-pushed the utc-normalize-timestamp-partitions branch from 78f1a38 to a181752 Compare July 30, 2024 17:03

amogh-jahagirdar changed the title ~~[Spark, Protocol] UTC normalize timestamp partition values~~ [Spark] UTC normalize timestamp partition values Aug 1, 2024

amogh-jahagirdar requested review from lzlfred, Fokko and bart-samwel August 1, 2024 16:04

amogh-jahagirdar force-pushed the utc-normalize-timestamp-partitions branch 9 times, most recently from 61b5669 to ae1cd95 Compare August 16, 2024 22:56

[Spark] Update protocol to allow timestamp partition values adjusted …

4028268

…to UTC and Spark writes these adjusted values

amogh-jahagirdar force-pushed the utc-normalize-timestamp-partitions branch from ae1cd95 to 4028268 Compare August 19, 2024 19:55

[Protocol]: Allow storing UTC adjusted timestamp partition values

dd62666

amogh-jahagirdar mentioned this pull request Aug 19, 2024

[Protocol]: Allow storing UTC adjusted timestamp partition values #3398

Closed

4 tasks

amogh-jahagirdar changed the title ~~[Spark] UTC normalize timestamp partition values~~ [Protocol, Spark] UTC normalize timestamp partition values Aug 19, 2024

lzlfred approved these changes Aug 19, 2024

View reviewed changes

amogh-jahagirdar closed this Aug 19, 2024

amogh-jahagirdar reopened this Aug 19, 2024

vkorukanti merged commit e213023 into delta-io:master Aug 20, 2024
24 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Protocol, Spark] UTC normalize timestamp partition values #3378

[Protocol, Spark] UTC normalize timestamp partition values #3378

amogh-jahagirdar commented Jul 15, 2024 •

edited

Loading

amogh-jahagirdar Jul 15, 2024

amogh-jahagirdar Jul 15, 2024

amogh-jahagirdar Jul 15, 2024

amogh-jahagirdar Jul 18, 2024

amogh-jahagirdar Jul 18, 2024 •

edited

Loading

amogh-jahagirdar Jul 18, 2024

bart-samwel Jul 22, 2024

Fokko Jul 19, 2024

bart-samwel Jul 22, 2024

Fokko Jul 19, 2024

bart-samwel Jul 22, 2024

bart-samwel Jul 22, 2024

bart-samwel Jul 22, 2024

amogh-jahagirdar Jul 22, 2024 •

edited

Loading

bart-samwel Jul 22, 2024

amogh-jahagirdar Jul 22, 2024 •

edited

Loading

bart-samwel Jul 22, 2024

bart-samwel Jul 22, 2024

amogh-jahagirdar Jul 22, 2024

bart-samwel Jul 22, 2024

amogh-jahagirdar Jul 29, 2024

lzlfred Jul 22, 2024

lzlfred Jul 22, 2024

lzlfred Jul 22, 2024

lzlfred Jul 22, 2024

lzlfred Jul 22, 2024

lzlfred Jul 22, 2024

lzlfred Jul 22, 2024

amogh-jahagirdar commented Aug 1, 2024 •

edited

Loading

lzlfred left a comment

	val partitionValues = dir.map(dir => parsePartitions(dir, taskContext))
	val partitionValues = dir.map(parsePartitions(taskContext))

		// TODO: enable validatePartitionColumns?
		val utcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(

	val utcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(
	val useUtcTimestampPartitionValues = taskContext.getConfiguration.getBoolean(

		@@ -17,7 +17,6 @@
		package org.apache.spark.sql.delta.files

		import scala.collection.mutable.ListBuffer

[Protocol, Spark] UTC normalize timestamp partition values #3378

[Protocol, Spark] UTC normalize timestamp partition values #3378

Conversation

amogh-jahagirdar commented Jul 15, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Aug 1, 2024 • edited Loading

lzlfred left a comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Jul 15, 2024 •

edited

Loading

amogh-jahagirdar Jul 18, 2024 •

edited

Loading

amogh-jahagirdar Jul 22, 2024 •

edited

Loading

amogh-jahagirdar Jul 22, 2024 •

edited

Loading

amogh-jahagirdar commented Aug 1, 2024 •

edited

Loading