[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop #43976

tanelk · 2023-11-23T10:19:35Z

What changes were proposed in this pull request?

Compile the regex patterns used in SparkDateTimeUtils.getZoneId outside of the method, that can be called for each dataset row..

Why are the changes needed?

String.replaceFirst internally does Pattern.compile(regex).matcher(this).replaceFirst(replacement). Pattern.compile is very expensive method, that should not be called in a loop.

When using method like from_utc_timestamp with non-literal timezone, the SparkDateTimeUtils.getZoneId is called for each loop. In one of my usecases adding from_utc_timestamp increased the runtime from 15min to 6h.

Does this PR introduce any user-facing change?

Performance improvement.

How was this patch tested?

Existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

MaxGekk · 2023-11-23T11:14:15Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala

+  final val tzRegexShort = Pattern.compile("(\\+|\\-)(\\d):")
+  final val tzRegexLong = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")


Could you assign more specific names like:

Suggested change

final val tzRegexShort = Pattern.compile("(\\+|\\-)(\\d):")

final val tzRegexLong = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")

final val singleHourTz = Pattern.compile("(\\+|\\-)(\\d):")

final val singleMinuteTz = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")

Short and Long term slightly confuses sine they could not reflect actual tz for example:
Long: 08:3
Short: 8:30:00

MaxGekk

Waiting for CI.

MaxGekk · 2023-11-23T15:29:48Z

+1, LGTM. Merging to master.
Thank you, @tanelk.

Compile regex pattern outside the hot loop

0877057

github-actions bot added the SQL label Nov 23, 2023

Fix scalastyle

6b9afa6

MaxGekk reviewed Nov 23, 2023

View reviewed changes

Better variable names

f7f0004

MaxGekk approved these changes Nov 23, 2023

View reviewed changes

MaxGekk closed this in 2407066 Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop #43976

[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop #43976

tanelk commented Nov 23, 2023

MaxGekk Nov 23, 2023

MaxGekk left a comment

MaxGekk commented Nov 23, 2023

		final val tzRegexShort = Pattern.compile("(\\+\|\\-)(\\d):")
		final val tzRegexLong = Pattern.compile("(\\+\|\\-)(\\d\\d):(\\d)$")

[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop #43976

[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop #43976

Conversation

tanelk commented Nov 23, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk Nov 23, 2023

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

MaxGekk commented Nov 23, 2023