Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46070][SQL] Compile regex pattern in SparkDateTimeUtils.getZoneId outside the hot loop #43976

Closed
wants to merge 3 commits into from

Conversation

tanelk
Copy link
Contributor

@tanelk tanelk commented Nov 23, 2023

What changes were proposed in this pull request?

Compile the regex patterns used in SparkDateTimeUtils.getZoneId outside of the method, that can be called for each dataset row..

Why are the changes needed?

String.replaceFirst internally does Pattern.compile(regex).matcher(this).replaceFirst(replacement). Pattern.compile is very expensive method, that should not be called in a loop.

When using method like from_utc_timestamp with non-literal timezone, the SparkDateTimeUtils.getZoneId is called for each loop. In one of my usecases adding from_utc_timestamp increased the runtime from 15min to 6h.

Does this PR introduce any user-facing change?

Performance improvement.

How was this patch tested?

Existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 23, 2023
Comment on lines 40 to 41
final val tzRegexShort = Pattern.compile("(\\+|\\-)(\\d):")
final val tzRegexLong = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you assign more specific names like:

Suggested change
final val tzRegexShort = Pattern.compile("(\\+|\\-)(\\d):")
final val tzRegexLong = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")
final val singleHourTz = Pattern.compile("(\\+|\\-)(\\d):")
final val singleMinuteTz = Pattern.compile("(\\+|\\-)(\\d\\d):(\\d)$")

Short and Long term slightly confuses sine they could not reflect actual tz for example:
Long: 08:3
Short: 8:30:00

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for CI.

@MaxGekk
Copy link
Member

MaxGekk commented Nov 23, 2023

+1, LGTM. Merging to master.
Thank you, @tanelk.

@MaxGekk MaxGekk closed this in 2407066 Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants