[SPARK-48584][SQL] Perf improvement for unescapePathName by yaooqinn · Pull Request #46938 · apache/spark

yaooqinn · 2024-06-11T11:06:24Z

What changes were proposed in this pull request?

This PR improves perf for unescapePathName with algorithms briefly described as:

If a path contains no '%' or contains '%' at position > path.length-2, we return the original identity instead of creating a new StringBuilder to append char by char
Otherwise, we loop with 2 indices, plaintextStartIdx which starts from 0 and then points to the next char after resolving %xx, and plaintextEndIdx which points to the next '%'. plaintextStartIdx moves to plaintextEndIdx + 3 if %xx is valid, or moves to plaintextEndIdx + 1 if %xx is invalid.
Instead of using Integer.parseInt with error capture, we identify the high and low characters manually.

Why are the changes needed?

performance improvement for hotspots

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests in ExternalCatalogUtilsSuite
Benchmark results (9-11x faster)

Was this patch authored or co-authored using generative AI tooling?

no

yaooqinn · 2024-06-12T02:06:49Z

cc @dongjoon-hyun @cloud-fan @LuciferYang @JoshRosen thanks

beliefer · 2024-06-12T07:28:39Z

Could you provide some micro benchmark?

yaooqinn · 2024-06-12T07:33:05Z

Could you provide some micro benchmark?

What do you mean by some micro benchmark? Are sql/catalyst/benchmarks/EscapePathBenchmark-results.txt and sql/catalyst/benchmarks/EscapePathBenchmark-jdk21-results.txt not sufficient ?

beliefer · 2024-06-12T07:35:48Z

Oh, I see.

LuciferYang · 2024-06-12T07:43:53Z

It seems that there is a typo in the pr description:

Instead of using Integer.parseInt with error capture, we identify the high and low chararaters manually.

chararaters -> characters

yaooqinn · 2024-06-12T08:03:11Z

Thank you @LuciferYang

LuciferYang

+1, LGTM
Thanks @yaooqinn

beliefer · 2024-06-12T08:07:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

+    }
+    var plaintextEndIdx = path.indexOf('%')
+    val length = path.length
+    if (plaintextEndIdx == -1 || plaintextEndIdx + 2 > path.length) {


How about if (plaintextEndIdx == -1 || plaintextEndIdx + 2 > path.length || path.lastIndexOf('%') == 0) ?

|| path.lastIndexOf('%')

I'm sorry, I'm having a little trouble understanding your suggestion. Can you please clarify?

yaooqinn · 2024-06-12T08:41:36Z

Thank you @LuciferYang, I added a nit in the last commit and passed the test locally, so I didn't wait for the CI.

Merged to master

### What changes were proposed in this pull request? This PR follows up #46938 and improve the `unescapePathName`. ### Why are the changes needed? Improve the `unescapePathName` by cut off slow path. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46957 from beliefer/SPARK-48584_followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: beliefer <beliefer@163.com>

…ue (#8793) GlutenURLDecoder.java is copied from OpenJDK and it's under GPL v2 which belongs to Category X, we can't have it in Apache Releases. URLDecoder decode/encode is not fully compatible with the Hive catalog path escaping/unescaping, which Spark also follows. Besides, apache/spark#46938 has improved unescapePathName's speed at the Spark side by ~10x. So This PR also helps gluten gain perf which handles datasets w/ large partition numbers.

…ue (apache#8793) GlutenURLDecoder.java is copied from OpenJDK and it's under GPL v2 which belongs to Category X, we can't have it in Apache Releases. URLDecoder decode/encode is not fully compatible with the Hive catalog path escaping/unescaping, which Spark also follows. Besides, apache/spark#46938 has improved unescapePathName's speed at the Spark side by ~10x. So This PR also helps gluten gain perf which handles datasets w/ large partition numbers.

[SPARK-48584][SQL]Perf improvement for unescapePathName

33eea6f

github-actions bot added the SQL label Jun 11, 2024

[SPARK-48584][SQL]Perf improvement for unescapePathName

afc6f28

yaooqinn changed the title ~~[SPARK-48584][SQL]Perf improvement for unescapePathName~~ [SPARK-48584][SQL] Perf improvement for unescapePathName Jun 11, 2024

LuciferYang approved these changes Jun 12, 2024

View reviewed changes

beliefer reviewed Jun 12, 2024

View reviewed changes

nit

70a5e4f

yaooqinn closed this in da81d8e Jun 12, 2024

yaooqinn deleted the SPARK-48584 branch June 12, 2024 08:40

beliefer mentioned this pull request Jun 12, 2024

[SPARK-48584][SQL][FOLLOWUP] Improve the unescapePathName. #46957

Closed

yaooqinn mentioned this pull request Feb 20, 2025

[WIP][VL] Fix inconsistency issue of PartitionFile path unescaping & GPL issue apache/incubator-gluten#8793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48584][SQL] Perf improvement for unescapePathName#46938

[SPARK-48584][SQL] Perf improvement for unescapePathName#46938
yaooqinn wants to merge 3 commits intoapache:masterfrom
yaooqinn:SPARK-48584

yaooqinn commented Jun 11, 2024 •

edited

Loading

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

beliefer commented Jun 12, 2024

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

beliefer commented Jun 12, 2024

Uh oh!

LuciferYang commented Jun 12, 2024

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

LuciferYang left a comment

Uh oh!

beliefer Jun 12, 2024 •

edited

Loading

Uh oh!

yaooqinn Jun 12, 2024 •

edited

Loading

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaooqinn commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

beliefer commented Jun 12, 2024

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

beliefer commented Jun 12, 2024

Uh oh!

LuciferYang commented Jun 12, 2024

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

beliefer Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Jun 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaooqinn commented Jun 11, 2024 •

edited

Loading

beliefer Jun 12, 2024 •

edited

Loading

yaooqinn Jun 12, 2024 •

edited

Loading