[SPARK-48584][SQL] Perf improvement for unescapePathName#46938
[SPARK-48584][SQL] Perf improvement for unescapePathName#46938yaooqinn wants to merge 3 commits intoapache:masterfrom
Conversation
|
cc @dongjoon-hyun @cloud-fan @LuciferYang @JoshRosen thanks |
|
Could you provide some micro benchmark? |
What do you mean by some micro benchmark? Are sql/catalyst/benchmarks/EscapePathBenchmark-results.txt and sql/catalyst/benchmarks/EscapePathBenchmark-jdk21-results.txt not sufficient ? |
|
Oh, I see. |
|
It seems that there is a typo in the pr description: chararaters -> characters |
|
Thank you @LuciferYang |
LuciferYang
left a comment
There was a problem hiding this comment.
+1, LGTM
Thanks @yaooqinn
| } | ||
| var plaintextEndIdx = path.indexOf('%') | ||
| val length = path.length | ||
| if (plaintextEndIdx == -1 || plaintextEndIdx + 2 > path.length) { |
There was a problem hiding this comment.
How about if (plaintextEndIdx == -1 || plaintextEndIdx + 2 > path.length || path.lastIndexOf('%') == 0) ?
There was a problem hiding this comment.
|| path.lastIndexOf('%')
I'm sorry, I'm having a little trouble understanding your suggestion. Can you please clarify?
|
Thank you @LuciferYang, I added a nit in the last commit and passed the test locally, so I didn't wait for the CI. Merged to master |
### What changes were proposed in this pull request? This PR follows up #46938 and improve the `unescapePathName`. ### Why are the changes needed? Improve the `unescapePathName` by cut off slow path. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46957 from beliefer/SPARK-48584_followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: beliefer <beliefer@163.com>
…ue (#8793) GlutenURLDecoder.java is copied from OpenJDK and it's under GPL v2 which belongs to Category X, we can't have it in Apache Releases. URLDecoder decode/encode is not fully compatible with the Hive catalog path escaping/unescaping, which Spark also follows. Besides, apache/spark#46938 has improved unescapePathName's speed at the Spark side by ~10x. So This PR also helps gluten gain perf which handles datasets w/ large partition numbers.
…ue (apache#8793) GlutenURLDecoder.java is copied from OpenJDK and it's under GPL v2 which belongs to Category X, we can't have it in Apache Releases. URLDecoder decode/encode is not fully compatible with the Hive catalog path escaping/unescaping, which Spark also follows. Besides, apache/spark#46938 has improved unescapePathName's speed at the Spark side by ~10x. So This PR also helps gluten gain perf which handles datasets w/ large partition numbers.
What changes were proposed in this pull request?
This PR improves perf for unescapePathName with algorithms briefly described as:
position > path.length-2, we return the original identity instead of creating a new StringBuilder to append char by charplaintextStartIdxwhich starts from 0 and then points to the next char after resolving%xx, andplaintextEndIdxwhich points to the next'%'.plaintextStartIdxmoves toplaintextEndIdx + 3if%xxis valid, or moves toplaintextEndIdx + 1if%xxis invalid.Why are the changes needed?
performance improvement for hotspots
Does this PR introduce any user-facing change?
no
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
no