-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions #37113
Conversation
cc: @cloud-fan , I'm not sure why we don't support url encode\decode function as built-in function, but if we intentionally don't, i will close this title. |
import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType} | ||
import org.apache.spark.unsafe.types.UTF8String | ||
|
||
trait UrlBase extends UnaryExpression with CodegenFallback with ImplicitCastInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the file name should be urlExpressions
@yikf Please run |
thanks~ i will address those comment if this pr is necessary |
do other databases provide these functions? |
Both Trino & Presto have URL-related URL functions, including url_decode/url_encode, reference doc: https://trino.io/docs/current/functions/url.html, However, I can't find a function like url_decode in Hive or mysql, etc. maybe we have to use reflect function instead in hive, It's a bit of a hassle, And often these functions are useful. |
since = "3.4.0", | ||
group = "url_funcs") | ||
// scalastyle:on line.size.limit | ||
case class UrlEncode(child: Expression) extends UrlBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make them extend RuntimeReplaceable
, and use StaticInvoke
as the real expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, please take a look when you have a time, thanks
Can one of the admins verify this patch? |
@yikf Could you update the pr description because I found this pr also includes some refactoring work |
done |
// scalastyle:off line.size.limit | ||
@ExpressionDescription( | ||
usage = """ | ||
_FUNC_(str) - Translates a string into {@code application/x-www-form-urlencoded} format using a specific encoding scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does {@code application/x-www-form-urlencoded}
really work? Can you run DESC FUNCTION url_encode
and give a screenshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means it does not work... The SQL doc can't interpret {@code ...}
. Can we just use plain string?
.... into 'application/x-www-form-urlencoded' format ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i misread this meaning, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for a minor comment
|
||
object UrlCodec { | ||
def encode(src: UTF8String, enc: UTF8String): UTF8String = { | ||
UTF8String.fromString(URLEncoder.encode(src.toString, enc.toString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use java.net.URLEncoder
now.
I wonder if org.apache.commons.codec.net.URLCodec
and java.net.URLEncoder
have the same semantics?
Will org.apache.commons.codec.net.URLCodec
have better performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I write a micro bench as follows:
def testEncode(url: String, valuesPerIteration: Int): Unit = {
import org.apache.commons.codec.net.URLCodec
val benchmark = new Benchmark("Test encode", valuesPerIteration, output = output)
benchmark.addCase("Use java.net.URLEncoder") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
java.net.URLEncoder.encode(url, "UTF-8")
}
}
val codec = new URLCodec()
benchmark.addCase("Use org.apache.commons.codec.net.URLCodec") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
codec.encode(url)
}
}
benchmark.run()
}
def testDecode(value: String, valuesPerIteration: Int): Unit = {
import org.apache.commons.codec.net.URLCodec
val benchmark = new Benchmark("Test decode", valuesPerIteration, output = output)
benchmark.addCase("Use java.net.URLEncoder") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
java.net.URLDecoder.decode(value, "UTF-8")
}
}
val codec = new URLCodec()
benchmark.addCase("Use org.apache.commons.codec.net.URLCodec") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
codec.decode(value)
}
}
benchmark.run()
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val valuesPerIteration = 100000
// Test Contains
testEncode("https://spark.apache.org", valuesPerIteration)
testDecode("https%3A%2F%2Fspark.apache.org", valuesPerIteration)
}
and run it use GA, the result as follows:
Java 8
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 32 33 1 3.1 324.3 1.0X
Use org.apache.commons.codec.net.URLCodec 29 29 0 3.4 291.4 1.1X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 53 54 1 1.9 530.0 1.0X
Use org.apache.commons.codec.net.URLCodec 37 37 0 2.7 365.5 1.5X
Java 11
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 25 28 2 4.0 251.6 1.0X
Use org.apache.commons.codec.net.URLCodec 22 25 1 4.6 219.5 1.1X
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 31 31 1 3.3 306.2 1.0X
Use org.apache.commons.codec.net.URLCodec 26 29 2 3.8 260.4 1.2X
Java 17
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 31 32 1 3.2 308.3 1.0X
Use org.apache.commons.codec.net.URLCodec 26 29 2 3.8 262.6 1.2X
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 40 44 3 2.5 402.0 1.0X
Use org.apache.commons.codec.net.URLCodec 31 33 1 3.2 311.0 1.3X
``` ~~
`org.apache.commons.codec.net.URLCodec` seems a little quick than `java.net.URLEncoder` from benchmark result, so should we change to use `org.apache.commons.codec.net.URLCodec` in this pr? @cloud-fan @Yikf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the above bench test encode
twice, need re-run, please ignore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update result:
Java 8
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 40 41 1 2.5 404.9 1.0X
Use org.apache.commons.codec.net.URLCodec 36 38 1 2.8 358.5 1.1X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test decode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 38 39 1 2.6 379.0 1.0X
Use org.apache.commons.codec.net.URLCodec 31 32 2 3.2 312.8 1.2X
Java 11
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 26 26 0 3.9 256.5 1.0X
Use org.apache.commons.codec.net.URLCodec 24 24 1 4.2 240.5 1.1X
OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test decode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 22 22 0 4.6 217.5 1.0X
Use org.apache.commons.codec.net.URLCodec 18 18 0 5.6 179.0 1.2X
Java 17
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Test encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 29 33 2 3.4 290.1 1.0X
Use org.apache.commons.codec.net.URLCodec 31 35 2 3.3 305.3 1.0X
OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Test decode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder 22 26 2 4.5 223.1 1.0X
Use org.apache.commons.codec.net.URLCodec 25 29 2 4.0 252.3 0.9X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like java.net.URLEncoder
is the future (Java 17)? Assuming that Spark will get to Java 17 eventually, I'd vote for java.net.URLEncoder
to avoid any breaking change when we want to switch to java.net.URLEncoder
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LuciferYang for benchmark test result, Seems like the benchmark did not diff much and for the reason that Spark will bump JDK and avoid any breaking change, java.net.URLEncoder
is a better fit, what do you think @LuciferYang , @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
@LuciferYang @cloud-fan Any other comments please? |
friendly ping @cloud-fan |
thanks, merging to master! |
thank you all |
What changes were proposed in this pull request?
Currently, Spark don't support url encode/decode as built-in functions, the user might use reflect instead, It's a bit of a hassle, And often these functions are useful.
This pr aims to two points as follow:
Why are the changes needed?
url encode/decode functions are useful
Does this PR introduce any user-facing change?
yes, add new function as built-in function
How was this patch tested?
add new tests