Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions #37113

Closed
wants to merge 2 commits into from

Conversation

yikf
Copy link
Contributor

@yikf yikf commented Jul 7, 2022

What changes were proposed in this pull request?

Currently, Spark don't support url encode/decode as built-in functions, the user might use reflect instead, It's a bit of a hassle, And often these functions are useful.

This pr aims to two points as follow:

  • add url encode/decode as built-in function support.
  • tidy up url-related functions to one scala file

Why are the changes needed?

url encode/decode functions are useful

Does this PR introduce any user-facing change?

yes, add new function as built-in function

How was this patch tested?

add new tests

@github-actions github-actions bot added the SQL label Jul 7, 2022
@yikf
Copy link
Contributor Author

yikf commented Jul 7, 2022

cc: @cloud-fan , I'm not sure why we don't support url encode\decode function as built-in function, but if we intentionally don't, i will close this title.

import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}
import org.apache.spark.unsafe.types.UTF8String

trait UrlBase extends UnaryExpression with CodegenFallback with ImplicitCastInputTypes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the file name should be urlExpressions

@LuciferYang
Copy link
Contributor

@yikf Please run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *ExpressionsSchemaSuite" to regenerating the result file and update in this pr.

@yikf
Copy link
Contributor Author

yikf commented Jul 7, 2022

@yikf Please run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *ExpressionsSchemaSuite" to regenerating the result file and update in this pr.

thanks~ i will address those comment if this pr is necessary

@yikf yikf closed this Jul 7, 2022
@yikf yikf reopened this Jul 7, 2022
@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 8, 2022

do other databases provide these functions?

@yikf
Copy link
Contributor Author

yikf commented Jul 8, 2022

do other databases provide these functions?

Both Trino & Presto have URL-related URL functions, including url_decode/url_encode, reference doc: https://trino.io/docs/current/functions/url.html, However, I can't find a function like url_decode in Hive or mysql, etc. maybe we have to use reflect function instead in hive, It's a bit of a hassle, And often these functions are useful.

since = "3.4.0",
group = "url_funcs")
// scalastyle:on line.size.limit
case class UrlEncode(child: Expression) extends UrlBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make them extend RuntimeReplaceable, and use StaticInvoke as the real expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, please take a look when you have a time, thanks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@yikf yikf changed the title [WIP] Supports url encode/decode function [SPARK-39741][SQL] Support url encode/decode as built-in function Jul 11, 2022
@yikf yikf marked this pull request as ready for review July 11, 2022 07:50
@LuciferYang
Copy link
Contributor

@yikf Could you update the pr description because I found this pr also includes some refactoring work

@yikf yikf changed the title [SPARK-39741][SQL] Support url encode/decode as built-in function [SPARK-39741][SQL] Support url encode/decode as built-in function and tidy up url-related functions Jul 11, 2022
@yikf
Copy link
Contributor Author

yikf commented Jul 11, 2022

@yikf Could you update the pr description because I found this pr also includes some refactoring work

done

// scalastyle:off line.size.limit
@ExpressionDescription(
usage = """
_FUNC_(str) - Translates a string into {@code application/x-www-form-urlencoded} format using a specific encoding scheme.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does {@code application/x-www-form-urlencoded} really work? Can you run DESC FUNCTION url_encode and give a screenshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it work normally,

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means it does not work... The SQL doc can't interpret {@code ...}. Can we just use plain string?

.... into 'application/x-www-form-urlencoded' format ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i misread this meaning, updated

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for a minor comment


object UrlCodec {
def encode(src: UTF8String, enc: UTF8String): UTF8String = {
UTF8String.fromString(URLEncoder.encode(src.toString, enc.toString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use java.net.URLEncoder now.
I wonder if org.apache.commons.codec.net.URLCodec and java.net.URLEncoder have the same semantics?
Will org.apache.commons.codec.net.URLCodec have better performance?

Copy link
Contributor

@LuciferYang LuciferYang Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I write a micro bench as follows:

def testEncode(url: String, valuesPerIteration: Int): Unit = {
    import org.apache.commons.codec.net.URLCodec

    val benchmark = new Benchmark("Test encode", valuesPerIteration, output = output)

    benchmark.addCase("Use java.net.URLEncoder") { _: Int =>
      for (_ <- 0L until valuesPerIteration) {
        java.net.URLEncoder.encode(url, "UTF-8")
      }
    }

    val codec = new URLCodec()
    benchmark.addCase("Use org.apache.commons.codec.net.URLCodec") { _: Int =>
      for (_ <- 0L until valuesPerIteration) {
        codec.encode(url)
      }
    }
    benchmark.run()
  }

  def testDecode(value: String, valuesPerIteration: Int): Unit = {
    import org.apache.commons.codec.net.URLCodec

    val benchmark = new Benchmark("Test decode", valuesPerIteration, output = output)

    benchmark.addCase("Use java.net.URLEncoder") { _: Int =>
      for (_ <- 0L until valuesPerIteration) {
        java.net.URLDecoder.decode(value, "UTF-8")
      }
    }

    val codec = new URLCodec()
    benchmark.addCase("Use org.apache.commons.codec.net.URLCodec") { _: Int =>
      for (_ <- 0L until valuesPerIteration) {
        codec.decode(value)
      }
    }
    benchmark.run()
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

    val valuesPerIteration = 100000

    // Test Contains
    testEncode("https://spark.apache.org", valuesPerIteration)
    testDecode("https%3A%2F%2Fspark.apache.org", valuesPerIteration)
  }

and run it use GA, the result as follows:

Java 8

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               32             33           1          3.1         324.3       1.0X
Use org.apache.commons.codec.net.URLCodec             29             29           0          3.4         291.4       1.1X

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               53             54           1          1.9         530.0       1.0X
Use org.apache.commons.codec.net.URLCodec             37             37           0          2.7         365.5       1.5X

Java 11

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               25             28           2          4.0         251.6       1.0X
Use org.apache.commons.codec.net.URLCodec             22             25           1          4.6         219.5       1.1X

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               31             31           1          3.3         306.2       1.0X
Use org.apache.commons.codec.net.URLCodec             26             29           2          3.8         260.4       1.2X

Java 17

OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               31             32           1          3.2         308.3       1.0X
Use org.apache.commons.codec.net.URLCodec             26             29           2          3.8         262.6       1.2X

OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               40             44           3          2.5         402.0       1.0X
Use org.apache.commons.codec.net.URLCodec             31             33           1          3.2         311.0       1.3X
``` ~~

`org.apache.commons.codec.net.URLCodec` seems a little quick than `java.net.URLEncoder` from benchmark result, so should we change to use `org.apache.commons.codec.net.URLCodec` in this pr? @cloud-fan @Yikf 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the above bench test encode twice, need re-run, please ignore it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update result:

Java 8

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               40             41           1          2.5         404.9       1.0X
Use org.apache.commons.codec.net.URLCodec             36             38           1          2.8         358.5       1.1X

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test decode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               38             39           1          2.6         379.0       1.0X
Use org.apache.commons.codec.net.URLCodec             31             32           2          3.2         312.8       1.2X

Java 11

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               26             26           0          3.9         256.5       1.0X
Use org.apache.commons.codec.net.URLCodec             24             24           1          4.2         240.5       1.1X

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Test decode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               22             22           0          4.6         217.5       1.0X
Use org.apache.commons.codec.net.URLCodec             18             18           0          5.6         179.0       1.2X

Java 17

OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Test encode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               29             33           2          3.4         290.1       1.0X
Use org.apache.commons.codec.net.URLCodec             31             35           2          3.3         305.3       1.0X

OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Test decode:                               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Use java.net.URLEncoder                               22             26           2          4.5         223.1       1.0X
Use org.apache.commons.codec.net.URLCodec             25             29           2          4.0         252.3       0.9X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like java.net.URLEncoder is the future (Java 17)? Assuming that Spark will get to Java 17 eventually, I'd vote for java.net.URLEncoder to avoid any breaking change when we want to switch to java.net.URLEncoder in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LuciferYang for benchmark test result, Seems like the benchmark did not diff much and for the reason that Spark will bump JDK and avoid any breaking change, java.net.URLEncoder is a better fit, what do you think @LuciferYang , @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @cloud-fan

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@yikf
Copy link
Contributor Author

yikf commented Jul 14, 2022

@LuciferYang @cloud-fan Any other comments please?

@LuciferYang
Copy link
Contributor

friendly ping @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e5c1b82 Jul 15, 2022
@yikf
Copy link
Contributor Author

yikf commented Jul 15, 2022

thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants