[SPARK-16281][SQL] Implement parse_url SQL function #14008

janplus · 2016-07-01T05:29:48Z

What changes were proposed in this pull request?

This PR adds parse_url SQL functions in order to remove Hive fallback.

A new implementation of #13999

How was this patch tested?

Pass the exist tests including new testcases.

janplus · 2016-07-01T05:31:31Z

cc @rxin and @cloud-fan
Improvements for performance concern

rxin · 2016-07-01T05:40:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

@@ -285,6 +285,7 @@ object FunctionRegistry {
    expression[StringTrimLeft]("ltrim"),
    expression[JsonTuple]("json_tuple"),
    expression[FormatString]("printf"),
+    expression[ParseUrl]("parse_url"),


this should go before printf

OK, Thank you for review. I'll fix this.

rxin · 2016-07-01T05:41:53Z

@dongjoon-hyun can you help review this one?

dongjoon-hyun · 2016-07-01T07:28:52Z

Oh. Sure. @rxin

dongjoon-hyun · 2016-07-01T07:35:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO\n"
+  + "key specifies which query to extract\n"


Hi, @janplus .
There is a limitation of Scala 2.10 compiler. For extended, "+" breaks build.
Please use one single """ """ string like SubstringIndex https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L498 .

Hi, @dongjoon-hyun .
Thank you for review. I'll fix this.

dongjoon-hyun · 2016-07-01T08:35:34Z

Hi, @janplus .
I've done first pass.
Thank you for doing this.

janplus · 2016-07-01T13:14:04Z

@rxin and @dongjoon-hyun Thanks for your review.
I have add a new commit which does following things:

Put parse_url function in the right order.
Use """ """ instead of + in extended part to work with Scala 2.1.
Remove unnecessary lazys.
Correct REGEXPREFIX and add a new null test case.
Use NonFatal(_) instead of the specified exception.
Fix the indentation problems.

I have tried to not use varargs, but a separate constructor that accept two args does not help. As there isn't a magic key to make parse_url(url, partToExtract, magic key) to be treated as parse_url(url, partToExtract).

cloud-fan · 2016-07-01T13:53:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+    if (url == null || partToExtract == null) {
+      null
+    } else {
+      if (lastUrlStr == null || !url.equals(lastUrlStr)) {


is this optimization mainly for when the url is literal?

Yes. When the url column has many same values.

you can follow XPathBoolean to optimize for literal case.

Thought we judge on the url string, the main purpose is to cache the URL object.
As We must handle the exceptions caused by invalid urls, the approach of XPathBoolean seems not suitable.

dongjoon-hyun · 2016-07-01T19:06:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+    }
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {


Could you make this private?

janplus · 2016-07-08T03:04:52Z

cc @cloud-fan @rxin @liancheng
I did optimization for Literal part, so we don't need to check for every row. But since we may not assume in all circumstances the part is Literal, I keep the result being null when part is invalid.

rxin · 2016-07-08T03:10:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+      'query=1'
+      > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query')
+      '1'""")
+case class ParseUrl(children: Seq[Expression])


again we should not use Seq[Expression] here. We should just have a 3-arg ctor, and then add a 2-arg ctor.

Then we should think of a good default value for the 3rd argument. We should avoid using null as we assume the children of expression won't be null in a lot of places. How about using empty string as the default value for key?

As I explained before, I can hardly find a magic key that may let us treat parse_url(url, part, magic key) as parse_url(url, part). I have doubt on empty string, eg.

hive> select parse_url("http://spark/path?=1", "QUERY", "");
1

hive> select parse_url("http://spark/path?=1", "QUERY");
=1

Any suggestion on this?

Well, I don't have a strong preference here, Seq[Expression] doesn't look so bad to me. @rxin what do you think?

What if we use # as the default value and check on that? It is not a valid URL key is it?

Anyway I don't have a super strong preference here either. It might be more clear to not use a hacky # value.

Yes, # is not a valid URL key. And I agree with you on not using a hacky value.

rxin · 2016-07-08T05:44:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+      > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query')
+      '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {


here -- i don't think it makes a lot of sense to use ImplicitCastInputTypes here, since we are talking about urls. Why don't we just use ExpectsInputTypes

I am trying to make spark's behavior mostly like hive.
As hive does implicit cast for key, eg

hive> select parse_url("http://spark/path?1=v", "QUERY", 1);
v

Should we keep the same in spark?

I think it's OK in this case to not follow. This function is so esoteric that I doubt people will complain. If they do, we can always add the implicit casting later.

OK, I'll use ExpectsInputTypes.

Actually let's just keep it. Might as well since the code is already written.

Well, I have missed this comment and finished the change...

oh well this works

janplus · 2016-07-08T06:30:44Z

cc @rxin @cloud-fan Thank you for review
I add a new commit doing the following things:

Use ExpectsInputTypes instead of ImplicitCastInputTypes.
Add some cases for invalid-type parameters.
Code style fixes.

cloud-fan · 2016-07-08T07:32:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

+  // If the url is a constant, cache the URL object so that we don't need to convert url
+  // from UTF8String to String to URL for every row.
+  @transient private lazy val cachedUrl = children(0) match {
+    case Literal(url: UTF8String, _) => if (url ne null) getUrl(url) else null


it can be case Literal(url: UTF8String, _) if url != null => getUrl(url)

Oh yes, it's simpler.

cloud-fan · 2016-07-08T07:35:33Z

LGTM except one minor comment, thanks for working on it!

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala

janplus · 2016-07-08T13:33:39Z

cc @cloud-fan Thank you.
I have resolved conflicts with master and done some code style fixes as you suggested.

cloud-fan · 2016-07-08T14:16:17Z

retest this please

SparkQA · 2016-07-08T15:53:02Z

Test build #61983 has finished for PR 14008 at commit 95114ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

janplus · 2016-07-08T16:02:46Z

It seems failed the org.apache.spark.sql.sources.CreateTableAsSelectSuite.create a table, drop it and create another one with the same name test which is irrelevant with this PR.
Maybe we should retest this?

cloud-fan · 2016-07-08T16:55:53Z

retest this please

SparkQA · 2016-07-08T18:15:37Z

Test build #61987 has finished for PR 14008 at commit 95114ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-08T21:06:08Z

Test build #3173 has finished for PR 14008 at commit 95114ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-08T21:37:31Z

Thanks - merging in master/2.0.

## What changes were proposed in this pull request? This PR adds parse_url SQL functions in order to remove Hive fallback. A new implementation of #13999 ## How was this patch tested? Pass the exist tests including new testcases. Author: wujian <jan.chou.wu@gmail.com> Closes #14008 from janplus/SPARK-16281. (cherry picked from commit f5fef69) Signed-off-by: Reynold Xin <rxin@databricks.com>

janplus · 2016-07-09T09:39:45Z

Thanks @rxin @dongjoon-hyun @cloud-fan @liancheng
I've learnt a lot from this PR!

dongjoon-hyun · 2016-07-09T09:48:04Z

Congratulations on your first commit, @janplus !
I've learn a lot while watching this PR, too. :)

[SPARK-16281][SQL] Implement parse_url SQL function

def5982

rxin reviewed Jul 1, 2016
View reviewed changes

dongjoon-hyun reviewed Jul 1, 2016
View reviewed changes

[SPARK-16281][SQL] Implement parse_url SQL function

a2ab582

cloud-fan reviewed Jul 1, 2016
View reviewed changes

[SPARK-16281][SQL] Implement parse_url SQL function

08c20e0

dongjoon-hyun reviewed Jul 1, 2016
View reviewed changes

Optimize for Literal part.

fd70c6a

rxin reviewed Jul 8, 2016
View reviewed changes

Code style fixes.

40e1cd4

rxin reviewed Jul 8, 2016
View reviewed changes

Use ExpectsInputTypes instead of ImplicitCastInputTypes.

7b0bca4

cloud-fan reviewed Jul 8, 2016
View reviewed changes

janplus added 2 commits July 8, 2016 15:51

Code style fix.

013ff46

asfgit closed this in f5fef69 Jul 8, 2016

janplus deleted the SPARK-16281 branch July 9, 2016 09:39

ulysses-you mentioned this pull request Nov 11, 2020

[SPARK-32240][SQL] Use URL to parse url string #30333

Closed

[SPARK-16281][SQL] Implement parse_url SQL function #14008

[SPARK-16281][SQL] Implement parse_url SQL function #14008

Conversation

janplus commented Jul 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

janplus commented Jul 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jul 1, 2016

dongjoon-hyun commented Jul 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 1, 2016

janplus commented Jul 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janplus commented Jul 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janplus commented Jul 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 8, 2016

janplus commented Jul 8, 2016

cloud-fan commented Jul 8, 2016

SparkQA commented Jul 8, 2016

janplus commented Jul 8, 2016

cloud-fan commented Jul 8, 2016

SparkQA commented Jul 8, 2016

SparkQA commented Jul 8, 2016

rxin commented Jul 8, 2016

janplus commented Jul 9, 2016

dongjoon-hyun commented Jul 9, 2016