New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16281][SQL] Implement parse_url SQL function #14008
Conversation
cc @rxin and @cloud-fan |
@@ -285,6 +285,7 @@ object FunctionRegistry { | |||
expression[StringTrimLeft]("ltrim"), | |||
expression[JsonTuple]("json_tuple"), | |||
expression[FormatString]("printf"), | |||
expression[ParseUrl]("parse_url"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go before printf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Thank you for review. I'll fix this.
@dongjoon-hyun can you help review this one? |
Oh. Sure. @rxin |
@ExpressionDescription( | ||
usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", | ||
extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO\n" | ||
+ "key specifies which query to extract\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @janplus .
There is a limitation of Scala 2.10 compiler. For extended
, "+" breaks build.
Please use one single """ """
string like SubstringIndex https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L498 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @dongjoon-hyun .
Thank you for review. I'll fix this.
Hi, @janplus . |
@rxin and @dongjoon-hyun Thanks for your review.
I have tried to not use varargs, but a separate constructor that accept two args does not help. As there isn't a magic key to make |
if (url == null || partToExtract == null) { | ||
null | ||
} else { | ||
if (lastUrlStr == null || !url.equals(lastUrlStr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this optimization mainly for when the url
is literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. When the url
column has many same values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can follow XPathBoolean
to optimize for literal case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought we judge on the url
string, the main purpose is to cache the URL
object.
As We must handle the exceptions caused by invalid urls, the approach of XPathBoolean
seems not suitable.
} | ||
} | ||
|
||
def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this private
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
cc @cloud-fan @rxin @liancheng |
'query=1' | ||
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') | ||
'1'""") | ||
case class ParseUrl(children: Seq[Expression]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again we should not use Seq[Expression] here. We should just have a 3-arg ctor, and then add a 2-arg ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should think of a good default value for the 3rd argument. We should avoid using null
as we assume the children of expression won't be null in a lot of places. How about using empty string as the default value for key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I explained before, I can hardly find a magic key
that may let us treat parse_url(url, part, magic key)
as parse_url(url, part)
. I have doubt on empty string, eg.
hive> select parse_url("http://spark/path?=1", "QUERY", "");
1hive> select parse_url("http://spark/path?=1", "QUERY");
=1
Any suggestion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't have a strong preference here, Seq[Expression]
doesn't look so bad to me. @rxin what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we use # as the default value and check on that? It is not a valid URL key is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway I don't have a super strong preference here either. It might be more clear to not use a hacky # value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, # is not a valid URL key. And I agree with you on not using a hacky value.
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') | ||
'1'""") | ||
case class ParseUrl(children: Seq[Expression]) | ||
extends Expression with ImplicitCastInputTypes with CodegenFallback { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here -- i don't think it makes a lot of sense to use ImplicitCastInputTypes here, since we are talking about urls. Why don't we just use ExpectsInputTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to make spark's behavior mostly like hive.
As hive does implicit cast for key
, eg
hive> select parse_url("http://spark/path?1=v", "QUERY", 1);
v
Should we keep the same in spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK in this case to not follow. This function is so esoteric that I doubt people will complain. If they do, we can always add the implicit casting later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll use ExpectsInputTypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually let's just keep it. Might as well since the code is already written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I have missed this comment and finished the change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh well this works
cc @rxin @cloud-fan Thank you for review
|
// If the url is a constant, cache the URL object so that we don't need to convert url | ||
// from UTF8String to String to URL for every row. | ||
@transient private lazy val cachedUrl = children(0) match { | ||
case Literal(url: UTF8String, _) => if (url ne null) getUrl(url) else null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be case Literal(url: UTF8String, _) if url != null => getUrl(url)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, it's simpler.
LGTM except one minor comment, thanks for working on it! |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
cc @cloud-fan Thank you. |
retest this please |
Test build #61983 has finished for PR 14008 at commit
|
It seems failed the |
retest this please |
Test build #61987 has finished for PR 14008 at commit
|
Test build #3173 has finished for PR 14008 at commit
|
Thanks - merging in master/2.0. |
## What changes were proposed in this pull request? This PR adds parse_url SQL functions in order to remove Hive fallback. A new implementation of #13999 ## How was this patch tested? Pass the exist tests including new testcases. Author: wujian <jan.chou.wu@gmail.com> Closes #14008 from janplus/SPARK-16281. (cherry picked from commit f5fef69) Signed-off-by: Reynold Xin <rxin@databricks.com>
Thanks @rxin @dongjoon-hyun @cloud-fan @liancheng |
Congratulations on your first commit, @janplus ! |
What changes were proposed in this pull request?
This PR adds parse_url SQL functions in order to remove Hive fallback.
A new implementation of #13999
How was this patch tested?
Pass the exist tests including new testcases.