Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16281][SQL] Implement parse_url SQL function #14008

Closed
wants to merge 19 commits into from

Conversation

janplus
Copy link
Contributor

@janplus janplus commented Jul 1, 2016

What changes were proposed in this pull request?

This PR adds parse_url SQL functions in order to remove Hive fallback.

A new implementation of #13999

How was this patch tested?

Pass the exist tests including new testcases.

@janplus
Copy link
Contributor Author

janplus commented Jul 1, 2016

cc @rxin and @cloud-fan
Improvements for performance concern

@@ -285,6 +285,7 @@ object FunctionRegistry {
expression[StringTrimLeft]("ltrim"),
expression[JsonTuple]("json_tuple"),
expression[FormatString]("printf"),
expression[ParseUrl]("parse_url"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go before printf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Thank you for review. I'll fix this.

@rxin
Copy link
Contributor

rxin commented Jul 1, 2016

@dongjoon-hyun can you help review this one?

@dongjoon-hyun
Copy link
Member

Oh. Sure. @rxin

@ExpressionDescription(
usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
extended = "Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO\n"
+ "key specifies which query to extract\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @janplus .
There is a limitation of Scala 2.10 compiler. For extended, "+" breaks build.
Please use one single """ """ string like SubstringIndex https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L498 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dongjoon-hyun .
Thank you for review. I'll fix this.

@dongjoon-hyun
Copy link
Member

Hi, @janplus .
I've done first pass.
Thank you for doing this.

@janplus
Copy link
Contributor Author

janplus commented Jul 1, 2016

@rxin and @dongjoon-hyun Thanks for your review.
I have add a new commit which does following things:

  1. Put parse_url function in the right order.
  2. Use """ """ instead of + in extended part to work with Scala 2.1.
  3. Remove unnecessary lazys.
  4. Correct REGEXPREFIX and add a new null test case.
  5. Use NonFatal(_) instead of the specified exception.
  6. Fix the indentation problems.

I have tried to not use varargs, but a separate constructor that accept two args does not help. As there isn't a magic key to make parse_url(url, partToExtract, magic key) to be treated as parse_url(url, partToExtract).

if (url == null || partToExtract == null) {
null
} else {
if (lastUrlStr == null || !url.equals(lastUrlStr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this optimization mainly for when the url is literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When the url column has many same values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can follow XPathBoolean to optimize for literal case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought we judge on the url string, the main purpose is to cache the URL object.
As We must handle the exceptions caused by invalid urls, the approach of XPathBoolean seems not suitable.

}
}

def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@janplus
Copy link
Contributor Author

janplus commented Jul 8, 2016

cc @cloud-fan @rxin @liancheng
I did optimization for Literal part, so we don't need to check for every row. But since we may not assume in all circumstances the part is Literal, I keep the result being null when part is invalid.

'query=1'
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query')
'1'""")
case class ParseUrl(children: Seq[Expression])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again we should not use Seq[Expression] here. We should just have a 3-arg ctor, and then add a 2-arg ctor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should think of a good default value for the 3rd argument. We should avoid using null as we assume the children of expression won't be null in a lot of places. How about using empty string as the default value for key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explained before, I can hardly find a magic key that may let us treat parse_url(url, part, magic key) as parse_url(url, part). I have doubt on empty string, eg.

hive> select parse_url("http://spark/path?=1", "QUERY", "");
1

hive> select parse_url("http://spark/path?=1", "QUERY");
=1

Any suggestion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't have a strong preference here, Seq[Expression] doesn't look so bad to me. @rxin what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use # as the default value and check on that? It is not a valid URL key is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway I don't have a super strong preference here either. It might be more clear to not use a hacky # value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, # is not a valid URL key. And I agree with you on not using a hacky value.

> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query')
'1'""")
case class ParseUrl(children: Seq[Expression])
extends Expression with ImplicitCastInputTypes with CodegenFallback {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here -- i don't think it makes a lot of sense to use ImplicitCastInputTypes here, since we are talking about urls. Why don't we just use ExpectsInputTypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make spark's behavior mostly like hive.
As hive does implicit cast for key, eg

hive> select parse_url("http://spark/path?1=v", "QUERY", 1);
v

Should we keep the same in spark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK in this case to not follow. This function is so esoteric that I doubt people will complain. If they do, we can always add the implicit casting later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll use ExpectsInputTypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let's just keep it. Might as well since the code is already written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I have missed this comment and finished the change...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh well this works

@janplus
Copy link
Contributor Author

janplus commented Jul 8, 2016

cc @rxin @cloud-fan Thank you for review
I add a new commit doing the following things:

  1. Use ExpectsInputTypes instead of ImplicitCastInputTypes.
  2. Add some cases for invalid-type parameters.
  3. Code style fixes.

// If the url is a constant, cache the URL object so that we don't need to convert url
// from UTF8String to String to URL for every row.
@transient private lazy val cachedUrl = children(0) match {
case Literal(url: UTF8String, _) => if (url ne null) getUrl(url) else null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be case Literal(url: UTF8String, _) if url != null => getUrl(url)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, it's simpler.

@cloud-fan
Copy link
Contributor

LGTM except one minor comment, thanks for working on it!

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
@janplus
Copy link
Contributor Author

janplus commented Jul 8, 2016

cc @cloud-fan Thank you.
I have resolved conflicts with master and done some code style fixes as you suggested.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61983 has finished for PR 14008 at commit 95114ef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@janplus
Copy link
Contributor Author

janplus commented Jul 8, 2016

It seems failed the org.apache.spark.sql.sources.CreateTableAsSelectSuite.create a table, drop it and create another one with the same name test which is irrelevant with this PR.
Maybe we should retest this?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #61987 has finished for PR 14008 at commit 95114ef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 8, 2016

Test build #3173 has finished for PR 14008 at commit 95114ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 8, 2016

Thanks - merging in master/2.0.

@asfgit asfgit closed this in f5fef69 Jul 8, 2016
asfgit pushed a commit that referenced this pull request Jul 8, 2016
## What changes were proposed in this pull request?

This PR adds parse_url SQL functions in order to remove Hive fallback.

A new implementation of #13999

## How was this patch tested?

Pass the exist tests including new testcases.

Author: wujian <jan.chou.wu@gmail.com>

Closes #14008 from janplus/SPARK-16281.

(cherry picked from commit f5fef69)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@janplus
Copy link
Contributor Author

janplus commented Jul 9, 2016

Thanks @rxin @dongjoon-hyun @cloud-fan @liancheng
I've learnt a lot from this PR!

@janplus janplus deleted the SPARK-16281 branch July 9, 2016 09:39
@dongjoon-hyun
Copy link
Member

Congratulations on your first commit, @janplus !
I've learn a lot while watching this PR, too. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants