Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16287][SQL] Implement str_to_map SQL function #13990

Closed
wants to merge 38 commits into from

Conversation

techaddict
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds str_to_map SQL function in order to remove Hive fallback.

How was this patch tested?

Pass the Jenkins tests with newly added.

@SparkQA
Copy link

SparkQA commented Jun 30, 2016

Test build #61525 has finished for PR 13990 at commit 1f888ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@techaddict techaddict changed the title [SPARK-16287][SQL][WIP] Implement str_to_map SQL function [SPARK-16287][SQL] Implement str_to_map SQL function Jul 3, 2016
@techaddict
Copy link
Contributor Author

cc: @cloud-fan @rxin

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61685 has finished for PR 13990 at commit fa294bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Creates a map after splitting the input text into key/value pairs using delimeters
*/
@ExpressionDescription(
usage = """_FUNC_(text[, delimiter1, delimiter2]) - Creates a map after splitting the text into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiter1 and delimiter2 are not good names. delimiter1 is used to separate key-value pairs from the input text, and delimiter2 is used to separate key and value from each kv pair. Do you have some ideas about the naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about pairDelim and pairSeperatorDelim, not very good with naming what do you suggest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used delimiter1 and delimiter2 because its named that way in hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about pairDelim and keyValueDelim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yupp sound much better, let me make the change

usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text into
key/value pairs using delimiters.
Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
case class StringToMap(child: Expression, pairDelim: Expression, keyValueDelim: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about renaming child to text? to make it consistent with the comment: _FUNC_(text[, pairDelim, keyValueDelim])

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61690 has finished for PR 13990 at commit 6b2390d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringToMap(child: Expression, pairDelim: Expression, keyValueDelim: Expression)

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61767 has finished for PR 13990 at commit ca74f9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringToMap(text: Expression, pairDelim: Expression, keyValueDelim: Expression)

.split(delim1.asInstanceOf[UTF8String], -1)
.map{_.split(delim2.asInstanceOf[UTF8String], 2)}

ArrayBasedMapData(array.map(_(0)), array.map(_(1))).asInstanceOf[MapData]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unnecessary asInstanceOf?

* Creates a map after splitting the input text into key/value pairs using delimeters
*/
@ExpressionDescription(
usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will mess up the display i think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we really need an example here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the display [Usage: str_to_map(text[, pairDelim, keyValueDelim]) - Creates a map after splitting the text into key/value pairs using delimiters. Default delimiters are ',' for pairDelim and '=' for keyValueDelim.]
added example

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

cc @dongjoon-hyun can you help review this

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61806 has finished for PR 13990 at commit f7c03c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Sure, @rxin .


def this(child: Expression) = {
this(child, Literal(","), Literal("="))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @techaddict .
Could you add one more constructor, this(child: Expression, pairDelim: Expression)?

TypeCheckResult.TypeCheckSuccess
} else {
TypeCheckResult.TypeCheckFailure(
s"$prettyName's arguments must be foldable, but got $children.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake? 2 delimiters not all arguments

)

// All arguments should be string literals.
val m1 = intercept[AnalysisException]{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove these error tests from here, usually we only test the type checking logic in unit test, not end-to-end test.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62250 has finished for PR 13990 at commit cbc8798.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62256 has finished for PR 13990 at commit e701716.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62257 has finished for PR 13990 at commit 8172bd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@techaddict
Copy link
Contributor Author

@cloud-fan anything else, it good to merge ?

TypeCheckResult.TypeCheckSuccess
} else {
TypeCheckResult.TypeCheckFailure(
s"$prettyName's delimiters must be foldable, but got $children.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$children will print something like Seq(xxx, xxx), I think we can just say $prettyName's delimiters must be foldable

@cloud-fan
Copy link
Contributor

Sorry I was OOO last few days, LGTM except some minor comments, thanks for working on it!

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62471 has finished for PR 13990 at commit 1e35779.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@techaddict
Copy link
Contributor Author

@cloud-fan Comment addressed, test passed 👍


override def dataType: DataType = MapType(StringType, StringType, valueContainsNull = false)

override def checkInputDataTypes(): TypeCheckResult = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it's simpler to follow XPathExtract to do the type check, i.e. implement ExpectsInputTypes to check the type, and override checkInputDataTypes for the foldable check.

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62681 has finished for PR 13990 at commit 8aabd37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jul 22, 2016
## What changes were proposed in this pull request?
This PR adds `str_to_map` SQL function in order to remove Hive fallback.

## How was this patch tested?
Pass the Jenkins tests with newly added.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13990 from techaddict/SPARK-16287.

(cherry picked from commit df2c6d5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!

@asfgit asfgit closed this in df2c6d5 Jul 22, 2016
@srowen
Copy link
Member

srowen commented Jul 22, 2016

#14315 fixed the odd compile error for this.

Is this really something we should be merging in branch 2.0 now? this looks like part of a new feature, and not even obviously something for 2.0.1.

@cloud-fan
Copy link
Contributor

cloud-fan commented Jul 22, 2016

@srowen please see https://issues.apache.org/jira/browse/SPARK-16275, there is an explanation why we wanna merge them into 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants