Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8264][SQL]add substring_index function #7533

Closed
wants to merge 8 commits into from

Conversation

zhichao-li
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37824 has finished for PR 7533 at commit e52b379.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37941 has finished for PR 7533 at commit 4fe5df5.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38006 has finished for PR 7533 at commit 12e108f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@zhichao-li
Copy link
Contributor Author

cc @chenghao-intel @rxin


override def dataType: DataType = StringType
override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
override def nullable: Boolean = strExpr.nullable || delimExpr.nullable || countExpr.nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to override the nullable method, as the default value is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's abstract in Expression which Substring_index inherited from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad, actually I mean the function foldable, not the nullable.

@zhichao-li
Copy link
Contributor Author

@chenghao-intel just refactor the code to remedy the problem of invoking numChars to much, but I still think we should add numChars as a field to UTF8String since it's quite an useful function.

@zhichao-li
Copy link
Contributor Author

@rxin @chenghao-intel should I turn these functions(ordinalIndexOf, lastOrdinalIndexOf and subStringIndex) to be public in UTF8String? I guess they would be useful but on one use them except for substring_index UDF.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38058 has finished for PR 7533 at commit ac863e9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@zhichao-li
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #55 has finished for PR 7533 at commit ac863e9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38063 has finished for PR 7533 at commit ac863e9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

* @group string_funcs
* @since 1.5.0
*/
def substring_index(str: String, delim: String, count: Int): Column =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this version of API. @rxin actually made some clean up, and removed the string (the column name) version API.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38186 has finished for PR 7533 at commit b19b013.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

* @return the n-th last index of the search String,
* <code>-1</code> if no match or <code>null</code> string input
*/
public static int lastOrdinalIndexOf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better not make it the static function, like the indexOf, it mean we are locating the substring from CURRENT string.

}
bytePos--;
}
throw new RuntimeException("Invalid UTF8 string");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give more verbose info, like this.toString()?

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38294 has finished for PR 7533 at commit 67c253a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38298 has finished for PR 7533 at commit 9546991.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

* right) is returned. substring_index performs a case-sensitive match when searching for delim.
*/
public UTF8String subStringIndex(UTF8String delim, int count) {
if (delim == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont' need to check the null value here, as it's done in the expression side.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38317 has finished for PR 7533 at commit 515519b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)

@zhichao-li
Copy link
Contributor Author

@rxin it looks a few better now. could you take a look at this?

}
bytePos--;
}
throw new RuntimeException("Invalid UTF8 string: " + toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: throws more concrete exception like IllegalArgumentException or IllegalCharacterException etc.

@chenghao-intel
Copy link
Contributor

LGTM in general, @davies @rxin

* right) is returned. substring_index performs a case-sensitive match when searching for delim.
*/
case class Substring_index(strExpr: Expression, delimExpr: Expression, countExpr: Expression)
extends Expression with ImplicitCastInputTypes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use TernaryExpression ?

@davies
Copy link
Contributor

davies commented Jul 31, 2015

@zhichao-li Do you mind me to take over this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants