Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Sep 29, 2019

What changes were proposed in this pull request?

Postgresql and Oracle have the function to_number to convert a string to number.
The implement and support syntax has many different between Postgresql and Oracle. So, this PR mainly follows the implement of to_number in Postgresql.

There are some mainstream database support the syntax.
PostgreSQL:
https://www.postgresql.org/docs/12/functions-formatting.html

Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_NUMBER.html#GUID-D4807212-AFD7-48A7-9AED-BEC3E8809866

Vertica
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_NUMBER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CFormatting%20Functions%7C_____7

Redshift
https://docs.aws.amazon.com/redshift/latest/dg/r_TO_NUMBER.html

DB2
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1544.htm

Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/TH2cDXBn6tala29S536nqg

Snowflake:
https://docs.snowflake.net/manuals/sql-reference/functions/to_decimal.html

Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/to_number.htm#TO_NUMBER

Singlestore
https://docs.singlestore.com/v7.3/reference/sql-reference/numeric-functions/to-number/

Intersystems
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_TONUMBER

The syntax like:

select to_number('12,454.8-', '99G999D9S');
-12454.8

Why are the changes needed?

This PR adds a new function.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

@SparkQA
Copy link

SparkQA commented Sep 29, 2019

Test build #111557 has finished for PR 25963 at commit 07ad71b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ToNumber(strExpr: Expression, patternExpr: Expression)

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you uncomment examples there:

-- SELECT '' AS to_number_1, to_number('-34,338,492', '99G999G999');
-- SELECT '' AS to_number_2, to_number('-34,338,492.654,878', '99G999G999D999G999');
-- SELECT '' AS to_number_3, to_number('<564646.654564>', '999999.999999PR');
-- SELECT '' AS to_number_4, to_number('0.00001-', '9.999999S');
-- SELECT '' AS to_number_5, to_number('5.01-', 'FM9.999999S');
-- SELECT '' AS to_number_5, to_number('5.01-', 'FM9.999999MI');
-- SELECT '' AS to_number_7, to_number('5 4 4 4 4 8 . 7 8', '9 9 9 9 9 9 . 9 9');
-- SELECT '' AS to_number_8, to_number('.01', 'FM9.99');
-- SELECT '' AS to_number_9, to_number('.0', '99999999.99999999');
-- SELECT '' AS to_number_10, to_number('0', '99.99');
-- SELECT '' AS to_number_11, to_number('.-01', 'S99.99');
-- SELECT '' AS to_number_12, to_number('.01-', '99.99S');
-- SELECT '' AS to_number_13, to_number(' . 0 1-', ' 9 9 . 9 9 S');
-- SELECT '' AS to_number_14, to_number('34,50','999,99');
-- SELECT '' AS to_number_15, to_number('123,000','999G');
-- SELECT '' AS to_number_16, to_number('123456','999G999');
-- SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99');
-- SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99');
-- SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99');
-- SELECT '' AS to_number_20, to_number('1234.56','L99,999.99');
-- SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99');
-- SELECT '' AS to_number_22, to_number('42nd', '99th');

}

/**
* A function that convert string to numeric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert -> converts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks for your remind.

* @group string_funcs
* @since 3.0.0
*/
def to_number(x: Column, format: String): Column = withExpr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registering the to_number function should be enough. Exposing it in Scala API is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will remove it.

@dongjoon-hyun
Copy link
Member

Hi, @beliefer . Please run dev/scalastyle and fix all of the errors.

@SparkQA
Copy link

SparkQA commented Sep 30, 2019

Test build #111591 has finished for PR 25963 at commit 7b4ed77.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2019

Test build #111592 has finished for PR 25963 at commit c4a170c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2019

Test build #111602 has finished for PR 25963 at commit a8286cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111870 has finished for PR 25963 at commit 2a7202a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Oct 8, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111883 has finished for PR 25963 at commit 2a7202a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Oct 8, 2019

@MaxGekk Could you review this PR continuously ?

@SparkQA
Copy link

SparkQA commented Oct 8, 2019

Test build #111889 has finished for PR 25963 at commit 0f51086.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Oct 9, 2019

@dongjoon-hyun @wangyum Could you help me to review this PR?

}

/**
* A function that converts string to numeric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments duplicates the usage part. I don't think it is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will preserve this line and change the usage part as Convert strExprto a number based on thepatternExpr.

/**
* A function that converts string to numeric.
*/
// scalastyle:off line.size.limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code below doesn't reach the limit. This can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will follow your suggestion.

Literal.create(null, IntegerType), Literal.create(null, IntegerType)), null)
}

test("ToNumber") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage part of the expressions says that some part of patterns can use locale:

'S': sign anchored to number (uses locale)
      'L': currency symbol (uses locale)
      'D': decimal point (uses locale)
      'G': group separator (uses locale)

Could you test a few locales.

Copy link
Contributor Author

@beliefer beliefer Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Thanks for this review. I test the locale on Postgresql but 'L' seems not works.

select to_number('USD34234.4350', 'L99999.0000');  // 34234.435
select to_number('EUR34234.4350', 'L99999.0000');  // 34234.435
select to_number('RY34234.4350', 'L99999.0000');  // 34234.435

Although 'RY ' is not a valid locale , the result is the same as the others .

Copy link
Contributor Author

@beliefer beliefer Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description of locales is not consistent with the behavior in fact. Maybe I should remove the comment for locale as it was a bug of Postgresql.

'0': digit position (will not be dropped, even if insignificant)
'.': decimal point (only allowed once)
',': group (thousands) separator
'S': sign anchored to number (uses locale)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could clarify from where it takes the locale.

extends BinaryExpression with ImplicitCastInputTypes {

// scalastyle:off caselocale
private lazy val patternStr = patternExpr.eval().asInstanceOf[UTF8String].toUpperCase.toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you assume that patternExpr is foldable expression, correct? What happens if it is not. You can see there how to handle both cases:

@transient private lazy val formatter: Option[TimestampFormatter] = {
if (right.foldable) {
Option(right.eval()).map(format => TimestampFormatter(format.toString, zoneId))
} else None
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. A good suggestion.


val inputTypeCheck = super.checkInputDataTypes()
if(inputTypeCheck.isSuccess) {
if (patternStr.count(checkDecimalPointNum(_)) > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (patternStr.count(checkDecimalPointNum(_)) > 1) {
if (patternStr.count(checkDecimalPointNum) > 1) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I learned it.

}
}

var result = if (integerLen == -1 && wholeLen == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var result = if (integerLen == -1 && wholeLen == -1) {
val result = if (integerLen == -1 && wholeLen == -1) {

checkEvaluation(ToNumber(Literal("RY34234.4350"), Literal("L99999.0000")), "34234.435")
checkEvaluation(ToNumber(Literal("R34234.4350"), Literal("L99999.0000")), "34234.435")

ToNumber(Literal("454.3.2"), Literal("999D9D9")).checkInputDataTypes() match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you check only one negative tests. Could you add a little bit more negative tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example here is a test coverage report:
Screen Shot 2019-10-09 at 17 41 31

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you check only one negative tests. Could you add a little bit more negative tests.

OK. I will add more.

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112216 has finished for PR 25963 at commit 8deaabf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ToNumber(left: Expression, right: Expression)

@beliefer
Copy link
Contributor Author

beliefer commented Oct 18, 2019

@MaxGekk Could you review this PR continuously ? cc @dongjoon-hyun

c == '.' || c == 'D'
}

def checkSignNum(c: Char): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used only in one place. Could you inline it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will remove this method.

override def inputTypes: Seq[DataType] = Seq(StringType, StringType)

override def checkInputDataTypes(): TypeCheckResult = {
def checkDecimalPointNum(c: Char): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is used only once. I think no need to extract the simple check to a separate function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will remove this method too.

}

object ToNumber {
def transfer(input: UTF8String, pattern: UTF8String): UTF8String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, rename it to convert()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.

(inputStr, patternStr.replaceAll("FM", ""))
}
val inputChars = newInputStr.toCharArray()
val patternChars = newPatternStr.toIterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to convert it to iterator? String's foreach can iterate over chars as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I should avoid the convert.

case ',' | 'G' if Character.isDigit(currentChar) =>
case 'L' =>
while (Character.isLetter(inputChars(indexOfString)) ||
inputChars(indexOfString) == '$') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if indexOfString >= inputChars.length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexOfString from 0 to inputChars.length - 1. There exists one check.

Comment on lines 2413 to 2416
s"""
${ev.value} = org.apache.spark.sql.catalyst.expressions.ToNumber.transfer(
$l, $r);
"""})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s"""
${ev.value} = org.apache.spark.sql.catalyst.expressions.ToNumber.transfer(
$l, $r);
"""})
s"${ev.value} = org.apache.spark.sql.catalyst.expressions.ToNumber.transfer($l, $r);"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112359 has finished for PR 25963 at commit afa19a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

cc @HyukjinKwon @wangyum

@beliefer
Copy link
Contributor Author

beliefer commented Oct 30, 2019

@maropu
Copy link
Member

maropu commented Oct 31, 2019

What's an usecase for this func? Is this useful for users? For example, in the example of this PR description, we can just use an explicit cast like this?

scala> sql("select cast('-12454.8' as double)").show
+------------------------+                                                      
|CAST(-12454.8 AS DOUBLE)|
+------------------------+
|                -12454.8|
+------------------------+

@beliefer
Copy link
Contributor Author

beliefer commented Nov 1, 2019

@maropu to_number could support converts format string to number. This is different from cast

@beliefer
Copy link
Contributor Author

beliefer commented Nov 22, 2019

@maropu Could you continue to review this PR? Thanks!

@maropu
Copy link
Member

maropu commented Jan 8, 2020

I'm closing this because of the recent policy for the PostgreSQL dialect. If necessary, please reopen it. Anyway, thanks for the work!

@maropu maropu closed this Jan 8, 2020
@beliefer
Copy link
Contributor Author

beliefer commented Jan 9, 2020

OK. I will wait variety of the policy for the PostgreSQL dialect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants