-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28137][SQL] Add Postgresql function to_number. #25963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #111557 has finished for PR 25963 at commit
|
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you uncomment examples there:
spark/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/numeric.sql
Lines 825 to 846 in 66c9dc3
| -- SELECT '' AS to_number_1, to_number('-34,338,492', '99G999G999'); | |
| -- SELECT '' AS to_number_2, to_number('-34,338,492.654,878', '99G999G999D999G999'); | |
| -- SELECT '' AS to_number_3, to_number('<564646.654564>', '999999.999999PR'); | |
| -- SELECT '' AS to_number_4, to_number('0.00001-', '9.999999S'); | |
| -- SELECT '' AS to_number_5, to_number('5.01-', 'FM9.999999S'); | |
| -- SELECT '' AS to_number_5, to_number('5.01-', 'FM9.999999MI'); | |
| -- SELECT '' AS to_number_7, to_number('5 4 4 4 4 8 . 7 8', '9 9 9 9 9 9 . 9 9'); | |
| -- SELECT '' AS to_number_8, to_number('.01', 'FM9.99'); | |
| -- SELECT '' AS to_number_9, to_number('.0', '99999999.99999999'); | |
| -- SELECT '' AS to_number_10, to_number('0', '99.99'); | |
| -- SELECT '' AS to_number_11, to_number('.-01', 'S99.99'); | |
| -- SELECT '' AS to_number_12, to_number('.01-', '99.99S'); | |
| -- SELECT '' AS to_number_13, to_number(' . 0 1-', ' 9 9 . 9 9 S'); | |
| -- SELECT '' AS to_number_14, to_number('34,50','999,99'); | |
| -- SELECT '' AS to_number_15, to_number('123,000','999G'); | |
| -- SELECT '' AS to_number_16, to_number('123456','999G999'); | |
| -- SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99'); | |
| -- SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99'); | |
| -- SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99'); | |
| -- SELECT '' AS to_number_20, to_number('1234.56','L99,999.99'); | |
| -- SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99'); | |
| -- SELECT '' AS to_number_22, to_number('42nd', '99th'); |
| } | ||
|
|
||
| /** | ||
| * A function that convert string to numeric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert -> converts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Thanks for your remind.
| * @group string_funcs | ||
| * @since 3.0.0 | ||
| */ | ||
| def to_number(x: Column, format: String): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registering the to_number function should be enough. Exposing it in Scala API is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will remove it.
|
Hi, @beliefer . Please run |
|
Test build #111591 has finished for PR 25963 at commit
|
|
Test build #111592 has finished for PR 25963 at commit
|
|
Test build #111602 has finished for PR 25963 at commit
|
|
Test build #111870 has finished for PR 25963 at commit
|
|
Retest this please. |
|
Test build #111883 has finished for PR 25963 at commit
|
|
@MaxGekk Could you review this PR continuously ? |
|
Test build #111889 has finished for PR 25963 at commit
|
|
@dongjoon-hyun @wangyum Could you help me to review this PR? |
| } | ||
|
|
||
| /** | ||
| * A function that converts string to numeric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comments duplicates the usage part. I don't think it is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will preserve this line and change the usage part as Convert strExprto a number based on thepatternExpr.
| /** | ||
| * A function that converts string to numeric. | ||
| */ | ||
| // scalastyle:off line.size.limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below doesn't reach the limit. This can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will follow your suggestion.
| Literal.create(null, IntegerType), Literal.create(null, IntegerType)), null) | ||
| } | ||
|
|
||
| test("ToNumber") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage part of the expressions says that some part of patterns can use locale:
'S': sign anchored to number (uses locale)
'L': currency symbol (uses locale)
'D': decimal point (uses locale)
'G': group separator (uses locale)
Could you test a few locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Thanks for this review. I test the locale on Postgresql but 'L' seems not works.
select to_number('USD34234.4350', 'L99999.0000'); // 34234.435
select to_number('EUR34234.4350', 'L99999.0000'); // 34234.435
select to_number('RY34234.4350', 'L99999.0000'); // 34234.435
Although 'RY ' is not a valid locale , the result is the same as the others .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description of locales is not consistent with the behavior in fact. Maybe I should remove the comment for locale as it was a bug of Postgresql.
| '0': digit position (will not be dropped, even if insignificant) | ||
| '.': decimal point (only allowed once) | ||
| ',': group (thousands) separator | ||
| 'S': sign anchored to number (uses locale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could clarify from where it takes the locale.
| extends BinaryExpression with ImplicitCastInputTypes { | ||
|
|
||
| // scalastyle:off caselocale | ||
| private lazy val patternStr = patternExpr.eval().asInstanceOf[UTF8String].toUpperCase.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you assume that patternExpr is foldable expression, correct? What happens if it is not. You can see there how to handle both cases:
Lines 615 to 619 in eecef75
| @transient private lazy val formatter: Option[TimestampFormatter] = { | |
| if (right.foldable) { | |
| Option(right.eval()).map(format => TimestampFormatter(format.toString, zoneId)) | |
| } else None | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. A good suggestion.
|
|
||
| val inputTypeCheck = super.checkInputDataTypes() | ||
| if(inputTypeCheck.isSuccess) { | ||
| if (patternStr.count(checkDecimalPointNum(_)) > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (patternStr.count(checkDecimalPointNum(_)) > 1) { | |
| if (patternStr.count(checkDecimalPointNum) > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I learned it.
| } | ||
| } | ||
|
|
||
| var result = if (integerLen == -1 && wholeLen == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var result = if (integerLen == -1 && wholeLen == -1) { | |
| val result = if (integerLen == -1 && wholeLen == -1) { |
| checkEvaluation(ToNumber(Literal("RY34234.4350"), Literal("L99999.0000")), "34234.435") | ||
| checkEvaluation(ToNumber(Literal("R34234.4350"), Literal("L99999.0000")), "34234.435") | ||
|
|
||
| ToNumber(Literal("454.3.2"), Literal("999D9D9")).checkInputDataTypes() match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you check only one negative tests. Could you add a little bit more negative tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you check only one negative tests. Could you add a little bit more negative tests.
OK. I will add more.
|
Test build #112216 has finished for PR 25963 at commit
|
|
@MaxGekk Could you review this PR continuously ? cc @dongjoon-hyun |
| c == '.' || c == 'D' | ||
| } | ||
|
|
||
| def checkSignNum(c: Char): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used only in one place. Could you inline it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will remove this method.
| override def inputTypes: Seq[DataType] = Seq(StringType, StringType) | ||
|
|
||
| override def checkInputDataTypes(): TypeCheckResult = { | ||
| def checkDecimalPointNum(c: Char): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is used only once. I think no need to extract the simple check to a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will remove this method too.
| } | ||
|
|
||
| object ToNumber { | ||
| def transfer(input: UTF8String, pattern: UTF8String): UTF8String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, rename it to convert()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
| (inputStr, patternStr.replaceAll("FM", "")) | ||
| } | ||
| val inputChars = newInputStr.toCharArray() | ||
| val patternChars = newPatternStr.toIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to convert it to iterator? String's foreach can iterate over chars as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I should avoid the convert.
| case ',' | 'G' if Character.isDigit(currentChar) => | ||
| case 'L' => | ||
| while (Character.isLetter(inputChars(indexOfString)) || | ||
| inputChars(indexOfString) == '$') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if indexOfString >= inputChars.length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexOfString from 0 to inputChars.length - 1. There exists one check.
| s""" | ||
| ${ev.value} = org.apache.spark.sql.catalyst.expressions.ToNumber.transfer( | ||
| $l, $r); | ||
| """}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| s""" | |
| ${ev.value} = org.apache.spark.sql.catalyst.expressions.ToNumber.transfer( | |
| $l, $r); | |
| """}) | |
| s"${ev.value} = org.apache.spark.sql.catalyst.expressions.ToNumber.transfer($l, $r);" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion.
|
Test build #112359 has finished for PR 25963 at commit
|
|
What's an usecase for this func? Is this useful for users? For example, in the example of this PR description, we can just use an explicit cast like this? |
|
@maropu |
|
@maropu Could you continue to review this PR? Thanks! |
|
I'm closing this because of the recent policy for the PostgreSQL dialect. If necessary, please reopen it. Anyway, thanks for the work! |
|
OK. I will wait variety of the policy for the PostgreSQL dialect. |

What changes were proposed in this pull request?
PostgresqlandOraclehave the functionto_numberto convert a string to number.The implement and support syntax has many different between
PostgresqlandOracle. So, this PR mainly follows the implement ofto_numberinPostgresql.There are some mainstream database support the syntax.
PostgreSQL:
https://www.postgresql.org/docs/12/functions-formatting.html
Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_NUMBER.html#GUID-D4807212-AFD7-48A7-9AED-BEC3E8809866
Vertica
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_NUMBER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CFormatting%20Functions%7C_____7
Redshift
https://docs.aws.amazon.com/redshift/latest/dg/r_TO_NUMBER.html
DB2
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1544.htm
Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/TH2cDXBn6tala29S536nqg
Snowflake:
https://docs.snowflake.net/manuals/sql-reference/functions/to_decimal.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/to_number.htm#TO_NUMBER
Singlestore
https://docs.singlestore.com/v7.3/reference/sql-reference/numeric-functions/to-number/
Intersystems
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_TONUMBER
The syntax like:
Why are the changes needed?
This PR adds a new function.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New UT.