Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28137][SQL] Data Type Formatting Functions: to_number #35060

Closed
wants to merge 32 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 29, 2021

What changes were proposed in this pull request?

Many database support the function to_number to convert a string to number.
The implement of to_number has many different between Postgresql ,Oracle and Phoenix.
So, this PR follows the implement of to_number in Oracle that give a strict parameter verification.
So, this PR follows the implement of to_number in Phoenix that uses BigDecimal.

This PR support the patterns for numeric formatting as follows:

Pattern Description
9 digit position
0 digit position
. (period) decimal point (only allowed once)
, (comma) group (thousands) separator
S sign anchored to number (only allowed once)
$ value with a leading dollar sign (only allowed once)
D decimal point (only allowed once)
G group (thousands) separator

There are some mainstream database support the syntax.
PostgreSQL:
https://www.postgresql.org/docs/12/functions-formatting.html

Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_NUMBER.html#GUID-D4807212-AFD7-48A7-9AED-BEC3E8809866

Vertica
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_NUMBER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CFormatting%20Functions%7C_____7

Redshift
https://docs.aws.amazon.com/redshift/latest/dg/r_TO_NUMBER.html

DB2
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1544.htm

Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/TH2cDXBn6tala29S536nqg

Snowflake:
https://docs.snowflake.net/manuals/sql-reference/functions/to_decimal.html

Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/to_number.htm#TO_NUMBER

Phoenix
http://phoenix.incubator.apache.org/language/functions.html#to_number

Singlestore
https://docs.singlestore.com/v7.3/reference/sql-reference/numeric-functions/to-number/

Intersystems
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_TONUMBER

The syntax like:

select to_number('12,454.8-', '99G999D9S');
-12454.8

Why are the changes needed?

to_number is very useful for formatted currency to number conversion.

Does this PR introduce any user-facing change?

Yes. New feature.

How was this patch tested?

New tests

'.': decimal point (only allowed once)
',': group (thousands) separator
'S': sign anchored to number (uses locale)
'D': decimal point (uses locale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true? We simply relace D with ., and I don't see we handle locale there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see that we replace D with . during normalization.

Copy link
Contributor Author

@beliefer beliefer Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment references PostgreSQL and Oracle. But other database does not mention locale.
I guess NumberFormat.getNumberInstance(Locale.ROOT) is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Spark doc should match Spark implementation

_FUNC_(strExpr, formatExpr) - Convert `strExpr` to a number based on the `formatExpr`.
The format can consist of the following characters:
'9': digit position (can be dropped if insignificant)
'0': digit position (will not be dropped, even if insignificant)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more documents. I still don't understand the difference between 0 and 9 after reading this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated some comment. Please review again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_number is a parsing function, I think 0 and 9 have no difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do we document them differently here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do we document them differently here?

Updated

if (inputTypeCheck.isSuccess) {
if (right.foldable) {
if (builder.normalizedNumberFormat.length == 0) {
TypeCheckResult.TypeCheckFailure(s"Format expression cannot be empty")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this check in builder.check()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

boolean ${ev.isNull} = ${eval.isNull};
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)};
if (!${ev.isNull}) {
$builderClass $builder = new $builderClass("$numberFormat");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty risky to embed user input in the generated code, as people can inject malicious code.

We should do ctx.addReferenceObj(builder)

_FUNC_(strExpr, formatExpr) - Convert `strExpr` to a number based on the `formatExpr`.
The format can consist of the following characters:
'9': digit position
'0': digit position too
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'0' or '9': digit position

@github-actions github-actions bot added the DOCS label Jan 4, 2022

|Symbol|Meaning|Examples|
|------|-------|--------|
|**9**|digit position (can be dropped if insignificant)|9999|
Copy link
Contributor

@cloud-fan cloud-fan Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that snowflake does not simply trim the leading/trailing zeros, but replace them with spaces. Is it a common behavior in other databases?

|**D**|decimal point (only allowed once)|99D99|
|**,**|group (thousands) separator|9,999|
|**G**|group (thousands) separator|9G999|
|**-**|sign anchored to number|-9999|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also only allowed once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL throws ERROR: cannot use "S" twice
Oracle throws ORA-01481: 无效的数字格式模型.


Usage notes for numeric formatting:

- 0 specifies a digit position that will always be printed, even if it contains a leading/trailing zero. 9 also specifies a digit position, but if it is a leading zero then it will be replaced by a space, while if it is a trailing zero and fill mode is specified then it will be deleted. (For to_number(), these two pattern characters are equivalent.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although to_char is not implemented yet, can we use to_char as an example to demonstrate the differences between 0 and 9?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


- 0 specifies a digit position that will always be printed, even if it contains a leading/trailing zero. 9 also specifies a digit position, but if it is a leading zero then it will be replaced by a space, while if it is a trailing zero and fill mode is specified then it will be deleted. (For to_number(), these two pattern characters are equivalent.)

- The pattern characters S, D, and G represent the sign, decimal point, and thousands separator characters. The pattern characters period and comma represent those exact characters, with the meanings of decimal point and thousands separator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't say anything. Can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

|**9**|digit position (can be dropped if insignificant)|9999|
|**0**|digit position (will not be dropped, even if insignificant)|0999|
|**.**|decimal point (only allowed once)|99.99|
|**D**|decimal point (only allowed once)|99D99|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|**D**|decimal point (only allowed once)|99D99|
|**D**|decimal point, same as **.** (only allowed once)|99D99|

|**.**|decimal point (only allowed once)|99.99|
|**D**|decimal point (only allowed once)|99D99|
|**,**|group (thousands) separator|9,999|
|**G**|group (thousands) separator|9G999|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|**G**|group (thousands) separator|9G999|
|**G**|group (thousands) separator, same as **,**|9G999|

usage = """
_FUNC_(strExpr, formatExpr) - Convert `strExpr` to a number based on the `formatExpr`.
The format can consist of the following characters:
'0' or '9': digit position
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'0' or '9': digit position
'0' or '9': digit position

'0' or '9': digit position
'.' or 'D': decimal point (only allowed once)
',' or 'G': group (thousands) separator
'-' or 'S': sign anchored to number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only allowed once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL throws ERROR: cannot use "S" twice
Oracle throws ORA-01481: 无效的数字格式模型.

val builder =
ctx.addReferenceObj("builder", numberFormatBuilder, classOf[NumberFormatBuilder].getName)
val eval = left.genCode(ctx)
ev.copy(code = code"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use multi-line string.

code"""
  |...
  """.stripMargin


/**
* DecimalFormat provides '#' and '0' as placeholder of digit, ',' as grouping separator,
* '.' as decimal separator, '-' as minus, '$' as dollar, but '9', 'G', 'D', 'S'. So we need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* '.' as decimal separator, '-' as minus, '$' as dollar, but '9', 'G', 'D', 'S'. So we need
* '.' as decimal separator, '-' as minus, '$' as dollar, but not '9', 'G', 'D', 'S'. So we need

* 3. 'D' -> '.'
* 4. 'S' -> '-'
*
* Note: When calling format, we must preserve the digits after decimal point, so the digits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very sure about this behavior, but let's leave the discussion when adding to_char

* '.' or 'D': decimal point (only allowed once)
* ',' or 'G': group (thousands) separator
* '-' or 'S': sign anchored to number
* '$': returns value with a leading dollar sign
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns? This is def parse

numberDecimalFormat.applyPattern(transformedFormat)
private def parse(
input: UTF8String,
numberFormat: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originalNumberFormat?

val transformedFormat = transform(normalizedNumberFormat)
val numberFormatInstance = NumberFormat.getNumberInstance(Locale.ROOT)
assert(numberFormatInstance.isInstanceOf[DecimalFormat])
val numberDecimalFormat = numberFormatInstance.asInstanceOf[DecimalFormat]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's super inefficient to create a NumberFormat instance for every input row.


val transformedFormat = transform(normalizedNumberFormat)
val numberFormatInstance = NumberFormat.getNumberInstance(Locale.ROOT)
assert(numberFormatInstance.isInstanceOf[DecimalFormat])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just do new DecimalFormat()?


Seq(
("454", "999") -> Decimal(454),
("054", "999") -> Decimal(54),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "54" and "999"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return Decimal(54). Let's update tests.


val transformedFormat = transform(normalizedNumberFormat)
try {
numberDecimalFormat.applyLocalizedPattern(transformedFormat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this for every row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

private val transformedFormat = transform(normalizedNumberFormat)

private lazy val numberDecimalFormat = {
val decimalFormat = new DecimalFormat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do new DecimalFormat(transformedFormat)


def parse(input: UTF8String): Decimal = {
if (unChecked) {
check()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. This is a static type check, why do you do it at runtime when parsing the first input?


private lazy val numberFormat = right.eval().toString.toUpperCase(Locale.ROOT)
private lazy val numberFormatBuilder = new NumberFormatBuilder(numberFormat)
private lazy val (precision, scale) = numberFormatBuilder.parsePrecisionAndScale()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's simpler to have a def parsedDecimalType: DecimalType and use it here.

}
}

def parsedDecimalType: DecimalType = DecimalType(precision, scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I expect is

val parsedDecimalType = {
  val formatSplits = normalizedNumberFormat.split(POINT_SIGN).map(_.filterNot(isSign))
  assert(formatSplits.length <= 2)
  val precision = formatSplits.map(_.length).sum
  val scale = if (formatSplits.length == 2) formatSplits.last.length else 0
  DecimalType(precision, scale)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precision and scale is used in other place.

val format2 = 0.until(input.length).map(_ => '9').mkString
val format3 = 0.until(input.length + 1).map(_ => '0').mkString
val format4 = 0.until(input.length + 1).map(_ => '9').mkString
val format5 = 0.until(input.length + 2).map(_ => '0').mkString
Copy link
Contributor

@cloud-fan cloud-fan Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test the mixed case as I proposed before? val format3 = 0.until(input.length + 2).map(i => i % 2).mkString

It's usually not very useful to go deeper into one factor in the test. To test format string that is longer than the value, the + 1 case should be sufficient, we don't need to test + 2. Ditto to the - 2 case.

// Test '.' and 'D'
checkExceptionInExpression[IllegalArgumentException](
ToNumber(Literal("454.2"), Literal("999")),
"Format '999' used for parsing string to number or formatting number to string is invalid")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refine the error message. 999 is indeed a valid number format string. The problem is the input does not match it.

How about

The input string '$input' does not match the given number format: '$format'

ToNumber(Literal("454.23"), Literal("00.000")),
"Format '00.000' used for parsing string to number or formatting number to string is invalid")
checkExceptionInExpression[IllegalArgumentException](
ToNumber(Literal("454.23"), Literal("00D000")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to test all the cases, please write the test cases only with dot, and run it again after replacing dot with D.

("12,454", "99,999") -> Decimal(12454),
("12,454", "00,000") -> Decimal(12454),
("12,454", "99G999") -> Decimal(12454),
("12,454", "00G000") -> Decimal(12454),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same problem here, please make the tests more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and tests more cases, e.g. different number of digits.

SELECT endswith(null, null);

-- to_number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have extensive unit tests, we can only test some common cases in the end-to-end tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

"The input string '123,456' does not match the given number format: '9G9'")

Seq(
("12,454", "99,999") -> Decimal(12454),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, no need to repeat the test cases manually, we can replace 0 with 9 in the loop body.


// Test ',' and 'G'
checkExceptionInExpression[IllegalArgumentException](
ToNumber(Literal("123,456"), Literal("9G9")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test if the number of , doesn't match? e.g.

to_number('123,456', '999,999,999')
to_number('123,456,789', '999,999')

("454-", "000-") -> Decimal(-454),
("454-", "000S") -> Decimal(-454),
("-454", "-000") -> Decimal(-454),
("-454", "S000") -> Decimal(-454),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, replace 9 with 0, - with S in the loop body.

import org.apache.spark.sql.types.Decimal
import org.apache.spark.unsafe.types.UTF8String

class NumberUtilsSuite extends SparkFunSuite {
class NumberFormatterSuite extends SparkFunSuite {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we still need this test suite since we test almost everything in the new test in StringExpressionsSuite

Copy link
Contributor Author

@beliefer beliefer Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just follows StringUtils with StringUtilsSuite. In fact, RegexpExpressionsSuite.scala test almost everything too.

select to_number('454.2', '999.9');
select to_number('454.2', '000.0');
select to_number('454.2', '999D9');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for . and D

select to_number('12,454', '99,999');
select to_number('12,454', '00,000');
select to_number('12,454', '99G999');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

select to_number('$78.12', '$00.00');
select to_number('-454', '-999');
select to_number('-454', 'S999');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

input: UTF8String, numberFormat: String, errorMsg: String): Unit = {
val e = intercept[IllegalArgumentException](parse(input, numberFormat))
private def invalidNumberFormat(numberFormat: String, errorMsg: String): Unit = {
val testNumberFormatter = new TestNumberFormatter(numberFormat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this TestNumberFormatter? We can simply call check, expect it to return TypeCheckFailure, and check the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one minor comment, we can address it in the followup PR.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9c02dd4 Jan 20, 2022
@beliefer
Copy link
Contributor Author

@cloud-fan Thank you for your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants