New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28137][SQL] Data Type Formatting Functions: to_number
#35060
Conversation
'.': decimal point (only allowed once) | ||
',': group (thousands) separator | ||
'S': sign anchored to number (uses locale) | ||
'D': decimal point (uses locale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it true? We simply relace D
with .
, and I don't see we handle locale there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see that we replace D
with .
during normalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment references PostgreSQL
and Oracle
. But other database does not mention locale.
I guess NumberFormat.getNumberInstance(Locale.ROOT)
is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Spark doc should match Spark implementation
_FUNC_(strExpr, formatExpr) - Convert `strExpr` to a number based on the `formatExpr`. | ||
The format can consist of the following characters: | ||
'9': digit position (can be dropped if insignificant) | ||
'0': digit position (will not be dropped, even if insignificant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need more documents. I still don't understand the difference between 0 and 9 after reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated some comment. Please review again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_number
is a parsing function, I think 0 and 9 have no difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why do we document them differently here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why do we document them differently here?
Updated
if (inputTypeCheck.isSuccess) { | ||
if (right.foldable) { | ||
if (builder.normalizedNumberFormat.length == 0) { | ||
TypeCheckResult.TypeCheckFailure(s"Format expression cannot be empty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this check in builder.check()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
boolean ${ev.isNull} = ${eval.isNull}; | ||
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)}; | ||
if (!${ev.isNull}) { | ||
$builderClass $builder = new $builderClass("$numberFormat"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty risky to embed user input in the generated code, as people can inject malicious code.
We should do ctx.addReferenceObj(builder)
_FUNC_(strExpr, formatExpr) - Convert `strExpr` to a number based on the `formatExpr`. | ||
The format can consist of the following characters: | ||
'9': digit position | ||
'0': digit position too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'0' or '9': digit position
docs/sql-ref-number-pattern.md
Outdated
|
||
|Symbol|Meaning|Examples| | ||
|------|-------|--------| | ||
|**9**|digit position (can be dropped if insignificant)|9999| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that snowflake does not simply trim the leading/trailing zeros, but replace them with spaces. Is it a common behavior in other databases?
docs/sql-ref-number-pattern.md
Outdated
|**D**|decimal point (only allowed once)|99D99| | ||
|**,**|group (thousands) separator|9,999| | ||
|**G**|group (thousands) separator|9G999| | ||
|**-**|sign anchored to number|-9999| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also only allowed once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostgreSQL throws ERROR: cannot use "S" twice
Oracle throws ORA-01481: 无效的数字格式模型
.
docs/sql-ref-number-pattern.md
Outdated
|
||
Usage notes for numeric formatting: | ||
|
||
- 0 specifies a digit position that will always be printed, even if it contains a leading/trailing zero. 9 also specifies a digit position, but if it is a leading zero then it will be replaced by a space, while if it is a trailing zero and fill mode is specified then it will be deleted. (For to_number(), these two pattern characters are equivalent.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although to_char
is not implemented yet, can we use to_char
as an example to demonstrate the differences between 0 and 9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
docs/sql-ref-number-pattern.md
Outdated
|
||
- 0 specifies a digit position that will always be printed, even if it contains a leading/trailing zero. 9 also specifies a digit position, but if it is a leading zero then it will be replaced by a space, while if it is a trailing zero and fill mode is specified then it will be deleted. (For to_number(), these two pattern characters are equivalent.) | ||
|
||
- The pattern characters S, D, and G represent the sign, decimal point, and thousands separator characters. The pattern characters period and comma represent those exact characters, with the meanings of decimal point and thousands separator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't say anything. Can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
docs/sql-ref-number-pattern.md
Outdated
|**9**|digit position (can be dropped if insignificant)|9999| | ||
|**0**|digit position (will not be dropped, even if insignificant)|0999| | ||
|**.**|decimal point (only allowed once)|99.99| | ||
|**D**|decimal point (only allowed once)|99D99| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|**D**|decimal point (only allowed once)|99D99| | |
|**D**|decimal point, same as **.** (only allowed once)|99D99| |
docs/sql-ref-number-pattern.md
Outdated
|**.**|decimal point (only allowed once)|99.99| | ||
|**D**|decimal point (only allowed once)|99D99| | ||
|**,**|group (thousands) separator|9,999| | ||
|**G**|group (thousands) separator|9G999| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|**G**|group (thousands) separator|9G999| | |
|**G**|group (thousands) separator, same as **,**|9G999| |
usage = """ | ||
_FUNC_(strExpr, formatExpr) - Convert `strExpr` to a number based on the `formatExpr`. | ||
The format can consist of the following characters: | ||
'0' or '9': digit position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'0' or '9': digit position | |
'0' or '9': digit position |
'0' or '9': digit position | ||
'.' or 'D': decimal point (only allowed once) | ||
',' or 'G': group (thousands) separator | ||
'-' or 'S': sign anchored to number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it only allowed once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostgreSQL throws ERROR: cannot use "S" twice
Oracle throws ORA-01481: 无效的数字格式模型.
val builder = | ||
ctx.addReferenceObj("builder", numberFormatBuilder, classOf[NumberFormatBuilder].getName) | ||
val eval = left.genCode(ctx) | ||
ev.copy(code = code""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use multi-line string.
code"""
|...
""".stripMargin
|
||
/** | ||
* DecimalFormat provides '#' and '0' as placeholder of digit, ',' as grouping separator, | ||
* '.' as decimal separator, '-' as minus, '$' as dollar, but '9', 'G', 'D', 'S'. So we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* '.' as decimal separator, '-' as minus, '$' as dollar, but '9', 'G', 'D', 'S'. So we need | |
* '.' as decimal separator, '-' as minus, '$' as dollar, but not '9', 'G', 'D', 'S'. So we need |
* 3. 'D' -> '.' | ||
* 4. 'S' -> '-' | ||
* | ||
* Note: When calling format, we must preserve the digits after decimal point, so the digits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very sure about this behavior, but let's leave the discussion when adding to_char
* '.' or 'D': decimal point (only allowed once) | ||
* ',' or 'G': group (thousands) separator | ||
* '-' or 'S': sign anchored to number | ||
* '$': returns value with a leading dollar sign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns
? This is def parse
numberDecimalFormat.applyPattern(transformedFormat) | ||
private def parse( | ||
input: UTF8String, | ||
numberFormat: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originalNumberFormat
?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberUtils.scala
Outdated
Show resolved
Hide resolved
val transformedFormat = transform(normalizedNumberFormat) | ||
val numberFormatInstance = NumberFormat.getNumberInstance(Locale.ROOT) | ||
assert(numberFormatInstance.isInstanceOf[DecimalFormat]) | ||
val numberDecimalFormat = numberFormatInstance.asInstanceOf[DecimalFormat] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's super inefficient to create a NumberFormat
instance for every input row.
|
||
val transformedFormat = transform(normalizedNumberFormat) | ||
val numberFormatInstance = NumberFormat.getNumberInstance(Locale.ROOT) | ||
assert(numberFormatInstance.isInstanceOf[DecimalFormat]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we just do new DecimalFormat()
?
|
||
Seq( | ||
("454", "999") -> Decimal(454), | ||
("054", "999") -> Decimal(54), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "54"
and "999"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return Decimal(54)
. Let's update tests.
|
||
val transformedFormat = transform(normalizedNumberFormat) | ||
try { | ||
numberDecimalFormat.applyLocalizedPattern(transformedFormat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this for every row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
private val transformedFormat = transform(normalizedNumberFormat) | ||
|
||
private lazy val numberDecimalFormat = { | ||
val decimalFormat = new DecimalFormat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do new DecimalFormat(transformedFormat)
|
||
def parse(input: UTF8String): Decimal = { | ||
if (unChecked) { | ||
check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. This is a static type check, why do you do it at runtime when parsing the first input?
|
||
private lazy val numberFormat = right.eval().toString.toUpperCase(Locale.ROOT) | ||
private lazy val numberFormatBuilder = new NumberFormatBuilder(numberFormat) | ||
private lazy val (precision, scale) = numberFormatBuilder.parsePrecisionAndScale() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's simpler to have a def parsedDecimalType: DecimalType
and use it here.
59965fc
to
b3906ca
Compare
} | ||
} | ||
|
||
def parsedDecimalType: DecimalType = DecimalType(precision, scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I expect is
val parsedDecimalType = {
val formatSplits = normalizedNumberFormat.split(POINT_SIGN).map(_.filterNot(isSign))
assert(formatSplits.length <= 2)
val precision = formatSplits.map(_.length).sum
val scale = if (formatSplits.length == 2) formatSplits.last.length else 0
DecimalType(precision, scale)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precision
and scale
is used in other place.
val format2 = 0.until(input.length).map(_ => '9').mkString | ||
val format3 = 0.until(input.length + 1).map(_ => '0').mkString | ||
val format4 = 0.until(input.length + 1).map(_ => '9').mkString | ||
val format5 = 0.until(input.length + 2).map(_ => '0').mkString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test the mixed case as I proposed before? val format3 = 0.until(input.length + 2).map(i => i % 2).mkString
It's usually not very useful to go deeper into one factor in the test. To test format string that is longer than the value, the + 1
case should be sufficient, we don't need to test + 2
. Ditto to the - 2
case.
// Test '.' and 'D' | ||
checkExceptionInExpression[IllegalArgumentException]( | ||
ToNumber(Literal("454.2"), Literal("999")), | ||
"Format '999' used for parsing string to number or formatting number to string is invalid") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should refine the error message. 999
is indeed a valid number format string. The problem is the input does not match it.
How about
The input string '$input' does not match the given number format: '$format'
ToNumber(Literal("454.23"), Literal("00.000")), | ||
"Format '00.000' used for parsing string to number or formatting number to string is invalid") | ||
checkExceptionInExpression[IllegalArgumentException]( | ||
ToNumber(Literal("454.23"), Literal("00D000")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you really want to test all the cases, please write the test cases only with dot, and run it again after replacing dot with D
.
("12,454", "99,999") -> Decimal(12454), | ||
("12,454", "00,000") -> Decimal(12454), | ||
("12,454", "99G999") -> Decimal(12454), | ||
("12,454", "00G000") -> Decimal(12454), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same problem here, please make the tests more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and tests more cases, e.g. different number of digits.
SELECT endswith(null, null); | ||
|
||
-- to_number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have extensive unit tests, we can only test some common cases in the end-to-end tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
Show resolved
Hide resolved
"The input string '123,456' does not match the given number format: '9G9'") | ||
|
||
Seq( | ||
("12,454", "99,999") -> Decimal(12454), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, no need to repeat the test cases manually, we can replace 0
with 9
in the loop body.
|
||
// Test ',' and 'G' | ||
checkExceptionInExpression[IllegalArgumentException]( | ||
ToNumber(Literal("123,456"), Literal("9G9")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test if the number of ,
doesn't match? e.g.
to_number('123,456', '999,999,999')
to_number('123,456,789', '999,999')
("454-", "000-") -> Decimal(-454), | ||
("454-", "000S") -> Decimal(-454), | ||
("-454", "-000") -> Decimal(-454), | ||
("-454", "S000") -> Decimal(-454), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, replace 9
with 0
, -
with S
in the loop body.
import org.apache.spark.sql.types.Decimal | ||
import org.apache.spark.unsafe.types.UTF8String | ||
|
||
class NumberUtilsSuite extends SparkFunSuite { | ||
class NumberFormatterSuite extends SparkFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we still need this test suite since we test almost everything in the new test in StringExpressionsSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just follows StringUtils
with StringUtilsSuite
. In fact, RegexpExpressionsSuite.scala
test almost everything too.
select to_number('454.2', '999.9'); | ||
select to_number('454.2', '000.0'); | ||
select to_number('454.2', '999D9'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for .
and D
select to_number('12,454', '99,999'); | ||
select to_number('12,454', '00,000'); | ||
select to_number('12,454', '99G999'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
select to_number('$78.12', '$00.00'); | ||
select to_number('-454', '-999'); | ||
select to_number('-454', 'S999'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
input: UTF8String, numberFormat: String, errorMsg: String): Unit = { | ||
val e = intercept[IllegalArgumentException](parse(input, numberFormat)) | ||
private def invalidNumberFormat(numberFormat: String, errorMsg: String): Unit = { | ||
val testNumberFormatter = new TestNumberFormatter(numberFormat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this TestNumberFormatter
? We can simply call check
, expect it to return TypeCheckFailure
, and check the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor comment, we can address it in the followup PR.
thanks, merging to master! |
@cloud-fan Thank you for your hard work! |
What changes were proposed in this pull request?
Many database support the function
to_number
to convert a string to number.The implement of
to_number
has many different betweenPostgresql
,Oracle
andPhoenix
.So, this PR follows the implement of
to_number
inOracle
that give a strict parameter verification.So, this PR follows the implement of
to_number
inPhoenix
that uses BigDecimal.This PR support the patterns for numeric formatting as follows:
There are some mainstream database support the syntax.
PostgreSQL:
https://www.postgresql.org/docs/12/functions-formatting.html
Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_NUMBER.html#GUID-D4807212-AFD7-48A7-9AED-BEC3E8809866
Vertica
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_NUMBER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CFormatting%20Functions%7C_____7
Redshift
https://docs.aws.amazon.com/redshift/latest/dg/r_TO_NUMBER.html
DB2
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1544.htm
Teradata
https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/TH2cDXBn6tala29S536nqg
Snowflake:
https://docs.snowflake.net/manuals/sql-reference/functions/to_decimal.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/to_number.htm#TO_NUMBER
Phoenix
http://phoenix.incubator.apache.org/language/functions.html#to_number
Singlestore
https://docs.singlestore.com/v7.3/reference/sql-reference/numeric-functions/to-number/
Intersystems
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_TONUMBER
The syntax like:
Why are the changes needed?
to_number
is very useful for formatted currency to number conversion.Does this PR introduce any user-facing change?
Yes. New feature.
How was this patch tested?
New tests