Add SQL functions to format numbers into human readable format #10635

FrankChen021 · 2020-12-04T09:13:37Z

This PR implements requirements described in #10584

In this PR, 3 SQL functions and 3 corresponding native functions are provided to format numbers to different style.

SQL/Native function	Description
human_readable_binary_byte_format(value[, precision])	Returns the value in human-readable IEC format. Supported unit suffix: `B`, `KiB`, `MiB`, `GiB`, `TiB`, `PiB`, `EiB`. `precision` must be in the range of [0,3] (default: 2).
human_readable_decimal_byte_format(value[, precision])	Returns the value in human-readable SI format. Supported unit suffix: `B`, `KB`, `MB`, `GB`, `TB`, `PB`, `EB`. `precision` must be in the range of [0,3] (default: 2).
human_readable_decimal_format(value[, precision])	Returns the value in human-readable SI format. Supported unit suffix: `K`, `M`, `G`, `T`, `P`, `E`. `precision` must be in the range of [0,3] (default: 2).

The reason why 3 functions are provided to user instead of 1 function with different argument of format is that we think it's more simpler for a user to call these function. However in internal implementation, there's only one public format function exposed for all 3 format.

We've tested this feature in our clusters, and it works as we expect.

Web Console now looks like

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

.../main/java/org/apache/druid/sql/calcite/expression/builtin/SizeFormatOperatorConversion.java

abhishekagarwal87 · 2020-12-04T09:57:11Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

+            newScanQueryBuilder()
+                .dataSource(CalciteTests.DATASOURCE3)
+                .intervals(querySegmentSpec(Filtration.eternity()))
+                .virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING),


should this be?

Suggested change

.virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING),

.virtualColumns(expressionVirtualColumn("v0", "binary_date_format(45678)", ValueType.STRING),

This is a special case. I also thought it should a function call expression here. But because the given argument is a constant, this function call has been calculated during SQL parsing phase before converting to native sql. I've added some comments in the latest commit here.

👍 Thanks for clarifying

core/src/test/java/org/apache/druid/java/util/common/HumanReadableBytesTest.java

core/src/main/java/org/apache/druid/java/util/common/HumanReadableBytes.java

FrankChen021 · 2020-12-04T10:46:54Z

Hi @abhishekagarwal87 , Thanks for your review. I will handle your comments on next Monday.

vogievetsky · 2020-12-06T05:11:39Z

Do you like how the web console auto compiles in the function docs? :-p

FrankChen021 · 2020-12-07T05:43:17Z

Do you like how the web console auto compiles in the function docs? :-p

Yes, it's very useful and convenient. I like it.

FrankChen021 · 2020-12-08T01:58:10Z

Hi @asdf2014 @gianm @jihoonson @suneet-s , Could you help to review this PR ?

clintropolis · 2020-12-08T06:02:53Z

core/src/main/java/org/apache/druid/math/expr/Function.java

+    @Override
+    public ExprEval apply(List<Expr> args, Expr.ObjectBinding bindings)
+    {
+      final long bytes = args.get(0).eval(bindings).asLong();


Since we are calling asLong without checking isNumericNull of the ExprEval, we are ignoring the value of druid.generic.useDefaultValueForNull here, which I think is incorrect and this should return null instead.

Thinking out loud, how should this function behave with non-long inputs?

The way this is currently implement:

Inputs of ExprType.DOUBLE will be cast to a ExprType.LONG before conversion.

For ExprType.STRING inputs, if they are number-ish strings, they will be parsed into long values, but if not asLong will always be 0.

I don't know that this behavior is incorrect, I just wanted to call it out to think about it.

I do think we want to check for isNumericNull and return ExprEval.of(null) if NullHandling.sqlCompatible() is set, for any input types.

I see in the SQL operator it looks like it strictly validates that the inputs are numeric, while Druid native expressions have traditionally been a bit fast and loose about the inputs they accept and tend to be rather forgiving, so perhaps this is ok that the behavior here doesn't quite match.

good point. I'll make some improvement here.

Hi @clintropolis , null and type handling has been improved in the latest commit. Please check it at your convenience.

clintropolis · 2020-12-08T06:06:57Z

docs/querying/sql.md

+|`BINARY_BYTE_FORMAT(value, [precision])`|Returns the value in human-readable [IEC](https://en.wikipedia.org/wiki/Binary_prefix) format. Supported unit suffix: `B`, `KiB`, `MiB`, `GiB`, `TiB`, `PiB`, `EiB`. `precision` must be in the range of [0,3] (default: 2).|
+|`DECIMAL_BYTE_FORMAT(value, [precision])`|Returns the value in human-readable [SI](https://en.wikipedia.org/wiki/Binary_prefix) format. Supported unit suffix: `B`, `KB`, `MB`, `GB`, `TB`, `PB`, `EB`. `precision` must be in the range of [0,3] (default: 2).|
+|`DECIMAL_FORMAT(value, [precision])`|Returns the value in human-readable SI format. Supported unit suffix: `K`, `M`, `G`, `T`, `P`, `E`. `precision` must be in the range of [0,3] (default: 2).|


Should these docs live with the 'numeric' functions? I suppose here is ok too...

I see all existing numeric functions are all about mathematical computation, the input parameter and output result are all type of integer or float. The new functions in this PR are a little bit different from those, I don't whether it's suitable to put them in 'numeric function' section, so I put them in a separated section.

yeah, i guess in the native expression version of this document the section is called math functions or something like that

I think they make sense in the numeric section, because they accept numbers. It's OK that they don't return numbers. It's analogous to TIME_FORMAT, which still belongs in the time section.

FrankChen021 · 2020-12-11T10:53:51Z

The failure of CI has nothing to do with changes in this PR. @clintropolis could you re-trigger CI ?

…ormat functions do

FrankChen021 · 2020-12-22T02:50:01Z

The branch has been rebased onto the latest master to resolve conflicts with master branch. And do you have any more comments ? @clintropolis

clintropolis

Apologies for the delay. I think overall implementation looks good to me after the null-handling adjustments, but I need to do some more research to compare with other databases to properly weigh in on the question posed in the description of if these new functions are best to be split up into separate functions as they are in this PR, or combined into a single function.

Doing a quick survey at a super high level, at least I haven't found any that directly implement these particular functions in this PR, though there do appear to be some similar number to string formatting functions, and the theme so far seems to be to re-use the same function with different arguments (if it has more than 1 function):

postgres to_char
oracle to_char
ms sql server format
mysql format which I guess just does decimal places formatting of numeric types and appears to be the only single purpose function I have found so far, though does support locales

However, most of what I have looked at so far that support these various translations have relatively sophisticated formatting models and work mechanically through these expressed as formatting strings supplied as arguments, which I am unsure if the functions in this PR fit that model exactly. I guess these could be like special named formats, but I am unsure if there is precedent for something like that, or a more proper way to do this.

I guess the advantage of being split up is that we don't have to consider the harder question of how this ties into formatting/type conversion in general. The downside I guess is that if we decide to consolidate them in the future it becomes more work to remove these special functions since we can't just delete them right out, though I guess it isn't much effort should we change our mind to mark them deprecated and live on for some time as aliases for a few release cycles until eventual removal.

clintropolis · 2020-12-22T21:28:56Z

core/src/test/java/org/apache/druid/math/expr/FunctionTest.java

+  @Test
+  public void testSizeFormatWithNoDefaultValueForNull()
+  {
+    NullHandling.updateForTests(false);


All unit tests are run with both values of druid.generic.useDefaultValueForNull, so it isn't necessary to explicitly configure it. What we typically do is try to just write the test to check for the mode and adjust the expectation accordingly, e.g. to use another example from this file

assertExpr("lpad(x, 2, '')", NullHandling.replaceWithDefault() ? null : "fo");

MySQL ships a similar function format_bytes

I struggled the two different approaches at first. And at last I chose to do it by 3 different functions. The reasons are,

different function names are more meaningful than different arguments for one function. Since there're 3 different unit systems in this PR, how to name them in a short enough way and without ambiguity is a big challenge. For example, FORMAT(number, 'si'), FORMAT(number, 'dec'), si and dec are standard abbreviation and short enough but they're hard to understand; FORMAT(number, 'binary_byte'), it's clear enough, but it's not so simple compared to binary_byte_format(number)

at the underlying layer, there are always different format functions, and if we provide one function at the user side, we have to do some checks on the format specifier and dispatch calls to those different functions. It's a little bit simple if different functions are provided.

But as you mentioned, there are also some drawbacks in this way. If the standard is to keep consistent with other databases or keep less numbers of functions exposed to users, maybe we need to combine these functions together.

gianm · 2021-01-08T22:34:15Z

My 2¢ on the naming thing: I like having 3 separate functions, because I think a 1-function model works best if you have a good spec for format strings, which isn't what this patch is about. An example of a good spec is: https://www.postgresql.org/docs/current/functions-formatting.html#FUNCTIONS-FORMATTING-NUMERIC-TABLE

If you just have a few different mutually exclusive options, like we do here, I think it's more SQL-y to have different functions. In the future, we might introduce a postgresql-style number formatting spec, but we don't have to do it today.

A couple of things though:

IMO the name decimal_format is misleading. It sounds like a function that will format a number to a specific amount of decimals. Maybe instead we could call it human_readable_decimal_format. For consistency it may then be nice to call the others human_readable_binary_byte_format and human_readable_decimal_byte_format. The function names are long, but I think that's probably better than being misleading. Any thoughts @FrankChen021, @abhishekagarwal87, @clintropolis?
Please include examples in the documentation of what the formatted numbers will look like.

I'm just writing this as a comment instead of a review, since I didn't read the code.

FrankChen021 · 2021-01-09T13:32:32Z

IMO the name decimal_format is misleading. It sounds like a function that will format a number to a specific amount of decimals. Maybe instead we could call it human_readable_decimal_format. For consistency it may then be nice to call the others human_readable_binary_byte_format and human_readable_decimal_byte_format. The function names are long, but I think that's probably better than being misleading. Any thoughts @FrankChen021, @abhishekagarwal87, @clintropolis?

Naming is really a big challenge in the world of programming :) . There're some ways listed as follows

Alternative 1 - One function with different format specifier
format(number, 'human_readable_bin_byte')
format(number, 'human_readable_dec_byte')
format(number, 'human_readable_decimal')

It looks like the parameters are still too long.

Alternative 2 - One function with abbreviated format specifier
format(number, 'si') -- si is the standard word for binary byte.
format(number, 'iec') -- iec is the standard word for decimal byte
format(number, 'dec')

If we want to design one function, the challenge is how to give nice names for format specifiers. But it also leaves the extensibility in the future to support other kinds of number format. Do you have nicer format specifier names ?

Alternative 3 - based on @gianm 's suggestion
human_readable_format_bin_byte
human_readable_format_dec_byte
human_readable_format_dec

I think I would prefer the 3rd here. What do you think ? @gianm @clintropolis @abhishekagarwal87 ?

abhishekagarwal87 · 2021-01-09T19:22:29Z

I, for one, don't have a strong opinion. decimal_format here is equivalent to https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/CompactNumberFormat.html which I liked. So decimal_format could be compact_format/compact_number
The rest of the methods can be

compact_decimal_byte_format/compact_decimal_byte_number/compact_number_decimal_byte/compact_number_to_decimal_byte
and
compact_binary_byte_format/compact_binary_byte_number/compact_number_binary_byte/compact_number_to_binary_byte

compact_number / compact_number_to_decimal_byte / compact_number_to_binary_byte seems least ambiguous to me. That's all I can think of :)

Btw Java also has DecimalFormat which formats the number to any pattern use passes but doesn't compact it as such.

clintropolis · 2021-02-22T06:30:00Z

@FrankChen021 very sorry for the delayed response, I got a bit busy and haven't had a chance to get back to this PR until now. I think I agree that we shouldn't try to do a unified format function right now, I didn't mean to hoist the responsibility onto this PR (but am glad we had the discussion). I think the 3rd option of using the names for the functions @gianm suggested sgtm.

I think we can get this merged after the conflicts fixed and naming adjustments made. I still think you could adjust the tests to just expect the correct response based on the value of druid.generic.useDefaultValueForNull since travis runs both ways, which would make NullHandling.updateForTests not necessary (this comment), but thats more of a nitpick on my end, and am +1 either way.

FrankChen021 · 2021-02-22T06:37:23Z

@clintropolis Thanks for your response. I didn't got enough time to handle comments you left in the past month either. I will submit some commits to fix the conflicts, naming style and tests ASAP.

Signed-off-by: frank chen <frank.chen021@outlook.com>

clintropolis

forgot about this one, sorry 😅, lgtm other than minor comment about sql return type

clintropolis · 2021-06-22T10:45:19Z

...a/org/apache/druid/sql/calcite/expression/builtin/HumanReadableFormatOperatorConversion.java

+        .operatorBuilder(StringUtils.toUpperCase(name))
+        .operandTypeChecker(new HumanReadableFormatOperandTypeChecker())
+        .functionCategory(SqlFunctionCategory.STRING)
+        .returnTypeNonNull(SqlTypeName.VARCHAR)


this should use the recently added returnTypeCascadeNullable since it returns null if the input is null (see #11327)

clintropolis

👍 (sorry for conflicts, I think that was my fault)

FrankChen021 · 2021-06-28T08:01:08Z

Hi @abhishekagarwal87 @gianm , do you have any other comments ?

abhishekagarwal87 · 2021-06-29T07:35:49Z

docs/misc/math-expr.md

+
+| function | description |
+| --- | --- |
+| human_readable_binary_byte_format(value[, precision]) | Format a number in human-readable [IEC](https://en.wikipedia.org/wiki/Binary_prefix) format. For example, human_readable_binary_byte_format(1048576) returns `1.00 MiB`. `precision` must be in the range of [0,3] (default: 2). |


@FrankChen021 can you add one example with a custom precision value? Maybe, one of these examples can be modified itself.

abhishekagarwal87 · 2021-06-29T07:36:26Z

@FrankChen021 Just one minor comment on the doc. Rest looks good to me. Thank you for your patience.

docs/misc/math-expr.md

FrankChen021 · 2021-07-23T03:20:26Z

Hi @gianm , do you have any other comments ?

clintropolis · 2021-08-13T04:11:32Z

@FrankChen021 can you fix up conflicts? I think this is ready to merge otherwise

Signed-off-by: frank chen <frank.chen021@outlook.com>

FrankChen021 · 2021-08-13T06:59:58Z

@clintropolis Fixed. Let's wait to see if there's any CI problems.

abhishekagarwal87 reviewed Dec 4, 2020

View reviewed changes

clintropolis reviewed Dec 8, 2020

View reviewed changes

clintropolis added Area - Querying Area - SQL labels Dec 14, 2020

FrankChen021 added 9 commits December 22, 2020 09:21

add binary_byte_format/decimal_byte_format/decimal_format

6dd891e

clean code

1ecbbfc

fix doc

a6afe85

fix review comments

f5fab71

add spelling check rules

be3c333

remove extra param

e02809a

improve type handling and null handling

b5a3756

remove extra zeros

e279c18

fix tests and add space between unit suffix and number as most size-f…

91e6a55

…ormat functions do

FrankChen021 force-pushed the size-format branch from dbea904 to 91e6a55 Compare December 22, 2020 02:46

fix tests

5d0fe12

FrankChen021 requested a review from clintropolis December 28, 2020 00:57

clintropolis reviewed Jan 5, 2021

View reviewed changes

clintropolis added the Design Review label Jan 5, 2021

merge master to resolve conflicts

8e4eede

Signed-off-by: frank chen <frank.chen021@outlook.com>

FrankChen021 reopened this May 8, 2021

Merge master to resolve conflicts

f21bc87

Signed-off-by: frank chen <frank.chen021@outlook.com>

FrankChen021 force-pushed the size-format branch from dcdf54d to f21bc87 Compare May 13, 2021 02:09

Merge branch 'master' into size-format to resolve conflicts

8119545

clintropolis reviewed Jun 22, 2021

View reviewed changes

FrankChen021 added 3 commits June 23, 2021 16:47

Resolve review comments

fc7cc9b

Update SQL test case to check null handling

fa1f057

Fix intellij inspections

12edaab

clintropolis approved these changes Jun 26, 2021

View reviewed changes

Merge branch 'master' to resolve conflicts

27d9a41

FrankChen021 force-pushed the size-format branch from f48df63 to 27d9a41 Compare June 26, 2021 16:02

abhishekagarwal87 reviewed Jun 29, 2021

View reviewed changes

FrankChen021 added 2 commits June 29, 2021 17:12

Add more examples

30eb6ed

Fix example

e773dac

abhishekagarwal87 reviewed Jun 29, 2021

View reviewed changes

docs/misc/math-expr.md Outdated Show resolved Hide resolved

abhishekagarwal87 approved these changes Jun 29, 2021

View reviewed changes

abhishekagarwal87 added Feature Release Notes labels Jun 29, 2021

FrankChen021 requested a review from gianm July 23, 2021 03:20

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

Merge branch 'master' into size-format

f3c62a3

Signed-off-by: frank chen <frank.chen021@outlook.com>

clintropolis merged commit e40be0a into apache:master Aug 13, 2021

FrankChen021 deleted the size-format branch August 14, 2021 04:31

clintropolis mentioned this pull request Sep 9, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

FrankChen021 mentioned this pull request Sep 10, 2021

Add a SQL function to format number into human readable format #10584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQL functions to format numbers into human readable format #10635

Add SQL functions to format numbers into human readable format #10635

FrankChen021 commented Dec 4, 2020 •

edited

Loading

abhishekagarwal87 Dec 4, 2020

FrankChen021 Dec 7, 2020

abhishekagarwal87 Dec 7, 2020

FrankChen021 commented Dec 4, 2020

vogievetsky commented Dec 6, 2020

FrankChen021 commented Dec 7, 2020

FrankChen021 commented Dec 8, 2020

clintropolis Dec 8, 2020 •

edited

Loading

FrankChen021 Dec 8, 2020

FrankChen021 Dec 9, 2020

clintropolis Dec 8, 2020

FrankChen021 Dec 8, 2020

clintropolis Dec 8, 2020

gianm Jan 8, 2021

FrankChen021 commented Dec 11, 2020

FrankChen021 commented Dec 22, 2020

clintropolis left a comment

clintropolis Dec 22, 2020

FrankChen021 Jan 5, 2021

gianm commented Jan 8, 2021

FrankChen021 commented Jan 9, 2021

abhishekagarwal87 commented Jan 9, 2021

clintropolis commented Feb 22, 2021

FrankChen021 commented Feb 22, 2021

clintropolis left a comment

clintropolis Jun 22, 2021

FrankChen021 Jun 23, 2021

clintropolis left a comment

FrankChen021 commented Jun 28, 2021

abhishekagarwal87 Jun 29, 2021

abhishekagarwal87 commented Jun 29, 2021

FrankChen021 commented Jul 23, 2021

clintropolis commented Aug 13, 2021

FrankChen021 commented Aug 13, 2021

	.virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING),
	.virtualColumns(expressionVirtualColumn("v0", "binary_date_format(45678)", ValueType.STRING),

Add SQL functions to format numbers into human readable format #10635

Add SQL functions to format numbers into human readable format #10635

Conversation

FrankChen021 commented Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankChen021 commented Dec 4, 2020

vogievetsky commented Dec 6, 2020

FrankChen021 commented Dec 7, 2020

FrankChen021 commented Dec 8, 2020

clintropolis Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankChen021 commented Dec 11, 2020

FrankChen021 commented Dec 22, 2020

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Jan 8, 2021

FrankChen021 commented Jan 9, 2021

abhishekagarwal87 commented Jan 9, 2021

clintropolis commented Feb 22, 2021

FrankChen021 commented Feb 22, 2021

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

FrankChen021 commented Jun 28, 2021

Choose a reason for hiding this comment

abhishekagarwal87 commented Jun 29, 2021

FrankChen021 commented Jul 23, 2021

clintropolis commented Aug 13, 2021

FrankChen021 commented Aug 13, 2021

FrankChen021 commented Dec 4, 2020 •

edited

Loading

clintropolis Dec 8, 2020 •

edited

Loading