Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL functions to format numbers into human readable format #10635

Merged
merged 26 commits into from
Aug 13, 2021

Conversation

FrankChen021
Copy link
Member

@FrankChen021 FrankChen021 commented Dec 4, 2020

This PR implements requirements described in #10584

In this PR, 3 SQL functions and 3 corresponding native functions are provided to format numbers to different style.

SQL/Native function Description
human_readable_binary_byte_format(value[, precision]) Returns the value in human-readable IEC format. Supported unit suffix: B, KiB, MiB, GiB, TiB, PiB, EiB. precision must be in the range of [0,3] (default: 2).
human_readable_decimal_byte_format(value[, precision]) Returns the value in human-readable SI format. Supported unit suffix: B, KB, MB, GB, TB, PB, EB. precision must be in the range of [0,3] (default: 2).
human_readable_decimal_format(value[, precision]) Returns the value in human-readable SI format. Supported unit suffix: K, M, G, T, P, E. precision must be in the range of [0,3] (default: 2).

The reason why 3 functions are provided to user instead of 1 function with different argument of format is that we think it's more simpler for a user to call these function. However in internal implementation, there's only one public format function exposed for all 3 format.

We've tested this feature in our clusters, and it works as we expect.

image

Web Console now looks like

image


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

newScanQueryBuilder()
.dataSource(CalciteTests.DATASOURCE3)
.intervals(querySegmentSpec(Filtration.eternity()))
.virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be?

Suggested change
.virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING),
.virtualColumns(expressionVirtualColumn("v0", "binary_date_format(45678)", ValueType.STRING),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special case. I also thought it should a function call expression here. But because the given argument is a constant, this function call has been calculated during SQL parsing phase before converting to native sql. I've added some comments in the latest commit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for clarifying

@FrankChen021
Copy link
Member Author

Hi @abhishekagarwal87 , Thanks for your review. I will handle your comments on next Monday.

@vogievetsky
Copy link
Contributor

Do you like how the web console auto compiles in the function docs? :-p

@FrankChen021
Copy link
Member Author

Do you like how the web console auto compiles in the function docs? :-p

Yes, it's very useful and convenient. I like it.

@FrankChen021
Copy link
Member Author

Hi @asdf2014 @gianm @jihoonson @suneet-s , Could you help to review this PR ?

@Override
public ExprEval apply(List<Expr> args, Expr.ObjectBinding bindings)
{
final long bytes = args.get(0).eval(bindings).asLong();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are calling asLong without checking isNumericNull of the ExprEval, we are ignoring the value of druid.generic.useDefaultValueForNull here, which I think is incorrect and this should return null instead.

Thinking out loud, how should this function behave with non-long inputs?

The way this is currently implement:

  • Inputs of ExprType.DOUBLE will be cast to a ExprType.LONG before conversion.
  • For ExprType.STRING inputs, if they are number-ish strings, they will be parsed into long values, but if not asLong will always be 0.

I don't know that this behavior is incorrect, I just wanted to call it out to think about it.

I do think we want to check for isNumericNull and return ExprEval.of(null) if NullHandling.sqlCompatible() is set, for any input types.

I see in the SQL operator it looks like it strictly validates that the inputs are numeric, while Druid native expressions have traditionally been a bit fast and loose about the inputs they accept and tend to be rather forgiving, so perhaps this is ok that the behavior here doesn't quite match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I'll make some improvement here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @clintropolis , null and type handling has been improved in the latest commit. Please check it at your convenience.

Comment on lines 566 to 571
|`BINARY_BYTE_FORMAT(value, [precision])`|Returns the value in human-readable [IEC](https://en.wikipedia.org/wiki/Binary_prefix) format. Supported unit suffix: `B`, `KiB`, `MiB`, `GiB`, `TiB`, `PiB`, `EiB`. `precision` must be in the range of [0,3] (default: 2).|
|`DECIMAL_BYTE_FORMAT(value, [precision])`|Returns the value in human-readable [SI](https://en.wikipedia.org/wiki/Binary_prefix) format. Supported unit suffix: `B`, `KB`, `MB`, `GB`, `TB`, `PB`, `EB`. `precision` must be in the range of [0,3] (default: 2).|
|`DECIMAL_FORMAT(value, [precision])`|Returns the value in human-readable SI format. Supported unit suffix: `K`, `M`, `G`, `T`, `P`, `E`. `precision` must be in the range of [0,3] (default: 2).|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these docs live with the 'numeric' functions? I suppose here is ok too...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see all existing numeric functions are all about mathematical computation, the input parameter and output result are all type of integer or float. The new functions in this PR are a little bit different from those, I don't whether it's suitable to put them in 'numeric function' section, so I put them in a separated section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i guess in the native expression version of this document the section is called math functions or something like that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they make sense in the numeric section, because they accept numbers. It's OK that they don't return numbers. It's analogous to TIME_FORMAT, which still belongs in the time section.

@FrankChen021
Copy link
Member Author

The failure of CI has nothing to do with changes in this PR. @clintropolis could you re-trigger CI ?

@FrankChen021
Copy link
Member Author

The branch has been rebased onto the latest master to resolve conflicts with master branch. And do you have any more comments ? @clintropolis

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. I think overall implementation looks good to me after the null-handling adjustments, but I need to do some more research to compare with other databases to properly weigh in on the question posed in the description of if these new functions are best to be split up into separate functions as they are in this PR, or combined into a single function.

Doing a quick survey at a super high level, at least I haven't found any that directly implement these particular functions in this PR, though there do appear to be some similar number to string formatting functions, and the theme so far seems to be to re-use the same function with different arguments (if it has more than 1 function):

However, most of what I have looked at so far that support these various translations have relatively sophisticated formatting models and work mechanically through these expressed as formatting strings supplied as arguments, which I am unsure if the functions in this PR fit that model exactly. I guess these could be like special named formats, but I am unsure if there is precedent for something like that, or a more proper way to do this.

I guess the advantage of being split up is that we don't have to consider the harder question of how this ties into formatting/type conversion in general. The downside I guess is that if we decide to consolidate them in the future it becomes more work to remove these special functions since we can't just delete them right out, though I guess it isn't much effort should we change our mind to mark them deprecated and live on for some time as aliases for a few release cycles until eventual removal.

@Test
public void testSizeFormatWithNoDefaultValueForNull()
{
NullHandling.updateForTests(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All unit tests are run with both values of druid.generic.useDefaultValueForNull, so it isn't necessary to explicitly configure it. What we typically do is try to just write the test to check for the mode and adjust the expectation accordingly, e.g. to use another example from this file

assertExpr("lpad(x, 2, '')", NullHandling.replaceWithDefault() ? null : "fo");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL ships a similar function format_bytes

I struggled the two different approaches at first. And at last I chose to do it by 3 different functions. The reasons are,

  1. different function names are more meaningful than different arguments for one function. Since there're 3 different unit systems in this PR, how to name them in a short enough way and without ambiguity is a big challenge. For example, FORMAT(number, 'si'), FORMAT(number, 'dec'), si and dec are standard abbreviation and short enough but they're hard to understand; FORMAT(number, 'binary_byte'), it's clear enough, but it's not so simple compared to binary_byte_format(number)

  2. at the underlying layer, there are always different format functions, and if we provide one function at the user side, we have to do some checks on the format specifier and dispatch calls to those different functions. It's a little bit simple if different functions are provided.

But as you mentioned, there are also some drawbacks in this way. If the standard is to keep consistent with other databases or keep less numbers of functions exposed to users, maybe we need to combine these functions together.

@gianm
Copy link
Contributor

gianm commented Jan 8, 2021

My 2¢ on the naming thing: I like having 3 separate functions, because I think a 1-function model works best if you have a good spec for format strings, which isn't what this patch is about. An example of a good spec is: https://www.postgresql.org/docs/current/functions-formatting.html#FUNCTIONS-FORMATTING-NUMERIC-TABLE

If you just have a few different mutually exclusive options, like we do here, I think it's more SQL-y to have different functions. In the future, we might introduce a postgresql-style number formatting spec, but we don't have to do it today.

A couple of things though:

  • IMO the name decimal_format is misleading. It sounds like a function that will format a number to a specific amount of decimals. Maybe instead we could call it human_readable_decimal_format. For consistency it may then be nice to call the others human_readable_binary_byte_format and human_readable_decimal_byte_format. The function names are long, but I think that's probably better than being misleading. Any thoughts @FrankChen021, @abhishekagarwal87, @clintropolis?
  • Please include examples in the documentation of what the formatted numbers will look like.

I'm just writing this as a comment instead of a review, since I didn't read the code.

@FrankChen021
Copy link
Member Author

  • IMO the name decimal_format is misleading. It sounds like a function that will format a number to a specific amount of decimals. Maybe instead we could call it human_readable_decimal_format. For consistency it may then be nice to call the others human_readable_binary_byte_format and human_readable_decimal_byte_format. The function names are long, but I think that's probably better than being misleading. Any thoughts @FrankChen021, @abhishekagarwal87, @clintropolis?

Naming is really a big challenge in the world of programming :) . There're some ways listed as follows

  • Alternative 1 - One function with different format specifier
    format(number, 'human_readable_bin_byte')
    format(number, 'human_readable_dec_byte')
    format(number, 'human_readable_decimal')

It looks like the parameters are still too long.

  • Alternative 2 - One function with abbreviated format specifier
    format(number, 'si') -- si is the standard word for binary byte.
    format(number, 'iec') -- iec is the standard word for decimal byte
    format(number, 'dec')

If we want to design one function, the challenge is how to give nice names for format specifiers. But it also leaves the extensibility in the future to support other kinds of number format. Do you have nicer format specifier names ?

  • Alternative 3 - based on @gianm 's suggestion
    human_readable_format_bin_byte
    human_readable_format_dec_byte
    human_readable_format_dec

I think I would prefer the 3rd here. What do you think ? @gianm @clintropolis @abhishekagarwal87 ?

@abhishekagarwal87
Copy link
Contributor

I, for one, don't have a strong opinion. decimal_format here is equivalent to https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/CompactNumberFormat.html which I liked. So decimal_format could be compact_format/compact_number
The rest of the methods can be

compact_decimal_byte_format/compact_decimal_byte_number/compact_number_decimal_byte/compact_number_to_decimal_byte
and
compact_binary_byte_format/compact_binary_byte_number/compact_number_binary_byte/compact_number_to_binary_byte

compact_number / compact_number_to_decimal_byte / compact_number_to_binary_byte seems least ambiguous to me. That's all I can think of :)

Btw Java also has DecimalFormat which formats the number to any pattern use passes but doesn't compact it as such.

@clintropolis
Copy link
Member

@FrankChen021 very sorry for the delayed response, I got a bit busy and haven't had a chance to get back to this PR until now. I think I agree that we shouldn't try to do a unified format function right now, I didn't mean to hoist the responsibility onto this PR (but am glad we had the discussion). I think the 3rd option of using the names for the functions @gianm suggested sgtm.

I think we can get this merged after the conflicts fixed and naming adjustments made. I still think you could adjust the tests to just expect the correct response based on the value of druid.generic.useDefaultValueForNull since travis runs both ways, which would make NullHandling.updateForTests not necessary (this comment), but thats more of a nitpick on my end, and am +1 either way.

@FrankChen021
Copy link
Member Author

@clintropolis Thanks for your response. I didn't got enough time to handle comments you left in the past month either. I will submit some commits to fix the conflicts, naming style and tests ASAP.

Signed-off-by: frank chen <frank.chen021@outlook.com>
@FrankChen021 FrankChen021 reopened this May 8, 2021
Signed-off-by: frank chen <frank.chen021@outlook.com>
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot about this one, sorry 😅, lgtm other than minor comment about sql return type

.operatorBuilder(StringUtils.toUpperCase(name))
.operandTypeChecker(new HumanReadableFormatOperandTypeChecker())
.functionCategory(SqlFunctionCategory.STRING)
.returnTypeNonNull(SqlTypeName.VARCHAR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use the recently added returnTypeCascadeNullable since it returns null if the input is null (see #11327)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (sorry for conflicts, I think that was my fault)

@FrankChen021
Copy link
Member Author

Hi @abhishekagarwal87 @gianm , do you have any other comments ?


| function | description |
| --- | --- |
| human_readable_binary_byte_format(value[, precision]) | Format a number in human-readable [IEC](https://en.wikipedia.org/wiki/Binary_prefix) format. For example, human_readable_binary_byte_format(1048576) returns `1.00 MiB`. `precision` must be in the range of [0,3] (default: 2). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrankChen021 can you add one example with a custom precision value? Maybe, one of these examples can be modified itself.

@abhishekagarwal87
Copy link
Contributor

@FrankChen021 Just one minor comment on the doc. Rest looks good to me. Thank you for your patience.

@FrankChen021
Copy link
Member Author

Hi @gianm , do you have any other comments ?

@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
@clintropolis
Copy link
Member

@FrankChen021 can you fix up conflicts? I think this is ready to merge otherwise

Signed-off-by: frank chen <frank.chen021@outlook.com>
@FrankChen021
Copy link
Member Author

@clintropolis Fixed. Let's wait to see if there's any CI problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants