-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQL functions to format numbers into human readable format #10635
Conversation
.../main/java/org/apache/druid/sql/calcite/expression/builtin/SizeFormatOperatorConversion.java
Outdated
Show resolved
Hide resolved
newScanQueryBuilder() | ||
.dataSource(CalciteTests.DATASOURCE3) | ||
.intervals(querySegmentSpec(Filtration.eternity())) | ||
.virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be?
.virtualColumns(expressionVirtualColumn("v0", "'44.61KiB'", ValueType.STRING), | |
.virtualColumns(expressionVirtualColumn("v0", "binary_date_format(45678)", ValueType.STRING), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case. I also thought it should a function call expression here. But because the given argument is a constant, this function call has been calculated during SQL parsing phase before converting to native sql. I've added some comments in the latest commit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for clarifying
core/src/test/java/org/apache/druid/java/util/common/HumanReadableBytesTest.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/druid/java/util/common/HumanReadableBytes.java
Outdated
Show resolved
Hide resolved
Hi @abhishekagarwal87 , Thanks for your review. I will handle your comments on next Monday. |
Do you like how the web console auto compiles in the function docs? :-p |
Yes, it's very useful and convenient. I like it. |
Hi @asdf2014 @gianm @jihoonson @suneet-s , Could you help to review this PR ? |
@Override | ||
public ExprEval apply(List<Expr> args, Expr.ObjectBinding bindings) | ||
{ | ||
final long bytes = args.get(0).eval(bindings).asLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are calling asLong
without checking isNumericNull
of the ExprEval
, we are ignoring the value of druid.generic.useDefaultValueForNull
here, which I think is incorrect and this should return null instead.
Thinking out loud, how should this function behave with non-long inputs?
The way this is currently implement:
- Inputs of
ExprType.DOUBLE
will be cast to aExprType.LONG
before conversion. - For
ExprType.STRING
inputs, if they are number-ish strings, they will be parsed into long values, but if notasLong
will always be 0.
I don't know that this behavior is incorrect, I just wanted to call it out to think about it.
I do think we want to check for isNumericNull
and return ExprEval.of(null)
if NullHandling.sqlCompatible()
is set, for any input types.
I see in the SQL operator it looks like it strictly validates that the inputs are numeric, while Druid native expressions have traditionally been a bit fast and loose about the inputs they accept and tend to be rather forgiving, so perhaps this is ok that the behavior here doesn't quite match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I'll make some improvement here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @clintropolis , null and type handling has been improved in the latest commit. Please check it at your convenience.
docs/querying/sql.md
Outdated
|`BINARY_BYTE_FORMAT(value, [precision])`|Returns the value in human-readable [IEC](https://en.wikipedia.org/wiki/Binary_prefix) format. Supported unit suffix: `B`, `KiB`, `MiB`, `GiB`, `TiB`, `PiB`, `EiB`. `precision` must be in the range of [0,3] (default: 2).| | ||
|`DECIMAL_BYTE_FORMAT(value, [precision])`|Returns the value in human-readable [SI](https://en.wikipedia.org/wiki/Binary_prefix) format. Supported unit suffix: `B`, `KB`, `MB`, `GB`, `TB`, `PB`, `EB`. `precision` must be in the range of [0,3] (default: 2).| | ||
|`DECIMAL_FORMAT(value, [precision])`|Returns the value in human-readable SI format. Supported unit suffix: `K`, `M`, `G`, `T`, `P`, `E`. `precision` must be in the range of [0,3] (default: 2).| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these docs live with the 'numeric' functions? I suppose here is ok too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see all existing numeric functions are all about mathematical computation, the input parameter and output result are all type of integer or float. The new functions in this PR are a little bit different from those, I don't whether it's suitable to put them in 'numeric function' section, so I put them in a separated section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i guess in the native expression version of this document the section is called math functions or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they make sense in the numeric section, because they accept numbers. It's OK that they don't return numbers. It's analogous to TIME_FORMAT
, which still belongs in the time section.
The failure of CI has nothing to do with changes in this PR. @clintropolis could you re-trigger CI ? |
…ormat functions do
dbea904
to
91e6a55
Compare
The branch has been rebased onto the latest master to resolve conflicts with master branch. And do you have any more comments ? @clintropolis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay. I think overall implementation looks good to me after the null-handling adjustments, but I need to do some more research to compare with other databases to properly weigh in on the question posed in the description of if these new functions are best to be split up into separate functions as they are in this PR, or combined into a single function.
Doing a quick survey at a super high level, at least I haven't found any that directly implement these particular functions in this PR, though there do appear to be some similar number to string formatting functions, and the theme so far seems to be to re-use the same function with different arguments (if it has more than 1 function):
- postgres
to_char
- oracle
to_char
- ms sql server
format
- mysql
format
which I guess just does decimal places formatting of numeric types and appears to be the only single purpose function I have found so far, though does support locales
However, most of what I have looked at so far that support these various translations have relatively sophisticated formatting models and work mechanically through these expressed as formatting strings supplied as arguments, which I am unsure if the functions in this PR fit that model exactly. I guess these could be like special named formats, but I am unsure if there is precedent for something like that, or a more proper way to do this.
I guess the advantage of being split up is that we don't have to consider the harder question of how this ties into formatting/type conversion in general. The downside I guess is that if we decide to consolidate them in the future it becomes more work to remove these special functions since we can't just delete them right out, though I guess it isn't much effort should we change our mind to mark them deprecated and live on for some time as aliases for a few release cycles until eventual removal.
@Test | ||
public void testSizeFormatWithNoDefaultValueForNull() | ||
{ | ||
NullHandling.updateForTests(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All unit tests are run with both values of druid.generic.useDefaultValueForNull
, so it isn't necessary to explicitly configure it. What we typically do is try to just write the test to check for the mode and adjust the expectation accordingly, e.g. to use another example from this file
assertExpr("lpad(x, 2, '')", NullHandling.replaceWithDefault() ? null : "fo");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL ships a similar function format_bytes
I struggled the two different approaches at first. And at last I chose to do it by 3 different functions. The reasons are,
-
different function names are more meaningful than different arguments for one function. Since there're 3 different unit systems in this PR, how to name them in a short enough way and without ambiguity is a big challenge. For example,
FORMAT(number, 'si')
,FORMAT(number, 'dec')
,si
anddec
are standard abbreviation and short enough but they're hard to understand;FORMAT(number, 'binary_byte')
, it's clear enough, but it's not so simple compared tobinary_byte_format(number)
-
at the underlying layer, there are always different format functions, and if we provide one function at the user side, we have to do some checks on the format specifier and dispatch calls to those different functions. It's a little bit simple if different functions are provided.
But as you mentioned, there are also some drawbacks in this way. If the standard is to keep consistent with other databases or keep less numbers of functions exposed to users, maybe we need to combine these functions together.
My 2¢ on the naming thing: I like having 3 separate functions, because I think a 1-function model works best if you have a good spec for format strings, which isn't what this patch is about. An example of a good spec is: https://www.postgresql.org/docs/current/functions-formatting.html#FUNCTIONS-FORMATTING-NUMERIC-TABLE If you just have a few different mutually exclusive options, like we do here, I think it's more SQL-y to have different functions. In the future, we might introduce a postgresql-style number formatting spec, but we don't have to do it today. A couple of things though:
I'm just writing this as a comment instead of a review, since I didn't read the code. |
Naming is really a big challenge in the world of programming :) . There're some ways listed as follows
It looks like the parameters are still too long.
If we want to design one function, the challenge is how to give nice names for format specifiers. But it also leaves the extensibility in the future to support other kinds of number format. Do you have nicer format specifier names ?
I think I would prefer the 3rd here. What do you think ? @gianm @clintropolis @abhishekagarwal87 ? |
I, for one, don't have a strong opinion.
Btw Java also has DecimalFormat which formats the number to any pattern use passes but doesn't compact it as such. |
@FrankChen021 very sorry for the delayed response, I got a bit busy and haven't had a chance to get back to this PR until now. I think I agree that we shouldn't try to do a unified format function right now, I didn't mean to hoist the responsibility onto this PR (but am glad we had the discussion). I think the 3rd option of using the names for the functions @gianm suggested sgtm. I think we can get this merged after the conflicts fixed and naming adjustments made. I still think you could adjust the tests to just expect the correct response based on the value of |
@clintropolis Thanks for your response. I didn't got enough time to handle comments you left in the past month either. I will submit some commits to fix the conflicts, naming style and tests ASAP. |
Signed-off-by: frank chen <frank.chen021@outlook.com>
Signed-off-by: frank chen <frank.chen021@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot about this one, sorry 😅, lgtm other than minor comment about sql return type
.operatorBuilder(StringUtils.toUpperCase(name)) | ||
.operandTypeChecker(new HumanReadableFormatOperandTypeChecker()) | ||
.functionCategory(SqlFunctionCategory.STRING) | ||
.returnTypeNonNull(SqlTypeName.VARCHAR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use the recently added returnTypeCascadeNullable
since it returns null if the input is null (see #11327)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 (sorry for conflicts, I think that was my fault)
f48df63
to
27d9a41
Compare
Hi @abhishekagarwal87 @gianm , do you have any other comments ? |
docs/misc/math-expr.md
Outdated
|
||
| function | description | | ||
| --- | --- | | ||
| human_readable_binary_byte_format(value[, precision]) | Format a number in human-readable [IEC](https://en.wikipedia.org/wiki/Binary_prefix) format. For example, human_readable_binary_byte_format(1048576) returns `1.00 MiB`. `precision` must be in the range of [0,3] (default: 2). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FrankChen021 can you add one example with a custom precision value? Maybe, one of these examples can be modified itself.
@FrankChen021 Just one minor comment on the doc. Rest looks good to me. Thank you for your patience. |
Hi @gianm , do you have any other comments ? |
@FrankChen021 can you fix up conflicts? I think this is ready to merge otherwise |
Signed-off-by: frank chen <frank.chen021@outlook.com>
@clintropolis Fixed. Let's wait to see if there's any CI problems. |
This PR implements requirements described in #10584
In this PR, 3 SQL functions and 3 corresponding native functions are provided to format numbers to different style.
B
,KiB
,MiB
,GiB
,TiB
,PiB
,EiB
.precision
must be in the range of [0,3] (default: 2).B
,KB
,MB
,GB
,TB
,PB
,EB
.precision
must be in the range of [0,3] (default: 2).K
,M
,G
,T
,P
,E
.precision
must be in the range of [0,3] (default: 2).The reason why 3 functions are provided to user instead of 1 function with different argument of format is that we think it's more simpler for a user to call these function. However in internal implementation, there's only one public format function exposed for all 3 format.
We've tested this feature in our clusters, and it works as we expect.
Web Console now looks like
This PR has: