Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HumanSizeWithPrecision invalid scientific notation #45

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kmirzavaziri
Copy link

Fixes issue #44.

I am not sure if the function HumanSizeWithPrecision with precision of 1 behaves as expected. Please let me know if there is any improvement I can make.

Signed-off-by: Kamyar Mirzavaziri <kmirzavaziri@gmail.com>
@kmirzavaziri kmirzavaziri force-pushed the 44-fix-human-size-with-precision branch from 46c67a9 to 14a2e46 Compare January 14, 2024 16:26
@kmirzavaziri
Copy link
Author

@thaJeztah Would appreciate if you could take a look at this.

@thaJeztah
Copy link
Member

Thanks for working on this! I recall this was a bit fiddly when I last looked at it.

@krissetto perhaps you're interested to review this?

assertEquals(t, "6TB", HumanSizeWithPrecision(float64(5.972*TB), 1))
assertEquals(t, "2PB", HumanSizeWithPrecision(float64(2.22*PB), 1))
assertEquals(t, "1e+04YB", HumanSizeWithPrecision(float64(10000000000000*PB), 1))
assertEquals(t, "0.009GB", HumanSizeWithPrecision(float64(9.5*MB), 1))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, what is the precision int supposed to signify? I can't understand how "0.009GB" is a correct expectation for 9.5MB with a precision of 1

Copy link
Author

@kmirzavaziri kmirzavaziri Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, the precision of 1 here means having at most one digit (summing up left and right side of the period), which is impossible to render for any number greater than or equal to 10. I personally prefer producing an error for precisions of 2 or less, but it would be a breaking change. (The reason it works with precision 3 or higher is that we want to render the maximum number of 999 which is three digits. Because: for 1000 or higher, we would raise the unit to the higher level).

For more clarification: For example rendering the value of 12.123456 with precision 3 must result in 12.1 and not 12.123 if I comprehend the intention of the function correctly, and as go formatting works.

So I thought we must try to return meaningful values for precision 3 or higher, and disregard lower precisions. I am not sure what would be the most suitable way to handle lower precisions but I am willing to discuss and implement the result that reflects the mindset of docker.

Note: We can have somehow meaningful return values with precision 2, because any number with three digits (>= 100), can be converted to the higher unit occupying the maximum of 2 digits. But it is impossible to have a correct return value for some numbers with precision 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to look as well; not all of this was ever properly "documented", and some of the behavior most definitely got into the codebase "organically" (parts of this code at least originated from the moby/moby ("docker engine") codebase.

I know that long (long) time ago I opened a ticket related to output that confused me, which was because different number of decimals were used in the output which, combined with confusion about decimal separators (. vs , differing between locales) at least confused "me"; moby/moby#9977 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching within the docker and moby orgs here on GitHub:

  • moby/containerd only seems to use CustomSize() in one spot;
  • In the docker org we seem to be using HumanSizeWithPrecision() in:
    • cli;
    • compose;
    • docker-ce;
    • registry-cli
  • and CustomSize() only in the cs-libnetwork repo.

Not a ton of occurrences, so I'd be happy if we could come together on a better definition for this func and make the required changes wherever they may be needed.

If we agree to continue using the word "precision" to represent the total number of digits in the size representation, as it seems to currently be, instead of the number of digits after the decimal point (which I think would be a much better way of both defining and using the func) then I think we should adjust how the function works so that the output is at least consistent for all precision values used.

All values should be approximated anyway, so if we need to approximate 9.5*MB with a "precision" of 1, then I think the result should be the closest representation using a single digit, so for example 9MB instead of 0.009GB.

From what I've managed to find searching around, we never use precision values of 1 or 2, so I wouldn't worry too much about the "validity" of the output given that it's an approximation. We can always write better documentation on the functions so things are a bit clearer for the users, or outright make 3 the minimum precision value.

eg To be consistent, I think these tests should pass

assertEquals(t, "9MB", HumanSizeWithPrecision(float64(9.5*MB), 1))
assertEquals(t, "9MB", HumanSizeWithPrecision(float64(9.9*MB), 1))
assertEquals(t, "9MB", HumanSizeWithPrecision(float64(100*MB), 1))
assertEquals(t, "1GB", HumanSizeWithPrecision(float64(500*MB), 1))
assertEquals(t, "1GB", HumanSizeWithPrecision(float64(900*MB), 1))
assertEquals(t, "1GB", HumanSizeWithPrecision(float64(1499*MB), 1))

My personal preference would be to fix the definition and use actual number of digits after the decimal point as the meaning of precision, maybe by writing a new func and adjusting existing code that uses the old func to use the new func before removing it entirely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comprehensive description.

I believe that configuring the total number of digits (as opposed to the number of digits after the decimal point) when calling the function, would make a better functionality, since callers of the function mostly care about the maximum length of the return value (for example rendering a size to put it in a table in docker cli).

Although I strongly agree that the conventional definition of 'precision' typically refers to the number of digits after the decimal point, but Golang standard library uses this word for "width" definition in an exceptional case.

However I do agree that precision is not a good name and it is better to use a name such as width (or even better), but this also results in breaking the function name HumanSizeWith"Precision".

Also if we are sure that no one calls the function using precision 2 or 3, My personal opinion is that it is better to return empty string or behave in a similar way signaling an error (with proper documentation). An error, or a totally wrong answer makes it easier to spot and avoid using such a precision, than a misleading one (such as returning "9MB" for the input of float64(501*MB)) which might potentially result in hours of confusion and debugging.

Small note: one example of misleading present in the test cases would be that 500 is actually closer to 9 than 1000 (abs(500 - 9) = 491 < 500 = abs(1000 - 500)), but the human intuition believes that it must be rounded to 1000.

I am concerned about modifying the function signature and behavior, as I am unsure about the extent to which we should prioritize compatibility with the go-units library. Do we only care to update docker and moby orgs usages, or we prefer not to break any other random user of this module as well? Understanding the policy would aid me in making more informed decisions. And I am open to adding an improved function, updating caller projects gradually, and eventually removing the old one, if feasible and preferred. (As it would make it more consistent across all projects)

Anyway, I'm new to the open source and Docker community, so I'd appreciate your guidance. If you have any thoughts or know of community conventions I may not be aware of, please feel free to share. I'm open to adapting and aligning with best practices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @krissetto, I'm looking forward to a final decision. Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kmirzavaziri!

I don't think I'll be able to give a final decision here as I'm also new to Docker and I don't know how we manage these kinds of situations, but I'll leave my personal opinion and some observations here, waiting to be corrected by the powers that be ;)

TL;DR
I would think about creating a new function with more consistent behavior and naming, migrating our internal codebase to use it and marking the old func as deprecated somehow, but leaving it available in case any external codebases are using it. I don't think the "width" of the response should be part of the func signature, but only the precision (and maybe an optional maximum desired unit of size? eg TB). Handling the actual display size of the output seems to me to be a responsibility of whatever client decides to consume this library, and not something we should be dealing with here.


Some observations

I don't really like how we're using the concept of precision currently, as it goes against most of the stdlib fmt use-cases (which consider both width - min number of unicode runes - and precision - chars to the right of the decimal point). What we're actually using is the concept of "significant digits", which Go seems to handle in a peculiar way when using the %.*g syntax.

Take a look at the following test case we have:
assertEquals(t, "0.009GB", WowWithPrecision(float64(9.5*MB), 1))

Here the result ends up being 0.009GB, as 9 is the precision of one (as in 1 significant digit). This is still somewhat ok as it's still human-readable, but I think we should adjust our logic to leave the representation as 9.5MB, which is shorter in width, more human-readable, and actually has a "precision" of 1 (the kind of precision we like).

Checking this test case instead:
assertEquals(t, "1e+04YB", HumanSizeWithPrecision(float64(10000000000000*PB), 1))

I think we have more of a proble. Using the %.*g syntax, Go is converting the output to scientific notation in order to keep 1 significant digit, the 4 at the very end. My guess is that 1e+0.. is not considered as significant because it's part of the standard scientific notation, but I'm sure most-all of us would agree that it's not exactly human readable. would probably be better after a certain file size 😂

If we were to use %.*f instead, the output would become 10000.0YB. 1 digit of (in this case unnecessary) precision, but a relatively "wide" total number since we currently max out our units at YB, and thus cannot divide any further.

In the first test example, using 9.5MB, the func would return 0.0GB using %.*f and a precision of 1, since we are still doing one too many divisions and don't have enough precision available to represent the value at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krissetto Thank you for sharing the observations. I completely agree with your perspective on the situation. The only concern I have regarding the use of "precision" instead of "significant digits", is that it may pose challenges for callers in controlling the length of strings, potentially making it more difficult to address issues like this one.

However, it's worth noting that we still cannot guarantee the maximum length, at least not for precision < 3 or for sizes over 999YB (unless we return inaccurate responses such as - or ). Also there are some workarounds for the callers. And since they cannot rely on the implementation of this function anyway, they must control the length themselves.

Anyway, I am happy in both ways, and believe each can have their own pros and cons. @thaJeztah, based on the context and our conversion, a final decision would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants