Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

line number approximation of tabular data #6506

Closed
bernt-matthias opened this issue Jul 17, 2018 · 2 comments
Closed

line number approximation of tabular data #6506

bernt-matthias opened this issue Jul 17, 2018 · 2 comments

Comments

@bernt-matthias
Copy link
Contributor

bernt-matthias commented Jul 17, 2018

For tabular data (in my examples gtf) an approximate number of lines is shown. This approximation is often quite off, e.g.

  • ~30,000,000 lines for a data set with 28,147,144 lines
  • ~8,700,000 lines for a data set with 8,232,791 lines

I would like to know if there is a technical reason (but I could not find something in the code). And if not suggest one of the following solutions:

  • actually show a shorter representation like ~30 * 10^6
  • show the precise number (which would use the same space)

If I know where and have a suggestion how to solve this I would volunteer.

@tougai
Copy link

tougai commented Feb 11, 2019

I have experienced the same problem, ie. different approximations when extracting first column of tabular file with cut (~3,300,000) and original file (~2,800,000). When i look at the files and do a wc -l on command line, i get the same number (2,711,716) for both .dat files.

@bernt-matthias
Copy link
Contributor Author

Just found the code used for the estimation:

def estimate_file_lines(self, dataset):

bernt-matthias added a commit to bernt-matthias/galaxy that referenced this issue May 13, 2022
fixes galaxyproject#6506

for large files the number of lines is estimated and shown
as a rounded number (using two significant digits), e.g
`~8,700,000 lines`.

with this change it will be: `~87 10^5 lines`

this commit also makes roundify really round numbers (as the name
suggests) and not simply cut at two digits, but this could be
reverted if there are concerns wrt speed due to using more math
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this issue May 13, 2022
fixes galaxyproject#6506

for large files the number of lines is estimated and shown
as a rounded number (using two significant digits), e.g
`~8,700,000 lines`.

with this change it will be: `~87 10^5 lines`

this commit also makes roundify really round numbers (as the name
suggests) and not simply cut at two digits, but this could be
reverted if there are concerns wrt speed due to using more math
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this issue May 14, 2022
fixes galaxyproject#6506

for large files the number of lines is estimated and shown
as a rounded number (using two significant digits), e.g
`~8,700,000 lines`.

with this change it will be: `~87 10^5 lines`

this commit also makes roundify really round numbers (as the name
suggests) and not simply cut at two digits, but this could be
reverted if there are concerns wrt speed due to using more math
@dannon dannon closed this as completed in 7abb163 Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants