New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding -u
and -d
options to sort-bed
to print unique and duplicated elements
#196
Comments
Added in commit de811e7 |
Could you show some time comparisons between this version and v2p4p29 when sorting a large file(s)? It looks like there is an extra string copy operation during printing that may have an impact on speed. |
Belay that - I just missed an if statement; it all looks fine. |
Actually, you have a point. There is probably a smarter way to reduce the number of string copy operations that are done, by shuffling pointer addresses between previous, current, and next, from sorted element to sorted element. I'll review it and look at possible changes. |
I was concerned only with the default behavior. I don't think there is a speed impact with that, and I don't see a problem with a little hit when these newer options are used. The only impact on the default behavior might be the local arrays (previous) vs the heap-allocated ones that you are using now. It might be worth testing out (the default) on a very large file. The local arrays count against the stack frame overhead (memory), and would be caught very early on if they were too large. |
My thought was that, though the megarow parameters we use now don't seem to
pull the maximum line length past a typical 8Mb stack limit, I worry that
people recompiling BEDOPS with larger exponent constants could run into
error-message-less segfaults if the stack usage gets too much. By using the
heap, I thought we might reduce the likelihood of that problem.
I will be doing some more testing of this, anyway. It would be good to know
how much of a performance hit the heap usage would be. We only allocate
once, and hopefully fragmentation isn't too much of a problem. I could look
at using kmalloc, if it is, or we can go back to the stack-based way of
things if you're uncomfortable with these changes.
…On Tue, Nov 21, 2017 at 1:35 PM, Shane Neph ***@***.***> wrote:
I was concerned only with the default behavior. I don't think there is a
speed impact with that, and I don't see a problem with a little hit when
these newer options are used. The only impact on the default behavior might
be the local arrays (previous) vs the heap-allocated ones that you are
using now. It might be worth testing out (the default) on a very large
file. The local arrays count against the stack frame overhead (memory), and
would be caught very early on if they were too large.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFPCbK3yRscoLssm7luo553Hp0V5pWLYks5s40IggaJpZM4PSdRG>
.
|
No, I'm not uncomfortable with this. Give it a quick test on a big input with caches purged to quantify things, ensure the impact is small-to-non-existent, and move forward! |
I shuffled a 100M-element subset of a FIMO 1e-4 scan, and I sorted it with the 2.4.29-typical Typical results with the current
Here is where we use the development (2.4.30-typical)
Average runtimes are very close to identical between |
Times of 2.4.30-megarow
The megarow build of
Removing the stack limit gets rid of the segfault:
Sort times are very close. Using the heap seems safer, as megarow builds make the |
Functionality idea: A
-u
option can quickly report unique elements, similar to GNUsort -u
functionality, and a-d
option can report only duplicated elements, similar to GNUuniq -d
.At most, it would only need storage of the previous and current sorted lines, before printing anything to standard output.
The text was updated successfully, but these errors were encountered: