Consider adding `-u` and `-d` options to `sort-bed` to print unique and duplicated elements #196

alexpreynolds · 2017-09-10T21:15:41Z

Functionality idea: A -u option can quickly report unique elements, similar to GNU sort -u functionality, and a -d option can report only duplicated elements, similar to GNU uniq -d.

At most, it would only need storage of the previous and current sorted lines, before printing anything to standard output.

The text was updated successfully, but these errors were encountered:

alexpreynolds · 2017-11-21T15:14:15Z

Added in commit de811e7

sjneph · 2017-11-21T15:50:14Z

Could you show some time comparisons between this version and v2p4p29 when sorting a large file(s)? It looks like there is an extra string copy operation during printing that may have an impact on speed.

sjneph · 2017-11-21T16:22:18Z

Belay that - I just missed an if statement; it all looks fine.

alexpreynolds · 2017-11-21T20:52:25Z

Actually, you have a point. There is probably a smarter way to reduce the number of string copy operations that are done, by shuffling pointer addresses between previous, current, and next, from sorted element to sorted element. I'll review it and look at possible changes.

sjneph · 2017-11-21T21:35:27Z

I was concerned only with the default behavior. I don't think there is a speed impact with that, and I don't see a problem with a little hit when these newer options are used. The only impact on the default behavior might be the local arrays (previous) vs the heap-allocated ones that you are using now. It might be worth testing out (the default) on a very large file. The local arrays count against the stack frame overhead (memory), and would be caught very early on if they were too large.

bedops · 2017-11-21T21:54:06Z

My thought was that, though the megarow parameters we use now don't seem to pull the maximum line length past a typical 8Mb stack limit, I worry that people recompiling BEDOPS with larger exponent constants could run into error-message-less segfaults if the stack usage gets too much. By using the heap, I thought we might reduce the likelihood of that problem. I will be doing some more testing of this, anyway. It would be good to know how much of a performance hit the heap usage would be. We only allocate once, and hopefully fragmentation isn't too much of a problem. I could look at using kmalloc, if it is, or we can go back to the stack-based way of things if you're uncomfortable with these changes.

…

On Tue, Nov 21, 2017 at 1:35 PM, Shane Neph ***@***.***> wrote: I was concerned only with the default behavior. I don't think there is a speed impact with that, and I don't see a problem with a little hit when these newer options are used. The only impact on the default behavior might be the local arrays (previous) vs the heap-allocated ones that you are using now. It might be worth testing out (the default) on a very large file. The local arrays count against the stack frame overhead (memory), and would be caught very early on if they were too large. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFPCbK3yRscoLssm7luo553Hp0V5pWLYks5s40IggaJpZM4PSdRG> .

sjneph · 2017-11-22T05:26:20Z

No, I'm not uncomfortable with this. Give it a quick test on a big input with caches purged to quantify things, ensure the impact is small-to-non-existent, and move forward!

alexpreynolds · 2017-11-24T10:12:47Z

I shuffled a 100M-element subset of a FIMO 1e-4 scan, and I sorted it with the 2.4.29-typical sort-bed (current) and 2.4.30-typical sort-bed (development).

Typical results with the current sort-bed run approximately 144s, e.g.:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c 'sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 142.54
user 135.56
sys 6.47
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c 'sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 142.62
user 135.05
sys 6.90
...

Here is where we use the development (2.4.30-typical) sort-bed on the same dataset:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 143.14
user 135.54
sys 6.66
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 146.69
user 138.95
sys 6.92
...

Average runtimes are very close to identical between -typical versions.

alexpreynolds · 2017-11-24T22:03:16Z

Times of 2.4.30-megarow sort-bed on the same shuffled 100M-row subset were very close to 2.4.29-typical sort-bed times, e.g.:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed-megarow ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 144.16
user 136.33
sys 6.79
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed-megarow ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 144.15
user 136.43
sys 6.71
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed-megarow ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 144.07
user 136.37
sys 6.72

The megarow build of sort-bed 2.4.29 segfault'ed, so I wasn't able to do an apples-to-apples between-build test:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c '/net/module/sw/bedops/2.4.29-megarow/bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
sh: line 1: 10014 Segmentation fault      (core dumped) /net/module/sw/bedops/2.4.29-megarow/bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null
real 0.55
user 0.00
sys 0.00

Removing the stack limit gets rid of the segfault:

# ulimit -s unlimited
# time -p sh -c '/net/module/sw/bedops/2.4.29-megarow/bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 142.52
user 134.92
sys 6.78

Sort times are very close. Using the heap seems safer, as megarow builds make the BED_LINE_LEN constant large enough to push memory usage over the default stack size we use internally. This will likely be the same default others use on their Linux hosts outside.

alexpreynolds added enhancement feature request question labels Sep 10, 2017

alexpreynolds self-assigned this Sep 10, 2017

alexpreynolds closed this as completed Nov 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding `-u` and `-d` options to `sort-bed` to print unique and duplicated elements #196

Consider adding `-u` and `-d` options to `sort-bed` to print unique and duplicated elements #196

alexpreynolds commented Sep 10, 2017

alexpreynolds commented Nov 21, 2017

sjneph commented Nov 21, 2017

sjneph commented Nov 21, 2017

alexpreynolds commented Nov 21, 2017

sjneph commented Nov 21, 2017

bedops commented Nov 21, 2017 via email

sjneph commented Nov 22, 2017

alexpreynolds commented Nov 24, 2017

alexpreynolds commented Nov 24, 2017

Consider adding -u and -d options to sort-bed to print unique and duplicated elements #196

Consider adding -u and -d options to sort-bed to print unique and duplicated elements #196

Comments

alexpreynolds commented Sep 10, 2017

alexpreynolds commented Nov 21, 2017

sjneph commented Nov 21, 2017

sjneph commented Nov 21, 2017

alexpreynolds commented Nov 21, 2017

sjneph commented Nov 21, 2017

bedops commented Nov 21, 2017 via email

sjneph commented Nov 22, 2017

alexpreynolds commented Nov 24, 2017

alexpreynolds commented Nov 24, 2017

Consider adding `-u` and `-d` options to `sort-bed` to print unique and duplicated elements #196

Consider adding `-u` and `-d` options to `sort-bed` to print unique and duplicated elements #196