Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding -u and -d options to sort-bed to print unique and duplicated elements #196

Closed
alexpreynolds opened this issue Sep 10, 2017 · 9 comments

Comments

@alexpreynolds
Copy link
Collaborator

Functionality idea: A -u option can quickly report unique elements, similar to GNU sort -u functionality, and a -d option can report only duplicated elements, similar to GNU uniq -d.

At most, it would only need storage of the previous and current sorted lines, before printing anything to standard output.

@alexpreynolds
Copy link
Collaborator Author

Added in commit de811e7

@sjneph
Copy link
Collaborator

sjneph commented Nov 21, 2017

Could you show some time comparisons between this version and v2p4p29 when sorting a large file(s)? It looks like there is an extra string copy operation during printing that may have an impact on speed.

@sjneph
Copy link
Collaborator

sjneph commented Nov 21, 2017

Belay that - I just missed an if statement; it all looks fine.

@alexpreynolds
Copy link
Collaborator Author

Actually, you have a point. There is probably a smarter way to reduce the number of string copy operations that are done, by shuffling pointer addresses between previous, current, and next, from sorted element to sorted element. I'll review it and look at possible changes.

@sjneph
Copy link
Collaborator

sjneph commented Nov 21, 2017

I was concerned only with the default behavior. I don't think there is a speed impact with that, and I don't see a problem with a little hit when these newer options are used. The only impact on the default behavior might be the local arrays (previous) vs the heap-allocated ones that you are using now. It might be worth testing out (the default) on a very large file. The local arrays count against the stack frame overhead (memory), and would be caught very early on if they were too large.

@bedops
Copy link
Owner

bedops commented Nov 21, 2017 via email

@sjneph
Copy link
Collaborator

sjneph commented Nov 22, 2017

No, I'm not uncomfortable with this. Give it a quick test on a big input with caches purged to quantify things, ensure the impact is small-to-non-existent, and move forward!

@alexpreynolds
Copy link
Collaborator Author

I shuffled a 100M-element subset of a FIMO 1e-4 scan, and I sorted it with the 2.4.29-typical sort-bed (current) and 2.4.30-typical sort-bed (development).

Typical results with the current sort-bed run approximately 144s, e.g.:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c 'sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 142.54
user 135.56
sys 6.47
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c 'sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 142.62
user 135.05
sys 6.90
...

Here is where we use the development (2.4.30-typical) sort-bed on the same dataset:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 143.14
user 135.54
sys 6.66
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 146.69
user 138.95
sys 6.92
...

Average runtimes are very close to identical between -typical versions.

@alexpreynolds
Copy link
Collaborator Author

Times of 2.4.30-megarow sort-bed on the same shuffled 100M-row subset were very close to 2.4.29-typical sort-bed times, e.g.:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed-megarow ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 144.16
user 136.33
sys 6.79
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed-megarow ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 144.15
user 136.43
sys 6.71
# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c './bin/sort-bed-megarow ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 144.07
user 136.37
sys 6.72

The megarow build of sort-bed 2.4.29 segfault'ed, so I wasn't able to do an apples-to-apples between-build test:

# sync; echo 3 > /proc/sys/vm/drop_caches
# time -p sh -c '/net/module/sw/bedops/2.4.29-megarow/bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
sh: line 1: 10014 Segmentation fault      (core dumped) /net/module/sw/bedops/2.4.29-megarow/bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null
real 0.55
user 0.00
sys 0.00

Removing the stack limit gets rid of the segfault:

# ulimit -s unlimited
# time -p sh -c '/net/module/sw/bedops/2.4.29-megarow/bin/sort-bed ~/old.fimo.xfac2015.1e-4.starch.100M.bed.shuf > /dev/null'
real 142.52
user 134.92
sys 6.78

Sort times are very close. Using the heap seems safer, as megarow builds make the BED_LINE_LEN constant large enough to push memory usage over the default stack size we use internally. This will likely be the same default others use on their Linux hosts outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants