Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: subtractBed: Add switch to allow -a file to be read in memory and -b from disk #92

Open
ghuls opened this issue Dec 5, 2013 · 1 comment

Comments

@ghuls
Copy link

ghuls commented Dec 5, 2013

Most of the time that I want to use subtractBed, the b file is very big (e.g.: 2 GiB: 66 million lines) as it contains all known snp positions.

This command is then very slow (file a contains only 10000 positions) and sometimes is unable to run as there is not enough memory to load the b file in memory:

subtractBed -a sample_snps.bed -b allknownsnps.bed

This is the error I get when I run it on a machine with 32GiB of RAM:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

When running the command with strace:

strace -o subtractBed.strace subtractBed -a sample_snps.bed -b allknownsnps.bed

This is the end of the log:

read(3, "17\t67979430\t67979431\tG\tA\tPASS\nch"..., 8191) = 8191
brk(0x6a65bf000)                        = 0x6a65bf000
read(3, "7985837\tG\tC\tMinDP\nchr17\t67985847"..., 8191) = 8191
brk(0x6a65e7000)                        = 0x6a65e7000
read(3, "\tA\tG\tPASS\nchr17\t67992208\t6799220"..., 8191) = 8191
read(3, "\nchr17\t67999172\t67999173\tT\tG\tPAS"..., 8191) = 8191
brk(0x6a660b000)                        = 0x6a65e7000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "terminate called after throwing "..., 48) = 48
write(2, "std::bad_alloc", 14)          = 14
write(2, "'\n", 2)                      = 2
write(2, "  what():  ", 11)             = 11
write(2, "std::bad_alloc", 14)          = 14
write(2, "\n", 1)                       = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid()                                = 26915
tgkill(26915, 26915, SIGABRT)           = 0
--- SIGABRT (Aborted) @ 0 (0) ---
+++ killed by SIGABRT (core dumped) +++
# Line number lose to where the PC runs out of memory: 58469306
$ grep -n -m1 '^chr17'$'\t''67999172' allknownsnps.bed
58469306:chr17  67999172    67999173    T   G   PASS

# Total number of lines in the file: 66007044
$ wc -l allknownsnps.bed
66007044 

It would be nice to use subtractBed like this:

subtractBed -loada -a sample_snps.bed -b allknownsnps.bed

Where -loada loads file a in memory and b from disk.
So if file b is read line by line, it just needs to remove all entries from file a (that is loaded in memory) that are found in file b.

I know it is possible to mimic this behaviour with, the following, but it would be much easier if it was implemented in subtractBed directly:

intersectBed -wb -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin
@ghuls
Copy link
Author

ghuls commented Dec 5, 2013

This is probably technically more correct (no -wb) as this allows to still use additional options for subtractBed:

intersectBed -a allknownsnps.bed -b sample_snps.bed | subtractBed -a sample_snps.bed -b stdin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant