Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bamCoverage very slow, regardless of BAM size, --regions or other shortcuts #662

Closed
alexbarrera opened this issue Feb 7, 2018 · 18 comments

Comments

@alexbarrera
Copy link

Hi,

I have noticed that for ChIP-seq BAM files, bamCoverage is extremely slow. The following bamCoverage call over a BAM of ~2k reads (830Kb) takes more than 15min. I'm using the latest deeptools available in conda (bamCoverage 2.5.7), but the same is true for my previous deeptools version (2.2.4).

$ bamCoverage \
--bam GR.dedup_filtered.sorted.bam \
--binSize 1 \
--verbose \
--extendReads 200 \
--normalizeUsingRPKM \
--outFileFormat bigwig \
--outFileName GR.dedup_filtered.sorted.rpkm.bw \
--numberOfProcessors 1 \
--skipNAs \
--region chr21

The log:

genome partition size for multiprocessing: 1000000
genome partition size for multiprocessing: 500000
genome partition size for multiprocessing: 250000
genome partition size for multiprocessing: 125000
genome partition size for multiprocessing: 62500
genome partition size for multiprocessing: 50000
normalization: RPKM
Final scaling factor: 51041.241323
genome partition size for multiprocessing: 1050000
minFragmentLength: 0
verbose: True
out_file_for_raw_data: None
numberOfSamples: None
bedFile: None
bamFilesList: ['ctrl_001.dedup_filtered.sorted.bam']
ignoreDuplicates: False
numberOfProcessors: 1
samFlag_exclude: None
save_data: False
stepSize: 1
smoothLength: None
center_read: False
defaultFragmentLength: 200
chrsToSkip: []
region: chr21:1
maxPairedFragmentLength: 800
samFlag_include: None
binLength: 1
blackListFileName: None
maxFragmentLength: 0
minMappingQuality: None
zerosToNans: True
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:1-1000001
MainProcess countReadsInRegions_worker: processing 1000000 (7019720.4 per sec) @ chr21:1-1000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:1000001-2000001
MainProcess countReadsInRegions_worker: processing 1000000 (6849643.7 per sec) @ chr21:1000001-2000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:2000001-3000001
MainProcess countReadsInRegions_worker: processing 1000000 (6978364.3 per sec) @ chr21:2000001-3000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:3000001-4000001
MainProcess countReadsInRegions_worker: processing 1000000 (6872517.0 per sec) @ chr21:3000001-4000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:4000001-5000001
MainProcess countReadsInRegions_worker: processing 1000000 (7018087.8 per sec) @ chr21:4000001-5000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:5000001-6000001
MainProcess countReadsInRegions_worker: processing 1000000 (6885299.3 per sec) @ chr21:5000001-6000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:6000001-7000001
MainProcess countReadsInRegions_worker: processing 1000000 (7029426.3 per sec) @ chr21:6000001-7000001
MainProcess,  processing 1 (1209.4 per sec) reads @ chr21:7000001-8000001
MainProcess countReadsInRegions_worker: processing 1000000 (6840316.1 per sec) @ chr21:7000001-8000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:8000001-9000001
MainProcess countReadsInRegions_worker: processing 1000000 (7004621.0 per sec) @ chr21:8000001-9000001
MainProcess,  processing 1 (2336.7 per sec) reads @ chr21:9000001-10000001
MainProcess countReadsInRegions_worker: processing 1000000 (6999360.9 per sec) @ chr21:9000001-10000001
MainProcess,  processing 3 (2292.0 per sec) reads @ chr21:10000001-11000001
MainProcess countReadsInRegions_worker: processing 1000000 (6997504.2 per sec) @ chr21:10000001-11000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:11000001-12000001
MainProcess countReadsInRegions_worker: processing 1000000 (7017395.0 per sec) @ chr21:11000001-12000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:12000001-13000001
MainProcess countReadsInRegions_worker: processing 1000000 (7016808.0 per sec) @ chr21:12000001-13000001
MainProcess,  processing 0 (0.0 per sec) reads @ chr21:13000001-14000001
MainProcess countReadsInRegions_worker: processing 1000000 (7030710.6 per sec) @ chr21:13000001-14000001
MainProcess,  processing 3 (2434.8 per sec) reads @ chr21:14000001-15000001
MainProcess countReadsInRegions_worker: processing 1000000 (7010849.8 per sec) @ chr21:14000001-15000001
MainProcess,  processing 5 (2761.2 per sec) reads @ chr21:15000001-16000001
MainProcess countReadsInRegions_worker: processing 1000000 (6965384.8 per sec) @ chr21:15000001-16000001
MainProcess,  processing 7 (3824.9 per sec) reads @ chr21:16000001-17000001
MainProcess countReadsInRegions_worker: processing 1000000 (7007265.7 per sec) @ chr21:16000001-17000001
MainProcess,  processing 5 (3787.5 per sec) reads @ chr21:17000001-18000001
MainProcess countReadsInRegions_worker: processing 1000000 (6982895.3 per sec) @ chr21:17000001-18000001
MainProcess,  processing 3 (2396.3 per sec) reads @ chr21:18000001-19000001
MainProcess countReadsInRegions_worker: processing 1000000 (7026023.2 per sec) @ chr21:18000001-19000001
MainProcess,  processing 6 (4662.1 per sec) reads @ chr21:19000001-20000001
MainProcess countReadsInRegions_worker: processing 1000000 (7025929.1 per sec) @ chr21:19000001-20000001
MainProcess,  processing 6 (3401.7 per sec) reads @ chr21:20000001-21000001
MainProcess countReadsInRegions_worker: processing 1000000 (6993105.8 per sec) @ chr21:20000001-21000001
MainProcess,  processing 2 (2522.1 per sec) reads @ chr21:21000001-22000001
MainProcess countReadsInRegions_worker: processing 1000000 (6982070.0 per sec) @ chr21:21000001-22000001
MainProcess,  processing 2 (1620.7 per sec) reads @ chr21:22000001-23000001
MainProcess countReadsInRegions_worker: processing 1000000 (6992371.3 per sec) @ chr21:22000001-23000001
MainProcess,  processing 3 (2286.6 per sec) reads @ chr21:23000001-24000001
MainProcess countReadsInRegions_worker: processing 1000000 (7033151.0 per sec) @ chr21:23000001-24000001
MainProcess,  processing 5 (2839.4 per sec) reads @ chr21:24000001-25000001
MainProcess countReadsInRegions_worker: processing 1000000 (7011740.6 per sec) @ chr21:24000001-25000001
MainProcess,  processing 5 (3955.4 per sec) reads @ chr21:25000001-26000001
MainProcess countReadsInRegions_worker: processing 1000000 (7021577.1 per sec) @ chr21:25000001-26000001
MainProcess,  processing 8 (4444.3 per sec) reads @ chr21:26000001-27000001
MainProcess countReadsInRegions_worker: processing 1000000 (7018980.3 per sec) @ chr21:26000001-27000001
MainProcess,  processing 6 (3359.5 per sec) reads @ chr21:27000001-28000001
MainProcess countReadsInRegions_worker: processing 1000000 (7026176.2 per sec) @ chr21:27000001-28000001
MainProcess,  processing 4 (3005.1 per sec) reads @ chr21:28000001-29000001
MainProcess countReadsInRegions_worker: processing 1000000 (6985581.8 per sec) @ chr21:28000001-29000001
MainProcess,  processing 9 (4912.6 per sec) reads @ chr21:29000001-30000001
MainProcess countReadsInRegions_worker: processing 1000000 (6991287.4 per sec) @ chr21:29000001-30000001
MainProcess,  processing 2 (1587.2 per sec) reads @ chr21:30000001-31000001
MainProcess countReadsInRegions_worker: processing 1000000 (7014109.1 per sec) @ chr21:30000001-31000001
MainProcess,  processing 3 (2375.5 per sec) reads @ chr21:31000001-32000001
MainProcess countReadsInRegions_worker: processing 1000000 (6998134.6 per sec) @ chr21:31000001-32000001
MainProcess,  processing 1 (1270.6 per sec) reads @ chr21:32000001-33000001
MainProcess countReadsInRegions_worker: processing 1000000 (7043497.2 per sec) @ chr21:32000001-33000001
MainProcess,  processing 3 (3671.7 per sec) reads @ chr21:33000001-34000001
MainProcess countReadsInRegions_worker: processing 1000000 (7014730.9 per sec) @ chr21:33000001-34000001
MainProcess,  processing 6 (4531.9 per sec) reads @ chr21:34000001-35000001
MainProcess countReadsInRegions_worker: processing 1000000 (7009771.9 per sec) @ chr21:34000001-35000001
MainProcess,  processing 4 (3002.9 per sec) reads @ chr21:35000001-36000001
MainProcess countReadsInRegions_worker: processing 1000000 (7022070.9 per sec) @ chr21:35000001-36000001
MainProcess,  processing 3 (1674.2 per sec) reads @ chr21:36000001-37000001
MainProcess countReadsInRegions_worker: processing 1000000 (6964228.3 per sec) @ chr21:36000001-37000001
MainProcess,  processing 4 (3112.7 per sec) reads @ chr21:37000001-38000001
MainProcess countReadsInRegions_worker: processing 1000000 (7022023.9 per sec) @ chr21:37000001-38000001
MainProcess,  processing 5 (3834.6 per sec) reads @ chr21:38000001-39000001
MainProcess countReadsInRegions_worker: processing 1000000 (7012866.0 per sec) @ chr21:38000001-39000001
MainProcess,  processing 4 (4634.6 per sec) reads @ chr21:39000001-40000001
MainProcess countReadsInRegions_worker: processing 1000000 (7003872.4 per sec) @ chr21:39000001-40000001
MainProcess,  processing 3 (3640.9 per sec) reads @ chr21:40000001-41000001
MainProcess countReadsInRegions_worker: processing 1000000 (7010802.9 per sec) @ chr21:40000001-41000001
MainProcess,  processing 3 (3501.1 per sec) reads @ chr21:41000001-42000001
MainProcess countReadsInRegions_worker: processing 1000000 (6950091.9 per sec) @ chr21:41000001-42000001
MainProcess,  processing 3 (2399.9 per sec) reads @ chr21:42000001-43000001
MainProcess countReadsInRegions_worker: processing 1000000 (7005159.1 per sec) @ chr21:42000001-43000001
MainProcess,  processing 4 (2304.2 per sec) reads @ chr21:43000001-44000001
MainProcess countReadsInRegions_worker: processing 1000000 (7009080.7 per sec) @ chr21:43000001-44000001
MainProcess,  processing 3 (1727.0 per sec) reads @ chr21:44000001-45000001
MainProcess countReadsInRegions_worker: processing 1000000 (7004913.4 per sec) @ chr21:44000001-45000001
MainProcess,  processing 3 (2429.1 per sec) reads @ chr21:45000001-46000001
MainProcess countReadsInRegions_worker: processing 1000000 (7009713.3 per sec) @ chr21:45000001-46000001
MainProcess,  processing 1 (2391.3 per sec) reads @ chr21:46000001-46709983
MainProcess countReadsInRegions_worker: processing 709982 (7033488.0 per sec) @ chr21:46000001-46709983
output file: ctrl_001.dedup_filtered.sorted.rpkm.bw

The above code seems to be particularly slow in going through the first phase (in each iteration of the mapReduce function, printing out each genome partition size for multiprocessing: XXXXX line).

I have uploaded the BAM and index files to Google Drive.

Many thanks in advance!

@fidelram
Copy link
Collaborator

fidelram commented Feb 7, 2018 via email

@alexbarrera
Copy link
Author

@fidelram thanks for your reply. The observation is independent of the architecture where this is run. I have access to a powerful HPC, where memory is not an issue, plus there is no out of memory error. The binSize can be increased without changing the long processing time.

As originally noted, I specify --numberOfProcessors 1, but it doesn't seem to be honored (it still goes through the multiprocessing path).

This does not seem to affect to all files, since I have computed the coverage of many RNA-seq BAM files without issues. It must be something in the BAMs, however, I'm not sure what could cause a 2000 reads BAM to take such a long time.

Did you get a chance to run the command above without issues?
Thanks

@dpryan79
Copy link
Collaborator

dpryan79 commented Feb 8, 2018

The performance is so poor because of the small number of reads in the file. deepTools is tuned toward large files and will often perform poorly on very very small files because of this. This particular behavior is due to a specific tuning behavior in the normalization step that tries to determine the number of reads that are going to be filtered out. This normally works by sampling ~1% of the reads in a file, but for such small files it ends up needing to select every read. There's no short-circuit for "just use every read", though, which is why you see multiple "genome partition size" lines in the output, one for each iteration through the file while it attempts to get enough reads to meet its minimum sampling rate.

@dpryan79
Copy link
Collaborator

dpryan79 commented Feb 8, 2018

I'm working on a fix to by-pass this, since (A) it's easy enough to know to sample all of the reads and (B) nothing will be filtered out anyway, so we can just bypass all of that. That'll appear in deepTools 3, which should be out soon (the code is finished, I just need to wait for pysam to be updated for it to all work).

dpryan79 added a commit that referenced this issue Feb 8, 2018
@dpryan79
Copy link
Collaborator

dpryan79 commented Feb 8, 2018

This is now changed in develop and will be included in deepTools 3.

@dpryan79 dpryan79 closed this as completed Feb 8, 2018
@Pacomito
Copy link

Hello,
I have a rather similar issue but the for a small sized bam file (500mb) running deeptools v. 3.1.0 bamCoverage except that the program doesn't finish (ChIP-seq data):
bamCoverage --bam A1082C1_flagged_rmDup.bam --outFileName A1082C1_flagged_rmDup.bw --numberOfProcessors 8 --normalizeUsing RPKM
The program has already run more than 10hrs on HPC (ressource 8 processors 32 Gb ram) and it doesn't work on my local with 6 processors 32gb ram. Memory consumption doesn't seem to be the issue here so I don't really know where is the problem.

normalization: RPKM
ignoreDuplicates: False
center_read: False
samFlag_include: None
out_file_for_raw_data: None
binLength: 50
numberOfProcessors: 8
stepSize: 50
bedFile: None
region: None
samFlag_exclude: None
minFragmentLength: 0
minMappingQuality: None
bamFilesList: ['A1082C1_flagged_rmDup.bam']
zerosToNans: False
verbose: False
maxFragmentLength: 0
numberOfSamples: None
save_data: False
defaultFragmentLength: read length
smoothLength: None
maxPairedFragmentLength: 1000
chrsToSkip: []
blackListFileName: None

Thanks ;) !

@dpryan79
Copy link
Collaborator

@Pacomito Sometimes a thread dies and the program never finishes. Try killing the job and rerunning things.

@Pacomito
Copy link

@dpryan79 Thanks for the answer, I tried re-running it both on the server and my local but it is taking more than 3 hours already using 8 processors, so I think the issue is not about dying thread this time.
It usually works fine with the same version of deeptools for me and very similar bam.
The only difference in this bam file is a tag that I added manually (representing R2 start : XS:i: number) but the file is correctly recognized by samtools sort and index so I don't this it is corrupted.
Do you think it is normal that it takes that long ?

@dpryan79
Copy link
Collaborator

That seems odd and the addition of an aux tag shouldn't make a difference (deepTools ignores those tags). Can you make the file available to me?

@Pacomito
Copy link

Thank you, Here is the link to a drive :
https://drive.google.com/open?id=1OVWb5giiuINmB3g8R3IN-OQdS8d7TEDr

@dpryan79
Copy link
Collaborator

Did you mean to have 800,000+ 56 base long contigs in there? That's why it's so slow.

@Pacomito
Copy link

Pacomito commented Jan 24, 2019

Did you mean to have 800,000+ 56 base long contigs in there? That's why it's so slow.

Oh it is my bad, thank you, I didn't know the '@'SQ headers were used to do the bam coverage.
My file is coming from a merge of 3 different BAMs and formatted so I guess I did put the wrong headers.
Re-running it with the correct @'SQ' headers (
'@'SQ SN:chr1 LN:248956422
'@'SQ SN:chr10 LN:133797422
...)
it worked fine and ran really fast 5-10 min ) !
Thanks again

@ascarafia
Copy link

Oh it is my bad, thank you, I didn't know the '@'SQ headers were used to do the bam coverage.
My file is coming from a merge of 3 different BAMs and formatted so I guess I did put the wrong headers.
Re-running it with the correct @'SQ' headers (
'@'SQ SN:chr1 LN:248956422
'@'SQ SN:chr10 LN:133797422
...)
it worked fine and ran really fast 5-10 min ) !
Thanks again

@Pacomito Hi! I am also having this problem with a .bam file I obtained from mergeing other .bam files. However I did not understand how to correct the problem. I don't know how to correct the '@'SQ headers from my file or what my headers should be like. You said "Re-running it with the correct @'SQ' headers". How do you know what are the correct headers? Could you tell me where can I learn to fix this issue?

@dpryan79
Copy link
Collaborator

dpryan79 commented May 3, 2020

@ascarafia Do you also have hundreds of thousands of small contigs in your headers? If not you have a different problem.

@ascarafia
Copy link

@ascarafia Do you also have hundreds of thousands of small contigs in your headers? If not you have a different problem.

Hi. Thank you for your answer. I'm not sure how to check the number of contigs. I did samtools idxstats and get 456 regions. I also tried samtools view -H file.bam | grep "^@SQ" and get the same result. Are those the contigs?

chr1 248956422 3785338 0
chr10 133797422 1131782 0
chr11 135086622 1935364 0
chr11_KI270721v1_random 100316 0 0
chr12 133275309 4398033 0
chr13 114364328 986959 0
chr14 107043718 34137438 0
chr14_GL000009v2_random 201709 0 0
chr14_GL000225v1_random 211173 0 0
chr14_KI270722v1_random 194050 0 0
chr14_GL000194v1_random 191469 0 0
chr14_KI270723v1_random 38115 0 0
chr14_KI270724v1_random 39555 0 0
chr14_KI270725v1_random 172810 0 0
chr14_KI270726v1_random 43739 0 0
...

What puzzles me the most is that I created two .bam files with the same commands and run bamCoverage on one of them without any problems. It took more than an hour but got the job done. It's the second file that run over 12 hours but never got processed. The headers of both look the same.

@dpryan79
Copy link
Collaborator

dpryan79 commented May 3, 2020

Something crashed on your system. Kill the job and rerun it.

@ascarafia
Copy link

@dpryan79
I tried this two more times, both times it crashed again. I am positive it is because of the .bam file. I checked the version of deepTools (2.5.7) and I think maybe it has something to do with what you said before about low read number: "The performance is so poor because of the small number of reads in the file."
I will try to get deepTools3 to see if this solves the issue.

@dpryan79
Copy link
Collaborator

dpryan79 commented May 4, 2020

@ascarafia Sounds like a good plan, please open a new issue if you continue running into this problem with deepTools 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants