Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inaccurate counts with bedtools intersect -split and -f #750

Closed
benslack19 opened this issue Aug 8, 2019 · 6 comments
Closed

inaccurate counts with bedtools intersect -split and -f #750

benslack19 opened this issue Aug 8, 2019 · 6 comments

Comments

@benslack19
Copy link

I'm trying to do something that I thought was straightforward but is not giving me the expected answer using bedtools intersect. I had written this to the Google forum (yet to be approved by the moderator) and therefore I kept working on it. As such, I've found a workaround solution which I share at the bottom.

The task I'm attempting is "How many reads from my RNA-seq sample overlap with any exon from my GTF file, given that overlaps are at least 50% of the read's length?" The majority of our read lengths are between 73 and 76 (one read of a paired-end sequencing is trimmed) so I expected overlaps of at least 37 bp to be counted.

From bedtools documentation, one way I believed to perform the the call was the following:

$BEDTOOLS intersect -abam ${STAR_BAM_PA} -b ${GTF_REF_EXON} -bed -u -f 0.5 -split | wc -l

The bedtools version I'm using is 2.26 (but I get the same results with 2.27). STAR_BAM_PA is my primary alignments BAM file using STAR then samtools. GTF_REF_EXON refers to the genomic coordinates file from NCBI RefSeq with only the lines containing "exon".

Not surprisingly, removing the -f parameter (so that the default is min. 1 bp overlap) increased this number which was expected. In order to visualize what was actually going on, I removed the wc -l and focused on one transcript and the reads that would align there. To my surprise, I found that when using the -f parameter that all of the reads were within an exon and ignored those that spanned exon-exon boundaries.

overlap_50_1bp

As you can see, the reads without the -f parameter showed reads spanning exon-exon junctions, several of which one end was spanning an exon with I thought using -split would take care of this as it is from RNA-seq. The image is the same as above but zooming in between exon 1 and the start of exon 2.

overlap_50_1bp_exon1

I tried the following:

  • both with and without –split option (results were the same)
  • inputting all 24 permutations of the orders of the 4 optional parameters (-bed -f 0.5 -u -split)
  • removing the -u

These all did not yield the desired results.

My interpretation of this is that the -f parameter isn't working on the length of the -abam input file has the documentation states (since the STAR_BAM_PA file appears to show the original read lengths), but rather on the resulting alignment which is a partial output of the bedtools intersect call. This misses many reads that span exon-exon junctions. This appears to be a bug to me.

My workaround is the following:

$BEDTOOLS intersect -abam STAR_BAM_PA -b GTF_REF_EXON -bed -split \
| cut -f 1,2,3,4,5 \
| awk '$3-$2 > 36.5’ \
| awk ' { print $2, $3, $3-$2, $4}’ \
| cut -d ' ' -f 4 \
| uniq \
| wc -l

Line 1: The output of the bedtools intersect call with -split shows all overlaps to an exon.
Line 2 and 3: I filter the results to have only overlaps greater than 36.5 (since most read lengths are at least 73).
Lines 4-6: I count only the reads once if they have this overlap.

Of course it'd be nice if something can be recommended (or fixed) to use in bedtools where I don't have to do this. (It's more accurate than the bedtools only call but it requires hard-coding of the read length filter and other testing shows it stops with very large BAM files.)

I look forward to feedback.

@arq5x
Copy link
Owner

arq5x commented Aug 8, 2019

Thanks for reporting this. I am away on vacation but will dig into this when I return.

@arq5x
Copy link
Owner

arq5x commented Aug 12, 2019

I see the problem and will start on a solution. Thanks very much for reporting this.

arq5x added a commit that referenced this issue Aug 14, 2019
@arq5x
Copy link
Owner

arq5x commented Aug 14, 2019

This has been fixed in master. Thanks so much for reporting!

@arq5x arq5x closed this as completed Aug 14, 2019
@benslack19
Copy link
Author

Thank you Aaron! Can I assume that it use the call as first tried?
$BEDTOOLS intersect -abam ${STAR_BAM_PA} -b ${GTF_REF_EXON} -bed -u -f 0.5 -split | wc -l

@arq5x
Copy link
Owner

arq5x commented Aug 14, 2019

Yep, and you can also use -a instead of -abam, as it will auto-detect the format. That said, -abam works as well.

@benslack19
Copy link
Author

The visualizations and numbers look good. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants