-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(a+b+c) is not commutative and overestimates overlap #45
Comments
Good point. This is something tricky that comes up a lot when you have files with multiple features in one that overlap a single feature in another and then try to make a Venn diagram -- which has no category for "multiple hits". The reason you run into this issue is that the For example, what should the 2-way Venn diagram look like for these files? It's not really defined what should go in that middle overlap section of the diagram:
Using $ intersectBed -a a.bed -b b.bed -u | wc -l $ intersectBed -a b.bed -b a.bed -u | wc -l Not using $ intersectBed -a a.bed -b b.bed | wc -l $ intersectBed -a b.bed -b a.bed | wc -l This latter version can result in the sum of the each circle in the Venn diagram being huge and having no relation to the original number of features. Unfortunately, Venn diagrams aren't ideal for overlapping cases like this. In my applications it made sense to use the If you'd like to play around, you can subclass BedTool and overwrite its
Any suggestions on how to improve this? I'd imagine others will run into this as well, so I'll at least have to add this explanation to the docs. |
I don't see how to get around this issue that: (a + b + c) != (b + c + a) but, we could provide a class method like: BedTool.intersect_all(*beds, **kwargs) |
so something like that recent post on the bedtools list: def intersect_all(beds, **kwargs):
"""
Successively intersect all files in `beds`, passing kwargs
to BedTool.intersect().
Note that if kwargs like `u=True` or `v=True` are used, the order
of the list will determine the final results
"""
x = BedTool(beds[0])
for bed in beds[1:]:
x = x.intersect(bed, **kwargs)
return x |
Won't that have the same problem -- it'll depend on the order of the inputs? Maybe first you need to do something smart like cat, then merge to get a base. |
Yep, as noted in the docstring. Really it just passes the buck, leaving decisions up to the caller rather than hard-coding in the overridden But with no kwargs, it should do the plain ol' intersect (without -u), which should be commutative. I'll have to think some more about a cat/merge/intersect strategy. |
I have three BED Tools, a b and c, and I expect that (a+b+c).count() should == (b+a+c).count(), since that function is intended to extract features that are common to all three BedTools. (as discussed here: http://packages.python.org/pybedtools/3-brief-examples.html#example-2-intersections-for-a-3-way-venn-diagram) However, in my tests, I find conflicting results, depending on the order of the operands. Also, the total number present in (a+b+c).count() [or any other order] seems to be an over-estimate of the actual number of overlaps, given that I find elements in (a+b+c) that are specific to only b. I can provide BED files for testing if needed.
as a consequence, "venn_mpl.py -c a.BED -a b.BED -b c.BED " gives a different result from "venn_mpl.py -a a.BED -b b.BED -c c.BED "
The text was updated successfully, but these errors were encountered: