Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(a+b+c) is not commutative and overestimates overlap #45

Closed
mlovci opened this issue Oct 27, 2011 · 5 comments
Closed

(a+b+c) is not commutative and overestimates overlap #45

mlovci opened this issue Oct 27, 2011 · 5 comments

Comments

@mlovci
Copy link

mlovci commented Oct 27, 2011

I have three BED Tools, a b and c, and I expect that (a+b+c).count() should == (b+a+c).count(), since that function is intended to extract features that are common to all three BedTools. (as discussed here: http://packages.python.org/pybedtools/3-brief-examples.html#example-2-intersections-for-a-3-way-venn-diagram) However, in my tests, I find conflicting results, depending on the order of the operands. Also, the total number present in (a+b+c).count() [or any other order] seems to be an over-estimate of the actual number of overlaps, given that I find elements in (a+b+c) that are specific to only b. I can provide BED files for testing if needed.

as a consequence, "venn_mpl.py -c a.BED -a b.BED -b c.BED " gives a different result from "venn_mpl.py -a a.BED -b b.BED -c c.BED "

@daler
Copy link
Owner

daler commented Oct 27, 2011

Good point. This is something tricky that comes up a lot when you have files with multiple features in one that overlap a single feature in another and then try to make a Venn diagram -- which has no category for "multiple hits".

The reason you run into this issue is that the BedTool.__add__ method uses the u=True argument to intersectBed. (__subtract__ uses v=True to be symmetrical). If you have nested features (2 features in b that overlap a feature in a) then you'll run into this issue.

For example, what should the 2-way Venn diagram look like for these files? It's not really defined what should go in that middle overlap section of the diagram:

a.bed  -------------        --------------------------
b.bed      -----------               -----     ------

Using u=True:

$ intersectBed -a a.bed -b b.bed -u | wc -l
2

$ intersectBed -a b.bed -b a.bed -u | wc -l
3

Not using u=True results in another issue -- the total number of features for a overlapping with b is greater than the number of features in a in the first place:

$ intersectBed -a a.bed -b b.bed | wc -l
3

$ intersectBed -a b.bed -b a.bed | wc -l
3

This latter version can result in the sum of the each circle in the Venn diagram being huge and having no relation to the original number of features.

Unfortunately, Venn diagrams aren't ideal for overlapping cases like this. In my applications it made sense to use the u=True case. Do you think it makes more sense to use the u=False?

If you'd like to play around, you can subclass BedTool and overwrite its __add__ method to do what you want:

class MyBedTool(BedTool):
    def __init__(self, *args, **kwargs):
        BedTool.__init__(self, *args, **kwargs)
    def __add__(self, other):
        return self.intersect(other)

Any suggestions on how to improve this? I'd imagine others will run into this as well, so I'll at least have to add this explanation to the docs.

@brentp
Copy link
Contributor

brentp commented Oct 27, 2011

I don't see how to get around this issue that:

(a + b + c) != (b + c + a)

but, we could provide a class method like:

BedTool.intersect_all(*beds, **kwargs)

@daler
Copy link
Owner

daler commented Oct 27, 2011

so something like that recent post on the bedtools list:

def intersect_all(beds, **kwargs):
    """
    Successively intersect all files in `beds`, passing kwargs
    to BedTool.intersect().

    Note that if kwargs like `u=True` or `v=True` are used, the order 
    of the list will determine the final results
    """
    x = BedTool(beds[0])
    for bed in beds[1:]:
        x = x.intersect(bed, **kwargs)
    return x

@brentp
Copy link
Contributor

brentp commented Oct 27, 2011

Won't that have the same problem -- it'll depend on the order of the inputs?

Maybe first you need to do something smart like cat, then merge to get a base.
Then intersect each successively with base?

@daler
Copy link
Owner

daler commented Oct 27, 2011

Yep, as noted in the docstring. Really it just passes the buck, leaving decisions up to the caller rather than hard-coding in the overridden __add__.

But with no kwargs, it should do the plain ol' intersect (without -u), which should be commutative. I'll have to think some more about a cat/merge/intersect strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants