Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected (wrong?) results in terms of total length for the intersection of two sets #34

Closed
amizeranschi opened this issue Jul 3, 2020 · 7 comments

Comments

@amizeranschi
Copy link

amizeranschi commented Jul 3, 2020

Hello,

I've been testing Intervene on two of the sample tests. I've generated an UpSet plot and saved the corresponding BED files for the intersection and set differences and then computed the total length of the features from each of the files.

I would expect that the summed total lengths would equal that of the initial files, but this is not the case. Any assistance in clarifying this issue would be much appreciated, as I was planning on using Intervene on a larger number of files.

I've also manually computed the intersection and set differences using bedops and, this time, the summed total lengths were as I'd expect. Finally, I've plotted a Venn diagram of the intersections using the R package bedr and its results match those obtained with bedops. I'm including the diagram below.

Below is a minimal reproducible example for what I did:


## set up a temporary directory
temp_dir=/path/to/temp_dir
mkdir $temp_dir
cd $temp_dir

wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K27me3.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me1.bed

## run Intervene and save the BED files for the overlaps
intervene upset --output overlap_counts_2 --save-overlaps -i H3K27me3.bed H3K4me1.bed

## compute the length of the UpSet diagram elements (subsets)
cd $temp_dir/overlap_counts_2/sets

ln -s $temp_dir/H3K4me1.bed H3K4me1.bed
ln -s $temp_dir/H3K27me3.bed H3K27me3.bed

## manually compute the intersection and set differences using BEDOPS
bedops --intersect H3K27me3.bed H3K4me1.bed > BEDOPS_11_H3K27me3_H3K4me1.bed
bedops --difference H3K27me3.bed BEDOPS_11_H3K27me3_H3K4me1.bed > BEDOPS_10_H3K27me3.bed
bedops --difference H3K4me1.bed BEDOPS_11_H3K27me3_H3K4me1.bed > BEDOPS_01_H3K4me1.bed

## compute the total length (in bp) for each BED file and write these to a text file
echo -e "BED file\t$Total length (bp)" > lengths.txt

for file_name in *.bed; do
  subset_length=$(cat $file_name | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}')
  echo -e "${file_name}\t${subset_length}" >> lengths.txt
done

cat lengths.txt

These are the results I got:

BED file	 length (bp)
01_H3K4me1.bed	202168680
10_H3K27me3.bed	11936247
11_H3K27me3_H3K4me1.bed	208156989
BEDOPS_01_H3K4me1.bed	323296156
BEDOPS_10_H3K27me3.bed	135056952
BEDOPS_11_H3K27me3_H3K4me1.bed	83578607
H3K27me3.bed	218635559
H3K4me1.bed	406874763

Here, I would expect that the total length of e.g. H3K27me3.bed would be equal to the summed total lengths of 10_H3K27me3.bed and 11_H3K27me3_H3K4me1.bed, but this wasn't the case. However, the results produced with bedops did match this expectation.

I have used the following R code to produce a Venn diagram using the R package bedr:

install.packages("bedr")
library(bedr)

## read the BED files
regions.a = read.table("/path/to/H3K4me1.bed", 
               header = FALSE, stringsAsFactors = FALSE)

regions.b = read.table("/path/to/H3K27me3.bed", 
               header = FALSE, stringsAsFactors = FALSE)

bedr.plot.region(
  input = list(
    a = regions.a,
    b = regions.b
  ),
  feature = "bp",
  fraction.overlap = 0.000000001
)

Below is the resulting Venn diagram. The total bp length values are identical to those computed above by bedops.

venn

@amizeranschi
Copy link
Author

Hello again,

I have extended this exercise to three sets and I am running into the same issue. The results of Intervene don't match those of other tools. Also, when computing the total length of regions from one set from the different elements (subsets) that should compose it, the values don't add up as they should, but they do add up in the case of other tools (bedops and bedr were tested).

Why is this the case?

Here is my code:

## set up a temporary directory
temp_dir=/path/to/temp_dir
mkdir $temp_dir
cd $temp_dir

wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me1.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me2.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me3.bed


## run Intervene and save the BED files for the overlaps
intervene upset --output overlap_counts_3 --save-overlaps -i H3K4me1.bed H3K4me2.bed H3K4me3.bed

## compute the length of the UpSet diagram elements (subsets)
cd $temp_dir/overlap_counts_3/sets

ln -s $temp_dir/H3K4me1.bed H3K4me1.bed
ln -s $temp_dir/H3K4me2.bed H3K4me2.bed
ln -s $temp_dir/H3K4me3.bed H3K4me3.bed

## manually compute the intersection and set differences using BEDOPS
bedops --intersect H3K4me2.bed H3K4me1.bed H3K4me3.bed > BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed
bedops --intersect H3K4me2.bed H3K4me1.bed | bedops --difference - BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS_110_H3K4me2_H3K4me1.bed
bedops --intersect H3K4me2.bed H3K4me3.bed | bedops --difference - BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS_011_H3K4me2_H3K4me3.bed
bedops --intersect H3K4me1.bed H3K4me3.bed | bedops --difference - BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS_101_H3K4me1_H3K4me3.bed
bedops --everything BEDOPS_101_H3K4me1_H3K4me3.bed BEDOPS_011_H3K4me2_H3K4me3.bed BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me3.bed - > BEDOPS_001_H3K4me3.bed
bedops --everything BEDOPS_110_H3K4me2_H3K4me1.bed BEDOPS_011_H3K4me2_H3K4me3.bed BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me2.bed - > BEDOPS_010_H3K4me2.bed
bedops --everything BEDOPS_110_H3K4me2_H3K4me1.bed BEDOPS_101_H3K4me1_H3K4me3.bed BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me1.bed - > BEDOPS_100_H3K4me1.bed

## write the total lengths to a file
echo -e "Set element\t$Total length (bp)" > lengths.txt

for file_name in *.bed; do
  subset_length=$(cat $file_name | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}')
  echo -e "${file_name}\t${subset_length}" >> lengths.txt
done

cat lengths.txt

This is the result:

Set element	 length (bp)
001_H3K4me3.bed	1586566
010_H3K4me2.bed	1945128
011_H3K4me2_H3K4me3.bed	4000088
100_H3K4me1.bed	56723387
101_H3K4me1_H3K4me3.bed	1503887
110_H3K4me1_H3K4me2.bed	86126244
111_H3K4me1_H3K4me2_H3K4me3.bed	275359610
BEDOPS_001_H3K4me3.bed	19700303
BEDOPS_010_H3K4me2.bed	23749016
BEDOPS_011_H3K4me2_H3K4me3.bed	21965441
BEDOPS_100_H3K4me1.bed	290903957
BEDOPS_101_H3K4me1_H3K4me3.bed	11191800
BEDOPS_110_H3K4me2_H3K4me1.bed	54706745
BEDOPS_111_H3K4me2_H3K4me1_H3K4me3.bed	50072261
H3K4me1.bed	406874763
H3K4me2.bed	150493463
H3K4me3.bed	102929805

Here, again, I would expect the total length of regions in H3K4me1.bed to be equal to the sums of total lengths of regions from the files 100_H3K4me1.bed, 101_H3K4me1_H3K4me3.bed, 110_H3K4me1_H3K4me2.bed and 111_H3K4me1_H3K4me2_H3K4me3.bed. However, this is not true for the files produced by Intervene, although it is true for those produced by bedops.

Plotting the 3 BED files as a Venn diagram with the R package bedr produces, again, identical results as those of bedops:

3-sets

@asntech
Copy link
Owner

asntech commented Jul 15, 2020

@amizeranschi can you tell me what are you using as a back-end for bedr? BEDTools or BEDOPS?

Can you try bedr with feature = "interval" instead of feature = "bp"?

@asntech
Copy link
Owner

asntech commented Jul 15, 2020

@amizeranschi And also you can set either regions of A or B to output in the intervene by providing wa or wb to --bedtools-options This difference could be due to this. And I am not sure how feature = "bp" works in bedr.

@amizeranschi
Copy link
Author

@asntech I think bedr uses bedtools as an engine, by default. I haven't changed this. The exact R code that I used is pasted above.

The reason I used feature = "bp" in bedr is because this answers exactly the question that I am looking for: namely, what is the length, in bp, of the overlapping regions between multiple BED files.

I have also computed this manually with bedops, in two steps, as shown above. First, I computed the overlapping regions using bedops --intersect and bedops --difference. Then, I computed the total length (in bp) of the respective BED files, using bedops --merge and GNU awk.

Note: the --intersect operation is commutative in bedops, so the order of the input files does not matter:

$ bedops --intersect H3K27me3.bed H3K4me1.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607

$ bedops --intersect H3K4me1.bed H3K27me3.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607

I believe the BED intersection operation that you originally had in mind with Intervene is more similar, if not identical, to the --element-of operation in bedops: https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#element-of-e-element-of. It would be very helpful if you could implement my use case as well.

@amizeranschi
Copy link
Author

Update: it looks like the bedtools intersect and bedtools subtract operations give the same results as the corresponding bedops operations, when used with u=False and v=False:

$ bedops --intersect H3K27me3.bed H3K4me1.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607

$ bedops --intersect H3K4me1.bed H3K27me3.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607

$ bedtools intersect -a H3K27me3.bed -b H3K4me1.bed | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607

$ bedtools intersect -a H3K4me1.bed -b H3K27me3.bed | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
83578607



$ bedops --difference H3K27me3.bed BEDOPS_11_H3K27me3_H3K4me1.bed | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
135056952

$ bedtools subtract -a H3K27me3.bed -b BEDOPS_11_H3K27me3_H3K4me1.bed | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
135056952

So, it looks like implementing a way to set u=False and v=False for pybedtools in Intervene would solve my issue.

@amizeranschi
Copy link
Author

Update2:

After looking at it in more depth, things were a bit more complicated than I initially expected. However, I've managed to manually compute Intervene's UpSet diagram elements for my use case with bedtools, in the same way I've done earlier with bedops.

I'd be grateful if you could add support for automating this use case in Intervene, as bedr only supports Venn diagrams (no UpSet support) for a maximum of 5 BED files.

I'm including my updated code for 3 BED files below, in the hope that it will be helpful.

In a nutshell, the bedtools multiinter command should be used for computing set intersections and cat *.bed | bedtools subtract for set unions and differences.

## set up a temporary directory
temp_dir=/path/to/temp_dir
mkdir $temp_dir
cd $temp_dir

wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me1.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me2.bed
wget https://raw.githubusercontent.com/asntech/intervene/master/intervene/example_data/ENCODE_hESC/H3K4me3.bed

## run Intervene and save the BED files for the overlaps
intervene upset --output overlap_counts_3 --save-overlaps -i H3K4me1.bed H3K4me2.bed H3K4me3.bed

## compute the length of the UpSet diagram elements (subsets)
cd $temp_dir/overlap_counts_3/sets

## prefix all the file names for Intervene's original output
prefix="INTERVENE_"
for filename in *.bed
do
  mv "$filename" "$prefix$filename"
done

ln -s $temp_dir/H3K4me1.bed H3K4me1.bed
ln -s $temp_dir/H3K4me2.bed H3K4me2.bed
ln -s $temp_dir/H3K4me3.bed H3K4me3.bed

## manually compute the intersection and set differences using BEDOPS
bedops --intersect H3K4me2.bed H3K4me1.bed H3K4me3.bed > BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed
bedops --intersect H3K4me2.bed H3K4me1.bed | bedops --difference - BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS____110_H3K4me2_H3K4me1.bed
bedops --intersect H3K4me2.bed H3K4me3.bed | bedops --difference - BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS____011_H3K4me2_H3K4me3.bed
bedops --intersect H3K4me1.bed H3K4me3.bed | bedops --difference - BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed > BEDOPS____101_H3K4me1_H3K4me3.bed
bedops --everything BEDOPS____101_H3K4me1_H3K4me3.bed BEDOPS____011_H3K4me2_H3K4me3.bed BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me3.bed - > BEDOPS____001_H3K4me3.bed
bedops --everything BEDOPS____110_H3K4me2_H3K4me1.bed BEDOPS____011_H3K4me2_H3K4me3.bed BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me2.bed - > BEDOPS____010_H3K4me2.bed
bedops --everything BEDOPS____110_H3K4me2_H3K4me1.bed BEDOPS____101_H3K4me1_H3K4me3.bed BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed | bedops --difference H3K4me1.bed - > BEDOPS____100_H3K4me1.bed

## manually compute the intersection and set differences using BEDTOOLS
bedtools multiinter -i H3K4me2.bed H3K4me1.bed H3K4me3.bed | grep "1,2,3" | cut -f 1-3 > BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed
bedtools multiinter -i H3K4me2.bed H3K4me1.bed | grep "1,2" | cut -f 1-3 | bedtools subtract -a - -b BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed > BEDTOOLS__110_H3K4me2_H3K4me1.bed
bedtools multiinter -i H3K4me2.bed H3K4me3.bed | grep "1,2" | cut -f 1-3 | bedtools subtract -a - -b BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed > BEDTOOLS__011_H3K4me2_H3K4me3.bed
bedtools multiinter -i H3K4me1.bed H3K4me3.bed | grep "1,2" | cut -f 1-3 | bedtools subtract -a - -b BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed > BEDTOOLS__101_H3K4me1_H3K4me3.bed
cat BEDTOOLS__101_H3K4me1_H3K4me3.bed BEDTOOLS__011_H3K4me2_H3K4me3.bed BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed | bedtools subtract -a H3K4me3.bed -b - > BEDTOOLS__001_H3K4me3.bed
cat BEDTOOLS__110_H3K4me2_H3K4me1.bed BEDTOOLS__011_H3K4me2_H3K4me3.bed BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed | bedtools subtract -a H3K4me2.bed -b - > BEDTOOLS__010_H3K4me2.bed
cat BEDTOOLS__110_H3K4me2_H3K4me1.bed BEDTOOLS__101_H3K4me1_H3K4me3.bed BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed | bedtools subtract -a H3K4me1.bed -b - > BEDTOOLS__100_H3K4me1.bed

## write the total lengths to a file
echo -e "Set element\t$Total length (bp)" > lengths.txt

for file_name in *.bed; do
  subset_length=$(cat $file_name | sort-bed - | bedops --merge - | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}')
  echo -e "${file_name}\t${subset_length}" >> lengths.txt
done

cat lengths.txt

This yields:

$ cat lengths.txt
Set element	 length (bp)
BEDOPS____001_H3K4me3.bed	19700303
BEDOPS____010_H3K4me2.bed	23749016
BEDOPS____011_H3K4me2_H3K4me3.bed	21965441
BEDOPS____100_H3K4me1.bed	290903957
BEDOPS____101_H3K4me1_H3K4me3.bed	11191800
BEDOPS____110_H3K4me2_H3K4me1.bed	54706745
BEDOPS____111_H3K4me2_H3K4me1_H3K4me3.bed	50072261
BEDTOOLS__001_H3K4me3.bed	19700303
BEDTOOLS__010_H3K4me2.bed	23749016
BEDTOOLS__011_H3K4me2_H3K4me3.bed	21965441
BEDTOOLS__100_H3K4me1.bed	290903957
BEDTOOLS__101_H3K4me1_H3K4me3.bed	11191800
BEDTOOLS__110_H3K4me2_H3K4me1.bed	54706745
BEDTOOLS__111_H3K4me2_H3K4me1_H3K4me3.bed	50072261
H3K4me1.bed	406874763
H3K4me2.bed	150493463
H3K4me3.bed	102929805
INTERVENE_001_H3K4me3.bed	1586566
INTERVENE_010_H3K4me2.bed	1945128
INTERVENE_011_H3K4me2_H3K4me3.bed	4000088
INTERVENE_100_H3K4me1.bed	56723387
INTERVENE_101_H3K4me1_H3K4me3.bed	1503887
INTERVENE_110_H3K4me1_H3K4me2.bed	86126244
INTERVENE_111_H3K4me1_H3K4me2_H3K4me3.bed	275359610

@amizeranschi
Copy link
Author

Also, regarding:

@amizeranschi can you tell me what are you using as a back-end for bedr? BEDTools or BEDOPS?

I reran bedr with the three sample BED files mentioned above and saw the following within its output:

   bedtools multiinter -i /tmp/Rtmpbdtb2K/i_5cef7f415bae.bed /tmp/Rtmpbdtb2K/_5cef45258826.bed /tmp/Rtmpbdtb2K/_5cef3fb99ba6.bed  -names  H3K4me1 H3K4me2 H3K4me3 -g /path/to/bedr/genomes/human.hg19.genome -filler . 

So it looks like it is indeed using bedtools, specifically the multiinter command, which I've also tried running above.

@asntech asntech closed this as completed Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants