Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify bed versus bedgraph format #236

Closed
cokelaer opened this issue Jun 5, 2019 · 2 comments
Closed

clarify bed versus bedgraph format #236

cokelaer opened this issue Jun 5, 2019 · 2 comments
Milestone

Comments

@cokelaer
Copy link
Contributor

cokelaer commented Jun 5, 2019

Once done, we can fix other issues such as bam2bed, rename bedgraph into bed or vice versa etc.

As also pointed in this issue from sequana_coverage, sequana/sequana#555
the BED format is made of 4 columns: chrom name, start pos end pos, value.

The difference between the BED and BEDGRAPH is that BEDGRAPH is 4 columns whereas BED can be up to 12 columns.

In bam2bedgraph, bedtools genomecov -bg -ibam creates a bedgraph*:

chr1	75	176	1
chr1	447	547	1
chr1	547	548	2
chr1	548	648	1
chr1	661	690	1

In bam2bed each position (even those with no coverage) are reported with bedtools genomecov -d
and therefore the end position is not reported. we get for instance::

chr1	1	0
chr1	2	0
chr1	3	0
...
chr1	75	1
chr1	75	1

but this is not a BED file. Actually, this rather a COV (for coverage) format, but there is no such standard. This is also the same output given by samtools depth .

@cokelaer cokelaer added this to the v0.4 milestone Jun 5, 2019
@cokelaer
Copy link
Contributor Author

We introduce a new format to store the coverage. The class bam2bed is now called bam2cov to avoid ambiguity. The bam2bedgraph is still available.

@blaiseli
Copy link
Collaborator

blaiseli commented Jul 5, 2019

the BED format is made of 4 columns: chrom name, start pos end pos, value

Sorry for this late comment: Note that in the BED format, 4-th column, if present, should be a name, not a value, which makes it different from BEDGRAPH

According to https://genome.ucsc.edu/FAQ/FAQformat.html#format1:

The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

And the order is chrom, chromStart, chromEnd, [name, score, strand, ...]

When I need something BED-compliant, with strand information, I sometimes have to introduce dummy values for name and score. If I want something with a score, I sometimes need to introduce dummy values for name.

And, if we really strictly want to have a BED-compliant format, the score, if present, should be between 0 and 1000 (same reference as above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants