New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start should be 1-based, not 0-based #251

Closed
teemukataja opened this Issue Dec 11, 2018 · 12 comments

Comments

Projects
None yet
5 participants
@teemukataja
Copy link
Contributor

teemukataja commented Dec 11, 2018

In the Beacon specification, the start-key is described to be 0-based, while the VCF specification describes the position as 1-based; POS - position: The reference position, with the 1st base having position 1.

I verified this information using IGV genome browser. Upon further research, other genomic filetypes also report to be using the 1-based system.

@mbaudis

This comment has been minimized.

Copy link
Contributor

mbaudis commented Dec 11, 2018

The decision was early on to follow GA4GH standards, which are 0-based half open.

(The lack of a clear documentation of "GA4GH standards" strikes again ...).

So, 0 based it should be.

@mbaudis mbaudis closed this Dec 11, 2018

@teemukataja

This comment has been minimized.

Copy link
Contributor Author

teemukataja commented Dec 11, 2018

I would like to understand this use case and couldn't find anything on the past issues. Can you point me to where I can find information on this decision, if no such document of standards exist?

@mbaudis

This comment has been minimized.

Copy link
Contributor

mbaudis commented Dec 11, 2018

@teemukataja

This comment has been minimized.

Copy link
Contributor Author

teemukataja commented Dec 11, 2018

Thank you.

@teemukataja

This comment has been minimized.

Copy link
Contributor Author

teemukataja commented Dec 11, 2018

@mbaudis

The VMC data model on page 16 suggests, that nucleotides follow a 1-based counting convention.

Upon reading more about bases and interbases I strongly feel that Beacon should follow the standards of genomic files, such as the VCF, which uses the 1-based system. Because the 1-based system tell the position of the base of interest, I think it fits more for the role of Beacon. Interbases might be better for applying data science on datasets, but the role of Beacon is to find those datasets first.

@mbaudis

This comment has been minimized.

Copy link
Contributor

mbaudis commented Dec 11, 2018

Interbase coordinates

I'm not married to any concept, but there have been endless discussions already. Also, this is a clear case where Beacon just has to pick up whatever format is selected as a "GA4GH standard". File formats and browsers all use different coordinate systems.

Quote:

Moving from UCSC browser/tools to Ensembl browser/tools or back
* Ensembl uses 1-based coordinate system
* UCSC uses 0-based coordinate system
* Some file formats are 1-based (GFF, SAM, VCF) and others are 0-based (BED, BAM)

Pinging @andrewyatz @reece ...

@andrewyatz

This comment has been minimized.

Copy link

andrewyatz commented Dec 11, 2018

You pinged?

With reference to GA4GH there is additional context available from @jmarshall comment on a PR of mine for refget. The use of 0-based, inclusive coordinates is now a convention of GA4GH specifications. It certainly isn't a standard. If it were this has been left behind in the pre infinity war like snap of GA4GH but the vague notion that we prefer 0-based, inclusive pervades.

@mbaudis

This comment has been minimized.

Copy link
Contributor

mbaudis commented Dec 11, 2018

@jmarshall

This comment has been minimized.

Copy link

jmarshall commented Dec 11, 2018

I would like to understand this use case

It is clear that 0-based half-inclusive intervals are the appropriate representation to use for arithmetic (and hence machine communications). If this doesn't seem clear to you, reread the epic threads linked to in the comment (samtools/hts-specs#327 (comment)) that @andrewyatz pointed to.

[So it's obviously the right representation for APIs; whether it's GA4GH's policy to use this representation is a separate question.]

As Beacon is a web service API, its purpose is machine communications therefore 0-based half-inclusive is the natural representation. This statement about purpose is a bit more quibble-able, so GA4GH codified this choice as a policy, or a “standard” if you will. (Those of us who were there at the time remember this use of the word “standard” — with relief, as it was the end of endless discussions!) This is reflected in secondary sources such as the htsget spec:

We use the following pan-GA4GH standards:

  • 0 start, half open coordinates

Tragically the primary sources (some GA4GH press release or minutes of some meeting), if any, have been obfuscated by subsequent web site reorganisations…

@reece

This comment has been minimized.

Copy link

reece commented Dec 11, 2018

Humans use 1-based inclusive. That shouldn't and won't change.

Interbase coordinates conceptually cleaner than inclusive coordinates (regardless of base), especially when distinguishing insertions and deletions, and for edits at the terminii. APIs should use interbase.

I can't think of any technical benefit for 0-based inclusive coordinates.

@jmarshall

This comment has been minimized.

Copy link

jmarshall commented Dec 12, 2018

(For the avoidance of doubt,) “interbase” and “zero-based half-inclusive” are two names for the same representation (and the latter name is more formally “zero-based half-open” I guess). I think in @andrewyatz's comment he was meaning the latter but inadvertently elided the “half-”.

@reece

This comment has been minimized.

Copy link

reece commented Dec 12, 2018

@jmarshall: Funny, I removed a point clarifying this because I thought it was a distraction. I guess I should have left it in.

Although interbase and 0-based, right-open are numerically equivalent, they're semantically distinct. Interbase provides important conceptual clarity.

0-based, right-open refers to residues, which makes it awkward to refer to insertion points at the terminii because you have to refer to imaginary residues. Also, with residue-based coordinates, insertions use exclusive coordinates but deletions and substitutions use inclusive coordinates. That is, 5_6 refers to the space between 5 and 6 for an insertion, but refers to 5 and 6 inclusively for a deletion or MNV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment