Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What characters should be allowed in sequence names? #2

Closed
nsheff opened this issue Nov 11, 2020 · 16 comments
Closed

What characters should be allowed in sequence names? #2

nsheff opened this issue Nov 11, 2020 · 16 comments

Comments

@nsheff
Copy link
Member

nsheff commented Nov 11, 2020

The seqcol algorithm will include the names of the sequences within the strings that we hash to form the seqcol digests. For example, chr1. You can read the draft specification here. What should be the allowable set of characters included in these sequence names?

The current proposal is to follow the specification defined for SAM headers: https://samtools.github.io/hts-specs/SAMv1.pdf#subsubsection.1.2.1

In short that is:

Reference sequence names may contain any printable ASCII characters in the range[!-~]apart frombackslashes, commas, quotation marks, and brackets—i.e., apart from ‘\ , "‘’ () [] {} <>’—and may notstart with ‘*’ or ‘=’.4

In theory, we can be more relaxed because seqcol itself doesn't have the same restrictions of a sam file. So, we could define a more permissive set of allowable characters for seqcol purposes. The advantage would be that seqcol could then be used for a wider array of possible sequence collections, even ones that wouldn't work in a sam file; the disadvantage is that perhaps we lose an opportunity to encourage people to standardize, which could prevent some headaches downstream.

@nsheff nsheff changed the title What characters should be allowd in sequence names? What characters should be allowed in sequence names? Nov 12, 2020
@andrewyatz
Copy link
Collaborator

andrewyatz commented Nov 13, 2020

Sorry @jmarshall but I think you also have some input here. @daviesrob pointed us towards the SAM spec and I think this would be a good starting point to say we set ourselves against this. Only concern might be VCF compatibility?

@jmarshall
Copy link
Member

VCF uses the same rules as SAM for sequence names.

This has been brought to TASC (see ga4gh/TASC#5) with a view to harmonising sequence name rules across GA4GH. As yet no-one from other working groups has identified any similar rules from their formats or protocols with which any harmonisation would be necessary, so I intend soon to propose that the SAM rule be designated as the “GA4GH standard sequence name identifier”.

Formats and protocols that use e.g. JSON have the benefit that they only particularly need to restrict quote characters (") or have escaping rules for them. So they could indeed be more permissive. However that would just lead users to have problems later, when producing SAM or VCF files, rather than having the problem made apparent up front. So IMHO it is better to standardise on the lowest common denominator, to avoid submarine headaches.

@nsheff
Copy link
Member Author

nsheff commented Nov 25, 2020

Thanks for the comments. After further discussion today in the refget API meeting, it seems that we are approaching a consensus that we should standardize on SAM sequence name rules.

Before a final decision, we wanted to reach out to a few other communities (plant genomics, microbiomes, proteomics), particularly looking for use cases that may not rely on the SAM toolchain, to see if there's an existing community with a use case that would argue for making a more lenient specification.

@andrewyatz can you ping people who may be able to weigh in?

forum post at biobakery

@andrewyatz
Copy link
Collaborator

Already pinged them 👍

@nsheff
Copy link
Member Author

nsheff commented Dec 11, 2020

Alright, I've now talked to multiple people from the plant and microbial genomics communities, and have repeatedly heard that following the SAM spec will not be a problem.

I have not yet talked to anyone from the proteomics community, though. @andrewyatz has anyone raised any issues with you? I'd like to finalize this point soon so our discussions in other issues can become more concrete.

@andrewyatz
Copy link
Collaborator

My plant, microbes and proteomics contacts have all said they see no issues

@sveinugu
Copy link
Collaborator

Just felt like throwing in a hot potato here:

UTF8?

@nsheff
Copy link
Member Author

nsheff commented Jan 6, 2021

Just felt like throwing in a hot potato here:

UTF8?

Not sure what you're getting at...

@sveinugu
Copy link
Collaborator

sveinugu commented Jan 15, 2021

Just that non-ascii letters tend to appear in the strangest places, breaking things. There is really no reason for a global standard to require that sequences are named with only ASCII letters, except that most tools would probably fail if people do (which is a good reason). So I am not very serious here. But one could potentially explicitly add the possibility of URL-encoding UTF-8 characters and reserve the %-character for that. To be truly global! Then one can use emoji for naming sequences, which is (let's face it) what we all really want to do anyway! 😉 (edit: had to have a real emoji here)

@jmarshall
Copy link
Member

jmarshall commented Jan 15, 2021

If Unicode characters were allowed in these sequence names, then I hope it's obvious that UTF-8 would be the correct encoding for them 😄 — UTF-8 was carefully constructed so that its representations of non-ASCII characters do not collide with any ASCII characters, so there would be no need to further encode them to avoid colliding with commas or US/RS or any other ASCII delimiter characters.

So there's no technical reason to disallow UTF-8-encoded Unicode characters.

It's basically another facet of the design choice of whether to allow arbitrary characters here in this field in this protocol or whether to align with other common restrictions on sequence names.

BTW e.g. SAM and VCF allow UTF-8 in particular description field values (SAM) and in most field values but not INFO/FORMAT keys (VCF), though not in their equivalents of sequence names.

@andrewyatz
Copy link
Collaborator

I like the way you're saying this @jmarshall. Alignment with other common restrictions is a good reason

@sveinugu
Copy link
Collaborator

I like the way you're saying this @jmarshall. Alignment with other common restrictions is a good reason

Definitely. Which is why I suggested url-encoding them as that should not create any trouble in any tools, however there will be trouble when such sequence names are visualized in e.g. a GUI, so I guess that might be a bad idea. Which means UTF-8 is probably out of the question. Just wanted to raise the issue though.

@daviesrob
Copy link
Contributor

Another reason to ba careful about Unicode that that there's more than one way to write the same visual character, so you have to introduce normalization rules if you want to test for equivalence. For example "Ç" is u+0037 and "Ç" is u+0043u+0327 (or at least was when I pasted it in).

Adding normalization rules makes the specification much more complicated, and means implementers have to depend on extra libraries to implement it. And that's ignoring sorting and collation, which would need their own set of rules to get a consistent outcome.

@jmarshall
Copy link
Member

@daviesrob: Good point — that's a very good technical reason to disallow arbitrary Unicode in fields that you need to do comparisons/sorting/etc on, while perhaps allowing it in free-form text fields (like descriptions) that the computer doesn't need to do anything with other than display to the user (of which, by definition, the unhashed digest string doesn't have any).

@nsheff
Copy link
Member Author

nsheff commented Jan 20, 2021

See #9

@nsheff
Copy link
Member Author

nsheff commented Feb 1, 2022

This issue was decided with an ADR entry in PR #9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants