What characters should be allowed in sequence names? #2

nsheff · 2020-11-11T21:44:28Z

The seqcol algorithm will include the names of the sequences within the strings that we hash to form the seqcol digests. For example, chr1. You can read the draft specification here. What should be the allowable set of characters included in these sequence names?

The current proposal is to follow the specification defined for SAM headers: https://samtools.github.io/hts-specs/SAMv1.pdf#subsubsection.1.2.1

In short that is:

Reference sequence names may contain any printable ASCII characters in the range[!-~]apart frombackslashes, commas, quotation marks, and brackets—i.e., apart from ‘\ , "‘’ () [] {} <>’—and may notstart with ‘*’ or ‘=’.4

In theory, we can be more relaxed because seqcol itself doesn't have the same restrictions of a sam file. So, we could define a more permissive set of allowable characters for seqcol purposes. The advantage would be that seqcol could then be used for a wider array of possible sequence collections, even ones that wouldn't work in a sam file; the disadvantage is that perhaps we lose an opportunity to encourage people to standardize, which could prevent some headaches downstream.

The text was updated successfully, but these errors were encountered:

andrewyatz · 2020-11-13T17:39:58Z

Sorry @jmarshall but I think you also have some input here. @daviesrob pointed us towards the SAM spec and I think this would be a good starting point to say we set ourselves against this. Only concern might be VCF compatibility?

jmarshall · 2020-11-13T18:29:08Z

VCF uses the same rules as SAM for sequence names.

This has been brought to TASC (see ga4gh/TASC#5) with a view to harmonising sequence name rules across GA4GH. As yet no-one from other working groups has identified any similar rules from their formats or protocols with which any harmonisation would be necessary, so I intend soon to propose that the SAM rule be designated as the “GA4GH standard sequence name identifier”.

Formats and protocols that use e.g. JSON have the benefit that they only particularly need to restrict quote characters (") or have escaping rules for them. So they could indeed be more permissive. However that would just lead users to have problems later, when producing SAM or VCF files, rather than having the problem made apparent up front. So IMHO it is better to standardise on the lowest common denominator, to avoid submarine headaches.

nsheff · 2020-11-25T16:16:40Z

Thanks for the comments. After further discussion today in the refget API meeting, it seems that we are approaching a consensus that we should standardize on SAM sequence name rules.

Before a final decision, we wanted to reach out to a few other communities (plant genomics, microbiomes, proteomics), particularly looking for use cases that may not rely on the SAM toolchain, to see if there's an existing community with a use case that would argue for making a more lenient specification.

@andrewyatz can you ping people who may be able to weigh in?

forum post at biobakery

andrewyatz · 2020-11-25T16:24:10Z

Already pinged them 👍

nsheff · 2020-12-11T13:53:22Z

Alright, I've now talked to multiple people from the plant and microbial genomics communities, and have repeatedly heard that following the SAM spec will not be a problem.

I have not yet talked to anyone from the proteomics community, though. @andrewyatz has anyone raised any issues with you? I'd like to finalize this point soon so our discussions in other issues can become more concrete.

andrewyatz · 2020-12-11T16:15:32Z

My plant, microbes and proteomics contacts have all said they see no issues

sveinugu · 2020-12-15T00:34:44Z

Just felt like throwing in a hot potato here:

UTF8?

nsheff · 2021-01-06T13:58:33Z

Just felt like throwing in a hot potato here:

UTF8?

Not sure what you're getting at...

sveinugu · 2021-01-15T13:24:52Z

Just that non-ascii letters tend to appear in the strangest places, breaking things. There is really no reason for a global standard to require that sequences are named with only ASCII letters, except that most tools would probably fail if people do (which is a good reason). So I am not very serious here. But one could potentially explicitly add the possibility of URL-encoding UTF-8 characters and reserve the %-character for that. To be truly global! Then one can use emoji for naming sequences, which is (let's face it) what we all really want to do anyway! 😉 (edit: had to have a real emoji here)

jmarshall · 2021-01-15T14:01:51Z

If Unicode characters were allowed in these sequence names, then I hope it's obvious that UTF-8 would be the correct encoding for them 😄 — UTF-8 was carefully constructed so that its representations of non-ASCII characters do not collide with any ASCII characters, so there would be no need to further encode them to avoid colliding with commas or US/RS or any other ASCII delimiter characters.

So there's no technical reason to disallow UTF-8-encoded Unicode characters.

It's basically another facet of the design choice of whether to allow arbitrary characters here in this field in this protocol or whether to align with other common restrictions on sequence names.

BTW e.g. SAM and VCF allow UTF-8 in particular description field values (SAM) and in most field values but not INFO/FORMAT keys (VCF), though not in their equivalents of sequence names.

andrewyatz · 2021-01-15T14:07:08Z

I like the way you're saying this @jmarshall. Alignment with other common restrictions is a good reason

sveinugu · 2021-01-15T14:16:54Z

I like the way you're saying this @jmarshall. Alignment with other common restrictions is a good reason

Definitely. Which is why I suggested url-encoding them as that should not create any trouble in any tools, however there will be trouble when such sequence names are visualized in e.g. a GUI, so I guess that might be a bad idea. Which means UTF-8 is probably out of the question. Just wanted to raise the issue though.

daviesrob · 2021-01-15T15:02:50Z

Another reason to ba careful about Unicode that that there's more than one way to write the same visual character, so you have to introduce normalization rules if you want to test for equivalence. For example "Ç" is u+0037 and "Ç" is u+0043u+0327 (or at least was when I pasted it in).

Adding normalization rules makes the specification much more complicated, and means implementers have to depend on extra libraries to implement it. And that's ignoring sorting and collation, which would need their own set of rules to get a consistent outcome.

jmarshall · 2021-01-15T15:14:52Z

@daviesrob: Good point — that's a very good technical reason to disallow arbitrary Unicode in fields that you need to do comparisons/sorting/etc on, while perhaps allowing it in free-form text fields (like descriptions) that the computer doesn't need to do anything with other than display to the user (of which, by definition, the unhashed digest string doesn't have any).

nsheff · 2021-01-20T13:37:12Z

See #9

nsheff · 2022-02-01T21:51:18Z

This issue was decided with an ADR entry in PR #9.

nsheff changed the title ~~What characters should be allowd in sequence names?~~ What characters should be allowed in sequence names? Nov 12, 2020

nsheff mentioned this issue Dec 9, 2020

What information is included within the string-to-digest? #8

Closed

jmarshall mentioned this issue Dec 11, 2020

What characters should we use for delimiters? #1

Closed

jmarshall mentioned this issue Feb 9, 2021

Possible bug in htslib/bcftools 1.1: [E::bcf_hdr_add_sample_len] Duplicated sample name samtools/bcftools#1408

Closed

sveinugu mentioned this issue Sep 22, 2021

Sequence collection, ordered? or unordered? #5

Closed

nsheff closed this as completed Feb 1, 2022

jmarshall mentioned this issue Apr 29, 2023

sam: Clarify the usage of UTF-8 characters in header samtools/hts-specs#719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What characters should be allowed in sequence names? #2

What characters should be allowed in sequence names? #2

nsheff commented Nov 11, 2020 •

edited

Loading

andrewyatz commented Nov 13, 2020 •

edited

Loading

jmarshall commented Nov 13, 2020

nsheff commented Nov 25, 2020 •

edited

Loading

andrewyatz commented Nov 25, 2020

nsheff commented Dec 11, 2020

andrewyatz commented Dec 11, 2020

sveinugu commented Dec 15, 2020

nsheff commented Jan 6, 2021

sveinugu commented Jan 15, 2021 •

edited

Loading

jmarshall commented Jan 15, 2021 •

edited

Loading

andrewyatz commented Jan 15, 2021

sveinugu commented Jan 15, 2021

daviesrob commented Jan 15, 2021

jmarshall commented Jan 15, 2021

nsheff commented Jan 20, 2021

nsheff commented Feb 1, 2022

What characters should be allowed in sequence names? #2

What characters should be allowed in sequence names? #2

Comments

nsheff commented Nov 11, 2020 • edited Loading

andrewyatz commented Nov 13, 2020 • edited Loading

jmarshall commented Nov 13, 2020

nsheff commented Nov 25, 2020 • edited Loading

andrewyatz commented Nov 25, 2020

nsheff commented Dec 11, 2020

andrewyatz commented Dec 11, 2020

sveinugu commented Dec 15, 2020

nsheff commented Jan 6, 2021

sveinugu commented Jan 15, 2021 • edited Loading

jmarshall commented Jan 15, 2021 • edited Loading

andrewyatz commented Jan 15, 2021

sveinugu commented Jan 15, 2021

daviesrob commented Jan 15, 2021

jmarshall commented Jan 15, 2021

nsheff commented Jan 20, 2021

nsheff commented Feb 1, 2022

nsheff commented Nov 11, 2020 •

edited

Loading

andrewyatz commented Nov 13, 2020 •

edited

Loading

nsheff commented Nov 25, 2020 •

edited

Loading

sveinugu commented Jan 15, 2021 •

edited

Loading

jmarshall commented Jan 15, 2021 •

edited

Loading