Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Problem assigning B.1.623 and B.1.621 #115

Closed
rmcolq opened this issue Jun 17, 2021 · 13 comments
Closed

Potential Problem assigning B.1.623 and B.1.621 #115

rmcolq opened this issue Jun 17, 2021 · 13 comments
Milestone

Comments

@rmcolq
Copy link
Contributor

rmcolq commented Jun 17, 2021

Copied over from cov-lineages/pangolin#253

Dear Pangolin team,

When assigning lineages to patient samples with pangolin I stumbled over a cluster that looks like B.1.621 but was assigned B.1.623.
The mutations we find definitely fit the B.1.621 in GISAID and not B.1.623, so either the lineage definition was changed or something else went wrong. I uploaded the sequences to USHER (https://hgwdev-angie.gi.ucsc.edu/cgi-bin/hgPhyloPlace) and got a strange result there too. Usher classified the sequence as B.1.621, but the closest sequences where B.1.623 (I guess classified by pangolin in GISAID).
An example of the AA changes our sequences have :
ORF1ab:T1055A; ORF1ab:T1538I; ORF1ab:N2147D; ORF1ab:T2952I; ORF1ab:T3255I; ORF1ab:Q3729R; ORF1ab:P4715L; ORF1ab:P5743S; ORF1ab:P5815S; ORF1ab:A6532V; ORF1ab:S6713L; S:T95I; S:Y144S; S:R346K; S:E484K; S:N501Y; S:D614G; S:P681H; S:D950N; ORF3a:Q57H; ORF3a:V256fs; ORF3a:V259L; ORF8:T11K; ORF8:P38S; ORF8:S67F; N:T205I

Of these the following fit B.1.621 in https://outbreak.info/situation-reports?pango=B.1.621
ORF1ab:T1055A; ORF1ab:T1538I; ORF1ab:T3255I; ORF1ab:Q3729R; ORF1ab:P4715L; ORF1ab:P5743S; S:T95I; S:Y144S; S:R346K; S:E484K; S:N501Y; S:D614G; S:P681H; S:D950N; ORF3a:Q57H; ORF8:T11K; ORF8:P38S; ORF8:S67F; N:T205I
and the following B.1.623 in https://outbreak.info/situation-reports?pango=B.1.623
ORF1ab:P4715L; S:N501Y; S:D614G; S:P681H; ORF3a:Q57H; with 9 AA changes missing

Attached please find a fasta with 10 of the sequences.
I use:
pangolin==3.0.3
lineages-2020-05-19-2
pangoLEARN-2021-06-15
scorpio-0.3
constellations-0.0.5
pango-designation v1.2.13
All the best and thanks again fro the great tools you provide,

Sorry, just checked on GISAID (17/06/2021, 1:24:19 PM), and one of the closest sequences according to USHER, EPI_ISL_2220456, is also classified as B.1.623 (version 3.1.1, lineages version 2021-06-15), although the mutations eg:
Spike D614G, Spike D950N, Spike E484K, Spike ins145N, Spike N501Y, Spike P681H, Spike R346K, Spike T95I, Spike Y144T, Spike Y145S
would rather fit B.1.621 #62
than B.1.623 #25
Lukas

b.1.623_run_994_10.fasta.gz
Dr. Lukas Endler
Senior Postdoc

CeMM
Research Center for Molecular Medicine
of the Austrian Academy of Sciences
Lazarettgasse 14, AKH BT 25.3
1090 Vienna, Austria
Phone +43-1/40160-70055
lendler@cemm.oeaw.ac.at
www.cemm.at

@jcw349
Copy link

jcw349 commented Jun 18, 2021

Hi,

We are also seeing all of our B.1.623 sequences with B.1.621 mutations listed above clustering together and growing rapidly over time. These "B.1.623" are all branching off of the same ancestral node in B.1.621 and are genetically unrelated to the canonical B.1.623. I think this might be an emerging sublineage of B.1.621, suggesting B.1.621.1 for the sublineage.

Jade Wang

@rmcolq
Copy link
Contributor Author

rmcolq commented Jun 18, 2021

Hi @jcw349, @luenling - are any of your sequences on GISAID? If so, some accessions/sequence IDs would be awesome!

@ha-ruiz75
Copy link

Hello, all.

I have also experienced this issue in our sequencing results.
I have compiled a list of GISAID sequences with <5% Ns that should be assigned as B.1.621 but are assigned to other lineages (mostly B.1.623).

List

genomes_should_be_B_1_621.txt

@jcw349
Copy link

jcw349 commented Jun 18, 2021

@rmcolq
We are working in collaboration with Pandemic Response Lab (PRL). So far only 1 sequence made it onto GISAID, EPI_ISL_1828364. I uploaded some more today will let you know if they make it through.

In the mean time. This is the phylogenetic tree we put together. The leaf names are a substring of GISAID virus names if the genomes make it onto GISAID.

The "Assigned_Pangolin_Lineage" indicates the lineage to which the genome was assigned (B.1.621 or B.1.623). As you can see a subset genomes assigned to B.1.623 (blue) are a sublineage of B.1.621 (yellow with red branch). This sublineage are genomically similar and they all have S:E484K, S:N501Y, as well as other mutations Dr. Endler pointed out. The other B.1.623 on the other hand do not all have these mutations and are genetically distinct from the B.1.621 clade.

@luenling
Copy link

Dear Rachel @rmcolq ,
I just uploaded some of our sequences to GISAID, so hopefully I can send you the IDs in the next days.
Thanks for looking into it and all the best,
Lukas

@luenling
Copy link

luenling commented Jun 20, 2021

Dear Rachel @rmcolq ,

Attached a list of 48 GISAID IDs and virus names (tsv with first 4 columns from GISAID) of our sequences that were assigned B.1.623 and should be B.1.621. They are all from a cluster in Vorarlberg so not much diversity there.

All the best,
Lukas

b.1.621_cluster_Vorarlberg_CeMM.txt

@jcw349
Copy link

jcw349 commented Jun 23, 2021

Hi,
I realized our sequences may have been rejected by GISAID, without error messages from them, due to frameshift insertions and deletions.

According to cov-glue, they have:
Novel non-codon-aligned insertion in S: 21990-TAC-21991
Novel frameshifting deletion in ORF 3a: nucleotides 26158-26161

I will double checking our depth of coverage and attempt to resubmit. Is there another way for us to send you the fasta files in the mean time?

Thank you,
Jade W.

@rmcolq
Copy link
Contributor Author

rmcolq commented Jun 24, 2021

@chrisruis I took an initial look at this, but it really isn't a simple case and I don't have the experience to resolve it. It looks like when the ids provided are uploaded to https://hgwdev-angie.gi.ucsc.edu/cgi-bin/hgPhyloPlace they mostly cluster together on the big tree, but the nearby lineages are not nicely coloured by pangolin lineage - probably haven't sampled enough of the diversity in the region in pango-designations to date. There are several subtrees, the clearest of which is this:
Screenshot 2021-06-18 at 16 37 08
Can you take a look?

@AngieHinrichs
Copy link
Member

@rmcolq it looks the default subtree size of 50 was used -- it's really meant for uploading only a handful of sequences and seeing the names of sequences that they're most closely related to. For bigger tree questions like this, I would use a much larger subtree size like 1000 (the max is currently 5000) -- it gives a lot more context, and you can then click on branches of interest to zoom in and see the leaf labels.

Regarding the appearance of B.1 and B.1.429 calls in the branch that's mostly B.1.621 -- I usually interpret that as either pangoLEARN being affected by noise / new mutations that have appeared in other lineages, or perhaps lots of Ns (or contamination, new mutations etc) leading to multiple equally parsimonious placements for that sequence. UShER picked the placement that you see, but that might not be the only plausible placement for that sequence. Unfortunately there's no way to tell about Ns aside from examining the original sequence at this point. (If I had tons of time, I would find a way to get more of that info into the metadata...)

One more thing: https://hgwdev-angie.gi.ucsc.edu/cgi-bin/hgPhyloPlace is unstable -- it's my sandbox and may be completely broken when I'm working on something new. https://genome.ucsc.edu/cgi-bin/hgPhyloPlace is the official location, although its tree & metadata may lag by a day (or sometimes a few days) and its code lags by ~one week to one month relative to the more-stable-than-hgwdev-angie test server hgwdev.gi.ucsc.edu. :)

@AngieHinrichs
Copy link
Member

I pasted in both Héctor's and Lukas's sample IDs with a subtree size of 1000 so you can see more B.1.621 sequences (purple) around the cluster that is very consistently being misidentified as B.1.623 (shades of blue):

https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/pango-designation-115.json?c=pango_lineage

image

@rmcolq
Copy link
Contributor Author

rmcolq commented Jun 25, 2021

That looks much clearer! Thanks @AngieHinrichs. It appears that there are 2 extra ORF1a mutations (N2147D and T2952I) on the branch leading to the cluster, but otherwise it clearly fits well within the B.1.621 lineage. I don't see evidence of these new samples being corresponding to an introduction to a new place. Are you all happy with a lineage definition of B.1.621 for your samples (in which case I will pick some to add to the designations list)?
Alternatively, do you have any evidence suggesting it should be a new lineage (https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/)?

@jcw349
Copy link

jcw349 commented Jun 25, 2021

Hi,
I think B.1.621 is a more appropriate assignment than B.1.623, however, I think this might be a sub-lineage of B.1.621.

We didn't get to put a phylogenetic tree together with all the sequences listed by others in this thread, but based on our data, it seems like the B.1.623 group that needs to be reclassified (green) and is growing faster in NYC than other B.1.621 (blue) (see Counts over time figure). It looks like in Angie Hinrichs tree, B.1.623 and "uploaded sample" are also mostly falling into 1 branch as well.

Counts over time:
image

single branch in B.1.621
In the phylogenetic tree, misclassified group (purple branch) is a single branch with large number of non-identical sequences, that stem from the B.1.621 (red branch)
image

NYC PHL and PRL are working on resubmitting these sequences to GISAID. As mentioned earlier, there are novel insertions and deletions in this cluster that are causing GISAID to reject our sequences. Sorry for the delays!

Thank you!
Jade W.

@rmcolq
Copy link
Contributor Author

rmcolq commented Jun 30, 2021

I've added as many of these as I could to a newly designated lineage B.1.621.1, released in pango-designation v1.2.29. This will filter into the training over the next week.

@rmcolq rmcolq closed this as completed Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants