Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pangoline Lineage calling despite "N" at defining SNP position #492

Open
MantaRay87 opened this issue Oct 17, 2022 · 3 comments
Open

Pangoline Lineage calling despite "N" at defining SNP position #492

MantaRay87 opened this issue Oct 17, 2022 · 3 comments

Comments

@MantaRay87
Copy link

Not even sure if pangolin itself is the right repository to mention this issue/question or if it has been explained before.

While using some of our Illumina sequence data to compare it with our ONT sequences I noticed the following:

with pangolin-data v1.14 the consensus files were called as BA.5.2. But with the new pangolin-data v1.15.1 they are called as BA.5.2.28. The lineage defining SNP of BA.5.2.28 is ORF1a:T1437I (from pango-designation issue #1133). But our consensi in question have a missing Amplicon at that particular stretch of the genome, hence in the consensus sequences are stated "N" --> How/why does pangolin assign this lineage anyway? How can I be sure it is correct?

We have the same Issue with the ONT data where lineage BA.5.2.21 is called after the update but also here we have "N" at this particular stretch in the sequence at this position.

@AngieHinrichs
Copy link
Member

Thanks for reporting this @MantaRay87. I think I might know the cause but it would help to have some more info:

  • Can you include a few complete lines of output? Assuming you're using the default analysis mode (UShER), the comment column indicates whether there were multiple equally optimally parsimonious placements (EPPs) in the lineage tree.
  • Can you share an example sequence or a few IDs if the sequences are in public repositories?

When a sequence has an 'N' or other IUPAC ambiguous base, UShER will impute the value based on matches at other positions. When there are multiple EPPs, AFAIK UShER picks the node with the most descendants, which would make it pick BA.5.2 when it matches both BA.5.2 and BA.5.2.28 because of the N at 4575 -- but when adding UShER mode to pangolin, I added a bit of logic to override UShER's favorite when there was a plurality of matches in one lineage. Unfortunately I have seen some cases in which one lineage gets a plurality of matches only because of Ns matching multiple branches within the lineage -- so that "voting" might be harming more than helping, given the amount of amplicon dropout that is common now. Without looking at complete output for your sequences (or ideally, running pangolin on the sequences myself), I can't be sure that's causing the odd assignments, but that's my best guess.

@MantaRay87
Copy link
Author

Hey @AngieHinrichs thanks for the fast reply. I sure can:

Pango_output.csv

The corresponding Gisaid IDS can be found here:
GISAID_IDs.csv

The ONT sequences in which all 4 samples are called as BA.5.2.21 are not public. If you also need output and sequences here let me know :)

@AngieHinrichs
Copy link
Member

Perfect, thanks. Yes, it is what I guessed: usher picks BA.5.2 for all four of those sequences, but since there are more matches within the BA.5.2.28 branch due to the Ns, the "voting" picks BA.5.2.28. I probably should remove the "voting" from pangolin. I will try to find time soon to see how many sequences' assignments will change, and evaluate whether it looks like an improvement overall.

In the meantime, the conflict and comment columns of the lineage_report.csv output might be useful for flagging these. conflict is the proportion of placements in a lineage other than the lineage that won the "voting" and was assigned. For the four example sequences, it is either 0.25 (when comment column has Usher placements: BA.5.2(1/4) BA.5.2.28(3/4)) or 0.4 (when comment has Usher placements: BA.5.2(2/5) BA.5.2.28(3/5)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants