Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove 2339 names that map to seqs on multiple lineage branches #2466

Merged
merged 2 commits into from Jan 23, 2024

Conversation

AngieHinrichs
Copy link
Member

48k names in lineages.csv map to multiple sequences in the UCSC UShER tree, sometimes because of deduplication failures and sometimes because the same name is used in multiple submissions to the same repository and we end up with the same name used for multiple accessions. Out of the 48k, 2339 names map to multiple sequences that are on different lineage branches in the UShER tree, causing confusion about which of the multiple accessions/sequences the name is meant to refer to. Remove those names from lineages.csv so we can rely on more uniquely mapped names.

@AngieHinrichs
Copy link
Member Author

@corneliusroemer I will merge this tomorrow unless you would like some time to check it first.

@corneliusroemer
Copy link
Contributor

Thanks for this!

I ran a quick check to see whether this would drop any lineages into precarious territory in terms of number of designations left but that's not the case so no objection from me.

counts before and after pruning for affected lineages
A.1          [-2699-]{+2698+}
AP.1         [-395-]{+394+}
AY.100       [-16626-]{+16622+}
AY.103       [-18103-]{+18101+}
AY.120       [-15515-]{+15514+}
AY.122       [-68322-]{+68321+}
AY.134       [-801-]{+799+}
AY.25        [-3853-]{+3849+}
AY.25.1      [-41309-]{+41307+}
AY.4         [-34453-]{+34451+}
AY.4.2.1     [-1289-]{+1287+}
AY.4.2.2     [-3241-]{+3240+}
AY.44        [-19992-]{+19985+}
AY.46.6      [-5208-]{+5207+}
AY.75        [-5173-]{+5170+}
AY.9         [-2836-]{+2832+}
AY.98.1      [-10420-]{+10419+}
AZ.2.1       [-95-]{+94+}
B            [-4001-]{+3999+}
B.1          [-46228-]{+46223+}
B.1.1        [-22790-]{+22778+}
B.1.1.220    [-184-]{+183+}
B.1.1.228    [-165-]{+164+}
B.1.1.519    [-485-]{+476+}
B.1.1.7      [-69636-]{+69634+}
B.1.160.14   [-200-]{+198+}
B.1.160.16   [-508-]{+503+}
B.1.160.26   [-194-]{+193+}
B.1.177      [-37243-]{+37241+}
B.1.177.43   [-254-]{+252+}
B.1.177.44   [-917-]{+897+}
B.1.177.57   [-2262-]{+2261+}
B.1.177.8    [-819-]{+817+}
B.1.221      [-5443-]{+5442+}
B.1.222      [-404-]{+403+}
B.1.225      [-40-]{+39+}
B.1.36.2     [-65-]{+63+}
B.3.1        [-491-]{+490+}
B.4          [-301-]{+300+}
BA.1         [-66001-]{+65813+}
BA.1.1       [-104527-]{+104388+}
BA.1.1.1     [-15936-]{+15907+}
BA.1.1.10    [-1143-]{+1142+}
BA.1.1.11    [-1983-]{+1978+}
BA.1.1.12    [-1927-]{+1926+}
BA.1.1.13    [-4146-]{+4137+}
BA.1.1.14    [-6372-]{+6368+}
BA.1.1.15    [-6296-]{+6290+}
BA.1.1.18    [-34614-]{+34558+}
BA.1.1.3     [-291-]{+290+}
BA.1.1.4     [-792-]{+791+}
BA.1.1.8     [-381-]{+380+}
BA.1.10      [-1251-]{+1249+}
BA.1.12      [-1862-]{+1859+}
BA.1.13      [-1518-]{+1516+}
BA.1.14      [-3727-]{+3689+}
BA.1.14.1    [-1941-]{+1940+}
BA.1.15      [-35340-]{+35155+}
BA.1.15.1    [-19949-]{+19914+}
BA.1.16      [-7651-]{+7636+}
BA.1.17      [-11367-]{+11352+}
BA.1.17.2    [-29853-]{+29779+}
BA.1.18      [-20829-]{+20791+}
BA.1.19      [-2661-]{+2628+}
BA.1.20      [-17377-]{+17323+}
BA.1.21      [-4525-]{+4510+}
BA.2         [-111257-]{+110927+}
BA.2.12      [-2460-]{+2459+}
BA.2.12.1    [-39747-]{+39743+}
BA.2.14      [-502-]{+490+}
BA.2.23      [-6280-]{+6279+}
BA.2.3       [-19173-]{+19164+}
BA.2.36      [-3000-]{+2999+}
BA.2.37      [-3967-]{+3966+}
BA.2.45      [-438-]{+434+}
BA.2.47      [-848-]{+845+}
BA.2.51      [-308-]{+290+}
BA.2.7       [-508-]{+505+}
BA.2.9       [-42851-]{+42138+}
BA.3         [-204-]{+203+}
BA.4         [-3634-]{+3633+}
BA.4.1       [-8368-]{+8366+}
BA.4.1.1     [-1349-]{+1348+}
BA.4.1.6     [-530-]{+522+}
BA.4.6.5     [-3080-]{+3078+}
BA.5         [-2464-]{+2461+}
BA.5.1       [-23524-]{+23500+}
BA.5.1.1     [-1190-]{+1187+}
BA.5.1.22    [-11797-]{+11796+}
BA.5.1.23    [-9531-]{+9530+}
BA.5.1.24    [-5166-]{+5163+}
BA.5.1.25    [-2847-]{+2845+}
BA.5.1.30    [-1000-]{+999+}
BA.5.1.35    [-1854-]{+1852+}
BA.5.2       [-10669-]{+10665+}
BA.5.2.1     [-14519-]{+14516+}
BA.5.2.28    [-1957-]{+1955+}
BA.5.2.3     [-594-]{+593+}
BA.5.2.32    [-370-]{+368+}
BA.5.2.57    [-1477-]{+1476+}
BA.5.2.9     [-3582-]{+3580+}
BA.5.5       [-6044-]{+6042+}
BD.1         [-4404-]{+4393+}
BE.1         [-1838-]{+1837+}
BE.1.1       [-6607-]{+6604+}
BE.1.1.2     [-3495-]{+3494+}
BE.12        [-317-]{+315+}
BF.10        [-1351-]{+1349+}
BF.14        [-2520-]{+2516+}
BF.26        [-4018-]{+4015+}
BF.28        [-5746-]{+5745+}
BF.36        [-2097-]{+2096+}
BF.7         [-10188-]{+10187+}
BM.1.1       [-301-]{+299+}
BQ.1         [-8917-]{+8916+}
BQ.1.1.31    [-242-]{+241+}
BQ.1.1.65    [-252-]{+251+}
BQ.1.12      [-2523-]{+2519+}
CL.1.3       [-198-]{+197+}
CP.1         [-67-]{+66+}
DE.1         [-437-]{+436+}
DE.2         [-45-]{+44+}
FL.2.3.1     [-7-]{+6+}
FL.21        [-283-]{+282+}
GF.1         [-275-]{+274+}
GJ.1.2       [-940-]{+935+}
GY.1.1       [-40-]{+39+}
HK.20.1      [-63-]{+62+}
HK.27        [-472-]{+471+}
HK.27.1.1    [-134-]{+128+}
HK.3.2       [-921-]{+920+}
XD           [-22-]{+21+}
XG           [-140-]{+130+}
XH           [-41-]{+40+}
XV           [-42-]{+41+}
JN.6         [-37-]{+36+}
W.4          [-508-]{+507+}
XBB          [-725-]{+724+}
XBB.1.16.18  [-432-]{+430+}
XBB.1.16.19  [-412-]{+411+}
XBB.1.30     [-147-]{+146+}
XBB.1.31.2   [-122-]{+121+}
XBB.1.5.18   [-482-]{+481+}
XBB.1.5.19   [-545-]{+544+}
XBB.1.5.32   [-729-]{+712+}
XBB.1.5.48   [-1770-]{+1769+}

I've had a look at the names of the duplicates. It looks like these are not exact name duplicates but rather the Genbank and GISAID strain names respectively.

We could in theory have a rule to havea precedence of GISAID over Genbank to resolve ambiguities.

But the differences are so small it might not worth the trouble.

Commands I used:

Run this on master and pr branch:

git checkout master; csv2tsv lineages.csv| tsv-summarize -H -g lineage --count | sort >preprune.tsv
git checkout pr; csv2tsv lineages.csv| tsv-summarize -H -g lineage --count | sort >postprune.tsv
csvdiff -o word-diff -s '\t' preprune.tsv postprune.tsv | tsv-pretty

@AngieHinrichs
Copy link
Member Author

Thanks @corneliusroemer for taking a look! Besides the duplicates, there are some interesting cases from GenBase (CNCB) where the same name is reused for many sequences like this:

CHN/MSCDC-10/2023       CHN/MSCDC-10/2023|C_AA029524.1|2023-05-22       XBB.1.16.19
CHN/MSCDC-10/2023       CHN/MSCDC-10/2023|C_AA043238.1|2023-08-18       EG.5.1.1
CHN/MSCDC-10/2023       CHN/MSCDC-10/2023|C_AA045181.1|2023-09-08       HK.3.1
CHN/MSCDC-10/2023       CHN/MSCDC-10/2023|C_AA047965.1|2023-10-04       HK.3
CHN/MSCDC-10/2023       CHN/MSCDC-10/2023|C_AA049553.1|2023-10-19       HK.3.1
CHN/MSCDC-10/2023       CHN/MSCDC-10/2023|C_AA051947.1|2023-10-26       HK.13

I pinged CNCB about that, asking if they could ask the submitters to use distinct names, but no response so far. Meanwhile, GISAID curators seem to be adding disambiguating suffixes for those (MSCDC-10, MSCDC-10-2, MSCDC-10-3, and so on). So in that case it would be better to go with GISAID names than public.

@AngieHinrichs AngieHinrichs merged commit acc65d1 into cov-lineages:master Jan 23, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants