Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New masks to consider due to amplicon 64 issues #15

Open
theosanderson opened this issue May 19, 2022 · 4 comments
Open

New masks to consider due to amplicon 64 issues #15

theosanderson opened this issue May 19, 2022 · 4 comments

Comments

@theosanderson
Copy link

I belately saw this message from @BioWilko https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473/17

Hi, in regards to our post about erroneous mutations in ARTIC V4/4.1 3 we have now discovered a significantly higher number of affected genomes when ambiguous bases are considered (31,567 in COG-UK dataset). Assuming sequencing centres use the updated versions of the scheme BED file new sequences should not be affected but I think you should consider adding the following positions to the problematic sites mask:
19209 G/K
19210 G/R
19212 G/R
19214 G/R
19217 A/M

I agree that it makes sense to add these to the mask - I can see some issues on the UShER tree that result from these (not hundreds, but tens) [@AngieHinrichs for info]

@theosanderson theosanderson changed the title New masks to consider New masks to consider due to amplicon 64 issues May 19, 2022
@AngieHinrichs
Copy link

+1

Thanks @theosanderson for the heads-up. Anecdotally, I've seen a few other sets of adjacent (or at least close) mutations that cause trouble in the Omicron branches of the tree, although I haven't got a nice analysis with evidence like @BioWilko's to explain them! I can provide lists of sequences in case anyone would like to take a look.

  • T27039A, C27040A (usually with A27038T)
  • T26491C, A26492T, T26497C
  • C21304A, G21305A (often but not always with C21302T)

@AngieHinrichs
Copy link

Hi @LiXingguangBrandonStark -- I haven't used mask_alignment_using_vcf.py nor did I write it (from github history it looks like @conorwalker is the main author), but if you cd to the ProblematicSites_SARS-CoV2/src/ directory and then run

python3 mask_alignment_using_vcf.py

it outputs brief usage instructions:

usage: mask_alignment_using_vcf.py [-h] [-m] [-c] [-b] [-d]
                                   [-n MASK_CHARACTER] [-r REFERENCE_ID] -v
                                   VCF -i INPUT_FASTA -o OUTPUT_FASTA
mask_alignment_using_vcf.py: error: the following arguments are required: -v/--vcf, -i/--input_fasta, -o/--output_fasta

(I use different tools to mask VCF instead of fasta, using the file problematic_sites_sarsCov2.vcf.)

@W-L
Copy link
Owner

W-L commented Oct 27, 2022

Hi @LiXingguangBrandonStark!
Did you clone this repository? (git clone https://github.com/W-L/ProblematicSites_SARS-CoV2.git) You can then find the vcf for masking sites at ./ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf and the script to mask alignments in FASTA format at ./ProblematicSites_SARS-CoV2/src/mask_alignment_using_vcf.py, with usage instructions as posted by @AngieHinrichs (Thank you!) If you encounter issues using the files, please feel free to open a new issue.

@W-L
Copy link
Owner

W-L commented Oct 27, 2022

The vcf has a column named FILTER with a recommendation for each site to either mask it before performing downstream analyses or to otherwise be cautious with interpreting results due to potential misleading effects that the site may cause. You can find more info about this in the original post on virological.org. The files in subset_vcf separate the sites from the main vcf into these two categories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants