Skip to content

Commit 701bedf

Browse files
authored
Add instructions to clean up FASTA record names
IQ-TREE replaces apostrophes with underscores, causing a mismatch between strain names from input sequences/metadata and the output tree and breaking the downstream workflow. This commit adds instructions for users to replace the apostrophes themselves prior to analysis, to avoid this issue.
1 parent 292de92 commit 701bedf

File tree

1 file changed

+13
-1
lines changed

1 file changed

+13
-1
lines changed

Diff for: README.md

+13-1
Original file line numberDiff line numberDiff line change
@@ -198,14 +198,26 @@ When you have downloaded sequences for all batches, concatenate them together in
198198
cat gisaid_epiflu_sequences/gisaid_epiflu_sequence_*.fasta > gisaid_downloads.fasta
199199
```
200200

201+
Some strain names contain characters that IQ-TREE does not allow and which it will convert to underscores in its output trees.
202+
For example, the apostrophe in the name "Cote d'Ivoire" will be replaced with an underscore.
203+
To avoid mismatches between strain names caused by this IQ-TREE replacement, we replace those characters in the initial FASTA file at the beginning of the analysis using [seqkit's replace command](https://bioinf.shenwei.me/seqkit/usage/#replace).
204+
205+
```bash
206+
# Install seqkit. Optionally, use "mamba install" instead of "conda install".
207+
conda install -c conda-forge -c bioconda seqkit
208+
209+
# Replace apostrophes with underscores in the FASTA record names.
210+
seqkit replace -p "(')" -r "_" gisaid_downloads.fasta > gisaid_downloads.renamed.fasta
211+
```
212+
201213
Use augur to parse out the metadata and sequences into separate files.
202214
Store these files in a directory with the same name as the natural samples in this analysis.
203215

204216
```bash
205217
# Write out sequences and metadata for the validation sample.
206218
mkdir -p data/natural/natural_sample_1_with_90_vpm
207219
augur parse \
208-
--sequences gisaid_downloads.fasta \
220+
--sequences gisaid_downloads.renamed.fasta \
209221
--output-sequences data/natural/natural_sample_1_with_90_vpm/filtered_sequences.fasta \
210222
--output-metadata data/natural/natural_sample_1_with_90_vpm/strains_metadata.tsv \
211223
--fields strain accession collection_date passage_category submitting_lab

0 commit comments

Comments
 (0)