# Data processing, sampling and KwARG

This notebook provides commands used in processing the data, obtaining the samples and running KwARG, illustrated on the South Africa (November) dataset. The commands can be run on most Unix-based systems via command line (to re-run a particular command, delete the '!' prefixing each line below).

External tools used:

[MAFFT](https://mafft.cbrc.jp/alignment/software/)

[SeqKit](https://bioinf.shenwei.me/seqkit/)

The quality control criteria as detailed in Section S1.2 were applied by running the sequences through the Nextclade tool available [here](https://clades.nextstrain.org/).

The list of problematic sites to mask by De Maio et al (2020) is downloaded as `problematic_sites_sarsCov2.vcf` from [here](https://github.com/W-L/ProblematicSites_SARS-CoV2) (using version from 4 March 2021).

# Data processing

### Source

The data was downloded from [GISAID](https://www.gisaid.org/) on 28 December 2020, filtering for sequences:

- collected in November 2020 in South Africa;
- marked as complete (>29,900 nucleotides) and excluding low coverage sequences (>5% ambiguous nucleotides);

giving 326 sequences in total. The downloaded files are named:

- `SA_data.fasta` (the sequences in fasta format)
- `SA_sequencing_info.tsv` (sequencing metadata)

Filtering only for sequences labelled as belonging to strain 501Y.V2, the sequencing metadata was downloaded in file

- `SA_newstrain_sequencing_info.tsv`

The data is not shared here as per GISAID's terms of use, but the GISAID accession IDs are provided in the file SA_id.txt, which can be used to recreate the dataset used in our analysis.

In [1]:
!grep ">" SA_data.fasta | awk -F"|" '{ print $2 }' > SA_id.txt;
!head SA_id.txt

EPI_ISL_660159
EPI_ISL_660160
EPI_ISL_660161
EPI_ISL_660162
EPI_ISL_660163
EPI_ISL_660164
EPI_ISL_660221
EPI_ISL_660222
EPI_ISL_660225
EPI_ISL_660228


### Alignment

The sequences in each dataset were aligned using MAFFT to the reference sequence in `ref_seq.fasta` (GISAID accession EPI_ISL_402125, GenBank ID MN908947.3):
```sh
mafft --auto --thread -1 --keeplength --quiet --mapout --preservecase --addfragments SA_data.fasta ref_seq.fasta > SA_alignment.fasta
```

All symbols other than 'A, C, T, G' are replaced with 'N':

In [2]:
!sed '/^>/! s/[^actgACTG]/N/g' SA_alignment.fasta > SA_alignment_cleaned.fasta

Sequences labelled as having long stretches of ambiguous nucleotides were removed:

In [3]:
!grep "Stretches" SA_sequencing_info.tsv | awk '{print $1 " " $2 "|" $3 "|" $4}' > Ns_id.txt;
!sed -i .bak "s/\ /_/g" Ns_id.txt;
!sed -i .bak "s/\ /_/g" SA_alignment_cleaned.fasta;
!seqkit grep -n -v -f Ns_id.txt SA_alignment_cleaned.fasta > SA_alignment_good.fasta

To check the size of the resulting dataset:

In [4]:
!seqkit stats SA_alignment_good.fasta

file                     format  type  num_seqs    sum_len  min_len  avg_len  max_len
SA_alignment_good.fasta  FASTA   DNA        279  8,342,937   29,903   29,903   29,903


### Splitting by strain

The sequencing metadata file `SA_newstrain_sequencing_info.txt` is used to extract the sequences which are labelled as belonging to strain 501Y.V2.

In [5]:
!awk '{print $1 " " $2 "|" $3 "|" $4}' SA_newstrain_sequencing_info.tsv | grep "hCoV" > SA_newstrain_id.txt;
!sed -i .bak "s/\ /_/g" SA_newstrain_id.txt

In [6]:
!seqkit grep -n -f SA_newstrain_id.txt SA_alignment_good.fasta > SA_newstrain.fasta;
!seqkit grep -n -v -f SA_newstrain_id.txt SA_alignment_good.fasta > SA_oldstrain.fasta

Checking to see if there are any exact duplicates in the data:

In [7]:
!seqkit rmdup -DSA_olddeleted.txt -s SA_oldstrain.fasta > SA_oldstrain_nondup.fasta;
!seqkit rmdup -DSA_newdeleted.txt -s SA_newstrain.fasta > SA_newstrain_nondup.fasta;
!rm SA_newstrain.fasta;
!rm SA_oldstrain.fasta

[INFO][0m 0 duplicated records removed
[INFO][0m 0 duplicated records removed


In [8]:
!seqkit stats SA_oldstrain_nondup.fasta;
!seqkit stats SA_newstrain_nondup.fasta

file                       format  type  num_seqs    sum_len  min_len  avg_len  max_len
SA_oldstrain_nondup.fasta  FASTA   DNA        102  3,050,106   29,903   29,903   29,903
file                       format  type  num_seqs    sum_len  min_len  avg_len  max_len
SA_newstrain_nondup.fasta  FASTA   DNA        177  5,292,831   29,903   29,903   29,903


### Sample generation

The reference sequence was deleted from `SA_oldstrain_nondup.fasta` prior to running the following:

In [9]:
!cat ref_seq.fasta > SA_sample.fasta;
!seqkit sample -s 43984291 -p 0.6 SA_oldstrain_nondup.fasta | seqkit shuffle -s 92834717 | seqkit head -n 25 >> SA_sample.fasta;
!seqkit sample -s 23849817 -p 0.6 SA_newstrain_nondup.fasta | seqkit shuffle -s 34876261 | seqkit head -n 25 >> SA_sample.fasta;

[INFO][0m sample by proportion
[INFO][0m read sequences ...
[INFO][0m 62 sequences outputted
[INFO][0m 62 sequences loaded
[INFO][0m shuffle ...
[INFO][0m output ...
[INFO][0m read sequences ...
[INFO][0m sample by proportion
[INFO][0m 113 sequences outputted
[INFO][0m 113 sequences loaded
[INFO][0m shuffle ...
[INFO][0m output ...


The accession numbers of the sampled sequences are given in `SA_sample_id.txt`:

In [10]:
!grep ">" SA_sample.fasta | awk -F"|" '{ print $2 }' > SA_sample_id.txt;
!cat SA_sample_id.txt


EPI_ISL_660225
EPI_ISL_660257
EPI_ISL_736993
EPI_ISL_660643
EPI_ISL_660229
EPI_ISL_736985
EPI_ISL_736926
EPI_ISL_696462
EPI_ISL_660655
EPI_ISL_660625
EPI_ISL_660231
EPI_ISL_678608
EPI_ISL_660163
EPI_ISL_660232
EPI_ISL_700488
EPI_ISL_660652
EPI_ISL_660622
EPI_ISL_660651
EPI_ISL_678612
EPI_ISL_696509
EPI_ISL_678595
EPI_ISL_660222
EPI_ISL_696468
EPI_ISL_660230
EPI_ISL_660626
EPI_ISL_736958
EPI_ISL_696481
EPI_ISL_660637
EPI_ISL_678632
EPI_ISL_736932
EPI_ISL_678641
EPI_ISL_700422
EPI_ISL_696503
EPI_ISL_700470
EPI_ISL_736983
EPI_ISL_736936
EPI_ISL_700487
EPI_ISL_736935
EPI_ISL_700443
EPI_ISL_736939
EPI_ISL_700554
EPI_ISL_696505
EPI_ISL_696518
EPI_ISL_700589
EPI_ISL_736959
EPI_ISL_696453
EPI_ISL_696521
EPI_ISL_736964
EPI_ISL_736928
EPI_ISL_678629


These are the accession numbers given in Table S1.

### Masking problematic sites

This is done using the provided script (for other datasets, the list of sites to mask was amended as detailed in the manuscript). 

In [12]:
!awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' SA_sample.fasta > SA_sample_temp.fasta;
!sed 's/./& /g' < SA_sample_temp.fasta > SA_sample_test.fasta;
!rm SA_sample_temp.fasta;
!Rscript SA_find_multiallelic.R SA_sample_test.fasta SA_sample_masked.fasta SA_sample_positions.txt

[1] SNP sites: 206
[1] Plus masked: 29
[1] Plus multi-allelic: 0
[1] SNP sites:
  [1]   117   168   174   203   210   241   355   362   376   550   598  1042
 [13]  1072  1172  1205  1248  1263  1269  1337  1427  1593  1968  2692  2780
 [25]  2781  2782  2937  3117  3182  3340  3472  3505  3904  3923  4078  4093
 [37]  4510  4615  4668  5230  5425  5495  5503  5794  5857  5950  6525  6618
 [49]  6624  6651  6701  6726  6762  7064  7113  7279  7390  7420  7425  7844
 [61]  8068  8655  8660  8964  9073  9430 10138 10156 10279 10540 10623 10681
 [73] 10912 11230 11401 11447 11534 11629 11653 11854 11875 11886 11896 12071
 [85] 12085 12253 12503 12769 13122 13812 14583 14763 14928 14937 15003 15222
 [97] 15952 15970 16490 16804 17193 17334 17533 17679 17876 17898 17999 18028
[109] 18085 18175 18395 18495 18555 18910 19062 19283 19542 19602 19656 20233
[121] 20387 20718 21024 21099 21614 21762 21801 21979 21997 22022 22205 22206
[133] 22214 22813 22992 23012 23031 23063 23407

In [13]:
!cat SA_sample_positions.txt

117 168 174 203 210 241 355 362 376 550 598 1042 1072 1172 1205 1248 1263 1269 1337 1427 1593 1968 2692 2780 2781 2782 2937 3117 3182 3340 3472 3505 3904 3923 4078 4093 4510 4615 4668 5230 5425 5495 5503 5794 5857 5950 6525 6618 6624 6651 6701 6726 6762 7064 7113 7279 7390 7420 7425 7844 8068 8655 8660 8964 9073 9430 10138 10156 10279 10540 10623 10681 10912 11230 11401 11447 11534 11629 11653 11854 11875 11886 11896 12071 12085 12253 12503 12769 13122 13812 14583 14763 14928 14937 15003 15222 15952 15970 16490 16804 17193 17334 17533 17679 17876 17898 17999 18028 18085 18175 18395 18495 18555 18910 19062 19283 19542 19602 19656 20233 20387 20718 21024 21099 21614 21762 21801 21979 21997 22022 22205 22206 22214 22813 22992 23012 23031 23063 23407 23534 23664 23710 23836 23854 24023 24062 24133 24337 24398 24781 25139 25145 25171 25186 25241 25273 25303 25427 25511 25521 25561 25566 25613 25630 25635 25672 25705 25770 25814 25855 25904 25970 25977 26174 26262 26456 26501 26563 26586 266

### Labelling

The sequences are given sequential labels (to make the ARGs easier to view):

In [14]:
!seqkit head -n 1 SA_sample_masked.fasta > SA_sample_masked_id.fasta;
!seqkit range -r 2:26 SA_sample_masked.fasta > f1.f;
!seqkit range -r 27:51 SA_sample_masked.fasta > f2.f;
!awk '/^>/{print ">O" ++i; next}{print}' < f1.f >> SA_sample_masked_id.fasta;
!awk '/^>/{print ">N" ++i; next}{print}' < f2.f >> SA_sample_masked_id.fasta;
!rm f1.f;
!rm f2.f

# KwARG

Detailed instructions for using KwARG and obtaining the desired outputs can be found [here](https://github.com/a-ignatieva/kwarg). KwARG is run on the masked sample using the following command:

```
kwarg -T50,30 -Q500 -S-1,1.9,1.8,1.7,1.6,1.5,1.4,1.3,1.2,1.1,1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.01,1 -M-1,1.91,1.81,1.71,1.61,1.51,1.41,1.31,1.21,1.11,1.01,0.91,0.81,0.71,0.61,0.51,0.41,0.31,0.21,0.11,0.02,1.1 -R1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,-1 -C2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,-1 -k -n -f < SA_sample_masked_id.fasta > SA_kwarg_out.txt
```
