# Exploring Early SARS-CoV2 Mutations
Read the chapter 23 in my book.

In section 23.3.3 we deviate from the book because of Jupyter. Instead of a local Jmol installation we use a web-based version.

In section 23.3.4 we are using a different tool, SamTools instead of IGV.

<figure style="width:50%; display: block; margin-left: auto; margin-right: auto;">
  <img src="ace2-spike.png" alt="ace2-spike.png" >
  <figcaption>Mutations in the receptor binding motif have the highest impact on pathogenicity of SARS-CoV-2</figcaption> 
</figure>

---

We are working in the directory *SARS-CoV-2*.

## Download Programs

We need two AWK scripts

In [None]:
wget 'https://github.com/awkologist/CompBiol3/raw/main/23_SARS-CoV-2/fasta2tbl'

Make the file executable ...

In [None]:
chmod u+x ./fasta2tbl

In [None]:
wget 'https://github.com/awkologist/CompBiol3/raw/main/23_SARS-CoV-2/compare-cov2.awk'

## Download Virus Sequences

Download reference genome:

In [None]:
efetch -db nuccore -id NC_045512 -format fasta > wuhan-1.fasta

We create a copy in tab-delimited format:

In [None]:
./fasta2tbl wuhan-1.fasta > wuhan-1.tab

Download from [NCBI](https://www.ncbi.nlm.nih.gov/sars-cov-2/) viruses from Europe, from human hosts, without ambigious characters, complete nucleotide sequences, and a sequence length of exactly 29,903 nt. They are downloaded via the web browser as *sequences.fasta*. Move them into your current working directory.

Detailed instruction are in chapter "**23.2.3 Data and GitHub Repository**"

Move the file (ca 100 MB) from your local computer to JuypterHub. It takes a while â€“ check in the GitHub file browser if the complete file has been uploaded.

Here we print the header for the reference genome:

In [None]:
cut -f 1 wuhan-1.tab
awk -F"\t" '{print $1}' ./wuhan-1.tab

Rename the file to *cov2-len-29903.fasta*

In [None]:
mv sequences.fasta cov2-len-29903.fasta

In [None]:
grep -c ">" cov2-len-29903.fasta

## Convert FASTA to TAB and Edit Header

In [None]:
head -1 cov2-len-29903.fasta

In [None]:
./fasta2tbl cov2-len-29903.fasta | sed 's/|/\t/g' > cov2-len-29903.tab

In [None]:
wc -l cov2-len-29903.tab

In [None]:
cut -f 1 cov2-len-29903.tab | head -2

In [None]:
cut -f 1,3 cov2-len-29903.tab | head -2

In [None]:
cut -f 1-3 cov2-len-29903.tab | head -2

## Analyze Data

In [None]:
cut -f 2 cov2-len-29903.tab | sed 's/:.*//' | sort | uniq -c 
# cut -f 2 cov2-len-29903.tab | sort | uniq -c 

In [None]:
egrep "Germany" cov2-len-29903.tab | cut -f 1-3

In [None]:
awk -f compare-cov2.awk -v ref=wuhan-1.tab -v seq=cov2-len-29903.tab -v id2=MT358638

In [None]:
for i in MT358638 MT358639 MT358640 MT358641 MT358642 MT358643; do awk -f compare-cov2.awk -v ref=wuhan-1.tab -v seq=cov2-len-29903.tab -v id2=$i; done

In [None]:
awk -f compare-cov2.awk -v ref=wuhan-1.tab -v seq=cov2-len-29903.tab -v id2=OK075090

In [None]:
# mit head -50 werden statt allen (ca 3.000) nur 50 Sequenzen bearbeitet
for i in $(cut -f 1 cov2-len-29903.tab | head -50 | sed 's/\..*//'); do awk -f compare-cov2.awk -v ref=wuhan-1.tab -v seq=cov2-len-29903.tab -v id2=$i; done > result.txt

In [None]:
egrep -c ">" result.txt

In [None]:
egrep -c "SPIKE" result.txt

In [None]:
egrep -c "Motif" result.txt

In [None]:
egrep "Motif" result.txt | head -5

In [None]:
egrep "Motif" result.txt | cut -d ' ' -f 7 | sort | uniq -c
echo
egrep "Motif" result.txt | cut -d ' ' -f 7 

Goto an online version of [Jmol](https://lampz.tugraz.at/~hadley/ss1/molecules/moleculeviewer/viewer.php) in your browser of choice. There you can open the Jmol terminal as shown in this video:

<figure style="width:50%; display: block; margin-left: auto; margin-right: auto;">
  <img src="jmol_im_www.gif" alt="jmol_im_www.gif" >
  <figcaption>Using the online version of Jmol.</figcaption>
</figure>

Copy/paste the following code into the Jmol terminal 

```
load =7DF4
spacefill off; wireframe off
cartoon
select :A; color lightgray # ACE2
select :C; color gray # Spike 2
select :D; color darkgray # Spike 3
select :B; color lightblue # Spike 1
select 319-541:B; color blue # RBDomain
select 437-508:B; color red # RBMotif
```

Open the PDB structure 7DF4 with the above script directly in [jmol.php](https://chemapps.stolaf.edu/jmol/jmol.php?pdbid=7df4&script=wireframe&nbsp;off;spacefill&nbsp;off;cartoon;select&nbsp;:A;color&nbsp;lightgray;select&nbsp;:C;color&nbsp;gray;select&nbsp;:D;color&nbsp;darkgray;select&nbsp;:B;color&nbsp;lightblue;select&nbsp;319-541:B;color&nbsp;blue;select&nbsp;437-508:B;color&nbsp;red)

Note to myself: Spaces are replaed by ```&nbsp;``` // No space behind semicolon // No comments

In [None]:
egrep "Motif" result.txt | awk '{print $6}' | sort | uniq -c | sed 's/:/ /' | awk '{if($1<10){print "select "$3":B; spacefill 100; color yellow"}else{print "select "$3":B; spacefill 300; color yellow"}}' | tee -a 7DF4.script

You can now add these Jmol commands to the Jmol terminal and execute them.

## Mapping Genomic Variance

In [None]:
minimap2 -x asm5 -a -o sequences.sam wuhan-1.fasta cov2-len-29903.fasta

Print SAM without sequence:

In [None]:
awk -F"\t" '{ORS=""; for(i=1; i<=NF; i++){if($i!~/[ATCG]{10,}/){print $i" "}}print "\n"}' sequences.sam | head -10

The 6th field specifies the changes. "29903M" means that all 29903 nucleotides match (see table 23.1 in my book). Therefore, we print only lines with mutations:

In [None]:
awk -F"\t" '$6!="29903M"{ORS=""; for(i=1; i<=NF; i++){if($i!~/[ATCG]{10,}/){print $i" "}}print "\n"}' sequences.sam | head -10

In [None]:
awk -F"\t" '$6!="29903M"{ORS=""; for(i=1; i<=NF; i++){if($i!~/[ATCG]{10,}/){print $i" "}}print "\n"}' sequences.sam | wc -l

And now we print the full lines, including the sequences.

In [None]:
# awk -F"\t" '$6!="29903M"{print $0}' sequences.sam > sequences_mutated.sam
awk -F"\t" '$6!="29903M"{print $1, $6}' sequences.sam | sed '1,2d' | head

Convert SAM file to binary format and sort 

In [None]:
samtools view -b sequences_mutated.sam | samtools sort - -o sequences_mutated.sorted.bam

Create index for *sequences.sorted.bam* file for visualization with IGV

In [None]:
samtools index sequences_mutated.sorted.bam

We can now look at the variants with the Samtools viewer ```tview```:

Therefore, change to the **terminal** and run ```samtools tview sequences_mutated.sorted.bam wuhan-1.fasta -p NC_045512.2:22871-23084```

This opens the following screen. The top sequence represents the reference stored in *wuhan-1.fasta*. Each following line represents one virus genome sequence. A dot means that the nucleotide at this position matches the reference. For mutated nucleotides, the nucleotide is shown. You can sroll left and right with the curos keys and can apply commands as shown in the help. ```Q```brings you back to the terminal prompt.

<figure style="width:50%; display: block; margin-left: auto; margin-right: auto;">
  <img src="samtools-tview.png" alt="samtools-tview.png" >
  <figcaption>The terminal based alignment viewer tview of SamTools.</figcaption>
</figure>

## IGV Approach with Xvfb
Xvfb is a virtual frame buffer for the graphical X11-server. That means you can open graphical windows without displaying them. Instead, snapshots can be stored.

In [None]:
bcftools mpileup --max-depth 4000 -a AD -f wuhan-1.fasta -o seq-4000.bcf sequences_mutated.sorted.bam

Now call the significantly variant position with the multiallelic-caller (`-m`) and assume a ploidy of two. Save only variant sites (`-v`) in a VCF formated file (`-O v`).

In [None]:
bcftools call -O v -v -m --ploidy 2 -o seq-4000.vcf seq-4000.bcf

Convert original file with all variant position (without statistical model) to VCF

In [None]:
bcftools view -O v -o seq-4000-all.vcf seq-4000.bcf

Extract all positions that have at least one variant nucleotide.

In [None]:
awk '$5~/[ACGT]/||$0~/^#/{print $0}' seq-4000-all.vcf > seq-4000-snv.vcf

Download GenBank file of reference genome for annotation

In [None]:
efetch -db nuccore -id NC_045512 -format genbank > wuhan-1.gb

Edit GenBank file to be compatible with the FASTA version. 

In [None]:
sed -i 's/NC_045512/NC_045512.2/' wuhan-1.gb

Create GFF3 File *spikeprotein.gff3*:
```
##gff-version 3
##track name="Spike Protein" gffTags=on
NC_045512.2     .       CDS     21563   25383   .       +       .       ID=Spike;Name=SpikeProtein;Color=blue
NC_045512.2     .       CDS     22517   23183   .       +       .       ID=Domain;Name=RBDomain;Color=green
NC_045512.2     .       CDS     22871   23084   .       +       .       ID=Motif;Name=RBMotif;Color=red
```

Call IGV with ```xvfb-run```and the following IGV script file called *igv.bat*:

```
new
snapshotDirectory snapshots
genome wuhan-1.gb
load spikeprotein.gff3
# load sequences_mutated.sorted.bam
load seq-4000.vcf 
load seq-4000-snv.vcf
snapshot region_full.png
# RBM:
goto NC_045512.2:22871-23084
snapshot region_RBM.png
# RBD:
goto NC_045512.2:22517-23183
snapshot region_RBD.png
exit
```

In [None]:
# igv -g wuhan-1.gb -l NC_045512.2:22871-23084 sequences.sorted.bam seq-4000.vcf seq-4000-snv.vcf spikeprotein.gff3
xvfb-run -a igv -b igv.bat # notebooks/AngBioInfo_2025/AngBioInfo_2025/SARS-CoV-2

**Attention**: Images do not update automatically:

<figure style="text-align: center;">
  <img src="snapshots/region_full.png" style="width: 30%; display: inline-block; margin-right: 2%;">
  <img src="snapshots/region_RBD.png" style="width: 30%; display: inline-block;">
  <img src="snapshots/region_RBM.png" style="width: 30%; display: inline-block;">
  <figcaption>Window snapshot of IGV with left) the complete virus, middle) the RBD and right) the RBM.</figcaption>
</figure>
<script>
  document.getElementById("myimg").src += '?t=' + new Date().getTime();
</script>

sequences_mutated.sorted.bam