# Coronavirus demo

The following is a short demonstration of using the **Epi2me Labs** notebook environment to perform some basic analysis of a 2019-nCov dataset.

***In order to run all of this example please remember to connect to [your local](https://colab.research.google.com/github/epi2me-labs/resources/blob/master/epi2me-labs-server.ipynb#connecting) Epi2me Labs server.***

## Downloading the data

First we will download some nanopore data from the SRA:

In [2]:
!wget https://sra-download.ncbi.nlm.nih.gov/traces/sra45/SRZ/010948/SRR10948550/HKU-SZ-002a.fastq

--2020-02-11 16:08:13--  https://sra-download.ncbi.nlm.nih.gov/traces/sra45/SRZ/010948/SRR10948550/HKU-SZ-002a.fastq
Resolving sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)... 130.14.250.28, 130.14.250.24, 130.14.250.25
Connecting to sra-download.ncbi.nlm.nih.gov (sra-download.ncbi.nlm.nih.gov)|130.14.250.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 310422386 (296M) [application/octet-stream]
Saving to: ‘HKU-SZ-002a.fastq’


2020-02-11 16:09:28 (3.95 MB/s) - ‘HKU-SZ-002a.fastq’ saved [310422386/310422386]



and also download two reference sequences:

In [3]:
# the isolate we downloaded data for above
!wget -O HKU-SZ-002a.ref.fasta 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MN938384.1&rettype=FASTA'
# a second isolate
!wget -O Wuhan-Hu-1.ref.fasta 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MN908947.3&rettype=FASTA'

--2020-02-11 16:09:38--  https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MN938384.1&rettype=FASTA
Resolving eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘HKU-SZ-002a.ref.fasta’

HKU-SZ-002a.ref.fas     [ <=>                ]  29.66K  --.-KB/s    in 0.09s   

2020-02-11 16:09:39 (325 KB/s) - ‘HKU-SZ-002a.ref.fasta’ saved [30367]

--2020-02-11 16:09:39--  https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MN908947.3&rettype=FASTA
Resolving eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving t

# A little look at the data

First we can align the nanopore reads to the reference sequence of the second isolate:

In [4]:
!run mini_align -r Wuhan-Hu-1.ref.fasta -i HKU-SZ-002a.fastq -p reads2ref -t 4

Constructing minimap index.
[M::mm_idx_gen::0.008*0.57] collected minimizers
[M::mm_idx_gen::0.017*0.79] sorted minimizers
[M::main::0.050*0.41] loaded/built the index for 1 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.050*0.42] distinct minimizers: 5587 (99.93% are singletons); average occurrences: 1.004; average spacing: 5.332
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -I 16G -x map-ont -d Wuhan-Hu-1.ref.fasta.mmi Wuhan-Hu-1.ref.fasta
[M::main] Real time: 0.053 sec; CPU: 0.022 sec; Peak RSS: 0.004 GB
[M::main::0.016*0.33] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.016*0.35] mid_occ = 3
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.017*0.36] distinct minimizers: 5587 (99.93% are singletons); average occurrences: 1.004; average spacing: 5.332
[M::worker_pipeline::5.808*1.38] mapped 425717 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -x map-o

As a first analysis, lets plot a graph dipicting the coverage of the reads against the reference. To do this we run the following program,

In [7]:
!run coverage_from_bam reads2ref.bam -s 10

[16:10:28 - root] Processing region MN908947.3:0-29900


which produces a text file with coverage data. The `-s` option here asks for the coverage to be calculated at 10-base intervals along the reference. The  data output can be read using the `pandas` library, and plotted using the interactive `plotly` library:

In [9]:
import pandas
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = 'plotly_white'
pio.renderers.default = 'colab'

# read the data from text file using pandas
df1 = pandas.read_table("reads2ref_MN908947.3_0_29903.depth.txt")
# create a new figure
fig = go.Figure()
# add a line representing the total depth...
fig.add_trace(
    go.Scatter(x=df1['pos'], y=df1['depth'], mode='lines', name='total'))
# ...another for the + strand depth
fig.add_trace(
    go.Scatter(x=df1['pos'], y=df1['depth_fwd'], mode='lines', name='+'))
# ...and one for the - strand
fig.add_trace(
    go.Scatter(x=df1['pos'], y=df1['depth_rev'], mode='lines', name='-'))
# Add some nice titles
fig.update_layout(
    title='Genome Coverage',
    xaxis={'title':'genome position'},
    yaxis={'title':'genome coverage'})

fig.show()

Next we can calculate some further statistics of the alignments, with a second program:

In [10]:
!run stats_from_bam reads2ref.bam > reads.stats

Mapped/Unmapped/Short/Masked: 1001/0/0/0


To plot a histogram dipicting the accuracy of the reads with respect to the reference sequence we can run:

In [11]:
from plotly import express as px
df2 = pandas.read_table("reads.stats")
fig = px.histogram(df2, x="acc", title="Read accuracy")
fig.update_layout(xaxis={'title':'accuracy'})
fig.show()

## Variant Calling with medaka

The following little workflow demonstrates minimal SNP calling by using the alignments of the reads to the reference sequence created above.

First we run `medaka`'s RNN on the alignments produced above to create a set of base probabilities in an `.hdf` file:

In [12]:
!rm -rf reads2ref.hdf
!run medaka consensus reads2ref.bam reads2ref.hdf --threads 4

[16:11:16 - Predict] Processing region(s): MN908947.3:0-29903
[16:11:16 - Predict] Setting tensorflow threads to 4.
[16:11:16 - Predict] Processing 1 long region(s) with batching.
[16:11:16 - Predict] Using model: /opt/conda/envs/ont/lib/python3.6/site-packages/medaka/data/r941_min_high_g344_model.hdf5.
[16:11:16 - ModelLoad] Building model with cudnn optimization: False
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #213: KMP_AFFINITY: x2APIC ids not unique - decoding legacy APIC ids.
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-3
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 4 packages x 1 cores/pkg x 1 threads/core (4 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to

These probabilities can be processed into a Variant Call Format file:

In [13]:
!run medaka snp Wuhan-Hu-1.ref.fasta reads2ref.hdf ont.snp.vcf --threshold 0.9

[16:11:49 - DataIndex] Loaded 1/1 (100.00%) sample files.
[16:11:49 - SNPs] Processing MN908947.3:0-.


The threshold parameter here controls the reporting of minor variants. Setting a value close to one will filter most minor calls.

To view the variants we can use `bcftools` to filter variants to those with high quality and output a simple table:

In [14]:
!run bcftools query -i 'QUAL>10' -f '%CHROM\ %POS\ %REF>%ALT\ %QUAL\n' ont.snp.vcf

MN908947.3 8782 C>T 63.216
MN908947.3 28144 T>C 43.71
MN908947.3 29095 C>T 58.097
