<a href="https://colab.research.google.com/github/cheungngo/MScGenomics/blob/master/5030/L7_Variant_calling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Variant calling

## Installing the bwa

In [6]:
!cp "/content/drive/MyDrive/Colab Notebooks/5030/bwa/bwa-0.7.17.tar.bz2" bwa.tar.bz2
!bzip2 -d /content/bwa.tar.bz2
!tar -xf bwa.tar

In [None]:
%cd bwa-0.7.17
!make

## Installing the samtools

In [None]:
%cd ..
!wget https://github.com/samtools/samtools/releases/download/1.15/samtools-1.15.tar.bz2
!bzip2 -d /content/samtools-1.15.tar.bz2
!tar -xf /content/samtools-1.15.tar

In [None]:
%cd /content/samtools-1.15
!./configure
!make
!make install

## Installing the bcftools

In [None]:
%cd ..
!wget https://github.com/samtools/bcftools/releases/download/1.15/bcftools-1.15.tar.bz2
!bzip2 -d /content/bcftools-1.15.tar.bz2
!tar -xf /content/bcftools-1.15.tar

In [None]:
%cd /content/bcftools-1.15
!./configure
!make
!make install

## Downloading the reference genome E. coli REL606

In [49]:
%cd ..

/content


In [None]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
!gzip -d /content/GCA_000017985.1_ASM1798v1_genomic.fna.gz

## Getting the reference genome from the ftp (another way)

In [20]:
!cp "/content/drive/MyDrive/Colab Notebooks/5030/L7/ecoli_rel606.fasta" .

## Indexing the reference genome

In [22]:
%cd /content/bwa-0.7.17

/content/bwa-0.7.17


In [23]:
!./bwa index "/content/ecoli_rel606.fasta"

[bwa_index] Pack FASTA... 0.05 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.61 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.03 sec
[bwa_index] Construct SA from BWT and Occ... 0.58 sec
[main] Version: 0.7.17-r1188
[main] CMD: ./bwa index /content/ecoli_rel606.fasta
[main] Real time: 2.401 sec; CPU: 2.314 sec


In [24]:
# you can see the new files here
!ls ..

bwa-0.7.17		ecoli_rel606.fasta.bwt
bwa.tar			ecoli_rel606.fasta.pac
drive			ecoli_rel606.fasta.sa
ecoli_rel606.fasta	GCA_000017985.1_ASM1798v1_genomic.fna
ecoli_rel606.fasta.amb	sample_data
ecoli_rel606.fasta.ann


## Align reads to reference genome

In [25]:
# getting the fastq files
!cp "/content/drive/MyDrive/Colab Notebooks/5030/L7/SRR2584866_1.trim.sub.fastq" ..
!cp "/content/drive/MyDrive/Colab Notebooks/5030/L7/SRR2584866_2.trim.sub.fastq" ..

In [None]:
!./bwa mem /content/ecoli_rel606.fasta \
"/content/SRR2584866_1.trim.sub.fastq" \
"/content/SRR2584866_2.trim.sub.fastq" > ../SRR2584866.aligned.sam

In [31]:
!ls ..

bwa-0.7.17		ecoli_rel606.fasta.pac
bwa.tar			ecoli_rel606.fasta.sa
drive			GCA_000017985.1_ASM1798v1_genomic.fna
ecoli_rel606.fasta	sample_data
ecoli_rel606.fasta.amb	SRR2584866_1.trim.sub.fastq
ecoli_rel606.fasta.ann	SRR2584866_2.trim.sub.fastq
ecoli_rel606.fasta.bwt	SRR2584866.aligned.sam


## Convert SAM to BAM with samtools

In [None]:
%cd ..

In [39]:
!samtools view -S -b "/content/SRR2584866.aligned.sam" \
> "/content/SRR2584866.aligned.bam"

## Sorting and indexing

In [40]:
!samtools sort -o "/content/SRR2584866.aligned.sorted.bam" \
"/content/SRR2584866.aligned.bam"

In [41]:
!samtools index "/content/SRR2584866.aligned.sorted.bam"

## Post-alignment cleanup

In [51]:
!bcftools mpileup -O b -o /content/SRR2584866_raw.bcf \
-f /content/ecoli_rel606.fasta \
/content/SRR2584866.aligned.sorted.bam

[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250


## Variant calling

In [52]:
!bcftools call --ploidy 1 -m -v \
-o /content/SRR2584866_variants.vcf \
"/content/SRR2584866_raw.bcf"

## Final filtering

In [54]:
!vcfutils.pl varFilter /content/SRR2584866_variants.vcf > /content/SRR2584866_variants_final.vcf

In [55]:
!head -100 /content/SRR2584866_variants_final.vcf

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15+htslib-1.15
##bcftoolsCommand=mpileup -O b -o /content/SRR2584866_raw.bcf -f /content/ecoli_rel606.fasta /content/SRR2584866.aligned.sorted.bam
##reference=file:///content/ecoli_rel606.fasta
##contig=<ID=CP000819.1,length=4629812>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias