# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Analyses 1 - Alignment using BLAST**
Welcome to the fifth jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with performing BLAST alignment of your data against the WormBase Genome, ESTs and Protein data. 
Let's get started!

For this tutorial, we will use the wrappers for NCBI Blast+ application in the BioPython library. 

We will start with installing and importing the required libraries.

In [None]:
!pip install biopython

In [None]:
import Bio
import wget
import gzip 
import shutil
from Bio.Blast.Applications import NcbimakeblastdbCommandline
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast.Applications import NcbiblastpCommandline
from Bio.Blast.Applications import NcbiblastxCommandline
from Bio.Blast.Applications import NcbitblastnCommandline
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

### Creating a BLAST database

We need to first create our own BLAST database using the C. elegans reference genome. (Or any reference genome based on your requirement). 
From the FTP site, we download the required reference genome and then generate the BLAST database with the wrapper for the NCBI BLAST+ `makeblastdb` command line.

Download the reference genome for the C. elegans species - BioProject PRJNA13758 and WormBase ID - WS280.
Check out tutorial 1 for more information on customizing your download.

In [None]:
#Generate the link for the reference genome
species = 'c_elegans'
bioproject = 'PRJNA13758'
wormbase_id = 'WS280'
descriptor = 'genomic'
extension = 'fa'
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/'+species+'/'+bioproject+'/'+species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension+'.gz'
#Download the reference genome
wget.download(link)
downloaded_file = species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension
#Unzip the reference genome to get the .fa file
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Create a commandline for the NCBI BLAST+ program makeblastdb, and then run the command.
Here, we create a nucleotide database using the reference genome downloaded in the previous cell.

In [None]:
command = NcbimakeblastdbCommandline(dbtype="nucl", 
                                     parse_seqids = 'TRUE', 
                                     input_file=downloaded_file, 
                                     out='worm_genome', 
                                     title='worm_genome')
command

In [None]:
command()

We have now created a new local database that we can use for our BLAST (blastn) alignments!!

### Running a BLAST query

We run blastn on our example.fa fasta file agaisnt the local BLAST database. The other parameters can be changed to your convinience.

Create a commandline for the NCBI BLAST+ program blastn and then run the command.

In [None]:
command = NcbiblastnCommandline(query="data/nucl_example.fa", #The sequence to search with
                                db="worm_genome", #The database to BLAST against
                                evalue=1e+0, #Expectation value cutoff
                                out="blastn.xml", #Output file for alignment
                                reward=1, #Reward for a nucleotide match 
                                penalty=-3, #Penalty for a nucleotide mismatch
                                outfmt='"5"', #Alignment view. -> 5 for XML
                                word_size=15, #Word size for wordfinder algorithm
                                gapopen=5, #Cost to open a gap
                                gapextend=2 #Cost to extend a gap
                               )

In [None]:
command()

We now have generated an xml file with the results of the BLAST alignment and we will parse it to get the output in a readable and understandable format.

For this we use the NCBIXML module which can help us to easily parse the BLAST XML output.

In [None]:
result_handle = open("blastn.xml")

blast_records = NCBIXML.read(result_handle)

for i in range(len(blast_records.alignments)):
    for hsp in blast_records.alignments[i].hsps:
        print('Chromosome: ' + blast_records.descriptions[i].title.split(' ')[0])
        print(hsp)
        print('\n')

### Example with blastn using ESTs 

We look at another example, performing blastn alignment with ESTs this time!

In [None]:
#Generate the link for the ests file
species = 'c_elegans'
bioproject = 'PRJNA13758'
wormbase_id = 'WS280'
descriptor = 'ests'
extension = 'fa'
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/'+species+'/'+bioproject+'/'+species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension+'.gz'
#Download the ests file
wget.download(link)
downloaded_file = species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension
#Unzip the ests file to get the .fa file
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
command = NcbimakeblastdbCommandline(dbtype="nucl", 
                                     parse_seqids = 'TRUE', 
                                     input_file=downloaded_file, 
                                     out='worm_ests', 
                                     title='worm_ests')
command

In [None]:
command()

In [None]:
command = NcbiblastnCommandline(query="data/nucl_example.fa", #The sequence to search with
                                db="worm_ests", #The database to BLAST against
                                evalue=1e+0, #Expectation value cutoff
                                out="blastn_ests.xml", #Output file for alignment
                                reward=1, #Reward for a nucleotide match 
                                penalty=-3, #Penalty for a nucleotide mismatch
                                outfmt='"5"', #Alignment view. -> 5 for XML
                                word_size=15, #Word size for wordfinder algorithm
                                gapopen=5, #Cost to open a gap
                                gapextend=2 #Cost to extend a gap
                               )

In [None]:
command()

In [None]:
result_handle = open("blastn_ests.xml")

blast_records = NCBIXML.read(result_handle)

for i in range(len(blast_records.alignments)):
    for hsp in blast_records.alignments[i].hsps:
        print('Chromosome: ' + blast_records.descriptions[i].title.split(' ')[0])
        print(hsp)
        print('\n')

### Example with blastp 

We look at another example, performing blastp alignment this time!

In [None]:
#Generate the link for the reference protein sequence
species = 'c_elegans'
bioproject = 'PRJNA13758'
wormbase_id = 'WS280'
descriptor = 'protein'
extension = 'fa'
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/'+species+'/'+bioproject+'/'+species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension+'.gz'
#Download the reference protein sequence
wget.download(link)
downloaded_file = species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension
#Unzip the reference protein sequence to get the .fa file
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
command = NcbimakeblastdbCommandline(dbtype="prot", 
                                     parse_seqids = 'TRUE', 
                                     input_file=downloaded_file, 
                                     out='worm_protein', 
                                     title='worm_protein')
command

In [None]:
command()

In [None]:
command = NcbiblastpCommandline(query="data/prot_example.fa", #The sequence to search with
                                db="worm_protein", #The database to BLAST against
                                evalue=1e+0, #Expectation value cutoff
                                out="blastp.xml", #Output file for alignment
                                outfmt='"5"', #Alignment view. -> 5 for XML
                                gapopen=11, #Cost to open a gap
                                gapextend=1 #Cost to extend a gap
                               )

In [None]:
command()

In [None]:
result_handle = open("blastp.xml")

blast_records = NCBIXML.read(result_handle)

for i in range(len(blast_records.alignments)):
    for hsp in blast_records.alignments[i].hsps:
        print('Chromosome: ' + blast_records.descriptions[i].title.split(' ')[0])
        print(hsp)
        print('\n')

### Example with blastx

We look at another example, performing blastx alignment, nucleotide query against protein database, this time! 

In [None]:
#We will use the previously generated worm_protein database!!
command = NcbiblastxCommandline(query="data/nucl_example.fa", #The sequence to search with
                                db="worm_protein", #The database to BLAST against
                                evalue=1e+0, #Expectation value cutoff
                                out="blastx.xml", #Output file for alignment
                                outfmt='"5"', #Alignment view. -> 5 for XML
                                gapopen=11, #Cost to open a gap
                                gapextend=1 #Cost to extend a gap
                               )

In [None]:
command()

In [None]:
result_handle = open("blastx.xml")

blast_records = NCBIXML.read(result_handle)

for i in range(len(blast_records.alignments)):
    for hsp in blast_records.alignments[i].hsps:
        print('Chromosome: ' + blast_records.descriptions[i].title.split(' ')[0])
        print(hsp)
        print('\n')

### Example with tblastn

We look at another example, performing tblastn alignment, protein query against nucleotide database, this time! 

In [None]:
#We will use the previously generated worm_protein database!!
command = NcbitblastnCommandline(query="data/prot_example.fa", #The sequence to search with
                                db="worm_genome", #The database to BLAST against
                                evalue=1e+0, #Expectation value cutoff
                                out="tblastn.xml", #Output file for alignment
                                outfmt='"5"', #Alignment view. -> 5 for XML
                                gapopen=11, #Cost to open a gap
                                gapextend=1 #Cost to extend a gap
                               )

In [None]:
command()

In [None]:
result_handle = open("tblastn.xml")

blast_records = NCBIXML.read(result_handle)

for i in range(len(blast_records.alignments)):
    for hsp in blast_records.alignments[i].hsps:
        print('Chromosome: ' + blast_records.descriptions[i].title.split(' ')[0])
        print(hsp)
        print('\n')

This is the end of the first tutorial for WormBase data analysis! This tutorial dealt with using BLAST alignment for any worm data.

In the next tutorial, we will use BLAT, which is a faster and more efficient version of BLAT for similar analyses!