# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Analyses 2 - Alignment using BLAT**
Welcome to the sixth jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with performing BLAT alignment of your data locally. (This is compatible on Linux OS systems only!!)
Let's get started!

We start by importing the required python libraries and also installing 2 programs locally - BLAT (from the UCSC Genome Browser). 

BLAT is an alignment tool like BLAST. We can download the gfServer/gfClient version of BLAT, which performs the same as the web version.

In [1]:
import os
import itertools
import Bio 
import math
import pandas as pd
from Bio import SearchIO 
get_ipython().system = os.system

#### Install BLAT, run the gfServer and make it ready to receive queries and use gfClient send a BLAT query

Download the binaries for the BLAT program

In [2]:
!mkdir BLAT #there are Python commands to crerate directories
!rsync -a rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/blat/ BLAT/
!chmod +x BLAT/gfServer BLAT/gfClient BLAT/blat

0

Download the .2bit genome file for the C. elegans species

In [3]:
!wget https://hgdownload.soe.ucsc.edu/goldenPath/ce11/bigZips/ce11.2bit

0

Run the server on the downloaded .2bit file

In [4]:
!BLAT/gfServer start 127.0.0.1 1234 -stepSize=5 ce11.2bit &

0

Query the server with a BLAT request for the example file - nucl_example.fa

In [5]:
!BLAT/gfClient -minScore=10 -minIdentity=0 127.0.0.1 1234 . data/nucl_example.fa out.psl

0

#### Parse the output of the BLAT program and generate a useful dataframe

In [6]:
psl = 'out.psl' 
qresult = SearchIO.read(psl, 'blat-psl')

The cell below describes a class that takes in a row of the psl file as input and tokenizes it to extract all information that the BLAT output can provide.

You do not need to make any changes to this cell.

In [7]:
class Psl(object):
       
    def __init__(self, s):
        fields = s.strip().split()
        num_fields = len(fields)
        matches, mismatches, repmatches, ncount, qnuminsert, qbaseinsert, tnuminsert, tbaseinsert, strand, qname, \
        qsize, qstart, qend, tname, tsize, tstart, tend, blockcount, blocksizes, qstarts, tstarts = fields[0:21]
        self.matches = int(matches)
        self.mismatches = int(mismatches)
        self.repmatches = int(repmatches)
        self.ncount = int(ncount)
        self.qnuminsert = int(qnuminsert)
        self.qbaseinsert = int(qbaseinsert)
        self.tnuminsert = int(tnuminsert)
        self.tbaseinsert = int(tbaseinsert)
        self.strand = strand
        self.qname = qname
        self.qsize = int(qsize)
        self.qstart = int(qstart)
        self.qend = int(qend)
        self.tname = tname
        self.tsize = int(tsize)
        self.tstart = int(tstart)
        self.tend = int(tend)
        self.blockcount = int(blockcount)
        self.blocksizes = [int(x) for x in blocksizes.split(',')[0:-1]]
        self.qstarts = [int(x) for x in qstarts.split(',')[0:-1]]
        self.tstarts = [int(x) for x in tstarts.strip().split(',')[0:-1]]
        
    def __lenmul(self): #In case the sequence is protein, we need the length multiplier value to be 3, else 1
        if self.__isProtein:
            return 3
        else:
            return 1

    def __isProtein(self): #We find out if the input sequence is a protein or nucleotide sequence!!
        lastblock = self.blockcount - 1
        return (self.strand[1:1] == '+' and self.tend == (self.tstarts[lastblock] + 
                (3 * self.blocksizes[lastblock]))) or ((self.strand[1:1] == '-') and 
                (self.tstart == (self.tsize - (self.tstarts[lastblock] + 3*self.blocksizes[lastblock]))))
    
    def __calcMilliBad(self, ismrna): #Get the number of non-identical matches
        qalisize = self.__lenmul() * self.qspan()
        alisize = min(qalisize, self.tspan())
        millibad = 0
        if alisize <= 0: return 0
        sizediff = alisize - self.tspan()
        if sizediff < 0:
            if ismrna:
                sizediff = 0
            else:
                sizediff = -sizediff
        insertfactor = self.qnuminsert
        if not ismrna: insertfactor += self.tnuminsert
        total = self.__lenmul() *\
            (self.matches + self.repmatches + self.mismatches)
        if total != 0:
            millibad = (1000 * (self.mismatches * self.__lenmul() + insertfactor + \
                                    round(3*math.log(1 + sizediff)))) / total
        return millibad

    def qspan(self): #Get span of alignment for input sequence
        return self.qend - self.qstart
    
    def tspan(self): #Get span of alignment for target sequence
        return self.tend - self.tstart
    
    def score(self): #Calculate the score as in the web version of BLAT
        return self.matches + (self.repmatches / 2) - self.mismatches - self.qnuminsert - self.tnuminsert
    
    def calcPercentIdentity(self): #Calculate the percent identity as in the web version of BLAT
        return 100.0 - self.__calcMilliBad(True) * 0.1

We will now create a dataframe that will contain the BLAT output in a way that is readable and easy-to-understand.

In [8]:
BLAT_output = pd.DataFrame(columns=['Query Name', 'Score', 'Sequence start in query', 'Sequence end in query', 'Size of query', 'Percent Identity', 'Target Name', 'Strand', 'Sequence start in target', 'Sequence end in target', 'Span of Target'])
with open('out.psl') as f:
    for line in itertools.islice(f, 6, None):  
        p = Psl(line)
        BLAT_output = BLAT_output.append({'Query Name':p.qname, 'Score':p.score(), 'Sequence start in query':p.qstart+1, 'Sequence end in query':p.qend, 'Size of query':p.qsize, 'Percent Identity':"%.1f" % p.calcPercentIdentity(), 'Target Name':p.tname, 'Strand':p.strand, 'Sequence start in target':p.tstart+1, 'Sequence end in target':p.tend, 'Span of Target':p.tspan()}, ignore_index=True)    

In [9]:
BLAT_output

Unnamed: 0,Query Name,Score,Sequence start in query,Sequence end in query,Size of query,Percent Identity,Target Name,Strand,Sequence start in target,Sequence end in target,Span of Target
0,C37F5.1,12.0,64,75,450,100.0,chrII,+,10508110,10508121,12
1,C37F5.1,16.0,43,58,450,100.0,chrIII,+,9948381,9948396,16
2,C37F5.1,16.0,55,70,450,100.0,chrIV,+,13774041,13774056,16
3,C37F5.1,20.0,97,117,450,100.0,chrV,+,8584771,8584792,22
4,C37F5.1,18.0,55,72,450,100.0,chrV,+,18094249,18094266,18
5,C37F5.1,18.0,336,353,450,100.0,chrI,-,12635080,12635097,18
6,C37F5.1,18.0,27,44,450,100.0,chrIII,-,2535570,2535587,18
7,C37F5.1,20.0,16,35,450,100.0,chrIII,-,5767252,5767271,20
8,C37F5.1,447.0,1,450,450,100.0,chrIV,-,2278432,2300286,21855
9,C37F5.1,16.0,176,191,450,100.0,chrIV,-,5494796,5494811,16


This is the end of the second tutorial for WormBase data analysis! This tutorial dealt with using the BLAT alignment locally for any worm data.

In the next tutorial, we will use ePCR (In-Silico PCR), which is used to search a sequence database with a pair of PCR primers!