# BIO294: bioinformatics for comparative and evolutionary genomics
christoph.stritt@botinst.uzh.ch  
*14 November 2019*

# First steps to building your own bioinformatics toolbox with Python

#### Online documentation
https://www.python.org/doc/  
https://biopython.org/  
http://rosalind.info/problems/locations/


## A simple program to explore gene trees 
### Phytochrome A

![Gene tree for Phytochrome B](phyA.png)

# Preliminaries


#### Download the course data from github to your Desktop

```console
cd ~/Desktop
git clone https://github.com/cstritt/bio294

```

In the data folder you'll find a file containing protein sequences of _Brachypodium distachyon_ and some other species, downloaded from Phytozome (https://phytozome-next.jgi.doe.gov).

Our goal is to create a tree for the Phytochrome A gene **Bradi1g10520**, showing all homologous genes within _B. distachyon_.

Let's take a look at the input files.


```console
cd data/
ll

zless -S Bdistachyon_314_v3.1.protein_primaryTranscriptOnly.fa.gz

gunzip *.gz

# Is there some information available for the gene? 
grep Bradi1g10520 Bdistachyon_314_v3.1.defline.txt
grep Bradi1g10520 Bdistachyon_314_v3.1.annotation_info.txt
```

#### Program outline
**A)** Fetch fasta entry for the gene of interest  
**B)** Blast fasta entry against a local data base  
**C)** Extract homolog sequences and align them  
**D)** Estimate gene tree  


#### Exercises
In case you are stuck, possible solutions for the exercises are given in the file dirtyTree.py.

## A) Fetch fasta entry for the gene of interest

First, we want to extract the protein sequence of our focal gene and write it to a fasta file, which can then be used with blast.  
This can be done using some bash bricolage, for example:


```console
sed -n -e '/Bradi5g25817/,/^>/ p' Bdistachyon_314_v3.1.protein_primaryTranscriptOnly.fa | sed -e ';$d' 
```

Let's try the same with Python. 


In [1]:
gene = 'Bradi1g10520'
path_to_gene_space = 'data/Bdistachyon_314_v3.1.protein_primaryTranscriptOnly.fa'

switch = 0
gene_seq = ''
with open(path_to_gene_space) as f:
    for line in f:
        
        if line.startswith('>'):
            switch = 1 if gene in line else 0
            
        if switch == 1:
            gene_seq += line  
            
gene_seq

'>Bradi1g10520.1.p pacid=32796434 transcript=Bradi1g10520.1 locus=Bradi1g10520 ID=Bradi1g10520.1.v3.1 annot-version=v3.1\nMSSSRPTQSSSSSSRTRQSSRARILAQTTLDAELNAEYEETGDSFNYSKLVEAQRNTPPEQQGRSEKVIAYLQHIQKGKM\nIQSFGCLLALDEKSFNVIAFSQNAPEMLTMVSHAVPSVDDPPRLGIGTNVRSLFTDQGATALHKALGFADVSLLNPILVQ\nCKTSGKPFYAIVHRATGCLVVDFEPVKPTEFPATAAGALQSYKLAAKAISKIQSLPGGSMEVLCNTVVKEVFDLTGYDRV\nMAYKFHEDDHGEVFAEITKPGLEPYLGLHYPATDIPQAARFLFMKNKVRMICDCRARSIKIVEDESLPFDISLCGSALRA\nPHSCHLQYMENMNSIASLVMAVVVNENEEDDEVEAEQPAQQQKKKKLWGLLVCHHESPRYVPFPLRYACEFLAQVFAVHV\nNKEFELENQLREKSILRMQTILSDMLFREASPLTIISGTPNVMDLVKCDGAALLHGDKVWRLRDAPTESQIRDIASWLSE\nVHRDSTGLSTESLHDAGYPGASALGDMICGMAVAKINSRDILFWFRSHTAAEIKWGGAKHDPSDMDDGRRMHPRLSFKAF\nLEVVKMKSLSWSDYEMDAIHSLQLILRGALNDGIKATKGASLDNQIGDLKLDGLAELQAVTSEMVRLMETATVPILAVDG\nNGLINGWNQKAAELTGLRVDDAIGRHILTLVEESSVPVVQRMLYLALQGKEEKEVRFEVKTHGPKREDGPVILVVNACAS\nRDLHDHVVGVCFVAQDMTVHKLVMDKFTRIEGDYKAIIHNPNPLIPPIFGTDEFGWCSEWNTAMTMLTGWHRDEVVDKML\nLGEVFDSSSASCLLKNKDAFISLCVLINSALAGEETDKAPFGFFDRNGKYIECLLSAN

In [2]:
# Write to file          
with open(gene + '.fasta', 'w') as g:
    g.write(gene_seq)

In [8]:
# The same can be done more simply with Biopython
import sys
from Bio import SeqIO

# Display some information about the SeqIO object
help(SeqIO)

Help on package Bio.SeqIO in Bio:

NAME
    Bio.SeqIO - Sequence input/output as SeqRecord objects.

DESCRIPTION
    Bio.SeqIO is also documented at SeqIO_ and by a whole chapter in our tutorial:
    
      - `HTML Tutorial`_
      - `PDF Tutorial`_
    
    .. _SeqIO: http://biopython.org/wiki/SeqIO
    .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
    
    Input
    -----
    The main function is Bio.SeqIO.parse(...) which takes an input file handle
    (or in recent versions of Biopython alternatively a filename as a string),
    and format string.  This returns an iterator giving SeqRecord objects:
    
    >>> from Bio import SeqIO
    >>> for record in SeqIO.parse("Fasta/f002", "fasta"):
    ...     print("%s %i" % (record.id, len(record)))
    gi|1348912|gb|G26680|G26680 633
    gi|1348917|gb|G26685|G26685 413
    gi|1592936|gb|G29385|G29385 471
    
    Note that the parse(

In [13]:
# Fetch the gene sequence
for seq_record in SeqIO.parse(path_to_gene_space, "fasta"):
    if gene in seq_record.id:
        SeqIO.write(seq_record, sys.stdout, 'fasta')

# Whats the difference between the two outputs?

>Bradi1g10520.1.p pacid=32796434 transcript=Bradi1g10520.1 locus=Bradi1g10520 ID=Bradi1g10520.1.v3.1 annot-version=v3.1
MSSSRPTQSSSSSSRTRQSSRARILAQTTLDAELNAEYEETGDSFNYSKLVEAQRNTPPE
QQGRSEKVIAYLQHIQKGKMIQSFGCLLALDEKSFNVIAFSQNAPEMLTMVSHAVPSVDD
PPRLGIGTNVRSLFTDQGATALHKALGFADVSLLNPILVQCKTSGKPFYAIVHRATGCLV
VDFEPVKPTEFPATAAGALQSYKLAAKAISKIQSLPGGSMEVLCNTVVKEVFDLTGYDRV
MAYKFHEDDHGEVFAEITKPGLEPYLGLHYPATDIPQAARFLFMKNKVRMICDCRARSIK
IVEDESLPFDISLCGSALRAPHSCHLQYMENMNSIASLVMAVVVNENEEDDEVEAEQPAQ
QQKKKKLWGLLVCHHESPRYVPFPLRYACEFLAQVFAVHVNKEFELENQLREKSILRMQT
ILSDMLFREASPLTIISGTPNVMDLVKCDGAALLHGDKVWRLRDAPTESQIRDIASWLSE
VHRDSTGLSTESLHDAGYPGASALGDMICGMAVAKINSRDILFWFRSHTAAEIKWGGAKH
DPSDMDDGRRMHPRLSFKAFLEVVKMKSLSWSDYEMDAIHSLQLILRGALNDGIKATKGA
SLDNQIGDLKLDGLAELQAVTSEMVRLMETATVPILAVDGNGLINGWNQKAAELTGLRVD
DAIGRHILTLVEESSVPVVQRMLYLALQGKEEKEVRFEVKTHGPKREDGPVILVVNACAS
RDLHDHVVGVCFVAQDMTVHKLVMDKFTRIEGDYKAIIHNPNPLIPPIFGTDEFGWCSEW
NTAMTMLTGWHRDEVVDKMLLGEVFDSSSASCLLKNKDAFISLCVLINSALAGEETDKAP
FGFFDRNGKYIECLLSANRKENAGGL

***Exercises:***  
1) Rewrite the code above as a function, either using the approach with or without Biopython.  
2) Implement the function as a program which can be executed from the command line (hint: use the sys module).

## B) Blast fasta entry against a local data base

To find homologs, we blast the gene sequence against the gene space, i.e. against the same file from which the gene was extracted. On the command line, this would be
```console
blastp -query Bradi5g25817.fasta -subject Bdistachyon_314_v3.1.protein_primaryTranscriptOnly.fa

```

The Biopython module has a blast interface which allows running blast from within Python. Results are stored in the file defined by the "out" parameter. 

In [3]:
from Bio.Blast.Applications import NcbiblastpCommandline

#help(NcbiblastxCommandline)
blast_cline = NcbiblastpCommandline(query = gene + ".fasta", 
                                     subject = path_to_gene_space,
                                     outfmt = '6 pident length evalue sseqid', 
                                     out = gene + ".blastout")

# Print command
print(blast_cline)
# Execute command
blast_cline()

blastp -out Bradi1g10520.blastout -outfmt "6 pident length evalue sseqid" -query Bradi1g10520.fasta -subject data/Bdistachyon_314_v3.1.protein_primaryTranscriptOnly.fa


('', '')

***Exercise:*** Take a look at the output file. How many hits does it contain?

### Digression: Blast against the NCBI online database

In [None]:
""" Instead of blasting against a local fasta file, 
it's also possible to blast against the NCBI database.
"""

from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
  
fasta_string = open(gene + '.fasta').read()
result_handle = NCBIWWW.qblast("blastp", "refseq_protein", fasta_string)

blast_records = NCBIXML.parse(result_handle)
blast_records = list(blast_records)

# Can you find the homolog sequences in the blast_records list?

## C) Extract homolog sequences and align them

***Exercise:***  
Filter the output of the local blast: create a list with the names of the genes with a percentage identity > 30, alignment length > 150, and an e-value < 0.0001 (Panchy et al. 2016). How 'robust' are these thresholds, i.e. how does the number of 'homologs' change when the thresholds change?

In [6]:
min_perc_id = 30
min_aln_len = 150
max_eval = 0.0001

homologs = []

import dirtyTree

help(dirtyTree.filter_blast_results)
homologs = dirtyTree.filter_blast_results(gene + '.blastout', min_perc_id = 30, min_aln_len=150, max_eval=0.0001)
homologs

Help on function filter_blast_results in module dirtyTree:

filter_blast_results(path_to_blast_results, min_perc_id=30, min_aln_len=150, max_eval=0.0001)
    Returns names of genes which have passed the filtering



['Bradi1g08400.1.p',
 'Bradi1g10510.1.p',
 'Bradi1g10520.1.p',
 'Bradi1g64360.1.p']

In [9]:
# Extract all homolog sequences from the gene space and write them to a file
recs = []

for seq_record in SeqIO.parse(path_to_gene_space, "fasta"):
    
    if any(x in seq_record.id for x in homologs):    
        recs.append(seq_record)

SeqIO.write(recs, gene + ".homologs.fasta", 'fasta')

4

In [10]:
""" Run the muscle aligner through the Biopython interface
"""
from Bio import AlignIO
from Bio.Align.Applications import MuscleCommandline
help(MuscleCommandline)

Help on class MuscleCommandline in module Bio.Align.Applications._Muscle:

class MuscleCommandline(Bio.Application.AbstractCommandline)
 |  MuscleCommandline(cmd='muscle', **kwargs)
 |  
 |  Command line wrapper for the multiple alignment program MUSCLE.
 |  
 |  http://www.drive5.com/muscle/
 |  
 |  Notes
 |  -----
 |  Last checked against version: 3.7, briefly against 3.8
 |  
 |  References
 |  ----------
 |  Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high
 |  accuracy and high throughput, Nucleic Acids Research 32(5), 1792-97.
 |  
 |  Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with
 |  reduced time and space complexity. BMC Bioinformatics 5(1): 113.
 |  
 |  Examples
 |  --------
 |  >>> from Bio.Align.Applications import MuscleCommandline
 |  >>> muscle_exe = r"C:\Program Files\Aligments\muscle3.8.31_i86win32.exe"
 |  >>> in_file = r"C:\My Documents\unaligned.fasta"
 |  >>> out_file = r"C:\My Documents\aligned.fasta"
 |  >>> muscle_cli

In [11]:
# Define the command
cline = MuscleCommandline(input=gene + ".homologs.fasta", 
                          out=gene + ".homologs.aligned.fasta")

# Show the command
print(cline)
# Run it
cline()

# Load the alignment
aln = AlignIO.read(gene + '.homologs.aligned.fasta', 'fasta')
print(aln)

muscle -in Bradi1g10520.homologs.fasta -out Bradi1g10520.homologs.aligned.fasta
SingleLetterAlphabet() alignment with 4 rows and 1204 columns
------------MSSSRPTQSS-------------SSSSR----...--- Bradi1g10520.1.p
------------MSSSRPTHSF-------------GFSNS----...--- Bradi1g10510.1.p
MASGSRATPTRSPSSARPAASRQAEQQHHYTQSSGGSTSRAGGG...--- Bradi1g64360.1.p
--------MSSSRSNNRATCSR-------------GSS------...SNF Bradi1g08400.1.p


In [16]:
# Access information in the alignment
print(aln[0])

ID: Bradi1g10520.1.p
Name: Bradi1g10520.1.p
Description: Bradi1g10520.1.p pacid=32796434 transcript=Bradi1g10520.1 locus=Bradi1g10520 ID=Bradi1g10520.1.v3.1 annot-version=v3.1
Number of features: 0
Seq('------------MSSSRPTQSS-------------SSSSR--------------...---', SingleLetterAlphabet())


***Exercise:***  
How many gaps are there in each aligned sequence? 

In [17]:
# Trim alignment
import strumenti
aln_trimmed = strumenti.trim_alignment(aln, max_prop_missing=0)
print(aln_trimmed)

SingleLetterAlphabet() alignment with 4 rows and 892 columns
MSSSRPTQSSSSSTRQSSRARILAQTTLDAELNAEYEETGDSFN...ACQ Bradi1g10520.1.p
MSSSRPTHSFGFSTRQSSQARISTQTTLDAELNAEYEESGDSFN...SCQ Bradi1g10510.1.p
PSSARPAASRGSTAATESVSKAVAQYTLDAGLHAVFEASGRSFD...LLR Bradi1g64360.1.p
RSNNRATCSRGSSARSKHSERVVAQTPVDARLHAEFEGSQRRFD...LSK Bradi1g08400.1.p


## D) Estimate gene tree
Biopython is great for parsing phylogenetic trees and contains a neighbor-joining algorithm which we are going to use.

In [18]:
from Bio import Phylo

tree = strumenti.neighbor_joining_tree(aln_trimmed, prot_model='blosum62')
tree_out = Phylo.draw_ascii(tree, file=sys.stdout, column_width=50)

                     _____ Bradi1g10520.1.p
  __________________|
 |                  |___________ Bradi1g10510.1.p
_|
 |______________________ Bradi1g64360.1.p
 |
 |___________________ Bradi1g08400.1.p



## Exercises:
1) Implement the code above as a program which can be run from the command line.  


2) Can you find an easy way to include sequences from other species in the analysis?  
3) Create a tree for the NBS-LRR gene Bradi1g00960, including all species in the data folder. How does its homolog tree differ from Phytochrome A? Why?  


4) Run the program for some more genes: Bradi1g64360, Bradi1g12340, Bradi1g31280. Instead of typing each command individually in the command line, try to use a for loop.    
5) Add minimum percentage identity as a parameter to the program.  