<img style="float: right;" src="http://www2.le.ac.uk/liscb1.jpg">
# Bioinformatics in Python

Written by Teri Forey

As well as python libraries that work with numerical data, people have also built libraries that work on biological data. Specifically DNA/protein sequences and SAM/BAM alignment files. These libraries are [Biopython](http://biopython.org/) and [pysam](http://pysam.readthedocs.io/en/latest/index.html) respectively.

## 1. Biopython

Biopython is the main python library that parses and reads sequence files. There are two main objects the `Seq` object that contains the DNA/RNA/protein sequence itself and the `SeqRecord` object that contains additional information such as an ID. We'll first look at creating a `Seq` object. 

### 1.1 Seq
We'll first import the Seq class that's contained within Bio.Seq

In [1]:
from Bio.Seq import Seq

We can create a Seq object simply by typing in the sequence as a string.

In [3]:
my_seq = Seq("ATGGCATCGATTCGATGCTAGGCTAGCTAG")
my_seq

Seq('ATGGCATCGATTCGATGCTAGGCTAGCTAG', Alphabet())

When this object is printed to screen, it contains the sequence but also an Alphabet. At the moment this alphabet is just a default basic alphabet as Biopython doesn't know whether the sequence is DNA or protein. If you know this information when you create a Seq object, you should supply it.

In [10]:
from Bio.Alphabet import generic_dna, generic_protein

my_dna = Seq("ATCGATGCAGGCTAG", generic_dna)
my_dna

Seq('ATCGATGCAGGCTAG', DNAAlphabet())

In [9]:
my_protein = Seq('ATCGATGCAGGCTAG', generic_protein)
my_protein

Seq('ATCGATGCAGGCTAG', ProteinAlphabet())

If these alphabets are set, it means a lot of biologically sensible errors will be caught. For example, you wouldn't be able to combine a DNA and protein sequence.

In [11]:
my_dna + my_protein

TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet()

### 1.2 Seq Methods

The Seq object contains some generic methods that are common to all strings, as well as methods that are unique to nucleotide sequences. A few examples of these are demonstrated below. 

In [13]:
# Use the string method find() to locate the start of a particular sequence
my_dna.find('AGG')

8

In [15]:
# print the reverse complement sequence
my_dna.reverse_complement()

Seq('CTAGCCTGCATCGAT', DNAAlphabet())

In [18]:
# Translate the sequence into protein. 
# Note this will only translate the forward strand and will assume the whole sequence is coding! 
my_dna.translate()

Seq('IDAG*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [19]:
# Translate will not work on a protein sequence
my_protein.translate()

ValueError: Proteins cannot be translated!

### 1.3 SeqRecord

A SeqRecord object contains a Seq object, as well as additional information.

In [21]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC

record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                       IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print(record)

ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())


As you can see, we've assigned the sequence, ID, name and description. Printing the record outputs this information in a human readable format.

### 1.4 SeqIO

It's unlikely you'll often create these SeqRecords by hand though, you'll most often be reading them in from a file. For that, you need to import the SeqIO function. There's an example GenBank file in `data/`, we'll import this and print a few pieces of information for each sequence in the file.

In [28]:
from Bio import SeqIO

# We have to specify the filename and the format
for record in SeqIO.parse("data/sequence.gb", 'genbank'):
    print("ID: %s\tName: %s\tSequence Length: %s" % (record.id, record.name, len(record.seq)))

ID: NM_001285987.1	Name: NM_001285987	Sequence Length: 2075


There is only one sequence in this file, we can see all of the information by printing the whole record.

In [29]:
print(record)

ID: NM_001285987.1
Name: NM_001285987
Description: Homo sapiens POU class 5 homeobox 1 (POU5F1), transcript variant 5, mRNA
Number of features: 17
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=29-MAR-2018
/accessions=['NM_001285987']
/sequence_version=1
/keywords=['RefSeq']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Overexpression of OCT4 induced by modulation of histone marks plays crucial role in breast cancer progression', ...), Reference(title='Correlation of Cancer Stem-Cell Markers OCT4, SOX2, and NANOG with Clinicopathological Features and Prognosis in Operative Patients with Rectal Cancer', ...), Reference(title='Concise review: isoforms of OCT4 contribute to the confusing diversity in stem cell biology', ...), Reference(title=

There is a lot of information here! Different file formats contain different information, so what you see here won't match what you'll see from a FASTA format file for example. Almost all of the sequence file formats you'll come across working with biological sequences are supported by Biopython, have a look at their documentation for more information.

Importantly when working with a SeqRecord object, you can access the underlying Seq object - and therefore all of the methods that come with it!

In [32]:
record.seq

Seq('GGAAAAAAGGAAAGTGCACTTGGAAGAGATCCAAGTGGGCAACTTGAAGAACAA...AAA', IUPACAmbiguousDNA())

In [33]:
record.seq.reverse_complement()

Seq('TTTTTTTTTTTAAGTGTGTCTATCTACTGTGTCCCAGGCTTCTTTATTTAAGAA...TCC', IUPACAmbiguousDNA())

In [38]:
# Note that to print the whole sequence you need to use print() or convert it to a string
print(record.seq)
str(record.seq)

GGAAAAAAGGAAAGTGCACTTGGAAGAGATCCAAGTGGGCAACTTGAAGAACAAGTGCCAAATAGCACTTCTGTCATGCTGGATGTCAGGGCTCTTTGTCCACTTTGTATAGCCGCTGGCTTATAGAAGGTGCTCGATAAATCTCTTGAATTTAAAAATCAATTAGGATGCCTCTATAGTGAAAAAGATACAGTAAAGATGAGGGATAATCAATTTAAAAAATGAGTAAGTACACACAAAGCACTTTATCCATTCTTATGACACCTGTTACTTTTTTGCTGTGTTTGTGTGTATGCATGCCATGTTATAGTTTGTGGGACCCTCAAAGCAAGCTGGGGAGAGTATATACTGAATTTAGCTTCTGAGACATGATGCTCTTCCTTTTTAATTAACCCAGAACTTAGCAGCTTATCTATTTCTCTAATCTCAAAACATCCTTAAACTGGGGGTGATACTTGAGTGAGAGAATTTTGCAGGTATTAAATGAACTATCTTCTTTTTTTTTTTTCTTTGAGACAGAGTCTTGCTCTGTCACCCAGGCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGTGCGTGCCACCGTGCCCAGCTAATTTTTGTGTTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCATGCTGGTCTTGAACTCCTGACCTCGTGATCTGCCCACCTCGGCCTCCCAAAGTGCTGGAATTATAGGCGTGAGCCACCGCGCCCAGCAAAGAACTTCTAACCTTCATAACCTGACAGGTGTTCTCGAGGCCAGGGTCTCTCTTTCTGTCCTTTCACGATGCTCTGCATCCCTTGGATGTGCCAGTTTCTGGGGGAAGAGTAGTCCTTTGTTACATGCATGAGTCAGTGAACAGGGAATGGGTGAATGACATTTGTGGGTAGGTTATTTCTAGAAGTTAGGTGGGCAGCTTGGAAGG

'GGAAAAAAGGAAAGTGCACTTGGAAGAGATCCAAGTGGGCAACTTGAAGAACAAGTGCCAAATAGCACTTCTGTCATGCTGGATGTCAGGGCTCTTTGTCCACTTTGTATAGCCGCTGGCTTATAGAAGGTGCTCGATAAATCTCTTGAATTTAAAAATCAATTAGGATGCCTCTATAGTGAAAAAGATACAGTAAAGATGAGGGATAATCAATTTAAAAAATGAGTAAGTACACACAAAGCACTTTATCCATTCTTATGACACCTGTTACTTTTTTGCTGTGTTTGTGTGTATGCATGCCATGTTATAGTTTGTGGGACCCTCAAAGCAAGCTGGGGAGAGTATATACTGAATTTAGCTTCTGAGACATGATGCTCTTCCTTTTTAATTAACCCAGAACTTAGCAGCTTATCTATTTCTCTAATCTCAAAACATCCTTAAACTGGGGGTGATACTTGAGTGAGAGAATTTTGCAGGTATTAAATGAACTATCTTCTTTTTTTTTTTTCTTTGAGACAGAGTCTTGCTCTGTCACCCAGGCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGTGCGTGCCACCGTGCCCAGCTAATTTTTGTGTTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCATGCTGGTCTTGAACTCCTGACCTCGTGATCTGCCCACCTCGGCCTCCCAAAGTGCTGGAATTATAGGCGTGAGCCACCGCGCCCAGCAAAGAACTTCTAACCTTCATAACCTGACAGGTGTTCTCGAGGCCAGGGTCTCTCTTTCTGTCCTTTCACGATGCTCTGCATCCCTTGGATGTGCCAGTTTCTGGGGGAAGAGTAGTCCTTTGTTACATGCATGAGTCAGTGAACAGGGAATGGGTGAATGACATTTGTGGGTAGGTTATTTCTAGAAGTTAGGTGGGCAGCTTGGAAG

There are lots of methods and attributes associated with a SeqRecord object, you can use the python function dir() to list them as strings. Those that start with and underscore ('_') are internal methods and shouldn't be used.

In [40]:
dir(record)

['__add__',
 '__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__le___',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_per_letter_annotations',
 '_seq',
 '_set_per_letter_annotations',
 '_set_seq',
 'annotations',
 'dbxrefs',
 'description',
 'features',
 'format',
 'id',
 'letter_annotations',
 'lower',
 'name',
 'reverse_complement',
 'seq',
 'translate',
 'upper']

## Exercise 1

Read in the file `data/multiple_sequences.gb`, write a for loop that will parse the contents of the genbank file outputting each sequence in FASTA format. Print a count of how many sequences are in that file.

### 1.5 BioAlign

Another Biopython object that proves very useful in analysing biological data is the Align object that can store alignment data. As with SeqRecord objects you can use AlignIO.parse() to read from a file. To demonstrate other functionality we'll create an alignment by hand.

In [45]:
from Bio.Align import MultipleSeqAlignment

aln = MultipleSeqAlignment([
             SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
             SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
             SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),
         ])

print(aln)

DNAAlphabet() alignment with 3 rows and 12 columns
ACTGCTAGCTAG Alpha
ACT-CTAGCTAG Beta
ACTGCTAGDTAG Gamma


In [47]:
# You can then print this alignment in a different format
print(aln.format('clustal'))

CLUSTAL X (1.81) multiple sequence alignment


Alpha                               ACTGCTAGCTAG
Beta                                ACT-CTAGCTAG
Gamma                               ACTGCTAGDTAG





In [51]:
# You can also select which rows and columns to print as you would using numpy
print(aln[:2])

print(aln[1: , 3:])

DNAAlphabet() alignment with 2 rows and 12 columns
ACTGCTAGCTAG Alpha
ACT-CTAGCTAG Beta
DNAAlphabet() alignment with 2 rows and 9 columns
-CTAGCTAG Beta
GCTAGDTAG Gamma


## pysam

Another library specifically written for biological data is pysam. This library is useful for reading and filtering SAM/BAM format files. 