# Biopython

The [Biopython Project](https://biopython.org/) is an international association of developers of freely available Python (https://www.python.org) tools for computational molecular biology.

Install using `pip install biopython`.

In [1]:
import sys
!{sys.executable} -m pip install biopython



Biopython version.

In [2]:
import Bio
print(Bio.__version__)

1.79


## Working with sequences

The Biopython mechanisms for dealing with sequences is via the `Seq` object. Sequences are essentially strings of letters but the difference between `Seq` objects and standard Python strings is that they have different methods (although the `Seq` object supports many of the same methods as a plain string).

In [3]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
print(my_seq)
print(type(my_seq))

AGTACACTGGT
<class 'Bio.Seq.Seq'>


To convert to a plain string, use `str`.

In [10]:
print(type(str(my_seq)))

<class 'str'>


Treat `Seq` objects as normal Python strings.

In [4]:
from Bio.Seq import Seq
my_seq = Seq("GATCG")
for index, letter in enumerate(my_seq):
    print("%i %s" % (index, letter))

0 G
1 A
2 T
3 C
4 G


Count occurrence of a substring.

In [5]:
from Bio.Seq import Seq
print(Seq("AGTACACTAGTGGTAAAAAGTTTT").count("AGT"))

3


Get the complement and reverse complement of `my_seq`.

In [6]:
print(my_seq)
print(my_seq.complement())
print(my_seq.reverse_complement())

GATCG
CTAGC
CGATC


Calculate GC%.

In [7]:
from Bio.Seq import Seq
from Bio.SeqUtils import GC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
print(GC(my_seq))

46.875


Transcription.

In [13]:
from Bio.Seq import Seq
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
template_dna = coding_dna.reverse_complement()
messenger_rna = coding_dna.transcribe()
print(messenger_rna)
print(template_dna.reverse_complement().transcribe())

AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG


Translation from coding DNA or from mRNA.

In [18]:
print(coding_dna.translate())
print(messenger_rna.translate())

MAIVMGR*KGAR*
MAIVMGR*KGAR*


Stop on occurrence of stop codon.

In [19]:
print(coding_dna.translate(to_stop = True))

MAIVMGR


Output codon table.

In [20]:
from Bio.Data import CodonTable
print(CodonTable.unambiguous_dna_by_name["Standard"])

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

### Parsing sequence file formats

Most popular file formats have parsers integrated into `Bio.SeqIO` and/or `Bio.AlignIO`. Check the Wiki pages for [SeqIO](https://biopython.org/wiki/SeqIO) and [AlignIO](https://biopython.org/wiki/AlignIO), which should include an up to date list of supported file types.

For more information on Sequence Records (`SeqRecord` class), see [Chapter 4 of the Biopython tutorial](https://biopython.org/DIST/docs/tutorial/Tutorial.html#sec32).

The `Bio.SeqIO.parse()` function is used to read in sequence data as `SeqRecord` objects and expects two arguments:

1. a _handle_ (a file opened for reading or output from a command line program) to read the data from or a filename.
2. a lower case string specifying the sequence format.

The `Bio.SeqIO.parse()` function returns an iterator, which gives `SeqRecord` objects.

For example to parse a FASTA file (the `counter` is just for limiting the output for the notebook):

In [8]:
from Bio import SeqIO
counter = 0
for seq_record in SeqIO.parse("../data/ls_orchid.fasta", "fasta"):
    if counter == 3:
        break
    print("[ ID ]", seq_record.id)
    print("[ Sequence ]", repr(seq_record.seq))
    print("[ Sequence length ]", len(seq_record), "\n")
    counter += 1

[ ID ] gi|2765658|emb|Z78533.1|CIZ78533
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
[ Sequence length ] 740 

[ ID ] gi|2765657|emb|Z78532.1|CCZ78532
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
[ Sequence length ] 753 

[ ID ] gi|2765656|emb|Z78531.1|CFZ78531
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
[ Sequence length ] 748 



Parse a gzipped FASTA.

In [39]:
import gzip
from Bio import SeqIO
my_handle = gzip.open("../data/ls_orchid.fasta.gz", "rt")

counter = 0
for seq_record in SeqIO.parse(my_handle, "fasta"):
    if counter == 3:
        break
    print("[ ID ]", seq_record.id)
    print("[ Sequence ]", repr(seq_record.seq))
    print("[ Sequence length ]", len(seq_record), "\n")
    counter += 1

[ ID ] gi|2765658|emb|Z78533.1|CIZ78533
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
[ Sequence length ] 740 

[ ID ] gi|2765657|emb|Z78532.1|CCZ78532
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
[ Sequence length ] 753 

[ ID ] gi|2765656|emb|Z78531.1|CFZ78531
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
[ Sequence length ] 748 



Save identifiers using a [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) (create a new list based on an existing list):

In [29]:
from Bio import SeqIO
identifiers = [seq_record.id for seq_record in SeqIO.parse("../data/ls_orchid.gbk", "genbank")]
print(identifiers[1:6])

['Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1']


Parse GenBank file.

In [9]:
from Bio import SeqIO
counter = 0
for seq_record in SeqIO.parse("../data/ls_orchid.gbk", "genbank"):
    if counter == 3:
        break
    print("[ ID ]", seq_record.id)
    print("[ Sequence ]", repr(seq_record.seq))
    print("[ Sequence length ]", len(seq_record), "\n")
    counter += 1

[ ID ] Z78533.1
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
[ Sequence length ] 740 

[ ID ] Z78532.1
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
[ Sequence length ] 753 

[ ID ] Z78531.1
[ Sequence ] Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
[ Sequence length ] 748 



## Useful links

Biopython has two collections of cookbook recipes: the one included with the [Biopython tutorial](https://biopython.org/DIST/docs/tutorial/Tutorial.html#sec370) and the one available on the [Biopython Wiki](https://biopython.org/wiki/Category:Cookbook).