Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

15104 lines (12337 sloc) 694.247 kB
% This is the main LaTeX file which is used to produce the Biopython
% Tutorial documentation.
%
% If you just want to read the documentation, you can pick up ready-to-go
% copies in both pdf and html format from:
%
% http://biopython.org/DIST/docs/tutorial/Tutorial.html
% http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
%
% If you want to typeset the documentation, you'll need a standard TeX/LaTeX
% distribution (I use teTeX, which works great for me on Unix platforms).
% Additionally, you need HeVeA (or at least hevea.sty), which can be
% found at:
%
% http://pauillac.inria.fr/~maranget/hevea/index.html
%
% You will also need the pictures included in the document, some of
% which are UMLish diagrams created by Dia
% (http://www.lysator.liu.se/~alla/dia/dia.html).
% These diagrams are available from Biopython git in the original dia
% format, which you can easily save as .png format using Dia itself.
% They are also checked in as the png files, so if you make
% modifications to the original dia files, the png files should also be
% changed.
%
% Once you're all set, you should be able to generate pdf by running:
%
% pdflatex Tutorial.tex (to generate the first draft)
% pdflatex Tutorial.tex (to get the cross references right)
% pdflatex Tutorial.tex (to get the table of contents right)
%
% To generate the html, you'll need HeVeA installed. You should be
% able to just run:
%
% hevea -fix Tutorial.tex
%
% However, on older versions of hevea you may first need to remove the
% Tutorial.aux file generated by LaTeX, then run hevea twice to get
% the references right.
%
% If you want to typeset this and have problems, please report them
% at biopython-dev@biopython.org, and we'll try to get things resolved. We
% always love to have people interested in the documentation!
\documentclass{report}
\usepackage{url}
\usepackage{fullpage}
\usepackage{hevea}
\usepackage{graphicx}
% make everything have section numbers
\setcounter{secnumdepth}{4}
% Make links between references
\usepackage{hyperref}
\newif\ifpdf
\ifx\pdfoutput\undefined
\pdffalse
\else
\pdfoutput=1
\pdftrue
\fi
\ifpdf
\hypersetup{colorlinks=true, hyperindex=true, citecolor=red, urlcolor=blue}
\fi
\begin{document}
\begin{htmlonly}
\title{Biopython Tutorial and Cookbook}
\end{htmlonly}
\begin{latexonly}
\title{
%Hack to get the logo on the PDF front page:
\includegraphics[width=\textwidth]{images/biopython.jpg}\\
%Hack to get some white space using a blank line:
~\\
Biopython Tutorial and Cookbook}
\end{latexonly}
\author{Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, \\
Michiel de Hoon, Peter Cock, Tiago Antao, Eric Talevich, Bartek Wilczy\'{n}ski}
\date{Last Update -- 1 August 2012 (Biopython 1.60+)}
%Hack to get the logo at the start of the HTML front page:
%(hopefully this isn't going to be too wide for most people)
\begin{rawhtml}
<P ALIGN="center">
<IMG ALIGN="center" SRC="images/biopython.jpg" TITLE="Biopython Logo" ALT="[Biopython Logo]" width="1024" height="288" />
</p>
\end{rawhtml}
\maketitle
\tableofcontents
\chapter{Introduction}
\label{chapter:introduction}
\section{What is Biopython?}
The Biopython Project is an international association of developers of freely available Python (\url{http://www.python.org}) tools for computational molecular biology. The web site \url{http://www.biopython.org} provides an online resource for modules, scripts, and web links for developers of Python-based software for life science research.
Basically, we just like to program in Python and want to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts.
\section{What can I find in the Biopython package}
The main Biopython releases have lots of functionality, including:
\begin{itemize}
\item The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
\begin{itemize}
\item Blast output -- both from standalone and WWW Blast
\item Clustalw
\item FASTA
\item GenBank
\item PubMed and Medline
\item ExPASy files, like Enzyme and Prosite
\item SCOP, including `dom' and `lin' files
\item UniGene
\item SwissProt
\end{itemize}
\item Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.
\item Code to deal with popular on-line bioinformatics destinations such as:
\begin{itemize}
\item NCBI -- Blast, Entrez and PubMed services
\item ExPASy -- Swiss-Prot and Prosite entries, as well as Prosite searches
\end{itemize}
\item Interfaces to common bioinformatics programs such as:
\begin{itemize}
\item Standalone Blast from NCBI
\item Clustalw alignment program
\item EMBOSS command line tools
\end{itemize}
\item A standard sequence class that deals with sequences, ids on sequences, and sequence features.
\item Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
\item Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
\item Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
\item Code making it easy to split up parallelizable tasks into separate processes.
\item GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
\item Extensive documentation and help with using the modules, including this file, on-line wiki documentation, the web site, and the mailing list.
\item Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects.
\end{itemize}
We hope this gives you plenty of reasons to download and start using Biopython!
\section{Installing Biopython}
All of the installation information for Biopython was separated from
this document to make it easier to keep updated.
The short version is go to our downloads page (\url{http://biopython.org/wiki/Download}),
download and install the listed dependencies, then download and install Biopython.
For Windows we provide pre-compiled click-and-run installers, while for Unix and other
operating systems you must install from source as described in the included README file.
This is usually as simple as the standard commands:
\begin{verbatim}
python setup.py build
python setup.py test
sudo python setup.py install
\end{verbatim}
\noindent (You can in fact skip the build and test, and go straight to the install --
but its better to make sure everything seems to be working.)
The longer version of our installation instructions covers
installation of Python, Biopython dependencies and Biopython itself.
It is available in PDF
(\url{http://biopython.org/DIST/docs/install/Installation.pdf})
and HTML formats
(\url{http://biopython.org/DIST/docs/install/Installation.html}).
\section{Frequently Asked Questions (FAQ)}
\begin{enumerate}
\item \emph{How do I cite Biopython in a scientific publication?} \\
Please cite our application note \cite[Cock {\textit et al.}, 2009]{cock2009}
and/or one of the publications listed on our website describing specific modules within Biopython.
\item \emph{How should I capitalize ``Biopython''? Is ``BioPython'' OK?} \\
The correct capitalization is ``Biopython'', not ``BioPython'' (even though
that would have matched BioPerl, BioJava and BioRuby).
\item \emph{How do I find out what version of Biopython I have installed?} \\
Use this:
\begin{verbatim}
>>> import Bio
>>> print Bio.__version__
...
\end{verbatim}
If the ``\verb|import Bio|'' line fails, Biopython is not installed.
If the second line fails, your version is very out of date.
If the version string ends with a plus, you don't have an official
release, but a snapshot of the in development code.
\item \emph{Where is the latest version of this document?}\\
If you download a Biopython source code archive, it will include the
relevant version in both HTML and PDF formats. The latest published
version of this document (updated at each release) is online:
\begin{itemize}
\item \url{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
\item \url{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
\end{itemize}
If you are using the very latest unreleased code from our repository
you can find copies of the in-progress tutorial here:
\begin{itemize}
\item \url{http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html}
\item \url{http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf}
\end{itemize}
\item \emph{Which ``Numerical Python'' do I need?} \\
For Biopython 1.48 or earlier, you needed the old Numeric module.
For Biopython 1.49 onwards, you need the newer NumPy instead.
Both Numeric and NumPy can be installed on the same machine fine.
See also: \url{http://numpy.scipy.org/}
\item \emph{Why is the} \verb|Seq| \emph{object missing the (back) transcription \& translation methods described in this Tutorial?} \\
You need Biopython 1.49 or later. Alternatively, use the \verb|Bio.Seq| module functions described in Section~\ref{sec:seq-module-functions}.
\item \emph{Why is the} \verb|Seq| \emph{object missing the upper \& lower methods described in this Tutorial?} \\
You need Biopython 1.53 or later. Alternatively, use \verb|str(my_seq).upper()| to get an upper case string.
If you need a Seq object, try \verb|Seq(str(my_seq).upper())| but be careful about blindly re-using the same alphabet.
\item \emph{Why does't the} \verb|Seq| \emph{object translation method support the} \verb|cds| \emph{option described in this Tutorial?} \\
You need Biopython 1.51 or later.
\item \emph{Why doesn't} \verb|Bio.SeqIO| \emph{work? It imports fine but there is no parse function etc.} \\
You need Biopython 1.43 or later. Older versions did contain some related code under the \verb|Bio.SeqIO| name which has since been removed - and this is why the import ``works''.
\item \emph{Why doesn't} \verb|Bio.SeqIO.read()| \emph{work? The module imports fine but there is no read function!} \\
You need Biopython 1.45 or later. Or, use \texttt{Bio.SeqIO.parse(...).next()} instead.
\item \emph{Why isn't} \verb|Bio.AlignIO| \emph{present? The module import fails!} \\
You need Biopython 1.46 or later.
\item \emph{What file formats do} \verb|Bio.SeqIO| \emph{and} \verb|Bio.AlignIO| \emph{read and write?} \\
Check the built in docstrings (\texttt{from Bio import SeqIO}, then \texttt{help(SeqIO)}), or see \url{http://biopython.org/wiki/SeqIO} and \url{http://biopython.org/wiki/AlignIO} on the wiki for the latest listing.
\item \emph{Why don't the } \verb|Bio.SeqIO| \emph{and} \verb|Bio.AlignIO| \emph{input functions let me provide a sequence alphabet?} \\
You need Biopython 1.49 or later.
\item \emph{Why won't the } \verb|Bio.SeqIO| \emph{and} \verb|Bio.AlignIO| \emph{functions} \verb|parse|\emph{,} \verb|read| \emph{and} \verb|write| \emph{take filenames? They insist on handles!} \\
You need Biopython 1.54 or later, or just use handles explicitly (see Section~\ref{sec:appendix-handles}).
It is especially important to remember to close output handles explicitly after writing your data.
\item \emph{Why won't the } \verb|Bio.SeqIO.write()| \emph{and} \verb|Bio.AlignIO.write()| \emph{functions accept a single record or alignment? They insist on a list or iterator!} \\
You need Biopython 1.54 or later, or just wrap the item with \verb|[...]| to create a list of one element.
\item \emph{Why doesn't} \verb|str(...)| \emph{give me the full sequence of a} \verb|Seq| \emph{object?} \\
You need Biopython 1.45 or later. Alternatively, rather than \verb|str(my_seq)|, use \verb|my_seq.tostring()| (which will also work on recent versions of Biopython).
\item \emph{Why doesn't} \verb|Bio.Blast| \emph{work with the latest plain text NCBI blast output?} \\
The NCBI keep tweaking the plain text output from the BLAST tools, and keeping our parser up to date is/was an ongoing struggle.
If you aren't using the latest version of Biopython, you could try upgrading.
However, we (and the NCBI) recommend you use the XML output instead, which is designed to be read by a computer program.
\item \emph{Why doesn't} \verb|Bio.Entrez.read()| \emph{work? The module imports fine but there is no read function!} \\
You need Biopython 1.46 or later.
\item \emph{Why doesn't} \verb|Bio.Entrez.parse()| \emph{work? The module imports fine but there is no parse function!} \\
You need Biopython 1.52 or later.
\item \emph{Why has my script using} \verb|Bio.Entrez.efetch()| \emph{stopped working?} \\
This could be due to NCBI changes in February 2012 introducing EFetch 2.0.
First, they changed the default return modes - you probably want to add \verb|retmode="text"| to
your call.
Second, they are now stricter about how to provide a list of IDs -- Biopython 1.59 onwards
turns a list into a comma separated string automatically.
\item \emph{Why doesn't} \verb|Bio.PDB.MMCIFParser| \emph{work? I see an import error about} \verb|MMCIFlex| \\
From Biopython 1.42 to 1.59, the underlying \verb|Bio.PDB.mmCIF.MMCIFlex| module
was not installed by default. It required a third party tool called flex
(fast lexical analyzer generator). This should work with Biopython 1.60 onwards.
\item \emph{Why doesn't} \verb|Bio.Blast.NCBIWWW.qblast()| \emph{give the same results as the NCBI BLAST website?} \\
You need to specify the same options -- the NCBI often adjust the default settings on the website,
and they do not match the QBLAST defaults anymore. Check things like the gap penalties and expectation threshold.
\item \emph{Why doesn't} \verb|Bio.Blast.NCBIXML.read()| \emph{work? The module imports but there is no read function!} \\
You need Biopython 1.50 or later. Or, use \texttt{Bio.Blast.NCBIXML.parse(...).next()} instead.
\item \emph{Why doesn't my} \verb|SeqRecord| \emph{object have a} \verb|letter_annotations| \emph{attribute?} \\
Per-letter-annotation support was added in Biopython 1.50.
\item \emph{Why can't I slice my} \verb|SeqRecord| \emph{to get a sub-record?} \\
You need Biopython 1.50 or later.
\item \emph{Why can't I add} \verb|SeqRecord| \emph{objects together?} \\
You need Biopython 1.53 or later.
\item \emph{Why doesn't} \verb|Bio.SeqIO.convert()| \emph{or} \verb|Bio.AlignIO.convert()| \emph{work? The modules import fine but there is no convert function!} \\
You need Biopython 1.52 or later. Alternatively, combine the \verb|parse| and \verb|write|
functions as described in this tutorial (see Sections~\ref{sec:SeqIO-conversion} and~\ref{sec:converting-alignments}).
\item \emph{Why doesn't} \verb|Bio.SeqIO.index()| \emph{work? The module imports fine but there is no index function!} \\
You need Biopython 1.52 or later.
\item \emph{Why doesn't} \verb|Bio.SeqIO.index_db()| \emph{work? The module imports fine but there is no \texttt{index\_db} function!} \\
You need Biopython 1.57 or later (and a Python with SQLite3 support).
\item \emph{Where is the} \verb|MultipleSeqAlignment| \emph{object? The} \verb|Bio.Align| \emph{module imports fine but this class isn't there!} \\
You need Biopython 1.54 or later. Alternatively, the older \verb|Bio.Align.Generic.Alignment| class supports some of its functionality, but using this is now discouraged.
\item \emph{Why can't I run command line tools directly from the application wrappers?} \\
You need Biopython 1.55 or later. Alternatively, use the Python \verb|subprocess| module directly.
\item \emph{I looked in a directory for code, but I couldn't find the code that does something. Where's it hidden?} \\
One thing to know is that we put code in \verb|__init__.py| files. If you are not used to looking for code in this file this can be confusing. The reason we do this is to make the imports easier for users. For instance, instead of having to do a ``repetitive'' import like \verb|from Bio.GenBank import GenBank|, you can just use \verb|from Bio import GenBank|.
\item \emph{Why does the code from CVS seem out of date?} \\
In late September 2009, just after the release of Biopython 1.52, we switched from using CVS to git, a distributed version control system. The old CVS server will remain available as a static and read only backup, but if you want to grab the latest code, you'll need to use git instead. See our website for more details.
\end{enumerate}
\noindent For more general questions, the Python FAQ pages \url{http://www.python.org/doc/faq/} may be useful.
\chapter{Quick Start -- What can you do with Biopython?}
\label{chapter:quick-start}
This section is designed to get you started quickly with Biopython, and to give a general overview of what is available and how to use it. All of the examples in this section assume that you have some general working knowledge of Python, and that you have successfully installed Biopython on your system. If you think you need to brush up on your Python, the main Python web site provides quite a bit of free documentation to get started with (\url{http://www.python.org/doc/}).
Since much biological work on the computer involves connecting with databases on the internet, some of the examples will also require a working internet connection in order to run.
Now that that is all out of the way, let's get into what we can do with Biopython.
\section{General overview of what Biopython provides}
As mentioned in the introduction, Biopython is a set of libraries to provide the ability to deal with ``things'' of interest to biologists working on the computer. In general this means that you will need to have at least some programming experience (in Python, of course!) or at least an interest in learning to program. Biopython's job is to make your job easier as a programmer by supplying reusable libraries so that you can focus on answering your specific question of interest, instead of focusing on the internals of parsing a particular file format (of course, if you want to help by writing a parser that doesn't exist and contributing it to Biopython, please go ahead!). So Biopython's job is to make you happy!
One thing to note about Biopython is that it often provides multiple ways of ``doing the same thing.'' Things have improved in recent releases, but this can still be frustrating as in Python there should ideally be one right way to do something. However, this can also be a real benefit because it gives you lots of flexibility and control over the libraries. The tutorial helps to show you the common or easy ways to do things so that you can just make things work. To learn more about the alternative possibilities, look in the Cookbook (Chapter~\ref{chapter:cookbook}, this has some cools tricks and tips), the Advanced section (Chapter~\ref{chapter:advanced}), the built in ``docstrings'' (via the Python help command, or the \href{http://biopython.org/DIST/docs/api/}{API documentation}) or ultimately the code itself.
\section{Working with sequences}
\label{sec:sequences}
Disputably (of course!), the central object in bioinformatics is the sequence. Thus, we'll start with a quick introduction to the Biopython mechanisms for dealing with sequences, the \verb|Seq| object, which we'll discuss in more detail in Chapter~\ref{chapter:Bio.Seq}.
Most of the time when we think about sequences we have in my mind a string of letters like `\verb|AGTACACTGGT|'. You can create such \verb|Seq| object with this sequence as follows - the ``$>>>$'' represents the Python prompt followed by what you would type in:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print my_seq
AGTACACTGGT
>>> my_seq.alphabet
Alphabet()
\end{verbatim}
What we have here is a sequence object with a \emph{generic} alphabet - reflecting the fact we have \emph{not} specified if this is a DNA or protein sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines and Threonines!). We'll talk more about alphabets in Chapter~\ref{chapter:Bio.Seq}.
In addition to having an alphabet, the \verb|Seq| object differs from the Python string in the methods it supports. You can't do this with a plain string:
%cont-doctest
\begin{verbatim}
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.complement()
Seq('TCATGTGACCA', Alphabet())
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT', Alphabet())
\end{verbatim}
The next most important class is the \verb|SeqRecord| or Sequence Record. This holds a sequence (as a \verb|Seq| object) with additional annotation including an identifier, name and description. The \verb|Bio.SeqIO| module for reading and writing sequence file formats works with \verb|SeqRecord| objects, which will be introduced below and covered in more detail by Chapter~\ref{chapter:Bio.SeqIO}.
This covers the basic features and uses of the Biopython sequence class.
Now that you've got some idea of what it is like to interact with the Biopython libraries, it's time to delve into the fun, fun world of dealing with biological file formats!
\section{A usage example}
\label{sec:orchids}
Before we jump right into parsers and everything else to do with Biopython, let's set up an example to motivate everything we do and make life more interesting. After all, if there wasn't any biology in this tutorial, why would you want you read it?
Since I love plants, I think we're just going to have to have a plant based example (sorry to all the fans of other organisms out there!). Having just completed a recent trip to our local greenhouse, we've suddenly developed an incredible obsession with Lady Slipper Orchids (if you wonder why, have a look at some \href{http://www.flickr.com/search/?q=lady+slipper+orchid&s=int&z=t}{Lady Slipper Orchids photos on Flickr}, or try a \href{http://images.google.com/images?q=lady%20slipper%20orchid}{Google Image Search}).
Of course, orchids are not only beautiful to look at, they are also extremely interesting for people studying evolution and systematics. So let's suppose we're thinking about writing a funding proposal to do a molecular study of Lady Slipper evolution, and would like to see what kind of research has already been done and how we can add to that.
% Brad's links to Millicent Orchids are dead now (March 2007)
% http://www.millicentorchids.com/greenhouse/images/papesq01.jpg
% http://www.millicentorchids.com/greenhouse/indexphoto.htm
After a little bit of reading up we discover that the Lady Slipper Orchids are in the Orchidaceae family and the Cypripedioideae sub-family and are made up of 5 genera: \emph{Cypripedium}, \emph{Paphiopedilum}, \emph{Phragmipedium}, \emph{Selenipedium} and \emph{Mexipedium}.
That gives us enough to get started delving for more information. So, let's look at how the Biopython tools can help us. We'll start with sequence parsing in Section~\ref{sec:sequence-parsing}, but the orchids will be back later on as well - for example we'll search PubMed for papers about orchids and extract sequence data from GenBank in Chapter~\ref{chapter:entrez}, extract data from Swiss-Prot from certain orchid proteins in Chapter~\ref{chapter:swiss_prot}, and work with ClustalW multiple sequence alignments of orchid proteins in Section~\ref{sec:align_clustal}.
\section{Parsing sequence file formats}
\label{sec:sequence-parsing}
A large part of much bioinformatics work involves dealing with the many types of file formats designed to hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language. However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers.
We are now going to briefly introduce the \verb|Bio.SeqIO| module -- you can find out more in Chapter~\ref{chapter:Bio.SeqIO}. We'll start with an online search for our friends, the lady slipper orchids. To keep this introduction simple, we're just using the NCBI website by hand. Let's just take a look through the nucleotide databases at NCBI, using an Entrez online search (\url{http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide}) for everything mentioning the text Cypripedioideae (this is the subfamily of lady slipper orchids).
When this tutorial was originally written, this search gave us only 94 hits, which we saved as a FASTA formatted text file and as a GenBank formatted text file (files \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta}{\tt ls\_orchid.fasta} and \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk}{\tt ls\_orchid.gbk}, also included with the Biopython source code under {\tt docs/tutorial/examples/}).
% The GenBank version is actually new - it was created by filtering out the original 94 hits from the
% latest search of over 400 hits.
If you run the search today, you'll get hundreds of results! When following the tutorial, if you want to see the same list of genes, just download the two files above or copy them from \verb|docs/examples/| in the Biopython source code. In Section~\ref{sec:connecting-with-biological-databases} we will look at how to do a search like this from within Python.
\subsection{Simple FASTA parsing example}
\label{sec:fasta-parsing}
If you open the lady slipper orchids FASTA file \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta}{\tt ls\_orchid.fasta} in your favourite text editor, you'll see that the file starts like this:
\begin{verbatim}
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
\end{verbatim}
It contains 94 records, each has a line starting with ``$>$'' (greater-than symbol) followed by the sequence on one or more lines. Now try this in Python:
\begin{verbatim}
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
\end{verbatim}
\noindent You should get something like this on your screen:
\begin{verbatim}
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
\end{verbatim}
Notice that the FASTA format does not specify the alphabet, so \verb|Bio.SeqIO| has defaulted to the rather generic \verb|SingleLetterAlphabet()| rather than something DNA specific.
\subsection{Simple GenBank parsing example}
Now let's load the GenBank file \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk}{\tt ls\_orchid.gbk} instead - notice that the code to do this is almost identical to the snippet used above for the FASTA file - the only difference is we change the filename and the format string:
\begin{verbatim}
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
\end{verbatim}
\noindent This should give:
\begin{verbatim}
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740
...
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592
\end{verbatim}
This time \verb|Bio.SeqIO| has been able to choose a sensible alphabet, IUPAC Ambiguous DNA. You'll also notice that a shorter string has been used as the \verb|seq_record.id| in this case.
\subsection{I love parsing -- please don't stop talking about it!}
Biopython has a lot of parsers, and each has its own little special niches based on the sequence format it is parsing and all of that. Chapter~\ref{chapter:Bio.SeqIO} covers \verb|Bio.SeqIO| in more detail, while Chapter~\ref{chapter:Bio.AlignIO} introduces \verb|Bio.AlignIO| for sequence alignments.
While the most popular file formats have parsers integrated into \verb|Bio.SeqIO| and/or \verb|Bio.AlignIO|, for some of the rarer and unloved file formats there is either no parser at all, or an old parser which has not been linked in yet.
Please also check the wiki pages \url{http://biopython.org/wiki/SeqIO} and \url{http://biopython.org/wiki/AlignIO} for the latest information, or ask on the mailing list. The wiki pages should include an up to date list of supported file types, and some additional examples.
The next place to look for information about specific parsers and how to do cool things with them is in the Cookbook (Chapter~\ref{chapter:cookbook} of this Tutorial). If you don't find the information you are looking for, please consider helping out your poor overworked documentors and submitting a cookbook entry about it! (once you figure out how to do it, that is!)
\section{Connecting with biological databases}
\label{sec:connecting-with-biological-databases}
One of the very common things that you need to do in bioinformatics is extract information from biological databases. It can be quite tedious to access these databases manually, especially if you have a lot of repetitive work to do. Biopython attempts to save you time and energy by making some on-line databases available from Python scripts. Currently, Biopython has code to extract information from the following databases:
\begin{itemize}
\item \href{http://www.ncbi.nlm.nih.gov/Entrez/}{Entrez} (and \href{http://www.ncbi.nlm.nih.gov/PubMed/}{PubMed}) from the NCBI -- See Chapter~\ref{chapter:entrez}.
\item \href{http://www.expasy.org/}{ExPASy} -- See Chapter~\ref{chapter:swiss_prot}.
\item \href{http://scop.mrc-lmb.cam.ac.uk/scop/}{SCOP} -- See the \verb|Bio.SCOP.search()| function.
\end{itemize}
The code in these modules basically makes it easy to write Python code that interact with the CGI scripts on these pages, so that you can get results in an easy to deal with format. In some cases, the results can be tightly integrated with the Biopython parsers to make it even easier to extract information.
\section{What to do next}
Now that you've made it this far, you hopefully have a good understanding of the basics of Biopython and are ready to start using it for doing useful work. The best thing to do now is finish reading this tutorial, and then if you want start snooping around in the source code, and looking at the automatically generated documentation.
Once you get a picture of what you want to do, and what libraries in Biopython will do it, you should take a peak at the Cookbook (Chapter~\ref{chapter:cookbook}), which may have example code to do something similar to what you want to do.
If you know what you want to do, but can't figure out how to do it, please feel free to post questions to the main Biopython list (see \url{http://biopython.org/wiki/Mailing_lists}). This will not only help us answer your question, it will also allow us to improve the documentation so it can help the next person do what you want to do.
Enjoy the code!
\chapter{Sequence objects}
\label{chapter:Bio.Seq}
Biological sequences are arguably the central object in Bioinformatics, and in this chapter we'll introduce the Biopython mechanism for dealing with sequences, the \verb|Seq| object.
Chapter~\ref{chapter:SeqRecord} will introduce the related \verb|SeqRecord| object, which combines the sequence information with any annotation, used again in Chapter~\ref{chapter:Bio.SeqIO} for Sequence Input/Output.
Sequences are essentially strings of letters like \verb|AGTACACTGGT|, which seems very natural since this is the most common way that sequences are seen in biological file formats.
There are two important differences between \verb|Seq| objects and standard Python strings.
First of all, they have different methods. Although the \verb|Seq| object supports many of the same methods as a plain string, its \verb|translate()| method differs by doing biological translation, and there are also additional biologically relevant methods like \verb|reverse_complement()|.
Secondly, the \verb|Seq| object has an important attribute, \verb|alphabet|, which is an object describing what the individual characters making up the sequence string ``mean'', and how they should be interpreted. For example, is \verb|AGTACACTGGT| a DNA sequence, or just a protein sequence that happens to be rich in Alanines, Glycines, Cysteines
and Threonines?
\section{Sequences and Alphabets}
The alphabet object is perhaps the important thing that makes the \verb|Seq| object more than just a string. The currently available alphabets for Biopython are defined in the \verb|Bio.Alphabet| module. We'll use the IUPAC alphabets (\url{http://www.chem.qmw.ac.uk/iupac/}) here to deal with some of our favorite objects: DNA, RNA and Proteins.
\verb|Bio.Alphabet.IUPAC| provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions. For instance, for proteins, there is a basic IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional elements ``U'' (or ``Sec'' for selenocysteine) and ``O'' (or ``Pyl'' for pyrrolysine), plus the ambiguous symbols ``B'' (or ``Asx'' for asparagine or aspartic acid), ``Z'' (or ``Glx'' for glutamine or glutamic acid), ``J'' (or ``Xle'' for leucine isoleucine) and ``X'' (or ``Xxx'' for an unknown amino acid). For DNA you've got choices of IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which provides for ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters for modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or IUPACUnambiguousRNA.
The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type checking.
Now that we know what we are dealing with, let's look at how to utilize this class to do interesting work.
You can create an ambiguous sequence with the default generic alphabet like this:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.alphabet
Alphabet()
\end{verbatim}
However, where possible you should specify the alphabet explicitly when creating your sequence objects - in this case an unambiguous DNA alphabet object:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
\end{verbatim}
Unless of course, this really is an amino acid sequence:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)
>>> my_prot
Seq('AGTACACTGGT', IUPACProtein())
>>> my_prot.alphabet
IUPACProtein()
\end{verbatim}
\section{Sequences act like strings}
In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length, or iterating over the elements:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq):
... print index, letter
0 G
1 A
2 T
3 C
4 G
>>> print len(my_seq)
5
\end{verbatim}
You can access elements of the sequence in the same way as for strings (but remember, Python counts from zero!):
%cont-doctest
\begin{verbatim}
>>> print my_seq[0] #first letter
G
>>> print my_seq[2] #third letter
T
>>> print my_seq[-1] #last letter
G
\end{verbatim}
The \verb|Seq| object has a \verb|.count()| method, just like a string.
Note that this means that like a Python string, this gives a
\emph{non-overlapping} count:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2
\end{verbatim}
\noindent For some biological uses, you may actually want an overlapping count
(i.e. $3$ in this trivial example). When searching for single letters, this
makes no difference:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
\end{verbatim}
While you could use the above snippet of code to calculate a GC\%, note that the \verb|Bio.SeqUtils| module has several GC functions already built. For example:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
\end{verbatim}
\noindent Note that using the \verb|Bio.SeqUtils.GC()| function should automatically cope with mixed case sequences and the ambiguous nucleotide S which means G or C.
Also note that just like a normal Python string, the \verb|Seq| object is in some ways ``read-only''. If you need to edit your sequence, for example simulating a point mutation, look at the Section~\ref{sec:mutable-seq} below which talks about the \verb|MutableSeq| object.
\section{Slicing a sequence}
A more complicated example, let's get a slice of the sequence:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())
\end{verbatim}
Two things are interesting to note. First, this follows the normal conventions for Python strings. So the first element of the sequence is 0 (which is normal for computer science, but not so normal for biology). When you do a slice the first item is included (i.e.~4 in this case) and the last is excluded (12 in this case), which is the way things work in Python, but of course not necessarily the way everyone in the world would expect. The main goal is to stay consistent with what Python does.
The second thing to notice is that the slice is performed on the sequence data string, but the new object produced is another \verb|Seq| object which retains the alphabet information from the original \verb|Seq| object.
Also like a Python string, you can do slices with a start, stop and \emph{stride} (the step size, which defaults to one). For example, we can get the first, second and third codon positions of this DNA sequence:
%cont-doctest
\begin{verbatim}
>>> my_seq[0::3]
Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())
>>> my_seq[1::3]
Seq('AGGCATGCATC', IUPACUnambiguousDNA())
>>> my_seq[2::3]
Seq('TAGCTAAGAC', IUPACUnambiguousDNA())
\end{verbatim}
Another stride trick you might have seen with a Python string is the use of a -1 stride to reverse the string. You can do this with a \verb|Seq| object too:
%cont-doctest
\begin{verbatim}
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
\end{verbatim}
\section{Turning Seq objects into strings}
\label{sec:seq-to-string}
If you really do just need a plain string, for example to write to a file, or insert into a database, then this is very easy to get:
%cont-doctest
\begin{verbatim}
>>> str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
\end{verbatim}
Since calling \verb|str()| on a \verb|Seq| object returns the full sequence as a string,
you often don't actually have to do this conversion explicitly.
Python does this automatically with a print statement:
%cont-doctest
\begin{verbatim}
>>> print my_seq
GATCGATGGGCCTATATAGGATCGAAAATCGC
\end{verbatim}
You can also use the \verb|Seq| object directly with a \verb|%s| placeholder when using the Python string formatting or interpolation operator (\verb|%|):
%cont-doctest
\begin{verbatim}
>>> fasta_format_string = ">Name\n%s\n" % my_seq
>>> print fasta_format_string
>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC
<BLANKLINE>
\end{verbatim}
\noindent This line of code constructs a simple FASTA format record (without worrying about line wrapping).
Section~\ref{sec:SeqRecord-format} describes a neat way to get a FASTA formatted
string from a \verb|SeqRecord| object, while the more general topic of reading and
writing FASTA format sequence files is covered in Chapter~\ref{chapter:Bio.SeqIO}.
\emph{NOTE:} If you are using Biopython 1.44 or older, using \verb|str(my_seq)|
will give just a truncated representation. Instead use \verb|my_seq.tostring()|
(which is still available in the current Biopython releases for backwards compatibility):
%cont-doctest
\begin{verbatim}
>>> my_seq.tostring()
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
\end{verbatim}
\section{Concatenating or adding sequences}
Naturally, you can in principle add any two Seq objects together - just like you can with Python strings to concatenate them. However, you can't add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence:
%doctest
\begin{verbatim}
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> protein_seq + dna_seq
Traceback (most recent call last):
...
TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()
\end{verbatim}
If you \emph{really} wanted to do this, you'd have to first give both sequences generic alphabets:
%cont-doctest
\begin{verbatim}
>>> from Bio.Alphabet import generic_alphabet
>>> protein_seq.alphabet = generic_alphabet
>>> dna_seq.alphabet = generic_alphabet
>>> protein_seq + dna_seq
Seq('EVRNAKACGT', Alphabet())
\end{verbatim}
Here is an example of adding a generic nucleotide sequence to an unambiguous IUPAC DNA sequence, resulting in an ambiguous nucleotide sequence:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_nucleotide
>>> from Bio.Alphabet import IUPAC
>>> nuc_seq = Seq("GATCGATGC", generic_nucleotide)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> nuc_seq
Seq('GATCGATGC', NucleotideAlphabet())
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> nuc_seq + dna_seq
Seq('GATCGATGCACGT', NucleotideAlphabet())
\end{verbatim}
\section{Changing case}
Python strings have very useful \verb|upper| and \verb|lower| methods for changing the case.
As of Biopython 1.53, the \verb|Seq| object gained similar methods which are alphabet aware.
For example,
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper()
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
\end{verbatim}
These are useful for doing case insensitive matching:
%cont-doctest
\begin{verbatim}
>>> "GTAC" in dna_seq
False
>>> "GTAC" in dna_seq.upper()
True
\end{verbatim}
Note that strictly speaking the IUPAC alphabets are for upper case
sequences only, thus:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())
\end{verbatim}
\section{Nucleotide sequences and (reverse) complements}
\label{sec:seq-reverse-complement}
For nucleotide sequences, you can easily obtain the complement or reverse
complement of a \verb|Seq| object using its built-in methods:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
\end{verbatim}
As mentioned earlier, an easy way to just reverse a \verb|Seq| object (or a
Python string) is slice it with -1 step:
%cont-doctest
\begin{verbatim}
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
\end{verbatim}
In all of these operations, the alphabet property is maintained. This is very
useful in case you accidentally end up trying to do something weird like take
the (reverse)complement of a protein sequence:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> protein_seq.complement()
Traceback (most recent call last):
...
ValueError: Proteins do not have complements!
\end{verbatim}
The example in Section~\ref{sec:SeqIO-reverse-complement} combines the \verb|Seq|
object's reverse complement method with \verb|Bio.SeqIO| for sequence input/ouput.
\section{Transcription}
Before talking about transcription, I want to try and clarify the strand issue.
Consider the following (made up) stretch of double stranded DNA which
encodes a short peptide:
\begin{tabular}{rcl}
\\
& {\small DNA coding strand (aka Crick strand, strand $+1$)} & \\
5' & \texttt{ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG} & 3' \\
& \texttt{|||||||||||||||||||||||||||||||||||||||} & \\
3' & \texttt{TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC} & 5' \\
& {\small DNA template strand (aka Watson strand, strand $-1$)} & \\
\\
& {\LARGE $|$} &\\
& Transcription & \\
& {\LARGE $\downarrow$} &\\
\\
5' & \texttt{AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG} & 3' \\
& {\small Single stranded messenger RNA} & \\
\\
\end{tabular}
The actual biological transcription process works from the template strand, doing a reverse complement (TCAG $\rightarrow$ CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T $\rightarrow$ U.
Now let's actually get down to doing a transcription in Biopython. First, let's create \verb|Seq| objects for the coding and template DNA strands:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> template_dna = coding_dna.reverse_complement()
>>> template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())
\end{verbatim}
\noindent These should match the figure above - remember by convention nucleotide sequences are normally read from the 5' to 3' direction, while in the figure the template strand is shown reversed.
Now let's transcribe the coding strand into the corresponding mRNA, using the \verb|Seq| object's built in \verb|transcribe| method:
%cont-doctest
\begin{verbatim}
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
\end{verbatim}
\noindent As you can see, all this does is switch T $\rightarrow$ U, and adjust the alphabet.
If you do want to do a true biological transcription starting with the template strand, then this becomes a two-step process:
%cont-doctest
\begin{verbatim}
>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
\end{verbatim}
The \verb|Seq| object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U $\rightarrow$ T substitution and associated change of alphabet:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
\end{verbatim}
\emph{Note:} The \verb|Seq| object's \verb|transcribe| and \verb|back_transcribe| methods
were added in Biopython 1.49. For older releases you would have to use the \verb|Bio.Seq|
module's functions instead, see Section~\ref{sec:seq-module-functions}.
\section{Translation}
\label{sec:translation}
Sticking with the same example discussed in the transcription section above,
now let's translate this mRNA into the corresponding protein sequence - again taking
advantage of one of the \verb|Seq| object's biological methods:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
\end{verbatim}
You can also translate directly from the coding strand DNA sequence:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
\end{verbatim}
You should notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).
The translation tables available in Biopython are based on those \href{http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi}{from the NCBI} (see the next section of this tutorial). By default, translation will use the \emph{standard} genetic code (NCBI table id 1).
Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead:
%cont-doctest
\begin{verbatim}
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
\end{verbatim}
You can also specify the table using the NCBI table number which is shorter, and often included in the feature annotation of GenBank files:
%cont-doctest
\begin{verbatim}
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
\end{verbatim}
Now, you may want to translate the nucleotides up to the first in frame stop codon,
and then stop (as happens in nature):
%cont-doctest
\begin{verbatim}
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', IUPACProtein())
\end{verbatim}
\noindent Notice that when you use the \verb|to_stop| argument, the stop codon itself
is not translated - and the stop symbol is not included at the end of your protein
sequence.
You can even specify the stop symbol if you don't like the default asterisk:
%cont-doctest
\begin{verbatim}
>>> coding_dna.translate(table=2, stop_symbol="@")
Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@'))
\end{verbatim}
Now, suppose you have a complete coding sequence CDS, which is to say a
nucleotide sequence (e.g. mRNA -- after any splicing) which is a whole number
of codons (i.e. the length is a multiple of three), commences with a start
codon, ends with a stop codon, and has no internal in-frame stop codons.
In general, given a complete CDS, the default translate method will do what
you want (perhaps with the \verb|to_stop| option). However, what if your
sequence uses a non-standard start codon? This happens a lot in bacteria --
for example the gene yaaX in \texttt{E. coli} K12:
%TODO - handle line wrapping in doctest?
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
... "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
... "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
... "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
... "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
... generic_dna)
>>> gene.translate(table="Bacterial")
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
HasStopCodon(ExtendedIUPACProtein(), '*')
>>> gene.translate(table="Bacterial", to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
\end{verbatim}
\noindent In the bacterial genetic code \texttt{GTG} is a valid start codon,
and while it does \emph{normally} encode Valine, if used as a start codon it
should be translated as methionine. This happens if you tell Biopython your
sequence is a complete CDS:
%TODO - handle line wrapping in doctest?
\begin{verbatim}
>>> gene.translate(table="Bacterial", cds=True)
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
\end{verbatim}
In addition to telling Biopython to translate an alternative start codon as
methionine, using this option also makes sure your sequence really is a valid
CDS (you'll get an exception if not).
The example in Section~\ref{sec:SeqIO-translate} combines the \verb|Seq| object's
translate method with \verb|Bio.SeqIO| for sequence input/ouput.
\emph{Note:} The \verb|Seq| object's \verb|translate| method is new in Biopython 1.49.
For older releases you would have to use the \verb|Bio.Seq| module's \verb|translate|
function instead, see Section~\ref{sec:seq-module-functions}. The \texttt{cds} option
was added in Biopython 1.51, and there is no simple way to do this with older versions
of Biopython.
\section{Translation Tables}
In the previous sections we talked about the \verb|Seq| object translation method (and mentioned the equivalent function in the \verb|Bio.Seq| module -- see
Section~\ref{sec:seq-module-functions}).
Internally these use codon table objects derived from the NCBI information at
\url{ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt}, also shown on
\url{http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi} in a much more readable layout.
As before, let's just focus on two choices: the Standard translation table, and the
translation table for Vertebrate Mitochondrial DNA.
%doctest
\begin{verbatim}
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
>>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
\end{verbatim}
Alternatively, these tables are labeled with ID numbers 1 and 2, respectively:
%cont-doctest
\begin{verbatim}
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_id[1]
>>> mito_table = CodonTable.unambiguous_dna_by_id[2]
\end{verbatim}
You can compare the actual tables visually by printing them:
%TODO - handle <BLANKLINE> automatically in doctest?
\begin{verbatim}
>>> print standard_table
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
\end{verbatim}
\noindent and:
\begin{verbatim}
>>> print mito_table
Table 2 Vertebrate Mitochondrial, SGC1
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA W | A
T | TTG L | TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L | CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T | AAT N | AGT S | T
A | ATC I(s)| ACC T | AAC N | AGC S | C
A | ATA M(s)| ACA T | AAA K | AGA Stop| A
A | ATG M(s)| ACG T | AAG K | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V(s)| GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
\end{verbatim}
You may find these following properties useful -- for example if you are trying
to do your own gene finding:
%cont-doctest
\begin{verbatim}
>>> mito_table.stop_codons
['TAA', 'TAG', 'AGA', 'AGG']
>>> mito_table.start_codons
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']
>>> mito_table.forward_table["ACG"]
'T'
\end{verbatim}
\section{Comparing Seq objects}
\label{sec:seq-comparison}
Sequence comparison is actually a very complicated topic, and there is no easy
way to decide if two sequences are equal. The basic problem is the meaning of
the letters in a sequence are context dependent - the letter ``A'' could be part
of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of
each \verb|Seq| object to try and capture this information - so comparing two
\verb|Seq| objects means considering both the sequence strings \emph{and} the
alphabets.
For example, you might argue that the two DNA \verb|Seq| objects
\texttt{Seq("ACGT", IUPAC.unambiguous\_dna)} and
\texttt{Seq("ACGT", IUPAC.ambiguous\_dna)} should be equal, even though
they do have different alphabets. Depending on the context this could be
important.
This gets worse -- suppose you think \texttt{Seq("ACGT",
IUPAC.unambiguous\_dna)} and \texttt{Seq("ACGT")} (i.e. the default generic
alphabet) should be equal. Then, logically, \texttt{Seq("ACGT", IUPAC.protein)}
and \texttt{Seq("ACGT")} should also be equal. Now, in logic if $A=B$ and
$B=C$, by transitivity we expect $A=C$. So for logical consistency we'd
require \texttt{Seq("ACGT", IUPAC.unambiguous\_dna)} and \texttt{Seq("ACGT",
IUPAC.protein)} to be equal -- which most people would agree is just not right.
This transitivity problem would also have implications for using \verb|Seq|
objects as Python dictionary keys.
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
\end{verbatim}
So, what does Biopython do? Well, the equality test is the default for Python
objects -- it tests to see if they are the same object in memory. This is a
very strict test:
%TODO - silence FutureWarning from this doctest?
\begin{verbatim}
>>> seq1 == seq2
False
>>> seq1 == seq1
True
\end{verbatim}
If you actually want to do this, you can be more explicit by using the Python
\verb|id| function,
%cont-doctest
\begin{verbatim}
>>> id(seq1) == id(seq2)
False
>>> id(seq1) == id(seq1)
True
\end{verbatim}
Now, in every day use, your sequences will probably all have the same
alphabet, or at least all be the same type of sequence (all DNA, all RNA, or
all protein). What you probably want is to just compare the sequences as
strings -- so do this explicitly:
%cont-doctest
\begin{verbatim}
>>> str(seq1) == str(seq2)
True
>>> str(seq1) == str(seq1)
True
\end{verbatim}
\noindent As an extension to this, while you can use a Python dictionary with
\verb|Seq| objects as keys, it is generally more useful to use the sequence a
string for the key. See also Section~\ref{sec:seq-to-string}.
\section{MutableSeq objects}
\label{sec:mutable-seq}
Just like the normal Python string, the \verb|Seq| object is ``read only'', or in Python terminology, immutable. Apart from wanting the \verb|Seq| object to act like a string, this is also a useful default since in many biological applications you want to ensure you are not changing your sequence data:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
\end{verbatim}
Observe what happens if you try to edit the sequence:
%TODO - This is not a doctest as Python 2.4 output omits the object name.
\begin{verbatim}
>>> my_seq[5] = "G"
Traceback (most recent call last):
...
TypeError: 'Seq' object does not support item assignment
\end{verbatim}
However, you can convert it into a mutable sequence (a \verb|MutableSeq| object) and do pretty much anything you want with it:
%cont-doctest
\begin{verbatim}
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
\end{verbatim}
Alternatively, you can create a \verb|MutableSeq| object directly from a string:
%doctest
\begin{verbatim}
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
\end{verbatim}
Either way will give you a sequence object which can be changed:
%cont-doctest
\begin{verbatim}
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq[5] = "C"
>>> mutable_seq
MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.remove("T")
>>> mutable_seq
MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.reverse()
>>> mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA())
\end{verbatim}
Do note that unlike the \verb|Seq| object, the \verb|MutableSeq| object's methods like \verb|reverse_complement()| and \verb|reverse()| act in-situ!
An important technical difference between mutable and immutable objects in Python means that you can't use a \verb|MutableSeq| object as a dictionary key, but you can use a Python string or a \verb|Seq| object in this way.
Once you have finished editing your a \verb|MutableSeq| object, it's easy to get back to a read-only \verb|Seq| object should you need to:
%cont-doctest
\begin{verbatim}
>>> new_seq = mutable_seq.toseq()
>>> new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA())
\end{verbatim}
You can also get a string from a \verb|MutableSeq| object just like from a \verb|Seq| object (Section~\ref{sec:seq-to-string}).
\section{UnknownSeq objects}
Biopython 1.50 introduced another basic sequence object, the \verb|UnknownSeq| object.
This is a subclass of the basic \verb|Seq| object and its purpose is to represent a
sequence where we know the length, but not the actual letters making it up.
You could of course use a normal \verb|Seq| object in this situation, but it wastes
rather a lot of memory to hold a string of a million ``N'' characters when you could
just store a single letter ``N'' and the desired length as an integer.
%doctest
\begin{verbatim}
>>> from Bio.Seq import UnknownSeq
>>> unk = UnknownSeq(20)
>>> unk
UnknownSeq(20, alphabet = Alphabet(), character = '?')
>>> print unk
????????????????????
>>> len(unk)
20
\end{verbatim}
You can of course specify an alphabet, meaning for nucleotide sequences
the letter defaults to ``N'' and for proteins ``X'', rather than just ``?''.
%cont-doctest
\begin{verbatim}
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import IUPAC
>>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)
>>> unk_dna
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> print unk_dna
NNNNNNNNNNNNNNNNNNNN
\end{verbatim}
You can use all the usual \verb|Seq| object methods too, note these give back
memory saving \verb|UnknownSeq| objects where appropriate as you might expect:
%cont-doctest
\begin{verbatim}
>>> unk_dna
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.complement()
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.reverse_complement()
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.transcribe()
UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N')
>>> unk_protein = unk_dna.translate()
>>> unk_protein
UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X')
>>> print unk_protein
XXXXXX
>>> len(unk_protein)
6
\end{verbatim}
You may be able to find a use for the \verb|UnknownSeq| object in your own
code, but it is more likely that you will first come across them in a
\verb|SeqRecord| object created by \verb|Bio.SeqIO|
(see Chapter~\ref{chapter:Bio.SeqIO}).
Some sequence file formats don't always include the actual sequence, for
example GenBank and EMBL files may include a list of features but for the
sequence just present the contig information. Alternatively, the QUAL files
used in sequencing work hold quality scores but they \emph{never} contain a
sequence -- instead there is a partner FASTA file which \emph{does} have the
sequence.
\section{Working with directly strings}
\label{sec:seq-module-functions}
To close this chapter, for those you who \emph{really} don't want to use the sequence
objects (or who prefer a functional programming style to an object orientated one),
there are module level functions in \verb|Bio.Seq| will accept plain Python strings,
\verb|Seq| objects (including \verb|UnknownSeq| objects) or \verb|MutableSeq| objects:
%doctest
\begin{verbatim}
>>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
>>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
>>> reverse_complement(my_string)
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
>>> transcribe(my_string)
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
>>> back_transcribe(my_string)
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
>>> translate(my_string)
'AVMGRWKGGRAAG*'
\end{verbatim}
\noindent You are, however, encouraged to work with \verb|Seq| objects by default.
\chapter{Sequence Record objects}
\label{chapter:SeqRecord}
Chapter~\ref{chapter:Bio.Seq} introduced the sequence classes. Immediately ``above'' the \verb|Seq| class is the Sequence Record or \verb|SeqRecord| class, defined in the \verb|Bio.SeqRecord| module. This class allows higher level features such as identifiers and features to be associated with the sequence, and is used thoughout the sequence input/output interface \verb|Bio.SeqIO| described fully in Chapter~\ref{chapter:Bio.SeqIO}.
If you are only going to be working with simple data like FASTA files, you can probably skip this chapter
for now. If on the other hand you are going to be using richly annotated sequence data, say from GenBank
or EMBL files, this information is quite important.
While this chapter should cover most things to do with the \verb|SeqRecord| object in this chapter, you may also want to read the \verb|SeqRecord| wiki page (\url{http://biopython.org/wiki/SeqRecord}), and the built in documentation (also \href{http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html}{online}):
\begin{verbatim}
>>> from Bio.SeqRecord import SeqRecord
>>> help(SeqRecord)
...
\end{verbatim}
\section{The SeqRecord object}
\label{sec:SeqRecord}
The \verb|SeqRecord| (Sequence Record) class is defined in the \verb|Bio.SeqRecord| module. This class allows higher level features such as identifiers and features to be associated with a sequence (see Chapter~\ref{chapter:Bio.Seq}), and is the basic data type for the \verb|Bio.SeqIO| sequence input/output interface (see Chapter~\ref{chapter:Bio.SeqIO}).
The \verb|SeqRecord| class itself is quite simple, and offers the following information as attributes:
\begin{description}
\item[seq] -- The sequence itself, typically a \verb|Seq| object.
\item[id] -- The primary ID used to identify the sequence -- a string. In most cases this is something like an accession number.
\item[name] -- A ``common'' name/id for the sequence -- a string. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analagous to the LOCUS id in a GenBank record.
\item[description] -- A human readable description or expressive name for the sequence -- a string.
\item[letter\_annotations] -- Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores (e.g. Section~\ref{sec:FASTQ-filtering-example}) or secondary structure information (e.g. from Stockholm/PFAM alignment files).
\item[annotations] -- A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more ``unstructed'' information to the sequence.
\item[features] -- A list of \verb|SeqFeature| objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence). The structure of sequence features is described below in Section~\ref{sec:seq_features}.
\item[dbxrefs] - A list of database cross-references as strings.
\end{description}
\section{Creating a SeqRecord}
Using a \verb|SeqRecord| object is not very complicated, since all of the
information is presented as attributes of the class. Usually you won't create
a \verb|SeqRecord| ``by hand'', but instead use \verb|Bio.SeqIO| to read in a
sequence file for you (see Chapter~\ref{chapter:Bio.SeqIO} and the examples
below). However, creating \verb|SeqRecord| can be quite simple.
\subsection{SeqRecord objects from scratch}
To create a \verb|SeqRecord| at a minimum you just need a \verb|Seq| object:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)
\end{verbatim}
Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:
%cont-doctest
\begin{verbatim}
>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = "AC12345"
>>> simple_seq_r.description = "Made up sequence I wish I could write a paper about"
>>> print simple_seq_r.description
Made up sequence I wish I could write a paper about
>>> simple_seq_r.seq
Seq('GATC', Alphabet())
\end{verbatim}
Including an identifier is very important if you want to output your \verb|SeqRecord| to a file. You would normally include this when creating the object:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq, id="AC12345")
\end{verbatim}
As mentioned above, the \verb|SeqRecord| has an dictionary attribute \verb|annotations|. This is used
for any miscellaneous annotations that doesn't fit under one of the other more specific attributes.
Adding annotations is easy, and just involves dealing directly with the annotation dictionary:
%cont-doctest
\begin{verbatim}
>>> simple_seq_r.annotations["evidence"] = "None. I just made it up."
>>> print simple_seq_r.annotations
{'evidence': 'None. I just made it up.'}
>>> print simple_seq_r.annotations["evidence"]
None. I just made it up.
\end{verbatim}
Working with per-letter-annotations is similar, \verb|letter_annotations| is a
dictionary like attribute which will let you assign any Python sequence (i.e.
a string, list or tuple) which has the same length as the sequence:
%cont-doctest
\begin{verbatim}
>>> simple_seq_r.letter_annotations["phred_quality"] = [40,40,38,30]
>>> print simple_seq_r.letter_annotations
{'phred_quality': [40, 40, 38, 30]}
>>> print simple_seq_r.letter_annotations["phred_quality"]
[40, 40, 38, 30]
\end{verbatim}
The \verb|dbxrefs| and \verb|features| attributes are just Python lists, and
should be used to store strings and \verb|SeqFeature| objects (discussed later
in this chapter) respectively.
%TODO - Update this to show passing in the annotations etc to __init__ with a warning that
%this requires Biopython 1.51 or later?
\subsection{SeqRecord objects from FASTA files}
This example uses a fairly large FASTA file containing the whole sequence for \textit{Yersinia pestis biovar Microtus} str. 91001 plasmid pPCP1, originally downloaded from the NCBI. This file is included with the Biopython unit tests under the GenBank folder, or online \href{http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna}{\texttt{NC\_005816.fna}} from our website.
The file starts like this - and you can check there is only one record present (i.e. only one line starting with a greater than symbol):
\begin{verbatim}
>gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...
\end{verbatim}
Back in Chapter~\ref{chapter:quick-start} you will have seen the function \verb|Bio.SeqIO.parse(...)|
used to loop over all the records in a file as \verb|SeqRecord| objects. The \verb|Bio.SeqIO| module
has a sister function for use on files which contain just one record which we'll use here (see Chapter~\ref{chapter:Bio.SeqIO} for details):
%TODO - line wrapping for doctest?
\begin{verbatim}
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.fna", "fasta")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|',
description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... sequence',
dbxrefs=[])
\end{verbatim}
Now, let's have a look at the key attributes of this \verb|SeqRecord|
individually -- starting with the \verb|seq| attribute which gives you a
\verb|Seq| object:
\begin{verbatim}
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet())
\end{verbatim}
\noindent Here \verb|Bio.SeqIO| has defaulted to a generic alphabet, rather
than guessing that this is DNA. If you know in advance what kind of sequence
your FASTA file contains, you can tell \verb|Bio.SeqIO| which alphabet to use
(see Chapter~\ref{chapter:Bio.SeqIO}).
Next, the identifiers and description:
\begin{verbatim}
>>> record.id
'gi|45478711|ref|NC_005816.1|'
>>> record.name
'gi|45478711|ref|NC_005816.1|'
>>> record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence'
\end{verbatim}
As you can see above, the first word of the FASTA record's title line (after
removing the greater than symbol) is used for both the \verb|id| and
\verb|name| attributes. The whole title line (after removing the greater than
symbol) is used for the record description. This is deliberate, partly for
backwards compatibility reasons, but it also makes sense if you have a FASTA
file like this:
\begin{verbatim}
>Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...
\end{verbatim}
Note that none of the other annotation attributes get populated when reading a
FASTA file:
\begin{verbatim}
>>> record.dbxrefs
[]
>>> record.annotations
{}
>>> record.letter_annotations
{}
>>> record.features
[]
\end{verbatim}
In this case our example FASTA file was from the NCBI, and they have a fairly well defined set of conventions for formatting their FASTA lines. This means it would be possible to parse this information and extract the GI number and accession for example. However, FASTA files from other sources vary, so this isn't possible in general.
\subsection{SeqRecord objects from GenBank files}
As in the previous example, we're going to look at the whole sequence for \textit{Yersinia pestis biovar Microtus} str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file.
Again, this file is included with the Biopython unit tests under the GenBank folder, or online \href{http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb}{\texttt{NC\_005816.gb}} from our website.
This file contains a single record (i.e. only one LOCUS line) and starts:
\begin{verbatim}
LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008
DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
sequence.
ACCESSION NC_005816
VERSION NC_005816.1 GI:45478711
PROJECT GenomeProject:10638
...
\end{verbatim}
Again, we'll use \verb|Bio.SeqIO| to read this file in, and the code is almost identical to that for used above for the FASTA file (see Chapter~\ref{chapter:Bio.SeqIO} for details):
\begin{verbatim}
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
\end{verbatim}
You should be able to spot some differences already! But taking the attributes individually,
the sequence string is the same as before, but this time \verb|Bio.SeqIO| has been able to automatically assign a more specific alphabet (see Chapter~\ref{chapter:Bio.SeqIO} for details):
\begin{verbatim}
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA())
\end{verbatim}
The \verb|name| comes from the LOCUS line, while the \verb|id| includes the version suffix.
The description comes from the DEFINITION line:
\begin{verbatim}
>>> record.id
'NC_005816.1'
>>> record.name
'NC_005816'
>>> record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.'
\end{verbatim}
GenBank files don't have any per-letter annotations:
\begin{verbatim}
>>> record.letter_annotations
{}
\end{verbatim}
Most of the annotations information gets recorded in the \verb|annotations| dictionary, for example:
\begin{verbatim}
>>> len(record.annotations)
11
>>> record.annotations["source"]
'Yersinia pestis biovar Microtus str. 91001'
\end{verbatim}
The \verb|dbxrefs| list gets populated from any PROJECT or DBLINK lines:
\begin{verbatim}
>>> record.dbxrefs
['Project:10638']
\end{verbatim}
Finally, and perhaps most interestingly, all the entries in the features table (e.g. the genes or CDS features) get recorded as \verb|SeqFeature| objects in the \verb|features| list.
\begin{verbatim}
>>> len(record.features)
29
\end{verbatim}
\noindent We'll talk about \verb|SeqFeature| objects next, in
Section~\ref{sec:seq_features}.
\section{SeqFeature objects}
\label{sec:seq_features}
Sequence features are an essential part of describing a sequence. Once you get beyond the sequence itself, you need some way to organize and easily get at the more ``abstract'' information that is known about the sequence. While it is probably impossible to develop a general sequence feature class that will cover everything, the Biopython \verb|SeqFeature| class attempts to encapsulate as much of the information about the sequence as possible. The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you'll probably have an easier time grasping the structure of the Biopython classes.
\subsection{SeqFeatures themselves}
The first level of dealing with sequence features is the \verb|SeqFeature| class itself. This class has a number of attributes, so first we'll list them and their general features, and then work through an example to show how this applies to a real life example, a GenBank feature table. The attributes of a SeqFeature are:
\begin{description}
\item[location] -- The location of the \verb|SeqFeature| on the sequence that you are dealing with. The locations end-points may be fuzzy -- section~\ref{sec:locations} has a lot more description on how to deal with descriptions.
\item[type] -- This is a textual description of the type of feature (for instance, this will be something like `CDS' or `gene').
\item[ref] -- A reference to a different sequence. Some times features may be ``on'' a particular sequence, but may need to refer to a different sequence, and this provides the reference (normally an accession number). A good example of this is a genomic sequence that has most of a coding sequence, but one of the exons is on a different accession. In this case, the feature would need to refer to this different accession for this missing exon. You are most likely to see this in contig GenBank files.
\item[ref\_db] -- This works along with \verb|ref| to provide a cross sequence reference. If there is a reference, \verb|ref_db| will be set as None if the reference is in the same database, and will be set to the name of the database otherwise.
\item[strand] -- The strand on the sequence that the feature is located on. This may either be $1$ for the top strand, $-1$ for the bottom strand, or $0$ or \texttt{None} for both strands (or if it doesn't matter). Keep in mind that this only really makes sense for double stranded DNA, and not for proteins or RNA.
\item[qualifiers] -- This is a Python dictionary of additional information about the feature. The key is some kind of terse one-word description of what the information contained in the value is about, and the value is the actual information. For example, a common key for a qualifier might be ``evidence'' and the value might be ``computational (non-experimental).'' This is just a way to let the person who is looking at the feature know that it has not be experimentally (i.~e.~in a wet lab) confirmed. Note that other the value will be a list of strings (even when there is only one string). This is a reflection of the feature tables in GenBank/EMBL files.
\item[sub\_features] -- A very important feature of a feature is that it can have additional \verb|sub_features| underneath it. This allows nesting of features, and helps us to deal with things such as the GenBank/EMBL feature lines in a (we hope) intuitive way.
\end{description}
To show an example of SeqFeatures in action, let's take a look at the following feature from a GenBank feature table:
\begin{verbatim}
mRNA complement(join(<49223..49300,49780..>50208))
/gene="F28B23.12"
\end{verbatim}
To look at the easiest attributes of the \verb|SeqFeature| first, if you got a \verb|SeqFeature| object for this it would have it \verb|type| of 'mRNA', a \verb|strand| of -1 (due to the `complement'), and would have None for the \verb|ref| and \verb|ref_db| since there are no references to external databases. The \verb|qualifiers| for this SeqFeature would be a Python dictionary that looked like \verb|{'gene' : ['F28B23.12']}|.
Now let's look at the more tricky part, how the `join' in the location
line is handled. First, the location for the top level \verb|SeqFeature| (the
one we are dealing with right now) is set as going from
\verb|`<49223' to `>50208'| (see section~\ref{sec:locations} for
the nitty gritty on how fuzzy locations like this are handled).
So the location of the top level object is the entire span of the
feature. So, how do you get at the information in the `join'?
Well, that's where the \verb|sub_features| go in.
The \verb|sub_features| attribute will have a list with two \verb|SeqFeature|
objects in it, and these contain the information in the join. Let's
look at \verb|top_level_feature.sub_features[0]| (the first
\verb|sub_feature|). This object is a \verb|SeqFeature| object with a
\verb|type| of `\verb|mRNA|,' a \verb|strand| of -1 (inherited
from the parent \verb|SeqFeature|) and a location going from
\verb|'<49223' to '49300'|.
So, the \verb|sub_features| allow you to get at the internal information if you want it (i.~e.~if you were trying to get only the exons out of a genomic sequence), or just to deal with the broad picture (i.~e.~you just want to know that the coding sequence for a gene lies in a region). Hopefully this structuring makes it easy and intuitive to get at the sometimes complex information that can be contained in a \verb|SeqFeature|.
\subsection{Locations}
\label{sec:locations}
In the section on SeqFeatures above, we skipped over one of the more difficult parts of features, dealing with the locations. The reason this can be difficult is because of fuzziness of the positions in locations. Before we get into all of this, let's just define the vocabulary we'll use to talk about this. Basically there are two terms we'll use:
\begin{description}
\item[position] -- This refers to a single position on a sequence,
which may be fuzzy or not. For instance, 5, 20, \verb|<100| and
\verb|3^5| are all positions.
\item[location] -- A location is two positions that defines a region of a sequence. For instance 5..20 (i.~e.~5 to 20) is a location.
\end{description}
I just mention this because sometimes I get confused between the two.
The complication in dealing with locations comes in the positions themselves. In biology many times things aren't entirely certain (as much as us wet lab biologists try to make them certain!). For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position. To help us deal with this, we have the concept of fuzzy positions. Basically there are five types of fuzzy positions, so we have five classes do deal with them:
\begin{description}
\item[ExactPosition] -- As its name suggests, this class represents a position which is specified as exact along the sequence. This is represented as just a a number, and you can get the position by looking at the \verb|position| attribute of the object.
\item[BeforePosition] -- This class represents a fuzzy position
that occurs prior to some specified site. In GenBank/EMBL notation,
this is represented as something like \verb|`<13'|, signifying that
the real position is located somewhere less then 13. To get
the specified upper boundary, look at the \verb|position|
attribute of the object.
\item[AfterPosition] -- Contrary to \verb|BeforePosition|, this
class represents a position that occurs after some specified site.
This is represented in GenBank as \verb|`>13'|, and like
\verb|BeforePosition|, you get the boundary number by looking
at the \verb|position| attribute of the object.
\item[WithinPosition] -- This class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as `(1.5)', to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The \verb|position| attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The \verb|extension| attribute specifies the range to the higher boundary, so in this case it would be 4. So \verb|object.position| is the lower boundary and \verb|object.position + object.extension| is the upper boundary.
%TODO - Fix this, a between position 2^3 becomes Python style [2:2]
\item[BetweenPosition] -- This class deals with a position that
occurs between two coordinates. For instance, you might have a
protein binding site that occurs between two nucleotides on a
sequence. This is represented as \verb|`2^3'|, which indicates that
the real position happens between position 2 and 3. Getting
this information from the object is very similar to
\verb|WithinPosition|, the \verb|position| attribute specifies
the lower boundary (2, in this case) and the \verb|extension|
indicates the range to the higher boundary (1 in this case).
\end{description}
Now that we've got all of the types of fuzzy positions we can have taken care of, we are ready to actually specify a location on a sequence. This is handled by the \verb|FeatureLocation| class. An object of this type basically just holds the potentially fuzzy start and end positions of a feature. You can create a \verb|FeatureLocation| object by creating the positions and passing them in:
%doctest
\begin{verbatim}
>>> from Bio import SeqFeature
>>> start_pos = SeqFeature.AfterPosition(5)
>>> end_pos = SeqFeature.BetweenPosition(9, left=8, right=9)
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
\end{verbatim}
Note that the details of some of the fuzzy-locations changed in Biopython 1.59,
in particular for BetweenPosition and WithinPosition you must now make it explicit
which integer position should be used for slicing etc. For a start position this
is generally the lower (left) value, while for an end position this would generally
be the higher (right) value.
If you print out a \verb|FeatureLocation| object, you can get a nice representation of the information:
\begin{verbatim}
>>> print my_location
[>5:(8^9)]
\end{verbatim}
We can access the fuzzy start and end positions using the start and end attributes of the location:
%cont-doctest
\begin{verbatim}
>>> my_location.start
AfterPosition(5)
>>> print my_location.start
>5
>>> my_location.end
BetweenPosition(9, left=8, right=9)
>>> print my_location.end
(8^9)
\end{verbatim}
If you don't want to deal with fuzzy positions and just want numbers,
they are actually subclasses of integers so should work like integers:
%cont-doctest
\begin{verbatim}
>>> int(my_location.start)
5
>>> int(my_location.end)
9
\end{verbatim}
For compatibility with older versions of Biopython you can ask for the
\verb|nofuzzy_start| and \verb|nofuzzy_end| attributes of the location
which are plain integers:
%cont-doctest
\begin{verbatim}
>>> my_location.nofuzzy_start
5
>>> my_location.nofuzzy_end
9
\end{verbatim}
Notice that this just gives you back the position attributes of the fuzzy locations.
Similary, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the \verb|FeaturePosition| constructors, and you'll get back out \verb|ExactPosition| objects:
%cont-doctest
\begin{verbatim}
>>> exact_location = SeqFeature.FeatureLocation(5, 9)
>>> print exact_location
[5:9]
>>> exact_location.start
ExactPosition(5)
>>> int(exact_location.start)
5
>>> exact_location.nofuzzy_start
5
\end{verbatim}
That is all of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true!
\subsection{Sequence}
A \verb|SeqFeature| object doesn't directly contain a sequence, instead its location (see Section~\ref{sec:locations}) describes how to get this from the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be \texttt{complement(6..18)}, like this:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
\end{verbatim}
You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement.
If you are using Biopython 1.59 or later, the feature location's start and end are integer like so this works:
%cont-doctest
\begin{verbatim}
>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print feature_seq
AGCCTTTGCCGTC
\end{verbatim}
This is a simple example so this isn't too bad -- however once you have to deal with compound features (joins) this is rather messy. Instead, the \verb|SeqFeature| object has an \verb|extract| method to take care of all this:
%cont-doctest
\begin{verbatim}
>>> feature_seq = example_feature.extract(example_parent)
>>> print feature_seq
AGCCTTTGCCGTC
\end{verbatim}
The \verb|extract| method was added in Biopython 1.53, and in Biopython 1.56 the \verb|SeqFeature| was further extended to give its length as that of the region of sequence it describes.
%cont-doctest
\begin{verbatim}
>>> print example_feature.extract(example_parent)
AGCCTTTGCCGTC
>>> print len(example_feature.extract(example_parent))
13
>>> print len(example_feature)
13
\end{verbatim}
\section{Location testing}
As of Biopython 1.56, you can use the Python keyword \verb|in| with a
\verb|SeqFeature| to see if the base/residue for a parent coordinate is
within the feature or not.
For example, suppose you have a SNP of interest and you want to know which
features this SNP is within, and lets suppose this SNP is at index 4350
(Python counting!). Here is a simple brute force solution where we just
check all the features one by one in a loop:
%doctest ../Tests/GenBank
\begin{verbatim}
>>> from Bio import SeqIO
>>> my_snp = 4350
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> for feature in record.features:
... if my_snp in feature:
... print feature.type, feature.qualifiers.get('db_xref')
...
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']
\end{verbatim}
Note that gene and CDS features from GenBank or EMBL files defined with joins
are the union of the exons -- they do not cover any introns.
\section{References}
Another common annotation related to a sequence is a reference to a journal or other published work dealing with the sequence. We have a fairly simple way of representing a Reference in Biopython -- we have a \verb|Bio.SeqFeature.Reference| class that stores the relevant information about a reference as attributes of an object.
The attributes include things that you would expect to see in a reference like \verb|journal|, \verb|title| and \verb|authors|. Additionally, it also can hold the \verb|medline_id| and \verb|pubmed_id| and a \verb|comment| about the reference. These are all accessed simply as attributes of the object.
A reference also has a \verb|location| object so that it can specify a particular location on the sequence that the reference refers to. For instance, you might have a journal that is dealing with a particular gene located on a BAC, and want to specify that it only refers to this position exactly. The \verb|location| is a potentially fuzzy location, as described in section~\ref{sec:locations}.
Any reference objects are stored as a list in the \verb|SeqRecord| object's \verb|annotations| dictionary under the key ``references''.
That's all there is too it. References are meant to be easy to deal with, and hopefully general enough to cover lots of usage cases.
\section{The format method}
\label{sec:SeqRecord-format}
Biopython 1.48 added a new \verb|format()| method to the \verb|SeqRecord| class which
gives a string containing your record formatted using one of the output file formats
supported by \verb|Bio.SeqIO|, such as FASTA:
\begin{verbatim}
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
record = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")
print record.format("fasta")
\end{verbatim}
\noindent which should give:
\begin{verbatim}
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
\end{verbatim}
This \verb|format| method takes a single mandatory argument, a lower case string which is
supported by \verb|Bio.SeqIO| as an output format (see Chapter~\ref{chapter:Bio.SeqIO}).
However, some of the file formats \verb|Bio.SeqIO| can write to \emph{require} more than
one record (typically the case for multiple sequence alignment formats), and thus won't
work via this \verb|format()| method. See also Section~\ref{sec:Bio.SeqIO-and-StringIO}.
\section{Slicing a SeqRecord}
\label{sec:SeqRecord-slicing}
One of the new features in Biopython 1.50 was the ability to slice a \verb|SeqRecord|,
to give you a new \verb|SeqRecord| covering just part of the sequence. What is important
here is that any per-letter annotations are also sliced, and any features which fall
completely within the new sequence are preserved (with their locations adjusted).
For example, taking the same GenBank file used earlier:
%doctest ../Tests/GenBank
\begin{verbatim}
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
\end{verbatim}
%TODO - support line wrapping in doctest
\begin{verbatim}
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
\end{verbatim}
%cont-doctest
\begin{verbatim}
>>> len(record)
9609
>>> len(record.features)
41
\end{verbatim}
For this example we're going to focus in on the \verb|pim| gene, \verb|YP_pPCP05|.
If you have a look at the GenBank file directly you'll find this gene/CDS has
location string \texttt{4343..4780}, or in Python counting \texttt{4342:4780}.
From looking at the file you can work out that these are the twelfth and
thirteenth entries in the file, so in Python zero-based counting they are
entries $11$ and $12$ in the \texttt{features} list:
%cont-doctest
\begin{verbatim}
>>> print record.features[20]
type: gene
location: [4342:4780](+)
qualifiers:
Key: db_xref, Value: ['GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>
\end{verbatim}
%This one is truncated so can't use for doctest
\begin{verbatim}
>>> print record.features[21]
type: CDS
location: [4342:4780](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
Key: product, Value: ['pesticin immunity protein']
Key: protein_id, Value: ['NP_995571.1']
Key: transl_table, Value: ['11']
Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']
\end{verbatim}
Let's slice this parent record from 4300 to 4800 (enough to include the \verb|pim|
gene/CDS), and see how many features we get:
%cont-doctest
\begin{verbatim}
>>> sub_record = record[4300:4800]
\end{verbatim}
%TODO - Line wrapping for doctest?
\begin{verbatim}
>>> sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])
\end{verbatim}
%cont-doctest
\begin{verbatim}
>>> len(sub_record)
500
>>> len(sub_record.features)
2
\end{verbatim}
Our sub-record just has two features, the gene and CDS entries for \verb|YP_pPCP05|:
%cont-doctest
\begin{verbatim}
>>> print sub_record.features[0]
type: gene
location: [42:480](+)
qualifiers:
Key: db_xref, Value: ['GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>
\end{verbatim}
\begin{verbatim}
>>> print sub_record.features[20]
type: CDS
location: [42:480](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
Key: product, Value: ['pesticin immunity protein']
Key: protein_id, Value: ['NP_995571.1']
Key: transl_table, Value: ['11']
Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']
\end{verbatim}
\noindent Notice that their locations have been adjusted to reflect the new parent sequence!
While Biopython has done something sensible and hopefully intuitive with the features
(and any per-letter annotation), for the other annotation it is impossible to know if
this still applies to the sub-sequence or not. To avoid guessing, the \texttt{annotations}
and \texttt{dbxrefs} are omitted from the sub-record, and it is up to you to transfer
any relevant information as appropriate.
%cont-doctest
\begin{verbatim}
>>> sub_record.annotations
{}
>>> sub_record.dbxrefs
[]
\end{verbatim}
The same point could be made about the record \texttt{id}, \texttt{name}
and \texttt{description}, but for practicality these are preserved:
%cont-doctest
\begin{verbatim}
>>> sub_record.id
'NC_005816.1'
>>> sub_record.name
'NC_005816'
>>> sub_record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.'
\end{verbatim}
\noindent This illustrates the problem nicely though, our new sub-record is
\emph{not} the complete sequence of the plasmid, so the description is wrong!
Let's fix this and then view the sub-record as a reduced GenBank file using
the \texttt{format} method described above in Section~\ref{sec:SeqRecord-format}:
\begin{verbatim}
>>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial."
>>> print sub_record.format("genbank")
...
\end{verbatim}
See Sections~\ref{sec:FASTQ-slicing-off-primer}
and~\ref{sec:FASTQ-slicing-off-adaptor} for some FASTQ examples where the
per-letter annotations (the read quality scores) are also sliced.
\section{Adding SeqRecord objects}
\label{sec:SeqRecord-addition}
One of the new features in Biopython 1.53 was the ability to add \verb|SeqRecord| objects
together, giving a new \verb|SeqRecord|. What is important here is that any common
per-letter annotations are also added, all the features are preserved (with their
locations adjusted), and any other common annotation is also kept (like the id, name
and description).
For an example with per-letter annotation, we'll use the first record in a
FASTQ file. Chapter~\ref{chapter:Bio.SeqIO} will explain the \verb|SeqIO| functions:
%doctest ../Tests/Quality
\begin{verbatim}
>>> from Bio import SeqIO
>>> record = SeqIO.parse("example.fastq", "fastq").next()
>>> len(record)
25
>>> print record.seq
CCCTTCTTGTCTTCAGCGTTTCTCC
\end{verbatim}
%TODO - doctest wrapping
\begin{verbatim}
>>> print record.letter_annotations["phred_quality"]
[26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26,
26, 26, 26, 23, 23]
\end{verbatim}
\noindent Let's suppose this was Roche 454 data, and that from other information
you think the \texttt{TTT} should be only \texttt{TT}. We can make a new edited
record by first slicing the \verb|SeqRecord| before and after the ``extra''
third \texttt{T}:
%cont-doctest
\begin{verbatim}
>>> left = record[:20]
>>> print left.seq
CCCTTCTTGTCTTCAGCGTT
>>> print left.letter_annotations["phred_quality"]
[26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26]
>>> right = record[21:]
>>> print right.seq
CTCC
>>> print right.letter_annotations["phred_quality"]
[26, 26, 23, 23]
\end{verbatim}
\noindent Now add the two parts together:
%cont-doctest
\begin{verbatim}
>>> edited = left + right
>>> len(edited)
24
>>> print edited.seq
CCCTTCTTGTCTTCAGCGTTCTCC
\end{verbatim}
\begin{verbatim}
>>> print edited.letter_annotations["phred_quality"]
[26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26,
26, 26, 23, 23]
\end{verbatim}
\noindent Easy and intuitive? We hope so! You can make this shorter with just:
%cont-doctest
\begin{verbatim}
>>> edited = record[:20] + record[21:]
\end{verbatim}
Now, for an example with features, we'll use a GenBank file.
Suppose you have a circular genome:
%doctest ../Tests/GenBank
\begin{verbatim}
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
\end{verbatim}
%TODO - doctest wrapping
\begin{verbatim}
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
\end{verbatim}
%cont-doctest
\begin{verbatim}
>>> len(record)
9609
>>> len(record.features)
41
>>> record.dbxrefs
['Project:58037']
\end{verbatim}
%TODO - doctest wrapping
\begin{verbatim}
>>> record.annotations.keys()
['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references',
'accessions', 'data_file_division', 'date', 'organism', 'gi']
\end{verbatim}
You can shift the origin like this:
%cont-doctest
\begin{verbatim}
>>> shifted = record[2000:] + record[:2000]
\end{verbatim}
%TODO - doctest wrapping
\begin{verbatim}
>>> shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])
\end{verbatim}
%cont-doctest
\begin{verbatim}
>>> len(shifted)
9609
\end{verbatim}
Note that this isn't perfect in that some annotation like the database cross references
and one of the features (the source feature) have been lost:
%cont-doctest
\begin{verbatim}
>>> len(shifted.features)
40
>>> shifted.dbxrefs
[]
>>> shifted.annotations.keys()
[]
\end{verbatim}
This is because the \verb|SeqRecord| slicing step is cautious in what annotation
it preserves (erroneously propagating annotation can cause major problems). If
you want to keep the database cross references or the annotations dictionary,
this must be done explicitly:
\begin{verbatim}
>>> shifted.dbxrefs = record.dbxrefs[:]
>>> shifted.annotations = record.annotations.copy()
>>> shifted.dbxrefs
['Project:10638']
>>> shifted.annotations.keys()
['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references',
'accessions', 'data_file_division', 'date', 'organism', 'gi']
\end{verbatim}
Also note that in an example like this, you should probably change the record
identifiers since the NCBI references refer to the \emph{original} unmodified
sequence.
\section{Reverse-complementing SeqRecord objects}
\label{sec:SeqRecord-reverse-complement}
One of the new features in Biopython 1.57 was the \verb|SeqRecord| object's
\verb|reverse_complement| method. This tries to balance easy of use with worries
about what to do with the annotation in the reverse complemented record.
For the sequence, this uses the Seq object's reverse complement method. Any
features are transferred with the location and strand recalculated. Likewise
any per-letter-annotation is also copied but reversed (which makes sense for
typical examples like quality scores). However, transfer of most annotation
is problematical.
For instance, if the record ID was an accession, that accession should not really
apply to the reverse complemented sequence, and transferring the identifier by
default could easily cause subtle data corruption in downstream analysis.
Therefore by default, the \verb|SeqRecord|'s id, name, description, annotations
and database cross references are all \emph{not} transferred by default.
The \verb|SeqRecord| object's \verb|reverse_complement| method takes a number
of optional arguments corresponding to properties of the record. Setting these
arguments to \verb|True| means copy the old values, while \verb|False| means
drop the old values and use the default value. You can alternatively provide
the new desired value instead.
Consider this example record:
%doctest ../Tests/GenBank
\begin{verbatim}
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> print record.id, len(record), len(record.features), len(record.dbxrefs), len(record.annotations)
NC_005816.1 9609 41 1 11
\end{verbatim}
Here we take the reverse complement and specify a new identifier -- but notice
how most of the annotation is dropped (but not the features):
%cont-doctest
\begin{verbatim}
>>> rc = record.reverse_complement(id="TESTING")
>>> print rc.id, len(rc), len(rc.features), len(rc.dbxrefs), len(rc.annotations)
TESTING 9609 41 0 0
\end{verbatim}
\chapter{Sequence Input/Output}
\label{chapter:Bio.SeqIO}
In this chapter we'll discuss in more detail the \verb|Bio.SeqIO| module, which was briefly introduced in Chapter~\ref{chapter:quick-start} and also used in Chapter~\ref{chapter:SeqRecord}. This aims to provide a simple interface for working with assorted sequence file formats in a uniform way.
See also the \verb|Bio.SeqIO| wiki page (\url{http://biopython.org/wiki/SeqIO}), and the built in documentation (also \href{http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html}{online}):
\begin{verbatim}
>>> from Bio import SeqIO
>>> help(SeqIO)
...
\end{verbatim}
The ``catch'' is that you have to work with \verb|SeqRecord| objects (see Chapter~\ref{chapter:SeqRecord}), which contain a \verb|Seq| object (see Chapter~\ref{chapter:Bio.Seq}) plus annotation like an identifier and description.
\section{Parsing or Reading Sequences}
\label{sec:Bio.SeqIO-input}
The workhorse function \verb|Bio.SeqIO.parse()| is used to read in sequence data as SeqRecord objects. This function expects two arguments:
\begin{enumerate}
\item The first argument is a {\it handle} to read the data from, or a filename. A handle is typically a file opened for reading, but could be the output from a command line program, or data downloaded from the internet (see Section~\ref{sec:SeqIO_Online}). See Section~\ref{sec:appendix-handles} for more about handles.
\item The second argument is a lower case string specifying sequence format -- we don't try and guess the file format for you! See \url{http://biopython.org/wiki/SeqIO} for a full listing of supported formats.
\end{enumerate}
\noindent As of Biopython 1.49 there is an optional argument \verb|alphabet| to specify the alphabet to be used. This is useful for file formats like FASTA where otherwise \verb|Bio.SeqIO| will default to a generic alphabet.
The \verb|Bio.SeqIO.parse()| function returns an {\it iterator} which gives \verb|SeqRecord| objects. Iterators are typically used in a for loop as shown below.
Sometimes you'll find yourself dealing with files which contain only a single record. For this situation Biopython 1.45 introduced the function \verb|Bio.SeqIO.read()| which takes the same arguments. Provided there is one and only one record in the file, this is returned as a \verb|SeqRecord| object. Otherwise an exception is raised.
\subsection{Reading Sequence Files}
In general \verb|Bio.SeqIO.parse()| is used to read in sequence files as \verb|SeqRecord| objects, and is typically used with a for loop like this:
\begin{verbatim}
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
\end{verbatim}
The above example is repeated from the introduction in Section~\ref{sec:sequence-parsing}, and will load the orchid DNA sequences in the FASTA format file \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta}{ls\_orchid.fasta}. If instead you wanted to load a GenBank format file like \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk}{ls\_orchid.gbk} then all you need to do is change the filename and the format string:
\begin{verbatim}
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print seq_record.id
print seq_record.seq
print len(seq_record)
\end{verbatim}
Similarly, if you wanted to read in a file in another file format, then assuming \verb|Bio.SeqIO.parse()| supports it you would just need to change the format string as appropriate, for example ``swiss'' for SwissProt files or ``embl'' for EMBL text files. There is a full listing on the wiki page (\url{http://biopython.org/wiki/SeqIO}) and in the built in documentation (also \href{http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html}{online}).
Another very common way to use a Python iterator is within a list comprehension (or
a generator expression). For example, if all you wanted to extract from the file was
a list of the record identifiers we can easily do this with the following list comprehension:
\begin{verbatim}
>>> from Bio import SeqIO
>>> identifiers = [seq_record.id for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")]
>>> identifiers
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', ..., 'Z78439.1']
\end{verbatim}
\noindent There are more examples using \verb|SeqIO.parse()| in a list
comprehension like this in Section~\ref{seq:sequence-parsing-plus-pylab}
(e.g. for plotting sequence lengths or GC\%).
\subsection{Iterating over the records in a sequence file}
In the above examples, we have usually used a for loop to iterate over all the records one by one. You can use the for loop with all sorts of Python objects (including lists, tuples and strings) which support the iteration interface.
The object returned by \verb|Bio.SeqIO| is actually an iterator which returns \verb|SeqRecord| objects. You get to see each record in turn, but once and only once. The plus point is that an iterator can save you memory when dealing with large files.
Instead of using a for loop, can also use the \verb|.next()| method of an iterator to step through the entries, like this:
\begin{verbatim}
from Bio import SeqIO
record_iterator = SeqIO.parse("ls_orchid.fasta", "fasta")
first_record = record_iterator.next()
print first_record.id
print first_record.description
second_record = record_iterator.next()
print second_record.id
print second_record.description
\end{verbatim}
Note that if you try and use \verb|.next()| and there are no more results, you'll get the special \verb|StopIteration| exception.
One special case to consider is when your sequence files have multiple records, but you only want the first one. In this situation the following code is very concise:
\begin{verbatim}
from Bio import SeqIO
first_record = SeqIO.parse("ls_orchid.gbk", "genbank").next()
\end{verbatim}
A word of warning here -- using the \verb|.next()| method like this will silently ignore any additional records in the file.
If your files have {\it one and only one} record, like some of the online examples later in this chapter, or a GenBank file for a single chromosome, then use the new \verb|Bio.SeqIO.read()| function instead.
This will check there are no extra unexpected records present.
\subsection{Getting a list of the records in a sequence file}
In the previous section we talked about the fact that \verb|Bio.SeqIO.parse()| gives you a \verb|SeqRecord| iterator, and that you get the records one by one. Very often you need to be able to access the records in any order. The Python \verb|list| data type is perfect for this, and we can turn the record iterator into a list of \verb|SeqRecord| objects using the built-in Python function \verb|list()| like so:
\begin{verbatim}
from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.gbk", "genbank"))
print "Found %i records" % len(records)
print "The last record"
last_record = records[-1] #using Python's list tricks
print last_record.id
print repr(last_record.seq)
print len(last_record)
print "The first record"
first_record = records[0] #remember, Python counts from zero
print first_record.id
print repr(first_record.seq)
print len(first_record)
\end{verbatim}
\noindent Giving:
\begin{verbatim}
Found 94 records
The last record
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592
The first record
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740
\end{verbatim}
You can of course still use a for loop with a list of \verb|SeqRecord| objects. Using a list is much more flexible than an iterator (for example, you can determine the number of records from the length of the list), but does need more memory because it will hold all the records in memory at once.
\subsection{Extracting data}
The \verb|SeqRecord| object and its annotation structures are described more fully in
Chapter~\ref{chapter:SeqRecord}. As an example of how annotations are stored, we'll look at the output from parsing the first record in the GenBank file \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk}{ls\_orchid.gbk}.
\begin{verbatim}
from Bio import SeqIO
record_iterator = SeqIO.parse("ls_orchid.gbk", "genbank")
first_record = record_iterator.next()
print first_record
\end{verbatim}
\noindent That should give something like this:
\begin{verbatim}
ID: Z78533.1
Name: Z78533
Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/sequence_version=1
/source=Cypripedium irapeanum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', ..., 'Cypripedium']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', ..., 'ITS1', 'ITS2']
/references=[...]
/accessions=['Z78533']
/data_file_division=PLN
/date=30-NOV-2006
/organism=Cypripedium irapeanum
/gi=2765658
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
\end{verbatim}
This gives a human readable summary of most of the annotation data for the \verb|SeqRecord|.
For this example we're going to use the \verb|.annotations| attribute which is just a Python dictionary.
The contents of this annotations dictionary were shown when we printed the record above.
You can also print them out directly:
\begin{verbatim}
print first_record.annotations
\end{verbatim}
\noindent Like any Python dictionary, you can easily get a list of the keys:
\begin{verbatim}
print first_record.annotations.keys()
\end{verbatim}
\noindent or values:
\begin{verbatim}
print first_record.annotations.values()
\end{verbatim}
In general, the annotation values are strings, or lists of strings. One special case is any references in the file get stored as reference objects.
Suppose you wanted to extract a list of the species from the \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk}{ls\_orchid.gbk} GenBank file. The information we want, \emph{Cypripedium irapeanum}, is held in the annotations dictionary under `source' and `organism', which we can access like this:
\begin{verbatim}
>>> print first_record.annotations["source"]
Cypripedium irapeanum
\end{verbatim}
\noindent or:
\begin{verbatim}
>>> print first_record.annotations["organism"]
Cypripedium irapeanum
\end{verbatim}
In general, `organism' is used for the scientific name (in Latin, e.g. \textit{Arabidopsis thaliana}),
while `source' will often be the common name (e.g. thale cress). In this example, as is often the case,
the two fields are identical.
Now let's go through all the records, building up a list of the species each orchid sequence is from:
\begin{verbatim}
from Bio import SeqIO
all_species = []
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
all_species.append(seq_record.annotations["organism"])
print all_species
\end{verbatim}
Another way of writing this code is to use a list comprehension:
\begin{verbatim}
from Bio import SeqIO
all_species = [seq_record.annotations["organism"] for seq_record in \
SeqIO.parse("ls_orchid.gbk", "genbank")]
print all_species
\end{verbatim}
\noindent In either case, the result is:
% Try and keep this example output line short enough to fit on one page of PDF output:
\begin{verbatim}
['Cypripedium irapeanum', 'Cypripedium californicum', ..., 'Paphiopedilum barbatum']
\end{verbatim}
Great. That was pretty easy because GenBank files are annotated in a standardised way.
Now, let's suppose you wanted to extract a list of the species from a FASTA file, rather than the GenBank file. The bad news is you will have to write some code to extract the data you want from the record's description line - if the information is in the file in the first place! Our example FASTA format file \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta}{ls\_orchid.fasta} starts like this:
\begin{verbatim}
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
\end{verbatim}
You can check by hand, but for every record the species name is in the description line as the second word. This means if we break up each record's \verb|.description| at the spaces, then the species is there as field number one (field zero is the record identifier). That means we can do this:
\begin{verbatim}
from Bio import SeqIO
all_species = []
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
all_species.append(seq_record.description.split()[1])
print all_species
\end{verbatim}
\noindent This gives:
\begin{verbatim}
['C.irapeanum', 'C.californicum', 'C.fasciculatum', 'C.margaritaceum', ..., 'P.barbatum']
\end{verbatim}
The concise alternative using list comprehensions would be:
\begin{verbatim}
from Bio import SeqIO
all_species == [seq_record.description.split()[1] for seq_record in \
SeqIO.parse("ls_orchid.fasta", "fasta")]
print all_species
\end{verbatim}
In general, extracting information from the FASTA description line is not very nice.
If you can get your sequences in a well annotated file format like GenBank or EMBL,
then this sort of annotation information is much easier to deal with.
\section{Parsing sequences from compressed files}
\label{sec:SeqIO_compressed}
In the previous section, we looked at parsing sequence data from a file.
Instead of using a filename, you can give \verb|Bio.SeqIO| a handle
(see Section~\ref{sec:appendix-handles}), and in this section
we'll use handles to parse sequence from compressed files.
As you'll have seen above, we can use \verb|Bio.SeqIO.read()| or
\verb|Bio.SeqIO.parse()| with a filename - for instance this quick
example calculates the total length of the sequences in a multiple
record GenBank file using a generator expression:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> print sum(len(r) for r in SeqIO.parse("ls_orchid.gbk", "gb"))
67518
\end{verbatim}
\noindent
Here we use a file handle instead, using the \verb|with| statement
(Python 2.5 or later) to close the handle automatically:
%This doctest won't work on Python 2.5, even with the __future__ import. Odd.
\begin{verbatim}
>>> from __future__ import with_statement #Needed on Python 2.5
>>> from Bio import SeqIO
>>> with open("ls_orchid.gbk") as handle:
... print sum(len(r) for r in SeqIO.parse(handle, "gb"))
67518
\end{verbatim}
\noindent
Or, the old fashioned way where you manually close the handle:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> handle = open("ls_orchid.gbk")
>>> print sum(len(r) for r in SeqIO.parse(handle, "gb"))
67518
>>> handle.close()
\end{verbatim}
Now, suppose we have a gzip compressed file instead? These are very
commonly used on Linux. We can use Python's \verb|gzip| module to open
the compressed file for reading - which gives us a handle object:
%This doctest fails on Python 3, http://bugs.python.org/issue13989
\begin{verbatim}
>>> import gzip
>>> from Bio import SeqIO
>>> handle = gzip.open("ls_orchid.gbk.gz", "r")
>>> print sum(len(r) for r in SeqIO.parse(handle, "gb"))
67518
>>> handle.close()
\end{verbatim}
Similarly if we had a bzip2 compressed file (sadly the function name isn't
quite as consistent):
%This doctest fails on Python 3
\begin{verbatim}
>>> import bz2
>>> from Bio import SeqIO
>>> handle = bz2.BZ2File("ls_orchid.gbk.bz2", "r")
>>> print sum(len(r) for r in SeqIO.parse(handle, "gb"))
67518
>>> handle.close()
\end{verbatim}
\noindent
If you are using Python 2.7 or later, the \verb|with|-version works for
gzip and bz2 as well. Unfortunately this is broken on older versions of
Python (\href{http://bugs.python.org/issue3860}{Issue 3860}) and you'd
get an \verb|AttributeError| about \verb|__exit__| being missing.
There is a gzip (GNU Zip) variant called BGZF (Blocked GNU Zip Format),
which can be treated like an ordinary gzip file for reading, but has
advantages for random access later which we'll talk about later in
Section~\ref{sec:SeqIO-index-bgzf}.
\section{Parsing sequences from the net}
\label{sec:SeqIO_Online}
In the previous sections, we looked at parsing sequence data from a file
(using a filename or handle), and from compressed files (using a handle).
Here we'll use \verb|Bio.SeqIO| with a another type of handle, a network
connection, to download and parse sequences from the internet.
Note that just because you \emph{can} download sequence data and parse it into
a \verb|SeqRecord| object in one go doesn't mean this is a good idea.
In general, you should probably download sequences \emph{once} and save them to
a file for reuse.
\subsection{Parsing GenBank records from the net}
\label{sec:SeqIO_GenBank_Online}
Section~\ref{sec:efetch} talks about the Entrez EFetch interface in more detail,
but for now let's just connect to the NCBI and get a few \textit{Opuntia} (prickly-pear)
sequences from GenBank using their GI numbers.
First of all, let's fetch just one record. If you don't care about the
annotations and features downloading a FASTA file is a good choice as these
are compact. Now remember, when you expect the handle to contain one and
only one record, use the \verb|Bio.SeqIO.read()| function:
\begin{verbatim}
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "A.N.Other@example.com"
handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="6273291")
seq_record = SeqIO.read(handle, "fasta")
handle.close()
print "%s with %i features" % (seq_record.id, len(seq_record.features))
\end{verbatim}
\noindent Expected output:
\begin{verbatim}
gi|6273291|gb|AF191665.1|AF191665 with 0 features
\end{verbatim}
The NCBI will also let you ask for the file in other formats, in particular as
a GenBank file. Until Easter 2009, the Entrez EFetch API let you use ``genbank''
as the return type, however the NCBI now insist on using the official
return types of ``gb'' (or ``gp'' for proteins) as described on
\href{http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html}
{EFetch for Sequence and other Molecular Biology Databases}.
As a result, in Biopython 1.50 onwards, we support ``gb'' as an
alias for ``genbank'' in \verb|Bio.SeqIO|.
\begin{verbatim}
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "A.N.Other@example.com"
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="6273291")
seq_record = SeqIO.read(handle, "gb") #using "gb" as an alias for "genbank"
handle.close()
print "%s with %i features" % (seq_record.id, len(seq_record.features))
\end{verbatim}
\noindent The expected output of this example is:
\begin{verbatim}
AF191665.1 with 3 features
\end{verbatim}
\noindent Notice this time we have three features.
Now let's fetch several records. This time the handle contains multiple records,
so we must use the \verb|Bio.SeqIO.parse()| function:
\begin{verbatim}
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "A.N.Other@example.com"
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", \
id="6273291,6273290,6273289")
for seq_record in SeqIO.parse(handle, "gb"):
print seq_record.id, seq_record.description[:50] + "..."
print "Sequence length %i," % len(seq_record),
print "%i features," % len(seq_record.features),
print "from: %s" % seq_record.annotations["source"]
handle.close()
\end{verbatim}
\noindent That should give the following output:
\begin{verbatim}
AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c...
Sequence length 902, 3 features, from: chloroplast Opuntia marenae
AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c...
Sequence length 899, 3 features, from: chloroplast Grusonia clavata
AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for...
Sequence length 899, 3 features, from: chloroplast Opuntia bradtianaa
\end{verbatim}
See Chapter~\ref{chapter:entrez} for more about the \verb|Bio.Entrez| module, and make sure to read about the NCBI guidelines for using Entrez (Section~\ref{sec:entrez-guidelines}).
\subsection{Parsing SwissProt sequences from the net}
\label{sec:SeqIO_ExPASy_and_SwissProt}
Now let's use a handle to download a SwissProt file from ExPASy,
something covered in more depth in Chapter~\ref{chapter:swiss_prot}.
As mentioned above, when you expect the handle to contain one and only one record,
use the \verb|Bio.SeqIO.read()| function:
\begin{verbatim}
from Bio import ExPASy
from Bio import SeqIO
handle = ExPASy.get_sprot_raw("O23729")
seq_record = SeqIO.read(handle, "swiss")
handle.close()
print seq_record.id
print seq_record.name
print seq_record.description
print repr(seq_record.seq)
print "Length %i" % len(seq_record)
print seq_record.annotations["keywords"]
\end{verbatim}
\noindent Assuming your network connection is OK, you should get back:
\begin{verbatim}
O23729
CHS3_BROFI
RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE', ProteinAlphabet())
Length 394
['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
\end{verbatim}
\section{Sequence files as Dictionaries}
We're now going to introduce three related functions in the \verb|Bio.SeqIO|
module which allow dictionary like random access to a multi-sequence file.
There is a trade off here between flexibility and memory usage. In summary:
\begin{itemize}
\item \verb|Bio.SeqIO.to_dict()| is the most flexible but also the most
memory demanding option (see Section~\ref{SeqIO:to_dict}). This is basically
a helper function to build a normal Python \verb|dictionary| with each entry
held as a \verb|SeqRecord| object in memory, allowing you to modify the
records.
\item \verb|Bio.SeqIO.index()| is a useful middle ground, acting like a
read only dictionary and parsing sequences into \verb|SeqRecord| objects
on demand (see Section~\ref{sec:SeqIO-index}).
\item \verb|Bio.SeqIO.index_db()| also acts like a read only dictionary
but stores the identifiers and file offsets in a file on disk (as an
SQLite3 database), meaning it has very low memory requirements (see
Section~\ref{sec:SeqIO-index-db}), but will be a little bit slower.
\end{itemize}
See the discussion for an broad overview
(Section~\ref{sec:SeqIO-indexing-discussion}).
\subsection{Sequence files as Dictionaries -- In memory}
\label{SeqIO:to_dict}
The next thing that we'll do with our ubiquitous orchid files is to show how
to index them and access them like a database using the Python \verb|dictionary|
data type (like a hash in Perl). This is very useful for moderately large files
where you only need to access certain elements of the file, and makes for a nice
quick 'n dirty database. For dealing with larger files where memory becomes a
problem, see Section~\ref{sec:SeqIO-index} below.
You can use the function \verb|Bio.SeqIO.to_dict()| to make a SeqRecord dictionary
(in memory). By default this will use each record's identifier (i.e. the \verb|.id|
attribute) as the key. Let's try this using our GenBank file:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"))
\end{verbatim}
There is just one required argument for \verb|Bio.SeqIO.to_dict()|, a list or
generator giving \verb|SeqRecord| objects. Here we have just used the output
from the \verb|SeqIO.parse| function. As the name suggests, this returns a
Python dictionary.
Since this variable \verb|orchid_dict| is an ordinary Python dictionary, we can look at all of the keys we have available:
%cont-doctest
\begin{verbatim}
>>> len(orchid_dict)
94
\end{verbatim}
%Can't use following for doctest due to abbreviation
\begin{verbatim}
>>> print orchid_dict.keys()
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
\end{verbatim}
If you really want to, you can even look at all the records at once:
\begin{verbatim}
>>> orchid_dict.values() #lots of output!
...
\end{verbatim}
We can access a single \verb|SeqRecord| object via the keys and manipulate the object as normal:
%cont-doctest
\begin{verbatim}
>>> seq_record = orchid_dict["Z78475.1"]
>>> print seq_record.description
P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA.
>>> print repr(seq_record.seq)
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
\end{verbatim}
So, it is very easy to create an in memory ``database'' of our GenBank records. Next we'll try this for the FASTA file instead.
Note that those of you with prior Python experience should all be able to construct a dictionary like this ``by hand''. However, typical dictionary construction methods will not deal with the case of repeated keys very nicely. Using the \verb|Bio.SeqIO.to_dict()| will explicitly check for duplicate keys, and raise an exception if any are found.
\subsubsection{Specifying the dictionary keys}
\label{seq:seqio-todict-functionkey}
Using the same code as above, but for the FASTA file instead:
\begin{verbatim}
from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", "fasta"))
print orchid_dict.keys()
\end{verbatim}
\noindent This time the keys are:
\begin{verbatim}
['gi|2765596|emb|Z78471.1|PDZ78471', 'gi|2765646|emb|Z78521.1|CCZ78521', ...
..., 'gi|2765613|emb|Z78488.1|PTZ78488', 'gi|2765583|emb|Z78458.1|PHZ78458']
\end{verbatim}
You should recognise these strings from when we parsed the FASTA file earlier in Section~\ref{sec:fasta-parsing}. Suppose you would rather have something else as the keys - like the accession numbers. This brings us nicely to \verb|SeqIO.to_dict()|'s optional argument \verb|key_function|, which lets you define what to use as the dictionary key for your records.
First you must write your own function to return the key you want (as a string) when given a \verb|SeqRecord| object. In general, the details of function will depend on the sort of input records you are dealing with. But for our orchids, we can just split up the record's identifier using the ``pipe'' character (the vertical line) and return the fourth entry (field three):
\begin{verbatim}
def get_accession(record):
""""Given a SeqRecord, return the accession number as a string.
e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"
"""
parts = record.id.split("|")
assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"
return parts[3]
\end{verbatim}
\noindent Then we can give this function to the \verb|SeqIO.to_dict()| function to use in building the dictionary:
\begin{verbatim}
from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", "fasta"), key_function=get_accession)
print orchid_dict.keys()
\end{verbatim}
\noindent Finally, as desired, the new dictionary keys:
\begin{verbatim}
>>> print orchid_dict.keys()
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
\end{verbatim}
\noindent Not too complicated, I hope!
\subsubsection{Indexing a dictionary using the SEGUID checksum}
To give another example of working with dictionaries of \verb|SeqRecord| objects, we'll use the SEGUID checksum function. This is a relatively recent checksum, and collisions should be very rare (i.e. two different sequences with the same checksum), an improvement on the CRC64 checksum.
Once again, working with the orchids GenBank file:
\begin{verbatim}
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
for record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print record.id, seguid(record.seq)
\end{verbatim}
\noindent This should give:
\begin{verbatim}
Z78533.1 JUEoWn6DPhgZ9nAyowsgtoD9TTo
Z78532.1 MN/s0q9zDoCVEEc+k/IFwCNF2pY
...
Z78439.1 H+JfaShya/4yyAj7IbMqgNkxdxQ
\end{verbatim}
Now, recall the \verb|Bio.SeqIO.to_dict()| function's \verb|key_function| argument expects a function which turns a \verb|SeqRecord| into a string. We can't use the \verb|seguid()| function directly because it expects to be given a \verb|Seq| object (or a string). However, we can use Python's \verb|lambda| feature to create a ``one off'' function to give to \verb|Bio.SeqIO.to_dict()| instead:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> from Bio.SeqUtils.CheckSum import seguid
>>> seguid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"),
... lambda rec : seguid(rec.seq))
>>> record = seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"]
>>> print record.id
Z78532.1
>>> print record.description
C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA.
\end{verbatim}
\noindent That should have retrieved the record {\tt Z78532.1}, the second entry in the file.
\subsection{Sequence files as Dictionaries -- Indexed files}
%\subsection{Indexing really large files}
\label{sec:SeqIO-index}
As the previous couple of examples tried to illustrate, using
\verb|Bio.SeqIO.to_dict()| is very flexible. However, because it holds
everything in memory, the size of file you can work with is limited by your
computer's RAM. In general, this will only work on small to medium files.
For large files, Biopython 1.52 introduced an alternative,
\verb|Bio.SeqIO.index()|, which works a little differently. Although
it still returns a dictionary like object, this does \emph{not} keep
\emph{everything} in memory. Instead, it just records where each record
is within the file -- when you ask for a particular record, it then parses
it on demand.
As an example, let's use the same GenBank file as before:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank")
>>> len(orchid_dict)
94
\end{verbatim}
%Following is abbr.
\begin{verbatim}
>>> orchid_dict.keys()
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
\end{verbatim}
%cont-doctest
\begin{verbatim}
>>> seq_record = orchid_dict["Z78475.1"]
>>> print seq_record.description
P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA.
>>> seq_record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
\end{verbatim}
\noindent Note that \verb|Bio.SeqIO.index()| won't take a handle,
but only a filename. There are good reasons for this, but it is a little
technical. The second argument is the file format (a lower case string as
used in the other \verb|Bio.SeqIO| functions). You can use many other
simple file formats, including FASTA and FASTQ files (see the example in
Section~\ref{sec:fastq-indexing}). However, alignment
formats like PHYLIP or Clustal are not supported. Finally as an optional
argument you can supply an alphabet, or a key function.
Here is the same example using the FASTA file - all we change is the
filename and the format name:
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.fasta", "fasta")
>>> len(orchid_dict)
94
>>> orchid_dict.keys()
['gi|2765596|emb|Z78471.1|PDZ78471', 'gi|2765646|emb|Z78521.1|CCZ78521', ...
..., 'gi|2765613|emb|Z78488.1|PTZ78488', 'gi|2765583|emb|Z78458.1|PHZ78458']
\end{verbatim}
\subsubsection{Specifying the dictionary keys}
\label{seq:seqio-index-functionkey}
Suppose you want to use the same keys as before? Much like with the
\verb|Bio.SeqIO.to_dict()| example in Section~\ref{seq:seqio-todict-functionkey},
you'll need to write a tiny function to map from the FASTA identifier
(as a string) to the key you want:
\begin{verbatim}
def get_acc(identifier):
""""Given a SeqRecord identifier string, return the accession number as a string.
e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"
"""
parts = identifier.split("|")
assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"
return parts[3]
\end{verbatim}
\noindent Then we can give this function to the \verb|Bio.SeqIO.index()|
function to use in building the dictionary:
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.fasta", "fasta", key_function=get_acc)
>>> print orchid_dict.keys()
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
\end{verbatim}
\noindent Easy when you know how?
\subsubsection{Getting the raw data for a record}
\label{sec:seqio-index-getraw}
The dictionary-like object from \verb|Bio.SeqIO.index()| gives you each
entry as a \verb|SeqRecord| object. However, it is sometimes useful to
be able to get the original raw data straight from the file. For this
reason Biopython 1.54 added a \verb|get_raw()| method which takes a
single argument (the record identifier) and returns a string (extracted
from the file without modification).
A motivating example is extracting a subset of a records from a large
file where either \verb|Bio.SeqIO.write()| does not (yet) support the
output file format (e.g. the plain text SwissProt file format) or
where you need to preserve the text exactly (e.g. GenBank or EMBL
output from Biopython does not yet preserve every last bit of
annotation).
Let's suppose you have download the whole of UniProt in the plain
text SwissPort file format from their FTP site
(\url{ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz})
and uncompressed it as the file \verb|uniprot_sprot.dat|, and you
want to extract just a few records from it:
\begin{verbatim}
>>> from Bio import SeqIO
>>> uniprot = SeqIO.index("uniprot_sprot.dat", "swiss")
>>> handle = open("selected.dat", "w")
>>> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]:
... handle.write(uniprot.get_raw(acc))
>>> handle.close()
\end{verbatim}
There is a longer example in Section~\ref{sec:SeqIO-sort} using the
\verb|SeqIO.index()| function to sort a large sequence file (without
loading everything into memory at once).
\subsection{Sequence files as Dictionaries -- Database indexed files}
\label{sec:SeqIO-index-db}
Biopython 1.57 introduced an alternative, \verb|Bio.SeqIO.index_db()|, which
can work on even extremely large files since it stores the record information
as a file on disk (using an SQLite3 database) rather than in memory. Also,
you can index multiple files together (providing all the record identifiers
are unique).
The \verb|Bio.SeqIO.index()| function takes three required arguments:
\begin{itemize}
\item Index filename, we suggest using something ending \texttt{.idx}.
This index file is actually an SQLite3 database.
\item List of sequence filenames to index (or a single filename)
\item File format (lower case string as used in the rest of the
\verb|SeqIO| module).
\end{itemize}
As an example, consider the GenBank flat file releases from the NCBI FTP site,
\url{ftp://ftp.ncbi.nih.gov/genbank/}, which are gzip compressed GenBank files.
As of GenBank release $182$, there are $16$ files making up the viral sequences,
\texttt{gbvrl1.seq}, \ldots, \texttt{gbvrl16.seq}, containing in total almost
one million records. You can index them like this:
\begin{verbatim}
>>> from Bio import SeqIO
>>> files = ["gbvrl%i.seq" % (i+1) for i in range(16)]
>>> gb_vrl = SeqIO.index_db("gbvrl.idx", files, "genbank")
>>> print "%i sequences indexed" % len(gb_vrl)
958086 sequences indexed
\end{verbatim}
That takes about two minutes to run on my machine. If you rerun it then the
index file (here \texttt{gbvrl.idx}) is reloaded in under a second. You can
use the index as a read only Python dictionary - without having to worry
about which file the sequence comes from, e.g.
\begin{verbatim}
>>> print gb_vrl["GQ333173.1"].description
HIV-1 isolate F12279A1 from Uganda gag protein (gag) gene, partial cds.
\end{verbatim}
\subsubsection{Getting the raw data for a record}
Just as with the \verb|Bio.SeqIO.index()| function discussed above in
Section~\ref{sec:seqio-index-getraw}, the dictionary like object also lets you
get at the raw text of each record:
\begin{verbatim}
>>> print gb_vrl.get_raw("GQ333173.1")
LOCUS GQ333173 459 bp DNA linear VRL 21-OCT-2009
DEFINITION HIV-1 isolate F12279A1 from Uganda gag protein (gag) gene, partial
cds.
ACCESSION GQ333173
...
//
\end{verbatim}
\subsection{Indexing compressed files}
\label{sec:SeqIO-index-bgzf}
Very often when you are indexing a sequence file it can be quite large -- so
you may want to compress it on disk. Unfortunately efficient random access
is difficult with the more common file formats like gzip and bzip2. In this
setting, BGZF (Blocked GNU Zip Format) can be very helpful. This is a variant
of gzip (and can be decompressed using standard gzip tools) popularised by
the BAM file format, \href{http://samtools.sourceforge.net/}{samtools}, and
\href{http://samtools.sourceforge.net/tabix.shtml}{tabix}.
To create a BGZF compressed file you can use the command line tool \verb|bgzip|
which comes with samtools. In our examples we use a filename extension
\verb|*.bgz|, so they can be distinguished from normal gzipped files (named
\verb|*.gz|). You can also use the \verb|Bio.bgzf| module to read and write
BGZF files from within Python.
The \verb|Bio.SeqIO.index()| and \verb|Bio.SeqIO.index_db()| can both be
used with BGZF compressed files. For example, if you started with an
uncompressed GenBank file:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank")
>>> len(orchid_dict)
94
\end{verbatim}
You could compress this (while keeping the original file) at the command
line using the following command -- but don't worry, the compressed file
is already included with the other example files:
\begin{verbatim}
$ bgzip -c ls_orchid.gbk > ls_orchid.gbk.bgz
\end{verbatim}
You can use the compressed file in exactly the same way:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.gbk.bgz", "genbank")
>>> len(orchid_dict)
94
\end{verbatim}
\noindent
or:
%Don't use doctest as would have to clean up the *.idx file
\begin{verbatim}
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index_db("ls_orchid.gbk.bgz.idx", "ls_orchid.gbk.bgz", "genbank")
>>> len(orchid_dict)
94
\end{verbatim}
The \verb|SeqIO| indexing automatically detects the BGZF compression. Note
that you can't use the same index file for the uncompressed and compressed files.
\subsection{Discussion}
\label{sec:SeqIO-indexing-discussion}
So, which of these methods should you use and why? It depends on what you are
trying to do (and how much data you are dealing with). However, in general
picking \verb|Bio.SeqIO.index()| is a good starting point. If you are dealing
with millions of records, multiple files, or repeated analyses, then look at
\verb|Bio.SeqIO.index_db()|.
Reasons to choose \verb|Bio.SeqIO.to_dict()| over either
\verb|Bio.SeqIO.index()| or \verb|Bio.SeqIO.index_db()| boil down to a need
for flexibility despite its high memory needs. The advantage of storing the
\verb|SeqRecord| objects in memory is they can be changed, added to, or
removed at will. In addition to the downside of high memory consumption,
indexing can also take longer because all the records must be fully parsed.
Both \verb|Bio.SeqIO.index()| and \verb|Bio.SeqIO.index_db()| only parse
records on demand. When indexing, they scan the file once looking for the
start of each record and do as little work as possible to extract the
identifier.
Reasons to choose \verb|Bio.SeqIO.index()| over \verb|Bio.SeqIO.index_db()|
include:
\begin{itemize}
\item Faster to build the index (more noticeable in simple file formats)
\item Slightly faster access as SeqRecord objects (but the difference is only
really noticeable for simple to parse file formats).
\item Can use any immutable Python object as the dictionary keys (e.g. a
tuple of strings, or a frozen set) not just strings.
\item Don't need to worry about the index database being out of date if the
sequence file being indexed has changed.
\end{itemize}
Reasons to choose \verb|Bio.SeqIO.index_db()| over \verb|Bio.SeqIO.index()|
include:
\begin{itemize}
\item Not memory limited -- this is already important with files from second
generation sequencing where 10s of millions of sequences are common, and
using \verb|Bio.SeqIO.index()| can require more than 4GB of RAM and therefore
a 64bit version of Python.
\item Because the index is kept on disk, it can be reused. Although building
the index database file takes longer, if you have a script which will be
rerun on the same datafiles in future, this could save time in the long run.
\item Indexing multiple files together
\item The \verb|get_raw()| method can be much faster, since for most file
formats the length of each record is stored as well as its offset.
\end{itemize}
\section{Writing Sequence Files}
We've talked about using \verb|Bio.SeqIO.parse()| for sequence input (reading files), and now we'll look at \verb|Bio.SeqIO.write()| which is for sequence output (writing files). This is a function taking three arguments: some \verb|SeqRecord| objects, a handle or filename to write to, and a sequence format.
Here is an example, where we start by creating a few \verb|SeqRecord| objects the hard way (by hand, rather than by loading them from a file):
\begin{verbatim}
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")
rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" \
+"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca subsp. bracteata]")
rec3 = SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC" \
+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" \
+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" \
+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" \
+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" \
+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" \
+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT", generic_protein),
id="gi|13925890|gb|AAK49457.1|",
description="chalcone synthase [Nicotiana tabacum]")
my_records = [rec1, rec2, rec3]
\end{verbatim}
\noindent Now we have a list of \verb|SeqRecord| objects, we'll write them to a FASTA format file:
\begin{verbatim}
from Bio import SeqIO
SeqIO.write(my_records, "my_example.faa", "fasta")
\end{verbatim}
\noindent And if you open this file in your favourite text editor it should look like this:
\begin{verbatim}
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
>gi|13919613|gb|AAK33142.1| chalcone synthase [Fragaria vesca subsp. bracteata]
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ
DMVVVEIPKLGKEAAVKAIKEWGQ
>gi|13925890|gb|AAK49457.1| chalcone synthase [Nicotiana tabacum]
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC
EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP
KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN
NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV
SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW
IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT
TGEGLEWGVLFGFGPGLTVETVVLHSVAT
\end{verbatim}
Suppose you wanted to know how many records the \verb|Bio.SeqIO.write()| function wrote to the handle?
If your records were in a list you could just use \verb|len(my_records)|, however you can't do that when your records come from a generator/iterator. Therefore as of Biopython 1.49, the \verb|Bio.SeqIO.write()| function returns the number of \verb|SeqRecord| objects written to the file.
\emph{Note} - If you tell the \verb|Bio.SeqIO.write()| function to write to a file that already exists, the old file will be overwritten without any warning.
\subsection{Round trips}
Some people like their parsers to be ``round-tripable'', meaning if you read in
a file and write it back out again it is unchanged. This requires that the parser
must extract enough information to reproduce the original file \emph{exactly}.
\verb|Bio.SeqIO| does \emph{not} aim to do this.
As a trivial example, any line wrapping of the sequence data in FASTA files is
allowed. An identical \verb|SeqRecord| would be given from parsing the following
two examples which differ only in their line breaks:
\begin{verbatim}
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA
\end{verbatim}
To make a round-tripable FASTA parser you would need to keep track of where the
sequence line breaks occurred, and this extra information is usually pointless.
Instead Biopython uses a default line wrapping of $60$ characters on output.
The same problem with white space applies in many other file formats too.
Another issue in some cases is that Biopython does not (yet) preserve every
last bit of annotation (e.g. GenBank and EMBL).
Occasionally preserving the original layout (with any quirks it may have) is
important. See Section~\ref{sec:seqio-index-getraw} about the \verb|get_raw()|
method of the \verb|Bio.SeqIO.index()| dictionary-like object for one potential
solution.
\subsection{Converting between sequence file formats}
\label{sec:SeqIO-conversion}
In previous example we used a list of \verb|SeqRecord| objects as input to the \verb|Bio.SeqIO.write()| function, but it will also accept a \verb|SeqRecord| iterator like we get from \verb|Bio.SeqIO.parse()| -- this lets us do file conversion by combining these two functions.
For this example we'll read in the GenBank format file \href{http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk}{ls\_orchid.gbk} and write it out in FASTA format:
\begin{verbatim}
from Bio import SeqIO
records = SeqIO.parse("ls_orchid.gbk", "genbank")
count = SeqIO.write(records, "my_example.fasta", "fasta")
print "Converted %i records" % count
\end{verbatim}
Still, that is a little bit complicated. So, because file conversion is such a
common task, Biopython 1.52 introduced a helper function letting you replace
that with just:
\begin{verbatim}
from Bio import SeqIO
count = SeqIO.convert("ls_orchid.gbk", "genbank", "my_example.fasta", "fasta")
print "Converted %i records" % count
\end{verbatim}
The \verb|Bio.SeqIO.convert()| function will take handles \emph{or} filenames.
Watch out though -- if the output file already exists, it will overwrite it!
To find out more, see the built in help:
\begin{verbatim}
>>> from Bio import SeqIO
>>> help(SeqIO.convert)
...
\end{verbatim}
In principle, just by changing the filenames and the format names, this code
could be used to convert between any file formats available in Biopython.
However, writing some formats requires information (e.g. quality scores) which
other files formats don't contain. For example, while you can turn a FASTQ
file into a FASTA file, you can't do the reverse. See also
Sections~\ref{sec:SeqIO-fastq-conversion} and~\ref{sec:SeqIO-fasta-qual-conversion}
in the cookbook chapter which looks at inter-converting between different FASTQ formats.
Finally, as an added incentive for using the \verb|Bio.SeqIO.convert()| function
(on top of the fact your code will be shorter), doing it this way may also be
faster! The reason for this is the convert function can take advantage of
several file format specific optimisations and tricks.
\subsection{Converting a file of sequences to their reverse complements}
\label{sec:SeqIO-reverse-complement}
Suppose you had a file of nucleotide sequences, and you wanted to turn it into a file containing their reverse complement sequences. This time a little bit of work is required to transform the \verb|SeqRecord| objects we get from our input file into something suitable for saving to our output file.
To start with, we'll use \verb|Bio.SeqIO.parse()| to load some nucleotide
sequences from a file, then print out their reverse complements using
the \verb|Seq| object's built in \verb|.reverse_complement()| method (see Section~\ref{sec:seq-reverse-complement}):
\begin{verbatim}
>>> from Bio import SeqIO
>>> for record in SeqIO.parse("ls_orchid.gbk", "genbank"):
... print record.id
... print record.seq.reverse_complement()
\end{verbatim}
Now, if we want to save these reverse complements to a file, we'll need to make \verb|SeqRecord| objects.
We can use the \verb|SeqRecord| object's built in \verb|.reverse_complement()| method (see Section~\ref{sec:SeqRecord-reverse-complement}) but we must decide how to name our new records.
This is an excellent place to demonstrate the power of list comprehensions which make a list in memory:
%doctest examples
\begin{verbatim}
>>> from Bio import SeqIO
>>> records = [rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") \
... for rec in SeqIO.parse("ls_orchid.fasta", "fasta")]
>>> len(records)
94
\end{verbatim}
\noindent Now list comprehensions have a nice trick up their sleeves, you can add a conditional statement:
%cont-doctest examples
\begin{verbatim}
>>> records = [rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") \
... for rec in SeqIO.parse("ls_orchid.fasta", "fasta") if len(rec)<700]
>>> len(records)
18
\end{verbatim}
That would create an in memory list of reverse complement records where the sequence length was under 700 base pairs. However, we can do exactly the same with a generator expression - but with the advantage that this does not create a list of all the records in memory at once:
%cont-doctest examples
\begin{verbatim}
>>> records = (rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") \
... for rec in SeqIO.parse("ls_orchid.fasta", "fasta") if len(rec)<700)
\end{verbatim}
As a complete example:
%not a doctest as would have to remove the output file
\begin{verbatim}
>>> from Bio import SeqIO
>>> records = (rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") \
... for rec in SeqIO.parse("ls_orchid.fasta", "fasta") if len(rec)<700)
>>> SeqIO.write(records, "rev_comp.fasta", "fasta")
18
\end{verbatim}
There is a related example in Section~\ref{sec:SeqIO-translate}, translating each
record in a FASTA file from nucleotides to amino acids.
\subsection{Getting your SeqRecord objects as formatted strings}
\label{sec:Bio.SeqIO-and-StringIO}
Suppose that you don't really want to write your records to a file or handle -- instead you want a string containing the records in a particular file format. The \verb|Bio.SeqIO| interface is based on handles, but Python has a useful built in module which provides a string based handle.
For an example of how you might use this, let's load in a bunch of \verb|SeqRecord| objects from our orchids GenBank file, and create a string containing the records in FASTA format:
\begin{verbatim}
from Bio import SeqIO
from StringIO import StringIO
records = SeqIO.parse("ls_orchid.gbk", "genbank")
out_handle = StringIO()
SeqIO.write(records, out_handle, "fasta")
fasta_data = out_handle.getvalue()
print fasta_data
\end{verbatim}
This isn't entirely straightforward the first time you see it! On the bright side, for the special case where you would like a string containing a \emph{single} record in a particular file format, Biopython 1.48 added a new \verb|format()| method to the \verb|SeqRecord| class (see Section~\ref{sec:SeqRecord-format}).
Note that although we don't encourage it, you \emph{can} use the \verb|format()| method to write to a file, for example something like this:
\begin{verbatim}
from Bio import SeqIO
out_handle = open("ls_orchid_long.tab", "w")
for record in SeqIO.parse("ls_orchid.gbk", "genbank"):
if len(record) > 100:
out_handle.write(record.format("tab"))
out_handle.close()
\end{verbatim}
\noindent While this style of code will work for a simple sequential file format like FASTA or the simple tab separated format used here, it will \emph{not} work for more complex or interlaced file formats. This is why we still recommend using \verb|Bio.SeqIO.write()|, as in the following example:
\begin{verbatim}
from Bio import SeqIO
records = (rec for rec in SeqIO.parse("ls_orchid.gbk", "genbank") if len(rec) > 100)
SeqIO.write(records, "ls_orchid.tab", "tab")
\end{verbatim}
\noindent Making a single call to \verb|SeqIO.write(...)| is also much quicker than
multiple calls to the \verb|SeqRecord.format(...)| method.
\chapter{Multiple Sequence Alignment objects}
\label{chapter:Bio.AlignIO}
This chapter is about Multiple Sequence Alignments, by which we mean a collection of
multiple sequences which have been aligned together -- usually with the insertion of gap
characters, and addition of leading or trailing gaps -- such that all the sequence
strings are the same length. Such an alignment can be regarded as a matrix of letters,
where each row is held as a \verb|SeqRecord| object internally.
We will introduce the \verb|MultipleSeqAlignment| object which holds this kind of data,
and the \verb|Bio.AlignIO| module for reading and writing them as various file formats
(following the design of the \verb|Bio.SeqIO| module from the previous chapter).
Note that both \verb|Bio.SeqIO| and \verb|Bio.AlignIO| can read and write sequence
alignment files. The appropriate choice will depend largely on what you want to do
with the data.
The final part of this chapter is about our command line wrappers for common multiple
sequence alignment tools like ClustalW and MUSCLE.
\section{Parsing or Reading Sequence Alignments}
We have two functions for reading in sequence alignments, \verb|Bio.AlignIO.read()| and \verb|Bio.AlignIO.parse()| which following the convention introduced in \verb|Bio.SeqIO| are for files containing one or multiple alignments respectively.
Using \verb|Bio.AlignIO.parse()| will return an {\it iterator} which gives \verb|MultipleSeqAlignment| objects. Iterators are typically used in a for loop. Examples of situations where you will have multiple different alignments include resampled alignments from the PHYLIP tool \verb|seqboot|, or multiple pairwise alignments from the EMBOSS tools \verb|water| or \verb|needle|, or Bill Pearson's FASTA tools.
However, in many situations you will be dealing with files which contain only a single alignment. In this case, you should use the \verb|Bio.AlignIO.read()| function which returns a single \verb|MultipleSeqAlignment| object.
Both functions expect two mandatory arguments:
\begin{enumerate}
\item The first argument is a {\it handle} to read the data from, typically an open file (see Section~\ref{sec:appendix-handles}), or a filename.
\item The second argument is a lower case string specifying the alignment format. As in \verb|Bio.SeqIO| we don't try and guess the file format for you! See \url{http://biopython.org/wiki/AlignIO} for a full listing of supported formats.
\end{enumerate}
\noindent There is also an optional \verb|seq_count| argument which is discussed in Section~\ref{sec:AlignIO-count-argument} below for dealing with ambiguous file formats which may contain more than one alignment.
Biopython 1.49 introduced a further optional \verb|alphabet| argument allowing you to specify the expected alphabet. This can be useful as many alignment file formats do not explicitly label the sequences as RNA, DNA or protein -- which means \verb|Bio.AlignIO| will default to using a generic alphabet.
\subsection{Single Alignments}
As an example, consider the following annotation rich protein alignment in the PFAM or Stockholm file format:
\begin{verbatim}
# STOCKHOLM 1.0
#=GS COATB_BPIKE/30-81 AC P03620.1
#=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52;
#=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1
#=GS COATB_BPI22/32-83 AC P15416.1
#=GS COATB_BPM13/24-72 AC P69541.1
#=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49;
#=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49;
#=GS COATB_BPZJ2/1-49 AC P03618.1
#=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1
#=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49;
#=GS COATB_BPIF1/22-73 AC P03619.2
#=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50;
COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
#=GR COATB_BPIKE/30-81 SS -HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH----
Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
COATB_BPM13/24-72 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
#=GR COATB_BPM13/24-72 SS ---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT--
COATB_BPZJ2/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
#=GR Q9T0Q9_BPFD/1-49 SS ------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH--
COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
#=GR COATB_BPIF1/22-73 SS XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH---
#=GC SS_cons XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC--
#=GC seq_cons AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA
//
\end{verbatim}
This is the seed alignment for the Phage\_Coat\_Gp8 (PF05371) PFAM entry, downloaded from a now out of date release of PFAM from \url{http://pfam.sanger.ac.uk/}. We can load this file as follows (assuming it has been saved to disk as ``PF05371\_seed.sth'' in the current working directory):
%doctest examples
\begin{verbatim}
>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm")
\end{verbatim}
\noindent This code will print out a summary of the alignment:
%cont-doctest
\begin{verbatim}
>>> print alignment
SingleLetterAlphabet() alignment with 7 rows and 52 columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
\end{verbatim}
You'll notice in the above output the sequences have been truncated. We could instead write our own code to format this as we please by iterating over the rows as \verb|SeqRecord| objects:
%doctest examples
\begin{verbatim}
>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm")
>>> print "Alignment length %i" % alignment.get_alignment_length()
Alignment length 52
>>> for record in alignment:
... print "%s - %s" % (record.seq, record.id)
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73
\end{verbatim}
You could also use the alignment object's \verb|format| method to show it in a particular file format -- see Section~\ref{sec:alignment-format-method} for details.
Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure? Try this:
%cont-doctest
\begin{verbatim}
>>> for record in alignment:
... if record.dbxrefs:
... print record.id, record.dbxrefs
COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']
COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']
Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']
COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']
\end{verbatim}
\noindent To have a look at all the sequence annotation, try this:
\begin{verbatim}
>>> for record in alignment:
... print record
\end{verbatim}
Sanger provide a nice web interface at \url{http://pfam.sanger.ac.uk/family?acc=PF05371} which will actually let you download this alignment in several other formats. This is what the file looks like in the FASTA file format:
\begin{verbatim}
>COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
>Q9T0Q8_BPIKE/1-52
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
>COATB_BPI22/32-83
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
>COATB_BPM13/24-72
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
>COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
>Q9T0Q9_BPFD/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
>COATB_BPIF1/22-73
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
\end{verbatim}
\noindent Note the website should have an option about showing gaps as periods (dots) or dashes, we've shown dashes above. Assuming you download and save this as file ``PF05371\_seed.faa'' then you can load it with almost exactly the same code:
\begin{verbatim}
from Bio import AlignIO
alignment = AlignIO.read("PF05371_seed.faa", "fasta")
print alignment
\end{verbatim}
All that has changed in this code is the filename and the format string. You'll get the same output as before, the sequences and record identifiers are the same.
However, as you should expect, if you check each \verb|SeqRecord| there is no annotation nor database cross-references because these are not included in the FASTA file format.
Note that rather than using the Sanger website, you could have used \verb|Bio.AlignIO| to convert the original Stockholm format file into a FASTA file yourself (see below).
With any supported file format, you can load an alignment in exactly the same way just by changing the format string. For example, use ``phylip'' for PHYLIP files, ``nexus'' for NEXUS files or ``emboss'' for the alignments output by the EMBOSS tools. There is a full listing on the wiki page (\url{http://biopython.org/wiki/AlignIO}) and in the built in documentation (also \href{http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html}{online}):
\begin{verbatim}
>>> from Bio import AlignIO
>>> help(AlignIO)
...
\end{verbatim}
\subsection{Multiple Alignments}
The previous section focused on reading files containing a single alignment. In general however, files can contain more than one alignment, and to read these files we must use the \verb|Bio.AlignIO.parse()| function.
Suppose you have a small alignment in PHYLIP format:
\begin{verbatim}
5 6
Alpha AACAAC
Beta AACCCC
Gamma ACCAAC
Delta CCACCA
Epsilon CCAAAC
\end{verbatim}
If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools, one of the steps would be to create a set of many resampled alignments using the tool \verb|bootseq|. This would give output something like this, which has been abbreviated for conciseness:
\begin{verbatim}
5 6
Alpha AAACCA
Beta AAACCC
Gamma ACCCCA
Delta CCCAAC
Epsilon CCCAAA
5 6
Alpha AAACAA
Beta AAACCC
Gamma ACCCAA
Delta CCCACC
Epsilon CCCAAA
5 6
Alpha AAAAAC
Beta AAACCC
Gamma AACAAC
Delta CCCCCA
Epsilon CCCAAC
...
5 6
Alpha AAAACC
Beta ACCCCC
Gamma AAAACC
Delta CCCCAA
Epsilon CAAACC
\end{verbatim}
If you wanted to read this in using \verb|Bio.AlignIO| you could use:
\begin{verbatim}
from Bio import AlignIO
alignments = AlignIO.parse("resampled.phy", "phylip")
for alignment in alignments:
print alignment
print
\end{verbatim}
\noindent This would give the following output, again abbreviated for display:
\begin{verbatim}
SingleLetterAlphabet() alignment with 5 rows and 6 columns
AAACCA Alpha
AAACCC Beta
ACCCCA Gamma
CCCAAC Delta
CCCAAA Epsilon
SingleLetterAlphabet() alignment with 5 rows and 6 columns
AAACAA Alpha
AAACCC Beta
ACCCAA Gamma
CCCACC Delta
CCCAAA Epsilon
SingleLetterAlphabet() alignment with 5 rows and 6 columns
AAAAAC Alpha
AAACCC Beta
AACAAC Gamma
CCCCCA Delta
CCCAAC Epsilon
...
SingleLetterAlphabet() alignment with 5 rows and 6 columns
AAAACC Alpha
ACCCCC Beta
AAAACC Gamma
CCCCAA Delta
CAAACC Epsilon
\end{verbatim}
As with the function \verb|Bio.SeqIO.parse()|, using \verb|Bio.AlignIO.parse()| returns an iterator.
If you want to keep all the alignments in memory at once, which will allow you to access them in any order, then turn the iterator into a list:
\begin{verbatim}
from Bio import AlignIO
alignments = list(AlignIO.parse("resampled.phy", "phylip"))
last_align = alignments[-1]
first_align = alignments[0]
\end{verbatim}
\subsection{Ambiguous Alignments}
\label{sec:AlignIO-count-argument}
Many alignment file formats can explicitly store more than one alignment, and the division between each alignment is clear. However, when a general sequence file format has been used there is no such block structure. The most common such situation is when alignments have been saved in the FASTA file format. For example consider the following:
\begin{verbatim}
>Alpha
ACTACGACTAGCTCAG--G
>Beta
ACTACCGCTAGCTCAGAAG
>Gamma
ACTACGGCTAGCACAGAAG
>Alpha
ACTACGACTAGCTCAGG--
>Beta
ACTACCGCTAGCTCAGAAG
>Gamma
ACTACGGCTAGCACAGAAG
\end{verbatim}
\noindent This could be a single alignment containing six sequences (with repeated identifiers). Or, judging from the identifiers, this is probably two different alignments each with three sequences, which happen to all have the same length.
What about this next example?
\begin{verbatim}
>Alpha
ACTACGACTAGCTCAG--G
>Beta
ACTACCGCTAGCTCAGAAG
>Alpha
ACTACGACTAGCTCAGG--
>Gamma
ACTACGGCTAGCACAGAAG
>Alpha
ACTACGACTAGCTCAGG--
>Delta
ACTACGGCTAGCACAGAAG
\end{verbatim}
\noindent Again, this could be a single alignment with six sequences. However this time based on the identifiers we might guess this is three pairwise alignments which by chance have all got the same lengths.
This final example is similar:
\begin{verbatim}
>Alpha
ACTACGACTAGCTCAG--G
>XXX
ACTACCGCTAGCTCAGAAG
>Alpha
ACTACGACTAGCTCAGG
>YYY
ACTACGGCAAGCACAGG
>Alpha
--ACTACGAC--TAGCTCAGG
>ZZZ
GGACTACGACAATAGCTCAGG
\end{verbatim}
\noindent In this third example, because of the differing lengths, this cannot be treated as a single alignment containing all six records. However, it could be three pairwise alignments.
Clearly trying to store more than one alignment in a FASTA file is not ideal. However, if you are forced to deal with these as input files \verb|Bio.AlignIO| can cope with the most common situation where all the alignments have the same number of records.
One example of this is a collection of pairwise alignments, which can be produced by the EMBOSS tools \verb|needle| and \verb|water| -- although in this situation, \verb|Bio.AlignIO| should be able to understand their native output using ``emboss'' as the format string.
To interpret these FASTA examples as several separate alignments, we can use \verb|Bio.AlignIO.parse()| with the optional \verb|seq_count| argument which specifies how many sequences are expected in each alignment (in these examples, 3, 2 and 2 respectively).
For example, using the third example as the input data:
\begin{verbatim}
for alignment in AlignIO.parse(handle, "fasta", seq_count=2):
print "Alignment length %i" % alignment.get_alignment_length()
for record in alignment:
print "%s - %s" % (record.seq, record.id)
print
\end{verbatim}
\noindent giving:
\begin{verbatim}
Alignment length 19
ACTACGACTAGCTCAG--G - Alpha
ACTACCGCTAGCTCAGAAG - XXX
Alignment length 17
ACTACGACTAGCTCAGG - Alpha
ACTACGGCAAGCACAGG - YYY
Alignment length 21
--ACTACGAC--TAGCTCAGG - Alpha
GGACTACGACAATAGCTCAGG - ZZZ
\end{verbatim}
Using \verb|Bio.AlignIO.read()| or \verb|Bio.AlignIO.parse()| without the \verb|seq_count| argument would give a single alignment containing all six records for the first two examples. For the third example, an exception would be raised because the lengths differ preventing them being turned into a single alignment.
If the file format itself has a block structure allowing \verb|Bio.AlignIO| to determine the number of sequences in each alignment directly, then the \verb|seq_count| argument is not needed. If it is supplied, and doesn't agree with the file contents, an error is raised.
Note that this optional \verb|seq_count| argument assumes each alignment in the file has the same number of sequences. Hypothetically you may come across stranger situations, for example a FASTA file containing several alignments each with a different number of sequences -- although I would love to hear of a real world example of this. Assuming you cannot get the data in a nicer file format, there is no straight forward way to deal with this using \verb|Bio.AlignIO|. In this case, you could consider reading in the sequences themselves using \verb|Bio.SeqIO| and batching them together to create the alignments as appropriate.
\section{Writing Alignments}
We've talked about using \verb|Bio.AlignIO.read()| and \verb|Bio.AlignIO.parse()| for alignment input (reading files), and now we'll look at \verb|Bio.AlignIO.write()| which is for alignment output (writing files). This is a function taking three arguments: some \verb|MultipleSeqAlignment| objects (or for backwards compatibility the obsolete \verb|Alignment| objects), a handle or filename to write to, and a sequence format.
Here is an example, where we start by creating a few \verb|MultipleSeqAlignment| objects the hard way (by hand, rather than by loading them from a file).
Note we create some \verb|SeqRecord| objects to construct the alignment from.
\begin{verbatim}
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
align1 = MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),
])
align2 = MultipleSeqAlignment([
SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"),
SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"),
SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"),
])
align3 = MultipleSeqAlignment([
SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"),
SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"),
SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"),
])
my_alignments = [align1, align2, align3]
\end{verbatim}
\noindent Now we have a list of \verb|Alignment| objects, we'll write them to a PHYLIP format file:
\begin{verbatim}
from Bio import AlignIO
AlignIO.write(my_alignments, "my_example.phy", "phylip")
\end{verbatim}
\noindent And if you open this file in your favourite text editor it should look like this:
\begin{verbatim}
3 12
Alpha ACTGCTAGCT AG
Beta ACT-CTAGCT AG
Gamma ACTGCTAGDT AG
3 9
Delta GTCAGC-AG
Epislon GACAGCTAG
Zeta GTCAGCTAG
3 13
Eta ACTAGTACAG CTG
Theta ACTAGTACAG CT-
Iota -CTACTACAG GTG
\end{verbatim}
Its more common to want to load an existing alignment, and save that, perhaps after some simple manipulation like removing certain rows or columns.
Suppose you wanted to know how many alignments the \verb|Bio.AlignIO.write()| function wrote to the handle? If your alignments were in a list like the example above, you could just use \verb|len(my_alignments)|, however you can't do that when your records come from a generator/iterator. Therefore as of Biopython 1.49, the \verb|Bio.AlignIO.write()| function returns the number of alignments written to the file.
\emph{Note} - If you tell the \verb|Bio.AlignIO.write()| fun