Skip to content

Commit

Permalink
Document sequence comparison changes
Browse files Browse the repository at this point in the history
  • Loading branch information
peterjc committed Nov 19, 2014
1 parent 06c6c9d commit fa9b64a
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 33 deletions.
14 changes: 14 additions & 0 deletions Doc/Tutorial/chapter_introduction.tex
Expand Up @@ -186,6 +186,20 @@ \section{Frequently Asked Questions (FAQ)}
\item \url{http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf}
\end{itemize}


\item \emph{What is wrong with my sequence comparisons?} \\
There was a major change in Biopython 1.65 making the \verb|Seq| and
\verb|MutableSeq| classes (and subclasses) use simple string-based
comparison (ignoring the alphabet other than if giving a warning),
which you can do explicitly with \verb|str(seq1) == str(seq2)|.

Older versions of Biopython would use instance-based comparison
for \verb|Seq| objects which you can do explicitly with
\verb|id(seq1) == id(seq2)|.

If you still need to support old versions of Biopython, use these
explicit forms to avoid problems. See Section~\ref{sec:seq-comparison}.

\item \emph{Why is the} \verb|Seq| \emph{object missing the upper \& lower methods described in this Tutorial?} \\
You need Biopython 1.53 or later. Alternatively, use \verb|str(my_seq).upper()| to get an upper case string.
If you need a Seq object, try \verb|Seq(str(my_seq).upper())| but be careful about blindly re-using the same alphabet.
Expand Down
74 changes: 41 additions & 33 deletions Doc/Tutorial/chapter_seq_objects.tex
Expand Up @@ -377,7 +377,7 @@ \section{Nucleotide sequences and (reverse) complements}
object's reverse complement method with \verb|Bio.SeqIO| for sequence input/output.

\section{Transcription}
Before talking about transcription, I want to try and clarify the strand issue.
Before talking about transcription, I want to try to clarify the strand issue.
Consider the following (made up) stretch of double stranded DNA which
encodes a short peptide:

Expand Down Expand Up @@ -667,9 +667,9 @@ \section{Comparing Seq objects}
way to decide if two sequences are equal. The basic problem is the meaning of
the letters in a sequence are context dependent - the letter ``A'' could be part
of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of
each \verb|Seq| object to try and capture this information - so comparing two
\verb|Seq| objects means considering both the sequence strings \emph{and} the
alphabets.
each \verb|Seq| object to try to capture this information - so comparing two
\verb|Seq| objects could mean considering both the sequence strings \emph{and}
the alphabets.

For example, you might argue that the two DNA \verb|Seq| objects
\texttt{Seq("ACGT", IUPAC.unambiguous\_dna)} and
Expand All @@ -684,53 +684,61 @@ \section{Comparing Seq objects}
$B=C$, by transitivity we expect $A=C$. So for logical consistency we'd
require \texttt{Seq("ACGT", IUPAC.unambiguous\_dna)} and \texttt{Seq("ACGT",
IUPAC.protein)} to be equal -- which most people would agree is just not right.
This transitivity problem would also have implications for using \verb|Seq|
objects as Python dictionary keys.
This transitivity also has implications for using \verb|Seq| objects as
Python dictionary keys.

Now, in everyday use, your sequences will probably all have the same
alphabet, or at least all be the same type of sequence (all DNA, all RNA, or
all protein). What you probably want is to just compare the sequences as
strings -- which you can do explicitly:
%doctest
\begin{verbatim}
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
\end{verbatim}

So, what does Biopython do? Well, the equality test is the default for Python
objects -- it tests to see if they are the same object in memory. This is a
very strict test:
%TODO - silence FutureWarning from this doctest?
\begin{verbatim}
>>> seq1 == seq2
False
>>> seq1 == seq1
>>> seq2 = Seq("ACGT", IUPAC.ambiguous_dna)
>>> str(seq1) == str(seq2)
True
>>> str(seq1) == str(seq1)
True
\end{verbatim}

If you actually want to do this, you can be more explicit by using the Python
\verb|id| function,
So, what does Biopython do? Well, as of Biopython 1.65, sequence comparison
only looks at the sequence, essentially ignoring the alphabet:

%cont-doctest
\begin{verbatim}
>>> id(seq1) == id(seq2)
False
>>> id(seq1) == id(seq1)
>>> seq1 == seq2
True
>>> seq1 == "ACGT"
True
\end{verbatim}

Now, in every day use, your sequences will probably all have the same
alphabet, or at least all be the same type of sequence (all DNA, all RNA, or
all protein). What you probably want is to just compare the sequences as
strings -- so do this explicitly:
%cont-doctest
As an extension to this, using sequence objects as keys in a Python dictionary
is now equivalent to using the sequence as a plain string for the key.
See also Section~\ref{sec:seq-to-string}.

Note if you compare sequences with incompatible alphabets (e.g. DNA vs RNA,
or nucleotide versus protein), then you will get a warning but for the
comparison itself only the string of letters in the sequence is used:
%Can we do a doctest with a warning?
\begin{verbatim}
>>> str(seq1) == str(seq2)
True
>>> str(seq1) == str(seq1)
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, generic_protein
>>> dna_seq = Seq("ACGT", generic_dna)
>>> prot_seq = Seq(``ACGT'', generic_protein)
>>> dna_seq == prot_seq
BiopythonWarning: Incompatible alphabets DNAAlphabet() and ProteinAlphabet()
True
\end{verbatim}

\noindent As an extension to this, while you can use a Python dictionary with
\verb|Seq| objects as keys, it is generally more useful to use the sequence a
string for the key. See also Section~\ref{sec:seq-to-string}.
\noindent
\emph{WARNING:} Older versions of Biopython instead used to check if the
\verb|Seq| objects were the same object in memory.
This is important if you need to support scripts on both old and new
versions of Biopython. Here make the comparison explicit by wrapping
your sequence objects with either \verb|str(...)| for string based
comparison or \verb|id(...)| for object instance based comparison.

\section{MutableSeq objects}
\label{sec:mutable-seq}
Expand Down

0 comments on commit fa9b64a

Please sign in to comment.