Document sequence comparison changes

biopython · Nov 19, 2014 · fa9b64a · fa9b64a
1 parent 06c6c9d
commit fa9b64a
Show file tree

Hide file tree

Showing 2 changed files with 55 additions and 33 deletions.
diff --git a/Doc/Tutorial/chapter_introduction.tex b/Doc/Tutorial/chapter_introduction.tex
@@ -186,6 +186,20 @@ \section{Frequently Asked Questions (FAQ)}
   \item \url{http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf}
   \end{itemize}
 
+
+  \item \emph{What is wrong with my sequence comparisons?} \\
+  There was a major change in Biopython 1.65 making the \verb|Seq| and
+  \verb|MutableSeq| classes (and subclasses) use simple string-based
+  comparison (ignoring the alphabet other than if giving a warning),
+  which you can do explicitly with \verb|str(seq1) == str(seq2)|.
+
+  Older versions of Biopython would use instance-based comparison
+  for \verb|Seq| objects which you can do explicitly with
+  \verb|id(seq1) == id(seq2)|.
+
+  If you still need to support old versions of Biopython, use these
+  explicit forms to avoid problems. See Section~\ref{sec:seq-comparison}.
+
   \item \emph{Why is the} \verb|Seq| \emph{object missing the upper \& lower methods described in this Tutorial?} \\
   You need Biopython 1.53 or later.  Alternatively, use \verb|str(my_seq).upper()| to get an upper case string.
   If you need a Seq object, try \verb|Seq(str(my_seq).upper())| but be careful about blindly re-using the same alphabet.

diff --git a/Doc/Tutorial/chapter_seq_objects.tex b/Doc/Tutorial/chapter_seq_objects.tex
@@ -377,7 +377,7 @@ \section{Nucleotide sequences and (reverse) complements}
 object's reverse complement method with \verb|Bio.SeqIO| for sequence input/output.
 
 \section{Transcription}
-Before talking about transcription, I want to try and clarify the strand issue.
+Before talking about transcription, I want to try to clarify the strand issue.
 Consider the following (made up) stretch of double stranded DNA which
 encodes a short peptide:
 
@@ -667,9 +667,9 @@ \section{Comparing Seq objects}
 way to decide if two sequences are equal. The basic problem is the meaning of
 the letters in a sequence are context dependent - the letter ``A'' could be part
 of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of
-each \verb|Seq| object to try and capture this information - so comparing two
-\verb|Seq| objects means considering both the sequence strings \emph{and} the
-alphabets.
+each \verb|Seq| object to try to capture this information - so comparing two
+\verb|Seq| objects could mean considering both the sequence strings \emph{and}
+the alphabets.
 
 For example, you might argue that the two DNA \verb|Seq| objects
 \texttt{Seq("ACGT", IUPAC.unambiguous\_dna)} and
@@ -684,53 +684,61 @@ \section{Comparing Seq objects}
 $B=C$, by transitivity we expect $A=C$. So for logical consistency we'd
 require \texttt{Seq("ACGT", IUPAC.unambiguous\_dna)} and \texttt{Seq("ACGT",
 IUPAC.protein)} to be equal -- which most people would agree is just not right.
-This transitivity problem would also have implications for using \verb|Seq|
-objects as Python dictionary keys.
+This transitivity also has implications for using \verb|Seq| objects as
+Python dictionary keys.
 
+Now, in everyday use, your sequences will probably all have the same
+alphabet, or at least all be the same type of sequence (all DNA, all RNA, or
+all protein). What you probably want is to just compare the sequences as
+strings -- which you can do explicitly:
 %doctest
 \begin{verbatim}
 >>> from Bio.Seq import Seq
 >>> from Bio.Alphabet import IUPAC
 >>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
->>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
-\end{verbatim}
-
-So, what does Biopython do? Well, the equality test is the default for Python
-objects -- it tests to see if they are the same object in memory. This is a
-very strict test:
-%TODO - silence FutureWarning from this doctest?
-\begin{verbatim}
->>> seq1 == seq2
-False
->>> seq1 == seq1
+>>> seq2 = Seq("ACGT", IUPAC.ambiguous_dna)
+>>> str(seq1) == str(seq2)
+True
+>>> str(seq1) == str(seq1)
 True
 \end{verbatim}
 
-If you actually want to do this, you can be more explicit by using the Python
-\verb|id| function,
+So, what does Biopython do? Well, as of Biopython 1.65, sequence comparison
+only looks at the sequence, essentially ignoring the alphabet:
+
 %cont-doctest
 \begin{verbatim}
->>> id(seq1) == id(seq2)
-False
->>> id(seq1) == id(seq1)
+>>> seq1 == seq2
+True
+>>> seq1 == "ACGT"
 True
 \end{verbatim}
 
-Now, in every day use, your sequences will probably all have the same
-alphabet, or at least all be the same type of sequence (all DNA, all RNA, or
-all protein). What you probably want is to just compare the sequences as
-strings -- so do this explicitly:
-%cont-doctest
+As an extension to this, using sequence objects as keys in a Python dictionary
+is now equivalent to using the sequence as a plain string for the key.
+See also Section~\ref{sec:seq-to-string}.
+
+Note if you compare sequences with incompatible alphabets (e.g. DNA vs RNA,
+or nucleotide versus protein), then you will get a warning but for the
+comparison itself only the string of letters in the sequence is used:
+%Can we do a doctest with a warning?
 \begin{verbatim}
->>> str(seq1) == str(seq2)
-True
->>> str(seq1) == str(seq1)
+>>> from Bio.Seq import Seq
+>>> from Bio.Alphabet import generic_dna, generic_protein
+>>> dna_seq = Seq("ACGT", generic_dna)
+>>> prot_seq = Seq(``ACGT'', generic_protein)
+>>> dna_seq == prot_seq
+BiopythonWarning: Incompatible alphabets DNAAlphabet() and ProteinAlphabet()
 True
 \end{verbatim}
 
-\noindent As an extension to this, while you can use a Python dictionary with
-\verb|Seq| objects as keys, it is generally more useful to use the sequence a
-string for the key. See also Section~\ref{sec:seq-to-string}.
+\noindent
+\emph{WARNING:} Older versions of Biopython instead used to check if the
+\verb|Seq| objects were the same object in memory.
+This is important if you need to support scripts on both old and new
+versions of Biopython. Here make the comparison explicit by wrapping
+your sequence objects with either \verb|str(...)| for string based
+comparison or \verb|id(...)| for object instance based comparison.
 
 \section{MutableSeq objects}
 \label{sec:mutable-seq}