Merge pull request #1497 from gregcaporaso/remove-cogent-DNA

removes dependence on cogent's DNA, LoadSeqs, Alignment, DenseAlignment
biocore · Apr 11, 2014 · 99ff358 · 99ff358
2 parents 416bfbd + 4c24f84
commit 99ff358
Show file tree

Hide file tree

Showing 40 changed files with 445 additions and 909 deletions.
diff --git a/ChangeLog.md b/ChangeLog.md
diff --git a/doc/install/install.rst b/doc/install/install.rst
@@ -18,7 +18,7 @@ As a consequence of this 'pipeline' architecture, **QIIME has a lot of dependenc
 How to not install QIIME
 ========================
 
-Because QIIME is hard to install, we have attempted to shift this burden to the QIIME development group rather than our users by providing virtual machines with QIIME and all of its dependencies pre-installed. We, and third-party developers, have also created several automated installation procedures. These alternatives (`summarized here <../index.html#downloading-and-installing-qiime>`_) allow you to bypass the complex installation procedure and have access to a full, working QIIME installation. 
+Because QIIME is hard to install, we have attempted to shift this burden to the QIIME development group rather than our users by providing virtual machines with QIIME and all of its dependencies pre-installed. We, and third-party developers, have also created several automated installation procedures. These alternatives (`summarized here <../index.html#downloading-and-installing-qiime>`_) allow you to bypass the complex installation procedure and have access to a full, working QIIME installation.
 
 **We highly recommend going with one of these solutions if you're new to QIIME, or just want to test it out to see if it will do what you want.**
 
@@ -91,7 +91,7 @@ The next are python packages not included in Canopy Express. Each of these can b
 * pyqi 0.3.1 (`src_pyqi <https://pypi.python.org/packages/source/p/pyqi/pyqi-0.3.1.tar.gz>`_) (license: BSD)
 * scikit-bio (latest development version) (`src_skbio <https://github.com/biocore/scikit-bio>`_) (license: BSD)
 
-Next, there are two non-python dependencies required for the QIIME base package. These should be installed by following their respective install instructions. 
+Next, there are two non-python dependencies required for the QIIME base package. These should be installed by following their respective install instructions.
 
 * uclust 1.2.22q (`src_uclust <http://www.drive5.com/uclust/downloads1_2_22q.html>`_) See :ref:`uclust install notes <uclust-install>`. (licensed specially for Qiime and PyNAST users)
 * fasttree 2.1.3 (`src_fasttree <http://www.microbesonline.org/fasttree/FastTree-2.1.3.c>`_) See `FastTree install instructions <http://www.microbesonline.org/fasttree/#Install>`_ (license: GPL)
@@ -154,17 +154,17 @@ You should see output that looks like the following::
 	................
 	----------------------------------------------------------------------
 	Ran 16 tests in 0.440s
-	
+
 	OK
 
-This indicates that you have a complete QIIME base install. 
+This indicates that you have a complete QIIME base install.
 
 You should next :ref:`run QIIME's unit tests <run-test-suite>`. You will experience some test failures as a result of not having a full QIIME install. If you have questions about these failures, you should post to the `QIIME Forum <http://forum.qiime.org>`_.
 
 QIIME full install (for access to advanced features in QIIME, and non-default processing pipelines)
 ---------------------------------------------------------------------------------------------------
 
-The dependencies described below will support a full QIIME install. These are grouped by the features that each dependency will provide access to. Installation instructions should be followed for each individual package (e.g., from the project's website or README/INSTALL file). 
+The dependencies described below will support a full QIIME install. These are grouped by the features that each dependency will provide access to. Installation instructions should be followed for each individual package (e.g., from the project's website or README/INSTALL file).
 
 Alignment, tree-building, taxonomy assignment, OTU picking, and other data generation steps (required for non-default processing pipelines):
 
@@ -181,8 +181,6 @@ Alignment, tree-building, taxonomy assignment, OTU picking, and other data gener
 * cdbtools (`src_cdbtools <ftp://occams.dfci.harvard.edu/pub/bio/tgi/software/cdbfasta/cdbfasta.tar.gz>`_)
 * muscle 3.8.31 (`src_muscle <http://www.drive5.com/muscle/downloads.htm>`_) (Public domain)
 * rtax 0.984 (`src_rtax <http://static.davidsoergel.com/rtax-0.984.tgz>`_) (license: BSD)
-* pplacer 1.1 (`src_pplacer <http://matsen.fhcrc.org/pplacer/builds/pplacer-v1.1-Linux.tar.gz>`_) (license: GPL)
-* ParsInsert 1.04 (`src_parsinsert <http://downloads.sourceforge.net/project/parsinsert/ParsInsert.1.04.tgz>`_) (license: GPL)
 * usearch v5.2.236 and/or usearch v6.1 (`src_usearch <http://www.drive5.com/usearch/>`_) (license: see http://www.drive5.com/usearch/nonprofit_form.html) **At this stage two different versions of usearch are supported.** usearch v5.2.236 is referred to as ``usearch`` in QIIME, and usearch v6.1 is referred to as ``usearch61``.
 
 Processing sff files:

diff --git a/doc/scripts/insert_seqs_into_tree.rst b/doc/scripts/insert_seqs_into_tree.rst
diff --git a/qiime/adjust_seq_orientation.py b/qiime/adjust_seq_orientation.py
@@ -12,7 +12,7 @@
 
 from os.path import split, splitext
 from skbio.parse.sequences import parse_fasta
-from cogent import DNA
+from skbio.core.sequence import DNA
 
 usage_str = """usage: %prog [options] {-i INPUT_FASTA_FP}
 
@@ -42,7 +42,7 @@ def rc_fasta_lines(fasta_lines, seq_desc_mapper=append_rc):
     """
     for seq_id, seq in parse_fasta(fasta_lines):
         seq_id = seq_desc_mapper(seq_id)
-        seq = DNA.rc(seq.upper())
+        seq = str(DNA(seq.upper()).rc())
         yield seq_id, seq
     return
 

diff --git a/qiime/align_seqs.py b/qiime/align_seqs.py
@@ -25,24 +25,24 @@
 from os import remove
 from numpy import median
 
-from cogent import LoadSeqs, DNA
-from cogent.core.alignment import DenseAlignment, SequenceCollection, Alignment
-from cogent.core.sequence import DnaSequence as Dna
-from cogent.parse.rfam import MinimalRfamParser, ChangedSequence
-
 import brokit
 from brokit.infernal import cmalign_from_alignment
 import brokit.clustalw
 import brokit.muscle_v38
 import brokit.mafft
 
+from cogent import DNA as DNA_cogent
+from cogent.parse.rfam import MinimalRfamParser, ChangedSequence
 from skbio.app.util import ApplicationNotFoundError
 from skbio.core.exception import RecordError
 from skbio.parse.sequences import parse_fasta
 
 from qiime.util import (FunctionWithParams,
                         get_qiime_temp_dir)
 
+from skbio.core.alignment import SequenceCollection, Alignment
+from skbio.core.sequence import DNASequence
+from skbio.parse.sequences import parse_fasta
 
 # Load PyNAST if it's available. If it's not, skip it if not but set up
 # to raise errors if the user tries to use it.
@@ -115,7 +115,7 @@ def getResult(self, seq_path):
         seqs = self.getData(seq_path)
         params = dict(
             [(k, v) for (k, v) in self.Params.items() if k.startswith('-')])
-        result = module.align_unaligned_seqs(seqs, moltype=DNA, params=params)
+        result = module.align_unaligned_seqs(seqs, moltype=DNA_cogent, params=params)
         return result
 
     def __call__(self, result_path=None, log_path=None, *args, **kwargs):
@@ -131,7 +131,7 @@ def __init__(self, params):
         """Return new InfernalAligner object with specified params.
         """
         _params = {
-            'moltype': DNA,
+            'moltype': DNA_cogent,
             'Application': 'Infernal',
         }
         _params.update(params)
@@ -156,9 +156,10 @@ def __call__(self, seq_path, result_path=None, log_path=None,
         moltype = self.Params['moltype']
 
         # Need to make separate mapping for unaligned sequences
-        unaligned = SequenceCollection(candidate_sequences, MolType=moltype)
-        int_map, int_keys = unaligned.getIntMap(prefix='unaligned_')
-        int_map = SequenceCollection(int_map, MolType=moltype)
+        unaligned = SequenceCollection.from_fasta_records(
+            candidate_sequences.iteritems(), DNASequence)
+        mapped_seqs, new_to_old_ids = unaligned.int_map(prefix='unaligned_')
+        mapped_seq_tuples = [(k, str(v)) for k,v in mapped_seqs.iteritems()]
 
         # Turn on --gapthresh option in cmbuild to force alignment to full
         # model
@@ -174,7 +175,6 @@ def __call__(self, seq_path, result_path=None, log_path=None,
         # are fragments.
         # Also turn on --gapthresh to use same gapthresh as was used to build
         # model
-
         if cmalign_params is None:
             cmalign_params = {}
         cmalign_params.update({'--sub': True, '--gapthresh': 1.0})
@@ -186,20 +186,23 @@ def __call__(self, seq_path, result_path=None, log_path=None,
         # Align sequences to alignment including alignment gaps.
         aligned, struct_string = cmalign_from_alignment(aln=template_alignment,
                                                         structure_string=struct,
-                                                        seqs=int_map,
+                                                        seqs=mapped_seq_tuples,
                                                         moltype=moltype,
                                                         include_aln=True,
                                                         params=cmalign_params,
                                                         cmbuild_params=cmbuild_params)
 
         # Pull out original sequences from full alignment.
-        infernal_aligned = {}
+        infernal_aligned = []
+        # Get a dict of the identifiers to sequences (note that this is a
+        # cogent alignment object, hence the call to NamedSeqs)
         aligned_dict = aligned.NamedSeqs
-        for key in int_map.Names:
-            infernal_aligned[int_keys.get(key, key)] = aligned_dict[key]
+        for n, o in new_to_old_ids.iteritems():
+            aligned_seq = aligned_dict[n]
+            infernal_aligned.append((o, aligned_seq))
 
         # Create an Alignment object from alignment dict
-        infernal_aligned = Alignment(infernal_aligned, MolType=moltype)
+        infernal_aligned = Alignment.from_fasta_records(infernal_aligned, DNASequence)
 
         if log_path is not None:
             log_file = open(log_path, 'w')
@@ -208,7 +211,7 @@ def __call__(self, seq_path, result_path=None, log_path=None,
 
         if result_path is not None:
             result_file = open(result_path, 'w')
-            result_file.write(infernal_aligned.toFasta())
+            result_file.write(infernal_aligned.to_fasta())
             result_file.close()
             return None
         else:
@@ -248,12 +251,8 @@ def __call__(self, seq_path, result_path=None, log_path=None,
         for seq_id, seq in parse_fasta(open(template_alignment_fp)):
             # replace '.' characters with '-' characters
             template_alignment.append((seq_id, seq.replace('.', '-').upper()))
-        try:
-            template_alignment = LoadSeqs(data=template_alignment, moltype=DNA,
-                                          aligned=DenseAlignment)
-        except KeyError as e:
-            raise KeyError('Only ACGT-. characters can be contained in template alignments.' +
-                           ' The offending character was: %s' % e)
+        template_alignment = Alignment.from_fasta_records(
+                    template_alignment, DNASequence, validate=True)
 
         # initialize_logger
         logger = NastLogger(log_path)
@@ -273,25 +272,28 @@ def __call__(self, seq_path, result_path=None, log_path=None,
 
         logger.record(str(self))
 
+        for i, seq in enumerate(pynast_failed):
+            skb_seq = DNASequence(str(seq), identifier=seq.Name)
+            pynast_failed[i] = skb_seq
+        pynast_failed = SequenceCollection(pynast_failed)
+
+        for i, seq in enumerate(pynast_aligned):
+            skb_seq = DNASequence(str(seq), identifier=seq.Name)
+            pynast_aligned[i] = skb_seq
+        pynast_aligned = Alignment(pynast_aligned)
+
         if failure_path is not None:
             fail_file = open(failure_path, 'w')
-            for seq in pynast_failed:
-                fail_file.write(seq.toFasta())
-                fail_file.write('\n')
+            fail_file.write(pynast_failed.to_fasta())
             fail_file.close()
 
         if result_path is not None:
             result_file = open(result_path, 'w')
-            for seq in pynast_aligned:
-                result_file.write(seq.toFasta())
-                result_file.write('\n')
+            result_file.write(pynast_aligned.to_fasta())
             result_file.close()
             return None
         else:
-            try:
-                return LoadSeqs(data=pynast_aligned, aligned=DenseAlignment)
-            except ValueError:
-                return {}
+            return pynast_aligned
 
 
 def compute_min_alignment_length(seqs_f, fraction=0.75):

diff --git a/qiime/assign_taxonomy.py b/qiime/assign_taxonomy.py
@@ -23,8 +23,6 @@
 from cStringIO import StringIO
 from collections import Counter, defaultdict
 
-from cogent import LoadSeqs, DNA
-
 from skbio.app.util import ApplicationNotFoundError
 from skbio.parse.sequences import parse_fasta