Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of Gapped in PhyloXML as proxy for if aligned #2944

Open
peterjc opened this issue Jun 2, 2020 · 7 comments
Open

Use of Gapped in PhyloXML as proxy for if aligned #2944

peterjc opened this issue Jun 2, 2020 · 7 comments
Assignees

Comments

@peterjc
Copy link
Member

peterjc commented Jun 2, 2020

For #2046 I am trying to remove the use of the Gapped class, and adding explicit defaulting to "-" as a gap character where necessary.

The to_alignment method tests in test_PhyloXML.py show that the PhyloXML code uses the Gapped alphabet encoded class as a proxy for if the sequence is aligned - possibly meaning it is part of an alignment (but not if it has a gap character).

e.g. this does not work:

$ git diff
diff --git a/Bio/Phylo/PhyloXML.py b/Bio/Phylo/PhyloXML.py
index aaabb0849..5313b6c5c 100644
--- a/Bio/Phylo/PhyloXML.py
+++ b/Bio/Phylo/PhyloXML.py
@@ -1254,7 +1254,7 @@ class Sequence(PhyloElement):
     def from_seqrecord(cls, record, is_aligned=None):
         """Create a new PhyloXML Sequence from a SeqRecord object."""
         if is_aligned is None:
-            is_aligned = isinstance(record.seq.alphabet, Alphabet.Gapped)
+            is_aligned = "-" in record.seq
         params = {
             "accession": Accession(record.id, ""),
             "symbol": record.name,
@@ -1400,10 +1400,7 @@ class Sequence(PhyloElement):
 
     def get_alphabet(self):
         """Get the alphabet for the sequence."""
-        alph = self.alphabets.get(self.type, Alphabet.generic_alphabet)
-        if self.mol_seq and self.mol_seq.is_aligned:
-            return Alphabet.Gapped(alph)
-        return alph
+        return self.alphabets.get(self.type, Alphabet.generic_alphabet
$ git diff
diff --git a/Tests/test_PhyloXML.py b/Tests/test_PhyloXML.py
index 3502e41cf..7025d499e 100644
--- a/Tests/test_PhyloXML.py
+++ b/Tests/test_PhyloXML.py
@@ -685,7 +685,7 @@ class MethodTests(unittest.TestCase):
         self.assertIsInstance(aln, MultipleSeqAlignment)
         self.assertEqual(len(aln), 0)
         # Add sequences to the terminals
-        alphabet = Alphabet.Gapped(Alphabet.generic_dna)
+        alphabet = Alphabet.generic_dna
         for tip, seqstr in zip(tree.get_terminals(), ("AA--TTA", "AA--TTG", "AACCTTC")):
             tip.sequences.append(
                 PX.Sequence.from_seqrecord(
@@ -695,7 +695,7 @@ class MethodTests(unittest.TestCase):
         # Check the alignment
         aln = tree.to_alignment()
         self.assertIsInstance(aln, MultipleSeqAlignment)
-        self.assertEqual(len(aln), 3)
+        self.assertEqual(len(aln), 3, aln)
         self.assertEqual(aln.get_alignment_length(), 7)
 
     # Syntax sugar

The test fails:

======================================================================
FAIL: test_to_alignment (__main__.MethodTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_PhyloXML.py", line 698, in test_to_alignment
    self.assertEqual(len(aln), 3, aln)
AssertionError: 2 != 3 : Alignment with 2 rows and 7 columns
AA--TTA :A
AA--TTG :B

----------------------------------------------------------------------

This has left out the ungapped sequence "AACCTTC" which was part of the alignment.

@mdehoon
Copy link
Contributor

mdehoon commented Jun 3, 2020

In test_PhyloXML.py, line 692 you need to add is_aligned=True in the call to PX.Sequence.from_seqrecord.

peterjc added a commit to peterjc/biopython that referenced this issue Jun 3, 2020
@peterjc
Copy link
Member Author

peterjc commented Jun 3, 2020

Change from @mdehoon included on #2951, it works in that the test passes, but is this the right thing to do from the end user perspective? Over to @etal to review...

@etal
Copy link
Contributor

etal commented Jun 6, 2020

It's nice to be able to guess is_aligned from the input SeqRecord if alphabets are used and already carry that information, but I see alphabets are deprecated. Is the alignment/gapped status tracked elsewhere in SeqRecords now, or is it just implied when they come from a MultipleSeqAlignment?

It might be all right to make the is_aligned option mandatory, or default to False. The main use case is to make it easy to read an alignment with AlignIO and then add its sequence records to a tree to serialize as phyloXML, where the individual sequences are stored with each taxon rather than together as a standalone alignment block. I'm not sure what other code is using this within Biopython, besides the unit test.

@peterjc
Copy link
Member Author

peterjc commented Jun 7, 2020

I wasn't aware of any code using Gapped in this way, even my thoughts for a simplified enum for DNA/RNA/Nucleotide/Protein/etc wouldn't cover this. The changes to do with gaps have been making the gap character explicit via an argument when needed.

Note in practise the old Gapped class didn't perfectly map to aligned anyway - you could have sequences with no gaps in an alignment. Being part of a MultipleSeqAlignment does imply that the sequences are aligned (but does not mean they will have gapped characters).

@etal
Copy link
Contributor

etal commented Jun 7, 2020

True, the is_aligned status of a sequence is separate from its alphabet and whether it contains gaps (or Stockholm format's . characters).

I see the MolSeq class in PhyloXML.py allows its is_aligned attribute to be None. Since SeqRecord objects don't carry their own alignment status reliably, it may be best to change the default behavior of from_seqrecord(..., is_aligned=None) to set MolSeq(is_aligned=None) instead of guessing -- i.e. just drop the if is_aligned is None block.

That change might break round-tripping of MultipleSeqAlignment -> Sequence.from_seqrecord() -> Phylogeny.to_alignment(), although I'm not sure if that's supported currently. There could be a missing feature Phylogeny.add_alignment(multiple_seq_alignment) to make it all go smoothly, or clients can continue to handle it on their own.

peterjc added a commit that referenced this issue Jun 7, 2020
* Drop use of Gapped alphabet in PhyloXML

See issue #2944

* Use assertIsNone etc in test_PhyloXML.py

Spotted with flake8-assertive, which also complained
about using assertEqual with True and False which is
more nuanced. See also very similar issues on #2835
@peterjc
Copy link
Member Author

peterjc commented Jun 7, 2020

That sounds worth trying - can I leave it with you?

@peterjc
Copy link
Member Author

peterjc commented Jul 24, 2020

@etal with #2951 merged can we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants