[patch] Adding GF information from a Stockholm alignment #768

frubino · 2016-02-11T07:44:00Z

Hi,

I needed to parse the Pfam seed alignment. Because I needed to filter by ID/ACC, which is part of the GF set of features, I noticed that the per-file annotations are not saved. Since the new MultipleSeqAlignment class supports per alignment annotation, via its constructor and the lines are kept in a gf dictionary in the parser, I just needed to change the line that creates the alignment instance.

Since the other information is stored in different attributes, I put the gf dictionary as the sole per-alignment annotations. Since a feature can span multiple lines, I also joined the those by a space.

The change is really just one line, that follows.

--- a/Bio/AlignIO/StockholmIO.py
+++ b/Bio/AlignIO/StockholmIO.py
@@ -455,7 +455,7 @@ class StockholmIterator(AlignmentIterator):

                 self._populate_meta_data(id, record)
                 records.append(record)
-            alignment = MultipleSeqAlignment(records, self.alphabet)
+            alignment = MultipleSeqAlignment(records, self.alphabet, annotations=dict((key, ' '.join(value)) for key, value in gf.iteritems()))

             # TODO - Introduce an annotated alignment class?
             # For now, store the annotation a new private property:

The text was updated successfully, but these errors were encountered:

peterjc · 2016-02-11T09:25:59Z

Good point. See also #357 for recording the per-column annotations.

I don't like the discarding of the line split information, although with some entries like CC comments space joining is fine - is this really safe in general?

There's a note in the code wondering if we should check #=GF SQ ... sequence count lines. Perhaps also drop it from the annotations dictionary?

To match the GR and GS parsing code, should we have a dictionary like pfam_gr_mapping to map CC to comment etc?

frubino · 2016-02-11T09:55:41Z

To be honest, I went through the code quickly to understand why that information was not there. There are a few more points to check, for sure. I haven't used the an alignment in the Stockholm format in a long time, tough.

Looking at the code, I think only features that have a 1 element list should be converted into strings. That would improve annotation check (like in my case, the ID is only one, yet in a list), while the multi-line kept as-is.
Probably dropping the #GF SQ, after a check would be a good idea. On writing the information can be added again, especially if the alignment is changed in the meantime.
Since there's a comment attribute, yes. Also, regarding those two dictionaries, why not make it a global variable in the module? The risk of not being in sync is reduced, and if needed inside the class, a quick copy in the variable should do the trick:

PFAM_GR_MAPPING = {"secondary_structure": "SS",
                       "surface_accessibility": "SA",
                       "transmembrane": "TM",
                       "posterior_probability": "PP",
                       "ligand_binding": "LI",
                       "active_site": "AS",
                       "intron": "IN"}
....

class StockholmWriter(SequentialAlignmentWriter):
    pfam_gr_mapping = PFAM_GR_MAPPING.copy()

I could try and make more changes, was there a unitest for this parser?

peterjc · 2016-02-11T10:15:58Z

Examples (and test cases) might help decide
Yes. We already write out the #GF SQ line based on the actual alignment. Once the other annotation is officially exposed in the alignment object, using it on output would be the next logical step.
Generating the reverse mapping from the forward mapping would indeed solve the risk of the two getting out of sync.
Yes please. There are generic tests for this in test_AlignIO.py, but adding test_AlignIO_StockholmIO.py to explicitly look at the annotations would be better.

frubino · 2016-02-11T15:13:49Z

Pull request #769

mdehoon · 2022-09-24T01:27:11Z

The new alignment parser in Bio.Align stores the per-file GF annotations:

>>> from Bio import Align
>>> alignment = Align.read("example.sth", "stockholm")
>>> alignment.annotations['identifier']
'HAT'
>>> alignment.annotations['accession']
'PF02184.18'
>>> alignment.annotations['definition']
'HAT (Half-A-TPR) repeat'

frubino mentioned this issue Feb 11, 2016

reading/writing Stockholm alignment GF and GC annotation #769

Open

peterjc mentioned this issue Mar 20, 2019

Pfam/Stockholm reader #1977

Closed

peterjc mentioned this issue Oct 19, 2020

Metadata lost for MultipleSeqAlignment class #3314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[patch] Adding GF information from a Stockholm alignment #768

[patch] Adding GF information from a Stockholm alignment #768

frubino commented Feb 11, 2016

peterjc commented Feb 11, 2016

frubino commented Feb 11, 2016

peterjc commented Feb 11, 2016

frubino commented Feb 11, 2016

mdehoon commented Sep 24, 2022

[patch] Adding *GF* information from a Stockholm alignment #768

[patch] Adding *GF* information from a Stockholm alignment #768

Comments

frubino commented Feb 11, 2016

peterjc commented Feb 11, 2016

frubino commented Feb 11, 2016

peterjc commented Feb 11, 2016

frubino commented Feb 11, 2016

mdehoon commented Sep 24, 2022

[patch] Adding GF information from a Stockholm alignment #768

[patch] Adding GF information from a Stockholm alignment #768