Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[patch] Adding *GF* information from a Stockholm alignment #768

Open
frubino opened this issue Feb 11, 2016 · 5 comments
Open

[patch] Adding *GF* information from a Stockholm alignment #768

frubino opened this issue Feb 11, 2016 · 5 comments

Comments

@frubino
Copy link

frubino commented Feb 11, 2016

Hi,

I needed to parse the Pfam seed alignment. Because I needed to filter by ID/ACC, which is part of the GF set of features, I noticed that the per-file annotations are not saved. Since the new MultipleSeqAlignment class supports per alignment annotation, via its constructor and the lines are kept in a gf dictionary in the parser, I just needed to change the line that creates the alignment instance.

Since the other information is stored in different attributes, I put the gf dictionary as the sole per-alignment annotations. Since a feature can span multiple lines, I also joined the those by a space.

The change is really just one line, that follows.

--- a/Bio/AlignIO/StockholmIO.py
+++ b/Bio/AlignIO/StockholmIO.py
@@ -455,7 +455,7 @@ class StockholmIterator(AlignmentIterator):

                 self._populate_meta_data(id, record)
                 records.append(record)
-            alignment = MultipleSeqAlignment(records, self.alphabet)
+            alignment = MultipleSeqAlignment(records, self.alphabet, annotations=dict((key, ' '.join(value)) for key, value in gf.iteritems()))

             # TODO - Introduce an annotated alignment class?
             # For now, store the annotation a new private property:
@peterjc
Copy link
Member

peterjc commented Feb 11, 2016

Good point. See also #357 for recording the per-column annotations.

I don't like the discarding of the line split information, although with some entries like CC comments space joining is fine - is this really safe in general?

There's a note in the code wondering if we should check #=GF SQ ... sequence count lines. Perhaps also drop it from the annotations dictionary?

To match the GR and GS parsing code, should we have a dictionary like pfam_gr_mapping to map CC to comment etc?

@frubino
Copy link
Author

frubino commented Feb 11, 2016

To be honest, I went through the code quickly to understand why that information was not there. There are a few more points to check, for sure. I haven't used the an alignment in the Stockholm format in a long time, tough.

  1. Looking at the code, I think only features that have a 1 element list should be converted into strings. That would improve annotation check (like in my case, the ID is only one, yet in a list), while the multi-line kept as-is.

  2. Probably dropping the #GF SQ, after a check would be a good idea. On writing the information can be added again, especially if the alignment is changed in the meantime.

  3. Since there's a comment attribute, yes. Also, regarding those two dictionaries, why not make it a global variable in the module? The risk of not being in sync is reduced, and if needed inside the class, a quick copy in the variable should do the trick:

PFAM_GR_MAPPING = {"secondary_structure": "SS",
                       "surface_accessibility": "SA",
                       "transmembrane": "TM",
                       "posterior_probability": "PP",
                       "ligand_binding": "LI",
                       "active_site": "AS",
                       "intron": "IN"}
....

class StockholmWriter(SequentialAlignmentWriter):
    pfam_gr_mapping = PFAM_GR_MAPPING.copy()

I could try and make more changes, was there a unitest for this parser?

@peterjc
Copy link
Member

peterjc commented Feb 11, 2016

  1. Examples (and test cases) might help decide

  2. Yes. We already write out the #GF SQ line based on the actual alignment. Once the other annotation is officially exposed in the alignment object, using it on output would be the next logical step.

  3. Generating the reverse mapping from the forward mapping would indeed solve the risk of the two getting out of sync.

  4. Yes please. There are generic tests for this in test_AlignIO.py, but adding test_AlignIO_StockholmIO.py to explicitly look at the annotations would be better.

@frubino
Copy link
Author

frubino commented Feb 11, 2016

Pull request #769

@mdehoon
Copy link
Contributor

mdehoon commented Sep 24, 2022

The new alignment parser in Bio.Align stores the per-file GF annotations:

>>> from Bio import Align
>>> alignment = Align.read("example.sth", "stockholm")
>>> alignment.annotations['identifier']
'HAT'
>>> alignment.annotations['accession']
'PF02184.18'
>>> alignment.annotations['definition']
'HAT (Half-A-TPR) repeat'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants