Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCG MSF support in Bio.AlignIO #2306

Merged
merged 21 commits into from Nov 5, 2019
Merged

GCG MSF support in Bio.AlignIO #2306

merged 21 commits into from Nov 5, 2019

Conversation

peterjc
Copy link
Member

@peterjc peterjc commented Oct 19, 2019

This adds support to Bio.AlignIO for parsing GCG MSF files.

  • This work was on behalf of the NMPD, and as copyright holders, they agreed to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run flake8 locally, and
    understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

@codecov
Copy link

codecov bot commented Oct 19, 2019

Codecov Report

Merging #2306 into master will decrease coverage by 0.02%.
The diff coverage is 74.1%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2306      +/-   ##
==========================================
- Coverage   84.95%   84.92%   -0.03%     
==========================================
  Files         321      322       +1     
  Lines       52013    52157     +144     
==========================================
+ Hits        44187    44296     +109     
- Misses       7826     7861      +35
Impacted Files Coverage Δ
Bio/Align/Applications/_TCoffee.py 100% <ø> (ø) ⬆️
Bio/AlignIO/__init__.py 83.6% <100%> (+0.96%) ⬆️
Bio/AlignIO/MsfIO.py 73.91% <73.91%> (ø)
Bio/Pathway/Rep/__init__.py 100% <0%> (ø) ⬆️
Bio/SVDSuperimposer/__init__.py 87.14% <0%> (ø) ⬆️
Bio/KEGG/KGML/KGML_pathway.py 80.3% <0%> (ø) ⬆️
Bio/KEGG/KGML/KGML_parser.py 78.94% <0%> (ø) ⬆️
Bio/Nexus/Nexus.py 72.33% <0%> (ø) ⬆️
Bio/SeqIO/PdbIO.py 96.05% <0%> (ø) ⬆️
Bio/Pathway/Rep/MultiGraph.py 62.72% <0%> (ø) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b68b239...00b3ded. Read the comment docs.

@peterjc
Copy link
Member Author

peterjc commented Oct 21, 2019

Maybe I can improve the test coverage - I should set a good example ;)

@pbashyal-nmdp
Copy link

Thank you Peter. This has helped us having to use BioPerl to convert IMGT msf files to Stockholm before being imported in Biopython.

@peterjc
Copy link
Member Author

peterjc commented Oct 30, 2019

Almost all the code without tests is error handling for malformed MSF files, and many of the real world examples which trigger the warnings are often too large to include as is...

https://github.com/ANHIG/IMGTHLA/blob/3320/msf/DOA_prot.msf
as of commit 16c09d89398603dcf653cc5476f857f1a21c1d9d
(14 August 2018), i.e. from version 3.32.0 of the DB.

This file has a discrepancy in the alignment length in the
main header (62) versus most of the records (250).

If we accept there should be 250 columns, then the entry
for the last sequence is only 62 without gap padding
(and indeed, has blank lines in the last three blocks).

The initial Biopython GCG MSF parser will accept this file
but with two warnings.
@peterjc
Copy link
Member Author

peterjc commented Oct 31, 2019

@pbashyal-nmdp would you prefer this code as is (with the header length heuristics included), or for me to remove the heuristics and raise an exception on length inconsistencies (my preference).

Cross reference:

ANHIG/IMGTHLA#200 - Truncated sequences in MSF files (most recently affecting their release v3.36.0)

ANHIG/IMGTHLA#201 - Wrong alignment length in MSF header (most recently affecting v3.32.0)

As a compromise, we could remove just the wrong MSF header length heuristic, which is the least elegant and most fragile bit of the code in my view.

Have seen this on older IMGTHLA releases, most recently
v3.32.0 (April 2018) - the six releases since are fine.
https://github.com/ANHIG/IMGTHLA/blob/3300/msf/W_prot.msf
as of commit d99d8aca3f01f7431741a998ea5cc2417d53ac9c
(26 Oct 2017), i.e. from v3.30.0 of the IMGTHLA dtabase.

This file has a discrepancy between the alignment length
(99 columns) and four of the sequences (only 93 letters
without trailing gap padding).

The initial Biopython GCG MSF parser will accept this file
(and apply the missing padding) with a warning.
@peterjc
Copy link
Member Author

peterjc commented Nov 5, 2019

I have removed the heuristic needed for ANHIG/IMGTHLA#201 which is not needed with IPD-IMGT/HLA database release 3.33.0 onwards.

With these changes, all the recent IPD-IMGT/HLA database MSF files parse perfectly, bar one warning from C_gen.msf on branch 3360, issue logged as ANHIG/IMGTHLA#200

$ for b in 3330 3340 3350 3360 3370 3380; do git checkout --force $b && ./check_all.py; done
Switched to branch '3330'
Your branch is up to date with 'origin/3330'.
Done, 97 parsed
Switched to branch '3340'
Your branch is up to date with 'origin/3340'.
Done, 97 parsed
Switched to branch '3350'
Your branch is up to date with 'origin/3350'.
Done, 97 parsed
Switched to branch '3360'
Your branch is up to date with 'origin/3360'.
C_gen.msf - 13933Kb - One of more alignment sequences were truncated and have been gap padded
Done, 97 parsed
Switched to branch '3370'
Your branch is up to date with 'origin/3370'.
Done, 98 parsed
Switched to branch '3380'
Your branch is up to date with 'origin/3380'.
Done, 101 parsed

Test script:

#!/usr/bin/env python

import os
import sys
import warnings

from Bio import BiopythonParserWarning
from Bio import AlignIO

count =0
for f in os.listdir("."):
    if f.endswith(".msf"):
        with warnings.catch_warnings():
            warnings.simplefilter('error', BiopythonParserWarning)
            try:
                align = AlignIO.read(f, "msf")
            except (ValueError, BiopythonParserWarning) as e:
                print("%s - %iKb - %s" % (f, os.stat(f).st_size / 1000, e))
            count += 1

print("Done, %i parsed" % count)

This was also used to pick the smallest possible real example files for the two test cases.

@peterjc
Copy link
Member Author

peterjc commented Nov 5, 2019

The problematic MSF file C_gen.msf in v3.36.0 has been fixed in ANHIG/IMGTHLA@9926a70

That means versions the IPD-IMGT/HLA database v3.33.0 to v3.38.0 (latest) inclusive MSF files all parse without errors.

@peterjc peterjc merged commit 5c81ca9 into biopython:master Nov 5, 2019
@peterjc peterjc deleted the msf branch November 5, 2019 17:02
@peterjc
Copy link
Member Author

peterjc commented Nov 5, 2019

Thank you @pbashyal-nmdp and the NMDP, I'm sure other people will find a use for the MSF support too.

peterjc added a commit to biopython/biopython.github.io that referenced this pull request Nov 5, 2019
@pbashyal-nmdp
Copy link

Great work @peterjc ! We also hope it'll be useful for others as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants