New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCG MSF support in Bio.AlignIO #2306
Conversation
Upgraded warning to error for unexpected lines after alignment (could be a sign of a problem in the expected length)
Codecov Report
@@ Coverage Diff @@
## master #2306 +/- ##
==========================================
- Coverage 84.95% 84.92% -0.03%
==========================================
Files 321 322 +1
Lines 52013 52157 +144
==========================================
+ Hits 44187 44296 +109
- Misses 7826 7861 +35
Continue to review full report at Codecov.
|
Maybe I can improve the test coverage - I should set a good example ;) |
Thank you Peter. This has helped us having to use BioPerl to convert IMGT msf files to Stockholm before being imported in Biopython. |
Almost all the code without tests is error handling for malformed MSF files, and many of the real world examples which trigger the warnings are often too large to include as is... |
https://github.com/ANHIG/IMGTHLA/blob/3320/msf/DOA_prot.msf as of commit 16c09d89398603dcf653cc5476f857f1a21c1d9d (14 August 2018), i.e. from version 3.32.0 of the DB. This file has a discrepancy in the alignment length in the main header (62) versus most of the records (250). If we accept there should be 250 columns, then the entry for the last sequence is only 62 without gap padding (and indeed, has blank lines in the last three blocks). The initial Biopython GCG MSF parser will accept this file but with two warnings.
@pbashyal-nmdp would you prefer this code as is (with the header length heuristics included), or for me to remove the heuristics and raise an exception on length inconsistencies (my preference). Cross reference: ANHIG/IMGTHLA#200 - Truncated sequences in MSF files (most recently affecting their release v3.36.0) ANHIG/IMGTHLA#201 - Wrong alignment length in MSF header (most recently affecting v3.32.0) As a compromise, we could remove just the wrong MSF header length heuristic, which is the least elegant and most fragile bit of the code in my view. |
Have seen this on older IMGTHLA releases, most recently v3.32.0 (April 2018) - the six releases since are fine.
https://github.com/ANHIG/IMGTHLA/blob/3300/msf/W_prot.msf as of commit d99d8aca3f01f7431741a998ea5cc2417d53ac9c (26 Oct 2017), i.e. from v3.30.0 of the IMGTHLA dtabase. This file has a discrepancy between the alignment length (99 columns) and four of the sequences (only 93 letters without trailing gap padding). The initial Biopython GCG MSF parser will accept this file (and apply the missing padding) with a warning.
I have removed the heuristic needed for ANHIG/IMGTHLA#201 which is not needed with IPD-IMGT/HLA database release 3.33.0 onwards. With these changes, all the recent IPD-IMGT/HLA database MSF files parse perfectly, bar one warning from
Test script: #!/usr/bin/env python
import os
import sys
import warnings
from Bio import BiopythonParserWarning
from Bio import AlignIO
count =0
for f in os.listdir("."):
if f.endswith(".msf"):
with warnings.catch_warnings():
warnings.simplefilter('error', BiopythonParserWarning)
try:
align = AlignIO.read(f, "msf")
except (ValueError, BiopythonParserWarning) as e:
print("%s - %iKb - %s" % (f, os.stat(f).st_size / 1000, e))
count += 1
print("Done, %i parsed" % count) This was also used to pick the smallest possible real example files for the two test cases. |
The problematic MSF file That means versions the IPD-IMGT/HLA database v3.33.0 to v3.38.0 (latest) inclusive MSF files all parse without errors. |
Thank you @pbashyal-nmdp and the NMDP, I'm sure other people will find a use for the MSF support too. |
Great work @peterjc ! We also hope it'll be useful for others as well. |
This adds support to
Bio.AlignIO
for parsing GCG MSF files.This work was on behalf of the NMPD, and as copyright holders, they agreed to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runflake8
locally, andunderstand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)