Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse certain Genbank format #2844

Closed
philiptzou opened this issue Apr 22, 2020 · 7 comments
Closed

Unable to parse certain Genbank format #2844

philiptzou opened this issue Apr 22, 2020 · 7 comments

Comments

@philiptzou
Copy link

philiptzou commented Apr 22, 2020

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)
3.7.5 (default, Nov  7 2019, 10:50:52)
[GCC 8.3.0]
CPython
Linux-4.15.0-88-generic-x86_64-with-Ubuntu-18.04-bionic
1.76

Expected behaviour

Should be able to process AYW00820.1.

Actual behaviour

Traceback (most recent call last):
  ...
  File "/home/philip/.virtualenvs/.../lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
    record = self.parse(handle, do_features)
  File "/home/philip/.virtualenvs/.../lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
    if self.feed(handle, consumer, do_features):
  File "/home/philip/.virtualenvs/.../lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 466, in feed
    self._feed_header_lines(consumer, self.parse_header())
  File "/home/philip/.virtualenvs/.../lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 1802, in _feed_header_lines
    structured_comment_key
KeyError: 'Assembly-Data'

Steps to reproduce

from Bio import SeqIO, Entrez

handle = Entrez.efetch(
  db='protein', id='AYW00820.1',
  rettype='gb', retmode='text')
SeqIO.read(handle, format='genbank')
@peterjc
Copy link
Member

peterjc commented Apr 23, 2020

@biologyguy are you able to take a look at this since you wrote the structured comment parsing? I wonder if the input is malformed?

LOCUS       AYW00820                1841 aa            linear   VRL 01-SEP-2019
DEFINITION  Nsp1-3 [Mucambo virus].
ACCESSION   AYW00820
VERSION     AYW00820.1
DBSOURCE    accession MF993533.1
KEYWORDS    .
SOURCE      Mucambo virus
  ORGANISM  Mucambo virus
            Viruses; ssRNA viruses; ssRNA positive-strand viruses, no DNA
            stage; Togaviridae; Alphavirus.
REFERENCE   1  (residues 1 to 1841)
  AUTHORS   Araujo,P.A., Casseb,S.M., Ferreira,M.S., Silva,S.P., Silva,F.A.,
            Martins,L.C., Chiang,J.O., Cruz,A.C.R. and Vasconcelos,P.F.C.
  TITLE     Genomic detection and Phylogenetic Analysis of Mucambo virus from
            hematophagus arthropods of Caxiuana Nacional Forest, municipality
            of Melgaco, Para
  JOURNAL   Unpublished
REFERENCE   2  (residues 1 to 1841)
  AUTHORS   Araujo,P.A., Casseb,S.M., Ferreira,M.S., Silva,S.P., Silva,F.A.,
            Martins,L.C., Chiang,J.O., Cruz,A.C.R. and Vasconcelos,P.F.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (22-SEP-2017) Section of Arbovirology and Hemorrhagic
            Fevers, Evandro Chagas Institute, Highway BR 316, Ananindeua, Para
            67030-000, Brazil
COMMENT     EMAIL: pedroarthuraraujo@yahoo.com.br
            ##Assembly-Data-START##
            Assembly method: IDBA-UD
            Sequencing Technology: MiniSeq
            ##Assembly-Data-END##.
            Method: conceptual translation.
FEATURES             Location/Qualifiers
     source          1..1841
...

The trailing full stop on ##Assembly-Data-END##. looks out of place?

@biologyguy
Copy link
Contributor

Hi @peterjc,
It's been awhile! I hope you and yours are staying safe and healthy during these very 'interesting' days.
So, I'm not sure how best to approach this. The record does seem to be malformed but the trailing full stop isn't the issue. It's the colons that are used within the comment, which are supposed to act as the delimiter for key-value pairs. The parser is looking for the pattern " :: " but I'm not sure if this is the actual spec prescribed delimiter or just the most common convention. There doesn't seem to be much guidance from NCBI (https://www.ncbi.nlm.nih.gov/genbank/structuredcomment/).
I can certainly 'fix' the parser by switching the pattern to r'\s*:+\s*', but I worry it could have unintended effects on others if people have added a colon in their key or value text. No idea if that happens.
I can submit a pull request if you think the regex fix is worth pursuing.
Thanks,
-Steve

@peterjc
Copy link
Member

peterjc commented Apr 24, 2020

OK - would one of you like to contact the NCBI about this file?

For Biopython, assuming this file is indeed invalid, how about a middle ground: Abort parsing the structured comment and raise a warning, allowing the user to continue (perhaps they don't care about this part of the file).

@biologyguy
Copy link
Contributor

Assuming tests pass, I've added in the check and raise a warning. Is okay?
@philiptzou I'll leave it up to you to alert NCBI :)

@peterjc
Copy link
Member

peterjc commented Apr 28, 2020

@philiptzou are you able to install Biopython from source? The latest master branch should now work, and we'd appreciate you testing it.

Also, have you emailed the NCBI about this problematic structured annotation?

@philiptzou
Copy link
Author

@philiptzou are you able to install Biopython from source? The latest master branch should now work, and we'd appreciate you testing it.

Also, have you emailed the NCBI about this problematic structured annotation?

Thanks! Just noticed the message and I'll soon try it!

@philiptzou
Copy link
Author

Work like a charm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants