Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biojava fails to parse Genbank and EMBL format #843

Closed
innovate-invent opened this issue Jun 17, 2019 · 4 comments · Fixed by #919
Closed

Biojava fails to parse Genbank and EMBL format #843

innovate-invent opened this issue Jun 17, 2019 · 4 comments · Fixed by #919
Assignees

Comments

@innovate-invent
Copy link

innovate-invent commented Jun 17, 2019

Biojava fails to parse anticodon and transl_except feature qualifiers when they line wrap.
Biojava expects the values to be quoted, this is invalid.

This causes applications like Mauve and Colombo/SigiHMM to emit

This line could not be parsed:                 seq:caa)

and discard large portions of the dataset.

This line is from 15584_genome.embl from Biopython:

FT                   /anticodon=(pos:complement(1123552..1123554),aa:Leu,
FT                   seq:caa)

The matching line from 15584_genome.embl from Bioperl:

FT                   /anticodon="(pos:complement(1123552..1123554),aa:Leu,seq:ca
FT                   a)"

The difference between biopython and bioperl is that bioperl quotes anticodons if they wrap.

Copying NIH statement here for reference. Some messages were removed/edited for brevity.

Case #: CAS-380254-C8B2Z2 
The document here: http://www.insdc.org/files/feature_table.html does not describe how and where continuation lines should be created. There is divergence between biopython, bioperl, and other tools on how this is handled causing datasets to fail to be parsed.
For example:
The reference genome for NC_018080 downloaded from RefSeq fails to parse.
Features containing:
/anticodon=(pos:complement(1123552..1123554),aa:Leu,
seq:caa)
are failing to parse, this output is consistent with biopython.
Bioperl wraps at column 80:
/anticodon="(pos:complement(1123552..1123554),aa:Leu,seq:caa
)"
Because there is no official statement on how wrapping should occur I can't raise issues with the appropriate software developer.
Any input on the matter would be appreciated. 
Case Created: 6/6/2019 4:08 PM 


Dear Nolan,
Thanks for the additional info. It is expected that long anticodon qualifiers will wrap if they exceed the 80-character max line length. Furthermore, any of the following are legal within the spec:
/anticodon=(pos:complement(1123552..1123554),aa:Leu,
seq:caa)
/anticodon=(pos:complement(1123552..1123554),aa:Leu,seq:
caa)
/anticodon=(pos:complement(1123552..1123554),
aa:Leu,seq:caa)
What might be happening is that the parsers you're testing may be combining the /anticodon qualifier and its continuation line, but adding an extra space, which would then invalidate the qualifier. That would be a bug in the parser code. For parsing an anticodon qualifier with a continuation line, you have to:
Realize that the you aren't yet at the end of the qualifier value (terminal right paren)
Read the next line, and strip all leading white space
Directly append (2) to the qualifier value
For other qualifiers with text content, a parser generally needs to include a space when appending a continuation line, unless the row ends in a hyphen or is an unbroken string (no spaces, such as with a long chemical formula or the protein sequence in a /translation qualifier). But /anticodon may need an exception, depending on how the parser code is written.
I hope that helps clear up the bug. If you file a bug report, please let us know the ticket for our records.

...

As our flatfile does not have quotes, the quoting is not valid. The quotes are a bug.

Best regards,
-Terence Murphy, Ph.D.
RefSeq and Gene Team Lead
NCBI/NLM/NIH/DHHS


bioperl/bioperl-live#321

@heuermh
Copy link
Member

heuermh commented Jun 17, 2019

Thank you, @innovate-invent. I've also created an issue for biojava-legacy at biojava/biojava-legacy#50

@prashantVaishla
Copy link

hi , i am new to embl format parsing
could you please share the sample input file or send me any link

@MaxGreil
Copy link
Contributor

MaxGreil commented Feb 4, 2021

Hi @josemduarte ,
I would like to take on this issue. Can you please explain what needs to be done here?

@josemduarte
Copy link
Contributor

Thanks @MaxGreil . The issue is quite well explained above. BioJava should be able to parse wrapped records as one record:

FT                   /anticodon=(pos:complement(1123552..1123554),aa:Leu,
FT                   seq:caa)

Best would be to have that in a unit test and develop a fix based on the unit test.

There are more details on parse recommendations in the email pasted above:

For parsing an anticodon qualifier with a continuation line, you have to:
Realize that the you aren't yet at the end of the qualifier value (terminal right paren)
Read the next line, and strip all leading white space
Directly append (2) to the qualifier value
For other qualifiers with text content, a parser generally needs to include a space when appending a continuation line, unless the row ends in a hyphen or is an unbroken string (no spaces, such as with a long chemical formula or the protein sequence in a /translation qualifier). But /anticodon may need an exception, depending on how the parser code is written.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants