Biojava fails to parse Genbank and EMBL format #843

innovate-invent · 2019-06-17T17:36:43Z

Biojava fails to parse anticodon and transl_except feature qualifiers when they line wrap.
Biojava expects the values to be quoted, this is invalid.

This causes applications like Mauve and Colombo/SigiHMM to emit

This line could not be parsed:                 seq:caa)

and discard large portions of the dataset.

This line is from 15584_genome.embl from Biopython:

FT                   /anticodon=(pos:complement(1123552..1123554),aa:Leu,
FT                   seq:caa)

The matching line from 15584_genome.embl from Bioperl:

FT                   /anticodon="(pos:complement(1123552..1123554),aa:Leu,seq:ca
FT                   a)"

The difference between biopython and bioperl is that bioperl quotes anticodons if they wrap.

Copying NIH statement here for reference. Some messages were removed/edited for brevity.

Case #: CAS-380254-C8B2Z2
The document here: http://www.insdc.org/files/feature_table.html does not describe how and where continuation lines should be created. There is divergence between biopython, bioperl, and other tools on how this is handled causing datasets to fail to be parsed.
For example:
The reference genome for NC_018080 downloaded from RefSeq fails to parse.
Features containing:
/anticodon=(pos:complement(1123552..1123554),aa:Leu,
seq:caa)
are failing to parse, this output is consistent with biopython.
Bioperl wraps at column 80:
/anticodon="(pos:complement(1123552..1123554),aa:Leu,seq:caa
)"
Because there is no official statement on how wrapping should occur I can't raise issues with the appropriate software developer.
Any input on the matter would be appreciated.
Case Created: 6/6/2019 4:08 PM

Dear Nolan,
Thanks for the additional info. It is expected that long anticodon qualifiers will wrap if they exceed the 80-character max line length. Furthermore, any of the following are legal within the spec:
/anticodon=(pos:complement(1123552..1123554),aa:Leu,
seq:caa)
/anticodon=(pos:complement(1123552..1123554),aa:Leu,seq:
caa)
/anticodon=(pos:complement(1123552..1123554),
aa:Leu,seq:caa)
What might be happening is that the parsers you're testing may be combining the /anticodon qualifier and its continuation line, but adding an extra space, which would then invalidate the qualifier. That would be a bug in the parser code. For parsing an anticodon qualifier with a continuation line, you have to:
Realize that the you aren't yet at the end of the qualifier value (terminal right paren)
Read the next line, and strip all leading white space
Directly append (2) to the qualifier value
For other qualifiers with text content, a parser generally needs to include a space when appending a continuation line, unless the row ends in a hyphen or is an unbroken string (no spaces, such as with a long chemical formula or the protein sequence in a /translation qualifier). But /anticodon may need an exception, depending on how the parser code is written.
I hope that helps clear up the bug. If you file a bug report, please let us know the ticket for our records.

...

As our flatfile does not have quotes, the quoting is not valid. The quotes are a bug.

Best regards,
-Terence Murphy, Ph.D.
RefSeq and Gene Team Lead
NCBI/NLM/NIH/DHHS

bioperl/bioperl-live#321

The text was updated successfully, but these errors were encountered:

heuermh · 2019-06-17T21:13:19Z

Thank you, @innovate-invent. I've also created an issue for biojava-legacy at biojava/biojava-legacy#50

prashantVaishla · 2019-08-14T10:01:08Z

hi , i am new to embl format parsing
could you please share the sample input file or send me any link

MaxGreil · 2021-02-04T19:29:55Z

Hi @josemduarte ,
I would like to take on this issue. Can you please explain what needs to be done here?

josemduarte · 2021-02-04T19:48:51Z

Thanks @MaxGreil . The issue is quite well explained above. BioJava should be able to parse wrapped records as one record:

FT                   /anticodon=(pos:complement(1123552..1123554),aa:Leu,
FT                   seq:caa)

Best would be to have that in a unit test and develop a fix based on the unit test.

There are more details on parse recommendations in the email pasted above:

For parsing an anticodon qualifier with a continuation line, you have to:
Realize that the you aren't yet at the end of the qualifier value (terminal right paren)
Read the next line, and strip all leading white space
Directly append (2) to the qualifier value
For other qualifiers with text content, a parser generally needs to include a space when appending a continuation line, unless the row ends in a hyphen or is an unbroken string (no spaces, such as with a long chemical formula or the protein sequence in a /translation qualifier). But /anticodon may need an exception, depending on how the parser code is written.

innovate-invent mentioned this issue Jun 17, 2019

Multi-line GenBank anticodon qualifier parsed with space at line break biopython/biopython#2112

Open

heuermh mentioned this issue Jun 17, 2019

Parse anticodon and transl_except feature qualifiers when they line wrap biojava/biojava-legacy#50

Open

josemduarte added the help wanted Questions and support for users label Sep 2, 2019

heuermh mentioned this issue Sep 3, 2019

Advance notice: next release 5.3.0 #848

Closed

innovate-invent mentioned this issue Nov 27, 2019

SIGI-HMM crash with some datasets brinkmanlab/IslandCompare#135

Closed

josemduarte assigned MaxGreil Feb 4, 2021

josemduarte removed the help wanted Questions and support for users label Feb 4, 2021

MaxGreil mentioned this issue Mar 1, 2021

Github issue 843 - Genebank parser #919

Merged

MaxGreil mentioned this issue Mar 10, 2021

EMBL file parser is not parsing features #923

Open

josemduarte closed this as completed in #919 Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biojava fails to parse Genbank and EMBL format #843

Biojava fails to parse Genbank and EMBL format #843

innovate-invent commented Jun 17, 2019 •

edited

Loading

heuermh commented Jun 17, 2019

prashantVaishla commented Aug 14, 2019

MaxGreil commented Feb 4, 2021

josemduarte commented Feb 4, 2021

Biojava fails to parse Genbank and EMBL format #843

Biojava fails to parse Genbank and EMBL format #843

Comments

innovate-invent commented Jun 17, 2019 • edited Loading

heuermh commented Jun 17, 2019

prashantVaishla commented Aug 14, 2019

MaxGreil commented Feb 4, 2021

josemduarte commented Feb 4, 2021

innovate-invent commented Jun 17, 2019 •

edited

Loading