Github issue 843 - Genebank parser #919

MaxGreil · 2021-03-01T18:13:14Z

Reference Issue

Partly fixes #843

What does this implement/fix? Explain your changes.

Corrected GenbankReader.java to parse anticodon qualifiers when they line wrap.
Wrote new testGithub843 in GenbankReaderTest.java
Added mentioned file NC_018080.gb downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_018080

Additional comments

@josemduarte , can you please tell me where I can get NC_018080 in .embl file format? I couldn't find it.

josemduarte

Thanks @MaxGreil ! this is great progress.

Adding the full 14MB genbank file is a bit too much for a unit test. We should avoid adding a lot of data if not needed. The ideal unit test is a test that goes through a few lines only and asserts that the behaviour is as expected. For instance you can get the few lines excerpt from the genbank record to demonstrate the issue. You can even keep the few lines as a string constant inside the test.

MaxGreil · 2021-03-02T17:40:58Z

Hi @josemduarte , I reduced the size of file NC_018080.gb to contain only one relevant case with qualifier anticodon.

heuermh · 2021-03-02T17:49:15Z

Is the fix for transl_except as simple? Though here aren't any examples of these kind of failing lines in the linked issue.

(edit) Here is the doc, appears that the syntax is similar to that for anticodon

Qualifier       /transl_except=
Definition      translational exception: single codon the translation of which
                does not conform to genetic code defined by Organism and /codon=
Value format    (pos:location,aa:<amino_acid>) where amino_acid is the
                amino acid coded by the codon at the base_range position
Example         /transl_except=(pos:213..215,aa:Trp)
                /transl_except=(pos:1017,aa:TERM)
                /transl_except=(pos:2000..2001,aa:TERM)
                /transl_except=(pos:X22222:15..17,aa:Ala)
Comment         if the amino acid is not on the restricted vocabulary list use
                e.g., '/transl_except=(pos:213..215,aa:OTHER)' with
                '/note="name of unusual amino acid"';
                for modified amino-acid selenocysteine use three letter code
                'Sec'  (one letter code 'U' in amino-acid sequence)
                /transl_except=(pos:1002..1004,aa:Sec);
                for partial termination codons where TAA stop codon is
                completed by the addition of 3' A residues to the mRNA
                either a single base_position or a base_range is used, e.g.
                if partial stop codon is a single base:
                /transl_except=(pos:1017,aa:TERM)
                if partial stop codon consists of two bases:
                /transl_except=(pos:2000..2001,aa:TERM) with
                '/note='stop codon completed by the addition of 3' A residues 
                to the mRNA'.

aalhossary · 2021-03-03T00:19:50Z

ور Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: Michael L Heuer <notifications@github.com> Sent: Wednesday, March 3, 2021 1:49:32 AM To: biojava/biojava <biojava@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [biojava/biojava] [WIP] Github issue 843 (#919) Is the fix for transl_except as simple? Though here aren't any examples of these kind of failing lines in the linked issue — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#919 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA35TI2BI3RGNUQMYOGUNJDTBUQKZANCNFSM4YNA2ZVA>.

MaxGreil · 2021-03-03T11:52:10Z

I added a fix for qualifier transl_except in the genebank parser according to the documentation.

@heuermh thank you for posting the documentation for qualifier transl_except. Do you know an example file for qualifier transl_except with line wrap? Or is testGithub843 sufficient as it is?

@josemduarte does it make sense to split this PR for issue #843 into 2 separate PRs, i.e. one PR for the genebank parser and one PR for the embl parser?

josemduarte · 2021-03-05T05:52:03Z

biojava-core/src/test/java/org/biojava/nbio/core/sequence/io/GenbankReaderTest.java

+		for (Iterator<FeatureInterface<AbstractSequence<NucleotideCompound>, NucleotideCompound>> it = dna.getFeaturesByType("tRNA").iterator(); it.hasNext();) {
+			FeatureInterface<AbstractSequence<NucleotideCompound>, NucleotideCompound> tRNAFeature = it.next();
+			String anticodon = tRNAFeature.getQualifiers().get("anticodon").get(0).getValue();
+			assertFalse(anticodon.contains(" "));


How about also asserting that the string is exactly as expected? If I understand it right, it should be: (pos:complement(1123552..1123554),aa:Leu,seq:caa)

josemduarte · 2021-03-05T23:51:44Z

Do you know an example file for qualifier transl_except with line wrap? Or is testGithub843 sufficient as it is?

I would simply fabricate an example for transl_except with wrapping in the same gb file excerpt text you have added.

does it make sense to split this PR for issue #843 into 2 separate PRs, i.e. one PR for the genebank parser and one PR for the embl parser?

Do we already have an EMBL file parser somewhere in BioJava? Sorry I don't know well this part of BioJava. If we do and it's only about fixing it, then please go ahead and make it part of this PR. It will only need an additional unit test (again, it's ok to have fabricated data).

MaxGreil · 2021-03-06T13:42:47Z

I added a line in file NC_018080.gb qualifier transl_except and tested for correct parsing.

However, I have an additional question:

I had a look at the EMBL file parser with test givenAnEmilFileWhenProcessEmilFileThanTheSequenceShouldReturnAsExpected in biojava-core/src/test/java/org/biojava/nbio/core/sequence/io/embl/EmblReaderTest.java.
To my understanding, the FeatureTable in file embl.test is not parsed at all.
Please have a look at biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/embl/EmblRecord.java. Variable featureTable is just a String that gets assign and overwritten by the last value.

@josemduarte , should I open a new issue for this problem?

josemduarte · 2021-03-09T17:07:42Z

I added a line in file NC_018080.gb qualifier transl_except and tested for correct parsing.

Great, thanks @MaxGreil ! I think this looks good now. @heuermh could you double check it too?

I had a look at the EMBL file parser with test givenAnEmilFileWhenProcessEmilFileThanTheSequenceShouldReturnAsExpected in biojava-core/src/test/java/org/biojava/nbio/core/sequence/io/embl/EmblReaderTest.java.
To my understanding, the FeatureTable in file embl.test is not parsed at all.
Please have a look at biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/embl/EmblRecord.java. Variable featureTable is just a String that gets assign and overwritten by the last value.

Thanks for investigating. Indeed that deserves another issue separate from #843

josemduarte · 2021-03-24T23:37:05Z

I'll merge this by the end of the week if there's no other comments

Github issue 843

f70075c

josemduarte requested changes Mar 2, 2021

View reviewed changes

Changes

574a977

Add transl_except to parser

59bb88f

MaxGreil requested a review from josemduarte March 3, 2021 11:52

josemduarte reviewed Mar 5, 2021

View reviewed changes

Test

a52126f

MaxGreil requested a review from josemduarte March 5, 2021 12:00

Test changes

4881428

josemduarte approved these changes Mar 9, 2021

View reviewed changes

MaxGreil mentioned this pull request Mar 10, 2021

EMBL file parser is not parsing features #923

Open

MaxGreil changed the title ~~[WIP] Github issue 843~~ Github issue 843 - Genebank parser Mar 13, 2021

josemduarte merged commit 4d1cf58 into biojava:master Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github issue 843 - Genebank parser #919

Github issue 843 - Genebank parser #919

MaxGreil commented Mar 1, 2021 •

edited

Loading

josemduarte left a comment

MaxGreil commented Mar 2, 2021

heuermh commented Mar 2, 2021 •

edited

Loading

aalhossary commented Mar 3, 2021 via email

MaxGreil commented Mar 3, 2021

josemduarte Mar 5, 2021

josemduarte commented Mar 5, 2021

MaxGreil commented Mar 6, 2021

josemduarte commented Mar 9, 2021

josemduarte commented Mar 24, 2021

Github issue 843 - Genebank parser #919

Github issue 843 - Genebank parser #919

Conversation

MaxGreil commented Mar 1, 2021 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

Additional comments

josemduarte left a comment

Choose a reason for hiding this comment

MaxGreil commented Mar 2, 2021

heuermh commented Mar 2, 2021 • edited Loading

aalhossary commented Mar 3, 2021 via email

MaxGreil commented Mar 3, 2021

josemduarte Mar 5, 2021

Choose a reason for hiding this comment

josemduarte commented Mar 5, 2021

MaxGreil commented Mar 6, 2021

josemduarte commented Mar 9, 2021

josemduarte commented Mar 24, 2021

MaxGreil commented Mar 1, 2021 •

edited

Loading

heuermh commented Mar 2, 2021 •

edited

Loading