Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github issue 843 - Genebank parser #919

Merged
merged 5 commits into from
Mar 26, 2021

Conversation

MaxGreil
Copy link
Contributor

@MaxGreil MaxGreil commented Mar 1, 2021

Reference Issue

Partly fixes #843

What does this implement/fix? Explain your changes.

Additional comments

@josemduarte , can you please tell me where I can get NC_018080 in .embl file format? I couldn't find it.

Copy link
Contributor

@josemduarte josemduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaxGreil ! this is great progress.

Adding the full 14MB genbank file is a bit too much for a unit test. We should avoid adding a lot of data if not needed. The ideal unit test is a test that goes through a few lines only and asserts that the behaviour is as expected. For instance you can get the few lines excerpt from the genbank record to demonstrate the issue. You can even keep the few lines as a string constant inside the test.

@MaxGreil
Copy link
Contributor Author

MaxGreil commented Mar 2, 2021

Hi @josemduarte , I reduced the size of file NC_018080.gb to contain only one relevant case with qualifier anticodon.

@heuermh
Copy link
Member

heuermh commented Mar 2, 2021

Is the fix for transl_except as simple? Though here aren't any examples of these kind of failing lines in the linked issue.

(edit) Here is the doc, appears that the syntax is similar to that for anticodon

Qualifier       /transl_except=
Definition      translational exception: single codon the translation of which
                does not conform to genetic code defined by Organism and /codon=
Value format    (pos:location,aa:<amino_acid>) where amino_acid is the
                amino acid coded by the codon at the base_range position
Example         /transl_except=(pos:213..215,aa:Trp)
                /transl_except=(pos:1017,aa:TERM)
                /transl_except=(pos:2000..2001,aa:TERM)
                /transl_except=(pos:X22222:15..17,aa:Ala)
Comment         if the amino acid is not on the restricted vocabulary list use
                e.g., '/transl_except=(pos:213..215,aa:OTHER)' with
                '/note="name of unusual amino acid"';
                for modified amino-acid selenocysteine use three letter code
                'Sec'  (one letter code 'U' in amino-acid sequence)
                /transl_except=(pos:1002..1004,aa:Sec);
                for partial termination codons where TAA stop codon is
                completed by the addition of 3' A residues to the mRNA
                either a single base_position or a base_range is used, e.g.
                if partial stop codon is a single base:
                /transl_except=(pos:1017,aa:TERM)
                if partial stop codon consists of two bases:
                /transl_except=(pos:2000..2001,aa:TERM) with
                '/note='stop codon completed by the addition of 3' A residues 
                to the mRNA'.

@aalhossary
Copy link
Member

aalhossary commented Mar 3, 2021 via email

@MaxGreil
Copy link
Contributor Author

MaxGreil commented Mar 3, 2021

I added a fix for qualifier transl_except in the genebank parser according to the documentation.

@heuermh thank you for posting the documentation for qualifier transl_except. Do you know an example file for qualifier transl_except with line wrap? Or is testGithub843 sufficient as it is?

@josemduarte does it make sense to split this PR for issue #843 into 2 separate PRs, i.e. one PR for the genebank parser and one PR for the embl parser?

for (Iterator<FeatureInterface<AbstractSequence<NucleotideCompound>, NucleotideCompound>> it = dna.getFeaturesByType("tRNA").iterator(); it.hasNext();) {
FeatureInterface<AbstractSequence<NucleotideCompound>, NucleotideCompound> tRNAFeature = it.next();
String anticodon = tRNAFeature.getQualifiers().get("anticodon").get(0).getValue();
assertFalse(anticodon.contains(" "));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also asserting that the string is exactly as expected? If I understand it right, it should be: (pos:complement(1123552..1123554),aa:Leu,seq:caa)

@josemduarte
Copy link
Contributor

Do you know an example file for qualifier transl_except with line wrap? Or is testGithub843 sufficient as it is?

I would simply fabricate an example for transl_except with wrapping in the same gb file excerpt text you have added.

does it make sense to split this PR for issue #843 into 2 separate PRs, i.e. one PR for the genebank parser and one PR for the embl parser?

Do we already have an EMBL file parser somewhere in BioJava? Sorry I don't know well this part of BioJava. If we do and it's only about fixing it, then please go ahead and make it part of this PR. It will only need an additional unit test (again, it's ok to have fabricated data).

@MaxGreil
Copy link
Contributor Author

MaxGreil commented Mar 6, 2021

I added a line in file NC_018080.gb qualifier transl_except and tested for correct parsing.

However, I have an additional question:

I had a look at the EMBL file parser with test givenAnEmilFileWhenProcessEmilFileThanTheSequenceShouldReturnAsExpected in biojava-core/src/test/java/org/biojava/nbio/core/sequence/io/embl/EmblReaderTest.java.
To my understanding, the FeatureTable in file embl.test is not parsed at all.
Please have a look at biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/embl/EmblRecord.java. Variable featureTable is just a String that gets assign and overwritten by the last value.

@josemduarte , should I open a new issue for this problem?

@josemduarte
Copy link
Contributor

I added a line in file NC_018080.gb qualifier transl_except and tested for correct parsing.

Great, thanks @MaxGreil ! I think this looks good now. @heuermh could you double check it too?

I had a look at the EMBL file parser with test givenAnEmilFileWhenProcessEmilFileThanTheSequenceShouldReturnAsExpected in biojava-core/src/test/java/org/biojava/nbio/core/sequence/io/embl/EmblReaderTest.java.
To my understanding, the FeatureTable in file embl.test is not parsed at all.
Please have a look at biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/embl/EmblRecord.java. Variable featureTable is just a String that gets assign and overwritten by the last value.

Thanks for investigating. Indeed that deserves another issue separate from #843

@MaxGreil MaxGreil changed the title [WIP] Github issue 843 Github issue 843 - Genebank parser Mar 13, 2021
@josemduarte
Copy link
Contributor

I'll merge this by the end of the week if there's no other comments

@josemduarte josemduarte merged commit 4d1cf58 into biojava:master Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Biojava fails to parse Genbank and EMBL format
4 participants