GenBankWriter does not use information in the original header #942

jamesmorris · 2021-07-12T10:15:31Z

Hello,

I have recently been working on adding an accession ID to a few hundred GenBank files using biojava.

I have been using the following approach:

First reading the GenBank format files into a DNASequence object using GenbankReaderHelper.readGenbankDNASequence(inputStream)
Then adding the new accession ID with the setAccession(new AccessionID("new_accession")) method on the DNASequence object
Finally I use GenbankWriterHelper.writeNucleotideSequence(outputStream, sequences, GenbankWriterHelper.CIRCULAR_DNA) to create an updated GenBank file which includes the new accession

This works for inserting the new accession ID however information is lost in the locus line. As rather than using information from the original file a new locus line is created using default settings.

For example if I update a GenBank file that contains the following original locus line:
LOCUS test_locus_name 9291 BP DS-DNA CIRCULAR SYN 13-JUL-1994

The GenBank file that gets written by the writeNucleotideSequence() method will look like this:
LOCUS new_accession 9291 bp DNA circular 12-Jul-2021

We therefore loose the following:

The locus name
The double stranded prefix on DNA molecule type
The division
The original date

I would argue that there should be another way to write a GenBank file from a DNASequence that could use an original header so no information is lost through the processes of reading and writing the same file.

I would be interested to know what you think about this?

Many thanks,
James

richarda23 · 2021-07-20T21:04:17Z

Hi,
Looking at Genbank spec and example sample record, there is only 1 field to hold the locus name: https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
which currently takes the accessionId when writing. So there doesn't seem a way to store the original and new one at the same time? Or are you just expecting the 'ACCESSION' row to change?

This is the spec from the above link, so I guess what is correct behaviour is a little undefined.

However, the 10 characters in the locus name are no longer sufficient to represent the amount of information originally intended to be contained in the locus name. The only rule now applied in assigning a locus name is that it must be unique. For example, for GenBank records that have 6-character accessions (e.g., U12345), the locus name is usually the first letter of the genus and species names, followed by the accession number. For 8-character character accessions (e.g., AF123456), the locus name is just the accession number.

Looking at the GenbankParser, it ignores fields 4 and 5 of the input document LOCUS row and instead generates them from other information gleaned from the sequence - so it does lose the original MoleculeType and Division information?

There is only 1 field that modification date can be stored. You could argue that you've modified the Genbank record, so the modificationDate should reflect that?

I suppose a solution might be

to provide an overloaded GenbankWriter method that takes some configuration parameter object to the GenbankWriter with options like 'keepOriginalModificationDate', 'keepOriginalLocus' etc?
Enhance the parse to preserve the original MoleculeType and Division?

josemduarte · 2022-10-19T05:53:20Z

Do I understand it right that #1042 fully fixes this issue?

jamesmorris · 2022-10-19T11:31:13Z

It fixes the issue with of being able to maintain the locus line details yes

I need to open another ticket around the other features of a GenBank file header that are not correctly reproduced when using the GenbankWriter that I have discovered while doing this work

GenBankWriter does not use information in the original header #942

josemduarte added a commit that referenced this issue Oct 26, 2022

Merge pull request #1042 from jamesmorris/master

3cece28

GenBankWriter does not use information in the original header #942

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenBankWriter does not use information in the original header #942

GenBankWriter does not use information in the original header #942

jamesmorris commented Jul 12, 2021 •

edited

richarda23 commented Jul 20, 2021

josemduarte commented Oct 19, 2022

jamesmorris commented Oct 19, 2022

GenBankWriter does not use information in the original header #942

GenBankWriter does not use information in the original header #942

Comments

jamesmorris commented Jul 12, 2021 • edited

richarda23 commented Jul 20, 2021

josemduarte commented Oct 19, 2022

jamesmorris commented Oct 19, 2022

jamesmorris commented Jul 12, 2021 •

edited