Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenBankWriter does not use information in the original header #942

Open
jamesmorris opened this issue Jul 12, 2021 · 3 comments
Open

GenBankWriter does not use information in the original header #942

jamesmorris opened this issue Jul 12, 2021 · 3 comments

Comments

@jamesmorris
Copy link
Contributor

jamesmorris commented Jul 12, 2021

Hello,

I have recently been working on adding an accession ID to a few hundred GenBank files using biojava.

I have been using the following approach:

  1. First reading the GenBank format files into a DNASequence object using GenbankReaderHelper.readGenbankDNASequence(inputStream)
  2. Then adding the new accession ID with the setAccession(new AccessionID("new_accession")) method on the DNASequence object
  3. Finally I use GenbankWriterHelper.writeNucleotideSequence(outputStream, sequences, GenbankWriterHelper.CIRCULAR_DNA) to create an updated GenBank file which includes the new accession

This works for inserting the new accession ID however information is lost in the locus line. As rather than using information from the original file a new locus line is created using default settings.

For example if I update a GenBank file that contains the following original locus line:
LOCUS test_locus_name 9291 BP DS-DNA CIRCULAR SYN 13-JUL-1994

The GenBank file that gets written by the writeNucleotideSequence() method will look like this:
LOCUS new_accession 9291 bp DNA circular 12-Jul-2021

We therefore loose the following:

  • The locus name
  • The double stranded prefix on DNA molecule type
  • The division
  • The original date

I would argue that there should be another way to write a GenBank file from a DNASequence that could use an original header so no information is lost through the processes of reading and writing the same file.

I would be interested to know what you think about this?

Many thanks,
James

@richarda23
Copy link
Contributor

Hi,
Looking at Genbank spec and example sample record, there is only 1 field to hold the locus name: https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
which currently takes the accessionId when writing. So there doesn't seem a way to store the original and new one at the same time? Or are you just expecting the 'ACCESSION' row to change?

This is the spec from the above link, so I guess what is correct behaviour is a little undefined.

However, the 10 characters in the locus name are no longer sufficient to represent the amount of information originally intended to be contained in the locus name. The only rule now applied in assigning a locus name is that it must be unique. For example, for GenBank records that have 6-character accessions (e.g., U12345), the locus name is usually the first letter of the genus and species names, followed by the accession number. For 8-character character accessions (e.g., AF123456), the locus name is just the accession number.

Looking at the GenbankParser, it ignores fields 4 and 5 of the input document LOCUS row and instead generates them from other information gleaned from the sequence - so it does lose the original MoleculeType and Division information?

There is only 1 field that modification date can be stored. You could argue that you've modified the Genbank record, so the modificationDate should reflect that?

I suppose a solution might be

  1. to provide an overloaded GenbankWriter method that takes some configuration parameter object to the GenbankWriter with options like 'keepOriginalModificationDate', 'keepOriginalLocus' etc?

  2. Enhance the parse to preserve the original MoleculeType and Division?

@josemduarte
Copy link
Contributor

Do I understand it right that #1042 fully fixes this issue?

@jamesmorris
Copy link
Contributor Author

It fixes the issue with of being able to maintain the locus line details yes

I need to open another ticket around the other features of a GenBank file header that are not correctly reproduced when using the GenbankWriter that I have discovered while doing this work

josemduarte added a commit that referenced this issue Oct 26, 2022
GenBankWriter does not use information in the original header #942
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants