Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeqIO GenBank output errors on locus tag length #526

Closed
lairdm opened this issue Apr 22, 2015 · 3 comments
Closed

SeqIO GenBank output errors on locus tag length #526

lairdm opened this issue Apr 22, 2015 · 3 comments

Comments

@lairdm
Copy link

lairdm commented Apr 22, 2015

I've found SeqIO.write() is raising an error when the Locus tag/name is over 16 characters in a Genbank file. While 10 character used to be the limit for this field, longer Locus tags are now permitted and according to the specifications there's effectively no limit:

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#LocusB

Is there interest in increasing the limit when writing out sequences? We've seeing a lot of longer locus tags produced by various tools, and BioPerl seems to have no issue with these longer locus tags.

Thanks.

@peterjc peterjc changed the title SeqIO parser errors on locus tag length SeqIO GenBank output errors on locus tag length Apr 22, 2015
@peterjc
Copy link
Member

peterjc commented Apr 22, 2015

Biopython is more flexible on parsing (since there are so many invalid GenBank files out there), but attempts to be strict on output. We are imposing the 16 character limit in order to follow the NCBI GenBank specification.

Quoting ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt (April 15 2015, NCBI-GenBank Flat File Release 207.0)

LOCUS   - A short mnemonic name for the entry, chosen to suggest the
sequence's definition. Mandatory keyword/exactly one record.

...

3.4.4 LOCUS Format

...

The detailed format for the LOCUS line format is as follows:

Positions  Contents
---------  --------
01-05      'LOCUS'
06-12      spaces
13-28      Locus name
29-29      space
...

i.e. According to the NCBI specification, the locus field is limited to 16 characters (columns 13 to 28 only).

@peterjc peterjc closed this as completed Apr 22, 2015
@peterjc
Copy link
Member

peterjc commented Feb 2, 2017

The 16 character limit implied by using columns 13 to 28 has not changed as of December 15 2016, NCBI-GenBank Flat File Release 217.0. Note they do recommend:

  Although each of these data values can be found at column-specific
positions, we encourage those who parse the contents of the LOCUS
line to use a token-based approach. This will prevent the need for
software changes if the spacing of the data values ever has to be
modified.

However, many of the fields like devision and date do not define any way to specify a null value for missing data, which traditionally has been left blank (i.e. spaces) - which breaks parsing based on a white space separator.

@peterjc
Copy link
Member

peterjc commented Feb 2, 2017

See also #802 as of which Biopython will squeeze up to 26 letters in if space allows (by stealing space from the length field).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants