New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SeqIO GenBank output errors on locus tag length #526
Comments
Biopython is more flexible on parsing (since there are so many invalid GenBank files out there), but attempts to be strict on output. We are imposing the 16 character limit in order to follow the NCBI GenBank specification. Quoting ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt (April 15 2015, NCBI-GenBank Flat File Release 207.0)
i.e. According to the NCBI specification, the locus field is limited to 16 characters (columns 13 to 28 only). |
The 16 character limit implied by using columns 13 to 28 has not changed as of December 15 2016, NCBI-GenBank Flat File Release 217.0. Note they do recommend:
However, many of the fields like devision and date do not define any way to specify a null value for missing data, which traditionally has been left blank (i.e. spaces) - which breaks parsing based on a white space separator. |
See also #802 as of which Biopython will squeeze up to 26 letters in if space allows (by stealing space from the length field). |
I've found SeqIO.write() is raising an error when the Locus tag/name is over 16 characters in a Genbank file. While 10 character used to be the limit for this field, longer Locus tags are now permitted and according to the specifications there's effectively no limit:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#LocusB
Is there interest in increasing the limit when writing out sequences? We've seeing a lot of longer locus tags produced by various tools, and BioPerl seems to have no issue with these longer locus tags.
Thanks.
The text was updated successfully, but these errors were encountered: