Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swiss format files do not contain dbxrefs for protein identifiers #372

Open
jgoodson opened this issue Oct 8, 2014 · 2 comments
Open

Swiss format files do not contain dbxrefs for protein identifiers #372

jgoodson opened this issue Oct 8, 2014 · 2 comments

Comments

@jgoodson
Copy link

jgoodson commented Oct 8, 2014

The Swiss-Prot format "swiss" contains database cross-references to other databases. Currently SeqIO stores these in the dbxrefs list in the resulting SeqRecord. During parsing all of these cross-references are stored per "DR" line as a list with the first element being the database and subsequent elements holding the accession and optional information. The format makes a specific exception for EMBL/GenBank/DDBJ cross-references (with database identifier "EMBL"), where instead of a single accession it stores a reference to the genome and then a reference to the protein/CDS id in the nucleotide database in the same line.

http://web.expasy.org/docs/userman.html#DR_EMBL

The protein id comes second as defined in the Swiss-Prot format and in practice when using the EBI databases. This information is lost during parsing since only the first element in a DR line is kept. I believe this should be fixed. The dbxref field may already contain multiple cross-references to the same database, so the protein accessions could be simply stored as additional cross-references for all DR lines containing "EMBL" as the database ID. Alternatively since this line format is defined by the format, instead of storing these as "EMBL:X99999" the protein IDs could be stored as an alternative database identifier, such as "EMBLpro:X99999" in addition to the current accession stored the way it is currently to avoid backwards incompatibility.

I want to submit a pull request adding this feature, but I wanted to get opinions on which behavior would be best (just add as additional "EMBL:X99999" references or with an alternative database ID for the protein identifiers).

I feel like dbxrefs ideally these would be either be stored differently with more flexibility to hold the information contained in Swiss-Prot or GenBank cross-reference lines, or at least stored separately with more descriptive database identifiers that would indicate which is the source sequence and which is the protein coding sequence, although that would certainly break scripts expecting the current source sequence to be found as dbxrefs with the specific format "EMBL:XX9999999999", but that is a more complicated question.

@peterjc
Copy link
Member

peterjc commented Jul 4, 2018

Perhaps the SeqRecord object's .dbxrefs list is not the right place to record this annotation.

Reading https://web.expasy.org/docs/userman.html#DR_EMBL there is a lot of nuance here which would be difficult to recover if we just used a single flat list for the entire record.

@peterjc
Copy link
Member

peterjc commented Jul 22, 2020

See also discussion on #2708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants