Swiss format files do not contain dbxrefs for protein identifiers #372

jgoodson · 2014-10-08T18:06:57Z

The Swiss-Prot format "swiss" contains database cross-references to other databases. Currently SeqIO stores these in the dbxrefs list in the resulting SeqRecord. During parsing all of these cross-references are stored per "DR" line as a list with the first element being the database and subsequent elements holding the accession and optional information. The format makes a specific exception for EMBL/GenBank/DDBJ cross-references (with database identifier "EMBL"), where instead of a single accession it stores a reference to the genome and then a reference to the protein/CDS id in the nucleotide database in the same line.

http://web.expasy.org/docs/userman.html#DR_EMBL

The protein id comes second as defined in the Swiss-Prot format and in practice when using the EBI databases. This information is lost during parsing since only the first element in a DR line is kept. I believe this should be fixed. The dbxref field may already contain multiple cross-references to the same database, so the protein accessions could be simply stored as additional cross-references for all DR lines containing "EMBL" as the database ID. Alternatively since this line format is defined by the format, instead of storing these as "EMBL:X99999" the protein IDs could be stored as an alternative database identifier, such as "EMBLpro:X99999" in addition to the current accession stored the way it is currently to avoid backwards incompatibility.

I want to submit a pull request adding this feature, but I wanted to get opinions on which behavior would be best (just add as additional "EMBL:X99999" references or with an alternative database ID for the protein identifiers).

I feel like dbxrefs ideally these would be either be stored differently with more flexibility to hold the information contained in Swiss-Prot or GenBank cross-reference lines, or at least stored separately with more descriptive database identifiers that would indicate which is the source sequence and which is the protein coding sequence, although that would certainly break scripts expecting the current source sequence to be found as dbxrefs with the specific format "EMBL:XX9999999999", but that is a more complicated question.

peterjc · 2018-07-04T15:55:32Z

Perhaps the SeqRecord object's .dbxrefs list is not the right place to record this annotation.

Reading https://web.expasy.org/docs/userman.html#DR_EMBL there is a lot of nuance here which would be difficult to recover if we just used a single flat list for the entire record.

peterjc · 2020-07-22T12:36:16Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swiss format files do not contain dbxrefs for protein identifiers #372

Swiss format files do not contain dbxrefs for protein identifiers #372

jgoodson commented Oct 8, 2014

peterjc commented Jul 4, 2018

peterjc commented Jul 22, 2020

Swiss format files do not contain dbxrefs for protein identifiers #372

Swiss format files do not contain dbxrefs for protein identifiers #372

Comments

jgoodson commented Oct 8, 2014

peterjc commented Jul 4, 2018

peterjc commented Jul 22, 2020