You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Swiss-Prot format "swiss" contains database cross-references to other databases. Currently SeqIO stores these in the dbxrefs list in the resulting SeqRecord. During parsing all of these cross-references are stored per "DR" line as a list with the first element being the database and subsequent elements holding the accession and optional information. The format makes a specific exception for EMBL/GenBank/DDBJ cross-references (with database identifier "EMBL"), where instead of a single accession it stores a reference to the genome and then a reference to the protein/CDS id in the nucleotide database in the same line.
The protein id comes second as defined in the Swiss-Prot format and in practice when using the EBI databases. This information is lost during parsing since only the first element in a DR line is kept. I believe this should be fixed. The dbxref field may already contain multiple cross-references to the same database, so the protein accessions could be simply stored as additional cross-references for all DR lines containing "EMBL" as the database ID. Alternatively since this line format is defined by the format, instead of storing these as "EMBL:X99999" the protein IDs could be stored as an alternative database identifier, such as "EMBLpro:X99999" in addition to the current accession stored the way it is currently to avoid backwards incompatibility.
I want to submit a pull request adding this feature, but I wanted to get opinions on which behavior would be best (just add as additional "EMBL:X99999" references or with an alternative database ID for the protein identifiers).
I feel like dbxrefs ideally these would be either be stored differently with more flexibility to hold the information contained in Swiss-Prot or GenBank cross-reference lines, or at least stored separately with more descriptive database identifiers that would indicate which is the source sequence and which is the protein coding sequence, although that would certainly break scripts expecting the current source sequence to be found as dbxrefs with the specific format "EMBL:XX9999999999", but that is a more complicated question.
The text was updated successfully, but these errors were encountered:
The Swiss-Prot format "swiss" contains database cross-references to other databases. Currently SeqIO stores these in the dbxrefs list in the resulting SeqRecord. During parsing all of these cross-references are stored per "DR" line as a list with the first element being the database and subsequent elements holding the accession and optional information. The format makes a specific exception for EMBL/GenBank/DDBJ cross-references (with database identifier "EMBL"), where instead of a single accession it stores a reference to the genome and then a reference to the protein/CDS id in the nucleotide database in the same line.
http://web.expasy.org/docs/userman.html#DR_EMBL
The protein id comes second as defined in the Swiss-Prot format and in practice when using the EBI databases. This information is lost during parsing since only the first element in a DR line is kept. I believe this should be fixed. The dbxref field may already contain multiple cross-references to the same database, so the protein accessions could be simply stored as additional cross-references for all DR lines containing "EMBL" as the database ID. Alternatively since this line format is defined by the format, instead of storing these as "EMBL:X99999" the protein IDs could be stored as an alternative database identifier, such as "EMBLpro:X99999" in addition to the current accession stored the way it is currently to avoid backwards incompatibility.
I want to submit a pull request adding this feature, but I wanted to get opinions on which behavior would be best (just add as additional "EMBL:X99999" references or with an alternative database ID for the protein identifiers).
I feel like dbxrefs ideally these would be either be stored differently with more flexibility to hold the information contained in Swiss-Prot or GenBank cross-reference lines, or at least stored separately with more descriptive database identifiers that would indicate which is the source sequence and which is the protein coding sequence, although that would certainly break scripts expecting the current source sequence to be found as dbxrefs with the specific format "EMBL:XX9999999999", but that is a more complicated question.
The text was updated successfully, but these errors were encountered: