Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import and export fasta format for RNA, DNA and PEPTIDES #1755

Closed
even1024 opened this issue Feb 28, 2024 · 1 comment · Fixed by #1765
Closed

Import and export fasta format for RNA, DNA and PEPTIDES #1755

even1024 opened this issue Feb 28, 2024 · 1 comment · Fixed by #1765

Comments

@even1024
Copy link
Collaborator

even1024 commented Feb 28, 2024

Background

This task covers import/export of sequences with defined monomers

These sequences in fasta are represented as a header, comment and plain string like:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

plain string is a combination of the following symbols:

for peptides:
A - Alanine
C - Cysteine
D - Aspartic Acid
E - Glutamic Acid
F - Phenylalanine
G - Glycine
H - Histidine
I - Isoleucine
K - Lysine
L - Leucine
M - Methionine
N - Asparagine
O - Pyrrolysine
P - Proline
Q - Glutamine
R - Arginine
S - Serine
T - Threonine
U - Selenocysteine
V - Valine
W - Tryptophan
Y - Tyrosine

for RNA nucleotides:
A - AMP (Adenosine monophosphate)
C - CMP (Cytidine monophosphate)
G - GMP (Guanosine monophosphate)
U - UMP (Uridine monophosphate)
T - rTMP (Ribothymidine monophosphate)

for DNA nucleotides:
A - dAMP (Deoxyadenosine monophosphate)
C - dCMP (Deoxycytidine monophosphate)
G - dGMP (Deoxyguanosine monophosphate)
U - dUMP (Deoxyuridine monophosphate)
T - TMP (Thymidine monophosphate)

* - translation stop
- - gap of indeterminate length

Requirements:

  1. FASTA file could contain one or several sequences.
    Each sequence in FASTA format is expressed in 2 or more lines of text. The first line is an identifying header, the remainder of the lines (one or more) represent the sequence itself.
  • The header
    The header line starts with a greater-than symbol (">") and ends with newline. Allowed characters are "A" to "Z", "a" to
    "z", "0" to "9", "_", "-", ".", ",", ";" and "|" with SPACES between them.
  • The sequence
    • DNA/RNA sequences could contain only the symbols: "A", "C", "G", "T", "U", "-"
    • Peptide sequence could contain only the symbols "A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "U", "V", "W", "Y", "*", "-"
    • Lower-case letters are accepted and are mapped into upper-case.
    • One sequence could be represented in one line or in several lines.
  • Comments
    The comments line starts with the semicolon ";" symbol and ends with newline. May contain any symbol (including ">")
  1. Import FASTA
  • The first symbol in the FASTA file should be ">", if not the Error message should be displayed and import should not be performed.
  • The header should be ignored during import.
  • Comments should be ignored during import.
  • Each sequence should be imported as a separate molecule. The start of the sequence is the first symbol in a line after header line, the end of the sequence is the last symbol before ">"
  • If sequence contains some unallowed symbol then Error message should be displayed AND import should not be performed.
  • "*" means the end of the peptide sequence.
  • If " * " occurred between two letters then it should be recognized as the break in peptide chain (no bond should be created between monomers separated with the "*").
  • "-" should be ignored during import.
  • Import of the RNA/DNA and peptide sequences is performed according to the rules for Sequence format.
  • "U" symbol for peptide should be recognized as Selenocysteine.
  1. Export to FASTA
  • Each molecule should be exported like separate sequence with the header "Sequence N", where N - is the number of sequence
  • The header of each new sequence should be started from the new line .
  • Only the sequences of the same type (DNA/RNA/Peptide) could be exported in one FASTA file.
  • Complex polymer made up of different type sequences or branched polymer is not exported and system should display Error message.
  • In case of molecules made up of modified monomers (peptides and nucleotides) natural analog is exported for peptide monomer and base natural analog is exported for modified nucleotides.
  • [DRAFT] In case of monomer natural analog is 'X' that monomer should be ignored during export.
  • CHEMs and RNA-monomers that are not a part of nucleotide or nucleoside are ignored during export.

Solution

  1. Implement following functions for C API, where type can be one of "RNA", "DNA" or "PEPTIDE":
int indigoLoadFASTA(int source, const char* type);
int indigoLoadFASTAFromString(const char* string,  const char* type);
int indigoLoadFASTAFromFile(const char* filename,  const char* type);
int indigoLoadFASTAFromBuffer(const char* buffer, int size,  const char* type);
int indigoSaveFASTA(int molecule, int output, const char* type);
int indigoSaveFASTAToFile(int molecule, const char* filename, const char* type);
const char* indigoFASTA(int molecule, const char* type);
  1. Add language bindings for Python, Java, C#
    python binding functions:
    def loadFASTA(self, input_string: string, sequence_type: string):
    def loadFASTAFromFile(self, input_file: string, sequence_type: string):
    def FASTA(self, sequence_type: string):
    def saveFASTA(self, output_file: string, sequence_type: string)

  2. Add the following content types to WASM "loadMoleculeOrReaction" and Indigo service "convert" API:

chemical/x-rna-fasta, chemical/x-dna-fasta, chemical/x-peptide-fasta

@olganaz
Copy link
Collaborator

olganaz commented Mar 1, 2024

This task covers import/export of sequences with defined monomers:
Requirements:

  1. A sequence in FASTA format is expressed in 2 or more lines of text. The first line is an identifying header, the remainder of the lines (one or more) represent the sequence itself.
  • The header
    The header line starts with a greater-than symbol (">") and ends with newline. Allowed characters are "A" to "Z", "a" to
    "z", "0" to "9", "_", "-", ".", ",", ";" and "|" with SPACES between them.
  • The sequence
    DNA/RNA sequences could contain only the symbols: "A", "C", "G", "T", "U", "-"
    Peptide sequence could contain only the symbols "A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T",
    "U", "V", "W", "Y", "*", "-"
    Lower-case letters are accepted and are mapped into upper-case. One sequence could be represented in one line or in several lines.
  • Comments
    The comments line starts with the semicolon ";" symbol and ends with newline. May contain any symbol (including ">")
  1. Import FASTA
  • The first symbol in the FASTA file should be ">", if not the Error message should be displayed and import should not be performed.
  • The header should be ignored during import.
  • Comments should be ignored during import.
  • Each sequence should be imported as a separate molecule. The start of the sequence is the first symbol in a line after header line, the end of the sequence is the last symbol before ">"
  • If sequence contains some unallowed symbol then Error message should be displayed AND import should not be performed.
  • "*" means the end of the peptide sequence
  • If " * " occurred between two letters then it should be recognized as the break in peptide chain (no bond should be created between monomers separated with the "*").
  • "-" should be ignored during import.
  • Import of the RNA/DNA and peptide sequences is performed according to the rules for Sequence format.
  • "U" symbol for peptide should be recognized as Selenocysteine.
  1. Export to FASTA
  • Each molecule should be exported like separate sequence with the header "Sequence N", where N - is the number of sequence

  • Only the sequences of the same type (DNA/RNA/Peptide) could be exported in one FASTA file.

  • Complex polymer made up of different type sequences or branched polymer is not exported and system should display Error message.

  • In case of molecules made up of modified monomers (peptides and nucleotides) natural analog is exported for peptide monomer and base natural analog is exported for modified nucleotides

  • CHEMs and RNA-monomers that are not a part of nucleotide or nucleoside are ignored during export.

@even1024 even1024 linked a pull request Mar 5, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants