Import and export fasta format for RNA, DNA and PEPTIDES #1755

even1024 · 2024-02-28T11:19:35Z

Background

This task covers import/export of sequences with defined monomers

These sequences in fasta are represented as a header, comment and plain string like:

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

plain string is a combination of the following symbols:

for peptides:
A - Alanine
C - Cysteine
D - Aspartic Acid
E - Glutamic Acid
F - Phenylalanine
G - Glycine
H - Histidine
I - Isoleucine
K - Lysine
L - Leucine
M - Methionine
N - Asparagine
O - Pyrrolysine
P - Proline
Q - Glutamine
R - Arginine
S - Serine
T - Threonine
U - Selenocysteine
V - Valine
W - Tryptophan
Y - Tyrosine

for RNA nucleotides:
A - AMP (Adenosine monophosphate)
C - CMP (Cytidine monophosphate)
G - GMP (Guanosine monophosphate)
U - UMP (Uridine monophosphate)
T - rTMP (Ribothymidine monophosphate)

for DNA nucleotides:
A - dAMP (Deoxyadenosine monophosphate)
C - dCMP (Deoxycytidine monophosphate)
G - dGMP (Deoxyguanosine monophosphate)
U - dUMP (Deoxyuridine monophosphate)
T - TMP (Thymidine monophosphate)

* - translation stop
- - gap of indeterminate length

Requirements:

FASTA file could contain one or several sequences.
Each sequence in FASTA format is expressed in 2 or more lines of text. The first line is an identifying header, the remainder of the lines (one or more) represent the sequence itself.

The header
The header line starts with a greater-than symbol (">") and ends with newline. Allowed characters are "A" to "Z", "a" to
"z", "0" to "9", "_", "-", ".", ",", ";" and "|" with SPACES between them.
The sequence
- DNA/RNA sequences could contain only the symbols: "A", "C", "G", "T", "U", "-"
- Peptide sequence could contain only the symbols "A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "U", "V", "W", "Y", "*", "-"
- Lower-case letters are accepted and are mapped into upper-case.
- One sequence could be represented in one line or in several lines.
Comments
The comments line starts with the semicolon ";" symbol and ends with newline. May contain any symbol (including ">")

Import FASTA

The first symbol in the FASTA file should be ">", if not the Error message should be displayed and import should not be performed.
The header should be ignored during import.
Comments should be ignored during import.
Each sequence should be imported as a separate molecule. The start of the sequence is the first symbol in a line after header line, the end of the sequence is the last symbol before ">"
If sequence contains some unallowed symbol then Error message should be displayed AND import should not be performed.
"*" means the end of the peptide sequence.
If " * " occurred between two letters then it should be recognized as the break in peptide chain (no bond should be created between monomers separated with the "*").
"-" should be ignored during import.
Import of the RNA/DNA and peptide sequences is performed according to the rules for Sequence format.
"U" symbol for peptide should be recognized as Selenocysteine.

Export to FASTA

Each molecule should be exported like separate sequence with the header "Sequence N", where N - is the number of sequence
The header of each new sequence should be started from the new line .
Only the sequences of the same type (DNA/RNA/Peptide) could be exported in one FASTA file.
Complex polymer made up of different type sequences or branched polymer is not exported and system should display Error message.
In case of molecules made up of modified monomers (peptides and nucleotides) natural analog is exported for peptide monomer and base natural analog is exported for modified nucleotides.
[DRAFT] In case of monomer natural analog is 'X' that monomer should be ignored during export.
CHEMs and RNA-monomers that are not a part of nucleotide or nucleoside are ignored during export.

Solution

Implement following functions for C API, where type can be one of "RNA", "DNA" or "PEPTIDE":

int indigoLoadFASTA(int source, const char* type);
int indigoLoadFASTAFromString(const char* string,  const char* type);
int indigoLoadFASTAFromFile(const char* filename,  const char* type);
int indigoLoadFASTAFromBuffer(const char* buffer, int size,  const char* type);
int indigoSaveFASTA(int molecule, int output, const char* type);
int indigoSaveFASTAToFile(int molecule, const char* filename, const char* type);
const char* indigoFASTA(int molecule, const char* type);

Add language bindings for Python, Java, C#
python binding functions:
def loadFASTA(self, input_string: string, sequence_type: string):
def loadFASTAFromFile(self, input_file: string, sequence_type: string):
def FASTA(self, sequence_type: string):
def saveFASTA(self, output_file: string, sequence_type: string)
Add the following content types to WASM "loadMoleculeOrReaction" and Indigo service "convert" API:

chemical/x-rna-fasta, chemical/x-dna-fasta, chemical/x-peptide-fasta

The text was updated successfully, but these errors were encountered:

olganaz · 2024-03-01T11:00:10Z

This task covers import/export of sequences with defined monomers:
Requirements:

A sequence in FASTA format is expressed in 2 or more lines of text. The first line is an identifying header, the remainder of the lines (one or more) represent the sequence itself.

The header
The header line starts with a greater-than symbol (">") and ends with newline. Allowed characters are "A" to "Z", "a" to
"z", "0" to "9", "_", "-", ".", ",", ";" and "|" with SPACES between them.
The sequence
DNA/RNA sequences could contain only the symbols: "A", "C", "G", "T", "U", "-"
Peptide sequence could contain only the symbols "A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T",
"U", "V", "W", "Y", "*", "-"
Lower-case letters are accepted and are mapped into upper-case. One sequence could be represented in one line or in several lines.
Comments
The comments line starts with the semicolon ";" symbol and ends with newline. May contain any symbol (including ">")

Import FASTA

The first symbol in the FASTA file should be ">", if not the Error message should be displayed and import should not be performed.
The header should be ignored during import.
Comments should be ignored during import.
Each sequence should be imported as a separate molecule. The start of the sequence is the first symbol in a line after header line, the end of the sequence is the last symbol before ">"
If sequence contains some unallowed symbol then Error message should be displayed AND import should not be performed.
"*" means the end of the peptide sequence
If " * " occurred between two letters then it should be recognized as the break in peptide chain (no bond should be created between monomers separated with the "*").
"-" should be ignored during import.
Import of the RNA/DNA and peptide sequences is performed according to the rules for Sequence format.
"U" symbol for peptide should be recognized as Selenocysteine.

Export to FASTA

Each molecule should be exported like separate sequence with the header "Sequence N", where N - is the number of sequence
Only the sequences of the same type (DNA/RNA/Peptide) could be exported in one FASTA file.
Complex polymer made up of different type sequences or branched polymer is not exported and system should display Error message.
In case of molecules made up of modified monomers (peptides and nucleotides) natural analog is exported for peptide monomer and base natural analog is exported for modified nucleotides
CHEMs and RNA-monomers that are not a part of nucleotide or nucleoside are ignored during export.

even1024 added Feature epic: macromolecules labels Feb 28, 2024

even1024 added this to the Indigo-1.19.0-rc.1 milestone Feb 28, 2024

even1024 self-assigned this Feb 28, 2024

even1024 linked a pull request Mar 5, 2024 that will close this issue

#1755 Import and export fasta format for RNA, DNA and PEPTIDES #1765

Merged

7 tasks

even1024 closed this as completed in #1765 Mar 5, 2024

even1024 added a commit that referenced this issue Mar 5, 2024

#1755 Import and export fasta format for RNA, DNA and PEPTIDES (#1765)

e8869c2

This was referenced Mar 19, 2024

Fasta: All Peptides should be saved to FASTA format #1822

Closed

Autotests: Macro: UI for Open/Save As FASTA/Sequence epam/ketcher#4280

Open

This was referenced Mar 27, 2024

Macro: Cannot load Peptides from our Library that are not connected by bonds using FASTA file #1881

Closed

FASTA: Load FASTA or Sequence file that contains few monomers causes their overlaping on the canvas epam/ketcher#4355

Closed

This was referenced Mar 28, 2024

Macro: DNA/RNA sequences should NOT accept '*' symbols epam/ketcher#4358

Open

Macro: System should allow '*' symbol at the beginning of peptide sequence epam/ketcher#4359

Open

This was referenced Mar 28, 2024

FASTA: User should be able to add, save and open FASTA/Sequence files with "X" symbol Peptides. epam/ketcher#4361

Open

Pyl Peptide is changed to wrong letter while we save one to Fasta or Sequence file epam/ketcher#4362

Open

AlexeyGirin mentioned this issue Apr 8, 2024

FASTA: Export branched structures to FASTA should be impossible. epam/ketcher#4334

Open

baranovdv mentioned this issue Apr 14, 2024

Indigo incorrectly calculates RNA coordinates with data from FASTA/Sequence file #1910

Closed

This was referenced May 15, 2024

CHEMs that are not a part of nucleotide or nucleoside are not ignored during export epam/ketcher#4626

Closed

Empty headers appear when exporting to FASTA #1950

Closed

olganaz mentioned this issue Jun 20, 2024

Import/Export of variant monomers from Fasta/Sequence #2015

Closed

AlexeyGirin mentioned this issue Jul 8, 2024

Preview: We shouldn't allow to export canvas with nucleotides with no natural analog to Sequence format (and FASTA) #2035

Closed

AlexeyGirin mentioned this issue Aug 29, 2024

Regress: R1-R2 connected unsplit nucleotides saved in FASTA as separate squences #2280

Closed

AlexeyGirin mentioned this issue Sep 24, 2024

O and U sumbols are not supported in sequence mode epam/ketcher#5621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import and export fasta format for RNA, DNA and PEPTIDES #1755

Import and export fasta format for RNA, DNA and PEPTIDES #1755

even1024 commented Feb 28, 2024 •

edited by AlexeyGirin

Loading

olganaz commented Mar 1, 2024 •

edited

Loading

Import and export fasta format for RNA, DNA and PEPTIDES #1755

Import and export fasta format for RNA, DNA and PEPTIDES #1755

Comments

even1024 commented Feb 28, 2024 • edited by AlexeyGirin Loading

olganaz commented Mar 1, 2024 • edited Loading

even1024 commented Feb 28, 2024 •

edited by AlexeyGirin

Loading

olganaz commented Mar 1, 2024 •

edited

Loading