FASTA parsing error #232

VarIr · 2020-09-08T12:02:12Z

While parsing a larger FASTA file, I hit an error for a specific protein sequence (see below).

I could reproduce this behaviour with smaller sequences down to FU (not, however, MU). It seems, some sequences with char U may be recognized as NucleotideSequence instead of ProteinSequence.

Steps to reproduce

# Download the sequence
fasta_file = fasta.FastaFile.read('sequence.fasta')
avidin_seq, streptavidin_seq = fasta.get_sequences(fasta_file).values()

Error

---------------------------------------------------------------------------
AlphabetError                             Traceback (most recent call last)
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/alphabet.py in encode_multiple(self, symbols, dtype)

src/biotite/sequence/codec.pyx in biotite.sequence.codec.encode_chars()

AlphabetError: Symbol 'F' is not in the alphabet

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-223-c3874679f080> in <module>
----> 1 biotite_fasta.get_sequences(s0).values()

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in get_sequences(fasta_file)

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)

ValueError: FASTA data cannot be converted either to 'NucleotideSequence' nor to 'ProteinSequence'

The text was updated successfully, but these errors were encountered:

padix-key · 2020-09-08T18:11:46Z

Thanks for reporting, I can reproduce the error. One problem is that Selenocysteine (U) is currently not recognized by the amino acid alphabet. To fix this, U needs also to be added to substitution matrices. The more interesting case is MU, because U should neither be recognized by the ambiguous nucleotide alphabet, but in this case it works. This requires further investigation. For now the following would work for you, if you accept that U is converted into C:

fasta_file = fasta.FastaFile.read('sequence.fasta')
seq_dict = {
    header: seq.ProteinSequence(seq_str.replace("U", "C"))
    for header, seq_str in fasta_file.items()
}
sequences = list(seq_dict.values())
print(type(sequences[0]))
print(sequences[0])

<class 'biotite.sequence.ProteinSequence'>
FC

VarIr · 2020-09-09T08:11:32Z

Thanks for the work-around, I'll give it a try.

Regarding Selenocysteine, I find the error message confusing (the first msg points to F when in fact U is the culprit; the second does not tell, whether biotite tried to parse it as nucleic acid or protein sequence). Perhaps the first could be improved by mentioning missing Selenocysteine support, and the second by having individual error messages for nucleotide and protein?

padix-key · 2020-09-09T14:52:35Z

The problem is that the first error message is raised by the Alphabet class, which has no knowledge about the type of sequence. In the second message it is not possible to give individual error messages, since the function does not know whether a nucleotide or protein sequence is expected. A possible solution could be a third error message in the middle of them, that states to which kind of sequence the first one refers to.

padix-key · 2020-11-10T12:33:49Z

I decided for the solution that selenocysteine is automatically converted into cysteine, when get_sequence() or get_sequences() is called. If the sequence contains selenocysteine, a warning is raised mentioning the automatic conversion. If selenocysteine is explicitly required, a custom Sequence class needs to be created by the user. If you have no objections, I will close this issue when the PR is merged.

padix-key added the bug label Sep 8, 2020

padix-key mentioned this issue Sep 16, 2020

Correct handling of ambiguous symbols when reading from FASTA/FASTQ #233

Merged

padix-key mentioned this issue Nov 10, 2020

Selenocysteine support for FASTA and GenBank files #246

Merged

padix-key closed this as completed in #246 Nov 10, 2020

dnlbauer mentioned this issue Aug 4, 2021

Selenocysteine strikes again: AlphabetError in BlastWebApp #344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FASTA parsing error #232

FASTA parsing error #232

VarIr commented Sep 8, 2020

padix-key commented Sep 8, 2020 •

edited

VarIr commented Sep 9, 2020

padix-key commented Sep 9, 2020

padix-key commented Nov 10, 2020

FASTA parsing error #232

FASTA parsing error #232

Comments

VarIr commented Sep 8, 2020

Steps to reproduce

Error

padix-key commented Sep 8, 2020 • edited

VarIr commented Sep 9, 2020

padix-key commented Sep 9, 2020

padix-key commented Nov 10, 2020

padix-key commented Sep 8, 2020 •

edited