Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA parsing error #232

Closed
VarIr opened this issue Sep 8, 2020 · 4 comments · Fixed by #246
Closed

FASTA parsing error #232

VarIr opened this issue Sep 8, 2020 · 4 comments · Fixed by #246
Labels

Comments

@VarIr
Copy link

VarIr commented Sep 8, 2020

While parsing a larger FASTA file, I hit an error for a specific protein sequence (see below).

I could reproduce this behaviour with smaller sequences down to FU (not, however, MU). It seems, some sequences with char U may be recognized as NucleotideSequence instead of ProteinSequence.

Steps to reproduce

# Download the sequence
fasta_file = fasta.FastaFile.read('sequence.fasta')
avidin_seq, streptavidin_seq = fasta.get_sequences(fasta_file).values()

Error

---------------------------------------------------------------------------
AlphabetError                             Traceback (most recent call last)
~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/alphabet.py in encode_multiple(self, symbols, dtype)

src/biotite/sequence/codec.pyx in biotite.sequence.codec.encode_chars()

AlphabetError: Symbol 'F' is not in the alphabet

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-223-c3874679f080> in <module>
----> 1 biotite_fasta.get_sequences(s0).values()

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in get_sequences(fasta_file)

~/miniconda3/envs/biotite/lib/python3.8/site-packages/biotite/sequence/io/fasta/convert.py in _convert_to_sequence(seq_str)

ValueError: FASTA data cannot be converted either to 'NucleotideSequence' nor to 'ProteinSequence'
@padix-key
Copy link
Member

padix-key commented Sep 8, 2020

Thanks for reporting, I can reproduce the error. One problem is that Selenocysteine (U) is currently not recognized by the amino acid alphabet. To fix this, U needs also to be added to substitution matrices. The more interesting case is MU, because U should neither be recognized by the ambiguous nucleotide alphabet, but in this case it works. This requires further investigation. For now the following would work for you, if you accept that U is converted into C:

fasta_file = fasta.FastaFile.read('sequence.fasta')
seq_dict = {
    header: seq.ProteinSequence(seq_str.replace("U", "C"))
    for header, seq_str in fasta_file.items()
}
sequences = list(seq_dict.values())
print(type(sequences[0]))
print(sequences[0])
<class 'biotite.sequence.ProteinSequence'>
FC

@padix-key padix-key added the bug label Sep 8, 2020
@VarIr
Copy link
Author

VarIr commented Sep 9, 2020

Thanks for the work-around, I'll give it a try.

Regarding Selenocysteine, I find the error message confusing (the first msg points to F when in fact U is the culprit; the second does not tell, whether biotite tried to parse it as nucleic acid or protein sequence). Perhaps the first could be improved by mentioning missing Selenocysteine support, and the second by having individual error messages for nucleotide and protein?

@padix-key
Copy link
Member

The problem is that the first error message is raised by the Alphabet class, which has no knowledge about the type of sequence. In the second message it is not possible to give individual error messages, since the function does not know whether a nucleotide or protein sequence is expected. A possible solution could be a third error message in the middle of them, that states to which kind of sequence the first one refers to.

@padix-key
Copy link
Member

I decided for the solution that selenocysteine is automatically converted into cysteine, when get_sequence() or get_sequences() is called. If the sequence contains selenocysteine, a warning is raised mentioning the automatic conversion. If selenocysteine is explicitly required, a custom Sequence class needs to be created by the user. If you have no objections, I will close this issue when the PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants