Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Could not determine format of file: '/dbfs/mnt/LuxC.sto' #10

Open
lzhangUT opened this issue Sep 22, 2021 · 3 comments
Open
Labels
question Further information is requested

Comments

@lzhangUT
Copy link

lzhangUT commented Sep 22, 2021

Hi,
I was following your tutorial of Multiple sequence alignment (mas) to HMM.
I have downloaded your example data into my working directory. and I can see the two files (LuxC.faa and LuxC.sto) there as this:
[FileInfo(path='dbfs:/mnt/LuxC.faa', name='LuxC.faa', size=153510),
FileInfo(path='dbfs:/mnt/LuxC.sto', name='LuxC.sto', size=150686),

when I tried to run this code:

with pyhmmer.easel.MSAFile("/dbfs/mnt/LuxC.sto") as msa_file:
    msa_file.set_digital(alphabet)
    msa = next(msa_file)

It gives me error like this:
ValueError: Could not determine format of file: '/dbfs/mnt/LuxC.sto'

I am not sure where it went wrong, the installation and the first two commands in the tutorial works fine.
Thanks for your help

@lzhangUT
Copy link
Author

lzhangUT commented Sep 22, 2021

however, if I manually copy all the content and create the file and save into my working directory, the files seem to be working, the error was gone.

but I have another issue when running the following code:

with pyhmmer.easel.SequenceFile("/dbfs/mnt/alphafold/LuxC.faa") as seq_file:
  seq_file.set_digital(alphabet)
  sequences = list(seq_file)

pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
hits = pipeline.search_hmm(query=hmm, sequences=sequences)
ValueError: Could not parse file: Line 2: illegal character -

@althonos
Copy link
Owner

Hi @lzhangUT ,

In the first snippet, I am not sure what is going wrong, but you can always manually set the file type to "stockholm" since it looks like Easel doesn't find the format properly:

with pyhmmer.easel.MSAFile("/dbfs/mnt/LuxC.sto", format="stockholm") as msa_file:
    msa_file.set_digital(alphabet)
    msa = next(msa_file)

In the second one, I suppose it's because you are trying to read a multiple alignment file, and by default using a SequenceFile on those will fail. You need to manually allow the gaps:

with pyhmmer.easel.SequenceFile("/dbfs/mnt/alphafold/LuxC.faa", ignore_gaps=True) as seq_file:
  seq_file.set_digital(alphabet)
  sequences = list(seq_file)

@althonos althonos added the question Further information is requested label Sep 23, 2021
@lzhangUT
Copy link
Author

Hi @althonos ,
Thanks for your response.
first of all, I think LuxC.faa is a fasta file, i.e.,a sequence file, not a multiple alignment file here.
second, I was following the tutorial on your github, and the data is from your github as well. Even after I add the code 'ignore_gaps=True', the same error is still there.

with pyhmmer.easel.SequenceFile("/dbfs/mnt/alphafold/LuxC.faa", ignore_gaps=True) as seq_file:
seq_file.set_digital(alphabet)
sequences = list(seq_file)

ValueError: Could not parse file: Line 2: illegal character -
and the error is for the line in **,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants