Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MuscleApp takes extremely long/does not finish #273

Closed
dnlbauer opened this issue Jan 4, 2021 · 2 comments · Fixed by #277
Closed

MuscleApp takes extremely long/does not finish #273

dnlbauer opened this issue Jan 4, 2021 · 2 comments · Fixed by #277

Comments

@dnlbauer
Copy link
Collaborator

dnlbauer commented Jan 4, 2021

I am trying to align 99 sequences with an avg sequence length of 865 via muscle, but the process seems to be either locked or blocked in some way (muscle is visible in htop but no CPU is used).

Example code, mainly copy-pasta from the biotite gallery/tutorial:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq
import biotite.database.entrez as entrez
import biotite.application.blast as blast
import biotite.application.muscle as muscle

# sequence
file = fasta.FastaFile.read("5u6p.fasta")
seq = seq.ProteinSequence(list(file.items())[0][1])

# Find homologous proteins using NCBI Blast
# Search only the UniProt/SwissProt database
blast_app = blast.BlastWebApp("blastp", seq, "swissprot", obey_rules=True)
blast_app.start()
blast_app.join()
alignments = blast_app.get_alignments()

# Get hit IDs for hits with score > 200
hits = []
for ali in alignments:
    if ali.score > 200:
        hits.append(ali.hit_id)
# Get the sequences from hit IDs
hit_seqs = []
for hit in hits:
    file_name = entrez.fetch(hit, ".", "fa", "protein", "fasta")
    fasta_file = fasta.FastaFile.read(file_name)
    hit_seqs.append(fasta.get_sequence(fasta_file))
print(len(hit_seqs)) # 99

app = muscle.MuscleApp(hit_seqs)
app.start()
print(app.get_command())  # muscle -in /tmp/tmpdx_2xr27.fa -out /tmp/tmpk_1i3huh.fa -tree1 /tmp/tmp9qre27e5.tree -tree2 /tmp/tmpe4h0rl3k.tree -seqtype protein
app.join()
alignment = app.get_alignment()
print(alignment)

Interestingly, when I run the command from app.get_command() manually, this produces the expected output file within ~15 sec, while the biotite script still didn't finish after 15 minutes. Finally, switching from MuscleApp to ClustalOmegaApp in the script works and produces an alignment. Therefore, this issue seems to be related to MuscleApp only.

Hopefully, this issue is reproducible on other machines at all! ;)

@padix-key
Copy link
Member

Thank you, I will have a look into this.

@padix-key
Copy link
Member

padix-key commented Jan 6, 2021

I made some investigations and have an idea what happens here:

Compared to Clustal Omega, MUSCLE by default prints its progress to stderr in an extremely verbose manner. I suspect that the amount is so large, that the stderr fills up and MUSCLE waits for the stderr to be read (and hence emptied) by the other end of the pipe. In case of the terminal this is not a problem, as the terminal directly prints all characters it receives from the stderr. However, Biotite reads the stderr/stdout only after the program terminates. This leads to a deadlock with MUSCLE waiting for the stderr to be read and Python wating for MUSCLE to finish.

I will create a PR with two measures to fix this issue:

  • LocalApp.join() will use subprocess.communicate() which continuously reads the output until the application finishes
  • MuscleApp will use the -quiet parameter to stop it from outputting the progress

@padix-key padix-key mentioned this issue Jan 6, 2021
padix-key added a commit that referenced this issue Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants