MuscleApp takes extremely long/does not finish #273

dnlbauer · 2021-01-04T12:47:54Z

I am trying to align 99 sequences with an avg sequence length of 865 via muscle, but the process seems to be either locked or blocked in some way (muscle is visible in htop but no CPU is used).

Example code, mainly copy-pasta from the biotite gallery/tutorial:

import biotite.sequence.io.fasta as fasta
import biotite.sequence as seq
import biotite.database.entrez as entrez
import biotite.application.blast as blast
import biotite.application.muscle as muscle

# sequence
file = fasta.FastaFile.read("5u6p.fasta")
seq = seq.ProteinSequence(list(file.items())[0][1])

# Find homologous proteins using NCBI Blast
# Search only the UniProt/SwissProt database
blast_app = blast.BlastWebApp("blastp", seq, "swissprot", obey_rules=True)
blast_app.start()
blast_app.join()
alignments = blast_app.get_alignments()

# Get hit IDs for hits with score > 200
hits = []
for ali in alignments:
    if ali.score > 200:
        hits.append(ali.hit_id)
# Get the sequences from hit IDs
hit_seqs = []
for hit in hits:
    file_name = entrez.fetch(hit, ".", "fa", "protein", "fasta")
    fasta_file = fasta.FastaFile.read(file_name)
    hit_seqs.append(fasta.get_sequence(fasta_file))
print(len(hit_seqs)) # 99

app = muscle.MuscleApp(hit_seqs)
app.start()
print(app.get_command())  # muscle -in /tmp/tmpdx_2xr27.fa -out /tmp/tmpk_1i3huh.fa -tree1 /tmp/tmp9qre27e5.tree -tree2 /tmp/tmpe4h0rl3k.tree -seqtype protein
app.join()
alignment = app.get_alignment()
print(alignment)

Interestingly, when I run the command from app.get_command() manually, this produces the expected output file within ~15 sec, while the biotite script still didn't finish after 15 minutes. Finally, switching from MuscleApp to ClustalOmegaApp in the script works and produces an alignment. Therefore, this issue seems to be related to MuscleApp only.

Hopefully, this issue is reproducible on other machines at all! ;)

The text was updated successfully, but these errors were encountered:

padix-key · 2021-01-05T16:27:44Z

Thank you, I will have a look into this.

padix-key · 2021-01-06T13:57:04Z

I made some investigations and have an idea what happens here:

Compared to Clustal Omega, MUSCLE by default prints its progress to stderr in an extremely verbose manner. I suspect that the amount is so large, that the stderr fills up and MUSCLE waits for the stderr to be read (and hence emptied) by the other end of the pipe. In case of the terminal this is not a problem, as the terminal directly prints all characters it receives from the stderr. However, Biotite reads the stderr/stdout only after the program terminates. This leads to a deadlock with MUSCLE waiting for the stderr to be read and Python wating for MUSCLE to finish.

I will create a PR with two measures to fix this issue:

LocalApp.join() will use subprocess.communicate() which continuously reads the output until the application finishes
MuscleApp will use the -quiet parameter to stop it from outputting the progress

Fix #273

padix-key mentioned this issue Jan 6, 2021

Fix #273 #277

Merged

padix-key closed this as completed in #277 Jan 7, 2021

padix-key added a commit that referenced this issue Jan 7, 2021

Merge pull request #277 from padix-key/issue-273

31e01c3

Fix #273

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MuscleApp takes extremely long/does not finish #273

MuscleApp takes extremely long/does not finish #273

dnlbauer commented Jan 4, 2021 •

edited

padix-key commented Jan 5, 2021

padix-key commented Jan 6, 2021 •

edited

MuscleApp takes extremely long/does not finish #273

MuscleApp takes extremely long/does not finish #273

Comments

dnlbauer commented Jan 4, 2021 • edited

padix-key commented Jan 5, 2021

padix-key commented Jan 6, 2021 • edited

dnlbauer commented Jan 4, 2021 •

edited

padix-key commented Jan 6, 2021 •

edited