Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence index is not preserved in multithread mode #20

Closed
zdk123 opened this issue Oct 21, 2022 · 2 comments
Closed

Sequence index is not preserved in multithread mode #20

zdk123 opened this issue Oct 21, 2022 · 2 comments
Labels
wontfix This will not be worked on

Comments

@zdk123
Copy link

zdk123 commented Oct 21, 2022

Hit another bug in the output data.

Compare

records = SeqIO.parse(fasta_file, "fasta")
orf_finder = pyrodigal.OrfFinder(meta=True)
predictions = [orf_finder.find_genes(bytes(record.seq)) for record in records]
[p.__getstate__()['_num_seq'] for p in predictions]

output:

[1, 2, 3, 4 , ... ]
records = SeqIO.parse(fasta_file, "fasta")
orf_finder = pyrodigal.OrfFinder(meta=True)
with pool.ThreadPool() as p:
    predictions = p.map(lambda r: orf_finder.find_genes(bytes(r.seq)), records)
[p.__getstate__()['_num_seq'] for p in predictions]

output:

[2, 54, 66, 72, ...]

This causes a mismatch in the input sequence order and the ID in the resulting GFF / stats file. Anyway of fixing this rather than modifying the state?

@althonos
Copy link
Owner

This is not so much of a bug than it is an issue with how Python redirects the iterable inside ThreadPool.map i think; there is no guarantee in the order in which the threads receive the items. I think ThreadPool.imap might guarantee iteration order, but it's often slower. There's also the possibility that a first sequence takes longer than a second sequence, causing sequence number 2 to be returned first.

Since I'm going to change the ID field of the GFF output to become the gene identifier as you suggested in #18 anyway, I don't think there will be a reason to allow changing the _num_seq attribute manually. It shouldn't be used for anything else...

@zdk123
Copy link
Author

zdk123 commented Oct 21, 2022

I agree with your interpretation, and the other fix will definitely solve this for us too. My only point I guess is that _num_seq doesn't get used until the results are written so it could be corrected manually.

@zdk123 zdk123 closed this as completed Oct 21, 2022
@althonos althonos added the wontfix This will not be worked on label Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants