-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/fix issue 36 #39
Conversation
Hmmm, your suspicions were correct. Performance took an absolute beating. Before:
After:
I didn't bother letting it finish...I'll see if we can think of something else. At that speed, it's no better than bcftools I don't think. |
This was me testing on a trio germline dataset |
hmm. it shouldn't have that much difference. Are you sure it wasn't somehow cached on the disk? In first run? And did you compile in release mode both times? |
Hmmm, the first time was from a docker image that I had built (which is release mode), the second time built like it is in the |
yeah, default is debug mode, so if you add |
I guess if we're hard set on avoiding additional vectors we could potentially move this into the This Changes the behavior in that we're going to get the user defined missing values for these star records. |
Ha, you weren't kidding. In release mode:
And to rule out disk caching, I grabbed another file:
Lastly, cleared the disk cache with
Perhaps surprisingly or not, it's actually faster. My guess is cutting out needing to encode the * values, and as @dmiller15 pointed out to me, also the annotation of the records all together. |
My colleague @dmiller15 had a much simpler suggestion to simply skip
*
entries as a start in the annotator tool. It'd likely tidy up our current concerns...expanding to skip all non ACGT alleles (likeN
orACN
) could be tackled, but not sure it needs to immediately.