Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non ACGT Bug fix kinda #37

Closed
wants to merge 1 commit into from
Closed

Conversation

migbro
Copy link
Contributor

@migbro migbro commented Feb 1, 2024

Going on the premise that GATK's outputs are wildly popular and that in their format, * is a valid alt possibility, even if in practice you would not use it:

  • allow * as acceptable value
  • However, other characters would still cause the bug to reappear, like if there was an N
    Separately, I :revert vec as change in main as it causes compile fail. Also a small update to git ignore to avoid test files

Fixes #36

🔨 allow * as acceptable value
📝 update git ignore to avoid test files
@migbro
Copy link
Contributor Author

migbro commented Feb 1, 2024

Been talking with @dmiller15 about this and don't really like my solution. It'd be preferred to simply skip annotating *alleles - like still output the line as-is. Unfortunately, not really sure how to do that or if it'd require major refactoring. It's something I think a lot of people who use GATK for germline trio calls will encounter

@brentp
Copy link
Owner

brentp commented Feb 2, 2024

I don't think this will work anyway, the encoding uses 2 bits per letter so only ACTG can be encoded.

@migbro
Copy link
Contributor Author

migbro commented Feb 2, 2024

Yeah, that's fair. I'll dump this PR, pull your other bug fix and ruminate some more. We've been working to incorporate this tool as an annotation replacement for bcftools to annotate thousands of germline vcfs with a subset of gnomAD 3.1.1 fields, hence my eagerness to tighten up a couple things. It's speed will save a us a good chunk of time and money collectively

@migbro migbro closed this Feb 2, 2024
@brentp
Copy link
Owner

brentp commented Feb 2, 2024

We've been working to incorporate this tool as an annotation replacement for bcftools to annotate thousands of germline vcfs with a subset of gnomAD 3.1.1 fields, hence my eagerness to tighten up a couple things. It's speed will save a us a good chunk of time and money collectively

Let's figure out how to get it in a state where you can use it.
Can you let me know if it's ok with you to discard variants with "*" in the archive set and just skip them (write out without annotation) in the query set?

@migbro
Copy link
Contributor Author

migbro commented Feb 2, 2024

The short answer is probably yes. This is predicated on, if I understand correctly, encoder is used to process input vcfs in both encode and anno modes. So, I'd think since it's not designed to handle ACGT, it's probably ok to skip encoding non ACGT in the zips generated, as long as if a non ACGT character in the vcf being annotated still shows up as-is with no annotation in the annotated result from anno. Does that make sense?

@migbro
Copy link
Contributor Author

migbro commented Feb 2, 2024

Alternatively, perhaps as a future feature, you could consider expanding to the complete IUPAC notation while keeping the same behavior to skip encoding non-conforming characters to cover nearly all possible use cases. That's probably make the bit encoding more complex and the resulting binaries larger. Not sure how speed would be affected

@migbro migbro deleted the feature/mb-bug-fix branch February 2, 2024 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 non ACGT ALT character
2 participants