-
Notifications
You must be signed in to change notification settings - Fork 31
Fixes to variant qc evaluation #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…in method instead of approx quantile for binning
…hods into fixes_to_variant_qc_evaluation
…in method instead of approx quantile for binning
…in method instead of approx quantile for binning
|
The current PR replaces the quantile approx functionality, but I am also open to adding it in as an option with a warning about unequal binning when there are duplicate boundaries |
gtiao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is much easier and simpler to read -- thanks for putting this in so quickly!
| ( | ||
| n_bins | ||
| * ( | ||
| bin_ht[f"{bin_id}_rank"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually 101 bins the way the code is written, because the largest rank (say it's 1002 in a given grouping) will give a value of 1 * n_bins = 100 (because the rank 1002 will be divided by the total number of variants in the grouping, 1002). Basically, you're starting a new bin for the single final variant in each grouping.
I think if you want to only have 100 bins, you'll need to subtract 1 off the {bin_id}_rank in line 121 before dividing by hl.float64(bin_ht.bin_group_variant_counts[bin_id] on line 122. This will handle the other end (the first variant in the grouping) fine, because hl.floor() will still return zero for the first variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree, the rank is actually 0 to (n_variants - 1)
Notebook test:
cleanup_new_binning_function.html.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show what the {bin_id}_rank is in the notebook, and not just the final bin assignment? I agree if the rank is 0 to (n_variants - 1) then the code as written is fine (basically the ranking already has 1 subtracted off it). It's just that I assumed hl.scan.count_where(bin_ht[f"_filter_{bin_id}"]) would be an actual count, which starts a 1 and not 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was too at first, but got the wrong binning when I tested so changed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I wonder if the behavior changes when it's a scan -- I know the sum mechanism for hl.scan() doesn't count the current row, so maybe that's what's happening here.
In any case, the code is working as intended, so no changes required.
gtiao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjustments all look good, just want to understand how the ranking code works
| ( | ||
| n_bins | ||
| * ( | ||
| bin_ht[f"{bin_id}_rank"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show what the {bin_id}_rank is in the notebook, and not just the final bin assignment? I agree if the rank is 0 to (n_variants - 1) then the code as written is fine (basically the ranking already has 1 subtracted off it). It's just that I assumed hl.scan.count_where(bin_ht[f"_filter_{bin_id}"]) would be an actual count, which starts a 1 and not 0.


First pass at a PR to fix variant QC evaluation bugs found by Grace and revert back to ranking then binning instead of using approx quantile.