Skip to content

Conversation

@jkgoodrich
Copy link
Contributor

First pass at a PR to fix variant QC evaluation bugs found by Grace and revert back to ranking then binning instead of using approx quantile.

@jkgoodrich jkgoodrich requested a review from gtiao January 28, 2021 23:25
@jkgoodrich
Copy link
Contributor Author

jkgoodrich commented Jan 28, 2021

The current PR replaces the quantile approx functionality, but I am also open to adding it in as an option with a warning about unequal binning when there are duplicate boundaries

Copy link
Contributor

@gtiao gtiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is much easier and simpler to read -- thanks for putting this in so quickly!

(
n_bins
* (
bin_ht[f"{bin_id}_rank"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually 101 bins the way the code is written, because the largest rank (say it's 1002 in a given grouping) will give a value of 1 * n_bins = 100 (because the rank 1002 will be divided by the total number of variants in the grouping, 1002). Basically, you're starting a new bin for the single final variant in each grouping.

I think if you want to only have 100 bins, you'll need to subtract 1 off the {bin_id}_rank in line 121 before dividing by hl.float64(bin_ht.bin_group_variant_counts[bin_id] on line 122. This will handle the other end (the first variant in the grouping) fine, because hl.floor() will still return zero for the first variant.

Copy link
Contributor Author

@jkgoodrich jkgoodrich Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree, the rank is actually 0 to (n_variants - 1)

Notebook test:
cleanup_new_binning_function.html.zip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show what the {bin_id}_rank is in the notebook, and not just the final bin assignment? I agree if the rank is 0 to (n_variants - 1) then the code as written is fine (basically the ranking already has 1 subtracted off it). It's just that I assumed hl.scan.count_where(bin_ht[f"_filter_{bin_id}"]) would be an actual count, which starts a 1 and not 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-01-29 at 2 09 05 PM

Screen Shot 2021-01-29 at 2 09 12 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was too at first, but got the wrong binning when I tested so changed it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I wonder if the behavior changes when it's a scan -- I know the sum mechanism for hl.scan() doesn't count the current row, so maybe that's what's happening here.

In any case, the code is working as intended, so no changes required.

@jkgoodrich jkgoodrich requested a review from gtiao January 29, 2021 18:31
Copy link
Contributor

@gtiao gtiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjustments all look good, just want to understand how the ranking code works

(
n_bins
* (
bin_ht[f"{bin_id}_rank"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show what the {bin_id}_rank is in the notebook, and not just the final bin assignment? I agree if the rank is 0 to (n_variants - 1) then the code as written is fine (basically the ranking already has 1 subtracted off it). It's just that I assumed hl.scan.count_where(bin_ht[f"_filter_{bin_id}"]) would be an actual count, which starts a 1 and not 0.

@gtiao gtiao merged commit 8b5bf71 into master Jan 29, 2021
@gtiao gtiao deleted the fixes_to_variant_qc_evaluation branch January 29, 2021 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants