-
Notifications
You must be signed in to change notification settings - Fork 31
Added functionality to compute coverage stats from sparse MT (needed … #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ch-kr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, just a few comments
|
|
||
| context = None | ||
| for contig in contigs: | ||
| _context = hl.utils.range_table(ref.contig_length(contig), n_partitions=int(ref.contig_length(contig) / 5000000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you choose 5000000 for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no specific reason really..seemed reasonable given that for each row there's very little data. But happy to reconsider if you have another suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. what about adding an n_partitions arg that is set to 50 by default, since len(chromosome 1) / 5000000 should be ~50? no strong feelings about this though.
| col_key_fields = list(mt.col_key) | ||
| t = mt._localize_entries('__entries', '__cols') | ||
| t = t.join(reference_ht.annotate(_in_ref=True), how='outer') | ||
| t = t.annotate(__entries=hl.or_else(t.__entries, hl.range(n_samples).map(lambda x: hl.null(t.__entries.dtype.element_type)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this annotate necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what __localize does is to take the entries and represent them as an array, __unlocalize does the opposite.
Now after the outer join, __entries at rows that appear in reference_ht and not in mt have the value NA. This is fine in the Table representation for t, but in the next line we call t._unlocalize_entries('__entries', '__cols', col_key_fields), which will again spread the information in __entries array into the new MatrixTable entries. In the MatrixTable representation, an entry can be NA, but the entry array cannot, so it requires __entries to be an array of the same length as the number of samples.
For this reason, we need to first annotate __entries to be an array of the right length with NA for each element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
Back to you @ch-kr |
ch-kr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, I added a note about a potential change for n_partitions but don't have a strong preference for or against the change
| col_key_fields = list(mt.col_key) | ||
| t = mt._localize_entries('__entries', '__cols') | ||
| t = t.join(reference_ht.annotate(_in_ref=True), how='outer') | ||
| t = t.annotate(__entries=hl.or_else(t.__entries, hl.range(n_samples).map(lambda x: hl.null(t.__entries.dtype.element_type)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
|
||
| context = None | ||
| for contig in contigs: | ||
| _context = hl.utils.range_table(ref.contig_length(contig), n_partitions=int(ref.contig_length(contig) / 5000000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. what about adding an n_partitions arg that is set to 50 by default, since len(chromosome 1) / 5000000 should be ~50? no strong feelings about this though.
c6ac709 to
8ba026a
Compare
…e.g. for browser)