Skip to content

Conversation

@lfrancioli
Copy link
Contributor

…e.g. for browser)

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just a few comments


context = None
for contig in contigs:
_context = hl.utils.range_table(ref.contig_length(contig), n_partitions=int(ref.contig_length(contig) / 5000000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you choose 5000000 for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no specific reason really..seemed reasonable given that for each row there's very little data. But happy to reconsider if you have another suggestion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. what about adding an n_partitions arg that is set to 50 by default, since len(chromosome 1) / 5000000 should be ~50? no strong feelings about this though.

col_key_fields = list(mt.col_key)
t = mt._localize_entries('__entries', '__cols')
t = t.join(reference_ht.annotate(_in_ref=True), how='outer')
t = t.annotate(__entries=hl.or_else(t.__entries, hl.range(n_samples).map(lambda x: hl.null(t.__entries.dtype.element_type))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this annotate necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what __localize does is to take the entries and represent them as an array, __unlocalize does the opposite.
Now after the outer join, __entries at rows that appear in reference_ht and not in mt have the value NA. This is fine in the Table representation for t, but in the next line we call t._unlocalize_entries('__entries', '__cols', col_key_fields), which will again spread the information in __entries array into the new MatrixTable entries. In the MatrixTable representation, an entry can be NA, but the entry array cannot, so it requires __entries to be an array of the same length as the number of samples.
For this reason, we need to first annotate __entries to be an array of the right length with NA for each element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@lfrancioli
Copy link
Contributor Author

Back to you @ch-kr

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, I added a note about a potential change for n_partitions but don't have a strong preference for or against the change

col_key_fields = list(mt.col_key)
t = mt._localize_entries('__entries', '__cols')
t = t.join(reference_ht.annotate(_in_ref=True), how='outer')
t = t.annotate(__entries=hl.or_else(t.__entries, hl.range(n_samples).map(lambda x: hl.null(t.__entries.dtype.element_type))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!


context = None
for contig in contigs:
_context = hl.utils.range_table(ref.contig_length(contig), n_partitions=int(ref.contig_length(contig) / 5000000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. what about adding an n_partitions arg that is set to 50 by default, since len(chromosome 1) / 5000000 should be ~50? no strong feelings about this though.

@lfrancioli lfrancioli force-pushed the sparse_mt_coverage branch from c6ac709 to 8ba026a Compare March 5, 2020 19:06
@lfrancioli lfrancioli merged commit c4cc23d into master Mar 5, 2020
@lfrancioli lfrancioli deleted the sparse_mt_coverage branch March 5, 2020 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants