HighLevelGraph length without materializing layers#7274
Merged
jrbourbeau merged 3 commits intodask:mainfrom Mar 9, 2021
Merged
HighLevelGraph length without materializing layers#7274jrbourbeau merged 3 commits intodask:mainfrom
jrbourbeau merged 3 commits intodask:mainfrom
Conversation
crusaderky
reviewed
Feb 26, 2021
| def __len__(self): | ||
| return len(self.to_dict()) | ||
| def __len__(self) -> int: | ||
| return sum(len(layer) for layer in self.layers.values()) |
Collaborator
There was a problem hiding this comment.
pls add comment explaining how this could double-count keys but we decided not to care as it should always be a broken use case
Member
There was a problem hiding this comment.
I was actually wondering about this when I looked at this PR last night, but decided not to say anything. I'm glad to hear that this already came up and has been considered.
Collaborator
crusaderky
reviewed
Feb 26, 2021
crusaderky
approved these changes
Feb 26, 2021
Closes dask#7271 Calculate the `__len__` of a HighLevelGraph from the sum of the legths of its layers, instead of `len(self.to_dict())`. This is much faster and prevents causing all the layers to materialize. I also changed the `__len__` implementation on `Blockwise`. It was using `_out_numblocks`, which is unused anywhere else in the codebase, and a bit hard to read. As far as I could tell, `_out_numblocks` was equivalent to `{i: self.dims[i] for i in self.output_indices}`. Basically, the length of a Blockwise layer should be equal to the number of output keys (right?).
Co-authored-by: crusaderky <crusaderky@gmail.com>
d99a227 to
a5f454f
Compare
jrbourbeau
approved these changes
Mar 9, 2021
Member
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for the fix @gjoseph92 and reviewing @crusaderky!
douglasdavis
pushed a commit
to douglasdavis/dask
that referenced
this pull request
Mar 14, 2021
Calculate the `__len__` of a HighLevelGraph from the sum of the legths of its layers, instead of `len(self.to_dict())`. This is much faster and prevents causing all the layers to materialize.
I also changed the `__len__` implementation on `Blockwise`. It was using `_out_numblocks`, which is unused anywhere else in the codebase, and a bit hard to read. As far as I could tell, `_out_numblocks` was equivalent to `{i: self.dims[i] for i in self.output_indices}`. Basically, the length of a Blockwise layer should be equal to the number of output keys
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Calculate the
__len__of a HighLevelGraph from the sum of the legths of its layers, instead oflen(self.to_dict()). This is much faster and prevents causing all the layers to materialize.I also changed the
__len__implementation onBlockwise. It was using_out_numblocks, which is unused anywhere else in the codebase, and a bit hard to read. As far as I could tell,_out_numblockswas equivalent to{i: self.dims[i] for i in self.output_indices}. Basically, the length of a Blockwise layer should be equal to the number of output keys (right?), so I just reused the logic fromget_output_keys, without materializing the keys.For the example in the linked issue, where
_repr_html_took 19sec before, it now takes 5ms.cc @crusaderky
black dask/flake8 dask