Skip to content

HighLevelGraph length without materializing layers#7274

Merged
jrbourbeau merged 3 commits intodask:mainfrom
gjoseph92:hlg-len-without-materialize
Mar 9, 2021
Merged

HighLevelGraph length without materializing layers#7274
jrbourbeau merged 3 commits intodask:mainfrom
gjoseph92:hlg-len-without-materialize

Conversation

@gjoseph92
Copy link
Copy Markdown
Collaborator

@gjoseph92 gjoseph92 commented Feb 25, 2021

Calculate the __len__ of a HighLevelGraph from the sum of the legths of its layers, instead of len(self.to_dict()). This is much faster and prevents causing all the layers to materialize.

I also changed the __len__ implementation on Blockwise. It was using _out_numblocks, which is unused anywhere else in the codebase, and a bit hard to read. As far as I could tell, _out_numblocks was equivalent to {i: self.dims[i] for i in self.output_indices}. Basically, the length of a Blockwise layer should be equal to the number of output keys (right?), so I just reused the logic from get_output_keys, without materializing the keys.

For the example in the linked issue, where _repr_html_ took 19sec before, it now takes 5ms.

cc @crusaderky

def __len__(self):
return len(self.to_dict())
def __len__(self) -> int:
return sum(len(layer) for layer in self.layers.values())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add comment explaining how this could double-count keys but we decided not to care as it should always be a broken use case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually wondering about this when I looked at this PR last night, but decided not to say anything. I'm glad to hear that this already came up and has been considered.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin we discussed it in #7271; i'd like to hear your input on it though

gjoseph92 and others added 3 commits February 26, 2021 14:05
Closes dask#7271

Calculate the `__len__` of a HighLevelGraph from the sum of the legths of its layers, instead of `len(self.to_dict())`. This is much faster and prevents causing all the layers to materialize.

I also changed the `__len__` implementation on `Blockwise`. It was using `_out_numblocks`, which is unused anywhere else in the codebase, and a bit hard to read. As far as I could tell, `_out_numblocks` was equivalent to `{i: self.dims[i] for i in self.output_indices}`. Basically, the length of a Blockwise layer should be equal to the number of output keys (right?).
Co-authored-by: crusaderky <crusaderky@gmail.com>
@gjoseph92 gjoseph92 force-pushed the hlg-len-without-materialize branch from d99a227 to a5f454f Compare February 26, 2021 21:06
Base automatically changed from master to main March 8, 2021 20:19
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @gjoseph92 and reviewing @crusaderky!

@jrbourbeau jrbourbeau merged commit 479e8cb into dask:main Mar 9, 2021
douglasdavis pushed a commit to douglasdavis/dask that referenced this pull request Mar 14, 2021
Calculate the `__len__` of a HighLevelGraph from the sum of the legths of its layers, instead of `len(self.to_dict())`. This is much faster and prevents causing all the layers to materialize.

I also changed the `__len__` implementation on `Blockwise`. It was using `_out_numblocks`, which is unused anywhere else in the codebase, and a bit hard to read. As far as I could tell, `_out_numblocks` was equivalent to `{i: self.dims[i] for i in self.output_indices}`. Basically, the length of a Blockwise layer should be equal to the number of output keys
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Displaying Array/DataFrame in notebook materializes all graph layers

4 participants