-
Notifications
You must be signed in to change notification settings - Fork 9
Change read_pixels to avoid passing HATS Catalog object to dask graph
#982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Click here to view all benchmarks. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #982 +/- ##
=======================================
Coverage 96.86% 96.86%
=======================================
Files 54 54
Lines 2554 2554
=======================================
Hits 2474 2474
Misses 80 80 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nicer to just define the paths ahead of the from_map call, thanks for looking into this Sean! This all looks good to me, I did notice that this slowed down two of the benchmarks by ~50%, I'm not sure how much to worry about that for this PR. Perhaps the original implementation is faster for smaller datasets, but at the cost of this scalability issue.
|
Seconding the concern about scaling - we were originally passing the paths, but moved away from that intentionally, as the string construction was painfully slow for catalogs with 100k+ partitions. The regression in the benchmarks is a real one! |
|
@delucchi-cmu Hmm, okay so on the one hand, the current implementation is causing some memory leak issues for specific datasets. While on the other hand the current implementation is generally faster. Perhaps merits some deeper investigation into why zubercal specifically is not releasing memory. I'm surprised to hear that passing the paths is slow, you said that's due to string construction, like the actual definition of the strings themselves? |
|
Yes - the string construction! In particular, calling the string construction method for each pixel, instead of using a method that will vectorize the construction. |
|
Thank you, that notebook is helpful. I'm going to poke at this a bit while Sean is out, I wonder if submitting the string construction to dask as it's own task layer will be the best of both worlds |
d691f07 to
706cb3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the PR description.
read_pixels to not pass hats Catalog object to dask graph
read_pixels to not pass hats Catalog object to dask graphread_pixels to not pass HATS Catalog object to dask graph
read_pixels to not pass HATS Catalog object to dask graphread_pixels to avoid passing HATS Catalog object to dask graph
In working with the Zubercal dataset, users were running into issues with a memory leak during the
read_pixeldask task. This led to workflows failing due to running out of memory. This change updates theread_hatsbehavior to avoid passing thehatsCatalogobject to the dask task graph, which seems to solve the memory leak problem.My best guess is that the MOC object that is part of the hats Catalog could be causing this, since we've ran into issues with how they are cloud pickled before and how the Rust interaction works in dask distributed, but further investigation is necessary to confirm this.