Skip to content

Conversation

@smcguire-cmu
Copy link
Contributor

@smcguire-cmu smcguire-cmu commented Aug 6, 2025

In working with the Zubercal dataset, users were running into issues with a memory leak during the read_pixel dask task. This led to workflows failing due to running out of memory. This change updates the read_hats behavior to avoid passing the hats Catalog object to the dask task graph, which seems to solve the memory leak problem.

My best guess is that the MOC object that is part of the hats Catalog could be causing this, since we've ran into issues with how they are cloud pickled before and how the Rust interaction works in dask distributed, but further investigation is necessary to confirm this.

@github-actions
Copy link

github-actions bot commented Aug 6, 2025

Before [e7dd698] After [1504048] Ratio Benchmark (Parameter)
37.6±2ms 37.0±0.6ms 0.99 benchmarks.time_polygon_search
80.9±1ms 78.6±2ms 0.97 benchmarks.time_kdtree_crossmatch
23.6±0.8ms 22.7±0.2ms 0.96 benchmarks.time_box_filter_on_partition
952±5ms 882±10ms 0.93 benchmarks.time_create_midsize_catalog
6.38±0.01s 5.82±0.02s 0.91 benchmarks.time_create_large_catalog

Click here to view all benchmarks.

@codecov
Copy link

codecov bot commented Aug 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.86%. Comparing base (e7dd698) to head (c1ea802).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #982   +/-   ##
=======================================
  Coverage   96.86%   96.86%           
=======================================
  Files          54       54           
  Lines        2554     2554           
=======================================
  Hits         2474     2474           
  Misses         80       80           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@smcguire-cmu smcguire-cmu requested a review from dougbrn August 6, 2025 21:54
Copy link
Contributor

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nicer to just define the paths ahead of the from_map call, thanks for looking into this Sean! This all looks good to me, I did notice that this slowed down two of the benchmarks by ~50%, I'm not sure how much to worry about that for this PR. Perhaps the original implementation is faster for smaller datasets, but at the cost of this scalability issue.

@delucchi-cmu
Copy link
Contributor

Seconding the concern about scaling - we were originally passing the paths, but moved away from that intentionally, as the string construction was painfully slow for catalogs with 100k+ partitions. The regression in the benchmarks is a real one!

@dougbrn
Copy link
Contributor

dougbrn commented Aug 7, 2025

@delucchi-cmu Hmm, okay so on the one hand, the current implementation is causing some memory leak issues for specific datasets. While on the other hand the current implementation is generally faster. Perhaps merits some deeper investigation into why zubercal specifically is not releasing memory. I'm surprised to hear that passing the paths is slow, you said that's due to string construction, like the actual definition of the strings themselves?

@delucchi-cmu
Copy link
Contributor

Yes - the string construction! In particular, calling the string construction method for each pixel, instead of using a method that will vectorize the construction.

https://github.com/lincc-frameworks/notebooks_lf/blob/6ba95d3a1284fa8b84e92ea776cca9d08d928e6c/sprints/2024/04_04/paths_optimization.ipynb

@dougbrn
Copy link
Contributor

dougbrn commented Aug 7, 2025

Thank you, that notebook is helpful. I'm going to poke at this a bit while Sean is out, I wonder if submitting the string construction to dask as it's own task layer will be the best of both worlds

@smcguire-cmu smcguire-cmu marked this pull request as ready for review August 13, 2025 16:55
Copy link
Contributor

@delucchi-cmu delucchi-cmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the PR description.

@smcguire-cmu smcguire-cmu changed the title Change read_hats to load from paths instead of pixels Change read_pixels to not pass hats Catalog object to dask graph Aug 13, 2025
@smcguire-cmu smcguire-cmu changed the title Change read_pixels to not pass hats Catalog object to dask graph Change read_pixels to not pass HATS Catalog object to dask graph Aug 13, 2025
@smcguire-cmu smcguire-cmu changed the title Change read_pixels to not pass HATS Catalog object to dask graph Change read_pixels to avoid passing HATS Catalog object to dask graph Aug 13, 2025
@smcguire-cmu smcguire-cmu merged commit d9a7b95 into main Aug 13, 2025
12 of 13 checks passed
@smcguire-cmu smcguire-cmu deleted the sean/fix-memory-leak branch August 13, 2025 18:29
@delucchi-cmu delucchi-cmu mentioned this pull request Aug 14, 2025
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants