Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not explicitly call gc.collect in CompImageHDU.compressed_data #14576

Merged
merged 1 commit into from Apr 19, 2023

Conversation

Cadair
Copy link
Member

@Cadair Cadair commented Mar 24, 2023

I have a nice pathological benchmark where I load 150 files and read all the data:

Before:

$ python run_timeit.py dkist_visp_tiled_post
Using DKIST dataset (BEJZP) with shape: (4, 18, 150, 996, 2545)
FITS tile shape is (1, 256, 256) (numpy order).
Ran dkist_visp_tiled_post 5 times, average time 27.755185409799743 s

After:

$ python run_timeit.py dkist_visp_tiled_post
Using DKIST dataset (BEJZP) with shape: (4, 18, 150, 996, 2545)
FITS tile shape is (1, 256, 256) (numpy order).
Ran dkist_visp_tiled_post 5 times, average time 4.998688907800533 s

We already discuss the reason why this line exists pretty comprehensively in the docs here: https://docs.astropy.org/en/stable/io/fits/appendix/faq.html#i-am-opening-many-fits-files-in-a-loop-and-getting-oserror-too-many-open-files

I am not really sure I see the benefit of this still being here?

@github-actions
Copy link

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@github-actions
Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@pllim pllim added this to the v5.3 milestone Mar 24, 2023
@astrofrog
Copy link
Member

I agree that users who run into issues should simply call gc.collect themselves - we shouldn't call it by default as this can have a large performance penalty.

@pllim
Copy link
Member

pllim commented Mar 24, 2023

But people been relying on this indirectly for so long, so now suddenly they have to manually call gc.collect() to get back the same behavior?

@astrofrog
Copy link
Member

@pllim - I'm not sure it would actually change anything for most users, we shouldn't actually have to call garbage collection explicitly and it's surprising that the code even includes this. It might have been added as a 'just in case' but the main use case it might have been intended for is opening many files in a small amount of time which is what @Cadair is doing but it seems to be hindering not helping.

@pllim
Copy link
Member

pllim commented Mar 24, 2023

Is asking over at astropy-dev mailing list necessary, just in case?

@Cadair
Copy link
Member Author

Cadair commented Mar 24, 2023

I can't imagine that even if anyone is relying on this they know it exists, so asking wont help.

@pllim
Copy link
Member

pllim commented Mar 24, 2023

Should at least add a change log then?

@Cadair
Copy link
Member Author

Cadair commented Mar 24, 2023

Should at least add a change log then?

Sure. I was waiting to see if people a) rejected it outright and b) that all the tests passed 😀

@Cadair
Copy link
Member Author

Cadair commented Mar 24, 2023

For what it's worth, I just loaded 10800 FITS files with memmap enabled without this line and I didn't hit any issues. I am not sure when dask would have dropped the references to them, so I don't know how many were open at once, but it certainly wasn't keeping them around forever.

@saimn
Copy link
Contributor

saimn commented Mar 24, 2023

I agree we should not call gc.collect since it will have a performance penalty when processing a lot of files.
For reference, this was added in #3283 / cf51894.

For what it's worth, I just loaded 10800 FITS files with memmap enabled without this line and I didn't hit any issues

Interesting, though problems may arise only in specific configurations, e.g. did you take care of not keeping references to .data ? What the value of ulimit -n on your system ?

I guess the other case to check, where people could be impacted by this change, is reading one or a few big files, keeping / not keeping references to .data, to see if memory if released or not.

@Cadair
Copy link
Member Author

Cadair commented Mar 24, 2023

did you take care of not keeping references to .data ?

I didn't but dask might have done.

What the value of ulimit -n on your system ?

$ ulimit -n
1024

I am not suggesting this is a comprehensive test.

@Cadair Cadair marked this pull request as ready for review March 24, 2023 17:33
@Cadair Cadair requested a review from saimn as a code owner March 24, 2023 17:33
@@ -0,0 +1,2 @@
Do not call ``gc.collect()`` in ``CompImageHDU.compressed_data`` as it has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically in the deleter right? Maybe rephrase to say when closing a ConpImageHDU since that is when people are most likely to call this indirectly?

@Cadair
Copy link
Member Author

Cadair commented Apr 19, 2023

@astrofrog @saimn I think this is good to go?

@saimn
Copy link
Contributor

saimn commented Apr 19, 2023

I did a few quick tests and I'm confused, it seems we have a memory leak with the new code ?
https://github.com/saimn/misc-astro/blob/master/astropy/FITS%20compression.ipynb
Note, for dev I get the same memory usage with gc.collect or without.
Also the amount of memory that leaks seems to vary with the tiling, it's bigger with the default tiling.

@pllim
Copy link
Member

pllim commented Apr 19, 2023

I am guessing this needs to be rebased on top of #14649 and re-reviewed after that one is merged first?

@saimn
Copy link
Contributor

saimn commented Apr 19, 2023

Running again some tests after #14649, I don't see a difference with or without gc.collect(), so sounds good to me.

Copy link
Contributor

@saimn saimn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Cadair !

@saimn saimn merged commit 80c3854 into astropy:main Apr 19, 2023
25 checks passed
@Cadair Cadair deleted the no_gc_in_tiled_data branch April 19, 2023 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants