-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor combiner.combine to minimize the number of files open at one time #630
Refactor combiner.combine to minimize the number of files open at one time #630
Conversation
Tests are now (mostly) failing as intended; removed an f-string to fix fails on python 3.4 and 3.5. |
Not sure why coverage went down; the lines marked as not covered are actually covered by the new test. Perhaps because the test is run in a subprocess. |
@MSeifert04 @crawfordsm -- this is ready for a look. |
Interesting. Reading through the code was a bit of a revelation... we really have some more problems there. In either case passing I like the fact that images are read only once with the new approach (at least if all of them are actually files) even though the time reading the files will probably be negligible compared to the combining time - I mean the "tiling" only happens when we expect that each "tiled combine" actually exceeds a few GB. So I don't think it's actually necessary to "keep them around". The best idea here would be to read the file, copy the slice and then completely dispose of the CCDData (which means the memory is freed and the file handle is disposed). That way we (probably) keep only one file-handle open at each point (in case of memory mapping) and don't keep the original alive because we keep the slice (in case of non-memory-mapping). The copy is actually a bit "annoying", however we have to assume that the copy operation will also be negligible compared to actually combining the images. I mean we already copy the arrays inside the Long story short: I'm not really sure how we should proceed. We could simply use your approach for now (looks good - except that the tests are not picked up, which should be corrected) because it's a definite improvement and think about a more general refactor later on - I mean most of the things I've just written are pure speculation (especially those statements that contained the word "negligible") and have to be verified before we actually do that. |
How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀 |
Thanks for the comments, @MSeifert04. I think that as currently written there are a few different outcomes: Input list is
|
Ah, maybe by avoiding "subprocess" here? Would that be feasible? |
Not sure if it's ideal. I mean you keep a lot of open file-handles that could still conflict with the open-files-limit just to avoid opening the same file multiple times. I haven't done any timings but my feeling is that opening the files multiple times isn't really significant in the runtime of the function. And in most cases the "tiling" won't happen at all - in these cases having lots of open file handles is unnecessary. |
Yes, that's the problem. We either keep the memmap alive (if it's memmappable) or the full array (in case it's not). Both isn't great. |
No, because the tests lower the limit (in Python) on the number of simultaneously open files, and once the limit is lowered, you cannot raise it. |
Going to run some profiles now to inform the discussion here. IIRC, the bulk of the time is spent reading data from disk (at least in the case that 100 files are being combined)... |
That would be great. Could you share the results? :) |
@MSeifert04 -- see #624 (comment) for one graph. I think the profiling/optimization is separate from this file issue, though. I was being thrown by I'll add more comments on that issue tomorrow, but it looks like the easy speedups are going to be using the nan functions instead of masked, and maybe allowing for bottleneck. |
A number of comments:
I have one more minor change to push then I think this is ready to merge. Will add a cross-reference to the memory issue shortly. |
Incidentally, to do the memory profiling, I installed memory_profiler and ran it in
|
Memory use issue is #638 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked through the tests. It looks very good overall, just a few comments that need to be addressed (and some that are more optional).
CHANGES.rst
Outdated
@@ -24,6 +24,8 @@ Bug Fixes | |||
- Function ``median_combine`` now correctly calculates the uncertainty for | |||
masked ``CCDData``. [#608] | |||
|
|||
- Function ``combine`` now avoids opening files unnecessarily. [#629, #630] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new approach actually doesn't avoid opening the files unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it still does; without the copy
, ccd_list
held a file reference to each file in the list if the files were mem-mapped. The copy
in the update avoids that, so that at most 1 file is open at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it would be more accurate to say that it doesn't keep files open unnecessarily....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it would be more accurate to say that it doesn't keep files open unnecessarily....
I like that.
ccdproc/combiner.py
Outdated
@@ -1,7 +1,6 @@ | |||
# Licensed under a 3-clause BSD style license - see LICENSE.rst | |||
|
|||
"""This module implements the combiner class.""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be kept. Personaly, I like blank lines between module docs and imports. Feel free to ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that was unintentional...
ccdproc/combiner.py
Outdated
@@ -739,7 +738,7 @@ def combine(img_list, output_file=None, | |||
'func': sigma_clip_func, | |||
'dev_func': sigma_clip_dev_func} | |||
|
|||
# Finally Run the input method on all the subsections of the image | |||
# Finally Run the input method on all the subsections of the image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems wrong. I would've expected this to throw an IndentationError... But I think we should stick to 4 space indentation even for comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Haven't figured why my editor sometimes misses things like this....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Turns out my editor did catch it...my eyes missed it]
ccdproc/combiner.py
Outdated
@@ -775,6 +779,10 @@ def combine(img_list, output_file=None, | |||
ccd.mask[x:xend, y:yend] = comb_tile.mask | |||
if ccd.uncertainty is not None: | |||
ccd.uncertainty.array[x:xend, y:yend] = comb_tile.uncertainty.array | |||
# Clean up any open files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is now wrong. At this point there should only be no open file anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating....
ccdproc/combiner.py
Outdated
@@ -775,6 +779,10 @@ def combine(img_list, output_file=None, | |||
ccd.mask[x:xend, y:yend] = comb_tile.mask | |||
if ccd.uncertainty is not None: | |||
ccd.uncertainty.array[x:xend, y:yend] = comb_tile.uncertainty.array | |||
# Clean up any open files | |||
del comb_tile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I really like those del
s. Are these still necessary for reducing the memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so; can run a profile in a moment to check. The only essential one, I think is to delete tile_combiner
. That will, by design, use roughly the amount of memory the user has set as the cap. If we don't delete it, then on the second pass through the loop we'll (briefly) consume twice that memory on line 761 (or 2 depending on commit) as we construct the new combiner.
Thanks for the comments @MSeifert04 -- I have addressed them in code or in the comments above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still haven't got the time to review the tests but the actual code changes look good to me 👍
This will fix #629; the initial commit is two new tests, one of which will fail before the issue is fixed. Once the failures are verified on CI, I'll add a fix.
For bugfixes: