Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor combiner.combine to minimize the number of files open at one time #630

Merged
merged 19 commits into from
Oct 19, 2018

Conversation

mwcraig
Copy link
Member

@mwcraig mwcraig commented Jun 23, 2018

This will fix #629; the initial commit is two new tests, one of which will fail before the issue is fixed. Once the failures are verified on CI, I'll add a fix.

For bugfixes:

  • Did you add an entry to the "Changes.rst" file?
  • Did you add a regression test?
  • Does the commit message include a "Fixes #issue_number" (replace "issue_number").

@mwcraig
Copy link
Member Author

mwcraig commented Jun 24, 2018

Tests are now (mostly) failing as intended; removed an f-string to fix fails on python 3.4 and 3.5.

@mwcraig
Copy link
Member Author

mwcraig commented Jun 24, 2018

Not sure why coverage went down; the lines marked as not covered are actually covered by the new test. Perhaps because the test is run in a subprocess.

@mwcraig
Copy link
Member Author

mwcraig commented Jun 24, 2018

@MSeifert04 @crawfordsm -- this is ready for a look.

@MSeifert04
Copy link
Contributor

MSeifert04 commented Jun 30, 2018

Interesting. Reading through the code was a bit of a revelation... we really have some more problems there. In either case passing memmap=False will lead to trouble because we (1) had slices that prevented the original from "deallocating" or with the new approach we (2) explicitly keep them alive. That makes the "memory-saving" completely wrong if one chose to do memmap=False (or one overwrote the astropy fits memmap config entry).

I like the fact that images are read only once with the new approach (at least if all of them are actually files) even though the time reading the files will probably be negligible compared to the combining time - I mean the "tiling" only happens when we expect that each "tiled combine" actually exceeds a few GB. So I don't think it's actually necessary to "keep them around". The best idea here would be to read the file, copy the slice and then completely dispose of the CCDData (which means the memory is freed and the file handle is disposed). That way we (probably) keep only one file-handle open at each point (in case of memory mapping) and don't keep the original alive because we keep the slice (in case of non-memory-mapping). The copy is actually a bit "annoying", however we have to assume that the copy operation will also be negligible compared to actually combining the images. I mean we already copy the arrays inside the Combiner - if we could do that step in combine instead of Combiner.__init__ we would actually don't have any additional overhead compared to the status-quo.

Long story short: I'm not really sure how we should proceed. We could simply use your approach for now (looks good - except that the tests are not picked up, which should be corrected) because it's a definite improvement and think about a more general refactor later on - I mean most of the things I've just written are pure speculation (especially those statements that contained the word "negligible") and have to be verified before we actually do that.

@mwcraig
Copy link
Member Author

mwcraig commented Jul 2, 2018

except that the tests are not picked up, which should be corrected

How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀

@mwcraig
Copy link
Member Author

mwcraig commented Jul 2, 2018

Thanks for the comments, @MSeifert04.

I think that as currently written there are a few different outcomes:

Input list is CCDData objects

No change from prior behavior, but also no real memory savings, since all of the CCDData objects are already in memory. This might not strictly be the case if all of the CCDData objects are actually memory mapped, I suppose.

Input list is filenames and all files are opened memmap=True and all files are memory-mappable (e.g. no unsigned int or other bzero/bscale)

Files are opened only once, memmap references are preserved for later use. This is probably the optimal way to go in this case.

Input list is filenames and one more more files is opened with memmap=False or cannot be memory-mapped

As soon as one of those files is hit, memmap_failed is set to True. After that point, each file is opened, which means it is read into memory, a slice is made. As far as I can tell, keeping the slice around actually keeps around a copy of the full array (code snippet below)...so unless we copy, this case might read everything into memory. I think this would have been true before the PR too...

code snippet:

foo = np.zeros([10, 100])
foo.size  # This is 1000
bar = foo[3:5, 30:40]
del foo
bar.size   # 20, as expected
bar.base.size   # 1000, so it looks like we still have a reference to foo...
bar.base[7:10, 80:90]  # Yep, we can index this no problem 

So we need to handle this case better...

I think that in this case what you proposed (slice the CCDData and force a copy of the data) is necessary to avoid having all of the data read into memory. I might try writing a test that fails in this case (i.e. uses way too much memory) to confirm this is the case, then push a fix.

@MSeifert04
Copy link
Contributor

MSeifert04 commented Jul 3, 2018

except that the tests are not picked up, which should be corrected

How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀

Ah, maybe by avoiding "subprocess" here? Would that be feasible?

@MSeifert04
Copy link
Contributor

MSeifert04 commented Jul 3, 2018

Input list is filenames and all files are opened memmap=True and all files are memory-mappable (e.g. no unsigned int or other bzero/bscale)

Files are opened only once, memmap references are preserved for later use. This is probably the optimal way to go in this case.

Not sure if it's ideal. I mean you keep a lot of open file-handles that could still conflict with the open-files-limit just to avoid opening the same file multiple times. I haven't done any timings but my feeling is that opening the files multiple times isn't really significant in the runtime of the function.

And in most cases the "tiling" won't happen at all - in these cases having lots of open file handles is unnecessary.

@MSeifert04
Copy link
Contributor

As far as I can tell, keeping the slice around actually keeps around a copy of the full array (code snippet below)...so unless we copy, this case might read everything into memory.

Yes, that's the problem. We either keep the memmap alive (if it's memmappable) or the full array (in case it's not). Both isn't great.

@mwcraig
Copy link
Member Author

mwcraig commented Aug 8, 2018

except that the tests are not picked up, which should be corrected

How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀

Ah, maybe by avoiding "subprocess" here? Would that be feasible?

No, because the tests lower the limit (in Python) on the number of simultaneously open files, and once the limit is lowered, you cannot raise it.

@mwcraig
Copy link
Member Author

mwcraig commented Aug 8, 2018

Going to run some profiles now to inform the discussion here. IIRC, the bulk of the time is spent reading data from disk (at least in the case that 100 files are being combined)...

@MSeifert04
Copy link
Contributor

Going to run some profiles now to inform the discussion here. IIRC, the bulk of the time is spent reading data from disk (at least in the case that 100 files are being combined)...

That would be great. Could you share the results? :)

@mwcraig
Copy link
Member Author

mwcraig commented Aug 9, 2018

@MSeifert04 -- see #624 (comment) for one graph. I think the profiling/optimization is separate from this file issue, though. I was being thrown by __getitem__ taking so much time; I assumed that was from io.fits but it is from numpy.ma.

I'll add more comments on that issue tomorrow, but it looks like the easy speedups are going to be using the nan functions instead of masked, and maybe allowing for bottleneck.

@mwcraig
Copy link
Member Author

mwcraig commented Aug 10, 2018

A number of comments:

  • My original approach here (making a list of memory-mapped files) is terrible. It has the effect of breaking the memory limit that combine is trying to impose. The reason is that as successive slices are read in the memory-mapped array is "filled in" with the values read until each is taking up its full size in memory.
    • For 14 files, 2k x 2k, with memory limit of 32MB, the maximum memory use in my keep-the-list-of-memmaps approach used 450MB by the end.
    • With the latest version the maximum memory use is about 128MB. That is still too high but I believe it is unrelated to the changes here. Will open a separate issue for it.
  • The amount of time spent (re)opening files and copying in the latest approach is small. There was less than a 1 second difference between the latest approach and the earlier one of keeping a list of the memmaped files. In any event, the time spent copying is tiny (about 1 second) compared to the time spent in sigma_clipping (about 26 seconds) and average_combineing (about 6 seconds).

I have one more minor change to push then I think this is ready to merge. Will add a cross-reference to the memory issue shortly.

@mwcraig
Copy link
Member Author

mwcraig commented Aug 10, 2018

Incidentally, to do the memory profiling, I installed memory_profiler and ran it in ccdproc/tests with:

mprof run run_with_file_number_limit.py --overhead 6 --kind fits --open-by combine-chunk --size 2000 --frequent-gc 50

@mwcraig
Copy link
Member Author

mwcraig commented Aug 10, 2018

Memory use issue is #638

Copy link
Contributor

@MSeifert04 MSeifert04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked through the tests. It looks very good overall, just a few comments that need to be addressed (and some that are more optional).

CHANGES.rst Outdated
@@ -24,6 +24,8 @@ Bug Fixes
- Function ``median_combine`` now correctly calculates the uncertainty for
masked ``CCDData``. [#608]

- Function ``combine`` now avoids opening files unnecessarily. [#629, #630]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new approach actually doesn't avoid opening the files unnecessarily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still does; without the copy, ccd_list held a file reference to each file in the list if the files were mem-mapped. The copy in the update avoids that, so that at most 1 file is open at a time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would be more accurate to say that it doesn't keep files open unnecessarily....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would be more accurate to say that it doesn't keep files open unnecessarily....

I like that.

@@ -1,7 +1,6 @@
# Licensed under a 3-clause BSD style license - see LICENSE.rst

"""This module implements the combiner class."""

Copy link
Contributor

@MSeifert04 MSeifert04 Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be kept. Personaly, I like blank lines between module docs and imports. Feel free to ignore this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that was unintentional...

@@ -739,7 +738,7 @@ def combine(img_list, output_file=None,
'func': sigma_clip_func,
'dev_func': sigma_clip_dev_func}

# Finally Run the input method on all the subsections of the image
# Finally Run the input method on all the subsections of the image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems wrong. I would've expected this to throw an IndentationError... But I think we should stick to 4 space indentation even for comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Haven't figured why my editor sometimes misses things like this....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Turns out my editor did catch it...my eyes missed it]

@@ -775,6 +779,10 @@ def combine(img_list, output_file=None,
ccd.mask[x:xend, y:yend] = comb_tile.mask
if ccd.uncertainty is not None:
ccd.uncertainty.array[x:xend, y:yend] = comb_tile.uncertainty.array
# Clean up any open files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is now wrong. At this point there should only be no open file anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating....

@@ -775,6 +779,10 @@ def combine(img_list, output_file=None,
ccd.mask[x:xend, y:yend] = comb_tile.mask
if ccd.uncertainty is not None:
ccd.uncertainty.array[x:xend, y:yend] = comb_tile.uncertainty.array
# Clean up any open files
del comb_tile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I really like those dels. Are these still necessary for reducing the memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so; can run a profile in a moment to check. The only essential one, I think is to delete tile_combiner. That will, by design, use roughly the amount of memory the user has set as the cap. If we don't delete it, then on the second pass through the loop we'll (briefly) consume twice that memory on line 761 (or 2 depending on commit) as we construct the new combiner.

@mwcraig
Copy link
Member Author

mwcraig commented Aug 12, 2018

Thanks for the comments @MSeifert04 -- I have addressed them in code or in the comments above.

Copy link
Contributor

@MSeifert04 MSeifert04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still haven't got the time to review the tests but the actual code changes look good to me 👍

@crawfordsm crawfordsm merged commit f457ac5 into astropy:master Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

too many files open
3 participants