Refactor combiner.combine to minimize the number of files open at one time #630

mwcraig · 2018-06-23T23:11:02Z

This will fix #629; the initial commit is two new tests, one of which will fail before the issue is fixed. Once the failures are verified on CI, I'll add a fix.

For bugfixes:

Did you add an entry to the "Changes.rst" file?
Did you add a regression test?
Does the commit message include a "Fixes #issue_number" (replace "issue_number").

mwcraig · 2018-06-24T00:47:29Z

Tests are now (mostly) failing as intended; removed an f-string to fix fails on python 3.4 and 3.5.

Closes astropy#629

mwcraig · 2018-06-24T22:44:32Z

Not sure why coverage went down; the lines marked as not covered are actually covered by the new test. Perhaps because the test is run in a subprocess.

mwcraig · 2018-06-24T23:42:43Z

@MSeifert04 @crawfordsm -- this is ready for a look.

MSeifert04 · 2018-06-30T21:45:59Z

Interesting. Reading through the code was a bit of a revelation... we really have some more problems there. In either case passing memmap=False will lead to trouble because we (1) had slices that prevented the original from "deallocating" or with the new approach we (2) explicitly keep them alive. That makes the "memory-saving" completely wrong if one chose to do memmap=False (or one overwrote the astropy fits memmap config entry).

I like the fact that images are read only once with the new approach (at least if all of them are actually files) even though the time reading the files will probably be negligible compared to the combining time - I mean the "tiling" only happens when we expect that each "tiled combine" actually exceeds a few GB. So I don't think it's actually necessary to "keep them around". The best idea here would be to read the file, copy the slice and then completely dispose of the CCDData (which means the memory is freed and the file handle is disposed). That way we (probably) keep only one file-handle open at each point (in case of memory mapping) and don't keep the original alive because we keep the slice (in case of non-memory-mapping). The copy is actually a bit "annoying", however we have to assume that the copy operation will also be negligible compared to actually combining the images. I mean we already copy the arrays inside the Combiner - if we could do that step in combine instead of Combiner.__init__ we would actually don't have any additional overhead compared to the status-quo.

Long story short: I'm not really sure how we should proceed. We could simply use your approach for now (looks good - except that the tests are not picked up, which should be corrected) because it's a definite improvement and think about a more general refactor later on - I mean most of the things I've just written are pure speculation (especially those statements that contained the word "negligible") and have to be verified before we actually do that.

mwcraig · 2018-07-02T20:28:37Z

except that the tests are not picked up, which should be corrected

How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀

mwcraig · 2018-07-02T21:49:06Z

Thanks for the comments, @MSeifert04.

I think that as currently written there are a few different outcomes:

Input list is `CCDData` objects

No change from prior behavior, but also no real memory savings, since all of the CCDData objects are already in memory. This might not strictly be the case if all of the CCDData objects are actually memory mapped, I suppose.

Input list is filenames and all files are opened memmap=True and all files are memory-mappable (e.g. no unsigned int or other bzero/bscale)

Files are opened only once, memmap references are preserved for later use. This is probably the optimal way to go in this case.

Input list is filenames and one more more files is opened with memmap=False or cannot be memory-mapped

As soon as one of those files is hit, memmap_failed is set to True. After that point, each file is opened, which means it is read into memory, a slice is made. As far as I can tell, keeping the slice around actually keeps around a copy of the full array (code snippet below)...so unless we copy, this case might read everything into memory. I think this would have been true before the PR too...

code snippet:

foo = np.zeros([10, 100])
foo.size  # This is 1000
bar = foo[3:5, 30:40]
del foo
bar.size   # 20, as expected
bar.base.size   # 1000, so it looks like we still have a reference to foo...
bar.base[7:10, 80:90]  # Yep, we can index this no problem

So we need to handle this case better...

I think that in this case what you proposed (slice the CCDData and force a copy of the data) is necessary to avoid having all of the data read into memory. I might try writing a test that fails in this case (i.e. uses way too much memory) to confirm this is the case, then push a fix.

MSeifert04 · 2018-07-03T06:47:13Z

except that the tests are not picked up, which should be corrected

How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀

Ah, maybe by avoiding "subprocess" here? Would that be feasible?

MSeifert04 · 2018-07-03T06:52:54Z

Input list is filenames and all files are opened memmap=True and all files are memory-mappable (e.g. no unsigned int or other bzero/bscale)

Files are opened only once, memmap references are preserved for later use. This is probably the optimal way to go in this case.

Not sure if it's ideal. I mean you keep a lot of open file-handles that could still conflict with the open-files-limit just to avoid opening the same file multiple times. I haven't done any timings but my feeling is that opening the files multiple times isn't really significant in the runtime of the function.

And in most cases the "tiling" won't happen at all - in these cases having lots of open file handles is unnecessary.

MSeifert04 · 2018-07-03T06:55:31Z

As far as I can tell, keeping the slice around actually keeps around a copy of the full array (code snippet below)...so unless we copy, this case might read everything into memory.

Yes, that's the problem. We either keep the memmap alive (if it's memmappable) or the full array (in case it's not). Both isn't great.

mwcraig · 2018-08-08T13:03:20Z

except that the tests are not picked up, which should be corrected

How should that be fixed? By which I mean "I don't know how to force the coverage to pick up the tests that are run in a subprocess but would be happy to make that happen if I knew how" 😀

Ah, maybe by avoiding "subprocess" here? Would that be feasible?

No, because the tests lower the limit (in Python) on the number of simultaneously open files, and once the limit is lowered, you cannot raise it.

mwcraig · 2018-08-08T13:07:04Z

Going to run some profiles now to inform the discussion here. IIRC, the bulk of the time is spent reading data from disk (at least in the case that 100 files are being combined)...

MSeifert04 · 2018-08-08T18:14:27Z

Going to run some profiles now to inform the discussion here. IIRC, the bulk of the time is spent reading data from disk (at least in the case that 100 files are being combined)...

That would be great. Could you share the results? :)

mwcraig · 2018-08-09T03:32:16Z

@MSeifert04 -- see #624 (comment) for one graph. I think the profiling/optimization is separate from this file issue, though. I was being thrown by __getitem__ taking so much time; I assumed that was from io.fits but it is from numpy.ma.

I'll add more comments on that issue tomorrow, but it looks like the easy speedups are going to be using the nan functions instead of masked, and maybe allowing for bottleneck.

mwcraig · 2018-08-10T21:56:00Z

A number of comments:

My original approach here (making a list of memory-mapped files) is terrible. It has the effect of breaking the memory limit that combine is trying to impose. The reason is that as successive slices are read in the memory-mapped array is "filled in" with the values read until each is taking up its full size in memory.
- For 14 files, 2k x 2k, with memory limit of 32MB, the maximum memory use in my keep-the-list-of-memmaps approach used 450MB by the end.
- With the latest version the maximum memory use is about 128MB. That is still too high but I believe it is unrelated to the changes here. Will open a separate issue for it.
The amount of time spent (re)opening files and copying in the latest approach is small. There was less than a 1 second difference between the latest approach and the earlier one of keeping a list of the memmaped files. In any event, the time spent copying is tiny (about 1 second) compared to the time spent in sigma_clipping (about 26 seconds) and average_combineing (about 6 seconds).

I have one more minor change to push then I think this is ready to merge. Will add a cross-reference to the memory issue shortly.

mwcraig · 2018-08-10T22:04:22Z

Incidentally, to do the memory profiling, I installed memory_profiler and ran it in ccdproc/tests with:

mprof run run_with_file_number_limit.py --overhead 6 --kind fits --open-by combine-chunk --size 2000 --frequent-gc 50

mwcraig · 2018-08-10T22:27:39Z

Memory use issue is #638

MSeifert04

I haven't looked through the tests. It looks very good overall, just a few comments that need to be addressed (and some that are more optional).

MSeifert04 · 2018-08-11T20:42:39Z

CHANGES.rst

@@ -24,6 +24,8 @@ Bug Fixes
 - Function ``median_combine`` now correctly calculates the uncertainty for
  masked ``CCDData``. [#608]

+- Function ``combine`` now avoids opening files unnecessarily. [#629, #630]


The new approach actually doesn't avoid opening the files unnecessarily.

I think it still does; without the copy, ccd_list held a file reference to each file in the list if the files were mem-mapped. The copy in the update avoids that, so that at most 1 file is open at a time.

I suppose it would be more accurate to say that it doesn't keep files open unnecessarily....

I suppose it would be more accurate to say that it doesn't keep files open unnecessarily....

I like that.

MSeifert04 · 2018-08-11T20:43:01Z

ccdproc/combiner.py

@@ -1,7 +1,6 @@
 # Licensed under a 3-clause BSD style license - see LICENSE.rst

 """This module implements the combiner class."""
-


Could be kept. Personaly, I like blank lines between module docs and imports. Feel free to ignore this comment.

Oops, that was unintentional...

MSeifert04 · 2018-08-11T20:44:07Z

ccdproc/combiner.py

@@ -739,7 +738,7 @@ def combine(img_list, output_file=None,
            'func': sigma_clip_func,
            'dev_func': sigma_clip_dev_func}

-    # Finally Run the input method on all the subsections of the image
+   # Finally Run the input method on all the subsections of the image


That seems wrong. I would've expected this to throw an IndentationError... But I think we should stick to 4 space indentation even for comments.

Good catch! Haven't figured why my editor sometimes misses things like this....

[Turns out my editor did catch it...my eyes missed it]

MSeifert04 · 2018-08-11T20:46:54Z

ccdproc/combiner.py

@@ -775,6 +779,10 @@ def combine(img_list, output_file=None,
                ccd.mask[x:xend, y:yend] = comb_tile.mask
            if ccd.uncertainty is not None:
                ccd.uncertainty.array[x:xend, y:yend] = comb_tile.uncertainty.array
+            # Clean up any open files


The comment is now wrong. At this point there should only be no open file anymore

Updating....

MSeifert04 · 2018-08-11T20:48:49Z

ccdproc/combiner.py

@@ -775,6 +779,10 @@ def combine(img_list, output_file=None,
                ccd.mask[x:xend, y:yend] = comb_tile.mask
            if ccd.uncertainty is not None:
                ccd.uncertainty.array[x:xend, y:yend] = comb_tile.uncertainty.array
+            # Clean up any open files
+            del comb_tile


Not sure if I really like those dels. Are these still necessary for reducing the memory?

I think so; can run a profile in a moment to check. The only essential one, I think is to delete tile_combiner. That will, by design, use roughly the amount of memory the user has set as the cap. If we don't delete it, then on the second pass through the loop we'll (briefly) consume twice that memory on line 761 (or 2 depending on commit) as we construct the new combiner.

mwcraig · 2018-08-12T17:17:02Z

Thanks for the comments @MSeifert04 -- I have addressed them in code or in the comments above.

MSeifert04

Still haven't got the time to review the tests but the actual code changes look good to me 👍

mwcraig added 3 commits June 23, 2018 18:06

Add test that fails if too many files opened in combination

c823cf2

Fix tests to run when ccdproc is not installed [skip appveyor]

bb48024

Remove an f-string :(

5eb0743

mwcraig added 2 commits June 24, 2018 15:52

Sort files in test for open files

e2fdf44

Avoid re-opening files that are memory mapped

51f9c0d

Closes astropy#629

Skip new tests on python 3.4 or lower...

a7564a1

mwcraig added bug combiner labels Jun 24, 2018

Add changelog

fab1ad2

Add pytest directory to gitignore

dd17863

mwcraig added 7 commits August 10, 2018 12:23

Actually use size for FITS file creation

a793492

Fix error in print statement

48b6f38

Update size and other arguments

7c74075

Print the number of files being created

4234a04

Delete objects one-by-one in combiner to make tracking memory clearer

155054d

Add option to change garbage collection frequency

051b280

Make copies to further reduce number of simultaneously open files

ba32bb9

mwcraig added 2 commits August 10, 2018 17:01

Add psutil as a dependency in travis config

92621f5

Remove unneeded import

c20a7bf

mwcraig requested review from crawfordsm and MSeifert04 August 10, 2018 22:20

MSeifert04 reviewed Aug 11, 2018

View reviewed changes

Address review comments

e73b93e

MSeifert04 approved these changes Aug 12, 2018

View reviewed changes

mwcraig mentioned this pull request Aug 14, 2018

Change formula for chunking of files in combine #642

Merged

3 tasks

Clarify change log message [ci skip]

0ce425d

crawfordsm merged commit f457ac5 into astropy:master Oct 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor combiner.combine to minimize the number of files open at one time #630

Refactor combiner.combine to minimize the number of files open at one time #630

mwcraig commented Jun 23, 2018 •

edited

Loading

mwcraig commented Jun 24, 2018

mwcraig commented Jun 24, 2018

mwcraig commented Jun 24, 2018

MSeifert04 commented Jun 30, 2018 •

edited

Loading

mwcraig commented Jul 2, 2018 •

edited

Loading

mwcraig commented Jul 2, 2018

MSeifert04 commented Jul 3, 2018 •

edited

Loading

MSeifert04 commented Jul 3, 2018 •

edited

Loading

MSeifert04 commented Jul 3, 2018

mwcraig commented Aug 8, 2018

mwcraig commented Aug 8, 2018

MSeifert04 commented Aug 8, 2018

mwcraig commented Aug 9, 2018

mwcraig commented Aug 10, 2018

mwcraig commented Aug 10, 2018

mwcraig commented Aug 10, 2018

MSeifert04 left a comment

MSeifert04 Aug 11, 2018

mwcraig Aug 12, 2018

mwcraig Aug 12, 2018

MSeifert04 Aug 16, 2018

MSeifert04 Aug 11, 2018 •

edited

Loading

mwcraig Aug 12, 2018

MSeifert04 Aug 11, 2018

mwcraig Aug 12, 2018

mwcraig Aug 12, 2018

MSeifert04 Aug 11, 2018

mwcraig Aug 12, 2018

MSeifert04 Aug 11, 2018

mwcraig Aug 12, 2018

mwcraig commented Aug 12, 2018

MSeifert04 left a comment

		@@ -1,7 +1,6 @@
		# Licensed under a 3-clause BSD style license - see LICENSE.rst

		"""This module implements the combiner class."""

Refactor combiner.combine to minimize the number of files open at one time #630

Refactor combiner.combine to minimize the number of files open at one time #630

Conversation

mwcraig commented Jun 23, 2018 • edited Loading

mwcraig commented Jun 24, 2018

mwcraig commented Jun 24, 2018

mwcraig commented Jun 24, 2018

MSeifert04 commented Jun 30, 2018 • edited Loading

mwcraig commented Jul 2, 2018 • edited Loading

mwcraig commented Jul 2, 2018

Input list is CCDData objects

Input list is filenames and all files are opened memmap=True and all files are memory-mappable (e.g. no unsigned int or other bzero/bscale)

Input list is filenames and one more more files is opened with memmap=False or cannot be memory-mapped

So we need to handle this case better...

MSeifert04 commented Jul 3, 2018 • edited Loading

MSeifert04 commented Jul 3, 2018 • edited Loading

MSeifert04 commented Jul 3, 2018

mwcraig commented Aug 8, 2018

mwcraig commented Aug 8, 2018

MSeifert04 commented Aug 8, 2018

mwcraig commented Aug 9, 2018

mwcraig commented Aug 10, 2018

mwcraig commented Aug 10, 2018

mwcraig commented Aug 10, 2018

MSeifert04 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSeifert04 Aug 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwcraig commented Aug 12, 2018

MSeifert04 left a comment

Choose a reason for hiding this comment

mwcraig commented Jun 23, 2018 •

edited

Loading

MSeifert04 commented Jun 30, 2018 •

edited

Loading

mwcraig commented Jul 2, 2018 •

edited

Loading

Input list is `CCDData` objects

MSeifert04 commented Jul 3, 2018 •

edited

Loading

MSeifert04 commented Jul 3, 2018 •

edited

Loading

MSeifert04 Aug 11, 2018 •

edited

Loading