Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659

Open
graeme-winter opened this issue Nov 23, 2018 · 32 comments

Comments

@graeme-winter
Copy link
Contributor

Believed to be memory allocation in second step of integration i.e. once the reference profiles are constructed, when gets to integration it falls over. This is probably the most widely reported real bug.

@graeme-winter
Copy link
Contributor Author

Multiple possible solutions for this - using threaded integration in preference to multiprocessing is one, better prediction of memory usage would be a second. Really needs a test case to throw the problem however.

@jmp1985 jmp1985 self-assigned this Jan 14, 2019
@dagewa
Copy link
Member

dagewa commented Jul 25, 2019

Possible solution 1 is in the works #833

@dagewa
Copy link
Member

dagewa commented Mar 12, 2020

This is still one of the most commonly-reported bugs, and in my eyes the most serious.

We struggle to process "normal" data on a "normal" computer. For example, I have a dataset from a CCP4 workshop that is nothing particularly special: P3 cell with a=84.5 Å, c=107.0 Å, P6M detector, 360° at 0.1°/image and a moderately large mosaicity of 0.5°. I can't process this beyond 2.0 Å on my laptop with 16 GB RAM, even using just a single process! Ronan ran it on one of the CCP4 servers to 1.18 Å with nproc=1 and noted it required 31 GB of memory and took 7.5 hours to complete.

XDS has no such problem. We are keeping all integrated data in RAM for no good reason. Once a spot is integrated, it is integrated. Dump it somewhere. Or at least dump its shoebox

@graeme-winter
Copy link
Contributor Author

graeme-winter commented Mar 12, 2020 via email

@dagewa
Copy link
Member

dagewa commented Mar 12, 2020

I don't have any such pointers, just coming at it from a user perspective at the moment. Tried to process data, it failed. I did notice it failed during integration though, so after profile modelling. It integrated a couple of hundred images then died. I no longer have the logs for that run here though.

@graeme-winter
Copy link
Contributor Author

graeme-winter commented Mar 12, 2020 via email

@dagewa
Copy link
Member

dagewa commented Mar 12, 2020

I can't do for this case. I think a 3600 image fine-sliced dataset of thaumatin would do though. We can set the mosaicity to 0.5° manually

@graeme-winter
Copy link
Contributor Author

Reproduced:

/dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5

# disable virtual memory
ulimit -v 0

dials.import ../insulin-vanilla/ins_10_1_master.h5
dials.find_spots imported.expt nproc=8
dials.index imported.expt strong.refl
dials.refine indexed.expt indexed.refl
# proper values are 0.03 / 0.11
dials.integrate refined.expt refined.refl nproc=8 sigma_m=0.5 sigma_b=0.1

...

Traceback (most recent call last):
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 665, in <module>
    script.run()
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 353, in run
    reflections = integrator.integrate()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/integrator.py", line 1192, in integrate
    self.reflections, _, time_info = processor.process()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 259, in process
    self.manager.initialize()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 601, in initialize
    self.compute_processors()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 826, in compute_processors
    max_memory / 1e9,
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: 
        Not enough memory to run integration jobs.  This could be caused by a
        highly mosaic crystal model.  Possible solutions include increasing the
        percentage of memory allowed for shoeboxes or decreasing the block size.
        The average shoebox size is 121 x 123 pixels x 42 images - is your crystal
        really this mosaic?
            Total system memory: 17.2 GB
            Shoebox memory limit: 12.9 GB
            Required shoebox memory: 44.4 GB

@graeme-winter
Copy link
Contributor Author

This was on my MacBook Pro with 16 GB RAM

@dagewa
Copy link
Member

dagewa commented Mar 13, 2020

Thanks, ideal test case 👍

@dagewa
Copy link
Member

dagewa commented Mar 13, 2020

It bombed during integration with nproc=1 by setting a 12 GB limit on Linux:

ulimit -v 12000000
dials.integrate refined.expt refined.refl sigma_m=0.5 sigma_b=0.1

finally failing here:

 1795 [3882]: *******************************************************
 1796 [3800]: *****************************************************
 1797 [3684]: ****************************************************
 1798 [3584]: **************************************************
 1799 [3487]: *************************************************
 1800 [3394]: ************************************************

Traceback (most recent call last):
  File "/home/fcx32934/sw/cctbx/build/../modules/dials/command_line/integrate.py", line 665, in <module>
    script.run()
  File "/home/fcx32934/sw/cctbx/build/../modules/dials/command_line/integrate.py", line 353, in run
    reflections = integrator.integrate()
  File "/home/fcx32934/sw/cctbx/modules/dials/algorithms/integration/integrator.py", line 1192, in integrate
    self.reflections, _, time_info = processor.process()
  File "/home/fcx32934/sw/cctbx/modules/dials/algorithms/integration/processor.py", line 305, in process
    self.manager.accumulate(task())
  File "/home/fcx32934/sw/cctbx/modules/dials/algorithms/integration/processor.py", line 491, in __call__
    sbox_memory / 1e9,
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: 
        There was a problem allocating memory for shoeboxes.  This could be caused
        by a highly mosaic crystal model.  Possible solutions include increasing the
        percentage of memory allowed for shoeboxes or decreasing the block size.
        The average shoebox size is 122 x 124 pixels x 44 images - is your crystal
        really this mosaic?
        Total system memory: 16.7 GB
        Shoebox memory limit: 12.5 GB
        Required shoebox memory: 51.1 GB

@graeme-winter
Copy link
Contributor Author

Disabling checks in debugging-integrate-memory-use branch to see where this actually fails, rather than computes something && says no

@graeme-winter
Copy link
Contributor Author

Oh, not very helpful

/dls_sw/cluster/GE/UGE8.6.7/TEST/spool/cs04r-sc-com99-09/job_scripts/1178132: line 9: 865167 Segmentation fault      dials.integrate refined.expt refined.refl sigma_m=0.5 sigma_b=0.1

no useful stack trace 🤔 last console output

 1800 [3394]: ************************************************

 Memory usage:
  Total system memory: 134.845 GB
  Limit shoebox memory: 101.134 GB
  Required shoebox memory: 51.0777 GB

@dagewa
Copy link
Member

dagewa commented Mar 15, 2020

What are the options for inspecting memory use at a breakpoint in processing? I suspect it's shoeboxes that tip the balance, but it would be good to know for sure.

@ndevenish
Copy link
Member

A memory profiler might help - valgrind? VTune also has a memory profiler that looks promising, started running things before I left for leave but didn’t get to dig in.

@graeme-winter
Copy link
Contributor Author

Another option could be https://docs.python.org/3/library/faulthandler.html - just taking a look at this now, may try hacking in on my branch above

@graeme-winter
Copy link
Contributor Author

graeme-winter commented Mar 16, 2020

# proper values are 0.03 / 0.11
PYTHONFAULTHANDLER=1 \
dials.integrate refined.expt refined.refl sigma_m=0.5 sigma_b=0.1
Fatal Python error: Segmentation fault

Current thread 0x00007f999f926740 (most recent call first):
  File "/dls/science/users/gw56/git/modules/dials/algorithms/integration/processor.py", line 520 in __call__
  File "/dls/science/users/gw56/git/modules/dials/algorithms/integration/processor.py", line 305 in process
  File "/dls/science/users/gw56/git/modules/dials/algorithms/integration/integrator.py", line 1192 in integrate
  File "/dls/science/users/gw56/git/build/../modules/dials/command_line/integrate.py", line 353 in run
  File "/dls/science/users/gw56/git/build/../modules/dials/command_line/integrate.py", line 665 in <module>

Useful 🙂

@dagewa
Copy link
Member

dagewa commented Mar 16, 2020

Would guppy help, or crippled by boost.python?

@graeme-winter
Copy link
Contributor Author

Digging a bit here I think the solution will be in processor to pass the option not to save the shoeboxes down the stack so we don't allocate / fill them outside the minimum scope where they are needed. Looking more at this now.

@dagewa
Copy link
Member

dagewa commented Mar 16, 2020

Thanks for looking at it. I was wondering about moving shoeboxes out of the reflection table and into a map structure, leaving only keys into the map in the reflection table. Shoeboxes can then be consumed and deleted after use. But I wasn't 100% sure it is the shoeboxes at fault (though, say >95%)

Anthchirp added a commit that referenced this issue Mar 16, 2020
suggested in #659 specifically for debugging, but we should enable this
whenever possible.
@graeme-winter
Copy link
Contributor Author

maxrss

Looking at maximum resident size in memory it climbs fairly slowly then jumps as we go into the actual integration phase - this is I think where things go bad usually. This was "sensible" processing on a small molecule example so probably not representative - waiting on the smoking gnu above to finish running...

@graeme-winter
Copy link
Contributor Author

OK, troublesome task above (with slightly silly mosaic parameters) has finished and illustrates the issue well -

boom

From the output -

 Memory usage:
  Total system memory: 134.845 GB
  Limit shoebox memory: 101.134 GB
  Required shoebox memory: 51.0777 GB

This shoebox memory estimate is sound, and the problem...

@graeme-winter
Copy link
Contributor Author

Coming up for some air... it looks like the shoeboxes are kinda deallocated at the C++ level - there are still references to reflections["shoebox"] lurking around - the sys.getrefcount() says 2 which means 1, so there are some odd things happening

@graeme-winter
Copy link
Contributor Author

General synopsis at 13:15 hrs: looks like the shoeboxes are bring freed in the C++ code - if we are not writing them out - and deleting the "shoebox" column from the reflection table makes very little difference, as this is just 6 numbers or so and a couple of pointers per reflection.

@graeme-winter
Copy link
Contributor Author

Though... the memory estimate above seems spot on which makes me like 🤔

I checked the reference counts and I think the shoeboxes are bring freed... before being unlinked from the shoebox structure the data member had one use_count() reference

@graeme-winter
Copy link
Contributor Author

memory

Testing with small molecule set with deliberately large profile model parameters - the memory does not go up constantly, so something is being freed somewhere. I am coming around to the idea that what is causing us problems is the steep rise in memory use between 0 and 50 frames at the start of actual integration. I am not sure we can avoid this whilst keeping the pixel data in memory.

However: this is not a matter of us doing something dumb, this is a real problem.

@graeme-winter
Copy link
Contributor Author

Looking at another thread - are our processing / profile parameters sensible? Compare what XDS reports and DIALS, considering that in theory they have the same underlying model -

For reference - XDS says

 ***** SUGGESTED VALUES FOR INPUT PARAMETERS *****
 BEAM_DIVERGENCE=   0.36785  BEAM_DIVERGENCE_E.S.D.=   0.03679
 REFLECTING_RANGE=  0.53704  REFLECTING_RANGE_E.S.D.=  0.07672

DIALS says

Using 10678 / 10679 reflections for sigma calculation
Calculating E.S.D Beam Divergence.
Calculating E.S.D Reflecting Range (mosaicity).
 sigma b: 0.052859 degrees
 sigma m: 0.106186 degrees

perform integration with both - manually setting the profile parameters to "simulate" the XDS job and write a little jiffy to get the differences for matched reflections. If our profile parameters were too large we would see simply added noise, and the average intensity would be the same. If the XDS parameters are too small the integrated intensity should be systematically lower.

Script:

from dials.array_family import flex

def filter_tukey(data):
    from scitbx.math import five_number_summary

    outliers = flex.bool(len(data), False)
    min_x, q1_x, med_x, q3_x, max_x = five_number_summary(data)
    iqr_x = q3_x - q1_x
    cut_x = 1.5 * iqr_x
    outliers.set_selected(data > q3_x + cut_x, True)
    outliers.set_selected(data < q1_x - cut_x, True)
    
    return data.select(~outliers)

def compare(a, b):
    '''Effectively compute b - a'''
    data_a = flex.reflection_table.from_file(a)
    data_a = data_a.select(data_a.get_flags(data_a.flags.integrated_prf))

    data_b = flex.reflection_table.from_file(b)
    data_b = data_b.select(data_b.get_flags(data_b.flags.integrated_prf))

    matches = data_a.match(data_b)

    i_a = data_a["intensity.prf.value"].select(matches[0])
    i_b = data_b["intensity.prf.value"].select(matches[1])

    b_a = flex.double(sorted(i_b - i_a))

    # chuck outliers - use Tukey's fences

    b_a = filter_tukey(b_a)
    
    hist = flex.histogram(b_a)

    for c, v in zip(hist.slot_centers(), hist.slots()):
        print(c, v)
    
if __name__ == '__main__':
    import sys

    compare(sys.argv[1], sys.argv[2])

Result:

profile-parameters

Yes, there is a systematic difference in favour of the DIALS profile parameters being ~ right (at least, using smaller profile parameters makes for a systematic reduction in intensities which I interpret as missing the tails of the peak)

@graeme-winter
Copy link
Contributor Author

Shoeboxes useful because:
Screenshot 2020-03-18 at 07 06 07

@rjgildea
Copy link
Contributor

Reproduced:

/dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5

# disable virtual memory
ulimit -v 0

dials.import ../insulin-vanilla/ins_10_1_master.h5
dials.find_spots imported.expt nproc=8
dials.index imported.expt strong.refl
dials.refine indexed.expt indexed.refl
# proper values are 0.03 / 0.11
dials.integrate refined.expt refined.refl nproc=8 sigma_m=0.5 sigma_b=0.1

...

Traceback (most recent call last):
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 665, in <module>
    script.run()
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 353, in run
    reflections = integrator.integrate()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/integrator.py", line 1192, in integrate
    self.reflections, _, time_info = processor.process()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 259, in process
    self.manager.initialize()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 601, in initialize
    self.compute_processors()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 826, in compute_processors
    max_memory / 1e9,
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: 
        Not enough memory to run integration jobs.  This could be caused by a
        highly mosaic crystal model.  Possible solutions include increasing the
        percentage of memory allowed for shoeboxes or decreasing the block size.
        The average shoebox size is 121 x 123 pixels x 42 images - is your crystal
        really this mosaic?
            Total system memory: 17.2 GB
            Shoebox memory limit: 12.9 GB
            Required shoebox memory: 44.4 GB

@rjgildea rjgildea reopened this Apr 20, 2020
@Anthchirp
Copy link
Member

Just for the record: In DIALS 2.1.3 (which is the version included in CCP4 7.1) a related issue shows up as

Modelling reflection profiles

Traceback (most recent call last):
  File "/usr/local/dials-v2-1-3/build/../modules/dials/command_line/integrate.py", line 669, in <module>
    script.run()
  File "/usr/local/dials-v2-1-3/base/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/dials-v2-1-3/modules/dials/util/__init__.py", line 124, in show_mail_on_error
    e.args = (text + " " + str(e.args[0]),)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd7' in position 281: ordinal not in range(128)

and this should be fixed in DIALS 2.2.3 via #1221

@dagewa dagewa added this to To do in Bugs Apr 29, 2020
jbeilstenedmands added a commit that referenced this issue Sep 4, 2020
#659

In the integrator, calculate the maximum memory needed. If this exceeds the
available memory, then split the reflection table into random subsets and
process by performing multiple passes over the imagesets.

Applies only to regular integrators (not threaded) and will only take effect
in situations where processing currently exits.
@stale
Copy link

stale bot commented Oct 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. The label will be removed automatically if any activity occurs. Thank you for your contributions.

@stale stale bot added the stale issues that have not seen activity for a while label Oct 25, 2020
@graeme-winter
Copy link
Contributor Author

100% not at all stale - still an issue 😞

@stale stale bot removed the stale issues that have not seen activity for a while label Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Bugs
  
To do
Development

No branches or pull requests

6 participants