In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659

graeme-winter · 2018-11-23T06:39:16Z

Believed to be memory allocation in second step of integration i.e. once the reference profiles are constructed, when gets to integration it falls over. This is probably the most widely reported real bug.

graeme-winter · 2018-11-23T06:45:55Z

Multiple possible solutions for this - using threaded integration in preference to multiprocessing is one, better prediction of memory usage would be a second. Really needs a test case to throw the problem however.

dagewa · 2019-07-25T15:32:12Z

Possible solution 1 is in the works #833

dagewa · 2020-03-12T14:37:34Z

This is still one of the most commonly-reported bugs, and in my eyes the most serious.

We struggle to process "normal" data on a "normal" computer. For example, I have a dataset from a CCP4 workshop that is nothing particularly special: P3 cell with a=84.5 Å, c=107.0 Å, P6M detector, 360° at 0.1°/image and a moderately large mosaicity of 0.5°. I can't process this beyond 2.0 Å on my laptop with 16 GB RAM, even using just a single process! Ronan ran it on one of the CCP4 servers to 1.18 Å with nproc=1 and noted it required 31 GB of memory and took 7.5 hours to complete.

XDS has no such problem. We are keeping all integrated data in RAM for no good reason. Once a spot is integrated, it is integrated. Dump it somewhere. Or at least dump its shoebox

graeme-winter · 2020-03-12T14:45:58Z

Excellent to see some diagnostic here, keeping all shoeboxes around is unnecessary and a very poor idea Do you have pointers where the issues are centred?

dagewa · 2020-03-12T14:53:10Z

I don't have any such pointers, just coming at it from a user perspective at the moment. Tried to process data, it failed. I did notice it failed during integration though, so after profile modelling. It integrated a couple of hundred images then died. I no longer have the logs for that run here though.

graeme-winter · 2020-03-12T15:10:57Z

If the offending images are somewhere public would help a lot! 👍

dagewa · 2020-03-12T16:37:02Z

I can't do for this case. I think a 3600 image fine-sliced dataset of thaumatin would do though. We can set the mosaicity to 0.5° manually

graeme-winter · 2020-03-13T10:29:14Z

Reproduced:

/dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5

# disable virtual memory
ulimit -v 0

dials.import ../insulin-vanilla/ins_10_1_master.h5
dials.find_spots imported.expt nproc=8
dials.index imported.expt strong.refl
dials.refine indexed.expt indexed.refl
# proper values are 0.03 / 0.11
dials.integrate refined.expt refined.refl nproc=8 sigma_m=0.5 sigma_b=0.1

...

Traceback (most recent call last):
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 665, in <module>
    script.run()
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 353, in run
    reflections = integrator.integrate()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/integrator.py", line 1192, in integrate
    self.reflections, _, time_info = processor.process()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 259, in process
    self.manager.initialize()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 601, in initialize
    self.compute_processors()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 826, in compute_processors
    max_memory / 1e9,
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: 
        Not enough memory to run integration jobs.  This could be caused by a
        highly mosaic crystal model.  Possible solutions include increasing the
        percentage of memory allowed for shoeboxes or decreasing the block size.
        The average shoebox size is 121 x 123 pixels x 42 images - is your crystal
        really this mosaic?
            Total system memory: 17.2 GB
            Shoebox memory limit: 12.9 GB
            Required shoebox memory: 44.4 GB

graeme-winter · 2020-03-13T10:30:08Z

This was on my MacBook Pro with 16 GB RAM

dagewa · 2020-03-13T10:33:31Z

Thanks, ideal test case 👍

dagewa · 2020-03-13T14:05:37Z

It bombed during integration with nproc=1 by setting a 12 GB limit on Linux:

ulimit -v 12000000
dials.integrate refined.expt refined.refl sigma_m=0.5 sigma_b=0.1

finally failing here:

 1795 [3882]: *******************************************************
 1796 [3800]: *****************************************************
 1797 [3684]: ****************************************************
 1798 [3584]: **************************************************
 1799 [3487]: *************************************************
 1800 [3394]: ************************************************

Traceback (most recent call last):
  File "/home/fcx32934/sw/cctbx/build/../modules/dials/command_line/integrate.py", line 665, in <module>
    script.run()
  File "/home/fcx32934/sw/cctbx/build/../modules/dials/command_line/integrate.py", line 353, in run
    reflections = integrator.integrate()
  File "/home/fcx32934/sw/cctbx/modules/dials/algorithms/integration/integrator.py", line 1192, in integrate
    self.reflections, _, time_info = processor.process()
  File "/home/fcx32934/sw/cctbx/modules/dials/algorithms/integration/processor.py", line 305, in process
    self.manager.accumulate(task())
  File "/home/fcx32934/sw/cctbx/modules/dials/algorithms/integration/processor.py", line 491, in __call__
    sbox_memory / 1e9,
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: 
        There was a problem allocating memory for shoeboxes.  This could be caused
        by a highly mosaic crystal model.  Possible solutions include increasing the
        percentage of memory allowed for shoeboxes or decreasing the block size.
        The average shoebox size is 122 x 124 pixels x 44 images - is your crystal
        really this mosaic?
        Total system memory: 16.7 GB
        Shoebox memory limit: 12.5 GB
        Required shoebox memory: 51.1 GB

graeme-winter · 2020-03-15T05:48:26Z

Disabling checks in debugging-integrate-memory-use branch to see where this actually fails, rather than computes something && says no

graeme-winter · 2020-03-15T06:20:35Z

Oh, not very helpful

/dls_sw/cluster/GE/UGE8.6.7/TEST/spool/cs04r-sc-com99-09/job_scripts/1178132: line 9: 865167 Segmentation fault      dials.integrate refined.expt refined.refl sigma_m=0.5 sigma_b=0.1

no useful stack trace 🤔 last console output

 1800 [3394]: ************************************************

 Memory usage:
  Total system memory: 134.845 GB
  Limit shoebox memory: 101.134 GB
  Required shoebox memory: 51.0777 GB

dagewa · 2020-03-15T11:00:12Z

What are the options for inspecting memory use at a breakpoint in processing? I suspect it's shoeboxes that tip the balance, but it would be good to know for sure.

ndevenish · 2020-03-15T11:10:10Z

A memory profiler might help - valgrind? VTune also has a memory profiler that looks promising, started running things before I left for leave but didn’t get to dig in.

graeme-winter · 2020-03-16T09:02:57Z

Another option could be https://docs.python.org/3/library/faulthandler.html - just taking a look at this now, may try hacking in on my branch above

graeme-winter · 2020-03-16T09:44:26Z

# proper values are 0.03 / 0.11
PYTHONFAULTHANDLER=1 \
dials.integrate refined.expt refined.refl sigma_m=0.5 sigma_b=0.1

Fatal Python error: Segmentation fault

Current thread 0x00007f999f926740 (most recent call first):
  File "/dls/science/users/gw56/git/modules/dials/algorithms/integration/processor.py", line 520 in __call__
  File "/dls/science/users/gw56/git/modules/dials/algorithms/integration/processor.py", line 305 in process
  File "/dls/science/users/gw56/git/modules/dials/algorithms/integration/integrator.py", line 1192 in integrate
  File "/dls/science/users/gw56/git/build/../modules/dials/command_line/integrate.py", line 353 in run
  File "/dls/science/users/gw56/git/build/../modules/dials/command_line/integrate.py", line 665 in <module>

Useful 🙂

dagewa · 2020-03-16T09:51:51Z

Would guppy help, or crippled by boost.python?

graeme-winter · 2020-03-16T14:57:13Z

Digging a bit here I think the solution will be in processor to pass the option not to save the shoeboxes down the stack so we don't allocate / fill them outside the minimum scope where they are needed. Looking more at this now.

dagewa · 2020-03-16T15:14:21Z

Thanks for looking at it. I was wondering about moving shoeboxes out of the reflection table and into a map structure, leaving only keys into the map in the reflection table. Shoeboxes can then be consumed and deleted after use. But I wasn't 100% sure it is the shoeboxes at fault (though, say >95%)

suggested in #659 specifically for debugging, but we should enable this whenever possible.

graeme-winter · 2020-03-17T09:20:29Z

Looking at maximum resident size in memory it climbs fairly slowly then jumps as we go into the actual integration phase - this is I think where things go bad usually. This was "sensible" processing on a small molecule example so probably not representative - waiting on the smoking gnu above to finish running...

graeme-winter · 2020-03-17T09:46:36Z

OK, troublesome task above (with slightly silly mosaic parameters) has finished and illustrates the issue well -

From the output -

 Memory usage:
  Total system memory: 134.845 GB
  Limit shoebox memory: 101.134 GB
  Required shoebox memory: 51.0777 GB

This shoebox memory estimate is sound, and the problem...

graeme-winter · 2020-03-17T11:03:08Z

Coming up for some air... it looks like the shoeboxes are kinda deallocated at the C++ level - there are still references to reflections["shoebox"] lurking around - the sys.getrefcount() says 2 which means 1, so there are some odd things happening

graeme-winter · 2020-03-17T13:18:54Z

General synopsis at 13:15 hrs: looks like the shoeboxes are bring freed in the C++ code - if we are not writing them out - and deleting the "shoebox" column from the reflection table makes very little difference, as this is just 6 numbers or so and a couple of pointers per reflection.

graeme-winter · 2020-03-17T13:28:29Z

Though... the memory estimate above seems spot on which makes me like 🤔

I checked the reference counts and I think the shoeboxes are bring freed... before being unlinked from the shoebox structure the data member had one use_count() reference

graeme-winter · 2020-03-18T06:31:31Z

Testing with small molecule set with deliberately large profile model parameters - the memory does not go up constantly, so something is being freed somewhere. I am coming around to the idea that what is causing us problems is the steep rise in memory use between 0 and 50 frames at the start of actual integration. I am not sure we can avoid this whilst keeping the pixel data in memory.

However: this is not a matter of us doing something dumb, this is a real problem.

graeme-winter · 2020-03-18T09:51:59Z

Looking at another thread - are our processing / profile parameters sensible? Compare what XDS reports and DIALS, considering that in theory they have the same underlying model -

For reference - XDS says

 ***** SUGGESTED VALUES FOR INPUT PARAMETERS *****
 BEAM_DIVERGENCE=   0.36785  BEAM_DIVERGENCE_E.S.D.=   0.03679
 REFLECTING_RANGE=  0.53704  REFLECTING_RANGE_E.S.D.=  0.07672

DIALS says

Using 10678 / 10679 reflections for sigma calculation
Calculating E.S.D Beam Divergence.
Calculating E.S.D Reflecting Range (mosaicity).
 sigma b: 0.052859 degrees
 sigma m: 0.106186 degrees

perform integration with both - manually setting the profile parameters to "simulate" the XDS job and write a little jiffy to get the differences for matched reflections. If our profile parameters were too large we would see simply added noise, and the average intensity would be the same. If the XDS parameters are too small the integrated intensity should be systematically lower.

Script:

from dials.array_family import flex

def filter_tukey(data):
    from scitbx.math import five_number_summary

    outliers = flex.bool(len(data), False)
    min_x, q1_x, med_x, q3_x, max_x = five_number_summary(data)
    iqr_x = q3_x - q1_x
    cut_x = 1.5 * iqr_x
    outliers.set_selected(data > q3_x + cut_x, True)
    outliers.set_selected(data < q1_x - cut_x, True)
    
    return data.select(~outliers)

def compare(a, b):
    '''Effectively compute b - a'''
    data_a = flex.reflection_table.from_file(a)
    data_a = data_a.select(data_a.get_flags(data_a.flags.integrated_prf))

    data_b = flex.reflection_table.from_file(b)
    data_b = data_b.select(data_b.get_flags(data_b.flags.integrated_prf))

    matches = data_a.match(data_b)

    i_a = data_a["intensity.prf.value"].select(matches[0])
    i_b = data_b["intensity.prf.value"].select(matches[1])

    b_a = flex.double(sorted(i_b - i_a))

    # chuck outliers - use Tukey's fences

    b_a = filter_tukey(b_a)
    
    hist = flex.histogram(b_a)

    for c, v in zip(hist.slot_centers(), hist.slots()):
        print(c, v)
    
if __name__ == '__main__':
    import sys

    compare(sys.argv[1], sys.argv[2])

Result:

Yes, there is a systematic difference in favour of the DIALS profile parameters being ~ right (at least, using smaller profile parameters makes for a systematic reduction in intensities which I interpret as missing the tails of the peak)

graeme-winter · 2020-03-18T10:54:07Z

Shoeboxes useful because:

rjgildea · 2020-04-20T15:23:33Z

Reproduced:

/dls/i04/data/2020/cm26459-1/20200220/TestInsulin/ins_10/ins_10_1_master.h5

# disable virtual memory
ulimit -v 0

dials.import ../insulin-vanilla/ins_10_1_master.h5
dials.find_spots imported.expt nproc=8
dials.index imported.expt strong.refl
dials.refine indexed.expt indexed.refl
# proper values are 0.03 / 0.11
dials.integrate refined.expt refined.refl nproc=8 sigma_m=0.5 sigma_b=0.1

...

Traceback (most recent call last):
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 665, in <module>
    script.run()
  File "/Users/graeme/svn/cctbx/build/../modules/dials/command_line/integrate.py", line 353, in run
    reflections = integrator.integrate()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/integrator.py", line 1192, in integrate
    self.reflections, _, time_info = processor.process()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 259, in process
    self.manager.initialize()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 601, in initialize
    self.compute_processors()
  File "/Users/graeme/svn/cctbx/modules/dials/algorithms/integration/processor.py", line 826, in compute_processors
    max_memory / 1e9,
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: 
        Not enough memory to run integration jobs.  This could be caused by a
        highly mosaic crystal model.  Possible solutions include increasing the
        percentage of memory allowed for shoeboxes or decreasing the block size.
        The average shoebox size is 121 x 123 pixels x 42 images - is your crystal
        really this mosaic?
            Total system memory: 17.2 GB
            Shoebox memory limit: 12.9 GB
            Required shoebox memory: 44.4 GB

Anthchirp · 2020-04-28T15:02:45Z

Just for the record: In DIALS 2.1.3 (which is the version included in CCP4 7.1) a related issue shows up as

Modelling reflection profiles

Traceback (most recent call last):
  File "/usr/local/dials-v2-1-3/build/../modules/dials/command_line/integrate.py", line 669, in <module>
    script.run()
  File "/usr/local/dials-v2-1-3/base/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/dials-v2-1-3/modules/dials/util/__init__.py", line 124, in show_mail_on_error
    e.args = (text + " " + str(e.args[0]),)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd7' in position 281: ordinal not in range(128)

and this should be fixed in DIALS 2.2.3 via #1221

#659 In the integrator, calculate the maximum memory needed. If this exceeds the available memory, then split the reflection table into random subsets and process by performing multiple passes over the imagesets. Applies only to regular integrators (not threaded) and will only take effect in situations where processing currently exits.

stale · 2020-10-25T17:39:37Z

This issue has been automatically marked as stale because it has not had recent activity. The label will be removed automatically if any activity occurs. Thank you for your contributions.

graeme-winter · 2022-12-05T06:42:53Z

100% not at all stale - still an issue 😞

graeme-winter mentioned this issue Nov 23, 2018

Trap exit code -9 xia2/xia2#263

Closed

jmp1985 self-assigned this Jan 14, 2019

graeme-winter added the bug label Feb 5, 2020

dagewa added the dials.integrate label Mar 12, 2020

graeme-winter assigned ndevenish, graeme-winter and dagewa Mar 15, 2020

Anthchirp added a commit that referenced this issue Mar 16, 2020

Enable faulthandler by default on Python 3

692a33f

suggested in #659 specifically for debugging, but we should enable this whenever possible.

graeme-winter closed this as completed Mar 17, 2020

graeme-winter reopened this Mar 17, 2020

graeme-winter mentioned this issue Mar 18, 2020

Override sigma b multiplier #1195

Merged

rjgildea closed this as completed Apr 20, 2020

rjgildea reopened this Apr 20, 2020

dagewa added this to To do in Bugs Apr 29, 2020

This was referenced Sep 4, 2020

Avoid integrate memory limit #1392

Merged

Change blocking in integration for significant performance improvement #1396

Merged

stale bot added the stale issues that have not seen activity for a while label Oct 25, 2020

stale bot removed the stale issues that have not seen activity for a while label Dec 5, 2022

dagewa mentioned this issue Jul 31, 2023

dials.integrate (dials 3.3.2) MemoryError issue #1615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659

In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659

graeme-winter commented Nov 23, 2018

graeme-winter commented Nov 23, 2018

dagewa commented Jul 25, 2019

dagewa commented Mar 12, 2020 •

edited

graeme-winter commented Mar 12, 2020 via email

dagewa commented Mar 12, 2020

graeme-winter commented Mar 12, 2020 via email

dagewa commented Mar 12, 2020

graeme-winter commented Mar 13, 2020

graeme-winter commented Mar 13, 2020

dagewa commented Mar 13, 2020

dagewa commented Mar 13, 2020

graeme-winter commented Mar 15, 2020

graeme-winter commented Mar 15, 2020

dagewa commented Mar 15, 2020

ndevenish commented Mar 15, 2020

graeme-winter commented Mar 16, 2020

graeme-winter commented Mar 16, 2020 •

edited

dagewa commented Mar 16, 2020

graeme-winter commented Mar 16, 2020

dagewa commented Mar 16, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 18, 2020

graeme-winter commented Mar 18, 2020

graeme-winter commented Mar 18, 2020

rjgildea commented Apr 20, 2020

Anthchirp commented Apr 28, 2020

stale bot commented Oct 25, 2020

graeme-winter commented Dec 5, 2022

In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659

In dials.integrate track down and prevent (ideally) cause of exit code = -9 #659

Comments

graeme-winter commented Nov 23, 2018

graeme-winter commented Nov 23, 2018

dagewa commented Jul 25, 2019

dagewa commented Mar 12, 2020 • edited

graeme-winter commented Mar 12, 2020 via email

dagewa commented Mar 12, 2020

graeme-winter commented Mar 12, 2020 via email

dagewa commented Mar 12, 2020

graeme-winter commented Mar 13, 2020

graeme-winter commented Mar 13, 2020

dagewa commented Mar 13, 2020

dagewa commented Mar 13, 2020

graeme-winter commented Mar 15, 2020

graeme-winter commented Mar 15, 2020

dagewa commented Mar 15, 2020

ndevenish commented Mar 15, 2020

graeme-winter commented Mar 16, 2020

graeme-winter commented Mar 16, 2020 • edited

dagewa commented Mar 16, 2020

graeme-winter commented Mar 16, 2020

dagewa commented Mar 16, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 17, 2020

graeme-winter commented Mar 18, 2020

graeme-winter commented Mar 18, 2020

graeme-winter commented Mar 18, 2020

rjgildea commented Apr 20, 2020

Anthchirp commented Apr 28, 2020

stale bot commented Oct 25, 2020

graeme-winter commented Dec 5, 2022

dagewa commented Mar 12, 2020 •

edited

graeme-winter commented Mar 16, 2020 •

edited