Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ms4alg hangs after lots of thread-related log output #24

Closed
czawora opened this issue Jun 18, 2018 · 3 comments
Closed

ms4alg hangs after lots of thread-related log output #24

czawora opened this issue Jun 18, 2018 · 3 comments

Comments

@czawora
Copy link

czawora commented Jun 18, 2018

My run of the sorting algorithm produces the following output and seems to hang during the re-assigning phase (attaching output file). It will stay at this spot for over 15 minutes, even while running with 60 CPUs and 100GB of RAM (processing a 40GB file). Is it normal for the re-assigning phase to take that long?

Could it be related to all the OpenBLAS outputs I get? How should I interpret the OpenBLAS outputs:

OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 6190469 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable

_sort_hang.log

@alexmorley
Copy link
Collaborator

Can you post the output of
echo $OPENBLAS_NUM_THREADS ?

If this is >2 I suggest adding the following to you .bashrc (or at least run it before you sort).

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

This will prevent the linear algebra libraries from trying to do multi-threading on their own and instead leave that up to the language you are calling them from.

@czawora
Copy link
Author

czawora commented Jun 22, 2018

Sweet! That worked, though it revealed a new error. I am attaching the output below.

Clustering for channel 80 (phase1)...
Found 0 clusters for channel 80 (phase1)...
Computing templates for channel 80 (phase1)...

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 513, in run_phase1_sort
   neighborhood_sorter.runPhase1Sort()
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 345, in runPhase1Sort
    self.runSort(mode='phase1')
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 395, in runSort
    templates=compute_templates_from_timeseries_model(X,times,labels,nbhd_channels=nbhd_channels,clip_size=clip_size,chunk_infos=chunk_infos)
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 269, in compute_templates_from_timeseries_model
    K=np.max(labels)
  File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2320, in amax
    out=out, **kwargs)
  File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
"""

The above exception was the direct cause of the following exception:


Traceback (most recent call last):
  File "/data/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg_spec.py", line 11, in <module>
    if not PM.run(sys.argv):
  File "/data/zaworaca/anaconda3/lib/python3.6/site-packages/mltools/processormanager/processormanager_impl.py", line 37, in run
    return P(**args)
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/p_ms4alg.py", line 72, in sort
    MS4.sort()
  File "/gpfs/gsfs7/users/zaworaca/mountainlab-js/packages/ml_ms4alg/ms4alg.py", line 591, in sort
    pool.map(run_phase1_sort, neighborhood_sorters)
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 266, in map
   return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/data/zaworaca/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
ValueError: zero-size array to reduction operation maximum which has no identity

[ Removing temporary directory ... ]
Process returned with non-zero exit code.

This error only appears when I sort an hour long session (96 channels X 107919471 time points). Sorting only a small subset of that session (96 channels X 18000000 time points) runs successfully with your fix above.

_sort_array_err.log

@alexmorley
Copy link
Collaborator

OK. This one looks like an actual bug -> error when there are no cluster's found on a particular channel. Could you open a new issue with the above content. Thanks!

@alexmorley alexmorley added the bug Something isn't working label Jun 25, 2018
@alexmorley alexmorley removed the bug Something isn't working label Jun 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants