Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Rabit on a Dask + google-gke architecture #62

Closed
Mestalbet opened this issue Aug 5, 2018 · 1 comment
Closed

Problem with Rabit on a Dask + google-gke architecture #62

Mestalbet opened this issue Aug 5, 2018 · 1 comment

Comments

@Mestalbet
Copy link

Hi,

On a simple Dask XGBoost run I get the following error. The sample code looks like:

from dask_ml.xgboost import XGBRegressor
est = XGBRegressor(...)
x = dd.read_csv('somedata.csv')
y = x.y
del x['y'] 
est.fit(x, y)

And the error is as follows:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-d50d84593355> in <module>()
      5 y = x.y
      6 del x['y']
----> 7 est.fit(x, y)

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in fit(self, X, y)
    239         xgb_options = self.get_xgb_params()
    240         self._Booster = train(client, xgb_options, X, y,
--> 241                               num_boost_round=self.n_estimators)
    242         return self
    243 

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    167     """
    168     return sync(client.loop, _train, client, params, data,
--> 169                 labels, dmatrix_kwargs, **kwargs)
    170 
    171 

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    273             e.wait(10)
    274     if error[0]:
--> 275         six.reraise(*error[0])
    276     else:
    277         return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
    258             yield gen.moment
    259             thread_state.asynchronous = True
--> 260             result[0] = yield make_coro()
    261         except Exception as exc:
    262             error[0] = sys.exc_info()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105                     if exc_info is not None:
   1106                         try:
-> 1107                             yielded = self.gen.throw(*exc_info)
   1108                         finally:
   1109                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
    122     env = yield client._run_on_scheduler(start_tracker,
    123                                          host.strip('/:'),
--> 124                                          len(worker_map))
    125 
    126     # Tell each worker to train on the chunks/parts that it has locally

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1111                             exc_info = None
   1112                     else:
-> 1113                         yielded = self.gen.send(value)
   1114 
   1115                     if stack_context._state.contexts is not orig_stack_contexts:

/opt/conda/lib/python3.6/site-packages/distributed/client.py in _run_on_scheduler(self, function, *args, **kwargs)
   1911                                                      kwargs=dumps(kwargs))
   1912         if response['status'] == 'error':
-> 1913             six.reraise(*clean_exception(**response))
   1914         else:
   1915             raise gen.Return(response['result'])

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

/opt/conda/lib/python3.6/site-packages/dask_xgboost/core.py in start_tracker()
     30     """ Start Rabit tracker """
     31     env = {'DMLC_NUM_WORKER': n_workers}
---> 32     rabit = RabitTracker(hostIP=host, nslave=n_workers)
     33     env.update(rabit.slave_envs())
     34 

/opt/conda/lib/python3.6/site-packages/dask_xgboost/tracker.py in __init__()
    166         for port in range(port, port_end):
    167             try:
--> 168                 sock.bind((hostIP, port))
    169                 self.port = port
    170                 break

OSError: [Errno 99] Cannot assign requested address

Also the result of:

import socket
socket.error is OSError

is true .
Any help will be greatly appreciated.

Thanks.

@hcho3
Copy link
Contributor

hcho3 commented Aug 5, 2018

Edit the line 168 of /opt/conda/lib/python3.6/site-packages/dask_xgboost/tracker.py to say

print('hostIP = {}, post = {}'.format(hostIP, port))
sock.bind((hostIP, port))

Save the edit and re-run the program. Can you post the diagnostic output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants