-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dask_ml GridSearchCV on HPC cluster client #3463
Comments
Thanks for the report. I'd need to double check, but I don't think you need to use both Do you have a minimal example that produces the warning about scattering large objects? I notice that you're using |
Thanks you for you quick reaction.
you can download a minimal example here 130Mo. Also, the issue occurs even if I set the number of workers by hand I updated the code snippet to fit the provided data-set. |
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports gives some recommendations are writing minimal bug reports. Ideally, it would be a small snippet with extraneous details removed. Does the data need to be read from sqlite to reproduce the issue, or can you use in-memory dataframes?
Even if you wait for the works to arrive with |
thanks you for sharing these recommendations. here is the minimal code to produce the issue : columns = [str(col) for col in range(40)]
cv_folds = 5
cv_parameters = {'n_estimators': [50, 100, 150]}
df = pd.DataFrame(index=range(600000), columns=columns)
df = df.fillna(0)
df[columns[0]] = np.random.randint(1, 6, df.shape[0])
df[columns[1]] = np.random.randint(1, 6, df.shape[0])
labels_values = df[columns[0]]
groups = df[columns[1]]
features_values = df[columns[2:-1]]
splitter = list(
GroupKFold(n_splits=cv_folds).split(features_values, labels_values,
groups))
clf = RandomForestClassifier()
clf = GridSearchCV(clf, cv_parameters, cv=splitter, return_train_score=True)
cluster = PBSCluster(cores=12, memory="60GB")
cluster.scale(5)
client = Client(cluster)
client.wait_for_workers(5)
with joblib.parallel_backend('dask', scatter=[features_values, labels_values]):
clf.fit(features_values, labels_values) Indeed, thanks to |
Good, so we're just deailing with the warning about scattering large bits of data. I don't have access to an HPC cluster. Are you able to swap out the PBCluster for a This is starting to sound like dask/dask-ml#516. One problem with pre-scattering is that the large pieces of data in the graph are generated by dask-ml itself, not the user. So we need to make sure that the scattered pieces are just keys to slices of the data, rather than concrete ndarrays. |
by replacing the File ".../lib/python3.6/site-packages/distributed/process.py", line 202, in _start
process.start()
File ".../lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File ".../lib/python3.6/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File ".../lib/python3.6/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File ".../lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File ".../lib/python3.6/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File ".../lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File ".../lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable. and the program still stuck. |
See #516 (comment)
…On Fri, Feb 14, 2020 at 4:03 AM ArthurVINCENT ***@***.***> wrote:
by replacing the PBSCluster by LocalCluster then the following error
occurs :
File ".../lib/python3.6/site-packages/distributed/process.py", line 202, in _start
process.start()
File ".../lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File ".../lib/python3.6/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File ".../lib/python3.6/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File ".../lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File ".../lib/python3.6/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File ".../lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
_check_not_importing_main()
File ".../lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable.
and the program still stuck.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3463?email_source=notifications&email_token=AACKZTE3RELJ2YN66VO2G33RC2CAHA5CNFSM4KSOD5AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYZ42A#issuecomment-586260072>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTA3VPUF52SRZSYXPHLRC2CAHANCNFSM4KSOD5AA>
.
|
Also
https://stackoverflow.com/questions/60232708/dask-fails-with-freeze-support-bug/60232709#60232709
…On Fri, Feb 14, 2020 at 11:50 AM Matthew Rocklin ***@***.***> wrote:
See #516 (comment)
On Fri, Feb 14, 2020 at 4:03 AM ArthurVINCENT ***@***.***>
wrote:
> by replacing the PBSCluster by LocalCluster then the following error
> occurs :
>
> File ".../lib/python3.6/site-packages/distributed/process.py", line 202, in _start
> process.start()
> File ".../lib/python3.6/multiprocessing/process.py", line 105, in start
> self._popen = self._Popen(self)
> File ".../lib/python3.6/multiprocessing/context.py", line 291, in _Popen
> return Popen(process_obj)
> File ".../lib/python3.6/multiprocessing/popen_forkserver.py", line 35, in __init__
> super().__init__(process_obj)
> File ".../lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
> self._launch(process_obj)
> File ".../lib/python3.6/multiprocessing/popen_forkserver.py", line 42, in _launch
> prep_data = spawn.get_preparation_data(process_obj._name)
> File ".../lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
> _check_not_importing_main()
> File ".../lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
> is not going to be frozen to produce an executable.''')RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable.
>
> and the program still stuck.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#3463?email_source=notifications&email_token=AACKZTE3RELJ2YN66VO2G33RC2CAHA5CNFSM4KSOD5AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYZ42A#issuecomment-586260072>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTA3VPUF52SRZSYXPHLRC2CAHANCNFSM4KSOD5AA>
> .
>
|
thanks you, @mrocklin. the issue can be reproduce with the LocalCluster(processes=False,...) |
That's very surprising. I encourage you to reduce that down to a minimal reproducer and share it here if you have the time. |
@ArthurVINCENT have you been able to look at this anymore? I'm not seeing any issues with the following import joblib
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from distributed import LocalCluster, Client
if __name__ == "__main__":
columns = [str(col) for col in range(40)]
cv_folds = 5
cv_parameters = {'n_estimators': [50, 100, 150]}
df = pd.DataFrame(index=range(60000), columns=columns)
df = df.fillna(0)
df[columns[0]] = np.random.randint(1, 6, df.shape[0])
df[columns[1]] = np.random.randint(1, 6, df.shape[0])
labels_values = df[columns[0]]
groups = df[columns[1]]
features_values = df[columns[2:-1]]
splitter = list(
GroupKFold(n_splits=cv_folds).split(features_values, labels_values,
groups))
clf = RandomForestClassifier()
clf = GridSearchCV(clf, cv_parameters, cv=splitter, return_train_score=True)
cluster = LocalCluster()
client = Client(cluster)
with joblib.parallel_backend('dask', scatter=[features_values, labels_values]):
clf.fit(features_values, labels_values) |
actually the issue/warning appear if you replace the import from sklearn.model_selection import GridSearchCV by from dask_ml.model_selection import GridSearchCV Is there any reasons of using sklearn instead of dask_ml ? I will launch on my huge data-set and I will keep you informed if everything works. |
That's documented in https://ml.dask.org/hyper-parameter-search.html#hyperparameter-drop-in. |
it seems to work as expect on my large dataset. So, I will use GridSearchCV coming from sklearn instead of dask_ml. Thanks. |
Hi,
I'm working on a large data-set and I try to find my model hyperparameters thanks to GridSearchCV from dask_ml as presented in the dask_ml tutorial
here is my python code :
then dask asked me to use client.scatter to deploy data on workers as the following :
But if I use the backend like in the tutorial (with scatter) :
then no workers can be found :
Any suggestions will be welcome.
Note : If I use a smaller dataset, everything works.
The text was updated successfully, but these errors were encountered: