-
-
Notifications
You must be signed in to change notification settings - Fork 43
Multi-threading or -processing doesn't work for simple sklearn Pipeline #70
Comments
Could you post a full example? With from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
import dask_ml.model_selection
param_grid = {
'svc__C': [0.1, 1, 10],
'svc__gamma': [0.01, 0.1, 0.5],
}
X, y = make_classification(n_samples=10_000)
pipe = make_pipeline(
StandardScaler(),
SVC()
)
gs = dask_ml.model_selection.GridSearchCV(pipe, param_grid, scheduler='multiprocessing') With
I observe parallelism. Compare with
|
I tried same experience with code below:
I got: and with:
I got: I guess this works, but actually not with my own code or data. So I tried to adapt this example to my case and reached what I got with code and data:
Executing this code, I can see that with |
I think I identified what was the source of error taking your first example and running it with Leave-One-Out cross-validation in grid search loop:
LOOCV seems to cause this effect. |
Another thing causing this effect: the use of numpy logspace to define C and gamma range
Only one process up to 100% CPU. If you simplify param_grid with simple list values as you suggest it works ! |
That's awfully strange. If you're comfortable trying to debug this, let me know. Otherwise, I'm busy through the end of this week and most of next week on other projects. Will look afterwards. |
I am novice in Python programming, so unfortunately I am not comfortable to debug this. I let you look at this whenever you could or anyone that feels comfortable with this. |
It seems to be due to incompatibility of last version of |
Strange. I'm not sure why that would be, but haven't had any time to
investigate myself.
…On Thu, Mar 22, 2018 at 4:08 AM, mattvan83 ***@***.***> wrote:
It seems to be due to incompatibility of last version of distributed
1.21.4 or 1.21.3 with last version of sklearn 0.19.1. When downgrading to distributed
1.20.2, it seems to work.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#70 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIgQJKTbG0AvfnWM17TFZVeJEK9-fks5tg2oGgaJpZM4SstGR>
.
|
I wonder if this is linked to the problem I've got.
|
also - I found I couldn't revert to older distributed, as it causes an error (you'd probably also have to revert dask etc). |
I think this is the same as dask/dask-ml#249. Tracking it there. |
This was fixed over in dask/dask-ml#260 It'll be included in the next release of |
Hello,
I am in trouble using this nice tool dask-searchcv on simple Pipeline.
Given the fact I tried it on simple sklearn Pipeline (StandardScaler + SVC_rbf):
with
n_jobs=-1
andscheduler="threading"
orscheduler="multiprocessing"
and search grid on C and gamma parameters, in all time I got only one process used (on my 16 available).Whereas when I used dask-searchcv on composed Pipeline including moreover PCA, I got as expected one process used at 1600 % CPU or 16 processes launched.
I don't understand why dask-searchcv multi-threading or -processing doesn't work for the first case...
Any explanation ?
Matt
The text was updated successfully, but these errors were encountered: