-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCA+pipeline+GridSearchCV error #629
Comments
Thanks for the copy-pastable example. I'm not able to reproduce locally. Can you reproduce it in a fresh environment? |
Hi Tom, Thanks for the quick response! I am working within a singularity container (without write permissions) on an HPC. Here is a link to the image I am using - Dockerfile Any suggestions on how I might be able to further debug (I struggle when trying debug segmentation faults)? I will try to reproduce in a fresh environment, but thought I would pass along the image in case you would like to try it.. |
Thanks. I'm not sure why there would be a segfault, but it likely isn't from Dask. We're just coordinating calls to scikit-learn here. You might watch the dashboard and see if anything strange happens before the worker dies (perhaps suddenly high memory usage and the job scheduler kills the worker? Some HPC systems don't let you easily spill to disk). |
Hi @TomAugspurger, I can confirm that I am able to run the example successfully in a local conda environment. However, I am still having issues running the example in the singularity image (DockerHub Image) I get the same errors when I try:
This is probably out of scope for dask-ml, but thought I should post my update on the issue. If you have any further ideas/directions on how to debug, that would be great - otherwise, feel free to close and I can try with the singularity project. |
Thanks for the update. I'm not especially sure where to go next for debugging... You might try with different schedulers
import dask
dask.config.set(scheduler="single-threaded") If 1 passes, that tells us there's (maybe) some issue with communication / coordination between processes. If 1 fails but 2 passes, that tells us there's an issue with using this scikit-learn code from multiple threads. |
Thanks @TomAugspurger 1 works, but 2 fails (see below). I guess that suggests that there is an issue with the communication / coordination between processes. Seems odd that the SelectKBest works, but the PCA does not... I am running this in an HPC environment (via SLURM and JupyterHub) within a singularity container. When launching the container, I am bind mounting the following.
Threaded Scheduler:from dask.distributed import Client, LocalCluster
from dask_ml.model_selection import GridSearchCV as dask_GS
from sklearn.model_selection import GridSearchCV as sk_GS
from sklearn.datasets import make_multilabel_classification
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
import dask
#dask.config.set(scheduler="single-threaded")
#cluster = LocalCluster(n_workers=5,threads_per_worker=2)
#client = Client(cluster)
X, Y = make_multilabel_classification(n_classes=4, n_labels=1,n_features=271,
n_samples=1200,
random_state=1)
Y = Y.sum(axis=1)
N_FEATURES_OPTIONS_pca = np.arange(10)[1::3].tolist()
N_FEATURES_OPTIONS_sel = np.arange(10)[1::3].tolist()
Cs = [1,10,100.]
gammas = [.001,0.01]
pca = PCA(iterated_power='auto')
selection = SelectKBest(f_classif)
svc = svm.SVC()
pipe1 = Pipeline([
('reduce_dim', selection),
('classify', svc)])
pipe2 = Pipeline([
('reduce_dim', pca),
('classify', svc)])
param_grid1 = [{'reduce_dim__k': N_FEATURES_OPTIONS_sel,
'classify__C': Cs,
'classify__gamma': gammas}]
param_grid2 = [{'reduce_dim__n_components': N_FEATURES_OPTIONS_pca,
'classify__C': Cs,
'classify__gamma': gammas}]
#Sklearn Gridsearch with PCA pipeline
sk_clf = sk_GS(pipe2, param_grid2,cv=3,scoring='f1_macro',refit=True)
sk_clf.fit(X,Y)
print(sk_clf.best_score_)
#Dask Gridsearch with SelectKbest
dask_clf1 = dask_GS(pipe1, param_grid1,cv=3,scoring='f1_macro',refit=True)
dask_clf1.fit(X,Y)
print(dask_clf1.best_score_)
#Dask Gridsearch with PCA
dask_clf2 = dask_GS(pipe2, param_grid2,cv=3,scoring='f1_macro',refit=True)
dask_clf2.fit(X,Y)
print(dask_clf2.best_score_)
However, when I run it with the single threaded scheduler, it fails Single Threaded Scheduler:from dask.distributed import Client, LocalCluster
from dask_ml.model_selection import GridSearchCV as dask_GS
from sklearn.model_selection import GridSearchCV as sk_GS
from sklearn.datasets import make_multilabel_classification
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
import dask
dask.config.set(scheduler="single-threaded")
cluster = LocalCluster(n_workers=5,threads_per_worker=2)
client = Client(cluster)
X, Y = make_multilabel_classification(n_classes=4, n_labels=1,n_features=271,
n_samples=1200,
random_state=1)
Y = Y.sum(axis=1)
N_FEATURES_OPTIONS_pca = np.arange(10)[1::3].tolist()
N_FEATURES_OPTIONS_sel = np.arange(10)[1::3].tolist()
Cs = [1,10,100.]
gammas = [.001,0.01]
pca = PCA(iterated_power='auto')
selection = SelectKBest(f_classif)
svc = svm.SVC()
pipe1 = Pipeline([
('reduce_dim', selection),
('classify', svc)])
pipe2 = Pipeline([
('reduce_dim', pca),
('classify', svc)])
param_grid1 = [{'reduce_dim__k': N_FEATURES_OPTIONS_sel,
'classify__C': Cs,
'classify__gamma': gammas}]
param_grid2 = [{'reduce_dim__n_components': N_FEATURES_OPTIONS_pca,
'classify__C': Cs,
'classify__gamma': gammas}]
#Sklearn Gridsearch with PCA pipeline
sk_clf = sk_GS(pipe2, param_grid2,cv=3,scoring='f1_macro',refit=True)
sk_clf.fit(X,Y)
print(sk_clf.best_score_)
#Dask Gridsearch with SelectKbest
dask_clf1 = dask_GS(pipe1, param_grid1,scheduler=client,cv=3,scoring='f1_macro',refit=True)
dask_clf1.fit(X,Y)
print(dask_clf1.best_score_)
#Dask Gridsearch with PCA
dask_clf2 = dask_GS(pipe2, param_grid2,scheduler=client,cv=3,scoring='f1_macro',refit=True)
dask_clf2.fit(X,Y)
print(dask_clf2.best_score_)
|
Shot in the dark: can you try disabling spill to disk? https://jobqueue.dask.org/en/latest/configuration-setup.html#no-local-storage |
hmm still seeing the same issue when I avoid spilling to disk by: dask.config.set({'distributed.worker.memory.target': False, 'distributed.worker.memory.spill': False})
cluster = LocalCluster(n_workers=5,threads_per_worker=2)
client = Client(cluster)
dask.config.config results in
I did see the following in one of the work logs: cluster.logs() results in
|
Hmm I'm not sure what to try next :/
…On Thu, Apr 2, 2020 at 6:24 PM Rowan Gaffney ***@***.***> wrote:
Reopened #629 <#629>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#629 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIX3GEIZ3DWQBAZF3GDRKUNCRANCNFSM4LUYUFEQ>
.
|
Ok, Thanks for you help @TomAugspurger. I'll close as I kinda think this may be an issue with the singularity container or related to the HPC system. |
Noting over here from: https://github.com/sylabs/singularity/issues/5259 so there's a pointer in case others come across it here. It looks like you are binding the entire
This will almost certainly cause issues including segfaults, unless the container OS exactly matches the host - because the executables in the container expect to use libraries from the container... not the ones from the host which will be a different version / built differently. Also - when you run Singularity containers with python apps, python packages installed in your |
I was able to do some more debugging that might help with diagnosing the issue. Using gdb to examine the Seg Faults (via the core dump files) I can get the following back traces. @TomAugspurger, @dctrud, and @ynanyam, any idea if this is an issue within Singularity (aka issues with shared libraries between host of container) or Python library issue? Thanks! with
|
Thanks for the additional debugging, but it unfortunately doesn't give me any new guesses :/ |
There seems to be an issue with sklearn PCA+pipeline and dask_ml gridsearchCV. Please see my example below. Apologies if I am totally missing something.
Relevant Versions:
Minimal Example:
The following code shows that
This results in several core dump files and the following error:
Results In:
Dask Distributed worker / scheduler logs
Results of gdb on core dump file:
The text was updated successfully, but these errors were encountered: