Joblib, parallel_backend, and performance #114

metasyn · 2018-01-10T23:04:34Z

hi ya'll - Matthew suggested I ask some questions here.

I'm a little confused on some results I'm seeing - and am wondering if you guys can help me figure out why I am seemingly executing grid searches in serial.

I modified dask-docker and am using containers on a docker swarm with 64GB RAM and 48 cores each.

search = RandomizedSearchCV(self.estimator, self.estimator.param_space, cv=3, n_iter=20, verbose=10)

if self.estimator.use_dask:

    address = 'tcp://dask-cluster:8786'
    c = Client(address)

    with parallel_backend('dask.distributed', scheduler_host=address,
                          scatter=[X.values, y.values]):
            search.fit(X.values, y.values)

where X is pd.DataFrame and y is pd.Series.

When I run the above with a small number of rows, I can see in the status page that lots of tasks get executed in parallel. The blocks of work on the task stream end up looking like a straight vertical since they all get dispatched nearly simultaneously.

Once I start increasing the number of rows, I seem to get more and more serial execution, where the status page shows that really one additional task gets added at a time.

In this screenshot, I started at the full dataset, and started subtracting 10k on each run to see the affect on the execution time / parallelism. For some reason occasionally, e.g. on 80k and 40k, the work gets distributed a little differently?

When the number is higher, there is never more than one task active. When the number is lower, I see more (up to 4) getting triggered simultaneously.

Anyhow, my question ultimately is:

am I doing something in particular wong?
does this pattern look indicative of something I setup incorrectly?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-01-11T12:07:13Z

Thanks, I'll try to debug this later today or tomorrow. Can you say a bit more about

The graph backing X and y. Do they involve disk IO? Do all the workers share a file system?
The estimator you're fitting? This probably isn't the issue, but it may help with debugging.

metasyn · 2018-01-12T22:41:23Z

1.) X and y are already in memory - no additional disk IO after they've been read from std.io to a dataframe earlier in the process. The workers in the above are 10 containers spread across three physical hosts, so each worker on average shares a filesystem with 2 other workers.

2.) The estimator here was RandomForestClassifier

TomAugspurger · 2018-01-19T19:51:42Z

Sorry for the delay on this! Let's try to narrow this down to see if it's just the scheduler that's not working properly. Could you setup a cluster / client and try out the following:

import dask.distributed
import pandas as pd
import numpy as np
import dask_ml.joblib

from sklearn.externals import joblib
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

from sklearn.base import BaseEstimator

class DummyEstimator(BaseEstimator):
    def __init__(self, parameter=None):
        self.parameter = parameter
        
    def fit(self, X, y=None):
        return self
    
    def predict(self, X):
        return np.zeros(len(X))
    
    def score(self, X, y=None):
        return 0

search = RandomizedSearchCV(DummyEstimator(), {"parameter": stats.uniform}, cv=3, n_iter=20, verbose=10)

%%time
N = 100_000
X = pd.DataFrame(np.random.randn(N, 10))
y = pd.Series(np.random.uniform(size=N))

addr = client.scheduler_info()['address'] 
with joblib.parallel_backend('dask.distributed', addr,
                             scatter=[X.values, y.values]) as pb:
    search.fit(X, y)

For me, that finishes in ~4 seconds, just using a local cluster.

TomAugspurger · 2018-02-09T19:38:57Z

FYI, if you're able to try out and joblib master and dask.distributed master, things may have improved in the last couple weeks. Nothing specific to this issue, but we were making changes to that code and it might have fixed things magically :)

metasyn · 2018-03-25T19:52:12Z

Hey Tom - Sorry I sorta dropped off the face of the planet. I appreciate your responses - I might not get around to re-checking this out for a bit - so I will close this issue for now :)

metasyn changed the title B Joblib, parallel_backend, and performance Jan 10, 2018

metasyn closed this as completed Mar 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joblib, parallel_backend, and performance #114

Joblib, parallel_backend, and performance #114

metasyn commented Jan 10, 2018 •

edited

Loading

TomAugspurger commented Jan 11, 2018

metasyn commented Jan 12, 2018

TomAugspurger commented Jan 19, 2018

TomAugspurger commented Feb 9, 2018

metasyn commented Mar 25, 2018

Joblib, parallel_backend, and performance #114

Joblib, parallel_backend, and performance #114

Comments

metasyn commented Jan 10, 2018 • edited Loading

TomAugspurger commented Jan 11, 2018

metasyn commented Jan 12, 2018

TomAugspurger commented Jan 19, 2018

TomAugspurger commented Feb 9, 2018

metasyn commented Mar 25, 2018

metasyn commented Jan 10, 2018 •

edited

Loading