Skip to content
This repository has been archived by the owner on Oct 14, 2018. It is now read-only.

Efficiency for GridSearchCV on large graphs #29

Closed
pinjutien opened this issue Mar 4, 2017 · 46 comments
Closed

Efficiency for GridSearchCV on large graphs #29

pinjutien opened this issue Mar 4, 2017 · 46 comments

Comments

@pinjutien
Copy link

Hi,

We have another issue of DaskGridSearchCV. The example is XGBRegressor (https://github.com/pinjutien/grid-search/tree/master/grid_search_issue2).

In this case, when DaskGridSearchCV is on, there is nothing showing up in dash-board(for example, http://173.208.222.74:8866/status/). Both Client or Executor have no dash-board showing up.

Thanks,
Pin-Ju

@bansalr
Copy link

bansalr commented Mar 4, 2017

I think what you are saying is that top shows scheduler running and using cpu but the workers do not show any cpu usage and there is no activity on the dashboard.

@mrocklin
Copy link
Member

mrocklin commented Mar 4, 2017

I would need to see how you are setting up your workers. Do you have any set up?

If you don't want to deal with setting up a scheduler and workers you might also just create a client without arguments

client = Client()  # sets up scheduler and workers for you

Also, Executor is an old name for Client. I recommend switching.

@pinjutien
Copy link
Author

pinjutien commented Mar 4, 2017

Yes, we also tried Client() but it does not work. Here is the detail information.

  1. set up scheduler and worker
    dask-scheduler --http-port 8899 --port 8877 --bokeh-port 8866 &
    dask-worker 173.208.222.74:8877 &

  2. python test.py (https://github.com/pinjutien/grid-search/tree/master/grid_search_issue2)
    part of test.py:
    scheduler_address = '173.208.222.74:8877'
    #exc = Executor(scheduler_address, set_as_default=True)
    c = Client(scheduler_address, set_as_default=True)
    gs = DaskGridSearchCV(model, grid).fit(train_x, train_y)

Nothing show up in the section of Task-Stream in dashboard (http://173.208.222.74:8866/status/)

Thanks!

@mrocklin
Copy link
Member

mrocklin commented Mar 4, 2017

Does it run? Does it appear to be using the workers? Is there anything interesting in the logs? What version of bokeh do you have installed?

@pinjutien
Copy link
Author

pinjutien commented Mar 4, 2017

In this example (model = XGBRegressor), it does not work. But, other example we tried before works.

So, in this example, the dash board looks like:

image

worker status look like:
image

@mrocklin
Copy link
Member

mrocklin commented Mar 4, 2017

Does your computation run? Does any computation run? You might be able to isolate your problem by trying a few different things and seeing if they work or don't work. Does client connect? Does it see workers? Can you scatter data out? Can you perform simple computations, etc... To be clear I'm not asking for the answers to these questions. I'm recommending that you try a few things to see what does and does not work. That might help you to identify what is going wrong.

@pinjutien
Copy link
Author

pinjutien commented Mar 4, 2017

when I saw the following. It means client connect, right?
scheduler:
image

worker:
image

But,
If I switch model from XGBRegressor to DecisionTreeClassifier, it works perfectly.

@mrocklin
Copy link
Member

mrocklin commented Mar 4, 2017

Yes. Have you tried the quickstart yet? That might be a good place to start: http://distributed.readthedocs.io/en/latest/quickstart.html

@pinjutien
Copy link
Author

Sure. I tried these things. But, the issue is the following:
we tried two different regressors to do dask grid search. One is XGBRegressor. The other is DecisionTreeRegressor.

  • XGBRegressor: nothing show up in the section of Task-Stream in dashboard.
  • DecisionTreeRegressor: work perfectly. ie something show up in the section of Task-Stream
    in dashboard.

So, my question is why DaskGridSearchCV works in DecisionTreeRegressor but no in XGBRegressor (both of them use Client set-up as your suggestion.).

Thank you.

@bansalr
Copy link

bansalr commented Mar 4, 2017

I just noticed that xgboost has a different dask integration. We will try that next. However it seems that the vanilla way of calling XGBoost from scikit learn is broken.

So on this one, a summary of what we have tried so far:

  • This is the same scheduler/machine running on the same machine as in Issue#28. Upon solving issue#28, dask-learn works on that machine for models other than XGBoost. It works for random forest and decision tree classifier and decision tree regressor
  • On this data set for issue#29, PinJu tried Decision Tree Classifier and Regressor and those work
  • On the same data set, XGBoost works from SciKit learn.

So we can conclude that there is something between dask-learn and xgboost that is causing an interaction.

Ritesh

@mrocklin
Copy link
Member

mrocklin commented Mar 5, 2017

So I tried the same computation with the single threaded scheduler

dask.set_options(get=dask.async.get_sync)

And found that it appeared to hang and not make any progress. So I started it, let it ran for a while, and then Ctrl-C'd while running pdb to jump to whatever it was working on. I went up a little bit in the stack and then ran into a point where it was looping over all candidate parameters. It looks like there are half a million of them.

pdb> pp len(candidate_params)
504000

It's probably just taking a while to set things up.

So I cut down the grid significantly, ran things, and they appeared to run fine. My guess is that this was just spending a while setting up a very large computational graph. I think one lesson here might to start with a small problem and then scale up after you have it running. It would be interesting to look into how to reduce overhead, but we should first verify that the overhead is large relative to expected costs.

@yzard
Copy link

yzard commented Mar 5, 2017

Hi, Matt,

I work with Pinju and Ritesh, Actually I came out this parameters range.

  • I tried the small number of candidates, like ~8000, it works perfect, need wait for 10 seconds.
  • with 168000 candidates, it needs like 5-10 mins to start show something on bokeh.
  • with 504000 candidates, I waited for 30 mins, seems the gridsearch process used 14gb memory, and got killed by kernel, and dask-scheduler got some log like this, it might be caused by the kernel kill:

distributed.scheduler - INFO - Receive client connection: 9ed3268c-015a-11e7-a1ec-0002c9c27334 2017-03-04 22:53:13,176 http route tasks failed 2017-03-04 22:53:13,382 workers http route failed CRITICAL:/home/zyin/.conda/envs/anaconda_intel/lib/python2.7/site-packages/distributed/protocol/core.pyc:Failed to deserialize Traceback (most recent call last): File "/home/zyin/.conda/envs/anaconda_intel/lib/python2.7/site-packages/distributed/protocol/core.py", line 114, in loads fs = merge_frames(head, fs) File "/home/zyin/.conda/envs/anaconda_intel/lib/python2.7/site-packages/distributed/protocol/utils.py", line 53, in merge_frames assert sum(lengths) == sum(map(len, frames)) AssertionError distributed.scheduler - INFO - Remove client 9ed3268c-015a-11e7-a1ec-0002c9c27334 distributed.scheduler - INFO - Close client connection: 9ed3268c-015a-11e7-a1ec-0002c9c27334 distributed.core - WARNING - Lost connection to ('173.208.222.74', 12878) while sending result: Stream is clos

@mrocklin
Copy link
Member

mrocklin commented Mar 5, 2017

From those times it looks like it has an overhead of around 1.5ms per candidate. There might be some tuning we can do to take that down a bit. Is it worth it? What are runtimes in those cases?

In the last case it could be that the graph itself is too large to fit comfortably into memory?

I made a small case with two parameters with three and five options each. This resulted in a dask graph of size 171. This dask graph took about 15MB to serialize. cc @jcrist if he has thoughts here.

@mrocklin
Copy link
Member

mrocklin commented Mar 5, 2017

Looks like most of the 15MB in the comment above was just the input data.

I've reduced this problem down to the following minimally reproducible example (examples like these are highly preferred).

import pandas as pd
from xgboost.sklearn import XGBRegressor
from dklearn import DaskGridSearchCV
import numpy as np

model = XGBRegressor()
x = pd.DataFrame({i: np.random.random(100) for i in range(10)})
y = pd.Series(np.random.random(100) > 0.5)
grid = {'gamma': [0.5, 0.75, 1,  1.25, 1.5], 'colsample_bytree': [0.25,0.5,0.7, 0.9,1]}

gs = DaskGridSearchCV(model, grid).fit(x, y)

>>> len(gs.dask_graph_)
261

import cloudpickle
>>> sum([len(cloudpickle.dumps(v)) for v in gs.dask_graph_.values()])
107559

The size of the graph is a little bit interesting. 5 by 5 options yields 260 tasks. I can see how if there were hundreds of thousands of tasks that that might become problematic. I hope that by providing the minimal example above @jcrist is more easily able to play with this example and provide some feedback on what might be going on.

@yzard
Copy link

yzard commented Mar 5, 2017

Hi, Matt,

I think the 1.5ms overhead is not necessary for tuning right now.

Actually I run the exactly same test with 504000 candidates last night before I sleep, strange thing is this time the kernel didn't kill grid search process, but processes memory got freed, scheduler is holding 1.8gb memory and still nothing shows on the status page, I think dask is not working on those tasks.

I wish dask (scheduler/worker/DaskGridSearchCV etc.) will have the option shows much more verbose log, then it will be easier to narrow down the problem. And I don't even know what the correct graph size should be to determine minimal example..

@mrocklin
Copy link
Member

mrocklin commented Mar 5, 2017

You can increase the verbosity of the log in the ~/.dask/config.yaml file. You could change the logging entry so that it looks like the following:

logging:
  distributed: info
  distributed.scheduler: debug

You could also look at the dashboards at scheduler:8788/ and worker:8789 to get a better sense of what is going on. You can also use the IPython integration to investigate the scheduler and worker state directly.

@mrocklin
Copy link
Member

mrocklin commented Mar 5, 2017

Or if you set up a LocalCluster you can just have the scheduler in your main process.

http://distributed.readthedocs.io/en/latest/local-cluster.html

@yzard
Copy link

yzard commented Mar 5, 2017

Thanks, that's very good to know!

@yzard
Copy link

yzard commented Mar 7, 2017

Hi, Matt,

I wrote the code that break large parameters range into list of smaller parameters range with limitation of candidates, and doing grid search one by one and selecting the best score among the list, seems working so far. But still I think the number of tasks need be investigated.

And I changed logging into debug level, with large number of candidates, I didn't see too much good info, and there are lots of ping like logging every second.

I will try LocalCluster if I have some time, does that logging setting will enable verbose logging on DaskGridSearchCV?

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2017

I don't know what logging DaskGridSearchCV does, if any. You'll need to ask @jcrist about most of this (I don't work on dask-learn). The benefit to LocalCluster is that you can investigate the internals of the scheduler and workers if those interest you.

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2017

I took a look at a profiling the example above with a small set of parameters. There were four parameters with counts 5, 5, 7, and 4. I found that this generated about 9000 tasks we had about 1.7s of overhead, 1.2s of which was due to sklearn calls in getting and setting parameters. I wonder we could avoid this by doing the cloning within the task itself, rather than while building the task. This would also cut down on serialization and tokenization costs because we would only have to push a single sklearn executor into the graph.

@jcrist I don't have enough experience with the dask-learn codebase to know if this is reasonable. Thoughts? Can we do some of the estimator manipulation work in the tasks rather than at graph construction time?

@mrocklin
Copy link
Member

mrocklin commented Mar 9, 2017

For profiling I was doing the following:

import pandas as pd
from xgboost.sklearn import XGBRegressor
from dklearn import DaskGridSearchCV
import numpy as np

model = XGBRegressor()
x = pd.DataFrame({i: np.random.random(100) for i in range(10)})
y = pd.Series(np.random.random(100) > 0.5)
grid = {'gamma': [0.5, 0.75, 1,  1.25, 1.5], 'colsample_bytree': [0.25,0.5,0.7, 0.9,1], 'max_depth': range(1,12,2), 'learning_rate'     : [0.01, 0.03, 0.05, 0.07, 0.09, 0.1, 0.2],
}

%load_ext snakeviz
%snakeviz gs = DaskGridSearchCV(model, grid).fit(x, y)

@yzard
Copy link

yzard commented Mar 9, 2017

Matt, I think you mean following?

  1. pass the task with a raw estimator, X and y
  2. and then just serialize each possible candidate and fit
  3. clone the raw estimator and fit with possible candidate, return the score and release memory.

since dask has graph, 3 can depended on 2 then depend on 1, thus reduce(scores).

@jcrist
Copy link
Member

jcrist commented Mar 9, 2017

Apologies for the lack of reply here. Constructing the estimators inside the graph (instead of beforehand as we do now) is tricky in a couple respects:

I initially did the easiest thing, which was to initialize each estimator at graph build time, which handles both those cases nicely.

I should (hopefully) have some time to look at this soon. My first thought is just to reduce the cost of tokenizing each estimator, which should be possible with some tricks. Since we're only serializing unfit estimators, they're not all that expensive to serialize, so I doubt that's the main cost here. But the tokenizing for each candidate could add up. Worst case, we do all this in the graph, and the code is more complicated. But these are solvable problems.


One question I have is how many candidates is it normal to grid_search with? I made the (possibly naive) assumption that these were usually in the tens to low hundreds, not in the hundreds of thousands as I see above. For smaller numbers of candidates, with mildly expensive to run tasks, the current implementation should have low relative overhead.

Same question for cv splits? I genuinely don't know how these tools are used in practice.

@mrocklin
Copy link
Member

At a fine grained level I mean the following:

estimator = delayed(estimator)  # stick the estimator in the graph once
others = [delayed(clone_and_set_params)(estimator, params) for params in candidate_params]
fit = [delayed(fit)(other, data) for other in others]

This is obviously way more simple than things actually play out, but the point is that the graph really only has one estimator in it. The workers will move it among themselves as necessary. There are now a bunch of dicts of parameters in the graph instead, along with references to the original estimator, but both dicts and the reference are pretty lightweight. I think that this minimizes overhead on the client and scheduler side and moves that overhead to the workers (who can handle it much more effectively).

@jcrist tokenization was an issue, but I also found a nontrivial cost to just calling sklearn.clone. I suspect that this will be a theme in the near future in other projects (like dask.array) as well. We're likely to become much more careful about small overheads when constructing giant graphs. Today in other work I stopped using pd.DataFrame.memory_usage just because it constructed a pd.Series, which took a few hundred microseconds.

Of course, all of this only makes sense if checking half a million candidates in a grid search is likely to occur. I think that @pinjutien @bansalr and @yzard may be able to help us understand what is generally common in practice. I have to admit that I'm also somewhat surprised by this number. I would expect people to try other search strategies after a certain point. I wonder if RandomSearchCV would have better performance. I'm also curious if an active search strategy might be preferred. I don't have any experience on the ground on this problem though. If you all have feedback from the field I'd love to hear it.

@yzard
Copy link

yzard commented Mar 10, 2017

Thanks for the quick feedback @mrocklin @jcrist

For hyperparameter optimization (there are different ways, like Grid Search, Gradient Decent):

Neural Network use gradient decent and back propagation to tweaking hyper parameters, it has good and bad sides, it need huge amount of data, also gradient decent need the hyper parameters are continuous, and it might found the local optimal not global optimal. Since the gradient decent, good side is: it can tuning much more hyper parameters, I would like to say, millions (because of the layers), each one can varies from -infinite to +infinite

For decision tree based algorithm we used, some of the hyper parameters are even discrete like tree depth, gradient decent way is not capable of doing on those models (you can by setting fixed discrete hyper parameters, but still need lots of data, and only limits to Random Forest).

And why we using tree based algorithm? since for problem we are trying to solve, the data is discrete and not as many as images (I mean terabytes). For example, there is a data feature called typeOfHeat, it can be either gas, electrical, propane, etc, it's not from 0~1. And each feature might doesn't have continuous relationship. For example, the cat images, each adjacent pixels have obvious continuous relationship. That's why deep learning working so good on Images, Video, Voice, etc, and then no improvement on other problems.

So grid search (either random/or brute force) is a reasonable way to do hyper parameter optimization for tree based algorithm, especially for gradient boosting trees (xgboost). And still, grid search is actually just samples points in the hyper parameters space. Theoretically, the number of candidates should not be the related to the memory problem.

Back to the sample we provide, the training data that it's only 15mb and very small, and cross validation will split the data to let's say 3 folds, means 3 identical parts [1,2,3], I think this occupies lots of memory too, ideally each worker need only one copy of [1,2,3] and one raw estimator, if the data is 1gb big, I am not sure what will going on here for gridsearch, and that's not very big actually. Usually more data generates much better models.

If there are some better algorithm comes out, grid search might will get retired, but for now, I didn't see any possibilities, as long as those tree based is used widely.

@mrocklin
Copy link
Member

Memory issues

The memory problems you were seeing had nothing to do with the size of your data. It had to do with the metadata of the graph. I think that we can reduce that with the changes that @jcrist mentions above.

Alternatives to GridSearch

I rarely put gridsearch and gradient descent in the same category just because gridsearch is more often used on hyperparamters while gridsearch is more often used on parameters (or so I've been told). I think of RandomSearchCV as the standard alternative to GridSearchCV.

This sklearn example might be interesting, I haven't run through it yet though: http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html

This blogpost has the argument that I've heard before: https://medium.com/rants-on-machine-learning/smarter-parameter-sweeps-or-why-grid-search-is-plain-stupid-c17d97a0e881#.w121wqyu6

In particular this image from that blogpost does a decent job of explaining their point quickly, think:

But at the end of the day, I don't actually do any of this work. GridSearchCV should run efficiently regardless.

@yzard
Copy link

yzard commented Mar 10, 2017

Thanks, @mrocklin, I am so happy you wrote dask library, which I dreamed about, decoupled computation logic and data. made parallel computing capable and efficient. Tensorflow used similar idea, but dask are way more generic, So grid search/random search, even gradient decent can take advantage of this.

I put grid search and gradient decent search simply all of their goals are to search the best parameters that fits training data best (I will put overfitting on the side), please see this:
https://en.wikipedia.org/wiki/Hyperparameter_optimization

@yzard
Copy link

yzard commented Mar 10, 2017

@jcrist cross validation is a method that preventing model from overfitting see https://en.wikipedia.org/wiki/Overfitting. And there is no clear edge said it's overfitting or not. for k folds cv, normally choose 10 is pretty common. But k is also a parameter to choose, more folds normally tend to more close to overfitting:

Putting an extreme example, if we want cross validate classifier, got 100 training data, and use 100 folds, so each turn you use 99 data to train model and use 1 to validate the model, and repeat for 100 times, then average the results. for 100 models, they are all very similar. and the same time, it will generates 100 models.

I will not talk k too much, this is usually tweaking by the experience, usually more data use smaller k, less data use higher k but not extremely high like the example above. for smaller data set, the purpose is to suck any kind information of the limited data, and avoid overfitting.

@jcrist
Copy link
Member

jcrist commented Mar 10, 2017

@yzard: thanks. I understand what grid search and cross validation are. The big question I have is how large are common grids in the number of candidates. This affects how we choose to balance simplicity of implementation with performance. I assumed large grids (like the 500,000 candidate one you show above) would be uncommon, but may have been mistaken. More insight on common numbers of candidates would be helpful.

@mrocklin
Copy link
Member

When I look at the example @pinjutien provided here https://github.com/pinjutien/grid-search/blob/master/grid_search_issue2/test.py#L13-L27

grid = {
    'max_depth'         : range(1,12,2),
    'learning_rate'     : [0.01, 0.03, 0.05, 0.07, 0.09, 0.1, 0.2],
    'n_estimators'      : [100, 300, 400, 500, 600, 700, 800, 1000],
    'gamma'             : [0.5, 1, 1.5],
    #'min_child_weight'
    #'max_delta_step'
    'subsample'         : [0.2,0.4,0.66,0.8,1],
    'colsample_bytree'  : [0.25,0.5,0.7,0.9,1],
    #'colsample_bylevel': [0.2, 0.5, 0.7, 0.9, 1]
    'reg_alpha'         : np.arange(0.01, 1, 0.05)
    #'reg_lambda'       : np.arange(0.01, 1, 0.05)
    #'scale_pos_weight'
    #'base_score'
},

I'll admit that it does make large grids seem more reasonable.

@mrocklin
Copy link
Member

At a certain point though this must stop making sense. @amueller any thoughts here on active searching through hyperparameter spaces?

@amueller
Copy link

Those are not all parameters of DecisionTreeClassifier (gamma and reg_alpha for example) and the code should fail. It looks like a bug to me that it is not. I guess some workers fail silently?
They might all be valid for XGBoost.

Is the question how large a grid is reasonable? I'm not sure that's an easy to answer question ;)
I'd say 10x10x10x10 is reasonable - the whole point of this thing is to make larger grids feasible. The one above seems quite a big bigger. how did you get to 500,000?

@mrocklin
Copy link
Member

Yes, the grid above is for xgboost.

Is the question how large a grid is reasonable?

Yes

I'm not sure that's an easy to answer question ;) I'd say 10x10x10x10 is reasonable - the whole point of this thing is to make larger grids feasible. The one above seems quite a big bigger. how did you get to 500,000?

There are tradeoffs here in code complexity, feature support, and supporting very large grids. @jcrist and I are curious what the distribution of desired grids would look like. One hypothesis is that after a certain point people will just drop grid search and move to more active strategies (which would be pretty easy to build with dask.distributed's async API by the way if you know anyone). But that might not hold.

Question: Is there sufficient appetite for million-candidate-paramter-grids such that we should consider it an reasonable priority.

@yzard
Copy link

yzard commented Mar 10, 2017

@amueller 50,000 is just sampled more points :), 50 times of the reasonable number you mentioned as in the , and XGB supports L1/L2 regularization. you can see I disabled L2 for this example.

@mrocklin
Copy link
Member

It's pretty easy to expand the logic "well I want to add a couple more parameters to the grid search" a few times to get to a point where no cluster can handle it. Even if we reduce the overhead of the dask scheduler to zero it seems like it would still take years to go through an exhaustive list of twenty parameters with ten options each.

The question I think we should consider is where do we want to place finite developer time? How far do we go into reducing scheduler overhead and when do we switch to exploring different algorithms. @jcrist brought up hyperopt in private conversation.

Any thoughts @amueller

@mrocklin
Copy link
Member

cc @jbergstra . Not sure if you're still interested in the tools-for-hyper-parameter-optimization business. Feel free to ignore, or better yet forward to an appropriately interested graduate student :)

@bansalr
Copy link

bansalr commented Mar 13, 2017

@mrocklin, We regularly do grid search with 100k+ folds. We usually do it by using a high compute capacity on AWS and letting it run for 10-30 hours (or more).

As I mentioned before, GridSearch is the obvious application for dask to sklearn. However what we see here is that Dask is unable to speedup workloads that sklearn can handle over a matter of 10-30 hours using 30-60 cores.

I would say that we are far from approaching anything that would exhaust a moderate sized cluster of 500-1000 cores. I am sure there will be applications that will get to that boundary but this is not one of them.

Perhaps naive parallelism for GridSearch using dask is the answer. That will provide data hashing to avoid sending multiple copies to workers but not calculate common computation blocks to memoize those across the graph. That seems to be the thing breaking things right now.

PinJu has also had similar results on fairly small problems where setting cv>3 breaks the graph comptutation. This is with the AirBnB dataset from Kaggle where the raw data is 27MB but the workers balloon to ~3GB.

@mrocklin
Copy link
Member

Thanks for the details. I'll let @jcrist follow up. I think he's been doing some tuning.

@jcrist
Copy link
Member

jcrist commented Mar 13, 2017

Thanks for the details here. A few bits:

Perhaps naive parallelism for GridSearch using dask is the answer. That will provide data hashing to avoid sending multiple copies to workers but not calculate common computation blocks to memoize those across the graph. That seems to be the thing breaking things right now.

The reason things are slow right now is that grids of this size are not what I optimized for when writing this. These are solvable problems, they'll just require more complicated code.

PinJu has also had similar results on fairly small problems where setting cv>3 breaks the graph comptutation. This is with the AirBnB dataset from Kaggle where the raw data is 27MB but the workers balloon to ~3GB.

Can you explain this? What do you mean "breaks the graph computation"? If you can provide a reproducible test case, I'm sure this can be fixed.

@bansalr
Copy link

bansalr commented Mar 13, 2017

I speak colloquially when I say "breaks the graph computation". Namely that the example works with scikit-learn but dask-learn the workers grow large and run out of memory.

Pin-Ju will post the complete details along with an example this evening.

@amueller
Copy link

Wasn't this on this thread that I mentioned scikit-optimize and pinged @MechCoder who is working on this right now?

And the paper by Snoek on hyperparameter tuning has a section on distributing Bayesian optimization. I think James Bergstra is in stealth mode ;)

What are the current boundaries? What size graphs work well and what don't?

@mrocklin mrocklin changed the title Dash Board does not show up for XGBRegressor Grid Search. Efficiency for GridSearchCV on large graphs Mar 13, 2017
@bansalr
Copy link

bansalr commented Mar 13, 2017

This is from an email Pin-Ju sent this weekend to me with some results. He will post the complete example and plots this evening. For CV=3 and workers=7, this fails to complete on a machine with 32 cpus and 32GB of RAM. Data set is ~27mb raw on disk (AirBnB Kaggle data set).


Hello,

To investigate the memory usage vs number of workers, I did a following study and something interesting happen:

For one worker, I increase the cv from 2 to 4 and check the length of dask_graph.

number of workers: 1
number of grid points: 2000

cv = 2
running time: 237.84990310668945
size of graph: 7322

cv = 3
running time: 359.4740250110626
size of graph: 10980 ~ 7322 * 1.5

cv = 4
running time: 478.22754096984863
size of graph: 14638 ~ 7322 * 2

The size of increasing graph is very similar the increasing of cv number.
It looks like that daskgridsearch did some duplicates copy when doing cross validation!
I will issue any daskgridsearch report about it and let me know your thoughts.

Thanks,
Pin-Ju

ref:

train data set: ~25mb

@jcrist
Copy link
Member

jcrist commented Mar 13, 2017

Yes, the graph size scales linearly with the number of cv splits. Each candidate/split pair gets a task in the graph - twice as many splits leads to twice as many tasks.

It looks like that daskgridsearch did some duplicates copy when doing cross validation!

This I don't follow - at most each worker should have 1 copy of each cv split. I guess I'll wait until the issue is reported :). @pinjutien, please post your findings as a separate issue, this one is already sufficient in scope.

@jcrist
Copy link
Member

jcrist commented Mar 27, 2017

Graph building was sped up significantly in #37. With a small edit to remove compute from actually happening, I ran the following benchmark on that PR. This builds a grid of 500,000 candidates with 3 splits (1,500,000 individually fit estimators) and measures the time to build and serialize the graph.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from dklearn import DaskGridSearchCV
from sklearn.datasets import make_classification
from timeit import default_timer
import pickle

# Make data small so nbytes cost is negligible
X, y = make_classification(2, n_features=5)

model = DecisionTreeClassifier()
# 500,000 samples
grid = {'max_depth': np.arange(1, 1001),      # 1000
        'random_state': np.arange(100),       # 100
        'min_samples_leaf': np.arange(1, 6)}  # 5

grid_search = DaskGridSearchCV(model, grid)

start = default_timer()
grid_search.fit(X, y)
stop = default_timer()

print("Graph building took %.3f seconds" % (stop - start))
print("Graph has %d tasks" % len(grid_search.dask_graph_))

start = default_timer()
nbytes = sum(map(len, map(pickle.dumps, grid_search.dask_graph_.values())))
stop = default_timer()

print("Serialized graph takes %.3f GB" % (nbytes / 1e9))
print("Serializing graph took %.3f seconds" % (stop - start))

This results in:

(dask) jcrist grid-search $ python dklearn_grid_search_script.py
Graph building took 14.465 seconds
Graph has 3000008 tasks
Serialized graph takes 1.403 GB
Serializing graph took 42.132 seconds

Building the graph is no longer the bottleneck, rather the cost of serializing has become large. However, I'm not sure if there's any way around this when you're creating roughly 3 million tasks. If each task serializes to ~400 bytes each then hitting 1.4GB serialized isn't hard. Still, a graph this size will take many many hours to run, so the upfront cost of serializing it should be negligible in the long run.

For the special (and common) case of non-pipeline estimators, the number of tasks could be halved by doing the fit and score all in the same task. However, I'm not likely to work on that until other things have been done first. This would improve serialization time, at the cost of yet-another-code-path.

@yzard
Copy link

yzard commented Mar 27, 2017

Thanks a lot! Jim, I will try it out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants