Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow prediction with dask #5729

Closed
RAMitchell opened this issue May 30, 2020 · 9 comments
Closed

Slow prediction with dask #5729

RAMitchell opened this issue May 30, 2020 · 9 comments

Comments

@RAMitchell
Copy link
Member

Prediction with Xgboost dask interface is much slower than expected, taking almost 5x training time.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask import array as da
import xgboost as xgb
from xgboost.dask import DaskDMatrix
import time


def main(client):
    m = 10000000
    n = 100
    X = da.random.random(size=(m, n), chunks=100)
    y = da.random.random(size=(m,), chunks=100)
    dtrain = DaskDMatrix(client, X, y)

    start = time.time()
    output = xgb.dask.train(client,
                            {
                                'tree_method': 'gpu_hist'},
                            dtrain,
                            num_boost_round=500, evals=[(dtrain, 'train')])
    print("Train time: {}".format(time.time() - start))
    bst = output['booster']
    start = time.time()
    prediction = xgb.dask.predict(client, bst, dtrain)
    prediction = prediction.compute()
    print("Predict time: {}".format(time.time() - start))
    return prediction


if __name__ == '__main__':
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            main(client)

Train Time: 121s
Prediction time: 502s

@trivialfis
Copy link
Member

Let me take a look.

@trivialfis
Copy link
Member

trivialfis commented May 30, 2020

The prediction is run for each partition/block so that there's no concatenation for input data (hence lower memory usage). When you set the chunk size to 100, you ended up running prediction 100000 times given you have 10000000 rows.

@cdeotte
Copy link

cdeotte commented May 30, 2020

I see two issues here. First the chunk size is small. If you change to chunks=1_000_000, then train takes 27.2 seconds and prediction takes 151.2 seconds.

The reason prediction is 5.5x slower than train is because you need to add 'predictor' : 'gpu_predictor'. If you add this parameter then with chunks=1_000_000 train takes 27.2 seconds and predict takes 11.7 seconds.

@trivialfis
Copy link
Member

I think the predictor is automatically running on GPU in this example.

@cdeotte
Copy link

cdeotte commented May 30, 2020

No it is not. I just ran this on a DGX and posted my results above. By default XGB always uses CPU for prediction. Even if you use 'tree_method': 'gpu_hist', prediction is still on CPU. You must explicitly set the 'predictor' : 'gpu_predictor'.

@trivialfis
Copy link
Member

Got it, took another look into the code. CPU predictor might be chosen because data comes from host. Thanks for correcting my mistake.

@cdeotte
Copy link

cdeotte commented May 30, 2020

It is good you made your comment trivialfis. Many people including myself last week did not know this. As such I suggest making GPU prediction the default if a user activates the tree method gpu hist parameter.

(And yes, maybe the predictor depends on where the data comes from, I'm not sure).

@trivialfis
Copy link
Member

trivialfis commented May 30, 2020

It's a trade-off. We made some heuristics to avoid copying data into device from host. For training the data is not copied even you are using GPU. Currently DGX is not very accessible to wide public, so we can't make this setting.

@trivialfis
Copy link
Member

The prediction speed has been improved quite significantly in recent PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants