Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logloss increasing with num_rounds in distributed + tree_method=exact #5304

Closed
honzasterba opened this issue Feb 12, 2020 · 3 comments
Closed

Logloss increasing with num_rounds in distributed + tree_method=exact #5304

honzasterba opened this issue Feb 12, 2020 · 3 comments

Comments

@honzasterba
Copy link
Contributor

@honzasterba honzasterba commented Feb 12, 2020

We have observed a behaviour that when tree_method set to exact, distributed mode, training with same parameters only increasing the number of tree

then logloss metric is worsening (rising) with more trees built.

import dask_xgboost
import dask.dataframe as dd
from dask.distributed import Client
from dask_ml.preprocessing import DummyEncoder
import sklearn.metrics
import xgboost

client = Client(n_workers=4, threads_per_worker=1)

data = dd.read_csv("http://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/airlines_all.05p.csv", usecols=["Year", "Month", "DayofMonth", "DayOfWeek", "IsDepDelayed"])
data = data.categorize()

de = DummyEncoder()
trn = de.fit_transform(data)

X = trn[list(set(trn.columns) - set(["IsDepDelayed_YES", "IsDepDelayed_NO"]))]
y = trn["IsDepDelayed_YES"]

ntrees = [1, 10, 50, 100, 150, 200]
actual = y.compute()
dtrain = xgboost.DMatrix(X.compute())

for trees in ntrees:
    params = {'objective': "binary:logistic", 'seed':1, 'tree_method': "exact"}
    model = dask_xgboost.train(client, params, X, y, num_boost_round=trees)
    #predictions = dask_xgboost.predict(client, model, X).compute()
    predictions = model.predict(dtrain, ntree_limit=0)
    logloss = sklearn.metrics.log_loss(actual, predictions)
    print("%s = %s" % (trees, logloss))

gives output

10 = 0.67498968119218
50 = 0.6760841363388266
100 = 0.6782659646792807
150 = 0.6812360284077473
200 = 0.6882596412958003
@RAMitchell

This comment has been minimized.

Copy link
Member

@RAMitchell RAMitchell commented Feb 12, 2020

I don't think exact is meant to be used as a distributed algorithm. Do you get the expected result for 'tree_method': "hist"?

Perhaps we can log a fatal error in this case.

@cuauty

This comment has been minimized.

Copy link

@cuauty cuauty commented Feb 13, 2020

@RAMitchell Do you mean "exact" doesn't work for multiple worker?

@trivialfis

This comment has been minimized.

Copy link
Member

@trivialfis trivialfis commented Feb 13, 2020

That's correct. In master branch and 1.0rc there should be a fatal error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.