Impossible to reproduce model results #37

sergiocalde94 · 2019-04-12T10:58:10Z

I´ve just opened this issue in the dask repo, but maybe here is better...

I´m using dask for implementing a data pipeline with dask dataframes and dask ml in a Yarn Cluster.

When I build an XGBoost model, the results are always different, even if I manually fix a seed with da.random.seed().

import dask_xgboost as dxgb


params = {'objective': 'binary:logistic', 'n_estimators': 420,
           'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1}

bst = dxgb.train(client, params, fitted.transform(X), y)

Is it possible to reproduce the results of a dask model like the one in local using sklearn instead of dask ml???

The text was updated successfully, but these errors were encountered:

sergiocalde94 · 2019-04-12T11:13:29Z

The problem is when I run the model in cluster mode (not local). It´s a Yarn Cluster as I mentioned before.

jcrist · 2019-04-12T15:44:31Z

When I build an XGBoost model, the results are always different, even if I manually fix a seed with da.random.seed().

da.random.seed has no effect on dask-xgboost, so that definitely won't work. Currently it looks like we don't support setting a random seed for this library, but we should be able to. I'm not super familiar with xgboost, but it looks like you should be able to set the seed by adding seed to the params, which will be forwarded to every call to xgb.train (this may be non-optimal though, we may want a different seed per task).

You may try this and see if things work (untested).

import dask_xgboost as dxgb

params = {'objective': 'binary:logistic', 'n_estimators': 420,
          'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1, 'seed': 1234}

bst = dxgb.train(client, params, fitted.transform(X), y)

Provided X and y are consistently partitioned, and seed can be passed this way I would suspect consistent results. XGBoost also has some non-determinism inherent to it (see https://xgboost.readthedocs.io/en/latest/faq.html#slightly-different-result-between-runs). @TomAugspurger would know more, but he may be currently busy.

sergiocalde94 · 2019-04-12T15:56:11Z

Sorry, I had to tell to you that I also tested with seed parameter and it's not reproducible :(

jcrist · 2019-04-12T15:59:18Z

Ok. This may have to do with how we're using xgboost, or it may be inherent to xgboost (as I mentioned above). I'm not the person to figure this out, Tom likely knows more here.

sergiocalde94 · 2019-04-23T10:10:30Z

@TomAugspurger can you reply please? :(

TomAugspurger · 2019-04-23T11:52:53Z

I’m on parental leave for the next couple weeks. Could you try debugging it further yourself?

…

On Apr 23, 2019, at 05:10, Sergio Calderón Pérez-Lozao ***@***.***> wrote: @TomAugspurger can you reply please? :( — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

DigitalPig · 2019-04-24T12:56:58Z

When you are saying not replicable, do you mean the model itself? Or its prediction? If it is the prediction, is it the probability or the class label?

One thing to note that if you don't specify tree_method then xgboost backend automatically picks approx as its tree method. Maybe you can fix it to be exact in the params and try it again?

sergiocalde94 · 2019-04-24T16:50:32Z

Sorry @TomAugspurger ;(

Hi @DigitalPig,

The point is that if I build two xgboost models with exactly the same parameters it doesn´t return the same model because the importances are different. My preprocessing code is this (df_train is a dask dataframe):

from sklearn.pipeline import Pipeline
from dask_ml.compose import ColumnTransformer
from dask_ml.impute import SimpleImputer
from dask_ml.preprocessing import OneHotEncoder


FILL_MISSING_NUMERICAL = -99
FILL_MISSING_CATEGORICAL = 'Desconocido'


da.random.seed(42)
columns_numeric = df_train.select_dtypes(include='number').columns
columns_categorical = df_train.select_dtypes(exclude='number').columns
columns_categorical = columns_categorical[columns_categorical != 'variable_350']

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=FILL_MISSING_NUMERICAL))])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=FILL_MISSING_CATEGORICAL)),
    ('categorizer', Categorizer()),
    ('onehot', OneHotEncoder(sparse=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, columns_numeric),
        ('cat', categorical_transformer, columns_categorical)])

preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

X = df_train.drop('variable_350', axis=1)
y = df_train['variable_350'].astype(int)

fitted = preprocessing_pipeline.fit(X, y)

and then if I make this train two times and show its feature importances they are different:

params = {'objective': 'binary:logistic', 'n_estimators': 420,
          'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1,
          'tree_method': 'exact', 'seed': 123}

bst = dxgb.train(client, params, fitted.transform(X), y)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 8))

ax = xgb.plot_importance(bst, ax=ax, height=0.8, max_num_features=20)
ax.grid(True, axis="y")

first model:

second model:

As you can see the results are slightly different. Maybe I´m doing something wrong...

Thanks for your replies

DigitalPig · 2019-04-29T03:29:04Z

Which xgboost version are you using? I know the recent xgboost change the default method of variable importance from weight to gain. The plot you show still uses weight here.

Also, I would try to take a downsampled dataset and train it w/o dask to see if you still get different variable importance.

Last, there are some stochastic options turned on during your training like colsample_by_tree. in theory if you fix the seed (and the seed get transfer everywhere) it shouldn't matter. but I would also try to turn it off to see if you still have the same issue.

What about the prediction of these two models?

sergiocalde94 · 2019-05-06T13:33:35Z

@DigitalPig sorry for the time I took to answer, I was during my holidays.

I´m using xgboost 0.81, returned by:

import xgboost as xgb


print(xgb.__version__)

With the random options turned off the model also return different importances:

I ran this two times

params = {'objective': 'binary:logistic', 'n_estimators': 420,
          'max_depth': 5, 'eta': .01,
          'subsample': 1, 'colsample_bytree': 1,
          'learning_rate': .05, 'scale_pos_weight': 1,
          'tree_method': 'exact', 'seed': 123}

bst = dxgb.train(client, params, fitted.transform(X), y)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 8))

ax = xgb.plot_importance(bst, ax=ax, height=0.8, max_num_features=20)
ax.grid(True, axis="y")

First execution it returns this importances:

And the second time:

BUT when I tried to execute the test for less data (only a subset of 100000 registers), the models returned the same importances even with the stochastic parameters setted to a less than 1 value (subsample .8 or colasample_bytree .8 x.e.).

So maybe it´s because the size of the data??

sergiocalde94 · 2019-05-22T09:05:36Z

Any idea for why with more data dask_xgboost doesn´t return the same results for reproducibility?

TomAugspurger · 2019-05-22T19:55:28Z

If you remove dask-xgboost from the equation, and just use XGBoost, are the results deterministic? Is it still deterministic if you use XGBoost distributed training (again, not using dask to set up the distributed xgboost runtime)

sergiocalde94 · 2019-05-27T22:38:12Z

@TomAugspurger yes! with just using XGboost the results are deterministic with all of the data and xgboost with 30 n_jobs (cores)

TomAugspurger · 2019-05-27T22:48:32Z

Is that 30 cores on one machine or distributed?

…

On May 27, 2019, at 17:38, Sergio Calderón Pérez-Lozao ***@***.***> wrote: @TomAugspurger yes! with just using XGboost the results are deterministic with all of the data and xgboost with 30 n_jobs (cores) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sergiocalde94 · 2019-05-27T22:58:42Z

@TomAugspurger mmm is in one machine, can I distributed xgboost only with the xgboost library?

TomAugspurger · 2019-05-27T23:06:43Z

Yes, that’s the runtime dask hooks into.

…

On May 27, 2019, at 17:58, Sergio Calderón Pérez-Lozao ***@***.***> wrote: @TomAugspurger mmm is in one machine, can I distributed xgboost only with the xgboost library? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sergiocalde94 · 2019-05-27T23:11:42Z

Ok I will test it tomorrow! Thanks

sergiocalde94 · 2019-06-03T17:45:19Z

Sorry but I couldn´t test it because in our environment we are usiong a cluster that we are not allow to configure and it´s not possible to execute xgboost in distributed without dask.

Any idea to test it? :(

PS: For me the strangest thing is that with less data the results are reproducible even dask is using also the cluster (dask dashboard show it)

TomAugspurger · 2019-06-03T20:17:10Z

I don't have any other ideas at the moment.

…

On Mon, Jun 3, 2019 at 12:45 PM Sergio Calderón Pérez-Lozao < ***@***.***> wrote: Sorry but I couldn´t test it because in our environment we are usiong a cluster that we are not allow to configure and it´s not possible to execute xgboost in distributed without dask. Any idea to test it? :( PS: For me the strangest thing is that with less data the results are reproducible even dask is using also the cluster (dask dashboard show it) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37?email_source=notifications&email_token=AAKAOIVUUSWVT4OVFCZAUGDPYVKC7A5CNFSM4HFPPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2FFNY#issuecomment-498356919>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIQVMXADFPIZXCTU42LPYVKC7ANCNFSM4HFPPOWA> .

mmccarty · 2019-10-16T15:06:29Z

@sergiocalde94 Do you have minimal example that reproduces this issue? If so, I can take a look.

kylejn27 · 2019-10-24T16:46:19Z

I was able to reproduce this error. Taking a look at why this is happening

kylejn27 · 2019-10-24T23:30:32Z

I installed both libraries from source, the error seemed to go away. I did some digging and it seems that its a problem with version 0.90 of xgboost.

I'm fairly certain that this is the culprit and was fixed a few days ago in a commit in the master of xgboost here:
dmlc/xgboost@7e72a12#diff-fd53d68e0037d3512896122d1248d969L1128

jakirkham · 2019-10-25T01:00:09Z

In that case, would recommend requesting if upstream could make a new release.

kylejn27 · 2019-10-25T15:22:08Z

ok, my conclusion was a bit premature. I ran the example that reproduced the error again today after the issue above was closed and realized that I had accidentally set n_workers=1 in my distributed client, so it wasn't running in distributed mode. I'm going to continue to look into this problem

Here is how I reproduced the bug if anybody else was curious: https://github.com/kylejn27/dask-xgb-randomstate-bug

kylejn27 · 2019-10-31T16:05:35Z

No solution yet, but I have interesting information. I was tailing the dask worker logs and noticed a trend. If the thread that the workers were running on were the same between executions of the train method, the feature importance graphs were the same.

I'm not sure if this is an issue with dask-xgboost or dmlc/xgboost but I was able to reproduce this issue on the v1.0 version of dmlc/xgboost native dask integration.

maybe this is expected behavior though? https://xgboost.readthedocs.io/en/latest/faq.html#slightly-different-result-between-runs

mrocklin · 2019-11-04T16:46:04Z

cc @RAMitchell in case he has thoughts on what might be going on here.

mrocklin · 2019-11-04T16:46:14Z

(or knows someone who can take a look)

mrocklin · 2019-11-07T16:18:53Z

cc also @trivialfis

trivialfis · 2019-11-07T16:59:24Z

Yup. We are still struggling with some blocking issues to make a new release.

mrocklin · 2019-11-07T17:35:05Z

Ah ok. Good to know that this is on your radar. Thanks Jiaming!

…

On Thu, Nov 7, 2019 at 8:59 AM Jiaming Yuan ***@***.***> wrote: Yup. We are still struggling with some blocking issues to make a new release. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#37?email_source=notifications&email_token=AACKZTE2O5JFXARJNCFZKO3QSRCO3A5CNFSM4HFPPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNC4OI#issuecomment-551169593>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFGDK5CF3TOZGRSLA3QSRCO3ANCNFSM4HFPPOWA> .

trivialfis · 2019-12-02T13:07:11Z

Sorry for the long wait. Should be fixed once dmlc/xgboost#4732 is merged.

jakirkham · 2019-12-02T14:56:54Z

@mmccarty, would someone from your team be able to try out Jiaming’s PR ( dmlc/xgboost#4732 )?

kylejn27 · 2019-12-02T14:58:56Z

@jakirkham I'll test it out

mmccarty · 2019-12-02T15:20:03Z

Great! Thank you @trivialfis
Thanks @kylejn27

trivialfis · 2019-12-02T17:14:18Z

I tested it with year prediction dataset with dask interface in XGBoost, along with "exact" tree method used in this issue. More tests are coming.

The root problem is in model serialization, previously distributed training is more or less a Java/Scala thing and the JVM package built a layer on top of c++ to handle parameters. See the short tutorial in that PR. I tried to handle them in C++ by walking through the whole library and serialize it into JSON. As a bonus, you can verify internal configurations by calling save_config from booster object, which will return a json string containing all parameters used in XGBoost. Set verbosity to 3 and it will be pretty printing. You don't need to set any extra parameter to enable the fix as I have already used part of it in pickle. Feel free to reach me for any questions/issues, as with that PR, the c++ core of XGBoost should be well prepared for distributed environment, next fix won't take that long. Thanks for the discussions with many details.

kylejn27 · 2019-12-02T17:24:57Z

@trivialfis @jakirkham ran a few tests using that branch (ensuring that dask-xgboost was using Jiaming's PR branch as its xgboost dependency), saw the same issue. ~60% of the time there was a discrepancy in results

Here's an example of it failing:
https://github.com/kylejn27/dask-xgb-randomstate-bug/blob/master/dxgb_random_state_bug.ipynb

trivialfis · 2019-12-02T17:45:42Z

@kylejn27 I ran your example, X_train is different between runs. i converted it into numpy then save it as file for running sha256sum.

trivialfis · 2019-12-02T17:46:44Z

While X is reproducible

kylejn27 · 2019-12-02T18:17:39Z

Not sure if I'm getting the same results as you, but maybe I'm doing something wrong. Were you saving the input arrays inside the model code itself?

Sometimes, when running that example it does produce identical models, othertimes it does not. I've been running the script 4-5 times to ensure that I'm not getting different models.

So after running this I got two different models, but the input parameters seemed to be the exact same. I tried adding in random_state to train_test_split this time too. The same thing occurs on the dask_xgboost version as well even though I didn't include it below

X, y = make_classification(n_samples=100000, n_features=20,
                           chunks=1000, n_informative=4,
                           random_state=12)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=123)
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
dtest = xgb.dask.DaskDMatrix(client, X_test, y_test) 

bst = xgb.dask.train(client, params, dtrain)['booster']

fig, ax = plt.subplots(figsize=(12, 8))
ax = xgb.plot_importance(bst, ax=ax, height=0.8, max_num_features=len(X_train))
ax.grid(True, axis="y")

from numpy import save
save('x1',X.compute())
save('X_train1',X_train.compute())

bst1 = xgb.dask.train(client, params, dtrain)['booster']

fig, ax = plt.subplots(figsize=(12, 8))
ax = xgb.plot_importance(bst1, ax=ax, height=0.8, max_num_features=len(X_train))
ax.grid(True, axis="y")

from numpy import save
save('x2',X.compute())
save('X_train2',X_train.compute())

!sha256sum x1.npy
!sha256sum x2.npy
!sha256sum X_train1.npy
!sha256sum X_train2.npy

b2bda66f8fdc0e413533fe076b42f276a143aac2442f332974df0e7017a3a799  x1.npy
b2bda66f8fdc0e413533fe076b42f276a143aac2442f332974df0e7017a3a799  x2.npy
a7aa16b5383aae02ff92f95532e94eea73396979681a5523df50788f82a39a50  X_train1.npy
a7aa16b5383aae02ff92f95532e94eea73396979681a5523df50788f82a39a50  X_train2.npy

>>> bst.trees_to_dataframe().equals(bst1.trees_to_dataframe())
False
>>> pd.DataFrame(xgb.dask.predict(client, bst, dtest).compute()).equals(pd.DataFrame(xgb.dask.predict(client, bst1, dtest).compute()))
False

jakirkham · 2019-12-05T18:06:20Z

@trivialfis, any thoughts? 🙂

trivialfis · 2019-12-05T18:32:24Z

Sorry for late reply. Will get back to this tomorrow or at weekend. I also need to test for more popular tree methods like hist and GPU hsit.

trivialfis · 2019-12-09T14:00:37Z

Sorry for the many noise here. It suddenly occurred to me that the exact tree method doesn't support distributed training, see tree_method in https://xgboost.readthedocs.io/en/latest/parameter.html . It usually print a warning that the tree method is changed to approx when distributed training is enabled, but in this case tree_method is explicitly specified so no configuration is performed.

kylejn27 · 2019-12-09T19:53:35Z

@trivialfis hmm your right, I missed that in the docs. I still wasn't able to reproduce model results when I set the tree_method param to hist ... though I'm not super familiar with the intricacies of the different tree methods, is it expected behavior to see different trees then or is exact the only method that's supposed to return the same tree.

referencing this in the FAQ:
https://xgboost.readthedocs.io/en/latest/faq.html#slightly-different-result-between-runs

trivialfis · 2019-12-09T20:05:18Z

Set the numeric errors aside, it's supposed to have exact same trees. But sometimes numeric errors can be troublesome, for example XGBoost utilities gradients in missing value. It's nice in the math but floating point errors sometimes generate artificial gradients that are misinterpreted as gradients from missing value, hence changing the default split direction (I haven't talked to anyone about this yet as it's quite surprising to me). I will continue the tests for hist and GPU hist, should report back soon as possible.

kylejn27 · 2019-12-09T20:16:40Z

That's helpful to know, thanks for looking into this further

trivialfis · 2019-12-09T20:51:29Z

Please note that the above example is rare as I have only seen it in an artificially generated small dataset that has specific pattern of values. Usually this doesn't happen on any other datasets, as normally the error is not even close to be big enough to affect the split direction.

trivialfis · 2019-12-11T11:35:35Z

Ran some tests today for both hist and gpu_hist. For small number of iterations (< 48) with 2 workers 4 threads each, on YearPredictionMSD dataset (a dense dataset with 0.5M rows) the model should be reproducible. But going higher starts to generate discrepancy. It's not a good news but at least it proves that there is no human error in the code to generate this discrepancy. Most of the errors come from histogram building as result of summation error. I will keep the issue tracked in #37 in the future. Thanks for all the help!

mmccarty · 2019-12-11T14:58:01Z

Thank you for the update @trivialfis Please let us know if you need any assistance.

dancyfang · 2020-02-04T20:19:25Z

Thanks for all the discussion! I recently get on-boarded to dask-xgboost. Can someone summarize what's the reason for the irreproducibility?

jcrist mentioned this issue Apr 12, 2019

Dask XGboost impossible to reproduce same results dask/dask#4697

Closed

kylejn27 mentioned this issue Oct 25, 2019

Impossible to reproduce model results dmlc/xgboost#4989

Closed

trivialfis mentioned this issue Nov 7, 2019

Model reproduciblility with histogram tree method. dmlc/xgboost#5023

Closed

4 tasks

Impossible to reproduce model results #37

Impossible to reproduce model results #37

Comments

sergiocalde94 commented Apr 12, 2019

sergiocalde94 commented Apr 12, 2019

jcrist commented Apr 12, 2019

sergiocalde94 commented Apr 12, 2019

jcrist commented Apr 12, 2019

sergiocalde94 commented Apr 23, 2019

TomAugspurger commented Apr 23, 2019 via email

DigitalPig commented Apr 24, 2019 • edited Loading

sergiocalde94 commented Apr 24, 2019 • edited Loading

DigitalPig commented Apr 29, 2019

sergiocalde94 commented May 6, 2019

sergiocalde94 commented May 22, 2019

TomAugspurger commented May 22, 2019

sergiocalde94 commented May 27, 2019

TomAugspurger commented May 27, 2019 via email

sergiocalde94 commented May 27, 2019

TomAugspurger commented May 27, 2019 via email

sergiocalde94 commented May 27, 2019

sergiocalde94 commented Jun 3, 2019

TomAugspurger commented Jun 3, 2019 via email

mmccarty commented Oct 16, 2019

kylejn27 commented Oct 24, 2019

kylejn27 commented Oct 24, 2019 • edited Loading

jakirkham commented Oct 25, 2019

kylejn27 commented Oct 25, 2019 • edited Loading

kylejn27 commented Oct 31, 2019

mrocklin commented Nov 4, 2019

mrocklin commented Nov 4, 2019

mrocklin commented Nov 7, 2019

trivialfis commented Nov 7, 2019

mrocklin commented Nov 7, 2019 via email

trivialfis commented Dec 2, 2019

jakirkham commented Dec 2, 2019

kylejn27 commented Dec 2, 2019

mmccarty commented Dec 2, 2019

trivialfis commented Dec 2, 2019

kylejn27 commented Dec 2, 2019

trivialfis commented Dec 2, 2019 • edited Loading

trivialfis commented Dec 2, 2019

kylejn27 commented Dec 2, 2019

jakirkham commented Dec 5, 2019

trivialfis commented Dec 5, 2019

trivialfis commented Dec 9, 2019 • edited Loading

kylejn27 commented Dec 9, 2019

trivialfis commented Dec 9, 2019

kylejn27 commented Dec 9, 2019

trivialfis commented Dec 9, 2019

trivialfis commented Dec 11, 2019 • edited Loading

mmccarty commented Dec 11, 2019

dancyfang commented Feb 4, 2020

DigitalPig commented Apr 24, 2019 •

edited

Loading

sergiocalde94 commented Apr 24, 2019 •

edited

Loading

kylejn27 commented Oct 24, 2019 •

edited

Loading

kylejn27 commented Oct 25, 2019 •

edited

Loading

trivialfis commented Dec 2, 2019 •

edited

Loading

trivialfis commented Dec 9, 2019 •

edited

Loading

trivialfis commented Dec 11, 2019 •

edited

Loading