Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effect of random seed ignored (not a multithreading issue)... #2636

Closed
dataforager opened this issue Aug 24, 2017 · 3 comments
Closed

Effect of random seed ignored (not a multithreading issue)... #2636

dataforager opened this issue Aug 24, 2017 · 3 comments

Comments

@dataforager
Copy link

Follow-up on closed issue #113

A colleague of mine and I are trying to independently verify each other's prediction results using the same model parameters, training data and test data. He is using xgboost in R and I'm using xgboost in Python (though I have also tried the R version of the package on my machine for the purposes of this test).

What we've found is that we get the same predicted probabilities from the model on the test data using completely different seeds.

In R, seeds were set below either by passing the 'seed' param to xgb.train or by using R's set.seed() function.

We independently verified (by extracting value of .Random.seed) after calling set.seed() that seed was indeed being changed (it was).

nthread was set to 1 for both training/test runs with the exact same model parameters. Predicted probabilities appear below and are identical for two different seed values.

Predicted probabilities (seed=1):
0.4745588005
0.9879690409
0.5989014506
0.9906733632
0.5989014506
0.9928959012
0.1146880165
0.9928619266
0.9917168021
0.9958292842

Predicted probabilities (seed = 2):
0.4745588005
0.9879690409
0.5989014506
0.9906733632
0.5989014506
0.9928959012
0.1146880165
0.9928619266
0.9917168021
0.9958292842

In Python, we can confirm the same effect, though with different resulting predicted probabilities (another issue we think is related to the fact that we can't seem to control the random seed):

Again, we tried to either pass the seed as a parameter to xgb.train or set the random seed via numpy.random.seed() or python stdlib's random.seed(). And again, nthread was set to 1.

Predicted probabilities (seed = 1):
0.141121
0.98446
0.805141
0.949947
0.805141
0.979856
0.511622
0.990588
0.985136
0.987054

Predicted probabilities (seed = 2):
0.141121
0.98446
0.805141
0.949947
0.805141
0.979856
0.511622
0.990588
0.985136
0.987054

I've read previous posts related to this (including #113 mentioned above) which say this is a multithreading issue. However, since we see the exact same results when we set nthread to 1, we don't suspect this is the issue.

We'd appreciate any help we could get with this issue. Thanks a lot.

Pertinent system/package information can be found below.

Operating System: Ubuntu 16.04 LTS
Compiler: GNU gcc/g++ 5.4
Package used (python/R/jvm/C++): python and R
xgboost version used: 0.6a2

For Python:

  1. version: 2.7.12
  2. installation command: pip install xgboost

For R:

  1. sessionInfo(): R version 3.4.1 (2017-06-30)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] xgboost_0.6-4

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 Matrix_1.2-10 tools_3.4.1 stringi_1.1.5
[6] grid_3.4.1 data.table_1.10.4 lattice_0.20-35
2. installation command within R session: install.packages("xgboost",dependencies=TRUE)

@khotilov
Copy link
Member

Some parameter configurations are deterministic and some are random. You provided no details on what configuration you were running.

@Laurae2
Copy link
Contributor

Laurae2 commented Aug 25, 2017

@dataforager Did you use colsample_bytree, colsample_bylevel, or subsample? If not, you have the expected behavior.

@dataforager
Copy link
Author

Thank you both. Double-checked our parameter settings and realized we had left these at defaults (which was 1.0). So, no surprise we got deterministic behavior (whoops).

Appreciate the quick responses. Will close this.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants