Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] spark job failed due to too many features? #3348

Closed
alice2008 opened this issue May 29, 2018 · 3 comments
Closed

[jvm-packages] spark job failed due to too many features? #3348

alice2008 opened this issue May 29, 2018 · 3 comments

Comments

@alice2008
Copy link

@alice2008 alice2008 commented May 29, 2018

We noticed that XGBoost spark job running is quite unstable for dataset has many features (>50K features per record). I wonder if this is because of the native synchronization mechanism embedded in the XGBoost library, i.e. all partition need to load the full feature map and start calculating the candidates to find the best split? In this case, if candidates pool is too huge (>50K), it could potentially lead to failure.

The other question I have is: is it allowed for workers to fail and restart in synchronization? Currently we set 'spark.task.maxFailures': 1. Is it okay to bump to 3 attempts perhaps.

@yanboliang

This comment has been minimized.

Copy link
Contributor

@yanboliang yanboliang commented Jun 4, 2018

We noticed that XGBoost spark job running is quite unstable for dataset has many features (>50K features per record). I wonder if this is because of the native synchronization mechanism embedded in the XGBoost library, i.e. all partition need to load the full feature map and start calculating the candidates to find the best split? In this case, if candidates pool is too huge (>50K), it could potentially lead to failure.

Yes, each partition need to load the full feature map, what do you mean by unstable? What symptom did you encounter? Can you share the error log?

@CodingCat

This comment has been minimized.

Copy link
Member

@CodingCat CodingCat commented Jun 5, 2018

regarding spark.task.maxFailures, no....the assumption in xgboost4j-spark is that, if one of the workers failed, you have to restart the training

to avoid starting everything from scratch, you can set checkpoint_path and checkpoint_interval to have checkpoint built during your training (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/CheckpointManager.scala#L127-L141)

@hcho3

This comment has been minimized.

Copy link
Collaborator

@hcho3 hcho3 commented Jul 5, 2018

Any future discussions should be taken to https://discuss.xgboost.ai/.

Alternatively, you may choose to open a new issue to report a bug. For the report, please include an example script that we maintainers can run to re-produce the bug.

@hcho3 hcho3 closed this Jul 5, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.