Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking Change][jvm-package] Rewrite XGBoost JVM Package #10415

Open
20 of 26 tasks
wbo4958 opened this issue Jun 14, 2024 · 1 comment · Fixed by #10416
Open
20 of 26 tasks

[Breaking Change][jvm-package] Rewrite XGBoost JVM Package #10415

wbo4958 opened this issue Jun 14, 2024 · 1 comment · Fixed by #10416

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Jun 14, 2024

Reason

The XGBoost JVM package is the Java language binding for the XGBoost library. It is supposed to be a lightweight, thin wrapper around XGBoost. However, the current XGBoost JVM implementation is quite heavy-weight. For example, it groups the dataset using RDD for ranking, implements ranking within the XGBoostRegressor, samples the dataset for training and testing, and zips the training and evaluation datasets, duplicates some code usage. All of these additional features make the XGBoost JVM codebase difficult to read and maintain. Additionally, it is missing support for the latest XGBoost parameters and does not properly handle dense/sparse data usage.

Goal

  • Create a new XGBoostRanker for ranking problem.
  • Code reusing.
  • Support DART booster
  • Remove "grouping" for ranking problem.
  • Remove the trainTestRatio and its implementation
  • Remove "zip" train and eval dataset (add a new Boolean validation column to indicate if the instance is for training or for evaluating)
  • Catch up the latest parameters.
  • Add setter/getter for all parameters.
  • Support XGBoost style parameters when defining the parameters, (Like final val baseScore = new DoubleParam(this, "base_score", "The initial )
  • Support dense when the input is vector type
  • Support sparse when the input is vector type
  • Support array input for both CPU and GPU
  • Support columnar input for CPU ???
  • Remove the way linking the xgboost4j/xgboost4j-spark to GPU
  • Use the existing fasterxml.jackson to handle the json.
  • Avoid repartition if the number of input partittions is equal to num_workers.
  • Fix surefire issue Missing surefire report of xgboost4j-spark #10316
  • More scalastyle checking.
  • Remove xgboost4j-gpu, move the implemenation into xgboost4j-spark-gpu directly.
  • Remove the cudf dependency
  • Shade the xgboost4j/xgboost4j-spark into a single jar
  • Shade the xgboost4j/xgboost4j-spark/xgboost4j-spark-gpu into a single jar
  • [jvm-packages for GPU] Support numerical feature input #10387
  • Remove the RDD cache.
  • Fix XGBoost4j-spark CrossValidation train FAILED on multi-GPU environment: : Multiple processes running on same CUDA device is not supported! #10200
  • Fix give rabit tracker a port number #8294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants