Merge pull request #3 from CodingCat/fix_examples

adjust the API signature as well as the docs
dmlc · Mar 11, 2016 · 5798710 · 5798710
2 parents 2a6ac6f + a3b2e76
commit 5798710
Show file tree

Hide file tree

Showing 28 changed files with 118 additions and 95 deletions.
diff --git a/doc/jvm/xgboost4j-intro.md b/doc/jvm/xgboost4j-intro.md
@@ -24,7 +24,7 @@ Many of these machine learning libraries(e.g. [XGBoost](https://github.com/dmlc/
 requires new computation abstraction and native support(e.g. C++ for GPU computing).
 They are also often [much more efficient](http://arxiv.org/abs/1603.02754).
 
-The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Flink/Spark to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet))  via the file system and then conduct the following machine learning phase. While such process won't hurt performance as much in data processing case(because machine learning takes a lot of time compared to data loading), it create a bit inconvenience for the users.
+The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Flink/Spark to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet))  via the file system and then conduct the following machine learning phase. While such process won't hurt performance as much in data processing case(because machine learning takes a lot of time compared to data loading), it creates a bit inconvenience for the users.
 
 We want best of both worlds, so we can use the data processing frameworks like Flink and Spark toghether with
 the best distributed machine learning solutions.
@@ -37,7 +37,7 @@ XGBoost and XGBoost4J adopts Unix Philosophy.
 XGBoost **does its best in one thing -- tree boosting** and is **being designed to work with other systems**.
 We strongly believe that machine learning solution should not be restricted to certain language or certain platform.
 
-Specifically, users will be able to use distributed XGBoost in both Flink and Spark.
+Specifically, users will be able to use distributed XGBoost in both Flink and Spark, and possibly more frameworks in Future.
 We have made the API in a portable way so it **can be easily ported to other Dataflow frameworks provided by the Cloud**.
 XGBoost4J shares its core with other XGBoost libraries, which means data scientists can use R/python
 read and visualize the model trained distributedly.
@@ -85,10 +85,10 @@ watches += "test" -> testMax
 
 val round = 2
 // train a model
-val booster = XGBoost.train(params.toMap, trainMax, round, watches.toMap)
+val booster = XGBoost.train(trainMax, params.toMap, round, watches.toMap)
 ```
 
-In Scala:
+We then evaluate our model:
 
 ```scala
 val predicts = booster.predict(testMax)
@@ -111,7 +111,7 @@ In Spark, the dataset is represented as the [Resilient Distributed Dataset (RDD)
 val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).repartition(args(1).toInt)
 ```
 
-We move forward to train the models, in Spark:
+We move forward to train the models:
 
 ```scala
 val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound)
@@ -169,6 +169,8 @@ xgboostModel.predict(testData.map{x => x.vector})
 
 It is the first release of XGBoost4J package, we are actively move forward for more charming features in the next release. You can watch our progress in [XGBoost4J Road Map](https://github.com/dmlc/xgboost/issues/935).
 
+While we are trying our best to keep the minimum changes to the APIs, it is still subject to the incompatible changes.
+
 ## Further Readings
 
 If you are interested in knowing more about XGBoost, you can find rich resources in

diff --git a/jvm-packages/README.md b/jvm-packages/README.md
@@ -34,7 +34,7 @@ object XGBoostScalaExample {
     // number of iterations
     val round = 2
     // train the model
-    val model = XGBoost.train(paramMap, trainData, round)
+    val model = XGBoost.train(trainData, paramMap, round)
     // run prediction
     val predTrain = model.predict(trainData)
     // save model to the file.
@@ -43,34 +43,6 @@ object XGBoostScalaExample {
 }
 ```
 
-### XGBoost Flink
-```scala
-import ml.dmlc.xgboost4j.scala.flink.XGBoost
-import org.apache.flink.api.scala._
-import org.apache.flink.api.scala.ExecutionEnvironment
-import org.apache.flink.ml.MLUtils
-
-object DistTrainWithFlink {
-  def main(args: Array[String]) {
-    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
-    // read trainining data
-    val trainData =
-      MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.train")
-    // define parameters
-    val paramMap = List(
-      "eta" -> 0.1,
-      "max_depth" -> 2,
-      "objective" -> "binary:logistic").toMap
-    // number of iterations
-    val round = 2
-    // train the model
-    val model = XGBoost.train(paramMap, trainData, round)
-    val predTrain = model.predict(trainData.map{x => x.vector})
-    model.saveModelToHadoop("file:///path/to/xgboost.model")
-  }
-}
-```
-
 ### XGBoost Spark
 ```scala
 import org.apache.spark.SparkContext
@@ -101,3 +73,33 @@ object DistTrainWithSpark {
   }
 }
 ```
+
+### XGBoost Flink
+```scala
+import ml.dmlc.xgboost4j.scala.flink.XGBoost
+import org.apache.flink.api.scala._
+import org.apache.flink.api.scala.ExecutionEnvironment
+import org.apache.flink.ml.MLUtils
+
+object DistTrainWithFlink {
+  def main(args: Array[String]) {
+    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
+    // read trainining data
+    val trainData =
+      MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.train")
+    // define parameters
+    val paramMap = List(
+      "eta" -> 0.1,
+      "max_depth" -> 2,
+      "objective" -> "binary:logistic").toMap
+    // number of iterations
+    val round = 2
+    // train the model
+    val model = XGBoost.train(trainData, paramMap, round)
+    val predTrain = model.predict(trainData.map{x => x.vector})
+    model.saveModelToHadoop("file:///path/to/xgboost.model")
+  }
+}
+```
+
+
diff --git a/...ages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/BasicWalkThrough.java b/...ages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/BasicWalkThrough.java
@@ -67,7 +67,7 @@ public static void main(String[] args) throws IOException, XGBoostError {
     int round = 2;
 
     //train a boost model
-    Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
+    Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
 
     //predict
     float[][] predicts = booster.predict(testMat);
@@ -111,7 +111,7 @@ public static void main(String[] args) throws IOException, XGBoostError {
     HashMap<String, DMatrix> watches2 = new HashMap<String, DMatrix>();
     watches2.put("train", trainMat2);
     watches2.put("test", testMat2);
-    Booster booster3 = XGBoost.train(params, trainMat2, round, watches2, null, null);
+    Booster booster3 = XGBoost.train(trainMat2, params, round, watches2, null, null);
     float[][] predicts3 = booster3.predict(testMat2);
 
     //check predicts

diff --git a/...s/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/BoostFromPrediction.java b/...s/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/BoostFromPrediction.java
@@ -48,7 +48,7 @@ public static void main(String[] args) throws XGBoostError {
     watches.put("test", testMat);
 
     //train xgboost for 1 round
-    Booster booster = XGBoost.train(params, trainMat, 1, watches, null, null);
+    Booster booster = XGBoost.train(trainMat, params, 1, watches, null, null);
 
     float[][] trainPred = booster.predict(trainMat, true);
     float[][] testPred = booster.predict(testMat, true);
@@ -57,6 +57,6 @@ public static void main(String[] args) throws XGBoostError {
     testMat.setBaseMargin(testPred);
 
     System.out.println("result of running from initial prediction");
-    Booster booster2 = XGBoost.train(params, trainMat, 1, watches, null, null);
+    Booster booster2 = XGBoost.train(trainMat, params, 1, watches, null, null);
   }
 }
diff --git a/...kages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/CrossValidation.java b/...kages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/CrossValidation.java
@@ -49,7 +49,7 @@ public static void main(String[] args) throws IOException, XGBoostError {
     //set additional eval_metrics
     String[] metrics = null;
 
-    String[] evalHist = XGBoost.crossValidation(params, trainMat, round, nfold, metrics, null,
+    String[] evalHist = XGBoost.crossValidation(trainMat, params, round, nfold, metrics, null,
             null);
   }
 }
diff --git a/...kages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/CustomObjective.java b/...kages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/CustomObjective.java
@@ -163,6 +163,6 @@ public static void main(String[] args) throws XGBoostError {
 
     //train a booster
     System.out.println("begin to train the booster model");
-    Booster booster = XGBoost.train(params, trainMat, round, watches, obj, eval);
+    Booster booster = XGBoost.train(trainMat, params, round, watches, obj, eval);
   }
 }
diff --git a/...ckages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/ExternalMemory.java b/...ckages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/ExternalMemory.java
@@ -56,6 +56,6 @@ public static void main(String[] args) throws XGBoostError {
     int round = 2;
 
     //train a boost model
-    Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
+    Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
   }
 }
diff --git a/...gboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/GeneralizedLinearModel.java b/...gboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/GeneralizedLinearModel.java
@@ -60,7 +60,7 @@ public static void main(String[] args) throws XGBoostError {
 
     //train a booster
     int round = 4;
-    Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
+    Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
 
     float[][] predicts = booster.predict(testMat);
 

diff --git a/...ges/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/PredictFirstNtree.java b/...ges/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/PredictFirstNtree.java
@@ -51,7 +51,7 @@ public static void main(String[] args) throws XGBoostError {
 
     //train a booster
     int round = 3;
-    Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
+    Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
 
     //predict use 1 tree
     float[][] predicts1 = booster.predict(testMat, false, 1);

diff --git a/...es/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/PredictLeafIndices.java b/...es/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/PredictLeafIndices.java
@@ -49,7 +49,7 @@ public static void main(String[] args) throws XGBoostError {
 
     //train a booster
     int round = 3;
-    Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
+    Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
 
     //predict using first 2 tree
     float[][] leafindex = booster.predictLeaf(testMat, 2);

diff --git a/...s/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/BasicWalkThrough.scala b/...s/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/BasicWalkThrough.scala
@@ -43,7 +43,7 @@ class BasicWalkThrough {
 
     val round = 2
     // train a model
-    val booster = XGBoost.train(params.toMap, trainMax, round, watches.toMap)
+    val booster = XGBoost.train(trainMax, params.toMap, round, watches.toMap)
     // predict
     val predicts = booster.predict(testMax)
     // save model to model path
@@ -78,7 +78,7 @@ class BasicWalkThrough {
     val watches2 = new mutable.HashMap[String, DMatrix]
     watches2 += "train" -> trainMax2
     watches2 += "test" -> testMax2
-    val booster3 = XGBoost.train(params.toMap, trainMax2, round, watches2.toMap, null, null)
+    val booster3 = XGBoost.train(trainMax2, params.toMap, round, watches2.toMap, null, null)
     val predicts3 = booster3.predict(testMax2)
     println(checkPredicts(predicts, predicts3))
   }

diff --git a/...gboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/BoostFromPrediction.scala b/...gboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/BoostFromPrediction.scala
@@ -39,7 +39,7 @@ class BoostFromPrediction {
 
     val round = 2
     // train a model
-    val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
+    val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
 
     val trainPred = booster.predict(trainMat, true)
     val testPred = booster.predict(testMat, true)
@@ -48,6 +48,6 @@ class BoostFromPrediction {
     testMat.setBaseMargin(testPred)
 
     System.out.println("result of running from initial prediction")
-    val booster2 = XGBoost.train(params.toMap, trainMat, 1, watches.toMap, null, null)
+    val booster2 = XGBoost.train(trainMat, params.toMap, 1, watches.toMap, null, null)
   }
 }
diff --git a/...es/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/CrossValidation.scala b/...es/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/CrossValidation.scala
@@ -41,6 +41,6 @@ class CrossValidation {
     val metrics: Array[String] = null
 
     val evalHist: Array[String] =
-      XGBoost.crossValidation(params.toMap, trainMat, round, nfold, metrics, null, null)
+      XGBoost.crossValidation(trainMat, params.toMap, round, nfold, metrics, null, null)
   }
 }
diff --git a/...es/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/CustomObjective.scala b/...es/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/CustomObjective.scala
@@ -150,8 +150,8 @@ class CustomObjective {
 
     val round = 2
     // train a model
-    val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
-    XGBoost.train(params.toMap, trainMat, round, watches.toMap, new LogRegObj, new EvalError)
+    val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
+    XGBoost.train(trainMat, params.toMap, round, watches.toMap, new LogRegObj, new EvalError)
   }
 
 }
diff --git a/...ges/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/ExternalMemory.scala b/...ges/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/ExternalMemory.scala
@@ -45,7 +45,7 @@ class ExternalMemory {
 
     val round = 2
     // train a model
-    val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
+    val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
 
     val trainPred = booster.predict(trainMat, true)
     val testPred = booster.predict(testMat, true)
@@ -54,6 +54,6 @@ class ExternalMemory {
     testMat.setBaseMargin(testPred)
 
     System.out.println("result of running from initial prediction")
-    val booster2 = XGBoost.train(params.toMap, trainMat, 1, watches.toMap, null, null)
+    val booster2 = XGBoost.train(trainMat, params.toMap, 1, watches.toMap, null, null)
   }
 }
diff --git a/...ost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/GeneralizedLinearModel.scala b/...ost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/GeneralizedLinearModel.scala
@@ -52,7 +52,7 @@ class GeneralizedLinearModel {
     watches += "test" -> testMat
 
     val round = 4
-    val booster = XGBoost.train(params.toMap, trainMat, 1, watches.toMap, null, null)
+    val booster = XGBoost.train(trainMat, params.toMap, 1, watches.toMap, null, null)
     val predicts = booster.predict(testMat)
     val eval = new CustomEval
     println(s"error=${eval.eval(predicts, testMat)}")

diff --git a/.../xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/PredictFirstNTree.scala b/.../xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/PredictFirstNTree.scala
@@ -38,7 +38,7 @@ class PredictFirstNTree {
 
     val round = 3
     // train a model
-    val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
+    val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
 
     // predict use 1 tree
     val predicts1 = booster.predict(testMat, false, 1)

diff --git a/...xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/PredictLeafIndices.scala b/...xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/PredictLeafIndices.scala
@@ -39,7 +39,7 @@ class PredictLeafIndices {
     watches += "test" -> testMat
 
     val round = 3
-    val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
+    val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
 
     // predict using first 2 tree
     val leafIndex = booster.predictLeaf(testMat, 2)

diff --git a/...t4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/flink/DistTrainWithFlink.scala b/...t4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/flink/DistTrainWithFlink.scala
@@ -34,7 +34,7 @@ object DistTrainWithFlink {
     // number of iterations
     val round = 2
     // train the model
-    val model = XGBoost.train(paramMap, trainData, round)
+    val model = XGBoost.train(trainData, paramMap, round)
     val predTest = model.predict(testData.map{x => x.vector})
     model.saveModelAsHadoopFile("file:///path/to/xgboost.model")
   }

diff --git a/...t4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala b/...t4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala
@@ -16,29 +16,34 @@
 
 package ml.dmlc.xgboost4j.scala.example.spark
 
-import ml.dmlc.xgboost4j.scala.spark.XGBoost
+import ml.dmlc.xgboost4j.scala.DMatrix
+import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
 import org.apache.spark.SparkContext
 import org.apache.spark.mllib.util.MLUtils
 
 object DistTrainWithSpark {
   def main(args: Array[String]): Unit = {
-    if (args.length != 4) {
+    if (args.length != 5) {
       println(
-        "usage: program num_of_rounds num_workers training_path model_path")
+        "usage: program num_of_rounds num_workers training_path test_path model_path")
       sys.exit(1)
     }
     val sc = new SparkContext()
     val inputTrainPath = args(2)
-    val outputModelPath = args(3)
+    val inputTestPath = args(3)
+    val outputModelPath = args(4)
     // number of iterations
     val numRound = args(0).toInt
-    val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).repartition(args(1).toInt)
+    import DataUtils._
+    val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath)
+    val testSet = MLUtils.loadLibSVMFile(sc, inputTestPath).collect().iterator
     // training parameters
     val paramMap = List(
       "eta" -> 0.1f,
       "max_depth" -> 2,
       "objective" -> "binary:logistic").toMap
-    val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound)
+    val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound, nWorkers = args(1).toInt)
+    xgboostModel.predict(new DMatrix(testSet))
     // save model to HDFS path
     xgboostModel.saveModelAsHadoopFile(outputModelPath)
   }

diff --git a/jvm-packages/xgboost4j-flink/src/main/scala/ml/dmlc/xgboost4j/scala/flink/XGBoost.scala b/jvm-packages/xgboost4j-flink/src/main/scala/ml/dmlc/xgboost4j/scala/flink/XGBoost.scala
@@ -56,7 +56,7 @@ object XGBoost {
       val trainMat = new DMatrix(dataIter, null)
       val watches = List("train" -> trainMat).toMap
       val round = 2
-      val booster = XGBoostScala.train(paramMap, trainMat, round, watches, null, null)
+      val booster = XGBoostScala.train(trainMat, paramMap, round, watches, null, null)
       Rabit.shutdown()
       collector.collect(new XGBoostModel(booster))
     }
@@ -81,13 +81,14 @@ object XGBoost {
   /**
     * Train a xgboost model with link.
     *
-    * @param params The parameters to XGBoost.
     * @param dtrain The training data.
+    * @param params The parameters to XGBoost.
     * @param round Number of rounds to train.
     */
-  def train(params: Map[String, Any],
-            dtrain: DataSet[LabeledVector],
-            round: Int): XGBoostModel = {
+  def train(
+      dtrain: DataSet[LabeledVector],
+      params: Map[String, Any],
+      round: Int): XGBoostModel = {
     val tracker = new RabitTracker(dtrain.getExecutionEnvironment.getParallelism)
     if (tracker.start()) {
       dtrain