Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] XGBoost spark predictions not consistent between SparseVector and DenseVector #3634

Closed
pekaalto opened this issue Aug 26, 2018 · 5 comments

Comments

@pekaalto
Copy link

@pekaalto pekaalto commented Aug 26, 2018

Hi,
I have noticed that if I change the underlying feature-type from SparseVector to DenseVector the predictions change wildly.

I suspect that the DenseVector is not working correctly and the underlying issue is that DMatrix is not working properly when created with LabeledPoint with indices = null. Those are created here:

Below is a script to reproduce the issue and investigation about the LabeledPoint's.

import com.github.fommil.netlib.BLAS.{getInstance => blas}
import ml.dmlc.xgboost4j.LabeledPoint
import ml.dmlc.xgboost4j.scala.spark._
import ml.dmlc.xgboost4j.scala.{Booster, DMatrix, XGBoost}
import org.apache.spark.SparkContext
import org.apache.spark.ml.linalg.{DenseVector, SparseVector}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, SQLContext, SQLImplicits, SparkSession}

import scala.reflect.runtime.universe._
import scala.util.Random

object XGBTests {

  val spark = {
    SparkSession.builder
      .master("local[2]")
      .config("spark.sql.shuffle.partitions", "2")
      .config("spark.default.parallelism", "2")
      .appName("pekkatest")
      .getOrCreate()
  }

  val sc: SparkContext = spark.sparkContext
  val sqlContext: SQLContext = spark.sqlContext

  import sqlContext.implicits._

  case class Obs(features: Array[Float], label: Float)

  def createArtificialData(numFeatures: Int, numObs: Int): Array[Obs] = {
    val weights = (1 to numFeatures).map { _ => Random.nextDouble() }.toArray

    (1 to numObs).map { id =>
      val x = (1 to numFeatures).map { _ =>
        if (Random.nextDouble() > 0.7) Random.nextDouble() + 0.1 else 0.0
      }.toArray
      val y = blas.ddot(numFeatures, x, 1, weights, 1) + 3 * Random.nextGaussian()
      Obs(x.map(_.toFloat), y.toFloat)
    }.toArray
  }

  val numFeatures = 20
  val dataSize = 1000

  val data = createArtificialData(numFeatures, dataSize)

  def obsToDf(data: Array[Obs]) = {
    val sparkData = data.zipWithIndex.map { case (obs, idx) =>
      (idx, new DenseVector(obs.features.map(_.toDouble)), obs.label.toDouble)
    }.toSeq
    sc.parallelize(sparkData).toDF("id", "features", "label").cache()
  }

  val df = obsToDf(data)
  df.show()

  val xgboostModel = {
    new XGBoostRegressor()
      .setNumWorkers(2)
      .setObjective("reg:linear")
      .setTrainTestRatio(0.6)
      .setNumRound(10)
      .setMaxDepth(10)
      .setEvalMetric("rmse")
      .setFeaturesCol("features")
      .setLabelCol("label")
      .setPredictionCol("preds")
      .fit(df)
  }

  val smallDfDense = df.limit(2).cache()

  val toSparseUdf = udf { (x: DenseVector) => x.toSparse }
  val toDenseUdf = udf { (x: SparseVector) => x.toDense }

  val smallDfSparse = smallDfDense.withColumn("features", toSparseUdf($"features"))

  //These should return the same prediction but they don't
  xgboostModel.transform(smallDfDense).show()
  xgboostModel.transform(smallDfSparse).show()

  val values = Array(1.0f, 2.0f)

  val sparseLabeledPoint = LabeledPoint(1.0f, Array(0, 1), values = values)
  val denseLabeledPoint = LabeledPoint(1.0f, null, values = values ++ Array.fill[Float](numFeatures - 2)(0.0f))
  val denseLabeledPointLessZeros = LabeledPoint(1.0f, null, values = values ++ Array.fill[Float](numFeatures - 8)(0.0f))
  
  val dmFromDensePoint = new DMatrix(Seq(
    denseLabeledPoint
  ).toIterator)

  val dmFromSparsePoint = new DMatrix(Seq(
    sparseLabeledPoint
  ).toIterator)

  val dmFromData = new DMatrix(
    values ++ Array.fill[Float](numFeatures - 2)(0.0f),
    nrow = 1, ncol = numFeatures
  )
  
  val dmAll = new DMatrix(Seq(
    sparseLabeledPoint,
    denseLabeledPoint,
    denseLabeledPointLessZeros
  ).toIterator)
  
  
  //These two give consistent results...
  xgboostModel.nativeBooster.predict(dmFromSparsePoint).foreach(a => println(a.head))
  xgboostModel.nativeBooster.predict(dmFromData).foreach(a => println(a.head))
  
  //However this is very different...
  xgboostModel.nativeBooster.predict(dmFromDensePoint).foreach(a => println(a.head))
  
  // Creating only one DMatrix won't change the result, 
  // However, number of zeroes in dense LabeledPoint changes the result, 
  // this doesn't seem correct
  xgboostModel.nativeBooster.predict(dmAll).foreach(a => println(a.head))
}
@hcho3 hcho3 changed the title XGBoost spark predictions not consistent between SparseVector and DenseVector [jvm-packages] XGBoost spark predictions not consistent between SparseVector and DenseVector Aug 30, 2018
@CodingCat

This comment has been minimized.

Copy link
Member

@CodingCat CodingCat commented Sep 19, 2018

can you try to set missing parameter as 0 when predicting with dense vector?

@hcho3

This comment has been minimized.

Copy link
Collaborator

@hcho3 hcho3 commented Oct 9, 2018

@pekaalto Any updates?

@pekaalto

This comment has been minimized.

Copy link
Author

@pekaalto pekaalto commented Oct 10, 2018

Oh yes,

can you try to set missing parameter as 0 when predicting with dense vector?

Yes, when setting this parameter then the predictions are the same.

But should this be handled somehow differently. I think now for many spark users the input DataFrame contains mix of sparse and dense vectors. For example the VectorAssembler, which is usually last step before model, creates Sparse or Dense vector depending how many non-zero values we have.

https://github.com/apache/spark/blob/a5925c1631e25c2dcc3c2948cea31e993ce66a97/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L295

Then when the default missing is set to Float.NaN it can cause very subtle issues with the model.

@hcho3

This comment has been minimized.

Copy link
Collaborator

@hcho3 hcho3 commented Oct 10, 2018

Is it common to treat 0 and missing values the same? @CodingCat It may be worthwhile to change the default value of missing to 0?

@CodingCat

This comment has been minimized.

Copy link
Member

@CodingCat CodingCat commented Oct 10, 2018

missing is set to NaN to be consistent with other bindings

missing : float, optional
Value in the data which needs to be present as a missing value. If
None, defaults to np.nan.

I am not sure if it's a good idea to make it different across different language APIs....what if a DS uses the model trained from Spark in python?

@CodingCat CodingCat closed this Oct 17, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants
You can’t perform that action at this time.