Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XGBoost4J-Spark] Problem for (de)serialization of Spark implementation of custom objective and eval #7274

Merged
merged 16 commits into from Oct 21, 2021

Conversation

nicovdijk
Copy link
Contributor

Suggested solution to the issue in the title.

@trivialfis
Copy link
Member

Thanks for the PR!

@wbo4958 Could you please take a look when you are available?

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Sep 30, 2021 via email

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 1, 2021

@nicovdijk What issue does your PR try to solve? Could you add more descriptions?

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 1, 2021

@nicovdijk What issue does your PR try to solve? Could you add more descriptions?

@wbo4958: This PR is a solution to issue #7224 with the same name.

Copy link
Contributor

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicovdijk I just figured out what your PR tries to fix. Overall, You PR can fix this issue. But it's not straight forward for users who would like to customize their obj/eval, it requires users explicitly set the type hint.

Could we have callback way to do it by xgboost?

import ml.dmlc.xgboost4j.scala.{DMatrix, ObjectiveTrait}
import ml.dmlc.xgboost4j.scala.spark.params.CustomObjParam._
import org.apache.commons.logging.LogFactory
import org.json4s.ShortTypeHints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used. better to remove it

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 5, 2021

@nicovdijk I just figured out what your PR tries to fix. Overall, You PR can fix this issue. But it's not straight forward for users who would like to customize their obj/eval, it requires users explicitly set the type hint.

Could we have callback way to do it by xgboost?

@wbo4958, I would like to automate it, yes, but I'm not sure what is the best way.

  • We could implement adding the type hint in abstract classes SparkCustomObjective and SparkCustomEval, see the new commit. What do you think?

Also:

  • I was also wondering how persistence works in xgboost4j. Is it possible there to save and load including a custom objective and eval? I don't see any persistence test, but maybe I'm not looking in the right place...
  • The type hints are now added when an instance of a custom objective or eval is created. How can I make sure the hints are added when creating the classes? Right now, saving a model, then restarting the java program and trying to load would not work...

PS The comments have been fixed (I don't seem to have a proper linter).

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 10, 2021

XGBoost2MLlibParams https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/GeneralParams.scala#L260 will check and set the param, Could we add some matches for CustomObjParam and CustomEvalParam and set the TypeHints here?

@nicovdijk
Copy link
Contributor Author

XGBoost2MLlibParams https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/GeneralParams.scala#L260 will check and set the param, Could we add some matches for CustomObjParam and CustomEvalParam and set the TypeHints here?

That works nicely, thanks for the suggestion.

The only problem left is that we need to make sure that the type hint gets added, also before creating an instance for the first time (now it gets added when we save first). Any suggestions?

Maybe one would like to create a dummy instance somewhere when creating the class itself. However, I don't really how this should work for someone using XGBoost..

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 11, 2021

XGBoost2MLlibParams https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/GeneralParams.scala#L260 will check and set the param, Could we add some matches for CustomObjParam and CustomEvalParam and set the TypeHints here?

That works nicely, thanks for the suggestion.

The only problem left is that we need to make sure that the type hint gets added, also before creating an instance for the first time (now it gets added when we save first). Any suggestions?

Maybe one would like to create a dummy instance somewhere when creating the class itself. However, I don't really how this should work for someone using XGBoost..

We could add define an Interface (Trait) for this

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 12, 2021

We could add define an Interface (Trait) for this

I'm not sure I understand.. Ideally we should register the type hint before instantiation. However, it seems to me that we cannot do that for someone using XGBoost. They would have to do that themselves.

We would like to do the registration when defining the class (maybe in the companion object?). In python, for instance, you can run some code when importing a module, but in Scala I'm not sure how to accomplish this (objects are lazy). I don't know how a Trait can help me with this. Can you explain more?

The following code demonstrates the issue (I think):

import org.scalatest.FunSuite

class MyTest extends FunSuite {

  trait TestTrait {}

  object TestTrait {
    private var typeHintsAdded = Set[String]()

    def addTypeHint(instance: Any): Unit = {
      val clazz = instance.getClass()
      val className = clazz.getSimpleName()
      if (!typeHintsAdded.contains(className)) {
        typeHintsAdded += className
      }
    }

    def printTypeHints() {
      println(s"Type hints:\n$typeHintsAdded")
    }
  }

  class TestClass extends TestTrait {
    TestTrait.addTypeHint(this)
  }

  object TestClass extends TestTrait {
  }

  test("testing adding type hint while defining class") {
    println("Before creating instance\n")
    TestTrait.printTypeHints()
    val testClass = new TestClass
    println("After creating instance\n")
    TestTrait.printTypeHints()
  }
}

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 12, 2021

If users customize their obj, then they must extends ObjectiveTrait, So what am I think is whether we can add another trait like ABC.scala where define some methods to add TypeHints, and let ObjectiveTrait extends ABC. Then when users try to customize, they will be forced to implement the method adding TypeHint.

@nicovdijk
Copy link
Contributor Author

I would like a general TypeHint adding trait. However, modifying ObjectiveTrait in that way would also force users that do not use Spark to implement that method... Is that a good idea?

Also, I still don't see how that solves the problem about adding type hints before instantiation of the corresponding class. Sorry if I'm slow..

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 12, 2021

would like a general TypeHint adding trait. However, modifying ObjectiveTrait in that way would also force users that do not use Spark to implement that method... Is that a good idea?

Also, I still don't see how that solves the problem about adding type hints before instantiation of the corresponding class. Sorry if I'm slow..

  1. End users customize obj, then they must implement the method adding type hint. So the customized obj has the instance of type hint.
  2. We can add a case to match CustomObjParam and add another method to set the type hint from the customized obj into the CustomObjParam in XGBoost2MLlibParams

Please correct me.

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 13, 2021

would like a general TypeHint adding trait. However, modifying ObjectiveTrait in that way would also force users that do not use Spark to implement that method... Is that a good idea?
Also, I still don't see how that solves the problem about adding type hints before instantiation of the corresponding class. Sorry if I'm slow..

  1. End users customize obj, then they must implement the method adding type hint. So the customized obj has the instance of type hint.
  2. We can add a case to match CustomObjParam and add another method to set the type hint from the customized obj into the CustomObjParam in XGBoost2MLlibParams

Please correct me.

  1. The issue is not that a method has to be implemented (we can implement the method for the end user), but that the user needs to execute that method in order to actually add the TypeHint (see the code in one of my previous comments).
  2. I had already implemented two additional cases in XGBoost2MLlibParams that took care of adding TypeHints without the need for an additional method.

However, the downside of adding TypeHints in XGBoost2MLlibParams is that creating an instance of the CustomObj or CustomEval is no longer enough. One would need to use them as parameters in an XGBoostRegressor or XGBoostClassifier... I think it's better to execute the logic upon instantiation of a custom objective or eval.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 13, 2021

Hi @nicovdijk, Personally, this still seems like a workaround instead of a solution. I will try to make a commit locally to see if my suggestion can work.

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 13, 2021

@wbo4958, I'm still not really sure what you mean, so I would like to see any suggestion you have, thanks!

By the way, I did not include the type hint trait in the objective trait, because that's in the xgboost4j package which does not serialize the parameters, if I'm correct. I don't think it has the json4s dependency either. But that's definitely possible, if that's what you mean. Is it?

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 14, 2021

@nicovdijk Thx for your feed back. Yeah, I just realize the ObjectiveTrait is only in XGBoost4j.

BTW, Have you tested the scenario just load the model with customized obj/eval without any other train process?

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 14, 2021

@wbo4958, I haven't tested it using XGBoost (the tests create temporary files, to do this test we would need a saved model as resource), but the script I provided earlier demonstrates the issue. So I'm pretty sure that it would fail at the moment.

The user would have to create temporary instances of all custom obj/evals when starting the program, then everything works. Unless we have some initialization logic where we could do this for the user.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 14, 2021

Ok, Could you refine your PR, then we can review it

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 14, 2021

Ok, Could you refine your PR, then we can review it

@wbo4958 Sorry, what refinements exactly would you like?

@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2021

Codecov Report

Merging #7274 (3ad1414) into master (4fd149b) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #7274   +/-   ##
=======================================
  Coverage   83.68%   83.68%           
=======================================
  Files          13       13           
  Lines        3885     3885           
=======================================
  Hits         3251     3251           
  Misses        634      634           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fd149b...3ad1414. Read the comment docs.

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 15, 2021

@wbo4958, I refined 😄 the PR.

There is one last thing I would like to add. We use the Spark API in a Java project. Using traits with implementations is a bit cumbersome then. Can we add an abstract class SparkCustomObjective or JavaCustomObjective (or any other name) to add support for that? It should probably extends IObjective with TypeHintsTrait.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 15, 2021

@wbo4958, I refined smile the PR.

There is one last thing I would like to add. We use the Spark API in a Java project. Using traits with implementations is a bit cumbersome then. Can we add an abstract class SparkCustomObjective or JavaCustomObjective (or any other name) to add support for that? It should probably extends IObjective with TypeHintsTrait.

In that case, SparkCustomObjective will break the public API.

Anyway, this PR looks good for now. We can merge it for now and we will refine this once we figured out a better way for this.

Thx @nicovdijk

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 15, 2021

@trivialfis Please help to review it, and please change the [XGBoost-spark] to [jvm-packages]

@nicovdijk
Copy link
Contributor Author

There is one last thing I would like to add. We use the Spark API in a Java project. Using traits with implementations is a bit cumbersome then. Can we add an abstract class SparkCustomObjective or JavaCustomObjective (or any other name) to add support for that? It should probably extends IObjective with TypeHintsTrait.

In that case, SparkCustomObjective will break the public API.

Just to be clear: I would add a Scala abstract class that would be easier to use in Java code, nothing more. On the other hand, I guess there is a work-around by manually using the SavedTypeHints object to add TypeHints.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 15, 2021

There is one last thing I would like to add. We use the Spark API in a Java project. Using traits with implementations is a bit cumbersome then. Can we add an abstract class SparkCustomObjective or JavaCustomObjective (or any other name) to add support for that? It should probably extends IObjective with TypeHintsTrait.

In that case, SparkCustomObjective will break the public API.

Just to be clear: I would add a Scala abstract class that would be easier to use in Java code, nothing more. On the other hand, I guess there is a work-around by manually using the SavedTypeHints object to add TypeHints.

You mean xgboost4j introduces the trait from xgboost4j-spark?

@nicovdijk
Copy link
Contributor Author

nicovdijk commented Oct 15, 2021

You mean xgboost4j introduces the trait from xgboost4j-spark?

No, I only want to add the following to xgboost4j-spark:

abstract class SparkCustomEval extends EvalTrait with TypeHintsTrait

abstract class SparkCustomObjective extends ObjectiveTrait with TypeHintsTrait

Then in Java, you use SparkCustomEval and SparkCustomObjective directly (you cannot have multiple inheritance in Java).

@nicovdijk
Copy link
Contributor Author

@wbo4958, I think I did what you wanted. It was the first time I used rebasing, so please check if the result is correct, thanks.

Copy link
Contributor

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 20, 2021

@trivialfis Could you help to review it

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me. Thanks for the nice work!

@trivialfis trivialfis merged commit 31a307c into dmlc:master Oct 21, 2021
@abugh
Copy link

abugh commented Nov 24, 2021

@trivialfis @nicovdijk Hi I'm a user of XGBoost and get caught by this issue for a long time. Just as @wbo4958 said, is it possible for XGBoost to solve it so that XGboostModel can be integrated as a stage of a Spark PipelineMode when we are using custom objectives. For now, we must use XGBoostClassificationModel.load(), or else it will cause errors. It's really inconvenient for the users.

@nicovdijk
Copy link
Contributor Author

@abugh, I believe this merged PR does exactly that. Is this not the case?

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 25, 2021

@abugh, Yeah, Please try the newest release

@abugh
Copy link

abugh commented Nov 25, 2021

@nicovdijk But I see that in the jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/PersistenceSuite.scala unit-test, you are still using XGBoostClassificationModel .save() and XGBoostClassificationModel .load() .

How about if we transfer XGBoostClassificationModel into a stage of Spark PipelineModel, just like this val pipelineModel = new Pipeline().setStages(Array(XGBoostClassificationModelxxx)).fit(xxx)? and then use pipelineModel.save and PipelineModel.load to save and load. Do you mean that the newest xgboost support this now? I didn't see such case in unit-test in this PR.
Thanks for replying.

@abugh
Copy link

abugh commented Nov 25, 2021

Now I'm using XGBoost-Spark 1.3.1, with scala 2.1.2 . When I'm using XGBoostRegressionModel with Customized Objectives, I can't even load the model with XGBoostRegressionModel.load() and (I used XGBoostRegressionModel.save() to save the model).
I thought that the customized objectives are the core module of the XGBoost, but seems that it doesn't work for me even when I'm doing the standard operations(XGBoostRegressionModel.load()). Is there anything wrong in my operations? Or is there anyway for me to solve this problem except by upgrading the xgboost?
If I must upgrade the xgboost version, which version at least I need to upgrade to?

@abugh
Copy link

abugh commented Nov 25, 2021

And I see that in the latest version(1.5.1), the unit-test about customized objectives is replaced by using integrated objective.
https://github.com/dmlc/xgboost/blob/release_1.5.1/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/PersistenceSuite.scala

@nicovdijk
Copy link
Contributor Author

@abugh, this PR is not included in the 1.5.0 release or the patch release 1.5.1. It is included in the master branch. I do not know why it's not included yet nor when it will be.

I have even tested saving and loading in a pipeline in master - it works (which is expected, since it's just using the save and load implementation of the xgboost stage).

@abugh
Copy link

abugh commented Nov 25, 2021

@nicovdijk Really thanks for your applying. So do you mean that loading in a pipeline is same to loading in XGBoostRegressionModel? Or else, If loading in XGBoostRegressionModel works well, loading in a pipeline should also work well?

I have even tested saving and loading in a pipeline in master - it works Yeah I think it makes sense, as masters branch has already admit this PR, and as you said, this PR is just used to solve this problem. But if I'm using the release Patch(1.5.1 or 1.3.1)m it doesn't work. Any suggestion you can give for me if I can only use release patch for some reasons?

@abugh
Copy link

abugh commented Nov 25, 2021

Oh! good news. I change to use 1.5.1 release patch, seems it works now when using loading in XGBoostRegressionModel(loading in pipelineMode isn't tested yet, but I guess it works.)

I don't know the reason, as you mentioned that 1.5.1 release patch doesn't admit this PR. But I just share this information to you.

@nicovdijk
Copy link
Contributor Author

@nicovdijk Really thanks for your applying. So do you mean that loading in a pipeline is same to loading in XGBoostRegressionModel? Or else, If loading in XGBoostRegressionModel works well, loading in a pipeline should also work well?

I have even tested saving and loading in a pipeline in master - it works Yeah I think it makes sense, as masters branch has already admit this PR, and as you said, this PR is just used to solve this problem. But if I'm using the release Patch(1.5.1 or 1.3.1)m it doesn't work. Any suggestion you can give for me if I can only use release patch for some reasons?

@trivialfis, do you know if/when the code in this PR gets released?

@abugh
Copy link

abugh commented Nov 26, 2021

And also could you tell me which spark version is needed when using xgboost4j-spark_1.51 with Scala 2.12?
Is 3.2.0 needed? Or just 3.1.x is OK?

@abugh
Copy link

abugh commented Nov 27, 2021

Oh! good news. I change to use 1.5.1 release patch, seems it works now when using loading in XGBoostRegressionModel(loading in pipelineMode isn't tested yet, but I guess it works.)

I don't know the reason, as you mentioned that 1.5.1 release patch doesn't admit this PR. But I just share this information to you.

Unfortunately, seems that the successful run was a coincidence. I can't recurrent it even with the same conditions, and not know the reason.

It would be nice if you can tell me the plan/schedule to release this PR in the release patch. I will wait for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants