Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23455][ML] Default Params in ML should be saved separately in metadata #20633

Closed
wants to merge 5 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 17, 2018

What changes were proposed in this pull request?

We save ML's user-supplied params and default params as one entity in metadata. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params.

It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading.

The loaded default params should not be set as user-supplied params. We should save ML default params separately in metadata.

For backward compatibility, when loading metadata, if it is a metadata file from previous Spark, we shouldn't raise error if we can't find the default param field.

How was this patch tested?

Pass existing tests and added tests.

@SparkQA
Copy link

SparkQA commented Feb 17, 2018

Test build #87521 has finished for PR 20633 at commit 69648d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 17, 2018

cc @jkbradley This is to save default params separately in metadata file. Please help review after 2.3. Thanks!

@viirya
Copy link
Member Author

viirya commented Mar 2, 2018

also cc @MLnick @WeichenXu123 for review.

*/
def getParamValue(paramName: String): JValue = {
def getParamValue(paramName: String, isDefaultParam: Boolean = false): JValue = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the logic be "lookup params first, if not exist then lookup defaultParams" ?
If so, we don't need the isDefaultParam param.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will change this.

@WeichenXu123
Copy link
Contributor

@viirya Have you run backward compatibility test ? e.g., saving estimator/models via master version spark, and load estimator/models via this PR version spark. To check whether it works fine and all params get saved and restored correctly.
Especially this should be done against QuantileDiscretizer/Bucketizer which you remove temporary fix code.

@viirya
Copy link
Member Author

viirya commented Mar 6, 2018

@WeichenXu123 I've added unit test in DefaultReadWriteSuite/DefaultReadWriteTest to test if this can read old metadata back.

Sounds like the backward compatibility test you suggested should be checked manually. I will test it. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 6, 2018

Test build #87996 has finished for PR 20633 at commit 166cdbb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 6, 2018

retest this please.

@viirya
Copy link
Member Author

viirya commented Mar 6, 2018

@WeichenXu123 Thanks for comment! I've run the backward compatibility test locally against QuantileDiscretizer/Bucketizer. They work fine. After this PR, the saved models from 2.3.0 can be read back and all params are restored correctly.

@WeichenXu123
Copy link
Contributor

LGTM. Thanks! @jkbradley

@SparkQA
Copy link

SparkQA commented Mar 6, 2018

Test build #87998 has finished for PR 20633 at commit 166cdbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 10, 2018

cc @dbtsai if you have time to look at this too.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I made a review pass, and it looks like a nice setup.

@@ -791,7 +791,7 @@ trait Params extends Identifiable with Serializable {
* this method gets called.
* @param value the default value
*/
protected final def setDefault[T](param: Param[T], value: T): this.type = {
private[ml] final def setDefault[T](param: Param[T], value: T): this.type = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave this as protected. It's important that 3rd-party libraries be able to extend MLlib APIs, and this is one they need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to call setDefault in Metadata.setParams, I can't leave it as protected. But I think it is important to keep the extensibility. And I think we can't make it as public too. I add object Params and use it to help us update default param of a Params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; I like your solution.

val basicMetadata = ("class" -> cls) ~
("timestamp" -> System.currentTimeMillis()) ~
("sparkVersion" -> sc.version) ~
("uid" -> uid) ~
("defaultParamMap" -> jsonDefaultParams) ~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about putting this below paramMap since that's nicer for viewing the JSON?

def getAndSetParams(
instance: Params,
skipParams: Option[List[String]] = None): Unit = {
setParams(instance, false, skipParams)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: It's nice to pass boolean args by name

}
}
// For metadata file prior to Spark 2.4, there is no default section.
case JNothing if isDefault && (major == 2 && minor < 4 || major < 2) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic would be simpler if this check were put in the getAndSetParams method, which could just skip calling setParams(instance, true, skipParams) for Spark 2.3-.

}
}

test("User-suppiled value for default param should be kept after persistence") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppiled -> supplied

assert(loadedMyParams.get(myParams.intParamWithDefault).get == 100)
}

test("Read metadata without default field prior to 2.4") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like this setup.

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88502 has finished for PR 20633 at commit a2b6917.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 22, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88510 has finished for PR 20633 at commit a2b6917.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 23, 2018

@jkbradley Thanks for your comments! I've addressed them. Please review it again. Thank you.

@viirya
Copy link
Member Author

viirya commented Mar 31, 2018

ping @jkbradley

@@ -905,6 +905,15 @@ trait Params extends Identifiable with Serializable {
}
}

object Params {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make this object package private (for cleaner docs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

@@ -791,7 +791,7 @@ trait Params extends Identifiable with Serializable {
* this method gets called.
* @param value the default value
*/
protected final def setDefault[T](param: Param[T], value: T): this.type = {
private[ml] final def setDefault[T](param: Param[T], value: T): this.type = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; I like your solution.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow response!


// For metadata file prior to Spark 2.4, there is no default section.
val (major, minor) = VersionUtils.majorMinorVersion(sparkVersion)
if (major >= 2 && minor >= 4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:
if (major > 2 || (major == 2 && minor >= 4)) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're correct. I'll fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Test build #88880 has finished for PR 20633 at commit de93e65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Apr 17, 2018

Can you address the conflicts? Gonna start to review it soon. Thanks.

@viirya
Copy link
Member Author

viirya commented Apr 17, 2018

@dbtsai Thanks! I've solved the conflicts.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89421 has finished for PR 20633 at commit 80b668a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 19, 2018

ping @jkbradley

@jkbradley
Copy link
Member

jkbradley commented Apr 23, 2018

Sorry for the pause in review.
LGTM
Merging with master (pending fresh tests)

@dbtsai I'm going to merge this since I'm worried it will collect more conflicts, but let's discuss more if needed.

@viirya We'll need to update Python's DefaultParamsReader as well for Spark 2.4 in order to keep it in sync with Scala/Java. R thankfully should not require anything since it only has wrappers. I'll make & link a JIRA: https://issues.apache.org/jira/browse/SPARK-24058 Will you have time to work on that?

Thanks @viirya !

@viirya
Copy link
Member Author

viirya commented Apr 23, 2018

@jkbradley Thanks! I'll work on SPARK-24058.

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #4156 has finished for PR 20633 at commit 80b668a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 8301375 Apr 24, 2018
@viirya viirya deleted the save-ml-default-params branch December 27, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants