Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11237][ML] Add pmml export for k-means in Spark ML #20907

Closed

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Mar 26, 2018

What changes were proposed in this pull request?

Adding PMML export to Spark ML's KMeans Model.

How was this patch tested?

New unit test for Spark ML PMML export based on the old Spark MLlib unit test.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 26, 2018

cc @BryanCutler @sethah @MLnick

@SparkQA
Copy link

SparkQA commented Mar 26, 2018

Test build #88599 has finished for PR 20907 at commit 25d6f77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, one question

// Save model data: cluster centers
val data: Array[ClusterData] = instance.clusterCenters.zipWithIndex.map {
case (center, idx) =>
ClusterData(idx, center)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this type change Data -> ClusterData change the schema of the output parquet file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure. I'll manually test we can load the old format first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait no this shouldn't change anything, were saving this with a DataFrame and the schema is the same.
See the schema from 1: res3: org.apache.spark.sql.types.StructType = StructType(StructField(clusterIdx,IntegerType,false), StructField(clusterCenter,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) and the new one org.apache.spark.sql.types.StructType = StructType(StructField(clusterIdx,IntegerType,false), StructField(clusterCenter,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// model.
val pmmlClusteringModel = pmml.getModels.get(0).asInstanceOf[ClusteringModel]
assert(pmmlClusteringModel.getNumberOfClusters === clusterCenters.length)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this missing a call to testPMMLWrite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah :( Thanks for catching that.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88766 has finished for PR 20907 at commit e2dbe15.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2018

Test build #88823 has finished for PR 20907 at commit e0f9b09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [WIP][SPARK-11237][ML] Add pmml export for k-means in Spark ML [SPARK-11237][ML] Add pmml export for k-means in Spark ML Apr 2, 2018
Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, is this still a wip? Just noticed you changed the title, so is this ready?

@holdenk
Copy link
Contributor Author

holdenk commented Apr 4, 2018

Yup, this is ready. If you think its good feel free to merge it otherwise I'll merge it during my regular Friday review time :)

@viirya
Copy link
Member

viirya commented Apr 23, 2018

LGTM

@viirya
Copy link
Member

viirya commented Apr 23, 2018

One question I have is, how do users know if a model (e.g. KMeansModel after this change) supports pmml & internal formats without looking into source code? I did a search on the current docs, but didn't find any info.

@dbtsai
Copy link
Member

dbtsai commented Apr 23, 2018

LGTM too! +1 on the documentation which can be a followup PR. Merged into master, and thanks.

DB Tsai | Siri Open Source Technologies |  Apple, Inc

@asfgit asfgit closed this in e82cb68 Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants