Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13672] [ML] Add python examples of BisectingKMeans in ML and MLLIB #11515

Closed
wants to merge 12 commits into from

Conversation

zhengruifeng
Copy link
Contributor

JIRA: https://issues.apache.org/jira/browse/SPARK-13672

What changes were proposed in this pull request?

add two python examples of BisectingKMeans for ml and mllib

How was this patch tested?

manual tests

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52455 has finished for PR 11515 at commit 5ed2a47.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52456 has finished for PR 11515 at commit e6da291.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52457 has finished for PR 11515 at commit ebce780.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)
clusters = BisectingKMeans.train(parsedData, 2, maxIterations=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to run this, got an exception:

TypeError: unbound method train() must be called with BisectingKMeans instance as first argument (got PipelinedRDD instance instead)

train is missing a @classmethod annotation here. You can just add that in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the annotation was added

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52751 has finished for PR 11515 at commit cea8ddf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sqlContext = SQLContext(sc)

# $example on$
training = sqlContext.createDataFrame([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this example more consistent with the style of the other one (and the ML kmeans example):

from pyspark.sql.types import Row
from pyspark.mllib.linalg import Vectors
...
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: Row(features=Vectors.dense([float(x) for x in line.split(' ')])))
training = sqlContext.createDataFrame(parsedData)
...

@MLnick
Copy link
Contributor

MLnick commented Mar 10, 2016

Please add an include_example for the Python example in mllib-clustering.md

@zhengruifeng
Copy link
Contributor Author

@MLnick I have add an include_example in mllib-clustering.md. And some changes were make according to your commentations.

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52890 has finished for PR 11515 at commit 399290c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52892 has finished for PR 11515 at commit d441511.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52894 has finished for PR 11515 at commit 165a4fe.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52906 has finished for PR 11515 at commit 165a4fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Mar 11, 2016

Thanks! Merging to master.

@asfgit asfgit closed this in d18276c Mar 11, 2016
@zhengruifeng zhengruifeng deleted the mllib_bkm_pe branch March 11, 2016 07:26
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
JIRA: https://issues.apache.org/jira/browse/SPARK-13672

## What changes were proposed in this pull request?

add two python examples of BisectingKMeans for ml and mllib

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes apache#11515 from zhengruifeng/mllib_bkm_pe.
@rmchurch
Copy link

This example doesn't seem to work in Spark 2.0.0, and from the master mllib/clustering.py, I don't expect it to work in the most updated code either. Specifically, the Python BisectingKMeansModel class does not have a save method (the KMeansModel class does), so that the last three lines of the following code do not work:

# Build the model (cluster the data)
model = BisectingKMeans.train(parsedData, 2, maxIterations=5)

# Evaluate clustering
cost = model.computeCost(parsedData)
print("Bisecting K-means Cost = " + str(cost))

# Save and load model
path = "target/org/apache/spark/PythonBisectingKMeansExample/BisectingKMeansModel"
model.save(sc, path)
sameModel = BisectingKMeansModel.load(sc, path)

@zhengruifeng
Copy link
Contributor Author

@rmchurch I think this bug has been resolved in #16515

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants