[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans #20520

mgaido91 · 2018-02-06T15:16:52Z

What changes were proposed in this pull request?

SPARK-22119 introduced a new parameter for KMeans, ie. distanceMeasure. The PR adds it also to the Python interface.

How was this patch tested?

added UTs

SparkQA · 2018-02-06T15:39:52Z

Test build #87112 has finished for PR 20520 at commit 65da587.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-02-08T14:47:26Z

cc @BryanCutler @holdenk @jkbradley @srowen

srowen

I don't know the Python API conventions well, but that looks like the right kind of plubming through of a new parameter.

BryanCutler

Looks good, just some minor comments on the tests

BryanCutler · 2018-02-08T19:48:09Z

python/pyspark/ml/tests.py

+        kmeans.setDistanceMeasure("cosine")
+        model = kmeans.fit(df)
+        result = model.transform(df).rdd.collectAsMap()
+        self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])])


It's a little awkward to collectAsMap and compare like this, why not just regular collect and compare with data above?

sorry but I can't really understand what you are thinking about: here I compare that the prediction for two points contains the same value. Since the value is not known in advance I cannot just check if the dataframe is equal to something predefined. Am I missing something?

I meant something like this, similar to how its done in the KMeans pydoc

>>> transformed = model.transform(df).select("features", "prediction") >>> rows = transformed.collect() >>> rows[0].prediction == rows[1].prediction True >>> rows[2].prediction == rows[3].prediction

@BryanCutler is it guaranteed that the ordering is not changed? And is the ordering deterministic? I did it like this and now how you suggest because I thought that the ordering is not guaranteed, am I wrong?

yes, the order is preserved so you can do the same as doctest.

BryanCutler · 2018-02-08T19:49:04Z

python/pyspark/ml/tests.py

+                (Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)]
+        df = self.spark.createDataFrame(data, ["features"])
+        kmeans = KMeans(k=3, seed=1)
+        kmeans.setDistanceMeasure("cosine")


minor, but why not just kmeans = KMeans(k=3, distanceMeasure="cosine", seed=1)?

it was just to test that this method is working. Do you think it is better to switch to what you suggested?

It probably a little more common to put the param in the constructor, which should be tested also. you put setDistanceMeasure in test_kmeans_param which seems better, not a big deal though

SparkQA · 2018-02-09T09:00:00Z

Test build #87256 has finished for PR 20520 at commit 395ef5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-09T20:03:36Z

Test build #87270 has finished for PR 20520 at commit cbdbe12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM

srowen · 2018-02-10T16:47:01Z

Merged to master

## What changes were proposed in this pull request? SPARK-22119 introduced a new parameter for KMeans, ie. `distanceMeasure`. The PR adds it also to the Python interface. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20520 from mgaido91/SPARK-23344.

[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans

65da587

srowen reviewed Feb 8, 2018

View reviewed changes

BryanCutler reviewed Feb 8, 2018

View reviewed changes

improve test according to review

395ef5d

address comment

cbdbe12

BryanCutler approved these changes Feb 9, 2018

View reviewed changes

asfgit closed this in 0783876 Feb 10, 2018

[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans #20520

[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans #20520

Uh oh!

Conversation

mgaido91 commented Feb 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 6, 2018

Uh oh!

mgaido91 commented Feb 8, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 9, 2018

Uh oh!

SparkQA commented Feb 9, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Feb 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants