Skip to content

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Feb 6, 2018

What changes were proposed in this pull request?

SPARK-22119 introduced a new parameter for KMeans, ie. distanceMeasure. The PR adds it also to the Python interface.

How was this patch tested?

added UTs

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87112 has finished for PR 20520 at commit 65da587.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Feb 8, 2018

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the Python API conventions well, but that looks like the right kind of plubming through of a new parameter.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor comments on the tests

kmeans.setDistanceMeasure("cosine")
model = kmeans.fit(df)
result = model.transform(df).rdd.collectAsMap()
self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little awkward to collectAsMap and compare like this, why not just regular collect and compare with data above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry but I can't really understand what you are thinking about: here I compare that the prediction for two points contains the same value. Since the value is not known in advance I cannot just check if the dataframe is equal to something predefined. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant something like this, similar to how its done in the KMeans pydoc

    >>> transformed = model.transform(df).select("features", "prediction")
    >>> rows = transformed.collect()
    >>> rows[0].prediction == rows[1].prediction
    True
    >>> rows[2].prediction == rows[3].prediction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler is it guaranteed that the ordering is not changed? And is the ordering deterministic? I did it like this and now how you suggest because I thought that the ordering is not guaranteed, am I wrong?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the order is preserved so you can do the same as doctest.

(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)]
df = self.spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=3, seed=1)
kmeans.setDistanceMeasure("cosine")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but why not just kmeans = KMeans(k=3, distanceMeasure="cosine", seed=1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was just to test that this method is working. Do you think it is better to switch to what you suggested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably a little more common to put the param in the constructor, which should be tested also. you put setDistanceMeasure in test_kmeans_param which seems better, not a big deal though

@SparkQA
Copy link

SparkQA commented Feb 9, 2018

Test build #87256 has finished for PR 20520 at commit 395ef5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2018

Test build #87270 has finished for PR 20520 at commit cbdbe12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srowen
Copy link
Member

srowen commented Feb 10, 2018

Merged to master

@asfgit asfgit closed this in 0783876 Feb 10, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
## What changes were proposed in this pull request?

SPARK-22119 introduced a new parameter for KMeans, ie. `distanceMeasure`. The PR adds it also to the Python interface.

## How was this patch tested?

added UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes apache#20520 from mgaido91/SPARK-23344.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants