-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans #20520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #87112 has finished for PR 20520 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the Python API conventions well, but that looks like the right kind of plubming through of a new parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just some minor comments on the tests
python/pyspark/ml/tests.py
Outdated
kmeans.setDistanceMeasure("cosine") | ||
model = kmeans.fit(df) | ||
result = model.transform(df).rdd.collectAsMap() | ||
self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little awkward to collectAsMap and compare like this, why not just regular collect and compare with data
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry but I can't really understand what you are thinking about: here I compare that the prediction for two points contains the same value. Since the value is not known in advance I cannot just check if the dataframe is equal to something predefined. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant something like this, similar to how its done in the KMeans pydoc
>>> transformed = model.transform(df).select("features", "prediction")
>>> rows = transformed.collect()
>>> rows[0].prediction == rows[1].prediction
True
>>> rows[2].prediction == rows[3].prediction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler is it guaranteed that the ordering is not changed? And is the ordering deterministic? I did it like this and now how you suggest because I thought that the ordering is not guaranteed, am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the order is preserved so you can do the same as doctest.
python/pyspark/ml/tests.py
Outdated
(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] | ||
df = self.spark.createDataFrame(data, ["features"]) | ||
kmeans = KMeans(k=3, seed=1) | ||
kmeans.setDistanceMeasure("cosine") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but why not just kmeans = KMeans(k=3, distanceMeasure="cosine", seed=1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was just to test that this method is working. Do you think it is better to switch to what you suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably a little more common to put the param in the constructor, which should be tested also. you put setDistanceMeasure
in test_kmeans_param
which seems better, not a big deal though
Test build #87256 has finished for PR 20520 at commit
|
Test build #87270 has finished for PR 20520 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged to master |
## What changes were proposed in this pull request? SPARK-22119 introduced a new parameter for KMeans, ie. `distanceMeasure`. The PR adds it also to the Python interface. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20520 from mgaido91/SPARK-23344.
What changes were proposed in this pull request?
SPARK-22119 introduced a new parameter for KMeans, ie.
distanceMeasure
. The PR adds it also to the Python interface.How was this patch tested?
added UTs