Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37959][ML] Fix the UT of checking norm in KMeans & BiKMeans #35247

Closed
wants to merge 1 commit into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

In KMeansSuite and BisectingKMeansSuite, there are some unused lines:

model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 

For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful;

For euclidean distance, the norm checking is meaningless;

Why are the changes needed?

to enable norm checking for cosine distance, and diable it for euclidean distance

Does this PR introduce any user-facing change?

No

How was this patch tested?

updated testsuites

@github-actions github-actions bot added the ML label Jan 19, 2022
@zhengruifeng
Copy link
Contributor Author

friendly ping @srowen @huaxingao

@srowen
Copy link
Member

srowen commented Jan 19, 2022

The assertion says that cluster centers are vectors with norm 1? yes I do not see a reason to expect that when the distance metric in use is Euclidean. I wonder how that ever worked? it will not be true in the general case. Is it just something that happens to work in the unit test?

Fine to make the other assertions about the norm include an error term, yes.

@zhengruifeng
Copy link
Contributor Author

in each iteration, val newCenter = distanceMeasureInstance.centroid(sum, weightSum) is called to get the new center vector, and CosineDistanceMeasure will normalize the center. So the final vectors all have norm 1.

It seems just a special attribute in spark's impl, I'm not sure whether other impls normalize the centers too.

@srowen
Copy link
Member

srowen commented Jan 19, 2022

OK but not for Euclidean distance? then I wonder how that test ever passed, unless the data just happens to produce centers at distance 1 from the origin

@zhengruifeng
Copy link
Contributor Author

Euclidean distance doesn't have this attribute.

then I wonder how that test ever passed

they were not assertions, just expressions.

@srowen
Copy link
Member

srowen commented Jan 19, 2022

Oh right, I'm not reading. OK yes delete them

@huaxingao huaxingao closed this in 789fce8 Jan 19, 2022
huaxingao pushed a commit that referenced this pull request Jan 19, 2022
### What changes were proposed in this pull request?

In `KMeansSuite` and `BisectingKMeansSuite`, there are some unused lines:

```
model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0
```

For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful;

For euclidean distance, the norm checking is meaningless;

### Why are the changes needed?

to enable norm checking for cosine distance, and diable it for euclidean distance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites

Closes #35247 from zhengruifeng/fix_kmeans_ut.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
(cherry picked from commit 789fce8)
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
@huaxingao
Copy link
Contributor

Merged to master/3.2. Thanks!

@zhengruifeng zhengruifeng deleted the fix_kmeans_ut branch January 20, 2022 01:18
@zhengruifeng
Copy link
Contributor Author

thank you all!

catalinii pushed a commit to lyft/spark that referenced this pull request Feb 22, 2022
### What changes were proposed in this pull request?

In `KMeansSuite` and `BisectingKMeansSuite`, there are some unused lines:

```
model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0
```

For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful;

For euclidean distance, the norm checking is meaningless;

### Why are the changes needed?

to enable norm checking for cosine distance, and diable it for euclidean distance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites

Closes apache#35247 from zhengruifeng/fix_kmeans_ut.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
(cherry picked from commit 789fce8)
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Mar 4, 2022
### What changes were proposed in this pull request?

In `KMeansSuite` and `BisectingKMeansSuite`, there are some unused lines:

```
model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0
```

For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful;

For euclidean distance, the norm checking is meaningless;

### Why are the changes needed?

to enable norm checking for cosine distance, and diable it for euclidean distance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites

Closes apache#35247 from zhengruifeng/fix_kmeans_ut.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
(cherry picked from commit 789fce8)
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
### What changes were proposed in this pull request?

In `KMeansSuite` and `BisectingKMeansSuite`, there are some unused lines:

```
model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0
```

For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful;

For euclidean distance, the norm checking is meaningless;

### Why are the changes needed?

to enable norm checking for cosine distance, and diable it for euclidean distance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites

Closes apache#35247 from zhengruifeng/fix_kmeans_ut.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
(cherry picked from commit 789fce8)
Signed-off-by: huaxingao <huaxin.gao11@gmail.com>
(cherry picked from commit 5cf8108)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants