Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix severe bug in distance matrix -- found/fixed by Bernd Fritzke. #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ctwardy
Copy link

@ctwardy ctwardy commented Nov 17, 2022

In Nov. 2021, Bernd Fritzke reported & fixed a severe bug in the distance matrix calculation. The bug resulted in poor performance equivalent to naive init (rather than k-means++).

Bernd reports:

In its current form, the results were much worse than those of Hartigan-Wong (the default for k-means in the stats
package) for all test problems I tried. However, after fixing the bug, kmeanspp found better results than Hartigan-Wong for all those problems.

In summary:

The bug concerns a distance computation which should be a matrix of distances of all data vectors and all current codebook vectors, but is not. The code and an example illustrating the problem is shown below. Basically, to subtract a vector from a matrix, one has to convert the vector into a matrix where all rows are just copies of the vector. ... The fix is trivial.

See also this post by Paul Harrison, who also noticed sub-optimal results. When I compared these to scikit-learn, I discovered the results were as if the inits were still naive (random).

Bernd's analysis would explain that -- and appears to fix it.

In Nov. 2021, Bernd Fritzke [reported & fixed](https://www.mail-archive.com/r-help@r-project.org/msg263971.html) a severe bug in the distance matrix calculation.  The bug resulted in poor performance equivalent to naive init (rather than k-means++).

Bernd reports:
> In its current form, the results were much worse than those of Hartigan-Wong (the default for k-means in the stats 
package) for all test problems I tried. However, after fixing the bug, kmeanspp found better results than Hartigan-Wong for all those problems.

In summary:
> The bug concerns a distance computation which should be a matrix of distances of all data vectors and all current codebook vectors, but is not. The code and an example illustrating the problem is shown below. Basically, to subtract a vector from a matrix, one has to convert the vector into a matrix where all rows are just copies of the vector. ... The fix is trivial.


See also this [post by Paul Harrison](https://logarithmic.net/pfh-files/blog/01618034037/k-means-better.html), who also noticed sub-optimal results.  When I [compared these to scikit-learn](https://ctwardy.micro.blog/2021/06/10/on-kmeans-clustering.html), I discovered the results were as if the inits were still naive (random).  

Bernd's analysis would explain that -- and appears to fix it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant