Fix severe bug in distance matrix -- found/fixed by Bernd Fritzke. #2

ctwardy · 2022-11-17T15:51:21Z

In Nov. 2021, Bernd Fritzke reported & fixed a severe bug in the distance matrix calculation. The bug resulted in poor performance equivalent to naive init (rather than k-means++).

Bernd reports:

In its current form, the results were much worse than those of Hartigan-Wong (the default for k-means in the stats
package) for all test problems I tried. However, after fixing the bug, kmeanspp found better results than Hartigan-Wong for all those problems.

In summary:

The bug concerns a distance computation which should be a matrix of distances of all data vectors and all current codebook vectors, but is not. The code and an example illustrating the problem is shown below. Basically, to subtract a vector from a matrix, one has to convert the vector into a matrix where all rows are just copies of the vector. ... The fix is trivial.

See also this post by Paul Harrison, who also noticed sub-optimal results. When I compared these to scikit-learn, I discovered the results were as if the inits were still naive (random).

Bernd's analysis would explain that -- and appears to fix it.

In Nov. 2021, Bernd Fritzke [reported & fixed](https://www.mail-archive.com/r-help@r-project.org/msg263971.html) a severe bug in the distance matrix calculation. The bug resulted in poor performance equivalent to naive init (rather than k-means++). Bernd reports: > In its current form, the results were much worse than those of Hartigan-Wong (the default for k-means in the stats package) for all test problems I tried. However, after fixing the bug, kmeanspp found better results than Hartigan-Wong for all those problems. In summary: > The bug concerns a distance computation which should be a matrix of distances of all data vectors and all current codebook vectors, but is not. The code and an example illustrating the problem is shown below. Basically, to subtract a vector from a matrix, one has to convert the vector into a matrix where all rows are just copies of the vector. ... The fix is trivial. See also this [post by Paul Harrison](https://logarithmic.net/pfh-files/blog/01618034037/k-means-better.html), who also noticed sub-optimal results. When I [compared these to scikit-learn](https://ctwardy.micro.blog/2021/06/10/on-kmeans-clustering.html), I discovered the results were as if the inits were still naive (random). Bernd's analysis would explain that -- and appears to fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix severe bug in distance matrix -- found/fixed by Bernd Fritzke. #2

Fix severe bug in distance matrix -- found/fixed by Bernd Fritzke. #2

ctwardy commented Nov 17, 2022

Fix severe bug in distance matrix -- found/fixed by Bernd Fritzke. #2

Are you sure you want to change the base?

Fix severe bug in distance matrix -- found/fixed by Bernd Fritzke. #2

Conversation

ctwardy commented Nov 17, 2022