Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better PAM initialization with BUILD #667

Open
kno10 opened this issue Feb 5, 2021 · 11 comments
Open

Better PAM initialization with BUILD #667

kno10 opened this issue Feb 5, 2021 · 11 comments
Assignees
Labels
Investigation Tasks related to investigation of found issues Optimization Tasks related to code optimization Proposal Tasks that have been proposed by users

Comments

@kno10
Copy link

kno10 commented Feb 5, 2021

The PAM/k-medoids implementation appears to implement SWAP, but not the BUILD part for initializing PAM. Instead you have to provide good starting medoids.

I tired benchmarking it on a larger data set (well-known 20news data set in the sklearn.datasets.fetch_20newsgroups_vectorized version; with cosine distance and k=20) and the runtime of pyclustering kmedoids was extremely high (2000 sec), likely because of the poor random initialization.

With BUILD from the kmedoids package, I can reduce the run time to 338 sec. Nevertheless, pam from kmedoids is just 37 seconds, including 4.6 seconds for BUILD. The fasterpam variant finishes in 336 msec with random initialization.

It would also be good to get access to the final loss after optimization as well as the number of iterations. Then I could check if it ran into the maximum iteration limit, and if at least the result quality is comparable.

@annoviko
Copy link
Owner

annoviko commented Feb 6, 2021

Hello @kno10 ,

There is a kmeans++ based algorithm to initialize it on which you are referring in your article:

In the experiments, we will also study whether a clever sampling-based approach similar to k-means++ (Arthur and Vassilvitskii, 2007) that needs only O(nk) time but produces a worse starting point is an interesting alternative.

It could be used to initialize initial medoids and I expected that only quality is going to be affected. Nevertheless your outcome looks like a good point to implement PAM BUILD algorithm for initialization. I am not sure if it is convenient to make a part of the interface of PAM algorithm.

Also I have a question for you, did you use installed pyclustering version or just cloned version? Because in case of installed version, C++ implementation of the algorithm should be used and in case of cloned version, the library uses Python implementation due to lack of binaries in git repository.

It would also be good to get access to the final loss after optimization as well as the number of iterations

It is clear about the number of iterations. But I have question about the final loss after optimization. Do you mean total deviation (TD)?

@annoviko annoviko added Optimization Tasks related to code optimization Proposal Tasks that have been proposed by users labels Feb 6, 2021
@annoviko annoviko added this to To do in 0.11.0 Feature Development via automation Feb 6, 2021
@annoviko annoviko added the Investigation Tasks related to investigation of found issues label Feb 6, 2021
@annoviko annoviko moved this from To do to In progress in 0.11.0 Feature Development Feb 6, 2021
@annoviko annoviko self-assigned this Feb 6, 2021
@kno10
Copy link
Author

kno10 commented Feb 6, 2021

I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.

For that experiment I used "pip install pyclustering" on Google colab, whatever that does by default.

Yes, TD (total deviation) is the loss.

@annoviko
Copy link
Owner

annoviko commented Feb 6, 2021

Thank you for the clarification. One more question:

I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.

Was a distance matrix used as an input for the algorithm in the experiment?

@annoviko
Copy link
Owner

annoviko commented Feb 12, 2021

Hello @kno10 ,

I have introduced PAM BUILD algorithm in line with your article and I have optimized C++ version (that should be used by default) of K-Medoids (PAM) algorithm. I have attached pyclustering package (pyclustering-0.11.0.tar.gz) with all binaries that was created from master branch (not a release version) where these changes are available. Could you please check it again?

Just in case K-Means++ supports distance matrix now. Let me know if you need code example as well.

There is an example how to use K-Medoids with PAM BUILD:

from pyclustering.cluster.kmedoids import kmedoids, build
from pyclustering.cluster import cluster_visualizer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import FCPS_SAMPLES

# Load list of points `Tetra` for cluster analysis.
sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA)

# Initialize initial medoids using PAM BUILD algorithm
initial_medoids = build(sample, 4).initialize()

# Create instance of K-Medoids (PAM) algorithm.
kmedoids_instance = kmedoids(sample, initial_medoids)

# Run cluster analysis
kmedoids_instance.process()

# Display clustering results to console.
print("Clusters:", kmedoids_instance.get_clusters())
print("Labels:", kmedoids_instance.get_labels())
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())

# Display clustering results.
visualizer = cluster_visualizer()
visualizer.append_clusters(kmedoids_instance.get_clusters(), sample)
visualizer.append_cluster(initial_medoids, sample, markersize=120, marker='*', color='dimgray')
visualizer.append_cluster(kmedoids_instance.get_medoids(), sample, markersize=120, marker='*', color='black')
visualizer.show()

Output example:

Clusters: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199], [200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299], [300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399]]
Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Medoids: [0, 100, 201, 300]
Total Deviation: 239.25250279395595
Iterations: 3

@annoviko
Copy link
Owner

@kno10 , I would be really pleased if you provide test code that you was using for performance testing.

@annoviko
Copy link
Owner

annoviko commented Feb 12, 2021

@kno10 , I decided to repeat your experiment using sklearn.datasets.fetch_20newsgroups_vectorized.

Results of the original PAM without distance matrix (list of points) still aren't good:

Medoids: [6953, 10558, 489, 1034, 6910, 3522, 10246, 2692, 1965, 9385, 6535, 7433, 6540, 2760, 459, 10860, 6233, 4006, 484, 11189]
Total Deviation: 29.767144356027746
Iterations: 102
PAM BUILD: 36.062500 sec.
PAM: 2187.406250 sec.

PAM BUILD implementation: https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/pam_build.cpp

I was using the following code to get data and run it on PAM. Is it similar to what you was doing?

import time
from pyclustering.cluster.kmedoids import kmedoids, build
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups_vectorized

sample = fetch_20newsgroups_vectorized()
data = fetch_20newsgroups_vectorized()
X, y = data['data'], data['target']
X = TruncatedSVD().fit_transform(X)

t_start = time.process_time()
initial_medoids = build(X, 20).initialize()
pam_build_time = time.process_time() - t_start

t_start = time.process_time()
kmedoids_instance = kmedoids(X, initial_medoids)
kmedoids_instance.process()
pam_time = time.process_time() - t_start

print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())

print("PAM BUILD: %f sec." % pam_build_time)
print("PAM: %f sec." % pam_time)

@kno10
Copy link
Author

kno10 commented Feb 12, 2021

Please see https://colab.research.google.com/drive/1DNzGbQns5-kiyTVkDvAorcxqXZb5ukEI?usp=sharing
It still appears to take a very long time. It does not appear to use ccore?

@annoviko
Copy link
Owner

annoviko commented Feb 13, 2021

@kno10 , thanks, your performance test is different and it results look much better. By default pyclustering always uses C++ code. I have run it my local machine and got the following results. Could you please take a look at them?

(11314, 130107)
(11314, 11314) took 7.680890083312988 s
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_build__amount', '_build__calculate_first_medoid', '_build__calculate_next_medoid', '_build__ccore', '_build__create_distance_calculator', '_build__data', '_build__data_type', '_build__distance_calculator', '_build__initialize_by_ccore', '_build__initialize_by_python', '_build__metric', '_build__verify_arguments', 'initialize']
Build took: {} 37.16072940826416
<pyclustering.cluster.kmedoids.kmedoids object at 0x000001B85C41F780> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_kmedoids__calculate_swap_cost', '_kmedoids__ccore', '_kmedoids__clusters', '_kmedoids__create_distance_calculator', '_kmedoids__data_type', '_kmedoids__distance_calculator', '_kmedoids__distance_first_medoid', '_kmedoids__distance_second_medoid', '_kmedoids__erase_empty_clusters', '_kmedoids__initializer', '_kmedoids__iterations', '_kmedoids__itermax', '_kmedoids__labels', '_kmedoids__medoid_indexes', '_kmedoids__metric', '_kmedoids__pointer_data', '_kmedoids__swap_medoids', '_kmedoids__tolerance', '_kmedoids__total_deviation', '_kmedoids__update_clusters', '_kmedoids__verify_arguments', 'get_cluster_encoding', 'get_clusters', 'get_iterations', 'get_labels', 'get_medoids', 'get_total_deviation', 'predict', 'process']
Runtime: min=90973.16 mean=90973.16 ±nan
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:130: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
Loss:    min= 4994.75 mean= 4994.75 ±nan
N Iter:  min=    2.00 mean=    2.00 ±nan

There is a big impact on the performance in data transfer procedure between Python and C++. I wanted to have absolutely independent C++ library from "python.h" and have to pay for this now. I have added additional time counter to the implementation to measure data packing time and results unpacking time on python side:

Data packing time for C++ on python side: 21.093750 sec.
Results unpacking from C++ on python side: 0.000000 sec.
Build took: {} 31.153797149658203

Thus, in case PAM BUILD: 21 seconds is data packing and 10 seconds is processing on C++ side. The same is for PAM algorithm itself, so 42 seconds are used for data transfer only. Thus, in general I would expect 90 - 42 = 48 seconds for processing.

I think I will try to implement python dependent interface and use templates in C++ in order to optimize this flow and keep C++ static library independent from python.

FastPAM is impressing me in its performance, I will implement it, when I finish classical PAM optimizations.

@kno10
Copy link
Author

kno10 commented Feb 13, 2021

The resulting loss looks good, but the other PAMs used 3 iterations, not 2. This could be an off by one in counting, or it could be due to tolerance > 0 (IMHO, tolerance=0 is a reasonable default for "hard" clustering approaches such as k-medoids and k-means, just not for "soft" Gaussian Mixture Modeling).
When I try to run the notebook on google colab, it still takes forever and when I interrupt it, it appears to run in python, not in ccore based on the trace (in __calculate_next_medoid(self)). But I am not an expert on Python, so I don't know how to resolve this; I can only mention that it does not appear to use ccore.

@kno10
Copy link
Author

kno10 commented Feb 13, 2021

The runtime on Google colab was:

Build took: 2596859.432220459
Runtime: min=8129146.14 mean=8129146.14 ±nan
Loss:    min= 4994.75 mean= 4994.75 ±nan
N Iter:  min=    2.00 mean=    2.00 ±nan

So 43 minutes for BUILD + 92 min for SWAP. You may want to add a warning when the Python version is used on large data.

@annoviko
Copy link
Owner

@kno10 , thanks for these results, I think it is much better to optimize python code as well, the problem that I wasn't use numpy for processing in case of PAM BUILD and PAM itself, I think it doable. I will let you know when I update both implementations of the algorithm.

About iterations, I am calculating it in the following way (pseudo-code):

if (itermax > 0) then update_clusters();

for (iterations = 0; iterations < itermax;) then
	iterations++;
	cost = swap_medoids();
	if (cost != NOTHING_TO_SWAP) then
		update_clusters();

So, I would say that I am counting the amount of swap_medoids().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigation Tasks related to investigation of found issues Optimization Tasks related to code optimization Proposal Tasks that have been proposed by users
Projects
Development

No branches or pull requests

2 participants