HG-means

Source code of HG-means clustering (Gribel and Vidal, 2019).

HG-means is an efficient hybrid genetic algorithm proposed for the minimum sum-of-squares clustering (MSSC). This population-based metaheuristic uses K-means as a local search in combination with crossover, mutation, and diversification operators.

As HG-means algorithm uses K-means, we included the fundamental source files of the fast K-means implementation of Greg Hamerly (to whom we are grateful for making the source code available) in this repository, under the folder /hamerly. Original files and complete source code of Greg Hamerly K-means can be found at: https://github.com/ghamerly/fast-kmeans.

For the exact crossover, HG-means uses the implementation of Dlib (https://github.com/davisking/dlib) for solving an assignment problem. Dlib files are included in the /dlib-master folder.

HG-means clustering is available as a C++ code, as well as a Python package.

Installation and Run

C++

To run the algorithm in C++, go to /hgmeans folder and try the following sequence of commands:

> make

> ./hgmeans 'dataset_path' pi_min n2 it [nb_clusters] 'w'

Example

> make

> ./hgmeans 'data/iris.txt' 10 5000 1 2 5 10 'w'

This script executes HG-means clustering in the Iris dataset, with 10 solutions in population, a maximum of 5000 iterations, 1 iteration (algorithm repetitions), and 2, 5 and 10 clusters.

Important: You can provide a ground-truth file with the sample-cluster labeling. In this case, make sure that a file with the same name of the dataset and with the '.label' extension is placed in the same folder of the dataset. If this file is provided, HG-means clustering will compute clustering performance metrics. See the section Data format to check the expected data format for datasets and labels files.

Parameters of the algorithm

dataset_path: The path of dataset.

pi_min (default = 10): Population size. Determines the size of the population in the genetic algorithm.

n2 (default = 5000): Maximum number of iterations. Determines the total number of iterations the algorithm will take.

it (default = 1): The number of independent repetitions of the algorithm.

[nb_clusters]: The list with number of clusters. You can pass multiple values, separated by a single space.

w: A flag for saving the results in a file. Use 'w' if you wish to active this feature, or leave it blank. Important: the output files are saved in your current directory, within the folder hgm_out.

Python

HG-means is also available as a Python package. To install HG-means, run the following installation command:

> pip3 install hgmeans

Important: Check your user permissions for pip installation. You may need root user credentials.

For Windows users who do not have a C++ compiler, it may be required an installation of C++ Build tools, which can be downloaded here: https://go.microsoft.com/fwlink/?LinkId=691126

That is it! Now, open your Python interface, import the package and create an instance of HG-means. To execute it, just call the function run() with the corresponding parameters. See an example below:

>>> import hgmeans

>>> demo = hgmeans.PyHGMeans()

>>> demo.runfile('data/iris.txt', 10, 5000, 1, [2,5,10], 'w')

This script executes HG-means clustering in the Iris dataset, with 10 solutions in population, a maximum of 5000 iterations, 1 repetition, and 2, 5 and 10 clusters. Here the number of clusters is passed in an array, so values are separated by commas.

Alternativaly, we can execute HG-means by passing the arrays representing the dataset and labels. In this case, the run() function returns the clustering assignments for each number of clusters. See an example below:

>>> result = demo.run([[1,2,3], [4,5,6], [4,5,7], [4,5,8]], [1,2,2,2], 10, 5000, 1, [2,3], 'w')

Data format

Dataset files. In the first line of a dataset file, the number of data points (n) and the dimensionality of the data (d) is set, separated by a single space. The remaining lines correspond to the coordinates of data points. Each line contains the values of the d features of a sample, where x_ij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:

n	d
x₁₁	x₁₂	x₁₃	...	x_1d
x₂₁	x₂₂	x₂₃	...	x_2d
....	....	....	...	....
x_n1	x_n2	x_n3	...	x_nd

Some datasets are provided within the folder /data in the HG-means repository.

Labels files. The content of a labels file exhibits the cluster of each sample of the dataset according to the ground-truth, where y_i correspond to the label of the i-th sample:

y₁

y₂

...

y_n

Important: Labels files must have the .label extension. Some labels are provided within the folder /data in the HG-means repository.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
data		data
dlib-master		dlib-master
hamerly		hamerly
GeneticOperations.cpp		GeneticOperations.cpp
GeneticOperations.h		GeneticOperations.h
HGMeans.cpp		HGMeans.cpp
HGMeans.h		HGMeans.h
LICENSE.md		LICENSE.md
Makefile		Makefile
MathUtils.cpp		MathUtils.cpp
MathUtils.h		MathUtils.h
PbData.h		PbData.h
README.md		README.md
Solution.cpp		Solution.cpp
Solution.h		Solution.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HG-means

Related Article

Installation and Run

C++

Example

Parameters of the algorithm

Python

Data format

About

Contributors 2

Languages

License

danielgribel/hg-means

Folders and files

Latest commit

History

Repository files navigation

HG-means

Related Article

Installation and Run

C++

Example

Parameters of the algorithm

Python

Data format

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages