Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
CMakeLists.txt
README.md
demo_cfg.json
kmeans.hpp
kmeans_driver.cpp
update.cpp

README.md

Description

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

Usage

  1. Enter Paracel's home directory
    cd paracel;
  2. Generate training dataset(datagen.py will also create a result data named data.csv.label)
    python ./tool/datagen.py -m kmeans -o data.csv -n 1000 -k 20 --ncenters 50
  3. Set up link library path:
    export LD_LIBRARY_PATH=your_paracel_install_path/lib
  4. Create a json file named cfg.json, see example in Parameters section below.
  5. Run (4 workers, local mode in the following example)
    ./prun.py -w 4 -p 1 -c cfg.json -m local your_paracel_install_path/bin/kmeans

Parameters

Default parameters are set in a JSON format file. For example, we create a cfg.json as below(modify your_pa racel_install_path):

{
"input" : "data.csv",
"output" : "./kmeans_result/",
"type" : "fvec",
"kclusters" : 50,
"update_file" : "your_paracel_install_path/lib/libkmeans_update.so",
"update_functions" : ["local_update_kmeans_clusters", "local_update_kmeans_groups"],
"rounds" : 100
}
In the configuration file, kclusters refers to the number of clusters you want to separate. update_file, update_functions is the information of the registry function needed in kmeans algorithm. rounds refers to the number of training iterations.

Data Format

Input

fvec case.

Output

File centers_0: center coordinates of each group

File kmeans_0: the belonging points of each group
group_id1 point1|point2|...
group_id2 point3|point4|...
...

Notes

You can make use of the label data generated by datagen.py to compare with the training effect.