-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial_English
Tutorial of bayon in English
Bayon is a simple and fast clustering tool for large-scale data sets. If you want to survey large-scale data, bayon is useful to partition the data into some groups and understand it.
Bayon supports two hard-clustering methods, repeated bisection clustering and K-means clustering. In the outputs of these methods, each input document is assigned to only one cluster. But you can get similar clusters for each input document like soft-clustering method by using --classifier option.
-
2012/06/18: 0.1.1 Released
- A bug of build error was fixed
-
2010/09/15: 0.1.0 Released
- Refactoring
-
2010/02/16: 0.0.10 Released
- Bug fix
-
2010/01/04: 0.0.9 Released
- A bug that "make check" failed when prior version installed was fixed
-
2009/12/24: 0.0.8 Released
- Memory usage reduced when -C, --classify option specified
-
2009/08/12: 0.0.7 Released
- --vector-size option which is used to specify the max size of each input vector is added.
- bug fix
-
2009/07/17: 0.0.6 Released
Please download the latest version.
Download the latest source package, and type the following.
I recommend that you should install google-sparsehash in advance to run bayon faster.
% tar xvzf bayon-*.tar.gz
% cd bayon-*
% ./configure
% make
% make check
% sudo make install
When you want to partition input data, run the command-line tool with the following options. The output of clustering will be printed to stdout.
The documents in each cluster are arranged in descending order of the similarity values which are the cosine similarities between the vectors of documents and the vector of the cluster centroid. Clustering with -p, --point option will print the similarity values the result of clustering.
% bayon -n num [options] file
% bayon -l limit [options] file
-n, --number=num number of clusters
-l, --limit=lim limit value of cluster bisection
-p, --point output similarity point
-c, --clvector=file save vectors of cluster centroids
--clvector-size=num max size of output vectors of
cluster centroids (default: 50)
--method=method clustering method(rb, kmeans), default:rb
--seed=seed set seed for random number generator
To specify the number of output clusters, perform clustering with -n, --number option.
(Partition input data into one hundred clusters)
% bayon -n 100 input.tsv > output.tsv
(Partition input data into one hundred clusters
and print the result with similarity values)
% bayon -n 100 -p input.tsv > output.tsv
When it is difficult to specify the number of output clusters preliminarily, perform clustering with -l, --limit option which specifies the limit value of cluster partitions. It's better to set the limit value to 1.0-2.0 to get the appropriate results of clustering.
(Set the limit value of cluster partition to 1.0)
% bayon -l 1.0 input.tsv > output.tsv
To save the vectors of cluster centroids, perform clustering with -c, --clvector option. The vectors will be saved into the specified file in text format.
Additionally, using --clvector-size option, you can specify the max number of the pairs of keys and points of centroid vectors. By default up to fifty pairs are saved.
(Save the vectors of cluster centroids into 'centroid.tsv')
% bayon -n 100 -c centroid.tsv input.tsv > output.tsv
(Up to twenty pairs are saved)
% bayon -n 100 -c centroid.tsv --clvector-size 20 input.tsv > output.tsv
If you want to get the similar clusters for each input document by using -C, --classify option, you need to save the vectors of cluster centroids.
By default bayon uses 'Repeated Bisection Clustering' method. If you want to use 'K-means Clustering' method, perform clustering with --method option.
(Partition input data into a hundred clusters using K-means clustering method)
% bayon -n 100 --method kmeans input.tsv > output.tsv
Repeated bisection clustering is faster and more accurate than K-means clustering, so in general, it's better to use Repeated Bisection clustering.
Bayon selects documents randomly during clustering, so you can get the different results of clustering by specifying a different seed value of random number generator. By default the same results of clustering are printed every time because the seed value is fixed.
(Set seed value to 123456)
% bayon -n 100 --seed 123456 input.tsv > output.tsv
Performing the command-line tool with -C, --classify option, you can get the list of similar clusters for each input document. This option works like 'Soft Clustering'. The output will be printed to stdout.
% bayon -C file [options] file
-C, --classify=file target vectors
--inv-keys=num max size of keys of each vector to be
looked up in inverted index (default: 20)
--inv-size=num max size of inverted index of each key
(default: 100)
--classify-size=num max size of output similar groups
(default: 20)
When you use -C, --classify option, you need to perform clustering with -c, --clvector option preliminarily to save the vectors of cluster centroids
(At first, perform clustering input data saving the vectors of cluster centroids)
% bayon -c centroid.tsv -n 100 input.tsv > cluster.tsv
(Compare the vector of each document with the vector of each cluster centroid)
% bayon -C centroid.tsv input.tsv > classify.tsv
Specifying the similar clusters for each documents with -C, --classify option, bayon makes the inverted indexes of the vectors of input clusters for speeding up. Bayon may work faster or more accurate by specifying the following options for inverted indexes.
- --inv-keys=num : the max number of vector keys looked up inverted indexes
- --inv-size=num : the max number of the related clusters for each vector key saved into inverted indexes
However, you don't need to specify above option usually.
The following options can be used both clustering input data and specifying the similar clusters for each input document.
--vector-size max size of each input vector
--idf apply idf to input vectors
-h, --help show help messages
-v, --version show the version and exit
Specifying --vector-size option, you can set the max number of the items in each input vector which may lead to run bayon faster and more accurate.
See next section for details about --idf option.
The file of the list of input documents needs to be in a tab-separated text format as bellow. Each line must contain the information about only one document, the identifier of a document followed by the pairs of the keys and the values of document features. Blank lines must not be inserted.
document_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
document_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
- document_id: This is the identifier of a document, and must be the unique string which contains no tab characters.
- key: This is the key of the feature of a document, and must be the string which contains no tab characters.
- value: This is the value of the feature of a document, and must be real number.
value needs to reflect the degree of the feature appropriately. So, you should apply weighting scheme (e.g. tf-idf) to input data. Specifying --idf option, bayon changes the weights of input data automatically.
The file of the vectors of cluster centroids needs to be in a tab-separated text format as below. This file is used as input data to get the similar clsuters for each documents with -C, --classify option. Each field (cluster_id, key, value) is same as that of the list of input documents.
cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
The file of the vectors of cluster centroids is saved by bayon with -c, --clvector option automatically, so you don't usually need to make it manually.
Alex Pop 10 R&B 6 Rock 4
Bob Jazz 8 Reggae 9
Dave Classic 4 World 4
Ted Jazz 9 Metal 2 Reggae 6
Fred Pop 4 Rock 3 Hip-hop 3
Sam Classic 8 Rock 1
In the above example, 'Alex', 'Bob', 'Ted' are the identifiers of the documents, and 'Alex' has the features as the genres of music (Pop, R&B, Rock) which he often listens.
The file of the list of output clusters is in a tab-separated text format as below. Each line contains the information about one cluster, the identifier of the cluster followed by the identifiers of the documents in the cluster.
cluster_id1 \t document_id1 \t document_id2 \t document_id3 \t ...\n
cluster_id2 \t document_id4 \t document_id5 \t document_id6 \t ...\n
...
Clustering with -p, --point option, the similarity values are printed in the result of clustering. The similarity values are calculated by cosine similarity measure between the vectors of the documents and the vectors of the cluster centroids.
cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
Clustering with -c, --clvector option, the list of the vectors of cluster centroids are saved into the specified file. The format of output file is tab-separated text as below. Each line contains the information about one cluster, the identifier of a cluster followed by the pairs of the keys and the values of cluster centroid vectors.
cluster_id1 \t key1-1 \t value1-1 \t key1-2 \t value1-2 \t ...\n
cluster_id2 \t key2-1 \t value2-1 \t key2-2 \t value2-2 \t ...\n
...
The list of similar clusters for each document which specified by using -C, --classify option is in a tab-separated text format as below. Each line contains the information about one document, the identifier of a document followed by the pairs of the identifiers of clusters and the similarity values.
document_id1 \t cluster_id1 \t point1 \t cluster_id2 \t point2 \t ...\n
document_id2 \t cluster_id3 \t point3 \t cluster_id4 \t point4 \t ...\n
...
- The result of clustering the above example
1 Ted Bob
2 Alex Fred
3 Sam Dave
- The result of clustering the above example with -p, --point option
1 Ted 0.987737 Bob 0.987737
2 Alex 0.928262 Fred 0.928262
3 Sam 0.922401 Dave 0.922401
- The vectors of the cluster centroids
1 Jazz 0.750476 Reggae 0.654458 Metal 0.0920378
2 Pop 0.806401 Rock 0.451887 Hip-hop 0.277129 R&B 0.262137
3 Classic 0.921175 World 0.383297 Rock 0.0672347
- The result specified the similar clusters for each document
Alex 2 0.928261 3 0.0218138
Bob 1 0.987737
Dave 3 0.922401
Ted 1 0.987737
Fred 2 0.928262 3 0.034592
Sam 3 0.922401 2 0.0560497
The execution times to perform clustering and get the similar clusters (classification) are as bellow. Bayon could perform both clustering and classification fast.
# of data | # of cluster | time of clustering(sec) | time of classification(sec) |
---|---|---|---|
1000 | 10 | 0.17 | 0.088 |
50000 | 1000 | 36 | 13.4 |
500000 | 10000 | 720 | 1385 |