# Table of Contents

* [4 General Functions (Review)](#general_functions)
    * [4.1 Functions to Save and Open Variables](#open_save)
* [5 Document Clustering](#document_clustering) 
    * [5.1 k-Means Clustering](#k_means)
    * [5.2 Density-based Spatial Clustering of Applications with Noise (DBSCAN)](#dbscan)
    * [5.3 Balanced Iterative Reducing and Clustering using Hierarchies (Birch)](#birch)
    * [5.4 Affinity Propagation](#prop)

# General Functions <a id='general_functions'></a>

## Functions to Save and Open Variables <a id='open_save'></a>

Since it is not uncommon for a machine learning task to take a long time it is good practice to save variables that may be needed in the future. This can be achieved by using the [pickle](https://docs.python.org/3/library/pickle.html) module. This package allows a variable up to 4gb to be saved. This limitation is why the 'metrics' variables are saved as individual items instead of a dictionary.

In [1]:
%env JOBLIB_TEMP_FOLDER=/home/jovyan/work/tmp

# Save variables to file
import pickle

def save_var(variable_name):
    """ Saves the variable with the provided variable name 
         in the global namespace to the ./vars folder 
         with the provided same name """
    
    with open('./vars/' + variable_name,'wb') as my_file_obj:
        pickle.dump(globals()[variable_name], my_file_obj, protocol=pickle.HIGHEST_PROTOCOL)

def save_var_list(variable_name_list):
    """ Saves each variable with the provided variable name 
         in the global namespace to the ./vars folder 
         with the provided same name """
    for name in variable_name_list:
        with open('./vars/' + name,'wb') as my_file_obj:
            pickle.dump(globals()[name], my_file_obj, protocol=pickle.HIGHEST_PROTOCOL)

def open_var(file_name):
    """ Returns the variable saved with the provided 
         file name located in the ./vars folder"""
    
    file_object = open('./vars/' + file_name,'rb')  

    loaded_var = pickle.load(file_object)
    
    return loaded_var

def open_var_list(file_name_list):
    """ Loads a variable corresponding to each file name
         in file_name_list to the global namespace. """
    
    for file_name in file_name_list:
        globals()[file_name] = open_var(file_name)
        
mnist = open_var('mnist')

env: JOBLIB_TEMP_FOLDER=/home/jovyan/work/tmp


In [3]:
%%time
# Load Datasets
class Dataset_Part:
    """ Represents a dataset with attributes
         data and target """
    
    data = None
    target = None
    def __init__(self, X, y):
        self.data = X
        self.target = y

open_var_list(['mnist_train', 'mnist_test', 'rcv1_train', 'rcv1_test'])

CPU times: user 16 ms, sys: 76 ms, total: 92 ms
Wall time: 93.4 ms


In [4]:
import matplotlib
import matplotlib.pyplot as plt

def print_digit(dataset, index):
    # Get a random document
    digit_arr = dataset.data[index]
    # Reshape it to the size of the image
    digit_image = digit_arr.reshape(28,28)

    # Some information
    print(f'\tIndex: {index}\tLabel: {dataset.target[index]:.0f}')
    # Show the image
    plt.imshow(digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
    plt.axis("off")
    plt.show()

# [Document Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) <a id='document_clustering'></a>

Clustering is an unsupervised training method, meaning it is performed on data without labels. Because of this unsupervised learning is capable of finding relations that may not have been previously observed. 

## [Cluster Evaluation](http://scikit-learn.org/stable/modules/clustering.html#homogeneity-completeness-and-v-measure) <a id='cluster_evaluation'></a>

Unsupervised learning uses different evaluation metrics than supervised learning. This is because unsupervised learning makes assumptions with no prior knowladge (ie. no labels). Since the data does not conform to predetermined labels evaluation metrics such as precision and recall cannot be performed. Instead the following metrics can be used.

+ [__homogeneity score__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score) - A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class. A value of 1.0 represents perfectly homogenious labeling. Shown in equation (1) from [rosenberg-2007].
$$ h = 1 - \frac{H(C \mid K)}{H(C)} $$   
$$H(C \mid K) = - \sum^{\mid C \mid}_{c=1} \sum^{\mid K \mid}_{k=1} \frac{a_{c,k}}{N} \cdot \log \frac{a_{c,k}}{\sum^{\mid C \mid}_{c=1} a_{c,k}} $$  
$$H(C) = - \sum^{\mid C \mid}_{c=1} \frac{\sum^{\mid K \mid}_{k=1} a_{c,k}}{n} \cdot \log \frac{\sum^{\mid K \mid}_{k=1} a_{c,k}}{n} $$
+ [__completeness score__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html) - A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster. A value of 1.0 represents perfectly complete labeling. Shown in equation (1) from [rosenberg-2007].
$$ c = 1 - \frac{H(K \mid C)}{H(K)} $$
+ [__V-measure__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html) - V-measure is similar to F1 score in the sense that it is a harmonic mean, except it relates homogeneity and completeness [rosenberg-2007].  
$$ v = 2 \cdot \frac{ \text{homogeneity} \cdot \text{completeness}}{\text{homogeneity} + \text{completeness}} $$
+ [__Rand index__](https://doi.org/10.1007/BF01908075) - Measure of similarity between the predicted and true clusters. Rand Index considers all pairs of samples and counts the number of pairs that are assigned correctly to the same cluster, incorrectly to the same cluster, correctly to seperate clusters, and incorrectly to different clusters. [hubert-1985]
  + __TODO: PARAPHRASE ASSIGNMENT__
  + If C is the ground truth of class assignment and K the clustering, let us define:
    + a as the number of pairs of elements that are in the same set in C and in the same set in K
    + b as the number of pairs of elements that are in different sets in C and in different sets in K
    + $C^{n_{samples}}_2$ as the total number of possible pairs in the dataset (without ordering)
$$ RI = \frac{a + b}{C^{n_{samples}}_2}$$
+ [__adjusted Rand score__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) - The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
$$ ARI = \frac{RI - E(RI)}{\max(RI) - E(RI)} $$
  + Note - E(RI) means expected RI, or the RI given random labelings. 
+ [__mutual information score__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html) - The measure of similarity between the predicted and true labels, ignoring permutation. Given two sets of clusters $V$ and $U$. Suppose $U$ has size $i$, denoted as $\mid U \mid = i$, and similarly for $\mid V \mid = j$.
$$ MI(U,V) = \sum^{\mid U \mid}_{i=1} \sum^{\mid V \mid}_{j=1} \frac{\mid U_i \cap V_j \mid}{N} \cdot log ( \frac{N \mid U_i \cap V_j \mid}{ \mid U_i \mid \mid V_j \mid}) $$
+ [__adjusted mutual info score__](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html) - The mutual information score adjusted to account for the fact tha mutual information score is typically greater when there are more clusters. The adjusted mutual information score is 1 when the two sets of clusters are the same. Random clustering have an expected adjusted mutual information score near 0. Shown in equation (3) from [vinh-2010].
$$ AMI(U,V) = \frac{MI(U,V) - E(MI(U, V))} {\max(H(U), H(V)) - E(MI(U, V))} $$


In [10]:
from sklearn import metrics
from time import time
from multiprocessing import Process, Manager
from sklearn.externals import joblib

def fit_pred(estimator, data, labels, t0, name, est_type):
    
    print(f'fitting {name} estimator', file=open(f'./output.txt', 'a'))
    
    estimator.fit(data)

    print(f'finished fitting {name} estimator', file=open(f'./output.txt', 'a'))
    
    homo = metrics.homogeneity_score(labels, estimator.labels_)
    comp = metrics.completeness_score(labels, estimator.labels_)
    v_meas = metrics.v_measure_score(labels, estimator.labels_)
    adj_rand = metrics.adjusted_rand_score(labels, estimator.labels_)
    adj_mutu = metrics.adjusted_mutual_info_score(labels,  estimator.labels_)
    n_labels = len(estimator.labels_)
    
    vals = f'{name:9s}\t{(time() - t0):7.2f}s\t{homo:.3f}\t{comp:.3f}\t{v_meas:.3f}\t{adj_rand:.3f}\t{adj_mutu:.3f}'

    print(vals)
    print(vals, file=open(f'./out/{est_type}.txt', 'a'))
    
    print(f'saving {name} estimator', file=open(f'./output.txt', 'a'))

    joblib.dump(estimator, f'./vars/{name}.pkl')
    
    print(f'finished saving {name} estimator', file=open(f'./output.txt', 'a'))

    
def bench_clust(estimator_lst, name_lst, data, labels, est_type):

    print('%-9s\t   %-5s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s' 
          % ( 'title', 'time', 'homog', 'comp', 'v mes',
              'rand', 'mutu'))
    
    manager = Manager()
    est_lst = manager.list()
    
    processes = []
    for estimator, name in zip(estimator_lst, name_lst):
        t0 = time()
        
        p = Process(target=fit_pred, args=(estimator, data, labels, t0, name, est_type))
        p.start()
        
        processes += [p]

        est_lst += [estimator]
    return est_lst, processes

## [k-Means Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) <a id='k_means'></a>

This algorithm is implemented in the sklearn.cluster.KMeans scikit-learn module. K-means clustering attempts to seperate data into a predetermined, k, number of clusters. The aim is to create clusters with equal variance, thus minimizing inertia, also known as the within-cluster sum of squares. Inertia is defined as:

\begin{equation*}
 \sum_{i=0}^n \min_{\mu_j \in C} (\mid \mid x_j - \mu_i \mid \mid^2)
\end{equation*}


[comment]: <> (need to reword, too close to source)
To find clusters k-Means has a three step process explained by [Zhao et al.](https://doi.org/10.1016/j.neucom.2018.02.072) [zhao-2018] as:

1. Initialize k centroids, one for each cluster. The most basic way to do this is by picking k random samples.
+ Assign each sample to the closest centroid.
+ Recompute centroids with assignments from previous step.
+ Repeat step 2 and step 3 until convergence

In [6]:
from sklearn import metrics
from time import time

def bench_k_means(estimator_lst, name_lst, data, labels):
    print('%-9s\t%-6s\t%-12s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s' 
      % ( 'title', 'time', 'inertia', 'homog', 'comp', 'v mes', 'rand', 'mutu', 'silh'))
    for estimator, name in zip(estimator_lst, name_lst):
        t0 = time()
        estimator.fit(data)
        print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
              % (name, (time() - t0), estimator.inertia_,
                 metrics.homogeneity_score(labels, estimator.labels_),
                 metrics.completeness_score(labels, estimator.labels_),
                 metrics.v_measure_score(labels, estimator.labels_),
                 metrics.adjusted_rand_score(labels, estimator.labels_),
                 metrics.adjusted_mutual_info_score(labels,  estimator.labels_),
                 metrics.silhouette_score(data, estimator.labels_,
                                          metric='euclidean',
                                          sample_size=1000)))

In [7]:
from sklearn.cluster import KMeans

bench_k_means([KMeans(init='k-means++', n_clusters=3, n_init=10, n_jobs=-1),
               KMeans(init='k-means++', n_clusters=5, n_init=10, n_jobs=-1),
               KMeans(init='k-means++', n_clusters=10, n_init=10, n_jobs=-1),
               KMeans(init='random', n_clusters=10, n_init=10, n_jobs=-1),
               KMeans(init='k-means++', n_clusters=15, n_init=10, n_jobs=-1),
               KMeans(init='random', n_clusters=15, n_init=10, n_jobs=-1) ],
              ["k-means++ k=3", "k-means++ k=5", "k-means++ k=10", "random k=10", "k-means++ k=15", "random k=15"],
              data=mnist.data, labels=mnist.target)

title    	time  	inertia     	homog	comp	v mes	rand	mutu	silh
k-means++ k=3	17.52s	213604832818	0.211	0.443	0.286	0.172	0.211	0.058
k-means++ k=5	18.71s	197606834439	0.390	0.578	0.466	0.331	0.390	0.064
k-means++ k=10	27.39s	178432593770	0.496	0.504	0.500	0.367	0.496	0.061
random k=10	24.52s	178432235366	0.496	0.503	0.500	0.365	0.496	0.058
k-means++ k=15	27.78s	167326324983	0.581	0.499	0.537	0.379	0.499	0.060
random k=15	32.69s	167326474664	0.580	0.498	0.536	0.378	0.498	0.059


In [42]:
from sklearn.cluster import KMeans

est_km, proc_km = bench_clust(
              [ KMeans(init='k-means++', n_clusters=3, n_init=10, n_jobs=-1),
                KMeans(init='k-means++', n_clusters=5, n_init=10, n_jobs=-1),
                KMeans(init='k-means++', n_clusters=10, n_init=10, n_jobs=-1),
                KMeans(init='random', n_clusters=10, n_init=10, n_jobs=-1),
                KMeans(init='k-means++', n_clusters=15, n_init=10, n_jobs=-1),
                KMeans(init='random', n_clusters=15, n_init=10, n_jobs=-1) ],
              [ "k-means++ k=3", "k-means++ k=5", "k-means++ k=10", "random k=10", "k-means++ k=15", "random k=15"],
                data=mnist.data, labels=mnist.target, est_type='kMeans')

title    	   time 	homog	comp	v mes	rand	mutu
k-means++ k=3	  49.10s	0.211	0.443	0.286	0.172	0.211
k-means++ k=5	  57.04s	0.390	0.578	0.465	0.331	0.390
random k=10	  61.76s	0.496	0.503	0.500	0.365	0.496
k-means++ k=10	  68.20s	0.496	0.504	0.500	0.366	0.496
k-means++ k=15	  68.49s	0.582	0.499	0.537	0.380	0.499
random k=15	  73.17s	0.580	0.498	0.536	0.378	0.498


In [5]:
for p in proc_km:
    p.terminate()
    p.join()

NameError: name 'proc_km' is not defined

## [Density-based Spatial Clustering of Applications with Noise (DBSCAN)](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) <a id='dbscan'></a>

The DBSCAN algorithm clusters samples into areas of high density with surrounding low dennsity areas. Because of this Clusters can be any shape and the number of clusters is not predeturmined. Clusters are formed by finding region that satisfy a minimum density, number of documents per area. The shape of the cluster is determined by the distance metric used. Any distance function can be used and the distance function will determine the shape of the clusters [ester-1996].

To form a cluster DBSCAN searches for areas with a minimum number of points within a specified distance, $\varepsilon$, from a central point, this area is called an $\varepsilon$-neighborhood. Each point in a $\varepsilon$-neighborhood will expand outward, and if this neighborhood meets the minimum number of points required the cluster is updated to include this $\varepsilon$-neighborhood. Points that are not within $\varepsilon$ of the center, but it is included in the cluster it is said to be density-reachable. [Good visualization here](https://cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf).

In [6]:
import numpy as np
import random
ones = np.where(mnist.target == 2.)[0]
result = []
tot_result = 0.
count = 0
# result = np.linalg.norm(mnist.data[ones], 'fro')
for k in ones:
    for i in ones[random.sample(range(len(ones)), 100)]:
        if k != i:
            res = np.linalg.norm(mnist.data[i] - mnist.data[k])
            result += [res]
            tot_result += res
            count += 1

avg_result = tot_result / count
avg_result

2422.3517784359951

In [45]:
from sklearn.cluster import DBSCAN

est_db, proc=_db = bench_clust(
          [ DBSCAN(n_jobs=-1),
            DBSCAN(eps=1000, min_samples=10, n_jobs=-1),
            DBSCAN(eps=100, min_samples=10, n_jobs=-1),
            DBSCAN(eps=10, min_samples=100, n_jobs=-1),
            DBSCAN(eps=10, min_samples=1000, n_jobs=-1),
            DBSCAN(eps=2000, min_samples=5, n_jobs=-1),
            DBSCAN(eps=2000, min_samples=10, n_jobs=-1) ],
          [ "auto", "1000,10", "100,10", "10,100", "10,1000", "2000,5", "2000,10" ], 
                    data=mnist.data, labels=mnist.target, est_type='dbscan')

title    	   time 	homog	comp	v mes	rand	mutu
auto     	1014.43s	-0.000	1.000	-0.000	0.000	-0.000
10,100   	1639.32s	-0.000	1.000	-0.000	0.000	-0.000
10,1000  	1656.99s	-0.000	1.000	-0.000	0.000	-0.000
100,10   	2505.80s	-0.000	1.000	-0.000	0.000	-0.000
1000, 10 	8758.63s	0.142	0.538	0.225	0.065	0.141
2000, 10 	9785.88s	0.000	0.128	0.000	0.000	0.000
2000, 5  	9793.47s	0.000	0.129	0.000	0.000	0.000


In [15]:
from sklearn.cluster import DBSCAN

est_db, proc_db = bench_clust(
          [ DBSCAN(eps=500, min_sample4s=50, n_jobs=-1),
            DBSCAN(eps=500, min_samples=20, n_jobs=-1),
            DBSCAN(eps=500, min_samples=10, n_jobs=-1),
            DBSCAN(eps=500, min_samples=5, n_jobs=-1),
            DBSCAN(eps=750, min_samples=50, n_jobs=-1),
            DBSCAN(eps=750, min_samples=20, n_jobs=-1),
            DBSCAN(eps=750, min_samples=10, n_jobs=-1),
            DBSCAN(eps=750, min_samples=5, n_jobs=-1) ],
          [ "500,50", "500,20",  "500,10", "500,5", "750,50", "750,20",  "750,10",  "750,5"], 
            data=mnist.data, labels=mnist.target, est_type='dbscan')

title    	   time 	homog	comp	v mes	rand	mutu
500,20   	5855.63s	0.053	0.604	0.097	0.008	0.053
500,10   	5863.42s	0.068	0.664	0.124	0.014	0.068
500,5    	5870.39s	0.079	0.684	0.142	0.018	0.079
500,50   	5870.52s	0.021	0.396	0.040	0.000	0.021
750,50   	7657.90s	0.123	0.873	0.216	0.041	0.123
750,5    	7659.11s	0.143	0.804	0.243	0.047	0.142
750,10   	7665.11s	0.134	0.873	0.232	0.045	0.133
750,20   	7675.75s	0.129	0.896	0.225	0.044	0.129


In [13]:
for p in proc_db:
    p.join()

The one of the best performing models is when eps=1000 min_samples=10. To further evaluate we need to look at how labels are being classified. A value of -1 means that this is considered a noisy sample, and 53,469 have been counted as noisy data. Also, there are 30 labels when there should be around 10.

In [68]:
import numpy as np
from collections import Counter
from sklearn.externals import joblib

clf_db = joblib.load('./vars/1000,10.pkl')

Counter(clf_db.labels_)

Counter({-1: 53469,
         1: 135,
         0: 74,
         9: 10,
         11: 11,
         3: 17,
         4: 39,
         2: 8,
         12: 7,
         6: 5,
         7: 9,
         10: 14,
         5: 9,
         8: 10,
         13: 15936,
         14: 10,
         15: 15,
         16: 11,
         27: 10,
         28: 11,
         29: 9,
         17: 74,
         18: 7,
         19: 11,
         20: 13,
         23: 6,
         22: 16,
         21: 10,
         24: 20,
         31: 7,
         25: 7,
         26: 5,
         30: 5})

## [Affinity Propagation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) <a id='prop'></a>

Affinity propagation uses exemplars instead of centroids. This means that instead of finding a centroid affinity propagation works by finding samples that are most representative of the other samples. In addition the number of clusters is not predetermined. The number of clusters is determined by the data. 

There are three main formulas used in affinity propagation.

1. The similarity, or Euclidean distance, between two samples $s(i,k)$. 
2. The responsibility, $r(i,k)$, represents "the accumulated evidence for how well-suited point $k$ is to serve as the exemplar for point $i$, taking into account other potential exemplars for point $i$" [frey-2007]. Equation (1) from [frey-2007]:
  
 \begin{equation*} 
   r(i, k) \leftarrow s(i,k) - \max_{k' s.t k' \neq k} \{a(i, k') + s(i, k')\}
 \end{equation*}

3. The availability, a(i, k), represents "the accumulated evidence for how appropriate it would be for point $i$ to choose point $k$ as its exemplar, taking into account the support from other points that point $k$ should be an exemplar" [frey-2007]. Equation (2) from [frey-2007]:
 
 \begin{equation*}
   a(i, k) \leftarrow \min \{ 0, r(k,k) + \sum_{i' ~ s.t.~ i' \notin \{i,k\}} \max \{ 0, r(i',k) \} \}
 \end{equation*}
 
This algorithm is not scalable to large datasets. Running the command below uses nearly 200GB of ram.

In [None]:
from sklearn.cluster import AffinityPropagation

est_aff, proc_aff = bench_clust([AffinityPropagation()] ,["auto"], data=mnist.data, labels=mnist.target, est_type='affinity')

In [None]:
for p in proc_aff:
    p.terminate()

## [Agglomerative Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering) <a id='agglomerative'></a>

A bottom up hierarchical clustering approach. This means each sample begins as its own singleton cluster and the two closest clusters are successively merged. There are several popular linkage criterion.

+ Ward - This approach minimizes variance, like k-means. This is achieved by minimizing the squared differences within the clusters.
+ Complete linkage - This approach minimizes the maximum distance between samples in pairs of clusters.
+ Average linkage - This approach minimizes the average distance between all samples in pairs of clusters.

In [46]:
from sklearn.cluster import AgglomerativeClustering

est_agg, proc_agg = bench_clust(
          [ AgglomerativeClustering(n_clusters=3, linkage="ward"),
            AgglomerativeClustering(n_clusters=5, linkage="ward"),
            AgglomerativeClustering(n_clusters=10, linkage="ward"),
            AgglomerativeClustering(n_clusters=10, linkage="complete"),
            AgglomerativeClustering(n_clusters=10, linkage="average"),
            AgglomerativeClustering(n_clusters=15, linkage="ward") ],
          [ "ward n=3", "ward n=5", "ward n=10", "camp n=10", "avg n=10", "ward n=15" ], 
                    data=mnist.data, labels=mnist.target, est_type='agglom')

title    	   time 	homog	comp	v mes	rand	mutu
ward n=5 	2238.60s	0.486	0.779	0.599	0.351	0.486
camp n=10	2242.07s	0.260	0.334	0.292	0.130	0.260
ward n=3 	2250.85s	0.284	0.718	0.407	0.188	0.284
ward n=15	2264.27s	0.718	0.633	0.673	0.460	0.633
ward n=10	2264.96s	0.673	0.691	0.682	0.527	0.673
avg n=10 	2276.93s	0.093	0.690	0.164	0.029	0.093


In [None]:
for p in proc_agg:
    p.join()

## [Gaussian Mixture Modles (GMM)](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html) <a id='birch'></a>

It is a memory-efficient, online-learning algorithm provided as an alternative to MiniBatchKMeans. It constructs a tree data structure with the cluster centroids being read off the leaf. 

## [Balanced Iterative Reducing and Clustering using Hierarchies (Birch)](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html) <a id='birch'></a>

[Paper](https://rdcu.be/XdFp)  
It is a memory-efficient, online-learning algorithm provided as an alternative to MiniBatchKMeans. It constructs a tree data structure with the cluster centroids being read off the leaf. 

### Bibtex:

#### K means article [1]

@article{zhao-2018-k-means-a-revisit,
title = "k-means: A revisit",
journal = "Neurocomputing",
volume = "291",
pages = "195 - 206",
year = "2018",
issn = "0925-2312",
doi = "https://doi.org/10.1016/j.neucom.2018.02.072",
url = "http://www.sciencedirect.com/science/article/pii/S092523121830239X",
author = "Zhao, Wan-Lei and Deng, Cheng-Hao and Chong-Wah Ngo, Chong-Wah",
keywords = "Clustering, -means, Incremental optimization"
}

#### DBSCAN article [2] 

@inproceedings{ester-1996-a-density-based-algorithm-for-discovering-clusters-a-density-based-algorithm-for-discovering-clusters-in-large-spatial-databases-with-noise,
 author = {Ester, Martin and Kriegel, Hans-Peter and Sander, J\"{o}rg and Xu, Xiaowei},
 title = {A Density-based Algorithm for Discovering Clusters a Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise},
 booktitle = {Proceedings of the Second International Conference on Knowledge Discovery and Data Mining},
 series = {KDD'96},
 year = {1996},
 location = {Portland, Oregon},
 pages = {226--231},
 numpages = {6},
 url = {http://dl.acm.org/citation.cfm?id=3001460.3001507},
 acmid = {3001507},
 publisher = {AAAI Press},
 keywords = {arbitrary shape of clusters, clustering algorithms, efficiency on large spatial databases, handling nlj4-275oise},
 } 

#### V-Measure article [3]

@inproceedings{rosenberg-2007-proceedings-of-the-2007-joint-conference-on-empirical-methods-in-natural-language-processing-and-computational-natural-language-learning,
  title={V-measure: A conditional entropy-based external cluster evaluation measure},
  author={Rosenberg, Andrew and Hirschberg, Julia},
  booktitle={Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)},
  year={2007},
  url = {http://aclweb.org/anthology/D/D07/D07-1043.pdf},
}

#### Rand Index article [4]

@Article{hubert-1985-comparing-partitions,
author="Hubert, Lawrence
and Arabie, Phipps",
title="Comparing partitions",
journal="Journal of Classification",
year="1985",
month="Dec",
day="01",
volume="2",
number="1",
pages="193--218",
abstract="The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between {\textpm}1.",
issn="1432-1343",
doi="10.1007/BF01908075",
url="https://doi.org/10.1007/BF01908075"
}

#### Adjusted Mutual Information article [5]

@article{vinh-2010-information-theoretic-measures-for-clusterings-comparison-variants,-properties,-normalization-and-correction-for-chance,
 author = {Vinh, Nguyen Xuan and Epps, Julien and Bailey, James},
 title = {Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance},
 journal = {J. Mach. Learn. Res.},
 issue_date = {3/1/2010},
 volume = {11},
 month = dec,
 year = {2010},
 issn = {1532-4435},
 pages = {2837--2854},
 numpages = {18},
 url = {http://dl.acm.org/citation.cfm?id=1756006.1953024},
 acmid = {1953024},

publisher = {JMLR.org},
}

#### Affinity Propagation article [6]

@article {frey-2007-clustering-by-passing-messages-between-data-points,
  author = {Frey, Brendan J. and Dueck, Delbert},
  title = {Clustering by Passing Messages Between Data Points},
  volume = {315},
  number = {5814},
  pages = {972--976},
  year = {2007},
  doi = {10.1126/science.1136800},
  publisher = {American Association for the Advancement of Science},
  abstract = {Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such {\textquotedblleft}exemplars{\textquotedblright} can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called {\textquotedblleft}affinity propagation,{\textquotedblright} which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.},
  issn = {0036-8075},
  URL = {http://science.sciencemag.org/content/315/5814/972},
  eprint = {http://science.sciencemag.org/content/315/5814/972.full.pdf},
  journal = {Science}
}
