# Table of Contents

* [4 General Functions (Review)](#general_functions)
    * [4.1 Functions to Save and Open Variables](#open_save)
* [5 Document Clustering](#document_clustering) 
    * [5.1 k-Means Clustering](#k_means)
    * [5.2 Density-based Spatial Clustering of Applications with Noise (DBSCAN)](#dbscan)
    * [5.3 Balanced Iterative Reducing and Clustering using Hierarchies (Birch)](#birch)
    * [5.4 Affinity Propagation](#prop)

# General Functions <a id='general_functions'></a>

## Functions to Save and Open Variables <a id='open_save'></a>

Since it is not uncommon for a machine learning task to take a long time it is good practice to save variables that may be needed in the future. This can be achieved by using the [pickle](https://docs.python.org/3/library/pickle.html) module. This package allows a variable up to 4gb to be saved. This limitation is why the 'metrics' variables are saved as individual items instead of a dictionary.

In [1]:
# Save variables to file
import pickle

def save_var(variable_name):
    """ Saves the variable with the provided variable name 
         in the global namespace to the ./vars folder 
         with the provided same name """
    
    with open('./vars/' + variable_name,'wb') as my_file_obj:
        pickle.dump(globals()[variable_name], my_file_obj, protocol=pickle.HIGHEST_PROTOCOL)

def save_var_list(variable_name_list):
    """ Saves each variable with the provided variable name 
         in the global namespace to the ./vars folder 
         with the provided same name """
    for name in variable_name_list:
        with open('./vars/' + name,'wb') as my_file_obj:
            pickle.dump(globals()[name], my_file_obj, protocol=pickle.HIGHEST_PROTOCOL)

def open_var(file_name):
    """ Returns the variable saved with the provided 
         file name located in the ./vars folder"""
    
    file_object = open('./vars/' + file_name,'rb')  

    loaded_var = pickle.load(file_object)
    
    return loaded_var

def open_var_list(file_name_list):
    """ Loads a variable corresponding to each file name
         in file_name_list to the global namespace. """
    
    for file_name in file_name_list:
        globals()[file_name] = open_var(file_name)

In [2]:
%time mnist = open_var('mnist')

CPU times: user 157 ms, sys: 622 ms, total: 779 ms
Wall time: 2.14 s


In [3]:
%%time
# Load Datasets
class Dataset_Part:
    """ Represents a dataset with attributes
         data and target """
    
    data = None
    target = None
    def __init__(self, X, y):
        self.data = X
        self.target = y

open_var_list(['mnist_train', 'mnist_test', 'rcv1_train', 'rcv1_test'])

CPU times: user 30.7 ms, sys: 959 ms, total: 990 ms
Wall time: 3.32 s


In [4]:
import matplotlib
import matplotlib.pyplot as plt

def print_digit(dataset, index):
    # Get a random document
    digit_arr = dataset.data[index]
    # Reshape it to the size of the image
    digit_image = digit_arr.reshape(28,28)

    # Some information
    print(f'\tIndex: {index}\tLabel: {dataset.target[index]:.0f}')
    # Show the image
    plt.imshow(digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
    plt.axis("off")
    plt.show()

# [Document Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) <a id='document_clustering'></a>

Clustering is an unsupervised training method, meaning it is performed on data without labels. Because of this unsupervised learning is capable of finding relations that may not have been previously observed. 

## [Cluster Evaluation]() <a id='cluster_evaluation'></a>

Unsupervised learning uses different evaluation metrics than supervised learning. This is because unsupervised learning makes assumptions with no prior knowladge (ie. no labels). Since the data does not conform to predetermined labels evaluation metrics such as precision and recall cannot be performed. Instead the following metrics can be used.

+ __homogeneity score__ - 
+ __completeness score__ -
+ __silhouette score__ - 
+ __V-measure__ - The harmonic mean between homogeneity and completeness.
$$ v = 2 \cdot \frac{ \text{homogeneity} \cdot \text{completeness}}{\text{homogeneity} + \text{completeness}} $$

In [13]:
from sklearn import metrics
from sklearn.cluster import DBSCAN
from time import time

def bench_clust(estimator_lst, name_lst, data, labels):

    print('%-9s\t%-6s\t%-12s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s' 
          % ( 'title', 'time', 'inertia', 'homog', 'comp', 'v mes',
              'rand', 'mutu', 'silh'))
    
    for estimator, name in zip(estimator_lst, name_lst):
        t0 = time()
        estimator.fit(data)
        print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
              % (name, (time() - t0), .0, #estimator.inertia_,
                 metrics.homogeneity_score(labels, estimator.labels_),
                 metrics.completeness_score(labels, estimator.labels_),
                 metrics.v_measure_score(labels, estimator.labels_),
                 metrics.adjusted_rand_score(labels, estimator.labels_),
                 metrics.adjusted_mutual_info_score(labels,  estimator.labels_), 0.))
#              metrics.silhouette_score(data, estimator.labels_,
#                                       metric='euclidean',
#                                       sample_size=1000)))
#     return estimator

## [k-Means Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) <a id='k_means'></a>

This algorithm is implemented in the sklearn.cluster.KMeans scikit-learn module. K-means clustering attempts to seperate data into a predetermined, k, number of clusters. The aim is to create clusters with equal variance, thus minimizing inertia, also known as the within-cluster sum of squares. Inertia is defined as:

\begin{equation*}
 \sum_{i=0}^n \min_{\mu_j \in C} (\mid \mid x_j - \mu_i \mid \mid^2)
\end{equation*}


[comment]: <> (need to reword, too close to source)
To find clusters k-Means has a three step process explained by [Zhao et al.](https://doi.org/10.1016/j.neucom.2018.02.072) [1] as:

1. Initialize k centroids, one for each cluster. The most basic way to do this is by picking k random samples.
+ Assign each sample to the closest centroid.
+ Recompute centroids with assignments from previous step.
+ Repeat step 2 and step 3 until convergence

In [5]:
from sklearn import metrics
from sklearn.cluster import KMeans
from time import time

def bench_k_means(estimator_lst, name_lst, data, labels):
    print('%-9s\t%-6s\t%-12s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s\t%-4s' 
      % ( 'title', 'time', 'inertia', 'homog', 'comp', 'v mes', 'rand', 'mutu', 'silh'))
    for estimator, name in zip(estimator_lst, name_lst):
        t0 = time()
        estimator.fit(data)
        print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
              % (name, (time() - t0), estimator.inertia_,
                 metrics.homogeneity_score(labels, estimator.labels_),
                 metrics.completeness_score(labels, estimator.labels_),
                 metrics.v_measure_score(labels, estimator.labels_),
                 metrics.adjusted_rand_score(labels, estimator.labels_),
                 metrics.adjusted_mutual_info_score(labels,  estimator.labels_),
                 metrics.silhouette_score(data, estimator.labels_,
                                          metric='euclidean',
                                          sample_size=1000)))

bench_k_means([KMeans(init='k-means++', n_clusters=3, n_init=10, n_jobs=-1),
               KMeans(init='k-means++', n_clusters=5, n_init=10, n_jobs=-1),
               KMeans(init='k-means++', n_clusters=10, n_init=10, n_jobs=-1),
               KMeans(init='random', n_clusters=10, n_init=10, n_jobs=-1),
               KMeans(init='k-means++', n_clusters=15, n_init=10, n_jobs=-1),
               KMeans(init='random', n_clusters=15, n_init=10, n_jobs=-1) ],
              ["k-means++ k=3", "k-means++ k=5", "k-means++ k=10", "random k=10", "k-means++ k=15", "random k=15"],
              data=mnist.data, labels=mnist.target)

title    	time  	inertia     	homog	comp	v mes	rand	mutu	silh
k-means++ k=3	20.23s	213604851206	0.211	0.443	0.286	0.172	0.211	0.057
k-means++ k=5	24.08s	197606838359	0.390	0.578	0.465	0.330	0.390	0.072
k-means++ k=10	32.67s	178462843731	0.482	0.485	0.484	0.362	0.482	0.056
random k=10	34.80s	178432572697	0.496	0.504	0.500	0.367	0.496	0.059
k-means++ k=15	45.41s	167326445668	0.581	0.499	0.537	0.379	0.499	0.062
random k=15	49.65s	167326155420	0.581	0.499	0.537	0.379	0.499	0.071


In [6]:
%%time
from sklearn.cluster import KMeans

clf_kmeans = KMeans(n_clusters=10, random_state=23, n_jobs=-1)
clf_kmeans.fit(mnist_test.data)

CPU times: user 14 µs, sys: 4 µs, total: 18 µs
Wall time: 22.2 µs
CPU times: user 2.41 s, sys: 376 ms, total: 2.78 s
Wall time: 16.1 s


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=-1, precompute_distances='auto',
    random_state=23, tol=0.0001, verbose=0)

In [None]:
%%time
pred_kmeans = kmeans.predict(mnist_test.data)

In [None]:
%%time
scores_kmeans = get_scores(mnist_test.target, pred_kmeans, 'macro')
print_scores(scores_kmeans, 'macro')

## [Density-based Spatial Clustering of Applications with Noise (DBSCAN)](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) <a id='dbscan'></a>

The DBSCAN algorithm clusters samples into areas of high density with surrounding low dennsity areas. Because of this Clusters can be any shape and the number of clusters is not predeturmined. Clusters are formed by finding region that satisfy a minimum density, number of documents per area. The shape of the cluster is determined by the distance metric used. Any distance function can be used and the distance function will determine the shape of the clusters [2].

To form a cluster DBSCAN searches for areas with a minimum number of points within a specified distance, $\varepsilon$, from a central point, this area is called an $\varepsilon$-neighborhood. Each point in a $\varepsilon$-neighborhood will expand outward, and if this neighborhood meets the minimum number of points required the cluster is updated to include this $\varepsilon$-neighborhood. Points that are not within $\varepsilon$ of the center, but it is included in the cluster it is said to be density-reachable. [Good visualization here](https://cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf).

In [None]:
import numpy as np
import random
ones = np.where(mnist.target == 2.)[0]
result = []
tot_result = 0.
count = 0
# result = np.linalg.norm(mnist.data[ones], 'fro')
for k in ones:
    for i in ones[random.sample(range(len(ones)), 100)]:
        if k != i:
            res = np.linalg.norm(mnist.data[i] - mnist.data[k])
            result += [res]
            tot_result += res
            count += 1

avg_result = tot_result / count
avg_result

In [None]:
%%time
from sklearn.cluster import DBSCAN


clf_dbscan = DBSCAN(eps=10, min_samples=10, n_jobs=-1)
clf_dbscan.fit(mnist_test.data)

In [None]:
%%time
clf_dbscan_auto = DBSCAN(n_jobs=-1)
clf_dbscan_auto.fit(mnist_test.data)

In [14]:
from sklearn.cluster import DBSCAN

bench_clust([DBSCAN(n_jobs=-1)] ,["auto"], data=mnist.data, labels=mnist.target)
# bench_clust(clf_dbscan ,name="10, 10", data=mnist_test.data, labels=mnist_test.target)

title    	time  	inertia     	homog	comp	v mes	rand	mutu	silh
auto     	1186.30s	0	-0.000	1.000	-0.000	0.000	-0.000	0.000


## [Balanced Iterative Reducing and Clustering using Hierarchies (Birch)](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html) <a id='birch'></a>

[Paper](https://rdcu.be/XdFp)

## [Affinity Propagation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation) <a id='prop'></a>

Affinity propagation uses exemplars instead of centroids. This means that instead of finding a centroid affinity propagation works by finding samples that are most representative of the other samples. In addition the number of clusters is not predetermined. The number of clusters is determined by the data. 

[comment]: <> (need to reword, too close to source)
There are three main formulas used in affinity propagation.

1. The similarity of two samples is denoted $s(i,k)$.
2. The responsibility of a sample, $k$, to be an exemplar of sample, $i$. 
  
 \begin{equation*} 
   r(i, k) \leftarrow s(i,k) - \max [a(i, k') + s(i, k') \forall k' \neq k] 
 \end{equation*}

3. The accumulated evidence that sample $i$ should choose sample $k$ to be its exemplar.
 
 \begin{equation*}
   a(i, k) \leftarrow \min [0, r(k,k) + \sum_{i' ~ s.t.~ i' \notin \{i,k\}} r(i',k)]
 \end{equation*}

### Bibtex:

#### K means article [1]
@article{ZHAO2018195,
title = "k-means: A revisit",
journal = "Neurocomputing",
volume = "291",
pages = "195 - 206",
year = "2018",
issn = "0925-2312",
doi = "https://doi.org/10.1016/j.neucom.2018.02.072",
url = "http://www.sciencedirect.com/science/article/pii/S092523121830239X",
author = "Wan-Lei Zhao and Cheng-Hao Deng and Chong-Wah Ngo",
keywords = "Clustering, -means, Incremental optimization"
}

#### DBSCAN article [2] 
@inproceedings{Ester:1996:DAD:3001460.3001507,
 author = {Ester, Martin and Kriegel, Hans-Peter and Sander, J\"{o}rg and Xu, Xiaowei},
 title = {A Density-based Algorithm for Discovering Clusters a Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise},
 booktitle = {Proceedings of the Second International Conference on Knowledge Discovery and Data Mining},
 series = {KDD'96},
 year = {1996},
 location = {Portland, Oregon},
 pages = {226--231},
 numpages = {6},
 url = {http://dl.acm.org/citation.cfm?id=3001460.3001507},
 acmid = {3001507},
 publisher = {AAAI Press},
 keywords = {arbitrary shape of clusters, clustering algorithms, efficiency on large spatial databases, handling nlj4-275oise},
 } 

#### V-Measuer article [3]

@inproceedings{rosenberg2007v,
  title={V-measure: A conditional entropy-based external cluster evaluation measure},
  author={Rosenberg, Andrew and Hirschberg, Julia},
  booktitle={Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)},
  year={2007},
  url = {http://aclweb.org/anthology/D/D07/D07-1043.pdf},
}