# Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Category: Prototype based, Density based, Hierarchical

----------------------------------------------------------------------------------------------------------------------
##### K-Mean Clustering - Pros: xxxx | Cons: xxxx

##### K-Mean++ Clustering - Pros: xxxx | Cons: xxxx

##### K-Mode Clustering - Pros: xxxx | Cons: xxxx

##### K-Median Clustering - Pros: xxxx | Cons: xxxx

##### K-Medoids Clustering - Pros: xxxx | Cons: xxxx

##### K-Medoids Clustering - Pros: xxxx | Cons: xxxx

##### Hierarchy Clustering (Agglomerative) Bottom up - Pros: xxxx | Cons: xxxx

##### Hierarchy Clustering (Divisive) Top Down - Pros: xxxx | Cons: xxxx

##### Fuzzy Clustering - Pros: xxxx | Cons: xxxx

##### DBSCAN Clustering - Pros: xxxx | Cons: xxxx

##### OPTICS Clustering - Pros: xxxx | Cons: xxxx

##### Expectation Maximization (EM) - Pros: xxxx | Cons: xxxx

##### Non Negative Matrix Factorization - Pros: xxxx | Cons: xxxx

##### Latent Dirichlet Allocation (LDA) - Pros: xxxx | Cons: xxxx

----------------------------------------------------------------------------------------------------------------------

## --------------------- K-Mean Clustering

#### Wiki Definitation: 
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Lloyd -> Given any set of k centers Z, for each center z in Z, let V(z) denote its neighborhood. That is the set of data points for which z is the nearest neighbor. Each stage of Lloyd's algorithm moves every center point z to the centroid of V(z) and then updates V(z) by recomputing the distance from each point to its nearest center. These steps are repeated until convergence. Note that Lloyd's algorithm can get stuck in locally minimal solutions that are far from the optimal. For this reason it is common to consider heuristics based on local search, in which centers are swapped in and out of an existing solution (typically at random). Such a swap is accepted only if it decreases the average distortion, otherwise it is ignored.

Forgy -> Forgy's algorithm is a simple alternating least-squares algorithm consisting of the following steps:
Initialize the codebook vectors. (Suppose that when processing a given training case, N cases have been previously assigned to the winning codebook vector.)
Repeat the following two steps until convergence:
Read the data, assigning each case to the nearest (using Euclidean distance) codebook vector.
Replace each codebook vector with the mean of the cases that were assigned to it.

MacQueen -> This algorithm works by repeatedly moving all cluster centers to the mean of their respective Voronoi sets.

Hartigan and Wong -> Given n objects with p variables measured on each object x(i,j) for i = 1,2,...,n; j = 1,2,...,p; K-means allocates each object to one of K groups or clusters to minimize the within-cluster sum of squares:
#### Input Data: 
X(Numeric)
#### Initial Parameters: 
K(Number of clusters)
#### Cost Function: 
Given fixed centroids, it minizes the distance between in-cluster Xs and centroids by choosing the cluster labels for each X.(Repeat after each movement of centroids)
#### Process Flow: 
Initiate K centroids, assign Xs(observations) to the closest(distance metric) centroid. Then, calculate the Avg(x) for each cluster which grouped by the initial assignment. Use the Avg(x) as the new position of the centroids. Repeating this process.
#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. 


In [None]:
# ------------------------------------- R Code
data <- iris[, 3:4]
set.seed(20)
kmean.cluster <- kmeans(data,# - numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). 
                        centers=3,# - either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres. 
                        iter.max = 300,# - the maximum number of iterations allowed. 
                        nstart = 1,# - if centers is a number, how many random sets should be chosen? 
                        algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), 
                        trace=FALSE)# - only used in the default method ("Hartigan-Wong"): if positive (or true), tracing information on the progress of the algorithm is produced.

kmean.cluster$cluster # A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
kmean.cluster$centers # A matrix of cluster centres.
kmean.cluster$totss # The total sum of squares.
kmean.cluster$withinss # Vector of within-cluster sum of squares, one component per cluster
kmean.cluster$tot.withinss # Total within-cluster sum of squares, i.e. sum(withinss).
kmean.cluster$betweenss # The between-cluster sum of squares, i.e. totss-tot.withinss.
kmean.cluster$size # The number of points in each cluster.
kmean.cluster$iter # The number of (outer) iterations.
kmean.cluster$ifault # integer: indicator of a possible algorithm problem – for experts.


In [7]:
# -------------------------------------- Python Code
from sklearn import datasets
from sklearn.cluster import KMeans
iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3, # k
            init='random', # initate points
            n_init=10, # runs to choose lowest SSE #
            max_iter=300, # iters in each run
            tol=1e-04, # define converge
            random_state=0)
Y_km = km.fit_predict(X) # arrary of labels of groups

## --------------------- K-Mean ++ Clustering

#### Wiki Definitation: 
In classic k-means algorithm that uses a random seed to place the initial centroids, which can sometimes result in bad clusterings or slow convergence if the initial centroids are choosen poorly. So, in K-Mean++ clustering, it places the initial controids far aways from each others. The k-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.
#### Input Data: 
X(Numeric)
#### Initial Parameters: 
K(Number of clusters)
#### Cost Function: 
Given fixed centroids, it minizes the distance between in-cluster Xs and centroids by choosing the cluster labels for each X.(Repeat after each movement of centroids)
#### Process Flow: 
Initiate K centroids randomly, assign Xs(observations) to the closest(distance metric) centroid. Find the minimum squared distance between (X,centroid) for each centroid. Randomly select the next centroid using a weighted probability (one minium dist / sum of all minium dists). Repeat the process until k centroids choosen (create k centroids which are far aways from each other) --> Proceed to the classic K-mean process... 
#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. 


In [None]:
# ------------------------------------- R Code


In [None]:
# ------------------------------------- Python Code
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3, # k
            init='k-means++', # initiate points far away from each other
            n_init=10, # runs to choose lowest SSE #
            max_iter=300, # iters in each run
            tol=1e-04, # define converge
            random_state=0)
Y_km = km.fit_predict(X) # arrary of labels of groups

## --------------------- K-Mode Clustering

#### Wiki Definitation: 
K-modes, an algorithm extending the k-means paradigm to categorical domain is introduced ??. New dissimilarity measures to deal with categorical data is conducted to replace means with modes, and a frequency based method is used to update modes in the clustering process to minimize the clustering cost function.
#### Input Data: 
X(Categorical)
#### Initial Parameters: 
K(Number of clusters)
#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. 


In [None]:
# ------------------------------------- R Code
install.packages("klaR")
library(klaR)
install.packages("vcd")
library(vcd)

data <- as.data.frame(Arthritis)
data <- as.matrix(data, ncol=5, nrow=84)

kmode.cluster <- kmodes(data,# - A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.
                        modes=3,# - Either the number of modes or a set of initial (distinct) cluster modes. If a number, a random set of (distinct) rows in data is chosen as the initial modes.
                        iter.max = 10,# - The maximum number of iterations allowed.
                        weighted = FALSE)# - Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

kmode.cluster$cluster # A vector of integers indicating the cluster to which each object is allocated
kmode.cluster$size # The number of objects in each cluster.
kmode.cluster$modes # A matrix of cluster modes.
kmode.cluster$withindiff # The within-cluster simple-matching distance for each cluster.
kmode.cluster$iterations # The number of iterations the algorithm has run.
kmode.cluster$weighted # Whether weighted distances were used or not.


In [None]:
# ------------------------------------- Python Code
pip install kmodes
pip install --upgrade kmodes

# random categorical data
data = np.random.choice(20, (100, 10))

km = kmodes.KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids
print(km.cluster_centroids_)


## --------------------- K-Medians Clustering

#### Wiki Definitation: 
In statistics and data mining, k-medians clustering is a cluster analysis algorithm. It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 1-norm distance metric, as opposed to the square of the 2-norm distance metric (which k-means does.)
#### Input Data: 
X(Numeric)
#### Initial Parameters: 
K(Number of clusters)
#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. 


In [1]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code
# pip install pyclustering

import pyclustering
from pyclustering.utils import read_sample;
# load list of points for cluster analysis
data = ; 
# create instance of K-Medians algorithm
kmedians_instance = kmedians(data, # (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple.
                             [ [0.0, 0.1], [2.5, 2.6] ], # (list): Initial coordinates of medians of clusters that are represented by list: [center1, center2, ...].
                            tolerance = 0.25, # (double): Stop condition: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing
                            ccore = False); # (bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not.
 
# run cluster analysis and obtain results
kmedians_instance.process();
kmedians_instance.get_clusters(); 



## --------------------- K-Medoids Clustering

#### Wiki Definitation: 
The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses datapoints as centers (medoids or exemplars) and works with an arbitrary metrics of distances between datapoints. k-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori. A useful tool for determining k is the silhouette.
#### Input Data: 
X(Numeric)
#### Initial Parameters: 
K(Number of clusters)
#### Cost Function: 

#### Process Flow: 

#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. Silhouette.


In [None]:
# ------------------------------------- R Code
install.packages("cluster")
library(cluster)

# - https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
kmedoids <- pam(x,# data matrix or data frame, or dissimilarity matrix or object, depending on the value of the diss argument.
                k,# positive integer specifying the number of clusters, less than the number of observations. 
                diss = inherits(x, "dist"),# logical flag: if TRUE (default for dist or dissimilarity objects), then x will be considered as a dissimilarity matrix. If FALSE, then x will be considered as a matrix of observations by variables.
                metric = "euclidean", 
                medoids = NULL, 
                stand = FALSE, 
                cluster.only = FALSE,
                do.swap = TRUE,
                keep.diss = !diss && !cluster.only && n < 100,
                keep.data = !diss && !cluster.only,
                pamonce = FALSE, 
                trace.lev = 0)


In [None]:
# ------------------------------------- Python Code
# load list of points for cluster analysis
data = read_sample(path);
 
# create instance of K-Medoids algorithm
kmedians_instance = kmedians(data, # (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple.
                             [1, 10], # (list): Indexes of intial medoids (indexes of points in input data).
                             tolerance = 0.25, # (double): Stop condition: if maximum value of distance change of medoids of clusters is less than tolerance than algorithm will stop processing.
                             ccore = False ); # (bool): If specified than CCORE library (C++ pyclustering library) is used for clustering instead of Python code.

# run cluster analysis and obtain results
kmedians_instance.process();
kmedians_instance.get_clusters();  


## --------------------- Hierarchy Clustering (Agglomerative) Bottom up

#### Wiki Definitation: 
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
#### Input Data: 
X(Numeric) / X(Categorical) ~ Distance Metric: Euclidean distance, Squared Euclidean distance, Manhattan distance, maximum distance, Mahalanobis distance, Hamming distance(cate), Levenshtein distance(cate).
#### Initial Parameters: 
NA
#### Cost Function: 
Linkage ~ Minimum or single-linkage clustering, Maximum or complete-linkage clustering, Mean or average linkage clustering, or UPGMA, Centroid linkage clustering, or UPGMC, Minimum energy clustering
#### Process Flow: 
Created distance matrix by choosing distance metrics, then choose a linkage method to start 'merge' or 'dive' from top down or down top. When done, define the 'cut' for the number of groups
#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. Silhouette.


In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- Hierarchy Clustering (Divisive) Top Down

#### Wiki Definitation: 
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
#### Input Data: 
X(Numeric) / X(Categorical) ~ Distance Metric: Euclidean distance, Squared Euclidean distance, Manhattan distance, maximum distance, Mahalanobis distance, Hamming distance(cate), Levenshtein distance(cate).
#### Initial Parameters: 
NA
#### Cost Function: 
Linkage ~ Minimum or single-linkage clustering, Maximum or complete-linkage clustering, Mean or average linkage clustering, or UPGMA, Centroid linkage clustering, or UPGMC, Minimum energy clustering
#### Process Flow: 
Created distance matrix by choosing distance metrics, then choose a linkage method to start 'merge' or 'dive' from top down or down top. When done, define the 'cut' for the number of groups
#### Evaluation Methods: 

#### Tips: 
Choosing K based on either business knowledage or 'elbow method' on cost function value. Silhouette.


In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- Fuzzy Clustering

In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- DBSCAN Clustering

In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- OPTICS Clustering

In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- Non Negative Matrix Factorization

In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- Latent Dirichlet Allocation (LDA)

In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code



## --------------------- Expectation Maximization (EM)

In [None]:
# ------------------------------------- R Code



In [None]:
# ------------------------------------- Python Code

