## Support Vector Clustering

Clustering algorithms can be parametric or non-parametric. Clustering can be done in many ways like a parametric model, as in the k-means algorithm, or by grouping points according to some distance or similarity measure as in hierarchical clustering algorithm or density estimation. These algorithms look limited in power as they have certain limitations. 

SVC is a nonparametric clustering algorithm that does not make any assumption on the number or shape of the clusters in the data. It works best for low-dimensional data. if the data is high-dimensional, a preprocessing step, e.g. principal component analysis, is usually required. 

In Support Vector Clustering (SVC) algorithm data points are mapped from data space to a high dimensional feature space using a Gaussian(normal distribution) kernel. In feature space we look for the smallest sphere that encloses the image of the data using the Support Vector Domain Description algorithm. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. 

SVC uses Support Vector Domain Description (SVDD) to delineate the region in data space where the input examples are concentrated. It looks for the smallest enclosing sphere in the feature space defined by the kernel function. While in feature space the data is described by a sphere, when mapped back to data-space the sphere is transformed into a set of non-linear contours that enclose the data. 

The boundaries generated by SVDD become increasingly non-linear as the parameter $\gamma$ which governs the width of the Gaussian, is increased as shown below. 

<img src="../images/gamma.png">

As the value of γ is increased the SVDD boundary fits the data more tightly, and at several values of $\gamma$ the enclosing boundary splits, forming an increasing number of components (clusters). The so-called support vectors are the data points that are on the boundary.
<img src="../images/ignore_outliers.png">

SVC can deal with outliers by employing a soft margin constant that allows the sphere in feature space not to enclose all points. For large values of this parameter, we can also deal with overlapping clusters. 

We will use SvcR package to perform Support Vector Clustering. The main function is the findSvcModel function which computes a clustering model and returns it as an R object. 

Parameters of findSvcModel are listed below:

* data.frame means data.frame parameter in standard use or means data.frame in loadMat use or means DatMat in Eval use, a matrix given as unic argument


* MetOpt, optimization control parameter : optimStoch (stochastic way of optimization) or optimQuad (quadratic way of optimization)


* MetLab, labelling method: gridLabeling (grid labelling) or mstLabeling (mst labelling) or knnLabeling (knn labelling)


* KernChoice, kernel choice: KernLinear (Euclidian) or KernGaussian (RBF) or KernGaussianDist (Exponential) or KernDist (Matrix data as Kernel value)


* Nu, nu parameter


* q, q parameter


* k, k nearest neigbours for grid


* G, grid size


*  Cx, x component to display (1 for 1st attribute)


* Cy, y component to display (2 for 2nd attribute)

In [2]:
# svcR package to model a support vector cluster
library("svcR")

# Load inbuilt iris dataset using data() command
data("iris")

# findSvcMOdel() function will build a svc model. The parameters are described above. 
retA <- findSvcModel(iris, MetOpt = "optimStoch", MetLab = "gridLabeling",
    KernChoice = "KernGaussian", Nu = 0.5, q = 40, K = 4,
    G = 5, Cx = 0, Cy = 0)

In [None]:
# Plot the clusters of data points
plot(retA)
findSvcModel.summary(retA)

From the above plot for Iris data, classes 2 and 3 are not well separated. So the method can catch well class 1 observations and from time to time a "bridge" occurs between class 2 and 3 that links them to form one cluster.

In [None]:
glass_data = read.csv("../../../datasets/glass/glass.txt",header=FALSE,sep=',')
header = c("Id", "Refractive_Index","Sodium","Magnesium","Aluminium","Silicon","Potassium","Calcium","Barium","Iron","Type")
names(glass_data)=header
head(glass_data)

In [None]:
retB <- findSvcModel(glass_data, MetOpt = "optimStoch", MetLab = "gridLabeling",
    KernChoice = "KernGaussian", Nu = 0.5, q = 40, K = 1,
    G = 5, Cx = 0, Cy = 0)

In [None]:
plot(retB)

In [None]:
findSvcModel.summary(retB)

In [None]:
# Apply k-means cluster on the glass_data to compare with the results of SVC
km.fit = kmeans(glass_data,6,nstart = 20)

# Display the statistics of k-means cluster model.
km.fit
# plot the clusters of data points formed by k-means clustering model.
plot(glass_data[,2:3], col=(km.fit$cluster+1), main="K-Means Clustering Results with K=6", pch =20, cex =2)

In [None]:
# Use the multishapes dataset in factoextra package for clustering using svc
data("multishapes", package = "factoextra")
# multishapes has 3 columns x,y,shape. we will cluster the observations using just the x and y values. so create a dataframe 
# called df with first two columns x and y
df <- multishapes[, 1:2]

In [None]:
# Apply the clusting on the dataframe df
retC <- findSvcModel(df, MetOpt = "optimStoch", MetLab = "gridLabeling",
    KernChoice = "KernGaussian", Nu = 0.5, q = 40, K = 1,
    G = 5, Cx = 0, Cy = 0)

In [None]:
# Plot the clusters of observations created for df dataframe using svc
plot(retC)