# CLUSTERING

Clustering is an example of an unsupervised learning technique where we don't work with the labeled corpus to train our model. Clustering works directly with the features in your data and tries to find patterns and logical groupings in the underlying dataset. Clustering is applicable in a wide range of use cases, such as finding the relevant documents in a corpus, color quantization, and so on.

It's an unsupervised learning technique where you don't need labeled data in order to group or find patterns in data. K-means clustering is by far the most popular clustering algorithm; however, we'll implement and contrast different clustering techniques and figure out in what situations you might want to use those.

Supervised learning algorithms seek to learn the function f that links the input features with the output labels. So you can think of supervised learning as this complex reverse engineering problem where the model is trying to figure out what exactly this f is that links the input to the output. Let's consider one of the first machine-learning models that you've probably worked with, linear regression. Linear regression involves finding the best fit line via a training process. Now linear regression is an example of supervised learning. Linear regression specifies up front that the function f connecting the input and the output is linear in nature, but really machine-learning models go from the very simple to extremely complicated. This function f could be a really complicated function, and to reverse engineer this function, you need a complex model, such as a neural network. You've probably heard that neural networks can do some pretty amazing stuff. They can learn or reverse engineer pretty much anything given then the right training data. All of the discussion that we've had so far applies only to supervised learning techniques, such as classification and regression. 

When you're working with unsupervised machine-learning techniques, unsupervised learning does not have y variables or a corpus that has been labeled correctly, which means you only have raw features in your input data, and everything you do just uses those raw features. There are no labels or predictions associated with those features. So broadly, here are the two types of ML algorithms that you'll work with. Supervised learning, where the labels associated with the training data are used to correct and tweak the algorithm to build a model. In the real world though, getting labeled data is very difficult, which means you might have to work with an unsupervised learning technique. The model has to be set up right to learn hidden structures in the data. And understanding unsupervised learning techniques is very important in this context because clustering is a classic example of an unsupervised technique. When you're working with clustering or any other unsupervised learning technique, you only have the input x data, you do not have the output predictions or labels. What you are trying to do when you're working with unsupervised learning is to model or learn about the underlying structure in data to understand the data better to find patterns. Algorithms in unsupervised learning discover patterns and structure in the data by themselves. They have to be just set up right.

We mentioned earlier that clustering is an example of unsupervised learning, where we set up the model to learn structure in the data. There are no labels that we use to train the model. Before we move on to talking about clustering specifically, let's talk a little bit about use cases of unsupervised learning. 


Now, it's often the case in the real world that data is unlabeled. You might apply an unsupervised learning technique to make unlabeled data self sufficient. 


For example, if you want to identify photos of a specific individual, you might feed a model lots of different photographs, millions of them until it starts identifying similar features. 


Unsupervised learning techniques are also used for latent factor analysis. Let's say you have a huge amount of data, what are the significant factors in the data? 

For example, finding common drivers of 200 separate stocks or shares. That's an example of latent factor analysis. 


Unsupervised learning techniques are also used for clustering.

*** 
Clustering involves finding logical groupings in the underlying data. 
*** 

Let's say you have a large document corpus of newspaper articles, and you want to find those articles that pertain to sports. You could perform clustering on all of these newspaper articles and find that cluster which contains sports-related information. 


Unsupervised learning techniques are also used for anomaly detection. For example, in the case of credit card, you can use these techniques to flag fraudulent credit card transactions. 


Unsupervised learning techniques are also used for quantization, especially with colors. Let's say you have the original image in true color with 24-bits, and you want to compress this image before you feed it into an ML model. That's an example of unsupervised learning. 


We've seen that supervised learning techniques are very powerful, but they need labeled data, and it's hard to find labeled data in the real world. 


Unsupervised learning techniques are often used as pre-training for supervised learning problems, such as classification and regression. 


Of all of the unsupervised learning ML techniques, there are two that are very popular and widely used, autoencoders and clustering. 


So the two most popular unsupervised ML algorithms are clustering, which is used to identify patterns in data. These patterns are then used to bring together these data items into logical groups. 


K-means clustering is a popular and widely-used clustering algorithm, and that's what we'll study first in this module. 

*** 
Autoencoding is another example of a popular unsupervised ML algorithm. Autoencoders are typically used to identify latent factors in the underlying data. 
*** 

Let's say you have a very large dataset, and this dataset could be essentially anything. They could be documents, they could be people, they could be events, how do you find patterns in this data, how do you make sense of this very large dataset? 


One intuitive way is to group these data items based on some common attributes. So data items that belong to the same group have something in common, and data items that are in different groups are different in some way. And this is exactly what clustering tries to do. 


Now clustering is easy to imagine when you're talking about simple data, maybe just coordinates, but what if you want to cluster more complex data such as products sold on Amazon, people on Facebook, or websites indexed by Google. 


You can imagine that each of these entities can be very complex. There are a rich set of attributes or information available about these entities. How do you work with these to cluster them into logical groups? Too many entities, too many attributes per entities, and many of these attributes are not really numeric, and all of these put together results in huge complexity. This is a difficult problem to solve. 


Now you might already know that machine-learning models can accept only numeric input. The interesting thing is that any attribute belonging to any entity can be represented by a set of numbers. 


So for a product, some of the numbers that you use to represent it could be the product ID, the timestamp it was sold, and the amount a customer paid. 


For a person, it could be age, height, weight, how frequently does she hit like on a particular kind of news, how frequently she logs on to Facebook or your social media site. 


For a web page, you could have the length of a web page or word frequencies within a web page. These could be your numbers to represent that page. 


Once you have data represented in the form of numbers, every entity here, whether it's a product, a user, or a web page can be represented using a feature vector. And every entity is a data point, that is the representation of this feature vector. For example, age, height, and weight of a person can be plotted in a 3-dimensional plane. You need just one dimension for age, we add another dimension for height, and finally when we add in a third dimension, let's say the weight of an individual, we add in a third axis, the Z axis. 


Just extend your imagination to more complex features that can be represented using numbers and to a dimension beyond 3. So a set of N numbers represents a point in an N-dimensional hypercube. And once you have points in this N-dimensional hypercube, you can perform clustering and grouping using different techniques. 


You can use distance measures to compute the distances between these points to find which points lie in the same cluster, or you could find regions of very high density, and those could be your clusters. These are all examples of different clustering techniques that you could employ. 



Let's say you wanted to perform clustering on a set of points where every data point represents a Facebook user. Once clustering has been performed, people in the same group are similar to one another, and people who are in different groups or different clusters are different from one another. Based on the attributes that you've looked at to perform this clustering or the clustering model that you've used, your groupings could be different. 


Now in a real-world use case, it's possible that users who are in the same cluster may like the same kind of music, may have gone to the same high school, they have the same friends, or may enjoy the same kinds of movies. 


If you want to run ad campaigns on individuals who have the same taste, here is one way to do it. 


All of the data points in this N-dimensional hypercube are separated by a distance, the distance between users, how similar or different they are. Distances between users who are in the same cluster should be small, that is users in the same cluster should be similar to one another, and distances between users in different clusters should be large. These are the objectives of clustering. 

*** 
Entities in the same group should be very similar, and entities in different groups should be very different. 

*** 
#### What are some of the other use cases of clustering? 


To find relevant documents in a corpus. Now document archives are generally very rich, and it's hard to identify content relevant to a specific user or query. It's not possible for all of these documents to have labels or keywords. You can clump documents into semantically-similar groups using clustering. 


Clustering is also used for color quantization, where we represent the original image using fewer distinct colors. True color images represent each image using 24-bits per pixel, which is huge. Many displays and image formats cannot use this kind of granularity, they use just 8-bits per pixel. Just randomly picking 2 to the power 8, that is 256 colors to represent the original image is not optimal. If your original image is of the sea, there'll be too few colors in your color set to represent this image. This is where you can use clustering to identify the 250 most representative colors for your image, and then you'll quantize each true color to the nearest shade or to the nearest cluster.

# K-Means

The objective of clustering is to maximize intra-cluster similarity and to minimize inter-cluster similarity, and that's exactly what the K-means clustering algorithm tries to achieve. 


In order to understand how K-means clustering works, let's imagine data in two dimensions. To perform K-means clustering, we first initialize K centroids, that is means in this data. Now these K values can be any values picked at random, or there are algorithms that you can use to choose these K centroids that we start off with. 


Once we have these K centroids, iterate through all of the remaining points and assign each point to a cluster. You'll measure the distance between the centroid and all of the points, and points that are close to a particular centroid will be assigned to the cluster represented by the centroid. 


Once we have all of the clusters, you will recalculate the mean or the centroid for all of these clusters. At this point, your centroid or mean values might move a little bit. 


Once we have the new mean values, you will then reassign the points to clusters that are closest to those points. And this is the same process that we apply over and over again. 


Iterate until all of the points are in their final clusters and your means no longer move. And this is how you know your algorithm has terminated, this is where your K-means algorithm reaches convergence. 


K-means clustering is an example of a centroid-based clustering algorithm, where every cluster can be represented using a centroid or a reference vector. The term reference vector makes more sense, but because of how we calculate these means by finding the average of all of the points in a cluster, these reference vectors are often called centroids. 



Here is what the pseudocode for K-means clustering looks like.

<img src="../files/Capture7.png" width="200" height="300">

At the very first step we have initialize K centroids, this is an initial solution, there are algorithms that exist to pick these initial centroids well, so that your clustering performs well. 


The next two sets of actions you perform until convergence, until your cluster centers no longer move. 


You'll iterate through all of the points that you have in your dataset, and for each data point, you will assign that data point to the nearest cluster by calculating the distance of that point from all of your centroids and finding the nearest one. 

Once all of your points have been assigned to some cluster, you will then update the coordinates of the centroid or the reference vector. 

You'll find new coordinates by averaging all of the data points that belong to that cluster. 

After every iteration, you'll check to see whether the reference vectors or the centroids have moved. If they move, the algorithm hasn't converged, if the centroids have converged, we are done, we stop iterating. 


**** 

*** 
Each step of this process involves a design parameter in K-means clustering. These are the hyperparameters of your model. 

When you initialize K centroids, you have to specify the value of K, that is the number of clusters into which you want to group your data. 

The initial value for these centroids can be randomly chosen or there are algorithms that you can use to pick these values. 

In the next step, you have to calculate for each data point which centroid or cluster center is the closest. 

The distance measure is also a hyperparameter. Euclidean distance is the one that is most commonly used, and that's the distance measure that we use when we use our scikit-learn estimator. 

The third design decision for this algorithm is how we update the coordinates of the cluster center. Calculating the cluster center from points in the cluster can be done in different ways, the simple averaging technique is often used.

*** 

**** 

# Evaluating Clustering Models


Once you've applied a clustering technique or a model to your data, how do you know that the clustering that you've performed is good? There are different techniques that you can use to evaluate your clustering models. Once you have clusters in your data, you can calculate the homogeneity score, the completeness score, the V-measure score, the Adjusted Rand Index, Adjusted Mutual Info score, and the silhouette score for your clusters. 


Of all of these scoring techniques, only the silhouette score can be computed without original categories or labels on your data. 


All of the other scores require labeled data. Once you get a conceptual understanding of what exactly these scores are trying to capture, you'll find that scikit-learn has built-in functions for calculating all of these scores, so implementation is very straightforward. 


Let's start off by discussing homogeneity, completeness, and the V-measure score. You'll find that these are closely related. 


We'll first discuss homogeneity and completeness. Both of these are properties that we want in our clusters. 


Homogeneity basically says that every cluster should contain entities or members that belong to the same class. So within a cluster, if you have entities which belong to a different class, your homogeneity score will be lower. 


On the other hand, the completeness measure says that all members which belong to a particular category or a class should lie in the same cluster.


So they are subtly different and this difference is important. Now this subtle difference becomes quite significant, and you'll find that homogeneity and completeness are inversely related. So when you try to increase homogeneity, you'll find that completeness might fall. 


Each is a separate score that lies between 0 and 1. Higher values are obviously better. Now if you've worked with classification algorithms, you know that precision and recall are two metrics that we use to evaluate classifiers. 




Note: The term homogeneous is different from completeness in the sense that while talking about homogeneity, the base concept is of the respective cluster which we check whether in each cluster does each data point is of the same class label. While talking about completeness, the base concept is of the respective class label which we check whether data points of each class label is in the same cluster.


<img src="../files/Capture9.png">


In the above diagram, the clustering is perfectly homogeneous since in each cluster the data points of are of the same class label but it is not complete because not all data points of the same class label belong to the same class label.


<img src="../files/Capture10.png">


In the above diagram, the clustering is perfectly complete because all data points of the same class label belong to the same cluster but it is not homogeneous because the 1st cluster contains data points of many class labels.

Homogeneity and completeness are similar to precision and recall, and we need a third metric to optimize the tradeoff between these two. And this third metric that will allow us to figure out an optimal value for homogeneity and completeness is the V-measure score, and here is the mathematical formula for the V-measure. This is the harmonic mean of homogeneity and completeness. 


<img src="../files/Capture8.png">

The harmonic mean is often closer to the lower of the two values, homogeneity and completeness, and it favors an even weightage of both metrics. 


When you're evaluating your clustering models, it's helpful to take into account homogeneity, completeness, and the V-measure scores all together. Because they are a related set of metrics, all of these are bounded scores between 0 and 1. You'll find that these are the most commonly-used metrics to evaluate your clustering models because they happen to be easy to interpret. 


You just know that higher values are better, and the these can be applied to any clustering algorithm. One drawback is you require labeled data. All of the data that you're clustering need to have the original classes or categories to which they belong in order to calculate these scores.


Another metric that you can use to evaluate how well your clustering model performed is the Adjusted Rand Index, or ARI. This metric tries to measure the similarity between the original labels assigned to your data and the clusters into which your data points have been grouped. Once again this is a metric that can be calculated only if you have labeled data available. 


The term adjusted here is because it's adjusted for the probability of correct labeling purely by chance. So if you were to randomly take your data and assign them to clusters, what is the probability that you'll end up with this set of clusters? This is named after William Rand, that's why it's called the Adjusted Rand Index. 


When you use scikit-learn's function to calculate the Adjusted Rand Index, you'll get a value between -1 and 1. A value of 1 indicates that the labels and the predicted clusters from your clustering model agree perfectly. Your clustering model did very well. If you get 0 or negative values, that's indicative of bad clustering. This indicates that the labels and the calculated clusters are independent, there is no relationship between the two. 


Let's now move onto discussing another evaluation metric, the Adjusted Mutual Information. This is a measure of the mutual information in the overlap between cluster assignments, and once again like the other techniques that we've seen, needs labeled data. 


Mutual information between two labels is a measure of how well you can predict one label by seeing variations in another. An Adjusted Mutual Information score of 1 is the highest possible value, that means your clusters have been formed well. Actual labels and predicted labels match, 0 or negative values are bad, and this indicates assignments of data points to clusters have been done at random. The labels and calculated clusters are independent. 


And finally, before we move on, let's briefly talk about the silhouette score.


An important advantage of using silhouette scoring to evaluate your cluster model is the fact that you do not need labeled data. Silhouette scoring just works on the features. 


Silhouette score involves the calculation of something known as the silhouette coefficient, and the silhouette coefficient is associated with each sample in your dataset. 

The silhouette coefficient is a measure of how similar an object is to objects in its own cluster, and how different an object is from objects which live in other clusters. 

The overall silhouette score for your clustering model averages the silhouette coefficient for each sample. Since we are only looking at data points, there is no need for labeled data.