# 3. Unsupervised Machine Learning

We use unsupervised machine learning in the following situations:
1. Do not have a label to predict. An example of this is using an algorithm to look at brain scans to find areas that may raise concern. You don't have labels on the images to understand what areas might raise reason for concern, but you can understand which areas are most similar or different from one another.

2. Are not trying to predict a label, but rather group our data together for some other reason! One example of this is when you have tons of data, and you would like to condense it down to a fewer number of features to be used.

There are many methods of unsupervised learning including: clustering, hierarchial and density based clustering, gaussian mixture models and cluster validation, principal component analysis (PCA), and random projection and independenct component analysis. Broadly, unusupervised machine learning can classified into Clustering, and Dimensionality Reduction.

## 3.1 Clustering

Clustering algorithms attempts to find groupings of similar items. K-means algorithm is an example.

Three ways to identify clusters in your dataset:

1. Visual Inspection of your data.
2. Pre-conceived ideas of the number of clusters.
3. The elbow method, which compares the average distance of each point to the cluster center for different numbers of centers.

### K-means algorithm

K-means is one of the most popular algorithms used for clustering. The 'k' in k-means refers to the number of clusters to form during the execution of the algorithm.

It has the following steps:

1. Randomly place k-centroids amongst your data. Then repeat steps (2-3) until convergence keeping the number of the centroids the same.
2. Look at the distance from each centroid to each point. Assign each point to the closest centroid.
3. Move the centroid to the center of the points assigned to it.

### Limitations of K-means algorithm

There are some concerns with the k-means algorithm:

1. Concern: The random placement of the centroids may lead to non-optimal solutions.

Solution: Run the algorithm multiple times and choose the centroids that create the smallest average distance of the points to the centroids.

2. Concern: Depending on the scale of the features, you may end up with different groupings of your points.

Solution: Scale the features using Standardizing, which will create features with mean 0 and standard deviation 1 before running the k-means algorithm.

## 3.2 Hierarchial clustering

### 3.2.1 Single link clustering

In single link clustering, the algorithm measures the distance between a point to all other points. It then groups the closest points to a cluster. When a point (that is not part of a cluster) needs to assigned to a cluster, it measures it's distance to the closest point of all clusters. It is then added to the cluster that is closest to it.

In the example below, point no 7 is closest to cluster (6, 8) with the closest point 6, and therefore it is assigned to this cluster.

![Example for Single link clustering](./images/single_link_clustering_example.png "Example for Single link clustering")

Single link clustering performs better than k-means in cases where there is a lot of space between the clusters (e.g. case 2 and 3 in the figure below), but performs poorly when the points are too close to each other (e.g. case 1 and 4). It performs just as good as k-means when the the points are natually clustered together (e.g. case 6.)

![Single link vs k-means](./images/single_link_vs_k_means.png "Single link vs k-means")



### 3.2.2 Complete link clustering

Complete link clustering works in a similar fashtion to single link clustering, except that it considers the distance between the two farthest points in a cluster while attempting to merge the two clusters. Complete link clustering produces more compact clusters compared to single link clusters.

### 3.2.3 Average link clustering

In average link clustering the distance from every point to every other point in the other cluster is measured. The average of these distances are considered before merging the clusters.

### 3.2.4 Ward's method

Ward's method is the default method for agglomerative clustering in sci-kit learn. This method attempts to minimise the variance while forming clusters.

This method first finds the distance between every point in the clusters to the central point between the clusters (yellow X in the figure below). These distances are squared first and added, and from this the vaiance within the clusters are subtracted (distance between the points within a cluster and it's center - red X in the figure below) to arrive at the distance measure.

![Ward's method](./images/wards_method_example.png "Ward's method")

## 3.3 Density based clustering - DBSCAN

DBSCAN stands for Density Based Spatial Clustering of Applications with Noise. Unlike hierarchial clustering methods, DBSCAN does not require the 'number of clusters' as input. It only requires the following paramers as input:

1. Epsilon, i.e. the search distance, or radius around a point.
2. Minimum number of points required to form a cluster.

The algorithm works by visiting each point, and looking for other points with in it's epsilon distance. Once the minimum number of points criterion is satisfied, it forms a cluster. The process is then repeated for other points.

DBSCAN has a number of advantages:
1. No need to supply the number of clusters as a parameter to the algorithm.
2. DBSCAN can effectively deal with a number of cluster shapes and sizes.
3. Peforms well in the presence of noise and outliers.

Some of the disadvantages:
1. Faces difficulties with finding clusters of varying densities.
2. Border points that are reachable from two clusters are assigned to a cluster based on first come, first served basis.

![Example for DBSCAN](./images/db_scan_example.png "Example for DBSCAN")