# Cluster Analysis

#### Unsupervised Learning Basics
* **Unsupervised learning:** a group of machine learning algorithms that find patterns in unlabeled data.
* Data used in these algorithms has not been labeled, classified, or characterized in any way.
* The objective of the algorithm is to interpret any inherent structure(s) in the data.
* Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

#### Clustering
* The process of grouping items with similar characteristics
* The groups are formed as such that items in a single group are closer to eachother in terms of some characteristics as compared to items in other clusters
* A **cluster** is a group of items with similar characteristics
    * For example, Google News articles where similar words and word associations appear together
    * Customer Segmentation
* Clustering algorithms:
    * Hierarchical clustering $\Rightarrow$ Most common
    * K means clustering $\Rightarrow$ Most common
    * Other clustering algorithms: DBSCAN (Density based), Gaussian Methods
    
```
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate' : y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion = 'maxclust')
sns.scatterplot(x='x_coordinate', y='y_coordinate', hue = 'cluster_labels', data=df)
plt.show()
```

### K-means clustering in SciPy

```
from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
random.seed((1000, 2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate' : y_coordinates})

centroids,_ = kmeans(df, 3) # second argument is 'distortion' represented by dummy variable '_'
df['cluster_labels'],_ = vq(df, centroids) # second argument is 'distortion' represented by dummy variable '_'

sns.scatterplot(x='x_coordinate', y='y_coordinate', hue='cluster_labels', data=df)
plt.show()
```

#### Data preparation for cluster analysis
Why prepare data for clustering?
* Variables may have incomparable units (product dimensions in cm, price in dollars)
* Even if variables have the same unit, they may be significantly different in terms of their scales and variances
* Data in raw form may lead to bias in clustering
* Clusters may be heavily dependent on one variable
* **Solution:** normalization of variables

* **Normalization:** process of rescaling data to a standard deviation of 1: `x_new = x / std(x)`
    * normalization library: `from scipy.cluster.vq import whiten`
    * `scaled_data = whiten(data)`
    * output is an array of the same dimensions as original `data`
**Illustration of the normalization of data:**

```
from matplotlib import pyplot as plt
plt.plot(data, label = "original")
plt.plot(scaled_data, label = "scaled")
plt.legend()
plt.show()
```
* By default, pyplot plots line graphs

## Hierarchical Clustering

#### Creating a distance matrix using linkage

`scipy.cluster.hierarchy.linkage(observations, method='single', metric='euclidean', optimal_ordering=False)`
* This process computes the distances between clusters as we go from n clusters to one cluster where n is the number of points
* `method`: how to calculate the proximity of clusters
* `metric`: distance metric (Euclidean, Manhattan...)
* `optimal_ordering`: order data points (optional argument)

* **`method`**:
    * **single:** based on two closest objects (clusters tend to be more dispersed)
    * **complete:** based on two farthest objects
    * **average:** based on the arithmetic mean of all objects
    * **centroid:** based on the geometric mean of all objects
    * **median:** based on the median of all objects
    * **ward:** based on the sum of squares (clusters tend to be dense towards the centers)
    
* **Create cluster labels with fcluster:**
`scipy.cluster.hierarchy.fcluster(distance_matrix, num_clusters, criterion)`
* `distance_matrix`: output of `linkage` method
* `num_clusters`: number of clusters
* `criterion`: how to decide thresholds to form clusters

* **Note** that in all seaborn plots, an extra cluster with label 0 is shown even though no objects are present in it. This can be removed it you store the cluster labels as strings.

#### Visualize Clusters
* Visualizing clusters may help to make sense of clusters formed or identify number of clusters
* Visualizing can serve as an additional step in the validation of clusters formed
* May help you to spot trends in data
* For clustering, we will often use pandas DataFrames to store our data, often adding a separate column for cluster centers

```
df = pd.DataFrame({'x':[2, 3, 5, 6, 2], 'y': [1, 1, 5, 5, 2], 'labels': ['A', 'A', 'B', 'B', 'A']})
```
#### Visualizing clusters with matplotlib

```
from matplotlib import pyplot as plt
colors = {'A': 'red', 'B': 'blue'}
df.plot.scatter(x='x', y='y', c = df['labels'].apply(lambda x: colors[x]))
plt.show()
```
* We use the `c` argument of the scatter method to assign a color to each cluster
* However, we first need to manually map each cluster to a color
* Create dictionary `colors` with the cluster labels as keys and respectively associated colors as values

#### Visualizing clusters with seaborn

```
from matplotlib import pyplot as plt
import seaborn as sns
sns.scatterplot(x = 'x', y = 'y', hue = 'labels', data = df)
plt.show()
```

* Two reasons to prefer seaborn:
    * 1) For implementation, using seaborn is more convenient once you have stored cluster labels in your dataframe
    * 2) You do not need to manually select colors (there is a default palette that you can manually change if you so choose, but is not necessary)
    
#### Determining how many clusters with dendrograms
* ** Dendrograms:**
    * dendrograms help show progressions as clusters are merged
    * a dendrogram is a branching diagram that demonstrates how each cluster is composed by branching out into its child nodes
    
```
from scipy.cluster.hierarchy import dendrogram

Z = linkage(df[['x_whiten', 'y_whiten']], method = 'ward', metric = 'euclidean')
dn = dendrogram(Z)
plt.show()
```
* Recall the hierarchical clustering algorithm, where each step was a result of merging the two closest clusters in the earlier step
* The x-axis of a dendrogram represents individual points, whereas the y-axis represents the distance or dissimilarity between clusters
* The inverted U at the top of a dendrogram represents a single cluster of all datapoints
* The width of the inverted U-shape represents the distance between the two child clusters. Therefore, a wider inverted-U shape means that the two child clusters were further away from each other as compared to a narrower inverted-U in the diagram.
* If you draw a horizontal line at any part of the figure, the number of vertical lines it intersects with tells you the number of clusters at that stage and the distance between those vertical lines indicates the **intercluster distance.**
* **Note:** There is no "right" metric to determine "how many" clusters are ideal.
* An additional step of visualizing the data in a scatter plot (after visualizing it in a dendrogram) may be helpful before deciding on the number of clusters.

```
# Import the dendrogram function
from scipy.cluster.hierarchy import dendrogram

# Create a dendrogram
dn = dendrogram(distance_matrix)

# Display the dendogram
plt.show()
```
