# Cluster Analysis

#### Unsupervised Learning Basics
* **Unsupervised learning:** a group of machine learning algorithms that find patterns in unlabeled data.
* Data used in these algorithms has not been labeled, classified, or characterized in any way.
* The objective of the algorithm is to interpret any inherent structure(s) in the data.
* Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

#### Clustering
* The process of grouping items with similar characteristics
* The groups are formed as such that items in a single group are closer to eachother in terms of some characteristics as compared to items in other clusters
* A **cluster** is a group of items with similar characteristics
    * For example, Google News articles where similar words and word associations appear together
    * Customer Segmentation
* Clustering algorithms:
    * Hierarchical clustering $\Rightarrow$ Most common
    * K means clustering $\Rightarrow$ Most common
    * Other clustering algorithms: DBSCAN (Density based), Gaussian Methods
    
```
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate' : y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion = 'maxclust')
sns.scatterplot(x='x_coordinate', y='y_coordinate', hue = 'cluster_labels', data=df)
plt.show()
```

### K-means clustering in SciPy

```
from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
random.seed((1000, 2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate' : y_coordinates})

centroids,_ = kmeans(df, 3) # second argument is 'distortion' represented by dummy variable '_'
df['cluster_labels'],_ = vq(df, centroids) # second argument is 'distortion' represented by dummy variable '_'

sns.scatterplot(x='x_coordinate', y='y_coordinate', hue='cluster_labels', data=df)
plt.show()
```

#### Data preparation for cluster analysis
Why prepare data for clustering?
* Variables may have incomparable units (product dimensions in cm, price in dollars)
* Even if variables have the same unit, they may be significantly different in terms of their scales and variances
* Data in raw form may lead to bias in clustering
* Clusters may be heavily dependent on one variable
* **Solution:** normalization of variables

* **Normalization:** process of rescaling data to a standard deviation of 1: `x_new = x / std(x)`
    * normalization library: `from scipy.cluster.vq import whiten`
    * `scaled_data = whiten(data)`
    * output is an array of the same dimensions as original `data`
**Illustration of the normalization of data:**

```
from matplotlib import pyplot as plt
plt.plot(data, label = "original")
plt.plot(scaled_data, label = "scaled")
plt.legend()
plt.show()
```
* By default, pyplot plots line graphs