# Cluster Analysis

#### Unsupervised Learning Basics
* **Unsupervised learning:** a group of machine learning algorithms that find patterns in unlabeled data.
* Data used in these algorithms has not been labeled, classified, or characterized in any way.
* The objective of the algorithm is to interpret any inherent structure(s) in the data.
* Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

#### Clustering
* The process of grouping items with similar characteristics
* The groups are formed as such that items in a single group are closer to eachother in terms of some characteristics as compared to items in other clusters
* A **cluster** is a group of items with similar characteristics
    * For example, Google News articles where similar words and word associations appear together
    * Customer Segmentation
* Clustering algorithms:
    * Hierarchical clustering $\Rightarrow$ Most common
    * K means clustering $\Rightarrow$ Most common
    * Other clustering algorithms: DBSCAN (Density based), Gaussian Methods
    
```
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate' : y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion = 'maxclust')
sns.scatterplot(x='x_coordinate', y='y_coordinate', hue = 'cluster_labels', data=df)
plt.show()
```

### K-means clustering in SciPy

```
from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
random.seed((1000, 2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate' : y_coordinates})

centroids,_ = kmeans(df, 3) # second argument is 'distortion' represented by dummy variable '_'
df['cluster_labels'],_ = vq(df, centroids) # second argument is 'distortion' represented by dummy variable '_'

sns.scatterplot(x='x_coordinate', y='y_coordinate', hue='cluster_labels', data=df)
plt.show()
```

#### Data preparation for cluster analysis
Why prepare data for clustering?
* Variables may have incomparable units (product dimensions in cm, price in dollars)
* Even if variables have the same unit, they may be significantly different in terms of their scales and variances
* Data in raw form may lead to bias in clustering
* Clusters may be heavily dependent on one variable
* **Solution:** normalization of variables

* **Normalization:** process of rescaling data to a standard deviation of 1: `x_new = x / std(x)`
    * normalization library: `from scipy.cluster.vq import whiten`
    * `scaled_data = whiten(data)`
    * output is an array of the same dimensions as original `data`
**Illustration of the normalization of data:**

```
from matplotlib import pyplot as plt
plt.plot(data, label = "original")
plt.plot(scaled_data, label = "scaled")
plt.legend()
plt.show()
```
* By default, pyplot plots line graphs

## Hierarchical Clustering

#### Creating a distance matrix using linkage

`scipy.cluster.hierarchy.linkage(observations, method='single', metric='euclidean', optimal_ordering=False)`
* This process computes the distances between clusters as we go from n clusters to one cluster where n is the number of points
* `method`: how to calculate the proximity of clusters
* `metric`: distance metric (Euclidean, Manhattan...)
* `optimal_ordering`: order data points (optional argument)

* **`method`**:
    * **single:** based on two closest objects (clusters tend to be more dispersed)
    * **complete:** based on two farthest objects
    * **average:** based on the arithmetic mean of all objects
    * **centroid:** based on the geometric mean of all objects
    * **median:** based on the median of all objects
    * **ward:** based on the sum of squares (clusters tend to be dense towards the centers)
    
* **Create cluster labels with fcluster:**
`scipy.cluster.hierarchy.fcluster(distance_matrix, num_clusters, criterion)`
* `distance_matrix`: output of `linkage` method
* `num_clusters`: number of clusters
* `criterion`: how to decide thresholds to form clusters

* **Note** that in all seaborn plots, an extra cluster with label 0 is shown even though no objects are present in it. This can be removed it you store the cluster labels as strings.

#### Visualize Clusters
* Visualizing clusters may help to make sense of clusters formed or identify number of clusters
* Visualizing can serve as an additional step in the validation of clusters formed
* May help you to spot trends in data
* For clustering, we will often use pandas DataFrames to store our data, often adding a separate column for cluster centers

```
df = pd.DataFrame({'x':[2, 3, 5, 6, 2], 'y': [1, 1, 5, 5, 2], 'labels': ['A', 'A', 'B', 'B', 'A']})
```
#### Visualizing clusters with matplotlib

```
from matplotlib import pyplot as plt
colors = {'A': 'red', 'B': 'blue'}
df.plot.scatter(x='x', y='y', c = df['labels'].apply(lambda x: colors[x]))
plt.show()
```
* We use the `c` argument of the scatter method to assign a color to each cluster
* However, we first need to manually map each cluster to a color
* Create dictionary `colors` with the cluster labels as keys and respectively associated colors as values

#### Visualizing clusters with seaborn

```
from matplotlib import pyplot as plt
import seaborn as sns
sns.scatterplot(x = 'x', y = 'y', hue = 'labels', data = df)
plt.show()
```

* Two reasons to prefer seaborn:
    * 1) For implementation, using seaborn is more convenient once you have stored cluster labels in your dataframe
    * 2) You do not need to manually select colors (there is a default palette that you can manually change if you so choose, but is not necessary)
    
#### Determining how many clusters with dendrograms
* ** Dendrograms:**
    * dendrograms help show progressions as clusters are merged
    * a dendrogram is a branching diagram that demonstrates how each cluster is composed by branching out into its child nodes
    
```
from scipy.cluster.hierarchy import dendrogram

Z = linkage(df[['x_whiten', 'y_whiten']], method = 'ward', metric = 'euclidean')
dn = dendrogram(Z)
plt.show()
```
* Recall the hierarchical clustering algorithm, where each step was a result of merging the two closest clusters in the earlier step
* The x-axis of a dendrogram represents individual points, whereas the y-axis represents the distance or dissimilarity between clusters
* The inverted U at the top of a dendrogram represents a single cluster of all datapoints
* The width of the inverted U-shape represents the distance between the two child clusters. Therefore, a wider inverted-U shape means that the two child clusters were further away from each other as compared to a narrower inverted-U in the diagram.
* If you draw a horizontal line at any part of the figure, the number of vertical lines it intersects with tells you the number of clusters at that stage and the distance between those vertical lines indicates the **intercluster distance.**
* **Note:** There is no "right" metric to determine "how many" clusters are ideal.
* An additional step of visualizing the data in a scatter plot (after visualizing it in a dendrogram) may be helpful before deciding on the number of clusters.

```
# Import the dendrogram function
from scipy.cluster.hierarchy import dendrogram

# Create a dendrogram
dn = dendrogram(distance_matrix)

# Display the dendogram
plt.show()
```


#### Limitations of hierarchical clustering
**Measuring speed in hierarchical clustering:**
* Use `timeit` module
* The most time-consuming step is constructing the distance matrix using `.linkage()` method
* To check the time of a function in the interpreter, use `%` before the `timeit` keyword followed by the statement that you would like timed.
```
from scipy.cluster.hierarchy import linkage
import pandas as pd
import random, timeit
points = 100
df = pd.DataFrame({'x': random.sample(range(0, points), points),
                   'y': random.sample(range(0, points), points)}) 
%timeit linkage(df[['x', 'y']], method = 'ward', metric = 'euclidean')
```
* if you plot the runtime as datapoints, you'll see there's a quadratic increase of runtime, making `.linkage()` not feasible for large datasets


#### Basics of k-means clustering
* **Why k-means clustering?**
    * A critical drawback of hierarchical clustering: runtime
    * K means runs significantly faster on large datasets
    
#### Step 1: Generate cluster centers
* generate the cluster centers and then assign the cluster labels
* `kmeans(obs, k_or_guess, iter, thresh, check_finite)`
    * `obs`: standardized observations (which have been standardized with the `whiten` method)
    * `k_or_guess`: number of clusters
    * `iter`: number of iterations (default: 20)
    * `thresh`: threshold (default: 1e -05)
        * the idea behind this argument is that the algorithm is terminated if the change in distortion since the last k-means iteration is less than or equal to the threshold
    * `check_finite`: boolean value; whether to check if observations contain only finite numbers (and *not* infinite or `NaN` values); default: `True`
* `kmeans` function returns two arguments: cluster centers (also known as "the code book index"), distortion
* kmeans runs really quickly as opposed to hierarchical clustering
* `distortion` is calculated as the sum of squares of distances of points from cluster centers

#### Step 2: Generate cluster labels
* `vq(obs, code_book, check_finite=True)`
    * `obs`: standardized observations which have been standardized with the `whiten` method.
    * `code_book`: cluster centers
    * `check_finite`: see above; default: `True`
* the `vq()` function returns the cluster labels (aka "the code book index"), a list of distortions
#### A note on distortions
* `kmeans` returns a single value of distortions
* `vq` returns a list of distortions
    * the mean of the list of distortions from the `vq` method should approximately equal the distortion value of the `kmeans` method if the same list of observations is passed through

```
from scipy.cluster import kmeans, vq

cluster_centers, _ = kmeans(df[['scaled_x', 'scaled_y']], 3)
df['cluster_labels'], _ = vq(df[['scaled_x', 'scaled_y']], cluster_centers)

sns.scatterplot(x='x_scaled', y='y_scaled', hue='cluster_labels', data=df)
plt.show()
```

**kmeans** $\Rightarrow$ to get **cluster *centers***

**vq** $\Rightarrow$ to get **cluster *labels***

```
# Import the kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Generate cluster centers
cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)

# Assign cluster labels
comic_con['cluster_labels'], distortion_list = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers)

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', 
                hue='cluster_labels', data = comic_con)
plt.show()
```

#### Elbow method
* No *absolute* method to find the right number of clusters (k), in k-means clustering
* Constructing an elbow plot to decide the right number of clusters for your dataset; x-axis = 3 of clusters k, y-axis = distortion.
* **distortions:** the sum of squares of the distances of points from their respective clusters
* Ideally, distortion has an inverse relationship with the number of clusters
    * In other words, **distortion decreases with increasing k.**
    * distortion becomes zero when the number of clusters equals the number of points $\Leftarrow$ This is the underlying logic of the elbow method
* **Elbow method:** line plot between cluster centers and distortion
    * First, run kmeans clustering with a varying number of clusters on the data and construct an elbow plot which has the number of clusters on the x-axis and the distortion on the y-axis
    * The number of clusters can start at one and go up to the total number of data points
    * The ideal point is one beyond which the distortion decreases relatively less (?) on increasing the number of clusters
    
```
# Declaring variables for use
distortions = []
num_clusters = range(2, 7)

# Populating distortions for various clusters
for i in num_clusters:
    centroids, distortion = kmeans(df[['scaled_x', 'scaled_y']], i)
    distortions.append(distortion)
    
# Plotting elbow plot data
elbow_plot_data = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

sns.lineplot(x='num_clusters', y='distortions', data= 'elbow_plot_data')
plt.show()
```
* Note: Elbow method only gives an indication of ideal number of clusters
* Occassionally, it may be insufficient to find an optimal k
* For example, the elbow method fails when data is even distributed
* Other methods: **average silhouette** and **gap statistic**

```
distortions = []
num_clusters = range(1, 7)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(comic_con[['x_scaled', 'y_scaled']], i)
    distortions.append(distortion)

# Create a data frame with two lists - num_clusters, distortions
elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions': distortions})

# Creat a line plot of num_clusters and distortions
sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.show()
```

#### Limitations of kmeans clustering
* kmeans clustering overcomes the biggest drawback of hierarchical clustering (run time)
* However, it comes with its own set of **limitations** for consideration:
    * **procedure for finding the "right" number of k**
        * elbow method
        * silhouette method
        * gap statistic
    * **impact of seeds**
    * **biased towards equal-sized clusters**
    
* **Impact of seeds:**
    * As the process of defining the intial cluster centers is random, this initialization can affect the final clusters
    * Therefore, to get consistent results when running kmeans clustering on the same data set multiple times, it is a good idea to set the initializtion parameters for random number generation 
    * The seed is initialized through the seed method of `random` class in numpy
    * You can pass a single integer or a 1-D argument as an array
    * **Interestingly, the effect of seeds is only seen when the data to be clustered is fairly uniform.** If the data has distinct clusters before clustering is performed, the effect of seeds will not result in any changes in the formation of resulting clusters
    
```
from numpy import random
random.seed(12)
```

* **Bias towards equal-sized clusters:**
    * Can get very non-intuitive results
    * Bias is because the very idea of kmeans clustering is to minimize distortions. 
    * This results in clusters that have similar areas and not necessarily similar numbers of data points
    * For very differently-sized clusters: hierarchical clustering will likely do a better job (if you can afford the run time)
    

* Each technique has its pros and cons
* Consider your data size and patterns before deciding on an algorithm
* Clustering is still exploratory phase of analysis

```
# Set up a random seed in numpy
random.seed([1000,2000])

# Fit the data into a k-means algorithm
cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3)

# Assign cluster labels
fifa['cluster_labels'], _ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)

# Display cluster centers 
print(fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean())

# Create a scatter plot through seaborn
sns.scatterplot(x='scaled_def', y='scaled_phy', hue='cluster_labels', data=fifa)
plt.show()
```

## Dominant colors in images

* Let's use clustering on some **real world problems**
* **Analyze images to determine dominant colors**
* Each pixel consists of three values: each value is a number between 0 and 255, representing: R, G, B
* Pixel color = combination of these RGB values
* Goal: perform k-means clustering on standardized RGB values to find cluster centers
* Uses: Identifying features in satellite images

**Tools to find dominant colors:**
* Convert image to pixels: **`matplotlib.image.imread`**
    * Converts a jpeg image into a matrix which contains the RGB values of ech pixel 
* Display colors of cluster centers: **`matplotlib.pyplot.imshow`**

* **First step: Convert image to RBG matrix:**

```
import matplotlib.image as img
image = img.imread('sea.jpg')
image.shape
```
* **Note** the output of this call is a MxNx3 matrix (pronounced "M cross N cross three"), where M and N are the dimensions of the image.
* In this analysis, we are going to collectively look at all pixels and their positions would not matter. Hence, we will just extract all RGB values and store them in their corresponding lists

```
r = []
g = []
b = []

for row in image:
    for pixel in row:
        # A pixel contains RGB values 
        temp_r, temp_g, temp_b = pixel 
        r.append(temp_r)
        g.append(temp_g)
        b.append(temp_b)
        
pixels = pd.DataFrame({'red': r, 'green': g, 'blue': b})
pixels.head()
```
Create an elbow plot from the pixel color data

```
distortions = []
num_clusters = range(1,11)

# Create a list of distortions from the kmeans method
for i in num_clusters:
    cluster_centers, _= kmeans(pixels[['scaled_red', 'scaled_blue', 'scaled_green']], i)
    distortions.append(distortion)

# Create a dataframe with two lists: number of clusters and distortions
sns.lineplot(x='num_clusters', y='distortions', data=elbow_plot)
plt.xticks(num_clusters)
plt.show()
```
* Elbow plot successfully shows the number of dominant colors in the image
* recall that the cluster centers obtained are standardized RGB values. A standardized value of a variable is its actual value divided by the standard deviation

```
colors = []

# Find standard deviations
r_std, g_std, b_std = pixels[['red', 'green', 'blue']].std()

# Scale actual RGB values in range of 0-1
for cluster_center in cluster_centers:
    scaled_r, scaled_g, scaled_b = cluster_center
    colors.append((
        scaled_r * r_std/255,
        scaled_g * g_std/255,
        scaled_b * b_std/255
    ))
```
* **Display dominant colors:**

```
# Dimensions: 2 x 3 (N x 3 matrix)
print(colors)

# Dimensions: 1 x 2 x 3 (1 x N x 3 matrix)
plt.imshow([colors])
plt.show()
```

#### Batman example:

```
# Import image class of matplotlib
import matplotlib.image as img

# Read batman image and print dimensions
batman_image = img.imread('batman.jpg')
print(batman_image.shape)

# Store RGB values of all pixels in lists r, g and b
for row in batman_image:
    for temp_r, temp_g, temp_b in row:
        r.append(temp_r)
        g.append(temp_g)
        b.append(temp_b)

```

```
distortions = []
num_clusters = range(1, 7)

# Create a list of distortions from the kmeans function
for i in num_clusters:
    cluster_centers, distortion = kmeans(batman_df[['scaled_red', 'scaled_green', 'scaled_blue']], i)
    distortions.append(distortion)

# Create a data frame with two lists, num_clusters and distortions
elbow_plot = pd.DataFrame({'num_clusters': num_clusters, 'distortions':distortions})

# Create a line plot of num_clusters and distortions
sns.lineplot(x='num_clusters', y='distortions', data = elbow_plot)
plt.xticks(num_clusters)
plt.show()
```

```
# Get standard deviations of each color
r_std, g_std, b_std = batman_df[['red', 'green', 'blue']].std()

for cluster_center in cluster_centers:
    scaled_r, scaled_g, scaled_b = cluster_center
    # Convert each standardized value to scaled value
    colors.append((
        scaled_r * r_std / 255,
        scaled_g * g_std / 255,
        scaled_b * b_std / 255
    ))

# Display colors of cluster centers
plt.imshow([colors])
plt.show()
```

#### Document clustering
* Document clustering uses some concepts from NLP
* Steps:
    * 1) Clean data before processing
    * 2) Determine the importance of the terms in a document (in TF-IDF matrix)
    * 3) Cluster the TF-IDF matrix
    * 4) Find top terms, documents in each cluster
    
#### Step 1: Clean and tokenize data
* Convert text into smaller parts, called tokens, clean data for processing 

```
from nltk.tokenize import word_tokenize
import re

def remove_noise(text, stop_words=[]):
    tokens = word_tokenize(text
    cleaned_tokens = []
    for token in tokens:
        token = re.sub('[^A-Za-z0-9]+', '', token)
        if len(token) > 1 and token.lower() not in stop_words:
        #Get lowercase
        cleaned_tokens.append(token.lower())
    return cleaned_tokens
```
#### Document matrices and sparse matrices
* Once relevant terms have been extracted, a **document term matrix** is formed 
    * An element of this matrix signifies how many times a term has occurred in each document
    * Most elements are zeroes, hence a **sparse matrix** is formed
    
* A **sparse matrix** only contains terms which have non-zero elements
    * A sparse matrix often consists of observations: **row** (of non-zero value), **column** (of non-zero value), **value** (itself).
    
    
#### TF-IDF: Term frequency - Inverse Document Frequency
* A weighted measure: evaluate how important a word is to a document in a collection
* `max_df` and `min_df` signify the maximumm and minimum fraction of documents a word should occur in.

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(mx_df = 0.8, max_features = 50, min_df = 0.2, tokenize = remove_noise)
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
```

#### Clustering with sparse matrix
* `kmeans` in SciPy does not work with sparse matrices
* Use `.todense()` to convert sparse matrix to its expanded form

`cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)`
* Note: we do not make an elbow plot, as it will take an erratic form due to the high number of variables

* Each **cluster center** is a list with a size equal to the number of terms
* Each value in the cluster center is its importance
* Create a dictionary and print top terms

```
terms = tfidf_vectorizer.get_feature_names()

for i in range(num_clusters):
    center_terms = dict(zip(terms, list(cluster_centers[i])))
    sorted_terms = sorted(center_terms, key= center_terms.get, reverse= True)
    print(sorted_terms[:3})
```
* The above example is a **very simple** example of NLP.
* Other consideration include:
    * Work with hyperlinks, emoticons, etc
    * Normalize words to their base forms (run, ran, running $\Rightarrow$ run)
    * `todense()` may not work with large datasets, so you may need to consider an implementation of kmeans that works with sparse matrices.