<img src="data/images/div/lecture-notebook-header.png" />

# Document Clustering

Clustering is an unsupervised machine learning technique used to group similar data points together based on their characteristics or features. In the context of text documents, clustering algorithms organize documents into groups or clusters where documents within the same cluster are more similar to each other compared to documents in other clusters.

Here's how clustering can be applied to organize text documents:

* **Representation of Text Documents:** Before clustering, text documents need to be transformed into numerical representations. Techniques like TF-IDF, word embeddings (such as Word2Vec or GloVe), or document vectors using methods like Doc2Vec can convert text into feature vectors that capture semantic information.

* **Choice of Clustering Algorithm:** Several clustering algorithms can be used for text document organization, such as K-Means, Hierarchical Clustering, DBSCAN, or affinity propagation. Each algorithm has its own way of defining clusters based on distance metrics, density, or connectivity.

* **Clustering Process:** Once the text documents are represented as numerical vectors, the chosen clustering algorithm groups similar documents together. The algorithm identifies patterns or similarities in the feature space and iteratively assigns documents to clusters based on certain criteria.

* **Evaluation and Interpretation:** Clustering algorithms often require parameters (e.g., number of clusters for K-Means) that can impact the clustering results. Evaluation metrics like silhouette score or coherence measures can help assess the quality of clusters. Additionally, interpreting the clusters by analyzing the content of documents within each cluster can provide insights into the underlying structure or themes in the document collection.

* **Applications:** Clustering text documents finds applications in various areas, such as information retrieval, document organization, topic modeling, and recommendation systems. It can be used to automatically group similar articles, classify news topics, organize search results, or create document archives with related content.

Clustering facilitates the organization and exploration of large text document collections by grouping similar documents together, allowing for easier management, analysis, and understanding of textual data. It aids in discovering hidden structures and patterns within text corpora, enabling better information retrieval and knowledge discovery. 

## Setting up the Notebook

### Required packages

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from tqdm import tqdm

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import dendrogram

from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram

### Auxiliary Code

The file `src/utils.py` contains a series of auxiliary methods to plot the clustering results as well as for fetching the data.

In [None]:
from src.plotutil import color_func, get_mask, plot_sse_values, plot_silhouette_scores, plot_cluster_wordcloud, plot_dendrogram
from src.datautil import get_articles, get_random_article

### Data Collection

The file `data/news-articles-preprocessed.zip` contains a text file with 6k+ news articles collected from The Straits Times around Oct/Nov 2022. The articles are already somewhat preprocessed (punctuation removal, converted to lowercase, line break removal, lemmatization). Each line in the text file represents an individual article.

To get the article, the method `get_articles()` reads this zip file and loops through the text file and returns all articles in a list. The method also accepts a `search_term` to filter articles that contain that search term. While not used by default in the following, you can check it out to get different results


In [None]:
articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip')
#articles = get_articles(search_term="police")

print("Number of articles: {}".format(len(articles)))

There is also a method `get_random_article()` which, to the surprise of no-one, returns a random article from the list of 6k+ articles.

---

## K-Means

The K-Means algorithm is an unsupervised machine learning technique used for clustering data into distinct groups or clusters based on similarity in their features. It aims to partition a dataset into K clusters, where each data point belongs to the cluster with the nearest mean (centroid).

Here's an overview of how the K-Means algorithm works:

* **Initialization:** Begin by choosing K initial cluster centroids randomly from the dataset. These centroids represent the centers of the clusters.

* **Assignment Step:** For each data point in the dataset, calculate the distance (typically using Euclidean distance) between that point and each of the K centroids. Assign the data point to the cluster represented by the nearest centroid.

* **Update Step:** Recalculate the centroids of the K clusters based on the newly assigned data points. The new centroid of each cluster is the mean of all data points assigned to that cluster along each feature dimension.

* **Iterations:** Repeat the assignment and update steps iteratively until convergence. Convergence occurs when the assignment of data points to clusters no longer changes or when a maximum number of iterations is reached.

* **Final Result:** Once the algorithm converges, the data points are clustered into K groups, and each data point belongs to the cluster represented by the nearest centroid.

Key considerations and characteristics of the K-Means algorithm:

* It's sensitive to the initial placement of centroids, and different initializations can lead to different final clusters.
* K-Means aims to minimize the sum of squared distances between data points and their respective cluster centroids.
* The number of clusters, K, is a hyperparameter that needs to be determined based on domain knowledge or using techniques like the elbow method or silhouette score.
* It's an efficient and scalable algorithm, but it may struggle with non-linear or irregularly shaped clusters and is sensitive to outliers.

K-Means clustering is widely used in various domains for tasks like customer segmentation, image segmentation, anomaly detection, and more, where grouping similar data points together is essential for analysis and decision-making.

### Varying the Number of Clusters

Let's first have a look how the choice of *k* (i.e., the number of clusters) affect the result. Since K-Means on large data may take some time, we limit our articles to ones that only contain the word *"police"*. Of course, you can try out different keywords or simply consider all articles by omitting the `search_term` parameter.


In [None]:
articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip', search_term="police")

As usual, we need to convert our articles into document vectors; we use the TF-IDF weights. Note that we limit the number of considered terms (i.e., the vocabulary size) to 2,000. Again, the reason is mainly to speed up the computation time of all K-Means clusters. Given that we only consider a couple of hundred articles, 2,000 is actually not that low to be considered meaningful!

In [None]:
# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range = (1, 1), max_features=2000)

# Transform documents to tf-idf vectors (Term-Document Matrix with tf-idf weights)
tfidf = tfidf_vectorizer.fit_transform(articles)

Now we're ready to perform K-Means clustering. As the goal is to see how the result differ for different *k* we simple execute a loop and compute in each iteration:

* K-Means for the current value of *k*
* The SSE value for the current *k*
* The Silhouette Coefficient for the current *k*


In [None]:
num_clusters_k = 200

sse, silhouette_scores = [], []

for k in tqdm(range(2, num_clusters_k+1)):
    kmeans_toy = KMeans(n_clusters=k, n_init='auto', random_state=0).fit(tfidf.A)
    
    cluster_labels = kmeans_toy.fit_predict(tfidf.A)    
    silhouette_avg = silhouette_score(tfidf.A, cluster_labels)
    
    sse.append((k, kmeans_toy.inertia_)) # Inertia is the same as SSE
    silhouette_scores.append((k, silhouette_avg))

The method `plot_sse_values()` plots the SSE values.

In [None]:
plot_sse_values(sse)

As we already know, increasing *k* will always lower the SSE value (until it converges to its minimum -- which depends on *k* and the data distribution). More importantly, however, there is no visible "elbow" that would tell us about the best value for *k*.

Similar to the SSE values, we can also plot the Silhouette Coefficients for all considered values of *k*.


In [None]:
plot_silhouette_scores(silhouette_scores)

The results are quite similar. In a nutshell, there is no obvious indicator what constitutes the best value for *k*. As we discussed in the lecture and in the supplementary recording, the issue is that our document vectors have many features, i.e., our feature space is high-dimensional. Simply speaking, our data distribution is arbitrarily unlikely to feature nice, well-separated "blobs" which would yield more useful SSE values and Silhouette Coefficients. This is why in practice the choice of *k* are commonly a very pragmatic decision.

### Inspecting Clusters

In practice, clustering is often used to get a basic understanding of the dataset, e.g., a text corpus. This can be done by clustering the corpus and inspecting each cluster, e.g., by visualizing a cluster w.r.t. to its most important words. So let's do this using a basic approach. Since we run K-Means only once in the following, we can be a bit more generous regarding the number of news articles and the number of features. So feel free to play with those parameters.


In [None]:
articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip')
#articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip', search_term="police")

In [None]:
# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1), min_df=5, smooth_idf=False, max_features=5000)

# Transform documents to tf-idf vectors (Term-Document Matrix with tf-idf weights)
tfidf = tfidf_vectorizer.fit_transform(articles)

print(tfidf.shape)

Let's run K-Mean with $k=5$ to get 5 clusters:

In [None]:
%%time

num_clusters_k = 5

kmeans = KMeans(n_clusters=num_clusters_k, n_init='auto', random_state=0).fit(tfidf.A)

To visualize the resulting clusters, we provide you with the method `plot_cluster_wordcloud()` that creates a word cloud using the most important words in the respective cluster, where the importance of a word directly derives from the TF-IDF weight. You're encouraged to have a look at the method as it requires some understanding of what's going on. But here, we are just interested in the word clouds.

The code cell below goes through all cluster IDs from $0..(k-1)$ and plots the word cloud for that cluster.


In [None]:
for cid in range(num_clusters_k):
    plot_cluster_wordcloud(tfidf_vectorizer, kmeans, cid)

---

## AGNES

In contrast to K-Means, AGNES (Agglomerative Nesting) is one of the most popular hierarchical clustering algorithms. Agglomerative Hierarchical Clustering is an approach that starts by considering each data point as a single cluster and then merges pairs of clusters iteratively to form a hierarchy of clusters. It proceeds by following these steps:

* **Initialization:** Treat each data point as an individual cluster. Assign each point to its own cluster, making as many clusters as there are data points.

* **Similarity/Dissimilarity Calculation:** Compute the distance or dissimilarity between each pair of clusters. Various distance metrics (such as Euclidean distance, Manhattan distance, or other similarity measures) can be used to determine the distance between clusters.

* **Merging Clusters:** At each iteration, the two most similar clusters are merged together based on a linkage criterion. Common linkage methods include:

    * Single Linkage: Merge clusters based on the minimum distance between points in the two clusters.
    * Complete Linkage: Merge clusters based on the maximum distance between points in the two clusters.
    * Average Linkage: Merge clusters based on the average distance between points in the two clusters.
    * Ward's Method: Minimizes the variance when merging clusters.

* **Hierarchical Structure Building:** Repeat the process of merging the most similar clusters until all data points belong to a single cluster or until a stopping criterion (e.g., a specific number of clusters or a threshold distance) is met.

* **Dendrogram Formation:** As the clusters are merged, a dendrogram—a tree-like structure—is created, illustrating the sequence of merges and the distance at which they occurred. The dendrogram helps visualize the hierarchical relationship between clusters.

Agglomerative Hierarchical Clustering results in a hierarchical decomposition of the dataset, where the clusters can be observed at different levels of granularity. This method does not require specifying the number of clusters beforehand, making it useful for exploring the structure of the data and identifying natural groupings. It's important to note that the complexity of Agglomerative Hierarchical Clustering can be computationally expensive for large datasets due to its iterative nature and the need to calculate pairwise distances between data points.



### Visualizing the Complete Hierarchy

In hierarchical clustering, a dendrogram is a tree-like diagram that displays the arrangement of clusters and their relationships at various stages of the clustering process. It visually represents the merging of clusters and helps illustrate the hierarchy of relationships between data points or clusters.

Key elements of a dendrogram:

* **Vertical Axis:** The vertical axis of the dendrogram represents the dissimilarity or distance between clusters or data points. The height or level at which two clusters merge or a data point joins a cluster indicates the distance at which the merge occurred.

* **Horizontal Axis:** The horizontal axis shows the individual data points or clusters. Each point on this axis represents a data point initially, and as clusters merge, they form branches on the dendrogram.

* **Branches and Merges:** The branches of the dendrogram represent the clusters formed at different stages of the clustering process. The height of each branch's fusion indicates the dissimilarity level at which the clusters were merged.

* **Interpretation:** The structure of the dendrogram allows for interpretation of the relationships between clusters or data points. By observing the vertical lines where clusters merge, one can determine the distance or dissimilarity measure at which these merges occurred, providing insights into the similarity or distance between clusters.

* **Cluster Cutting:** Dendrograms allow for cutting the tree at different heights or distances, resulting in different numbers of clusters. This cutting process enables the determination of the optimal number of clusters based on the problem's requirements or characteristics of the data.

Dendrograms are valuable visualization tools in hierarchical clustering, aiding in understanding the structure and relationships within the dataset. They provide a clear representation of how clusters merge and form a hierarchy, allowing for the identification of clusters at different levels of granularity based on the distance or similarity thresholds.

In [None]:
articles = get_articles('data/datasets/news-articles/news-articles-preprocessed.zip', search_term="tesla")

print(len(articles))

Again, feel free to play around with the parameters for converting the news articles into document vectors.

In [None]:
# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range = (1, 1), max_features=2000)

# Transform documents to tf-idf vectors (Term-Document Matrix with tf-idf weights)
tfidf = tfidf_vectorizer.fit_transform(articles)

print(tfidf.shape)

Our document vectors are all we need to run AGNES. As it's common with scikit-learn algorithms, this just boils down to calling the `fit()` method. However, we need to make it explicit that indeed the complete hierarchy should be calculated. Adopting the code from this [scikit-learn page](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html), method `plot_dendrogram()` will plot the dendrogram for a given AGNES clustering.

In [None]:
# setting distance_threshold=0 ensures we compute the full tree.
agnes = AgglomerativeClustering(distance_threshold=0, n_clusters=None).fit(tfidf.A)

# As labels for the x-axis we use the first 20 characters of the article;
# admittely it's not very useful, but it's quick and easy
xtics = [ d[:20] for d in articles]

# Plot the dendrogram
plot_dendrogram(agnes, xtics, truncate_mode="level")

Recall that the height of the merges (i.e., the horizontal bars) reflect the distances between the two merged clusters. If you check the dendrogram above, you can see that some individual articles got merged with a distance of 0. This naturally indicates that our corpus of news articles contains a couple of duplicates.

### Get *k* Clusters with AGNES

In practice, we can use AGNES similar to K-Means by specifying a fixed number of *k* clusters. This means that AGNES can stop with merging cluster the moment *k* clusters have been formed. The basic output is (again, like with K-Means) a list containing the cluster ID for each of the input documents.


In [None]:
agnes = AgglomerativeClustering(n_clusters=5).fit(tfidf.A)

print(agnes.labels_)

In the output above, the first entry represents the first news article, the second entry represents the second news article, and so on. If the values of 2 articles are identical, this means that these 2 articles belong to the same cluster. A natural step would be, again, to maybe visualize each cluster. However, we cannot use `plot_cluster_wordcloud()` here, since this method is specific to K-Means as it utilizes the final centroids for the visualization. To come up with an alternative visualization is up to you :).

---

## Summary

Clustering is a very common data mining / text mining technique to either organize data based on their similarities / distance. This can be used as a solution for a task (i.e., organize all text documents into 20 groups) or as some form of Exploratory Data Analysis (EDA) to get a basic understanding for the given data. Clustering is considered a generic technique as it can always be applied once a meaningful notion of similarity / distance is defined between two data points (e.g., text documents). There are no other requirements.

Clustering methods like K-Means and AGNES (Agglomerative Hierarchical Clustering) are valuable in organizing text documents by grouping them based on similarities in content. Here's a summary of their uses:

* **K-Means Clustering for Text Documents:** K-Means is effective for organizing text documents by partitioning them into K distinct clusters. It helps in:
    * Grouping Similar Documents: K-Means identifies clusters of documents sharing similar content, aiding in document organization and retrieval.
    * Topic Discovery: By categorizing documents into clusters, K-Means can reveal underlying topics or themes within the text corpus.
    * Document Summarization: It enables summarization by selecting representative documents from each cluster, condensing large volumes of text into manageable summaries.

* **AGNES (Agglomerative Hierarchical Clustering) for Text Documents:** AGNES constructs a hierarchy of clusters by iteratively merging similar clusters. Its uses for text documents include:
    * Hierarchical Structure: AGNES forms a hierarchical representation of clusters, providing insights into relationships between documents at different levels of granularity.
    * Dendrogram Visualization: The dendrogram produced by AGNES helps visualize the merging process and offers an intuitive view of document relationships and hierarchy.
    * Flexible Clustering: AGNES allows the determination of clusters at various levels of similarity, offering flexibility in organizing documents based on different levels of detail or abstraction.

In essence, K-Means and AGNES are powerful tools for organizing text documents by clustering them into groups that share common content or themes. K-Means partitions documents into distinct clusters, aiding in summarization and topic discovery, while AGNES constructs hierarchical structures, offering insights into document relationships at different levels of granularity. Both methods contribute significantly to document management, analysis, and understanding within text corpora.
