# Document Clustering with Transformers

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Finding similar clusters of documents with Transformer embeddings

Here we will leverage already pre-trained Transformer models extract embeddings from documents and leverage unsupervised clustering models to group documents together



# Clustering

The idea of clustering is to group similar documents together based on the similarity of their embeddings

![](https://i.imgur.com/KaF70Ow.png)

# Clustering with Transformers and Agglomerative Clustering

Here we use Hierarchical clustering using the Agglomerative Clustering Algorithm. 

In contrast to k-means, we can specify a threshold for the clustering: Clusters below that threshold are merged. 

This algorithm can be useful if the number of clusters is unknown. 

By the threshold, we can control if we want to have many small and fine-grained clusters or few coarse-grained clusters.

In [2]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 4.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 17.4 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 48.7 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 64.7 MB/s 
Collecting tokenizers!=0.11.3,<

In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

In [4]:
# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]

In [5]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

embedder = SentenceTransformer('all-MiniLM-L12-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [6]:
corpus_embeddings = embedder.encode(corpus)

In [8]:
corpus_embeddings[0], corpus_embeddings[0].shape

(array([-2.29966976e-02, -5.42215668e-02, -4.23413068e-02, -4.47488800e-02,
        -3.54231820e-02, -2.26035714e-02,  6.13889731e-02, -1.12497553e-01,
        -8.98872688e-02, -3.51876067e-03,  4.92821187e-02,  1.29731558e-02,
        -8.50402489e-02,  3.15274554e-03,  5.85234836e-02, -4.41259705e-02,
         9.25420448e-02,  6.59015775e-03,  1.15228839e-01, -1.57639496e-02,
         5.49910143e-02,  4.11766060e-02, -1.19129522e-02, -1.44276749e-02,
         3.10340170e-02, -1.60162505e-02,  4.61587384e-02,  1.46778105e-02,
        -1.98512543e-02,  1.25952065e-02,  5.77107556e-02, -3.58901843e-02,
         3.83329540e-02,  6.27230257e-02, -2.96712387e-02, -3.16205248e-03,
        -1.50016574e-02, -5.01024313e-02,  3.76782604e-02,  2.17826362e-03,
         3.49687226e-02,  4.41790484e-02,  6.48202701e-03,  1.21949818e-02,
        -1.04580205e-02,  4.10641804e-02, -1.24173544e-01,  7.04240799e-02,
         5.61078265e-02,  3.16133052e-02, -1.63872033e-01,  5.93387745e-02,
         7.0

In [9]:
# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [11]:
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

In [12]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])

In [13]:
for i, cluster in clustered_sentences.items():
    print("Cluster ", i+1)
    print(cluster)
    print("")

Cluster  1
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']

Cluster  3
['The girl is carrying a baby.', 'The baby is carried by the woman']

Cluster  5
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']

Cluster  2
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']

Cluster  4
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']



# Large Scale Clustering with Transformers and Fast Clustering

Agglomerative Clustering for larger datasets is quite slow, so it is only applicable for maybe a few thousand sentences.

Sentence Transformers has a clustering algorithm that is tuned for large datasets (50k sentences in less than 5 seconds). In a large list of sentences it searches for local communities: A local community is a set of highly similar sentences.

You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specify the minimal size for a local community. This allows you to get either large coarse-grained clusters or small fine-grained clusters.

## Get Duplicate Questions Quora Dataset

In [16]:
import os
import csv
import time
from sentence_transformers import util

# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000 # We limit our corpus to only the first 50k questions


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)

Download dataset


  0%|          | 0.00/58.2M [00:00<?, ?B/s]

In [17]:
len(corpus_sentences)

50001

In [18]:
corpus_sentences[:5]

['Which are the best TV series to watch?',
 "Why are women who are on their periods are regarded as 'unclean' such that they are prevented in taking an actived part in rituals (Hinduism)?",
 'What are good online high schools and how do they work?',
 'How do I help the pro-life movement?',
 'Why do I want to be with someone always, otherwise I am feeling depressed?']

## Generate Document Embeddings

In [20]:
print("Encode the corpus. This might take a while")
corpus_embeddings = embedder.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

Encode the corpus. This might take a while


Batches:   0%|          | 0/782 [00:00<?, ?it/s]

## Cluster documents with Community Detection Model

In [21]:
print("Start clustering")
start_time = time.time()

#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)

print("Clustering done after {:.2f} sec".format(time.time() - start_time))

Start clustering
Clustering done after 3.67 sec


In [22]:
#Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} Elements ".format(i+1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])


Cluster 1, #101 Elements 
	 How can I improve my spoken English ability?
	 How can I improve my English speaking ability?
	 How can I improve my spoken English?
	 ...
	 What should I have to do to make my english and communication skills perfect?
	 How can I improve my English speaking skills as well as writing skills?
	 How can I improve my fluency in English to face a more confortable job interview?

Cluster 2, #97 Elements 
	 How can one make money online?
	 How could I make money online?
	 What is a way to make money online?
	 ...
	 What are the ways to make money working from home?
	 How can l earn $100 online daily?
	 Make money online from Nigeria?

Cluster 3, #89 Elements 
	 How will the 500 & 1000 rupee note ban affect India?
	 How will India be affected now that 500 and 1000 rupee notes have been banned?
	 What effect will the rupee 500 and 1000 currency note ban have on the Indian economy?
	 ...
	 What do you think about Modi government banning 500 & 1000 currency note from