<a href="https://colab.research.google.com/github/boscoj2008/transformer-series/blob/main/clustering_distilBERT_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Clustering text using DistillBERT embeddings**



1.   Explore & evaluate benchmark (or traditional) method
2.   Leverage Transformers using the Huggingface library (not SBERT)
3.   We don't assume any text pre-processing
4.   We don't visualize cluster in this tutorial (although can be an element of future work)




## what you will need to replicate this work


*   latest transformers library
*   CUDA or GPU
* dataset can be found at the github link in the video description



In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

In [2]:
def read_documents(doc_file):
    docs = []
    labels = []
    with open(doc_file, encoding='utf-8') as f:
        for line in f:
            label, _, _, doc = line.strip().split(maxsplit=3)
            docs.append(doc)
            labels.append(label)
    return docs, labels

In [3]:
from collections import Counter

def purity(labels, clustered):
    
    # find the set of cluster ids
    cluster_ids = set(clustered)

    N = len(clustered)
    majority_sum = 0  
    for cl in cluster_ids:
        
        # for this cluster, we compute the frequencies of the different human labels we encounter
        # the result will be something like { 'camera':1, 'books':5, 'software':3 } etc.
        labels_cl = Counter(l for l, c in zip(labels, clustered) if c == cl)

        # we select the *highest* score and add it to the total sum
        majority_sum += max(labels_cl.values())

    # the purity score is the sum of majority counts divided by the total number of items
    return majority_sum / N

In [4]:
docs, labels = read_documents('all_sentiment_shuffled.txt')
labels[1], docs[1]

('music',
 'i was misled and thought i was buying the entire cd and it contains one song')

In [5]:
vectorizer = TfidfVectorizer(stop_words='english')
doc_matrix = vectorizer.fit_transform(docs)

In [6]:
doc_matrix

<11914x46619 sparse matrix of type '<class 'numpy.float64'>'
	with 579847 stored elements in Compressed Sparse Row format>

In [7]:
clusterer = KMeans(n_clusters=7, verbose=True)

In [8]:
clustered_docs = clusterer.fit_predict(doc_matrix)

Initialization complete
Iteration 0, inertia 22504.002770291212
Iteration 1, inertia 11619.289062978238
Iteration 2, inertia 11568.87672321022
Iteration 3, inertia 11551.094303219983
Iteration 4, inertia 11542.946157560555
Iteration 5, inertia 11538.695223538898
Iteration 6, inertia 11535.550608074009
Iteration 7, inertia 11533.671128427506
Iteration 8, inertia 11533.023842757759
Iteration 9, inertia 11532.746893501882
Iteration 10, inertia 11532.53774396166
Iteration 11, inertia 11532.403205945113
Iteration 12, inertia 11532.317100809867
Iteration 13, inertia 11532.22987449163
Iteration 14, inertia 11532.111512040421
Iteration 15, inertia 11531.962317390986
Iteration 16, inertia 11531.743108302835
Iteration 17, inertia 11531.469185679858
Iteration 18, inertia 11530.687091763999
Iteration 19, inertia 11528.260050398618
Iteration 20, inertia 11521.834044237617
Iteration 21, inertia 11518.404235906626
Iteration 22, inertia 11518.03291439659
Iteration 23, inertia 11517.950449647109
Iterat

In [9]:
purity(labels, clustered_docs)

0.6758435454087628

In [10]:
from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score(labels, clustered_docs)

0.3037654148205302

# **Let's first define a distilBERT Transformer using Huggingface style code** 

In [11]:
!pip install transformers --quiet

In [12]:
# import torch and Huggingface dependencies
from transformers import AutoModel, AutoTokenizer
import torch.nn as nn
import torch

In [13]:
class model(nn.Module):
    def __init__(self, checkpoint, freeze=False, device='cuda'):
        super().__init__()
        
        self.model = AutoModel.from_pretrained(checkpoint)
        hidden_sz = self.model.config.hidden_size
        # set device cuda or cpu
        self.device = device
        # freeze model
        if freeze:
            for layer in self.model.parameters():
                layer.requires_grad=False
        
    def forward(self, x, attention_mask=None):
            
        x = x.to(self.device)
        # pooler_output(seq,dim) 
        with torch.no_grad():
            model_out = self.model(x['input_ids'], x['attention_mask'], return_dict=True)
            
        embds = model_out.last_hidden_state # model_out[0][:,0]
        mean_pool = embds.sum(axis=1)/ x['attention_mask'].sum(axis=1).unsqueeze(axis=1)
        return mean_pool

In [14]:
checkpoint = 'distilbert-base-uncased'
distilbert = model(checkpoint, freeze=True)
distilbert.to('cuda')
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [15]:
# dataloader
final_embeddings =list()
all_embeddings = []

final_sentences = docs

batch_sz = 200 # batch_size
for idx in range(0, len(final_sentences), batch_sz):
    batch_sentences = final_sentences[idx:idx+batch_sz]
    for sent in batch_sentences:
        tokens = tokenizer(sent ,truncation='longest_first', return_tensors='pt', return_attention_mask=True,padding=True)
        embeddings = distilbert(tokens)
        final_embeddings.extend(embeddings)
        all_embeddings = torch.stack(final_embeddings)
   



In [16]:
clustered_docs = clusterer.fit_predict(all_embeddings.cpu())

Initialization complete
Iteration 0, inertia 148989.88237921125
Iteration 1, inertia 101760.10896039216
Iteration 2, inertia 98777.92178689447
Iteration 3, inertia 97268.44714802766
Iteration 4, inertia 96676.27210337637
Iteration 5, inertia 96418.26924126566
Iteration 6, inertia 96301.79409088155
Iteration 7, inertia 96224.82164974435
Iteration 8, inertia 96169.87869599207
Iteration 9, inertia 96132.38324927574
Iteration 10, inertia 96097.64669235812
Iteration 11, inertia 96067.5215140099
Iteration 12, inertia 96043.15339739135
Iteration 13, inertia 96022.334180791
Iteration 14, inertia 96009.31171732351
Iteration 15, inertia 96000.2510624887
Iteration 16, inertia 95994.28893179823
Iteration 17, inertia 95989.42236599149
Iteration 18, inertia 95985.7820126932
Iteration 19, inertia 95982.3638826581
Iteration 20, inertia 95980.6748636264
Iteration 21, inertia 95978.91098492652
Iteration 22, inertia 95977.39476383687
Iteration 23, inertia 95976.26668277415
Iteration 24, inertia 95975.082

In [17]:
purity(labels, clustered_docs)

0.773375860332382

In [18]:
adjusted_rand_score(labels, clustered_docs)

0.5344747827646887