[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/8.clustering/Clustering_Sentence_Embeddings.ipynb)


This notebook explores the use of SentenceBERT to generate representations of sequences (sentences, documents) and clustering those representations using K-means.

In [None]:
!pip install sentence-transformers

In [None]:
# Get movies summaries and book titles to cluster
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/plot_summaries.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/loc/dev.tsv -O book_titles.txt

In [None]:
from math import sqrt

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from tqdm import tqdm


In [None]:
def read_data(filename):
    data = []
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols = line.rstrip().split("\t")
            idd = cols[0]
            summary = cols[1]
            data.append((idd, summary))
    return data

In [None]:
movies = read_data("plot_summaries.txt")
book_titles = read_data("book_titles.txt")

Load the sentence embedding model.

In [None]:
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Let's try embedding a sentence. What is the shape of the embedding?

In [None]:
embedding = sentence_model.encode("this is a sentence")
print(embedding.shape)

In [None]:
def cosine_similarity(one, two):
  return np.dot(one,two) / (sqrt(np.dot(one,one)) * sqrt(np.dot(two,two)))

In [None]:
def get_embeddings(data, model):
    X = []

    # Get sentence embeddings for each doc
    for idx, doc in tqdm(data):
        embedding = model.encode(doc)
        X.append(embedding)

    return np.array(X)

In [None]:
def run_all(data, model, num_clusters=10):

    embeddings = get_embeddings(data, model)

    # Run K-means
    kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(embeddings)

    # For each cluster, print out the n documents closest to the cluster center
    clusters = {}
    for idx, label in enumerate(kmeans.labels_):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append((idx, data[idx][1]))

    for label in clusters:
        sims = {}
        cluster_center = kmeans.cluster_centers_[label]
        for idx, doc in clusters[label]:
            sim = cosine_similarity(cluster_center, embeddings[idx])
            sims[idx] = sim
        for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)[:5]:
            print(k,"%.3f" % v, data[k][1])
        
        print()


# Book titles

In [None]:
run_all(book_titles[:1000], sentence_model, num_clusters=10)

# Movie summaries

In [None]:
run_all(movies[:100], sentence_model, num_clusters=10)

**Q1**: Play around with this method and vary the number of movies clustered, along with the number of clusters.  How would you rate the coherence and interepretability of these clusters? Try to label some of the clusters and discuss with your neighbors about the overall coherence.