<a href="https://colab.research.google.com/github/githubpsyche/rememberly/blob/main/TextSimilarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 9.3MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 26.1MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 49.7MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)


In [2]:
# imports
import torch
import torch.nn.functional as F

import scipy.spatial
from sentence_transformers import SentenceTransformer

# Testing

In [5]:
model = SentenceTransformer("stsb-roberta-base")

100%|██████████| 461M/461M [00:26<00:00, 17.2MB/s]


In [6]:
model = model.to(dev)

In [7]:
# testing spacy
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']

In [16]:
embeds = model.encode(sentences, convert_to_tensor=True)

In [13]:
type(embeds)

torch.Tensor

In [18]:
distance.pdist(embeds, metric="cosine")

array([0.50537506, 0.96602266, 0.97348109])

In [33]:
F.cosine_similarity(embeds[2:3], embeds[1:2])

tensor([0.0265])

In [59]:
text_units = [["this is one", "of the many", "text units", "we have", "to use"],
              ["this is another"],
              ["do you think", "this could be", "a third?"]]

In [63]:
text_unit_embeds = [model.encode(i, convert_to_tensor=True) for i in text_units]

In [65]:
[distance.pdist(i, metric="cosine") for i in text_unit_embeds]

[array([0.66904016, 0.7714568 , 0.3554559 , 0.61600814, 0.85547665,
        0.498224  , 0.68915933, 0.94102719, 0.66680099, 0.66650351]),
 array([], dtype=float64),
 array([0.72241644, 0.63780705, 0.84130491])]

# Semantic Text Similarity

In [3]:
# doing something like this is important for speed but with GCP something else might
# have to be done
if torch.cuda.is_available():
  dev = torch.device("cuda:0")
else:
  dev = torch.device("cpu")

Note: SentenceTransformer("sbert_model_name") will download and store the sbert model we choose each session we build a TextSimilarity class. This takes a while, but once a certain model is downloaded, it's easy to use with multiple instances of TextSimilarity

Also note that SBERT takes a while to run since it's fairly big, and pdist is O(n^2), this will become a lot slower with longer reading cycles. On the order of sentences as reading cycles, I don't think it will be too bad, but this is also on the Colab GPU

In [6]:
# export
class TextSimilarity(torch.nn.Module):
  """
  Computes embeddings for a pair of given sentences and calculates the cosine
  similarity between them
  """
  def __init__(self, sbert_model_name):
    """
    sbert_model_name: sbert model to use, I think "stsb-roberta-base" is good
    """
    super(TextSimilarity, self).__init__()
    self.model = SentenceTransformer(sbert_model_name)
    self.to(dev)

  @staticmethod
  def measure(embedded_reading_cycle):
    """
    Computes cosine similarity between all text unit embeddings in a reading cycle
    reading_cycle: list of text unit embeddings
    """
    # no customizable metric - only cosine, otherwise 1 - x doesn't make sense
    similarities = 1 - scipy.spatial.distance.pdist(embedded_reading_cycle, metric="cosine")
    # convert to torch tensors
    return torch.tensor(similarities)

  def forward(self, reading_cycles):
    """
    Computes cosine similarities between n sentences as a matrix
    reading_cycles: list of lists of text unit strings
    """
    # embeds is the list of text_unit embeddings for each reading cycle of shape
    # (num_reading_cycles, num_text_units, embed_dim)
    # where num_text_units varies per reading cycle, and embed_dim = 768 as per BERT
    embeds = [self.model.encode(reading_cycle, convert_to_tensor=True) for reading_cycle in reading_cycles]
    # given n text units in a given reading_cycle
    # pdist will compute n * (n - 1) // 2 similarities, so measures is a list of shape
    # of shape (num_reading_cycles, num_text_units * (num_text_units - 1) // 2)
    measures = [self.measure(emb_reading_cycle) for emb_reading_cycle in embeds]
    # return similarity measures and embeddings
    return measures, embeds

In [7]:
# testing
ts = TextSimilarity("stsb-roberta-base") # takes a while to load
print(ts([["this is one", "of the many", "text units", "we have", "to use"],
              ["this is another"],
              ["do you think", "this could be", "a third?"]])[0])

[tensor([0.3310, 0.2285, 0.6445, 0.3840, 0.1445, 0.5018, 0.3108, 0.0590, 0.3332,
        0.3335], dtype=torch.float64), tensor([], dtype=torch.float64), tensor([0.2776, 0.3622, 0.1587], dtype=torch.float64)]
