The purpose of this notebook is to test the idea that sentence level embeddings can better detect similarity that document level embeddings.  The sentence level embeddings will be aggregated to document similarity, not by aggregating the embeddings themselves (that would defeat the purpose) but by aggregating the similarity scores between each document's similar sentences.

This type of analysis introduces a new problem: sentence segmentation.  Flair's Syntok will be used for sentence segmentation, as the commonly used punkt from NLTK is dated and performs poorly.

Example:

In doc 1 and doc 2, sentences 1 and 2 were determined to be similar.  Doc 1 sent 1 matched with doc 2 sent 2, and doc 1 sent 2 matched with doc 2 sent 1.  Doc 1 had 10 total sentences and none of the other 8 matched with any other sentences from doc 2.

Option 1: Average Sentence Lvl Score (similarity threshold = 0.2)

| doc 1 | doc 2 | sentence similarity |
| --- | --- | -- |
| sent 1 | sent 2 | 0.83 |
| sent 2 | sent 1 | 0.22 |
| --- | --- | --- |
| doc similarity | -- | 0.53 |

Option 2: Number of Similar Sentences (similarity threshold = 0.2)

| doc 1 | doc 2 | number sentences matched |
| --- | --- | --- |
| --- | --- | 2 |
| doc similarity | --- | 2 |

Of the 2 options, option 2 makes the most sense because it only involves setting 1 threshold: the one for sentence level similarity.  

In [1]:
import re
import faiss
import numpy as np
import pandas as pd
import syntok.segmenter as segmenter
import tensorflow_hub as hub
from sklearn.datasets import fetch_20newsgroups
from itertools import combinations_with_replacement
from scipy.sparse import coo_matrix
from scipy.sparse.csgraph import connected_components

In [2]:
pd.set_option('max_colwidth', 800)

In [3]:
class USEEmbeddingModel:
    def __init__(self):
        """
        Universal Sentence Encoder (USE) is a state of the art semantic similarity model.
        It is preferable to BERT embeddings for semantic similarity because:
            * It was trained specifically for detecting semantic similarity with sentence pairs
            * It has a greater range of values for the embedding dimensions than BERT, allowing it
              to better separate close matches in the embedding space (0.5 - 0.8 vs 0.79 - 0.87 for BERT)
        This class pulls the pre-trained USE model from TensorFlow Hub, then uses it to
        create document level embeddings.  Note that USE has a dimensionality of 512, meaning
        only the first 512 tokens of the document will be encoded.
        """
        self.model_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
        self.model = None

    def _load_model(self):
        print(f"Model {self.model_url} loading")
        self.model = hub.load(self.model_url)
        print(f"Model {self.model_url} loaded")

    @staticmethod
    def batch(iterable, batch_size=1):
        """
        Creates batches of equal size, batch_size

        Example usage:
            data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # list of data

            for x in batch(data, 3):
                print(x)

            # Output

            [0, 1, 2]
            [3, 4, 5]
            [6, 7, 8]
            [9, 10]
        """
        iterable_len = len(iterable)
        for ndx in range(0, iterable_len, batch_size):
            yield iterable[ndx:min(ndx + batch_size, iterable_len)]

    def get_embeddings(self, text_input, batch_size=256):
        """
        Runs text through the model and produces the embeddings

        :param text_input: a list where each item is a document (a comment from this dataset)
        :param batch_size: integer representing how many samples to include in a batch
        """
        self._load_model()
        embeddings = []
        # helper variables to track progress
        nbr_batches = int(np.ceil(len(text_input) / batch_size))
        current_batch = 1

        skipped = []  # these caused an error for whatever reason
        for batch_indices in self.batch(iterable=range(len(text_input)), batch_size=batch_size):
            progress = round(100 * current_batch / nbr_batches, 2)
            if progress % 10 == 0:
                print(f"Embedding progress: {progress}%")
                print(progress)

            # grab the records for this batch
            batch_records = [text_input[idx] for idx in batch_indices]

            try:
                # forward pass over the input
                model_output = self.model(batch_records)

                # save the embeddings
                embeddings.append(model_output.numpy())
            
            except:
                skipped += batch_indices

            current_batch += 1

        # convert the list of embeddings to a numpy array
        embeddings = np.array(
            [np.array(i) for i in np.vstack(embeddings).tolist()]
        )

        return embeddings, skipped

In [4]:
def parse_text_to_sents(text: str):
    sents = [sentence for paragraph in segmenter.analyze(text) for sentence in paragraph]
    sents = [(''.join(str(t) for t in s).strip()) for s in sents]
    return sents

In [229]:
news = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
news_df = pd.DataFrame({"text": news.data, "topic": news.target})
news_df.dropna(subset=["text"], axis=0, inplace=True)
news_df = news_df.sample(frac=0.05)
news_df = news_df.reset_index(drop=False).rename(columns={"index": "doc_id"})
news_df['text'] = news_df['text'].apply(lambda x: re.sub("\s+", " ", re.sub("\t", " ", re.sub("\n", " ", x))))

In [230]:
news_df['text'] = news_df['text'].apply(parse_text_to_sents)
news_df = news_df[news_df['text'].str.len() > 40]
news_df = news_df.explode('text').reset_index(drop=True).reset_index(drop=False).rename(
    columns={"index": "global_id"}
)
news_df['local_id'] = news_df.groupby(['doc_id'])['global_id'].cumcount()

In [231]:
print(f"{len(set(news_df.doc_id))} documents and {len(news_df)} sentences")

17 documents and 2208 sentences


In [232]:
news_df.head()

Unnamed: 0,global_id,doc_id,text,topic,local_id
0,0,4811,"THE WHITE HOUSE Office of the Press Secretary (Pittsburgh, Pennslyvania) ______________________________________________________________ For Immediate Release April 17, 1993 RADIO ADDRESS TO THE NATION BY THE PRESIDENT Pittsburgh International Airport Pittsburgh, Pennsylvania 10:06 A.M. EDT THE PRESIDENT: Good morning.",18,0
1,1,4811,"My voice is coming to you this morning through the facilities of the oldest radio station in America, KDKA in Pittsburgh.",18,1
2,2,4811,"I'm visiting the city to meet personally with citizens here to discuss my plans for jobs, health care and the economy.",18,2
3,3,4811,But I wanted first to do my weekly broadcast with the American people.,18,3
4,4,4811,I'm told this station first broadcast in 1920 when it reported that year's presidential elections.,18,4


# Embed Documents with USE

In [233]:
model = USEEmbeddingModel()
embeddings, skipped = model.get_embeddings(
    text_input=news_df['text'].tolist(),
    batch_size=64
)

Model https://tfhub.dev/google/universal-sentence-encoder/4 loading
Model https://tfhub.dev/google/universal-sentence-encoder/4 loaded
Embedding progress: 20.0%
20.0
Embedding progress: 40.0%
40.0
Embedding progress: 60.0%
60.0
Embedding progress: 80.0%
80.0
Embedding progress: 100.0%
100.0


In [234]:
news_df.drop(news_df.index[skipped], axis=0, inplace=True)

In [235]:
assert(len(news_df) == embeddings.shape[0])

# Build FAISS Indices

In [236]:
NLIST = 100
NPROBE = 10
DISTANCE_THRESHOLD = 1.2

In [237]:
nbr_embeddings, embed_dim = embeddings.shape

In [238]:
# the L2 distance will be used as a quantizer for indices
quantizer_use = faiss.IndexFlatL2(embed_dim)

In [239]:
ivf_flat_index_use = faiss.IndexIVFFlat(quantizer_use, embed_dim, NLIST)
ivf_flat_index_use.nprobe = NPROBE
ivf_flat_index_use.train(embeddings.astype(np.float32))
ivf_flat_index_use.add_with_ids(embeddings.astype(np.float32), news_df['global_id'].values)



In [240]:
# perform range search for all document embeddings
limits, distances, indices = ivf_flat_index_use.range_search(embeddings.astype(np.float32), DISTANCE_THRESHOLD)
limits = limits.tolist()
indices = indices.tolist()

# form clusters by determining which documents are similar
clusters = [indices[start:end] for start, end in zip(limits, limits[1:])]
clusters = {i: set(v) for i, v in enumerate(clusters)}

In [241]:
# does any doc appear in more than 1 cluster?
seen = []
breaking_cluster = 0
for k, v in clusters.items():
    for d in v:
        if d in seen:
            breaking_cluster = k
            break
        else:
            seen.append(d)
    print("not unique, broke on", breaking_cluster)
    break

not unique, broke on 0


This is important, because it means that FAISS does not assign to unique clusters.  Connected components is required if a doc can only have 1 cluster.

In [242]:
len(clusters)

2208

In [243]:
# create a list of tuples representing the edges in a graph
edges = set([(x, y) for c in clusters.values() for x, y in combinations_with_replacement(c, 2)])

# create a square adjacency matrix from the edge list
nbr_documents = embeddings.shape[0]
matrix_shape = (nbr_documents, nbr_documents)
rows, cols = zip(*edges)
sparse_mat = coo_matrix((np.ones(len(edges)), (rows, cols)), shape=matrix_shape)

# determine cluster assignment by document
nbr_clusters, cluster = connected_components(sparse_mat)

In [244]:
print(nbr_clusters, len(news_df))

637 2208


In [245]:
faiss_id_to_global_id_map = {k: v for k, v in enumerate(news_df['global_id'].tolist())}
global_id_to_cluster_map = dict(zip(faiss_id_to_global_id_map, cluster))
news_df['cluster'] = news_df['global_id'].map(global_id_to_cluster_map)

In [246]:
news_df.head()

Unnamed: 0,global_id,doc_id,text,topic,local_id,cluster
0,0,4811,"THE WHITE HOUSE Office of the Press Secretary (Pittsburgh, Pennslyvania) ______________________________________________________________ For Immediate Release April 17, 1993 RADIO ADDRESS TO THE NATION BY THE PRESIDENT Pittsburgh International Airport Pittsburgh, Pennsylvania 10:06 A.M. EDT THE PRESIDENT: Good morning.",18,0,0
1,1,4811,"My voice is coming to you this morning through the facilities of the oldest radio station in America, KDKA in Pittsburgh.",18,1,1
2,2,4811,"I'm visiting the city to meet personally with citizens here to discuss my plans for jobs, health care and the economy.",18,2,2
3,3,4811,But I wanted first to do my weekly broadcast with the American people.,18,3,3
4,4,4811,I'm told this station first broadcast in 1920 when it reported that year's presidential elections.,18,4,4


In [247]:
# count number of shared sentences between documents as proportion of total sentences
doc_ids = set(news_df['doc_id'])
doc_similarity = {}
for doc1 in doc_ids:
    for doc2 in doc_ids:
        if doc1 == doc2:
            next
        doc1_c = news_df[news_df['doc_id'] == doc1].groupby('cluster')['global_id'].count().reset_index()
        doc2_c = news_df[news_df['doc_id'] == doc2].groupby('cluster')['global_id'].count().reset_index()
        both = pd.merge(doc1_c, doc2_c, on='cluster', how='inner')
        if len(both) != 0:
            both['shared'] = both.apply(lambda x: min(x['global_id_x'], x['global_id_y']), axis=1)
            perc_shared = both['shared'].sum() / len(news_df[news_df['doc_id'] == doc1])
        else:
            perc_shared = 0
        doc_similarity[str(doc1) + "_" + str(doc2)] = perc_shared
doc_similarity = pd.DataFrame.from_dict(doc_similarity, orient='index').reset_index(drop=False)
doc_similarity = pd.concat([doc_similarity, doc_similarity['index'].str.split("_", expand=True)], axis=1)
doc_similarity.columns = ["index", "perc_similar", "doc_id1", "doc_id2"]
doc_similarity.drop("index", axis=1, inplace=True)
doc_similarity = doc_similarity[doc_similarity['doc_id1'] != doc_similarity['doc_id2']]

In [248]:
doc_similarity.head()

Unnamed: 0,perc_similar,doc_id1,doc_id2
1,0.333333,2275,7972
2,0.333333,2275,6950
3,0.380952,2275,1511
4,0.190476,2275,9094
5,0.333333,2275,5447


In [251]:
doc_similarity[doc_similarity['perc_similar'] > 0.5].head()

Unnamed: 0,perc_similar,doc_id1,doc_id2
23,0.729032,7972,4682
25,0.735484,7972,6151
35,0.773481,6950,7972
39,0.745856,6950,5447
40,0.762431,6950,4682


In [260]:
news_df[news_df['doc_id'].isin([7972, 4682, 6151])]

Unnamed: 0,global_id,doc_id,text,topic,local_id,cluster
64,64,4682,"Archive-name: net-privacy/part2 Last-modified: 1993/3/3 Version: 2.1 IDENTITY, PRIVACY, and ANONYMITY on the INTERNET ================================================ (c) 1993 L. Detweiler.",11,0,2
65,65,4682,"Not for commercial use except by permission from author, otherwise may be freely copied.",11,1,2
66,66,4682,Not to be altered.,11,2,2
67,67,4682,Please credit if quoted.,11,3,28
68,68,4682,"SUMMARY ======= Email and account privacy, anonymity, file encryption, academic computer policies, relevant legislation and references, EFF, and other privacy and rights issues associated with use of the Internet and global networks in general.",11,4,2
...,...,...,...,...,...,...
2156,2156,6151,"Source and binaries for HP-UX 8.*/9.0(S300/400/700/800) and Domain 10.4 (68K, DN 10K) are available through the Interworks Users Group; contact Carol Relph at 508-436-5046, fax 508-256-7169, or relph_c@apollo.hp.com. Patches to X11R5 for Solaris 2.1 by Casper H.S. Dik (casper@fwi.uva.nl) et al are on export in contrib/{R5.SunOS5.patch.tar.Z,R5.SunOS5.patch.README}.",5,281,2
2157,2157,6151,Patches to X11R5 for the Sun Type 5 keyboard and the keyboard NumLock are available from William Bailey (dbgwab@arco.com).,5,282,2
2158,2158,6151,"Also: Binaries are available from Unipalm (+44 954 211797, xtech@unipalm.co.uk), probably for the Sun platforms.",5,283,2
2159,2159,6151,"---------------------------------------------------------------------- David B. Lewis faq%craft@uunet.uu.net ""Just the FAQs, ma'am.""",5,284,331


These documents do not look too similar.  It seems there are many short sentences that could be noise.  

After removing short sentences < 40 characters, the documents still do not look very similar.  It seems that sentence level comparisons are not good for judging document level similarity.  