# GTE Embeddings Project

## Charlie Perez for the Julian Bond Papers Project

This project is extremely vague but I think we can apply Lucian's framework discussed in "Tracing the Genealogies of Ideas" (https://arxiv.org/pdf/2402.01661v1) to the corpus of Bond's speeches. Lucian's framework took a long time to implement, though, and I can't find an open source repo on the internet, yet. I'm thinking I may need to email him. But in the meantime, I want to try and replicate the process. This work is conducted with Lucian's text as a reference.

Also, credit to the developers of SentenceTransformers package which I have used to create and analyze these embeddings:

- https://www.sbert.net/
- https://arxiv.org/abs/1908.10084

and to the GTE-small model that I am using to create these embeddings with SentenceTransformers:

- https://huggingface.co/thenlper/gte-small
- https://arxiv.org/abs/2308.03281

(Yeah, I need to cite these more thoroughly. I'll get around to it.)

I am also taking a class that will cover NLP and unsupervised learning methods on large corpus text data this semester, so I am guessing I will learn how to do many steps along the way.

## Outline:

Required packages include NLTK, sentence-transformers, FAISS

1. Scrape the Bond speeches that are available from Drupal to give us a corpus to work with.

2. Organize all of these speeches into a suitable data structure (separate into sentences, provdie sufficient metadata to link them). Lucian said he eliminated items from the corpus that were works < 1000 characters and sentences < 45 words. I think that in this case, the former restriction makes sense but the latter does not and we will need to include shorter sentences.

3. Preprocessing, including tokenization with NLTK.

4. Use GTE-small to create sentence-level embeddings, then search for similarities using FAISS.

## Test: 

let's see how this works

In [4]:
from sentence_transformers import SentenceTransformer
import os
import pandas as pd

# this is just to get rid of an error message - something about how Windows caches the model I'm downloading from the internet
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [5]:
documents = pd.read_csv('../document-catalog_extended.csv') # I have this from earlier - it's a CSV export of a bunch of speeches from my location project
documents = documents[documents['Workflow Stage'] == 'Early Access']
documents_test = documents.iloc[1:3].reset_index()
documents_test

Unnamed: 0,index,ID,Title,Box,Document Length,Workflow Stage,Image Filename,Image Identifier,Image URL,Document Body
0,5,667,Speeches making observations about the recent ...,Box 3 Folder 31,32.0,Early Access,"PJB667_mss13347-b3-f31_0049.tif\n, PJB667_mss1...","3191384, 3191385, 3191386, 3191387, 3191388, 3...",,"Now that the nation's voters — at least, 54% o..."
1,6,666,Speeches making observations about the recent ...,Box 3 Folder 31,24.0,Early Access,"PJB666_mss13347-b3-f31_0025.tif\n, PJB666_mss1...","3191361, 3191362, 3191363, 3191364, 3191365, 3...",,"Now that the nation's voters — at least, 54% o..."


In [6]:
doc_667 = documents_test['Document Body'][0].split('.')

# temporary solution to getting rid of short sentences

doc_667 = [x for x in doc_667 if len(x) > 50]
doc_667[:3]

["Now that the nation's voters — at least, 54% of those eligible — have gone to the polls and expressed their choice, we have an opportunity to reflect on why that choice was made, what it meant, and what it will mean in the years ahead",
 '\nThe choice, simply put, was between the past performance of one fallible man and the unproved promises of another',
 '\nThose who believed Gallup and Harris knew that the outcome was never in doubt']

I think I will need to start looking into using regex to do lookbehinds and not split on, say, "Dr." or "W.E.B." but fo rnow this will do.

In [8]:
# gonna leave this function here for the future when I can get good at sentence slicing

def sentence_slicing(str):
    

In [45]:
# ok, let's think about this.

documents_test['Document Body'][0]

'Now that the nation\'s voters — at least, 54% of those eligible — have gone to the polls and expressed their choice, we have an opportunity to reflect on why that choice was made, what it meant, and what it will mean in the years ahead.\nThe choice, simply put, was between the past performance of one fallible man and the unproved promises of another.\nThose who believed Gallup and Harris knew that the outcome was never in doubt. We learned\xa0that Jews, worried about quotas and Israel, would abandon regular\xa0Democrats\xa0would\xa0split\xa0over\xa0McGovern; that organized labor didn\'t like him; that\xa0factory workers\xa0white Southern\xa0voters, with George Wallace gone, would go to Nixon; that the wealthy, worried about taxes, would do the same; that the middle class saw safer streets under Nixon; that the "ethnics" wanted to crack down on dissenters and deserters and that college students could not stick to anything over a prolonged period of time and that almost no single identi

In [53]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\charl\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [55]:
test_667 = documents_test['Document Body'][0]
response = sent_tokenize(test_667)
response

["Now that the nation's voters — at least, 54% of those eligible — have gone to the polls and expressed their choice, we have an opportunity to reflect on why that choice was made, what it meant, and what it will mean in the years ahead.",
 'The choice, simply put, was between the past performance of one fallible man and the unproved promises of another.',
 'Those who believed Gallup and Harris knew that the outcome was never in doubt.',
 'We learned\xa0that Jews, worried about quotas and Israel, would abandon regular\xa0Democrats\xa0would\xa0split\xa0over\xa0McGovern; that organized labor didn\'t like him; that\xa0factory workers\xa0white Southern\xa0voters, with George Wallace gone, would go to Nixon; that the wealthy, worried about taxes, would do the same; that the middle class saw safer streets under Nixon; that the "ethnics" wanted to crack down on dissenters and deserters and that college students could not stick to anything over a prolonged period of time and that almost no sin

In [57]:
help(sent_tokenize)

Help on function sent_tokenize in module nltk.tokenize:

sent_tokenize(text, language='english')
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus



In [75]:
import re



documents.head()

Unnamed: 0,ID,Title,Box,Document Length,Workflow Stage,Image Filename,Image Identifier,Image URL,Document Body
3,670,Undated Speech concerning Conditions of Black ...,Box 12 Folder 3,17.0,Early Access,"PJB670_mss13347-b12-f3_0071.tif\n, PJB670_mss1...","3243646, 3243647, 3243648, 3243649, 3243650, 3...",,[This speech includes pages with many differen...
5,667,Speeches making observations about the recent ...,Box 3 Folder 31,32.0,Early Access,"PJB667_mss13347-b3-f31_0049.tif\n, PJB667_mss1...","3191384, 3191385, 3191386, 3191387, 3191388, 3...",,"Now that the nation's voters — at least, 54% o..."
6,666,Speeches making observations about the recent ...,Box 3 Folder 31,24.0,Early Access,"PJB666_mss13347-b3-f31_0025.tif\n, PJB666_mss1...","3191361, 3191362, 3191363, 3191364, 3191365, 3...",,"Now that the nation's voters — at least, 54% o..."
7,665,Speeches making observations about the recent ...,Box 3 Folder 31,24.0,Early Access,"PJB665_mss13347-b3-f31_0001.tif\n, PJB665_mss1...","3186999, 3187000, 3187001, 3187002, 3187003, 3...",,"1\nNow that the nations voters — at least, 54%..."
8,663,Speech about the upcoming presidential electio...,Box 3 Folder 30,19.0,Early Access,"PJB663_mss13347-b3-f30_0036.tif\n, PJB663_mss1...","3187085, 3187086, 3187087, 3187088, 3187089, 3...",,The election approaching on November seventh i...


In [105]:
regex_date_str = r'(19|20)\d{2}'
documents['Year'] = documents['Title'].map(lambda x: re.search(regex_date_str, x).group() if re.search(regex_date_str, x) else "undated")

# common_sentences['Speech_1'] = common_sentences['Index_1'].map(lambda x: doc_667[x])
documents

Unnamed: 0,ID,Title,Box,Document Length,Workflow Stage,Image Filename,Image Identifier,Image URL,Document Body,Year
3,670,Undated Speech concerning Conditions of Black ...,Box 12 Folder 3,17.0,Early Access,"PJB670_mss13347-b12-f3_0071.tif\n, PJB670_mss1...","3243646, 3243647, 3243648, 3243649, 3243650, 3...",,[This speech includes pages with many differen...,1969
5,667,Speeches making observations about the recent ...,Box 3 Folder 31,32.0,Early Access,"PJB667_mss13347-b3-f31_0049.tif\n, PJB667_mss1...","3191384, 3191385, 3191386, 3191387, 3191388, 3...",,"Now that the nation's voters — at least, 54% o...",1972
6,666,Speeches making observations about the recent ...,Box 3 Folder 31,24.0,Early Access,"PJB666_mss13347-b3-f31_0025.tif\n, PJB666_mss1...","3191361, 3191362, 3191363, 3191364, 3191365, 3...",,"Now that the nation's voters — at least, 54% o...",1972
7,665,Speeches making observations about the recent ...,Box 3 Folder 31,24.0,Early Access,"PJB665_mss13347-b3-f31_0001.tif\n, PJB665_mss1...","3186999, 3187000, 3187001, 3187002, 3187003, 3...",,"1\nNow that the nations voters — at least, 54%...",1972
8,663,Speech about the upcoming presidential electio...,Box 3 Folder 30,19.0,Early Access,"PJB663_mss13347-b3-f30_0036.tif\n, PJB663_mss1...","3187085, 3187086, 3187087, 3187088, 3187089, 3...",,The election approaching on November seventh i...,1972
...,...,...,...,...,...,...,...,...,...,...
429,83,Speech concerning black bankers and businesses...,Box 4 Folder 5,11.0,Early Access,"PJB83_0001.tif\n, PJB83_0002.tif\n, PJB83_0003...","2857471, 2857472, 2857473, 2857474, 2857475, 2...",,"Before I begin, let me tell you my financial q...",1972
430,84,Speech concerning challenges facing the Americ...,Box 4 Folder 6,15.0,Early Access,"PJB84_0001.tif\n, PJB84_0002.tif\n, PJB84_0003...","2857482, 2857483, 2857484, 2857485, 2857486, 2...",,Now that the nation has changed had a change o...,1972
431,85,Speech from the Alabama Voter Education Projec...,Box 4 Folder 3,11.0,Early Access,"PJB85_0001.tif\n, PJB85_0002.tif\n, PJB85_0003...","2857497, 2857498, 2857499, 2857500, 2857501, 2...",,Thank you a great deal. As John Lewis has told...,1972
432,86,Speech at Press Conference in Washington conce...,Box 4 Folder 7,9.0,Early Access,"PJB86_0001.tif\n, PJB86_0002.tif\n, PJB86_0003...","2857508, 2857509, 2857510, 2857511, 2857512, 2...",,[Handwritten.]\none the point of today's conce...,1973


In [109]:
str = documents['Title'][5]
re.search(regex_date_str, str).group()

'1972'

In [113]:
documents['Year'].value_counts()

Year
1975       48
1971       32
1972       29
1976       24
1977       23
1969       15
1973       12
1985       11
1968       11
1965       10
1979        9
2004        7
1991        7
1995        7
1980        7
2003        7
1978        7
2000        6
1998        6
2001        6
1992        6
1988        5
1981        5
1989        5
2002        5
1990        4
1993        4
1974        4
1970        4
1996        3
1983        3
1982        3
1967        3
2005        3
1999        3
undated     2
1997        2
1987        2
1986        2
1994        2
1949        1
1984        1
1964        1
1960        1
Name: count, dtype: int64

In [87]:
doc_666 = documents_test['Document Body'][1].split('.')

# temporary solution to getting rid of short sentences

doc_666 = [x for x in doc_666 if len(x) > 50]
doc_666[:3]

["Now that the nation's voters — at least, 54% of those eligible — have gone to the polls and expressed their choice, we have an opportunity to reflect on why that choice was made, what it meant, and what it will mean in the years ahead",
 '\xa0\nThe choice, simply put, was between the past performance of one fallible man and the unproved promises of another',
 '\nThose who believed Gallup and Harris knew that the outcome was never in doubt']

In [10]:
model = SentenceTransformer("thenlper/gte-small")

embeddings667 = model.encode(doc_667)
print(embeddings667.shape)

embeddings666 = model.encode(doc_666)
print(embeddings666.shape)

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings667, embeddings666)
print(similarities)

(166, 384)
(107, 384)
tensor([[1.0000, 0.7993, 0.7778,  ..., 0.7457, 0.7735, 0.8218],
        [0.7993, 1.0000, 0.8007,  ..., 0.7503, 0.7861, 0.7680],
        [0.7778, 0.8007, 1.0000,  ..., 0.7791, 0.7602, 0.7792],
        ...,
        [0.8169, 0.7643, 0.7733,  ..., 0.7581, 0.7800, 0.8178],
        [0.7489, 0.7607, 0.7258,  ..., 0.7496, 0.7716, 0.7609],
        [0.8029, 0.7716, 0.7661,  ..., 0.7850, 0.7978, 0.8117]])


In [11]:
import torch

threshold = 0.90

# Find indices where scores >= threshold
indices = torch.nonzero(similarities >= threshold)

indices[:5]

tensor([[0, 0],
        [1, 1],
        [2, 2],
        [3, 3],
        [4, 4]])

In [12]:
indices = [x.tolist() for x in indices]

In [13]:
common_sentences = pd.DataFrame(indices)
common_sentences.rename(columns={0: 'Index_1', 1: 'Index_2'}, inplace=True)
common_sentences.head()

Unnamed: 0,Index_1,Index_2
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4


In [14]:
scores = []
for index, row in common_sentences.iterrows():
    first = row['Index_1']
    second = row['Index_2']
    scores.append(round(float(similarities[first][second]),3))

In [15]:
scores[:5]

[1.0, 1.0, 1.0, 0.992, 0.994]

In [16]:
common_sentences['Speech_1'] = common_sentences['Index_1'].map(lambda x: doc_667[x])
common_sentences['Speech_2'] = common_sentences['Index_2'].map(lambda x: doc_666[x])
common_sentences['Similarity_score'] = scores
common_sentences.head()

Unnamed: 0,Index_1,Index_2,Speech_1,Speech_2,Similarity_score
0,0,0,"Now that the nation's voters — at least, 54% o...","Now that the nation's voters — at least, 54% o...",1.0
1,1,1,"\nThe choice, simply put, was between the past...","\nThe choice, simply put, was between the pas...",1.0
2,2,2,\nThose who believed Gallup and Harris knew th...,\nThose who believed Gallup and Harris knew th...,1.0
3,3,3,"We learned that Jews, worried about quotas an...","We learned that Jews, worried about quotas an...",0.992
4,4,4,] - ---- -- --- -\nIf the election on November...,\nIf the election on November 7th illuminated ...,0.994


In [17]:
print(common_sentences['Speech_1'][3])
print()
print(common_sentences['Speech_2'][3])

 We learned that Jews, worried about quotas and Israel, would abandon regular Democrats would split over McGovern; that organized labor didn't like him; that factory workers white Southern voters, with George Wallace gone, would go to Nixon; that the wealthy, worried about taxes, would do the same; that the middle class saw safer streets under Nixon; that the "ethnics" wanted to crack down on dissenters and deserters and that college students could not stick to anything over a prolonged period of time and that almost no single identifiable group could be found — except for Black people — to cast votes as a bloc for
2
George McGovern

 We learned that Jews, worried about quotas and Israel, would abandon McGovern; that organized labor didn't like him; that factory workers, with George Wallace gone, would go to Nixon; that the waethly wealthy, worried about taxes, would do the same; that the middle class saw safer streets under Nixon; that the "ethnics" wanted to crack down on dissenters 

### In action

So we can see that this classifier works pretty well!

It's kind of cool, we can see that the only differences are wording changes that were corrected on the page, such as the misspelling of wealthy, or the changing of his wording on the Jewish "abandonment" of McGovern.

In [19]:
least_similar = common_sentences[common_sentences['Similarity_score'] == min(common_sentences['Similarity_score'])]
least_similar

Unnamed: 0,Index_1,Index_2,Speech_1,Speech_2,Similarity_score
22,21,27,"""\n""But if we solve the greatest of our ills,""...","""1\nBut in addition to the Coalition's major g...",0.904


In [20]:
print(common_sentences['Speech_1'][22])
print()
print(common_sentences['Speech_2'][22])

"
"But if we solve the greatest of our ills," the Coalition said, "our paralysis of spirit and will, we can narrow the distance between what we have and what we want

"1
But in addition to the Coalition's major goals, and their definition of "paralysis of will" as the greatest of our ills, there is another goal much more desirable


And here, we can see that this has been much more significantly altered between the two drafts, but carries a similar message.

In [22]:
# need to create a method to build these dataframes

# def organize_dataframe(speech1, speech2, indices):
    

I'm starting to see how this might be simpler with a bit of a more expansive design here. I think the end goal looks something like this:

1. Create and save an embedding object for each Bond speech. This already contains the relative "size" of the speech (it is the number of rows in the embedding object) so we can also note how many of the lines share in similarity.
2. From the embedding object, we can create a comparison object between two embedding objects.
3. The comparison object will consist of a list object with all its similarity stats and a dataframe that does a more comprehensive comparison.
4. From the product of the comparison objects, we will be able to compare against hundreds of different documents.

I'm going to make these classes local for now, but I may eventually package them. Just not right now.

In [25]:
# necessary packages in full (for now, still building of course)

import torch
from sentence_transformers import SentenceTransformer
import pandas as pd

import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

import re
import itertools

In [215]:
class Speech():

    model = SentenceTransformer("thenlper/gte-small")
    regex_date_str = r'(19|20)\d{2}'

    def __init__(self, data):
        self.metadata = data[['ID', 'Title']]
        self.doc_body_raw = data['Document Body']
        self.document_body = sent_tokenize(self.doc_body_raw)
        self.document_body = [x for x in self.document_body if len(x) > 30]
        self.embeddings = model.encode(self.document_body)
        
        title = self.metadata['Title']
        if re.search(regex_date_str, title):
            self.metadata['Year'] = re.search(regex_date_str, title).group()
        else:
            self.metadata['Year'] = "undated"

class Comparison():

    model = SentenceTransformer("thenlper/gte-small")
    
    def __init__(self, speech1, speech2):
        self.speech1 = speech1
        self.speech2 = speech2
        self.similarities = model.similarity(speech1.embeddings, speech2.embeddings)

    def find_similar_phrases(self, lower_bound = 0.000, upper_bound = 1.001):
        # the idea here is that one can enter a similarity threshold in float format and this will return a dataframe of all similar sentences
        # within that threshold, along with threshold scores
        # this can be called public facing but is also used in the later functions.
        
        indices = torch.nonzero((self.similarities >= lower_bound) & (self.similarities < upper_bound))
        indices = pd.DataFrame(indices)
        indices.rename(columns={0: 'Index_1', 1: 'Index_2'}, inplace=True)
        scores = []
        for index, row in indices.iterrows():
            first = row['Index_1']
            second = row['Index_2']
            scores.append(round(float(self.similarities[first][second]),3))
        indices['Speech_1'] = indices['Index_1'].map(lambda x: self.speech1.document_body[x])
        indices['Speech_2'] = indices['Index_2'].map(lambda x: self.speech2.document_body[x])
        indices['Similarity_score'] = scores
        return indices

    def similarity_stats(self):
        # here I want to find the average semantic similarity across two texts
        # this is not going to be a true average: rather, it will bin sentences by semantic similarity
        # I am thinking of implementing some sort of hash table to prevent a sentence from being compared to too many other sentences
        # I want to store these attributes to be used in show_comparison()

        # For now, let's just have two bins: above 0.95 and between 0.90-0.95
        near_identical = self.find_similar_phrases(lower_bound = 0.95)
        paraphrase = self.find_similar_phrases(lower_bound = 0.90, upper_bound = 0.95)

        # now I want to take near_identical as a percentage of each speech
        copy_rate_1 = round(len(near_identical) / len(self.speech1.document_body),3)
        copy_rate_2 = round(len(near_identical) / len(self.speech2.document_body),3)
        paraphrase_rate_1 = round(len(paraphrase) / len(self.speech1.document_body),3)
        paraphrase_rate_2 = round(len(paraphrase) / len(self.speech2.document_body),3)
        values = {'Near Match': len(near_identical), 'Paraphrase': len(paraphrase), 'NM - % of Speech 1': copy_rate_1,
                 'NM - % of Speech 2': copy_rate_2, 'P - % of Speech 1': paraphrase_rate_1, 'P - % of Speech 2': paraphrase_rate_2}
        return values
        

    def show_comparison(self):
        # here I want to create an array/list-like object to return from the comparison so we can compare across comparisons.
        # Eventually, this will contain more of the metadata, but for now it is limited.
        values = self.similarity_stats()

        # I may not work on this today, but it is important that we label each comparison array/list/whatever
        # with at a minimum the document IDs being compared
        # at a maximum I want IDs, titles, versions, dates

        d1 = {'speech1_ID': self.speech1.metadata['ID'], 'speech2_ID' : self.speech2.metadata['ID'],
             'speech1_year': self.speech1.metadata['Year'], 'speech2_year' : self.speech2.metadata['Year'],
             'speech1_title': self.speech1.metadata['Title'], 'speech2_title' : self.speech2.metadata['Title']}
        d1.update(values)
        return d1
        
    
        

In [61]:
print(len(documents))
print((len(documents)*0.5)**2)

358
32041.0


There are 358 documents. Comparing each of them to every other possible document gives us a time complexity of $O(\frac{1}{2}n^2)$, which we can see already means a total of over 30k comparisons just on this dataset. I think this would not be too onerous given the small size, but if we are analyzing the entire corpus of Bond's work it might be quite large. So filtering will be important.

In [217]:
# testing cell

speech_1 = Speech(documents_test.loc[0])
print(speech_1.metadata['Title'])
# I want a regular expression to pull out as much date information as I can to make this a useful tool I think, but that's late stage.
print()
speech_2 = Speech(documents_test.loc[1])
print(speech_2.embeddings.shape)
print()
speeches_comparison = Comparison(speech_1, speech_2)
print((speeches_comparison.find_similar_phrases(lower_bound = 0.90).head()))
print()
print(speeches_comparison.similarity_stats())

Speeches making observations about the recent election and its effects upon black people, 1972 November (Doc 3 of 3)

(118, 384)

   Index_1  Index_2                                           Speech_1  \
0        0        0  Now that the nation's voters — at least, 54% o...   
1        1        1  The choice, simply put, was between the past p...   
2        2        2  Those who believed Gallup and Harris knew that...   
3        3        3  We learned that Jews, worried about quotas and...   
4        5        4  - ---- -- --- -\nIf the election on November 7...   

                                            Speech_2  Similarity_score  
0  Now that the nation's voters — at least, 54% o...             1.000  
1  The choice, simply put, was between the past p...             1.000  
2  Those who believed Gallup and Harris knew that...             1.000  
3  We learned that Jews, worried about quotas and...             0.991  
4  If the election on November 7th illuminated an...        

In [201]:
print(speech_1.metadata['Year'])

1972


In [203]:
speech_1.metadata

ID                                                     667
Title    Speeches making observations about the recent ...
Year                                                  1972
Name: 0, dtype: object

In [205]:
d1 = {'speech1_ID': speech_1.metadata['ID'], 'speech2_ID' : speech_2.metadata['ID'],
'speech1_year': speech_1.metadata['Year'], 'speech2_year' : speech_2.metadata['Year'],
'speech1_title': speech_1.metadata['Title'], 'speech2_title' : speech_2.metadata['Title']}
d2 = speeches_comparison.similarity_stats()
d1.update(d2)
pd.DataFrame([d1])

Unnamed: 0,speech1_ID,speech2_ID,speech1_year,speech2_year,speech1_title,speech2_title,Near Match,Paraphrase,NM - % of Speech 1,NM - % of Speech 2,P - % of Speech 1,P - % of Speech 2
0,667,666,1972,1972,Speeches making observations about the recent ...,Speeches making observations about the recent ...,112,10,0.619,0.949,0.055,0.085


In [219]:
speeches_comparison.show_comparison()

{'speech1_ID': 667,
 'speech2_ID': 666,
 'speech1_year': '1972',
 'speech2_year': '1972',
 'speech1_title': 'Speeches making observations about the recent election and its effects upon black people, 1972 November (Doc 3 of 3)',
 'speech2_title': 'Speeches making observations about the recent election and its effects upon black people, 1972 November (Doc 2 of 3)',
 'Near Match': 112,
 'Paraphrase': 10,
 'NM - % of Speech 1': 0.619,
 'NM - % of Speech 2': 0.949,
 'P - % of Speech 1': 0.055,
 'P - % of Speech 2': 0.085}

### Thoughts at this point

This is kind of a fun concept! Worth noting that more than 100% of Speech 2 is marked in some way - Speech 3 clearly takes Speech 2 and adds onto it, probably splitting certain sentences into two or reiterating a point of some sort. Let's quickly look at two documents that aren't the same version, but are close chronologically. We're also going to test creating a loop.

In [65]:
documents_test_2 = documents.iloc[3:5].reset_index()
documents_test_2

Unnamed: 0,index,ID,Title,Box,Document Length,Workflow Stage,Image Filename,Image Identifier,Image URL,Document Body
0,7,665,Speeches making observations about the recent ...,Box 3 Folder 31,24.0,Early Access,"PJB665_mss13347-b3-f31_0001.tif\n, PJB665_mss1...","3186999, 3187000, 3187001, 3187002, 3187003, 3...",,"1\nNow that the nations voters — at least, 54%..."
1,8,663,Speech about the upcoming presidential electio...,Box 3 Folder 30,19.0,Early Access,"PJB663_mss13347-b3-f30_0036.tif\n, PJB663_mss1...","3187085, 3187086, 3187087, 3187088, 3187089, 3...",,The election approaching on November seventh i...


In [67]:
speeches = []
for index, row in documents_test_2.iterrows():
    new_speech = Speech(row)
    speeches.append(new_speech)

speeches

[<__main__.Speech at 0x1620e4a2d50>, <__main__.Speech at 0x16215142000>]

Excellent! The objects are being stored correctly in the loop.

In [68]:
new_comparison = Comparison(speeches[0], speeches[1])

print((new_comparison.find_similar_phrases(lower_bound = 0.95)))

    Index_1  Index_2                                           Speech_1  \
0         9        9  It meant re-installing in power those who beli...   
1        12        3  He will continue to set the budget and name th...   
2        22       11  It, additionally, means an end for the moment ...   
3        23       12  A short while ago, the National Urban Coalitio...   
4        24       14  "Our ideal is a country where every American g...   
5        26       15       ] 5\nOur reality needs no full recital here.   
6        29       17  In sum, we know that our society is not functi...   
7        31       18  the Coalition said, "our paralysis of spirit a...   
8        32       19  Indeed, we must marshal our good sense and our...   
9        33       20                 There is no sensible alternative."   
10       34       21  In the Coalition view "America must pursue sev...   
11       35       22  It must try to:\nAchieve full employment with ...   
12       36       23  "Pr

In [71]:
print((new_comparison.find_similar_phrases(lower_bound = 0.90, upper_bound = 0.95)))

    Index_1  Index_2                                           Speech_1  \
0         8        8  2\nFor Black people in America, the election r...   
1        10       10  It meant giving a four year free hand to the c...   
2        11        2  3\nMr. Nixon now has four more years to put hi...   
3        13        5  4\nHe will continue to decide whether our week...   
4        18       37  9\nFor too many young Americans, the exch poss...   
5        27       16                We know that cities are in trouble.   
6        28       16  That poverty continues in the midst of wealth,...   
7        35       32  It must try to:\nAchieve full employment with ...   
8        40       53  "1\n[Handwritten line with arrow pointing to p...   
9        40       61  "1\n[Handwritten line with arrow pointing to p...   
10       40       63  "1\n[Handwritten line with arrow pointing to p...   
11       42       27    , and the goal its containment and eradication.   
12       47       22  The

In [73]:
print(new_comparison.similarity_stats())

{'Near Match': 35, 'Paraphrase': 25, 'NM - % of Speech 1': 0.269, 'NM - % of Speech 2': 0.33, 'P - % of Speech 1': 0.192, 'P - % of Speech 2': 0.236}


Isn't that cool! It admittedly picks up some weird stuff, like editorial language, and I need to do a better job of cleaning the data (I'm sure I'm missing out on lots of stuff due to various reasons). But it shows how Bond reuses certain phrases across time. And we can plot this on a much larger scale, which I want to do next.

This is where I think the FAISS clustering step will come in handy - it will make it easier to make connections across the entire corpus. But we can already begin to see how this can create a history of Bond's speechwriting, and identify things that relate to Bond's core tenets. This is a decent stopping point for now, though.