## Similarity Analysis of Congress Speeches
### By calculating the Cosine Similarities and Manhattan Distances between the TF-IDF (term frequency-inverse document frequency) and Count Vectors of texts with n-grams
Using *N* speeches of my choosing,<br />
preparing them for analysis (removing punctuation, stop-words),<br />
using Bag of Words and n-grams in addition to tf-idf to find the cosine similarity between them.<br />
Discussing my findings.

In [1]:
""" 
%pip install -U scikit-learn
%pip install nltk 
"""

' \n%pip install -U scikit-learn\n%pip install nltk \n'

In [2]:
""" 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
"""

" \nnltk.download('punkt')\nnltk.download('wordnet')\nnltk.download('averaged_perceptron_tagger')\nnltk.download('vader_lexicon')\n"

In [3]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.pipeline import Pipeline

In [4]:
import requests

text1_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-moseleybraun-il.txt"
text2_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-reid-nv.txt"

text1_get, text2_get = requests.get(text1_url), requests.get(text2_url)
text1, text2 = text1_get.text, text2_get.text

print("text1 head:\n",text1[0:200],"\n\ntext2 head:\n",text2[0:200])

text1 head:
 <DOC>
<DOCNO>105-moseleybraun-il-1-19981009</DOCNO>
<TEXT>
 Ms. MOSELEYBRAUN. Mr. President, I want to note my disappointment that the permanent relief for Haitian refugees that I and many others in t 

text2 head:
 <DOC>
<DOCNO>105-reid-nv-1-19981020</DOCNO>
<TEXT>
 Mr. REID. Mr. President, I rise today to call attention to the outstanding achievements of a Nevadan who has dedicated himself to helping individual


##### Removing punctuation and stop-words in the text

In [5]:
def text_preprocessor(text):
    # Deleting non-word characters by replacing them with blank (' '):
    text= re.sub(r'\W',' ', text)
    # Tokenizing the string text into word substrings, writing them to a list (.lower() makes all characters lower case):
    tokens = word_tokenize(text.lower())
    # Removing English stopwords from the list:
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Keeping words with at least 3 characters in the list:
    tokens = [word for word in tokens if len(word)>=3]
    # Joining the tokens -substrings- in the list back together with blank (' ') between them:
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text 

In [6]:
# Testing the pre-processing function with the first 1000 characters of the first text:
text1_head_tokenized = text_preprocessor(text1[:1000])
text1_head_tokenized

'doc docno 105 moseleybraun 19981009 docno text moseleybraun president want note disappointment permanent relief haitian refugees many others body worked make law dropped treasury appropriations conference report effort began last year debate appropriations bill included language granted certain central americans access suspension deportation procedure haitians granted access may recall supported granting relief affected class central americans along several colleagues senate house fought vigorously additional provisions haitian refugees although unsuccessful effort later introduced 1504 haitian immigrations fairness act 1997 legislation would provide haitian refugees permanent residency status course'

##### Stemming the words in the tokenized text

In [7]:
def stem_words(text):
    # Creating a stemmer instance which uses Porter Stemming Algorithm:
    stemmer = PorterStemmer()
    # Tokenizing the text into words, stemming them:
    stemmed_words = [stemmer.stem(word) for word in word_tokenize(text)]
    # Joining the word stems back and returning:
    return ' '.join(stemmed_words)

# Some alternatives to Porter in NLTK are Snowball (in English) and Lancaster.

In [8]:
# Testing the stemmer function:
text1_stemmed = stem_words(text1_head_tokenized)
text1_stemmed

'doc docno 105 moseleybraun 19981009 docno text moseleybraun presid want note disappoint perman relief haitian refuge mani other bodi work make law drop treasuri appropri confer report effort began last year debat appropri bill includ languag grant certain central american access suspens deport procedur haitian grant access may recal support grant relief affect class central american along sever colleagu senat hous fought vigor addit provis haitian refuge although unsuccess effort later introduc 1504 haitian immigr fair act 1997 legisl would provid haitian refuge perman resid statu cours'

##### Lemmatizing the words in the tokenized and stemmed text
[Lemmatisation](https://en.wikipedia.org/wiki/Lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [9]:
def lemmatize_words(text):
    # Creating a lemmatizer instance:
    lemmatizer = WordNetLemmatizer()
    # Applying the lemmatizer word by word:
    lemmatized_words = [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
    # Joining the words back and returning:
    return ' '.join(lemmatized_words)

In [10]:
# Testing the lemmatizer function:
text1_lemmatized = lemmatize_words(text1_stemmed)
text1_lemmatized

'doc docno 105 moseleybraun 19981009 docno text moseleybraun presid want note disappoint perman relief haitian refuge mani other bodi work make law drop treasuri appropri confer report effort began last year debat appropri bill includ languag grant certain central american access suspens deport procedur haitian grant access may recal support grant relief affect class central american along sever colleagu senat hous fought vigor addit provis haitian refuge although unsuccess effort later introduc 1504 haitian immigr fair act 1997 legisl would provid haitian refuge perman resid statu cours'

#### Putting it all together
Now that all the pre-processing functions are tested and working, we can apply the functions to full bodies of both texts.


In [11]:
def text_processor(text):
    step1 = text_preprocessor(text)
    step2 = stem_words(step1)
    step3 = lemmatize_words(step2)
    output = step3
    return output

In [12]:
text1_processed, text2_processed = text_processor(text1), text_processor(text2)

In [13]:
text1_processed[:500]

'doc docno 105 moseleybraun 19981009 docno text moseleybraun presid want note disappoint perman relief haitian refuge mani other bodi work make law drop treasuri appropri confer report effort began last year debat appropri bill includ languag grant certain central american access suspens deport procedur haitian grant access may recal support grant relief affect class central american along sever colleagu senat hous fought vigor addit provis haitian refuge although unsuccess effort later introduc '

#### We represent the processed bodies of text as vectors to analyze them. We use both TF-IDF and Bag-of-Words (Count) approaches.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Calling an instance of TF-IDF Vectorizer with default arguments:
tfidf_vectorizer = TfidfVectorizer()

# Calling an instance of Count Vectorizer with default arguments:
count_vectorizer = CountVectorizer()

In [15]:
# Vectorizing the bodies of texts and putting them together in a matrix:
corpus_tfidf = tfidf_vectorizer.fit_transform([text1_processed, text2_processed])
corpus_count = count_vectorizer.fit_transform([text1_processed, text2_processed])


In [16]:
import pandas as pd

# Transforming the corpus matrix to a dataframe with feature names (words) as index:
corpus_tfidf_matrix = pd.DataFrame(corpus_tfidf.toarray().transpose(), 
                             index=tfidf_vectorizer.get_feature_names_out())

corpus_count_matrix = pd.DataFrame(corpus_count.toarray().transpose(), 
                             index=count_vectorizer.get_feature_names_out())

# Renaming the columns with the names of the senators who gave the speeches:
corpus_tfidf_matrix = corpus_tfidf_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

corpus_count_matrix = corpus_count_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

In [17]:
corpus_count_matrix

Unnamed: 0,Moseley-Braun,Reid
000,192,187
060,0,1
063,0,1
083,0,1
097,1,0
...,...,...
zero,13,3
zest,1,0
zombi,1,0
zone,15,1


In [18]:
corpus_tfidf_matrix

Unnamed: 0,Moseley-Braun,Reid
000,0.047383,0.041278
060,0.000000,0.000310
063,0.000000,0.000310
083,0.000000,0.000310
097,0.000347,0.000000
...,...,...
zero,0.003208,0.000662
zest,0.000347,0.000000
zombi,0.000347,0.000000
zone,0.003702,0.000221


In [19]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculating the cosine similarity between to the vectorized texts:
tfidf_result = cosine_similarity(corpus_tfidf_matrix[corpus_tfidf_matrix.columns[0]].values.reshape(1, -1), 
                  corpus_tfidf_matrix[corpus_tfidf_matrix.columns[1]].values.reshape(1, -1))

bow_result = cosine_similarity(corpus_count_matrix[corpus_count_matrix.columns[0]].values.reshape(1, -1), 
                  corpus_count_matrix[corpus_count_matrix.columns[1]].values.reshape(1, -1))

In [20]:
print("Similarity rate with TF-IDF:", tfidf_result, 
      "\nSimilarity rate with Bag-of-Words:", bow_result)

Similarity rate with TF-IDF: [[0.69861398]] 
Similarity rate with Bag-of-Words: [[0.7534336]]


#### Now let us see how the results change if we look at 2-grams and 3-grams cumulatively in addition to single words.

We need to modify the vectorizers in order to achieve this. Previously, we used the vectorizers with default parameters. This means that they only looked at single words instead of groups of two or three consecutive words.

In [21]:
# Calling an instance of TF-IDF Vectorizer with 1 to 2 grams:
tfidf_vectorizer_12 = TfidfVectorizer(ngram_range = (1,2))

# Calling an instance of Count Vectorizer with 1 to 2 grams:
count_vectorizer_12 = CountVectorizer(ngram_range = (1,2))

# I name the vectorizers with the "_12" suffix, indicating the ngram_range parameter values.

Above, we call the vectorizers to look at 1 and 2-grams together. We could also look at 2-grams only instead by setting the ngram_range parameter to (2,2) instead.

In [22]:
# Vectorizing the bodies of texts and putting them together in a matrix:
corpus_tfidf_12 = tfidf_vectorizer_12.fit_transform([text1_processed, text2_processed])
corpus_count_12 = count_vectorizer_12.fit_transform([text1_processed, text2_processed])

In [23]:
# Transforming the corpus matrix to a dataframe with feature names (words) as index:
corpus_tfidf_matrix_12 = pd.DataFrame(corpus_tfidf_12.toarray().transpose(), 
                             index=tfidf_vectorizer_12.get_feature_names_out())

corpus_count_matrix_12 = pd.DataFrame(corpus_count_12.toarray().transpose(), 
                             index=count_vectorizer_12.get_feature_names_out())

# Renaming the columns with the names of the senators who gave the speeches:
corpus_tfidf_matrix_12 = corpus_tfidf_matrix_12.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

corpus_count_matrix_12 = corpus_count_matrix_12.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

In [24]:
corpus_count_matrix_12

Unnamed: 0,Moseley-Braun,Reid
000,192,187
000 000,4,3
000 170,0,1
000 1990,1,0
000 1995,1,0
...,...,...
zone new,1,0
zone peac,1,0
zone urban,1,0
zoster,0,1


In [25]:
corpus_tfidf_matrix_12

Unnamed: 0,Moseley-Braun,Reid
000,0.044707,0.036991
000 000,0.000931,0.000593
000 170,0.000000,0.000278
000 1990,0.000327,0.000000
000 1995,0.000327,0.000000
...,...,...
zone new,0.000327,0.000000
zone peac,0.000327,0.000000
zone urban,0.000327,0.000000
zoster,0.000000,0.000278


Let us define functions for similarity measures to make life easier in the future.

In [26]:
def cosine_result(matrix):
    result = cosine_similarity(matrix[matrix.columns[0]].values.reshape(1, -1), 
                               matrix[matrix.columns[1]].values.reshape(1, -1))
    return result

tfidf_result_12 = cosine_result(corpus_tfidf_matrix_12)
bow_result_12 = cosine_result(corpus_count_matrix_12)

print("Similarity rate with TF-IDF:", tfidf_result,
      "\nSimilarity rate with Bag-of-Words:", bow_result, 
      "\nSimilarity rate with TF-IDF, 1 to 2-grams:", tfidf_result_12, 
      "\nSimilarity rate with Bag-of-Words, 1 to 2-grams:", bow_result_12)

Similarity rate with TF-IDF: [[0.69861398]] 
Similarity rate with Bag-of-Words: [[0.7534336]] 
Similarity rate with TF-IDF, 1 to 2-grams: [[0.65857355]] 
Similarity rate with Bag-of-Words, 1 to 2-grams: [[0.72931358]]


In [27]:
from sklearn.metrics.pairwise import manhattan_distances

def manhattan_result(matrix):
    result = manhattan_distances(matrix[matrix.columns[0]].values.reshape(1, -1), 
                               matrix[matrix.columns[1]].values.reshape(1, -1))
    return result

Let us define another function which takes two processed texts, ngram_range parameters and vectorizer as input and returns cosine similarity between the two texts.

In [28]:
def similarity_pipeline(vectorizer, txt1, txt2, ngram_range = (1,1), similarity = "Cosine"):
    
    # Allowing the option to use Tfidf and Count vectorizers:
    if vectorizer == "Count":
        
        # Calling the vectorizer with desired ngram_range values, (1,1) applies if not specified:
        count_vectorizer = CountVectorizer(ngram_range = ngram_range)
        
        corpus_count = count_vectorizer.fit_transform([txt1, txt2])

        # Loading vectorized texts into a matrix:
        corpus_count_matrix = pd.DataFrame(corpus_count.toarray().transpose(), 
                             index=count_vectorizer.get_feature_names_out())
        
        # Renaming the columns:
        corpus_count_matrix = corpus_count_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)
        
        # Defining the output:
        if similarity == "Cosine":
            output = cosine_result(corpus_count_matrix)

        elif similarity == "Manhattan":
            output = manhattan_result(corpus_count_matrix)

        else:
            print("Please choose a valid parameter for similarity.",
                  "\nValid similarity measures are 'Cosine' and 'Manhattan'.")
    
    elif vectorizer == "Tfidf":
        # Calling the vectorizer with desired ngram_range values, (1,1) applies if not specified:
        tfidf_vectorizer = TfidfVectorizer(ngram_range = ngram_range)
        
        corpus_tfidf = tfidf_vectorizer.fit_transform([txt1, txt2])

        # Loading vectorized texts into a matrix:
        corpus_tfidf_matrix = pd.DataFrame(corpus_tfidf.toarray().transpose(), 
                             index=tfidf_vectorizer.get_feature_names_out())
        
        # Renaming the columns:
        corpus_tfidf_matrix = corpus_tfidf_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)
        
        # Defining the output:
        if similarity == "Cosine":
            output = cosine_result(corpus_tfidf_matrix)

        elif similarity == "Manhattan":
            output = manhattan_result(corpus_tfidf_matrix)

    else:
        print("Error: Please choose valid parameters for vectorizer and similarity measure.",
              "\nValid vectorizers are 'Count' and 'Tfidf'.",
              "\nValid similarity measures are 'Cosine' and 'Manhattan'.")

    return output

In [29]:
# Testing the function:
similarity_pipeline("Tfidf", text1_processed, text2_processed, ngram_range = (2,2), similarity= "Manhattan")

array([[122.85798728]])

Now we can check results comparatively, produced from different vectorizers and specifications of n-grams.

In [30]:
ngram_ranges = [(1,1), (1,2), (1,3), (1,4), (2,2), (3,3), (4,4)]
vectorizers = ["Tfidf", "Count"]
similarity_measures = ["Cosine", "Manhattan"]
results = []
for vec in vectorizers:
    for ngram_range in ngram_ranges:
        for similarity_measure in similarity_measures:
            if vec == "Tfidf":
                if similarity_measure == "Cosine":
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Cosine")
                    results.append({"Vectorizer": vec, "Similarity Measure": similarity_measure, "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
                else:
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Manhattan")
                    results.append({"Vectorizer": vec, "Similarity Measure": "Manhattan", "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
            else:
                if similarity_measure == "Cosine":
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Cosine")
                    results.append({"Vectorizer": "Bag-of-Words", "Similarity Measure": similarity_measure, "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
                else:
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Manhattan")
                    results.append({"Vectorizer": "Bag-of-Words", "Similarity Measure": "Manhattan", "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
 
df = pd.DataFrame(results)
df

Unnamed: 0,Vectorizer,Similarity Measure,Ngram Range,Similarity Score
0,Tfidf,Cosine,"(1, 1)",0.698614
1,Tfidf,Manhattan,"(1, 1)",16.875941
2,Tfidf,Cosine,"(1, 2)",0.658574
3,Tfidf,Manhattan,"(1, 2)",61.22641
4,Tfidf,Cosine,"(1, 3)",0.626878
5,Tfidf,Manhattan,"(1, 3)",108.87431
6,Tfidf,Cosine,"(1, 4)",0.610308
7,Tfidf,Manhattan,"(1, 4)",155.732114
8,Tfidf,Cosine,"(2, 2)",0.461612
9,Tfidf,Manhattan,"(2, 2)",122.857987


#### Discussion

To qualitatively discuss the similarity between the two senators' speeches some background on who they are is needed.

[**Moseley-Braun**](https://en.wikipedia.org/wiki/Carol_Moseley_Braun) was *the first African-American woman* elected to the U.S. Senate, the first African-American U.S. Senator from the **Democratic Party**, *the first woman to defeat an incumbent U.S. Senator in an election*, and the first female U.S. Senator from Illinois.

[**Harry Mason Reid Jr.**](https://en.wikipedia.org/wiki/Harry_Reid) was an American lawyer and politician who served as a United States senator from Nevada from 1987 to 2017. He *led the Senate **Democratic** Caucus from 2005 to 2017* and was *the Senate Majority Leader from 2007 to 2015*.

Although both senators were from the same party, the dissimilarity between them should most likely to be rooted in their backgrounds and identities they stand for.

Moseley-Braun had been the first to set many milestones while Reid's election and re-ellection has arguably been in smoother conditions. Moseley-Braun is an African-American woman to be the first female U.S. Senator in her state while Reid is from an already well represented identity group - white and male.

The similarity being above 50% might be due to the fact that they are from the same party but the present difference is, at least superficially, because they are vastly different character and from very different states.

For a better grounded analysis, we can look at the similarity measures of multiple pairs of senator speeches from the same and different parties and employ a comparative perspective. This approach can reveal patterns more clearly as to what makes two speeches similar and what having similar speeches tells us about the characteristics of the senators in comparison. Furthermore, different methods of vectorizing speeches and different measures of similarity might give qualitatively different results.

For instance, we found here a similarity of 65%. A good reference point would be the average level of similarity between senators of the two different parties.

##### What about when we use a different vectorizer and look at different n-gram ranges?

- **Observation 1**: We see that for every n-gram range, Bag-of-Words gives a higher cosine similarity. This is because TF-IDF is more restrictive. While Bag-of-Words simply records how many times each word is used in both texts, TF-IDF (Term Frequency-Inverse Document Frequency) gives a measure of how often word i (or n-gram i) appeared in text j, penalized by the number of texts also containing word i. In other words, the weight of a word is proportional to its frequency in the document (term frequency) and inversely proportional to its frequency across the corpus (inverse document frequency). Words that are common across the corpus (i.e., appear in many documents) receive a lower weight, while words that are rare in the corpus receive a higher weight.

- **Observation 2**: We look n-grams alone and cumulatively i.e., 1, 2 and 3-grams together for (1,3) n-gram range. If we take the first case, looking at 2-grams or 3-grams alone, we see for both vectorizers the cosine similarities strictly decrease as we look at larger grams. In the cumulative case, we see again cosine similarities decreasing. However, the decrease in this case is slower because it is easier to get a high measure of cosine similarity when looking at 1-grams than 2-grams and easier when looking at 2-grams than 3-grams. Looking at a range of 1 to 3 grams rather than just 3 grams results in higher cosine similarity due to the above logic.

##### What about a different similarity measure?
We also look at *Manhattan Distance* measures between text vectors. Manhattan Distance is the distance between two vectors as the sum of the absolute differences between the elements of the two vectors. It is also known as L1 distance. A lower value indicates greater similarity. See [Taxicab Geometry](https://en.wikipedia.org/wiki/Taxicab_geometry).

Since it is a distance measure, the lower the value, the more similar the two vectors.

In [31]:
df[df["Similarity Measure"] == "Manhattan"]

Unnamed: 0,Vectorizer,Similarity Measure,Ngram Range,Similarity Score
1,Tfidf,Manhattan,"(1, 1)",16.875941
3,Tfidf,Manhattan,"(1, 2)",61.22641
5,Tfidf,Manhattan,"(1, 3)",108.87431
7,Tfidf,Manhattan,"(1, 4)",155.732114
9,Tfidf,Manhattan,"(2, 2)",122.857987
11,Tfidf,Manhattan,"(3, 3)",169.456308
13,Tfidf,Manhattan,"(4, 4)",213.7759
15,Bag-of-Words,Manhattan,"(1, 1)",64225.0
17,Bag-of-Words,Manhattan,"(1, 2)",217514.0
19,Bag-of-Words,Manhattan,"(1, 3)",392813.0


Results with Manhattan Distance as the similarity measure mirrors the results when we used cosine similarity in terms of comparing within the same vectorizer i.e., with a given vectorizer, the observation about using different n-gram ranges hold here.

But since the [*Manhattan Distance*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.manhattan_distances.html#sklearn.metrics.pairwise.manhattan_distances) measure in **sci-kit learn** library is not a standardized measure like *Cosine Similarity*, we cannot compare along the results from different vectorizers with the same n-gram range.

For instance, if we look at the the Manhattan Score using Bag-of-Words with 1-grams and the corresponding score using TF-IDF, we cannot infer that there is a dissimilarity of the order of thousands between the two scores.

#### What else can be done in future projects?

- Corpus specific useless words can be eliminated during the text processing stage.
- More steps can be written as functions to avoid unnecessary code repetition.
- A pipeline can be constructed in order to check among N texts, which one is the most similar to a given text. For example, among all the senator speeches we have, which one is the most similar to a given senator's e.g., Senator Biden's.

## Let us now look at among the speech texts we have which one is the most similar to Senator Biden's.
We use;
- TF-IDF to vectorize,
- Cosine similarity to compare similarities, and,
- Cumulative 2-grams i.e., 1-grams and 2-grams together.

In [32]:
# Getting the URLs that contain the text files:

from bs4 import BeautifulSoup, SoupStrainer

html = requests.get('https://github.com/ariedamuco/ML-for-NLP/tree/main/Inputs/105-extracted-date')

text_links = []

# Putting links to each text file into a list:
for link in BeautifulSoup(html.text, parse_only=SoupStrainer('a')):
    if hasattr(link, 'href') and link['href'].endswith('.txt'):
        url = "https://raw.githubusercontent.com" + link['href'].replace('/blob/', '/')
        text_links.append(url)

In [33]:
# Putting all the texts into a dictionary:

text_dict = {}

for i, text_url in enumerate(text_links):
    text_get = requests.get(text_url)
    text = text_get.text
    
    key = 'text{}'.format(i+1)
    text_dict[key] = text

# The key for Senator Biden is 'text7'.

In [34]:
# Taking a random subsample of files because otherwise it takes too much time to process.

import random

text7 = text_dict.pop('text7')

keys = list(text_dict.keys())

random_keys = random.sample(keys, 20)

#random_keys.append('text7')

random_text_dict = {key: text_dict[key] for key in random_keys}

random_text_dict['text7'] = text7

random_text_dict.keys()

dict_keys(['text3', 'text2', 'text75', 'text14', 'text54', 'text23', 'text17', 'text31', 'text27', 'text81', 'text49', 'text33', 'text89', 'text5', 'text38', 'text28', 'text70', 'text6', 'text18', 'text20', 'text7'])

**Warning**: Since the subsample is selected randomly, the code will produce a different subsample at each run. When I ran the code myself for the first time I found speeches of Senator Lieberman as the most and Senator Helms as the least similar. Hence, the discussion in the end is based on them.

In [35]:
# Storing processed texts in a dictionary:

processed_text_dict = {}

for key, text in random_text_dict.items():
    processed_text = text_processor(text)
    
    processed_text_dict[key] = processed_text

In [56]:
# Re-checking if text7 indeed belongs to Senator Biden:
processed_text_dict['text7'][:100]

'doc docno 105 biden 19981021 docno text biden presid plea senat today pas hatch biden lautenberg sub'

In [37]:
# Vectorizing the subsample of texts:
corpus = tfidf_vectorizer_12.fit_transform(processed_text_dict.values())

In [38]:
corpus_matrix = pd.DataFrame(corpus.toarray().transpose(), 
                             index=tfidf_vectorizer_12.get_feature_names_out())

# Renaming the columns with the textID format i.e., their keys in the dictionary:
corpus_matrix = corpus_matrix.set_axis(list(processed_text_dict.keys()), 
                                       axis = "columns", 
                                       copy = True)

In [39]:
corpus_matrix.describe()

Unnamed: 0,text3,text2,text75,text14,text54,text23,text17,text31,text27,text81,...,text33,text89,text5,text38,text28,text70,text6,text18,text20,text7
count,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,...,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0,779490.0
mean,7.1e-05,9.7e-05,9.2e-05,6.7e-05,7.8e-05,8.4e-05,5.3e-05,8.9e-05,7.2e-05,8.3e-05,...,8.2e-05,8.1e-05,7e-05,0.000103,9.4e-05,7.6e-05,7.6e-05,5.3e-05,7.9e-05,9e-05
std,0.00113,0.001129,0.001129,0.001131,0.00113,0.001129,0.001131,0.001129,0.00113,0.00113,...,0.00113,0.00113,0.00113,0.001128,0.001129,0.00113,0.00113,0.001131,0.00113,0.001129
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.375865,0.251646,0.358107,0.375152,0.330256,0.29131,0.398921,0.277187,0.372142,0.311811,...,0.293204,0.312407,0.356939,0.226659,0.263023,0.352307,0.364171,0.416258,0.355191,0.326052


In [40]:
# Splitting each text into their own vectors:
for i, col_name in enumerate(list(corpus_matrix.columns)):
    globals()["TFIDF_" + str(col_name)] =corpus_matrix[corpus_matrix.columns[i]].values.reshape(1, -1) 

In [41]:
TFIDF_text7

array([[0.01359197, 0.00041845, 0.00044001, ..., 0.00118421, 0.00039474,
        0.00039474]])

In [42]:
# Creating a dataframe to contain pairwise similarities of speeches to Senator Biden's:
cosine_similarities_dict = {'Cosine Similarity': 'NaN', 'Text': (list(corpus_matrix.columns))}
cosine_similarities = pd.DataFrame(data=cosine_similarities_dict)

In [43]:
cosine_similarities

Unnamed: 0,Cosine Similarity,Text
0,,text3
1,,text2
2,,text75
3,,text14
4,,text54
5,,text23
6,,text17
7,,text31
8,,text27
9,,text81


In [44]:
# Calculating cosine similarities and writing them to the dataframe:
for i, col_name in enumerate(list(corpus_matrix.columns)):
    cosine_similarities['Cosine Similarity'][i] = cosine_similarity(TFIDF_text7, globals()["TFIDF_" + str(col_name)])[0][0]

In [45]:
# Dropping Biden's speech:
cosine_similarities = cosine_similarities.drop(index=cosine_similarities[cosine_similarities['Text'] == 'text7'].index)

In [46]:
cosine_similarities

Unnamed: 0,Cosine Similarity,Text
0,0.487042,text3
1,0.436158,text2
2,0.538425,text75
3,0.517219,text14
4,0.481797,text54
5,0.539758,text23
6,0.438911,text17
7,0.564684,text31
8,0.53823,text27
9,0.515049,text81


In [47]:
# Checking who has the closest speech:

cosine_similarities['Cosine Similarity'] = cosine_similarities['Cosine Similarity'].astype(float)
max_index = cosine_similarities['Cosine Similarity'].idxmax()

max_text = cosine_similarities.loc[max_index]['Text']

In [48]:
max_text

'text28'

In [49]:
max_text[-2:]

'28'

In [50]:
processed_text_dict[max_text][:1000]

'doc docno 105 dewin 19981021 docno text dewin presid 1023 ricki ray hemophilia relief fund act would author establish fund compassion payment would made peopl hemophilia contract hiv aid taint blood product earli 1980 peopl victim failur feder govern safeguard blood product failur includ inadequ measur screen high risk donor long delay recal blood product known pose elev risk infect time period specifi legisl approxim 200 victim infect victim victim famili would receiv singl 100 000 payment total author 750 000 would separ appropri relief fund sunset year 1023 pas hous without object suspens calendar may similar legisl senat 358 sponsor bipartisan cosponsor text doc doc docno 105 dewin 19981021 docno text dewin presid ask unanim consent order quorum call rescind text doc doc docno 105 dewin 19981021 docno text dewin presid first let thank major leader passag ricki ray bill occur moment ago bill introduc along senat bob graham senat introduc hous repres repres tauzin certainli work hel

In [51]:
text_links[int(max_text[-2:])-1]

'https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-dewine-oh.txt'

#### The Speech Most Similar to Senator Biden's

The most similar speech to Biden's in the random subset of texts seems to be Senator [Joe Lieberman](https://en.wikipedia.org/wiki/Joe_Lieberman).

It is plausible because:
- Although he went independent in 2006, he has been a **Democrat** since the beginning of his political career.
- He, before the 2016 election, he endorsed Hillary Clinton for president and in 2020 endorsed **Joe Biden** for president.
- Lieberman says about Joe Biden in [this](https://www.theguardian.com/us-news/2021/nov/24/joe-lieberman-most-republicans-democrats-centrists) article on The Guardian: “Biden is solid. He sees the world realistically and he knows he can’t be Roosevelt or Lyndon Johnson now in part because he doesn’t have the great Democratic majorities that they had.”

“And the country, thank God, is not where it was in the Depression, as bad as the pandemic was. The old Joe, which is the real Joe, will be dominant in the next three years of his presidency.”

##### Let us finally look at the least similar speech in the subsample as a reference

In [52]:
min_index = cosine_similarities['Cosine Similarity'].idxmin()
cosine_similarities.loc[min_index]

Cosine Similarity    0.329111
Text                   text49
Name: 10, dtype: object

In [53]:
min_text = cosine_similarities.loc[min_index]['Text']

In [54]:
processed_text_dict[min_text][:1000]

'doc docno 105 helm 19981021 docno text helm presid follow senat approv resolut ratif chemic weapon convent cwc subsequ ratif treati presid becam necessari unit state enact legisl implement variou domest oblig foreign relat judiciari committe senat immedi fulfil oblig prepar implement legisl treati ratifi may 1997 full senat pas 610 chemic weapon convent implement act 1997 soon thereaft novemb 1997 hous repres pas implement legisl togeth sanction russian firm assist iran ballist missil program regret taken long enact implement legisl law reason expect numer compani challeng constitution treati overturn court unfortun final resolut legal issu surround cwc well full complianc treati delay entir session congress presid clinton opposit unrel missil sanction provis bill inde presid sought delay derail cwc implement legisl throughout entir spring presid alon respons put unit state noncompli delay ultim veto bill june 1998 import frustrat slow pace implement cwc understand congress discharg o

In [55]:
text_links[int(min_text[-2:])-1]

'https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-helms-nc.txt'

It is the speech of Senator Jesse Helms of North Carolina.

A quick look at the [Wikipedia page](https://en.wikipedia.org/wiki/Jesse_Helms) on him shows that:
- He is a leader in the conservative movement.
- A Republican since 1970 to 2008 (previously a Democrat for a short time), which is when he passed.
- Wikipedia article reads: "[He] opposed civil rights, disability rights, environmentalism, feminism, gay rights, affirmative action, access to abortions, the Religious Freedom Restoration Act (RFRA), and the National Endowment for the Arts. Helms brought an "aggressiveness" to his conservatism, as in his rhetoric against homosexuality."

At least superficially, he seems to have opposed most of the things Biden stands for today.