## Similarity Analysis of Congress Speeches
### By calculating the Cosine Similarities and Manhattan Distances between the TF-IDF (term frequency-inverse document frequency) and Count Vectors of texts with n-grams
Using *N* speeches of my choosing,<br />
preparing them for analysis (removing punctuation, stop-words),<br />
using Bag of Words and n-grams in addition to tf-idf to find the cosine similarity between them.<br />
Discussing my findings.

In [None]:
""" 
%pip install -U scikit-learn
%pip install nltk 
"""

In [None]:
""" 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
"""

In [None]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.pipeline import Pipeline

In [None]:
import requests

text1_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-moseleybraun-il.txt"
text2_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-reid-nv.txt"

text1_get, text2_get = requests.get(text1_url), requests.get(text2_url)
text1, text2 = text1_get.text, text2_get.text

print("text1 head:\n",text1[0:200],"\n\ntext2 head:\n",text2[0:200])

In [None]:
""" 
# Loading the chosen speech texts:
text1 = open(r"Congress Speeches\105-moseleybraun-il.txt").read()
text2 = open(r"Congress Speeches\105-reid-nv.txt").read()

# Printing the first 200 characters:
print("text1 head:\n",text1[0:200],"\n\ntext2 head:\n",text2[0:200])
"""

##### Removing punctuation and stop-words in the text

In [None]:
def text_preprocessor(text):
    # Deleting non-word characters by replacing them with blank (' '):
    text= re.sub(r'\W',' ', text)
    # Tokenizing the string text into word substrings, writing them to a list (.lower() makes all characters lower case):
    tokens = word_tokenize(text.lower())
    # Removing English stopwords from the list:
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Keeping words with at least 3 characters in the list:
    tokens = [word for word in tokens if len(word)>=3]
    # Joining the tokens -substrings- in the list back together with blank (' ') between them:
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text 

In [None]:
# Testing the pre-processing function with the first 1000 characters of the first text:
text1_head_tokenized = text_preprocessor(text1[:1000])
text1_head_tokenized

##### Stemming the words in the tokenized text

In [None]:
def stem_words(text):
    # Creating a stemmer instance which uses Porter Stemming Algorithm:
    stemmer = PorterStemmer()
    # Tokenizing the text into words, stemming them:
    stemmed_words = [stemmer.stem(word) for word in word_tokenize(text)]
    # Joining the word stems back and returning:
    return ' '.join(stemmed_words)

# Some alternatives to Porter in NLTK are Snowball (in English) and Lancaster.

In [None]:
# Testing the stemmer function:
text1_stemmed = stem_words(text1_head_tokenized)
text1_stemmed

##### Lemmatizing the words in the tokenized and stemmed text
[Lemmatisation](https://en.wikipedia.org/wiki/Lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In [None]:
def lemmatize_words(text):
    # Creating a lemmatizer instance:
    lemmatizer = WordNetLemmatizer()
    # Applying the lemmatizer word by word:
    lemmatized_words = [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
    # Joining the words back and returning:
    return ' '.join(lemmatized_words)

In [None]:
# Testing the lemmatizer function:
text1_lemmatized = lemmatize_words(text1_stemmed)
text1_lemmatized

#### Putting it all together
Now that all the pre-processing functions are tested and working, we can apply the functions to full bodies of both texts.


In [None]:
def text_processor(text):
    step1 = text_preprocessor(text)
    step2 = stem_words(step1)
    step3 = lemmatize_words(step2)
    output = step3
    return output

In [None]:
text1_processed, text2_processed = text_processor(text1), text_processor(text2)

In [None]:
text1_processed[:500]

#### We represent the processed bodies of text as vectors to analyze them. We use both TF-IDF and Bag-of-Words (Count) approaches.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Calling an instance of TF-IDF Vectorizer with default arguments:
tfidf_vectorizer = TfidfVectorizer()

# Calling an instance of Count Vectorizer with default arguments:
count_vectorizer = CountVectorizer()

In [None]:
# Vectorizing the bodies of texts and putting them together in a matrix:
corpus_tfidf = tfidf_vectorizer.fit_transform([text1_processed, text2_processed])
corpus_count = count_vectorizer.fit_transform([text1_processed, text2_processed])


In [None]:
import pandas as pd

# Transforming the corpus matrix to a dataframe with feature names (words) as index:
corpus_tfidf_matrix = pd.DataFrame(corpus_tfidf.toarray().transpose(), 
                             index=tfidf_vectorizer.get_feature_names_out())

corpus_count_matrix = pd.DataFrame(corpus_count.toarray().transpose(), 
                             index=count_vectorizer.get_feature_names_out())

# Renaming the columns with the names of the senators who gave the speeches:
corpus_tfidf_matrix = corpus_tfidf_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

corpus_count_matrix = corpus_count_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

In [None]:
corpus_count_matrix

In [None]:
corpus_tfidf_matrix

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculating the cosine similarity between to the vectorized texts:
tfidf_result = cosine_similarity(corpus_tfidf_matrix[corpus_tfidf_matrix.columns[0]].values.reshape(1, -1), 
                  corpus_tfidf_matrix[corpus_tfidf_matrix.columns[1]].values.reshape(1, -1))

bow_result = cosine_similarity(corpus_count_matrix[corpus_count_matrix.columns[0]].values.reshape(1, -1), 
                  corpus_count_matrix[corpus_count_matrix.columns[1]].values.reshape(1, -1))

In [None]:
print("Similarity rate with TF-IDF:", tfidf_result, 
      "\nSimilarity rate with Bag-of-Words:", bow_result)

#### Now let us see how the results change if we look at 2-grams and 3-grams cumulatively in addition to single words.

We need to modify the vectorizers in order to achieve this. Previously, we used the vectorizers with default parameters. This means that they only looked at single words instead of groups of two or three consecutive words.

In [None]:
# Calling an instance of TF-IDF Vectorizer with 1 to 2 grams:
tfidf_vectorizer_12 = TfidfVectorizer(ngram_range = (1,2))

# Calling an instance of Count Vectorizer with 1 to 2 grams:
count_vectorizer_12 = CountVectorizer(ngram_range = (1,2))

# I name the vectorizers with the "_12" suffix, indicating the ngram_range parameter values.

Above, we call the vectorizers to look at 1 and 2-grams together. We could also look at 2-grams only instead by setting the ngram_range parameter to (2,2) instead.

In [None]:
# Vectorizing the bodies of texts and putting them together in a matrix:
corpus_tfidf_12 = tfidf_vectorizer_12.fit_transform([text1_processed, text2_processed])
corpus_count_12 = count_vectorizer_12.fit_transform([text1_processed, text2_processed])

In [None]:
# Transforming the corpus matrix to a dataframe with feature names (words) as index:
corpus_tfidf_matrix_12 = pd.DataFrame(corpus_tfidf_12.toarray().transpose(), 
                             index=tfidf_vectorizer_12.get_feature_names_out())

corpus_count_matrix_12 = pd.DataFrame(corpus_count_12.toarray().transpose(), 
                             index=count_vectorizer_12.get_feature_names_out())

# Renaming the columns with the names of the senators who gave the speeches:
corpus_tfidf_matrix_12 = corpus_tfidf_matrix_12.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

corpus_count_matrix_12 = corpus_count_matrix_12.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)

In [None]:
corpus_count_matrix_12

In [None]:
corpus_tfidf_matrix_12

Let us define functions for similarity measures to make life easier in the future.

In [None]:
def cosine_result(matrix):
    result = cosine_similarity(matrix[matrix.columns[0]].values.reshape(1, -1), 
                               matrix[matrix.columns[1]].values.reshape(1, -1))
    return result

tfidf_result_12 = cosine_result(corpus_tfidf_matrix_12)
bow_result_12 = cosine_result(corpus_count_matrix_12)

print("Similarity rate with TF-IDF:", tfidf_result,
      "\nSimilarity rate with Bag-of-Words:", bow_result, 
      "\nSimilarity rate with TF-IDF, 1 to 2-grams:", tfidf_result_12, 
      "\nSimilarity rate with Bag-of-Words, 1 to 2-grams:", bow_result_12)

In [None]:
from sklearn.metrics.pairwise import manhattan_distances

def manhattan_result(matrix):
    result = manhattan_distances(matrix[matrix.columns[0]].values.reshape(1, -1), 
                               matrix[matrix.columns[1]].values.reshape(1, -1))
    return result

Let us define another function which takes two processed texts, ngram_range parameters and vectorizer as input and returns cosine similarity between the two texts.

In [None]:
def similarity_pipeline(vectorizer, txt1, txt2, ngram_range = (1,1), similarity = "Cosine"):
    
    # Allowing the option to use Tfidf and Count vectorizers:
    if vectorizer == "Count":
        
        # Calling the vectorizer with desired ngram_range values, (1,1) applies if not specified:
        count_vectorizer = CountVectorizer(ngram_range = ngram_range)
        
        corpus_count = count_vectorizer.fit_transform([txt1, txt2])

        # Loading vectorized texts into a matrix:
        corpus_count_matrix = pd.DataFrame(corpus_count.toarray().transpose(), 
                             index=count_vectorizer.get_feature_names_out())
        
        # Renaming the columns:
        corpus_count_matrix = corpus_count_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)
        
        # Defining the output:
        if similarity == "Cosine":
            output = cosine_result(corpus_count_matrix)

        elif similarity == "Manhattan":
            output = manhattan_result(corpus_count_matrix)

        else:
            print("Please choose a valid parameter for similarity.",
                  "\nValid similarity measures are 'Cosine' and 'Manhattan'.")
    
    elif vectorizer == "Tfidf":
        # Calling the vectorizer with desired ngram_range values, (1,1) applies if not specified:
        tfidf_vectorizer = TfidfVectorizer(ngram_range = ngram_range)
        
        corpus_tfidf = tfidf_vectorizer.fit_transform([txt1, txt2])

        # Loading vectorized texts into a matrix:
        corpus_tfidf_matrix = pd.DataFrame(corpus_tfidf.toarray().transpose(), 
                             index=tfidf_vectorizer.get_feature_names_out())
        
        # Renaming the columns:
        corpus_tfidf_matrix = corpus_tfidf_matrix.set_axis(["Moseley-Braun","Reid"], 
                                       axis = "columns", 
                                       copy = True)
        
        # Defining the output:
        if similarity == "Cosine":
            output = cosine_result(corpus_tfidf_matrix)

        elif similarity == "Manhattan":
            output = manhattan_result(corpus_tfidf_matrix)

    else:
        print("Error: Please choose valid parameters for vectorizer and similarity measure.",
              "\nValid vectorizers are 'Count' and 'Tfidf'.",
              "\nValid similarity measures are 'Cosine' and 'Manhattan'.")

    return output

In [None]:
# Testing the function:
similarity_pipeline("Tfidf", text1_processed, text2_processed, ngram_range = (2,2), similarity= "Manhattan")

Now we can check results comparatively, produced from different vectorizers and specifications of n-grams.

In [None]:
ngram_ranges = [(1,1), (1,2), (1,3), (1,4), (2,2), (3,3), (4,4)]
vectorizers = ["Tfidf", "Count"]
similarity_measures = ["Cosine", "Manhattan"]
results = []
for vec in vectorizers:
    for ngram_range in ngram_ranges:
        for similarity_measure in similarity_measures:
            if vec == "Tfidf":
                if similarity_measure == "Cosine":
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Cosine")
                    results.append({"Vectorizer": vec, "Similarity Measure": similarity_measure, "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
                else:
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Manhattan")
                    results.append({"Vectorizer": vec, "Similarity Measure": "Manhattan", "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
            else:
                if similarity_measure == "Cosine":
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Cosine")
                    results.append({"Vectorizer": "Bag-of-Words", "Similarity Measure": similarity_measure, "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
                else:
                    score = similarity_pipeline(vec, text1_processed, text2_processed, ngram_range=ngram_range, similarity="Manhattan")
                    results.append({"Vectorizer": "Bag-of-Words", "Similarity Measure": "Manhattan", "Ngram Range": ngram_range, "Similarity Score": score[0][0]})
 
df = pd.DataFrame(results)
df

#### Discussion

To qualitatively discuss the similarity between the two senators' speeches some background on who they are is needed.

[**Moseley-Braun**](https://en.wikipedia.org/wiki/Carol_Moseley_Braun) was *the first African-American woman* elected to the U.S. Senate, the first African-American U.S. Senator from the **Democratic Party**, *the first woman to defeat an incumbent U.S. Senator in an election*, and the first female U.S. Senator from Illinois.

[**Harry Mason Reid Jr.**](https://en.wikipedia.org/wiki/Harry_Reid) was an American lawyer and politician who served as a United States senator from Nevada from 1987 to 2017. He *led the Senate **Democratic** Caucus from 2005 to 2017* and was *the Senate Majority Leader from 2007 to 2015*.

Although both senators were from the same party, the dissimilarity between them should most likely to be rooted in their backgrounds and identities they stand for.

Moseley-Braun had been the first to set many milestones while Reid's election and re-ellection has arguably been in smoother conditions. Moseley-Braun is an African-American woman to be the first female U.S. Senator in her state while Reid is from an already well represented identity group - white and male.

The similarity being above 50% might be due to the fact that they are from the same party but the present difference is, at least superficially, because they are vastly different character and from very different states.

For a better grounded analysis, we can look at the similarity measures of multiple pairs of senator speeches from the same and different parties and employ a comparative perspective. This approach can reveal patterns more clearly as to what makes two speeches similar and what having similar speeches tells us about the characteristics of the senators in comparison. Furthermore, different methods of vectorizing speeches and different measures of similarity might give qualitatively different results.

For instance, we found here a similarity of 65%. A good reference point would be the average level of similarity between senators of the two different parties.

##### What about when we use a different vectorizer and look at different n-gram ranges?

- **Observation 1**: We see that for every n-gram range, Bag-of-Words gives a higher cosine similarity. This is because TF-IDF is more restrictive. While Bag-of-Words simply records how many times each word is used in both texts, TF-IDF (Term Frequency-Inverse Document Frequency) gives a measure of how often word i (or n-gram i) appeared in text j, penalized by the number of texts also containing word i. In other words, the weight of a word is proportional to its frequency in the document (term frequency) and inversely proportional to its frequency across the corpus (inverse document frequency). Words that are common across the corpus (i.e., appear in many documents) receive a lower weight, while words that are rare in the corpus receive a higher weight.

- **Observation 2**: We look n-grams alone and cumulatively i.e., 1, 2 and 3-grams together for (1,3) n-gram range. If we take the first case, looking at 2-grams or 3-grams alone, we see for both vectorizers the cosine similarities strictly decrease as we look at larger grams. In the cumulative case, we see again cosine similarities decreasing. However, the decrease in this case is slower because it is easier to get a high measure of cosine similarity when looking at 1-grams than 2-grams and easier when looking at 2-grams than 3-grams. Looking at a range of 1 to 3 grams rather than just 3 grams results in higher cosine similarity due to the above logic.

##### What about a different similarity measure?
We also look at *Manhattan Distance* measures between text vectors. Manhattan Distance is the distance between two vectors as the sum of the absolute differences between the elements of the two vectors. It is also known as L1 distance. A lower value indicates greater similarity. See [Taxicab Geometry](https://en.wikipedia.org/wiki/Taxicab_geometry).

Since it is a distance measure, the lower the value, the more similar the two vectors.

In [None]:
df[df["Similarity Measure"] == "Manhattan"]

Results with Manhattan Distance as the similarity measure mirrors the results when we used cosine similarity in terms of comparing within the same vectorizer i.e., with a given vectorizer, the observation about using different n-gram ranges hold here.

But since the [*Manhattan Distance*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.manhattan_distances.html#sklearn.metrics.pairwise.manhattan_distances) measure in **sci-kit learn** library is not a standardized measure like *Cosine Similarity*, we cannot compare along the results from different vectorizers with the same n-gram range.

For instance, if we look at the the Manhattan Score using Bag-of-Words with 1-grams and the corresponding score using TF-IDF, we cannot infer that there is a dissimilarity of the order of thousands between the two scores.

#### What else can be done in future projects?

- Corpus specific useless words can be eliminated during the text processing stage.
- More steps can be written as functions to avoid unnecessary code repetition.
- A pipeline can be constructed in order to check among N texts, which one is the most similar to a given text. For example, among all the senator speeches we have, which one is the most similar to a given senator's e.g., Senator Biden's.

## Let us now look at among the speech texts we have which one is the most similar to Senator Biden's.
We use;
- TF-IDF to vectorize,
- Cosine similarity to compare similarities, and,
- Cumulative 2-grams i.e., 1-grams and 2-grams together.

In [None]:
# Getting the URLs that contain the text files:

from bs4 import BeautifulSoup, SoupStrainer

html = requests.get('https://github.com/ariedamuco/ML-for-NLP/tree/main/Inputs/105-extracted-date')

text_links = []

for link in BeautifulSoup(html.text, parse_only=SoupStrainer('a')):
    if hasattr(link, 'href') and link['href'].endswith('.txt'):
        url = "https://raw.githubusercontent.com" + link['href'].replace('/blob/', '/')
        text_links.append(url)

In [None]:
# Putting all the texts into a dictionary:

text_dict = {}

for i, text_url in enumerate(text_links):
    text_get = requests.get(text_url)
    text = text_get.text
    
    key = 'text{}'.format(i+1)
    text_dict[key] = text

# The key for Senator Biden is 'text7'.

In [None]:
import random

text7 = text_dict.pop('text7')

keys = list(text_dict.keys())

random_keys = random.sample(keys, 10)

#random_keys.append('text7')

random_text_dict = {key: text_dict[key] for key in random_keys}

random_text_dict['text7'] = text7

random_text_dict.keys()

In [None]:
processed_text_dict = {}

for key, text in random_text_dict.items():
    processed_text = text_processor(text)
    
    processed_text_dict[key] = processed_text

In [None]:
list(processed_text_dict.keys())
processed_text_dict['text7'][:10000000000]

In [None]:
""" 
corpus = tfidf_vectorizer_12.fit_transform([text1_processed, text2_processed])
"""
corpus = tfidf_vectorizer_12.fit_transform(processed_text_dict.values())


In [None]:
corpus_matrix = pd.DataFrame(corpus.toarray().transpose(), 
                             index=tfidf_vectorizer_12.get_feature_names_out())

# Renaming the columns with the names of the senators who gave the speeches:
corpus_matrix = corpus_matrix.set_axis(list(processed_text_dict.keys()), 
                                       axis = "columns", 
                                       copy = True)

In [109]:
corpus_matrix.describe()

Unnamed: 0,text65,text71,text24,text64,text98,text75,text3,text33,text22,text89,text7
count,594771.0,594771.0,594771.0,594771.0,594771.0,594771.0,594771.0,594771.0,594771.0,594771.0,594771.0
mean,0.000114,9e-05,6.6e-05,9.1e-05,5.6e-05,0.000113,8.9e-05,0.0001,9.2e-05,0.000101,0.00011
std,0.001292,0.001294,0.001295,0.001293,0.001295,0.001292,0.001294,0.001293,0.001293,0.001293,0.001292
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.294466,0.33474,0.367506,0.327415,0.390923,0.315102,0.361866,0.283449,0.299334,0.280597,0.298016


In [None]:
for i, col_name in enumerate(list(corpus_matrix.columns)):
    globals()["TFIDF_" + str(col_name)] =corpus_matrix[corpus_matrix.columns[i]].values.reshape(1, -1) 

In [None]:
TFIDF_text7

In [110]:
cosine_similarities_dict = {'Cosine Similarity': 'NaN', 'Text': (list(corpus_matrix.columns))}
cosine_similarities = pd.DataFrame(data=cosine_similarities_dict)

In [111]:
cosine_similarities

Unnamed: 0,Cosine Similarity,Text
0,,text65
1,,text71
2,,text24
3,,text64
4,,text98
5,,text75
6,,text3
7,,text33
8,,text22
9,,text89


In [115]:
for i, col_name in enumerate(list(corpus_matrix.columns)):
    cosine_similarities['Cosine Similarity'][i] = cosine_similarity(TFIDF_text7, globals()["TFIDF_" + str(col_name)])[0][0]

In [116]:
cosine_similarities = cosine_similarities.drop(index=cosine_similarities[cosine_similarities['Text'] == 'text7'].index)

In [119]:
cosine_similarities['Cosine Similarity'] = cosine_similarities['Cosine Similarity'].astype(float)


In [120]:
max_index = cosine_similarities['Cosine Similarity'].idxmax()

cosine_similarities.loc[max_index]

Cosine Similarity    0.646906
Text                   text65
Name: 0, dtype: object

In [121]:
processed_text_dict['text65'][:1000]

'doc docno 105 leahi 19981021 docno text leahi presid american peopl grow concern encroach person privaci seem everywher turn new technolog new commun medium new busi servic creat best intent highest expect also pose threat abil keep live live work think without giant corpor govern look shoulder peek keyhol current nation medium ob monica lewinski scandal focus attent abus power independ counsel kenneth starr prosecutor intim familiar enorm power prosecutor wield power gener circumscrib sen honor profession enough bar canon ethic disciplinari rule feder prosecutor rule regul depart justic starr differ view oblig privaci first casualti began investig presid person life use result illeg wiretap state maryland protect resid privat convers tape record without knowledg consent starr condon deliber flout law grant perpetr immun use illicit record persuad attorney gener expand jurisdict begin februari prosecutor starr forc mother travel countri capit sit feder grand juri right counsel present

In [122]:
text_links[65-1]

'https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-leahy-vt.txt'

https://en.wikipedia.org/wiki/Patrick_Leahy