# Gensim Word Embeddings Practice

## Term Co-occurence in Presidential Inaugural Speeches

Importing libraries for NLP and text preprocessing, together with a corpus of presidential inaugural speeches.

In [1]:
from nltk.corpus import inaugural, stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from collections import Counter
import re
import gensim
import spacy

Initializing the lemmatizer, creating variables to contain the presidential speeches and the English stop words.

In [2]:
english_stopwords = stopwords.words("english") + ["us"]
list_of_speeches = inaugural.fileids()
lemmatizer = WordNetLemmatizer()

The first function greets the user and delivers the list of inaugural speeches contained in the corpus, extracting the president's name and the year using Regex.

In [3]:
def list_presidents_speeches(list_of_speeches):
    print("Hello, I have a bunch of inaugural speeches to analyze! Here they are:\n")
    for speech_title in list_of_speeches:
        for year, name in re.findall("(\d{4})-(.+)\.txt", speech_title):
            print(f"President {name} delivered his inaugural speech in {year}.")

list_presidents_speeches(list_of_speeches)

Hello, I have a bunch of inaugural speeches to analyze! Here they are:

President Washington delivered his inaugural speech in 1789.
President Washington delivered his inaugural speech in 1793.
President Adams delivered his inaugural speech in 1797.
President Jefferson delivered his inaugural speech in 1801.
President Jefferson delivered his inaugural speech in 1805.
President Madison delivered his inaugural speech in 1809.
President Madison delivered his inaugural speech in 1813.
President Monroe delivered his inaugural speech in 1817.
President Monroe delivered his inaugural speech in 1821.
President Adams delivered his inaugural speech in 1825.
President Jackson delivered his inaugural speech in 1829.
President Jackson delivered his inaugural speech in 1833.
President VanBuren delivered his inaugural speech in 1837.
President Harrison delivered his inaugural speech in 1841.
President Polk delivered his inaugural speech in 1845.
President Taylor delivered his inaugural speech in 1849

The second function collects all the speeches delivered by the same president (**and by presidents who have the same last name; no easy turnaround here without changing the dataset**) into a single list variable.

In [4]:
def collect_presidents_speeches(president):
    collected_presidents_speeches = []
    regex_title_formula = "\d{4}-" + president + "\.txt"
    speeches_by_president = [title for title in list_of_speeches if re.match(regex_title_formula, title)]
    for title in speeches_by_president:
        for sentence in inaugural.sents(title):
            collected_presidents_speeches.append(sentence)
    return collected_presidents_speeches

The following three functions clean up the text (make it lowercase, remove stopwords) and lemmatize it using WordNetLemmatizer and the POS tagging in NLTK.

In [5]:
def lemmatize_with_pos(tokenized_string):
    tag_dict = {"J": "a", "R": "r", "V": "v", "N": "n"}
    lemmatized_string = []
    # print(pos_tag(tokenized_string))
    for token, tag in pos_tag(tokenized_string):
        if lemmatizer.lemmatize(tag[0].upper()) in tag_dict:
            lemmatized = lemmatizer.lemmatize(token, tag_dict[tag[0].upper()])
        else:
            lemmatized = lemmatizer.lemmatize(token)
        lemmatized_string.append(lemmatized)
    return lemmatized_string
    
def clean_up_text(text):
    words_regex = "\w+"
    cleaned_text = []
    for sentence in text:
        cleaned_sentence = []
        for token in sentence:
            word = str(token).lower()
            if re.match(words_regex, word) and word not in english_stopwords:
                cleaned_sentence.append(word)
        cleaned_text.append(cleaned_sentence)
    return cleaned_text

def preprocess_speeches(speeches):
    cleaned_speeches = clean_up_text(speeches)
    lemmatized_speeches = [lemmatize_with_pos(sentence) for sentence in cleaned_speeches]
    return lemmatized_speeches

This function prints a list of the twenty most frequent words found in the corpus of speeches delivered by the same president (**or, rather, by presidents who share the same last name**).

In [6]:
def print_most_frequent_words(processed_speeches):
    all_words = []
    most_common_terms = []
    for sentence in processed_speeches:
        for word in sentence:
            all_words.append(word)
    for term, ranking in Counter(all_words).most_common(20):
        print(term)
        most_common_terms.append(list(term))
    return most_common_terms

Finally, the main function in this script asks the user for the last name of the president whose speeches they wish to analyze, delivers the list of most common words, and then asks the user to enter a word whose most frequent co-occurring terms they wish to see. It finally delivers this list. If the word is not in the dictionary of words used by the president, the error is handled by the except clause.

In [7]:
def analyze_presidents_words():
    president = input("Whose inaugural speeches would you like to analyze?\n> ")
    speeches = preprocess_speeches(collect_presidents_speeches(president))
    print(f"\nThese are the words that President {president} used the most:\n")
    print_most_frequent_words(speeches)
    word = input("\nWhat is the word for which you would like to see the closest embeddings?\n> ").lower()
    president_embeddings = gensim.models.Word2Vec(speeches, size=96, window=5, min_count=1, workers=2, sg=1)
    try:
        most_similar_to_word = president_embeddings.wv.most_similar(word, topn=20)
        print(f"\nHere are the words that President {president} tended to use around the word {word}:\n")
        for term, score in most_similar_to_word:
            print(term)
        return most_similar_to_word
    except KeyError:
        print(f"\nIt looks like President {president} did not use the word {word}.")
        return None

similar_words = analyze_presidents_words()

Whose inaugural speeches would you like to analyze?
> Lincoln

These are the words that President Lincoln used the most:

state
constitution
union
law
shall
government
people
one
right
upon
may
case
war
would
make
must
slave
provision
hold
party

What is the word for which you would like to see the closest embeddings?
> people

Here are the words that President Lincoln tended to use around the word people:

say
give
identity
little
violate
hostile
interest
rather
direct
fault
find
hence
minority
possibility
could
none
everywhere
one
others
confidence
