<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/07_word_representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisites
---

In [None]:
# Install 3rd party packages
!pip -q install python-Levenshtein
!pip -q install "gensim==4.0.0"


# Standard library imports
import glob
from pathlib import Path
from itertools import combinations


# 3rd party imports
import nltk
import pandas as pd
import plotly.express as px
import tensorflow_hub as hub
import tensorflow.compat.v1 as tf
from gensim.models import KeyedVectors
from scipy.spatial.distance import cosine
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Get GloVe files and load into gensim
!mkdir -p GloVe
!wget -q https://www.dropbox.com/s/9k39nheab1rhezq/glove.6B.50d.zip?dl=1 -O ./GloVe/glove.6B.50d.zip
!unzip -qq -n -d ./GloVe ./GloVe/glove.6B.50d.zip

glove_model = KeyedVectors.load_word2vec_format("./GloVe/glove.6B.50d.txt", binary=False, no_header=True)


# Load ELMo into tensorflow
tf.disable_eager_execution()
elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)

In [None]:
# Data Prerequisites
!mkdir -p texts

from_raw_texts = False # Stanza takes a while to download, loading from raw texts is slower than just opening a pre-tokenized dataframe
if from_raw_texts:

  # Get text files
  !wget -q https://www.dropbox.com/s/5ibk0k4mibcq3q6/AussieTop100private.zip?dl=1 -O ./texts/AussieTop100private.zip
  !unzip -qq -n -d ./texts/ ./texts/AussieTop100private.zip

  about_dir = Path.cwd() / "texts" / "About"
  pr_dir = Path.cwd() / "texts" / "PR"
  dirs_to_load = [about_dir, pr_dir]


  # Preprocess texts
  !pip -q install stanza
  import stanza
  stanza.download('en')
  nlp = stanza.Pipeline('en', processors='tokenize,pos', tokenize_no_ssplit=True)

  texts = [] 
  nltk.download("stopwords", quiet=True)
  stops = nltk.corpus.stopwords.words('english')+["'s", '&']
  for directory in dirs_to_load:
    for file in glob.glob(f"{directory}/*.txt"):
      with open(file, 'r') as infile: 
        fulltext = " ".join([word.text.lower() for sentence in nlp(infile.read()).sentences for word in sentence.words if word.text.lower() not in stops and word.upos not in ["PUNCT", "SYM", "NUM", 'X']])
        texts.append(fulltext)
  text_df = pd.DataFrame(texts, columns=['text_no_stops'])
  
else:
  # Get pretokenized text dataframe
  !wget -q https://www.dropbox.com/s/o6b8hxeq8zpvxgj/pretokenized_aussie_fbs.csv?dl=1 -O ./texts/pretokenized_aussie_fbs.csv
  text_df = pd.read_csv('./texts/pretokenized_aussie_fbs.csv')

#Module 7 - Word Representations
---


At a fundamental level, computers know two things: on(1) and off(0). Building from this we're able to teach computers other numbers pretty easily by stringing these ons/offs together (e.g., 101101 in binary = 45 in decimal). However, the notion of 'text' or even 'words' is not natively understood by computers. As a result, for computers to understand text, we have to convert that text into numbers. But how to do that?

In this module, we'll look at four major models of the way that we present words to the computer for analysis:
* Bag-of-Words model
* GloVe model
* Contextual embedding models (ELMo, BERT)

#7.1. Bag-of-Words Model
---

In the Bag-of-Words model, each word is represented as a one-hot vector in a vector space with the same dimensionality as the number of unique words in the corpus (i.e., the vocabulary). 

Consider the three words: 'leaders', 'managers', and 'entrepreneurs':

In [None]:
# Displays the one-hot vector representation of three words
words = ["managers", "leaders", "entrepreneurs"]
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(words)
print(f"There are {len(vectorizer.get_feature_names())} words in the vocabulary: {vectorizer.get_feature_names()}\n")
for text_id, text in enumerate(words):
  print(f"Word {text_id+1}: {words[text_id]:15} --- As a one-hot vector: {doc_term_matrix.toarray()[text_id]}")

If words are one-hot vectors, entire documents can then be expressed as combinations of these vectors:

In [None]:
# Shows how one-hot vectors accumulate into word frequencies
sample_texts = ["Entrepreneurs and managers, while similar, also face a number of idiosyncratic challenges.", 
                "Managers and entrepreneurs are similar. While similar, managers and entrepreneurs face idiosyncratic challenges."]
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(sample_texts)
print(f"There are {len(vectorizer.get_feature_names())} words in the vocabulary: {vectorizer.get_feature_names()}")
for text_id, text in enumerate(sample_texts):
  print(f"\nText {text_id+1}: {sample_texts[text_id]}\n As a Bag of Words: {doc_term_matrix.toarray()[text_id]}\n")
  for word_id, word in enumerate(doc_term_matrix.toarray()[text_id]):
    print(f"Word: {vectorizer.get_feature_names()[word_id]:15} - {word}")

This representation is very convenient for basic analyses based on word counts and it enables us to look at how similar two texts are to eachother by comparing the similarity of two documents' arrays:

*Note: we'd actually do a bit more preprocessing before comparing the similarity of these two documents, but we'll put that aside for now.*

In [None]:
# Displays the cosine similarity between the two sample texts
print(f"The cosine similarity of the two texts is: {cosine_similarity(doc_term_matrix)[0][1]:0.2}")

However, as we start working with larger texts this representation of text leads to very sparse (filled with lots of zeroes) matrices. Consider the sample of Australian family business press releases and about us pages that were loaded by the *prerequisites* code:

In [None]:
# Shows an excerpt of a real (i.e., sparse) document-term matrix 
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(text_df['text_no_stops'])
print(f"There are {len(vectorizer.get_feature_names())} words in the vocabulary: {vectorizer.get_feature_names()}\n")
print(f"Text 1: {text_df.iloc[0]['text_no_stops']}\n As a Bag of Words: \n{doc_term_matrix.toarray()[0][0:1000]}\n")

And that's just the first 1,000 words (of nearly 7,900) for just the first text in our corpus. Storing and processing that many zeroes is pretty inefficient. 

Further, intuitively, we know that two words (e.g., 'manager' and 'leader') have similar meaning. However, this representation of words treats all individual words as being orthogonal. So while we can compare entire documents, the words that comprise those documents generally cannot be compared. Consider:

In [None]:
# Calculates and shows the orthogonality of words in the bag-of-words representation
words = ["managers", "leaders", "entrepreneurs"]
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(words)
print(f"There are {len(vectorizer.get_feature_names())} words in the vocabulary: {vectorizer.get_feature_names()}\n")
for text_id, text in enumerate(words):
  print(f"Word {text_id+1}: {words[text_id]:15} --- As a one-hot vector: {doc_term_matrix.toarray()[text_id]}")

print(f"\nThe cosine similarity of 'managers' and 'leaders' is: {cosine_similarity(doc_term_matrix)[0][1]:0.2}")

Combine all this with the notion that the Bag of Words model ignores word order and the limitations of this representation become fairly apparent.

#7.2. GloVe Word Embeddings
---

Rather than encoding information about word frequency into sparse vectors, GloVe creates *neural word embeddings* that are dense (no zeroes) and lower-dimensional (e.g., 50-300 vs 7900) vectors that and encode information about how that word is used in natural language. 

The GloVe vectors themselves can be created using a *unsupervised machine learning* algorithm on your corpus of texts; however, in this case, we're going to use the vectors provided by Stanford based on language use in both Wikipedia and news articles (i.e., Gigaword 5).

Consider the GloVe vector for the word 'managers':

In [None]:
# Displays the GloVe embedding for the word 'managers'
word = 'managers'
print(f"GloVe embedding for '{word}' has {glove_model.get_vector(word).shape[0]} dimensions")
print(f"GloVe embedding for '{word}': \n{glove_model.get_vector(word)}")

Because we have information about how the word is used encoded into the vector, we can now provide that direct comparison of words that we could not do with the Bag of Words model:

In [None]:
# Calculates the cosine similarity of two words with GloVe embeddings
comparison_word = 'leaders'
print(f"The cosine similarity of '{word}' and '{comparison_word}' is: {cosine(glove_model.get_vector(word), glove_model.get_vector(comparison_word))}")

We can also work the other direction to find the words that are most similar to a specified word in the Wikipedia/News corpus.

In [None]:
# Displays the ten most similar words to 'managers' in GloVe embeddings
glove_model.most_similar(word, topn=10)

In [None]:
# Displays the ten most similar words to 'entrepreneurs' in GloVe embeddings
comparison_word2 = 'entrepreneurs'
glove_model.most_similar(comparison_word2, topn=10)

Because the usage of each vector is encoded into 300-dimension space, we can use matrix algebra to look for connections between words as well. For example, we'll take a look at a classic example:

Let's start with the vector for 'king', subtract the vector for 'man', and add the vector for 'woman':

`'king' - 'man' + 'woman'`

In [None]:
# Demonstrates simple vector algebra with GloVe embeddings
glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)

That makes sense, how about to more abstract concepts?

`'clinton' - 'democrat' + 'republican'`

In [None]:
# Demonstrates more abstract vector algebra with GloVe embeddings
glove_model.most_similar(positive=['clinton', 'republican'], negative=['democrat'], topn=1)

That makes sense, but we're business scholars...let's apply this to a business setting: 

`'businessman' - 'man' + 'woman'`

In [None]:
glove_model.most_similar(positive=['businessman', 'woman'], negative=['man'], topn=1)

`'businessman' + 'innovation'`

In [None]:
glove_model.most_similar(positive=['businessman', 'innovation'], topn=1)

Let's visualize what is happening here by projecting the 300-dimensional space into 3-dimensional space using Principal Components Analysis:

In [None]:
# Plots PCA of the words in the word list into 3-dimensional space
word_list = ['businessman', 'entrepreneur', 'leader', 'manager', 'businesswoman', 'employee', 'employees', 'managers',
             'entrepreneurs',  'executive', 'executives', 'industrialist', 'industrialists', 'leaders', 'businesswomen', 
             'businessmen', 'businessperson', 'businesspeople', 'school', 'schools', 'university', 'universities', 'professor', 
             'professors', 'instructor', 'instructors', 'teacher', 'teachers', 'institute', 'institutes', 'college', 'colleges']
glove_vectors = [glove_model[word] for word in word_list]
project3d = PCA().fit_transform(glove_vectors)[:,:3]
word_projection_map = []
for word, (x,y,z) in zip(word_list, project3d):
  word_projection_map.append({'word': word, 'x': x, 'y': y, 'z':z})
embedding_df = pd.DataFrame(word_projection_map)

fig = px.scatter_3d(embedding_df, x="x", y="y", z="z", text="word")
fig.show()

#7.3. Contextual Word Embeddings
---

I'm lumping a pretty diverse of embedding models together here; however, for our purposes in this module, we're going to treat them as the same despite important differences.

In the previous section we looked at GloVe and Word2Vec which learn embeddings based on the context in which words are used in a corpus. This is fantastic, but once these embeddings are learned, each instance of that 'word' is treated the same. Intuition tells us that shouldn't be the case... consider "work":

* The employee went to work every morning,
* I work very hard for my pay,
* This is a masterful work of art,
* I was tired after a long day's work, and
* My cellphone doesn't work.

Each sentence contains the word 'work', but while it's spelled the same, the meaning is very different. Contextual word embeddings (e.g., ELMo, BERT) do not store a single embedding for each word - the embedding of the word is contextualized both during training and when it's being applied.

Consider the example of 'work' in these sentences


In [None]:
# Uses ELMo embeddings to show how 'work' has different embeddings depending on how it is used.
texts = ["The employee went to work every morning", 
         "I work very hard for my pay", 
         "This is a masterful work of art", 
         "I was tired after a long day's work", 
         "My cellphone doesn't work"]
work_locations = [(text_id, text.split(" ").index("work")) for text_id, text in enumerate(texts)]

work_embeddings = elmo(texts, signature="default", as_dict=True)["elmo"]
with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  elmo_embeddings = session.run(work_embeddings)

print("\nHere are the first 5 elements of the embeddings for the five 'work's:")
for text_id, word_id in work_locations:
  print(f"Sentence {text_id+1}: {elmo_embeddings[text_id][word_id][:5]}")

for (text1, word1), (text2, word2) in combinations(work_locations, 2):
  print(f"\nSimilarity of 'work' in the sentences: '{texts[text1]}' and '{texts[text2]}': ")
  print(cosine(elmo_embeddings[text1][word1], elmo_embeddings[text2][word2]))

We can see that even though each sentence uses the word 'work', the embedding of the word changes based on the context in which 'work' is being used.