# How can we think of text as numbers for quantitative analysis?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize

import nltk
nltk.download('punkt')

## Bag-of-Words (BoW)

BoW represents a document as a set of words without regard for word order.  Each word is assigned a unique index, and a document is represented as a vector whose values at the index for each word are the word counts.

In [None]:
corpus = ["The cat slept and then meowed.", 
          "The tiger slept and then roared.", 
          "The boy ran home and then the boy laughed."]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

Even though we are using Scikit-Learn to do the CountVectoriz-ing, there is no reason that we couldn't manually do it ourselves too with a bit of Python.  It's just convenient to do it the Scikit-Learn way.

In [None]:
vectorizer.get_feature_names_out()

In [None]:
pd.DataFrame(X.toarray(), 
             columns=vectorizer.get_feature_names_out())

In [None]:
# as to compare against our corpus:
corpus

## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF extends BoW by accounting for the uniqueness of words in distinguishing between documents.  The word counts of BoW are weighted by words' relative rarity across the entire corpus.

* Scikit-Learn's TF-IDF calculation is [described here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

In [None]:
vectorizer = TfidfVectorizer()

X_tfidf = vectorizer.fit_transform(corpus)

In [None]:
pd.DataFrame(X_tfidf.toarray(), 
             columns=vectorizer.get_feature_names_out())

There are a lot of mathematical details that come in here for trying to get well behaved forms of TF-IDF, and it's actually a messy business trying to back this out from the word counts and frequencies.

You can ignore the following if you want to, but here is how one would go directly from the matrix of counts to scikit-learn's version of the TFIDF measure.

In [None]:
x_bow = pd.DataFrame(X.toarray(), 
             columns=vectorizer.get_feature_names_out())

In [None]:
x_bow

In [None]:
# Getting the term frequencies in each of the three documents
(x_bow.T / x_bow.T.sum(axis=0)).T

In [None]:
# Getting the number of documents in which each word occurs
(x_bow > 0).sum(axis=0)

In [None]:
tf = (x_bow.T / x_bow.T.sum(axis=0)).T

# the +1 at the end is so that even words that occur across all docs
# still have a non-zero TFIDF
# the +1 in numerator and +1 in denominator are conveniences to
# handle the otherwise division by 0 for words that have 0 counts
idf = np.log((1+3) / (1+(x_bow > 0).sum(axis=0))) + 1

tf * idf

... and then one has to do a cosine normalization (the squares of elements in the rows add up to 1).  This is convenient because one can then do an inner (dot) product of rows to get a cosine similarity measure that varies between -1 and 1.

In [None]:
tfidf = tf * idf
tfidf = (tfidf.T / np.sqrt((tfidf.T * tfidf.T).sum(axis=0))).T
tfidf

In [None]:
np.dot(tfidf.loc[0], tfidf.loc[1])

## Word Embeddings

Word embeddings represent words as dense vectors in a continuous vector space. Word2Vec, GloVe, or FastText are pre-trained word embedding models that can be used to help obtain word embeddings.

In [None]:
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

model = Word2Vec(sentences=tokenized_corpus, 
                 vector_size=2,
                 min_count=1)

word_vectors = model.wv

In [None]:
tokenized_corpus

In [None]:
word_vectors.index_to_key

In [None]:
word_vectors['cat']

In [None]:
vector_for_document = [word_vectors[word] for word in tokenized_corpus[0] if word in word_vectors.index_to_key]

In [None]:
vector_for_document

The dense vectors can allow us to look for similarity scores, e.g., by looking at the inner (dot) product.

In [None]:
np.dot(word_vectors['cat'], word_vectors['meowed'])

In [None]:
np.dot(word_vectors['cat'], word_vectors['tiger'])

In [None]:
np.dot(word_vectors['cat'], word_vectors['the'])

# Word embedding plotting example:

In [None]:
word_vectors.index_to_key

In [None]:
word_embeddings = {word: model.wv[word] for word in word_vectors.index_to_key}

fig, ax = plt.subplots()

for word, wordvec in word_embeddings.items():
  ax.scatter(wordvec[0], wordvec[1])
  ax.annotate(word, (wordvec[0], wordvec[1]))

plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.title("Word Embeddings in 2D Space")
plt.show()

In the above, the "2" dimensions may be reasonable for plotting, but it's a dramatic projection of a high-dimensional space into a lower dimensional space for visualization.

When the texts become really large, the problem becomes even more dramatic.

In [None]:
# Load the text of "Moby Dick"
from nltk.corpus import gutenberg
moby_dick_text = gutenberg.raw('melville-moby_dick.txt')

# Sentence Tokenization
sentences = sent_tokenize(moby_dick_text)
words = word_tokenize(moby_dick_text)

In [None]:
len(sentences)

In [None]:
len(words)

In [None]:
# only uncomment this if you want lots of output
# moby_dick_text

In [None]:
sentences[55:56]

In [None]:
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in sentences]

In [None]:
tokenized_corpus[55:56]

In [None]:
model = Word2Vec(sentences=tokenized_corpus, 
                 vector_size=100,
                 min_count=1)

word_vectors = model.wv

In [None]:
model.wv.similarity('woman', 'man')

The similarity score is the cosine between the vectors representing the word embeddings.  The full word-document matrix is 255028-dimensional, while the word-embedding is only 100-dimensional.

In [None]:
np.dot(model.wv['woman'], 
       model.wv['man']) / (np.linalg.norm(model.wv['woman']) * 
                           np.linalg.norm(model.wv['man']))

In [None]:
model.wv.similarity('sea', 'scarcity')

# Contextual embeddings

Contextualized embeddings consider the surrounding words in a sentence.  As examples:

* BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model that can be used to help obtain embeddings for words, sentences, and documents.
* GPT (Generative Pre-trained Transformer) also works with contextual embeddings and context-dependent representations of words.

Transformers take us into the realm of deep learning, which we haven't touched on yet, but which we'll return to in just a few weeks.