# How can we think of text as numbers for quantitative analysis?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

## Bag-of-Words (BoW)

BoW represents a document as a set of words without regard for word order.  Each word is assigned a unique index, and a document is represented as a vector whose values at the index for each word are the word counts.

In [None]:
corpus = ["The cat slept and then meowed.", 
          "The tiger slept and then roared.", 
          "The boy ran home and then the boy laughed."]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

In [None]:
X.toarray()

Even though we are using Scikit-Learn to do the CountVectoriz-ing, there is no reason that we couldn't manually do it ourselves too with a bit of Python.  It's just convenient to do it the Scikit-Learn way.

In [None]:
vectorizer.get_feature_names_out()

In [None]:
pd.DataFrame(X.toarray(), 
             columns=vectorizer.get_feature_names_out())

In [None]:
# as to compare against our corpus:
corpus

## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF extends BoW by accounting for the uniqueness of words in distinguishing between documents.  The word counts of BoW are weighted by words' relative rarity across the entire corpus.

* Scikit-Learn's TF-IDF calculation is [described here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

In [None]:
vectorizer = TfidfVectorizer()

X_tfidf = vectorizer.fit_transform(corpus)

In [None]:
pd.DataFrame(X_tfidf.toarray(), 
             columns=vectorizer.get_feature_names_out())

There are a lot of mathematical details that come in here for trying to get well behaved forms of TF-IDF, and it's actually a messy business trying to back this out from the word counts and frequencies.

You can ignore the following if you want to, but here is how one would go directly from the matrix of counts to scikit-learn's version of the TFIDF measure.

In [None]:
x_bow = pd.DataFrame(X.toarray(), 
             columns=vectorizer.get_feature_names_out())

In [None]:
x_bow

In [None]:
# Getting the term frequencies in each of the three documents
(x_bow.T / x_bow.T.sum(axis=0)).T

In [None]:
# Getting the number of documents in which each word occurs
(x_bow > 0).sum(axis=0)

In [None]:
tf = (x_bow.T / x_bow.T.sum(axis=0)).T

# the +1 at the end is so that even words that occur across all docs
# still have a non-zero TFIDF
# the +1 in numerator and +1 in denominator are conveniences to
# handle the otherwise division by 0 for words that have 0 counts
idf = np.log((1+3) / (1+(x_bow > 0).sum(axis=0))) + 1

tf * idf

... and then one has to do a cosine normalization (the squares of elements in the rows add up to 1).  This is convenient because one can then do an inner (dot) product of rows to get a cosine similarity measure that varies between -1 and 1.

In [None]:
tfidf = tf * idf
tfidf = (tfidf.T / np.sqrt((tfidf.T * tfidf.T).sum(axis=0))).T
tfidf

In [None]:
np.dot(tfidf.loc[0], tfidf.loc[1])

In [None]:
# Cosine similarity matrix for every pair of documents:
np.matmul(tfidf, tfidf.T)

## Word Embeddings

Word embeddings represent words as dense vectors in a continuous vector space. Word2Vec, GloVe, or FastText are pre-trained word embedding models that can be used to help obtain word embeddings.

In [None]:
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

model = Word2Vec(sentences=tokenized_corpus, 
                 vector_size=2,
                 min_count=1)

word_vectors = model.wv

In [None]:
tokenized_corpus

In [None]:
word_vectors.index_to_key

In [None]:
word_vectors['cat']

In [None]:
vector_for_document = [word_vectors[word] for word in tokenized_corpus[0] if word in word_vectors.index_to_key]

In [None]:
vector_for_document

The dense vectors can allow us to look for similarity scores, e.g., by looking at the inner (dot) product.

In [None]:
np.dot(word_vectors['cat'], word_vectors['meowed'])

In [None]:
np.dot(word_vectors['cat'], word_vectors['tiger'])

In [None]:
np.dot(word_vectors['cat'], word_vectors['the'])

# Word embedding plotting example:

In [None]:
word_vectors.index_to_key

In [None]:
word_embeddings = {word: model.wv[word] for word in word_vectors.index_to_key}

fig, ax = plt.subplots()

for word, wordvec in word_embeddings.items():
  ax.scatter(wordvec[0], wordvec[1])
  ax.annotate(word, (wordvec[0], wordvec[1]))

plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.title("Word Embeddings in 2D Space")
plt.show()

In the above, the "2" dimensions may be reasonable for plotting, but it's a dramatic projection of a high-dimensional space into a lower dimensional space for visualization.

When the texts become really large, the problem becomes even more dramatic.

In [None]:
nltk.download('gutenberg')

In [None]:
# Load the text of "Moby Dick"
from nltk.corpus import gutenberg
moby_dick_text = gutenberg.raw('melville-moby_dick.txt')

# Sentence Tokenization
sentences = sent_tokenize(moby_dick_text)
words = word_tokenize(moby_dick_text)

In [None]:
len(sentences)

In [None]:
len(words)

In [None]:
# only uncomment this if you want lots of output
# moby_dick_text

In [None]:
sentences[55:56]

In [None]:
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in sentences]

In [None]:
tokenized_corpus[55:56]

In [None]:
model = Word2Vec(sentences=tokenized_corpus, 
                 vector_size=100,
                 min_count=1)

word_vectors = model.wv

In [None]:
model.wv.similarity('woman', 'man')

The similarity score is the cosine between the vectors representing the word embeddings.  The full word-document matrix is 255028-dimensional, while the word-embedding is only 100-dimensional.

In [None]:
np.dot(model.wv['woman'], 
       model.wv['man']) / (np.linalg.norm(model.wv['woman']) * 
                           np.linalg.norm(model.wv['man']))

In [None]:
model.wv.similarity('sea', 'scarcity')

## Word Embeddings with `sentence-transformers` and BERT

So far, we've seen how to represent text using traditional methods (e.g. bag-of-words, TF-IDF), as well as how to use a static embedding model like GloVe.

Unlike GloVe and FastText, which treat words in isolation and assign a single vector per word regardless of context, Sentence Transformers use transformer architectures (like BERT) and are fine-tuned on tasks such as semantic similarity using contrastive loss.  Modern NLP models use dense vector embeddings that capture semantic similarity: texts with similar meaning end up close together in vector space.

We now consider:
1. How to set up sentence-level embeddings with the `sentence-transformers` library.
2. How to get BERT-based embeddings directly from Hugging Face `transformers`.

## 1. Sentence Embeddings with `sentence-transformers`

The [`sentence-transformers`](https://www.sbert.net/) library wraps a variety of pre-trained transformer models
and makes it very easy to get sentence-level embeddings.

Flow:
- Input: a list of sentences / texts.
- Output: a NumPy array or PyTorch tensor of shape `(num_sentences, embedding_dim)`.
- Models like `"all-MiniLM-L6-v2"` are small, fast, and good general-purpose choices.

First, load the library and model:

In [None]:
from sentence_transformers import SentenceTransformer, util

In [None]:
sent_model_name = "sentence-transformers/all-MiniLM-L6-v2"
sent_model = SentenceTransformer(sent_model_name)

Encode our sentences:

In [None]:
sentences = ["The cat slept and then meowed.",
             "The tiger slept and then roared.",
             "The boy ran home and then the boy laughed.",
             "This sentence has positive sentiment.",
             "This sentence has negative sentiment."
             
]

sentence_embeddings = sent_model.encode(sentences, 
                                        convert_to_tensor=True)

In [None]:
sentence_embeddings.shape

Example numbers in the first embedding vector:

In [None]:
# sentence_embeddings[0][:]

### Measuring Similarity

We can measure similarity between embeddings using cosine similarity:

In [None]:
# Compute pairwise cosine similarities
cosine_sim_matrix = util.cos_sim(sentence_embeddings, sentence_embeddings)

cosine_sim_matrix

In [None]:
sentence_embeddings @ sentence_embeddings.T

## 2. Word / Sentence Embeddings with BERT (Hugging Face `transformers`)

`sentence-transformers` is great for ready-to-use sentence embeddings, but sometimes you may want to:

- Use a specific BERT variant (e.g. `bert-base-uncased`, `distilbert-base-uncased`, domain-specific BERT).
- Control how embeddings are constructed (e.g. average over tokens vs. use the `[CLS]` token).

We can do this directly with the Hugging Face `transformers` library.

Typical procedure for sentence embeddings with BERT:

1. Tokenize text with a BERT tokenizer.
2. Run the tokens through the BERT model to get hidden states.
3. Aggregate token embeddings (e.g. mean pooling across tokens) to get a single vector per sentence.


In [None]:
import torch
from transformers import BertTokenizer, BertModel

Load BERT tokenizer and model:

In [None]:
bert_model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
bert_model = BertModel.from_pretrained(bert_model_name)

Put model in evaluation mode (disables dropout etc.)

In [None]:
bert_model.eval()

Example sentences (same as before for comparison)

In [None]:
sentences = ["The cat slept and then meowed.",
             "The tiger slept and then roared.",
             "The boy ran home and then the boy laughed."
]

Tokenize with padding & truncation so all sequences have same length

In [None]:
encoded = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    return_tensors="pt"  # return PyTorch tensors
)

In [None]:
encoded

In [None]:
for k, v in encoded.items():
    print(k, v.shape)

### Getting Embeddings from BERT

Run BERT on our tokenized inputs:

In [None]:
# Short note about "**varname"

data = {'name': 'Alice', 'age': 30}

def greet(name, age): 
    print(f"Hi {name}, you're {age}.")

greet(**data)  # Equivalent to greet(name='Alice', age=30)   

In [None]:
with torch.no_grad():
    outputs = bert_model(**encoded)

In [None]:
outputs

In [None]:
outputs['last_hidden_state'].shape

In [None]:
outputs['pooler_output'].shape

Feeding the tokenized `encoded` into our BERT model outputs:

- `last_hidden_state`: a tensor of shape `(batch_size, sequence_length, hidden_size)`
- Optionally `pooler_output` (for some models) and/or hidden states from each layer.

Two common strategies to get a single vector per sentence:

1. [CLS] token embedding: use `last_hidden_state[:, 0, :]`.
2. Mean pooling: average all token embeddings, masking out padding tokens.

We'll implement mean pooling since it often works well in practice.

In [None]:
token_embeddings = outputs.last_hidden_state  # (batch_size, seq_len, hidden_size)
token_embeddings.shape

In [None]:
encoded['attention_mask']

In [None]:
# Expand attention mask so it matches token_embeddings shape
input_mask_expanded = encoded["attention_mask"].unsqueeze(-1).expand(token_embeddings.size()).float()
input_mask_expanded.shape

In [None]:
input_mask_expanded

In [None]:
# Sum embeddings along the sequence length dimension
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
    
# Count of non-masked tokens
sum_mask = input_mask_expanded.sum(dim=1) 
# to prevent div by 0: torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)

# Return average
bert_sentence_embeddings = sum_embeddings / sum_mask

In [None]:
bert_sentence_embeddings.shape

In [None]:
bert_sentence_embeddings[0][:10]

### Computing Similarity with BERT Embeddings

Just like with `sentence-transformers`, we can compute cosine similarity between BERT-based embeddings.


In [None]:
bert_sentence_embeddings @ bert_sentence_embeddings.T

Common to normalize first to get values between 0 and 1:

In [None]:
from torch.nn import functional as F

# Normalize embeddings before cosine similarity (optional but common)
normalized_embeddings = F.normalize(bert_sentence_embeddings, 
                                    p=2,   # exponent of the norm, here we take an L2 norm
                                    dim=1)

# Pairwise cosine similarity matrix
normalized_embeddings @ normalized_embeddings.T

## When to Use What?

- `sentence-transformers`:
  - High-quality sentence embeddings with minimal effort.
  - Models are fine-tuned on similarity/search tasks.
  - One line `.encode()` call, very convenient.
- Raw BERT via `transformers`:
  - Best when you need full control:
    - Custom pooling strategies.
    - Intermediate layers.
    - Domain-specific fine-tuning.
  - Requires more code but is very flexible.
- Start with a good `sentence-transformers` model.
- If needed, switch to a custom BERT (or other transformer) setup and fine-tune it on your own data.
