In [20]:
%%capture
!pip install scikit-learn==1.3.2 pandas==2.1.4

# Text representation, word embeddings, and sentence embeddings

In this lecture, we will learn about text representations. In particular, we will go through a few examples of word embeddings and sentence embeddings. They are tools that have a wide variety of uses. Let's get started.

## Text representation

<img src="images/week3-pnlp-01-pipeline.png">

representation of the NLP pipeline

### Side topic: what is tokenization?

A simple description of tokenization is that we need to split an English sentence into components -- usually defined by words, but sometimes other things. We will see an example later and in later lectures when we talk about modeling.

## Embeddings: current state

At my job, I learned that when you are doing research and exploration, if you can throw a kitchen sink at a problem, you should. Use the latest tool to see if you can solve the problem before going deep into designing your own solutions.

Let's produce some embeddings with `sentence_transfomer` from huggingface, then we will go back to the basics.

See [hackerllama's notebook](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/)

## What problems are we trying to solve with embeddings?

Text can not be consumed by computers as is -- we must find a numeric representation of our text in order to process them.

We also have the following questions

* How similar are two pieces of texts?
* How can we find neighbors? For example, what are the texts most relevant to the one in question?
    - semantic search (vector search) as opposed to keyword search
    - top k neighbors = ranking

By turning texts into vectors, we also get the following benefits:

* We can do vector algebra on texts!
* We can turn unstructured text data into structured feature for other models e.g. for predictive modeling

## Before sentence-transformers?

Let's go from the beginning to see how we arrive at where we are today. Examples from here on are taken from the Practical NLP book.

### One-hot encoding

One of the very first thought that comes to mind is, can we build a dictionary from the text, then encode our sentence with it?

One way to think of this is to encode dummy variables in regression models. Each word is a category in the dummy variable.

Challenge: what if we have tens of thousands of categories?

In [1]:
# lowercased, punctuation removed text
docs = [
    'dog bites man',
    'man bites dog',
    'dog eats meat',
    'man eats dog food',
    # 'its sunny today',
]

In [2]:
# taken from https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch3/01_OneHotEncoding.ipynb
vocab = {}
count = 0
for doc in docs:
    for word in doc.split():
        if word not in vocab:
            count = count +1
            vocab[word] = count
print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [3]:
# modified from https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch3/01_OneHotEncoding.ipynb

def get_onehot_vector(sentence):
  onehot_encoded = []
  for word in sentence.split():
             temp = [0]*len(vocab)
             if word in vocab:
                        temp[vocab[word]-1] = 1
             onehot_encoded.append(temp)
  return onehot_encoded

For each piece of text, you get a collection of vectors:

In [4]:
# dummy variables
get_onehot_vector(docs[1])

[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]

### Immediate issues

* The vocabulary set is fixed - it is determined by the document set
* Vector length depending on the length of text **- bad for data storage**

This is not very useful.

In [5]:
len(get_onehot_vector(docs[1]))

3

In [6]:
len(get_onehot_vector(docs[3]))

4

## Bag-of-words

We can take a step further from one-hot encoding and create a bag of words. This collapses each piece of text from a collection of vectors into a single vector.

Personally, I found that they were less useful in applications because there are better methods. But let's take a quick example to see what Bag-of-words look like. We will skip the N-grams -- please refer to the book if you are interested, but it is essentially bag-of-words but with 2-grams or more to capture phrases and word relations instead of single words.

In [7]:
# Modified from https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch3/02_Bag_of_Words.ipynb
from sklearn.feature_extraction.text import CountVectorizer

#look at the documents list
print("Our corpus: ", docs)

count_vect = CountVectorizer()
#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ", bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats dog food']
Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
BoW representation for 'dog bites man':  [[1 1 0 0 1 0]]
BoW representation for 'man bites dog:  [[1 1 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]


Note that these two printed out the exact same representation -- this approach does not take into the context at all

In [8]:
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ",bow_rep[1].toarray())

BoW representation for 'dog bites man':  [[1 1 0 0 1 0]]
BoW representation for 'man bites dog:  [[1 1 0 0 1 0]]


## TF-IDF

TF-IDF stands for **term frequency–inverse document frequency.** 

It is a simple idea but still very powerful. This is still used heavily in keyword search and surfacing important keywords in a collection of documents. 

<img src="images/week3-tf-idf-chris-albon.webp">

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(docs)

#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)

IDF for all words in the vocabulary [1.51082562 1.         1.51082562 1.91629073 1.22314355 1.91629073]
----------


In [10]:
# Note that this line in the book Notebook throws an error - need to find the correct method using dir()
print("All words in the vocabulary",tfidf.get_feature_names())

AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names'

In [11]:
#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names_out())
print("-"*10)

#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

All words in the vocabulary ['bites' 'dog' 'eats' 'food' 'man' 'meat']
----------
TFIDF representation for all documents in our corpus
 [[0.69113141 0.4574528  0.         0.         0.55953044 0.        ]
 [0.69113141 0.4574528  0.         0.         0.55953044 0.        ]
 [0.         0.37919167 0.5728925  0.         0.         0.72664149]
 [0.         0.34399327 0.51971385 0.65919112 0.42075315 0.        ]]
----------


This still suffers from the same Out-of-Vocabulary problem

In [12]:
temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

Tfidf representation for 'dog and man are friends':
 [[0.         0.63295194 0.         0.         0.77419109 0.        ]]


In [13]:
temp = tfidf.transform(["dog and man are friends that play together"])
print("Tfidf representation for 'dog and man are friends that play together':\n", temp.toarray())

Tfidf representation for 'dog and man are friends that play together':
 [[0.         0.63295194 0.         0.         0.77419109 0.        ]]


##### Advantages of tf-idf

* **Fast to compute**
* Fits into human understanding and current use of text search well - Rare terms are more specific.

What do I mean by rare terms are more specific? We can build a quick text search function by ranking the documents with the tf-idf score of our search term!

In this example, searching for `Lego` would yield `The Lego Movie` with a TF-IDF score highere than searching for `love` with `Sleepless in Seattle`.

<img src="images/week3-tf-idf-search.png">
Screenshot from Chapter 3 of Relevant Search by Doug Turnbull and John Berryman

**PechaKucha candidate: What is BM25 scaling?**

**PechaKucha candidate: What is Zipf's law?**

## Sparse embeddings vs dense embeddings

In the beginning of class, we saw `sentence-transfomers` which produces a **dense embedding**.

Bag-of-words and TF-IDF produce **sparse embeddings**

In general, sparse embeddings are great for matching keywords. Dense embeddings are better at capturing **context**.

## Word embeddings

We will not go into how the Word2Vec algorithm was trained. Instead, let's play with the results.

* JS demo for vector algebra: https://turbomaze.github.io/word2vecjson/
* https://projector.tensorflow.org/

## Transformer based embeddings

In transformer based embeddings, we take the last hidden layer as the numerical represention of the text. Why does this work? We will learn more in the transformer lectures.

## Finding neighbors


<img src="images/week3-cosine-similarity.png">

Formula for cosine similarity, a common measure of "similarity" for embeddings

## Evaluating embeddings

Before we go too far into work on embeddings, let's stop and think about: how do we evaluate embeddings? What are we looking at?

Let's take a look at how the researchers generated a dataset for benchmarking algorithms

**SimLex-999**

* https://fh295.github.io//simlex.html
* https://aclanthology.org/J15-4004.pdf

This dataset contains 999 pairs of English words that are labeled by human on their similarity (but not relatedness) using Amazon Mechanical Turk. **Please take a look at their annotation guideline.**

The researcher created a scale of 0-6 to calculate similary, then rescaled to 0-10 as similarity score.

To reproduce this notebook, [download the dataset from here](https://fh295.github.io//SimLex-999.zip), unzip, then drop the `.csv` into the `data/` folder.

In [14]:
import pandas as pd

In [15]:
df_simlex = pd.read_csv('data/SimLex-999.txt', sep='\t')

In [16]:
df_simlex.head(3)

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19


In [17]:
df_simlex.sample(3, random_state=42)

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
453,butter,potato,N,1.22,4.9,4.85,4,0.27,0,1.19
793,choose,elect,V,7.62,2.62,2.41,1,1.28,1,1.14
209,bread,flour,N,3.33,4.92,4.97,4,1.42,1,1.25


In [18]:
df_simlex.describe(include='float')

Unnamed: 0,SimLex999,conc(w1),conc(w2),Assoc(USF),SD(SimLex)
count,999.0,999.0,999.0,999.0,999.0
mean,4.561572,3.657087,3.568629,0.751512,1.274505
std,2.614663,1.13105,1.159572,1.344569,0.366278
min,0.23,1.19,1.19,0.0,0.34
25%,2.38,2.62,2.5,0.14,1.075
50%,4.67,3.83,3.66,0.25,1.31
75%,6.75,4.79,4.75,0.68,1.54
max,9.8,5.0,5.0,8.85,2.18


In [19]:
df_simlex.query('SimLex999 > 9.0').head(5)

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
6,happy,glad,A,9.17,2.56,2.36,1,5.49,1,1.59
8,stupid,dumb,A,9.58,1.75,2.36,1,5.26,1,1.48
16,insane,crazy,A,9.57,1.77,2.37,1,2.09,1,0.92


Problem with evaluation: 

Words are context dependent. big, large, huge can have different meanings in different context. Think about the following sentences:

* My big brother
* My large brother
* My huge brother

But their (human) similarity score was high -- we must consider the downstream task when evaluating similarities. More to come

In [None]:
df_simlex[lambda df: df['word1'] == 'large']

**Remember this for the project!!** How you evaluate is often more important than what you did.