## Word Representation

Word Embeddings yay!

- So far we used one-hot encoding for representing words/chars

Motivating example:

Man: $[0,1,....,0]$


Woman: $[0,0,0,0,0,0,0,1,....,0]$


- No way to associate similar words using one-hot encoding.
- Thus no generalization across words/concepts... 

Solution:

- Learning featurized representation: word embedding
- dim=0 gender    man=1
- dim=1 royal
- dim=2 age
- dim3= food    apple/orange have high value
- ... dim300 ....


Now, with this representations, apple and orange will have similar representations. 


Of course, manually choosing these definitions for each dimension is not a feasible option. Instead we let network learnig it.


Use t-SNE, for visualizing the high dimensional (300dim) vector into 2D. 

The reason we say embedding is:

- Imagine a 3D cube
- Each point is the embedding of a word. So we embed each word in that X dimensional hyperplane.

## Using Word Embeddings


How we can plug them into applications.


Example Named Entity Recognition 


    Training:  Sally Johnson is an orange farmer. 

    Test: Robert Lin is a apple farmer (durian cultivator).

If you have little training data you might not have seen durian in your named entity recognition training.


Advantage of word embeddings is that, they are learned unsupervised on huge datasets, so durian embedding will be similar to orange.


Setting:

- 1B words for training word embeddings
- 100k words for your named entity recognition task
- So we employ transfer learning to utilize the word embeddings.
- Also note that usually embeddings are much smaller in size (300) compared to OHE (10k vocab).
- (Optional) Continue fine-tuning your word embeddings on the new data. This is only done if you have enough data.


**Finally relation to Face Encoding**

- For face recognition, we used Siamese networks which had image encodings and compared the similarity.

- For word embeddings, we have a fixed vocabulary that we learn a fixed embedding. On the other hand, face recognition with CNNs allowed us to input any image to the embedding model. So we can not input "afkdafgdslkafna;sd" and try to get the embeddings.


## Properties of Word Embeddings

There were interesting analogies using word embeddings. Classic example of 

$$e_{queen} = e_{king} - e_{man} + e_{woman}$$


Implementation:

```
argmax sim(ew , eking - eman + ewoman)
```

You can see almost 70% analogy accuracy on an analogy dataset! You use the similarity of the full 300 dimensional embedding vectors


Important note about t-SNE:

- It maps the input vectors to 2D in a veeery non-linear complicated way.
- After t-SNE, you can not use the conventional distance metrics on the t-SNE vectors. They are only for visualization.



#### Common similarity metrics 

Cosine similarity.

$$
sim(u,v) = \dfrac{u^{T}v}{\lvert\lvert u\rvert\rvert_2 \cdot \lvert\lvert v\rvert\rvert_2}
$$


You can also use Euclidean distance but this is less frequently used


$$
\lvert\lvert u-v\rvert\rvert^2 
$$


Some analogy examples:

- Man:Woman as Boy:Girl
- Ottawa:Canada as Nairobi:Kenya
- Currencies of countries

## Embedding Matrix

Imagine again our same 10k vocabulary. 

We will have a 300x10k matrix $E$. Each column is the embedding representation.


Notation:

- $O_{6257}$ used to denote one hot encoding of 'orange'

$$
E \cdot O_{6257} = \text{embedding of orange <300,1>} 
$$


## Learning Word Embeddings: Word2Vec & GloVe

We will learn concrete algorithms to learn embeddings.

Historically people proposed very complex algorithms to learn these embeddings. Over time the algorithms interestingly got simpler aaand simpler.


Andrew goes over the Bengio 2003 paper Andrej karpathy also references in his playlist.

Example:

    I want a glass of orange ___________ . 



- Get the embedding of all words until the blank token.
- Each are 300-dim vectors -> 6x300: 1800
    - Alternatively we can also limit the context to last 4 words (context window length)
- Input all these embeddings to a neural network.
- Softmax with 10k outputs to predict the correct word!! 


This algorithm will be able to learn pretty good embeddings. It will learn that apple and orange are similar etc.



However, above method is NOT the only way to train the embeddings. Let's look at other approaches.

### Other context/target pairs

example:

    I want a glass of orange juic to go along with my cereal.
    

If your goal is to have a language model, then last N words as context makes sense. However, if you only want good representations given an input sequence you can also use left&right context:

Context:

- Last 4 words: This is the main way if we want to learn a language model
- 4 words on left & right (predict the word in the middle): 
- Last 1 word (simpler context)
- Nearby 1 word: Skipgram model which is found to work remarkably well




## Word2Vec

**Skip-gram model**

example:

    I want a glass of orange juic to go along with my cereal.


- Randomly pick a word from (-5,5 words) as the context word for a given target word.
- Try to predict the target given the context
- Goal is not really to do well on the prediction task but to learn good embeddings.


Model.

- Vocab size = 10k
- learn mapping from context (c) to target (t) 
- start by OHE for context `c` and input the embedding to a network
- output is the softmax unit to predict `t`

$$
P(t|c) = \dfrac{e^{\Theta_t^Te_c}}{\text{total}}\\
L  = - \sum_{i=1}^{10k} y_i log(\hat{y}_i))
$$

where $\Theta$ is the output layer weights


#### Problem with Softmax Classification

Computation is small 

- For each sample we need to calculate a 10k sum to calculate the denominator. 
- One solution is toi use Hierarchical Softmax. Divide the full vocabulary into maaany binary classifications. So the size become $log(v)$.
- Common words are on the top levels of hierarchical layers since they occur more frequently.


Another solution is negative sampling which we will cover in next lecture.


Last comment:


**How to sample the context c?**

Probably you want to avoid frequent context words such as "the, of, a, and, to" since they will appear so many times.

You should use different heuristics to balance less frequent words and frequent words so you can learn good embeddings for your whole vocabulary.

## Negative Sampling

Softmax denominator is slow to compuuute.

Negative sampling helps you to do something very similar to skip gram but do much faster.


    I want a glass of orange juice to go along with my cereal.
    

- Generate a positive pair of context and target. For example: orange (c) and juice (t)
    - label = 1
- Generate k=(5-20) negative example pairs: Randomly pick words from your vocabulary for the target
    - orange (c) and king (t)
    - ....
    - It is fine if your randomly selected `t` actually appears in the sentence. 
    - label for these pairs is 0
- Define a supervised task:
    - Given all our context words and target words try to predict whether target is 0 or 1.
    
    
### Model.

Instead of using a softmax now we will have logistic regression for each sample.


$$
P(y=1|c,t) = \sigma(\Theta_t^{T} e_c)
$$

We will have a k:1 ratio of negative to positive labels (y=0 vs y=1).


One way to look at it: Instead of trying to make a 10k class prediction at once, we formulate the task as 10k binary classification tasks (very few positive and many negative samples). And we only train on `k+1` samples instead of the whole set. So much much faster.



#### Selecting the negative examples


Some approaches:

- Sample proportional to occurrence frequency: common words are sampled way too often
- Sample uniform randomly: rare words are sampled many times
- Authors proposal: Somewhere in the middle:

$$
P(w_i) = \dfrac{f(w_i)^{3/4}}{\sum_j f(w_j)^{3/4}}
$$

where $f(w_i)$ is the frequency of $w_i$ in the corpus

## GloVe word Vectors

Some momentum recently hehe

Name: Global vectors for word representation!!

The core idea is, instead of predicting binary as in skip-gram, let's generalize this cooccurrence idea to all pairs and try to predict the count instead.




First we create a matrix $\mathcal{X}$ storing the counts 

$$
\mathcal{X}_{ij} = \text{# of times j (t) appears in the context of i (c)}
$$

Depending on the definition of `context` we might even have $\mathcal{X}_{ij}=\mathcal{X}_{ji}$. For example, context might mean in the vicinity of 10 words. this is a symmetric relationship. If context means the word that preceeds the target then it is NOT.



GloVe try to minimize the following expression:

$$
L = \sum_{i=1}^{10k} \sum_{j=1}^{10k} f(\mathcal{X}_{ij}) (\Theta_i^{T} e_j +b_i + b_j    - log(\mathcal{X}_{ij}))^2 
$$

$log(0)$ is undefined but for those we will handle them using 0 weight for those samples. So $f(\mathcal{X}_{ij})=0$ if $\mathcal{X}_{ij}=0$. and assume $0 * log(0) = 0$.


Weighting term ($f(\mathcal{X}_{ij})$) also helps with the following:

- For common frequent words 'this, is, of, a' we want to give large weight but not unduly large so they dominate the training too much.
- Give meaningful amount of weight to rare words so we can learn them too. Words such as Durian.


**Something I don't fully comprehend**

Andrew says $\Theta_i$ and $e_j$ are symmetric. So that we can initialize these matrices at the beginning randomly. However, when we finish the training we can average them to get the final embeddings 

$$
e_{j}^{final} = \dfrac{e_j + \Theta_j}{2}
$$

Turns out this simple approach of just minimizing the distance between number of occurrences actually works quite well.


I think overall Andrew skipped many details about the paper and it was very confusing. Like he didn't even bother mentioning why do we have the $b_i$ and $b_j$ terms.

Paper is god: https://nlp.stanford.edu/pubs/glove.pdf

**Okay Got it!!**

I was confused a lot about the part that $\Theta$ and $e$ are symmetric. It is actually quite straightforward. Because in this GloVe algorithm the definition of context is symmetric: word j appears in the +-10 of word i. So the X matrix have symmetry in its entries. This means that the two parameter sets will try to learn the exactly same symmetrical values.

#### A note on the featurization view of word embeddings

When we learn embeddings in GloVe or Word2Vec way, we can NOT expect that individual dimensions actually correspond to something meaningful. The axes will not be orthogonal to the concepts we have. 

Some linear algebra:

$$
(A\Theta_i)^T(A^{-T}e_j) = \Theta_i^{T} e_j
$$

Above proves that if we had the above transformation $A$, which proves there could be some other axis from what we expect to be true (human interpretable). But the analogy idea still works even though individual axes don't overlap.




## Sentiment Classification

Look at a piece of text and predict the sentiment.

With using embeddings, you can train good sentiment classifier with minimal data yay!!

    Example: The dessert is excellent              -> 4 stars

1. Simple model.

- Take OHE for each word and get the embeddings
- Sum or average the embeddings to get 300dim vector.
- Pass to softmax and output between 1-5

Problem: lacks the word order information. So fails to detect negative case like "lacking in good service" because good looks like positive

2. More sophisticated model: RNN

- Take OHE for each word and get the embeddings.
- Feed all embeddings to an RNN.
- Many-to-one architecture and last word works to predict the rating.

Embeddings help us generalize to words that don't appear in our sentiment classification fine-tuning dataset which could be small.

## Debiasing Word Embeddings

Problem: Learn biased representations from the training data!!

Bias here means bias around gender or ethnicity etc. Yabai examples:

    Man:Computer_Programmer as Woman:Homemaker
    Father:Doctor as Mother:Nurse
    
    
    
Embeddings reflect the gender, ethnicity, age, sexual orientation and other biases of the text used to train the model!!!

**How can we avoid this?**

Paper: https://arxiv.org/pdf/1607.06520.pdf 

I LOVED THE IDEA!!!


Three steps:

- Identify the bias direction: One bias at a time. Example: gender
    - We do this by subtracting many pairs as below and averaging them:
        - $e_{he} - e_{she}$
        - $e_{male} - e_{female}$
        - ...
        - average all these differences to get the bias direction.
- Neutralize: For every word that is not definitionally related to that concept (in our case gender), project to get rid of the bias.
    - So definitionally related words for gender are words like 'man' 'father' etc.
    - This helps us avoid `doctor` to have any gender related value. Gender dimension for doctor, soccer player, nurse will all be 0 (meaning it is neutral).
- Equalize: For all pairs of words where the only difference is gender (grandmother-grandfather or boy-girl), Equalize their vectors in the remaining 299-dimensions.
    - This helps us avoid grandmother being closer to nurse than grandfather 
    
    

How do we choose which words to neutralize in step 2? 

Authors train a classifier to decide whether words are definitional or not. And find out that many words are not definitional. Hmmm what kind of classifier?

Also for step 3, pairs we equalize are also apparently small in size

