# Session 3, part II

## Vector semantics and embeddings

<img src="images/_0.jpg" width="50%">

Learning goals
============


- Appreciating the anatomy of $\texttt{word2vec}$ ⬅️
- Understanding the various steps to train ad hoc embeddings  ⬅️
- Assigning a role to new forms of embeddings

$\texttt{word2vec}$: the basic
===================


**Key features**

+ in 2013, it drew a line between old-school and modern NLP
+ it doesn't require hand-labeled supervision
+ easy and quite fast to train (you can do that with Gensim)
+ it's OSS  

**Philosophy**

+ it trains a classifier on a binary prediction task:
  - "is word $\omega$ likely to show up near word $\eta$"?
+ the classification task is 'instrumental' in nature:
  - the point is not predicting the 'next' word
  - the goal is to adjust word vectors 

Let's focus on $\texttt{word2vec}$: Background
==================================

<img src="images/_4.png" width="50%">

$\texttt{word2vec}$ algorithm: Skip-Gram flavor
==================================

**!!!Boundary condition¡¡¡**

+ there are various flavors of the $\texttt{word2vec}$
+ here, we focus on the Skip-Gram (SG) flavor

Skip-Gram algorithm
=================

1. treat the target word and a neighboring context word as positive examples
2. randomly sample other words in the lexicon to get negative examples
3. use logistic regression to train a classifier to distinguish those two cases
4. use the weights as the embeddings

SG training data
==============

Given the sentence:

| not in context | c1         | c2 | t       | c3  | c4 | not in context | 
| -------------- |------------|----|---------|-----|----|----------------|
| ... lemon, a   | tablespoon | of | apricot | jam | a  | pinch of ...   |

let's assume context words are those in +/- 2 word window.

SG goal
=======
 
Given a tuple $(t,c)  = target, context$

+ $\texttt{(apricot, jam)}$
+ $\texttt{(apricot, aardvark)}$

Return probability that $c$ is a real context word:

$P(+|t,c)$

$P(-|t,c)$

How to compute p(+|t,c)?
=====================

Intuition:

+ words are likely to appear near similar words
+ model similarity with dot-product!
+ similarity(t,c)  $∝ t ∙ c$

Problem:
+ dot product is not a probability!
+ (neither is cosine)

Turning dot product into a probability
===============================

\begin{equation}
P(+|t,c) = \frac{1}{1+e^{-t ∙ c}}
\end{equation}

\begin{equation}
P(-|t,c) = 1 - P(+|t,c)
= \frac{e^{-t ∙ c}}{1+e^{-t ∙ c}}
\end{equation}

For all context words
=================

\begin{equation}
P(+|t,c_{1:k}) = \prod_{i=1}^{k} \frac{1}{1+e^{-t ∙ c_{i}}}
\end{equation}

SG training data
=============
Given the sentence:

| not in context | c1         | c2 | t       | c3  | c4 | not in context | 
| -------------- |------------|----|---------|-----|----|----------------|
| ... lemon, a   | tablespoon | of | apricot | jam | a  | pinch of ...   |

let's assume context words are those in +/- 2 word window.


SG training (1/2)
============== 

Given the sentence:

| not in context | c1         | c2 | t       | c3  | c4 | not in context | 
| -------------- |------------|----|---------|-----|----|----------------|
| ... lemon, a   | tablespoon | of | apricot | jam | a  | pinch of ...   |

we look for positive examples $+$:

| t       | c          |
|---------| -----------|
| apricot | tablespoon |
| apricot | of         |
| apricot | jam        |
| apricot | preserve   |
| apricot | ...        |

SG training (2/2)
============== 

Given the list of + $c$:

+ each positive $c$ is matched with a negative $c$
+ negatives are 'noise words' that do not belong to any linguistic contexts of $t$

Setup
=====

Let's represent words as vectors of some length (say 300), randomly initialized. 

+ we start with 300 * V random parameters
+ over the entire training set, we’d like to adjust those word vectors such that we:
  + maximize the similarity of the target word, context word pairs (t,c) drawn from the positive data
  + minimize the similarity of the (t,c) pairs drawn from the negative data. 

Learning $\texttt{word2vec}$ embeddings
=============================

<img src="images/_6.png" width="80%">

Source is Jurafsky and Martin (2019).