# Introduction

We start with the motivating example:

## Word Vectors

$$ \vec {man} + \vec {woman} - \vec {king} \approx \vec {queen} $$

$$ \vec {Berlin} + \vec {Germany} - \vec {France} \approx \vec {Paris} $$

### Visualization

** INSERT VISUALIZATION HERE **

## Magic?

Does this seem like magic to you? It might, but Word2Vec is learning what we call 'embeddings' for words, and embeddings are actually quite simple.
From the analogy above, you might conclude that Word2Vec is learning how to do analogies. That's actually *not* the case!

### What is being trained?

The model instead examines a large body of text and models how words co-occur. The significance of co-occurence stems from the insight in this phrase:

You shall know a word by its neighbors.

### Why does it work?

Different words occur together in different frequencies, and in *vector space*, properties manifest like the ones you see above.

# Embeddings

Often used interchangeably, the terms *word vectors* and *word embeddings* mean something very specific. Here is the setup.

## One-Hot Vectors

Say you have a million words in your vocabulary. Now say you have a machine learning model has one of those words as an input. How would you represent the word as a vector?

Traditionally, you would have a vector with 1 million components, or dimensions, and you would set the value of every dimension to $0$ except the dimension corresponding to the word in question (based on an index), which you would set to one.

The vector looks something like

$$ \begin{bmatrix} 0 & 0 & 0 & ... & 1  & ... & 0  \end{bmatrix} $$

## Dense representation

+ vocabulary is one-hot, which is one for a word and zero otherwise

      V c -> v_c
      W o -> w_o
      
## Linear Algebra

### Dot Product

### Orthogonality

## Questions

* Why are one-hot vectors bad?
* Why are dense embeddings good?



## Linear algebra 
+ latent semantic space -> score

cosine similarity <-> how close the vectors are, unit circle diagram

     $$\mathrm{score} = v_c \cdot w_o$$

+ we need this, because in one-hot everything will be _orthogonal_ (x*y = 0), so this way we getting a more tractable space. orthogonal -> 90degrees.

Write the update rule

$$ v_c = v_c - \alpha \frac {\partial L}{\partial v_c}$$

$$ = \sum_{j \ne k}^{W} \hat y_j * u_{i,j} $$

## Probabilistic

### Example text:

Berlin is a city in East Germany

+ so we want a classifier ...
+ the dot product is in some arbitrary scale, so the score doesn't sum to one and is merely relative
+ if we want probabilities, we exponentiate the score normalize over all scores to sum to 1
+ the score is also known as an "unnormalized log probability"
$$ P(O,C) = \exp( \mathrm{score} ) $$

and we want to maximize
$$ P(O|C) = P(O,C) / \sum P(O', C) = P(O,C) / P(C) $$

## Gradients

### Random initialize - we make no assumptions

### Formulation of objective function

BLACK BOX OPTIMIZATION 

OR

Detailed ->
$ -log(p(o|c)) $ -> $ \hat y - y$

### Linear Algebra intuition for gradient update

Write out derivatives on your own time!

Here are the gradients already worked out:
* Gradient wrt input vector, output vectors*

In backpropogation, when I do an update, I subtract the gradient. When I add a gradient to a vector, I'm pushing that vector towards the gradient. (show diagram)

Check out the gradients for $v_c$ and $u_o$.

### Visualize

How do related word vectors move towards each other in 3D space?


### Bonus

L2 distance between 2 word vectors 
$$ d^2 = (x - z)^2 = x^2 + z^2 - 2xz $$

Setting them to normal vectors
$$ d^2 = 2 - 2xz $$

Then exponentiating for unnormalized probability ->

$$ exp(-d^2) $$

## Deep Learning

When you have complicated models, you have more transformations to latent spaces.
However, the intuition remains the same; you're dealing with dot products - similarities (negative distances).