# Introduction

We start with the motivating example:

## Word Vectors

### Synonyms

$$ \vec {happiness} \approx \vec {joy} $$

$$ \vec {dog} \approx \vec {puppy} $$

And a seemingly weird example that will make sense to you later:

$$ \vec {one} \approx \vec {two} $$

### Analogies

$$ \vec {man} + \vec {woman} - \vec {king} \approx \vec {queen} $$

$$ \vec {Berlin} + \vec {Germany} - \vec {France} \approx \vec {Paris} $$

### Visualization

** INSERT VISUALIZATION HERE **

## Magic? Nope!

Does this seem like magic to you? It might, but Word2Vec is learning what we call 'embeddings' for words, and embeddings are actually quite simple.
From the analogy above, you might conclude that Word2Vec is learning how to do analogies. That's actually *not* the case!

### What is being trained?

The model instead examines a large body of text and models how words occur together. Consider the meaning of this assertion:

** You shall know a word by its neighbors. **

Put another way, the semantic content of a word arises from its usage, which is best measured by other words it is seen near.

### Why does it work?

Different words occur together in different contexts. When we model words in *vector space*, we make the vectors point in similar directions based on how those words occur together. Properties naturally manifest like the ones you see above.


## Boring Notation

**Notation is important!** The ability to formalize and express these concepts in a structured way helps you *truly* grasp the relationship between concepts you *think* you know and form connections to new ones.

$V$ is a matrix

$\vec v_c$ is the $c^{th}$ column vector taken from matrix $V$

$y_i$ is the $i^{th}$ scalar element of the vector $y$

$x$ is a vector

|W| is the length of the vocabulary

$v^Tu$ is the dot product of $v$ and $u$

# Embeddings

Often used interchangeably, the terms *word vectors* and *word embeddings* refer to a vector representation of a word with some (learned) semantic meaning.

## Questions

Some questions that will be answered over the next few sections:

* What is an embedding?
* Why do we want good representations?
* Why are embeddings considered good representations?

## Representations

It's often said that deep learning is about learning good representations. In traditional machine learning fields, representations were often concocted by hand by domain experts and researchers. **We'll see later why good representations matter, and what makes a representation good.**

Say you have a million words in your vocabulary and a machine learning model has one of those words as an input. How would you represent the word as a vector?

## One-Hot Vectors

Traditionally, you would have a vector with 1 million components, or dimensions, and you would set the value of every dimension to $0$ except the dimension corresponding to the word in question (based on an index), which you would set to one. We call this **one-hot**.

Let's be *formal* and use some mathematical notation:

*Formally*, the one-hot vector for the word of index $i$:

$$ x = \begin{bmatrix} 0 & 0 & ... & 1  & ... & 0  \end{bmatrix} $$

Where the $i^{th}$ element is 1 and all other elements are 0.

## Dense representation

In contrast to one-hot vectors, which are sparse (contain many 0s), Word2Vec trains dense vectors, which contain no 0s and are of lower dimension (say 100 - 300). The vector for every word is different; every word vector points in a different direction. These dense vectors are collected in column form to make an **embedding matrix**.

### Formulation

*Formally*

The **dense representation** of a one-hot encoded word $x$ that represents the $c^{th}$ word in the vocabulary is:

$$ V x =  \vec v_c $$

Where $V$ is the embedding matrix

And $\vec v_c$ is a vector in embedding space

### Explanation

This looks like matrix-vector multiplication, but it's actually even simpler. If you remember you matrix-vector multiplication rules, a matrix $V$ times a vector $x$ is a linear combination of all columns in matrix $V$ by every element of vector $x$. In other words, for every column in $V$, multiply the column vector $\vec v_i$ by the scalar $x_i$ and add up all $v_i$s.

$$ Vx = \sum_i^{|W|} x_i \vec v_i $$

Since $x$ is one-hot, only one column is multiplied by a non-zero element. This is actually just selecting column $i$ from matrix $V$. $V$ is *precisely* a collection of word vectors in column form.

### Examples

To embed the one-hot vector $x$ for word $c$ from embedding matrix $V$:

$$ Vx = \vec v_c $$

To embed the one-hot vector $y$ for word $o$ from embedding matrix $U$:

$$ Uy = \vec u_o $$

## Answers

What is an embedding?

** A dense vector representation of a word **

** A matrix with $|W|$ columns of dense vectors **

## Vector Space Properties

We have a collection of word vectors for words in our vocabulary - how do we know if they are good representations?

It must be the case that a **good word representation** have certain properties. Let's first look at what those properties are and later on demonstrate how those properties can be achieved.

Recall from linear algebra that two vectors of the same dimension can be combined in two ways:

### Vector Sum

The sum of vectors $v$ and $u$ is defined as addition of their individual elements.

$$ w_i = v_i + u_i $$

The resulting vector is the equivalent of adding drawing the tail of $u$ from the head of $v$:

** Visualization **

### Dot Product

The dot product of vectors $v$ and $u$ is defined as the sum of the product of their individual elements.

$$ v^T u = \sum_i v_i * u_i $$

#### Similarity

The dot product is proportional to the similarity of 2 vectors:

Vectors that point in the same direction have a high dot product.
Vectors that point in opposite directions have a low dot product.
Vectors that point in unrelated (orthogonal) directions have 0 dot product.

### Word2Vec vs One-Hot

One-hot vectors have 0 dot product. This implies that all words are equally unrelated to each other. ** This is neither true nor desirable **

Dense embeddings have high dot products for words that appear together and low dot products for words that do not. If we believe that similar & related words appear together, then we'd like them to point in similar directions and thus have a high dot product.

### Good Representation

A good representation encodes as much relevant information as possible. It accurately depicts prior knowledge we have about the data, and helps us create a model for the underlying task that takes advantage of that prior knowledge.

We know that words are related to each other in varying degrees, and along varying axis.

#### Examples

* The relationship between "king" and "man" is along the axis of specificity
* The relationship between "man" and "woman" is along the axis of gender
* The relationship between "one" and "two" is along the axis of plurality

Dense embeddings represent the relatedness of words, whereas one-hot encodings do not. A downstream machine learning task can make use of this insight and will not have to re-learn how "man" and "king" are related. When a good representation is learned, the overall task is becomes easier.

### Visualize

** Insert Visualization AGAIN**

** Point out how vectors are arranged **

## Questions

Why are good representations useful?

** Good representations take advantage of prior knowledge to make the underlying task easier **

Why are dense embeddings good representations?

** Dense embeddings represent words based on their relatedness, so that a downstream task can make use of such relatedness **

## Probabilistic

### Example text:

Berlin is a city in East Germany

+ so we want a classifier ...
+ the dot product is in some arbitrary scale, so the score doesn't sum to one and is merely relative
+ if we want probabilities, we exponentiate the score normalize over all scores to sum to 1
+ the score is also known as an "unnormalized log probability"
$$ P(O,C) = \exp( \mathrm{score} ) $$

and we want to maximize
$$ P(O|C) = P(O,C) / \sum P(O', C) = P(O,C) / P(C) $$





## Learning

### Random initialize - we make no assumptions

### Formulation of objective function

BLACK BOX OPTIMIZATION 

OR

Detailed ->
$ -log(p(o|c)) $ -> $ \hat y - y$


$$ v_c = v_c - \alpha \frac {\partial L}{\partial v_c}$$

$$ \frac {\partial L}{\partial v_c} = \sum_{j \ne k}^{W} \hat y_j * u_{i,j}  -  u_k $$

### Linear Algebra intuition for gradient update

Write out derivatives on your own time!

Here are the gradients already worked out:
* Gradient wrt input vector, output vectors*

In backpropogation, when I do an update, I subtract the gradient. When I add a gradient to a vector, I'm pushing that vector towards the gradient. (show diagram)

Check out the gradients for $v_c$ and $u_o$.

### Visualize

How do related word vectors move towards each other in 3D space?

# Looking forward

## Deep Learning

When you have complicated models, you have more transformations to latent spaces.
However, the intuition remains the same; you're dealing with dot products - similarities (negative distances).

## Bonus Material

L2 distance between 2 word vectors 
$$ d^2 = (x - z)^2 = x^2 + z^2 - 2xz $$

Setting them to normal vectors
$$ d^2 = 2 - 2xz $$

Then exponentiating for unnormalized probability ->

$$ exp(-d^2) $$