# XCS224N Natural Language Processing with Deep Learning


# Lecture 3

[CS224N](http://web.stanford.edu/class/cs224n/) / [XCS224N](http://scpd.stanford.edu/search/publicCourseSearchDetails.do?method=load&courseId=93933715) / [Lecture](https://youtu.be/8CWyBNX6eDo) / [Slides](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture03-neuralnets.pdf)

## Week 2-3 Preview

Goal is to learn about:
* Deep, multi-layer neural networks and how they can be trained using backpropagation (matrix calculus)
* NLP classifiers that add context, by taking in windows of words and classifying a center word

## Classification Setup and Notation

* Training dataset consisting of samples $\{x_i, y_i\}^N$
* $x_i$ are inputs (dimension d)
* $y_i$ are labels we try to predict (one of C classes)

Traditional ML/Stats approach:
* Assume $x_i$ are fixed (i.e. set), 
* Train softmax/logistic regression weights $W\in \mathbb{R}^{Cxd}$ 
* By way of training, determine a decision boundary (hyperplane)

<br>

<img src="images/decision_boundary.PNG" />

### Softmax Classifier

$$P(y|x) = \frac{exp(W_{y}.x)}{\sum_{c=1}^{C}exp(W_{c}.x)}$$

1. Take the $y^{th}$ row of W and multiply that row with x (dot product)
$$ W_{y}.x = \sum_{i=1}^{d}W_{yi}x_{i}=f_{y}$$

For Weight matrix $W$, we have a row corresponding to each class. Then for that row, we are dot producting it with our datapoint vector $x_i$. This then gives a score for how likely it is that the example belongs to a class.

2. Apply softmax to get normalized probability

$$P(y|x) = \frac{exp(f_{y})}{\sum_{c=1}^{C}exp(f_{c})}=softmax(f_{y})$$

In statistics, the purists may think that we are storing more weights than we need, and rather we only need N - 1 for classification i.e. for binary classification we only need one weight.

<strong>Goal</strong>
<br>

i) For every training example $(x,y)$, the objective is to <em>maximize the probability of the correct class $y$
</em>
<br>
ii) Or, <em>minimize the negative log probability of that class</em>:
<br>
$$-log p(y|x)=-log(\frac{exp(f_y)}{\sum_{c=1}^{C}exp(f_{c})})$$

### What is "cross entropy" loss/error?

* Originates from information theory
* Let the true probability distribution be $p$
* Let our computer probability be $q$
* Then cross entropy is:
<br>
$$ H(p,q) = -\displaystyle\sum_{c=1}^{C}p(c) log q(c)$$
<br>
* Assuming ground truth probability distribution that is 1 at the right class 0 everywhere else (i.e. one hot encoded vector):
```python
p = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
```
<strong>Because of one-hot $p$, the only term left is the negative log probability of the true class</strong>

<em>Q: What is the p vector? What do each of the components correspond to? </em>

### Classification over a full dataset
* Differs for classification over a full dataset $\{x_i, y_i\}^N_{i=1}$
<br>
$$J(\theta) = \frac{1}{N} \displaystyle\sum_{i=1}^{N}-log(\frac{e^{f_{y_{i}}}}{\sum_{c=1}^{C}exp(e^{f_{c}})})$$
* Instead of:
<br>
$$f_y = f_y(x) = W_y.x = \displaystyle\sum_{j=1}^{d}W_{yj}x_j$$

* We will write $f$ in matrix notation:
$$f=Wx$$

The $W$ matrix will have <strong>$C$ rows equal to the number of classes</strong> , and <strong>$d$ columns equal to the number of dimensions of the input vector $x$</strong>, thus:
<br>
<br>
$W\in \mathbb{R}^{Cxd}$ 

For each $y$ row that corresponds to a class in the $W$ matrix: 
* We take the dot product of the $yth$ and $jth$ component of $W$ and $jth$ component of $x$ for all input dimensions
* Then take the sum
* This is simplified to $f_y$

### Traditional ML optimization

For general ML, $\theta$ usually only consists of columns of W:
$$\theta = 
\begin{pmatrix}
  W_{.1} \\
  \vdots \\
  W_{.d} \\
 \end{pmatrix}
 =W(:) \in \mathbb{R}^{Cd}$$
<br>
So we only update the decision boundary via:
$$\bigtriangledown_\theta J(\theta) = 
\begin{pmatrix}
  W_{.1} \\
  \vdots \\
  W_{.d} \\
 \end{pmatrix}
\in \mathbb{R}^{Cd} $$

### Neural Network Classifiers (For the Win!)

Softmax (or logistic regression) alone are not very powerful. As they only generate linear decision boundaries thus, not very helpful for complex problems.

<br>

Enter NNs - They are able to learn much more complex functions and non-linear decision boundaries.

<br>

<img src="images/decision_boundaries_2.PNG" />

### Classification difference with word vectors

Commonly in NLP deep learning:
* We learn both $W$ and word vectors $x$
* We learn both conventional parameters ($W$) and representations ($x$)
* Word vectors re-represent one hot vectors, but adjust them in an intermediate layer vector space thus, for easy classification with a (linear) softmax classifier via layer $x = le$ 

<br>

$$\bigtriangledown_\theta J(\theta) = 
\begin{pmatrix}
  W_{.1} \\
  \vdots \\
  W_{.d} \\
  \vdots \\
  x_{aardvark} \\
  \vdots \\
  x_{zebra} \\
 \end{pmatrix}
\in \mathbb{R}^{Cd+Vd} $$

The result being a very large number of paramaters comparative to conventional deep learning.

### An artificial neuron

Understanding softmax models means you can understand the operation of neurons

<br>

<img src="images/neuron.PNG" />

### A neuron can be a binary logistic regression unit

$f$ = non-linear activation function (e.g. sigmoid), $w$ = weights, $b$ = bias, $h$ = hidden, $x$ = inputs
<br>
<br>
$$h_{w,b}(x) = f(w^Tx+b)$$
<br>
<br>
Sigmoid:
$$f(z)=\frac{1}{1+e^{-z}}$$

<img src="images/neuron_log_regression.PNG" />

<em>Q: What is z in this?</em>

<strong>Neural networks = running a bunch of logistic regressions at the same time</strong>
<br>If we feed a vector of inputs, through a bunch of logistic regression units, we get a vector of outputs.
<br><em>However, we don't have to decide ahead of time what attributes these units are trying to predict</em>

<img src="images/log_units.PNG" />