# Lecture 10

## Machine Learning: Feed-forward Neural Networks

### Multi-Layer Neural Networks
* Basic idea: represent any non-linear function as a composition of soft-threshold functions (form of non-linear regression)
* Lippmann 1987: two hidden layers suffice to represent any arbitrary region (provided there are enough neurons), even discontinuous functions! 

### Activation Functions
* One problem with perceptrons is that the **threshold function (step function)** is not differentiable -- cannot use gradient descent
* Alternative is to use the **sigmoid (logistic) function**
$$ g(z) = \frac{1}{1 + e^{-z}}$$

### The logit Function
* Given a probability $p$, the odds:
$$ \text{odds}(p) = \frac{p}{1 - p}
$$

* The log-odds (logit) function is 
$$ \text{logit}(p) = \log(\frac{p}{1 - p})
$$
* The logit function takes a probability and converts it to a real value between $-\infty$ and $\infty$ 

### Logistic Regression
* The sigmoid function is the *inverse* of the logit function:
$$ p = \text{sigmoid}(\text{logit}(p))
$$

### Activation Functions
* Other popular activation functions include:
$tanh(z), relu(z)$
![image-2.png](attachment:image-2.png)

### Softmax
* In order to interpret the output of the final layer as probabilities, we can normalize the activation of each output unit by the sum of all output activations
![image.png](attachment:image.png)

### Computation Graphs
* We can represent arithmetic expressions (such as those performed by a neural network) as directed acylic graphs
* Utilize computation graphs to depict the layering of different functions within a neural network

### Learning in Multi-Layer Neural Networks
* Fixed network structure, utilize learning in order to train weights
* Assume **feed-forward** neural networks (no loops)

* **Backpropagation Algorithm:**
    * Given current weights, get network output and compute loss function (assume multiple outputs)
    * Can use gradient descent to update weights and minimize loss
    * Problem: only know how to do this for the last layer
    * Solution: propagate error backwards through the network

### Negative Log-Likelihood
* Assume target output is a vector and $c(y)$ is the target class for target $y$
* Compute the negative log-likelihood for a single example
$$\text{Loss}(y, h_w(x)) = -\log P(c(y) | x; w) $$
* Empirical error for the entire training data:
$$ E_{\text{train}}(w) = \frac{1}{N} \sum_{i = 1}^N -\log P(c(y^{(i)}) | x^i; w)
$$

### Stochastic Gradient Descent (for a single unit)
* Goal: learn parameters that minimize the empirical error
![image.png](attachment:image.png)
* $\eta$ is the learning rate

### Neural Language Model
* First dense layer of the network learns an **embdedding** of each input token
    * Input layer is a concatenation of vectors
    * Tokens share a single weight matrix $E$
    * Embedding weights are trained at the same time as Language Model weights
    * Similar words have similar embeddings
![image.png](attachment:image.png)