# Deep Learning

## Linear Boundaries

Equation: $$W_1X_1 + W_2X_2 + b = 0$$
This equation is simplified in vector notation:
$$Wx + b = 0$$
$$W = (W_1, W_2)$$
$$x = (x_1, X_2)$$

X is the input, W is the weight and B is the bias.  Y is the label we're trying to predict

The actual prediction is y hat: $$\hat{y} = \begin{cases} 1 if Wx + b \geq 0 \\
                                                          0 if Wx + b < 0  \end{cases}$$
                                                          
The goal is for y hat to match y as close as possible


## Higher Dimensions
Boundary: A Plane
Equation: $$W_1X_1 + W_2X_2 + W_nX_n + b = 0$$

## Preceptrons

![image.png](attachment:image.png)

A Step function returns a 1 or 0

#### AND Preceptron
Only returns true if both inputs are true

#### OR Preceptron
Returns true if any inputs are true

#### NOT Preceptron
Only takes in one inpute and returns the opposite

#### XOR Preceptron
Only returns true if the inputs have different values
![image.png](attachment:image.png)

#### NAND
Combines AND and NOT

## Preceptron Trick

![image.png](attachment:image.png)

* 1- Start with random weight: W_1,...,b
* 2- For every misclassified point (X_1,...)
    * 2.1 if prediction = 0:
        - For i = 1 ...n
            - Change w_i + ax_i
        - Change b to b + a
    * 2.2 if prediction = 1:
        - For i = 1 ...n
            - Change w_i - ax_i
        - Change b to b - a

#### Error Function
The distance from the line

In order to use gradiant decent, the error function can not be discrete, needs to be continuous

#### Log-loss Error Function
Large penalty of a misclassified point

### Discrete vs Continuous Prediction
Discrete is yes or no

Continuous is a number between 0 and 1
![image.png](attachment:image.png)

To change from discrete to continous, we would change our activation functions from a step function, to a sigmoid function
$$\sigma(X) = 1 / (1 + e^{-X}) $$

The prediction is y_hat: $$\hat{y}=\sigma(Wx + b)$$


### Softmax
Is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes(features)

$$P(class i) = \frac{e^{Z_i}}{e^{Z_1} + ... + e^{Z_n}}$$

![image.png](attachment:image.png)

## Maximizing Probabilities

Product: Bad!, Sum: Good!.  So we use the sum of the negative of natural log of probabilities

#### Cross-Entropy (Important)
The sum of the negative natural logs
Small number is better

Formula:
$$ Cross-Entropy = -\sum_{i = 1}^{m} y_iln(p_i) + (1 - y_i)ln(1 - p_i)$$

![image.png](attachment:image.png)
if there is a gift in this example, then the second part of the formula is 0 and the other way around


#### Multi-Class Cross-Entropy

Formula:
$$ Cross-Entropy = -\sum_{i = 1}^{n} \sum_{j = 1}^{m} y_{ij} ln(p_{ij})$$

# Logistic Regression

#### Error function formula
$$ error = -(1-y)(ln(1-\hat{y}) - yln(\hat{y})$$
$$ Error function = -\frac{1}{m} \sum_{i = 1}^{m} (1-y_i)(ln(1-\hat{y_i}) + y_iln(\hat{y_i})$$
The error function is in terms of W and b which are the weights of the model
$$ E(W,b) = - \frac{1}{m} \sum_{i = 1}^{m} (1-y_i)(ln(1-\sigma(Wx^{(i)}+b)) + y_iln(\sigma(Wx^{(i)}+b)$$

# Gradient Descent

$$σ′(x)=σ(x)(1−σ(x))$$

The reason for this is the following, we can calculate it using the quotient formula:
![image.png](attachment:image.png)


Our goal is to calculate the gradient of E,E, at a point x = (x_1, ...., x_n), given by the partial derivatives

$$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$$

The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,

$$E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})$$

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In summary, the gradient is

$$\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)$$

#### Gradient Descent Step
which is equivalent to

$$w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i$$

Similarly, it updates the bias in the following way:

$$b' \leftarrow b + \alpha (y - \hat{y})$$

Note: Since we've taken the average of the errors, the term we are adding should be $$\frac{1}{m} \cdot \alpha$$ instead of α, but as α is a constant, then in order to simplify calculations, we'll just take $$\frac{1}{m} \cdot \alpha $$ to be our learning rate, and abuse the notation by just calling it α.

### Gradient Descent Algorithm
Pseudocode
![image.png](attachment:image.png)

## Perceptron vs Gradient Descent

- Perceptron changes the weights for only the misclassified points while gradient descent check and changes all points.  
- Gradient can take any number between 0 and 1 while perceptron can only take 0 or 1
- If a point is classified correctly, perceptron does nothing further, gradient will attempt to minimize the error further by moving the line further away from the point

## Non-Linear Models
![image.png](attachment:image.png)

## Neural Networks
3 layers:
* Inpute Layer
* Hidden Layer
* Output Layer

#### Feedforward
The procces neural networks use to turn the input into an output
![image.png](attachment:image.png)

## Backpropagation
Now, we're ready to get our hands into training a neural network. For this, we'll use the method known as backpropagation. In a nutshell, backpropagation will consist of:

    * Doing a feedforward operation.
    * Comparing the output of the model with the desired output.
    * Calculating the error.
    * Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
    * Use this to update the weights, and get a better model.
    *Continue this until we have a model that is good.