## Boundary line

![lb1](pics/boundary_line_1.png)

![lb2](pics/line_boundary_2.png)

## Higher Dimensions

![h_d_1](pics/higher_dim.png)

### QUIZ QUESTION

![h_d_q](pics/h_d_quiz.png)

## Perceptrons as Logical Operators

In this lesson, we'll see one of the many great applications of perceptrons. As logical operators! You'll have the chance to create the perceptrons for the most common of these, the **AND, OR**, and **NOT** operators. And then, we'll see what to do about the elusive **XOR** operator. Let's dive in!

### AND Perceptron

![and-quiz.png](pics/and-quiz.png)

#### What are the weights and bias for the AND perceptron?

Set the weights (`weight1`, `weight2`) and bias (`bias`) to values that will correctly determine the AND operation as shown above.
More than one set of values will work!

In [1]:
import pandas as pd

# TODO: Set weight1, weight2, and bias
weight1 = 0.5
weight2 = 0.5
bias = -1.0


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))


Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                  -1.0                    0          Yes
       0          1                  -0.5                    0          Yes
       1          0                  -0.5                    0          Yes
       1          1                   0.0                    1          Yes


### OR Perceptron

![or-quiz.png](pics/or-quiz.png)

The OR perceptron is very similar to an AND perceptron. In the image below, the OR perceptron has the same line as the AND perceptron, except the line is shifted down. What can you do to the weights and/or bias to achieve this? Use the following AND perceptron to create an OR Perceptron.

![and-to-or.png](pics/and-to-or.png)

![and_or.png](pics/and_or.png)

### NOT Perceptron

Unlike the other perceptrons we looked at, the NOT operation only cares about one input. The operation returns a `0` if the input is `1` and a `1` if it's a `0`. The other inputs to the perceptron are ignored.

In this quiz, you'll set the weights (`weight1`, `weight2`) and bias `bias` to the values that calculate the NOT operation on the second input and ignores the first input.

In [2]:
import pandas as pd

# TODO: Set weight1, weight2, and bias
weight1 = 0.0
weight2 = -1.0
bias = 0.0


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == correct_output else 'No'
    outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                   0.0                    1          Yes
       0          1                  -1.0                    0          Yes
       1          0                   0.0                    1          Yes
       1          1                  -1.0                    0          Yes


### XOR Perceptron

![xor.png](pics/xor.png)

Now, let's build a multi-layer perceptron from the AND, NOT, and OR perceptrons to create XOR logic!

The neural network below contains 3 perceptrons, A, B, and C. The last one (AND) has been given for you. The input to the neural network is from the first node. The output comes out of the last node.

The multi-layer perceptron below calculates XOR. Each perceptron is a logic operation of AND, OR, and NOT. However, the perceptrons A, B, and C don't indicate their operation. In the following quiz, set the correct operations for the perceptrons to calculate XOR.

![xor_nn.png](pics/xor_nn.png)

## Perceptron Trick

In the last section you used your logic and your mathematical knowledge to create perceptrons for some of the most common logical operators. In real life, though, we can't be building these perceptrons ourselves. The idea is that we give them the result, and they build themselves. For this, here's a pretty neat trick that will help us.


![perceptronquiz.png](pics/perceptronquiz.png)

**Solution**: Closer

## Time for some math!

Now that we've learned that the points that are misclassified, want the line to move closer to them, let's do some math. The following video shows a mathematical trick that modifies the equation of the line, so that it comes closer to a particular point.

![perceptron_trick_1.png](pics/perceptron_trick_1.png)


### Quiz

For the second example, where the line is described by $3x1+ 4x2 - 10 = 0$, if the learning rate was set to $0.1$, how many times would you have to apply the perceptron trick to move the line to a position where the blue point, at $(1, 1)$, is correctly classified?

**Solution:** 10


## Perceptron Algorithm

And now, with the perceptron trick in our hands, we can fully develop the perceptron algorithm!

There's a small error in the above video in that $W_i$ should be updated to $W_i = W_i + \alpha x_i$ (plus or minus depending on the situation).

### Coding the Perceptron Algorithm

Implement the perceptron algorithm.

![points.png](pics/points.png)

Recall that the perceptron step works as follows. For a point with coordinates $(p,q)$, label $y$, and prediction given by the equation $\hat{y} = step(w_1x_1 + w_2x_2 + b)$:

If the point is correctly classified, do nothing.
If the point is classified positive, but it has a negative label, subtract $\alpha$ $p$, $\alpha$ $q$, and $\alpha$ from $w_1$, $w_2$, and $b$ respectively.
If the point is classified negative, but it has a positive label, add $\alpha p$, $\alpha q$, and $\alpha$ to $w_1, w_2$, and $b$ respectively.

Feel free to play with the parameters of the algorithm (number of epochs, learning rate, and even the randomizing of the initial parameters) to see how your initial conditions can affect the solution!

In [None]:
import numpy as np
# Setting the random seed, feel free to change it and see different solutions.
np.random.seed(42)

def stepFunction(t):
    if t >= 0:
        return 1
    return 0

def prediction(X, W, b):
    return stepFunction((np.matmul(X,W)+b)[0])

# TODO: Fill in the code below to implement the perceptron trick.
# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    # Fill in code
    for i in range(len(X)):
        if prediction(X[i], W, b) > y[i]:
            W[0] -= learn_rate * X[i][0]
            W[1] -= learn_rate * X[i][1]
            b -= learn_rate 
        if prediction(X[i], W, b) < y[i]:
            W[0] -= learn_rate * X[i][0]
            W[1] -= learn_rate * X[i][1]
            b += learn_rate 
    return W, b
    
# This function runs the perceptron algorithm repeatedly on the dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs = 25):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2,1))
    b = np.random.rand(1)[0] + x_max
    # These are the solution lines that get plotted below.
    boundary_lines = []
    for i in range(num_epochs):
        # In each epoch, we apply the perceptron step.
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0]/W[1], -b/W[1]))
    return boundary_lines


## Error function

An error function is a function that tell us how far we are from the solution. 

### Log-loss Error Function

![log-loss-error-function.png](pics/log-loss-error-function.png)


### Discrete vs Continuous Predictions


![discrete_continuous_1.png](pics/discrete_continuous_1.png)
![discrete_continuous_1.png](pics/discrete_continuous_2.png)
![discrete_continuous_1.png](pics/discrete_continuous_3.png)
![discrete_continuous_1.png](pics/discrete_continuous_4.png)
![discrete_continuous_1.png](pics/discrete_continuous_5.png)
![discrete_continuous_1.png](pics/discrete_continuous_6.png)
![discrete_continuous_1.png](pics/discrete_continuous_7.png)

## The Softmax Function

Now you'll learn about the softmax function, which is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes.

Let´s assume you want to predict if you receive a gift or not. 

![softmax_1.png](pics/softmax_1.png)

Let us consider now a problem where we want our model to tell us if the animal is a duck, biever or a walrus. 

![softmax_1.png](pics/softmax_2.png)


![softmax_1.png](pics/softmax_quiz_1.png)

**Solution:** *exp*

![softmax_1.png](pics/softmax_3.png)

![softmax_1.png](pics/softmax_4.png)

**Solution:** *yes*

### Coding Softmax 

In [1]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    sf = []
    for l in L:
        sfDum = np.exp(l)/sumExpL
        sf.append(sfDum)
    return sf

## Maximum Likelihood 

Probability will be one of our best friends as we go through Deep Learning. In this lesson, we'll see how we can use probability to evaluate (and improve!) our models.

In this methodology we pick the models that give the highest probability at the correct predicted value of label. 

![maximum_likelihood_1.png](pics/maximum_likelihood_1.png) 

What we mean by this is that if the model is given by these probability spaces, then the probability that the points are of these colours is 0.0084.

Let´s do the same for another model. 

![maximum_likelihood_1.png](pics/maximum_likelihood_2.png) 


We can see that the second model is much better, so we should be going for that. 

### Maximizing Probabilities

In this lesson and quiz, we will learn how to maximize a probability, using some math. Nothing more than high school math, so get ready for a trip down memory lane!

As we saw before, our goal is to minimize the loss function. Could it be that the minimization of the loss function and maximization of probability be connected? 

## Cross-Entropy

So all we need is to maximize the total probability, is to find the model with the maximum probability. But then we need to do the **product** of all the probabilities, and products are hard - but why?
1. When we have thousands of probabilities, then the maximum probability is tiny. 
2. when single a probability change, the product will change drastically.

In summary, we really want to stay away from products. Let´s do sum! How can we do it? With **log**!

![cross_entropy_1.png](pics/cross_entropy_1.png)

So we're getting somewhere, there's definitely a connection between probabilities and error functions, and it's called **Cross-Entropy**. This concept is tremendously popular in many fields, including Machine Learning. Let's dive more into the formula, and actually code it!

Essentially what Cross-Entropy says is that given some events with their corresponding probabilities, if their probability of occuring is high and they occur then we have a small Cross-Entropy, if they have a small probability of occuring but they occur then we have a big Cross-Entropy. 

$$-\sum_{i = 1}^{m}y_i ln(p_i) + (1 - y_i) ln(1 - p_i)$$

In [1]:
import numpy as np

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    pr = [-y * np.log(p) - (1 - y) * np.log(1 - p) for y, p in zip(Y, P)]
    ce = sum(pr)
    return ce

### Multi-Class cross Entropy 

![multi_class_entropy.png](pics/multi_class_entropy.png)

$$\text{Corss-Entropy = } -\sum_{i = 1}^{n}\sum_{j = 1}^{m}y_{ij}ln(p_{ij})$$

where $m$ is the number of classes.

![multi-entropy-quiz.png](pics/multi-entropy-quiz.png)



## Logistic Regression

Now, we're finally ready for one of the most popular and useful algorithms in Machine Learning, and the building block of all that constitutes Deep Learning. The Logistic Regression Algorithm. And it basically goes like this:

* Take your data
* Pick a random model
* Calculate the error
* Minimize the error, and obtain a better model
* Enjoy!

### Calculating the Error Function

We will use the Cross-entropy as seen before, but we will take the average of it, not the sum of it. 

![error-funciton.png](pics/error-funciton.png)

### Gradient Descent

We learned that in order to minimize the error function, we need to take some derivatives. So let's get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,

![codecogseqn-49.gif](pics/codecogseqn-49.gif)

And now, let's recall that if we have mm points labelled $x^{(1)}$, $x^{(2)}$, $\ldots, x^{(m)}$, the error formula is:

$E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)$

where the prediction is given by $\hat{y_i} = \sigma(Wx^{(i)} + b)$. 

Our goal is to calculate the gradient of $E,$ at a point $x = (x_1, \ldots, x_n)$, given by the partial derivatives

$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$

To simplify our calculations, we'll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,

$$E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})$$

In order to calculate the derivative of this error with respect to the weights, we'll first calculate $\frac{\partial}{\partial w_j} \hat{y}$. Recall that $\hat{y} = \sigma(Wx+b)$, so:

![codecogseqn-43.gif](pics/codecogseqn-43.gif)

The last equality is because the only term in the sum which is not a constant with respect to $w_j$, which clearly has derivative $x_j$.

Now, we can go ahead and calculate the derivative of the error $E$ at a point $x$, with respect to the weight $w_j$.

![codecogseqn-60-2.png](pics/codecogseqn-60-2.png)

A similar calculation will show us that

![codecogseqn-58.gif](pics/codecogseqn-58.gif)

This actually tells us something very important. For a point with coordinates $(x_1, \ldots, x_n)$, label $y$, and prediction $\hat{y},$ the gradient of the error function at that point is $\left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right)$. In summary, the gradient is

$\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)$.

If you think about it, this is fascinating. **The gradient is actually a scalar times the coordinates of the point!** And what is the scalar? Nothing less than a **multiple of the difference between the label and the prediction**. What significance does this have?

![gd_quiz_1.png](pics/gd_quiz_1.png)

So, a small gradient means we'll change our coordinates by a little bit, and a large gradient means we'll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We'll see it in a bit.

### Gradient Descent Step
Therefore, since the gradient descent step simply consists in **subtracting a multiple of the gradient of the error function at every point**, then this updates the weights in the following way:

$$w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i],$$ which is equivalent to


$$w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i.$$

Similarly, it updates the bias in the following way:

$$b' \leftarrow b + \alpha (y - \hat{y}),$$

*Note*: Since we've taken the average of the errors, the term we are adding should be $\frac{1}{m} \cdot \alpha $ instead of $\alpha$, but as $\alpha$ is a constant, then in order to simplify calculations, we'll just take $\frac{1}{m} \cdot \alpha$ to be our learning rate, and abuse the notation by just calling it $\alpha.$

![Gradient_descent_algorithm.png](pics/Gradient_descent_algorithm.png)


### Gradient descent vs Perceptron

![gradient_descent_vs_perceptron.png](pics/gradient_descent_vs_perceptron.png) 

The difference between perceptron and gradient descent algorithm is that perceptron does nothing if the point was classified correctly while gradient descent algorithm sends the boundary further away to minimiize the cost function. 

## Non-Linear Models

![nl_1.png](pics/nl_1.png)



## Neural Network Architecture

Ok, so we're ready to put these building blocks together, and build great Neural Networks! (Or Multi-Layer Perceptrons, however you prefer to call them.)

![nn_1.png](pics/nn_1.png)

![nn_2.png](pics/nn_2.png)

![nn_3.png](pics/nn_3.png)

![nn_4.png](pics/nn_4.png)

![nn_4.png](pics/nn_5.png)


#### QUESTION 1 OF 2

Let's define the combination of two new perceptrons as w1\*0.4 + w2\*0.6 + b. Which of the following values for the weights and the bias would result in the final probability of the point to be 0.88?

Solution: w1: 3, w2: 5, b:-2.2

### Multiple layers

Now, not all neural networks look like the one above. They can be way more complicated! In particular, we can do the following things:

* Add more nodes to the input, hidden, and output layers.
* Add more layers.
We'll see the effects of these changes below.

![nn_4.png](pics/nn_6.png)
![nn_4.png](pics/nn_7.png)
![nn_4.png](pics/nn_8.png)
![nn_4.png](pics/nn_9.png)
![nn_4.png](pics/nn_10.png)

### Multi-Class Classification

And here we elaborate a bit more into what can be done if our neural network needs to model data with more than one output.

![nn_4.png](pics/nn_11.png)
![nn_4.png](pics/nn_12.png)

#### QUESTION 2 OF 2

How many nodes in the output layer would you require if you were trying to classify all the letters in the English alphabet?

Solution: 26

### Feedforward

Feedforward is the process neural networks use to turn the input into an output. Let's study it more carefully, before we dive into how to train the networks.

![nn_13](pics/nn_13.png)
![nn_13](pics/nn_14.png)

![nn_13](pics/nn_15.png)

### Error Function

Just as before, neural networks will produce an error function, which at the end, is what we'll be minimizing. The following video shows the error function for a neural network.

![nn_13](pics/nn_16.png)
![nn_13](pics/nn_17.png)


### Backpropagation

Now, we're ready to get our hands into training a neural network. For this, we'll use the method known as **backpropagation**. In a nutshell, backpropagation will consist of:

* Doing a feedforward operation.
* Comparing the output of the model with the desired output.
* Calculating the error.
* Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
* Use this to update the weights, and get a better model.
* Continue this until we have a model that is good.

![nn_13](pics/nn_18.png)
![nn_13](pics/nn_19.png)
![nn_13](pics/nn_20.png)
![nn_13](pics/nn_21.png)

### Backpropagation 

Feel free to tune out, since this part gets handled by Keras pretty well. If you'd like to go start training networks right away, go to the next section. But if you enjoy calculating lots of derivatives, let's dive in!

![nn_13](pics/nn_22.png)
![nn_13](pics/nn_23.png)
![nn_13](pics/nn_24.png)


#### Chain Rule
![nn_13](pics/nn_25.png)

![nn_13](pics/nn_26.png)
![nn_13](pics/nn_27.png)
![nn_13](pics/nn_28.png)

### Calculation of the derivative of the sigmoid function

Recall that the sigmoid function has a beautiful derivative, which we can see in the following calculation. This will make our backpropagation step much cleaner.

![sigmoid-derivative.gif](pics/sigmoid-derivative.gif)
