# 2 Classification with Neural Networks
In this chapter, you will understand the workings of a classifier and manually train one that operates on a single value. You will improve the classifier step by step and learn fundamental concepts about classification as you go along.
Finally, you will use automated backpropagation to train a multi-layer neural network to emulate a logic gate.

## 2.1 Introduction
In machine learning and statistics, classification is the problem of identifying to which set of categories (sub-populations) a new observation belongs to, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class or assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). [1]

A classification process requires a dataset that is split into different categories. A classifier can be trained on this dataset by learning the relationship between certain properties of the input data and the corresponding categories. 
To classify new data, the process is similar as in the chapter "Regression", however additional computational steps can be added depending on the application.
A common classification problem that can be solved by neural networks is image recognition (seen in Figure 1).



<img src="images/neural_network_classification.png" />
<p style="text-align: center;">
    Fig. 1 - Image recognition by a neural network
</p>

Run the cells below to import the necessary libraries and define a ReLU, MSE Loss function and a SimpleNeuron Class.

In [1]:
from __future__ import annotations # Used to allow referencing classes that have not yet been defined in type annotations. This will become default behaviour in Python 3.10. Until then, we have to use this line to enable that behaviour
from typing import *

import numpy as np
from ipywidgets import interact, Layout, FloatSlider
import plotly.offline as plotly
import plotly.graph_objs as go
import time
import threading
from typing import *

In [2]:
def relu(input_val: np.ndarray) -> np.ndarray:
    return np.maximum(input_val, 0)

In [3]:
def mean_squared_loss(predictions: np.ndarray, solutions: np.ndarray) -> float:
    total_squared_loss = np.sum(np.subtract(predictions, solutions)**2) #np allows to handle both values and lists
    mean_squared_loss = total_squared_loss/len(predictions)
    return mean_squared_loss

In [4]:
class SimpleNeuron:
    def __init__(self, plot: Interactive2DPlot):
        self.plot = plot #I am assigned the following plot
        self.plot.register_neuron(self) #hey plot, remember me
        
    def set_values(self, weight: float, bias: float):
        self.weight = weight
        self.bias = bias
        self.plot.update() #hey plot, I have changed, redraw my output
        
    def get_weight(self) -> float:
        return self.weight
    
    def get_bias(self) -> float:
        return self.bias

    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        self.activation = np.dot(self.weight, x) + self.bias
        return self.activation

In [5]:
# an Interactive Plot monitors the activation of a neuron or a neural network
class Interactive2DPlot:
    def __init__(self, points_red: Dict[str, List[float]], points_blue: Dict[str, List[float]], ranges: Dict[str, Tuple[float, float]], loss_function: Callable[[np.ndarray, np.ndarray], float] = mean_squared_loss, loss_string: str = "Loss", width: int = 800, height: int = 400, margin: Dict[str, int] = { 't': 0, 'l': 170 }, draw_time: float = 0.1):
        self.idle = True
        self.points_red = points_red
        self.points_blue = points_blue
        self.draw_time = draw_time
        self.loss_function = loss_function
        self.loss_string = loss_string

        self.x = np.arange(ranges["x"][0], ranges["x"][1], 0.01)
        self.y = np.arange(ranges["y"][0], ranges["y"][1], 0.01)

        self.layout = go.Layout(
            xaxis=dict(title="Neck height in m", range=ranges["x"]),
            yaxis=dict(title="y", range=ranges["y"]),
            width=width,
            height=height,
            showlegend=False,
            margin=margin,
        )
        self.trace = go.Scatter(x=self.x, y=self.y)

        self.plot_points_red = go.Scatter(
            x=points_red["x"], y=points_red["y"], mode="markers", marker=dict(color='rgb(255, 0, 0)', size=10)
        )
        self.plot_points_blue = go.Scatter(
            x=points_blue["x"],
            y=points_blue["y"],
            mode="markers",
            marker=dict(color='rgb(0, 0, 255)', size=10, symbol="square"),
        )

        self.plot_point_new = go.Scatter(
            x=[], y=[], mode="markers", marker=dict(size=20, symbol="star", color='rgb(0,0,0)')
        )

        self.data = [self.trace, self.plot_points_red, self.plot_points_blue, self.plot_point_new]
        self.plot = go.FigureWidget(self.data, self.layout)

    def register_neuron(self, neuron: SimpleNeuron):
        self.neuron = neuron

    def redraw(self):
        self.idle = False
        time.sleep(self.draw_time)
        self.plot.data[0].y = self.neuron.compute(self.x)
        self.idle = True

    def update(self):
        loss_red = self.loss_function(self.neuron.compute(self.points_red["x"]), self.points_red["y"])
        loss_blue = self.loss_function(self.neuron.compute(self.points_blue["x"]), self.points_blue["y"])
        print(self.loss_string,": {:0.3f}".format((loss_red + loss_blue) / 2))

        if self.idle:
            thread = threading.Thread(target=self.redraw)
            thread.start()

## 2.2 From Regression to Classification

###  2.2.1 Linear Regression

You find yourself working on a farm with sheep and llamas grazing in seperate enclosures. However, last night the shepard forgot to close the gate between the two enclosures. The llamas and sheep now are mixed and have to be seperated again. You immediately come up with a machine learning based solution to separate the sheep from the llamas again: You assume that llamas can be distinguished from sheep by measuring the distance from the top of their head to their spine, since llamas have significantly longer necks. Using a LIDAR scanner, neck heights will be measured autonomously and the animals will be seperated using a food enticement and an electronic turnstile that only lets llamas through.


<img src="images/neck_heights.png" />
<p style="text-align: center;">
    Fig. 2 - Concept of neck height measurement
</p>

To collect sample data, you go out on the field with a measuring tape and measure the neck heights of some sheep and llamas. You specify two categories: '0' for sheep and '1' for llamas. (See table 1)

Most llamas are grown up and have long necks, but there are also some young llamas with smaller necks. However, since their necks are still longer than the sheeps', you figure that this won't be a problem.

|  Animal | Neck height  | Category  |
|---------|--------------|-----------|
| Sheep #1| 0.20m        |0          |
| Sheep #2| 0.23m        |0          |
| Sheep #3| 0.28m        |0          |
| Sheep #4| 0.32m        |0          |
| Sheep #5| 0.35m        |0          |
| Llama #1| 0.55m        |1          |
| Llama #2| 0.68m        |1          |
| Llama #3| 0.74m        |1          |
| Llama #4| 0.83m        |1          |
| Llama #5| 0.95m        |1          |

<p style="text-align: center;">
    Table. 1 - Your data mining results
</p>





#### 2.2.1.1 Training a Linear Regression Neuron by Hand
For the sake of simplicity, you start by using a single neuron as a classifier. Run the two cells below to define the data mining points and to display a plot.

In [6]:
points_sheep = dict(
              x=[ 0.20, 0.23, 0.28, 0.32, 0.35],
              y=[ 0, 0, 0, 0, 0]
             )

points_llamas = dict(
              x=[ 0.55, 0.68, 0.74, 0.83, 0.95],
              y=[ 1,  1, 1, 1, 1]
             )

ranges = dict(x=[-0.1, 1.25], y=[-0.5, 1.4])
slider_layout = Layout(width="90%")

In [7]:
plot1 = Interactive2DPlot(points_sheep, points_llamas, ranges, loss_string="Mean Squared Loss")
neuron1 = SimpleNeuron(plot1)

interact(
    neuron1.set_values,
    weight=FloatSlider(min=-2, max=4, step=0.1, layout = slider_layout),
    bias=FloatSlider(min=-1, max=1, step=0.1, layout = slider_layout),
)

plot1.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=4.0, min=-2…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '38580736-e810-48a9-91b6-6bb13b057066',
 …

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> Change the weight and bias sliders above. What is a weight and bias combination that results in a loss < 0.05?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>weight=1.5    bias=-0.3</div>


***
#### 2.2.1.2 Working our way towards a discrete classifier
Now we want to use our trained neuron to classify new neck heights. To do that, we have to write a program that takes in a neck height and outputs what the trained neuron thinks about it. The classifier will also plot the new neck height. Run the box below to get the values from the task before.

In [8]:
# a duplicate of the last plot, so you don't have to scroll
plot2 = Interactive2DPlot(points_sheep, points_llamas, ranges, loss_string="Mean Squared Loss") 
neuron2 = SimpleNeuron(plot2)
neuron2.set_values(neuron1.get_weight(), neuron1.get_bias()) #get your values from last task

plot2.plot

Mean Squared Loss : 0.500


FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '4c263d48-2da2-4943-89b0-06f9c5862ed1',
 …

<div class="alert alert-block alert-success">
<b>Task:</b> Try to implement a classifier using just a linear neuron. (Yes, an almost futile task, but this will make sense later). <br> Complete the python code below and receive a classification_result.
<ul>
    <li> the classification result shall be the output of neuron2, given the new neck height </li>
    <li> you shouldn't need to add more than 1 line of code </li>
    <li> after executing, take a look at the star in the plot above. It represents the current input/output for the new neck length</li>
</ul>

</div>

In [9]:
new_neck_height = 0.4  # this value shall be varied to answer the questions below

classification_result: float

### STUDENT CODE HERE (1pt)
classification_result=neuron2.compute(new_neck_height)
### STUDENT CODE until HERE

plot2.plot.data[3].x = [new_neck_height] #update plot
plot2.plot.data[3].y = [classification_result] 

print("Result:", classification_result)

Result: 0.0


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> What classification value does the smallest llama have? (run the cell above and change new_neck_height)  
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>0.5250000000000001</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> What classification value does an animal with a neck height of 0.1m have? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>0.14999999999999997</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> What classification value does an animal with a neck height of 0.9m have?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>1.05</div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> Why is the classification value continuous, even though the training data had only two discrete values? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>Because the neuron is returning its classification result without an activationfunktion we could use a step-funktion to get 1s and 0ros </div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> How would you interpret this continuous classification value? Try to describe it in a few words, there is no single correct answer.
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>the higher the classification_result the more likely it is to be a 1/lama the smaller the more likely to be a 0/sheep</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> Your neuron outputs a continuous value, but what we need is a discrete output, that clearly says either "llama" or "sheep". To do this, you add a simple decision to the output of the neuron. The decision should be approximately just as sensitive towards llamas as to sheep. What neuron output (y-value) would you choose as the threshold and why? (no single correct answer)
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>the smallest lama has a classification_result=0.5250000000000001 and the biggest sheep has a classification_result=0.22499999999999992 so 0,375 seems in the middle and somehow fine</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> You want to add more data to your model to improve its performance. As you collect more data, you find a small llama with a neck height of 0.40m in your dataset. After you train your model on the new data, your discrete classifier decides that this small llama is a sheep. (Remember: the decision at the end only gets the y-value). Why is it problematic in this case to use a <b>linear</b> regression model for discrete classification? What property of the approximation function should be different?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>with a linear regression model the predicted value is continuous, not probabilistic and it is sensitive to imbalance data. The approximation funktion should not be linear</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> You decide that that manually adding a discrete decision at the end of your network is an unpractical idea. It would be better to improve the linear neuron by adding a heaviside step function as an activation function, just like adding a ReLu function. Then the training could be automated and the right threshold could be found automatically. What is the problem with this approach if we still want to use the Backpropagation algorithm?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>A requirement for backpropagation is a differentiable activation function.The heaviside function is neither differentiable in the classical sense nor is it weakly differentiable.</div>


***
### 2.2.2 Logistic Regression

In machine learning, the go-to assumption for an unknown two-class probability distribution is a logistic distribution.[2]
Its cumulated function is the logistic function, of which the sigmoid function is the most used special case. (See Fig 3.)
The sigmoid function enables a model to capture most natural occuring probability distributions.[3] (Further reading: see section "Further Reading" at the end of document)

In the introduction of Task 2.1, we gave the neck lengths corresponding labels. "0" for sheep and "1" for llama.
Here we can interpret the output of the neuron as the "llama probability": For example: An output of 1 means "100%" llama probability and an output of 0.2 means "20%" llama probability and so on.

<img src="images/sigmoid.png" />
<p style="text-align: center;">
    Fig. 3 - Sigmoid function
</p>


Run the cell below to define a sigmoid function.

In [10]:
def sigmoid(x: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-x))

<div class="alert alert-block alert-success">
<b>Task:</b> Complete Code and Train Neuron. Change the <code>SigmoidNeuron</code> class below to apply a sigmoid function to the final output.

</div>

In [11]:
class SigmoidNeuron(SimpleNeuron): #inheriting from SimpleNeuron, 
                                   #all functions stay the same unless they are specified here

    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        ### STUDENT CODE HERE (1 pt)
        self.activation = sigmoid(np.dot(self.weight, x) + self.bias)
        ### STUDENT CODE until HERE
        return self.activation

In [12]:
classification_plot_sig = Interactive2DPlot(points_llamas, points_sheep, ranges, loss_string="Mean Squared Loss")

our_sig_neuron = SigmoidNeuron(classification_plot_sig)

interact(
    our_sig_neuron.set_values,
    weight=FloatSlider(min=-50, max=200, step=0.1, layout = slider_layout),
    bias=FloatSlider(min=-50, max=50, step=0.1, layout = slider_layout),
)

classification_plot_sig.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=200.0, min=…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '8c666ed1-a35e-4dbc-8288-969a12337ca5',
 …

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> Give one example of an optimal weight and bias combination.
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>weight=30.2  bias=-13.5</div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> What advantage does a classifier have in general that also outputs a probability compared to a classifier that just outputs a binary yes/no value? (a few words) 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The user can assess the value of the classification</div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> Give one example how we can use the additional probability information to increase the accuracy of our seperation process 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>If the probability is low we can implement another seperation step</div>


## 2.3 Cross Entropy/Logarithmic Loss:
The most common loss function for classification is cross entropy loss, also called logarithmic loss. (In the context of machine learning, they are equal). In the special case of two categories, the loss is called binary cross entropy. The binary cross entropy loss between the ground truth data value $y$ and the predicted value $\hat{y}$ is calculated as follows:

\begin{align}
−[y \cdot log(\hat{y}) + (1 − y) \cdot log(1 − \hat{y})]
\end{align}

In this manner, the average of all data points is calculated w.r.t. this loss.
It turns out that the derivative of a logarithmic loss using one hot encoding (explained below) is just the solution vector subtracted by the network output, which makes it very easy to work with.
**Note:** Cross entropy loss can only be used, if the output values are between 0 and 1.

<img src="images/cross_entropy.png" />
<p style="text-align: center;">
        Fig. 4 - Logarithmic / cross entropy loss function

</p>




<div class="alert alert-block alert-success">
<b>Question (3 pts):</b> Calculate Squared and Cross Entropy Loss. Copy the table and fill out the ??? as an answer below (Markdown is fine to display the table). Use the cells below for calculations. 

| Input         | Llama Probability  |      Squared Loss    | Cross Entropy Loss   |
|---------------|--------------------|----------------------|----------------------|
|    llama(1)   | 0.99               |??????                |????                  |
|    sheep(0)   | 0.6                |????                  |????                  |
|    sheep(0)   | 0.95               |??????                |????                  |
|    sheep(0)   | 0.999999           |????????              |?????                 |


</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>

| Input         | Llama Probability  |      Squared Loss    | Cross Entropy Loss   |
|---------------|--------------------|----------------------|----------------------|
|    llama(1)   | 0.99               |0.0001                |0.0101                |
|    sheep(0)   | 0.6                |0.3600                |0.9163                |
|    sheep(0)   | 0.95               |0.9025                |2.9957                |
|    sheep(0)   | 0.999999           |1.0000                |13.8155               |

</div>











In [13]:
def cross_entropy_loss(predictions: np.ndarray, solutions: np.ndarray) -> float:
    predictions += 1e-15 #in order to prevent log(0)
    total_loss = np.sum(-(solutions*np.log(predictions)+(1-solutions)*np.log(1-predictions)))
    avg_loss = total_loss/len(predictions)
    return avg_loss

In [14]:
predicted = np.array([0.999999]) #insert here
actual = np.array([0]) #insert here


print("mean squared loss: {:0.4f}".format(mean_squared_loss(predicted,actual)))
print("cross entropy loss: {:0.4f}".format(cross_entropy_loss(predicted,actual)))

mean squared loss: 1.0000
cross entropy loss: 13.8155


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> How do the goals of regression and classification generally differ? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>In classification we predict discrete categories or classes. In Regression we predict continuous quantities.</div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> Why do you think cross entropy loss is better suited for classification training algorithms?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>Because a drastically wrong classified subject has nearly the same mean squared loss than a medium wrong classified subject. The cross entropy loss is really different for the two.</div>


## 2.4 One-Hot Encoding
To do classification, categories have to be represented in a way that the classifier can process. Neural networks cannot understand categories directly and need a numeric representation.

### 2.4.1 Disadvantages of Integer Encoding

In the llama classifier, llamas were assigned the value $1$ and sheep the value $0$. One single output neuron would "fire", if a llama was found, and not fire, if a sheep was found. This type of representing categories is called **integer** or **label encoding**

This works reasonably well for binary classification, but what if we want to distinguish between sheep, llamas and shepherd dogs?
Doing this with just one output neuron would result in complications: 
- Dogs would need a label that is numerically higher or lower (for example $2$), implying an order (Dogs > Llamas) where there actually is none.
- it would be necessary to interpret three different states out of one output neuron value

Another disadvantage can be seen in the next question:

<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> Suppose the encodings are: 0 for sheep, 1 for llamas and 2 for dogs. You classified 5 sheep and 5 dogs today. You want your classifier to output the average classification for today. What will the classifier say?
</div>

<div class="alert alert-block alert-success">
<b>Answer:the average is 1 -> lamas </b></div>


### 2.4.2 Composition of One-Hot Encoding

The solution for the shortcomings of integer encoding looks like this:

| Input         | One Hot Encoding  | 
|---------------|--------------------|
|    sheep   | [1,0,0]                |
|    llama   | [0,1,0]               |
|    dog     | [0,0,1]           |



The length of the representation vector is always equal to the amount of categories. Only one element of the vector is 1 for each category ("one-hot").
Using this encoding, we can conveniently use 3 output neurons for 3 different categories, so that the activation of each output neuron represents the classification score for that category.

###  2.4.3 Limits of One-Hot Encoding
One-hot encoding is not an unimprovable solution to represent categories, but rather another tool in the box that happens to work well for many problems, but not for all.

<div class="alert alert-block alert-success">
<b>Question (1 pts):</b> Suppose you would like to train a speech recognition neural network that can classify all English words contained in the Oxford English Dictionary. It does not need to classify whole sentences, just single words. What would be a problem using one-hot encoding?
</div>

<div class="alert alert-block alert-success">
<b>Answer:there are 228132 words in the oxford english dictionary</b></div>


## 2.5 Softmax Activation Function

The sigmoid function works fine for a "yes or no" problem, i.e. binary decisions. But more often than not we want to distinguish between more than two categories. For that, we need a function that takes in **multiple** neuron activations from the last layer of a network and outputs a **probability vector** containing the probabilities for each category. 

The key: **Each input** of this function is **normalized by the other inputs** such that the sum of the output vector is always 1. This activation function is different from ReLU or Sigmoid, because it always applies to the layer as a whole. In practice, it only makes sense as the activation function for the output layer.  Figure 3 shows an example network.

We can realize a softmax activation function by taking each element $x_i$ of the input vector, calculating $\exp(x_i)$ and then normalizing this value by dividing it by the sum of the $\exp$ results of all single input vector elements. Strictly speaking, the $\exp$ is not necessary for this effect - a linear normalization, limited to non-negative values, could also be interpreted as probability. However, the exponential normalization offers properties that improve performance (see "further reading").

\begin{align}
(\text{Softmax}(x))_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
\end{align}


<img src="images/softmax_example_network.png" />
<p style="text-align: center;">
    Fig. 3 - Softmax activation function
</p>

<div class="alert alert-block alert-success">
<b>Question (1 pts):</b> In "logistic regression", we also obtained a probability by applying a sigmoid function on the last layers' output. Why can't we apply a sigmoid function on each output neuron of this network instead of a softmax and get a probability vector?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>if we apply the sigmoid function to each raw output value separately, this means our network can output that all of the classes have low probability, that one class has high probability but the other classes have low probability, or that multiple or all classes have high probability so if there is a dog and a lama in the picture sigmoids works fine. Because in softmax all values added have to be one there is only one correct answer.  </div>


***
## 2.6 Automated Classification Training

### 2.6.1 Introduction

We already have explored automated training using backpropagation in the last chapter. We had one set of points that we had to fit a function as close as possible. The task is similar for classification training. However instead of y-coordinates for points, we now have discrete categories.

You got already a set of neck lengths and the correspoding categories (see table 1). In the field of machine learing, this dataset is called __training data__. It specifies the behaviour that the neural net should have. We will use backpropagation to adjust the weights and biases of the network over and over again until the network outputs the same values to a given set of inputs as in the training data. During backpropagation, the network is figuratively "learning" the training data. 

***
### 2.6.2 Realizing an XOR Gate with a Neural Network

You find yourself working as an engineer at a major electronic component manufacturing company. Your company wants to produce the first XOR gate chip that runs on artificial intelligence. You are given the training data in the form of a truth table:


| Input 1| Input 2  | Output    |
|--------|----------|-----------|
|    0   | 0        |0          |
|    0   | 1        |1          |
|    1   | 0        |1          |
|    1   | 1        |0          |


<p style="text-align: center;">
    Table. 2 - XOR Truth table
</p>


In this task we will make use of arrays and matrices to ease the handling of the data and the network parameters. We will also utilize a neural network without biases in order to make the algorithm as simple as possible.
The training data consists of a 2D Array of all possible input states and a 1D Array of all corresponding outputs. 

#### 2.6.2.1 Task : Create Training Data

A training set consists of an input set and a solution set. During supervised training, the network is adjusted until its predictions to the input set match the corresponding predetermined solutions.
Complete the training data below using the truth table

<div class="alert alert-block alert-success">
<b>Task:</b> Create Training Data. A training set consists of an input set and a solution set. During supervised training, the network is adjusted until its predictions to the input set match the corresponding predetermined solutions (not always see: Overfitting, but in this case). Complete the training data below using the groundtruth table above. Please initialize the solution '2 dimensional' as well.

</div>

In [26]:
xor_input_set: np.ndarray
xor_solution_set: np.ndarray

# STUDENT CODE HERE (1 pt)
xor_input_set=np.array([[0,0],[0,1],[1,0],[1,1]])
xor_solution_set=np.array([[0],[1],[1],[0]])
# STUDENT CODE until HERE


#### 2.6.2.2 Initializing the Network
Next, the Network has to be defined and initialized. For this task, we use a network with 3 hidden neurons (see Figure 4).

<img src="images/3x2_xor_network.png" />
<p style="text-align: center;">
    Fig. 4 - Neural Network 
</p>

We define $w_{01}, w_{02}, w_{03}, w_{10}, w_{11}, w_{12}$ all at once by just defining a 2x3 weight matrix $w_{l1}$ and do the same for $w_{l2}$. The matrices will be initialized with values between -1 and 1

Run the cell below to define a neural network class that is depicted above.

In [27]:
class NeuralNetwork:
    def __init__(self):
        self.hl_sum = [0, 0, 0]
        self.hl_activation = [0, 0, 0]
        self.ol_sum = [0]
        self.prediction = 0
        self.b = 0
        self.w_i = np.zeros((2, 3))
        self.w_o = np.zeros((3, 1))
        
    def set_conf(self, w_i: np.ndarray, w_o: np.ndarray, b: float):  # w_i and w_o are matrices here
        self.w_i = w_i
        self.w_o = w_o
        self.b = b

    def get_conf(self) -> Dict[str, Union[np.ndarray, float]]:
        configuration = dict()
        configuration['w_i'] = self.w_i
        configuration['w_o'] = self.w_o
        configuration['b'] = self.b
        return configuration

    def get_ex(self) -> Dict[str, float]:
        excitations = dict();
        excitations['hl_sum'] = self.hl_sum
        excitations['hl_activation'] = self.hl_activation
        excitations['ol_sum'] = self.ol_sum
        return excitations
    
    
    def show_conf(self):
        print("weight matrix w_i:")
        print(self.w_i)
        print("\nweight matrix w_o:")
        print(self.w_o)
        print("Bias")
        print(self.b)

    def compute(self, input_set: np.ndarray) -> np.ndarray:
        self.hl_sum = input_set.dot(self.w_i)
        self.hl_activation = relu(self.hl_sum) 
        self.ol_sum = relu(self.hl_activation).dot(self.w_o) + self.b
        self.prediction = sigmoid(self.ol_sum)

        return self.prediction

In [28]:
logic_gate_net = NeuralNetwork()

In [29]:
def initialize_network(net):
    #np.random.seed(3)
    weight_matrix_i = np.random.rand(2,3)  # a 2x3 matrix of weights
    weight_matrix_o = np.random.rand(3,1)  # a 3x1 matrix of weights
    bias = np.random.randn()
    net.set_conf(weight_matrix_i,weight_matrix_o,bias)

In [30]:
initialize_network(logic_gate_net) #just a test initialization to illustrate the weight matrices
logic_gate_net.show_conf()

weight matrix w_i:
[[0.82110642 0.38274138 0.57762102]
 [0.97631337 0.94240176 0.56146912]]

weight matrix w_o:
[[0.54057056]
 [0.35252837]
 [0.51582657]]
Bias
1.2934261623179244


#### 2.6.2.3 Defining Training Process
Finally, run the cells below to implement a backpropagation algorithm. Try to understand the code. See Fig. 4 for explanation of the variable names.

In [31]:
def sigmoid_prime(x: np.ndarray) -> np.ndarray: #the derivative of sigmoid
    return sigmoid(x)*(1-sigmoid(x))

In [32]:
def train(net: NeuralNetwork, input_set: np.ndarray, solution_set: np.ndarray, learning_rate: float, epochs: int):
    for t in range(epochs):
        # Forward pass: compute predicted solution_set
        predictions = net.compute(input_set)
        # Compute and print loss
        log_loss = cross_entropy_loss(predictions, solution_set)
        
        if (t % 5 == 0):  # only output every 5th epoch
            print("Loss after Epoch {}: {:0.4f}".format(t, log_loss))

        #unravel variables here for readability
        ol_sum = net.get_ex()['ol_sum']
        hl_activation = net.get_ex()['hl_activation']
        hl_sum = net.get_ex()['hl_sum']
        w_i = net.get_conf()['w_i']
        w_o = net.get_conf()['w_o']
        b = net.get_conf()['b']
        
        # Backpropagation to compute gradients of w_i and w_o with respect to loss
        # start from the loss at the end and then work towards the front
        grad_ol_sum = sigmoid_prime(ol_sum) * (predictions - xor_solution_set)
        grad_w_o = hl_activation.T.dot(grad_ol_sum)  # Gradient of Loss with respect to w_o
        grad_hl_activation = grad_ol_sum.dot(w_o.T)  # the second layer's error
        grad_hl_sum = hl_sum.copy()  # create a copy to work with
        grad_hl_sum[hl_sum < 0] = 0  # the derivate of ReLU
        grad_w_i = input_set.T.dot(grad_hl_sum * grad_hl_activation)  #

        updated_weight_matrix_i = w_i - learning_rate * grad_w_i
        updated_weight_matrix_o = w_o - learning_rate * grad_w_o
        updated_bias = b - learning_rate * grad_ol_sum.sum()
        net.set_conf(updated_weight_matrix_i, updated_weight_matrix_o,
                       updated_bias)  # Apply updated weights to network

<div class="alert alert-block alert-success">
<b>Task:</b> Choose Hyperparameters and Train
<ul>
<li> Choose an optimal learning rate and number of epochs by trying out values and running the cell below.
<li> If your training data was correct, the network should be ready for use after training.
A successfull training should result in a loss smaller than 0.02.
                                                     
<li><b>Hint:</b> Press Shift+Enter on the cell below and then the "up" arrow key to repeat the training easily.

</ul>
</div>

In [33]:
learning_rate: float
epochs: int
# STUDENT CODE HERE (2 pts)
learning_rate=7
epochs=99
# STUDENT CODE until HERE

initialize_network(logic_gate_net) #initialize again so you can just run this box and train a new network
train(logic_gate_net, xor_input_set, xor_solution_set,learning_rate,epochs)

Loss after Epoch 0: 0.7891
Loss after Epoch 5: 0.6365
Loss after Epoch 10: 0.6714
Loss after Epoch 15: 0.0889
Loss after Epoch 20: 0.0532
Loss after Epoch 25: 0.0419
Loss after Epoch 30: 0.0353
Loss after Epoch 35: 0.0310
Loss after Epoch 40: 0.0279
Loss after Epoch 45: 0.0255
Loss after Epoch 50: 0.0242
Loss after Epoch 55: 0.0233
Loss after Epoch 60: 0.0224
Loss after Epoch 65: 0.0215
Loss after Epoch 70: 0.0206
Loss after Epoch 75: 0.0198
Loss after Epoch 80: 0.0191
Loss after Epoch 85: 0.0185
Loss after Epoch 90: 0.0179
Loss after Epoch 95: 0.0173


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> Why are the losses different each time you run the cell?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>Because the weights and the bias is initialized random</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> What is a good learning rate that reaches a loss < 0.02 in < 100 epochs most of the time?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>learning rate = 7</div>


<div class="alert alert-block alert-success">
<b>Task:</b> Classification Test. Run the cell below and change the sliders and do a validation check on your logic gate.

</div>

In [34]:
def change(input1: float, input2: float):
    input_vector = np.array([input1 * 1, input2 * 1])     # converting bool to float
    prediction = logic_gate_net.compute(input_vector)
    print("\t input: {} \t \t output: {:0.9f}".format(input_vector, prediction[0]))

interact(
    change,
    input1=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
    input2=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
);

interactive(children=(FloatSlider(value=0.0, description='input1', layout=Layout(width='22%'), max=1.0, step=1…

<div class="alert alert-block alert-success">
<b>Task:</b> Continuous Input Test. Change the sliders and observe the changes when the input is varied continuously instead of binary.

</div>

In [35]:
interact(change, input1=0.0, input2=0.0);

interactive(children=(FloatSlider(value=0.0, description='input1', max=1.0), FloatSlider(value=0.0, descriptio…

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> What can you observe when changing the sliders? How would you describe the general relationship between the two inputs and the output (a few words)
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>If both inputs have the same value the output is minimal if they are far appart the output reaches its max</div>


<div class="alert alert-block alert-success">
<b>Question (1 pt):</b>  Change the sliders to the training data values e.g.(1.00, 1.00). Does the output match the training data exactly? Why is that the case?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>It does not match the training values perfectly because the weights are chosen randomly so the neural network is sometimes learning that 1 and 1 is something different than 0 </div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> The neural network now can do something more than just predicting the values of the input set that you gave it. What "special ability" has your network gained automatically? (<b>Hint:</b> Think about neural networks in general, the XOR gate is just an example)
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The special ability of the neural network is that its very flexible so it can be used for prettymuch every Problem</div>

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> How can this special ability be useful when applying neural networks to self-driving vehicles?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The neural network is pretty flexible and in self-driving there are many different problems that the vehicle could face, even some the neural network hasnt learned before</div>


<div class="alert alert-block alert-success">
<b>Question (2 pts):</b>Why does this ability make it easier to use a neural network for self-driving vehicles than traditional rule-based programming. (One pos. and neg. aspect)
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>Because you can train the neural network with hours of driving data rather than having to think about every situation that possibly could appear. In the end you cant know for sure that the neural network will act as it should in every situation so it could have learned sth dangerous that just happens rarely and you will never know until it happens.</div>


<div class="alert alert-block alert-success">
<b>Task:</b> Create an OR Gate. Change the code above to train an OR Network and verfy your results with a test.

</div>

| Input 1| Input 2  | Output    |
|--------|----------|-----------|
|    0   | 0        |0          |
|    0   | 1        |1          |
|    1   | 0        |1          |
|    1   | 1        |1          |


<p style="text-align: center;">
    Table. 3 - OR Truth table
</p>

## 2.7 Neural-Networks using DeepLearningLibraries

In [897]:
x_train = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]], dtype = 'float64')

y_train = np.array([[0],
                    [1],
                    [1],
                    [1]], dtype = 'float64')

### 2.7.1 Keras Example

In [898]:
# Load Library and modules
from tensorflow import keras

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras import optimizers

# Init the model
model = Sequential([
    Dense(3, input_shape=(2,), activation='relu'),
    Dense(1, activation='sigmoid')
])

# Show optimizer
rmsprop = optimizers.RMSprop(learning_rate=0.01, rho=0.9)

# Compile
model.compile(loss='binary_crossentropy',
              optimizer=rmsprop,
              metrics=['accuracy'])
# Train
model.fit(x_train, y_train, batch_size = 4,
          epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x1b217041e88>

<div class="alert alert-block alert-info">
<b>Note:</b> Install pytorch on tf1 or a new environment by using the search tool (graphically) in anaconda navigator. Please do not change Cuda when installing differently! to run the code below.

</div>

### 2.7.2 PyTorch Example
This example is just a reference for how the syntax will look when using PyTorch. You do not need to install PyTorch just to run it.

In [899]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(2, 3, True)
        self.fc2 = nn.Linear(3, 1, True)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.sigmoid(self.fc2(x))
        return x

net = Net()

inputs = torch.from_numpy(x_train).type(torch.FloatTensor)
targets = torch.from_numpy(y_train).type(torch.FloatTensor)

criterion = nn.BCELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)

print("Training loop:")
for idx in range(0, 201):
    for input, target in zip(inputs, targets):
        optimizer.zero_grad()   # zero the gradient buffers
        output = net(input)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()    # Does the update
    if idx % 50 == 0:
        print("Epoch: {: >8}  |  Loss: {}".format(idx, loss.data.numpy()))

Training loop:
Epoch:        0  |  Loss: 0.8618576526641846
Epoch:       50  |  Loss: 0.007643511518836021
Epoch:      100  |  Loss: 0.0003135695878881961
Epoch:      150  |  Loss: 3.922062387573533e-05
Epoch:      200  |  Loss: 7.987054232216906e-06


## 2.8 Outlook: Classification Tests in the Real World

A classic application of neural networks is the classification of images. A commonly used data set is CIFAR-10, which consists of:  
 1. Images of  airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks  (10 Categories)
 2. Labels attached to each image that categorize the image
 
<img src="images/cifar10_plot.png" />
<p style="text-align: center;">
    Fig. 3 - CIFAR-10 dataset[4]
</p>

 
The labels (also called annotations) act as the "solution" for the training set. Each item (airplane, car..) is a separate category. 
During training, the weights and biases in the network are adjusted in just the right way, until it performs the right mathematical operations to correctly classify the given training data. After training, the network can recognize whether the image is a cat, an airplane, etc. This even works for pictures that the network has never seen. You will find out how neural networks can perform image classification in the next class.

### Sources:
[1] Wikipedia, Statistical classification https://en.wikipedia.org/wiki/Statistical_classification, retrieved 01.05.2019

[2]  Brownlee, Jason 2018. Machine Learning Algorithms From Scratch. p. 70

[3]  Gibbs, M.N. (Nov 2000). "Variational Gaussian process classifiers". IEEE Transactions on Neural Networks. p. 1458–1464.

[4] Cifar-10, Cifar-100 Dataset Introduction
Corochann - https://corochann.com/cifar-10-cifar-100-dataset-introduction-1258.html, retrieved 02.02.2019


### Further Reading

The Sigmoid Function in Logistic Regression: http://karlrosaen.com/ml/notebooks/logistic-regression-why-sigmoid/

Why Softmax uses exponential function: https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization

# Feedback and Recap

<div class="alert alert-block alert-success">
<b>Question (3pt):</b>  Please conclude in a few sentences what you learned in this exercise
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>In 6.2 we used neural networks to classify data and learned the bennefits and losses of these.In the end we got a short look on how to use the nn with deeplearning libraries.</div>









