# 1 Regression with Neural Networks

Artificial neural networks are used to solve an extensive variety of problems. In this chapter we will focus on the problem of regression, since it is the easiest to begin with. We will explore the foundations of neural networks by using them to solve exemplary regression tasks. At first, the single artificial neuron is introduced. In the next step, activation functions are made familiar through interactive tasks. Finally, backpropagation is explained at the end of the chapter.


## 1.1 The Artificial Neuron in Theory


An artificial neuron is a mathematical function that is inspired by the information processing of a biological neural cell. Each neuron accepts one or multiple values as inputs ($x_n$) and outputs one value (y). It thereby performs a simple mathematical operation. (see (1) and Fig. 1):


\begin{align}
f_{\text{neuron}}(x) = \phi\left(\sum_{n=1}^m {x_n w_{n}} + b\right)       \;\;\;\;\;\;\;\;\;\;\;        (1)
\end{align}


- **Inputs** $x_n$ are numerical values given by the data or by other neurons
- **Weights** $w_{n}$ multiply the input values
- A **summation $v$** of the weighted inputs is calculated
- A **constant value** $b$ is added to the sum (so called **Bias**)
- An **activation function $𝜙$** is applied to the sum
- The **output** $y$ can be used as an input for another neuron or as a final output of a network

The resulting output is also called the "activation" $y$ of the neuron. 
Even though this mechanism is very simple, a multitude of simple neurons is able to solve very complex problems.


<img src="images/neural_network.png" />
<p style="text-align: center;">
    Fig. 1 - General artificial neuron
</p>




<div class="alert alert-block alert-info">
<b>Note:</b>   
<ul>
<li> An artificial neuron is just an abstract concept. Even though we will create _neuron objects_ in this Jupyter Notebook for code reusability, a neuron is not bound to any specific shape or implementation. There are even approaches to realize artificial neurons with <a href="https://www.osapublishing.org/optica/abstract.cfm?uri=optica-6-9-1132">pure optics</a>.
As long as something shows a behaviour which can be described by the mathematical function (1), we can view it as a "neuron".
<li> You could integrate the bias into the sum in (1) by denoting it as $w_0$ with a corresponding $x_0 = 1$.
</li>


</ul>
<br>

</div>

## 1.2 A Simple Neuron in Practice
For the sake of explanation, we will now examine a neuron with only one single input and without any activation function. This neuron is already able to model functions with the form (2). The neuron can be visualized as seen in Figure 2.

\begin{align}
f_{\text{neuron}}(x) & = w * x + b \;\;\;\;\;\;\;\;\;\;\;        (2)
\end{align}



<img src="images/single_neuron_no_activation.png" />
<p style="text-align: center;">
    Fig. 2 - Simple artificial neuron
</p>




Our neuron class will just have one input, one weight and one bias. 
- Upon initialization, it will be connected to an interactive plot
- Its weights and biases can be changed using the set_values method.
- Its weights and biases can be polled using the get_weights/get_bias methods.
- When the Neuron is changed, it notifies the interactive plot to redraw its output
- It has a compute method that computes the activation based in the weight and input

Run the cell below to define a neuron class.

<a id='simple_neuron'></a>

<a id='simple_neuron'></a>

##  1.3 The Problem of Regression
In the task of regression analysis, a model function has to be found that matches a given set of data points N as **accurately** as possible. A commonly used metric for the accuracy of the approximation is the **least squares approach**. The distance between each data point $(x_n, y_n)$ and the predicted value from the model $\hat{f}(x_n)$ is calculated via the distance of the y-values of the data point with the predicted y-value from the model function (3).

\begin{align}
d(\hat{f}(x_n), y_n) & = \left|\hat{f}(x_n) - y_n\right| \;\;\;\;\;\;\;\;\;\;\;        (3)
\end{align}

The distances are then squared and summed up. Since we want to compare the quality of an approximation with other approximations that might have a different amount of data points, we also divide the sum by the total number of data points (**Mean Squared Error**). This will be our **Loss** $J$ (4). Our goal is to keep this metric as low as possible, since the lower the loss, the better the approximation. Here, the terms "loss" and "error" have the same meaning. Another often used term is "cost".

\begin{align}
J & = \frac{1}{N} \sum_{n=0}^N (\hat{f}(x_n) - y_n)^2 \;\;\;\;\;\;\;\;\;\;\;        (4)
\end{align}


<div class="alert alert-block alert-info">
<b>Note:</b> Note that in eq.(4) $x_0$ is not a feature like in eq.(1), but a sample of the dataset.   

</div>




If we have achieved an accurate regression, we can make **predictions** with it. We will train our neurons to match a given set of points and then use them to predict new points. To do so, we will give the trained neuron new x-values and it will predict y-values.


<img src="images/least_squares_explanation.png" />
<p style="text-align: center;">
    Fig. 3 - Distance to model function visualized
</p>





In [1]:
from __future__ import annotations # Used to allow referencing classes that have not yet been defined in type annotations. This will become default behaviour in Python 3.10. Until then, we have to use this line to enable that behaviour
from typing import *



class SimpleNeuron:
    def __init__(self, plot: Interactive2DPlot):
        self.plot = plot #I am assigned the following plot
        self.plot.register_neuron(self) #hey plot, remember me
        
    def set_values(self, weight: float, bias: float):
        self.weight = weight
        self.bias = bias
        self.plot.update() #hey plot, I have changed, redraw my output
        
    def get_weight(self) -> float:
        return self.weight
    
    def get_bias(self) -> float:
        return self.bias

    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        self.activation = np.dot(self.weight, x) + self.bias
        return self.activation


We will create a function "loss" that performs the operation (4). It will receive a neuron object and a set of points as arguments.
- For each point that we give it, it first separates x and y-values. 
- It hands the neuron an x-value and asks the neuron to compute a prediction for the y-value. (see $\hat{f}(x_n)$) 
- Then it subtracts the real y-value from the predicted y-value, as in operation (3), resulting in a distance
- It then squares up the distance and accumulates the squared distances.  
- In the last step, it divides the sum of squared distances by the amount of compared points.

Run the cell below to define a loss function.

In [2]:
def loss(neuron: SimpleNeuron, points: Dict[str, List[float]]) -> float:
    sum_squared_dist = 0

    for point_x, point_y in zip(points["x"], points["y"]):  # zip merges both points["x"] and points["y"]

        predicted_point_y = neuron.compute(point_x)
        dist = point_y - predicted_point_y
        squared_dist = dist ** 2
        sum_squared_dist += squared_dist

    loss = sum_squared_dist / len(points["y"])
    return loss

### 1.3.1 Preparing an Interactive Plot

After importing the necessary libraries, we will set up an interactive plot class. It plots the output of a neuron by asking it to compute a set of x-values, which results in a set of predicted y-values that can be drawn on a plane. If the weight or bias of a neuron is changed, the neuron calls the "redraw" method of its plot to update it. The plot can also plot fixed points. Interactive sliders will be used to directly modify the weights and biases of neuron objects.


<div class="alert alert-block alert-info">
<b>Note:</b> The plot classes are not part of the subject matter for this lab.  

</div>

Run the cells below to import libraries and define an interactive plot.

In [3]:
import numpy as np
import plotly.offline as plotly
import plotly.graph_objs as go
from ipywidgets import interact, Layout, HBox, FloatSlider
import time
import threading

In [4]:
# an Interactive Plot monitors the activation of a neuron or a neural network
class Interactive2DPlot:
    def __init__(self, points: Dict[str, List[float]], ranges: Dict[str, Tuple[float, float]], width: int = 800, height: int = 400, margin: Dict[str, int] = { 't': 0, 'l': 170 }, draw_time: float = 0.05):
        self.idle = True
        self.points = points
        self.x = np.arange(ranges["x"][0], ranges["x"][1], 0.1)
        self.y = np.arange(ranges["y"][0], ranges["y"][1], 0.1)
        self.draw_time = draw_time
        self.layout = go.Layout(
            xaxis=dict(title="Input: x", range=ranges["x"], fixedrange=True),
            yaxis=dict(title="Output: y", range=ranges["y"], fixedrange=True),
            width=width,
            height=height,
            showlegend=False,
            autosize=False,
            margin=margin,
        )
        self.trace = go.Scatter(x=self.x, y=self.y)
        self.plot_points = go.Scatter(x=points["x"], y=points["y"], mode="markers")
        self.data = [self.trace, self.plot_points]
        self.plot = go.FigureWidget(self.data, self.layout)
        # self.plot = plotly.iplot(self.data, self.layout,config={"displayModeBar": False})

    def register_neuron(self, neuron: SimpleNeuron):
        self.neuron = neuron

    def redraw(self):
        self.idle = False
        time.sleep(self.draw_time)
        self.plot.data[0].y = self.neuron.compute(self.x)
        self.idle = True

    def update(self):
        print("Loss: {:0.2f}".format(loss(self.neuron, self.points)))
        if self.idle:
            thread = threading.Thread(target=self.redraw)
            thread.start()

<div class="alert alert-block alert-success">
<b>Task:</b> Train the neuron
<ul>
<li> You are given a set of 3 points and one neuron to do a curve fit. Run the cell below.
<li> <b>Change the weight and bias of the neuron using the sliders to minimize the loss.</b>
    <li><b>Hint:</b> You can also change the sliders with the arrow keys on your keyboard after clicking on the slider.
</ul>
</div>

<a id='points_linreg'></a>

In [5]:
points_linreg = dict(x=[1, 2, 3], y=[1.5, 0.7, 1.2])
ranges_linreg = dict(x=(-4, 4), y=(-4, 4))

linreg_plot = Interactive2DPlot(points_linreg, ranges_linreg)
simple_neuron = SimpleNeuron(linreg_plot)

slider_layout = Layout(width="90%")

interact(
    simple_neuron.set_values, 
    weight=FloatSlider(min=-3, max=3, step=0.1, value = 0, layout=slider_layout),
    bias=FloatSlider(min=-3, max=3, step=0.1, value = 0, layout=slider_layout)
)

linreg_plot.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=3.0, min=-3…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '4dd0e651-39dc-438a-b026-efc58371b190',
 …

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> What is the optimal weight and bias combination? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The optimal values are : weight = -0.10 and bias = 1.40 </div>

### 1.3.2 Preparing a 3D-Plot
We can see that searching for the lowest loss is a **parameter optimization problem**. For now, the problem can be solved manually, but if we want to use neural networks to solve more complex problems, we have to find a way to automate this process.

The loss function is changed with both the specified weight and the specified bias. This relationship can be visualized three-dimensionally, which can give us further insight to construct an algorithm that solves the optimization problem. 
In this 3D-View, logarithmic scales are used to emphasize the topography. We will define a new function to compute the logarithmic loss for a set of points.

The plot will be defined as follows:
- The **X axis** represents the weights. 
- The **Y axis** represents the bias.
- The **Z axis** (height) represents the corresponding loss value at a given weight/bias configuration. For illustration purposes, the logarithm of the MSE Loss is displayed.
- The **black ball** represents the current weight/bias configuration. Its height represents the loss of that configuration.

Run the cells below to define a 3D plot.

In [6]:
def log_mse(neuron: SimpleNeuron, points: Dict[str, List[float]]) -> np.ndarray:
    least_squares_loss = loss(neuron, points)
    return np.log10(least_squares_loss)

In [7]:
class Interactive3DPlot:
    def __init__(self, points: Dict[str, List[float]], ranges: Dict[str, Tuple[float, float]], width: int = 600, height: int = 600, draw_time: float = 0.1):
        self.idle = True
        self.points = points
        self.draw_time = draw_time
        self.threading = threading

        self.range_weights = np.arange(  # Array with all possible weight values in the given range
            ranges["x"][0], ranges["x"][1], 0.1
        )
        self.range_biases = np.arange(  # Array with all possible bias values in the given range
            ranges["y"][0], ranges["y"][1], 0.1
        )
        self.range_biases_t = self.range_biases[:, np.newaxis]  # Bias array transposed
        self.range_losses = []  # initialize z axis for 3D surface

        self.ball = go.Scatter3d(  # initialize ball
            x=[], y=[], z=[], hoverinfo="none", mode="markers", marker=dict(size=12, color="black")
        )

        self.layout = go.Layout(
            width=width,
            height=height,
            showlegend=False,
            autosize=False,
            margin=dict(t=0, l=0),
            scene=dict(
                xaxis=dict(title="Weight", range=ranges["x"], autorange=False, showticklabels=True),
                yaxis=dict(title="Bias", range=ranges["y"], autorange=False, showticklabels=True),
                zaxis=dict(title="Loss: log(MSE)", range=ranges["z"], autorange=True, showticklabels=False),
            ),
        )

        self.data = [
            go.Surface(
                z=self.range_losses,
                x=self.range_weights,
                y=self.range_biases,
                colorscale="Viridis",
                opacity=0.9,
                showscale=False,
                hoverinfo="none",
            ),
            self.ball,
        ]

        self.plot = go.FigureWidget(self.data, self.layout)

    def register_neuron(self, neuron: SimpleNeuron):
        self.neuron = neuron
        self.calc_surface()

        # height of 3d surface represents loss of weight/bias combination
        # In the 2D plot, x is an array from e.g. -4 to +4. But the weights and biases only have a single value
        # Here x will be the points to do regression and to calculate the loss on. 
        # The surface is spanned by the arrays of weight and bias.
        
    def calc_surface(self):  
                
        self.neuron.weight = (  #instead of 1 weight and 1 bias, let Neuron have an array of all weights and biases
            self.range_weights
        )
        self.neuron.bias = self.range_biases_t
        self.range_losses = log_mse(  # result: matrix of losses of all weight/bias combinations in the given range
            self.neuron, self.points
        )
        self.plot.data[0].z = self.range_losses

    def update(self):
        if self.idle:
            thread = threading.Thread(target=self.redraw)
            thread.start()

    def redraw(self):  # when updating, only the ball is redrawn
        self.idle = False
        time.sleep(self.draw_time)
        self.ball.x = [self.neuron.weight]
        self.ball.y = [self.neuron.bias]
        self.ball.z = [log_mse(self.neuron, self.points)]
        self.plot.data[1].x = self.ball.x
        self.plot.data[1].y = self.ball.y
        self.plot.data[1].z = self.ball.z
        self.idle = True

In [8]:
class DualPlot:
    def __init__(self, points: Dict[str, List[float]], ranges_3d: Dict[str, Tuple[float, float]], ranges_2d: Dict[str, Tuple[float, float]]):
        self.plot_3d = Interactive3DPlot(points, ranges_3d)
        self.plot_2d = Interactive2DPlot(points, ranges_2d, width=400, height=500, margin=dict(t=200, l=30))

    def register_neuron(self, neuron: SimpleNeuron):
        self.plot_3d.register_neuron(neuron)
        self.plot_2d.register_neuron(neuron)

    def update(self):
        self.plot_3d.update()
        self.plot_2d.update()

<div class="alert alert-block alert-success">
<b>Task:</b> Train the neuron
<ul>
<li> You are given the same set of 3 points and again one neuron to do a curve fit. Run the cell below.
<li> <b>Change the weight and bias of the neuron using the sliders to minimize the loss.</b>
<li> <b>Observe all changes.</b>
    </li>

</ul>

</div>

<div class="alert alert-block alert-info">
<b>Note:</b> You can turn the 3D-Plot by clicking on it and moving your cursor, but you have to stay inside the widget with your cursor. 

</div>

In [9]:
ranges_3d = dict(x=(-2.5, 2.5), y=(-2.5, 2.5), z=(-1, 2.5))  # set up ranges for the 3d plot
plot_task2 = DualPlot(points_linreg, ranges_3d, ranges_linreg)  # create a DualPlot object to mange plotting on two plots
neuron_task2 = SimpleNeuron(plot_task2)  # create a new neuron for this task

interact(
    neuron_task2.set_values,
    weight=FloatSlider(min=-2, max=2, step=0.2, layout=slider_layout),
    bias=FloatSlider(min=-2, max=2, step=0.2, layout=slider_layout),
)

HBox((plot_task2.plot_3d.plot, plot_task2.plot_2d.plot))

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=2.0, min=-2…

HBox(children=(FigureWidget({
    'data': [{'colorscale': [[0.0, '#440154'], [0.1111111111111111, '#482878'],
…

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> In general, what does the optimal weight and bias combination correspond to in the 3D Plot?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>its the point in the plot where the black ball is the lowest.(the black ball is belongs to the plot) </div>

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> What is the steepness of the valley at the point of optimal weight and bias combination?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The steepness is -0.2 </div>

***
## 1.4 Activation Functions
By now our neuron model consisting of weights and biases is only capable of mimicing linear functions and all we can do is linear regression. Activation functions expand our capabilities by introducing an additional non-linearity into the neuron. With it, we can model more complex functions. The most commonly used activation function nowadays is the Rectified Linear Unit, also called **ReLU**. It just outputs the input value, as long as it's greater than 0. If it's lower than zero, it outputs 0. We can conveniently describe this function by taking the maximum of the input value and of 0. The greater value of both will be chosen as the output (5).

\begin{align}
\phi_{relu}(x) & = \max(0,x)  \;\;\;\;\;\;\;\;\;\;\;    (5)
\end{align}




Run the cell below to define the ReLU function.

In [10]:
def relu(input_val: np.ndarray) -> np.ndarray:
    return np.maximum(input_val, 0)

We can draw a neuron with a ReLU activation function as follows:
<img src="images/single_neuron_relu.png" />
<p style="text-align: center;">
    Fig. 5 - Neuron with ReLU activiation function visualized
</p>


Let's create a new class to implement this neuron in Python. We will inherit all properties of a neuron from SimpleNeuron.
We only change the output by first feeding it through our ReLU function:

<div class="alert alert-block alert-success">
<b>Task:</b>  Implement a complete artificial neuron with relu activation function
<ul>
<li> Complete the code below for an artifical neuron by using the relu function from above to calculate its activation, like in Figure 5. </li>
<li>Take a look at the <a href="#simple_neuron">Simple Neuron Class</a> and write a similar compute function</li>
<li>You don't need to re-implement the relu function and should not need to add more than 1 line. </li>

</ul>
</div>

In [11]:
class ReluNeuron(SimpleNeuron): #inherit from SimpleNeuron class
    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        # STUDENT CODE HERE (1 pt)
        self.activation = relu(np.dot(self.weight, x) + self.bias)
        # STUDENT CODE until HERE
        return self.activation

***
### 1.4.1 Task: Nonlinear Climate Control

You find yourself as an engineer at the company "ClimaTronics". Your company wants to implement AI technology to regulate their new air conditioning system "Perfect Climate 9000". Even though the problem can be solved easily with conventional programming, the management department wants you to implement AI to attract investors. You have to fulfill the following requirements that are visualized in the datasheet excerpt:


`The climate control shall remain off for temperatures under 25°C. At a temperature of 30°C, it shall reach 10% of its cooling power. Between 30°C and 40°C, the cooling power shall rise quadratically with the temperature. Cooling power shall reach its maximum at 40°C.`
<img src="images/datasheet.png" />



Run the cell below for to display a interactive plot.

In [12]:
points_climate = dict(x=[25.0, 27.5, 30.0, 32.5, 35, 37.5, 40.0], y=[0.0, 2.0, 10.0, 23.7, 43, 68.7, 100.0])

ranges_climate = dict(x=(-4, 45), y=(-4, 105))
climate_plot = Interactive2DPlot(points_climate, ranges_climate)
our_relu_neuron = ReluNeuron(climate_plot)

interact(
    our_relu_neuron.set_values,
    weight=FloatSlider(min=-10, max=10, step=0.1, value=0, layout=slider_layout),
    bias=FloatSlider(min=-200.0, max=200.0, step=1, value=0, layout=slider_layout),
)

climate_plot.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=10.0, min=-…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '4f65079e-f588-4631-bd91-548033b942a7',
 …

<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> When setting the bias to 0.00, how does changing the weight affect the output function? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>without a bias the function is like y = ax with a being the weight, which result to a linear function passing through the origin</div>

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> How does changing the bias affect the output function? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>with a bias the function has the for y = ax+b with b being the bias which results to changing the point where the function touches the x axis to the left or to the right</div>

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> When setting the weight to 1.00 and the bias to -10, at what temperature does the climate control start? 
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>It starts nearly at 10°C</div>

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> When setting the weight to 1.00 and the bias to -20, at what temperature does the climate control start?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>It starts at 20°C</div>

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> When setting the weight to 2.00 and the bias to -20, at what temperature does the climate control start?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>at 10°C</div>

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> What's the best weight/bias configuration that you could find?
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The best configuration is : weight=7.10 and bias =-199</div>

### 1.4.2 Conclusion
Using just one neuron, we can easily understand and retrace the influence of weight and bias.
But our one-neuron-approximation is not enough to closely approximate the needed quadratic relationship.

***
##  1.5 Neural Networks

The approximation can be improved by using multiple neurons. Instead of just one neuron for our approximation, we construct a neural network. We will use two ReLU neurons and one output neuron that will have weights as well. Now we can decide how we want to weigh the result of the two ReLU neurons in the middle.

### 1.5.1 Hidden Layers
In this neural network, the two neurons in the middle represent a **hidden layer**.

In the last task, the weight and bias had an easily traceable influence on the output.
But by adding more neurons, the relationship between each weight and bias with the output becomes untraceable.
We obtain the weights and biases by simply adapting them until the result turns out to be correct. In this process, we quickly loose overview of what exactly we are calculating. It becomes very hard to untangle a neuron and describe its responsibility in the system. 

The input value is multiplied by the first weights and after adding biases and running it through the activation function, the values are multiplied again by the second weights. Hidden layers can be stacked multiple times after one another. This gives room for multiple calculation steps, allowing more complex functions.

Neural networks using at least one hidden layer have an interesting property: They can be used to approximate any continuous function. _(See "Further Reading")_

<img src="images/hidden_layer.png" />



We will create a class for neural networks. The network will have four weights and two biases.

</div>

<div class="alert alert-block alert-info">
<b>Note:</b> For the sake of simplicity and code reusability, we will treat neural networks the same way we treat individual neurons in the past examples. Remember that an artificial neuron is only a mathematical function? A whole neural network can be also fully described by just one single function, as is done here when calculating the activation. The neurons don't have to take the concrete shape of individual data objects.

</div>

Run the cell below to define a neural network.

In [13]:
class NeuralNetwork:
    def __init__(self, plot: Interactive2DPlot):
        self.plot = plot #I am assigned the following plot
        self.plot.register_neuron(self) #hey plot, remember me
        
    def set_config(self, w_i1: float, w_o1: float, b1: float, w_i2: float, w_o2: float, b2: float):
        self.w_i1 = w_i1
        self.w_o1 = w_o1
        self.b1 = b1
        self.w_i2 = w_i2
        self.w_o2 = w_o2
        self.b2 = b2
        self.show_config()
        self.plot.update()  # please redraw my output

    def show_config(self):
        print("w_i1:", self.w_i1, "\t| ", "w_o1:", self.w_o1,"\n")
        print("b1:", self.b1, "\t| ", "w_i2:", self.w_i2,"\n")
        print("w_o2:", self.w_o2, "\t| ", "b2:", self.b2,"\n")

    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        self.prediction = (relu(self.w_i1 * x + self.b1) * self.w_o1
                         + relu(self.w_i2 * x + self.b2) * self.w_o2)
        return self.prediction

***
###  1.5.2 Task: Nonlinear Climate Control with Neural Network

Run the cell below and adapt weights and bias to reach a better approximation of the desired curve than in the previous task

In [14]:
climate_plot_adv = Interactive2DPlot(points_climate, ranges_climate)
our_neural_net = NeuralNetwork(climate_plot_adv)

interact(
    our_neural_net.set_config,
    w_i1=FloatSlider(min=-10, max=10, step=0.1, layout=slider_layout),
    w_o1=FloatSlider(min=-10, max=10, step=0.1,  layout=slider_layout),
    b1=FloatSlider(min=-200.0, max=200.0, step=1,  layout=slider_layout),
    w_i2=FloatSlider(min=-10, max=10, step=0.1, layout=slider_layout),
    w_o2=FloatSlider(min=-10, max=10, step=0.1,  layout=slider_layout),
    b2=FloatSlider(min=-200.0, max=200.0, step=1,layout=slider_layout),
)
climate_plot_adv.plot

interactive(children=(FloatSlider(value=0.0, description='w_i1', layout=Layout(width='90%'), max=10.0, min=-10…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '32e2f832-9e10-429b-b7d3-3b4de3347a8d',
 …

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> What is the best configuration you could find? (Copy from above the plot)
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>
        
w_i1: 0.6 	|  w_o1: 6.0 

b1: -16.0 	|  w_i2: 0.7 

w_o2: 9.9 	|  b2: -23.0 

Loss: 4.01

</div>

### 1.5.3 Conclusion
We can conclude that the quadratic relationship can be better approximated by using additional weights and biases. Using two ReLU Neurons, we can create a function with two bends.
However, the complexity of finding the optimal weights/biases increases drastically with each variable. The more powerful our neural networks should be, the harder the optimization becomes.

***
##  1.6 Backpropagation

The solution to our optimization problems is called backpropagation. We can automate the process of adjusting weights and biases. In this example, we will turn back to the basics and use a simple neuron without an activation function. Backpropagation works by taking the partial derivatives of the loss function with respect to each weight and bias in the network. This can be done by using the chain rule of calculus. The network's output $\hat{y} = \hat{f}(x)$
(if you denote $\hat{y}$ as the predicted y-value by the neural network) is computed in the forward propagation by applying the given rules for calculation (multiply with weights, sum with bias and activation function until you reach the output). The loss is then calculated with your predicted and ground-truth value by the loss function. Given this loss you can easily calculate the partial derivatives in the so called backward propagation. See for example: [BackpropagationExample](https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html) 

At each point, the bias and weight gradients points to the direction of higher loss. The magnitude of the gradient represents the amount of increase in loss. 

Suppose we were to _maximize_ loss in Fig 6.: All we need to do is to follow the partial derivatives by adding them to our current weight/bias point. That means, decrease weight a lot (see the axes in Fig. 6), and decrease bias by some lesser amount, since it has less magnitude.

However, because we want to go down, we will _subtract_ the gradient from out current point. This will move us closer to the minimum. In the next step, we are further down and next to the valley, but not close enough. So we just repeat the steps until we reach the minimum.

The good thing about neural networks is that we can **analytically determine the gradient** for all possible data points. We do not have to estimate it by numerical methods, like calculating the loss of two weight/bias combinations and dividing it by the "step distance", as in the "Euler method". This beforehand knowledge of the gradient makes backpropagation relatively fast. However, sadly we can't analytically determine the weight/bias combination that brings the loss function to its minimum. We still have to apply it iteratively over many steps.

Every step we take is called one **epoch**. (In this case _training steps_ and _epochs_ are equivalent). Because it is hard to determine whether the minimum is reached, we will specify the number of epochs before our descent and simply let the program run.

If the magnitude of the gradients is too big, we will never reach a minimum. This is because our algorithm wants to move the ball too much at each step. It will oscillate around the minimum, but never arrive at it. In extreme cases, the movement even can oscillate up to infinty. To give us control over the amount of movement, the gradient is multiplied by a factor called **learning rate** (or also called "step size" in gradient descent). By setting it to an optimal value, we can prevent oscillations. However, if the learning rate is too small, the network will take forever to "learn", since the weights and biases are changing only very slowly.

Number of epochs and learning rate are so called **hyperparameters**. They influence the training process but are not part of the network itself.



<img src="images/backprop.png" />
<p style="text-align: center;">
    Fig. 6 - Partial derivatives of Loss function
</p>


### 1.6.1 Preparing Backpropagation Plot
We will create a new 3D-Plot that tracks our past weight/bias/loss values as we try to optimize the loss step by step. The black ball will leave a trace of its past values. Run the cell below to enable plotting the backpropagation steps.

In [15]:
plot_backprop = DualPlot(points_linreg, ranges_3d, ranges_linreg)
trace_to_plot = go.Scatter3d(x=[], y=[], z=[], hoverinfo="none", mode="lines", line=dict(width=10, color="grey"))

plot_backprop.plot_3d.data.append(trace_to_plot)  # Expand 3D Plot to also plot traces
plot_backprop.plot_3d.plot = go.FigureWidget(plot_backprop.plot_3d.data, plot_backprop.plot_3d.layout)
plot_backprop.plot_3d.draw_time = 0


def redraw_with_traces(plot_to_update: Interactive2DPlot, neuron: SimpleNeuron, trace_list: Dict[str, List[float]], points: Dict[str, List[float]]):  # executed every update step
    plot_to_update.plot_3d.plot.data[2].x = trace_list["x"]
    plot_to_update.plot_3d.plot.data[2].y = trace_list["y"]
    plot_to_update.plot_3d.plot.data[2].z = trace_list["z"]
    plot_to_update.plot_3d.plot.data[1].x = [neuron.weight]
    plot_to_update.plot_3d.plot.data[1].y = [neuron.bias]
    plot_to_update.plot_3d.plot.data[1].z = [log_mse(neuron, points)]
    plot_to_update.update()


def add_traces(neuron: SimpleNeuron, points: Dict[str, List[float]], trace_list: Dict[str, List[float]]):  # executed every epoch
    trace_list["x"].extend([neuron.weight])
    trace_list["y"].extend([neuron.bias])
    trace_list["z"].extend([log_mse(neuron, points)])

***
### 1.6.2 DIY Backpropagation

To do backpropagation, first you have to determine the partial derivatives of the loss function of the "simple neuron" with respect to weight and bias. After that, you have to figure out how to properly adjust the weights and biases to the gradient scaled to the learning rate.
Down below at the end of the document you can verify your results by training. If you hit the benchmark, your algorithm is correct.

The algorithm has to work with a dict of points of the form like: [points_linreg](#points_linreg).

<div class="alert alert-block alert-success">
<b>Task:</b> Determine the Gradient <b>analytically!!</b>
<ul>
<li> <b>Finish the function below by yourself.</b>
<li> There are multiple solutions to this, your algorithm may adjust the weight and bias in the right direction despite the gradient calculation being wrong.
<li> <b>Benchmark:</b> If you can reach a loss of 0.22 after 100 epochs and a learning rate of 0.03, your solution is correct
    </li>

</ul>

</div>

<div class="alert alert-block alert-info">
<b>Hint:</b>
<ul>
    <li> Read the above text on backpropagation carefully.
    <li> If you are having trouble figuring the gradient out, try calculating the gradient on paper first.
    <li> Ask yourself: What are the components of the Loss-function? How does the Loss-function depend on the weight and bias variables, by which you have to differentiate?
    </li>
</div>

In [22]:
def simple_neuron_loss_gradient(neuron: SimpleNeuron, points: Dict[str, List[float]]) -> Dict[str, float]:

    gradient_sum = dict(weight=0, bias=0) # contains the sum of the weight and bias gradient
    for point_x, point_y in zip(points["x"], points["y"]):  # for each point
            # Hint: point_x and point_y are the current point values

        gradient_sum["weight"] += ( # sum up the gradient for each point
            
            ### STUDENT CODE HERE (2 pts)
            (2)*(point_x)*(neuron.weight*point_x+neuron.bias-point_y)
            ### STUDENT CODE until HERE
        )

        gradient_sum["bias"] += (
            ### STUDENT CODE HERE (2 pts)
            (2)*(neuron.weight*point_x+neuron.bias-point_y)
            ### STUDENT CODE until HERE
        )

    gradient = dict(weight=gradient_sum["weight"] / len(points["x"]), bias=gradient_sum["bias"] / len(points["x"]))
    return gradient

<div class="alert alert-block alert-success">
<b>Task:</b> Adjust the Neuron
<ul>

<li> After finding the gradient you have to adjust the weight and bias of the neuron, based on the partial derivatives and the learning rate. You have to verify your results by training the net down below.
<li> <b>Finish the function below by yourself.</b>
    </li>

</ul>

</div>

<div class="alert alert-block alert-info">
<b>Info:</b>
<ul>
    <li> This is an iterative function used on each neuron once per epoch.
    <li> Use the neurons current weight and bias as a starting point and adjust it to improve the NN.
    <li> The entered learning rate scales the magnitude of the adjustment.
    <li> Think about the direction of the loss gradient and the direction you want your loss to shift in.
</ul>

In [23]:
def adjust_neuron(neuron: SimpleNeuron, gradient: Dict[str, float], learning_rate: float):
    ### STUDENT CODE HERE (2 pts)
    neuron.weight = neuron.weight-learning_rate*gradient["weight"]
    neuron.bias = neuron.bias-learning_rate*gradient["bias"]
    ### STUDENT CODE until HERE

### 1.6.3 Defining training process

In [47]:
# do not change
def train(neuron: SimpleNeuron, points: Dict[str, List[float]], epochs: int, learning_rate: float, redraw_step: int, trace_list: Dict[str, List[float]]):
    redraw_with_traces(neuron.plot, neuron, trace_list, points)
    for i in range(1, epochs + 1):  # first Epoch is Epoch no.1
        add_traces(neuron, points, trace_list)
        gradient = simple_neuron_loss_gradient(neuron, points)
        adjust_neuron(neuron, gradient, learning_rate)

        if i % redraw_step == 0:
            print("Epoch:{} \t".format(i), end="")
            redraw_with_traces(neuron.plot, neuron_backprop, trace_list, points)

<div class="alert alert-block alert-success">
<b>Task:</b> Choose Hyperparameters and Train
<ul>

<li> Choose an optimal learning rate and number of epochs by trying out values and running the two cells below</li>

</ul>

</div>

In [70]:
learning_rate = 0.145 #keep this for benchmarking, change to play around
epochs = 100 # keep this for benchmarking, change to play around
redraw_step = 10 # update plot every n'th epoch. too slow? set this to a higher value (e.g. 100)

# these values are taken as parameters by the train function below

neuron_backprop = SimpleNeuron(plot_backprop)
HBox((plot_backprop.plot_3d.plot, plot_backprop.plot_2d.plot))

HBox(children=(FigureWidget({
    'data': [{'colorscale': [[0.0, '#440154'], [0.1111111111111111, '#482878'],
…

In [71]:
#run this cell to test algorithm

np.random.seed(4) # keep this for benchmarking, remove to play around

neuron_backprop.set_values(  # set weight and bias randomly
    (5 * np.random.random() - 2.5), (5 * np.random.random() - 2.5)
)
trace_list1 = dict(x=[], y=[], z=[])

train(neuron_backprop, points_linreg, epochs, learning_rate, redraw_step, trace_list1)

Loss: 18.45
Loss: 18.45
Epoch:10 	Loss: 0.35
Epoch:20 	Loss: 0.22
Epoch:30 	Loss: 0.16
Epoch:40 	Loss: 0.12
Epoch:50 	Loss: 0.11
Epoch:60 	Loss: 0.10
Epoch:70 	Loss: 0.10
Epoch:80 	Loss: 0.10
Epoch:90 	Loss: 0.09
Epoch:100 	Loss: 0.09


**Benchmark:** If you can reach a loss of 0.22 after 100 epochs and a learning rate of 0.03, your solution is correct

**Only answer this after your algorithm has hit the benchmark**

<div class="alert alert-block alert-success">
<b>Question (3 pts):</b> What happens when you set the learing rate to 0.18? Explain this behavior.
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>The Algorithm makes way to bigs steps so it misses the minimum each time and it repeat this again and again </div>

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> What happens when you set the learing rate to 0.182? Explain this behavior.
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>As the learning rate got bigger, the steps got bigger, but this time the loss got bigger too. The same happend for the next epoch which then put us in an kind of destructive resonance </div>

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> What is the best learning rate you could find? (In terms of: lowest loss after 100 Epochs with lr=0.03) 
(Anything better than the benchmark loss of 0.22 is correct)
</div>

<div class="alert alert-block alert-success">
<b>Answer:</b>0.145</div>

## 1.7 Machine/Deep Learning Notation

We have already introduced the learning rate, hyperparameters and epoch. Some further notation will be introduced now.

Consider a training set you want to feed your neural network with and adjust its weights by using backpropagation. If you update weights and biases every sample that was used in one forward and backward pass, this is called **Stochastic Gradient Descent** or online learning. As you might already guess, you can as well accumulate errors by samples in the size of the so called **batch size** and perform multiple backward passes with these bigger subsets of training data, which is called **Batch Gradient Descent**. The proper calculation would be using the whole set of training samples for your forward passes and store errors to compute one update, which is the regular Gradient Descent. With increasing sample size this becomes often unfeasible and this is why batch gradient descent and stochastic gradient descent are helpful, but have different effects on the trained model during training.

No matter what gradient descent variant you use, an **epoch** has expired if your training data has been totally used once for updating the weights in subsets or as a whole.

As you already got to know in Task2 - PerformanceEvaluation, a model is able to overfit and underfit depending on the complexity of the problem/model and the amount of data available. Preventing Overfitting in neural networks is achieved by **Regularization**. In the Task7 - Convolutional Neural Networks you will have to apply some of these techniques by using the deeplearning library Keras. Most common techniques beside reducing model complexity or increasing/augmenting training data are **L1/L2-regularization** or **Dropout**.

If you are not able to train your model properly this might be due to the fact that your model is confronted with **Exploding or Vanishing Gradients**. Consider a really deep model containing multiple hidden layers. By using the loss at the output the whole network's parameters should be updated. Because values $>1$ that are multiplied recursively will lead to larger updates, training behaviour can get instable. Training will not update weights in earlier layers when partial derivatives are $<1$ and are multiplied over many layers (Vanishing Gradient). To prevent the last case, which happens more often in regular training and by using sigmoid functions, the ReLU activation function is nowadays used as the default activation in intermediate layers. More advanced methods can be used as well, like **residual/skip connections**.

***
### Further Reading: Neural Networks are Universal Function Approximators

It can be mathematically proven that neural networks can approximate any continuous function, as long as they have at least one hidden layer, use nonlinear activation functions, and use a sufficient (but finite) amount of hidden layer neurons. 

https://www.sciencedirect.com/science/article/pii/089360809190009T?via%3Dihub
Kurt Hornik,
Approximation capabilities of multilayer feedforward networks,
Neural Networks,
Volume 4, Issue 2,
1991,
Pages 251-257