# Regression and classification - supervised learning
Machine learning is a very broad field and pretty much covers every application in which a bit of software is not explicetly programmed to perform a special task but *learns* to do it by *training* it with example data. This tutorial will only cover one aspect of machine learning: *supervised learning*. 

Supervised learning is always about predicting something based on some underlying data:
    - product interests based on Google searches
    - stock prices based on the market development of the last week
    - the kind of animal we see in an image based on samples we have stored in a data base
    
Machine learning is called *supervised*, if the value that is to be predicted is available for the data that is used for training. If an algorithm is trained to separate images of cats and dogs, the images used for training this are labelled by someone before the training. In the beginning, a machine learning algorithm won't be able to keep them apart, but after it has been *trained* on a lot of data, it can be able to perform this task.

There are many types of algorithms which can be used for supervised learning. Some of them are shown below. In this tutorial, we will focus on one special type of machine learning algorithms: *neural networks*
<img src="https://miro.medium.com/max/477/1*KFQI59Yv7m1f3fwG68KSEA.jpeg" height=500 style='height: 500px'> (image: https://medium.com/technology-nineleaps/popular-machine-learning-algorithms-a574e3835ebb)

In [14]:
# run this code so that you can enjoy the examples!
# if you run this in Colab, you need to download the examples: uncomment the following line
# !wget https://github.com/flome/e4_bsc_python/blob/machine_learning/4.%20Machine%20Learning/interactive_examples.py
from interactive_examples import *
%matplotlib inline

Before we can understand the basics of machine learning and neural networks, we need to explore which problems we want to solve with it and how a computer sees those problems. We will discuss two basic problem: regression and classification.

## Regression

Regression deals with the prediction of continuous variables based on input data. The simplest form of regression is a function fit with one variable as it is done in the lab exercises a lot:
<img src="https://github.com/flome/e4_bsc_python/blob/machine_learning/4.%20Machine%20Learning/imgs/linear_fit.png?raw=true" height=400 style='height: 400px'>
If we now do a measurement at a new point *x'*, the regression can be used to "predict" the most likely value for *y = f(x')*

### Linear regression
The regression method used above is called *linear regression* for obvious reasons. The regression prediction is determined by a *slope m* and an *intercept b*. In the example below, you can experiment a bit how to match the data well: Adjust the slope and intercept, so that the *manual linear fit* matches the data well. The plot updates as soon as you stop moving the sliders.

In [8]:
# run this code!
linear_example()

VBox(children=(Output(), VBox(children=(FloatSlider(value=0.0, continuous_update=False, description='slope m:'…

#### Loss functions
Turns out, this is really annoying because you don't even know, what "matching the data well" is even suppose to mean! The computer cannot know this either. We need to quantify what "matching the data" means. This is done with *loss* or *cost* functions. 

A loss function returns a value based on "how well the fit matches the data". Usually, a *lower cost* corresponds with a better agreement with the data. For regression, the *least-squares-fit* is the most common way to quantify this. The loss is computed as the average squared difference of the data values $y_i$ from the fit line $f(x_i)$:
<p>
<center>
$\mathrm{m. s. e.} = \frac{1}{N} \sum_{i = 0}^{N} \left( y_i - f(x_i) \right)^2$
</center>
</p>
Try your luck again, this time you get to know "how well you match the data". How low can you get the loss?
The plots are updated when you stop moving the sliders!


In [9]:
# run this code!
linear_example_with_loss()

VBox(children=(HBox(children=(Output(), Output())), VBox(children=(FloatSlider(value=0.0, continuous_update=Fa…

#### Monkey work - automate curve fitting using a minimizer
This is the way that a computer sees a fitting problem. You have probably followed a certain strategy during the loss *minimization*. While it can be possible in 2 dimensions ( *m* and *b* ) to just keep trying what change lowers the loss, this becomes a very difficult problem in more dimensions. 

You don't (of course you don't) need to do this by hand for more complex *parameter optimizations*. Computer programs which are designed to be specifically good at this are called *minimizier* or *optimizer* and the strategies they implement are quite different.
Most minimizers implement the concept of *gradient descent*. If the loss function is differentiable with respect to the parameters of our function, we can compute the derivative of the loss. This means, that we can estimate, how much our loss will change if we move a parameter to higher or lower values! This makes getting to the best result a lot easier! 
<p>
<img src="https://blog.paperspace.com/content/images/2018/05/fastlr.png" height=250, style='height: 250px'>
</p>
(image: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/)

How much do we have to change the parameters as soon as we have computed the best change of parameters? That depends! The *rate* of parameter change relative to the loss gradients is often called *learning rate* and is one of the most important parameters in optimizations. 
<p>
<img src="https://miro.medium.com/max/1200/0*K0ltbXIgtNLEXsXN.png" height=250, style='height: 250px'>
</p>
(image: https://medium.com/octavian-ai/how-to-use-the-learning-rate-finder-in-tensorflow-126210de9489)

Check out the new *Descend!* button in the example below. It changes the parameters $m$ and $b$ of the linear function according to a simple gradient descent rule

<center>
    $ \vec{\lambda}_{i+1} = \vec{\lambda}_i - \alpha \cdot \nabla L(m, b)$
</center>
where $\vec{\lambda}_i = (m, b)$, $\alpha$ is the learning rate and $L(m, b)$ is the mean-squared-loss as a function of the parameters $m$ and $b$.

Finding the optimal parameters should get a lot easier. Test the behaviour at different learning rates. Which learning rate seems to be a good choice? Which are the best parameters found like this? How does the loss compare to the one you could find manually?

*Note: Due to how the program is animated, every click on 'Descend!' updates the plots two times. Once for the slope, one for the intercept.*

In [10]:
# run this code!
linear_example_gradient_descent()

VBox(children=(HBox(children=(Output(), Output())), HBox(children=(VBox(children=(FloatSlider(value=0.0, conti…

You will use different and better minimization methods later on in the tutorial but this is it for now. We want to look at another type of problem first.

## Classification
Classification works very similar to regression. But instead of assigning a (mostly) continuous value to a combination of inputs, we want to assing a discrete *class* as *prediction*. In the regression problem we were looking for a function that is as similar to the data as possible, for classification we want a function that divides our data into areas or *classes*. In simple cases this can in fact be a straight line line we used before.
<img src="https://cdn.educba.com/academy/wp-content/uploads/2019/12/Regression-vs-Classification.jpg" height=200, style="height:200px"> (image: https://www.educba.com/regression-vs-classification/)

The following example shows you how you can create a *decision boundary* with a linear function. Every value on one side *belongs* to one class, every value on the other side *belongs* to the other.

In [11]:
# run this code!
linear_classification()

VBox(children=(Output(), VBox(children=(FloatSlider(value=0.0, continuous_update=False, description='slope m:'…

As with regression, deciding the *best* decision boundary is a difficult task. What is a good *loss* for a classification task? Measures like the *accuracy* seems to be a natural measure:
<img src="https://miro.medium.com/max/4208/1*Yslau43QN1pEU4jkGiq-pw.png" height=400 style='height: 400px'>
(image: https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28)

In [12]:
# run this code!
linear_classification_accuracy()

VBox(children=(Output(), VBox(children=(FloatSlider(value=0.0, continuous_update=False, description='slope m:'…

It turns out, that these are indeed good *metrics* to evaluate the performance of a classification. They are not good loss functions though and we have already learned why. First of all, accuracy is not differentiable. It changes in discrete steps depending on whether a value *belongs* to class 1 or class 2. The optimizer won't be happy and we don't want to go back to manual tuning. In addition, 

Let's leave behind the need to assign each data point to a class and instead assign a *probability* that it belongs to a class. A very commonly used function for this task is the *sigmoid function*.
<p>
<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Sigmoid-function-2.svg" height=200 style='height: 200px'>
</p>

The sigmoid function takes the value 0.5 on the decision boundary and rises or falls away from the boundary. So we can assign probabilities to data points. This has not solved the problem of a differentiable loss function. To compute an accuracy, we still need to assign a class label. We would actually prefer to have a loss function that takes the probability values for the data points into account. The most commonly chosen loss function for such *binary classifications* (the class is either 0 or 1 (1 or 2 respectively) is a function called *binary crossentropy* which is often also known as *log loss* and is closely related to *maximum likelihood methods*:

<p>
<center>
    $ b. c. e = - \frac{1}{N} \sum_{i = 1}^{N} \left( y_i\cdot \log (p_i) + (1-y_i)\cdot \log(1-p_i) \right)$
</center>
</p>

Here, $p$ is the predicted probability that a given data point belongs to class 1 using the sigmoid function, $y$ is the correct class label. The loss is designed that way, that if the class $y$ is 0, only the part $\log(1-p)$ counts, if the class $y$ is 1, only $\log (p)$ counts.

<p>
<img src="https://ml-cheatsheet.readthedocs.io/en/latest/_images/cross_entropy.png" height=250 style='height:250px'>
</p>

Let's inspect how the *binary crossentropy* behaves when we move around the decision boundary:


In [13]:
# run this code!
linear_classification_accuracy_and_bce()

VBox(children=(HBox(children=(Output(), Output())), VBox(children=(FloatSlider(value=0.0, continuous_update=Fa…

## Wrap-up
At this point you have learned already a lot how a computer is in principle able to predict values based on input data. These were very simple examples but the underlying concepts are really the same for many more applications in machine learning. The next tutorials will investigate how we can approximate more complex functions and use them to tackle regression and classification problems in higher dimensions.
At this point you should have learned:

- what is supervised learning?
- what are the goals of regression and classification?
- why do we need a loss function?
- what is the difference between a good *metric* and a good *loss*?
- what are examples for typical metrics and losses?