# Machine Learning - Introduction
## What is Machine Learning?
There is still a lot of confusion about terms like **artificial intelligence** (or: **A.I.**), **machine learning**, and **deep learning**.
The aim of this course is not to give a full in-depth introduction to all those topics, because this clearly demands one (or usually: multiple) courses on its own.

However, what this section should show is, that the underlying idea of machine learning is often quite approachable. And the some of the respective techniques can be used with relative ease, at least from the code perspective. If you haven't learnt the basics of machine learning before, it is unlikely that this section will make you feel like you have mastered the art. But, again, this is not the main goal.

Machine learning is a subfield of *artificial intelligence* (see {numref}`fig_dimensionality_reduction01`), which often makes this sound very intimidating at first. In its full complexity, this is indeed hard to master. But it is important to note at this point, that machine learning itself can also be seen as *just another tool* that we have as data scientists. In fact, many of its techniques are no more complicated or complex than methods in dimensionality reduction or clustering that we have seen in the last chapters (some even consider those as part of machine learning, but I will skip this discussion for now).

```{figure} ../images/fig_ai_vs_ml_vs_deep_learning.png
:name: fig_ai_vs_ml_vs_dl

Machine Learning is a field of techniques that belongs to *artificial intelligence*. Currently, machine learning is even the main representation of artificial intelligence with *deep learning* being the most prominent subset. Here, however, we will focus on more classical machine learning techniques (no worries, those remain equally important in real-life practice!).
```

The key idea behind machine learning is that we use algorithms that *"learn from data"*. Learning here is nothing magic and typically means something very different from our everyday use of the word "learning" in a human context. Learning here simply means that the rules by which a program decides on its outcome or behavior are no longer *hard-coded* by a person, but they are automatically searched and optimized.

To give you an idea of how simple this "learning" can be: Imagine we have two highly correlated features, for instance the shoe size and the height of of many people. We could than automatically find a good linear fit function and then use this to make predictions for new data entries. Say, you find footprints of size 47 after a robery, then your new "machine learning model" (yes, that's simply the linear fit!) can predict the height of that person ({numref}`fig_predictions_already_done`**A**). The prediction is probably not perfect, but we have good reasons to believe that it most likely won't be too far off either.

Or, think of several chat messages that we clustered into "spam", "negative", and "positive" and which is nicely reflected by their position in a 2D plot after dimensionality reduction ({numref}`fig_predictions_already_done`**B**). If we now receive a new message which ends up clearly within the "positive" cluster, then we can obviously risk the first best guess of saying that this might be a positive message as well. Again, we could be wrong, but based on the data we have, this simply seems to be the best guess we can do. 

So, when is this going to be called *machine learning*?
When we have an algorithm that does this guessing based on the data for us it is machine learning. If we do the guessing, it is not machine learning.

```{figure} ../images/fig_predictions_already_done.png
:name: fig_predictions_already_done

Even before coming to this chapter, we have already worked with techniques that would actually allow to make predictions on the basis of known data points. **A**, when we looked at (high) correlations this already implied that there is a more or less reliable link between different features. **B**, when we think of clustering and dimensionality reduction it seems rather obvious that we could make predictions for new datapoints based on their lower dimensional position!
```

Before we have a look at some actual machine learning techniques, we should look at one key distinction to make. In machine learning we distinguish two types of tasks that a model can do. **Classification** means that models will predict the class of unknown datapoints based on what they "learned" from **labeled data**. Labeled data means that this is data for which we *know* the true labels (also called: targets or ground truth).

**Regression** means that models will predict one (or multiple) numerical values for unknown datapoints, see {numref}`fig_classification_regression`.

```{figure} ../images/fig_classification_regression.png
:name: fig_classification_regression

In *machine learning* we distinguish **classification** and **regression** tasks. The key difference is that classification models will predict one out of several possible categories while regression models output numerical values (floats).
```

## Training & Testing

Before we start, let's quickly repeat a few key terms:
- **model**: you can think of a machine learning model as a program function. It takes a certain, well-defined input (the **data**, often lazyly written as $X$) and generates predictions or **labels** (lazyly written: $y$).
- A model is *trained* on data with known labels. This is called *supervised learning*, because the process is guided by these target values (the labels). There is also something called *unsupervised learning*, but we will ignore this for now.
- The prediction of labels is called *prediction* or *inference*. Obviously, this should happen after the model training.

### How do we know how *good* our model is?
The tricky part about applying supervised machine learning models is that we *train* the model using (more or less-) well-understood data and later want to use its predictions on *unknown* data. This means on data where we either don't know the labels ourselves (say, predicting tomorrow's weather). Or, we could know the labels, but simply do not want or cannot manually create those for all cases (for instance classify all incoming mail as spam or no-spam). But can we trust the model's predictions?

As the name says, everything the model outputs are *predictions*. For all models we will consider here, this means that every single model output is uncertain and could be wrong. With the tools we have at hand in this section, we cannot even estimate the chance of failing. But, we can measure how reliable our model is *on average*!

For this, we need to do the single most important action in all of machine learning, the **train/test split**.
### Train/Test split
For supervised machine learning we need data ($X$) and corresponding labels ($y$). To avoid blindly applying our model to truly unseen and unknown data, we virtually **always** split our data into training and test data. As the name suggests, the training data can be used to train our machine learning model. The test data is kept and will **never** be used in training, but only to assess the quality of predictions of our model. Typically, we want as much data as possible for the training since more data usually correlates with better models. However, we also have to reserve enough data for the test set to later guarantee a meaningful assessment of our model.


We will later see that this can easily be done by using the `train_test_split` function from `Scikit-Learn`.

## Common algorithms

Instead of going much deeper into the theory of what machine learning is, we will simply have a look at a few very common algorithms every well-trained data scientist should know about. Some of them remain standard items in our everyday toolbox and you will most likely apply them every now and then. Others are important because they reflect key concepts in machine learning algorithms and help you gain a more fundamental understanding of what is actually happening under the hood.


### k-nearest neighbors (k-NN)
**$k$-nearest neighbors** is for very good reasons one of the most commonly known machine learning algorithms. It is relatively intuitive and simple, yet still powerful enough to find plenty of use cases even today (despite havine much fancier techniques on the market).

The algorithm works as follows ({numref}`fig_knn_algorithm`). For any given data point $x$, do the following:
- Search for the $k$ nearest neighbors within the known data.
- For classification: take the most common label of those $k$ data points.
- For regression: take the mean (or median) of those $k$ data points.

That's essentially it, which means it is no more complicated than what was sketched in {numref}`fig_predictions_already_done`B. One could ask if the terms "training" or "learning" are very good in this context. But one could essentially just argue that the data provided for reference is available and hence "learned". The algorithm is clearly a machine learning algorithm because we only define the process. The respective outcomes will be fully dependent on the provided reference data.

```{figure} ../images/fig_knn_algorithm.png
:name: fig_knn_algorithm

k-nearest neighbors is a rather intuitive algorithm. It is fully distance based and relies on finding the $k$ nearest neighbors within the reference data. Out of those data points the final prediction is generated, either by majority vote (classification) or by averaging (regression).
```

---

#### Pros, Cons, Caveats
Conceptually, the k-nearest neighbors algorithm is rather simple and intuitive. However, there are a few important aspects to consider when applying this algorithm.

First of all, k-nearest kneighbors is a distance-based algorithm. This means that we have to ensure that closer really means "more similar" which is not as simple as it maybe sounds. We have to decide on a *distance metric* that is the measure (or function) by which we calculate the distance between data points. We can use common metrics like the Euclidean distance, but there are many different options to choose from.
Even more critical is the proper *scaling* of our features. Just think of an example. We want to predict the shoe size of a person from the person's height (measured in $m$) and weight (measured in $kg$). This means that we have two features here, height and weight. For a prediction on a new person we simply need his/her height and weight. Then k-NN will compare those values to all known ("learned") data points in our model and find the closest $k$ other people. If we now use the Euclidean distance, the distance $d$ will simply be

$$ d = \sqrt{(w_1 - w_2) ^ 2 + (h_1 - h_2) ^ 2} $$, where $w$ and $h$ are the weights and heights of person 1 and 2.

Try to answer the following question: What is the problem here?

...?

Ok. The issue here is, that the weights are in kilograms ($kg$), so we are talking about values like 50, 60, 80, 100. The height, however, is measured in meters ($m$) such that values are many times smaller. As a result, having two people differ one meter in height (which is a lot) will count no more than one kilogram difference (which is close to nothing). Clearly not what we intuitively mean by "nearest neighbors"!

The solution to this is a proper **scaling** of our data. Often, we will simply apply one of the following two scaling methods:
1. MinMax Scaling - this means we linearly rescale our data such that the lowest occuring value becomes 0 and the highest value becomes 1.
2. Standard Scaling - here we rescale our data such that the mean value will be 0 and the standard deviation will be 1.

Both methods might give you values that look awkward at first. Standard scaling, for instance, gives both positive and negative values so that our height values in the example could be -1.04 or +0.27. But don't worry, the scaling is really only meant to be used for the machine learning algorithm itself.

Once we scaled our data, and maybe also picked the right distance metric (or used a good default, which will do for a start), we are technically good to apply k-NN.

But there are still some questions we need to consider.

The obvious one is: What should we use as $k$?  
This is the model's main parameter and we are free to choose any value we like. And there is no simple best choice that always work. In practice the choice of $k$ will depend on the number of data points we have, but also the distribution of data and the number of classes or parameter ranges. We usually want to pick odd values here to avoid draws as much as possible (imagine two nearest neighbors are "spam" and two are "no-spam"). But whether 3, 5, 7, or 13 is the best choice will depend on our specific task at hand. 


In machine learning we call such a thing a **fitting parameter**. This means that we are free to change its value and it might have a considerable impact on the quality of our predictions, or our "model performance". Ideally we would compare several different models with different parameters and pick the one that performs best.

Let's consider a situation as in {numref}`fig_knn_caveats`A. Here we see that a change in $k$ can lead to entirely different predictions for certain data points. In general, kNN predictions can be highly unstable close to border regions, and they also tend to be highly sensitive to the local density of data points. The later can be a problem if we have far more points of one category than for another.

```{figure} ../images/fig_knn_caveats.png
:name: fig_knn_caveats

k-nearest neighbors has a few important caveats. **A** its predictions can change with changing $k$, and generally are very density sensitive. **B** it suffers (as many machine learning models) from overconfidence, which simply means that it will confidently output predictions even for data points that are entirely different from the training data (or even physically impossible).
```

Finally, another common problem with kNN -but also many other models- is called **over-confidence** ({numref}`fig_knn_caveats`B). The algorithm described here creates its predictions on the $k$ closest neighbors. But for very unusual inputs or even entirely impossible inputs, the algorithm will still find $k$ closest neighbors and make a prediction. So if you ask for the shoe size of a person of 6.20m and 840 kg your model might confidently answer your question and say: 48 (if nothing bigger occurred in the data). So much for the "intelligent" in *artificial intelligence* ...

In summary, k-NN has a number of Pros and Cons:

**Pros**  
- Can be used for classification and regression
- Very intuitive, which also means that the predictions are easy to explain!
- No overfitting (we will soon see what this is)
- Does not make impossible predictions (because it only takes values from the training data)

**Cons**  
- Predictions are sensitive to local density of data points and the choice of $k$
- Can suffer from over-confidence.
- Does not scale well for very large datasets (computing all distances can take very long)

## 

- k-nearest neighbors
- linear regression
- logistic regression
- decision trees
- random forests

In [1]:
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sb

In [None]:
# Material yet to be cleaned and prepared...

### Want to read more on this?

- https://github.com/GeostatsGuy/PythonNumericalDemos/tree/master
- https://inferentialthinking.com/chapters/intro.html#