# Introduction to Machine Learning

## Organization

Introduction

How we connect

## Setup

different IDE´s

environment - libmamba

Download Minianaconda **[here](https://www.anaconda.com/docs/getting-started/miniconda/install#quickstart-install-instructions)**

* conda create --name FUKI python=3.10

* conda activate FUKI

* conda install jupyterlab

* conda install tensorflow

In [None]:
### * conda install conda-forge::tensorflow

* conda install anaconda::scikit-learn

* conda install statsmodels

Test computer - GPU support ?

## Python Introduction

Python 101

What is machine learning?

What is Keras?

Keras was created by **[Francois Chollet](https://en.wikipedia.org/wiki/François_Chollet)** as  a high-level API between for the low level backend libraries Thenao, CNTK and TensorFlow. It was later on absorbed into TensorFlow and now serves as an easy access to the TensorFlow library. 

Installing Keras

In [None]:
import os

# Set environment variable OMP_NUM_THREADS
os.environ["OMP_NUM_THREADS"] = "1"

# Introduction and overview of machine learning

**Machine learning** is a branch of artificial intelligence in which information is obtained by using algorithms to analyze data. For example, speech recognition is an algorithmic conversion of acoustic information into text. But how can information be obtained using algorithms and what different approaches are used in machine learning? We want to take a closer look at this in this introductory session.

In this course we aim at providing you with the necessary mathematical knowledge, methods and programming skills to conduct your own machine learning projects. Let´s start by familiarizing ourselves with some of the fundamental concepts of machine learning so we can later on put the methods we have already learned into a broader context by apllying them to solve problems. The various machine learning algorithms will be briefly characterized and explained in more detail in the following sessions. 

Let's start by asking ourselves some questions ...

## What is machine learning?

**[Machine Learning](https://en.wikipedia.org/wiki/Machine_learning)** (ML) is a branch of artificial intelligence that gives computers the ability to learn from experience and automate tasks without having to be explicitly programmed. The core idea behind ML is that computers can analyze data, recognize patterns and make predictions or decisions based on them.

In contrast to previous approaches, which were based on expert and rule-based systems and were heavily dependent on domain knowledge, machine learning aims to extract generic regularities from high-dimensional data sets. This is achieved by training different models (algorithms) with suitable data and adjusting the **model-specific parameters** to achieve the highest possible **prediction accuracy** measured against an **evaluation metric**.

## How does the machine learning process work?

We have already seen some examples of the machine learning process during the course, but we will summarize these points again to illustrate the system. 

In the following diagram, you can see a schematic overview of the individual steps of the machine learning process:

<img src="./images/ML_scheme.png" alt="drawing" width="80%"/>

## Machine learning process

Let's take a look at the individual steps of the **machine learning process** using an example from **polynomial regression**.

- **Data collection:**

Initial data is required for the training of **machine learning algorithms**. Which data must be collected and to what extent depends on the problem on the one hand and on domain knowledge on the other by excluding irrelevant features in advance. It should be mentioned that machine learning, in contrast to human pattern recognition, can achieve good results with high-dimensional data (many feature vectors) and therefore a hasty exclusion of features in some cases leads to poorer predictions. However, the collection of data can be time-consuming and cost-intensive, so a middle way often makes sense.

To get an impression of the general procedures, let's look at an example data set. We generate $350$ data points from a **[continuous uniform distribution](https://en.wikipedia.org/wiki/Continuous_uniform_distribution)** in the interval $[-5 , 5]$ using the `NumPy` function `random.uniform()` for the function $f(x) = x^2$ and add random noise with the function `random.normal()`. In addition, we insert `NaN` values at the beginning of the data set to practise dealing with missing data, save the result in `data` and create a `Pandas` dataframe from it:

In [None]:
import numpy as np
import pandas as pd

seed = 124
np.random.seed(seed)

# Number of data points
n = 350

# Generate n random x values in the range from -5 to 5
x = np.random.uniform(-5, 5, n - 1)

# Generate the corresponding y-values with the function x^3 and random noise
y = x**3 + 20 * np.random.normal(0, 0.125, n - 1)

# Add a NaN value at the beginning of the x and y arrays
x = np.insert(x, 0, np.nan)
y = np.insert(y, 0, np.nan)

data_df = pd.DataFrame({"x": x, "y": y})

data_df

As we can see, the data set contains zero values, which are also referred to as 'NaN' values. The handling of missing or incorrect data is dealt with in the next step during data cleansing.

- **Data cleansing:**

The purpose of data cleansing is to check the validity of the training data. Missing entries, incorrect entries, outliers and duplicates must either be removed or replaced by **[imputation methods](https://en.wikipedia.org/wiki/Imputation_(statistics))**.

In the case of our example data set, we remove the missing data using the `dropna()` method:

In [None]:
data_df_clean = data_df.copy()
data_df_clean = data_df_clean.dropna()
data_df_clean

In the next step, we turn to data exploration.

- **Data exploration:** 

Statistical analysis and visualization of data in the context of **[explorative data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (EDA)** makes it possible to identify patterns in the data in advance and to examine relationships between individual features. This makes it possible to recognize potential problems in the evaluation at an early stage or to carry out preliminary work for feature engineering.

Let's carry out an exploratory data analysis for the sample data as an example:

First, we assign the dependent variable or **target variable** and the independent variable or **feature variable**. In our example, these simply correspond to the $y$ and $X$ values respectively. In other problems, however, it may be necessary to consider many features for the prediction of the target variable.

In [None]:
X = data_df_clean.x
y = data_df_clean.y

In a first step, we can use the methods `info()` and `describe()` to calculate the characteristic statistical parameters of the data. These methods return the data types contained in the data set as well as the number, mean value, standard deviation, minimum, first, second and third quantile and the maximum of the data.

In [None]:
print(data_df_clean.info())
print(data_df_clean.describe())

We can conclude from this that the data set contains numerical data of the `float64` data type in equal numbers. The mean value, the standard deviation, the minimum, the first, second and third quantile as well as the maximum can give us information about the central tendency, the dispersion and the distribution form of the data. For example, the $y$ values have a significantly higher scatter around the mean due to their higher standard deviation, which indicates a non-linear relationship in relation to $y$ as the dependent variable of $x$.

When exploring data, it can also be useful to visualize the it in order to gain an overview of the distribution of the data. We present three common EDA visualization methods below.


The representation as **[scatter plot](https://en.wikipedia.org/wiki/Scatter_plot)**  enables an assessment of the dependency structure in the data by plotting the values of the data set in pairs as points:

In [None]:
import matplotlib.pyplot as plt

_ = plt.scatter(X, y, s=5)

Another frequently used form of representation is the **[histogram](https://en.wikipedia.org/wiki/Histogram)**, which provides an overview of the frequency distribution of the data. The data is divided into classes (*bins*), whereby the number of data in the respective class corresponds to the height:

In [None]:
_ = plt.hist(data_df_clean.y, bins=50)

When displaying the data in a **[boxplot](https://en.wikipedia.org/wiki/Box_plot)**, the median (yellow line), the first and third quartiles (black rectangle) and the "antennae" or whiskers, which correspond to $1.5$ times the interquartile range, are plotted. Points outside the whiskers are plotted as black circles:

In [None]:
_ = plt.boxplot([data_df_clean.x, data_df_clean.y], labels=["$X$-Werte", "$y$-Werte"])

More detailed descriptions of the use of diagrams for exploratory data analysis can be found in <cite id="4kev9"><a href="#zotero%7C16738657%2FWPIDC5X6">(Bruce et al., 2021)</a></cite> (pp. 1-47).

- **Train-Test-Split:**

As already discussed in another section, it makes sense to split the available data into a training, test and validation data set in order to generalize the model. While the model is trained on the training data, the model parameters are tuned by evaluating the predictions on the validation data set. The final evaluation of the model is then carried out using the test data. This splitting ensures a realistic estimation of the model's performance on new, previously unseen data.

We can use the `train_test_split()` function in `scikit-learn` to split the data into training and test data sets. To create an additional validation dataset, we can use `train_test_split()` one more time to split the test data.

In [None]:
from sklearn.model_selection import train_test_split


# Split data into training and test sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=21
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

plt.scatter(X_val, y_val, label=r"10 % validation data", s=5, color="red")
plt.scatter(X_test, y_test, label=r"10 % test data", s=5, color="blue")
plt.scatter(
    X_train, y_train, label=r"80 % training data", s=5, alpha=0.25, color="green"
)
_ = plt.legend()

- **Model selection:** 

The choice of model used depends on the problem being addressed (e.g. **classification**, **regression** or **clustering**) as well as on factors such as the size of the data set and the specific advantages and disadvantages of the models, which we will discuss in more detail <cite id="g1tw5"><a href="#zotero%7C16738657%2FP59K4ZW6">(Géron, 2020)</a></cite> (pp. 31-33). Other factors include the type of data used (**categorical**, **nominal**, **ordinal**, **numeric**), **over** or **underfitting**, the **hyperparameters** used in the model and, finally, the **interpretability of the model**.

In our example, we assume a polynomial function through **data exploration** in order to describe the distribution of the data points. 

We therefore use the following approach:

$$f(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_n x^n $$

where $a_0,a_1,\cdots,a_n$ are the coefficients and $n$ is the order of the polynomial.

- **Feature Engineering** 

**Feature engineering** is about creating new features and transforming existing data to enable better model prediction results. Examples include scaling data, creating polynomial features or merging highly correlated features for dimensionality reduction. An important application of feature engineering is, for example, the **one-hot encoding** of categorical data.

With reference to our example, we add $n$ powers of the $x$ values (i.e. $x^0, x^1, \cdots , x^n$) using the function `PolynomialFeatures` and the argument `degree=n`. If there are several independent variables $[a,b]$, all combinations of the features $[a,b,ab,a^2,b^2]$ must also be taken into account.

In [None]:
# Import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features for training data
poly_features = PolynomialFeatures(degree=5)
X_poly_train = poly_features.fit_transform(X_train.values.reshape(-1, 1))

In [None]:
# Create polynomial regression model
poly_mod = LinearRegression()
poly_mod.fit(X_poly_train, y_train)

We create a regression line with predictions of our model `poly_mod` for $1.000$ uniformly distributed points in the interval $[-10 , 10]$ and plot the regression line with the original data:

In [None]:
# Create x-values for regression line
x_reg = np.linspace(-5, 5, 1000)

# Create x_reg_poly
x_reg_poly = poly_features.fit_transform(x_reg.reshape(-1, 1))

# Make predictions for x_reg
pred_reg = poly_mod.predict(x_reg_poly)

plt.plot(
    x_reg,
    pred_reg,
    color="red",
    linestyle="dashed",
    linewidth=1,
    label="Regression line",
)
plt.legend()
plt.title("Polynomial Regression")

_ = plt.scatter(X, y, s=5)

- **Model evaluation:**

In this step, the quality of the model's predictions is checked. Depending on the problem and the model used, there are different metrics that can be applied. Examples include the **[RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation)**, the **[MSE](https://en.wikipedia.org/wiki/Mean_squared_error)**, the **[coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)** and the **[accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)**.

We will discuss the different metrics for different machine learning algorithms in detail later. For the model evaluation in our example, we opt for the **MSE** to evaluate the predictions for the validation data `X_val` of our model and the actual values `y_val`:

In [None]:
from sklearn.metrics import mean_squared_error

# Create PolynomialFeatures (degree=5)
poly_features = PolynomialFeatures(degree=5)

# Create polynomial features for validation data
x_val_poly = poly_features.fit_transform(X_val.values.reshape(-1, 1))

# Create polynomial regression model
poly_mod = LinearRegression()
poly_mod.fit(X_poly_train, y_train)

y_preds = poly_mod.predict(x_val_poly)

mse = mean_squared_error(y_val, y_preds)

print(f"Mean Squared Error (MSE) for polynomial of order 5: {mse:.2f}")

We can output the parameters of the model with the attributes `intercept_` (the constant term $a_0$, also called intercept or bias and `coef_` (the coefficients of the polynomial $a_1, a_2, \cdots , a_n$). Note that the first value in `coef_` serves as a placeholder for the intercept.

In [None]:
print("Polynomial of order 5")
print("Bias (Intercept):", poly_mod.intercept_)
print("Weights (coefficients):", poly_mod.coef_)

- **Hyperparameter tuning:**

Various methods such as **grid search**, **random search** and **cross-validation** are used to optimize **hyperparameters** <cite id="ticvo"><a href="#zotero%7C16738657%2FP59K4ZW6">(Géron, 2020)</a></cite> (pp. 78-82), i.e. parameters that are included in the model before the data is trained. With **Grid Search** the parameter space is systematically scanned, while with **Random search** certain value ranges of the parameter space are selected at random. The different combinations of parameters are evaluated, for example by minimizing a **Loss** function, and the ideal combination is selected. **Cross-validation** is used to use parts of the training dataset to validate the model and only in the last step to divide the dataset into test and training datasets.

We will discuss `GridSearchCV` and its possible uses for hyperparameter tuning later in this chapter. For our example, we use a `for` loop to evaluate the model for different values of the hyperparameter, which in this case corresponds to the order of the polynomial features.

In [None]:
# Create list for the individual models, 0 serves as placeholder
modelle = [0]

for i in range(1, 6):
    # Create PolynomialFeatures (degree=i)
    poly_features = PolynomialFeatures(degree=i)

    # Create x_test_poly (degree=i)
    X_val_poly = poly_features.fit_transform(X_val.values.reshape(-1, 1))

    # Create polynomial features for training data
    X_poly_train = poly_features.fit_transform(X_train.values.reshape(-1, 1))

    # Create polynomial regression model
    poly_mod = LinearRegression()
    poly_mod.fit(X_poly_train, y_train)

    y_preds = poly_mod.predict(X_val_poly)

    mse = mean_squared_error(y_val, y_preds)

    # Add the model to the list
    modelle.append(poly_mod)

    print(f"Mean Squared Error (MSE) of the model of the order: {i} : {mse:.2f}")
    print("=" * 50)

We recognize that the best evaluated model in this case is the one with third-order polynomial features. However, since models three through five are close to each other in their evaluation metric, we compare their performance in making predictions on the test data.

- **Model validation:**

We evaluate the generalization capabilities of our models by making predictions on the previously unused test data and apply the same evaluation metric (**MSE**) as on the validation set:

In [None]:
for i in range(3, 6):
    # Create PolynomialFeatures (degree=i)
    poly_features = PolynomialFeatures(degree=i)

    # Create x_test_poly (degree=i)
    X_test_poly = poly_features.fit_transform(X_test.values.reshape(-1, 1))
    y_preds = modelle[i].predict(X_test_poly)

    mse = mean_squared_error(y_test, y_preds)

    print(f"Mean Squared Error (MSE) of the model of the order: {i} : {mse:.2f}")

Our evaluation metric yields a similar value compared to the validation set and rates the third-order model best. If a model scores significantly better on the validation set, this may indicate **[overfitting](https://en.wikipedia.org/wiki/Overfitting)**.

In the last step, we display the three models and the underlying function without noise ($f(x) = x^3$) in the range $[-20, 20]$:

In [None]:
X_pred = np.linspace(-20, 20, 1000)
colors = ["red", "blue", "yellow"]
labels = ["3rd order polynomial", "4th order polynomial", "5th order polynomial"]
for i in range(3, 6):
    # Create PolynomialFeatures (degree=i)
    poly_features = PolynomialFeatures(degree=i)

    # Create x_test_poly (degree=i)
    X_pred_poly = poly_features.fit_transform(X_pred.reshape(-1, 1))
    y_preds = modelle[i].predict(X_pred_poly)
    plt.plot(X_pred, y_preds, color=colors[i - 3], label=labels[i - 3])
plt.plot(X_pred, X_pred**3, color="k", linestyle="--", label="Function without noise")
plt.grid()
_ = plt.legend()

## What types of machine learning are there?

There are three main types of machine learning <cite id="yklgc"><a href="#zotero%7C16738657%2FN6ILIT6K">(Matzka, 2021)</a></cite> (pp. 10-14), <cite id="fs80l"><a href="#zotero%7C16738657%2FGXMJ3Q3D">(Awad &#38; Khanna, 2015)</a></cite> (pp. 5-9):
**[supervised learning](https://en.wikipedia.org/wiki/Supervised_learning)**, **[unsupervised learning](https://en.wikipedia.org/wiki/Machine_learning#Unsupervised_learning)** and **[reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning)**. There are also other hybrid forms such as partially supervised learning or active learning, which will not be discussed in this course.

In **supervised learning**, input data and the corresponding output data (labels) are used, i.e. data for which the correct predictions are known, in order to train the model. The goal of supervised learning is to extrapolate generalized rules through training in order to make correct predictions on unknown data.

**Unsupervised learning** pursues the goal of extracting significant correlations from the data without the use of labels. The computer learns to independently identify patterns or structures in data.

The approach of **reinforced learning** is to let the computer (agent) learn through interaction with an environment by trial and error, where each interaction leads to a new state of the environment that is evaluated by a reward function. In order not to go beyond the scope of this chapter, we cannot discuss reinforcement learning in detail here.

It should be mentioned that **[Deep Learning](https://en.wikipedia.org/wiki/Deep_learning)** based on **[artificial neural networks](https://en.wikipedia.org/wiki/Neural_network_(machine_learning))** can be regarded as another method of machine learning. We will look at neural networks and deep learning in more detail in the chapter Deep Learning with `Keras`.

The following figure shows a schematic breakdown of the different types of machine learning and their main applications.

<img src="./images/ML3_engl.png" alt="drawing" width="80%"/>

## Overview of evaluation metrics:

Before we go into the individual forms of machine learning in more detail, this summary provides a brief overview of metrics that are used for different types of machine learning.

The importance of these evaluation metrics lies in the fact that by choosing metrics suitable for the problem under consideration, the performance of different models and hyperparameters can be assessed and improvements made to the model.

The metrics are categorized into **classification**, **regression** and **clustering** based on the applications in which they are used.

**Metrics for classification:**

<cite id="oo4dr"><a href="#zotero%7C16738657%2FN6ILIT6K">(Matzka, 2021)</a></cite> (pp. 140-156)

**[Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision)**: Proportion of correct predictions of class assignments, suitable for balanced class ratios.
    
$$\text{accuracy} = \frac{TP + TN}{TP + TN + FN + FP} $$

In the binary case, $TP$ (True Positive) are the correct class assignments of the model to class $0$, $TN$ (True Negative) are the correct class assignments of the model to class $1$, $FP$ are the incorrect class assignments of the model to class $0$ and $FN$ are the incorrect class assignments of the model to class $1$. $FP$ therefore corresponds to an error of the $1$th type ($\alpha$ error) and $FN$ to an error of the $2$th type ($\beta$ error).

**[Precision](https://en.wikipedia.org/wiki/Precision_and_recall)**:

Proportion of correctly positively predicted instances out of all positively predicted instances, helpful when it is important to avoid false positives ($FP$).
    
$$\text{precision} = \frac{TP}{TP + FP} $$

**[Recall (Sensitivity, True Positive Rate)](https://en.wikipedia.org/wiki/Precision_and_recall)**: 

Proportion of correctly positive predicted class assignments out of all true positive cases, important to avoid missing positive class assignments.
    
$$\text{Recall} = \frac{TP}{TP + FN} $$

**[F1 score](https://en.wikipedia.org/wiki/Precision_and_recall#F-measure)**:

The harmonic mean between precision and recall, suitable for unbalanced classes.
    
$$\text{F-Score} = \frac{(\beta^2 +1)\cdot P \cdot R}{\beta^2 \cdot P + R} \rightarrow \text{F1-Score} = \frac{2\cdot P \cdot R}{ P + R} = \frac{2}{P^{-1} + R^{-1}}$$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score

# Create artificial data
X, y = make_classification(n_samples=100000, n_features=20, random_state=42)

# Divide the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_scores = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
roc_auc = roc_auc_score(y_test, y_scores)

# Create the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", lw=2, label="ROC curve (AUC = %0.2f)" % roc_auc)
plt.plot([0, 1], [0, 1], color="gray", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.legend(loc="lower right")
plt.show()

**[ROC (Receiver Operating Characteristic) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)**:

In the figure above, you can see the graphical representation of the trade-off between **True Positive Rate (Recall)** and the **[False Positive Rate](https://en.wikipedia.org/wiki/False_positive_rate)** of the model at different thresholds for the classification decision. The diagonal gray dashed line corresponds to a random classification with the same true positive rate and false positive rate.

The **false positive rate** = (1 - **[Specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)**), as the following applies:

$$FPR + \text{Specificity} = \frac{FP}{(FP + TN)} + \frac{TN}{(FP + TN)} = \frac{(FP + TN)}{(FP + TN)}= 1 $$

$$FPR = 1 - \text{Specificity}$$

In other words, the more positive results are recorded, the more negative results are incorrectly classified as positive. Ideally, the classifier should have a high value for the recall and a low value for the **false positive rate**, which corresponds to a steep left-sided increase in **ROC**. In practice, the **ROC** can help to find a suitable threshold and assess the performance of a classification by selecting the tangent point of a tangent at $45^\circ$ angle to the **ROC curve** as the optimal threshold of the classification <cite id="c62fn"><a href="#zotero%7C16738657%2FWPIDC5X6">(Bruce et al., 2021)</a></cite> (pp. 232-236).

**[AUROC (Area Under ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve)**:

The **AUROC** corresponds to the area under the **ROC** and can assume values between $1$ (ideal classifier) and $0.5$ (worst classifier), since values under $0.5$ lie above the diagonal again by swapping the classification. The **AUROC** can be used as a quality measure when comparing multiple models <cite id="dejcm"><a href="#zotero%7C16738657%2FWPIDC5X6">(Bruce et al., 2021)</a></cite> (p. 234).

**Metrics for regression:**

<cite id="ph4pt"><a href="#zotero%7C16738657%2FN6ILIT6K">(Matzka, 2021)</a></cite> (pp. 156-166)

**[Mean Squared Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error)**:

Average of the squared errors between the predicted and actual values, reacts sensitively to outliers as these are squared in the $MSE$.

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat Y_i)^2 $$

**[Root Mean Squared Error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation)**:

Square root of the MSE. This makes it easier to interpret than the $MSE$, as it is available in the same units as the data.

$$\text{RMSE} = \sqrt{\text{MSE}} $$

**[Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error)**:

Average of the absolute errors between the predicted and actual values, robust against outliers.

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^n |\hat Y_i - Y_i| $$

where $n$ is the number of predictions, $\hat Y_i$ is the value of the prediction and $Y_i$ is the observed value.

**[R-squared (coefficient of determination)](https://en.wikipedia.org/wiki/Coefficient_of_determination)**: Proportion of the explained variance in the model compared to the total variance of the dependent variable.

$$R^2 = \frac{\sum_i (\hat y_i - \bar y_i)^2 }{\sum_i (y_i - \bar y_i)^2} $$

In detail, $\hat y_i$ is the prediction of the regression model, $\bar y_i$ is the mean of the $y_i$ values and $y_i$ is the observed values.

**Metrics for clustering:**

**[Silhouette Score](https://en.wikipedia.org/wiki/Silhouette_(clustering))**:

Measure of cohesion within a cluster and how well it is separated from other clusters.

$$\begin{equation}
    S(o)=
    \begin{cases}
      0, & \text{if o is the only element in A} \\
      \frac{dist(B,o)-dist(A,o)}{\text{max}[dist(A,o) , dist(B,o)]}, & \text{otherwise}
    \end{cases}
  \end{equation} $$
where $dist(A,o)$ is the average distance to all other objects in $A$ and $dist(B,o)$ is the average distance to all objects in $B$.

The **silhouette coefficient** is then calculated as the arithmetic mean over $S(o)$ of the individual clusters:

$$s_C = \frac{1}{n_C} \sum_{o \in C} S(o) $$

## $k$-fold cross-validation

The **[cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))** <cite id="m95lo"><a href="#zotero%7C16738657%2FQRTMKI7W">(Richter, 2019)</a></cite> (pp. 19-20) is a validation method in which the training and test set are split first and then the training set is divided into $k$ subsets of the same size. These subsets are divided into $k$ subsets $T_1,...,T_k$ and a different subset $T_i$ is used for validation in each case, while the remaining $k-1$ subsets are used for training. Finally, the average of the $k$ validation results is calculated.

The following figure shows a schematic structure of **cross-validation** for $k=3$.

<img src="./images/cross2_engl.png" alt="drawing" width="60%"/>

The $k$-fold **cross-validation** method offers the advantage of a better model evaluation, as it divides the data set into training and test sets and the model is trained and tested several times with different data combinations. This provides a more realistic estimate of model performance and reduces the risk of **overfitting**. Cross-validation is therefore helpful in making a decision about which is the best model or hyperparameter.

In addition, cross-validation allows for more efficient use of data in situations where the amount of data is limited, as all data in the training set can be used for both training and evaluation.

## Supervised learning

In this section, we look at supervised learning. We have already discussed some of the methods used in supervised learning, such as linear, multiple and logistic regression, in the chapter Machine Learning with `scikit-learn`. At this point, we would like to discuss the $K$-nearest neighbor model (KNN) as a further method of supervised learning, discuss general possibilities for optimizing models such as **GridSearch** and **Cross-validation** and explain the idea behind supervised learning in more detail <cite id="nzd29"><a href="#zotero%7C16738657%2FN6ILIT6K">(Matzka, 2021)</a></cite> (pp. 99-166).

Characteristically, supervised learning involves passing labeled data so that the model is able to make predictions based on generalization of the relationships previously observed from this training dataset. During the training process, it is necessary to evaluate (supervise) the predictions of the model to ensure that it generalizes well to new data.

The vector of input values $X$ is called the **feature vector** or **input**, the individual components are called **features**. The output vector ($Y$) is called the **target vector**.

The basic problem in supervised learning is to find a relationship between $X$ and $Y$ in the form $$f(X) = Y $$.

$X$ is usually a real-valued vector, i.e. $X \in \mathbb{R}^d$. For $Y$, either $Y \in \mathbb{R}$ or $Y \in {1,\cdots,K}$ and $K \in \mathbb{N}$ applies.

<img src="./images/supervised_engl2.png" alt="drawing" width="80%"/>

Supervised learning is mainly used for two tasks:

If $Y$ is present as $Y \in \mathbb{R}$, we speak of regression problems and for $Y \in {1,\cdots,K}$ of classification problems.

<img src="./images/reg_vs_class.png" alt="drawing" width="60%"/>

Representatives of models with supervised learning include the following:

**Linear models:**

* *Linear regression*

**Linear regression** is a method for predicting a continuous target variable based on one or more input variables. It models the relationship between the inputs and the target variable as a linear function.
        
* *Polynomial regression*

**Polynomial regression** is an extension of linear regression in which a polynomial function is used to model the dependent variable in relation to the independent variable.
        
* *Logistic regression*

In contrast to linear regression, **logistic regression** is used when the target variable is binary (for example, the classes $0$ and $1$). It models the probability of a particular class occurring and uses the sigmoid function, also known as the logistic function, to transform the continuous predictions of the linear model to the range between $0$ and $1$.

**$K$ Nearest Neighbors (KNN):**

* *Classification with KNN*
        
* *Regression with KNN*

**KNN** is a method of classification or regression in which a point is classified or predicted based on the $K$ nearest neighbors in the feature space. The similarity (distance) between the data points is used.

**Tree-based models for classification:**

* *Decision trees*

A **decision tree** is a hierarchical model that represents a decision in the form of a tree. It helps to make classifications or predictions by making a series of decisions based on features of the data.
        
* *Random Forests*

**Random Forest** is an ensemble method that combines multiple decision trees. Each tree is trained on a random sample of the data and features. The predictions of the trees are averaged to obtain a robust prediction.
        
**Decision tree-based algorithms for regression:**

* *Random forest regression*

**Random forests** can also be used for regression. Numerical values are averaged over the individual decision trees of the random forest using an ensemble method.

**Support Vector Machines (SVM)**:

* *Classification with SVM*
        
* *Regression with SVM*

**SVM** is a classification and regression method that uses a separation plane or hyperplane model to divide data into different classes. It aims to achieve a maximum margin between the classes.

In the following chapters we will deal with the models: **Decision Trees**, **Random Forest** and **Support Vector Machines**, while in this chapter we will first deal with the **$K$-nearest neighbor algorithm**.

### $K$-nearest neighbors (KNN)

The general idea of the $K$-nearest-neighbors algorithm is based on the following steps:

* Find $K$ observations with similar features in terms of their predictors, where the decision on similarity is made based on distance metrics.

* For classification: Determine which category or class is predominant among the similar observations and assign this category to the new observation.

* For regression: Determine the average value of the similar observations and assign this to the new observation <cite id="e45nb"><a href="#zotero%7C16738657%2FWPIDC5X6">(Bruce et al., 2021)</a></cite> (pp. 248-251).

It should be noted that how the **similarity** is measured (distance measure), how many **nearest neighbors** ($K$) are defined and how the features are **scaled** has an influence on the results of the prediction.

### Metrics

As mentioned, the $K$ nearest neighbor algorithm uses metrics to measure the similarity of observations. In general, these **[metrics](https://en.wikipedia.org/wiki/Metric_space)** $d(x,y)$ can be defined on a set of elements if the following properties are fulfilled <cite id="3gftp"><a href="#zotero%7C16738657%2FE3KUYES4">(Lang &#38; Pucker, 2016)</a></cite> (p. 354):

* Non-negativity: $d(x, y) \ge 0 $

* Uniqueness: $d(x, y) = 0 \ $, if $x = y$ applies

* Symmetry: $d(x, y) = d(y, x)$

* **[Triangle inequality](https://en.wikipedia.org/wiki/Triangle_inequality)**: $d(x, z) \le d(x, y) + d(y, z)$

Closely linked to the concept of metric is the definition of **[norm](https://en.wikipedia.org/wiki/Euclidean_space#Euclidean_norm)**. The norm is used to assign a number to elements of a metric space that describes their size. In relation to vectors, this corresponds to the **length** of a vector. The norm must generally fulfill the following properties:

* Non-negativity: $||x|| \ge 0$

* Uniqueness: $||x|| = 0 \ $, if $x = 0$ applies

* Scaling: $||\lambda x|| = |\lambda|||x|| \ $, where $\lambda$ is a scalar

* Triangle inequality: $||x + y|| \le ||x|| + ||y|||$

It can be shown that the scalar product can be used to define a norm <cite id="lin6r"><a href="#zotero%7C16738657%2FE3KUYES4">(Lang &#38; Pucker, 2016)</a></cite> (p. 357). Applied to vectors, the norm corresponds to the square root of the scalar product of a vector with itself:

$$|\vec x| = \sqrt{ \vec x \cdot \vec x}$$

Two of the most common distance measures that fulfill these relations are the following <cite id="z3awk"><a href="#zotero%7C16738657%2FE3KUYES4">(Lang &#38; Pucker, 2016)</a></cite> (p. 356):

* Euclidean distance, which is defined as follows:

Let $\vec a$ and $\vec b$ be two vectors in a $d$-dimensional space $\vec a,\vec b \in \mathbb{R}^d$, then the Euclidean distance between them is:

$d(a_i,b_j) = \sqrt{(a_1 - b_1)^2+(a_2 - b_2)^2+ \cdots + (a_d - b_d)^2} $

* The city block metric or **[Manhattan metric](https://en.wikipedia.org/wiki/Taxicab_geometry)** for which the distance between two vectors $\vec a$ and $\vec b$ is given by:

$d(a_i,b_j) = |(a_1 - b_1)|+|(a_2 - b_2)|+ \cdots + |(a_d - b_d)| = \sum_d |a_d - b_d|$

The terms **metric** and **norm** can also be defined much more generally (for example, to general vector spaces or function spaces). You can find out more about this in <cite id="9uwzo"><a href="#zotero%7C16738657%2FE3KUYES4">(Lang &#38; Pucker, 2016)</a></cite> (pp. 349-372) and <cite id="lgtci"><a href="#zotero%7C16738657%2FEPPNN3A6">(Lenze, 2020)</a></cite> (pp. 339-410).

### Scaling

#### $z$-Standardization

For methods that compare similarities based on distance measures, it is important to pay attention to the scaling of the data. For example, **[$z$-standardization](https://en.wikipedia.org/wiki/Standard_score)** can be used to bring the predictor variables to the same scales <cite id="qt0dd"><a href="#zotero%7C16738657%2FWPIDC5X6">(Bruce et al., 2021)</a></cite> (pp. 254-257).

The $z$-standardization is given by:

$$ z = \frac{x-\bar x}{\sigma} $$

where $\bar x$ corresponds to the mean value of the data and $\sigma$ to the standard deviation.

#### Min-max scaling (normalization)

Another common form of scaling is **[min-max scaling](https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization))**, also known as normalization. Here, the characteristics under consideration are mapped to values between $[0 , 1]$ by subtracting the minimum characteristic value from all values and dividing by the difference between the maximum and minimum characteristic value:

$$x^\prime = \frac{x - min(x)}{max(x) - min(x)} $$

To map the values to an arbitrary interval $[a , b]$, you can write the min-max scaling in the following form:

$$x^\prime = a + \frac{(x - min(x))(b - a)}{max(x) - min(x)} $$

In `scikit-learn`, various forms of scaling are contained in the `sklearn.preprocessing` module. For example, the $z$ standardization is implemented as `StandardScaler()` and the min-max scaling as `MinMaxScaler()`. The scaling is applied using the `fit_transform` method.

In order to avoid potential **[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning))**, it is important to note that the `fit_transform()` method is **only** applied to the **training data** when scaling. The **test data** is scaled using the parameters determined on the training data with the `transform()` method.

* Processing tabular data with `Pandas`

* Learning the `Python` fundamentals necessary for machine learning

* Linear, polynomial and logistic regression with `scikit learn`

In [None]:
import time
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.utils import to_categorical

# Startzeit des Trainings
start_time = time.time()

# Laden des MNIST-Datensatzes
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Vorverarbeitung der Daten
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

# Erstellen des Modells
model = Sequential([
    Flatten(input_shape=(28, 28)),          # Eingabeschicht
    Dense(128, activation='relu'),          # Erste versteckte Schicht
    Dense(64, activation='relu'),           # Zweite versteckte Schicht
    Dense(10, activation='softmax')         # Ausgabeschicht
])

# Kompilieren des Modells
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Training des Modells (die Anzahl der Epochen und Batch-Größe kann angepasst werden)
model.fit(x_train, y_train, epochs=10, batch_size=64, validation_data=(x_test, y_test))

# Testen des Modells
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"Testgenauigkeit: {test_acc}")

# Endzeit des Trainings
end_time = time.time()

# Berechnen der Trainingsdauer
training_time = end_time - start_time
print(f"Training Dauer: {training_time:.2f} Sekunden")

In [1]:
def linear_search(element, list1):
    count = 0
    for number in list1:
        count += 1
        if number == element:
            return print(f'Element: {number} found at {count} of the list.')
    return print('Element not found!')

In [2]:
data = [4, 1, 2, 3]

In [3]:
result = linear_search(4, data)

Element: 4 found at 1 of the list.


In [4]:
result = linear_search(5, data)

Element not found!
