# Introduction
## Types of Machine Learning
### Supervised ML
Each input has an explicit output, like a label or number.
Its goal is to find relations between in- and output.

The trained model can be used in an application.

#### Classification
Has labels as target (output), these are discrete values.
Used to predict e.g.  genre of a song.

#### Regression
Output is a numeric continuous value, like stock prices.

### Unsupervised ML
No output, the goal is to find relations between samples, e.g. clustering.

### Others
- Reinforcement learning, use a reward function with an agent, actions in an environment (states) in a feedback-update-loop to continuously update the model.
- Deep learning, deep neural networks, combines supervised, unsupervised and some other techniques

## Glossary
- Model: Relations in data that we model. In supervised learning: regression/classification model that is trained on our data.
- Model type/class: The underlying algorithm that is used to create the model, like SVM or KNN.
- Model parameters: Model parameters are what the models learn from data during training *on its own*
- Hyperparameters: They are the parameters for the algorithm and influence
    - how the model learns from the data
    - the model's complexity. Hyperparameters can be tuned to change the models' behaviour.
- Training: Fitting the model to the data.
- Evaluation/Test: Check how the model performs on test data.
- Features/Predictors/Dimensions: measurable property, usually the columns in a csv.
- Sample: one data point or one row in a csv-file

## Basic ML Workflow
![Basic ML Workflow](img/ml_workflow.png)

## Type III Error
*Provide the right answer to the wrong question.*

Often occurs when the dataset is not fully understood or some kind of pattern occurs in it, that does not occur naturally.
For example, model learns that every second sample is of type `A`.

Use common sense and intuition!

[Some examples](https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml?pli=1)

# Visualization
Visualization helps to understand data.
It shows patterns in datasets.

## What to look at first
1. Before you even visualize anything, the basics: Size of dataframe? Datatypes? Class distributions? Basic stats per features? Is there data missing? (NA, etc?)
2. Distribution of individual features: consider visualizing
3. Relations between features: consider visualizing

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                  columns=["sepal_length", "sepal_width", "petal_length", "petal_width", 'target'])
df["target"] = df["target"].apply(lambda x: iris.target_names[int(x)])
df.shape  # the dimensions of the dataframe/dataset

In [None]:
df.dtypes  # the datatype of each dimension

In [None]:
df.describe()  # shows for each numerical feature the mean, std, max, quantiles

## Classification: class distribution
Balanced data sets make things easier, but in reality often unequally distributed

## Regression: distribution of target variable
Similar problem: distribution not equal or does not cover complete target range

## Scatterplot
- pairplot, xyplot, etc
- shows relations between 2 feature
- Classification: color, symbol can indicate class label


In [None]:
from matplotlib import pyplot as plt
import seaborn as sn

_, ax = plt.subplots(1, 2, figsize=(10, 5))

sn.scatterplot(df[["sepal_length", "sepal_width", "target"]], x="sepal_length", y="sepal_width", hue="target", ax=ax[0])
sn.scatterplot(df[["petal_length", "petal_width", "target"]], x="petal_length", y="petal_width", hue="target", ax=ax[1])


## Scatterplot Matrix
Can show correlation between features, but can be messy with high dimensional datasets.

In [None]:
_, ax = plt.subplots(4, 4, figsize=(20, 20))

for i, col in enumerate(df.drop("target", axis=1).columns):
    for j, inner_col in enumerate(df.drop("target", axis=1).columns):
        if col == inner_col:
            ax[i, j].text(0.5, 0.5, col, ha='center', va='center', size=30)
            break
        sn.scatterplot(df[[col, inner_col, "target"]], x=col, y=inner_col, hue="target", ax=ax[i, j])
plt.show()

## Densityplot
- Density of distribution of single variable
- Sometimes plots individual samples with scatter for better understanding of data
- Kernel parameter: defines granularity of density estimate
- Classification: the more feature densities of different classes overlap, the more similar is the feature
- Differences in density: feature poss. captures some differences in classes

In [None]:
_, ax = plt.subplots(2, 2, figsize=(20, 20))

for i, col in enumerate(df.drop("target", axis=1)):
    sn.kdeplot(df, x=col, hue="target", ax=ax[i // 2, i % 2], kernel="")
plt.show()

## Histogram
Nr. of bins change how many bars are used in a histogram.
See also: Binning

In [None]:
_, ax = plt.subplots(1, 2)
sn.histplot(df, x="sepal_length", bins=4, ax=ax[0])
sn.histplot(df, x="sepal_length", bins=8, ax=ax[1])

## Mean, Median, SD and MAD
All can be used for feature derivation

### Mean and Median
Show the "center" of samples distribution.
Median is considered more statistically robust(to outliers) than mean

### SD and median absolut deviation (MAD)
Shows the scatter of samples. Again MAD is more robust than SD.

SD Formula:
$$
  SD=\sqrt{\frac{\sum (x_{i}-\mu)^2}{N}}
$$

MAD Formula:
`X` is the series of samples.
`m(X)` can be arithmetic mean, median or mode, but for median absolute deviation it is the median!

$$
  MAD=\vert\frac{1}{n}\sum_{i=1}^n x_{i} - mean(X)\vert
$$

**MEDIAN and MAD are not impacted by outliers!!!**


In [None]:
sn.displot(df, x="sepal_length", kind="kde", height=8)
plt.axvline(x=df["sepal_length"].mean(), color='black', label="mean")
plt.axvline(x=df["sepal_length"].median(), color='red', label="median")
plt.legend(loc=0)
plt.show()

In [None]:
_, ax = plt.subplots(1, 2)
mean = df["sepal_length"].mean()
sd = df["sepal_length"].std()
median = df["sepal_length"].median()
mad = (df["sepal_length"] - df["sepal_length"].mean()).abs().mean()

sn.kdeplot(df, x="sepal_length", ax=ax[0])
ax[0].axvline(x=mean, color='black', label="mean")
ax[0].axvline(x=mean + sd, color='green', label="sd")
ax[0].axvline(x=mean - sd, color='green')
ax[0].legend(loc=0)

sn.kdeplot(df, x="sepal_length", ax=ax[1])
ax[1].axvline(x=median, color='red', label="median")
ax[1].axvline(x=median + mad, color='blue', label="mad")
ax[1].axvline(x=median - mad, color='blue')
ax[1].legend(loc=1)
plt.show()

## Boxplot (=box-and-whisker-plot)
Shows distribution of one variable as the variables quartiles.

Quartiles != quantiles

It's a form of quantiles using 3 separations that results into 4 parts.
Each part holds ~25% of the data samples.
- Q1: 25% below, 75% above
- Q2: median, 50% below, 50% above
- Q3: 75% below, 25% above

There are also percentiles: if X is the 80% percentile -> 80% of samples below X

The dots in a boxplot are the extrem outliers, the box contains 50% of the data.

In [None]:
_, ax = plt.subplots(1, 2)
df.boxplot(ax=ax[0])
q = df["sepal_length"].quantile([0.25, 0.5, 0.75])
sn.kdeplot(df, x="sepal_length", ax=ax[1])
ax[1].axvline(x=q[0.25], color='blue', label="Q1")
ax[1].axvline(x=q[0.5], color='red', label="Q2")
ax[1].axvline(x=q[0.75], color='green', label="Q3")
ax[1].legend(loc=1)

## TODO: Levelplot / Contourplot: Visualization Page 40