# Machine learning

# Learning
* Enables problem solving:
    * the transition of a state with respect to a quality factor
    * state x is changed with some function
    * f(speech) = emotion
    * f(x) = x'
    * e.g. f(x) = a x + b
    * find a and b so that x' is optimal

# How to split your data
In supervised machine learning, you usually need three kinds of data sets:
* train data: to teach the model the relation between data and labels
* dev data: (short for *development*) to tune meta parameters of your model, e.g. 
    * *number of neurons*, 
    * *batch size* or 
    * *learning rate*.
* test data: to evaluate your model ONCE at the end to check on generalization

Of course all this is to prevent [*overfitting*](http://blog.syntheticspeech.de/2022/02/16/kinds-of-machine-learning/#Overfitting) on your train and/or dev data.

* If you've used your test data for a while, 
you might need to find a new set, 
as chances are high that you overfitted 
on your test during experiments.

So what's a good split?

Some rules apply:
* train and dev can be from the same set, but the test set is ideally from a different database.
* if you don't have so much data, a 80/20/20 % split is normal
* if you have masses an data, use only so much dev and test that your population seems covered.
* If you have really little data: use [x cross validation](http://blog.syntheticspeech.de/2022/11/28/how-to-evaluate-your-model/#X_fold_cross_validation) for train and dev, still the test set should be extra

## Nkululeko exercise 1


Edit the [demo configuration](https://github.com/felixbur/nkululeko/blob/main/demos/exp_emodb.ini)

Set/keep as target *emotion* as FEAT type *os* and as MODEL type *xgb*

Use the emodb as test and train set but try [out all split methods](https://github.com/felixbur/nkululeko/blob/main/ini_file.md#data)
* specified
* speaker split
* random
* loso
* logo 
* 5_fold_cross_validation

Which works best and why?

## Nkululeko exercise 2
Set the 
```
[EXP]
epochs = 200
[MODEL] 
type = mlp
layers = {'l1':1024, 'l2':64} 
save = True
[PLOT]
epoch_progression = True
best_model = True
```
run the experiment.
Find the epoch progression plot and see at which epoch overfitting starts.

# Evaluation
These slides about evaluation of machine learning models, obviously the answer to the question if a model is any good depends a lot on how you test that.

## Criteria
Depending whether you got a classification or regression problem you can choose from a multitude of measures.

# Classification
Most of these measures are derived from the confusion matrix:
* **Confusion Matrix** : Matrix with results: rows represent the real values and columns the predictions. 
* In the binary case, the cells are called *True Positive* (TP), *False Negative* (FN: Type 2 error), *False Positive* (FN: Type 1 error) and *True Negative* (TN)
* So in the example (next slide), TP=3, FN=4, FP=3 and TN=3.


<img src=images/Prec-recall.png width=40%>

<img src=images/conf_mat.png width=40%>

The following measurements can be derived from these:

* **Accuracy**: Percentage of correct predictions -> (TP+TN)/(TP+FP+FN+TN).
* **un- / weighted Recall/Sensitivity**: percentage of detected cases -> TP / (TP+FN). Can be weighted by class frequency, for multiple classes 
* **un- / weighted Precision**: percentage of relevant predictions -> TP / (TP+FP)
* **Specificity**: Like sensitivity, but for the negative examples -> TN / (TN+FP)
* **F1**: Combination of Recall and Precision -> F1 = 2 * (Rec* Prec)/ (Rec + Prec) 
* **AUC/ROC** Usually there's a tradeoff between Recall and Precision. With the *Receiver Operator Curve*  and it's *Area under curve* this can be visualized by plotting the False positive rate (100-specificity) against the True positive rate (sensitivity).

### Regression
* **Pearson's** Pearson's Correlation Coefficient  measures the similarity of two sets of numbers with the same lenght. I's a value between -1 and 1, with 0 meaning no correlation and -1 negative correlation. When plotted in 2-d space, PCC=1 would be the identity line.

* **MAE** Mean absolute error: taken two sets of numbers with same length as correct and predicted values, one can compute the mean absolute error by summing up the absolute values of the pairwise differences and scale by the number of samples.

* **CCC** Concordance Correlation Coefficient is a measure quite similar to PCC but tries to penalize rater bias (seeing the two distributions as truth and ratings).

## Approaches
### Train / test /dev splits
Best you have enough data to split it into seperate sets:
* **train** for the training
* **dev** to tune meta-parameters
* **test** as a final test set 

Be careful to make sure that they are  speaker disjunct, i.e. not have overlapping speakers, else you can't be sure if you learn general speaker characteristics or speaker idiosyncrasies.

Also it's a very good idea to have the test set from a completely different data source, so you could have more trust in the generalizability of your model.

More [on the subject here](http://blog.syntheticspeech.de/2022/12/01/how-to-split-you-data/)

### X fold cross validation

* If you are low on data, you might try x fold cross validation, 
* it means that you split your data  in *x* (usually 10) sets with same size, 
* and then do *x* trainings, 
* using one set as *dev set* and the rest for *train*.


### LOSO

* *Leave one Speaker out* 
* is like X fold cross-validation, but each set are all samples of one speaker. 
* If there are many speakers, you might want *Leave one speaker group out*.
Both is supported by [Nkululeko](http://blog.syntheticspeech.de/2021/08/04/machine-learning-experiment-framework/).

# Different machine learners
* gives an overview on popular machine learners in a nutshell.
* Lots of site on the internet give great detail on this and you should take a few minuted to check them out.


* In general, all these approaches work by extracting *features* from data and 
* comparing a *test* sample's features with the features derived from a *training* set 
* to predict some *class*  or *value*  [in case of regression](http://blog.syntheticspeech.de/2022/02/16/kinds-of-machine-learning/#Classification_vs_Regression).

So they work with two phases: 

* During training, the *parameters* of the approach are learned, thereby creating the *model*.
* A test time, unknown test samples get *predicted* by the model.

In addition, most of these approaches can be customized by 
* *meta-parameters* which also can be learned by some 
* *meta algorithm*, but not during a normal training. 

One thing all of these approaches have in common is that they 
* *model* the world by "densing" down  the real values, 
* i.e. the data, to a simpler form at some time (*feature extraction*), 

so they all can be seen as some kind of dimensionality reduction 

* On the one hand you lose information this way, 
* on the other this is not a problem because you usually are interested in some kind of underlying principle that generated your training data, 
* and not so much in the training data itself.

Still you got a trade-off between 
* generalizability and
* [specificity](http://blog.syntheticspeech.de/2022/02/16/kinds-of-machine-learning/#Bias_vs_variance)
* aka *bias* vs. *variance*

The following list is by far not complete, I simply selected the ones that were most commonly used during my professional life.

## Linear regression
To represent the dependency of a dependend and an independend variable by a straight line. 
* The price question is how to learn the two parameters of  the line (*a* and *b* of *y=ax+b*) using the training data. 
* One approach would be gradient descent with a [Perceptron](http://blog.syntheticspeech.de/2022/02/16/kinds-of-machine-learning/#Perceptron).
<img src=images/linear_regression.png width=60%>

## GMMs

A Gaussian is a way to describe 
* a distribution with two values: mean and variance. 
* One way to distinguish two kinds of things is two distinguish them by the distributions of their features,
* e.g. herrings from trouts by the size of their fins.
* Gaussian mixture models model *one* distribution of each feature by a mix of several Gaussians, hence their name.


<img src=images/gmms.png width=60%>

## (Naive) Bayes
* Bayes statistics is fundamentally different from so-called frequentist statistics, as it takes prior knowledge of the problem into account.
* The Bayesian formula tells us how likely an event (the class we want to distinguish) can happen in conjunction with another event (the feature that we observe).
* During training the Bayes classifier updates its believe about the world, using absolute or estimated frequencies as prior knowledge.
* The approach is called naive because it assumes that each input feature is independent, which is most of the time not true.
<img src=images/naive_bayes.png width=20%>

## KNN (k nearest neighbor)

* K nearest neighbor is an approach to assign *test* data, 
* given its *k* (given parameter) nearest neighbors (in the *feature* space, by some distance metrics) 
* either the most common *class* or some property *value* as an average.


<img src=images/knn.png width=40%>

## Support vector machines
* Support vector machines are algorithms motivated by vector geometry
* They construct *hyperplanes* in N-dimensional (number of *features*) space by maximizing the margin between data points from different classes.
* The function that defines the hyperplane is called the kernel function and can be parameterized.
* They can be combined with GMMS if the data is approximated by them.


<img src=images/svm.png width=50%>

## CART (classification and regression trees)
* Perhaps the most straightforward way to categorize data: order its parameters in a tree like fashion with the *features* as twigs and the data points as leaves.
* The tree is learned from the training set (and can be probabilistic).
* The big advantage of this model is that it is easily interpretable to humans.


<img src=images/cart.png width=40%>

## XGBoost
* A sophisticated algorithm loosely based on CARTS 
* as it combines Random Forests (ensembles of trees) 
* with boosting more successful ones.


<img src=images/xgboost.png width=40%>

## MLP (Multi-layer perceptron)

As the name suggests, these algorithms are derived from the original [Perceptron](http://blog.syntheticspeech.de/2022/02/16/kinds-of-machine-learning/#Perceptron) idea that is inspired by the human brain.
<img src=images/ann.png width=40%>