# Machine learning

# Learning
* Enables problem solving:
    * the transition of a state with respect to a quality factor
    * state x is changed with some function
    * f(speech) = emotion
    * f(x) = x'
    * e.g. f(x) = a x + b
    * find a and b so that x' is optimal

# How to split your data
In supervised machine learning, you usually need three kinds of data sets:
* train data: to teach the model the relation between data and labels
* dev data: (short for *development*) to tune meta parameters of your model, e.g. 
    * *number of neurons*, 
    * *batch size* or 
    * *learning rate*.
* test data: to evaluate your model ONCE at the end to check on generalization

Of course all this is to prevent [*overfitting*](http://blog.syntheticspeech.de/2022/02/16/kinds-of-machine-learning/#Overfitting) on your train and/or dev data.

* If you've used your test data for a while, 
you might need to find a new set, 
as chances are high that you overfitted 
on your test during experiments.

So what's a good split?

Some rules apply:
* train and dev can be from the same set, but the test set is ideally from a different database.
* if you don't have so much data, a 80/10/10 % split is normal
* if you have masses an data, use only so much dev and test that your population seems covered.
* you can stratify the data
* If you have really little data: use [x cross validation](http://blog.syntheticspeech.de/2022/11/28/how-to-evaluate-your-model/#X_fold_cross_validation) for train and dev, still the test set should be extra

## Nkululeko exercise 1


Edit the [demo configuration](https://github.com/felixbur/nkululeko/blob/main/demos/exp_emodb.ini)
Set/keep as target *emotion* as FEAT type *os* and as MODEL type *xgb*
Use the emodb as test and train set but try [out all split methods](https://github.com/felixbur/nkululeko/blob/main/ini_file.md#data)
* specified
* speaker split
* balanced
* random
* loso
* logo 
* 5_fold_cross_validation

Which works best and why?

## Nkululeko exercise 2
Set the 
```
[EXP]
epochs = 200
[MODEL] 
type = mlp
layers = {'l1':1024, 'l2':64} 
save = True
[PLOT]
epoch_progression = True
best_model = True
```
run the experiment.
Find the epoch progression plot and see at which epoch overfitting starts.

# Evaluation
These slides about evaluation of machine learning models, obviously the answer to the question if a model is any good depends a lot on how you test that.

## Criteria
Depending whether you got a classification or regression problem you can choose from a multitude of measures.

# Classification
Most of these measures are derived from the confusion matrix:
* **Confusion Matrix** : Matrix with results: rows represent the real values and columns the predictions. 
* In the binary case, the cells are called *True Positive* (TP), *False Negative* (FN: Type 2 error), *False Positive* (FN: Type 1 error) and *True Negative* (TN)
* So in the example (next slide), TP=3, FN=4, FP=3 and TN=3.


<img src=images/Prec-recall.png width=40%>

<img src=images/conf_mat.png width=40%>

The following measurements can be derived from these:

* **Accuracy**: Percentage of correct predictions -> (TP+TN)/(TP+FP+FN+TN).
* **un- / weighted Recall/Sensitivity**: percentage of detected cases -> TP / (TP+FN). Can be weighted by class frequency, for multiple classes 
* **un- / weighted Precision**: percentage of relevant predictions -> TP / (TP+FP)
* **Specificity**: Like sensitivity, but for the negative examples -> TN / (TN+FP)
* **F1**: Combination of Recall and Precision -> F1 = 2 * (Rec* Prec)/ (Rec + Prec) 
* **AUC/ROC** Usually there's a tradeoff between Recall and Precision. With the *Receiver Operator Curve*  and it's *Area under curve* this can be visualized by plotting the False positive rate (100-specificity) against the True positive rate (sensitivity).

### Regression
* **Pearson's** Pearson's Correlation Coefficient  measures the similarity of two sets of numbers with the same lenght. I's a value between -1 and 1, with 0 meaning no correlation and -1 negative correlation. When plotted in 2-d space, PCC=1 would be the identity line.

* **MAE** Mean absolute error: taken two sets of numbers with same length as correct and predicted values, one can compute the mean absolute error by summing up the absolute values of the pairwise differences and scale by the number of samples.
* **MSE** Mean squared error
* **CCC** Concordance Correlation Coefficient is a measure quite similar to PCC but tries to penalize rater bias (seeing the two distributions as truth and ratings).

## Approaches
### Train / test /dev splits
Best you have enough data to split it into seperate sets:
* **train** for the training
* **dev** (or **eval**): to tune meta-parameters
* **test** as a final test set 

Be careful to make sure that they are  speaker disjunct, i.e. not have overlapping speakers, else you can't be sure if you learn general speaker characteristics or speaker idiosyncrasies.

Also it's a very good idea to have the test set from a completely different data source, so you could have more trust in the generalizability of your model.

More [on the subject here](http://blog.syntheticspeech.de/2022/12/01/how-to-split-you-data/)

### X fold cross validation

* If you are low on data, you might try x fold cross validation, 
* it means that you split your data  in *x* (usually 10) sets with same size, 
* and then do *x* trainings, 
* using one set as *dev set* and the rest for *train*.

**Nkululeko exercise**: try this with the MODEL k_fold_cross option

### LOSO

* *Leave one Speaker out* 
* is like X fold cross-validation, but each set are all samples of one speaker. 
* If there are many speakers, you might want *Leave one speaker group out*.
Both is supported by [Nkululeko](http://blog.syntheticspeech.de/2021/08/04/machine-learning-experiment-framework/).

**Nkululeko exercise**: try this with the MODEL logo option

# Meta parameter tuning
* The parameters that configure machine learning algorithms are called *meta parameters* 
* in contrast to the "normal" *parameters* that are learned during training.
* But as they obviously also influence the quality of your predictions, these parameters also must be learned.
* They are *optimized* on the *dev-set*

* Examples are 
    * the C parameter for SVM
    * the number of subsamples for XGB
    * the number of layers and neurons for a neural net

* The naive approach is simply to try them all, systematically,
but this does not scale.

* In general, because the search space for the optimal configuration usually is without limit, it'd be better to try a stochastic approach or a genetic one.

**Nkululeko exercise**: optimize the parameters [using *tuning_params*](http://blog.syntheticspeech.de/2021/09/03/perform-optimization-with-nkululeko/)

# Data augmentation
* Often (kind of always) there is a lack of training data for supervised learning.
* One way to tackle this is representation learning which can be done in an self-supervised fashion.
* Another approach is to multiply your labeled training data by adding slightly altered versions of it, 
* that would not change the information that is the aim of the detection, 
    * for example by adding noise to the data or clipping it.
    
**Nkululeko exercise**: try this with the [*augment* module](http://blog.syntheticspeech.de/2023/03/13/nkululeko-how-to-augment-the-training-set/)

* A third way is to synthesize data based on the labeled training, 
* for example with GANs, VAEs or with rule-based simulation. 
* It can be distinguished if in this case only a parameterized for of the samples (ie. the features) or raw samples are generated.

* Sometimes only samples for a rare class are needed, 
    * in this case techniques like ROS (random over sampling), 
    * Synthetic Minority Oversampling Technique (SMOTE) or the 
    * Adaptive Synthetic (ADASYN) can be used.
    
**Nkululeko exercise**: try this with the [FEATS balancing key](http://blog.syntheticspeech.de/2023/11/16/nkululeko-oversample-the-training-set/)

# Feature normalization
* "Normalizing" or scaling feature values means 
* to shift them to a common range, or distribution with same mean and standard deviation 
    * (also called z-transformation).
* You would do that for several reasons:

* Artificial neural nets can handle small numbers best, so they all should be in the range -1, 1
* Speakers have their individual ways to speak which you are not interested in if you want to learn a general task, e.g. emotion or age. So you would speaker-normalize the values for each speaker individually. Of course this is in most applications not possible because you don't have already samples of your test speakers.
* You might want to normalize the sexes, because woman typicall have a higher pitch. But another way out is also to use only relative values and not absolute ones.

* Mind that you shouldn't use your test set for normalization 
* as it really only should be used for test and is supposed to be unknown. 
* That's why you should compute your normalization parameters on the training set, you can then use them to normalize/scale the test.

**Nkululeko exercise**: try this with the [FEATS scaling key](http://blog.syntheticspeech.de/2021/03/26/feature-scaling/)