# Machine Learning in Python for Neuroimaging

This notebook, prepared by Désirée Lussier, is to provide an introduction to machine learning for BrainHack School 2022. It has been adapted from Brainhack Global MTL 2019 traintrack tutorial (https://github.com/BrainhackMTL/global2019-traintrack/blob/master/machine_learning.ipynb) by Désirée Lussier which was modified from the MAIN 2019 training given by Alexadre Hutton (https://github.com/main-training/main-training-nilearn-ml/blob/master/01_intro_ml.ipynb; https://github.com/main-training/main-training-nilearn-ml/blob/master/01_intro_ml_slides.odp) and sklearn tutorial (http://scipy-lectures.org/packages/scikit-learn/index.html).

This training is meant to give a brief overview of the basics of machine learning in Python3. For more complete training or other specific examples please see the additional BrainHack School modules (https://school.brainhackmtl.org/modules/) above MAIN courses (https://github.com/main-training) and documentation for Scikit Learn (https://scikit-learn.org/stable/).

# Training Outline

## Machine Learning with Scikit Learn

* Machine learning classification examples
* Model evaluation
* Model complexity

# What is machine learning?
In machine learning models with parameters which are optimized according to previously-seen data.

Machine learning (ML) can be divided into subcategories:

## Supervised Learning
We have observations X we want to use to predict Y

X: data, features, inputs
Y: target, labels, outputs
The goal is to find a model which best predicts Y based on X:
Y = f(X)

Models are further divided:
* Classification: Predicting ordinal numbers; determining classes for inputs
* Regression: Predicting continuous values

## Unsupervised Learning
We have observations X
The goal is to extract information about X
E.g.: finding a representation, cluster the data

ML models are typically developed in some variation of:
* Parameter training
* Model evaluation
* Model selection
* Model generalization


# Supervised Learning
## Typical ML Pipeline

### Model fitting

Data exploration -> rnd. data split (whole dataset) -> training set -> parameters optimization (on training data only) -> trained model 
-> test set -> evaluate on test set

1. Model (estimator) selection, e.g., linear regression, polynomial regression, multi-layer perception (ANN), etc.
2. Loss (cost) function, e.g., mean squared error, mean absolute error, max error, explained variation, R-squared, etc.
3. Optimization of model parameter! 
y= ax+b, Loss = f(a,b) to find the value to minimize the loss value

Gradient Descent: 
* Initialize model parameters (a,b) randomly
* Iterate between:
    * Compute loss, e.g., MSE
    * Update model parameters (a,b) in direction of gradient with training data

### Model Validation

1. Use Test data to verify the model
2. Score function, e.g., mean squared error, mean absolute error, max error, explained variation, R-squared, etc.

### k-fold cross-validation
* To Avoid overfitting
* Improve model generalizability
* Provide a more accurate performance measure
* Hyper-paramter tuning

1. Train model parameters on training set
2. Evaluate training with the validation set, which is the subset of training set
3. Report error on test set only at the very end

### How many Training Epochs? 
* When the loss function starts increasing on the validation set, that's where you should stop training
* Underfitting: A statistical model or a machine learning algorithm cannot capture the underlying trend of the data, i.e., it only performs well on training data but performs poorly on testing data -> Low variance but away from the target
* Overfitting: A statistical model is said to be overfitted when the model does not make accurate predictions on testing data -> Close to the target but high variance
* Bias-Variance trade-off: error(val/test) = bias + variance

## How to improve models

### Dimensionality Reduction
* After split train set before the parameters optimization
* Curse of dimentionality (more features than samples)
* Intrinsic dimension may actually be small (redundante data)
* Extract "salient" features
* Remove noisy features

1. Feature extraction/engineering
    * Compact representation of the data
    * Maps input features into a lower dimensional space
    * Linear: PCA, ICA, LDA
    * Nonlinear: Isomap, SOM, autoencoders, t-SNE
2. Feature selection
    * selection of a subet of input features
    * Features are still in original space
    * Variable selection: t-statistic
    * Subset selection: min redundancy, max relevancy (mRMR)
    * Other criteria: BIC, consistency, etc

### Regularization
* Penalties on the LOSS function to prevent overfitting!
    1. L1/Lasso: constrains parameters to be sparse
    2. L2/Ridge: constrains parameters to be small

### Ensemble Methods
* Bagging: bootstrpp resampling + aggregation
* Boosting: learners learn sequentially and adaptively to improve model predictions of a learning algorithm

## Classification
* Support Vector Machine (SVM) 
--------------------------
Probabilistic classifiers
* Artificial Neural Metworks
* Logistic regression
* Decision Trees
* Random Forests 

### Performance Metrics - Binary Classification
|Prediction\True    |Positive       | Negative      |
|---                |---            |---            |
|Positive           |True positive  |False positive |
|Negative           |False negative |True negative  |

* Score function:
    1. Accuracy: Tp+Tn/(Tp+Tn+Fp+Fn)
    2. Precision: Tp/(Tp+Fp)
    3. Recall: Tp/(Tp+Fp)
    4. F1 score: 2Tp/(2Tp+Fp+Fn)
* Precision and Recall are appropriate when classes are imbalanced!
----
Probabilistic classifiers
* Score function:
    1. ROC curve
    2. precision-recall curve -> Appropriate when classes are imbalanced!

### Multiclass Prediction
* To extend a binary metric to multiclass problems, the data is treated as a collection of binary problems, one for each class.
* The binary metric is then averaged across the set of classes, each of which may be useful in some scenario.