General Notes on Machine Learning
==============

Areas of follow-up:
 - Investigate "pre-training of initial embeddings with unsupervised learning methods such as autoencoders" to speed up training (due to the "vanishing gradient problem" of deep neural networks)
 - Would also like to learn more about this: "Autoencoders attempt to learn a useful layered representation of the data without having to backpropagate through a deep network; standard gradient descent is only used at the end to fine-tune the network."
 - Some interesting work on how you can have a parameterized classifier, that approximates a likelihood ratio, that allows you to estimate the parameters themselves using the classifier. (Baldi P, et al. Eur. Phys. J. C 76:235 (2016))

Linear Models for Regression
=======

Simply put, these are models that try to classify data in a parameter space by applying linear cuts, e.g.

$y(\mathbf x) = a_1 x_1 + a_2 x_2 + .... + b > 0$

where the $>0$ threshold indicates a binary classification. For a simple case, take an binary output variable that depends on 2 input variables, e.g. points in a plane. Then this classifier would draw a line in the 2D plane to try to separate the two classes as efficiently as possible.

See **support vector machines** for a more complex version of this.


Logistic Regression
========

Used when your output variable is a probability between 0 and 1 (or a multi-class probability with the sum of classes adding to 1). It uses a **logistic (sigmoid) function** to map the inputs to a probability between 0 and 1. A logistic regression can have >1 input variables.

The logistic regression is really just a fit to data where y=0 or 1; it can be converted to a classifier by applying a threshold cut to the output logistic estimate.

Terminology:
 - **Logit**: basically the logistic function that maps the infinite space to 0-1.
 - **Probit**: Very similar function to the logit function, but based on the CDF of a gaussian function.
 - **Odds ratio**: ??


Naive Bayes and Likelihood Ratio
========

See statisticalMethods -- these assume that your input variables are uncorrelated.


Support Vector Machines (SVM)
==========

A method of regression or classification that maps labeled data onto a larger parameter space so as to maximize the distance between two classes, and uses this mapping as a basis for classification of unlabeled data. 

In the simplest case, a line divides two classes in plane, or a plane divides two classes in a 3d space. By introducing **kernels** (see statisticalMethods), the boundary can assume complex shapes (which presumably are still hyperplanes in the newly-defined parameter space).

The **support vectors** of the SVM are the data points at the boundary of the two different classes, which drive the definition of the boundary itself.

SVMs can also be used in the context of clustering.

**Scaling / normalization preprocessing** is an important step for SVMs, which otherwise do not do very well.

Trees and Forests
========

Metrics: Gini Impurity and Entropy
-------

The **Gini impurity** is a decision-tree metric; it is the (summed) probability of misclassifying a sample in a given region. For a multi-class problem, in a given region, it is given by:

$I_G(p) = \sum^J_{i=1}\left( p_i \sum_{k\neq i} p_k\right) = ... = 1 - \sum p_k^2 $

(as you can see it is one minus the square of the probability of correct assignment.)



According to (??), **Gini impurity** is used for classification problems, while **entropy** is used for exploratory analyses (??)

Decision Tree
--------

Find the most powerful variable and apply the most discriminating cut; in the two subsequent populations, repeat the process with the next-most-powerful variable, ... etc. You can **prune** (remove nodes at the end) or **pre-prune** (stopping the continuation of a tree).
 - **Pruning**: `max_depth`, `max_leaf_nodes`, `min_samples_leaf`
 - **Feature importance**: A ranked list of the most powerful features (a number between 0 and 1). Note that if there are two highly correlated variables then it is possible that one of the features has a lower feature imporantance than you might think.
 - **No preprocessing necessary**: Normalization and standardization are not required!
 

Random Forest (a case of Bootstrap Aggregation or BAgging)
---------

To correct for **over-fitting** in a single decision tree, create an ensemble of trees and insert some randomness into their creation (drop data points; remove variables). Then average the collection of results.

 - `n_estimators`: Number of trees
 - `max_features`: Maximum features to look at in the dataset, for a given tree.

Boosted Decision Tree (e.g. Gradient Boosted)
---------

 - **Not good for sparse, high-dimensional data**
 - Generally faster execution than random forests
 - `learning_rate`: steepness of the gradient descent.
 - `n_estimators`: Number of trees
 - `max_depth` (of individual trees): Usually keep this at 5 or lower.

AdaBoost
----------

Achieves the boosting by means of increasing the **weight** of misclassified events to boost their importance.



Neural Networks
==========

Multilayer Perceptron (MLP)
---------

A combination of **hidden layers** connected by **weights**, linearly combining inputs and applying an **activation function**.

Activation functions:
 - **Rectified Linear Unit (RELU)**
 - **tanh** (hyperbolic tangent)

**Regularization**: you can regularize a NN by constraining the weights to be close to 0.<br>
**Scaling**: Neural nets work better if you scale / normalize the data.



Convolutional Neural Networks
----------
 - Convolutional architecture: a filter window is swept over the (spatially-related) input
 - Pooling: that filtered information is summarized (e.g. takes the maximum or the average)
 - Normalization Layers
 - Residual Layers
 - Graph convolutional networks (as opposed to tree networks)


Recurrent Neural Networks
-----------

 - long-short-term-memory (LSTM)
 - Gated recurrent unit (GRU)

Adversarial Neural Networks
-------------

Preprocessing
-------

Validation and Metrics
============

Loss Functions
--------
 - Hinge Loss (???)


Metrics
-------
 - **F1**: The **harmonic mean** of the precision and recall
 - **Accuracy**:
 - **Confusion Matrix**:


Cross-validation
-------

 - **$k$-folds**: 
 - 

Other Topics
===========

**Terminology**
 - What is **deep learning**: simply: multi-layer neural networks
 - **Training and Inference**: "inference" here just refers to "testing".
 - **Type I vs Type II error**: Type-I is false positive; Type-II is false-negative.
 - **Gini impurity**: A measure of the fraction of misclassified events in a given region. Used for optimizing (B)DTs
 - **Entropy**:
 - **Information gain**:
 - **Brier score**:
 - **Uncertainty** (from a classifier): The probability value assigned to a classified object by the classifier (how sure is the classifier)
 - **Calibrated** (of a classifier): The probabilities assigned should match reality (think Nate Silver)
 - **Precision**: $n_\text{tp}/n_\text{fp}$ (true positive / false positive test results)
 - **Recall (sensitivity)**: $n_\text{tp}/n_\text{true}$ (true positive / all true elements in the population)
 - **F1 score**: Harmonic mean between the precision and recall (e.g. $\left(\frac{p^{-1} + r^{-1}}{2}\right)^{-1}$)
 - **Brier score**: pretty much a mean squared error for classification problems, e.g. with probabilities for binary classification. $BS = \frac{1}{N}\sum_i (r_i - p_i)^2$ and it runs between 0 and 1.

**Multicollinearity** in multiple linear regression: 
 - Effectively introduces a degeneracy in the solution (e.g. `y=ax+b` or `y=dx+b` are equally valid solutions)
 - Affects the variance of the prediction (?)
 - May cause the matrix to be non-invertible (less than full-rank tensor)
 - May occur when your inputs have not been properly studied or cleaned for redundancy
 - Statistical tests exist to identify these issues
 - Solutions:
   - Remove one of the variables; start simpler.
   - Principle Component analysis to orthoganalize your variables

**Curse of Dimensionality**: Techniques to avoid it include principle component analysis; clustering, ... ?

**What do people mean by the "Bias vs Variance" tradeoff**: Generally this is a euphemism for overtraining, or basically picking too many parameters for a fit. A fit with too many parameters will have an unnaturally low bias, but would have a very high variance on a testing data set. This can be avoided using e.g. F-tests (for e.g. a regression), or using techniques to avoid over-fitting the data (see learning curves, validation/test sets, etc.)

**What is a generative vs a discriminative model?**
   - **Generative models** basically model the joint probability density function ($p(x,y)$); or, they attempt to generate inputs $\mathbf{x}$ given an output y (like cat $\rightarrow$ cat image). **Discriminative models** model conditional probabilities ($p(y|x)$); typically slower to converge but eventually outperform. (Note that if no probability model is used, then it is generally considered discriminative.)

**Question: what's the hardest part about Machine learning**:
 - Knowing your inputs is probably the most important element (bad stuff in $\rightarrow$ bad stuff out)
 - Knowing which methods to apply in which situations

**Question: what is your biggest weakness**:
 - Perhaps obvious: I am coming from a **different field**. But of course this can be a strength as well: injecting outside perspective; have seen a huge variety of techniques and solutions; understand the importance of domain knowledge.

**My questions for them**:
 - What is the makeup/organization of the team? How do people generally work in the team - together, separate...?
 - What is the balance of focus, between engineering/power optimization and "business intelligence?"
 - What machine learning tools do you find yourself using the most in this job?
 - What other topics besides machine learning are part of the job? How do you typically spend your days?

**Regularization**: 
 - Generally speaking, methods that can be used to **reduce overfitting**.
 - **Ridge** (L2): Introduce a penalty term to the least squares metric to prefer coefficients close to 0. (This can fix the issue of multicollinearity, for instance.)
 - **Lasso** (L1): Drop input variables (sets the weight equal to zero), effectively reducing the number of inputs.
 - **Dropout**: Randomly drop nodes of a NN during training to reduce noisy, meaningless nodes.
 - **Data augmentation**: Increase the size of the training data.
 - **Early stopping**: stop when the performance starts to get worse.


Bibliography
================
 - 'Deep Learning and Its Application to LHC Physics'. https://arxiv.org/pdf/1806.11484.pdf