# Machine Learning: Overview of some concepts

## What do we do with data?

- What's the difference between unsupervised, semi-supervised, and supervised learning?
  - Labels for none (unsupervised), some (semi-supervised) or all (supervised) samples
- Why do we normalize data? Is it always necessary?
  - Normalization may be necessary so that all features are weighted equally
  - Necessary for models that use Euclidean metrics (e.g. multivariate regression), may not be necessary for models that look at one feature at a time (e.g decision tree).
- How do we convert raw data into feature vectors?
  - Depends on data types
  - For example, conversion of text to vectors
- How do we select features?
  - Simple criteria that don't depend on outcomes (e.g. feature variance, removal of correlation such as PCA)
  - Criteria that depend on outcomes (e.g. top 10 by p-value ranking in univariate testing) require incorporation into pipeline - i.e. each cross-validation needs a new set of top 10 to avoid biasing evaluation
- How do we deal with categorical variables?
  - Use of integers implies ordering which may be misleading
  - Most common is the use of dummy or one-hot encoding
  - Dummy encoding results in $n-1$ columns for $n$ categories
  - One-hot encoding results in $n$ columns for $n$ categories - have to set intercept to zero to avoid collinearity
  - Colinearity means that one or more variables can be expressed as a linear combination of the others - why is this a problem?
- What is data augmentation?
  - Using synthetic data to increase size of training set
  - Most common in deep learning as the models have very high capacity
- What is an unbalanced lanced data set?
  - Distribution of classes is non-uniform
  - Very, very common issue
- How do we deal with an unbalanced data set?
  - At least, need baseline evaluation of model that incorporates unbalance - for example 99% accuracy is not useful when 99% of the samples belong to a single class
  - Use a evaluation metric that is not sensitive to imbalance (e.g. [Kappa](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html))
  - Resample data set
  - Decision tree algorithms are less sensitive to imbalanced data sets
  - Penalize algorithm by increasing cost of mistake for minority classes
  - Generate synthetic samples (data augmentation)
  - Consider minor classes as single group and see if they can be detected with anomaly detection algorithms
  - Also See [imbalanced-learn package](https://github.com/scikit-learn-contrib/imbalanced-learn)

## What types of models are there?

- What is a machine learning model?
  - An implicit or explicit function that takes a vector (features) as input and returns an integer (classification) or reel number (regression)
- What is the XOR problem?
  - Find a function to separate two classes - class 1 at (0,0) and (1,1) and class 2 at (0,1) and (1,0)
- How does increasing dimensionality allow a linear model to solve the XOR problem?
  - For example if we created an extra feature $\vert x^2 - y^2 \vert$, we can solve the XOR problem with a linear model

| Class |  x  |  y  | $\vert x^2 - y^2 \vert$ |
| ----- | --- | --- | ----------------------- |
| 0     | 0   | 0   | 0                       |
| 1     | 0   | 1   | 1                       |
| 1     | 1   | 0   | 1                       |
| 0     | 1   | 1   | 0                       |

- What's the difference between shallow and deep learning?
  - Shallow learning is a term used by deep learning practitioners to refer to classical machine learning techniques
  - Deep learning is sometimes used to refer to neural networks with many filters layers
- What machine learning models do you know?
  - Some of the major classes are
    - Nearest neighbor
    - Linear models
    - Kernel methods such as SVM
    - Tree-based
    - Ensembles
    - Neural networks
- Can you briefly explain how each type of model works?
- What is ensemble learning?
- What is the difference between boosting and bagging?
  - Both use bootstrap to generate N samples (sample with replacement) and train N machines
  - Bagging (bootstrap aggregating) averages the weights over each machine (hence reducing variance and preventing overfitting)
  - Random forest is like bagging, but we also select a random sample of features at each step - this forces the trees to be more different than with bagging
  - Boosting describes a group of methods that give more weight to more effective machines automatically

## How do we select a model?

- What is a hyper-parameter?
  - A different value of a hyper-parameter defines a different member of the same class of machine learning algorithms. For example, the degree of a polynomial is a hyperparameter while the coefficients are regular parameters
- What is bias-variance trade-off
  - Show derivation
![img](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)
- What is [regularization](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a)?
  - A statistical technique to reduce variance in the model
- Why is regularization necessary?
  - Reduces risk of over-fitting
  - May also perform feature selection (L1 regularization)
- How do we perform regularization?
  - L1 and L2 examples
  - L2: $\lambda \sum \beta_i^2$
  - L1: $\lambda \sum \vert \beta_i \vert$
  - Regularization as constrained optimization
  - Why does L1 result in sparsity?
- How and why do we perform cross-validation?
  - Usually, to estimate out-of-sample errors when doing model selection
  - $k$-fold cross-validation
- What is leave-one-out-cross-validation (LOOCV)?
  - Same as k-fold, when $k = n$  

## How do we fit a model to data?

- What is model capacity?
  - Amount of complexity the model can encode
  - If capacity is high relative to the amount of data, there is a risk that the model just "memorizes" the training set and fails to generalize
- How do we use a loss function?
  - In the model fitting, many ML algorithms optimize the model parameters with respect to the loss function
- What types of loss functions are there?
  - MSE, MAE, cross-entropy

## How do we evaluate if our model is any good?

- How do we evaluate a model?
  - We need to define a performance metric to measure
  - Make predictions on test set and calculate performance metric
- What is a confusion matrix?
  - A square matrix with true on columns and predicted on rows
- How do we define accuracy from a confusion matrix?
  - Trace of matrix over sum of all entires
- How do we construct ROC and PRC curves?
  - Plots to evaluate model for different values of a cutoff
  - See earlier [lectures](http://people.duke.edu/~ccc14/bios-823-2018/S10_Anomaly_Detection.html#Precision-recall-and-ROC-curves)
- What is a Bayes optimal classifier?
  - Theoretical classifier that gives the Baye's error - i.e. all error is due to intrinsic noise
- Why is it critical to evaluate our model on out-of-sample data?
  - Predictions of in-sample data can be due to memorization rather than generalization

## How do we improve ML performance?
  
- What is Baye's error?
  - Irreducible error - error due to noise in system
- How do we estimate the Baye's error?
  - Generally not possible
  - If it is a problem that humans are good at solving, the best human error may be a good approximation of the Baye's error
- How can we diagnose underfitting?
  - Underfitting implies training error is much larger than Baye's error
  - Generally seen by reducing training error as we increase model capacity
- What can we do if we are underfitting?
  - Increase model complexity
- How do we diagnose overfitting?
  - Training error does not change but validation error increases as we increase model capacity
- What can we do if we are overfitting?
  - Increase data
  - Use some form of regularization (L1, L2, bagging, dropout)
  - Decrease model capacity
