# Pre-Processing

- Numeric
    - Zero mean, unit variance: $x’= (x-\mu)/ \sigma$
    - In interval $[0,1]: x'=(x -\min)/(\max - \min)$
- Categorical
    - Encoded as number in such a way that there is no sense of ordering, for e.g. if there are 3 classes apple, orange and banana, and encoded as $1,2,3$ respectively, it appears as apple comes first than orange, which is not correct. So the correct way to encode is one hot encoding.
    - Also here only equality testing is meaningful. 
- Ordinal
    - Encoded as numbers to preserve ordering
    - $\le, \ge$ operations meaningful

# Feature Extraction from Data

- Images
	- Pixel values, Segment and extract features, Handcrafted features: HOG, SIFT,etc
	- Deep learned features!
- Text
	- Bag of words, Ngrams
	- Deep learned features!
- Speech
	- Mel Frequency Cepstral Coefficients (MFCCs), Other frequency based features
	- Deep learned features!
- Time varying sensor Data
	- Statistical and moment based features (mean, variance, etc

# Challenges

- Structured input/Structured output
	- One fix: Attribute = root-to-leaf paths
- Missing data
	- Fix: Fill in the value, Introduce special label, remove instance, remove attribute, Use classifiers that can handle missing values
- Outliers
	- Fix: Remove, Threshold, Visualize!
- Data assumptions
	- Generated how? Sources?
	- Smooth? Linear? Noise?

# Class Imbalance

Almost all classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration.

As a result, examples from the overwhelming class are well classified whereas examples from the minority class tend to be misclassified.

- Are all classifiers sensitive to class imbalance?
    - Decision Tree:Very sensitive to class imbalances. This is because the algorithm works globally, not paying attention to specific data points.
    - Multi Layer perceptrons (MLPs): are less prone to the class imbalance problem. This is because of their flexibility: their solution gets adjusted by each data point in a bottom up manner as well as by the overall data set in a top down manner.
    - Support Vector Machines (SVMs) SVMs are even less prone to the class imbalance problem than MLPs because they are only concerned with a few support vectors, the data points located close to the boundaries.

## Solution

- Collect more data!
- Change your performance metric:
	- Confusion Matrix, Precision Recall, F1 score, etc.
- Resample dataset
- Generate synthetic samples
- Try penalized models
- Try a different perspective anomaly/change detection
<br><br>
- At the data Level: Re Sampling
	- Oversampling (Random or Directed)
	- Under-sampling (Random or Directed), (not good for model performance)
	- Active Sampling
- At the Algorithmic Level:
	- Adjusting the Costs
	- Adjusting the decision threshold / probabilistic estimate at the tree leaf
<br><br>
- Under-sampling (random and directed) is not effective and can even hurt performance.
- Random oversampling helps quite dramatically. Directed oversampling makes a bit of a difference by helping slightly more.
- Cost adjusting is about as effective as Directed oversampling. Generally, however, it is found to be slightly more useful.


# SMOTE

SMOTE = Synthetic Minority Oversampling Technique

- For each minority example $k$, compute nearest minority class examples $(i,j,l,n,m)$
- Synthetically generate event $k_1$ such that $k_1$ lies between $k$ and $i$
- Randomly chose an example out of $5$ closest points.


# Using Large Datasets

- At large data scales, the performance of different algorithms converge such that performance differences virtually disappear.
- Given a large enough data set, the algorithm you'd want to use is the one that is computationally less expensive.
- It's only at smaller data scales that the performance differences between algorithms matter.
- CPUs vs GPUs
	- Deep learning has greatly benefited from GPUs
- Map Reduce/ Hadoop , Apache Spark, Vowpal Wabbit frameworks
	- Many learning algorithms amenable to partitioning of computations

# Generalization Error

- Components of generalization error
- Bias: how much the average model over all training sets differ from the true model?
	- Error due to inaccurate assumptions/simplifications made by the model
- Variance: how much models estimated from different training sets differ from each other

- MSE in terms of bias and variance<br><br>
  $$\color{blue} \text{MSE}=\color{red} \underbrace{\text{Bias}^2}_{\text{error due to incorrect assumption}} + \color{green} \underbrace{\text{Variance}}_{\text{error due to variance in training}} + \color{purple} \underbrace{\text{Noise}}_{\text{Unavoidable error}} \tag{1}$$
  <br><br>
  Consider True function $q$ and Estimator $f_i = f(X_i)$ on sample $X_i$<br>
  Bias = ${\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack -q\right)}^2$<br>
  Variance = $\mathit{\mathbf{E}}\left\lbrack {\left(f-\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2 \right\rbrack$<br>
  $\mathit{\mathbf{E}}\left\lbrack {\left(f-q\right)}^2 \right\rbrack ={\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack -q\right)}^2 +\mathit{\mathbf{E}}\left\lbrack {\left(f-\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2 \right\rbrack$<br>
  Proof:<br>
  __LHS__: <br>
  $$\begin{align*}{}
  \text{LHS} &=\mathit{\mathbf{E}}\left\lbrack {\left(f-q\right)}^2 \right\rbrack \\
  &=\mathit{\mathbf{E}}\left\lbrack f^2 +q^2 -2f\cdot q\right\rbrack \\
  &=\mathit{\mathbf{E}}\left\lbrack f^2 \right\rbrack +q^2 -2\cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack \cdot q
  \end{align*}$$
  __RHS__:
  $$\begin{align*}{}
  \text{RHS} &={\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack -q\right)}^2 +\mathit{\mathbf{E}}\left\lbrack {\left(f-\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2 \right\rbrack 
  \\
  &={\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2 +q^2 -2\cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack \cdot q+\mathit{\mathbf{E}}\left\lbrack f^2 +{\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2 -2\cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack \cdot f\right\rbrack \\
  &=\cancel{{\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2} +q^2 -2\cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack \cdot q+\mathit{\mathbf{E}}\left\lbrack f^2 \right\rbrack +\cancel{{\left(\mathit{\mathbf{E}}\left\lbrack f\right\rbrack \right)}^2} -\cancel{2\cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack \cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack} \\
  &=\mathit{\mathbf{E}}\left\lbrack f^2 \right\rbrack +q^2 -2\cdot \mathit{\mathbf{E}}\left\lbrack f\right\rbrack \cdot q
  \end{align*}$$
  LHS=RHS hence proved




## Bias variance tradeoff

From equation $(1)$ we can see that when MSE is constant and if we try to reduce the variance Bias has to increase and vice versa.

- Models with too few parameters are inaccurate because of a __large bias__ bias (not enough flexibility).
- Models with too many parameters are inaccurate because of a __large variance__ (too much sensitivity to the sample).
- __Underfitting__: Model is too “simple” to represent all the relevant class characteristics
	- High bias and low variance
	- High training error and high test error
- __Overfitting__: Model is too “complex” and fits irrelevant characteristics (noise) in the data
	- Low bias and high variance
	- Low training error and high test error

:::{.callout-tip}
In case of classification, variance dominates bias. Very roughly, this is because we only need to make a discrete decision
rather than get an exact value.
:::

# Measuring Bias and Variance

- Create multiple training set using bootstrap replicates.
- Apply learning algorithm on each replicates to obtain hypothesis.
- compute predicted value for each hypothesis on the data which did not appear on the bootstrap replicate the hypothesis was trained on.
- compute the average prediction 
- Estimate bias 
- Estimate variance.
- Assume noise is 0
    - If we have multiple data points with the same $x$ value, then we can estimate the noise not generally available in machine learning

# Some Inferences

- How to reduce variance of classifier
	- Choose a simpler classifier
	- Regularize the parameters
	- Get more training data
- Training Error and Cross Validation
	- Suppose we use the training error to estimate the difference between the true model prediction and the learned model prediction.
	- The training error is _downward biased_: on average it _underestimates_ the generalization error.
	- Cross validation is nearly unbiased; it _slightly_ _overestimates_ the generalization error.

# Regularizers

- KNN
	- Choose higher $k$
- Decision Trees
	- Pruning
- Naïve Bayes
	- Parametric models automatically act as regularizers
- SVMs
	- Control $c$ value

:::{.callout-tip}
If the wight value is high the model is likely to over fit, so we want weight to be small. Consider a dataset in $2$-D, Hiving large variance across $y$-axis, now consider a model which overfits to these data, as do so $W$ has to be larger because for very small change in $x,\;y$ will have to change by very high amount (due to such data distribution) which can be achieved only if $W$ is large. It is shown in the below pic.
:::
<br>
![](CS5590_images/mspaint_AEgund0GVg.png)<br><br>

# Model-based Machine Learning

- pick a model
- pick a criteria to optimize (aka objective function)
- develop a learning algorithm (aka Find $W$ and $b$ that minimizes the loss)
- Generally, we don’t want huge weights
	- If weights are large, a small change in a feature can result in a large change in the prediction
	- Also, can give too much weight to any one feature

## Regularization in Model based ML

- A regularizer is an additional criteria to the loss function to make sure that we don’t overfit
- It’s called a regularizer since it tries to keep the parameters more normal/regular
- It is a bias (inductive bias) on the model forces the learning to prefer certain types (smaller) of weights over others (larger).
  $$\argmin_{w,b} \sum_{i=1}^n \mathrm{loss}\left(y,y^{\prime } \right)+\lambda \cdot \boxed{\mathrm{regulizer}\left(w,b\right)}$$
- Type of norm regularizer
  - 1-norm ( sum of weights )
    $$r(w,b)=\sum_{w_j} \left\lvert w_j \right\rvert $$
  - 2-norm ( sum of squared weights )
    $$r(w,b)=\sum_{w_j} \sqrt{\left\lvert w_j \right\rvert^2} $$    
  - $p$-norm ( sum of squared weights )
    $$r(w,b)=\sum_{w_j} \sqrt[p]{\left\lvert w_j \right\rvert^p}=\left\lVert w\right\rVert ^p$$   

:::{.callout-tip}
- Smaller values of $p, (p < 2)$ encourage sparser vectors<br>
- Larger values of $p$ discourage large weights more.
- All $p$ norms penalize larger weights.
- $p < 2$ tends to create sparse i.e. lots of $0$ weights)

Notice below pic
:::    

![](CS5590_images/Acrobat_DOQ241iPvG.png)

- L1 is popular because it tends to result in sparse solutions (i.e. lots of zero weights)
- However, it is not differentiable, so it only works for gradient descent solvers
- L2 is also popular because for some loss functions, it can be solved directly (no gradient descent required, though often iterative solvers still)
- Lp is less popular since they don’t tend to shrink the weights enough