# Machine Learning

#### Basics
- Suite of algorithms and techniques that learn from data, find patterns, and make predictions
- Useful with multivariate systems that are too complicated to use classical statistics
- Two main types
    - Supervised Learning
        - learning from labeled examples (categories)
        - kNN, SVM, Decision Trees, Random Forests, Bagging, Boosting, etc.
        - make predictions for new data points
    - Unsupervised Learning
        - learning from unlabeled data to create categories/clusters
        - PCA, MDS, Clustering
        - find patterns in data
        - for data with no categories (maybe don't know what they are), don't have training data, can't label because it's so big, don't know what's in it, or need to identify patterns/structure
- Learn by example
    - training data set
- Dividing the data
    - Production Models
        - 60% train, 20% validation, 20% test
    - Other methods
        - **30%** test data (though I've seen examples with more like 15% test data, depending on size of the dataset)
        - divide training data into equal sized folds for cross validation and hyperparameter estimation
        - use the test data once at the very end to evaluate the model performance
- Loss function
    - this is the 'error function' that is trying to be minimized by the model
    - in linear regression, this may be a least squares function (OLS) (sum of squares)

#### Regression vs. Classification
- Both supervised
    - regression -> target predicted variable is continuous
    - classification -> target predicted variable is categorical

## Linear Regression

#### Bias
- Selection Bias
    - where did the data come from, and what are you missing
    - your sample vs. the population
- Publication Bias
    - only positive/significant results published
    - highly influence by $\alpha < 0.05$ threshold
- Non-Response Bias
    - people did/didn't respond are different
- Length Bias
    - especially important with time based measurements
    - bias towards individuals (data points) that remain in a specific state longer
        - you miss the two day sample and collect the 20 yr sample, this happens repeatedly
- Calculation
    - difference between your estimation and the "truth"
        - almost impossible to find the "truth", so really can't calculate bias accurately
- MSE (meas squared error)
    - $MSE = variance + bias^2$
    - decreasing bias often means taking more samples or increasing model complexity, which means increasing variance
    - it's a tradeoff, and decreasing bias may increase variance enough to make the MSE very large
    - want to minimize MSE, rather than bias
- Fisher Weighting
    - way to take multiple independent unbiased estimators and combine into one number
    - take a weighted average
        - individual weights add up to one
        - individual weights inversely proportional to the variance
- Nate Silver Weighting
    - take each variable into account based on particular circumstances
        - past reliability, known biases, etc.
- Bonferroni Correction
    - divide $\alpha$/number of tests

- Regression toward the mean (RTTM)
    - predicting things like son's heights based on father's heights
        - the son won't be exactly the same height as the father, and will likely be between the father's height and the mean
        - combination of luck/skill (luck = chance, skill = effect)
        - example praising good performance leads to worse performance and punishing bad performance leads to better performance:
            - RTTM would predict bad to get better and good to get worse in many cases just by chance

#### OLS (Ordinary Least Squares)
- defines a linear model (the model is not OLS, that is the procedure to generate the linear model)
- **always plot the residuals!**
    - no pattern is a good thing
    - heteroscedasticity: bigger spread on one side
    - nonlinear: distinct pattern to the residuals (U-shape or similar)
- Goodness of fit
    - $R^2$
        - explained or 'accounted for' variance
        - amount of variance captured by the model
    - bad to use for telling how good a model is at predicting
        - $R^2$ will always go up as you add more variables
        - cross validation is better

#### Linear Regression Tips
- When in doubt, take the log of predictor variables
- Avoid collinearity
    - using independent variables that are highly correlated with each other
    - tends to inflate regression coefficient, create high variances in estimates, and high instability
    - variance inflation factor helps to measure collinearity (1-2 very little to no collin. while over 20 is extreme collin.)

### Logistic Regression
- Basics
    - good for predicting binary outcomes
    - think of the logistic s curve, the bottom is the first outcome with a value of 0, and the second outcome is the top of the curve with an outcome of 1
    - provides a framework for considering/controlling for variables 
- Odds ratio
    - probability of an outcome is $p$
    - odds of the outcome are $p/(1-p)$
        - odds = 10/1
        - probability: 10 = p/(1-p) -> 10 - 10p = p -> 10 = 11p -> p = 10/11 = ~0.91
        - if p is very small, 1-p is about 0 and odds are about equal to prob, but only in this case
    - odds ratio = $\frac{(p_1/(1-p_1))}{(p_2/(1-p_2))}$
        - where $p_1$ is the probability of outcome 1 and $p_2$ is the probability of outcome 2
- Strategy
    - use Maximum Likelihood Estimation (MLE) to estimate parameters from the data
    - define variables
        - y is 0 or 1 (false or true) for the variable you are trying to predict
        - can have multiple x variables
            - can be binary 0/1 (presence/absence, gender, etc.)
            - can be continuous (age, etc.)
    - want the probability of Y given your X variables
        - $p = P(Y=1|X_1, X_2, ..., X_k)$
    - logistic regression model (logit)
        - $logit(p) = ln (\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + ... + \beta_k X_k$
    - once the $\beta$ values are calculated above, you can plugin values for your $X$ variables to calculate probabilities and make predictions
    - interpret results (adjusted odds ratio)
        - Example:
        - model calculates a $\beta$ value of 1.89 for a binary X variable
        - plug into equations and simplify to get an 'adjusted odds ratio' = $e^{\beta_k} = ~ 6.3$
            - (raise e to the $\beta$ power) to calculate addjusted odds ratio
        - the '1' state has increased odds of 6.3 compared to the '0' state
    - model formula
        > $p = \dfrac{1}{1 + e^{-cx}}$ or $p = \dfrac{1}{1 + e^{-(b_0 + b_1x_1 ... b_nx_n)}}$

#### Curse of Dimensionality
- As the number of dimensions increases, the amount of 'space' in the dimension increases rapidly, while the probability space can't keep up
    - this results in an exponential decrease in probability with dimension
    - i.e. a circle tangent to a square takes up 0.79 of the volume of the dimension space, while a sphere tangent to a cube takes up just 0.52 of the volume of the dimension space
        - by dimension 6 this is just 0.08 and 10 is 0.002
- Other descriptions
    - overfitting a model
    - harder to cluster data, individual points seem like their own cluster (overfitting)
    - model parameters can reach 100's or 1000's and even have more parameters than data points
    - **most statistics were designed to describe a central tendency (the middle)**
        - the numbers above are the random chance of a point in that space being 'central'
        - at 100 dimensions this goes down to $1.87 * 10^{-70}$
    - in high dimensions, there is often no choice but extrapolation
        - the space is so sparse that you must look outside the estimate to predict (finding close neighbor points)
- Interpolation is better than Extrapolation in many cases (straight line not always the best answer)
    - interpolation -> new (predicted) data points are within the range of observed points
    - exprapolation -> new (predicted) data points are beyond the range of observed points
- Blessing, not a curse?
    - Gelman, uses a hierarchy to give weights to the variables, and this helps to give order to the model, supposedly aggregating neighbors better within the predictive space

#### Dimensionality Reduction for Classification Systems
- Goal: 
    - Project the high dimension data into a lower dimension subspace that best fits the data
        - reduce the dimensions, while keeping the same distance between points observed in the higher dimensions
        - looking for the 'intrinsic' dimensionality
            - data that are in 2d but basically follow a linear pattern are intrinsically 1d
            - data that are in 3d but basically fall around the same plane are intrinsically 2d, etc.
        - many techniques project the data into a linear plane
- How: PCA (principal components analysis)
    - uses
        - dimensionality reduction
        - compression
        - visualization
    - what it does
        - pc1: minimizes the orthogonal distances
        - pc1: represents the direction of maximum variance (or max spread of the data points)
        - would create a 1d line for 2d data
    - how it differs from linear regression
        - linear regression does not rotate the original axis
            - residuals are measured as $|y_{obs} - y_{pred}|$ (vertical differences)
        - pca uses best fit based on orthogonal projection of the data
            - not diff in y values, but closest distance between the point and the line (perpendicular to the line)
        - linear regression residual values are essentially the hypoteneuse of a triangle
        - pca residual values are one side of that same triangle
        - therefore, the resulting best fit 'line' is different
    - algorithm
        - subtract the mean from each data point (center the data around the origin)
        - (typically) scale each dimension by its variance
            - depends on the problem and its application
            - divide each dimension parameter (x, y, z...) by the variance in that direction
            - helps pay less attention to magnitude of dimensions
        - compute covariance matrix S
            - $S = \frac{1}{N}X^TX$
            - original matrix X transposed times X
            - captures the spread in each dimension and all of the cross products
        - compute the k largest eigenvectors from S
            - these are the principal components
        - possible to add weighting by:
            - having fewer samples of lower weighted features
            - using other algorithms that allow for weighted features
    - SVD (singular value decomposition)
        - the way that most people compute PCA
        - much more robust than doing by hand like above
    - Screeplot: plot eigenvectors along the x axis and variance explained on the y axis
        - enough to calculate 80-90% of the variance is usually good
            - or look for where there are diminishing returns by adding more pc's
        - there is a simple formula to calculate the variance that each eigenvector explains
- Intrinsic Dimensionality
    - is the number of pc's you need/use
    - effectively reduces the dimensionality of the data
- Project your data points onto that/those principal components
    - for visualization purposes, you can pick the first two pc's to create a 2d plot of your points

#### Multidimensional Scaling (MDS)
- Similar in function to PCA with a different goal
    - goal: find a set of points whose pairwise distances match a given distance matrix
    - input is a matrix of distance vectors between points rather than the raw data
        - decide distance metric, compute all of the pairwise distances, and assemble in distance matrix
    - classical MDS methods
        - given n x n matrix of pairwise distances between each point
        - compute n x k matrix X with coordinates of distances with some linear algebra magic
        - perform PCA on this matrix X
    - more trying to preserve the distances between points rather than minimize the orthogonal distances between the data and the projection

#### Dealing with 'Wide' Data
- wide data = lots of variables, whereas tall data has lots of observations
    - table shape
- Ridge Regression
    - an extension of linear regression
    - takes the sum of the squared residuals between observed/model + the sum of squared $\beta$'s times $\lambda$ which is a tuning parameter 
        - (penalty factor)
        - imposes a penalty to the model related to the number of variables used (to promote dimensionality reduction)
    - there are other types of regression that also punish too high dimensionality starting with OLS, then adding a penalty factor
- Shrinkage Estimation
    - not out of Seinfeld
    - **Stein's Paradox**
        - use total MSE to calculate performance (standard loss function)
        - try to estimate 3 or more means for predicting 3 or more independent y values (dependent variables)
        - prediction estimates tend to shrink towards a grand mean (sort of like regression towards the mean)
        - makes the most sense when 'independence' may not be complete (i.e. many different baseball players' avgs.)
- LASSO and Sparsity
    - helps to induce sparsity, which reduces dimensions
        - will reduce hierarchy of $\beta$'s to 0 (reduces the number of variables that matter)
    - like Ridge regression, but penalty term is sum of $\lambda * abs(\beta)$

## Classification
- Assigning labels to data
- Examples: 
    - Email -> spam or not spam
    - Text -> which language is it
    - Images -> classify images as a type or search for features
- Greatly enhanced by machine learning

#### Classification Strategy
- Input:
    - training dataset of N data points, each labeled as one of K different classes
- Learn:
    - use the training dataset to learn about the classes
- Evaluate:
    - predict the labels for a test set and compare to the true labels (ground truthing)

#### Classification Explanation
- k number 'features' are used to essentially plot the data in a k dimensional space
- we are looking for the decision boundary(ies) between the different classes
- data points are typically called 'x' and the labels called 'y'
- depends heavily on feature selection
    - crappy features will result in overlap and no real decision boundary
    - good features can result in obvious decision boundaries
    - cross validation helps with feature selection

#### K Nearest Neighbors Classification
- Properties
    - Training is fast
    - Prediction is slow
        - must keep all data points, and compare new data point to all data points
- Simplest way to classify
    - 'majority voting' kind of deal
    - look at the categories of the nearest datapoints: assign predicted categories based on the known categories that are closest in feature values to the prediction data point
    - decisions are impacted on the 'size' of the neighborhood
        - evaluate the performance of the classifier to pick the neighborhood size
        - this is usually not a set size, but instead the value 'k' which is the number of neighbors to 'poll' (typically an odd number so no ties)
    - depends heavily on 'k' or the number of neighbors to use in classification
        - larger k means less variance but more potential bias
        - very low k results in rough decision boundary
        - always use an odd k
- 1 nearest neighbor properties
    - rough decision boundary
    - will draw islands around points that are the 'wrong category' in another category's space
        - no bueno when this happens
    - however, it is simple and often quite good for well separated, low dimension data (few features)
        - especially when there are no 'island' points in the 'wrong spot'
    - complexity
        - complexity of adding N points to the training set is order 1
            - just add the additional data points/labels (N more calculations)
        - complexity of new test data M number of points
            - complexity is M x N, must go through every training point for each new test point
        - weird tradeoff
            - ideally you spend your time during training, so testing is quick, but this version is opposite that
        - error on training set
            - 0 because we always get the nearest neighbor
        - variance/bias tradeoff
            - variance is quite high (adding new training data will affect the decision boundary each time)
            - bias is very low
- k-NN properties (using k nearest neighbors)
    - get rid of islands by increasing k above 1
    - if k is too large, decision boundary can become too smooth
        - leads to lower variance and higher bias
- Choosing the ideal value for k (more info below)
    - k, the 'distance' metric, and the 'majority voting' method are hyperparameters
        - hyperparameter = a parameter whose value is set before training begins
    - how do we measure the 'distance' between points?
        - could be euclidian, but this makes less sense for increased dimensions, and there are other options
    - how do you decide 'majority voting'
        - usually probabilistically, but there are other methods that might improve the model
    - Usually you would divide the data 60% train, 20% validation, 20% test
        - use the 20% validation to help choose the hyperparameters
    - Cross Validation
        - optimize the three hyperparameters mentioned above
        - train on the training data, test on the testing data, choose the k with the lowest test error
        - how do you choose the number of folds? intuition! but 5 and 10 are typical numbers
            - the size of your dataset heavily influences this, and sometimes 3 fold is all you can do
            - **30%** of the data is typically used as test data (try to select randomly)
                - I've also seen as little as 15% test data, depending on size of the dataset
            1. Training/validating
                - do not want to use entire training (labeled) dataset at once, because you can only optimize it once
                - training multiple times will help choose the correct k that will generalize to other test data
                - best to split the training data into smaller sets (folds)
                - use the last of those training folds (training data) to estimate the hyperparameters, and this actually becomes a validation fold that uses 'validation data'
                - example: you would separate the training data into five folds, four with training data, and one with validation data (order of the folds not important)
                - finish by taking the average of the hyperparameters (or the set that is the most optimal) from each fold/validation, then apply to the test data
            2. Cross Validation
                - pick a fold to be the validation fold, with all other folds as training folds
                    - used as 'test data for the hyperparameters' essentially
                - repeat, but change the validation fold, using all other folds as training folds
                    - do this until every fold has been used as the validation fold once
                - example: 5 'training' folds, would do five different iterations, using each fold as the validation fold once, generating 5 sets of hyperparameters
                - average the parameters with the best performance on the validation data (or pick if that makes more sense, but probably usually avg)
                - test data is NOT used to determine the parameters and is used only at the very end to evaluate the model
                    - results are only able to be generalized if evaluated in this way on test data
            - CIFAR-10 dataset is a good example using color images to classify into categories
                - link:

#### k-NN Example Dealing with high dimensional data (like images)
- Turn the data into vectors
    - image example using CIFAR-10 dataset
        - each color image is 32 x 32 pixels
        - each color image is a function of rgb
        - each image is a vector in 32 x 32 x 3 = 3072 dimension space (unrolling/unpacking the matrix)
- $L_1$ distance aka 'Manhattan Distance'
    - The 'distance' between points is the absolute difference in their pixel values
        - sum the absolute differences at each pixel location
            - use the 3072 vector and add up 3072 numbers, where each number is the difference between two images
            - `sum from 0 to i of (abs(diff[i] = img1[i] - img2[i]))` where i = n-1 (0 indexed)
- $L_2$ distance aka 'Euclidian Distance'
    - shortest distance between points plotted in a plane
        - the square root of the sum of the differences squared
            - `sqrt( sum from 0 to i of( ( img1[i] - img2[i] )^2 ) ) )
            - could apply this to the image vectors above
- More general Lp norms
    - weird formulas, less commonly used than $L_1$ and $L_2$ distances
- In these examples, the features are the pixels and the model doesn't work so well
    - turns out the this doesn't work so well, and the accuracy is less than about 40% using the pixel differences as the feature
    - the pixel method is very simple and doesn't take into account rotation, zoom, lighting, different colors of objects in the same class, etc.
    - neural network architectures work much better (~95% accuracy) and use more features
        - these use sifting and pick certain areas, account for rotation, perspective, lighting, etc.
        - moral of the story is that **feature selection is very important**

#### k-NN Choosing Ideal k
- Plot the results of many different tests using your fold choice (5 fold, etc.)
    - x var is the value of k
    - y var is the accuracy of the model (right predictions by classifier / right predictions of the ground truth)
    - if using 5 fold validation, for each value of k, you will get 5 values as results
        - plot line graph using error bars through the mean of the 5 points

## Evaluating Model Performance

#### Confusion Matrix
- For binary classification, this is a 2 x 2 matrix
    - columns are prediction 0, prediction 1
    - rows are actually 0, actually 1
    - values are true/false
    - shows left to right, top to bottom
        - true positive, false negative
        - false positive, true negative
        - record numbers there (not percentages)
    - accuracy = sum of diagonal (true positive + true negative) / sum of entire matrix
        - can extend for more than two options
    - precision = true positives / true positives + false positives
        - high precision means low false positive rate
        - if I take a positive predict, what is the chance it actually is positive?
    - recall = true positives / true positives + false negatives
        - aka sensitivity, hit rate
        - high recall means predicted most positives correctly
        - if I take a random positive sample, what is the chance of a correct prediction?
    - F1 score = 2 * ( precision * recall / precision + recall ) 
        - aka harmonic mean

#### Hyperparameter Tuning
- Hyperparameters are parameters chosen before fitting the model
    - k in k-NN
    - alpha/lambda in Ridge/Lasso regression
- Strategy
    - try a bunch of different values for the hyperparameters
    - fit all of them separately
    - evaluate the performance of all of them
    - choose the best performing one
    - essential to use cross validation for HT (or use a validation set)
        - otherwise, you risk overfitting to the test set and losing generalization
- Implementation 
    - choose a grid of hyperparameters to test (all combinations within a certain meaningful range of each)
    - perform k-fold validation for each point in the grid
    - choose the value or combination of values for the hyperparameter(s) that perform the best
    - GridSearchCV in scikit learn can do this
    - RandomizedSearchCV does a similar process, but doesn't test all combinations and uses less memory

## Support Vector Machines (SVM)
- Popular classifier


- Basic info
    - widely used for all sorts of classification problems
    - some people say it's the best off-the-shelf classifier
    - kind of weird because it focuses on the rare data points (ones on the edge of the cluster)
    - only uses the support vectors for prediction, so faster than k-NN for predicting
- Characteristics of SVM
    - When using two classes y values for two classes are -1, and 1: this makes the math easier
    - Hyperplane separates the classes
        - $w^Tx + b = 0$
        - positive results are evaluated as the +1 class, negative results are the -1 class
    - w: weight vector -> orientation of the hyperplane that separates the classes (like a decision boundary)
        - hyperplane can be a straight line for two features
        - hyperplane passes through the origin
        - w changes the orientation (angle) of the hyperplane
    - b: bias -> lets you 'shift' the hyperplane around, so it doesn't pass through the origin
        - b moves the hyperplane off the origin
    - goal of SVM
        - initialize w & b, then 'wiggle' that hyperplane to optimize it for class predictions
            - uses **Maximum Margin Classification**
                - make the hyperplane the further distance from all points as possible (largest rectangle box around the hyperplane)
                - these closest points define the 'support vector' in SVM that helps draw the hyperplane
                    - still need to collect and train on a large amount of data, because the support vector set are going to be the rare observations, furthest away from the mean of that particular class
                    - $\gamma$ 'gamma' is the width of the maximum margin
                - $x_\perp$ = the distance from a point x on the max margin to the hyperplane
                    - $x_\perp = x - \gamma \frac{w}{||w||}$ (the direction component of vector w)
                        - $x_\perp$ is on the hyperplane
                    - if x is on the maximum margin, then $\gamma = x - x_\perp$
                    - leads to $w^Tx_{\perp}^{(i)} + b = 0$
                    > and $\gamma^{(i)} = y^i(\dfrac{w^Tx^{(i)} + b}{||w||})$
                - trying to optimize the largest $\gamma$ (gamma is width of the margin)
                - slack variables
                    - training points that fall on the 'wrong side' of the decision boundary
                    - distance is measured to the margin on the 'correct side' $\xi$
                        - if a yellow point is on the blue side, measure yellow point to yellow side margin
                    - leads to an equation not listed here where coeffiecient $C$ multiplies by the sum of the slack variables
                        - large $C$ means that the model will care a lot about minimizing the value of the slack variables (large C means not a lot of slack allowed)
                        - lower $C$ can lead to a more robust model at the expense of some misclassification
        - once w and b are defined, prediction is easy
            - +1 cat -> $w^Tx + b > 0$ and -1 cat -> $w^Tx + b < 0$
        - no longer need training data
            - just evaluate new datapoints based on the hyperplane equation
        

#### Perceptron
- Used to calculate the hyperplane equation
    - structure is similar to a neuron
    - you have n features, there are n 'dendrites' of $w_n * x_n$ then add $b$ and sum those
        - the output is the threshold for the 'firing' of the 'neuron'
    - no fewer than 60,000 training data points suggested for training a neural network in this way
- Basis of deep learning models
    - all artificial neural networks (ANN) are intrinsically about
- Hyperparameters and extra dimensions
    - XOR problem
        - can't draw a straight line hyperplane between points 0,0:1,1 are blue: 0,1:1,0 are yellow
            - box shape with alternating colors
        - basic solutions:
            - add a third dimension that will separate the points and use a plane to separate them
                - only add what is necessary to separate the classes
                - running PCA on this will keep necessary dimensions while reducing 'noisy' or not needed dimensions
            - include $x^2$ in the model
                - does not add any additional information or dimensions, but it will 'move' the datapoints such that a separating hyperplane can be drawn
                - this allows the SVM to essentially have non-linear decision surfaces
            - these are essentially the same solution
    - Use Kernel Trick for SVM's as Generic Solution
        - $x = (x_1, x_2)$ (2d point)
        - $\phi(x) = (1, \sqrt{2}x_1, \sqrt{2}x_2, x_1^2, x_2^2, \sqrt{2}x_1x_2)$
            - quadratic expansion of the 2d point
        - this gets unruly at higher dimensions, so we are paying a cost for adding the dimensions to get a linear hyperplane, but we don't actually hae to do that, there is an easier way below to get the dot product of x and z
            - for polynomial kernels
            > $K(x,z) = \phi(x) \cdot \phi(z) = (1 + x \cdot z)^s$
            - Radial Basis Function (RBF)
            > $K(x,z) = exp(-\gamma(x-z)^2)$
        - tuning
            - polynomial kernel, tune s (degree of the polynomial)
            - RBF, tune $\gamma$
                - can go up to an infinite number of dimensions
        - Kernel Trick Summary
            - arbitrary many dimensions
            - little computational cost
            - maximal margin helps with curse of dimensionality

#### SVM Application
- Divide data
- Cross validate
    - number of components for PCA
    - which kernel to use (polynomial or RBF)
    - tune hyperparameters (use exponential steps `[10^-1, 10^0, 10^1]` of `[10^-2, 10^0, 10^2]`
        - polynomial
            - s (degree of the polynomial)
        - RBF
            - gamma
        - C (slack variable parameter)
            - large C means less slack (leads to overfitting in some cases)
        - always tune gamma/C jointly together because they greatly affect the model's fit (over/under fitting)
- Tips and Tricks
    - normalize or otherwise scale your data first! (mean = 0, std = 1)
        - check if library normalizes by default
        - train & test data should both be normalized, but should be done separately, so mean/std are not influenced
    - map y to (-1, 1) or (0, 1), I think (-1, 1) is better
    - use RBF kernel by default unless you have another reason not to
    - use exponential sequences for gamma/C

#### Main Differences Between k-NN and SVM
- predictions
    - k-NN keeps/uses all the training data during predictions
    - SVM only worries about the support vectors
- hyperparameters
    - k-NN only tunes k
    - SVM has C, which kernel, then kernel parameter (gamma or s/degree)

#### Model Testing
- as with k-NN, can plot degrees of freedom (# of params - 1) vs error
    - plot both training error and testing (validation) error with different dof
    - optimum is where testing (validation) error is the lowest
- error metrics
    - positive/negative matrix (similar to confusion matrix) tp, tn, fp, fn
        - true positive rate = $tpr = \frac{tp}{tp + fn}$
        - false positive rate = $fpr = \frac{fp}{fp + tn}$
        - Receiver Operating Characteristics: plot fpr(x) vs. tpr(y)
            - want to be in top left corner (100% tpr, 0% fpr) or as close as possible
            - diagnoal line 0,0 to 1,1 is the 50/50 line, coin flip chance (worst possible case)
            - below this line is even worse!, though if it is way below the line, just flip the labels, because the model actually does have predictive power!
            - plot entire line varying over the parameter
                - however, only display the AUC (area under the curve) value
            - plot training data and testing data
        - Recall/precision curves (good for imbalanced data: i.e. only 20% positives in the the data)
            - recall = $\frac{tp}{tp + fn}$ (same as true positive rate)
            - precision = $\frac{tp}{tp + fp}$
            - plot recall(x) vs. precision(y)
            - want to be in upper-right corner (high recall/high precision)
                - can actually choose whether or not recall or precision is more important and choose accordingly
        - F-measure (summary measure for precision/recall curves (analogous to AUC)
            - weighted average of precision and recall (harmonic mean)
            - $F_{\beta} = \dfrac{(\beta^2 + 1)\cdot P \cdot R}{\beta^2 \cdot P + R}$
            - usual case $\beta = 1$
            - increasing $\beta$ allocates weight to recall

#### SVM for Multiple Classes
- 1 vs. all (slower training)
    - idea
        - single out one class, and combine all other classes together for the model
        - do for each class individually
    - implementation
        - train n classifiers for n classes
        - use 1 (positive class) for the single class
        - let all classifers predict, take classification with greatest margin as the answer
- 1 vs 1 (faster training)
    - idea
        - take only two classes at a time
        - permute each combination of classes
    - implementation
        - train $n(n-1)/2$ classifiers
        - let all classifiers you trained predict, then take majority vote
- evaluation
    - confusion matrix
        - predicted label on x, true label on y

## Decision Trees
- Basics
    - start with a question, answer determines which branch you follow
    - this may lead to an answer or to another question
    - the decisions are intuitive and easy from which to derive meaning (understanding/interpreting)
        - the w and b from SVM don't tell as much
        - whereas MMSE value over 6 may provide more practical information
    - fast training
    - fast prediction
    - are not that popular
        - they tend to overfit
        - tree pruning can help some with this
- Ideas
    - look at one feature at a time
        - pick a threshold and evaluate a simple Y/N or T/F is $x_1 > \theta_1$
            - make a decision then continue through the tree
        - pick another threshold on another feature $x_2 > \theta_2$
            - make a decision then continue through the tree
    - once finished, the 'feature space' where the points dwell is divided into cells
        - entire classes are assigned to the 'cells'
    - drawbacks
        - by analyzing one feature at a time (greater than/less than), you can only draw straight line splits to create the 'cells'
            - would take a lot of decisions (splits) to create a diagnonal line
            - to 'smooth' out the line and not have it too pixelated requires even more splits (decisions)
    - benefits
        - because only 1 feature at a time, you don't need to normalize or scale the data first
        - multi-class features are very easy with decision trees
            - unlike SVM
        - very fast
        - very intuitive/interpretable
    - most overfit model will have one cell for every training data point
- Disadvantages
    - sensitive to small changes in data
    - prone to overfitting
    - only axis aligned splits (one parameter at a time)

#### DT vs. SVM
- DT is better at
    - multiple classes
    - mixed data types (no standardization needed)
    - handling missing values
    - more robust with outliers
    - insensitive to monotone transformation of inputs
    - scales well (large N)
    - ability to deal with irrelevant inputs
    - interpretability
- SVM is better at
    - ability to extract linear combinations of features
    - **predictive power** generally makes SVM better, though SVD (singular value decomposition) can help DT's

#### Decision Tree Training
- Must learn tree structure
    - which feature to query?
    - which threshold to use?
    - want to optimize 'node purity'
        - creating cells that only contain one category
        - ok to have 4 cells that are all red, this is the prediction space for red
            - one way to have a non-linear decision boundary
    - Gini impurity
        - 0.5 is bad (worse case scenario)
            - high Gini index
            - high entropy
            - high missclassification error
        - expected error if randomly choosing a sample and predict the class of the entire node base on the sample
        - probablility of making an error for each class is prob of that class * prob of an error (sum of the probs of each other class)
        - Gini impurity is the sum of all those probabilities of making an error
        - for each n (each n gets to be the main n, while multiplying by the probs of the other n's then sum it up)
        - Generalizing (sum of probs of true class * wrong prediction)
            - $C$ number of classes
            - $N$ number of datapoints
            - $N_i$ number of datapoints in class i
            - $I_G$ purity of a node
            > $I_G = $$\sum_{n=1}^{C} \frac{N_i}{N} (1-\frac{N_i}{N})$
    - optimizing the tree
        - should I make a split here?
            - Gini impurity of parent node vs. Gini impurity of child nodes
        - misclassification
            - is predicted label the same as the assigned label? divided by the number of points
            - problem: this is not differentiable
            - can be the same between two different decisions for a split (though one decision produces a pure node and the other does not)
                - Gini index gives more weight to the split that produces a pure node
                - ex. split 400/200, 200/0    vs. 300/100, 100/300
                    - both have 0.25 misclassification
                    - Gini gain for left side is 0.166 and the right is 0.125

#### Decision Trees Process (pseudocode)
- Check if finished
    - do all cells only have 1 class in them? (one choice for when to stop)
- For each feature $x_i$
    - calculate gain from splitting on $x_i$
    - let $x_best$ be the feature with the highest gain
- Create a decision node that splits on $x_best$
    - repeat on the sub-nodes that are left
- When to stop?
    - nodes only contain one class
    - nodes contain less than x data points
    - max depth is reached
    - node purity is sufficient (above a threshold)
    - you start to overfit (cross validation)

#### Tree Pruning
- Helps with overfitting and too many decisions
    - work backwards, removing decisions
    - creates more impure cells
        - use majority voting in those cells for classification
        - can also use probabilities in these cells for classification (ratio of the points from each class)
- Pruning and Complexity
    - bias/variance tradeoff (over/under fitting)
    - plot complexity (number of nodes) vs. prediction error
    - optimize where test (validation) error is the lowest

## Ensemble Methods and Random Forests
- **Ensemble Methods**
    - Can use these methods for other classifiers (not helpful for linear regression)
    - single DT's don't perform well but are fast
    - can use multiple trees
    - just need to ensure they don't all learn the same way
        - bootstrap the data to simulate collecting more data
        - will create many slightly different trees you can use on new data, taking the average of all the results
        - you cannot use the bootstrap samples as cross validation, because there is no 'unseen' data
            - the bootstrap samples will contain essentially the same data points, with some extra randomness to them
            - probability of choosing N data point again is about 0.632 (very high)
            - still need 'unseen' data for cross validation (validation folds)
    - **Bagging** = bootstrap aggregating
        - bootstrap your sample dataset
            - before bs-ing first set aside test data
        - train a classifier on each set
        - predict for new data points
            - run data through all generated classifiers and take the average
        - does a pretty good job of generalizing and catching the variance without introducing too much bias
            - let's you create lots of high variance trees, then taking the average will smooth out the decision boundary
        - two ways to measure performance of bagging
            - consenus: (mode of the total number of predictions) - voting
                - can add weights to specific models/learners
            - probabilitiy
        - **benefits**
            - reduces overfitting (variance, very wiggly decision boundaries)
            - normally only one type of classifier (DT, SVM, k-NN): can use with them all
                - however, often used with DT's because they have high variance and they are fast, so can bootstrap quickly
            - does NOT help with linear models
            - easy to parallelize (MLlib included with Spark)
    - **Random Forest**
        - builds on bagging
            - **main difference** between RF and bagging is using subset of features each time
                - often sqrt(M) where M is the total number of features
                - select this subset randomly
                - introduces more randomization to the model
                - makes model better at generalizing and reduces variance (smoother boundary)
                - loses a little interpretability (on how classification was done, thresholds)
            - each tree from a bs sample (remember to only bs the training data set)
            - no pruning, keep bagged trees as is
            - hyperparameter tuning is more intuitive and outlined than for neural nets
                - number of trees, number of features
    - **Boosting**: very popular
        - like bagging except:
            - weak learners evolve over time
                - training sets are not independently chosen (like in bootstrap)
            - votes are weighted
            - better than bagging in some cases
                - though not necessarily better than random forests
            - more prone to overfitting than random forest
        - **process**
            - after building a model (tree) the model is examined and datapoints are (re)weighted
                - points that were misclassified are assigned a higher weight in the next tree
                - this continues as more trees are built
                - trees with better performance are assigned more weight
            - tuning hyperparameters
                - number of trees
                - number of splits (often stumps work well: only one split)
                - parameters controlling how weights evolve with each tree
        - AdaBoost
            - most successful/popular boosting algorithm
            - pseudocode
                - two group examples with -1 and 1 as the groups
                - start with equal weights on the observations (datapoints)
                - generate a total of M classifiers (G)
                    - fit a classifier with the current weights
                    - compute the error
                    - reweight the observations based on the error
                    - calculate the model weight based on the error
                    - repeat for the next classifier for M total classifiers
                - result is the sign of the sum of all of the models times their weight
                    - weighted majority voting
        - Gradient Boosting (GBM)
- Regression Trees
    - the decision bins are continuous rather than categorical
    - the mean of the values assigned to the bin becomes the decision
    - also use the weighted average of all of the trees as the outcome (weight the learners/models)

## Bayesian Methods
- Bayes rule
    > $P(A|B) = \dfrac{P(B|A)P(A)}{P(B)}$
- $P(A|B)$ is the posterior (probability of A given B)
- $P(A)$ is the prior probability of A

- method for updating your belief
    - update probabilities after observing data $B$

- Sequential updating
    - posterior becomes the new prior, then do again
    - posterior becomes the new prior ...
- Bayes rule (likelihood version)
    > $P(\theta|y) = \dfrac{P(y|\theta)P(\theta)}{P(y)}$
- $\theta$ is an unknown, while $y$ is known
- treating the data $y$ as fixed leads to
    > $P(\theta|y) \propto L(\theta)p(\theta)$
    - posterior density is proportional to the likelihood function times the prior density
    - $P(y)$ is hard to calculate, so this works well
- works best with large sample sizes
    - the likelihood function dominates the prior and choice of the prior isn't vital

#### Naive Bayes
- Assumptions
    - conditional independence for each value
        - so multiply these propabilities when doing caculations (factor them)
        - huge assumption but huge simplification in statistical/computational complexity
            - independence assumption means that the individual probabilities are naive (independent) of each other, so the probability of all of them occuring is the product of their individual probabilities rather than some crazy calculation
        - often unrealistic, but still useful because it reduces the complexity of the model
        - model is wrong, though it is useful
- Pseudocode
    - set up full probability model - joint probabilities of all observed and unobservable quantities in the data 
    - conditioning on observed data and calculating the appropriate posterior distribution
    - evaluate the fit of the model and the implications of the posterior distribution
- Conjugate priors
    - Beta/binomial is the most popular choice for the prior distribution of $\theta$
        - $X|p \sim binom(n,p)$
        - $p \sim Beta(a,b)$
        - $p|X = x \sim Beta(a + x,b - n - x)$
            - this is how you update your beliefs
            - generate new posterior that then becomes the prior
            - $a$ number of prior successes and $b$ the number of prior failures, $n$ is sample size
            - $x$ are new successes, so $n-x$ are new failures
        - the conjugacy of this says that the posterior distribution is also a beta distribution
        - if you don't have much prior data, start by choosing $a$ and $b$ as small numbers
            - increasing $a$ and $b$ gives more weight to your prior
            - if sample size is reasonably large, Beta(0.7,0.5) makes no difference btw Beta(1,1)
                - likelihood will dominate the prior
    - Normal/normal choice for the prior distribution of $\theta$
        - $y|\mu \sim N(\mu,\sigma^2)$
        - $\mu \sim N(\mu_0,\tau^2)$
        - $y|\mu \sim N \left( (1-B)y + B\mu_0,\dfrac{1}{\frac{1}{\tau^2} + \frac{1}{\gamma^2}} \right)$
        - $\mu_0$ and $\tau^2$ are the hyperparameters