# Machine Learning

#### Basics
- Suite of algorithms and techniques that learn from data, find patterns, and make predictions
- Useful with multivariate systems that are too complicated to use classical statistics
- Two main types
    - Supervised Learning
        - learning from labeled examples (categories)
        - kNN, SVM, Decision Trees, Random Forests, Bagging, Boosting, etc.
        - make predictions for new data points
    - Unsupervised Learning
        - learning from unlabeled data to create categories/clusters
        - PCA, MDS, Clustering
        - find patterns in data
        - for data with no categories (maybe don't know what they are), don't have training data, can't label because it's so big, don't know what's in it, or need to identify patterns/structure
- Learn by example
    - training data set

## Linear Regression

#### Bias
- Selection Bias
    - where did the data come from, and what are you missing
    - your sample vs. the population
- Publication Bias
    - only positive/significant results published
    - highly influence by $\alpha < 0.05$ threshold
- Non-Response Bias
    - people did/didn't respond are different
- Length Bias
    - especially important with time based measurements
    - bias towards individuals (data points) that remain in a specific state longer
        - you miss the two day sample and collect the 20 yr sample, this happens repeatedly
- Calculation
    - difference between your estimation and the "truth"
        - almost impossible to find the "truth", so really can't calculate bias accurately
- MSE (meas squared error)
    - $MSE = variance + bias^2$
    - decreasing bias often means taking more samples or increasing model complexity, which means increasing variance
    - it's a tradeoff, and decreasing bias may increase variance enough to make the MSE very large
    - want to minimize MSE, rather than bias
- Fisher Weighting
    - way to take multiple independent unbiased estimators and combine into one number
    - take a weighted average
        - individual weights add up to one
        - individual weights inversely proportional to the variance
- Nate Silver Weighting
    - take each variable into account based on particular circumstances
        - past reliability, known biases, etc.
- Bonferroni Correction
    - divide $\alpha$/number of tests

- Regression toward the mean (RTTM)
    - predicting things like son's heights based on father's heights
        - the son won't be exactly the same height as the father, and will likely be between the father's height and the mean
        - combination of luck/skill (luck = chance, skill = effect)
        - example praising good performance leads to worse performance and punishing bad performance leads to better performance:
            - RTTM would predict bad to get better and good to get worse in many cases just by chance

#### OLS (Ordinary Least Squares)
- defines a linear model (the model is not OLS, that is the procedure to generate the linear model)
- **always plot the residuals!**
    - no pattern is a good thing
    - heteroscedasticity: bigger spread on one side
    - nonlinear: distinct pattern to the residuals (U-shape or similar)
- Goodness of fit
    - $R^2$
        - explained or 'accounted for' variance
        - amount of variance captured by the model
    - bad to use for telling how good a model is at predicting
        - $R^2$ will always go up as you add more variables
        - cross validation is better

#### Linear Regression Tips
- When in doubt, take the log of predictor variables
- Avoid collinearity
    - using independent variables that are highly correlated with each other
    - tends to inflate regression coefficient, create high variances in estimates, and high instability
    - variance inflation factor helps to measure collinearity (1-2 very little to no collin. while over 20 is extreme collin.)

### Logistic Regression
- Basics
    - good for predicting binary outcomes
    - think of the logistic s curve, the bottom is the first outcome with a value of 0, and the second outcome is the top of the curve with an outcome of 1
    - provides a framework for considering/controlling for variables 
- Odds ratio
    - probability of an outcome is $p$
    - odds of the outcome are $p/(1-p)$
        - odds = 10/1
        - probability: 10 = p/(1-p) -> 10 - 10p = p -> 10 = 11p -> p = 10/11 = ~0.91
        - if p is very small, 1-p is about 0 and odds are about equal to prob, but only in this case
    - odds ratio = $\frac{(p_1/(1-p_1))}{(p_2/(1-p_2))}$
        - where $p_1$ is the probability of outcome 1 and $p_2$ is the probability of outcome 2
- Strategy
    - use Maximum Likelihood Estimation (MLE) to estimate parameters from the data
    - define variables
        - y is 0 or 1 (false or true) for the variable you are trying to predict
        - can have multiple x variables
            - can be binary 0/1 (presence/absence, gender, etc.)
            - can be continuous (age, etc.)
    - want the probability of Y given your X variables
        - $p = P(Y=1|X_1, X_2, ..., X_k)$
    - logistic regression model (logit)
        - $logit(p) = ln (\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + ... + \beta_k X_k$
    - once the $\beta$ values are calculated above, you can plugin values for your $X$ variables to calculate probabilities and make predictions
    - interpret results (adjusted odds ratio)
        - Example:
        - model calculates a $\beta$ value of 1.89 for a binary X variable
        - plug into equations and simplify to get an 'adjusted odds ratio' = $e^{\beta_k} = ~ 6.3$
            - (raise e to the $\beta$ power) to calculate addjusted odds ratio
        - the '1' state has increased odds of 6.3 compared to the '0' state

#### Curse of Dimensionality
- As the number of dimensions increases, the amount of 'space' in the dimension increases rapidly, while the probability space can't keep up
    - this results in an exponential decrease in probability with dimension
    - i.e. a circle tangent to a square takes up 0.79 of the volume of the dimension space, while a sphere tangent to a cube takes up just 0.52 of the volume of the dimension space
        - by dimension 6 this is just 0.08 and 10 is 0.002
- Other descriptions
    - overfitting a model
    - harder to cluster data, individual points seem like their own cluster (overfitting)
    - model parameters can reach 100's or 1000's and even have more parameters than data points
    - **most statistics were designed to describe a central tendency (the middle)**
        - the numbers above are the random chance of a point in that space being 'central'
        - at 100 dimensions this goes down to $1.87 * 10^{-70}$
    - in high dimensions, there is often no choice but extrapolation
        - the space is so sparse that you must look outside the estimate to predict (finding close neighbor points)
- Interpolation is better than Extrapolation in many cases (straight line not always the best answer)
    - interpolation -> new (predicted) data points are within the range of observed points
    - exprapolation -> new (predicted) data points are beyond the range of observed points
- Blessing, not a curse?
    - Gelman, uses a hierarchy to give weights to the variables, and this helps to give order to the model, supposedly aggregating neighbors better within the predictive space

#### Dealing with 'Wide' Data
- wide data = lots of variables, whereas tall data has lots of observations
    - table shape
- Ridge Regression
    - an extension of linear regression
    - takes the sum of the squared residuals between observed/model + the sum of squared $\beta$'s times $\lambda$ which is a tuning parameter 
        - (penalty factor)
        - imposes a penalty to the model related to the number of variables used (to promote dimensionality reduction)
    - there are other types of regression that also punish too high dimensionality starting with OLS, then adding a penalty factor
- Shrinkage Estimation
    - not out of Seinfeld
    - **Stein's Paradox**
        - use total MSE to calculate performance (standard loss function)
        - try to estimate 3 or more means for predicting 3 or more independent y values (dependent variables)
        - prediction estimates tend to shrink towards a grand mean (sort of like regression towards the mean)
        - makes the most sense when 'independence' may not be complete (i.e. many different baseball players' avgs.)
- LASSO and Sparsity
    - helps to induce sparsity, which reduces dimensions
        - will reduce hierarchy of $\beta$'s to 0 (reduces the number of variables that matter)
    - like Ridge regression, but penalty term is sum of $\lambda * abs(\beta)$

## Classification
- Assigning labels to data
- Examples: 
    - Email -> spam or not spam
    - Text -> which language is it
    - Images -> classify images as a type or search for features
- Greatly enhanced by machine learning

#### Classification Strategy
- Input:
    - training dataset of N data points, each labeled as one of K different classes
- Learn:
    - use the training dataset to learn about the classes
- Evaluate:
    - predict the labels for a test set and compare to the true labels (ground truthing)

#### Classification Explanation
- k number 'features' are used to essentially plot the data in a k dimensional space
- we are looking for the decision boundary(ies) between the different classes
- data points are typically called 'x' and the labels called 'y'
- depends heavily on feature selection
    - crappy features will result in overlap and no real decision boundary
    - good features can result in obvious decision boundaries
    - cross validation helps with feature selection

#### Nearest Neighbor Classification
- Simplest way to classify
    - 'majority voting' kind of deal
    - look at the categories of the nearest datapoints: assign predicted categories based on the known categories that are closest in feature values to the prediction data point
    - decisions are impacted on the 'size' of the neighborhood
        - evaluate the performance of the classifier to pick the neighborhood size
        - this is usually not a set size, but instead the value 'k' which is the number of neighbors to 'poll' (typically an odd number so no ties)
- 1 nearest neighbor properties
    - rough decision boundary
    - will draw islands around points that are the 'wrong category' in another category's space
        - no bueno when this happens
    - however, it is simple and often quite good for well separated, low dimension data (few features)
        - especially when there are no 'island' points in the 'wrong spot'
    - complexity
        - complexity of adding N points to the training set is order 1
            - just add the additional data points/labels (N more calculations)
        - complexity of new test data M number of points
            - complexity is M x N, must go through every training point for each new test point
        - weird tradeoff
            - ideally you spend your time during training, so testing is quick, but this version is opposite that