# Advanced Regression

<hr>

## CART
*Classification and regression trees*

Suppose if a single covariate reflects a realistic situation where mixed effects are observed, for example, data points with age $\leq 25$ and data points with age $> 25$, then a decision tree can fit a regression model and estimate its relevant coefficients in each set of data points, conditional on age.

<img alt="CART" src="assets/CART.png" width="500">

This offers two direct benefits:

1. Ability to examine each individual leaf's coefficients and explain the covariate's relationship with the target variable


2. Higher predictive accuracy with targeted predictions

****

**Branching**

*How does branching work?*

In general, the idea is to do the following:

- Use half of the data to begin branching
- Find the best factor to branch
    - Split the data on each factor
    - Calculate the [mutual information](https://github.com/codedarrylcode/mitx-statistical-modeling/blob/master/notebooks/05a-gaussian-processes.ipynb) between factor and target
    - Find the factor that has the largest mutual information
- Accept branch if mutual information is more than threshold
- Repeat until no further splits that are more than threshold
- Use the other half of the data to prune the tree
    - For each branch, calculate estimation error with/without the tree (*did this really improve the model?*)
    - If branching increases error, then remove branching
    

<img alt="Branching" src="assets/branching.png" width="500">

In general, over-branching is likely to cause overfitting and a branch should be rejected if it doesn't cross the threshold, i.e. improvement benefit vs the cost of overfitting. A rule of thumb to prevent overfitting is to ensure that the leaf contains at least 5% of the original data

****

## Random Forests

The idea is to introduce randomness and generate many different trees such that the average of these trees outperforms a single tree. This procedure is called **bagging** (*bootstrap aggregating*) and also known as one of the [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) methods.

The general procedure is as follows:

1. Bootstrapping $n$ data points with replacement from the original dataset

2. Randomly choose a small subset of factors for branching, for e.g. $1 + \log(n)$ # of factors

3. No pruning necessary

4. For regression, the average of all predictions are used and for classification, the most common predicted response is used (*averages also can be used as probability*)

| Benefits      | Drawbacks |
| ----------- | ----------- |
| Better overall estimates      | Harder to explain/interpret results       |
| Averages between trees and somewhat neutralizes over-fitting   | Can't give us a specific regression or classification model from the data        |

****

## Logistic Regression
*Sometimes called a logit model*

A standard linear regression model is as follows:

$y = a_0 + a_1 x_1 + \dots + a_j x_j = X^T \alpha$ where $\alpha$ is a vector of coefficients as estimated by MLE

A logistic regression model transforms the above:

$\log (\frac{p}{1-p}) = a_0 + a_1 x_1 + \dots + a_j x_j = X^T \alpha$

$\therefore p = \frac{1}{1 + e^{-X^T \alpha}}$

This transformation creates a generalized linear model (GLM) to output a response between 0 to 1 that represents the probability of a given observation for a binary target variable.

<img alt="Logistic Regression Curve" src="assets/logistic_regression_curve.png" width="200">

Similarly in linear regression, we can do the following:

- Transformations of input data
- Interaction terms
- Regularization
- Trees / random forests

But logistic regression takes longer to calculate with no closed-form solution.

****

## Advanced methods in regression

- **Poisson regression**

    This can be used when the response follows a Poisson distribution, $f(z) = \frac{\lambda^z e^{-\lambda}}{z!}$, for example, discrete count arrivals at an airport security line where the arrival rate might be a function of time and estimate $\lambda(x)$
    
    
- **Regression splines**

    A spline is a function of polynomials that connect to each other. The following image has 4 different splines fitting to the data:
    
    <img alt="Regression splines" src="assets/regression_splines.png" width="300">
    
    This fits different functions to different parts of the dataset to allow for smooth connections between parts. An order-$k$ regression spline is such that the polynomials are of order $k$. One such algorithm is called **MARS** (*multivariate adaptive regression splines*) but is commonly implemented as a code package called [*Earth*](https://github.com/scikit-learn-contrib/py-earth).
    
    
- **Bayesian regression**

    Starts with a prior distribution on regression coefficients and computes the posterior distribution given data. Most helpful **when there is not much data** and we can leverage on experts' opinions to define a prior, or use a very broad prior when no such information is available.
    
    
- **$k$-Nearest-Neighbor Regression**

    No estimate of prediction function and predict the response by weighted averages (*commonly by distance*) of the $k$ closest data points.
    
****

## Variable Selection

Using too many factors in a model can lead to two main problems:

1. Overfitting
    - When # of factors is close to or larger than # of data points
    - The model might fit too closely to random effects
    
    
2. Difficulty of interpretation
    - Overly-complex models can be hard to interpret, especially when factors are correlated with each other
    
    
****

**Variable selection techniques**

1. **Forward/Backwards/Stepwise Regression**

    Decisions are made step-by-step and is known as the *Greedy Algorithm*. At each step, take one thing that looks best without consideration of future options.

    1. Forward Selection
        - Start with no factors
        - Evaluate each factor one by one and select factors that have a p-value below a given threshold (say, $p \leq 0.15$)
        - Fit model with given set of signficant factors
        - Remove factors with high p-value
        - Re-fit model with final set of factors   
        
        <img alt="Forward" src="assets/forward.png" width="300">
    
    2. Backwards Elimination
        - Start with all factors
        - Evaluate factors and find worst factor
        - Remove factors above p-threshold (say, $p > 0.15$)
        - Fit model with final set of factors
        
    3. Stepwise (*Hybrid of forward / backswards*)
        - Start with no factors
        - Evaluate each factor and find best factor (lowest p-value)
        - If factor has $p \leq 0.15$ then add factor into the model
        - Fit model with current set of factors
        - Remove factors with high p-value (say, $p > 0.15$)
        - Fit model with final set of factors and remove factors with a stricter threshold ($p > 0.05$)
        
        <img alt="Stepwise" src="assets/stepwise.png" width="500">

    Alternative metrics such as AIC or BIC can be considered instead of p-value as well.
    
    
2. **Regularized Regression**

    Adds a constraint to the standard regresion equation. In standard linear regression, the model aims to minimize the squared error as follows:
    
    $\min \sum_{i=1}^{n} (y_i - \hat y_i)^2$
    
    where each $\hat y_i$ is predicted by $X^T \beta =$
    
    In Lasso regression, we add a constaint such that the equation is now as follows:
    
    $\min \sum_{i=1}^{n} (y_i - \hat y_i)^2 + \lambda \sum_{j = 0}^{p} \vert b_j \vert$
    
    where $\lambda \sum_{j = 0}^{p} \vert b_j \vert$ is a penalty term that penalizes each factor's weight if the weight becomes big. Choosing a low $\lambda$ will mean that the penalty term diminishes and resembles the original linear regression model without penalizing the weights. **Scaling the data is necessary** for lasso regression to work.
    
    This method can lead to zero coefficients which means the variable is completely neglected in generating the model's output and therefore helps with variable selection that would offer the best prediction quality.
    
    A variant of the lasso regression is the *elastic net* which is a combination of the lasso and ridge regression methods such that the equation is now as follows:
    
    $\min \sum_{i=1}^{n} (y_i - \hat y_i)^2 + \lambda \sum_{j = 0}^{p} \vert b_j \vert + (1 - \lambda) \sum_{j = 0}^{p} b_j^2$
    
    Can lead to better predictive ability and needs to be evaluated against the different options. 
    
    Generally, the Lasso regression approach simplifies models by forcing coefficients to be zero while ridge regression forces coefficients to shrink to zero and reduces variance in estimation and the elastic net captures the benefits of both variable selection of lasso and predictive benefits of ridge. 
    
    But similarly captures the disadvantages of both where Lasso rules out some correlated variable arbitrarily and ridge damps the coefficients of highly influential predictive variables for lesser variance.

****

# Basic code
A `minimal, reproducible example`