## Flaws of Linear Regression
- The bad thing about linear regression is that they rarely work well in real-world application.
- While they are easy to implement, and good for research articles, they aren't useful when we are trying to make predictions.
- In this regard, simple linear regression suffers from two major flaws:
    - It's prone to overfit with many input features.
    - It cannot easily express non-linear relationships.

### It's prone to overfit with many input features. Why does this happen?
- Let's say you have 100 observations in your training dataset.
- Let's say you also have 100 features.
- If you fit a linear regression model with all of those 100 features, you can perfectly "memorize" the training set.
- Each coefficient would simply memorize one observation. This model would have perfect accuracy on the training data, but perform poorly on unseen data.
- It hasn’t learned the true underlying patterns; it has only memorized the noise in the training data.

- **Regularization** is a technique used to prevent overfitting by artificially penalizing model coefficients.
    - It can discourage large coefficients (by dampening them).
    - It can also remove features entirely (by setting their coefficients to 0).
    - The "strength" of the penalty is tunable. (More on this tomorrow...)

    
## Regularized regression
### Lasso Regression
- Lasso, or LASSO, stands for Least Absolute Shrinkage and Selection Operator.
- Lasso regression penalizes the absolute size of coefficients.
- Practically, this leads to coefficients that can be exactly 0.
- Thus, Lasso offers automatic feature selection because it can completely remove some features.
- Remember, the "strength" of the penalty should be tuned.
- A stronger penalty leads to more coefficients pushed to zero.

### Ridge Regression
- Ridge regression penalizes the squared size of coefficients.
- Practically, this leads to smaller coefficients, but it doesn't force them to 0.
- In other words, Ridge offers feature shrinkage.
- Again, the "strength" of the penalty should be tuned.
- A stronger penalty leads to coefficients pushed closer to zero.

### Elastic-Net is a compromise between Lasso and Ridge.
- Elastic-Net penalizes a mix of both absolute and squared size.
- The ratio of the two penalty types should be tuned.
- The overall strength should also be tuned.
- Oh and in case you’re wondering, there’s no "best" type of penalty. It really depends on the dataset and the problem. We recommend trying different algorithms that use a range of penalty strengths as part of the tuning process, which we'll cover in detail tomorrow.

## Decision trees
- Decision trees model data as a "tree" of hierarchical branches. They make branches until they reach "leaves" that represent predictions.
- Due to their branching structure, decision trees can easily model nonlinear relationships.
- For example, let's say that for Single Family homes, larger lots command higher prices.
- However, let's that for Apartments, smaller lots command higher prices (i.e. it's a proxy for urban / rural).
- This reversal of correlation is difficult for linear models to capture unless you explicitly add an interaction term (i.e. you can anticipate it ahead of time).
- On the other hand, decision trees can capture this relationship naturally.
- **Flaws**: 
    - Decisions trees suffer from a major flaw as well. If you allow them to grow limitlessly, they can completely "memorize" the training data, just from creating more and more and more branches.
    - As a result, individual unconstrained decision trees are very prone to being overfit.
- **Prevent Overfitting** 
- **1. ENSEMBLES** Ensembles are machine learning methods for combining predictions from multiple separate models. There are a few different methods for ensembling, but the two most common are:
    - **Bagging**: Bagging attempts to reduce the chance overfitting complex models.
        - It trains a large number of "strong" learners in parallel.
        - A strong learner is a model that's relatively unconstrained.
        - Bagging then combines all the strong learners together in order to "smooth out" their predictions.
    - **Boosting**: Boosting attempts to improve the predictive flexibility of simple models.
        - It trains a large number of "weak" learners in sequence.
        - A weak learner is a constrained model (i.e. you could limit the max depth of each decision tree).
        - Each one in the sequence focuses on learning from the mistakes of the one before it.
        - Boosting then combines all the weak learners into a single strong learner.
- While bagging and boosting are both ensemble methods, they approach the problem from opposite directions. Bagging uses complex base models and tries to "smooth out" their predictions, while boosting uses simple base models and tries to "boost" their aggregate complexity.
- **Ensembling is a general term, but when the base models are decision trees, they have special names: random forests and boosted trees!**


- **Random Forest**: train a large number of "strong" decision trees and combine their predictions through bagging.
    - There are two sources of "randomness" for random forests:
         - Each tree is only allowed to choose from a random subset of features to split on (leading to feature selection).
         - Each tree is only trained on a random subset of observations (a process called resampling).
    - Random forests tend to perform very well right out of the box.
        - They often beat many other models that take up to weeks to develop.
        - They are the perfect "swiss-army-knife" algorithm that almost always gets good results.
        - They don’t have many complicated parameters to tune.
        

- **Boosted trees** train a sequence of "weak", constrained decision trees and combine their predictions through boosting.
    - Each tree is allowed a maximum depth, which should be tuned.
    - Each tree in the sequence tries to correct the prediction errors of the one before it.
    - In practice, boosted trees tend to have the highest performance ceilings.
    - They often beat many other types of models after proper tuning.
    - They are more complicated to tune than random forests.