## Why Ensembles work?

* **Hard Voting Classifier** : Aggregate predictions from each classifier and predict the class that gets most votes

* This works even if each classifier is weak (just better than random guessing). provided they are independent and diverse

* suppose you build an ensemble containing 1,000 classifiers that are individually correct only 51% of the time (barely better than random guessing). If you predict the majority voted class, you can hope for up to 75% accuracy!

* Ensembles generally has similar bias than single predictor but a lower variance  (it makes roughly the same number of errors
on the training set, but the decision boundary is less irregular)

* ** Soft Voting Classifiers** : If all classifiers are able to estimate class probabilities (i.e., they all have a pre
dict_proba() method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more
weight to highly confident votes.

* They scale well as it can be parallelized across cores. This is adv when compared to boosting which cant be parallelized, as it is sequential learning technique

* How to get diverse classifiers:
    * different training algorithms
    * different random training data on same algorithms

* Bagging Vs Pasting:
    * When sample is performed with replacement : Bagging
    * When sample is performed without replacement : Pasting
    * Bootstrapping (with replacement) enable more diversity in subsets so bagging has slightly higher bias but predictors are less correlated and variance is reduced. 

* Out of bag evaluation:
    * In bagging, there is always some samples not sent to training.

    * Roughly only about 63% data are sampled on avg for each predictor. 

    * the remaining 37% are not the same data for all predictors

    * the impact can be seen as oob score (also seen as validation score). Can evaluate the average of all oob evaluations for each predictor

* Random Patches Vs Random Subspaces:
    * Sampling can be done with subsets of training data and subsets of features
    * Random Patches : Sampling both training data & features
    * Random Subspaces : Sampling features but keeping all training data
    
    

## Random Forest

* Decision Tree : No Ensembles

* Bagging Classifier : Ensemble, max_samples (training data) tunable uptil size of training set

* Random Forest : Ensemble, random subset of features, max_samples similar to Bagging Classifier
* Why RF is better : Greater tree diversity -> higher bias but with lower variance

### Feature Importance:
* Check for how much the impurity decreases on average across all trees for the features.

* Remember impurity decreases along the tree, hence if a feature causes greater reduction in impurity it is more powerful.

* It is weighted average -> node's weight is equal to number of samples in that node. A feature which can cause greatest reduction in impurity with lot of data -> most important!

* It is usually scaled so importance of all sums to 1 for all features

## Extra Trees

* In random forest, if you make individual trees even more random -> extremely randomized trees

* Instead of splitting nodes based on best possible thresholds, you split with random thresholds

* Trading more bias for lower variance

* Faster than RF as finding optimal threshold split is time consuming

* Try both RF and Extra Trees to find which is superior


## Adaboost

* Initialize data weights as 1/n (number of data)

* Predictor's weight:
    * Error Rate $ r_j = \frac{\sum errors}{n} $

    * Predictor weight $ \alpha = \eta \ log \frac{1 - r}{r} $, where $\eta$ is learning rate.
    
    * Error Rate is the fraction of mistakes/misclassification each predictor mistakes

    * Predictor weight will be high if error rate is low

* Update data weights:
    * Increase weights for misclassifications
   $ \begin{align*}
        w &= w \ correct \ predictions \\
          &= w*\exp(\alpha) \ misclassifications 
          \end{align*}$

    * Normalize weights
     $w = \frac{w}{\sum w} $

* **Prevent Overfitting**:
    * Reduce number of estimators -> Lesser it will sequentialy update data weights for correct prediction or less overfitting

## Gradient Boosting

* similar to adaboost, sequentially adds predictors each one correcting its predeccessor

* Instead of tweaking the data weights like adaboost, this methods tries to fit new predictor to the residuals made by previous predictors

* ** Algorithm** :
    * Choose base estimator : Decision Tree with max_depth =2 (ex)

    * 1st Tree : Build with xtrain, ytrain 

    * 2nd Tree : Calculate error y_error = ytrain - y_pred from above. Build with xtrain and y_error

    * 3rd Tree : Calculate error y_error = y_error from above - y_pred from above. Build with xtrain and y_error

    * Prediction  = sum(individual tree predictions on new data)

* As you can the predictions sequentially get better. 

![title](Images\Grad_1.PNG)

## GBM Concept? Why fitting Residuals works?

* Remember, the basic assumption of linear regression is that sum of its residuals is 0, i.e. the residuals should be spread randomly around zero.

* Although, tree-based models (considering decision tree as base models for our gradient boosting here) are not based on such assumptions, but if we think logically (not statistically) about this assumption, we might argue that, if we are able to see some pattern of residuals around 0, we can leverage that pattern to fit a model.

* So, the intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better. 

* Once we reach a stage that residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it might lead to overfitting). 

* Algorithmically, we are minimizing our loss function, such that test loss reach its minima.

* Ultimate goal can be seen as to remove patterns in residuals.

![title](Images\GB_1.PNG)

* Over the trees, the residuals decrease and starts to diminish. 


## GBM Hyperparameters:

* $\eta$ : learning rate
    * Determines the impact of each predictor on final prediction.
    
    * If value is 1 -> overfit and difficult to generalize
    * Lower values -> better generalize but require more trees in sequence to model all the relations and can be computationaly expensive
    
    * Intuition : prediction = $ y_{tree1} + \eta_1 \ y_{tree2} + .. + \eta_n \ y_{treen} $
    * y_{tree1} typically for regression is the average of all y's. Hence for y_tree2 will have the exact error between y_original and y_tree1. 
    * This if $\eta1$ is equal to 1 -> the residuals will be zero, hence overfit probelm can occur.

* n_estimators : 
    * high number can overfit, use cv to estimate best value

* Misc & Other Obv parameters:
    * Min_samples_split in each predictor
    * min_samples_leaf
    * min_weight_fraction_leaf
    * max_depth
    * max_leaf_nodes
    * max_features
    * subsample data
    *
    



## Points to note:

* Early stopping :
    * Useful to prevent overfitting. Can plot learning curve and stop buiding sequential trees if val error is increasing 

* XGBoost : Extreme Gradient Boosting takes care of this automatically of Early stopping