# Gradient Boosting vs Random Forest

Gradient Boosting Machine (GBM) and Random Forest (RF) differ in the way the trees are built: the order and the way the results are combined. I has been shown that GBM performs better than RF
if parameters tuned carefully.

**Gradient Boosting**: GBT build trees one at a time, where each new tree helps to correct errors made by previously trained trees.

### Real-world application

A great application of GBM is *anomaly detection* in supervised learning settings where data is often highly unbalanced such as
DNA sequences, credit card transactions or cybersecurity.


Gradient boosting has shown to be a powerful method on real-life datasets to address learning to rank problems due to 
its two main features:

1. It performs the optimization in function space(rather than in parameter space) which makes the use of custom loss functions much easier.
2. Boosting focuses step-by-step on difficult examples that give a nice strategy to deal with unbalanced datasets by 
strengthening the impact of the positive class.

### Strengths of the model

Since boosted trees are derived by optimizing an objective function, basically GBM can be used to solve almost all
objective functions that we can write gradient out. This including things like ranking and poission regression, which RF is
harder to achieve.

### Weaknesses of the model

- GBMs are more sensitive to overfitting if the data is noisy.
- Training generally takes longer because of the fact that trees are built sequentially.
- GBMs are harde to tune that RF. There are typically three parameters: number of trees, depth of trees and learning 
rate, and each tree built is generally shallow.

**Random Forest**: RFs train each tree independently, using a random sample of the data. This randomness helps to make
the model more robust than a single decision tree, and less likely to overfit on the trainig data.

### Real-world application

The most prominent application of random forest is *multi-class object detection* in large-scale real-world computer
vision problems. RF methods can handle a large amount of training data efficiently and are inherently suited
for multi-class problems.

Another application is in *bioinformatics*, like medical diagnosis. This method is especially attractive for this 
application in the following cases:

- The real-world data is noisy and contains many missing values, some of the attributes are categorical, or semi-continuos.
- There are needs to integrate different data sources which face the issue of weighting them.
- We need high predictive accuracy for a high dimensional problem with highly correlated features.

### Strengths of the model

- RF is much easier to tune than GBM. There are typically two parameters in RF: number of trees and number of features to be selected to each node.
- RF is harder to overfit than GBM.

### Weaknesses of the model

- The main limitation of the Random Forests algorithm is that a large number of trees may make the algorithm
slow for real-time prediction.
- For data including categorical variables with different numbers of levels, random forest are biased in favor
of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable
for this type of data. Methods such as partial permutations were used to solve the problem.
- If the data contain groups of correlated features of similar relevance for the output, then smaller groups are
favored over larger groups.

**Note**: A very important point regarding how RF and GBM methods are handling *missing data*. Gradient Boosting Trees use CART trees. 
CART trees are also used in Random Forests. CART handles missing values either by imputation with average, either by rough average/mode, 
either by an averaging/mode based on proximities. However, one can build a GBM or RF with other types of decision trees. The usual replacement
for CART is C4.5. In C4.5 the missing values are not replaced on data set. Instead, the impurity function computed takes into account the 
missing values by penalizing the impurity score with the ratio of missing values.

References
- https://www.quora.com/When-would-one-use-Random-Forests-over-Gradient-Boosted-Machines-GBMs

- https://www.quora.com/What-are-the-advantages-disadvantages-of-using-Gradient-Boosting-over-Random-Forests

- https://www.semanticscholar.org/paper/An-Introduction-to-Random-Forests-for-Multi-class-Gall-Razavi/9035e87ce49b67b751838c7346d36fe481260217?p2df

- https://stats.stackexchange.com/questions/98953/why-doesnt-random-forest-handle-missing-values-in-predictors