## Gradient Boost (GB)
- Gradient boost for regression: when we use gradient boost to predict a continuous value, i.e. weight (Don't get this confused with regression).
<br/><br/>
- <strong>AdaBoost vs. Gradient Boost</strong>:
<br/><br/>
    - AdaBoost:
        - AdaBoost starts by building a short tree, i.e. stumps from training data, and the amount of say/weight the stump has on the final output is based on how well it compensated for previous errors. 
        - Then AdaBoost <mark>builds new stumps based on previous errors</mark>. 
        - New stumps' performance will vary, i.e. one might do better/worse than the previous stump (this is where the weight varies). 
        - AdaBoost makes new stumps until it made the number of stumps you originally asked for or it has a perfect fit.
<br/><br/>
    - Gradient Boost (GB):
        - On the other hand, Gradient Boost starts by making a single leaf (this leaf represents the initial guess for the weights of all samples). 
        - For example, when we want to predict a continuous value, <mark>the first guess is simply the average value</mark>. Then gradient boost builds a tree based on the initial guess. 
            - There is a similarity with AdaBoost in terms of how this tree is <mark>based on the errors made by the previous tree</mark>. 
            - But unlike AdaBoost, <mark>this tree is not a stump, but usually larger than a stump since gradient boost still restricts the size of the tree</mark>. 
            - People usually set the max. number of leaves to be between 8 and 32.
        - Hence, similar to AdaBoost, Gradient Boost builds fixed size trees based on the previous tree's errors. But unlike AdaBoost, each tree is larger than a stump.
        - Also, Gradient Boost <mark>scales the trees, but by the same amount</mark>.
        - Then GB biulds another tree based on the erros made by previous tree and then it scales the tree (equally in terms of weights).
        - Then GB continues to make new trees until it reaches the number you have initially asked for, or additional trees fail to improve the fit.
<br/><br/>
- Gradient Boost uses <strong>learning rate</strong> to scale trees to deal with overfitting. Learning rate is a value between 0 and 1. <mark>Scaling a tree results in a small step in the right direction, i.e. decreasing variance</mark>.
<br/><br/>
- According to Jerome Friendman (inventor of Gradient Boost), empirical evidence shows that taking lots of small steps in the right direction (scaling tress) results in better predictions with testing set (i.e. low variance).
<br/><br/>
- <strong>Gradient Boost for regression Procedure</strong>:
    1. Calculate the average of the feature we want to predict.
<br/><br/>
    2. Build a tree based on the errors from the first tree, i.e. <strong>pseudo residuals</strong> = (observed weight - predicted weight(mean for first step)) and scale the tree with a learning rate.
<br/><br/>    
    3. Add the new tree and your previous trees with your learning rate to calculate the new residual.
<br/><br/>    
    4. Keep repeating steps 2 and 3 until you reach the number of times you initially specified or until making new trees fail to improve the fit (i.e. adding a new tree does not reduce the size of the residuals). 
        - It is important to note that learning rate is equally weighted for all the new trees.
        - <mark>Each time you add a tree to the prediction, the residuals get smaller</mark>
<br/><br/>    
    5. Predict weight for the test set using your model.
<br/><br/>
- <strong>Gradient Boost for regression Procedure in detail</strong>:
<br/><br/>
    - Input: data $\{(x_i,y_i\}_{i=1}^n$ and a differentiable Loss function, $L(y_i,F(x))$
        - <mark>$L(y_i,F(x)) = \frac{1}{2}(observed-predicted)^2$</mark>
        - <mark>$y_i$: observed values</mark>
        - <mark>$\gamma$: predicted values</mark>
<br/><br/>
    - <strong>Step 1</strong>: Initialize model with a constant value: $F_0(x) = argmin_{\gamma}\sum_{i=1}^{n}L(y_i,\gamma)$, i.e. <mark>average of the observed weights</mark>.
        - minimum of the sum of the loss function, i.e. $loss func = \frac{1}{2}(y_i-\gamma)^2$.
        - i.e. we want to find the point that minimizes the sum of the residuals ($\times\frac{1}{2}$).
        - Set the derivative of the sums and set it equal to zero and solve, i.e. the following (MLE):
<br/><br/>
$$Solve:\,0=\frac{\partial}{\partial \gamma_i} [\frac{1}{2}(88-\gamma_1)^2+\frac{1}{2}(76-\gamma_2)^2+\frac{1}{2}(56-\gamma_3)^2+ ...]$$
<br/><br/>
        - This is essentially the <mark>avg. of the observed values</mark>.
<br/><br/>
    - <strong>Step 2</strong>: for $m=1$ to $M$: (loop to make trees, conventionally, M=100)
        - (A) Compute for the <mark>pseudo-residual/gradient</mark>: (derivative of the loss func w.r.t. predicted value):
<br/><br/>        
$$r_{im}=-[\frac{\partial L(y_i,F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}~~~for~~i=1,...,n$$
<center>OR</center>
$$\frac{\partial}{\partial predicted}\frac{1}{2}(observed - predicted)^2 = -[-(Observed - Predicted)]$$
$$=(Observed - Predicted) = (Observed - F_{m-1}) = Pseudo-residuals$$
<br/><br/>
        - Note: $F_{m-1}$ is the previous prediction.
        - In $r_{i,m}$, $r$ is the residual, $i$ is the sample number and $m$ is the $m$th tree.
<br/><br/>
        - (B) Fit a regression tree for the $r_{im}$ values and create terminal regions (leaves) $R_{jm}$, for $j=1...J_m$ (number of  leaves).
            - In other words, fit the regression tree to the residuals and labeling the leaves (using observations).
        - (C) For $j=1...J_m$ compute (determining output value for each leaf)
            - Similar to Step 1, but <mark>we are taking the previous prediction into account.</mark>
            - Take the derivative and solve by equating it to zero.
$$\gamma_{jm}=argmin_{\gamma}\sum_{x_i\in R{ij}}L(y_i,F_{m-1}(x_i)+\gamma)$$
<br/><br/>
        - (D) Update 
$$F_m(x)=F_{m-1}(x)+\nu\sum_{j=1}^{J_m}\gamma_{m}I(x\in R_{jm})$$
<br/><br/>
            - Add up all the output values, $\gamma_{j,m}$'s for all the leaves,$R_{j,m}$ , that sample x can be found in.
            - $\nu$ is the <strong>Learning Rate</strong> that is between 1 and 0.
    - <strong>Step 3</strong>: Output $F_M(x)$


Sources:
- https://www.youtube.com/watch?v=3CC4N4z3GJc&t=118s
- https://www.youtube.com/watch?v=2xudPOBz-vs

### Gradient Boost for Classification
<br/><br/>
<br/><br/>
<br/><br/>

Sources:
- https://www.youtube.com/watch?v=jxuNLH5dXCs