# XGBoost Regressor
1. loss function = mean square error, i.e. $\mathcal{L} = \sum\limits_{i=1}^N \dfrac{(y_i - \hat{y}_i)^2}{N}$
2. Negative gradient = $-\nabla \mathcal{L}$ (w.r.t. $\hat{y}_i$) $= \sum\limits_{i=1}^N \dfrac{2(y_i-\hat{y}_i)}{N}$
3. Hessian (2nd order derivative) = $H(\mathcal{L}) = 2$
4. model-0 = mean-value predictor
5. calculate residuals $r_i = y_i - \bar{y}$
6. for the 1st weak learner decision tree, splitting is as follows
    1. **no MSE reduction will be used**
    2. calculate similarity score at current node $SS = \frac{G_N^2}{H_N+ \lambda}$, where $\lambda$ : regularisation parameter 
        1. $SS = \sum\limits_{i=1}^N \dfrac{\dfrac{2(y_i-\hat{y}_i)^2}{N}}{2 + \lambda} = \sum\limits_{i=1}^N \dfrac{(y_i-\hat{y}_i)^2}{N+ N\lambda/2} = \dfrac{\sum\limits_{i=1}^N r_i^2}{N+ N\lambda/2}$ \
            = (sum of residuals from previous learner)/(total no. of residuals, i.e. samples in the current node)
    3. use a split criterion for a given feature to obtain right and left child nodes, and thus samples that end up in them, use those respective samples to calculate $SS_R \,,\, SS_L$
    4. find the split gain as $SS_R + SS_L - SS$, and the splitting criterion with the highest gain will be used
7. 

## Differences from GBM
1. L1 and L2 regularisation (the $\lambda$ term in the similarity score expression)
    1. $\lambda$ if high, will cause all the 3 terms to be quite small, leading to a higher bias but lower variance model
    2. hence it is curing overfitting
    
2. optimised for speed, efficiency, scalability
    1. xgboost uses 2nd order taylor series expansion for loss approximation
    2. column/block-based parallelisation
        Consider a small dataset with one feature (Feature X), a target variable (Y), and corresponding gradient and Hessian values for each sample. 

        | Sample  | Feature X | Target (Y) | Gradient (g) | Hessian (h) |
        |---------|-----------|------------|--------------|-------------|
        | 1       | 2.1       | 10         | -0.8         | 1.2         |
        | 2       | 2.5       | 12         | -0.6         | 1.1         |
        | 3       | 3.2       | 15         | 0.2          | 0.9         |
        | 4       | 3.7       | 18         | 0.4          | 1           |
        | 5       | 4         | 20         | 0.5          | 1.3         |

        1. discretize feature values into histogram bins 

            | Bin Range  | Feature values mapped |
            |---------|-----------|
            | 1: (2.0-2.7]   | 2.1, 2.5       |
            | 2: (2.7 - 3.5) | 3.2       |
            | 3: (3.5 - 4.2] | 3.7, 4    |

        3. we sum the gradient (g) and Hessian (h) values 

            | Bin Range  | Gradient Sum(G) | Hessian sum (H) |
            |---------|-----------|--------------|
            | 1: (2.0-2.7]   | -0.8 + -0.6 = -1.4   |  1.2 + 1.1 = 2.3 |
            | 2: (2.7 - 3.5) | 0.2       |   0.9   |
            | 3: (3.5 - 4.2] | 0.4 + 0.5 = 0.9    |   1 + 1.3 = 2.3 |

        4. Let's compute the gain for splitting after Bin 1: 
            - left ( i.e. bin 1 ): $G_L = -1.4 \,,\, H_L = 2.3$
            - right ( i.e. bin 2 and bin 3 ): $G_R = 0.2  + 0.9 = 1.1 \,,\, H_R = 0.9 + 2.3 = 3.2$
            - $Gain = \frac{G_L^2}{H_L+ \lambda} + \frac{G_R^2}{H_R+ \lambda} - \frac{G_N^2}{H_N+ \lambda} = \frac{(-1.4)^2}{2.3+ \lambda} + \frac{(1.1)^2}{3.2+ \lambda} - \frac{(-1.4 + 1.1)^2}{2.3 + 3.2 + \lambda}$
        5. Similarly, gain is computed for other split points (after Bin 2 [left = bin1 + bin2, right = bin 3]), and the split with the highest gain is selected.
        6. Benefits of Histogram-Based Splitting
            1. Speed Improvement:
                1. Instead of checking all possible splits, we only evaluate a limited number of bins.
                2. Reduces the number of comparisons from O(n) (raw splits) to O(#bins).
            2. Memory Efficiency:
                1. Storing bin statistics is significantly smaller than storing raw feature values.
            3. Robustness to Noisy Data:
                1. Since histogram bins aggregate values, small fluctuations in feature values have minimal impact.
            4. **Handling missing values** : While constructing histograms for each feature, XGBoost only considers the non-missing values for that feature, and skips missing values entirely, which helps in two ways:
                1. Reduces Computation: For sparse features, this speeds up histogram updates because it doesn’t need to evaluate missing or zero entries.
                2. Memory Efficiency: By not allocating space or performing operations for missing values, it reduces the amount of memory needed.
        7. When evaluating potential splits for a feature, XGBoost **assigns missing values** to **either** the **left or right child** of a split depending on how it maximizes the gain.
            - Best split strategy for missing values: XGBoost tries both assignments (left and right) and chooses the one that results in the best gain.
            - This behavior allows it to handle missing values naturally without the need for imputation or extra preprocessing, unlike many other algorithms.
            - For example, if a feature has 100 samples and only 20 are non-zero or non-missing, XGBoost will only compute the statistics for these 20 samples and ignore the 80 missing/zero entries during the tree construction phase.\
                This reduces the number of operations and speeds up computation because it avoids wasting resources on irrelevant entries.
3. XGBoost can learn how to handle missing values automatically (it determines the best direction to split for missing data).
4. Preferred for larger datasets since computationally optimized
5. Preferred when GBM tends to overfit
6. Faster inference time than GBM
7. 

# XGBoost Classifier