## Trees: Ensemble Methods - Boosting

Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) at every step,and the goal is to solve for net error from the prior tree.

When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into a better performing model.

An ensemble of trees are built one by one and individual trees are summed sequentially. The Next tree tries to recover the loss (difference between actual and predicted values) from the previous tree.

 - boosting = low variance, high bias base learners
 
 ![Boosting Example](./images/boosting.png)

#### Adaboost = Adaptive Boosting
AdaBoost learns from the mistakes by increasing the weight of misclassified data points.

It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher weights to incorrectly classified instances.

*Adaboost usually has just a node and two leaves.(A tree with one node and two leaves is called a stump)*

Steps:
<li> 0: Initialize the weights of data points. (e.g. data has 1000 points, each initial point would have 1/1000 = 0.001) </li>
<li> 1: Train a decision Tree (whole dataset) </li>
<li> 2: Calculate the weighted error rate (e) of the decision tree. </li>
<li> 3: Calculate this decision tree’s weight in the ensemble. The weight of this tree = learning rate * log( (1 — e) / e) </li> 
<br> ** The higher the weighted error of the tree, the less decision power the tree will be given during the later voting. </br>
<br> ** The lower the weighted error of the tree, the higher decision power the tree will be given during the later voting. </br>

<li> 4: Update weights of wrongly classified points. </li> 
<br> the weight of each data point stays same if the model got this data points correct.</br>
<br> the <strong><em>new weight of this data point = old weight*exp(weight of the tree)</em></strong>, if the model got this data point wrong </br> 

![sample weight calculation](./images/sample_weight_calc.png)

** The amount of say (alpha) will be negative when the sample is correctly classified.

** The amount of say (alpha) will be positive when the sample is miss-classified.

--- We normalize weights to bring them all to the sum of one afterwards.

<li> 5: Repeat step 1 (dataset with new weights) </li>
<li> 6: Make final prediction </li>

Further reading:https://www.mygreatlearning.com/blog/adaboost-algorithm/
<br> https://www.analyticsvidhya.com/blog/2021/09/adaboost-algorithm-a-complete-guide-for-beginners/#:~:text=AdaBoost%20also%20called%20Adaptive%20Boosting,are%20also%20called%20Decision%20Stumps </br>

#### Gradient Boosting = Gradient Descent + Boosting.
Gradient Descent is a first-order iterative optimization algorithm for finding a local minimum of a differential function. If x(n+1) = x(n) - learning_rate*dF/dx(n) for a small learning_rate, then F(x(n)) => F(x(n+1)). (the idea is to move against the gradient).

Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of changing the weights for every incorrect classified observation at every iteration like AdaBoost, Gradient Boosting method tries to fit the new predictor to the residual errors made by the previous predictor.

Say we have mean squared error (MSE) as loss defined as:
![Mean squared error](./images/xgb_1.png)

We want our predictions, such that our loss function (MSE) is minimum. By using gradient descent and updating our predictions based on a learning rate, we can find the values where MSE is minimum.
![gradient boosting](./images/xgb_2.png)

So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or minimum) and predicted values are sufficiently close to actual values.

<strong>Note:</strong>

<li> Gradient Boosting is prone to Over-fitting.</li>
<li> Requires careful tuning of different hyper-parameters.</li>

Example: https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4

In [3]:
! conda install -c conda-forge py-xgboost


^C


In [2]:
#import libraries
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import time
import catboost as cb
import lightgbm as lgb

ModuleNotFoundError: No module named 'xgboost'

In [None]:
#import dataset

X,y = load_boston(return_X_y=True)

#train,test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

#xgboost
xgbr = xgb.XGBRegressor(max_depth=5,learning_rate=0.1,n_estimators=100,n_jobs=-1)
start_time = time.time()  #track the model development time

xgbr.fit(X_train,y_train)

end_time = time.time()

y_predict = xgbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time)) 

mean_squared_error(y_test,y_predict) #error

--- 0.05968928337097168 seconds ---


6.583590106471756

In [3]:
#lets try lightgbm
#it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise.

lgbr = lgb.LGBMRegressor(learning_rate=0.1,n_estimators=100,max_depth=5,num_leaves=50)

start_time = time.time()

lgbr.fit(X_train,y_train,verbose=0)

end_time = time.time()

y_predict = lgbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time))

mean_squared_error(y_test,y_predict)    #error

--- 0.09108877182006836 seconds ---


8.069578290965865

In [4]:
#catboost helps you savetime by preprocessing of categorical columns for you.
#weighted sampling version of Stochastic Gradient Boosting.

#lets try catboost
cbr = cb.CatBoostRegressor(learning_rate=0.1,n_estimators=100,max_depth=5)

start_time = time.time()

cbr.fit(X_train,y_train,verbose=0)

end_time = time.time()

y_predict = cbr.predict(X_test)

print("--- %s seconds ---" % (end_time - start_time))

mean_squared_error(y_test,y_predict)    #error

--- 0.19351601600646973 seconds ---


9.344821856482579

Exercise: Load the promotion dataset from the data folder, train a model on the dataset and compare results using both random forests and gradient boosting.

<strong>Note: Also make sure to do some data cleaning, upsampling/downsampling, parameter tuning.</strong>

`n_estimators`
- increasing num trees will increase model complexity

`max_features`
- how many features to split on
- rule of thumb = sqrt(num_features)
- depends on ratio of noisy to important var in dataset
- small num features = reduce variance increase bias
- lots of noisy = small m will decrease probability of choosing an important variable at a split

`min samples per leaf` 
- increase a bit (default is 1) to get smaller trees w less overfitting

`max_depth`
- controls variance

`subsample`
- The fraction of observations to be selected for each tree. Selection is done by random sampling.
- Values slightly less than 1 make the model robust by reducing the variance.



## Starting point hyperparameters

*** Heard from a Kaggle Grandmaster

Learning rate = 0.05, 1000 rounds, max depth = 3-5, subsample = 0.8-1.0, colsample_bytree = 0.3 - 0.8, lambda = 0 to 5

Add capacity to combat bias - add rounds

Reduce capacity to combat variance - depth / regularization

In [2]:
import pandas as pd

In [4]:
df = pd.read_csv('./data/promotion/train.csv')
df.head()

Unnamed: 0,EmployeeNo,Division,Qualification,Gender,Channel_of_Recruitment,Trainings_Attended,Year_of_birth,Last_performance_score,Year_of_recruitment,Targets_met,Previous_Award,Training_score_average,State_Of_Origin,Foreign_schooled,Marital_Status,Past_Disciplinary_Action,Previous_IntraDepartmental_Movement,No_of_previous_employers,Promoted_or_Not
0,YAK/S/00001,Commercial Sales and Marketing,"MSc, MBA and PhD",Female,Direct Internal process,2,1986,12.5,2011,1,0,41,ANAMBRA,No,Married,No,No,0,0
1,YAK/S/00002,Customer Support and Field Operations,First Degree or HND,Male,Agency and others,2,1991,12.5,2015,0,0,52,ANAMBRA,Yes,Married,No,No,0,0
2,YAK/S/00003,Commercial Sales and Marketing,First Degree or HND,Male,Direct Internal process,2,1987,7.5,2012,0,0,42,KATSINA,Yes,Married,No,No,0,0
3,YAK/S/00004,Commercial Sales and Marketing,First Degree or HND,Male,Agency and others,3,1982,2.5,2009,0,0,42,NIGER,Yes,Single,No,No,1,0
4,YAK/S/00006,Information and Strategy,First Degree or HND,Male,Direct Internal process,3,1990,7.5,2012,0,0,77,AKWA IBOM,Yes,Married,No,No,1,0


In [None]:
# pd.read_csv('./data/promotion/tarin.csv')