# Boosting
If a datapoint is incorrectly predicted by the first model (and then probably all the rest of the models) will combining the predictions produce better results? Probably not. That's where boosting comes in.

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.

Here's a visualization of the steps:

1. A subset is created from the original dataset.

2. Initially, all data points are given equal weights.

3. A base model is created on this subset.

4. This model is used to make predictions on the whole dataset
![4](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2015/11/dd2-e1526989487878.png)

5. Errors are calculated using the actual values and predicted values

6. Observations which are incorrectly predicted are given higher weights (The misclassified blue points above)

7. Another model is created and predictions are made on the dataset (this model tried to correct errors from the previous model)
![7](https://www.analyticsvidhya.com/wp-content/uploads/2015/11/boosting10.png)

8. Multiple models are created, each correcting the errors of the previous model.

9. The final model (strong learned) is the weighted mean of all the models (weak learners)
![9](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2015/11/dd4-e1526551014644.png)

Thus, the boosting algorithm combines a number of weak learners to form a strong learner.
![final](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2015/11/dd4-e1526551014644.png)

# Example Implementations
* AdaBoost
* Stochastic Gradient Boosting

## AdaBoost
AdaBoost (Adaptive Boosting), is perhaps the first successful boosting ensemble algorithm - it generally works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay more or less attention to them in the construction of subsequent models.

We'll implement one using sklearn's `AdaBoostClassifier` with 30 decision trees on the "pima indians" dataset.

In [3]:
# import packages and data
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import AdaBoostClassifier

URL = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

df = pd.read_csv(URL, names=features)

X = df.iloc[:, 0:8]
y = df.iloc[:, 8]

kfold = KFold(n_splits=10, random_state=42)

  return f(*args, **kwds)


In [4]:
# instantiate and train the model
num_trees = 30

abc = AdaBoostClassifier(n_estimators=num_trees, random_state=42)

In [5]:
# Get results
results = cross_val_score(abc, X, y, cv=kfold)

print(results.mean())

0.760457963089542


# Stochastic Gradient Boosting
Stochastic Gradient Boosting or Gradient Boosting Machines (GBM) are one of the most sophisticated ensemble techniques - one that also provides really good performance.

It works like so:

In [11]:
# Create Data
df = pd.DataFrame(data=[[1, 'Y', 'M', 'A', 51000, 35], [2, 'N', 'F', 'B', 25000, 24], [3, 'Y', 'M', 'A', 74000, 38],
                       [4, 'N', 'F', 'A', 29000, 30], [5, 'N', 'F', 'B', 37000, 33]],
                  columns=['ID', 'Married', 'Gender', 'Current City', 'Monthly Income', 'Age (target)'])
df

Unnamed: 0,ID,Married,Gender,Current City,Monthly Income,Age (target)
0,1,Y,M,A,51000,35
1,2,N,F,B,25000,24
2,3,Y,M,A,74000,38
3,4,N,F,A,29000,30
4,5,N,F,B,37000,33


We want to predict `Age`
1. The mean age is assumed to be the predicted value for all observations in the dataset
2. The errors are calculated using this mean prediction and actual values of age

In [12]:
df['Mean Age (prediction 1)'] = [32,32,32,32,32]
df['Residual 1'] = [3, -8, 6, -2, 1]
df

Unnamed: 0,ID,Married,Gender,Current City,Monthly Income,Age (target),Mean Age (prediction 1),Residual 1
0,1,Y,M,A,51000,35,32,3
1,2,N,F,B,25000,24,32,-8
2,3,Y,M,A,74000,38,32,6
3,4,N,F,A,29000,30,32,-2
4,5,N,F,B,37000,33,32,1


3. A tree model is created using the errors calculated above as the target variable. Our objective is to find the best split to minimize the error
4. The predictions by this model are combined with prediction 1

In [13]:
df.rename({'Residual 1': 'Residual 1 (new target)'})
df['Prediction 2'] = [3, -5, 3, -5, 3]
df['Combine (mean+pred2)'] = [35, 27, 35, 27, 35]

df

Unnamed: 0,ID,Married,Gender,Current City,Monthly Income,Age (target),Mean Age (prediction 1),Residual 1,Prediction 2,Combine (mean+pred2)
0,1,Y,M,A,51000,35,32,3,3,35
1,2,N,F,B,25000,24,32,-8,-5,27
2,3,Y,M,A,74000,38,32,6,3,35
3,4,N,F,A,29000,30,32,-2,-5,27
4,5,N,F,B,37000,33,32,1,3,35


5. The value calculated above is the new prediction
6. New errors are calculated using this predicted value and the actual value

In [14]:
df['Residual 2 (latest target)'] = [0, -3, -3, 3, -2]

df

Unnamed: 0,ID,Married,Gender,Current City,Monthly Income,Age (target),Mean Age (prediction 1),Residual 1,Prediction 2,Combine (mean+pred2),Residual 2 (latest target)
0,1,Y,M,A,51000,35,32,3,3,35,0
1,2,N,F,B,25000,24,32,-8,-5,27,-3
2,3,Y,M,A,74000,38,32,6,3,35,-3
3,4,N,F,A,29000,30,32,-2,-5,27,3
4,5,N,F,B,37000,33,32,1,3,35,-2


Steps 2-6 are repeated until the maximum number of iterations is reached.

#### Implementing with sklearn

We'll implement using sklearn's `GradientBoostingClassifier` with the same dataset above and 100 trees.

In [6]:
# import and instantiate the model
from sklearn.ensemble import GradientBoostingClassifier

num_trees = 100

gbc = GradientBoostingClassifier(n_estimators=num_trees, random_state=42)

In [7]:
# get results
results = cross_val_score(gbc, X, y, cv=kfold)

print(results.mean())

0.7642857142857143


## XGBoost
XGBoost (Extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. It's highly effective, has great predictive power, and is much faster than other gradient boosting techniques. It also includes a variety of regularization techniques which reduces overfitting and improves overall performance.

Implementation:

In [1]:
# import packages
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [7]:
# import data
URL = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pd.read_csv(URL, names=features)

X = data.iloc[:, 0:8]
y = data.iloc[:, 8]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [8]:
# instantiate and train the model
model = XGBClassifier()

model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [9]:
# make predictions
y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

In [13]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy * 100.0: .2f}%")

Accuracy:  74.02%


## Light GBM
Light GBM beats most algorithms when the dataset is extremely large.

It's a gradient boosting framework that uses tree-based algorithms and follows a leaf-wise approach while others work in a level-wise approach. Like this:
![level](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/11194110/leaf.png)
![leaf](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/11194227/depth.png)

A leaf-wise approach may cause over-fitting on smaller datasets but this can be avoided by using the `max_depth` parameter.

## CatBoost
Handling categorical variables is a tedious process, especially when you have a large number of such variables. When your categorical variables have too many labels, performing one hot encoding on them exponentially increases the dimensionality and it becomes really difficult to work with the dataset. 

CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms.