### Boosting Algorithms
The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.
Boosting is an ensemble technique in which the predictors are not made independently, but sequentially.
- Ada Boosting (Adaptive Boosting)
- Gradient Tree Boosting
- XGBoost (Extreme Gradient Boosting)

### How Boosting Algorithms works?

Boosting combines weak learner using base learners to form a strong rule.To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.

Here’s another question which might haunt you, ‘How do we choose different distribution for each round?’

For choosing the right distribution, here are the following steps:

**Step 1:** The base learner takes all the distributions and assign equal weight or attention to each observation.

**Step 2:** If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.

**Step 3:** Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates  a strong learner which eventually improves the prediction power of the model. Boosting pays higher focus on examples which are mis-classiﬁed or have higher errors by preceding weak rules.

### Ada Boosting (Adaptive Boosting)

![image.png](attachment:image.png)

**Box 1:** You can see that we have assigned equal weights to each data point and applied a decision stump to classify them as + (plus) or – (minus). The decision stump (D1) has generated vertical line at left side to classify the data points. We see that, this vertical line has incorrectly predicted three + (plus) as – (minus). In such case, we’ll assign higher weights to these three + (plus) and apply another decision stump.

**Box 2:** Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared to rest of the data points. In this case, the second decision stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified + (plus) correctly. But again, it has caused mis-classification errors. This time with three -(minus). Again, we will assign higher weight to three – (minus) and apply another decision stump.

**Box 3:** Here, three – (minus) are given higher weights. A decision stump (D3) is applied to predict these mis-classified observation correctly. This time a horizontal line is generated to classify + (plus) and – (minus) based on higher weight of mis-classified observation.

**Box 4:** Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule as compared to individual weak learner. You can see that this algorithm has classified these observation quite well as compared to any of individual weak learner.

**AdaBoost (Adaptive Boosting) :** It works on similar method as discussed above. It fits a sequence of weak learners on different weighted training data. It starts by predicting original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.

Mostly, we use decision stamps with AdaBoost. But, we can use any machine learning algorithms as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problem.

### Code

In [1]:
#Importing packages
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier #For Classification
from sklearn.ensemble import AdaBoostRegressor #For Regression
from sklearn.tree import DecisionTreeClassifier

In [2]:
# reading the data
df = pd.read_csv('HR_comma_sep.csv')
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [3]:
df.shape

(14999, 10)

In [4]:
df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')

In [5]:
df.isnull().sum()

satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
Work_accident            0
left                     0
promotion_last_5years    0
sales                    0
salary                   0
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   sales                  14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


In [7]:
df['last_evaluation'] = df['last_evaluation'].fillna(df['last_evaluation'].mean())
df['number_project'] = df['number_project'].fillna(df['number_project'].mean())
df['average_montly_hours'] = df['average_montly_hours'].fillna(df['average_montly_hours'].mean())
df['time_spend_company'] = df['time_spend_company'].fillna(df['time_spend_company'].mean())
df['Work_accident'] = df['Work_accident'].fillna(df['Work_accident'].mean())
df['left'] = df['left'].fillna(df['left'].mode()[0])
df['promotion_last_5years'] = df['promotion_last_5years'].fillna(df['promotion_last_5years'].mean())
df['sales'] = df['sales'].fillna(df['sales'].mode()[0])
df['salary'] = df['salary'].fillna(df['salary'].mode()[0])

In [8]:
df.salary.value_counts()

low       7316
medium    6446
high      1237
Name: salary, dtype: int64

In [9]:
df.sales.value_counts()

sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: sales, dtype: int64

In [10]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
df['sales'] = lb.fit_transform(df['sales'])

In [11]:
df.sales.value_counts()

7    4140
9    2720
8    2229
0    1227
6     902
5     858
1     787
2     767
3     739
4     630
Name: sales, dtype: int64

In [13]:
# separating the independent and dependent variables
# independent variable
X = df.drop('salary',axis=1)
# dependent variable
y = df['salary']

In [14]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_sc = sc.fit_transform(X)

In [15]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_train_sc,y,test_size=0.3)
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(10499, 9) (4500, 9) (10499,) (4500,)


In [17]:
# Now we will use decision tree as a base estimator, you can use any ML learner as base estimator if it accepts sample weight 
dt = DecisionTreeClassifier(class_weight="balanced") 
clf = AdaBoostClassifier(n_estimators=100, base_estimator=dt,learning_rate=1)

# training the model
clf.fit(x_train,y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced'),
                   learning_rate=1, n_estimators=100)

In [None]:
# predict the target on the train dataset
predict_train = clf.predict(x_train)
print('\nTarget on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


In [None]:
# predict the target on the test dataset
predict_test = clf.predict(x_test)
print('\nTarget on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

We can tune the parameters to optimize the performance of algorithms, I’ve mentioned below the key parameters for tuning:

**n_estimators:** It controls the number of weak learners.<br>
**learning_rate:** Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.<br>
**base_estimators:** It helps to specify different ML algorithm.<br>
We can also tune the parameters of base learners to optimize its performance.

### Gradient Boosting
A Gradient Boosting Machine or GBM combines the predictions from multiple decision trees to generate the final predictions. Keep in mind that all the weak learners in a gradient boosting machine are decision trees.But if we are using the same algorithm, then how is using a hundred decision trees better than using a single decision tree? How do different decision trees capture different signals/information from the data?

Here is the trick – the nodes in every decision tree take a different subset of features for selecting the best split. This means that the individual trees aren’t all the same and hence they are able to capture different signals from the data.
Additionally, each new tree takes into account the errors or mistakes made by the previous trees. So, every successive decision tree is built on the errors of the previous trees. This is how the trees in a gradient boosting machine algorithm are built sequentially.

In gradient boosting, it trains many model sequentially. Each new model gradually minimizes the loss function (y = ax + b + e, e needs special attention as it is an error term) of the whole system using Gradient Descent method. The learning procedure consecutively fit new models to provide a more accurate estimate of the response variable.

The principle idea behind this algorithm is to construct new base learners which can be maximally correlated with negative gradient of the loss function, associated with the whole ensemble.

In Python Sklearn library, we use Gradient Tree Boosting or GBRT. It is a generalization of boosting to arbitrary differentiable loss functions. It can be used for both regression and classification problems.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)
clf.fit(x_train, y_train)

In [None]:
# predict the target on the train dataset
predict_train = clf.predict(x_train)
print('\nTarget on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


In [None]:
# predict the target on the test dataset
predict_test = clf.predict(x_test)
print('\nTarget on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

**n_estimators:** It controls the number of weak learners.

**learning_rate:** Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.

**max_depth:** maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.

We can tune loss function for better performance.

### Extreme Gradient Boosting
Extreme Gradient Boosting or XGBoost is another popular boosting algorithm. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. The trees in XGBoost are built sequentially, trying to correct the errors of the previous trees.

But there are certain features that make XGBoost slightly better than GBM:

- One of the most important points is that XGBM implements parallel preprocessing (at the node level) which makes it faster than GBM
- XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance. You can select the regularization technique by setting the hyperparameters of the XGBoost algorithm

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier()

# fit the model with the training data
model.fit(x_train,y_train)

In [None]:
# predict the target on the train dataset
predict_train = model.predict(x_train)
print('\nTarget on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)


In [None]:
# predict the target on the test dataset
predict_test = model.predict(x_test)
print('\nTarget on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

We can also tune the hyper parameters to optimize its performance.

In [None]:
UnSupervised Learning

KMeans++

KMeans(init=Kmean++)

Hierarchical Clustering

In [None]:
Supervised Learning

Linear Regression
Logistic regression
KNN
Naive Bayes
Decision TRee
Random Forest
SVM
Gradient Boosting,AdaBoost,XGBoost