# Gradient boosting algorithm

Gradient Boosting is a popular boosting algorithm in machine learning used for classification and regression tasks. Boosting is one kind of ensemble Learning method which trains the model sequentially and each new model tries to correct the previous model. It combines several weak learners into strong learners. There is two most popular boosting algorithm i.e **AdaBoost** and **Gradient Boosting**.

## Importing and loading data

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing Gradient Boosting Classifiers
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier

from sklearn import metrics 
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading the data
data = pd.read_csv('datasets/data_cleaned.csv')

# Check the data
data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


## Model building

### Separating independent and dependent variables

In [3]:
# Independent variables
x = data.drop(['Survived'], axis=1)

#dependent variable
y = data['Survived']

print(x.shape, y.shape)

(891, 24) (891,)


### Creating the training and testing sets

In [4]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 101, stratify = y)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(668, 24) (668,)
(223, 24) (223,)


### Build the Gradient Boosting model

In [5]:
# Creating an Gradient boosting instance
clf = GradientBoostingClassifier(random_state=96)

# Train the model
clf.fit(train_x,train_y)

In [6]:
# Calculating scores
print('Training Score:', clf.score(train_x, train_y).round(3))
print('Testing Score:', clf.score(test_x, test_y).round(3))

Training Score: 0.9
Testing Score: 0.812


### Hyperparameter tuning for Gradient Boosting

#### Model Based Hyperparameters

1. **n_estimators:** Total number of trees.
2. **loss:** The loss function to be minimized. 
3. **subsample:** The fraction of observations to be selected for each tree. Selection is done by random sampling.
4. **random_state:** The random number seed so that same random numbers are generated every time.
5. **learning_rate:** This determines the impact of each tree on the final outcome 

In [7]:
# Creating an Gradient boosting instance
clf = GradientBoostingClassifier(random_state=96, n_estimators=200, subsample=0.7)

# Training the model
clf.fit(train_x,train_y)

In [8]:
# Calculating scores
print('Training Score:', clf.score(train_x, train_y).round(3))
print('Testing Score:', clf.score(test_x, test_y).round(3))

Training Score: 0.945
Testing Score: 0.816


#### Tree Based Hyperparameters

1. **max_depth:** Maximum depth to which tree can grow (stopping criteria)
2. **max_features:** The number of features to consider while searching for a best split
3. **max_leaf_nodes:** The maximum number of terminal nodes or leaves in a tree
4. **min_samples_leaf:** Minimum samples required in a terminal node or leaf (stopping criteria)
5. **min_samples_split:** Minimum number of samples required in a node for splitting (stopping criteria)

In [9]:
# Creating an Gradient boosting instance
clf = GradientBoostingClassifier(random_state=96, min_samples_split=100, max_depth=4)

# Training the model
clf.fit(train_x,train_y)

In [10]:
# Calculating scores
print('Training Score:', clf.score(train_x, train_y).round(3))
print('Testing Score:', clf.score(test_x, test_y).round(3))

Training Score: 0.903
Testing Score: 0.839


### Build the AdaBoost model 

In [11]:
# Creating an Gradient boosting instance
clf = AdaBoostClassifier(random_state=96)

# Train the model
clf.fit(train_x,train_y)

In [12]:
# Calculating scores
print('Training Score:', clf.score(train_x, train_y).round(3))
print('Testing Score:', clf.score(test_x, test_y).round(3))

Training Score: 0.841
Testing Score: 0.798
