# Gradient Boosting

- Works for regression or classification
- Sequentially adds predictors
- Each one corrects its predecessor
- Fit new predictor to the residual errors

## 3 Elements

1. Loss function to be optimized: Loss function depends on the type of problem to be solved. In regression, MSE, and in classification, logarithmic loss. At each stage, unexplained loss from prior iterations will be optimized rather than starting from scratch 
2. Weak learner to make predictions: Decision trees are used as a weak learner in gradient boosting
3. Additive model to add weak learners to minimze the loss function: Trees are added one at a time and exissting trees in the model are not changed. The gradient descent proccedure is used to minimize loss when adding trees. 

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import seaborn as sns

In [7]:
df = sns.load_dataset('titanic')
df.dropna(inplace=True)
X = df[['pclass', 'sex', 'age']].copy()
le = preprocessing.LabelEncoder()
X['sex'] = le.fit_transform(df['sex'])
y = df['survived'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [10]:
gbc_clf = GradientBoostingClassifier()
gbc_clf.fit(X_train, y_train);

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, roc_auc_score

def printScore(clf, X_train, X_test, y_train, y_test, train=True):
    lb = preprocessing.LabelBinarizer()
    lb.fit(y_train)
    if train:
        res = clf.predict(X_train)
        print('Train Results:\n')
        print('Accuracy: %.2f\n' % accuracy_score(y_train, res))
        print('Classification Report: \n {} \n'.format(classification_report(y_train, res)))
        print('Confusion Matrix: \n {} \n'.format(confusion_matrix(y_train, res)))
        print('ROC AUC: {0:.4f}\n'.format(roc_auc_score(lb.transform(y_train), lb.transform(res))))
    else:
        res_test = clf.predict(X_test)
        print('Test Results:\n')
        print('Accuracy: %.2f\n' % accuracy_score(y_test, res_test))
        print('Classification Report: \n {} \n'.format(classification_report(y_test, res_test)))
        print('Confusion Matrix: \n {} \n'.format(confusion_matrix(y_test, res_test)))
        print('ROC AUC: {0:.4f}\n'.format(roc_auc_score(lb.transform(y_test), lb.transform(res_test))))

In [None]:
printScore(gbc_clf, X_train, X_test, y_train, y_test)
printScore(gbc_clf, X_train, X_test, y_train, y_test, train=False)