# Boosting 

* Es la aplicación práctica del aprendizaje gradual.
* La idea detrás de este modelo es poder ir ajustándose poco a poco al objetivo esperado.
* En este caso los modelos son dependientes ya que cada nuevo modelo depende de los resultados obtenidos en la iteración anterior.
* En el caso del bagging todos los modelos en conjunto aprenden el mismo modelo y luego se complementan. Aprendizaje colectivo. En el caso del Boosting cada modelo se ajusta a un problema levemente distinto, que contribuye al mismo objetivo.

* **Contra**: Debido a que este tipo de modelos siguen una estructura secuencial de entrenamiento no pueden correrse en paralelo. (Aunque XGBoost y LightGBM sí pueden).

> Boosting en general tiene capacidad de Early Stopping. No implementada en Scikit-Learn


# Adaboost

* Es un modelo propuesto por Freund and Schapire en 1997.
* Este modelo ganó el premio Gödel en 2003. Premio en Informática Teórica. (Primer modelo de Machine Learning en ganar este premio)
* Primer modelo práctico en implementar el Boosting.
* Permite generar aprendizaje secuencial para cualquier estimador base.
* A medida que el ensamble crece la performance mejora.
* Este tipo de ensambles está basado en weak learners. De esta manera son rápidos de entrenar y se van ajustando a medida que las iteraciones aumentan.

# Gradient Boosting Machine 

![](boosting.png)

$$y \sim f_1(X)$$

$$y - f_1(X) \sim f_2(X)$$

$$y \sim f_1(X) + f_2(X)$$

$$y \sim f_1(X) + f_2(X) + ... + f_n(X) = \sum_{i = 1}^nf_i(X)$$

Corresponde a un modelo aditivo que se va ajustando a los residuos de cada ajuste.



# Implementando un Adaboost

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('titanic.csv', index_col = 0)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Signing_date
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1911-05-17
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1911-07-23
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1911-09-08
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1911-06-26
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1911-10-25
...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1911-08-17
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1911-08-07
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,1912-01-30
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1911-08-08


In [3]:
X = df[['Pclass','Age','SibSp','Parch','Fare']].fillna(df.Age.mean())
y = df.Survived

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 123)

In [4]:
# probar cambiando parámetros del modelo
# C = 0.01, 1, 10, 50, 100
adb = AdaBoostClassifier(base_estimator = LogisticRegression(C = 50), n_estimators = 10, random_state = 123)
adb.fit(X_train, y_train)
y_pred = adb.predict(X_test)
y_pred_train = adb.predict(X_train)
print(classification_report(y_train, y_pred_train))
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.85      0.77       379
           1       0.66      0.45      0.54       244

    accuracy                           0.69       623
   macro avg       0.68      0.65      0.65       623
weighted avg       0.69      0.69      0.68       623

              precision    recall  f1-score   support

           0       0.75      0.84      0.79       170
           1       0.65      0.51      0.57        98

    accuracy                           0.72       268
   macro avg       0.70      0.68      0.68       268
weighted avg       0.71      0.72      0.71       268



In [5]:
# modelo fuertemente overfitted
# regularizarlo fuertemente max_depth = 1, 2, 5
# regularizarlo ccp_alpha 0.01, 0.3
adb = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth = 1), n_estimators = 100, random_state = 123) 
adb.fit(X_train, y_train)
y_pred = adb.predict(X_test)
y_pred_train = adb.predict(X_train)
print(classification_report(y_train, y_pred_train, digits = 4))
print(classification_report(y_test, y_pred, digits = 4))

              precision    recall  f1-score   support

           0     0.7546    0.8681    0.8074       379
           1     0.7326    0.5615    0.6357       244

    accuracy                         0.7480       623
   macro avg     0.7436    0.7148    0.7215       623
weighted avg     0.7460    0.7480    0.7401       623

              precision    recall  f1-score   support

           0     0.7898    0.8176    0.8035       170
           1     0.6630    0.6224    0.6421        98

    accuracy                         0.7463       268
   macro avg     0.7264    0.7200    0.7228       268
weighted avg     0.7434    0.7463    0.7445       268



In [6]:
%%time
# No funciona tan bien cómo si funciona un DecisionTree

adb = AdaBoostClassifier(base_estimator = RandomForestClassifier(max_depth = 1), n_estimators = 50, random_state = 123)
adb.fit(X_train, y_train)
y_pred = adb.predict(X_test)
y_pred_train = adb.predict(X_train)
print(classification_report(y_train, y_pred_train, digits = 4))
print(classification_report(y_test, y_pred, digits = 4))

              precision    recall  f1-score   support

           0     0.7534    0.8707    0.8078       379
           1     0.7351    0.5574    0.6340       244

    accuracy                         0.7480       623
   macro avg     0.7443    0.7140    0.7209       623
weighted avg     0.7463    0.7480    0.7398       623

              precision    recall  f1-score   support

           0     0.7753    0.8118    0.7931       170
           1     0.6444    0.5918    0.6170        98

    accuracy                         0.7313       268
   macro avg     0.7099    0.7018    0.7051       268
weighted avg     0.7274    0.7313    0.7287       268

Wall time: 7.61 s


# Gradient Boosting Machine 

In [7]:
%%time
# No funciona tan bien cómo si funciona un DecisionTree
# max_depth = 1
# subsample = 0.6
# n_estimators = 50, subsample = 0.6, óptimo
adb = GradientBoostingClassifier(n_estimators = 50, subsample = 0.6,  random_state = 123)
adb.fit(X_train, y_train)
y_pred = adb.predict(X_test)
y_pred_train = adb.predict(X_train)
print(classification_report(y_train, y_pred_train, digits = 4))
print(classification_report(y_test, y_pred, digits = 4))

              precision    recall  f1-score   support

           0     0.7590    0.8892    0.8190       379
           1     0.7654    0.5615    0.6478       244

    accuracy                         0.7608       623
   macro avg     0.7622    0.7253    0.7334       623
weighted avg     0.7615    0.7608    0.7519       623

              precision    recall  f1-score   support

           0     0.7730    0.8412    0.8056       170
           1     0.6747    0.5714    0.6188        98

    accuracy                         0.7425       268
   macro avg     0.7238    0.7063    0.7122       268
weighted avg     0.7370    0.7425    0.7373       268

Wall time: 70 ms


# Modelo más avanzado

In [8]:
X = df[['Pclass','Age','SibSp','Parch','Fare', 'Sex','Embarked']].copy()
y = df.Survived.copy()

X[['Pclass','Sex','Embarked']] = X[['Pclass','Sex','Embarked']].astype('category')
X.dtypes

Pclass      category
Age          float64
SibSp          int64
Parch          int64
Fare         float64
Sex         category
Embarked    category
dtype: object

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 123)

In [10]:
is_cat = X_train.dtypes == 'category'
is_cat

Pclass       True
Age         False
SibSp       False
Parch       False
Fare        False
Sex          True
Embarked     True
dtype: bool

In [11]:
%%time
cat = Pipeline(steps = [
    ('ord', OrdinalEncoder(cols = 'Pclass')),
    ('ohe', OneHotEncoder(use_cat_names = True))
])

num = Pipeline(steps = [
    ('imp', SimpleImputer(strategy = 'mean')),
    ('sc', StandardScaler()),
    #('pt', PowerTransformer())
    
])

prep = ColumnTransformer(transformers = [
    ('cat', cat, is_cat),
    ('num', num, ~is_cat)
])

pipe = Pipeline(steps = [
    ('prep', prep),
    #('pca', PCA(n_components = 10)),
    ('gb', GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.5, max_depth = 3, random_state = 123))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
y_pred_train = pipe.predict(X_train)

print(classification_report(y_train, y_pred_train, digits = 4))
print(classification_report(y_test, y_pred, digits = 4))

              precision    recall  f1-score   support

           0     0.9279    0.9770    0.9518       435
           1     0.9606    0.8809    0.9190       277

    accuracy                         0.9396       712
   macro avg     0.9443    0.9289    0.9354       712
weighted avg     0.9407    0.9396    0.9391       712

              precision    recall  f1-score   support

           0     0.9196    0.9035    0.9115       114
           1     0.8358    0.8615    0.8485        65

    accuracy                         0.8883       179
   macro avg     0.8777    0.8825    0.8800       179
weighted avg     0.8892    0.8883    0.8886       179

Wall time: 138 ms


In [12]:
%%time
cat = Pipeline(steps = [
    ('ord', OrdinalEncoder(cols = 'Pclass')),
    ('ohe', OneHotEncoder(use_cat_names = True))
])

num = Pipeline(steps = [
    ('imp', SimpleImputer(strategy = 'mean')),
    ('sc', StandardScaler()),
    #('pt', PowerTransformer())
    
])

prep = ColumnTransformer(transformers = [
    ('cat', cat, is_cat),
    ('num', num, ~is_cat)
])

pipe = Pipeline(steps = [
    ('prep', prep),
    #('pca', PCA(n_components = 10)),
    ('gb', GradientBoostingClassifier())#n_estimators = 50, learning_rate = 0.5, max_depth = 3, random_state = 123))
])

params = {#'gb__n_estimators:': [50, 100],
         'gb__learning_rate': [0.1, 0.5, 1],
         'gb__max_depth': [1, 3, 5]
         
         }


search = GridSearchCV(pipe, params, cv = StratifiedKFold(n_splits = 5), scoring = 'f1', n_jobs = -1)

Wall time: 998 µs


In [13]:
%%time
search.fit(X_train, y_train)
y_pred = search.predict(X_test)
y_pred_train = search.predict(X_train)

print(classification_report(y_train, y_pred_train, digits = 4))
print(classification_report(y_test, y_pred, digits = 4))

              precision    recall  f1-score   support

           0     0.8891    0.9402    0.9140       435
           1     0.8968    0.8159    0.8544       277

    accuracy                         0.8919       712
   macro avg     0.8930    0.8781    0.8842       712
weighted avg     0.8921    0.8919    0.8908       712

              precision    recall  f1-score   support

           0     0.8707    0.8860    0.8783       114
           1     0.7937    0.7692    0.7813        65

    accuracy                         0.8436       179
   macro avg     0.8322    0.8276    0.8298       179
weighted avg     0.8427    0.8436    0.8430       179

Wall time: 6.48 s


In [14]:
import sklearn; sklearn.show_versions()


System:
    python: 3.7.7 (default, Apr 15 2020, 05:09:04) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\fata2810\AppData\Local\Continuum\anaconda3\envs\MLprojects\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
          pip: 20.2.2
   setuptools: 49.6.0.post20200814
      sklearn: 0.23.1
        numpy: 1.19.1
        scipy: 1.5.2
       Cython: 0.29.17
       pandas: 1.1.2
   matplotlib: 3.2.2
       joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True


> VER EXTREME GRANDIENT BOOST XGBOOST Y LIGHTGBM