## Ensembles

Ensembles são conjunto de estimadores. A ideia geral é que a opinião de um grupo é normalmente mais forte que a opinião de um único indíviduo. Em uma comparação com fins didáticos, é como se um modelo estivesse tomando uma decisão com base na opinião de vários experts.
<br>
Formalmente, usa-se vários estimadores bases para compor conjuntos de estimadores que irão realizar o processo de machine learning.

**Vantagens:**

* Em geral, aumento de precisão
    * Pode transformar um *weak learner* em um *strong learner*

**Desvantangens:**

* Perda de poder de interpretação do modelo

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X, y = datasets.make_classification(n_samples=1000, n_features=20,
                                   n_informative=2, n_redundant=10,
                                   random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                   random_state=42)

from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

classifiers = [XGBClassifier(), ExtraTreesClassifier(), RandomForestClassifier(), GradientBoostingClassifier()]
names = ['Xgboost', 'Extremely Randomized Trees', 'Random Forest', 'Gradient Tree Boosting']

In [2]:
def train_and_test(clf, name):
    scores = cross_val_score(clf, X_train, y_train, cv=10)
    print('Acurácia do %s = %.2f [+/- %.2f]' % (name, scores.mean(), scores.std()))

for name, clf in zip(names, classifiers):
    train_and_test(clf, name)

Acurácia do Xgboost = 0.91 [+/- 0.02]
Acurácia do Extremely Randomized Trees = 0.91 [+/- 0.03]
Acurácia do Random Forest = 0.91 [+/- 0.02]
Acurácia do Gradient Tree Boosting = 0.91 [+/- 0.02]


## Ajuste de parâmetros de Ensembles

In [3]:
import scipy.stats as st

one_to_left = st.beta(10, 1)  
from_zero_positive = st.expon(0, 50)

params = {  
    "n_estimators": [100, 300, 500],
    "max_depth": st.randint(3, 40),
    "learning_rate": st.uniform(0.05, 0.4),
    "gamma": st.uniform(0, 10),
}

xgb = XGBClassifier() 
xgb

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [4]:
from sklearn.model_selection import RandomizedSearchCV

rs = RandomizedSearchCV(xgb, params, n_jobs=-1, cv=10, n_iter=10)  
rs.fit(X_train, y_train) 
print(rs.best_params_)
print(rs.best_estimator_)
print(rs.best_score_)

{'gamma': 8.9254628154897002, 'learning_rate': 0.34790503714318155, 'max_depth': 16, 'n_estimators': 100}
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=8.9254628154897002,
       learning_rate=0.34790503714318155, max_delta_step=0, max_depth=16,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
0.917142857143


## Materiais para estudo:

* [Gradient Tree Boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [Extremely randomized trees](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
* [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [Xgboost](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/)

**Papers dos algoritmso citados:**

* Hastie, T.; Friedman, J.; Tibshirani, R. **Boosting and additive trees.** In: The Elements of Statistical Learning. [S.l.]: Springer, 2001. p. 299–345. Citado na página 24.
* Geurts, Pierre, Damien Ernst, and Louis Wehenkel. **"Extremely randomized trees."** Machine learning 63.1 (2006): 3-42.    
* Breiman, Leo. **"Random forests."** Machine learning 45.1 (2001): 5-32.    
* Chen, Tianqi, and Carlos Guestrin. **"Xgboost: A scalable tree boosting system."** Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.
    