## 3. Klassifikation mit Random Forests und Boosting

### a.)
Wiederholen Sie Aufgabe 1 b mit einem Random Forest. Teilen Sie dazu den Datensatz wieder in einen Trainings- und Testdatensatz. Bestimmen Sie den Fehler auf den Testdaten und vergleichen Sie ihn mit dem ”out of bag”-Fehler.

In [1]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

x, y = load_digits(return_X_y=True)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

rf_clf = RandomForestClassifier()
rf_clf.fit(x_train, y_train)

print(f"Out of Bag Error: {rf_clf.oob_score:.4f}")
print(f"Test Error: {rf_clf.score(x_test, y_test):.4f}")

Out of Bag Error: 0.0000
Test Error: 0.9778


### b.)
Wiederholen sie Aufgabe 1 b mit einem Boosting-Verfahren. Sie können dazu entweder die Implementierung von sklearn verwenden oder Sie installieren xgboost (https://github.com/dmlc/xgboost) oder catboost (https://catboost.ai/docs/).

In [2]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier()
gb_clf.fit(x_train, y_train)

# print(f"Out of Bag Error: {gb_clf.oob_improvement_:.4f}")
print(f"Test Error: {gb_clf.score(x_test, y_test):.4f}")

Test Error: 0.9711


### c.)
Optional: Finden Sie die optimalen Hyperparameter der Algorithmen mit der in 2 d beschriebenen Gittersuche.

##### RandomForest Tuning

In [3]:
import numpy as np

rf_param_grid = {
    "n_estimators": [100 * i for i in range(1, 4)],
    "max_features": ["sqrt", "log2"],
    "criterion": ["gini", "entropy", "log_loss"]
}

In [5]:
from sklearn.model_selection import GridSearchCV

grid_cv = GridSearchCV(RandomForestClassifier(), rf_param_grid, verbose=2, cv=3)
grid_cv.fit(x_train, y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] END criterion=gini, max_features=sqrt, n_estimators=100; total time=   0.5s
[CV] END criterion=gini, max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END criterion=gini, max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END criterion=gini, max_features=sqrt, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_features=sqrt, n_estimators=200; total time=   0.7s
[CV] END criterion=gini, max_features=sqrt, n_estimators=200; total time=   0.9s
[CV] END criterion=gini, max_features=sqrt, n_estimators=300; total time=   0.7s
[CV] END criterion=gini, max_features=sqrt, n_estimators=300; total time=   0.7s
[CV] END criterion=gini, max_features=sqrt, n_estimators=300; total time=   1.4s
[CV] END criterion=gini, max_features=log2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_features=log2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_features=log2, n_es

In [6]:
grid_cv.score(x_test, y_test)

0.9777777777777777

##### GradientBoosting Tuning

In [7]:
gb_params = {
    "loss": ["log_loss"],
    "learning_rate": np.logspace(-7, 1, 10),
    "criterion": ["friedman_mse", "squared_error"]
}

In [8]:
gb_grid = GridSearchCV(GradientBoostingClassifier(), gb_params, cv=3, verbose=2)
gb_grid.fit(x_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] END criterion=friedman_mse, learning_rate=1e-07, loss=log_loss; total time=   5.6s
[CV] END criterion=friedman_mse, learning_rate=1e-07, loss=log_loss; total time=   4.6s
[CV] END criterion=friedman_mse, learning_rate=1e-07, loss=log_loss; total time=   4.5s
[CV] END criterion=friedman_mse, learning_rate=7.742636826811278e-07, loss=log_loss; total time=   5.2s
[CV] END criterion=friedman_mse, learning_rate=7.742636826811278e-07, loss=log_loss; total time=   3.8s
[CV] END criterion=friedman_mse, learning_rate=7.742636826811278e-07, loss=log_loss; total time=   5.2s
[CV] END criterion=friedman_mse, learning_rate=5.994842503189409e-06, loss=log_loss; total time=   4.4s
[CV] END criterion=friedman_mse, learning_rate=5.994842503189409e-06, loss=log_loss; total time=   4.7s
[CV] END criterion=friedman_mse, learning_rate=5.994842503189409e-06, loss=log_loss; total time=   5.7s
[CV] END criterion=friedman_mse, learning_rate=4.64

In [9]:
gb_grid.score(x_test, y_test)

0.9733333333333334