## 3. Klassifikation mit Random Forests und Boosting

### a.)
Wiederholen Sie Aufgabe 1 b mit einem Random Forest. Teilen Sie dazu den Datensatz wieder in einen Trainings- und Testdatensatz. Bestimmen Sie den Fehler auf den Testdaten und vergleichen Sie ihn mit dem ”out of bag”-Fehler.

In [5]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

x, y = load_digits(return_X_y=True)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

rf_clf = RandomForestClassifier(oob_score=True)
rf_clf.fit(x_train, y_train)

print(f"Out of Bag Error: {rf_clf.oob_score_}")
print(f"Test Error: {1 - rf_clf.score(x_test, y_test):.4f}")

Out of Bag Error: 0.9725315515961396
Test Error: 0.0200


### b.)
Wiederholen sie Aufgabe 1 b mit einem Boosting-Verfahren. Sie können dazu entweder die Implementierung von sklearn verwenden oder Sie installieren xgboost (https://github.com/dmlc/xgboost) oder catboost (https://catboost.ai/docs/).

In [6]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier()
gb_clf.fit(x_train, y_train)

print(f"Test Error: {1 - gb_clf.score(x_test, y_test):.4f}")

Test Error: 0.0289


### c.)
Optional: Finden Sie die optimalen Hyperparameter der Algorithmen mit der in 2 d beschriebenen Gittersuche.

##### RandomForest Tuning

In [7]:
import numpy as np

rf_param_grid = {
    "n_estimators": [100 * i for i in range(1, 3)],
    "max_features": ["sqrt", "log2"],
    "criterion": ["gini", "entropy", "log_loss"]
}

In [8]:
from sklearn.model_selection import GridSearchCV

grid_cv = GridSearchCV(RandomForestClassifier(), rf_param_grid, verbose=2, cv=3)
grid_cv.fit(x_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] END criterion=gini, max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END criterion=gini, max_features=sqrt, n_estimators=100; total time=   0.2s
[CV] END criterion=gini, max_features=sqrt, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_features=sqrt, n_estimators=200; total time=   0.3s
[CV] END criterion=gini, max_features=sqrt, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_features=sqrt, n_estimators=200; total time=   0.3s
[CV] END criterion=gini, max_features=log2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_features=log2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_features=log2, n_estimators=100; total time=   0.1s
[CV] END criterion=gini, max_features=log2, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_features=log2, n_estimators=200; total time=   0.4s
[CV] END criterion=gini, max_features=log2, n_es

In [9]:
grid_cv.best_params_

{'criterion': 'gini', 'max_features': 'sqrt', 'n_estimators': 100}

In [10]:
grid_cv.score(x_test, y_test)

0.9755555555555555

##### GradientBoosting Tuning

In [16]:
gb_params = {
    # "learning_rate": np.logspace(-5, -1, 10), braucht zu lange :(
    "n_estimators": [100 * i for i in range(1, 3)],
    "max_depth": [i for i in range(1, 4)]
}

In [17]:
gb_grid = GridSearchCV(GradientBoostingClassifier(), gb_params, cv=3, verbose=2)
gb_grid.fit(x_train, y_train)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] END ......................max_depth=1, n_estimators=100; total time=   1.9s
[CV] END ......................max_depth=1, n_estimators=100; total time=   1.6s
[CV] END ......................max_depth=1, n_estimators=100; total time=   1.6s
[CV] END ......................max_depth=1, n_estimators=200; total time=   3.8s
[CV] END ......................max_depth=1, n_estimators=200; total time=   4.0s
[CV] END ......................max_depth=1, n_estimators=200; total time=   5.2s
[CV] END ......................max_depth=2, n_estimators=100; total time=   3.4s
[CV] END ......................max_depth=2, n_estimators=100; total time=   2.9s
[CV] END ......................max_depth=2, n_estimators=100; total time=   3.0s
[CV] END ......................max_depth=2, n_estimators=200; total time=   6.6s
[CV] END ......................max_depth=2, n_estimators=200; total time=   5.4s
[CV] END ......................max_depth=2, n_est

In [19]:
gb_grid.best_params_

{'max_depth': 2, 'n_estimators': 200}

In [18]:
gb_grid.score(x_test, y_test)

0.9733333333333334