<a href="https://colab.research.google.com/github/albheim/frtn65_notebooks/blob/master/ex6_boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

In [None]:
# Download dataset from url, save it to sonar.csv
!wget -O sonar.csv https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data 

In [None]:
data = pd.read_csv("sonar.csv", header=None)
data.iloc[:, 60] = 1 * (data.iloc[:, 60] == 'M') # Replace labels M and R with 1 and 0

In [None]:
feature_selection = range(0, 60) # All features
#feature_selection = [1, 2, 3, 4, 5] # Only a few features

# Split data into features and labels, convert labels R->0 and M->1
X, y = data.iloc[:, feature_selection].values, data.iloc[:, 60].values

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Write your own boosting loop**

You should now attempt to write your own boosting classifier using `DecisionTreeClassifier` as the base classifier in an ensamble. 

The final ensamble will have a weighted voting based on the accuracy of each single classifier.

The steps needed are:

1. **Choose a weak classifier as base**   
2. Initialize weights for each sample
3. **For T rounds**
    1. Normalize the weights
    2. For available features from the set, train a classifier using a single feature and evaluate the weighted training error
    3. Choose the classifier with the lowest error
    4. **Update the weights of the training samples: increase if classified wrongly by this classifier, decrease if correctly**
4. Form the final strong classifier as the linear combination of the T classifiers with coefficient correlating to how good they are as individual classifiers

If we are using a `DecisionTreeClassifier` with `max_features=None`, meaning that all features are considered for use in the split, we get step 2.2 and 2.3 for free. 

Training using weights can be done using the keyword `sample_weight` when calling `fit` in sklearn.

Your task is to try out different settings for our base classifier and try different number of iterations when learning. You should also implement some update for the weight vector where if we predict correct on sample $i$ then $w_i$ should decrease while if we predict wrong $w_i$ should increase.

In [None]:
classifiers = []
coeff = []

# Initially all samples are weighted the same
w = np.ones(len(y_train))

n_iterations =  # Try out different numbers 

for i in range(1, n_iterations+1):
    clf = DecisionTreeClassifier(
        # Try different parameters to the classifier
    )

    w = w / np.sum(w) # Normalize weights
    clf.fit(X_train, y_train, sample_weight=w)

    w =  # Use suitable update based on correct/incorrect classifications on training data

    classifiers.append(clf)
    coeff.append(clf.score(X_train, y_train)) # We use accuracy as ensamble coefficient

    # Find the ensamble prediction and evaluate the accuracy after adding each new classifier
    y_hat = sum(coeff[i] * classifiers[i].predict(X_test) for i in range(len(coeff))) / sum(coeff)
    score = np.mean(np.round(y_hat) == y_test)
    print("{} classifiers: test accuracy {}".format(i, score))

**Compare to `AdaBoostClassifier`**

We evaluate an `AdaBoostClassifier` in a similar fashion and print the error for each number of classifiers as in the previous cell. Do you get similar performance for both of them?

In [None]:
for i in range(1, n_iterations+1):
    clf = AdaBoostClassifier(n_estimators=i,learning_rate=1)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("{} classifiers: test accuracy {}".format(i, score))