## Assignment 3 ##

Welcome to your third assignment. This assignment aims to help you understand 
ensemble models.

**1** Read the available breast cancer dataset from sklearn, split it into 
training data (X_train, y_train) and test data (X_test, y_test) with a split 
ratio of 70%/30% using the train_test_split function (set random_state to 0). 
The dataset concerns the diagnosis of breast cancer based on variables computed 
from a digitized image of a fine needle aspirate (FNA) of a breast mass sample. 

In [86]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer_data = load_breast_cancer()
X = cancer_data.data
y = cancer_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                            random_state=0)

In [87]:
"""Τεστ ορθής ανάγνωσης και διαχωρισμού του συνόλου δεδομένων"""
assert round(X_train[0][8], 5) == 0.1779
assert round(X_test[0][8], 5) == 0.2116

**2** Implement a deterministic version of the Random Subspaces method, which 
builds as many models as there are input variables, each ignoring a different 
input variable. For example, the first model ignores the first variable, the 
second model ignores the second variable, and so on. Use the clone function 
from sklearn.base to create a copy of the base model in each iteration.

In [88]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import clone

class RandomSubspaceDet:
    def __init__(self, estimator=DecisionTreeClassifier()):
        self.estimator = estimator
        self.estimators = []

    def fit(self, X_train, y_train):
        N = X_train.shape[1]

        for n in range(N):
            estimator = clone(self.estimator)
            X_train_reduced = np.delete(X_train, n, axis=1)
            estimator.fit(X_train_reduced, y_train)
            self.estimators.append(estimator)

    def predict(self, X):
        n = X.shape[0]
        m = len(self.estimators)
        predictions = np.zeros((n, m))

        for j, estimator in enumerate(self.estimators):
            X_reduced = np.delete(X, j, axis=1)
            predictions[:, j] = estimator.predict(X_reduced)

        return np.round(np.mean(predictions, axis=1))

In [89]:
"""Τεστ ορθής υλοποίησης RandomSubspaceDet"""
from sklearn.metrics import accuracy_score

rs = RandomSubspaceDet(estimator=DecisionTreeClassifier(random_state=1))
rs.fit(X_train, y_train)
assert round(accuracy_score(rs.predict(X_test), y_test), 4) == 0.9006

**3** Implement the AdaBoost method as presented in the lesson. 
Use the clone function from sklearn.base to create a copy of the base model in 
each iteration. Utilize the sample_weight parameter of the base model's fit 
method to set the weights of the training examples. 

In [90]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import clone

class AdaBoost:
    def __init__(self, n_estimators=20, 
                estimator=DecisionTreeClassifier(max_depth=1)):
        self.n_estimators = n_estimators
        self.estimator = estimator
        self.estimator_weights = np.zeros(self.n_estimators)
        self.estimator_errors = np.ones(self.n_estimators)
        self.estimators = []

    def fit(self, X_train, y_train):
        self.classes = np.array(sorted(list(set(y_train))))
        self.n_classes = len(self.classes)

        m = X_train.shape[0]
        weight = np.ones(m) / m

        def boost(X, y, weight):
            estimator = clone(self.estimator)
            estimator.fit(X, y, sample_weight=weight)
            y_pred = estimator.predict(X)

            misses = y_pred != y

            estimator_error = np.dot(weight, misses) / np.sum(weight)
            estimator_weight = np.log((1 - estimator_error) / estimator_error)

            weight *= np.exp(estimator_weight * misses)
            weight /= np.sum(weight)

            self.estimators.append(estimator)

            return weight, estimator_weight, estimator_error
        
        for t in range(self.n_estimators):
            weight, estimator_weight, estimator_error = boost(X_train, 
                                                            y_train, weight)
            self.estimator_errors[t] = estimator_error
            self.estimator_weights[t] = estimator_weight

    def predict(self, X):
        C = self.classes[:, np.newaxis]

        predictions = sum((estimator.predict(X) == C).T * w
                            for estimator, w in zip(self.estimators,
                                                self.estimator_weights))
        predictions /= self.estimator_weights.sum()

        return self.classes.take(np.argmax(predictions, axis=1))

In [91]:
"""Τεστ ορθής υλοποίησης AdaBoost"""
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

ab = AdaBoost(n_estimators=20, estimator=DecisionTreeClassifier(max_depth=1, 
            random_state=1))
ab.fit(X_train, y_train)
assert round(accuracy_score(ab.predict(X_test), y_test), 4) == 0.9591


In [92]:
# Ίδιο αποτέλεσμα και με τη κλάση της sklearn
ab = AdaBoostClassifier(n_estimators=20, algorithm="SAMME", 
                        estimator=DecisionTreeClassifier(max_depth=1, 
                                                        random_state=1))
ab.fit(X_train, y_train)
assert round(accuracy_score(ab.predict(X_test), y_test), 4) == 0.9591