# Assignment 6.2 - More Trees

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry as .ipynb-file and as .pdf. <br><br>
Please do **NOT** rename the file!

#### State both names of your group members here:
[Jane and John Doe]

In [1]:
# Paola Gega, Daniel Thompson

---

## Grading Info/Details - Assignment 6.2:

The assignment will be graded semi-automatically, which means that your code will be tested against a set of predefined test cases and qualitatively assessed by a human. This will speed up the grading process for us.

* For passing the test scripts: 
    - Please make sure to **NOT** alter predefined class or function names, as this would lead to failing of the test scripts.
    - Please do **NOT** rename the files before uploading to the Whiteboard!

* **(RESULT)** tags indicate checkpoints that will be specifically assessed by a human.

* You will pass the assignment if you pass the majority of test cases and we can at least confirm effort regarding the **(RESULT)**-tagged checkpoints per task.

---

## Task 6.2.1 - Classification Trees

* Implement the Classification Tree Class from scratch using only `NumPy`. The splitting criterion should be Gini impurity. **(RESULT)**
* Run your implementation on the `IRIS` classification dataset. **(RESULT)**

In [None]:
import numpy as np
from sklearn.datasets import load_iris

class DecisionTree:
    """Base class for a decision tree."""

    def __init__(self, max_depth=5, min_samples_split=10):
        """Initialize decision tree."""
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = {}
    
    def fit(self, X, y):
        """Build the decision tree."""
        self.tree = self._build_tree(X, y, 0)

    def _loss(self, y1, y2):
        pass

    def _build_tree(self, X, y, depth):
        """Recursively build the tree."""
        n_samples, n_features = X.shape
        # Make node a leaf if stop conditions are met
        if len(np.unique(y)) == 1:
            return {'value': y[0]}
        elif (depth >= self.max_depth or n_samples < self.min_samples_split):
            values, counts = np.unique(y, return_counts=True)
            ind = np.argmax(counts)
            return {'value': values[ind]}
        # Search for optimal split
        best_j = None
        best_z = None
        best_loss = float('inf')
        for j in range(n_features):
            X_j = np.unique(X[:,j])
            for z in (X_j[:-1]+X_j[1:])/2:
                # Calculate impurity of split
                left_indices = X[:, j] <= z
                right_indices = ~left_indices
                loss = self._loss(y[left_indices], y[right_indices])
                if loss < best_loss:
                    best_loss = loss
                    best_j = j
                    best_z = z
        left_indices = X[:, best_j] <= best_z
        right_indices = ~left_indices
        left_subtree = self._build_tree(X[left_indices], y[left_indices], depth + 1)
        right_subtree = self._build_tree(X[right_indices], y[right_indices], depth + 1)
        return {'feature_index' : best_j, 
                'threshold' : best_z,
                'left' : left_subtree,
                'right' : right_subtree}
    
    def predict(self, X):
        """Make predictions for X."""
        n_samples = X.shape[0]
        y_pred = np.empty(n_samples, dtype=float)
        for i in range(n_samples):
            node = self.tree
            while 'value' not in node.keys():
                if X[i,node['feature_index']] <= node['threshold']:
                    node = node['left']
                else:
                    node = node['right']
            y_pred[i] = node['value']
        return y_pred

In [4]:
def gini(y):
            counts = np.unique(y, return_counts=True)[1]
            prob_sq_sum = sum((count / len(y)) ** 2 for count in counts)
            return 1 - prob_sq_sum

class ClassificationTree(DecisionTree):
    """Classification decision tree using Gini impurity."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def _loss(self, y1, y2):
        """Calculate Gini impurity of a split."""
        total_samples = len(y1) + len(y2)
        gini1 = gini(y1)
        gini2 = gini(y2)
        weighted_gini = (len(y1) * gini1 + len(y2) * gini2) / total_samples
        return weighted_gini

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

data = load_wine()
X = data.data
y = data.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Train a classification tree model where every leaf corresponds to a single sample
model = ClassificationTree(max_depth = 150, min_samples_split = 2)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
print("Accuracy on training set:", accuracy_score(y_train, y_train_pred))
y_test_pred = model.predict(X_test)
print("Accuracy on test set:", accuracy_score(y_test, y_test_pred))

Accuracy on training set: 1.0
Accuracy on test set: 0.8888888888888888


In [7]:
# Train a classification tree model with more reasonable parameters to try to get less overfitting
model = ClassificationTree(max_depth = 5, min_samples_split = 20)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
print("Accuracy on training set:", accuracy_score(y_train, y_train_pred))
y_test_pred = model.predict(X_test)
print("Accuracy on test set:", accuracy_score(y_test, y_test_pred))

Accuracy on training set: 0.9924812030075187
Accuracy on test set: 0.8888888888888888


## Task 6.2.2 - Random Forests

* Implement Random Forests using only `NumPy`. **(RESULT)**
* Compare the results between the random forest run of your `ClassificationTree` class on the `IRIS` dataset. **(RESULT)**

In [8]:
class RandomForest:
    """Random Forest Classifier."""
    pass
    # TODO: Implement this function

## Task 6.2.3 - Extra Trees

* Implement Extra Trees using only `NumPy`. **(RESULT)**
* Compare the results between the `Random Forest` and an `Extra Trees` ensemble implementation on the `IRIS` dataset. **(RESULT)**

In [9]:
class ExtraTree(DecisionTree):
    """Extremely Randomized Tree - uses random thresholds instead of optimal ones."""
    pass
    # TODO: Implement this function

In [10]:
class ExtraTrees:
    """Extremely Randomized Trees ensemble."""
    pass
    # TODO: Implement this function

## Congratz, you made it! :)