# Assignment 6.1 - Trees

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry as .ipynb-file and as .pdf. <br><br>
Please do **NOT** rename the file!

#### State both names of your group members here:
[Jane and John Doe]

In [69]:
# Daniel Thompson and Paola Gega

---

## Grading Info/Details - Assignment 6.1:

The assignment will be graded semi-automatically, which means that your code will be tested against a set of predefined test cases and qualitatively assessed by a human. This will speed up the grading process for us.

* For passing the test scripts: 
    - Please make sure to **NOT** alter predefined class or function names, as this would lead to failing of the test scripts.
    - Please do **NOT** rename the files before uploading to the Whiteboard!

* **(RESULT)** tags indicate checkpoints that will be specifically assessed by a human.

* You will pass the assignment if you pass the majority of test cases and we can at least confirm effort regarding the **(RESULT)**-tagged checkpoints per task.

---

## Task 6.1.1 - Regression Trees

* Implement the Regression Tree Class from scratch using only `NumPy`. **(RESULT)**
* Run your implementation on the synthetic regression dataset provided. **(RESULT)**

In [70]:
import numpy as np

In [None]:
def generate_regression_data(n_samples=1000, n_features=8, noise=0.1, random_state=42):
    """Generate synthetic regression data similar to California housing."""
    np.random.seed(random_state)
    
    X = np.random.randn(n_samples, n_features)
    
    # Create target with non-linear relationships
    y = (2.5 * X[:, 0] +                     # Linear relationship
          1.8 * X[:, 1] ** 2 +               # Quadratic (non-linear)
          -1.2 * X[:, 2] * X[:, 3] +         # Interaction between features
          0.5 * np.sin(5 * X[:, 4]) +        # Sinusoidal (periodic pattern)
          0.8 * X[:, 5] +                    # Linear
          -0.3 * X[:, 6] ** 3 +              # Cubic (strong non-linearity)
          1.5 * X[:, 7])                     # Linear
    
    # Add noise
    y += noise * np.random.randn(n_samples)
    
    # Scale to reasonable range
    y = (y - y.min()) / (y.max() - y.min()) * 4 + 1
    
    return X, y


In [None]:
class RegressionTree:
    """A binary decision tree for regression using numpy."""
    
    def __init__(self, max_depth=5, min_samples_split=10):
        """
        Initialize Regression tree.
        
        Parameters:
        -----------
        max_depth : int
            Maximum depth. If max_depth = -1 then there is no limit.
        min_samples_split : int
            Number of samples beneath which we do not continue refining 
            the decision tree. 
        """
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.n_samples = None
        self.n_features = None
        self.left = None
        self.right = None
        self.feature = None
        self.spl_point = None
        self.pred_val = None
    
    def fit(self, X, y):
        """Build the regression tree."""
        self.n_samples, self.n_features = X.shape
        self._build_tree(X, y)
    
    def _build_tree(self, X, y):
        """Recursively build the tree."""
        # Make node a leaf if stop conditions are met
        if (self.max_depth == 0) or (self.n_samples < self.min_samples_split):
            self.pred_val = np.mean(y)
            return
        # Search for optimal split
        best_j = None
        best_z = None
        best_loss = float('inf')
        for j in range(self.n_features):
            X_j = np.sort(X[:,j])
            for z in (X_j[:-1] + X_j[1:])/2:
                # Calculate impurity of split
                val_l = np.mean(y[X_j <= z])
                val_r = np.mean(y[X_j > z])
                loss = np.mean((y[X_j <= z] - val_l)**2) + np.mean((y[X_j > z] - val_r)**2)
                if loss < best_loss:
                    best_loss = loss
                    best_j = j
                    best_z = z
        # Enter data for a non-leaf node
        self.feature = best_j
        self.spl_point = best_z
        self.left = RegressionTree(max_depth = self.max_depth - 1,
                                   min_samples_split = self.min_samples_split)
        self.left.fit(X[X_j <= z], y[X_j <= z])
        self.right = RegressionTree(max_depth = self.max_depth - 1,
                                   min_samples_split = self.min_samples_split)
        self.right.fit(X[X_j > z], y[X_j > z])
    
    def predict(self, X):
        """Make predictions for X."""
        n_samples, n_features = X.shape
        y_pred = np.empty(n_samples, dtype=float)
        for i in range(n_samples):
            node = self
            while (not node.pred_val):
                if X[i,self.feature] <= self.spl_point:
                    node = node.left
                else:
                    node = node.right
            y_pred[i] = node.pred_val
        return y_pred

In [73]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [74]:
X, y = generate_regression_data()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a regression tree model where every leaf corresponds to a single sample
model = RegressionTree(max_depth= - 1, min_samples_split=2)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
print("MSE on training set:", mean_squared_error(y_train, y_train_pred))
y_test_pred = model.predict(X_test)
print("MSE on test set:", mean_squared_error(y_test, y_test_pred))

MSE on training set: 0.28799318751344305
MSE on test set: 0.20757899158237966


## Task 6.1.2 - Bagging

* Implement Bagging using only `NumPy`. **(RESULT)**
* Compare the results between the bagged run of your `RegressionTree` class on the synthetic dataset. **(RESULT)**

In [75]:
class BaggingRegressor:
    """Bagging ensemble for regression trees."""
    
    def __init__(self):
        pass
        # TODO: Implement this function
    
    def fit(self):
        """Fit the bagging ensemble."""
        pass
        # TODO: Implement this function
    
    def predict(self):
        """Make predictions by averaging all trees."""
        pass
        # TODO: Implement this function

## Congratz, you made it! :)