# Assignment 2: Decision Trees

Fill in your name and student ID here.
- Name: Soh Kai Le
- Student ID: A0273076B (E1122479)

## Overview

In this assignment, we'll implement Decision Trees:

1. **Decision Tree Regression**
    - Compute Region-Residual Sum of Squares (Region-RSS)
    - Decision Tree Regressor
2. **Decision Tree Classification**
    - Calculate Entropy
    - Compute Information Gain
    - Implement the Majority Class function
    - Implement the Best Split function
    - Implement the Recursive Tree Builder Function
    - Implement Prediction Logic for a single data point
    - Decision Tree Classifier
3. **Practical**: Train a DT classifier on the training dataset using scikit-learn

By the end, you'll understand how DT works and how to tackle a problem using DT. Letâ€™s dive in!

## Instructions

1. Fill in your name and student ID at the top of the ipynb file.
2. The parts you need to implement are clearly marked with the following:

    ```
    """ YOUR CODE STARTS HERE """

    """ YOUR CODE ENDS HERE """
    ```

    , and you must **ONLY** write your code in between the above two lines.
3. **IMPORTANT**: Make sure that all of the cells are runnable and can compile without exception, even if the answer is incorrect. This will significantly help us in grading your solutions.
3. For task 1 and 2, you are only allowed to use basic Python functions in your code (no `NumPy` or its equivalents), unless otherwise stated. You may reuse any functions you have defined earlier. If you are unsure whether a particular function is allowed, feel free to ask any of the TAs.
4. For task 3, you may use the `scikit-learn` library.
5. Your solutions will be evaluated against a set of hidden test cases to prevent hardcoding of the answer. You may assume that the test cases are always valid, unless specified otherwise. Partial marks may be given for partially correct solutions.

### Submission Instructions
Items to be submitted:
* **This notebook, NAME-STUID-assignment2.ipynb**: This is where you fill in all your code. Replace "NAME" with your full name and "STUID" with your student ID, which starts with "A", e.g. `"John Doe-A0123456X-assignment2.ipynb"`

Submit your assignment by **Sunday, 14 September 23:59** to Canvas. Points will be deducted late submission.


## Overview



## Task 1 - Decision Tree Regression [4 Points]

### Task 1.1 - Compute Region-Residual Sum of Squares (Region-RSS) [1 Point]

Minimizing $\operatorname{Region-RSS}(l, c)$ helps us find the best split for the decision tree at each step.

$$
\underbrace{\operatorname{Region-RSS}(l,c)}_\text{Assume feature $l$, cutoff $c$}= \underbrace{\operatorname{RSS}(\{X|X_l<c\})}_\text{Left subregion} + \underbrace{\operatorname{RSS}(\{X|X_l\geq c\})}_\text{Right subregion}
$$

In order to do so, we need to calculate the residuals for each sub-region:

$$
\operatorname{RSS}(X) = \sum_{i=1}^{n} \left(y_i - \hat{f}(x_i)\right)^2 = \sum_{i=1}^{n} \left(e_i\right)^2
$$

Implement a function that computes the Region-RSS for a given cutoff, using the target values in the left and right sub-regions (`y_left` and `y_right`), without the use of `numpy`.

**Note**: Make sure that your code is able to handle the case where either `y_left` or `y_right` is an empty list.

In [1]:
# TASK 1.1
def calculate_regionrss(y_left, y_right):
    """
    TODO: Compute the Region-RSS for the split.
    Avoid using NumPy and use only basic Python functions.

    Args:
        y_left: A list of target values in the left split
        y_right: A list of target values in the right split

    Returns:
        Region-RSS error for a given cutoff
    """

    total_rss = 0

    """ YOUR CODE STARTS HERE """
    # y_left and y_right are the yi values so we first need to find ymean per side then apply the formula
    # so sum over the loop the squared values for both sides. Then total_rss = rss_left + rss_right
    rss_left = 0
    y_left_mean = 0
    rss_right = 0
    y_right_mean = 0

    # Left region

    y_left_length = len(y_left)
    y_right_length = len(y_right)

    if y_left_length != 0: # Ensure that it can handle the case where it is an empty list
      for i in y_left:
        y_left_mean += i

      y_left_mean = y_left_mean / y_left_length

      for j in y_left:
        rss_left += (j - y_left_mean) ** 2
    else:
      pass

    # Same for the right

    if y_right_length != 0: # Ensure that it can handle the case where it is an empty list
      for k in y_right:
        y_right_mean += k

      y_right_mean = y_right_mean / y_right_length

      for l in y_right:
        rss_right += (l - y_right_mean) ** 2
    else:
      pass

    total_rss = rss_left + rss_right

    """ YOUR CODE ENDS HERE """

    return total_rss

# TESTCASES 1.1
import math

assert math.isclose(calculate_regionrss([3, 4, 5], [8, 9]), 2.5, rel_tol=1e-5)
assert math.isclose(calculate_regionrss([1, 1, 1], [1, 1]), 0.0, rel_tol=1e-5)
assert math.isclose(calculate_regionrss([], [1, 1]), 0.0, rel_tol=1e-5)
print('All test cases passed!')

All test cases passed!


### Task 1.2 - Decision Tree Regressor [3 Points]

Our Decision Tree Regressor recursively splits the data into two regions at each node, using binary splits that minimize the Region-RSS. Splitting stops when the maximum depth is reached or there is no possible/valid splitting. Each leaf predicts the mean target value of its region.

In [2]:
# TASK 1.2
class DTRegressor:
    def __init__(self, max_depth=2):
        self.max_depth = max_depth
        self.tree = None

    def fit(self, X, y):
        """
        Args:
            X: A list of feature values (one feature per sample).
            y: A list of target values corresponding to each sample.
        """
        self.tree = self._build_tree(X, y, depth=0)

    def _build_tree(self, X, y, depth):
        """
        TODO: Implement the recursive tree building logic using RSS.
        Hint: Add to right node when it is exactly the same as the split value.

        Args:
            X: A list of feature values (one feature per sample)
            y: A list of target values corresponding to each sample
            depth: The current depth of the tree

        Returns:
            dict or float:
                If it's an internal node, returns a dictionary representing the node:
                - 'split_value': The feature value at which the split occurs.
                - 'left': The left child node (recursively built tree structure or leaf value).
                - 'right': The right child node (recursively built tree structure or leaf value).
                If it's a leaf node (base case) or max_depth reached, returns the mean of the target values
                in that region, which will be the prediction for that leaf.
        """

        res = None

        """ YOUR CODE STARTS HERE """
        # Testing
        # print(calculate_regionrss([2,4,6,8,10], []))

        # For X = [1,2,3,4,5]

        # [], [2,4,6,8,10] gives 40 corresponds to y[:0], split value = 1 / 2 = 0.5
        # [2], [4,6,8,10] gives 20 corresponds to y[:1], split value = (1+2)/2 = 1.5, corresponding index = 1 which is value = 2
        # [2,4], [6,8,10] gives 10 corresponds to y[:2], split value = (2+3)/2 = 2.5, corresponding index = 2 which is value = 3
        # [2,4,6], [8,10] gives 10 corresponds to y[:3], split value = (3+4)/2 = 3.5, corresponding index = 3 which is value = 4
        # [2,4,6,8], [10] gives 10 corresponds to y[:4], split value = (4+5)/2 = 4.5
        # [2,4,6,8,10], [] gives 40 corresponds to y[:5], split value = 5 + 5/2 = 7.5

        # k = 0, 1, 2, 3, 4

        # Best_split_value = (Sorted_X[Best_cutoff - 1] + Sorted_X[Best_cutoff]) / 2


        # Base Case is so as I think there is no valid split if we are only down to 1 value for X, had to shift this to the top in order for it to work as intended
        length = len(X) # len(X) should always be the same as len(y)
        if depth == self.max_depth or length <= 1:
          if length != 0:
            mean_y = sum(y) / len(y)
            return(mean_y)
          else:
            return(0)

        # Need to consider if X is sorted in ascending order else slicing by index would not work, use sort() but not directly since we need to sort and keep corresponding X and y
        # indexes the same so we use a list of tuples

        list_of_tuples = []
        for i in range(length):
          list_of_tuples.append((X[i], y[i]))

        Sorted_list = sorted(list_of_tuples, key=lambda x: x[0])
        Sorted_X = []
        Sorted_y = []

        for j in range(length):
          Sorted_X.append(Sorted_list[j][0])
          Sorted_y.append(Sorted_list[j][1])

        # Then we find the best cutoff by taking the smallest RSS
        Best_RSS = 100000
        Best_cutoff = -1

        for k in range(1, length):
          y_left = Sorted_y[:k]
          y_right = Sorted_y[k:]
          RSS = calculate_regionrss(y_left, y_right)

          if RSS < Best_RSS: # Not <= for the split_value calculation later
            Best_RSS = RSS
            Best_cutoff = k

        # Since we looped from 1 to length, we need to consider for cases outside that such as if it is a leaf node. If so then again just return mean_y again
        if Best_cutoff == -1:
          mean_y = sum(y) / len(y)
          return(mean_y)

        # Keep track for the left and right subtrees (outside the loop)
        X_left = Sorted_X[:Best_cutoff]
        X_right = Sorted_X[Best_cutoff:]
        y_left = Sorted_y[:Best_cutoff]
        y_right = Sorted_y[Best_cutoff:]

        # We calculate split value by taking the midpoint, since it seems to be the best method
        Best_split_value = (Sorted_X[Best_cutoff - 1] + Sorted_X[Best_cutoff]) / 2

        left_subtree = self._build_tree(X_left, y_left, depth + 1)
        right_subtree = self._build_tree(X_right, y_right, depth + 1)

        res = {'split_value': Best_split_value, 'left': left_subtree, 'right': right_subtree}

        """ YOUR CODE ENDS HERE """

        return res

    def predict_one(self, x, node=None):
        """
        TODO: Traverse the tree to make a prediction for a single input.
        Hint: Prediction should go to the right child node when it is exactly the same as the split value.

        Args:
            x: A single input feature value.
            node: The current node in the tree (used for recursion)

        Returns:
            A float representing the predicted target value for the input
        """

        prediction = 0

        """ YOUR CODE STARTS HERE """
        # Assume node = {'split_value': 3, 'left': {'split_value': 2, 'left': 2.0, 'right': 4.0}, 'right': {'split_value': 4, 'left': 6.0, 'right': 9.0}} from earlier print test case
        # x = 4, so we compare x vs split_value so 4 >= 3, as per hint it should go to the right child node else left

        if isinstance(node, float):
          # We are at the leaf node so it is just the float value
          prediction = node
        else:
          # We are still traversing, we have node as a dictionary
          split_value = node.get('split_value', 0)
          if x >= split_value:
            right_child_node = node.get('right', None)
            prediction = self.predict_one(x, node = right_child_node)

          else:
            left_child_node = node.get('left', None)
            prediction = self.predict_one(x, node = left_child_node)

        """ YOUR CODE ENDS HERE """

        return prediction

    def predict(self, X):
        """
        TODO: Call predict_one for each input in X and return the predictions.

        Args:
            X: A list of input feature values (one feature per sample)

        Returns:
            A list of predicted target values
        """

        predictions = []

        """ YOUR CODE STARTS HERE """
        res = 0

        for i in X:
          res = self.predict_one(i, node = self.tree)
          predictions.append(res)

        # print(predictions)

        """ YOUR CODE ENDS HERE """

        return predictions

# TESTCASES 1.2
import math

X = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

model = DTRegressor(max_depth=2)
model.fit(X, y)
predictions = model.predict([1.5, 3.5, 5])
assert all(isinstance(p, float) for p in predictions)
assert len(predictions) == 3
assert math.isclose(predictions[0], 4.0)
assert math.isclose(predictions[1], 9.0)
assert math.isclose(predictions[2], 9.0)


X_offset = [0, 1, 2, 3]
y_offset = [5, 7, 9, 11]
model_offset = DTRegressor(max_depth=2)
model_offset.fit(X_offset, y_offset)
predictions_offset = model_offset.predict([0.5, 2.5])
assert all(isinstance(p, float) for p in predictions_offset)
assert len(predictions_offset) == 2
assert math.isclose(predictions_offset[0], 7.0)
assert math.isclose(predictions_offset[1], 11.0)

print('All test cases passed!')

All test cases passed!


## Task 2 - Decision Tree Classification [11 Points]

### Task 2.1 - Calculate Entropy [1 Point]

The entropy of a label set quantifies the amount of uncertainty or impurity in the distribution of class labels. It is defined as:

$$
H(E) = -\sum_{e\in E}{P\left(e\right)\log_{|E|}\left(P\left(e\right)\right)}
$$

where $P(e)$ is the proportion of samples belonging to class $e$, and $|E|$ is the number of unique classes in the label set. Implement `compute_entropy` using the `math` package.

**Note**: Make sure that your code is able to handle the case where there is only 1 unique class.

In [3]:
# TASK 2.1
import math

def compute_entropy(y, n_unique_classes):
    """
    TODO: Compute class proportions and entropy using the formula.

    Args:
        y: A list of class labels (e.g., ['yes', 'no', 'yes', ...])
        n_unique_classes: The number of unique classes in the dataset

    Returns:
        A float representing the entropy of the label distribution
    """

    entropy = 0

    """ YOUR CODE STARTS HERE """
    # first consider for each type in n_unique_classes
    if n_unique_classes == 1:
      return(0)

    else:
      total = len(y)
      unique_values = {}

      for i in y:
        # Find the unique values in y
        unique_values[i] = unique_values.get(i,0) + 1

      unique_value = list(unique_values.keys()) # yes, no

      # Apply formula now
      for j in unique_value:
        P_e = unique_values.get(j, 0) / total
        entropy += P_e * (math.log(P_e, n_unique_classes))

      entropy = entropy * (-1)

    """ YOUR CODE ENDS HERE """

    return entropy

# TESTCASES 2.1
assert math.isclose(compute_entropy(['yes', 'no', 'yes', 'yes', 'no'], n_unique_classes=2), 0.970950, rel_tol=1e-5)
assert math.isclose(compute_entropy(['yes', 'yes', 'yes','no','maybe','maybe','no','maybe'], n_unique_classes= 3), 0.985056, rel_tol=1e-5)
assert math.isclose(compute_entropy(['cat', 'dog', 'cat', 'fish', 'dog', 'cat'], n_unique_classes= 3), 0.920619, rel_tol=1e-5)
assert math.isclose(compute_entropy(['yes', 'yes', 'yes'], n_unique_classes=1), 0, rel_tol=1e-5)

print('All test cases passed!')

All test cases passed!


### Task 2.2 - Compute Information Gain [1 Point]

The information gain of a split of the attribute $A$ is:

$$
\operatorname{IG}(D,A)=H(D) - \sum_{v\in A} {\frac{|D_v|}{|D|}}H(D_v)=H(D) - \sum_{v\in A} P(D_v)H(D_v)
$$

Implement `information_gain` using the function `compute_entropy` defined before.

**Note**: Make sure that your code is able to handle the case where `parent_y` is an empty list. However, you may assume that each of the list in `list_of_child_ys` is a valid mutually-exclusive split of `parent_y`.

In [4]:
# TASK 2.2
def information_gain(parent_y, list_of_child_ys, n_unique_classes=2):
    """
    TODO: Compute the information gain from the parent to the children

    Args:
        parent_y: Target values of the parent node.
        list_of_child_ys: A list where each element is a list of target values for a child node resulting from a split.
        n_unique_classes: The number of unique classes in the dataset

    Returns:
        A float representing the information gain from the split
    """

    gain = 0

    """ YOUR CODE STARTS HERE """
    # IG = Entropy(parent_y) - summation of sublist in (list_of_child_ys) of (proportion of sublist over parent * entropy(sublist))

    total = len(parent_y)
    if total == 0: # Need to ensure case where parent_y is [] but each list in list_of_child_ys is mutually exclusive from parent_y
      gain = 0

    else:
      H_D = compute_entropy(parent_y, n_unique_classes)

      for i in list_of_child_ys:
        gain += (len(i) / total) * compute_entropy(i, n_unique_classes)

      gain = (gain * -1) + H_D

    """ YOUR CODE ENDS HERE """

    return gain

# TESTCASES 2.2
import math

parent = ['yes', 'no', 'yes', 'no']
left = ['yes', 'yes']
right = ['no', 'no']
assert math.isclose(information_gain(parent, [left, right], n_unique_classes = 2), 1.0, rel_tol=1e-5)

parent = ['yes', 'no', 'yes', 'no']
left = ['yes', 'no']
right = ['yes', 'no']
assert math.isclose(information_gain(parent, [left, right], n_unique_classes = 2), 0.0, rel_tol=1e-5)


parent = ['yes', 'no', 'yes', 'no', 'yes', 'no']
child_1 = ['yes', 'yes']
child_2 = ['no', 'no']
child_3 = ['yes', 'no']
assert math.isclose(information_gain(parent, [child_1, child_2, child_3], n_unique_classes=2), 2/3, rel_tol=1e-5)

print('All test cases passed!')

All test cases passed!


### Task 2.3 Implement the Majority Class function [1 Point]

This function finds the most frequent class label within a list of `y_labels`. When a new data point reaches a leaf node in a classification tree, it's assigned the class that's most common among the training examples in that leaf. This function determines that "majority class." Count the occurrences of each label. If there's a tie for the highest count, return the smallest label (alphabetically first) to ensure consistent results.

In [None]:
# TASK 2.3
def majority_class(y_labels):
        """
        TODO: Implement majority class calculation.

        Args:
            y: A list of class labels

        Returns:
            The majority class label (the one with the highest count), or None if y_labels is empty
            If there is a tie, return the smallest label
        """

        majority_class = None

        """ YOUR CODE STARTS HERE """
        count = {}
        res = []

        # Use a dictionary to count the number of occurences
        for i in y_labels:
          count[i] = count.get(i,0) + 1
        # print(count)

        # Convert to list then sort by value then alphabetically
        res = list(count.items())

        # res = sorted(res, key = lambda x: (x[1], x[0]), reverse = True) # Failed
        # Need it to sort by the first element of the tuple in descending order, then second element of the tuple in ascending order
        # first element is an int, second element is a str

        res = sorted(res, key = lambda x: (-x[1], x[0])) # Trick is since when we multiply by -1, the ascending order becomes -3 then -2 which is what we want
        # print(res)

        majority_class = res[0][0]

        """ YOUR CODE ENDS HERE """

        return majority_class

# TESTCASES 2.3
y = ['yes', 'no', 'yes', 'no', 'yes']
assert majority_class(y) == 'yes'

y = ['yes', 'no', 'yes', 'no']
assert majority_class(y) == 'no'

y = [1, 2, 2, 3, 2]
assert majority_class(y) == 2

print('All test cases passed!')

All test cases passed!


### Task 2.4 - Implement the Best Split function [3 Points]

Complete the `find_best_split` function below using the `information_gain` function defined before.

For a given dataset `X` (features) and `y` (labels), it searches through all possible ways to split the data based on each feature and selects the best one A decision tree grows by finding splits that best separate the classes. This function chooses the split with the highest Information Gain (IG), leading to purer child nodes and better classification..

This function must handle both categorical and continuous features:
* Categorical Features
    * String Values like `'red'`, `'blue'`, `'apple'`, etc.
    * Perform a **multi-way split**, creating one child node per unique category (at least 2 categories).
    * DO NOT remove the chosen feature from X after splitting.

* Continuous Features
    * Numeric values like `2.0`, `4.5`, etc.
    * Perform **binary splits** at midpoints between consecutive sorted unique values.

For each potential split:
1. Split the data accordingly.
2. Calculate the information gain from the split.
3. Track the split details (feature index, split type, and child group values).

Return the split with the highest information gain. If there is a tie in information gain, prefer the feature with the lowest index.

As this is quite a complex problem, we suggest that you cross-check your outputs manually and add more test cases to make sure that your code really works!

In [None]:
# TASK 2.4
def find_best_split(X, y, n_unique_classes=2):
    """
    TODO: Implement the best split finding logic for both categorical and continuous features

    Args:
        X: List of feature vectors
        y: List of target labels.
        n_unique_classes: The total number of unique classes in the dataset, used as log base.

    Returns:
        A tuple:
        (
            best_gain,             # float: Highest information gain achieved by any split
            best_feature_idx,      # int: Index of the feature that gives the best split
            best_split_type,       # str: Either 'categorical' or 'continuous'
            best_split_details     # dict: Structure varies based on split type (see below)
        )

        - For a categorical feature, best_split_details should be:
            {
                category_val_1: {'X': [...], 'y': [...]},
                category_val_2: {'X': [...], 'y': [...]},
                ...
            }

        - For a continuous feature, best_split_details should be:
            {
                'split_value': val,        # float: midpoint value used for binary split
                'left_X': [...], 'left_y': [...],
                'right_X': [...], 'right_y': [...]
            }

        If no valid split is found, return:
            (-1.0, None, None, None)
    """

    best_overall_gain = -1.0
    best_overall_feature_index = None
    best_overall_split_type = None
    best_overall_split_details = None

    """ YOUR CODE STARTS HERE """
    # Based on task description, it should work on multiple X categorical variables so including X can be [['red', 1.0], ['blue', 2.0], ['red', 3.0], ...]
    length = len(X)
    # best_overall_gain should remain at -1.0 if there is really no valid split so if X is []
    if length == 0:
      return best_overall_gain, best_overall_feature_index, best_overall_split_type, best_overall_split_details

    else: # I don't think i need to put the rest of the code inside the else, but i will do so just to be safe
      number_of_category = len(X[0])

      for feature_index in range(number_of_category):
        current_split_type = None # Needed since i get unbound error without it

        if isinstance(X[0][feature_index], str):
          # Categorical data
          current_split_type = "categorical"

          # Next we find the number of unique categories
          category_values = {}
          unique_categories = []

          for i in range(length):
            category_value = X[i][feature_index]

            if category_value not in unique_categories:
              unique_categories.append(category_value)
              category_values[category_value] = {'X': [], 'y': []}
            else:
              pass

            category_values[category_value]['X'].append(X[i])
            category_values[category_value]['y'].append(y[i])

          # print(unique_categories) # ['red', 'blue', 'green']
          # print(category_values) # {'red': {'X': [['red'], ['red']], 'y': ['yes', 'yes']}, 'blue': {'X': [['blue'], ['blue']], 'y': ['no', 'no']}, 'green': {'X': [['green']], 'y': ['no']}}
          # Is this the multi-split in "Perform a multi-way split, creating one child node per unique category (at least 2 categories)."

          # Create list_of_child_ys
          list_of_child_ys = []

          for j in unique_categories:
            category_group_dict = category_values.get(j, None) # Also a dictionary, {'X': [['red'], ['red']], 'y': ['yes', 'yes']}
            category_group_values = category_group_dict.get('y', None) # Should be a list, ['yes', 'yes']
            list_of_child_ys.append(category_group_values)

          # Implement multi-split??? since there is some best_overall_feature_index, i think it means that X can be [['red', 1.0], ['blue', 2.0], ['red', 3.0], ...]
          IG = information_gain(y, list_of_child_ys, n_unique_classes)

          if IG > best_overall_gain and IG > 0.0: # If IG is not positive then the split is not valid/meaningful hence we let best_overall_gain = -1.0 still
            best_overall_gain = IG
            best_overall_split_details = category_values
            best_overall_feature_index = feature_index
            best_overall_split_type = current_split_type

          elif IG == best_overall_gain and feature_index < best_overall_feature_index and IG > 0.0: # If IG is not positive then the split is not valid/meaningful hence we let best_overall_gain = -1.0 still:
            # If there is a tie in information gain, prefer the feature with the lowest index.
            best_overall_gain = IG
            best_overall_split_details = category_values
            best_overall_feature_index = feature_index
            best_overall_split_type = current_split_type

          else:
            pass

        else:
          # Continuous data, structure is similar to for categorical
          current_split_type = "continuous"

          list_of_tuples = []
          for k in range(length):
            list_of_tuples.append((X[k][feature_index], y[k], X[k])) # X = [['red', 1.0], ['blue', 2.0], ['red', 3.0], ...]

          Sorted_list = sorted(list_of_tuples, key=lambda x: x[0])
          # print(Sorted_list)

          # Original method, failed at 2.6 since we needed to keep the list of list structure so modifed from tuple of 2 to 3
          # Sorted_X = []
          # Sorted_y = []

          # for l in range(length):
          #   Sorted_X.append(Sorted_list[l][0])
          #   Sorted_y.append(Sorted_list[l][1])

          # print(Sorted_X) # [2.0, 4.0, 6.0, 8.0, 10.0]

          # Find midpoints for splitting
          # mid_points = []

          for u in range(length - 1): # length = len(X)
            # We consider if there are duplicated values in Sorted_X if so we don't create a midpoint
            if Sorted_list[u][0] != Sorted_list[u+1][0]:
              mid_point = (Sorted_list[u][0] + Sorted_list[u+1][0]) / 2
              # mid_points.append(mid_point) # Not needed since we can handle all in a loop

              # Binary split by mid_point
              left_y = []
              right_y = []
              left_X = []
              right_X = []
              for r in Sorted_list:
                # print(Sorted_X) # Again legacy code
                # print(r)
                X_value = r[0]
                y_value = r[1]
                X_list_form = r[2]

                if X_value < mid_point: # Recall from lecture 3a, for our course we use < for the left subtree
                # Had to changed from using sorted list to original X and y since in task 2.6 it was not following the format of a list of list
                  left_y.append(y_value) # left_y.append(Sorted_y[r])
                  left_X.append(X_list_form) # left_X.append(Sorted_X[r])
                else:
                  right_y.append(y_value) # right_y.append(Sorted_y[r])
                  right_X.append(X_list_form) # right_X.append(Sorted_X[r]

              # Now we evaluate based on IG
              list_of_child_ys = [left_y, right_y]
              IG = information_gain(y, list_of_child_ys, n_unique_classes)

              # Almost the same as continuous except for best_overall_split_details
              if IG > best_overall_gain and IG > 0.0: # If IG is not positive then the split is not valid/meaningful hence we let best_overall_gain = -1.0 still
                best_overall_gain = IG
                best_overall_feature_index = feature_index
                best_overall_split_type = current_split_type
                best_overall_split_details = {'split_value':mid_point, 'left_X':left_X, 'left_y':left_y, 'right_X':right_X, 'right_y':right_y}

                # {
                #   'split_value': val,        # float: midpoint value used for binary split
                #   'left_X': [...], 'left_y': [...],
                #   'right_X': [...], 'right_y': [...]
                # }
              elif IG == best_overall_gain and feature_index < best_overall_feature_index and IG > 0.0: # If IG is not positive then the split is not valid/meaningful hence we let best_overall_gain = -1.0 still
              # If there is a tie in information gain, prefer the feature with the lowest index.
                best_overall_gain = IG
                best_overall_feature_index = feature_index
                best_overall_split_type = current_split_type
                best_overall_split_details = {'split_value':mid_point, 'left_X':left_X, 'left_y':left_y, 'right_X':right_X, 'right_y':right_y}

              else:
                pass

            else:
              pass

    """ YOUR CODE ENDS HERE """

    return best_overall_gain, best_overall_feature_index, best_overall_split_type, best_overall_split_details

# TESTCASES 2.4
import math

# TEST 1: Categorical Feature Split
X = [['red'], ['blue'], ['red'], ['green'], ['blue']]
y = ['yes', 'no', 'yes', 'no', 'no']
n_unique_classes = 2
gain, feature_idx, split_type, split_details = find_best_split(X, y, n_unique_classes)

assert split_type == 'categorical'
assert feature_idx == 0
assert math.isclose(gain, 0.97095, rel_tol=1e-4)
assert isinstance(split_details, dict)
assert set(split_details.keys()) == {'red', 'blue', 'green'}

# TEST 2: Continuous Feature Split
X = [[2.0], [4.0], [6.0], [8.0], [10.0]]
y = ['yes', 'yes', 'no', 'no', 'no']
n_unique_classes = 2
gain, feature_idx, split_type, split_details = find_best_split(X, y, n_unique_classes)

assert split_type == 'continuous'
assert feature_idx == 0
assert isinstance(split_details, dict)
assert 'split_value' in split_details
assert split_details['split_value'] == 5.0  # Midpoint between 4.0 and 6.0
assert math.isclose(gain, 0.97095, rel_tol=1e-4)

# TEST 3: No Valid Split
X = [['same'], ['same'], ['same']]
y = ['yes', 'yes', 'yes']
n_unique_classes = 1
gain, feature_idx, split_type, split_details = find_best_split(X, y, n_unique_classes)

assert gain == -1.0
assert feature_idx is None
assert split_type is None
assert split_details is None

print('All test cases passed!')

All test cases passed!


### Task 2.5 - Implement the Recursive Tree Builder Function [2 Points]

Complete the function `build_tree_recursive` below using the `find_best_split` and `majority_class` functions defined before.

It works recursively to grow the tree branch by branch. At each step, it decides whether to stop growing (i.e., return a leaf node) or to split the data and create internal nodes with children. Use `majority_class` to label leaf nodes and `find_best_split` to decide the best way to divide data at internal nodes.

Base Cases: Stop Recursion When Any of These is True:

- The maximum depth (`max_depth`) is reached.
- There is no further valid split, according to `find_best_split`
- All `y` labels in the current subset are the same class (pure node).

In these cases, return the majority class of the current `y` subset as the leaf node's prediction.

Recursive Step: If No Base Case is Triggered

1. Call `find_best_split(X, y, n_unique_classes)` to determine:
   - Best feature index
   - Type of feature (`'continuous'` or `'categorical'`)
   - Split details (values and data partitions)

2. Construct a dictionary representing an internal node, which includes:
   - `'feature'`: index of the best splitting feature.
   - `'split_type'`: `'continuous'` or `'categorical'`.
   - For continuous features:
     - `'split_value'`: float value for binary split.
     - `'left'` and `'right'`: recursive child nodes.
   - For categorical features:
     - `'children_map'`: a dictionary mapping each category to a recursive child node.
     - `'default_prediction'`: the majority class at this node (used as fallback during inference).

In [None]:
# TASK 2.5
def build_tree_recursive(X, y, depth, max_depth, n_unique_classes):
    """
    TODO: Implement the recursive tree building logic using the best split found.

    Args:
        X: Current subset of feature vectors.
        y: Current subset of target labels.
        depth: Current depth of the tree.
        max_depth: Maximum allowed depth for the tree.
        n_unique_classes: The total number of unique classes in the dataset, used as log base.

    Returns:
        Either:
            - A leaf node: the majority class label (int) if the tree should stop splitting.
            - OR a decision node (dict) with the following structure:

        For continuous features:
            {
                'feature': <feature_index>,                # int, index of feature used for split
                'split_type': 'continuous',               # str
                'split_value': <threshold>,               # float, the midpoint used for binary split
                'left': <left_subtree>,                   # recursive subtree or leaf
                'right': <right_subtree>,                 # recursive subtree or leaf
                'default_prediction': <majority_class>    # int, used for fallback predictions
            }

        For categorical features:
            {
                'feature': <feature_index>,                # int
                'split_type': 'categorical',              # str
                'children_map': {
                    <category_val>: <subtree_or_leaf>,    # one child per category
                    ...
                },
                'default_prediction': <majority_class>    # int
            }
    """

    res = None

    """ YOUR CODE STARTS HERE """
    gain, feature_idx, split_type, split_details = find_best_split(X, y, n_unique_classes) # Copied the testcase from find_best_split in task 2.4
    # if gain == -1 then the there should be no valid split
    unique_categories = dict.fromkeys(y, 0)
    stop = len(unique_categories) # len(set(y)) which does the same thing also works
    maj = majority_class(y)

    # Base case
    if depth == max_depth or gain == -1 or stop == 1:
      return(maj)

    else:
      # Construct a dictionary representing an internal node, which includes:
      # 'feature': index of the best splitting feature.
      # 'split_type': 'continuous' or 'categorical'.
      # For continuous features:
      # 'split_value': float value for binary split.
      # 'left' and 'right': recursive child nodes.
      # For categorical features:
      # 'children_map': a dictionary mapping each category to a recursive child node.
      # 'default_prediction': the majority class at this node (used as fallback during inference).

      # Most basic structure
      internal_node = {'feature':feature_idx, 'split_type':split_type, 'default_prediction':maj}

      if split_type == 'categorical':
        # Recall split_details = {'red': {'X': [['red'], ['red']], 'y': ['yes', 'yes']}, 'blue': {'X': [['blue'], ['blue']], 'y': ['no', 'no']}, 'green': {'X': [['green']], 'y': ['no']}}
        categories = split_details.keys()
        # We create the 'children_map' dictionary
        internal_node['children_map'] = {}

        for i in categories:
          dic = split_details[i] # dic is just the dict of [['red'], ['red']] and so on
          dic_X = dic['X']
          dic_y = dic['y']
          internal_node['children_map'][i] = build_tree_recursive(dic_X, dic_y, depth + 1, max_depth, n_unique_classes)
        return(internal_node)

      elif split_type == 'continuous':
        split_value = split_details['split_value']
        left_X = split_details['left_X']
        right_X = split_details['right_X']
        left_y = split_details['left_y']
        right_y = split_details['right_y']

        internal_node['split_value'] = split_value

        # Swap the assignments to match the prediction logic
        # It appears that the data for the left branch is in right_X/right_y
        # and vice versa.
        internal_node['left'] = build_tree_recursive(left_X, left_y, depth + 1, max_depth, n_unique_classes)
        internal_node['right'] = build_tree_recursive(right_X, right_y, depth + 1, max_depth, n_unique_classes)

        return internal_node

      else: # split_type == None
        pass

      # Added edge case of X = [] for task 2.4

    """ YOUR CODE ENDS HERE """

    return res

# TESTCASES 2.5

test_cases = [
    # Test Case 1: Base Case - Max depth reached
    {
        "name": "Max depth reached",
        "X": [[1], [2], [3]],
        "y": [0, 1, 0],
        "depth": 2,
        "max_depth": 2,
        "n_unique_classes": 2,
        "expected": 0 # Majority class of [0, 1, 0] is 0
    },
    # Test Case 2: Base Case - All labels are the same
    {
        "name": "All labels same",
        "X": [[1], [2], [3]],
        "y": [0, 0, 0],
        "depth": 0,
        "max_depth": 5,
        "n_unique_classes": 2,
        "expected": 0 # Majority class of [0, 0, 0] is 0
    },
    # Test Case 3: Base Case - Empty X (should return majority of y)
    {
        "name": "Empty X",
        "X": [],
        "y": [0, 1, 0],
        "depth": 0,
        "max_depth": 5,
        "n_unique_classes": 2,
        "expected": 0 # Majority class of [0, 1, 0] is 0
    }
]

# Helper function for running tests
for i, tc in enumerate(test_cases):
    print(f"Running Test Case {i+1}: {tc['name']}")
    result = build_tree_recursive(tc['X'], tc['y'], tc['depth'], tc['max_depth'], tc['n_unique_classes'])

    if "expected_type" in tc:
        if tc["expected_type"] == "continuous_node":
            assert isinstance(result, dict) and result.get('split_type') == 'continuous', \
                f"Test Case {i+1} Failed: Expected continuous node, got {result}"
            print(f"Test Case {i+1} Passed: Correctly identified continuous node.")
        elif tc["expected_type"] == "categorical_node":
            assert isinstance(result, dict) and result.get('split_type') == 'categorical', \
                f"Test Case {i+1} Failed: Expected categorical node, got {result}"
            print(f"Test Case {i+1} Passed: Correctly identified categorical node.")
        elif tc["expected_type"] == "continuous_node_with_leaf_children":
            assert isinstance(result, dict) and result.get('split_type') == 'continuous', \
                f"Test Case {i+1} Failed: Expected continuous node at root, got {result}"
            assert not isinstance(result['left'], dict) and not isinstance(result['right'], dict), \
                f"Test Case {i+1} Failed: Expected leaf children, but found nodes. Left: {result['left']}, Right: {result['right']}"
            print(f"Test Case {i+1} Passed: Correctly built continuous node with leaf children.")
    else:
        assert result == tc['expected'], \
            f"Test Case {i+1} Failed: Expected {tc['expected']}, got {result}"
        print(f"Test Case {i+1} Passed: Correctly returned leaf value {result}.")
    print("-" * 30)

print('All test cases passed!')

Running Test Case 1: Max depth reached
Test Case 1 Passed: Correctly returned leaf value 0.
------------------------------
Running Test Case 2: All labels same
Test Case 2 Passed: Correctly returned leaf value 0.
------------------------------
Running Test Case 3: Empty X
Test Case 3 Passed: Correctly returned leaf value 0.
------------------------------
All test cases passed!


### Task 2.6  - Implement Prediction Logic for a single data point [2 Points]

Complete the `predict_one_instance` function below. It takes a single data point and "walks" it down the decision tree to make a prediction.

Starting from the root node:
- Recursively check the `split_type` (`'continuous'` or `'categorical'`) and the feature used at the current `tree_node`.
- Based on `x_instance`'s value for that feature, decide:
  - whether to go `left` or `right` (for continuous),
  - or which `category` branch to follow (for categorical).

Continue this process until a leaf node, which contains the final predicted class.

Base Case:

- If `tree_node` is not a dictionary, it's a leaf node.
- Return its value directly (the predicted class).

Traversal Logic:

*   Continuous Split:
    - Compare `x_instance[feature_idx]` with `tree_node['split_value']`.
    - If less, recurse into `tree_node['left']`; else recurse into `tree_node['right']`.

*   Categorical Split:
    - Look up `x_instance[feature_idx]` in `tree_node['children_map']`.
    - If found, recurse into the matching child.
    - If unseen category, return `tree_node['default_prediction']`.

In [None]:
# TASK 2.6
def predict_one_instance(x_instance, tree_node):
    """
    TODO: Traverse the decision tree to make a prediction for a single input instance.

    Args:
        x_instance: A single feature vector (e.g., [val1, val2])
        tree_node: The current node of the decision tree (starts with the root)

    Returns:
        The predicted class label
    """

    predicted_label = None

    """ YOUR CODE STARTS HERE """
    # Base case
    if not isinstance(tree_node, dict): # not allows mean to invert the False to True
      return(tree_node) # Return its value directly (the predicted class).

    else:
      # Follow naming convention from earlier
      split_type = tree_node['split_type']
      feature_idx = tree_node['feature']
      value = x_instance[feature_idx]

      if split_type == 'categorical':
        matching_child = tree_node['children_map'].get(value, tree_node['default_prediction'])

        # We need to differentiate between the 2 possible matching_child so default_prediction is an int value based on earlier information
        if isinstance(matching_child, int):
          return(matching_child)

        else: # If found, recurse into the matching child. So we call the function again but with matching_child
          return(predict_one_instance(x_instance, matching_child))

      elif split_type == 'continuous':
        split_value = tree_node['split_value']

        # Compare x_instance[feature_idx] with tree_node['split_value'].
        if value < split_value:
          # If less, recurse into tree_node['left']; else recurse into tree_node['right'].
          return(predict_one_instance(x_instance, tree_node['left']))

        else:
          return(predict_one_instance(x_instance, tree_node['right']))

      else:
        pass



    """ YOUR CODE ENDS HERE """

    return predicted_label

# TESTCASES 2.6

# Test Case 1: Build a simple tree with a continuous split and predict.
X_tc1 = [[3.0], [7.0], [2.0], [8.0]]
y_tc1 = [0, 1, 0, 1]
n_unique_classes_tc1 = 2
max_depth_tc1 = 1
tree_tc1 = build_tree_recursive(X_tc1, y_tc1, 0, max_depth_tc1, n_unique_classes_tc1)
# Predict an instance that goes left
prediction_tc1_left = predict_one_instance([2.5], tree_tc1)
assert prediction_tc1_left == 0, f"Test Case 1 Failed (left prediction): Expected 0, got {prediction_tc1_left}"
print("Test Case 1 Passed (build & continuous prediction left)")
# Predict an instance that goes right
prediction_tc1_right = predict_one_instance([6.0], tree_tc1)
assert prediction_tc1_right == 1, f"Test Case 1 Failed (right prediction): Expected 1, got {prediction_tc1_right}"
print("Test Case 1 Passed (build & continuous prediction right)")


# Test Case 2: Build a simple tree with a categorical split and predict.
X_tc2 = [['apple'], ['banana'], ['apple'], ['orange']]
y_tc2 = [0, 1, 0, 1]
n_unique_classes_tc2 = 2
max_depth_tc2 = 1
tree_tc2 = build_tree_recursive(X_tc2, y_tc2, 0, max_depth_tc2, n_unique_classes_tc2)
# Predict a known category ('apple')
prediction_tc2_apple = predict_one_instance(['apple'], tree_tc2)
assert prediction_tc2_apple == 0, f"Test Case 2 Failed (apple prediction): Expected 0, got {prediction_tc2_apple}"
print("Test Case 2 Passed (build & categorical prediction 'apple')")
# Predict an unknown category ('grape') - should use default_prediction
prediction_tc2_grape = predict_one_instance(['grape'], tree_tc2)
assert prediction_tc2_grape == majority_class(y_tc2), f"Test Case 2 Failed (grape prediction): Expected default {majority_class(y_tc2)}, got {prediction_tc2_grape}"
print("Test Case 2 Passed (build & categorical prediction 'grape' - default)")

# Test Case 3: Mixed features - build a tree, and predict a continuous feature value path.
X_tc3 = [[10, 'A'], [2, 'B'], [12, 'A'], [3, 'B']]
y_tc3 = [0, 1, 0, 1]
n_unique_classes_tc3 = 2
max_depth_tc3 = 1
tree_tc3 = build_tree_recursive(X_tc3, y_tc3, 0, max_depth_tc3, n_unique_classes_tc3)
# Predict an instance (continuous feature 0 value: 4, which is <= 5)
prediction_tc3 = predict_one_instance([4, 'C'], tree_tc3) # The 'C' doesn't matter for this split
assert prediction_tc3 == 1, f"Test Case 3 Failed: Expected 1, got {prediction_tc3}" # Majority of [1,1] from left branch if split at 5
print("Test Case 3 Passed (mixed features - continuous path)")

print('All test cases passed!')

Test Case 1 Passed (build & continuous prediction left)
Test Case 1 Passed (build & continuous prediction right)
Test Case 2 Passed (build & categorical prediction 'apple')
Test Case 2 Passed (build & categorical prediction 'grape' - default)
Test Case 3 Passed (mixed features - continuous path)
All test cases passed!


### Task 2.7 - Decision Tree Classifier [1 Point]

Complete the class below using the `majority_class`, `build_tree_recursive` and `predict_one_instance` functions defined before. We have already imported them for you on Coursemology.

This class brings together all the previous helper functions to create, train, and use a decision tree for classification tasks.

* `fit(self, X, y)`:
    *   Calculate and store the total number of unique classes from `y` in `self.n_unique_classes`.
    *   Otherwise, call `build_tree_recursive` to start building the tree, passing all necessary parameters including `self.max_depth` and `self.n_unique_classes`. Store the resulting tree structure in `self.tree`.
    *   DO NOT return anything!

*   `predict(self, X)`:
    *   If `self.tree` is a leaf node (not a dictionary), simply return a list of that leaf value repeated for each instance in `X`.
    *   Otherwise, iterate through each `x_instance` in the input `X` and call `predict_one_instance` to get a prediction for each. Collect these predictions into a list and return it.

In [None]:
# TASK 2.7
class DTClassifier:
    def __init__(self, max_depth=2):
        """
        Args:
            max_depth (int): The maximum depth of the tree.
        """
        self.max_depth = max_depth
        self.tree = None
        self.n_unique_classes = None # To store the total number of unique classes

    def fit(self, X, y):
        """
        TODO: Fit the decision tree to the training data. DO NOT return anything!

        Args:
            X: A list of feature vectors for training.
            y: A list of corresponding target labels.
        """

        """ YOUR CODE STARTS HERE """
        # Calculate and store the total number of unique classes from y in self.n_unique_classes
        unique_categories = dict.fromkeys(y, 0)
        total = len(unique_categories)
        self.n_unique_classes = total

        self.tree = build_tree_recursive(X, y, 0, self.max_depth, self.n_unique_classes)

        """ YOUR CODE ENDS HERE """

    def predict(self, X):
        """
        TODO: Predict class labels for each instance in X using the fitted tree

        Args:
            X: A list of feature vectors for prediction.

        Returns:
            A list of predicted class labels.
        """

        predicted_labels = []

        """ YOUR CODE STARTS HERE """
        # If self.tree is a leaf node (not a dictionary), simply return a list of that leaf value repeated for each instance in X.
        if not isinstance(self.tree, dict):
          length = len(X)

          for i in range(length):
            predicted_labels.append(self.tree)

        # Otherwise, iterate through each x_instance in the input X and call predict_one_instance to get a prediction for each. Collect these predictions into a list and return it.
        else:
          for j in X:
            res = predict_one_instance(j, self.tree)
            predicted_labels.append(res)

        """ YOUR CODE ENDS HERE """

        return predicted_labels

# TESTCASES 2.7

# Test Case 1
X = [[1.0], [2.0], [3.0]]
y = [0, 0, 0]
clf = DTClassifier(max_depth=5)
clf.fit(X, y)
X_predict = [[10.0], [20.0]]
predictions = clf.predict(X_predict)
assert predictions == [0,0] , f"Prediction for a single-leaf tree should return that leaf's value for all instances."

# Test Case 2
X_train = [[3.0], [7.0], [2.0], [8.0]]
y_train = [0, 1, 0, 1]
max_depth = 1
clf = DTClassifier(max_depth=max_depth)
clf.fit(X_train, y_train)
X_predict = [[2.5], [6.0], [4.9], [5.1]]
predictions = clf.predict(X_predict)
assert predictions == [0, 1, 0, 1] , f"Prediction for continuous features should correctly traverse the tree."

# Test Case 3
X_train = [['red'], ['blue'], ['red'], ['green']]
y_train = [0, 1, 0, 1]
max_depth = 1
clf = DTClassifier(max_depth=max_depth)
clf.fit(X_train, y_train)
X_predict = [['red'], ['blue'], ['yellow']]
predictions = clf.predict(X_predict)
assert predictions == [0, 1, majority_class(y_train)] , f"Prediction for categorical features should correctly use children_map and handle unseen categories."

print('All test cases passed!')

All test cases passed!


## Task 3 - Practical [2 Points]

Train a DT classifier on the training dataset using `scikit-learn` and tune its hyperparameters to optimize performance.

You will get full marks if your modelling is appropriate and performs well. But remember, you **MUST NOT** use or access X_test and y_test in your code, as this defeats the purpose of a hidden test set. Any model that does so will be given 0 mark.

Make sure that you have installed `scikit-learn` in your python environment.

**HINT**: Set the `random_state` parameter (if exists) to a certain constant to make your model reproducible (same result on every run)

In [None]:
# TASK 3
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=41)

def train_model(X_train, y_train):
    """
    TODO: Train and return a DT classifier.

    Args:
        X_train: Training feature vectors
        y_train: Training labels

    Returns:
        A trained sklearn model, your model will be used to predict the labels of test data
    """

    model = None

    """ YOUR CODE STARTS HERE """
    # Import packages
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.ensemble import BaggingClassifier
    from sklearn.model_selection import StratifiedKFold, GridSearchCV

    # Not much data preprocessing is needed for decision trees (standardising or outlier removal), so it is likely that most of the improvement will come from hyperparameter tuning
    # Just do the basic check for missing values

    # print(type(X_train[0][1])) # X_train is a nested array of float type inside
    # print(type(y_train[0])) # y_train is an array of int type
    # When i say array it refers to the numpy array, which I CANNOT use :(

    # print(len(X_train)) Only 105
    # print(len(y_train)) As expected also same at 105
    # So small sample size

    # Same code from assignment 1
    ## Detect and drop missing values if there are any using None, NA, Na, null not 0
    count_X = 0
    count_y = 0

    for sub_list in X_train:
      for i in sub_list:
        if i is None or i == 'NA' or i == 'Na' or i == 'null':
          count_X += 1
        else:
          pass

    for j in y_train:
      if j is None or j == 'NA' or j == 'Na' or j == 'null':
        count_y += 1
      else:
        pass

    # print(count_X) # 0
    # print(count_y) # 0
    # As expected the iris data set has no missing data
    # End of checking for missing values

    # Idea use different ensembles for DT such as: 1. Random Forest, 2. Adaboost, 3. Bagging, 4. GradientBoosting (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
    ### Checked with prof, we can use ensembles
    # We use GridSearchCV and loop for all of them, keeping the best model
    # IMPORTANT: Based on task description, it is a DT classifier NOT regressor
    # DecisionTree, decided to include it just for completeness

    best_score = 0
    best_model = None
    best_params = None

    DT_ensemble = [
        # RandomForest (RF) from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
        # Decided to try random_state = 3244 instead of 42 for once
        (RandomForestClassifier(random_state = 3244), {
            'n_estimators': [50, 100, 200, 300],
            'max_depth': [None, 2, 3, 4, 5, 6],
            'min_samples_split': [2, 3, 4, 5], # default is 2
            'min_samples_leaf': [1, 2, 3], # default is 1
            'max_features': ['sqrt', 'log2', None],
            'bootstrap': [True, False],
            'class_weight': [None, 'balanced']
        }),
        # AdaBoost from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html, honestly not much to tune
        (AdaBoostClassifier(random_state = 3244), {
            'n_estimators': [50, 100, 200, 300],
            'learning_rate': [0.01, 0.1, 0.5, 1.0]
        }),
        # GradientBoosting from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#gradientboostingclassifier
        (GradientBoostingClassifier(random_state = 3244), {
            'n_estimators': [50, 100, 200, 300],
            'learning_rate': [0.01, 0.1, 0.5, 1.0],
            'max_depth': [None, 2, 3, 4, 5, 6],
            'min_samples_split': [2, 3, 4, 5],
            'min_samples_leaf': [1, 2, 3],
            'max_features': ['sqrt', 'log2', None],
            'subsample': [0.8, 1.0] # subsample < 1.0 is stochastic gradientboosting
        }),
        # Bagging from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
        (BaggingClassifier(random_state = 3244), {
            'n_estimators': [50, 100, 200, 300],
            'max_samples': [0.5, 1], # If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
            'max_features': [0.5, 1], # similar deal to max_samples
            'bootstrap': [True, False],
            'bootstrap_features': [True, False] # Whether features are drawn with replacement.
        }),
        # DecisionTree, decided to include it just for completeness from https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
        (DecisionTreeClassifier(random_state = 3244), {
            'criterion': ['gini', 'entropy', 'log_loss'],
            'max_depth': [None, 2, 3, 4, 5, 6],
            'min_samples_split': [2, 3, 4, 5], # default is 2
            'min_samples_leaf': [1, 2, 3], # default is 1
            'max_features': ['sqrt', 'log2', None],
            'class_weight': [None, 'balanced']
        })
    ]

    # Initialize StratifiedKFold
    stratified_kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=3244)

    for i, j in DT_ensemble: # list of tuple where i is the classifier and j is the parameters (as a dict)
      grid_search = GridSearchCV(estimator=i, param_grid = j, cv=stratified_kf, n_jobs = -1, scoring = "accuracy")
      grid_fit = grid_search.fit(X_train, y_train)

      # Update best_score_ and best_model
      if grid_search.best_score_ > best_score:
        best_score = grid_search.best_score_
        best_model = grid_search.best_estimator_

        # Find best hyperparameters so that we don't need to run the code for another 2h+
        best_params = grid_search.best_params_ # Hyperparameters that gave the best model performance

        model = best_model

      else:
        pass

    print("Best Hyperparameters:", best_params)
    print("Best model:", best_model)
    print("Best Score:", best_score)

    ## Manual process to run model using best hyperparameter without gridsearch (Uncomment this and comment out gridsearch if you don't want to wait 1 hour+ when testing)

    # model = BaggingClassifier(bootstrap=False, bootstrap_features=True, max_features=0.5, max_samples=0.5, n_estimators=50, random_state=3244)
    # model.fit(X_train, y_train)

    ## End of manual process to run model using best hyperparameter

    # First run attempt, had to stop

    # Second run attempt, ran out of free computing units so I had to switch to running on my local computer

    # Test 3
    # Best Hyperparameters: {'bootstrap': False, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 0.5, 'n_estimators': 50}
    # Best model: BaggingClassifier(bootstrap=False, bootstrap_features=True, max_features=0.5,
    #               max_samples=0.5, n_estimators=50, random_state=3244)
    # Best Score: 0.980952380952381
    # Model accuracy: 0.89

    """ YOUR CODE ENDS HERE """

    return model


# TESTCASES 3
# Our hidden test cases will use your code to train a model to predict the labels of the test data, not necessarily on the same train-test split.
# Note: If your model is poorly designed or performs poorly, points may be deducted.

model = train_model(X_train, y_train)
# Check if the model can predict
predictions = model.predict(X_test)
assert len(predictions) == len(X_test)
accuracy_score = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy_score:.2f}")

KeyboardInterrupt: 

## END OF ASSIGNMENT