## Assignment 2 ##

Welcome to your second assignment. This assignment aims to help you understand 
tree-based models and model evaluation.

**1** Read the available diabetes dataset from sklearn, which concerns 
predicting the progression of diabetes in a patient one year after 
(regression task), and store it in a variable named `diabetes`.

In [1]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes(as_frame=True)

In [2]:
"""Τεστ ορθής ανάγνωσης του συνόλου δεδομένων"""
assert diabetes.feature_names == ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3',
									's4', 's5', 's6']

**2** Split the dataset you read into a training subset (80%) and a test subset 
(20%) using the `train_test_split` method with a random seed of 42. Store the
training set in variables `X_train` and `y_train`, and the test set in 
variables `X_test` and `y_test`.

In [3]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array(diabetes.data)
y = np.array(diabetes.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, 
                                                    random_state=42)

In [4]:
"""Τεστ ορθού διαχωρισμού του συνόλου δεδομένων"""
assert round(X_train[0][8], 5) == 0.02736
assert round(X_test[0][8], 5) == 0.03243

**3** Using the training data, train two regression trees with splitting 
criteria `squared_error` and `poisson` 
([see section 1.10.7.2](https://scikit-learn.org/stable/modules/tree.html#tree)),
setting the random seed to 42 and maximum depth to 3. Calculate the $R^2$ score 
based on their predictions on the test data, which should be stored in
variables `y_pred1` (squared_error) and `y_pred2` (poisson) respectively.
Store the metrics in variables `r2_squared_error` and `r2_poisson`.

In [5]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

y_pred1 = DecisionTreeRegressor(max_depth=3, criterion="squared_error",
                                random_state=42)
y_pred2 = DecisionTreeRegressor(max_depth=3, criterion="poisson",
                                random_state=42)

y_pred1 = y_pred1.fit(X_train, y_train).predict(X_test)
y_pred2 = y_pred2.fit(X_train, y_train).predict(X_test)

r2_squared_error = r2_score(y_test, y_pred1)
r2_poisson = r2_score(y_test, y_pred2)

In [6]:
"""Τεστ ορθού υπολογισμού των μετρικών"""
assert round(r2_squared_error, 3) == 0.329
assert round(r2_poisson, 3) == 0.391
assert len(y_pred1) == len(y_pred2)

**4** Develop code based on the property 
[tree_](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py) of the following 
model, in order to display the path taken in the tree for predicting the first 
case of the test set. If a variable is checked a second time on the path, we 
should retain the last and narrowest constraint. Examples of correct functioning 
follow, showing the form of the if-then structure the path should have.

In [7]:
from sklearn.tree import export_text
model = DecisionTreeRegressor(max_depth=3, random_state=42)
model.fit(X_train, y_train)

In [8]:
def get_path(tree, instance, feature_names):
    feature = tree.tree_.feature
    threshold = tree.tree_.threshold
    path_dict = {}
    node_id = 0

    def build_path(path_dict, leaf_value):
        path = "αν "

        for i, (key, value) in enumerate(path_dict.items()):
            if i != len(path_dict) - 1:
                path += "{feature} {sign} {threshold} και ".format(
                    feature=key[0], 
                    sign=key[1], 
                    threshold=round(value, 2)
                )
            else:
                path += "{feature} {sign} {threshold} τότε {leaf}".format(
                    feature=key[0], 
                    sign=key[1], 
                    threshold=round(value, 2),
                    leaf=round(leaf_value, 2)
                )
        return path

    while True:
        feature_name = feature_names[feature[node_id]]
        if feature[node_id] != -2:
            if instance[feature[node_id]] <= threshold[node_id]:
                threshold_sign = "<="
                next_node_id = tree.tree_.children_left[node_id]
            else:
                threshold_sign = ">"
                next_node_id = tree.tree_.children_right[node_id]

            path_dict.update({(feature_name, threshold_sign):
                threshold[node_id]})

            if next_node_id == -1:
                break

            node_id = next_node_id
        else:
            leaf_value = tree.tree_.value[node_id][0][0]
            break
    
    return build_path(path_dict, leaf_value)
    

In [9]:
"""Τεστ ορθού υπολογισμού του μονοπατιού"""
assert get_path(model, X_test[0], diabetes.feature_names) == 'αν bmi <= 0.01 και s5 > 0.01 και s4 <= 0.09 τότε 159.57'
assert get_path(model, X_test[1], diabetes.feature_names) == 'αν bmi > 0.01 και bmi <= 0.07 και s6 <= 0.03 τότε 175.8'
assert get_path(model, X_test[6], diabetes.feature_names) == 'αν bmi > 0.07 και s2 > 0.02 τότε 225.75'
assert get_path(model, X_test[10], diabetes.feature_names) == 'αν bmi <= 0.01 και s5 <= 0.01 και s5 > -0.04 τότε 109.92'

**5** Split the diabetes dataset into training subsets of 60% 
(X_train, y_train), validation subsets of 15% (X_val, y_val), and test subsets 
of 25% (X_test, y_test) using a random seed of 42.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=.25,
                                                    random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=.2,
                                                  random_state=42)

In [11]:
"""Τεστ ορθού διαχωρισμού του συνόλου δεδομένων"""
assert round(X_train[0][8], 5) == -0.08024
assert round(X_val[0][8], 5) == 0.09924
assert round(X_test[0][8], 5) == 0.03243

**6** Implement a function that, given a regression tree model, explores all 
integer values of the `min_samples_leaf` parameter from a lower bound to an 
upper bound and returns the optimal value based on mean squared error on the 
validation set. Train the model on the given training set. Additionally, return
the mean squared error (rounded to the nearest integer) achieved on the test set 
by the model with the best performance on the validation set.



In [12]:
from sklearn.metrics import mean_squared_error

def best_min_samples_leaf(model, min, max):
    best_min_samples = None
    best_mse_test = None
    first = True
    
    for min_samples_leaf in range(min, max + 1):
        model.min_samples_leaf = min_samples_leaf
        model.fit(X_train, y_train)
        
        mse_val = mean_squared_error(y_val, model.predict(X_val))

        if first:
            best_mse_val = mse_val
            first = False
        
        if mse_val < best_mse_val:
            best_mse_val = mse_val
            best_min_samples = min_samples_leaf
            best_mse_test = mean_squared_error(y_test, 
                                                model.predict(X_test)) 
    
    return (best_min_samples, round(best_mse_test))

In [13]:
"""Τεστ ορθής υλοποίησης συνάρτησης"""
model = DecisionTreeRegressor(max_depth=3, random_state=42)
assert best_min_samples_leaf(model, 1, 8) == (7, 3358)
assert best_min_samples_leaf(model, 5, 15) == (11, 3271)

**7** Implement a function that, given a regression model, a dataset, and a 
metric, evaluates it using the "leave one out" cross-validation method.

In [14]:
from sklearn.model_selection import LeaveOneOut
import numpy as np

def leave_one_out(model, X, y, metric):
    loo = LeaveOneOut()
    loo.get_n_splits(X)
    
    scores = []

    X = X.to_numpy()

    for train_index, test_index in loo.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        score = metric(y_test, y_pred)
        scores.append(score)
    
    return np.mean(scores)

In [15]:
"""Τεστ ορθής υλοποίησης συνάρτησης"""
model = DecisionTreeRegressor(max_leaf_nodes=6, min_samples_leaf=2, 
							random_state=42)
assert round(leave_one_out(model, diabetes.data, diabetes.target, 
						mean_squared_error), 0) == 3354