# Coursework 1 - Supervised learning

**Replace CID in the file name with your CID**

# Outline


- [Task 1](#task-1): Regression <a name="index-task-1"></a>
  - [(1.1)](#task-11) Random Forest <a name="index-task-11"></a>
    - [(1.1.1)](#task-111) <a name="index-task-111"></a>
    - [(1.1.2)](#task-112) <a name="index-task-112"></a>
    - [(1.1.3)](#task-113) <a name="index-task-113"></a>
  - [(1.2)](#task-12) Multi-layer Perceptron <a name="index-task-12"></a>
    - [(1.2.1)](#task-121) <a name="index-task-121"></a>
    - [(1.2.2)](#task-122) <a name="index-task-122"></a>
    - [(1.2.3)](#task-123) <a name="index-task-123"></a>
- [Task 2](#task-2): Classification <a name="index-task-2"></a>
  - [(2.1)](#task-21) k-Nearest Neighbours <a name="index-task-21"></a>
    - [(2.1.1)](#task-211)  <a name="index-task-211"></a>
    - [(2.1.2)](#task-212) <a name="index-task-212"></a>
    - [(2.1.3)](#task-213) <a name="index-task-213"></a>
    - [(2.1.4)](#task-214) <a name="index-task-214"></a>
  - [(2.2)](#task-22) Logistic regression vs kernel logistic regression <a name="index-task-22"></a>
    - [(2.2.1)](#task-221) <a name="index-task-221"></a>
    - [(2.2.2)](#task-222) <a name="index-task-222"></a>
    - [(2.2.3)](#task-223) <a name="index-task-223"></a>



---



<a name="task-1"></a>

# (1) Task 1: Regression [(index)](#index-task-1)

<a name="task-11"></a>

## (1.1) Random Forest [(index)](#index-task-11)

In [350]:
# import packages
from collections import defaultdict
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

In [351]:
#load dataset
data = pd.read_csv('nanoelectrodes_capacitance_samples.csv')

X_train = data.drop("Capacitance ($\mu F / cm^2$)", axis = 'columns')
y_train = data["Capacitance ($\mu F / cm^2$)"]

test_data = pd.read_csv('nanoelectrodes_capacitance_test.csv')
X_test = test_data.drop("Capacitance ($\mu F / cm^2$)", axis = 'columns')
y_test = test_data["Capacitance ($\mu F / cm^2$)"]

<a name="task-111"></a>

### (1.1.1) [(index)](#index-task-111)

In [352]:
def loss(y, y_pred):
    return ((y - y_pred) ** 2).sum()

In [353]:
def mse(y, y_pred):
    return ((y - y_pred) ** 2).mean()

In [354]:
def rsq(y, y_pred):
    return 1 - (np.linalg.norm(y - y_pred) ** 2) / (np.linalg.norm(y - y.mean()) ** 2)

In [355]:
def split_samples(X, y, column, value):
    """
    Return the split of data whose column-th feature:
        less than value, in case `column` is not categorical (i.e. numerical)

    Arguments:
        X: training features, of shape (N, p).
        y: vector of training labels, of shape (N,).
        column: the column of the feature for splitting.
        value: splitting threshold  the samples
    Returns:
        tuple(np.array, np.array): tuple of the left split data (X_l, y_l).
        tuple(np.array, np.array): tuple of the right split data (X_l, y_l)
    """

    left_mask = (X[:, column] < value)

    # Using the binary masks `left_mask`, we split X and y.
    X_l, y_l = X[left_mask], y[left_mask] 
    X_r, y_r = X[~left_mask], y[~left_mask] 

    return (X_l, y_l), (X_r, y_r)

In [356]:
def loss_split_value(X, y, column):
    """
    Calculate the mse based on `column` with the split that minimizes the loss.
    Arguments:
        X: training features, of shape (N, p).
        y: vector of training labels, of shape (N,).
        column: the column of the feature for calculating. 0 <= column < D
    Returns:
        (float, float): the resulted mse and the corresponding value used in splitting.
    """

    unique_vals = np.unique(X[:, column])

    assert len(unique_vals) > 1, f"There must be more than one distinct feature value. Given: {unique_vals}."

    loss_val, threshold = np.inf, None

    # split the values of i-th feature and calculate the cost
    for value in unique_vals:
        (X_l, y_l), (X_r, y_r) = split_samples(X, y, column, value) 

        # if one of the two sides is empty, skip this split.
        if len(y_l) == 0 or len(y_r) == 0:
            continue

        new_loss = loss(y_l, y_l.mean()) + loss(y_r, y_r.mean()) 
        if new_loss < loss_val:
              loss_val, threshold = new_loss, value

    return loss_val, threshold

In [357]:
def loss_split(X, y):
    """
    Choose the best feature to split according to criterion.
    Args:
        X: training features, of shape (N, p).
        y: vector of training labels, of shape (N,).
    Returns:
        (int, float): the best feature index and value used in splitting.
        If the feature index is None, then no valid split for the current Node.
    """

    # Initialize `split_column` to None, so if None returned this means there is no valid split at the current node.
    min_loss = np.inf
    split_column = None
    split_val = np.nan
    m, n = X.shape

    for col in range(n):
        # skip column if samples are not seperable by that column.
        if len(np.unique(X[:, col])) < 2:
            continue
        loss, current_split_val = loss_split_value(X, y, col)  

        # To scan for the best split corresponding the minimum mse_index
        if loss < min_loss: 
            # Keep track with:

            # 1. the current minimum mse value,
            min_loss = loss

            # 2. corresponding column,
            split_column = col

            # 3. corresponding split threshold.
            split_val = current_split_val 

    return split_column, split_val

In [358]:
loss_split(X_train.to_numpy(), y_train.to_numpy())

(5, 2.5)

In [359]:
def build_tree(X, y, feature_names, depth,  max_depth=10, min_samples_leaf=10):
    """Build the decision tree according to the data.
    Args:
        X: (np.array) training features, of shape (N, p).
        y: (np.array) vector of training labels, of shape (N,).
        feature_names (list): record the name of features in X in the original dataset.
        depth (int): current depth for this node.
    Returns:
        (dict): a dict denoting the decision tree (binary-tree). Each node has seven attributes:
          1. 'feature_name': The column name of the split.
          2. 'feature_index': The column index of the split.
          3. 'value': The value used for the split.
          4. 'mean_value': For leaf nodes, this stores the dominant label. Otherwise, it is None.
          5. 'left': The left sub-tree with the same structure.
          6. 'right' The right sub-tree with the same structure.
    """
    # include a clause for the cases where (i) no feature, (ii) depth exceed, or (iii) X is too small
    if len(np.unique(y))==1 or depth>=max_depth or len(X)<=min_samples_leaf:
        return {'mean_value': np.mean(y)}

    split_index, split_val = loss_split(X, y)

    # If no valid split at this node, use mean.
    if split_index is None:
        return {'mean_value': np.mean(y)}

    # Split samples (X, y) given column and split-value.
    (X_l, y_l), (X_r, y_r) = split_samples(X, y, split_index, split_val) 
    return {
        'feature_name': feature_names[split_index],
        'feature_index': split_index,
        'value': split_val,
        'mean_value': None,
        'left': build_tree(X_l, y_l, feature_names, depth + 1, max_depth, min_samples_leaf),
        'right': build_tree(X_r, y_r, feature_names, depth + 1, max_depth, min_samples_leaf)
    }

In [360]:
def train(X, y):
    """
    Build the decision tree according to the training data.
    Args:
        X: (pd.Dataframe) training features, of shape (N, p). Each X[i] is a training sample.
        y: (pd.Series) vector of training labels, of shape (N,). y[i] is the label for X[i], and each y[i] is
        an integer in the range 0 <= y[i] <= C. Here C = 1.
    """
    feature_names = X.columns.tolist()
    X = X.to_numpy()
    y = y.to_numpy()
    return build_tree(X, y, feature_names, depth=1)

In [361]:
tree = train(X_train, y_train)

In [503]:
def find(tree, x):
    """
    Find the branch of a single sample with the fitted decision tree.
    Args:
        x: ((pd.Dataframe) a single sample features, of shape (D,).
    Returns:
        (int): predicted testing sample label.
    """
    
    if tree['mean_value'] is not None:
        return tree['mean_value']

    if x[tree['feature_index']] < tree['value']: 
        # go to left branch
        return find(tree['left'], x)  
    else:
        # go to right branch
        return find(tree['right'], x)  

In [492]:
def predict(tree, X):
    """
    Predict regression results for X.
    Args:
        X: (pd.Dataframe) testing sample features, of shape (N, p).
    Returns:
        (np.array): predicted testing sample labels, of shape (N,).
    """
    if len(X.shape) == 1:
        return find(tree, X)
    else:
        return np.array([find(tree, x) for x in X])

In [493]:
def tree_results(tree, X, y):
    """
    Return the R^2 and MSE score of the tree on the data.
    Args:
        tree: (dict) the decision tree.
        X: (pd.Dataframe) testing sample features, of shape (N, p).
        y: (pd.Series) vector of testing labels, of shape (N,).
    Returns:
        (float): R^2 score of the tree on the data.
        (float): MSE score of the tree on the data.
    """
    y_pred = predict(tree, X.to_numpy())
    return rsq(y, y_pred), mse(y, y_pred)

In [494]:
print(tree_results(tree, X_train, y_train))
print(tree_results(tree, X_test, y_test))

(0.7503640060081045, 1654.4051506951573)
(0.4896216113769649, 3369.5671069811706)


<a name="task-112"></a>

### (1.1.2) [(index)](#index-task-112)

In [495]:
def loss_split_rf(n_features, X, y):
    """
    Choose the best feature to split according to criterion.
    Args:
        n_features: number of sampled features.
        X: training features, of shape (N, p).
        y: vector of training labels, of shape (N,).
    Returns:
        (float, int, float): the minimized loss value.
    """

    # The added sampling step.
    columns = np.random.choice(list(range(12)), n_features, replace=False)


    min_loss_val, split_column, split_val = np.inf, None, np.nan

    # Only scan through the sampled columns in `columns_dict`.
    for column in columns:
        # skip column if samples are not seperable by that column.
        if len(np.unique(X[:, column])) < 2:
            continue

        # search for the best splitting value for the given column.
        loss_val, val = loss_split_value(X, y, column)
        if loss_val < min_loss_val:
            min_loss_val, split_column, split_val = loss_val, column, val

    return min_loss_val, split_column, split_val

In [496]:
def build_tree_rf(n_features, X, y, feature_names, depth,  max_depth=10, min_samples_leaf=10):
    """Build the decision tree according to the data.
    Args:
        X: (np.array) training features, of shape (N, p).
        y: (np.array) vector of training labels, of shape (N,).
        feature_names (list): record the name of features in X in the original dataset.
        depth (int): current depth for this node.
    Returns:
        (dict): a dict denoting the decision tree (binary-tree). Each node has seven attributes:
          1. 'feature_name': The column name of the split.
          2. 'feature_index': The column index of the split.
          3. 'value': The value used for the split.
          4. 'categorical': indicator for categorical/numerical variables.
          5. 'majority_label': For leaf nodes, this stores the dominant label. Otherwise, it is None.
          6. 'left': The left sub-tree with the same structure.
          7. 'right' The right sub-tree with the same structure.
    """
    # include a clause for the cases where (i) all lables are the same, (ii) depth exceed (iii) X is too small
    if len(np.unique(y)) == 1 or depth>=max_depth or len(X)<=min_samples_leaf:
        return {'mean_value': np.mean(y)}

    else:
        loss, split_column, split_val = loss_split_rf(n_features, X, y)

        # If loss is infinity, it means that samples are not seperable by the sampled features.
        if loss == np.inf:
            return {'mean_value': np.mean(y)}
        (X_l, y_l), (X_r, y_r) = split_samples(X, y, split_column, split_val)
        return {
            'feature_name': feature_names[split_column],
            'feature_index': split_column,
            'value': split_val,
            'mean_value': None,
            'left': build_tree(X_l, y_l, feature_names, depth + 1, max_depth, min_samples_leaf),
            'right': build_tree(X_r, y_r, feature_names, depth + 1, max_depth, min_samples_leaf)
        }

In [497]:
def train_rf(B, n_features, X, y, feature_names):
    """
    Build the decision tree according to the training data.
    Args:
        B: number of decision trees.
        X: (pd.Dataframe) training features, of shape (N, p). Each X[i] is a training sample.
        y: (pd.Series) vector of training labels, of shape (N,). y[i] is the label for X[i], and each y[i] is
        an integer in the range 0 <= y[i] <= C. Here C = 1.
    """
    if isinstance(X, pd.DataFrame):
        X = X.to_numpy()
        y = y.to_numpy()
        
    N = X.shape[0]
    training_indices = np.arange(N)
    trees = []

    for _ in range(B):
        # Sample the training_indices (with replacement)
        sample = np.random.choice(training_indices, N, replace=True) 

        # Ensure the size of the random sample is the same size of training sample
        assert len(sample) == len(training_indices)

        X_sample = X[sample, :]
        y_sample = y[sample]
        tree = build_tree_rf(n_features, X_sample, y_sample,
                            feature_names, depth=1)
        trees.append(tree)

    return trees

In [498]:
def predict_rf(rf, X):
    """
    Predict classification results for X.
    Args:
        rf: A trained random forest through train_rf function.
        X: (pd.Dataframe) testing sample features, of shape (N, p).
    Returns:
        (np.array): predicted testing sample labels, of shape (N,).
    """

    if len(X.shape) == 1:
        # if we have one sample
        return np.mean([find(tree, X) for tree in rf])
    else:
        # if we have multiple samples
        return np.array([np.mean([find(tree, x) for tree in rf]) for x in X])

In [499]:
def rf_results(rf, X, y):
    """
    Return the R^2 and MSE score of the random forest on the data.
    Args:
        rf: (list) the random forest.
        X: (pd.Dataframe) testing sample features, of shape (N, p).
        y: (pd.Series) vector of testing labels, of shape (N,).
    Returns:
        (float): R^2 score of the rf on the data.
        (float): MSE score of the rf on the data.
    """
    y_pred = predict_rf(rf, X)
    return rsq(y, y_pred), mse(y, y_pred)

In [500]:
n_features = int(X_train.shape[1]/3)

In [501]:
B = 30 # this will change later after optimisation
# fit the random forest with training data
rf = train_rf(B, n_features, X_train, y_train, X_train.columns.tolist())

In [502]:
print(rf_results(rf, X_train, y_train))
print(rf_results(rf, X_test, y_test))

ValueError: could not convert string to float: 'r'

In [453]:
def fold_indices(N, n_folds):
    fold_size = N // n_folds
    shuffle_index = np.random.permutation(np.arange(N))
    folds = []
    for i in range(n_folds):
        folds.append(shuffle_index[int(i * fold_size):int((i + 1) * fold_size)])
    return folds

In [454]:
def cross_validation_score(X_train, y_train, n_folds, B):
  scores = []
  folds = fold_indices(X_train.shape[0], n_folds)
  feature_names = X_train.columns.tolist()
  X_train = X_train.to_numpy()
  y_train = y_train.to_numpy()

  for i in range(len(folds)):
    val_indexes = folds[i]
    train_indexes = list(set(range(y_train.shape[0])) - set(val_indexes))
    
    X_train_i = X_train[train_indexes, :]
    y_train_i = y_train[train_indexes]


    X_val_i = X_train[val_indexes, :] 
    y_val_i = y_train[val_indexes] 

    rf = train_rf(B, n_features, X_train_i, y_train_i, feature_names)
    scores.append(rf_results(rf, X_val_i, y_val_i)[1])

  # Return the average score
  return np.mean(scores) 

In [455]:
def choose_best_B(X_train, y_train, folds, B_range):
  B_scores = np.zeros((len(B_range),))
  
  for i, B in enumerate(B_range):
    B_scores[i] = cross_validation_score(X_train, y_train, folds, B)
    print(f'CV_ACC@B={B}: {B_scores[i]:.3f}')

  best_B_index = np.argmax(B_scores) 
  return B_range[best_B_index]

In [456]:
b_hat = choose_best_B(X_train, y_train, 5, np.arange(1, 31))

AttributeError: 'numpy.ndarray' object has no attribute 'to_numpy'

<a name="task-113"></a>

### (1.1.3) [(index)](#index-task-113)



---



<a name="task-12"></a>

## (1.2) Multi-layer Perceptron [(index)](#index-task-12)

<a name="task-121"></a>

### (1.2.1) [(index)](#index-task-121)

<a name="task-122"></a>

### (1.2.2) [(index)](#index-task-122)

<a name="task-123"></a>

### (1.2.3) [(index)](#index-task-123)



---



<a name="task-2"></a>

# (2) Task 2: Classification [(index)](#index-task-2)

<a name="task-21"></a>

## (2.1) k-Nearest Neighbours [(index)](#index-task-21)

<a name="task-211"></a>

### (2.1.1) [(index)](#index-task-211)

<a name="task-212"></a>

### (2.1.2) [(index)](#index-task-212)

<a name="task-213"></a>

### (2.1.3) [(index)](#index-task-213)

<a name="task-214"></a>

### (2.1.4) [(index)](#index-task-214)



---



<a name="task-22"></a>

## (2.2) Logistic regression vs kernel logistic regression [(index)](#index-task-22)

<a name="task-221"></a>

### (2.2.1) [(index)](#index-task-221)

<a name="task-222"></a>

### (2.2.2) [(index)](#index-task-222)

<a name="task-223"></a>

### (2.2.3) [(index)](#index-task-223)