# Session 3: Introduction to machine learning algorithms

In this exercise worksheet, you'll implement machine learning algorithm models from scratch to extract biological meaning from sequence data. We will be focusing on the 2 following algorithms:

1. Decision tree to predict whether a breast cancer tumor is malignant or not from its visual properties.
2. Neural network to predict whether a DNA sequence is from a promoter or not.

We'll also focus on the strength and weaknesses of these models, how to assess their results and potential ways to improve them.

### Introduction to numpy: work with arrays and matrices in python


When doing machine learning, we often have matrices containing multiple samples and features. The python package numpy is very helpful to manipulate this type of data. It also has extensive documentation, which is very helpful if you get stuck : https://numpy.org/doc/1.19/

Below is a quick demonstration of its use:

In [1]:
# Numpy allows to easily manipulate arrays in one or more dimensions

import numpy as np

my_list = [
    [1,   2,   3  ],
    [10,  20,  30 ],
    [100, 200, 300],
] # Standard python 2D list
my_array = np.array(my_list) # Numpy aray equivalent

# How to select the first column of the 2D array ?

# Base python version
first_column = [0, 0, 0]
for i, row in enumerate(my_list):
    first_column[i] = row[0]

# Numpy version
first_column = my_array[:, 0] # We can slice the array in two dimensions: [rows, cols]

# How to multiply every element in the array by 10 ?

# Base python version
for i in range(len(my_list)):
    for j in range(len(my_list[0])):
        my_list[i][j] *= 10
        
# Numpy version
my_array *= 10

# Note: What would happen if you tried running my_list * 10 ? 

In [2]:
# You can display various informations on numpy arrays
print("my_array.shape: ", my_array.shape)
print("my_array.size: ", my_array.size)

my_array.shape:  (3, 3)
my_array.size:  9


The example above means operations are vectorized on numpy arrays. This means we do not need to write the loops explicitely. They are still executed implicitely, but in C code, which is much faster ! See for yourself:

In [3]:
big_list = [[3] * 1000] * 1000
big_array = np.array(big_list)
%timeit [[v * 10 for v in row] for row in big_list]
%timeit big_array * 10

29.9 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.37 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Exercise 1: Decision trees

As we saw in the lectures, decision trees are easily interpretable. Here we will implement a decision tree model for classification, from scratch. You will need to implement all the necessary building blocks:

* A cost function allowing to compute the purity of nodes and rate splits using information gain.
* An algorithm to find the feature and value providing best binary split.
* A recursive function that will call itself to split each node into children.

Some of the code is already written to guide you, but you still need to implement the critical part.

**a) Implement a function to compute the entropy of an array. We will later use it to rate splits.**


In [4]:
import numpy as np

def entropy(array, possible_values):
    """
    Compute the entropy of an array of values
    Hint: To get a maximum value of 1 like in the
    example, you need to use a log of base 2 (np.log2)
    >>> entropy(["A", "A", "A"], ["A", "B"])
    0
    >>> entropy([1, 2, 2, 1], [1, 2])
    1
    """
    if not len(array): return 0
    probs = [0] * len(possible_values)
    for i, value in enumerate(possible_values):
        for sample in array:
            if sample == value:
                probs[i] +=1
    probs = [p/sum(probs) for p in probs]
    entropy = 0
    for p in probs:
        if p == 0:
            prod = 0
        else:
            prod = p * np.log2(p)
        entropy -= prod
    return entropy

In [5]:
###########################################
### TEST YOUR CODE BY RUNNING THIS CELL ###
###########################################
assert entropy(["A", "A", "A"], ["A", "B"]) == 0
assert entropy([1, 2, 2, 1], [1, 2]) > entropy([1, 2, 1, 1], [1, 2])
assert entropy([1, 2, 2, 1], [1, 2]) > 0
print(' 0 0 0 \n0 . . 0\n0  v  0\n 0 0 0 ')
print("Congrats !!")

 0 0 0 
0 . . 0
0  v  0
 0 0 0 
Congrats !!



**b) Read the specification of our tree structure in the cell below**

In [15]:
# We can represent out tree with nested dictionaries. Each node is a dictionary with two children {{...},{...}}.
### Example of a tree with 5 nodes (2 internal and 3 terminal):
"""
GRAPHIC REPRESENTATION:

      o1      <- root o1, split on feature 3 at value 0.5
    /  \
   x1    o2   <- internal node o2, split on feature 1 at value 03
       / \
      x2   x3 <- terminal nodes x1, x2 and x3 contain the prediction result
"""

# DICTIONARY REPRESENTATION:

# Each internal node has attributes describing what feature was used for the split
# and what was the optimal value. Internal nodes also have 'left' and 'right' attributes,
# which each contain another dictionary representing the children nodes.

# Terminal nodes instead have data and pred attributes, which indicate which training samples
# are in the node, and what is the prediction.

dummy_tree  = {
    'split_feature': 3,         #
    'split_value': 0.5,         # ROOT o1
    'depth': 1,                 #
    'left':                     
        {
            'depth': 2,           # TERMINAL x1
            'data':[1, 2, 4],     #
            'pred': "A",          #
        },
    'right':
        {
            'split_feature': 1,   # INTERNAL o2
            'split_value': -3,    #
            'depth': 2,           #
            'left':
                {
                    'depth': 3,     #
                    'data': [0],    #  TERMINAL x2
                    'pred': "A",    #
                },
            'right': 
                {
                    'depth': 3,     #
                    'data': [3, 5], # TERMINAL x3
                    'pred': "B",    #
                },
        },
}


**b) Write a function to find the best split on the dataset provided below. It contains measurements from breast cancer tumors, and whether they are malignant (1) of begnin (0). The feature matrix Nxp is stored in `X`. The labels are stored in `y`.**
> Note: Given a dataset X of N samples (rows) and p features (cols), and target values (labels) y, the algorithm should find the combination of feature j and value s which provides the best split. The best split is defined as the one maximizing the information gain IG.
$$IG= entropy(y) - \frac{1}{(N_l+N_r)}\left(N_l * entropy(y_l) + N_l * entropy(y_r)\right)$$
where N represents the number of samples in a node, and {l,r} are the left and right children nodes generated by the split


In [6]:
# Load a dummy dataset
from sklearn import datasets
import numpy as np
dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target
print(
    f"We have N={X.shape[0]} samples, each with p={X.shape[1]} features. The target values are {dataset['target_names']}."
)

We have N=569 samples, each with p=30 features. The target values are ['malignant' 'benign'].


In [8]:
def get_split(X, y):
    """
    Select the best split point on dataset X to separate target y.
    Returns a dictionary (node) with attributes 'split_features' and
    'split_value', describing the split chosen, as well as 'left' and
    'right', storing the content of the children nodes. 'left' and 'right'
    each contain a list of two elements: the feature matrix and the target array,
    i.e. 'left': [Xl, yl], 'right': [Xr, yr].
    
    >>> get_split(np.array([[1, 1, 2],[1,1,4]]),np.array([1,2]))
    {
        'split_feature': 2,
        'split_value': 2,
        'left':  [np.array([[1, 1, 2]]), np.array([1])],
        'right': [np.array([[1, 1, 4]]), np.array([2])],
    }
    """
    y_values = set(y)
    old_entropy = entropy(y, y_values)
    best_value = best_feature = None
    best_score =  -np.inf
    # X.shape[1] gives the number of columns (features) in X
    for feature in range(X.shape[1]):
        # We iterate over the possible values for the current feature
        for value in set(X[:, feature]):
            # Boolean mask of samples higher than value for the feature
            higher = X[:, feature] > value
            # Samples higher than values go to left
            yl = y[higher]
            # The others go to right, notice the "~" to invert the mask
            yr = y[~higher]
            # Compute the entropy for candidate splits
            entropy_left = entropy(yl, y_values)
            entropy_right = entropy(yr, y_values)
            # Their weighted average gives the total new entropy
            new_entropy = (len(yl) * entropy_left + len(yr) * entropy_right) / len(y)
            # Compute information gain from this split
            ig = old_entropy - new_entropy
            # Retain split if it got the highest information gain
            if ig > best_score:
                # Groups with all features + predictions are kept for the children nodes
                best_Xl, best_yl = X[higher, :],  yl
                best_Xr, best_yr = X[~higher, :], yr
                best_feature, best_value, best_score = feature, value, ig
    new_node = {
        'split_feature': best_feature,
        'split_value': best_value,
        'left':  [best_Xl, best_yl],
        'right': [best_Xr, best_yr],
        'ig': best_score,
    }
    return new_node

In [9]:
###########################################
### TEST YOUR CODE BY RUNNING THIS CELL ###
###########################################
your_split = get_split(np.array([[1, 1, 2],[1,1,4]]),np.array([1,2]))
req_vals = {
    'split_feature': 2,
    'split_value': 2,
    'left':  [np.array([[1, 1, 2]]), np.array([1])],
    'right': [np.array([[1, 1, 4]]), np.array([2])],
}
assert np.all([your_split[k] == req_vals[k] for k in ['split_value', 'split_feature']]) # Check split
assert np.all(your_split[k][0] == req_vals[k][0] for k in ['left', 'right']) # Check X subsets
assert np.all(your_split[k][1] == req_vals[k][1] for k in ['left', 'right']) # Check y subsets
print(' 0 0 0 \n0 . . 0\n0  v  0\n 0 0 0 ')
print("Congrats !!")

 0 0 0 
0 . . 0
0  v  0
 0 0 0 
Congrats !!


**c) Implement the function for recursive binary partitioning. It should be called once by `build_tree`, and then call itself until it reaches a base condition (pure node, node too small or maximum depth).**


In [10]:
def most_freq(array):
    """Returns the most frequent value in a numpy array"""
    values, counts = np.unique(array,return_counts=True)
    ind=np.argmax(counts)
    return values[ind]


def recurse_split(node, depth, min_node_size=4, max_depth=3):
    """
    Given a node that is already split, check for base conditions
    min_node_size and max_depth.
    
    - If the base conditions are reached in the splits, make children
      terminal nodes.
    
    - If the current node is already pure, ignore splits
      and make the current node terminal.
    
    - If none of the base condition was reached, compute optimal split
      on the children and recurse further down into the tree.
    """
    Xl, yl = node['left']
    Xr, yr = node['right']
    del node['left'], node['right']
    # Check if all samples went to the same side (one split is empty)
    if not Xl.size or not Xr.size:
        # The current node is pure, make it a terminal node (leaf)
        del node['split_feature'], node['split_value']
        node['data'] = [np.concatenate([Xl, Xr]), np.concatenate([yl, yr])]
        node['pred'] = most_freq(node['data'][1])
        node['depth'] = depth
        return
    # If our tree has reached max depth, make the children nodes terminal
    if depth >= max_depth:
        node['left'] = {'depth': depth + 1, 'data': [Xl, yl], 'pred': most_freq(yl)}
        node['right'] = {'depth': depth + 1, 'data': [Xr, yr], 'pred': most_freq(yr)}
        return
    # process left child. If it is too small, make it terminal
    if Xl.shape[0] <= min_node_size:
        node['left'] = {'depth': depth + 1, 'data': [Xl, yl], 'pred': most_freq(yl)}
    # Otherwise, keep recursing deeper into the tree
    else:
        node['left'] = get_split(Xl, yl)
        recurse_split(node['left'], depth+1, min_node_size, max_depth)
    # process right child. If it is too small, make it terminal
    if Xr.shape[0] <= min_node_size:
        node['right'] = {'depth': depth + 1, 'data': [Xr, yr], 'pred': most_freq(yr)}
    # Otherwise, keep recursing deeper into the tree
    else:
        node['right'] = get_split(Xr, yr)
        recurse_split(node['right'], depth+1, min_node_size, max_depth)
 
    return

def build_tree(X, y, min_node_size=4, max_depth=3):
    """Given a dataset and associated targets, build a decision tree"""
    # Initialize a tree with the root node and resulting best splits
    tree = get_split(X, y)
    tree['depth'] = 1
    # The function does not return anything, it modifies 'tree'
    recurse_split(tree, 1, min_node_size=min_node_size, max_depth=max_depth)
    return tree

In [19]:
###########################################
### TEST YOUR CODE BY RUNNING THIS CELL ###
###########################################
dummy_tree = build_tree(np.array([[0, 1, 2], [1, 1, 2], [1, 2, 2], [0, 2, 2]]), np.array([0, 1, 1, 0]))
assert  dummy_tree['split_feature'] == 0, 'wrong split feature' # Check split feature
assert dummy_tree['split_value']  in [0, 1], 'wrong split threshold' # Check split threshold (either < 1 or <= 0 works)
assert len(dummy_tree['left']['data']) == 2, 'wrong number of samples in leaves' # Check y subsets
assert 'ig' in dummy_tree.keys(), 'Please add an "ig" key to your nodes in get_split'
assert dummy_tree['ig'] == 1, 'Wrong information gain value.'
print(' 0 0 0 \n0 . . 0\n0  v  0\n 0 0 0 ')
print("Congrats !!")

 0 0 0 
0 . . 0
0  v  0
 0 0 0 
Congrats !!


**d) What feature split provides the most information gain ? What does that imply ?**
> Note: You can use the print_tree function to visualise your tree

In [20]:
def print_tree(node, feature_names, depth=0):
    """Print a drawing of your tree"""
    if 'split_feature' in node.keys():
        j = feature_names[node['split_feature']]
        s = node['split_value']
        print(f"{2*depth*' '}[{j} < {s:.3f}]: IG={node['ig']:.3}")
        print_tree(node['left'], feature_names, depth+1)
        print_tree(node['right'], feature_names, depth+1)
    else:
        print('%s[%s]' % ((2*depth*' ', node['pred'])))

In [21]:
print_tree(dummy_tree, feature_names=['0', '1', '2'])

[0 < 0.000]: IG=1.0
  [1]
  [0]


In [22]:
dummy_tree

{'split_feature': 0,
 'split_value': 0,
 'ig': 1.0,
 'depth': 1,
 'left': {'depth': 2, 'data': [array([[1, 1, 2],
          [1, 2, 2]]), array([1, 1])], 'pred': 1},
 'right': {'depth': 2, 'data': [array([[0, 1, 2],
          [0, 2, 2]]), array([0, 0])], 'pred': 0}}

In [23]:
T = build_tree(X[:500], y[:500])

In [24]:
print_tree(T, dataset.feature_names)

[worst perimeter < 106.000]: IG=0.578
  [worst perimeter < 117.200]: IG=0.2
    [fractal dimension error < 0.002]: IG=0.0821
      [0]
      [1]
    [worst smoothness < 0.134]: IG=0.403
      [0]
      [1]
  [worst concave points < 0.134]: IG=0.124
    [worst texture < 27.200]: IG=0.453
      [0]
      [1]
    [area error < 48.840]: IG=0.0331
      [0]
      [1]


**e) Write a function to predict new values using your tree. Try to find a measure of success.**

In [25]:
def predict_values(tree, X):
    """
    Traverse the tree with new unknown observations
    and retrieve the prediction at the leaves
    """
    pred = np.zeros(X.shape[0])
    for i in range(X.shape[0]):
        node = tree
        terminal = False
        while not terminal:
            try:
                if X[i, node['split_feature']] < node['split_value']:
                    node = node['left']
                else:
                    node = node['right']
            except KeyError:
                terminal = True
        pred[i] = node['pred']
    return pred

In [28]:
def assess(T, X, y):
    pred = predict_values(T, X)
    correct = y == pred
    return sum(correct) / len(correct)
    
print(f"{100 * assess(T, X[500:], y[500:]):.2f}% of the predictions are correct")

73.91% of the predictions are correct


**f) How well does the tree generalize to new data ? Can you guess why, and how to solve this ?**

In [None]:
depths = [3, 4, 5, 10, 20]
sizes = [10, 20, 50, 100]
params = np.zeros((len(depths) * len(sizes), 3))
# We can try increasing depth !
i = 0
for m in depths:
    for n in sizes:
        params[i, 0] = m
        params[i, 1] = n
        params[i, 2] = assess(build_tree(X[:500], y[:500], min_node_size=n, max_depth=m), X[500:], y[500:])
        i += 1


In [None]:
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt

mpl.rcParams['legend.fontsize'] = 10

fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter(params[:, 0], params[:, 1], params[:, 2], label='prop. correct classif.', c=params[:, 2], cmap='winter')
ax.legend()
ax.set_xlabel("Max depth")
ax.set_ylabel("Min node size")
plt.show()
best_idx = np.flatnonzero(params[:, 2] == max(params[:, 2]))[0]
print(
    f"The best result of {100*params[best_idx, 2]:.2f}% correct"
    f" classification was obtained with max_depth={params[best_idx,0]}"
    f" and min_node_size={params[best_idx, 1]}"
)

## (Optional) Exercise 2: Predict promoters from DNA sequence with a neural network

Promoter regions are regulatory DNA sequences to which specific proteins can bind to trigger the transcription of neighbouring genes. Depending on the exact promoter sequence, proteins will have more or less affinity to it, which allows fine regulation of gene expression.

Here, we want to predict whether DNA sequences are promoters or not. You are given a dataset of 106 DNA sequences, each 57bp long. Some of these sequences, labelled "+" originate from a known promoter, while the others, labelled "-" are from a non promoter region.

Here, you need to make the best possible prediction of promoter state from the DNA sequences using a neural network.

To implement this neural network, we use matrix operations via numpy !
To help you with the theory, you can read this excellent documentation on neural networks:https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html

We use the naming conventions as in the drawing, so you can always refer to it:

![image.png](images/nn_with_matrices_displayed.png)

However, in our network, layer H will have 4 neurons (H1, H2, H3, H4), and layer O a single neuron (O1).

The actual neural network code is already written in the cells below. Your goal is understand and modify it however you want to improve results.

Here are a few hints (all of those terms are explained in the link above):
* We use sigmoid as our activation function, but you could change it.
* We use squared error as our cost functions, there are also other options.
* The learning rate and number of iterations are important parameters (also look at early stopping)
* You can change the number of nodes in the hidden layer by changing the shape of wh.
* A few other words which might be helpful: Regularization, drop out, data augmentation

At the end of the notebook, there is a cell to assess your model using Leave One Out (LOO) cross validation. This is a robust way to measure the performance of a model, where we train the network on all samples except 1. We then try to predict the single sample that was removed. We repeat the operation for each different sample. In the end, we just look at the proportion of correct predicitions.

Good luck !

### Introduction to pandas: More flexibility with dataframes
Pandas is quite similar to numpy,in the sense that it allows to manipulate tabular data and uses vectorized operations. However, it adds more flexibility withdataframes. Dataframes work exactly like in the Rprogramming language, they have columns which have names, and each column can contain different types of data. The pandas package has an excellent documentation here: https://pandas.pydata.org/

Here, we use pandas to load the data. Below is a quick explanation of how it works.

In [28]:
import pandas as pd
import numpy as np
# We can load a dataframe from a text file with pd.read_csv(), or create one directly:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30], 'C': ["x", "y", "z"]})
df.head() # head() shows the first few rows of the dataframe
# We can select a column by name, or by position
df.loc[:, 'A'] # loc selects columns by name
df['A'] # This is a shortcut to do the same thing
df.iloc[:, 0] # iloc selects columns by index (position)
# Instead of taking all rows, we can apply conditions to select a subset
df.loc[df['A'] > 1, ['B', 'C']] # Select columns B and C, but only include rows for which A > 1
df.iloc[[0, 2], :] # Select rows 0 and 2 of all columns
# It is also possible to apply functions on rows or columns
df['meanAB'] = df.loc[:, ['A', 'B']].apply(np.mean, axis=1) # Mean of A and B for each row
df['custom'] = df['C'].apply(lambda x: 'custom_' + x) # Custom (lambda) function on column C
# There are many more powerful features which we will not need here
df.head()

Unnamed: 0,A,B,C,meanAB,custom
0,1,10,x,5.5,custom_x
1,2,20,y,11.0,custom_y
2,3,30,z,16.5,custom_z


In [29]:
### DATA LOADING ###

import numpy as np
import pandas as pd
# Load the table using pandas
dataset = pd.read_csv('data/session_3_promoters.data', header=None, names=['promoter', 'ID', 'seq'])
dataset.head()

Unnamed: 0,promoter,ID,seq
0,+,S10,\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
1,+,AMPC,\t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaat...
2,+,AROH,\t\tgtactagagaactagtgcattagcttatttttttgttatcat...
3,+,DEOP2,\taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
4,+,LEU1_TRNA,\ttcgataattaactattgacgaaaagctgaaaaccactagaatgc...


In [30]:


# Use strip to remove whitespaces (tabs, shown as \t) from the sequences
dataset['seq'] = dataset['seq'].str.strip()

# Neural networks can only read numeric inputs, we need to convert DNA into numbers.
# Convert letters to numerals actg -> 1234
dataset['seq'] = dataset['seq'].apply(lambda x: x.translate(str.maketrans("actg", "1234")))
# Split strings into lists of integers
dataset['seq'] = dataset['seq'].apply(lambda x: [int(b) for b in x])
dataset.head()

Unnamed: 0,promoter,ID,seq
0,+,S10,"[3, 1, 2, 3, 1, 4, 2, 1, 1, 3, 1, 2, 4, 2, 3, ..."
1,+,AMPC,"[3, 4, 2, 3, 1, 3, 2, 2, 3, 4, 1, 2, 1, 4, 3, ..."
2,+,AROH,"[4, 3, 1, 2, 3, 1, 4, 1, 4, 1, 1, 2, 3, 1, 4, ..."
3,+,DEOP2,"[1, 1, 3, 3, 4, 3, 4, 1, 3, 4, 3, 4, 3, 1, 3, ..."
4,+,LEU1_TRNA,"[3, 2, 4, 1, 3, 1, 1, 3, 3, 1, 1, 2, 3, 1, 3, ..."


In [31]:
X = np.array(dataset['seq'].to_list())
# Scale our observations for each column (Will make it easier to fit the model)
X = (X - X.mean(axis=0)) / X.std(axis=0)
# Transform -/+ into 0/1 labels
y = np.array(dataset.promoter == '+', dtype=int)

In [121]:
np.random.seed(1337)
# We will use 90% of our dataset to train the network
TRAIN_SIZE = int(X.shape[0] * 0.9)
train_idx = np.random.choice(range(X.shape[0]), size=TRAIN_SIZE)
X_train, y_train = X[train_idx, :], y[train_idx]
# Save the other part to assess it afterwards
X_test, y_test = np.delete(X, train_idx, axis=0), np.delete(y, train_idx, axis=0)[:, None]

In [122]:
def sigmoid(z):
    return 1/(1+np.exp(-z))


def sigmoid_derivative(z):
    return np.exp(-z) / (1+np.exp(-z))**2


def feedforward(X, wh, wo):
    """
    Given input feature matrix X of shape Nxp and weights
    for hidden (wh)) and output (wo) layers, send X through
    the network to retrieve the folowing values:
    - zh: The linear combination of inputs and weights wh
    - zo: The linear combination of the hidden layer result (H) and the output layer weights (wo)
    - H: The result from the hidden layer, its value is just sigmoid(zh)
    - O: The output from our network, its value is sigmoid(zo)
    """
    zh = X @ wh # Z=XW for hidden layer
    zo = zh @ wo # Same for output layer
    H = sigmoid(zh)
    O = sigmoid(zo)
    return zh, zo, H, O


def backprop(X, targ, wh, wo, lr=0.1):
    """
    The gradient descent process used to train our algorithm.
    Given an input feature matrix, predictions from the network, real (target) values
    and the network weight which produced the predictions, this function computes
    the prediction error, and backpropagates the partial derivative of this error
    according to each weight throughout the network. The resulting gradient give
    the direction in which each weight should be adjusted. The learning rate (lr)
    is a constant determining how much to adjust the weights in that direction.
    """
    # feed forward
    zh, zo, H, pred = feedforward(X, wh, wo)
    # Compute prediction error
    Eo = (pred - targ) * sigmoid_derivative(zo) # Output layer error: Eo = E'(y) * f'(Zo)
    Eh = Eo * (wo.T * sigmoid_derivative(zh))   # Hidden layer error: Eh = E0 * w0 * f'(Zh)
    # Cost derivative for weights
    dWo = Eo.T @ H
    dWh = Eh.T @ X

    # Update weights
    wh -= lr * dWh.T
    wo -= lr * dWo.T
    return wh, wo


def predict(X, wh, wo):
    """Send a new prediction through the trained network to get the prediction"""
    inp = np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)
    H = sigmoid(inp @ wh)
    O = sigmoid(H @ wo)
    return O

In [131]:
# We add a first column of 1's (bias), so that: a0+a1*x1+... = a0*1+a1*x1+...
inp   = np.concatenate([np.ones((X_train.shape[0], 1)), X_train], axis=1)
# Initialize random weights in the network (we have no idea of good values for now)
wh    = np.random.rand(inp.shape[1], 20) # Weights of hidden layer
wo    = np.random.rand(20, 1) # Weights of output layer, shape[4, 1]
tar   = y_train[:, None] # target values

zh, zo, H, O = feedforward(inp, wh, wo)
# 10 iterations of training
for i in range(100000):
    # Adjust weights according to prediction errors
    wh, wo = backprop(inp, tar, wh, wo, lr=0.2)

In [132]:
# We predict the test samples using the trained weights
O =  predict(X_test, wh, wo)[:, 0]

In [133]:
# Binarise results into 0 / 1 and check which are equal to the real values
correct = (O>0.5) == y_test[:, 0]
print(
    f"The network has {100*(1-sum(correct) / len(y_test)):.2f}% misclassification rate. "
    f"({len(y_test) - sum(correct)}/{len(y_test)} wrong)"
)

The network has 26.67% misclassification rate. (12/45 wrong)


In [159]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.hist([O[O>0.5], O[O<=0.5]])

<IPython.core.display.Javascript object>

(array([[ 0.,  0.,  0.,  0.,  0.,  2.,  2.,  5.,  5., 51.],
        [33.,  1.,  2.,  1.,  3.,  0.,  0.,  0.,  0.,  0.]]),
 array([1.52553737e-16, 1.00000000e-01, 2.00000000e-01, 3.00000000e-01,
        4.00000000e-01, 5.00000000e-01, 6.00000000e-01, 7.00000000e-01,
        8.00000000e-01, 9.00000000e-01, 1.00000000e+00]),
 <a list of 2 BarContainer objects>)

Increasing the number of training iterations improves results

In [181]:
### ASSESS NETWORK WITH CROSS VALIDATION ###
# To make sure the network works properly, we need to measure its success with different subset of testing samples
# Since both the dataset and network are small, we'll use Leave One Out cross validation:
# Take a single sample out from the training set, and try to predict it. We repeat this operation N times
# (one for each sample)
MAX_ITER = 2000
N_HIDDEN = 8
scores = np.zeros(MAX_ITER)


# New LOO sample for cross validation
for n in range(X.shape[0]):
    X_train_loo, y_train_loo = np.delete(X, n, axis=0), np.delete(y, n, axis=0)[:, None]
    X_train_loo = np.concatenate([np.ones((X_train_loo.shape[0], 1)), X_train_loo], axis=1)
    X_test_loo, y_test_loo = X[None, n, :], y[n, None]
    # Reset weights
    wh     = np.random.rand(X_train_loo.shape[1], N_HIDDEN) # Weights of hidden layer
    wo     = np.random.rand(N_HIDDEN,1) # Weights of output layer, shape[4, 1]
    zh, zo, H, O = feedforward(X_train_loo, wh, wo)
    # Train for current LOO round
    for i in range(MAX_ITER):
        # Adjust weights according to prediction errors
        wh, wo = backprop(X_train_loo, y_train_loo, wh, wo)
        # Binarize predictions probs into 1/0
        pred = predict(X_test_loo, wh, wo)[:, 0] > 0.5
        # Single sample can be either correct or wrong. Store result
        scores[i] += pred == y_test_loo
    if not n % (X.shape[0]//10):
        print(f"{100*(n/X.shape[0]):.2f}% rounds completed")

# Convert number of correct guesses into proportion of correct LOO guess
scores /= X.shape[0]

0.00% rounds completed
9.43% rounds completed
18.87% rounds completed
28.30% rounds completed
37.74% rounds completed
47.17% rounds completed
56.60% rounds completed
66.04% rounds completed
75.47% rounds completed
84.91% rounds completed
94.34% rounds completed


In [179]:
%matplotlib notebook
plt.plot(range(MAX_ITER), scores)

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f454fead8d0>]

In [180]:
print(
    f"Your model has a misclassification rate of {100*(1-scores[-1]):2f}% on LOO cross "
    f"validation after {MAX_ITER} training iterations"
)

Your model has a misclassification rate of 31.132075% on LOO cross validation after 2000 training iterations
