Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "GERARDO DURAN MARTIN"
STUDENT_ID = "200774408"

---

# MTH793P - Coursework 1

This is a template notebook for the computational exercises of [Coursework 1](https://qmplus.qmul.ac.uk/pluginfile.php/2533631/mod_assign/introattachment/0/coursework1.pdf?forcedownload=1) of the module MTH793P, Advanced Machine Learning. Closely follow the instructions in this template in order to complete the assessment and to obtain full marks. Please only modify cells where you are instructed to do so. Failure to comply may result in unexpected errors that can lead to mark deductions.

Author: [Martin Benning](mailto:m.benning@qmul.ac.uk)

Date: 13.01.2021

As usual, we begin by loading the necessary libraries.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse as sp
from scipy.sparse import linalg

Create two lists *nodes* and *edges* and one NumPy array *weights*. The list *nodes* should contain all names of the nodes in the graph of Coursework 1, Question 1, in alphabetical order. The list edges should include lists that contain the indices of the nodes that are connected by the individual edge. For example, the first edge connects node 'Batman' and node 'Jessica Jones', so the list for this edge should be [0, 3], as 'Batman' is the first entry of the list *nodes* and 'Jessica Jones' is the fourth entry.  The Numpy array *weights* should contain the individual weights assigned to each edge.

In [3]:
nodes = sorted(["Spiderman", "Deadpool", "Jessica Jones", "Wonder woman", "Catwoman", "Batman"])

nodes_dict = {name: val for val, name in enumerate(nodes)}
edges_name = [
    ["Batman", "Jessica Jones"],
    ["Batman", "Catwoman"],
    ["Batman", "Deadpool"],
    ["Batman", "Wonder woman"],
    ["Deadpool", "Jessica Jones"],
    ["Jessica Jones", "Wonder woman"],
    ["Jessica Jones", "Spiderman"],
    ["Catwoman", "Wonder woman"],
    ["Catwoman", "Spiderman"],
    ["Deadpool", "Spiderman"],
    ["Spiderman", "Wonder woman"]
]

edges = [[nodes_dict[n1], nodes_dict[n2]] for n1, n2 in edges_name]

weights = np.array([4, 81, 16, 64, 64, 36, 49, 49, 1, 49, 1])

Display your lists and array with the following cell.

In [4]:
print('We consider a graph with nodes/vertices {n}, edges {e} and weights {w}.'.format( \
                            n = nodes, e = edges, w = weights))

We consider a graph with nodes/vertices ['Batman', 'Catwoman', 'Deadpool', 'Jessica Jones', 'Spiderman', 'Wonder woman'], edges [[0, 3], [0, 1], [0, 2], [0, 5], [2, 3], [3, 5], [3, 4], [1, 5], [1, 4], [2, 4], [4, 5]] and weights [ 4 81 16 64 64 36 49 49  1 49  1].


Write a function **construct_incidence_matrix** that takes the lists *nodes*, *edges* and the NumPy array *weights* as arguments. The function should return the incidence matrix *incidence_matrix* that corresponds to the weighted graph. The construction of the incidence matrix is described in the lecture notes. The returned matrix should ideally be a sparse matrix of the format LIL; for more information on sparse matrices in LIL format, please check [the documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html). In particular, the use of the command *sp.lil_matrix* to initialise the sparse matrix is recommended.

In [5]:
def construct_incidence_matrix(nodes, edges, weights):
    n_edges = len(edges)
    n_nodes = len(nodes)
    
    Mw = np.zeros((n_edges, n_nodes))
    for e, ((n1, n2), w) in enumerate(zip(edges, weights)):
        w = np.sqrt(w)
        Mw[e, n1] = -w
        Mw[e, n2] = w
    
    return Mw

You can test your function with the following cell.

In [6]:
from numpy.testing import assert_array_equal
incidence_matrix = construct_incidence_matrix(nodes, edges, weights)    
if sp.issparse(incidence_matrix):
    print('The incidence matrix of our graph is \n {i}.'.format(i = \
                    sp.csr_matrix.todense(incidence_matrix.astype(int))))
else:
    print('The incidence matrix of our graph is \n {i}.'.format(i = \
                    incidence_matrix.astype(int)))
# The following code is testing the previous code against a specific example    
test_nodes = ['Batman', 'Catwoman', 'Spiderman']
test_edges = [[0, 1], [1, 2]]
test_weights = np.array([81, 1])
test_incidence_matrix = construct_incidence_matrix(test_nodes, test_edges, test_weights)
if sp.issparse(test_incidence_matrix):
    assert_array_equal(sp.csr_matrix.todense(test_incidence_matrix.astype(int)), \
                       np.array([[-9, 9, 0],[0, -1, 1]]))
else:
    assert_array_equal(test_incidence_matrix.astype(int), \
                       np.array([[-9, 9, 0],[0, -1, 1]]))

The incidence matrix of our graph is 
 [[-2  0  0  2  0  0]
 [-9  9  0  0  0  0]
 [-4  0  4  0  0  0]
 [-8  0  0  0  0  8]
 [ 0  0 -8  8  0  0]
 [ 0  0  0 -6  0  6]
 [ 0  0  0 -7  7  0]
 [ 0 -7  0  0  0  7]
 [ 0 -1  0  0  1  0]
 [ 0  0 -7  0  7  0]
 [ 0  0  0  0 -1  1]].


Next, compute the corresponding graph-Laplacian for the incidence matrix *incidence_matrix* from the previous exercise and store it in a variable named *graph_laplacian*. Follow the definition from the lecture notes.

In [7]:
graph_laplacian = incidence_matrix.T @ incidence_matrix

You can test your function with the following cell.

In [8]:
if sp.issparse(graph_laplacian):
    print('The graph Laplacian of our graph is \n {g}.'.format(g = \
                        sp.csr_matrix.todense(graph_laplacian.astype(int))))
else:
    print('The graph Laplacian of our graph is \n {g}.'.format(g = \
                        graph_laplacian.astype(int)))

The graph Laplacian of our graph is 
 [[165 -81 -16  -4   0 -64]
 [-81 131   0   0  -1 -49]
 [-16   0 129 -64 -49   0]
 [ -4   0 -64 153 -49 -36]
 [  0  -1 -49 -49 100  -1]
 [-64 -49   0 -36  -1 150]].


We want to use our graph to determine whether a node in the graph belongs to the class "Marvel" or the class "DC". Suppose we are in a semi-supervised setting, where the node "Deadpool" is already labelled $v_{\text{Deadpool}} = 0$ (class "Marvel") and the node "Catwoman" is labelled as $v_{\text{Catwoman}} = 1$ (class "DC"). Here $v$ is the mathematical notation of the label vector. We follow the instructions in the lecture notes and formulate this as a linear system. We can either define appropriate projection matrices or create sub-matrices from the graph-Laplacian by choosing the correct indices. How you set up the linear system is up to you. Store your linear system in a variable named *linear_system* and the right-hand-side in a variable *right_hand_side*.

In [9]:
def create_projection_matrices(labels, known_nodes, nodes, nodes_dict):
    """
    Return the projection matrices 
    required to compute PR and PR⊥
    
    Parameters
    ----------
    labels: list
        Values of known nodes
    nodes: list
        Indices of known nodes
    """
    n_nodes = len(nodes)
    n_known = len(known_nodes)
    
    unknown_nodes = list(set(nodes) - set(known_nodes))
    
    PR = np.zeros((n_known, n_nodes))
    PR_perp = np.zeros((n_nodes - n_known, n_nodes))
    
    for i, node in enumerate(known_nodes):
        pos = nodes_dict[node]
        PR[i, pos] = 1
    
    for i, node in enumerate(unknown_nodes):
        pos = nodes_dict[node]
        PR_perp[i, pos] = 1
    
    return PR, PR_perp

labels = np.array([1, 0])
known_nodes = ["Catwoman", "Deadpool"]

PR, PR_perp = create_projection_matrices(labels, known_nodes, nodes, nodes_dict)

linear_system = PR_perp @ graph_laplacian @ PR_perp.T
right_hand_side = -PR_perp @ graph_laplacian @ PR.T @ labels

print(linear_system, end="\n" * 2)

print(right_hand_side)

[[165.  -4. -64.   0.]
 [ -4. 153. -36. -49.]
 [-64. -36. 150.  -1.]
 [  0. -49.  -1. 100.]]

[81.  0. 49.  1.]


Solve your linear system and store your labels in an array named *remaining_labels*. If *linear_system* is a sparse matrix, make sure to use the equivalent to **linalg.solve** for sparse matrices. Create also an array *all_labels* that contains all labels, as well as a boolean array *thresholded_labels* of the same size as *all_labels* with True or False values. An entry should be true if the corresponding entry in *all_labels* is larger than 0.5 and false otherwise.

In [10]:
remaining_labels = np.linalg.solve(linear_system, right_hand_side)

all_labels = PR_perp.T @ remaining_labels
all_labels[PR.argmax(axis=1)] = labels

thresholded_labels = all_labels > 0.5

In [11]:
# DC classes
np.array(nodes)[thresholded_labels]

array(['Batman', 'Catwoman', 'Wonder woman'], dtype='<U13')

In [12]:
#Marvel classes
np.array(nodes)[~thresholded_labels]

array(['Deadpool', 'Jessica Jones', 'Spiderman'], dtype='<U13')

Check your results with the following cell.

In [13]:
print('The computed labels are {a}. Setting all values above 0.5 to one and'.format( \
        a = all_labels), 'the remaining ones to zero yields {t}.'.format(t = \
        thresholded_labels.astype(int)))

The computed labels are [0.77273292 1.         0.         0.22924953 0.12945476 0.71224896]. Setting all values above 0.5 to one and the remaining ones to zero yields [1 1 0 0 0 1].


We conclude this coursework by computing the second eigenvector, i.e. the eigenvector that corresponds to the second smallest eigenvalue, of the graph-Laplacian *graph_laplacian*. Explore how to compute eigenvalues and eigenvectors, in particular of sparse matrices if you have used them.

In [14]:
eigenvalues, eigenvectors = np.linalg.eig(graph_laplacian)

Display the second smallest eigenvalue and the corresponding eigenvector. What do you observe?

In [15]:
print('The second smallest eigenvalue of the graph Laplacian is {eval}.'.format(eval = \
        eigenvalues[1]), 'The corresponding eigenvector is {evec}.T.'.format(evec = \
        eigenvectors[:, 1].T.real))

The second smallest eigenvalue of the graph Laplacian is 33.16375166078198. The corresponding eigenvector is [ 0.39722732  0.48514106 -0.39542065 -0.30661694 -0.50260786  0.32227706].T.
