# Session 4: Markov processes and graphs

In this exercise worksheet, you'll use graph representation to tackle biological problems. You will use graphs to represent a network of protein-protein interactions, and identify clusters of frequently interacting proteins.

You will also implement a hidden markov model to model secondary structures of proteins from their amino acid sequences.


# Exercise 1: Graph representation

Given a graph of protein-protein interaction (PPI) network, we want to identify clusters of frequently interacting proteins. We will then identify their potential role or pathway from the annotation terms associated to each protein.

In this example, we use a protein network from [Polo et al., 2018](https://www.nature.com/articles/s41598-018-28739-6). This network contains interactions of proteins modulated by arsenical compound in bladder, kidney and prostate cancer.(data identifier: http://doi.org/10.18119/N9T01M)

The network is split into two arrays:
* An edge array, where each row is an edge between two nodes (proteins)
* A node array, where each row is a protein, with additional informations, such as biological role

In [7]:
### DATA LOADING AND CLEANING ###

import numpy as np
import pandas as pd

# Load edges into a numpy matrix
edges = np.loadtxt('data/session_4_network_edges.csv', dtype=str)
# Load the list of nodes and their attributes into a dataframe
nodes = pd.read_csv('data/session_4_network_nodes.csv', sep='\t')
# Annotations stored as strings with "|" separators -> lists are easier to work with
nodes['Xref'] = nodes['Xref'].astype(str).apply(lambda x: x.split('|'))
nodes['Xref ID'] = nodes['Xref ID'].astype(str).apply(lambda x: x.split('|'))

**a) Manipulate the network. How many nodes are there ? Which node has the highest degree ?**
> Hint: The degree of a node is the number of edges it has. Use the edges table to compute the degree of each node. Be careful that each edge is only once in the table, but the node can be noted arbitrarily as target or source (since the network is not directed).


In [8]:
print(f"There are {nodes.shape[0]} different nodes")

There are 941 different nodes


In [9]:
# We concatenate source and target nodes to get all nodes involved in an edge
all_node_occurences = np.concatenate([edges[:, 0], edges[:, 1]])
# Use np.unique to get the unique nodes and their number of occurences
# Could also loop on nodes and count their occurrences with a dictionary
u_nodes, counts = np.unique(all_node_occurences, return_counts=True)
# Argsort returns the index order which would yield sorted count values
# Note: [::-1] is used to reverse the array from ascending (lowest first) to descending (highest first)
sort_order = np.argsort(counts)[::-1]
# We use the first 3 sorted indices to retrieve node names and counts at the same time
print("Top 3 nodes with most edges:")
for i in sort_order[:3]:
    print(f"{u_nodes[i]}: {counts[i]}")

Top 3 nodes with most edges:
P04637: 306
P40337: 208
Q04206: 93


(Optional) What are the annotation terms common to these 3 nodes ?
> Hint: try to use pandas to extract the annotations from the Xref column

In [10]:
from collections import defaultdict
annot_top3 = nodes.loc[nodes['shared name'].isin(u_nodes[sort_order[:3]]), 'Xref'].tolist()
annot_top3 = [a for g in annot_top3 for a in g]
annot_top3_counts = defaultdict(int)
for a in annot_top3:
    annot_top3_counts[a] += 1
for k, v in annot_top3_counts.items():
    if v == 3:
        print(k)
# They are all transcription factors, which makes sense because TF regulate many genes -> hubs in the graph

transcription factor binding
negative regulation of apoptotic process
cytosol
positive regulation of transcription, DNA-templated
negative regulation of transcription from RNA polymerase II promoter
nucleus
nucleoplasm


**b) Build an adjacency matrix A from the edge table (adjacency list).**
> Hint: Do not forget our network is undirected, the matrix should be symmetric


In [11]:
# Create an empty matrix of VxV, where V is the number of nodes in the network
A = np.zeros((nodes.shape[0], nodes.shape[0]))
# Get a mapping from node names to numeric indices, from 0 to V-1
nodes2num = {n: i for i, n in enumerate(nodes['shared name'])}
for edge in edges:
    source_name, target_name = edge[0], edge[1]
    source_idx = nodes2num[source_name]
    target_idx = nodes2num[target_name]
    A[source_idx, target_idx] += 1
    # Network is undirected, edges have no directions
    # so we also need to add the reverse direction to 
    # keep the matrix symmetric
    A[target_idx, source_idx] += 1

In [12]:
# This cell is just to show you what the adjacency table looks like
# Can you see some specific patterns ? What do they mean ?
%matplotlib notebook
import matplotlib.pyplot as plt
plt.imshow(A, cmap='afmhot')
plt.title("Adjacency table visualisation")
plt.xlabel('target node')
plt.ylabel('source node')
plt.colorbar()
plt.show()

<IPython.core.display.Javascript object>

**c) Write a function to measure network modularity. We will use it later to assess our clustering.**

In [15]:
def modularity(adj, clusters):
    """
    Compute the modularity of clusters from an adjacency 
    matrix adj representing an undirected network. The clusters
    should be given as lists of indices corresponding to
    rows/cols of the matrix. If there are k clusters, there
    should be a list of k lists, each containing the indices of its nodes.
    Values should be in the range [-1/2, 1] (higher = better)
    >>> modularity(np.array([[0, 1, 0], [1, 0, 0], [0, 0, 0]]), [[0, 1], [2]])
    0
    """
    E = adj.sum() / 2

    Q = 0
    for k in clusters:
        l = adj[np.ix_(k, k)].sum() / 2
        edges_outside = adj[k, :].sum() - 2*l
        d = 2*l + edges_outside
        expected = d**2 /(4*E)
        Q += l - expected
    
    Q /= E
    
    return Q

In [16]:
a = np.array([[0, 0, 1, 0], [0, 0, 1, 1], [1, 1, 0, 1], [0, 1, 1, 0]])
modularity(a, [[0], [1, 2, 3]])

-0.03125

**c) Implement Markov clustering to separate nodes into clusters (without knowing how many clusters there are beforehand).**

In [17]:
def expand(w):
    """
    Simulate random walk by multiplying
    the matrix by itself (dot product).
    """
    return w @ w


def inflate(w, r=2):
    """
    Increase the contrast of nodes by raising
    each value in the matrix to its rth power
    """
    return w**r


def prune(w, p=10e-6):
    """
    Prune an input matrix by settings entries below p to 0.
    Note: Be careful to never prune the highest transition
    probability of a node.
    """
    pruned = w.copy()
    for row in range(w.shape[0]):
        thresh = min(w[row, :].max(), p)
        to_prune = w[row, :] < thresh
        pruned[row, to_prune] = 0
    return pruned

    
def check_convergence(mat1, mat2, tol=10e-9):
    """
    Check if values have changed more than a threshold
    between two matrices (previous and current iteration)
    """
    abs_diff = np.abs(mat1 - mat2)
    converged = abs_diff.sum() < tol
    return converged


def to_probs(w):
    """
    Normalize values in a matrix to transition probabilities.
    The resulting matrix should have each row summing to 1.
    """
    denom = w.sum(axis=1)[:, None]
    denom[denom == 0] = 1
    return w / denom


def run_markov_clustering(adj, r=2, p=10e-3, tol=10e-9, max_iter = 1000):
    """
    Coordinates the actual clustering process. The iterative
    procedure is run until convergence, or max_iter iterations
    have been performed. First, self-contacts are added to the input
    adjacency matrix. The matrix is then converted to a stochastic
    probability matrix, and at each iteration, the following operations
    are performed:
     - Expansion: Simulate a random walk by multiplying the matrix with itself
     - Inflation: Raise each element in the matrix to its rth power to increase contrast
     - Pruning: Set edges with very low flow to zero, to speed up convergence
     - Convert the matrix back to probabilities
     - Check for convergence by comparing the previous and current matrix
    
    The final probability matrix is returned.
    """
    assert r > 1, "r is too small"
    assert max_iter >= 0, "max_iter cannot be negative"
    assert tol >= 0, "tol cannot be negative"
    converged = False
    w = adj + np.eye(adj.shape[0])
    w = to_probs(w)
    i = 0
    while not converged and i < max_iter:
        last_mat = w.copy()
        w = expand(w)
        w = inflate(w, r=r)
        w = prune(w, p=p)
        w = to_probs(w)
        converged = check_convergence(last_mat, w, tol=tol)
        i += 1
    print(f'{"Converged" if converged else "Did not converge"} after {i} iterations.')
    w = prune(w)
    return w




In [49]:
# Run the markov clustering on a dummy network
a = np.array([[0, 1, 0, 1, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 1], [1, 0, 1, 0, 1],[0, 0, 1, 1, 0]])
mcl = run_markov_clustering(a, r=2)
mcl

Converged after 6 iterations.


array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]])

In [19]:
# Visualize the input and output matrices of markov clustering
%matplotlib notebook
fig, ax = plt.subplots(1, 2)
ax[0].imshow(a)
ax[1].imshow(mcl)

<IPython.core.display.Javascript object>

<matplotlib.image.AxesImage at 0x7f848ae27050>

In [23]:
# Run the clustering algorithm on the whole protein network and visualize the resulting matrix
%matplotlib notebook
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True)
ax[0].imshow(A)
amcl =  run_markov_clustering(A, r=2)
ax[1].imshow(amcl)

<IPython.core.display.Javascript object>

Converged after 11 iterations.


<matplotlib.image.AxesImage at 0x7f848ad52c50>

**d) Write a function to get the list of clusters and their members from the matrix resulting of the Markov clustering. How many clusters are there ? What is their average size ? and the largest size ?**
> Hint: Nonzero entries on the matrix's diagonal are called attractors. Every node contacting an attractor is in the same cluster. In the example below, the input adjacency matrix is on the left, and the output matrix from Markov clustering is on the right. Looking at the "attractors" (diagonal entries) in the output, you see that the first attractor at [0, 0] contacts the second node at [0, 1], while the second attractor at [3, 3] contacts two other nodes at [2, 3] and [4, 5]. This means there are two clusters with the following node membership: [(0, 1), (2, 3, 4)].

![result_visual_markov_clustering](images/markov_clustering_result.png)

In [24]:
def get_clusters(mat):
    """
    Given the output matrix from the markov
    clustering, extract the clusters and their
    member nodes. The output should be in the format:
    [(0, 1), (2, 3), (4, 5)]
    In the case where there are 3 clusters, with 2 nodes
    each.
    """
    clusters = set()
    # Loop over diagonal elements
    for i, a in enumerate(mat.diagonal()):
        # If a is nonzero (is an attractor)
        if a:
            # Get the indices of rows with nonzero
            # values in the attractor's column
            members = tuple(np.nonzero(mat[:, i])[0])
            clusters.add(members)
    return clusters
            

In [48]:
# Retrieve individual clusters for the protein interactino network
aclusters = get_clusters(amcl)
# Compute th mean and max length of clusters
avg = np.mean([len(c) for c in aclusters])
big = np.max([len(c) for c in aclusters])
print(
    f"There are {len(aclusters)} clusters. The biggest has "
    f"{big} proteins and their average size is {avg:.2f}."
)

There are 47 clusters. The biggest has 181 proteins and their average size is 20.13.


In [47]:
# Quick visualisation of the cluster sizes
%matplotlib notebook
fig = plt.hist([len(c) for c in aclusters], 20); plt.title("Distribution of cluster sizes")

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Distribution of cluster sizes')

**e) Use modularity to optimize the parameters of the markov clustering. What parameters yield the best modularity ? What are the most frequent annotations in the biggest cluster ?**

In [None]:
# Loop over each parameter combination, perform clustering and recompute modularity everytime
Q = -2
best_combo = {}
for r in [2, 3, 4, 5]:
    for p in [0, 10e-10, 10e-5, 10e-3]:
        for tol in [10e-10, 10e-5, 10e-3]:
            res = run_markov_clustering(A, r=r, p=p, tol=tol)
            clu = get_clusters(res)
            mod = modularity(A, clu)
            # Retain the parameters which yielded the highest modularity
            if mod > Q:
                Q = mod
                best_combo = {'r': r, 'p': p, 'tol': tol}
                

In [29]:
print(f"The best combination of parameters is the following, with a modularity of {Q:.2f}:")
best_combo

The best combination of parameters is the following, with a modularity of 0.76:


{'r': 2, 'p': 0.0001, 'tol': 1e-09}

In [30]:
# We recompute the clustering for the whole network, using parameters  which optimize modularity
best_mcl = run_markov_clustering(A, **best_combo)
best_clu = get_clusters(best_mcl)


Converged after 16 iterations.


In [46]:
# We get th list of lengths (number of nodes) for each cluster
best_clu_list = list(best_clu)
clu_len = [len(c) for c in best_clu_list]
# We get the index of the cluster with the most members
biggest_idx = np.argmax(clu_len)
# We retrieve the members of this big cluster
biggest_clu = np.array(best_clu_list[biggest_idx])

In [45]:
# We retrieve the annotations of members for the largest cluster from the nodes table
from collections import defaultdict
annot_big_clu = nodes.Xref[biggest_clu].tolist()
annot_big_clu = [a for g in annot_big_clu for a in g]
annot_big_clu_counts = defaultdict(int)
for a in annot_big_clu:
    annot_big_clu_counts[a] += 1
[f'{k}: {v}' for k, v in sorted(annot_big_clu_counts.items(), key=lambda item: item[1])][::-1]

['cytosol: 101',
 'extracellular exosome: 92',
 'nucleus: 55',
 'cytoplasm: 54',
 'nucleoplasm: 49',
 'RNA binding: 40',
 'membrane: 38',
 'plasma membrane: 32',
 'mitochondrion: 29',
 'neutrophil degranulation: 23',
 'extracellular region: 20',
 'identity: 19',
 'signal transduction: 17',
 'ATP binding: 17',
 'endoplasmic reticulum membrane: 16',
 'perinuclear region of cytoplasm: 16',
 'metal ion binding: 15',
 'GTP binding: 15',
 'integral component of membrane: 14',
 'cadherin binding: 13',
 'endoplasmic reticulum: 13',
 'extracellular space: 13',
 'protein transport: 13',
 'mitochondrial inner membrane: 13',
 'secretory granule lumen: 12',
 'myelin sheath: 12',
 'Small GTP-binding protein: 12',
 'small GTPase mediated signal transduction: 12',
 'focal adhesion: 11',
 'structural constituent of ribosome: 11',
 'mitochondrial matrix: 11',
 'Golgi apparatus: 11',
 'Golgi membrane: 10',
 'SRP-dependent cotranslational protein targeting to membrane: 10',
 'translational initiation: 10'

Besides localisation (cytosol, membrane, nucleus, ...), which is not very informative, the most frequent annotations are "RNA binding, neutrophil degranulation, identity and signal transduction.

(Optional): It's always nice to have a visual representation of your graph, so here we use an external package to visualise the network and resulting clusters

In [209]:
# To run this demo visualisation, you need the markov_clustering package
# You can install it by uncommenting the line below:
#!pip install markov_clustering[drawing]
%matplotlib notebook
import markov_clustering as mc
result = mc.run_mcl(A, inflation=best_combo['r'], pruning_threshold=best_combo['p'])
clusters = mc.get_clusters(result)
mc.draw_graph(result, clusters, node_size=50, with_labels=False, edge_color="silver")

<IPython.core.display.Javascript object>