In [None]:
import volta

In this example file a group of networks is clustered into groups based on different similarities of their nodes, edges & structural properties.
To run this pipeline it is recommended to have ~8GB of memory available on your system.

# Load & Pre-Process Networks

The first step of the pipeline consists in loading the chosen data set.
You can store your networks in any common format, however the VOLTA package requires that the networks are provided as NetworkX Graph objects (refer to its documentation for detailed instructions). Moreover, the networks should be weighted: if you have an unweighted network, then assign all edges the same edge weight. The package assumes "weight" to be the default edge weight label, but this can be set when needed.

An example on how to pre-process a network, stored as an edgelist, is provided below. Different loading and storing examples are provided in the "import and export of networks" jupyter notebook. 

In [None]:
#location where the raw data files are stored, it is set to run from the installation folder
#- if applicable please CHANGE or CHANGE to the location of your networks

graph_location = "../networks/edgelists/"


In [None]:
#location where output should be saved
#Please set location
location = ""


In [None]:
import glob
import pandas as pd
import networkx as nx
import numpy as np

In [None]:

labels = []
networks_graphs = []

print("load networks")
#gets all files located in the specified folder that end on .edgelist
#CHANGE the ending if your files end differently
for path in glob.glob(graph_location +"*.edgelist"):
    
    #you can specify that only part of the file name should be used as network name for later identification
    name =  path.split("/")[-1].replace(".rds.edgelist", "")

    #FOR WINDOWS SYSTEMS SEPARATORS MY NEED TO BE ADJUSTED
    #IF NECESSARY UNCOMMENT
    #name =  path.split("\\")[-1].replace(".rds.edgelist", "")
    
    
    #read the edgelist file as a dataframe
    fh = pd.read_csv(path, sep="\t")
    #convert it into a NetworkX graph G and specify the column names of the node pairs
    G=nx.from_pandas_edgelist(fh, "V1", "V2")
    
    #if you have an unweighted network assign all edges the same edge weight - here a value of 1 is assigned
    for u, v, d in G.edges(data=True):
        d['weight'] = 1
        
    
    #save the graph objects to a list (only suitable if small networks are processed)
    #this is the main objects used for the examples below, which contains all networks
    networks_graphs.append(G)
    labels.append(name)
   

    

    print("loaded", name)

The networkX graph object is converted into a list of lists format

In [None]:
networks = volta.get_node_similarity.preprocess_graph(networks_graphs, attribute="weight")

Optional: If multiple networks are provided, get the union of nodes between them in order to ensure that all node names are mapped to the correct IDs (if this transformation is applied)

In [None]:
network_lists, mapping = volta.get_node_similarity.preprocess_node_list(networks)

In [None]:
import pickle

In [None]:
#save the mapping for later

with open(location + "node_id_mapping.pckl", "wb") as f:
    pickle.dump(mapping, f, protocol=4)

# Node Similarity

In this example, we will show how to compute node properties, such as: degree centrality, betweenness centrality and closeness centrality, of the shared nodes between the networks (when you have networks with different nodes). 

In the example the volta.get_node_similarity function is used. It is a wrapper function - if you are only interested in one output you can call the underlying functions directly, for this please refer to the documentation

Outputs:
sorted_nodes contains the node ids sorted after the selected properties as well as the mean and median ranking

shared_nodes contains for each nodes in how many networks it occures (this may be useful for networks that do not have the same nodes)

binary contains the shared nodes in a binary representation

centrality_values contains the selected properties for each node


Notes:
You can select which parameters are the most suitable for you analysis. Make sure to remove the asynchrone option when running in jupyter notebooks. 



In [None]:
sorted_nodes, shared_nodes, binary, centrality_values = volta.get_node_similarity.sort_list_and_get_shared(network_lists, mapping, networks_graphs, labels, degree_centrality=True, closeness_centrality=True, betweenness=True, degree=False, in_async=False)

In order to calculate multiple distances between the networks, the volta.get_node_similarity.estimate_similarities_nodes function can be used. 

Similarly, this is a wrapper function calling different distance measures: if you are interested in only one distance you can use the single function directly. If not all values are needed to be calculated you can use the individual functions called in the wrapper. For this refer to the documentation.

This wrapper returns distance/similarity matrices of jaccard similarity, jaccard distance, percentage of shared nodes, kendall rank correlation of degree centrality, closeness centrality and betweenness centrality and mean / median ranking as well as the corresponding p-values for the top and bottom kendall_x nodes (or as here all nodes), hamming distance and SMC similarity.

In [None]:
j, jd, percentage, kendall_dc_top, b_dc_top, kendall_cc_top, b_cc_top, kendall_betweenness_top, b_b_top, kendall_avg_top, b_avg_top, hamming, kendall_dc_bottom , b_dc_bottom , kendall_cc_bottom , b_cc_bottom , kendall_betweenness_bottom , b_b_bottom , kendall_avg_bottom , b_avg_bottom , smc, kendall_med_top, b_med_top, kendall_med_bottom, b_med_bottom =volta.get_node_similarity.estimate_similarities_nodes(network_lists, sorted_nodes, binary,  kendall_x=len(mapping), is_file=False, in_async=False)

This distances can be merged into a single distance matrix or used individually.

Since all networks have the same nodes, we will only use the average rank correlation matrix previously computed, but transform it into a distance - the correlation value c is transformed to a distance with (1-c)/2. The distances will then be plotted as a heatmap. 

In [None]:
import numpy as np

In [None]:
mean_dist = kendall_avg_top.copy()

for index, x in np.ndenumerate(mean_dist):
    d = (1-x)/2
    
    mean_dist[index[0]][index[1]] = d
    
    if index[0] == index[1]:
        mean_dist[index[0]][index[1]] = 0

In [None]:
mean_dist_nodes = mean_dist.copy()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
f = volta.plotting.plot_heatmap(mean_dist, xlabels=labels, ylabels = labels)

# Edge Similarity

In this second example, we will use the same preprocessed networks to compute edge similarity. 

Since the networks used here (when using the example networks) were originally unweighted, we assigned each edge its edge betweenness value as weight and compared the networks based on this value for shared edges. If you are working with weighted networks you can either use the original edge weights or estimate their weighted/ unweighted betweenness values. 

In [None]:
#sort edges after edge betweenness and assign to the graph objects
bet = []
graphs_with_betweenness = []
for net in networks_graphs:
    edges_betweenness = nx.edge_betweenness_centrality(net)
    bet.append(edges_betweenness)
    #write as new attribute to graph
    temp = nx.set_edge_attributes(net, edges_betweenness, "betweenness")
    

We need to convert the networks again into the list of lists format, since this time the betweeness values will be used and assign each edge an id. 

In [None]:
networks = volta.get_edge_similarity.preprocess_graph(networks_graphs, attribute="betweenness")



network_lists, mapping = volta.get_edge_similarity.preprocess_edge_list(networks)

#save mapping for later
with open(location + "edge_id_mapping.pckl", "wb") as f:
    pickle.dump(mapping, f, protocol=4)

Sort edges after betweenness values are assigned as weights and estimate for each edge in which network it appears. Here the volta.get_edge_similarity.sort_list_and_get_shared function is used.

In [None]:
sorted_networks, shared_edges, binary = volta.get_edge_similarity.sort_list_and_get_shared(networks, mapping, network_lists, labels, in_async=False)

You can now compute distances/ similarities between the networks based on their similarity in edges.

The volta.get_edge_similarity.estimate_similarities_edges wrapper function returns jaccard similarity, jaccard distance, kendall rank coefficient for kendall_x top and bottom edges (here ranked after betweenness), hamming distance and SMC similarity. As before, if only specific distances are needed the individual functions can be called, please refer to the documentation for this.

In [None]:
j, jd, percentage, kendall_top,b_top, kendall_bottom, b_bottom, hamming, smc = volta.get_edge_similarity.estimate_similarities_edges(network_lists, sorted_networks, binary,  kendall_x=100, is_file=False, in_async=False)

Where applicable the similarities are converted into a distance by taking 1-x (other conversions can also be used). The conversion is required since the clustering algorithms provided take distance matrices as input. Additional the individual similarity/ distance metrics are combined for the later clustering analysis, which is optional.

In [None]:
smc_dist = smc.copy()

for index, x in np.ndenumerate(smc_dist):
    d = 1-x
    
    smc_dist[index[0]][index[1]] = d
    
    if index[0] == index[1]:
        smc_dist[index[0]][index[1]] = 0

In [None]:
p_dist = percentage.copy()

for index, x in np.ndenumerate(p_dist):
    d = 1-x
    
    p_dist[index[0]][index[1]] = d
    
    if index[0] == index[1]:
        p_dist[index[0]][index[1]] = 0

In [None]:
k_dist = kendall_top.copy()

for index, x in np.ndenumerate(k_dist):
    d = (1-x)/2
    
    k_dist[index[0]][index[1]] = d
    
    if index[0] == index[1]:
        k_dist[index[0]][index[1]] = 0

In [None]:
import statistics

In [None]:
#Optional: combine the individual matrices into a median distance matrix

mean_dist = volta.clustering.create_mean_distance_matrix([k_dist, jd, hamming, p_dist, smc_dist], set_diagonal = True)

In [None]:
mean_dist_edges = mean_dist.copy()

In [None]:
f = volta.plotting.plot_heatmap(mean_dist, xlabels=labels, ylabels = labels)

# Structural Similarity

In this example,we compute network distances based on graph structural properties

The volta.get_network_structural_vector.estimate_vector is a wrapper function that computes a vector based distance  on a few structural parameters implemented. This means that for each network a vector of parameters is estimated, such as network density, clustering and its diameter. The distance between the networks is then estimated based on the calculated individual vectors.

You can create your own vector by creating your own wrapper function, or call the corresponding functions. Please refer to the documentation for further information.
The vector used in this examples will contain data about the number of nodes & edges, network density, amount of missing edges, cycles, shortest path distributions, clustering coefficient, degree centrality/ closeness centrality and betweenness centrality distribution.

Note: The function requires a list of NetworkX graph objects as input. Some paramters can be expensive on large networks. Therefore it is advised to adjust the parameter selection based on your needs.



In [None]:
vectors = volta.get_network_structural_vector.estimate_vector(networks_graphs, edge_attribute="weight", is_file=False)

Based on these vectors a distance matrix between the networks can be estimated.

volta.get_network_structural_vector.matrix_from_vector is a wrapper function which estimates the euclidean, canberra, correlation, cosine and jaccard distance based on the vectors.Here distance metrics that are not in a [0,1] range, have been normalized.

In [None]:
euclidean, canberra, correlation, cosine, jaccard = volta.get_network_structural_vector.matrix_from_vector(vectors, normalize=True)

In [None]:
#Optional: merge the individual distances into a single distance matrix

mean_dist = volta.clustering.create_mean_distance_matrix([jaccard, euclidean, canberra, cosine], set_diagonal = True)

In [None]:
mean_dist_structural = mean_dist.copy()

In [None]:
f = volta.plotting.plot_heatmap(mean_dist, xlabels=labels, ylabels = labels)

# Random Walks

In this example, we will use the Random walks method to characterize the structural/connectivity similarities around a specific node in different networks. In this example we will answer to the question: "Are the same nodes also similarly connected?"

The volta.get_walk_distances.helper_walks function performs random walks based on different starting nodes. Later, walks between networks for the same starting node will be compared.

The here used networks have all the same nodes, so nodes can simply be set to the node object of one of the graphs.

E.g. nodes = networks_graphs[0].nodes()

If the networks have different nodes then you can select which nodes should be compared (e.g. the union of nodes or only their intersection). If a node does not exist in a network None is returned for that node.
To reduce computational power you can investigate only a subset of pre-selected nodes of interest.

The example below uses the union (all of the example networks have the same nodes).

In [None]:
nodes = []
for net in networks_graphs:
    for node in net.nodes():
        if node not in nodes:
            nodes.append(node)

In [None]:
#select a random set of nodes
import random
nodes = random.sample(nodes,)

For each node in each network, 2 walks of size 3 per node are performed. A random sample of nodes is selected, if needed this can be run on all nodes. Number of nodes and number of walks/ length affect the computational complexity and memory usage. The parameters can be adjusted as needed. 
Edges can be selected probabilisticly based on their attributes or can be viewed equally.


In [None]:
performed_walks = volta.get_walk_distances.helper_walks(networks_graphs, nodes, labels, steps=3, number_of_walks=2, degree=False, probabilistic=False, weight ="weight")

Now we are estimating for each starting node how often surrounding nodes/ edges have been visit with respect to all the visited nodes/ edges. 
Note: Depending on your network sizes and selected nodes this can be quite memory intensive.

In [None]:
node_counts, edge_counts, nodes_frc, edges_frc = volta.get_walk_distances.helper_get_counts(labels, networks_graphs, performed_walks)

Now we want to estimate network similarities based on the visited nodes. For each network pair, kendall rank correlation is calculated (of the top 50 nodes)  for the same starting node. The mean correlation value of all same node pairs for a network pair is estimated and returned.

In [None]:
results_edges, results_nodes, results_edges_p, results_nodes_p = volta.get_walk_distances.helper_walk_sim(networks_graphs, performed_walks, nodes, labels, top=50, undirected=False, return_all = False, nodes_ranked=nodes_frc, edges_ranked=edges_frc)

Optional:The correlations can be converted into a distance (as required by the clustering algorithms). Here the transformation of distance d = (1-c)/2 for a centrality score c is used.

In [None]:
cor_nodes = results_nodes.copy()

for index, x in np.ndenumerate(cor_nodes):
    d = (1-x)/2
    
    cor_nodes[index[0]][index[1]] = d
    
    if index[0] == index[1]:
        cor_nodes[index[0]][index[1]] = 0

In [None]:
cor_edges = results_edges.copy()

for index, x in np.ndenumerate(cor_edges):
    d = (1-x)/2
    
    cor_edges[index[0]][index[1]] = d
    
    if index[0] == index[1]:
        cor_edges[index[0]][index[1]] = 0

In [None]:
mean_dist = volta.clustering.create_mean_distance_matrix([cor_nodes, cor_edges], set_diagonal = True)

In [None]:
mean_dist_walks = mean_dist.copy()

In [None]:
f = volta.plotting.plot_heatmap(mean_dist, xlabels=labels, ylabels = labels)

# Clustering

On each of the three median distance matrices calculated in the previous section clustering algorithms are run (the algorithms can also be run on the individual distance matrices if wished).
Here we use three different algorithms, but any algorithms more suited to your type of analysis can be applied. Please refer to the documentation for available algorithms.

Based on the three individual clusterings, a consensus clustering is created.
Each clustering algorithm is tuned based on a individually modifiable multiobjective function.

In [None]:
distances = [mean_dist_nodes, mean_dist_edges, mean_dist_structural, mean_dist_walks]
distances_name = ["nodes", "edges", "structural", "walks"]

In [None]:
clusterings = {}

for n in distances_name:
    clusterings[n] = []

Hierarchical clustering is run and best k value is estimated based on a multiobjective function, which focuses on maximizing the distance between clusters, minimizing distance within a cluster as well as having an even cluster size distribution
For the best k value the algorithm is run 10 times. One of the selected 3 algorithms has some randomness. In order to not bias towards one algorithm all are run the same amount of time. 

In [None]:
t = []
for i in range(2,len(labels)):
    t.append(i)
    
hierarchical = {}
for d in range(len(distances)):
    dist = distances[d]
    n = distances_name[d]
    maxs = 10000

    for i, k in enumerate(t):
    #for i, k in enumerate([2, 3]):




        cl_labels = volta.clustering.hierarchical_clustering(dist, n_clusters=k, linkage="complete")


        #print("obj 1")
        avg_score = volta.clustering.multiobjective(dist, cl_labels, min_number_clusters=None, max_number_clusters=None, min_cluster_size = None, max_cluster_size=None, local =True, bet=False, e=None, s=None, cluster_size_distribution = True)



        if avg_score < maxs:
            maxs = avg_score
            mk = k





    hierarchical[n] = k
    print(maxs, mk)
    
    print("creating clusterings for", n, "with k ", mk)
    
    for xx in range(10):
    
    
        cl_labels = volta.clustering.hierarchical_clustering(dist, n_clusters=mk, linkage="complete")
        clusterings.setdefault(n, []).append(cl_labels)



Affinity propagation has no parameters to be set so does not need to be tuned, but it is also run ten times (or 10x the same clustering is appended) so that for the later consensus for each clustering algorithm the same number of clusterings are provided. 

In [None]:
for d in range(len(distances)):
    dist = distances[d]
    n = distances_name[d]
    
    for xx in range(10):
        cl_labels = volta.clustering.affinityPropagation_clustering(dist)
        clusterings.setdefault(n, []).append(cl_labels)

K Mediods is tuned on the same multiobjective function and for the best k the algorithm is run three times.

In [None]:
t = []
for i in range(2,len(labels)):
    t.append(i)
    
kmed = {}
for d in range(len(distances)):
    dist = distances[d]
    n = distances_name[d]
    maxs = 10000

    for i, k in enumerate(t):
    #for i, k in enumerate([2, 3]):




        cl_labels, mediods = volta.clustering.kmedoids_clustering(dist, n_clusters=k)


        #print("obj 1")
        avg_score = volta.clustering.multiobjective(dist, cl_labels, min_number_clusters=None, max_number_clusters=None, min_cluster_size = None, max_cluster_size=None, local =True, bet=False, e=None, s=None, cluster_size_distribution = True)



        if avg_score < maxs:
            maxs = avg_score
            mk = k





    kmed[n] = k
    print(maxs, mk)
    
    print("creating clusterings for", n, "with k ", mk)
    
    for xx in range(10):
    
    
        cl_labels, mediods = volta.clustering.kmedoids_clustering(dist, n_clusters=mk)
        clusterings.setdefault(n, []).append(np.array(cl_labels))



For each of the groups we create an individual consensus clustering in order to compare this later to the combined consensus clustering. 


#### Nodes

In [None]:
merged_clusterings = []


for c in clusterings["nodes"]:
        
        merged_clusterings.append(c.tolist())

In [None]:
consensus_nodes = volta.clustering.consensus_clustering(merged_clusterings, seed=1234, threshold="matrix", per_node=False, rep = 10)

In [None]:
fig = volta.plotting.plot_clustering_heatmap(consensus_nodes, mean_dist_nodes, labels, cmap="bone")

#### Edges

In [None]:
merged_clusterings = []


for c in clusterings["edges"]:
        
        merged_clusterings.append(c.tolist())

In [None]:
consensus_edges = volta.clustering.consensus_clustering(merged_clusterings, seed=1234, threshold="matrix", per_node=False, rep = 10)

In [None]:
fig = volta.plotting.plot_clustering_heatmap(consensus_edges, mean_dist_edges, labels, cmap="bone")

#### Structural



In [None]:
merged_clusterings = []


for c in clusterings["structural"]:
        
        merged_clusterings.append(c.tolist())

In [None]:
consensus_structural = volta.clustering.consensus_clustering(merged_clusterings, seed=1234, threshold="matrix", per_node=False, rep = 10)

In [None]:
fig = volta.plotting.plot_clustering_heatmap(consensus_structural, mean_dist_structural, labels, cmap="bone")

#### Walks

In [None]:
merged_clusterings = []


for c in clusterings["walks"]:
        
        merged_clusterings.append(c.tolist())

In [None]:
consensus_walks = volta.clustering.consensus_clustering(merged_clusterings, seed=1234, threshold="matrix", per_node=False, rep = 10)

In [None]:
fig = volta.plotting.plot_clustering_heatmap(consensus_walks, mean_dist_walks, labels, cmap="binary")

We plot the agreement matrix, which shows for each pair in how many of the groups they have been clustered together.

In [None]:
consensus = [consensus_nodes, consensus_edges, consensus_structural, consensus_walks]

In [None]:
f, agg = volta.plotting.plot_agreement_matrix(consensus, xlabels=labels, ylabels=labels, annotation=True)

As we can see there are only 2 pairs that belong together on which all metrics agree on.

A Consensus Clustering is created from the individually performed clusterings. Please refer to the documentation for further information

In [None]:
merged_clusterings = []

for key in clusterings.keys():
    for c in clusterings[key]:
        
        merged_clusterings.append(c.tolist())

In [None]:
consensus = volta.clustering.consensus_clustering(merged_clusterings, seed=1234, threshold="matrix", per_node=False, rep = 10)

In [None]:
consensus

To show the complete clustering we estimate a mean combined distance matrix and plot the clustering on top of it.

In [None]:
mean_dist_combined = volta.clustering.create_median_distance_matrix(distances, set_diagonal = True)

In [None]:
fig = volta.plotting.plot_clustering_heatmap(consensus, mean_dist_combined, labels, cmap="binary")

By tweaking the treshold or using another method, it is possible to tune the consensus based on your needs (you can also use the multiobjective function again to evaluate the consensus numerically)

In [None]:
max(consensus)+1

In [None]:
df = pd.DataFrame(list(zip(labels, consensus)), 
               columns =['CHEMICAL', 'CLUSTER'])

In [None]:
#view clustering
df

In [None]:
#save clustering for later. It can be used as input for the common subgraph pipeline

df.to_csv(location+"clustering_networks.csv", index=None)

In [None]:
all_consensus = [consensus_nodes, consensus_edges, consensus_structural, consensus_walks, consensus]

In [None]:
group_labels = ["nodes", "edges", "structural", "walks", "combined"]

In [None]:
f, c = volta.plotting.plot_correlation_clusterings(all_consensus, xlabels=group_labels, ylabels=group_labels, size=(10,8), cmap="binary")