In [1]:
import volta

In this example file, we want to compare two gene networks (for example one representing a control and the other a treated sample; when using the example data two networks treated by different drugs are compared) in order to investigate if specific nodes(genes) or gene areas are changing (and therefore can be considered affected by a treatment).

# Load Networks



In [2]:
#location where the raw data files are stored, it is set to run from the installation folder
#- if applicable please CHANGE or CHANGE to the location of your networks

graph_location = "../networks/edgelists/"

#location where output should be saved
#Please set location
location = ""


First step of the pipeline consists in loading the chosen data set.
You can store your networks in any common format, however the VOLTA package requires that the networks are provided as NetworkX Graph objects (refer to its documentation for detailed instructions). Moreover, the networks should be weighted: if you have an unweighted network, then assign all edges the same edge weight. The package assumes "weight" to be the default edge weight label, but this can be set when needed.

An example on how to pre-process a network, stored as an edgelist, is provided below. Different loading and storing examples are provided in the "import and export of networks" jupyter notebook. 

In [3]:
import glob
import pandas as pd
import networkx as nx
import numpy as np

In [4]:

labels = []
networks_graphs = []
cnt = 0
print("load networks")
#gets all files located in the specified folder that end on .edgelist
#CHANGE the ending if your files end differently
for path in glob.glob(graph_location +"*.edgelist"):
    if cnt < 2:
        #you can specify that only part of the file name should be used as network name for later identification
        name =  path.split("/")[-1].replace(".rds.edgelist", "")

        #FOR WINDOWS SYSTEMS SEPARATORS MY NEED TO BE ADJUSTED
        #IF NECESSARY UNCOMMENT
        #name =  path.split("\\")[-1].replace(".rds.edgelist", "")


        #read the edgelist file as a dataframe
        fh = pd.read_csv(path, sep="\t")
        #convert it into a NetworkX graph G and specify the column names of the node pairs
        G=nx.from_pandas_edgelist(fh, "V1", "V2")

        #if you have an unweighted network assign all edges the same edge weight - here a value of 1 is assigned
        for u, v, d in G.edges(data=True):
            d['weight'] = 1


        #save the graph objects to a list (only suitable if small networks are processed)
        #this is the main objects used for the examples below, which contains all networks
        networks_graphs.append(G)
        labels.append(name)




        print("loaded", name)
    cnt = cnt + 1

load networks
loaded dasatinib_A375
loaded dasatinib_A549


The networkX graph object is converted into a list of lists format

In [5]:
networks = volta.get_node_similarity.preprocess_graph(networks_graphs, attribute="weight")

Optional: If multiple networks are provided, get the union of nodes between them in order to ensure that all node names are mapped to the correct IDs (if this transformation is applied)

In [6]:
#get union of nodes

nodes = []
for net in networks_graphs:
    for node in net.nodes():
        if node not in nodes:
            nodes.append(node)

In [7]:
#mapp node names to ID (this is mainly used for node & edge similarity functions)

network_lists, mapping = volta.get_node_similarity.preprocess_node_list(networks)

In [8]:
#save mapping for later

import pickle

with open(location + "node_id_mapping_network_network.pckl", "wb") as f:
    pickle.dump(mapping, f, protocol=4)

In [9]:
#OPTIONAL: create reversed mapping object

reverse_mapping = volta.distances.node_edge_similarities.reverse_node_edge_mapping(mapping)

# Nodes

The networks are compared based on different centrality measures (how the node location in the network changes) ato estimate which nodes are the most similar or different w.r.t. their network position.

The centrality ranks are also used in the Network clustering pipeline.

In [10]:
#sort nodes after the selected attributes

sorted_nodes = []

for graph in networks_graphs:
    temp = volta.distances.node_edge_similarities.sort_node_list(graph, mapping, degree_centrality=True, closeness_centrality=True, betweenness=True, k=None, as_str=False)
    
    sorted_nodes.append(temp)

1
2
3
average position is calculated
1
2
3
average position is calculated


Below we convert the output, which is a Python dict into a dataframe to make it more human readable

In [11]:
mapping_ids = list(mapping.values())

In [12]:
import pandas as pd 

df = pd.DataFrame(mapping_ids, 
               columns =['Mapping ID']) 

In [13]:
# add the reversed mapping IDS (original node IDs - in the example networks they are Entrez IDs)
entrez = []
for g in mapping_ids:
    entrez.append(reverse_mapping[g])
df["Entrez IDs"] = entrez

In [14]:
for i in range(len(sorted_nodes)):
    item = sorted_nodes[i][0]
    for key in item.keys():
        #ignore "degree" key, since it has not been calculated. We are using degree centrality instead.
        #Adjust to your selected parameters
        if key != "degree":
            temp = []
            for g in mapping_ids:
                for xx in range(len(item[key])):
                    if item[key][xx] == g:
                        temp.append(xx)
                
            #add to dataframe
            #since the results are in the same order as the network labels 
            #we can use the network label directly as column heading
            df[labels[i]+" Ranking " + key] = temp


In [15]:
#display the dataframe

df

Unnamed: 0,Mapping ID,Entrez IDs,dasatinib_A375 Ranking dc,dasatinib_A375 Ranking cc,dasatinib_A375 Ranking betweenness,dasatinib_A375 Ranking average_mean,dasatinib_A375 Ranking average_median,dasatinib_A549 Ranking dc,dasatinib_A549 Ranking cc,dasatinib_A549 Ranking betweenness,dasatinib_A549 Ranking average_mean,dasatinib_A549 Ranking average_median
0,0,780,534,647,182,444,559,817,864,946,890,871
1,1,3895,272,120,26,111,116,413,422,629,489,424
2,2,10904,567,711,300,551,590,566,607,702,642,614
3,3,23386,307,481,85,235,307,599,466,510,524,511
4,4,22883,720,663,372,632,689,796,697,694,740,701
...,...,...,...,...,...,...,...,...,...,...,...,...
972,972,10057,975,975,976,975,975,638,632,594,635,640
973,973,55699,970,967,972,970,970,591,560,413,517,564
974,974,10494,972,972,974,973,972,568,598,618,615,607
975,975,5875,964,969,947,963,967,815,838,796,825,824


We are interested in knowing which genes change the most between the networks with regards to their network position. Therefore we are going to estimate the rank difference of the median ranks. 
This can be done for any of the other parameters as well, if it is needed for your analysis in the same way.Please refer to the functions documentation for more details

In [16]:
change = []

for g in mapping_ids:
    
    val1 = df.loc[df["Mapping ID"] == g][labels[0]+" Ranking average_median"].to_list()[0]
    
    val2 = df.loc[df["Mapping ID"] == g][labels[1]+" Ranking average_median"].to_list()[0]
    
    change.append(abs(val1-val2))  

In [17]:
df_change = pd.DataFrame(list(zip(mapping_ids, entrez, change, df[labels[0]+" Ranking average_median"].to_list(), df[labels[1]+" Ranking average_median"].to_list())), 
               columns =['Mapping ID', 'Entrez IDs', 'Absolute Ranking Difference', labels[0]+' Ranking average_median', labels[1]+' Ranking average_median' ]) 

In [18]:
#sort the dataframe for easier visualization/ analysis

df_change = df_change.sort_values(by =["Absolute Ranking Difference"], axis=0, ascending=False)

First we inspect the top 20 genes (which network position is the most affected between the compared networks). Adjust the value if need be.

In [19]:
df_change.head(20)

Unnamed: 0,Mapping ID,Entrez IDs,Absolute Ranking Difference,dasatinib_A375 Ranking average_median,dasatinib_A549 Ranking average_median
642,642,3815,886,85,971
523,523,9587,886,27,913
927,927,55129,880,914,34
355,355,7511,877,933,56
435,435,5906,872,44,916
539,539,9870,866,75,941
57,57,3028,858,897,39
285,285,4927,848,6,854
595,595,54851,845,53,898
437,437,6722,844,49,893


Then we inspect the bottom 20 genes (which network position changes the least between the two networks). Adjust the value if need be.

In [20]:
df_change.tail(20)

Unnamed: 0,Mapping ID,Entrez IDs,Absolute Ranking Difference,dasatinib_A375 Ranking average_median,dasatinib_A549 Ranking average_median
773,773,128,5,475,470
429,429,3930,5,46,51
118,118,23038,5,912,917
256,256,10131,5,328,323
735,735,5986,4,145,149
748,748,51015,4,372,376
249,249,5048,4,163,159
474,474,9200,3,776,779
544,544,51053,3,245,242
689,689,9670,3,151,148


As a possible further analysis these ranked genes can be enriched by means of GSEA or the top x coulc be enriched as shown in the Example of Enrichment file.

# Edges

We now evaluate which edges are common in the two networks, which edges are unique and finally, which edges network position changes the most. The latter is estimated through betweenness estimation. 

In [21]:
#compute the edge betweenness scores and assign them to the graph objects

print("sort edges after edge betweenness")
bet = []
graphs_with_betweenness = []
for net in networks_graphs:
    edges_betweenness = nx.edge_betweenness_centrality(net)
    bet.append(edges_betweenness)
    #write as new attribute to graph
    temp = nx.set_edge_attributes(net, edges_betweenness, "betweenness")

sort edges after edge betweenness


As in the previous section the networks are converted to a list of list format and each edge is getting a unique ID assigned.

In [22]:
networks = volta.get_edge_similarity.preprocess_graph(networks_graphs, attribute="betweenness")

print("map edges to id")

network_lists, mapping = volta.get_edge_similarity.preprocess_edge_list(networks)

with open(location + "edge_id_mapping_network_network.pckl", "wb") as f:
    pickle.dump(mapping, f, protocol=4)

map edges to id


In [24]:
reverse_mapping = volta.distances.node_edge_similarities.reverse_node_edge_mapping(mapping)

The shared edges are retrieved. The function returns a dictionary data format, where key is mapped edge ID and value is list of network names this edge is present in.

In [25]:
shared = volta.distances.node_edge_similarities.compute_shared_layers(network_lists, labels, is_file=False, in_async=False)

dasatinib_A375
dasatinib_A549


The output is converted into a dataframe for easier inspection of the results.

In [26]:
edges = list(reverse_mapping.values())
edge_mapped_IDs = list(reverse_mapping.keys())

df = pd.DataFrame(list(zip(edges, edge_mapped_IDs)), 
               columns =['Edges', 'Mapping ID']) 
    

In [27]:
for label in labels:
    temp = []
    for i in edge_mapped_IDs:
        if label in shared[i]:
            temp.append(1)
        else:
            temp.append(0)
            
    df["In "+label] = temp
    

In [28]:
#plot the dataframe

df

Unnamed: 0,Edges,Mapping ID,In dasatinib_A375,In dasatinib_A549
0,3895780,0,1,0
1,10904780,1,1,0
2,23386780,2,1,0
3,22883780,3,1,0
4,929780,4,1,0
...,...,...,...,...
36889,992854850,36889,0,1
36890,2365810868,36890,0,1
36891,1016579600,36891,0,1
36892,598525987,36892,0,1


The shared edges are retrieved and stored in a dataframe for inspection.

In [29]:
shared_df = df.loc[(df["In "+labels[0]] == 1) & (df["In "+labels[1]] == 1)]

In [30]:
shared_df

Unnamed: 0,Edges,Mapping ID,In dasatinib_A375,In dasatinib_A549
20,6919780,20,1,1
87,33083895,87,1,1
110,1000710904,110,1,1
125,2597610904,125,1,1
234,58299918,234,1,1
...,...,...,...,...
27012,23533800,27012,1,1
27031,94888444,27031,1,1
27071,5512954788,27071,1,1
27075,1131910298,27075,1,1


The unique edges are retrieved and stored in a dataframe for inspection.

In [31]:
unique_df = df.loc[((df["In "+labels[0]] == 1) & (df["In "+labels[1]] == 0)) | ((df["In "+labels[0]] == 0) & (df["In "+labels[1]] == 1))]

In [32]:
unique_df

Unnamed: 0,Edges,Mapping ID,In dasatinib_A375,In dasatinib_A549
0,3895780,0,1,0
1,10904780,1,1,0
2,23386780,2,1,0
3,22883780,3,1,0
4,929780,4,1,0
...,...,...,...,...
36889,992854850,36889,0,1
36890,2365810868,36890,0,1
36891,1016579600,36891,0,1
36892,598525987,36892,0,1


As possible further analysis the genes making up common or unique edges could be enriched as shown in the Example of Enrichment jupyter notebook.

# Node areas/ connectivity

Hereafter we evaluate wheter nodes are connected in a similar way among the two networks, as well as if there are differences among node areas. In order to do this, we will make use of the random walks method, as already shown in the network clustering example file. 

## Random walks

For each common node in the two networks, random walks are performed and their similarity in visited nodes is compared. This allows to identify the most similar/ dissimilar node areas.

For each node, random walks of size 5 are performed by its degree number of times. A smaller walk size "scans" a smaller area around the starting node.Number of steps and number of walkers can be increased/decreased according to the experimental purposes (and memory availability). 


In [33]:
performed_walks = volta.get_walk_distances.helper_walks(networks_graphs, nodes, labels, steps=5, number_of_walks=1, degree=True, probabilistic=False, weight ="weight")

walks for node  0 outof 977
running walks 35 for node 780
running walks 8 for node 780
running walks 67 for node 3895
running walks 20 for node 3895
running walks 33 for node 10904
running walks 15 for node 10904
running walks 61 for node 23386
running walks 14 for node 23386
running walks 23 for node 22883
running walks 9 for node 22883
running walks 19 for node 929
running walks 15 for node 929
running walks 82 for node 9918
running walks 31 for node 9918
running walks 46 for node 949
running walks 66 for node 949
running walks 33 for node 51097
running walks 46 for node 51097
running walks 18 for node 10447
running walks 9 for node 10447
running walks 52 for node 7319
running walks 17 for node 7319
running walks 15 for node 6251
running walks 19 for node 6251
running walks 26 for node 2073
running walks 42 for node 2073
running walks 26 for node 8772
running walks 11 for node 8772
running walks 19 for node 3978
running walks 25 for node 3978
running walks 55 for node 10617
running w

running walks 21 for node 1994
running walks 166 for node 3930
running walks 42 for node 3930
running walks 69 for node 3988
running walks 45 for node 3988
running walks 80 for node 84722
running walks 9 for node 84722
running walks 146 for node 2180
running walks 23 for node 2180
running walks 246 for node 5782
running walks 33 for node 5782
running walks 90 for node 5927
running walks 9 for node 5927
running walks 35 for node 54623
running walks 30 for node 54623
running walks 208 for node 23636
running walks 41 for node 23636
running walks 47 for node 6944
running walks 32 for node 6944
running walks 89 for node 5310
running walks 48 for node 5310
running walks 57 for node 593
running walks 19 for node 593
running walks 151 for node 5906
running walks 5 for node 5906
running walks 31 for node 5997
running walks 9 for node 5997
running walks 156 for node 6722
running walks 6 for node 6722
running walks 199 for node 10237
running walks 25 for node 10237
running walks 100 for node 2817

running walks 10 for node 25836
walks for node  300 outof 977
running walks 142 for node 8202
running walks 34 for node 8202
running walks 78 for node 7982
running walks 12 for node 7982
running walks 88 for node 23483
running walks 21 for node 23483
running walks 60 for node 9519
running walks 11 for node 9519
running walks 161 for node 1500
running walks 33 for node 1500
running walks 77 for node 7905
running walks 39 for node 7905
running walks 165 for node 10049
running walks 9 for node 10049
running walks 173 for node 51031
running walks 18 for node 51031
running walks 96 for node 8611
running walks 19 for node 8611
running walks 74 for node 23075
running walks 10 for node 23075
running walks 137 for node 9181
running walks 10 for node 9181
running walks 129 for node 10962
running walks 4 for node 10962
running walks 148 for node 8480
running walks 12 for node 8480
running walks 154 for node 637
running walks 27 for node 637
running walks 123 for node 1399
running walks 70 for nod

running walks 26 for node 124583
running walks 154 for node 11000
running walks 34 for node 11000
running walks 94 for node 6253
running walks 6 for node 6253
running walks 17 for node 7867
running walks 7 for node 7867
running walks 18 for node 1429
running walks 43 for node 1429
running walks 30 for node 9759
running walks 14 for node 9759
running walks 13 for node 1725
running walks 38 for node 1725
running walks 57 for node 5018
running walks 2 for node 5018
running walks 14 for node 501
running walks 7 for node 501
running walks 22 for node 7866
running walks 7 for node 7866
running walks 18 for node 9289
running walks 9 for node 9289
running walks 32 for node 23244
running walks 36 for node 23244
running walks 54 for node 84159
running walks 4 for node 84159
running walks 31 for node 55011
running walks 6 for node 55011
running walks 31 for node 7494
running walks 63 for node 7494
running walks 35 for node 6772
running walks 42 for node 6772
running walks 37 for node 6184
running

running walks 11 for node 7106
running walks 15 for node 64764
running walks 9 for node 64764
running walks 27 for node 29978
running walks 40 for node 29978
running walks 15 for node 29911
running walks 20 for node 29911
running walks 52 for node 5529
running walks 7 for node 5529
running walks 81 for node 6443
running walks 21 for node 6443
running walks 29 for node 22887
running walks 6 for node 22887
running walks 31 for node 51335
running walks 29 for node 51335
running walks 13 for node 55111
running walks 12 for node 55111
running walks 40 for node 80758
running walks 5 for node 80758
running walks 21 for node 23300
running walks 13 for node 23300
running walks 45 for node 2548
running walks 27 for node 2548
running walks 26 for node 6990
running walks 23 for node 6990
running walks 39 for node 8634
running walks 30 for node 8634
walks for node  700 outof 977
running walks 33 for node 57149
running walks 30 for node 57149
running walks 35 for node 166647
running walks 12 for nod

running walks 12 for node 26001
running walks 11 for node 26001
running walks 5 for node 55699
running walks 14 for node 55699
running walks 2 for node 10057
running walks 13 for node 10057


Now we are estimating for each starting node how often surrounding nodes/ edges have been visit with respect to all the visited nodes/ edges. Depending on your network sizes and selected nodes this can be quite memory intensive.

In [34]:
node_counts, edge_counts, nodes_frc, edges_frc = volta.get_walk_distances.helper_get_counts(labels, networks_graphs, performed_walks)

Now we want to estimate network similarities based on the visited nodes. For each network pair, kendall rank correlation is calculated (of the top 20 nodes; adjust this value as needed) for the same starting node. The mean correlation value of all same node pairs for a network pair is estimated as well as the individual values are calculated and returned.

In [35]:
results_edges, results_nodes, results_edges_p, results_nodes_p, results_edges_all, results_nodes_all, results_edges_p_all, results_nodes_p_all = volta.get_walk_distances.helper_walk_sim(networks_graphs, performed_walks, nodes, labels, top=20, undirected=False, return_all = True, nodes_ranked=nodes_frc, edges_ranked=edges_frc)

The results are converted into a dataframe for inspection and the top and bottom 20 nodes are displayed.

In [36]:

df = pd.DataFrame(list(zip(nodes, results_nodes_all[(labels[0], labels[1])])), 
               columns =['Entrez ID', 'Correlation']) 

#dataframe is sorted after correlation

df = df.sort_values(by =["Correlation"], axis=0, ascending=False)

In [37]:
df.head(20)

Unnamed: 0,Entrez ID,Correlation
498,51465,0.536842
657,874,0.473684
529,1026,0.473684
289,1052,0.442105
135,54386,0.421053
518,501,0.421053
903,39,0.410526
917,55793,0.410526
405,23210,0.410526
816,1385,0.4


In [38]:
df.tail(20)

Unnamed: 0,Entrez ID,Correlation
252,9870,-0.305263
654,3775,-0.315789
918,7690,-0.315789
428,6636,-0.315789
387,10427,-0.315789
180,3398,-0.326316
929,27242,-0.326316
42,9887,-0.326316
619,3066,-0.326316
601,7105,-0.326316


Additionally the networks can be compared by means of their community structure as shown in the community notebook file.