In [None]:
import graphAlgorithms as ga

In this pipeline we are making use of the network clustering estimated in the Network clustering example pipeline.

The aim is to identify graph structures that are common among clusters or statistically overrepressented in them.

# Load & Preprocess Network

First step of the pipeline consists in loading the chosen data set.
You can store your networks in any common format, however the xx package requires that the networks are provided as NetworkX Graph objects (refer to its documentation for detailed instructions). Moreover, the networks should be weighted: if you have an unweighted network, then assign all edges the same edge weight. The package assumes "weight" to be the default edge weight label, but this can be set when needed.

An example on how to pre-process a network, stored as an edgelist, is provided below. Different loading and storing examples are provided in the "import and export of networks" jupyter notebook. 

In [None]:
#location where the raw data files are stored, it is set to run from the installation folder
#- if applicable please change or CHANGE to the location of your networks

graph_location = "../networks/edgelists/"

In [None]:
#location where output should be saved
#Please set location
location = ""

In [None]:
import glob
import pandas as pd
import networkx as nx
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt

In [None]:

labels = []
networks_graphs = []
cnt = 0
print("load networks")
#gets all files located in the specified folder that end on .edgelist
#CHANGE the ending if your files end differently
for path in glob.glob(graph_location +"*.edgelist"):
    
        #you can specify that only part of the file name should be used as network name for later identification
        name =  path.split("/")[-1].replace(".rds.edgelist", "")


        #read the edgelist file as a dataframe
        fh = pd.read_csv(path, sep="\t")
        #convert it into a NetworkX graph G and specify the column names of the node pairs
        G=nx.from_pandas_edgelist(fh, "V1", "V2")

        #if you have an unweighted network assign all edges the same edge weight - here a value of 1 is assigned
        for u, v, d in G.edges(data=True):
            d['weight'] = 1


        #save the graph objects to a list (only suitable if small networks are processed)
        #this is the main objects used for the examples below, which contains all networks
        networks_graphs.append(G)
        labels.append(name)




        print("loaded", name)
    

Get the union og nodes.

In [None]:
nodes = []
for net in networks_graphs:
    for node in net.nodes():
        if node not in nodes:
            nodes.append(node)

## Load clustering estimated in Network clustering pipeline


It reads in the dataframe (stored as csv) created in the network clustering pipeline and stored in the last step.

In [None]:
clustering = pd.read_csv(location+"clustering_networks.csv")

Convert data to a dictionary (a data stracture in python that is more generally known as an associative array) where key is cluster ID and value is list of NetworkX graph objects in that cluster or its adjacency matrix.

In [None]:
clusters_networks = {}
clusters_adjacencymatrices = {}
for cl in list(Counter(clustering["CLUSTER"].to_list()).keys()):
    clusters_networks[cl] = []
    clusters_adjacencymatrices[cl] = []

for cl in list(Counter(clustering["CLUSTER"].to_list()).keys()):
    #get all drug names and their ids in this cluster
    
    t = clustering.loc[clustering["CLUSTER"]==cl]
    
    drugs = t["CHEMICAL"].to_list()
    temp = []
    temp2 = []
    for d in drugs:
        for i in range(len(labels)):
            if labels[i] == d:
                temp.append(networks_graphs[i])
                temp2.append(nx.to_numpy_matrix(networks_graphs[i], nodelist=nodes,  weight='weight'))
    clusters_networks[cl] = temp
    clusters_adjacencymatrices[cl] = temp2

## Estimate common subgraphs 

Here we show an example of how to estimate sxubgraphs in case not all the edges are present in all the networks. In detail we show how to estimate the edges present in 75% and 50 % of the networks, respectively. For each of them, the common subnetwork is printed. 

In [None]:
#which edges are in 75% of all graphs in a cluster?

common_75 = {}

for cl in clusters_adjacencymatrices.keys():
    common_75[cl] = ga.pattern_matching.get_common_subgraph(clusters_adjacencymatrices[cl], p=0.75)

In [None]:
#print the common subnetwork
for i in common_75.keys():
    print("cluster ", i)
    T = ga.pattern_matching.build_graph_remove_isolates(common_75[i])
    
    plt.figure(3,figsize=(5,5)) 
    nx.draw(T, with_labels = True)
    plt.show()

In [None]:
#which edges are in 50% of all graphs in a cluster?

common_50 = {}

for cl in clusters_adjacencymatrices.keys():
    common_50[cl] = ga.pattern_matching.get_common_subgraph(clusters_adjacencymatrices[cl], p=0.5)

In [None]:
#print the subnetwork
for i in common_50.keys():
    print("cluster ", i)
    T = ga.pattern_matching.build_graph_remove_isolates(common_50[i])
    
    plt.figure(3,figsize=(5,5)) 
    nx.draw(T, with_labels = False, node_size = 4)
    plt.show()

This method is easy and provides a fast overview if networks withing a cluster share many edges or not, but does not provide any information about the edge distribution within a cluster and between clusters.


Therefore next we are estimating a subgraph based on if a specific edge within a cluster is statistically significant enriched in that cluster. The function estimates p values for each edge within a cluster based on a hypergeometric function and performs correction based on a Benjamin Hochberg correction. Both values are returned.

In [None]:
pval_matrix, adj_pval_matrix = ga.pattern_matching.get_statistical_overrepresented_edges(clusters_adjacencymatrices)

In [None]:
for i in adj_pval_matrix.keys():
    print("cluster ", i)
    
    
    
    
    T = ga.pattern_matching.build_graph_remove_isolates(adj_pval_matrix[i])
    print("number nodes", len(T.nodes()))
    print("number edges", len(T.edges()))
    plt.figure(3,figsize=(15,15)) 
    nx.draw(T, with_labels = False, node_size = 4)
    plt.show()

The estimated subgraphs can be used in replacement for all networks withing a cluster. Nodes & edges can be directly compared as described in the network-network comparison pipeline or modules can be detected and functionally enriched as described in the community detection example file.

# Communities

Here, we calculate consensus communities between all graphs within a network group, as well as we evaluate statistical overrepresented communities.
For individual community detection algorithms or ensembl methods, as well as on their application on individual networks, please refer to the community detection example file.

## Statistical overrepressented communities within a network group



In [None]:
statistical_communities = ga.pattern_matching.get_statistical_overrepresented_communities(clusters_networks, nodes, pval=0.05)

Once the enriched comunities are calculated, we assured that they were populated by at least 20 nodes (adjust this value to your needs e.g. what size is required to perform enrichment?), and proceeded to retrieve them. 


In [None]:
stat_communities = {}

for cl in statistical_communities.keys():
    stat_communities[cl] = {}
    print("cluster", cl)
    for c in Counter(statistical_communities[cl]).keys():
        if Counter(statistical_communities[cl])[c] >= 20:
            print("community ", c, "has", Counter(statistical_communities[cl])[c], "nodes")
            
            
            temp = []
            for i in range(len(nodes)):
                if statistical_communities[cl][i] == c:
                    temp.append(nodes[i])
                    
            stat_communities[cl][c] = temp

## Consensus Community on a group of networks

For each network, Louvain community detection is performed ten times. By evaluating the different partitionings results for all networks in a group, a consensus is estimated based on clustering.consensus_clustering(), as explained in more detail in the Network clustering pipeline.

In [None]:
consensus = {}

for cl in clusters_networks.keys():
    cons = ga.pattern_matching.get_consensus_community(clusters_networks[cl], nodes,  rep_network=10, threshold=0.75)
    
    consensus[cl] = cons

In [None]:
consensus

For example the communities can be functionally enriched and compared between the clusters.
    