# Sample Mass-Difference Networks in Metabolomics Data Analysis

Notebook to support the study on the application of **Sample M**ass-**Di**fference **N**etworks as a highly specific competing form of pre-processing procedure for high-resolution metabolomics data.

Mass-Difference Networks are focused into making networks from a list of masses. Each _m/z_ will represent a node. Nodes will be connected if the difference in their masses can be associated to a simple chemical reaction (enzymatic or non-enzymatic) that led to a change in the elemental composition of its metabolite.

The set of mass differences used to build said networks are called a set of MDBs - Mass-Difference-based Building block.

This is notebook `paper_sMDiNs_sMDiNsAnalysis.ipynb`


## Organization of the Notebook

- Loading up pre-processed and pre-treated dataset databases.
- **Reading built MDiNs from Cytoscape and analyse some of their characteristics.**
- **Subgraphing Sample MDiNs from MDiNs and analyse them.**
- Join results from sMDiN analysis to the database of pre-processed and pre-treated datasets.

Warning: Run this notebook after `paper_sMDiNs_database_prep.ipynb`.

#### Needed Imports

In [1]:
import itertools
from pathlib import Path

import numpy as np
import pandas as pd

import scipy.spatial.distance as dist
import scipy.cluster.hierarchy as hier
import scipy.stats as stats

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
from matplotlib import ticker

from sklearn.model_selection import GridSearchCV
import sklearn.ensemble as skensemble

import seaborn as sns
import networkx as nx

# Metabolinks package
import metabolinks as mtl
import metabolinks.transformations as transf

# Python files in the repository
import multianalysis as ma
from elips import plot_confidence_ellipse

# For multiprocessing sMDiN analysis of the HD dataset
import smdins
from tqdm import tqdm
import multiprocessing as mp

In [2]:
%matplotlib inline

In [3]:
# json for persistence

import json
from time import perf_counter

## Description of dataset records

`datasets` is the global dict that holds all data sets. It is a **dict of dict's**.

Each data set is **represented as a dict**.

Each record has the following fields (keys):

- `name`: the table/figure name of the data set
- `source`: the biological source for each dataset
- `mode`: the aquisition mode
- `alignment`: the alignment used to generate the data matrix
- `data`: the data matrix
- `target`: the sample labels, possibly already integer encoded
- `MDiN`: Mass-Difference Network
- `<treatment name>`: transformed data matrix / network. These treatment names can be
    - `original`: an alias to `data`
    - `Ionly`: missing value imputed data by 1/5 of the minimum value in each sample in the dataset, only
    - `P`: Pareto scaled data
    - `NP`: Pareto scaled and normalized
    - `NGP`: normalized, glog transformed and Pareto scaled
    - `Ionly_RF`: missing value imputed data by random forests, only
    - `P_RF`: Pareto scaled data
    - `NP_RF`: Pareto scaled and normalized
    - `NGP_RF`: normalized, glog transformed and Pareto scaled
    - `IDT`: `NGP_RF` or `NGP` - Intensity-based Data pre-Treatment chosen as comparison based on which of the two performed better for each dataset and each statistical method
    - `sMDiN`: Sample Mass-Difference Networks
       
- `<sMDiN analysis name>`: data matrix from nework analysis of MDiNs
    - **`Degree`: degree analysis of each sMDiN**
    - **`Betweenness`: betweenness centrality analysis of each sMDiN**
    - **`Closeness`: closeness centrality of analysis of each sMDiN**
    - **`MDBI`: analysis on the impact of each MDB (Mass-Difference based building-block) on building each sMDiN**
    - **`GCD11`: Graphlet Correlation Distance of 11 different orbits (maximum of 4-node graphlets) between each sMDiN.**

The keys of `datasets` may be shared with dicts holding records resulting from comparison analysis.

Here are the keys (and respective names) of datasets used in this study:

- GD_neg_global2 (GDg2-)
- GD_pos_global2 (GDg2+)
- GD_neg_class2 (GDc2-)
- GD_pos_class2 (GDc2+)
- YD (YD)
- vitis_types (GD types)
- HD (HD)

### Reading datasets database

In [4]:
# Read benchmark datasets
path = Path.cwd() / "store_files" / 'processed_data.json'
storepath = Path.cwd() / "store_files" / 'processed_data.h5'
with pd.HDFStore(storepath) as store:

    with open(path, encoding='utf8') as read_file:
        datasets = json.load(read_file)
    
    for dskey, dataset in datasets.items():
        for key in dataset:
            value = dataset[key]
            if isinstance(value, str) and value.startswith("INSTORE"):
                storekey = value.split("_", 1)[1]
                dataset[key] = store[storekey]
#datasets

In [5]:
# Atomic masses - https://ciaaw.org/atomic-masses.htm
#Isotopic abundances-https://ciaaw.org/isotopic-abundances.htm/https://www.degruyter.com/view/journals/pac/88/3/article-p293.xml
# Isotopic abundances from Pure Appl. Chem. 2016; 88(3): 293–306,
# Isotopic compositions of the elements 2013 (IUPAC Technical Report), doi: 10.1515/pac-2015-0503

chemdict = {'H':(1.0078250322, 0.999844),
            'C':(12.000000000, 0.988922),
            'N':(14.003074004, 0.996337),
            'O':(15.994914619, 0.9976206),
            'Na':(22.98976928, 1.0),
            'P':(30.973761998, 1.0),
            'S':(31.972071174, 0.9504074),
            'Cl':(34.9688527, 0.757647),
            'F':(18.998403163, 1.0),
            'C13':(13.003354835, 0.011078) # Carbon 13 isotope
           } 

# electron mass from NIST http://physics.nist.gov/cgi-bin/cuu/Value?meu|search_for=electron+mass
electron_mass = 0.000548579909065

### Loading Mass-Difference Network of benchmark datasets

In [6]:
#import MDiN_functions as mdin

#### MDBs (Mass-Difference-based Building blocks) 

Building the **list of MDBs** to use when building the MDiNs.

MDB - Mass-Difference-based Building blocks.

The choice of this MDBs is then crucial in the network building process. Ideally, they should represent different 'simple' biochemical reactions that cover the most common and ubiquitous enzymatic and non-enzymatic reactions while also being a relatively small amount of reactions – a total of 15 were picked - and maintaining the metabolite formula charge neutrality. 

For example, to maintain neutrality, a phosphorylation would mean the overall addition of a PO3H – addition of a -PO3$^{2-}$ group + 2 H$^+$ (maintaining neutrality) to replace an H atom in a metabolite. All the MDBs chosen represent changes in metabolites of no more than 5 atoms and less than 80 Da (small size). Each MDB should represent a set of chemically known reactions and a change in every main element in metabolites (C, H, O, N, S and P) is represented by at least one of the MDBs. To fulfil these conditions, representative MDBs were searched using BRENDA (https://www.brenda-enzymes.org/). The groups chosen were the following:

- CH2 (methylation) 
- O (oxygenation) 
- H2 (Hydrogenation)
- O(N-H-) (Aminase): NH3(O-) - H2
- PO3H (phosphorylation)
- NH3(O-) (transaminases)
- SO3 (sulphation)
- CO (like formylation) 
- CO2 (carboxylation, decarboxylation)
- CHOH (Hydroxymethylation) 
- NCH (formidoyltransferase)
- CONH (carbamoyltransferase)
- C2H2O (acetylation)
- S (rare but an extra S reaction)
- H2O

There could be many other MDB representing other reactions that can also be included such as CN2H2 (amidinotransferases), COCH2COO (malonyl transferases), etc.

In [7]:
# Chemical Formula transformations (MDBs chosen)
MDBs = ['H2','CH2','CO2','O','CHOH','NCH','O(N-H-)','S','CONH','PO3H','NH3(O-)','SO3','CO', 'C2H2O', 'H2O']

### Build the Mass-Difference Networks in Cytoscape using MetaNetter 2.0

To build the network in Cytoscape we need:

1) A list of neutral masses (with masses as float in node attributes).

2) A list of allowed transformations with specific mass differences - list of MDBs that we build above.

The transformation files and the dataset files were built in `paper_sMDiNs_database_prep` and used in Cytoscape to build the MDiNs.

Parameters: 1 ppm error allowed for edge establishment.

The Yeast datasets already have the m/z 'buckets' that are already representing the neutral masses of the metabolites. To consider the neutral masses of the grapevine datasets, a simple transformation by adding or removing a proton (Hydrogen atom mass - electron mass), depending if the ionization mode is negative or positive, respectively, on the m/z peaks was made.

Only one full network (for each benchmark dataset) is built. Then, subgraphs of them will be used to select every sample MDiN. This is the same as building an MDiN for each sample.

The networks were exported in graphml format that networkX module can read.

- Nodes have a standard number ID instead of the mass (stored as the attribute 'mass'). Other attributes stored are irrelevant. 
- Edges among the different attributes have a very useful attribute called 'Transformation' which stores which MDB of the list was used to establish the edge - will be used for MDB Impact analysis.
- Finally, the graph is directed. Since reactions are bidireccional, they will be transformed to undirected graphs.

Changes that will be made to the network:

- Nodes will be identified by their masses.
- Intensities of the node in each sample will be given later to store for each specific subgraph.

In [8]:
# Read the MDiNs built using Cytoscape's MetaNetter
for name, ds in datasets.items():
    print(f'Reading MDiN based on data in {name}', end=' ...')
    MDiN_temp = nx.read_graphml('mass_data/MassList' + name +'.graphml')
    
    # Making dicts for the new names
    new_nodes = dict.fromkeys(MDiN_temp.nodes(),0)

    for i,mz in nx.get_node_attributes(MDiN_temp,'mass').items(): # i is old name, mz is mass/new name
        new_nodes[i] = mz

    # Relabeling nodes
    MDiN_temp = nx.relabel_nodes(MDiN_temp, mapping=new_nodes)
    
    ds['MDiN'] = MDiN_temp.to_undirected()
    print('done!')

Reading MDiN based on data in GD_neg_global2 ...done!
Reading MDiN based on data in GD_pos_global2 ...done!
Reading MDiN based on data in GD_neg_class2 ...done!
Reading MDiN based on data in GD_pos_class2 ...done!
Reading MDiN based on data in YD ...done!
Reading MDiN based on data in vitis_types ...done!
Reading MDiN based on data in HD ...done!


In [9]:
# Build the MDiNs
#for name, ds in datasets.items():
#    print(f'Building MDiN based on data in {name}', end=' ...')
#    ds['MDiN'] = mdin.simple_MDiN(list(ds['data'].columns))
#    print('done!')

### MDiN characteristics

Building a table with general characteristics about the 7 MDiNs built.

In [10]:
Net_Chara = pd.DataFrame(columns=['# Nodes', '# Edges', 'Biggest Comp. Size', 'Connected Nodes', 
                                  'Diameter', 'Radius'])

In [11]:
for name, ds in datasets.items():
    Net_Chara.loc[name, '# Nodes'] = len(ds['MDiN'].nodes())
    Net_Chara.loc[name, '# Edges'] = len(ds['MDiN'].edges())
    
    Main_component = ds['MDiN'].subgraph(list(sorted(nx.connected_components(ds['MDiN']), key=len, reverse=True)[0]))
    Net_Chara.loc[name, 'Biggest Comp. Size'] = len(Main_component.nodes())
    Net_Chara.loc[name, 'Diameter'] = nx.diameter(Main_component)
    Net_Chara.loc[name, 'Radius'] = nx.radius(Main_component)
    
    isolated = 0
    for node in ds['MDiN'].degree():
        if node[1] == 0:
            isolated = isolated + 1
    Net_Chara.loc[name, 'Connected Nodes'] = (len(ds['MDiN'].nodes()) - isolated) / len(ds['MDiN'].nodes()) * 100

In [12]:
Net_Chara

Unnamed: 0,# Nodes,# Edges,Biggest Comp. Size,Connected Nodes,Diameter,Radius
GD_neg_global2,3629,1005,183,32.433177,27,14
GD_pos_global2,7026,6597,2482,55.735838,49,25
GD_neg_class2,3026,718,145,29.312624,31,16
GD_pos_class2,4565,3798,1472,52.968237,45,23
YD,1893,810,275,35.60486,31,16
vitis_types,3026,718,145,29.312624,31,16
HD,12867,31008,7631,74.943654,63,32


In [13]:
MDB_counts = {}

for name, ds in datasets.items():
    MDB_counts[name] = dict.fromkeys(MDBs, 0) # MDBs from the transformation list
    for i in ds['MDiN'].edges():
        MDB_counts[name][ds['MDiN'].edges()[i]['Transformation']] = MDB_counts[name][
            ds['MDiN'].edges()[i]['Transformation']] + 1
        
MDB_counts = pd.DataFrame.from_dict(MDB_counts)
MDB_counts.reindex(['O(N-H-)','NH3(O-)','H2','CH2','O','H2O','CO','NCH','CHOH','S','C2H2O','CONH','CO2','SO3','PO3H']).T
#MDB_counts.to_csv('TableS1.csv')

Unnamed: 0,O(N-H-),NH3(O-),H2,CH2,O,H2O,CO,NCH,CHOH,S,C2H2O,CONH,CO2,SO3,PO3H
GD_neg_global2,50,120,138,173,135,58,111,34,36,8,39,24,61,9,9
GD_pos_global2,291,214,763,1229,821,735,612,289,98,118,544,261,386,121,115
GD_neg_class2,24,87,112,134,91,43,90,9,23,4,25,15,46,10,5
GD_pos_class2,156,115,453,785,511,449,364,153,41,48,299,120,200,52,52
YD,38,32,101,152,100,96,78,27,6,13,62,19,39,11,36
vitis_types,24,87,112,134,91,43,90,9,23,4,25,15,46,10,5
HD,1878,1464,3196,4767,3832,2819,2317,2061,736,882,2521,1875,1902,448,310


For the **HD** dataset, due to its size in both node number and sample number, to accelerate posterior sMDiN network analysis, only nodes that have at least a degree of 1 (at least one edge) will be kept, discarding all uninformative isolated nodes from the network 

In [14]:
new_node_list = []
for node in datasets['HD']['MDiN'].degree():
    if node[1] > 0:
        new_node_list.append(node[0])
datasets['HD']['MDiN'] = datasets['HD']['MDiN'].subgraph(new_node_list)

### Building and Analysing Sample MDiNs

5 Analysis metrics of the Sample MDiNs were used:

- Degree analysis
- Betweenness - Betweenness centrality analysis
- Closeness - Closeness centrality analysis
- MDBI - Mass-Difference based building block Impact analysis
- GCD11 - Graphlet Correlation Distance using 11 non-redundant graphlet orbits (maximum of 4-node graphlets) - GCD-11 analysis

This analysis encompasses metrics that evaluate different aspects of the networs. 

Three measures of centrality: degree, centrality closeness, betweenness centrality commonly used to characterize networks. These metrics keep each node as a feature (no feature reduction) with its value for each sample being the respective metric value for each sample MDiN.

The last two metrics greatly reduce the amount of features that will be considered by statistical methods to discriminate between the sample MDiNs (one to 15 and the other to 60).

**MDB Impact** is a measure of the impact that each MDB had in establishing a sample MDiN. To that end, counts of the number of edges established due to each MDB are counted in each sample MDiN - each MDB represents a set of chemical reactions. To allow comparison between samples with different number of edges the counts in each sample MDiN are scaled by Pareto Scaling. This analysis was made to see if the relative importance of the MDBs in establishing the networks is characteristic of the class the sample belongs to.

**GCD-11** is an analysis that focuses on the structure and topology of the networks. Briefly, this method considers and counts the number of times each node in the sMDiN is in one of 11 of the 15 possible orbits (4 are redundant - 3, 12, 13, 14) of 2 to 4-node graphlets. This builds a Graphlet Degree Vector for each node, building a dataframe with 11 orbits as columns. The spearman correlation between each pair of the 11 columns is calculated to generate a symmetric 11x11 matrix - the Graphlet Correlation Matrix (GCM) that represents, according to the authors, the signature of the network topology. The distance of the these matrices can be used to compare different networks - the Graphlet Correlation Distance. To use this signature of the topology of each network to discriminate the different sMDiNs, the correlations between each pair of orbits (total of 60) were extracted from the Graphlet Correlation Matrix from each sMDiN to be the features that will be compared against each other and used by the statistical methods. This analysis was made to see if the general topology of the sMDins is characteristic of the class the sample belongs to, knowing that this metric has been known to sucessfully discriminate between different networks.

Useful papers for detailed explanations on GCD-11 and other related methodologies: 
- Yaveroğlu ÖN, Malod-Dognin N, Davis D, et al. Revealing the Hidden Language of Complex Networks. Sci Rep. 2014;4(1):4547. doi:10.1038/srep04547
- Milenković T, Pržulj N. Uncovering biological network function via graphlet degree signatures. Cancer Inform. 2008;6:257-273. doi:10.4137/cin.s680 - some details on graphlet signatures
- Tantardini M, Ieva F, Tajoli L, Piccardi C. Comparing methods for comparing networks. Sci Rep. 2019;9(1):1-19. doi:10.1038/s41598-019-53708-y - many different similar methods to GCD-11

In [15]:
for name, ds in datasets.items():
    sMDiNs = {}
    if name != 'HD':
        for samp in ds['data'].index:
            # Subgraphing sMDiN
            sMDiNs[samp] = ds['MDiN'].subgraph(ds['data'].T[ds['data'].loc[samp,:].replace({np.nan:0}) != 0].index)

            #Storing intensity of feature in sample on the nodes - Negative
            intensity_attr = dict.fromkeys(sMDiNs[samp].nodes(),0)
            #print(intensity_attr)
            for i,m in nx.get_node_attributes(sMDiNs[samp],'mass').items():
                intensity_attr[i] = {'intensity':ds['data'].loc[samp,m]}
            nx.set_node_attributes(sMDiNs[samp],intensity_attr)
    
    else:
        # For HD dataset, where the index of the 2D numerical matrix isn't the masses used for the MDiN.
        for samp in ds['data'].index:
            # Extracting the mass lists of each sample from the 2D numerical matrix
            new_idx = ds['data'].T[ds['data'].loc[samp,:].replace({np.nan:0}) != 0].index
            new_idx_final = []
            for i in new_idx:
                # Taking out a proton of the masses to 'neutralise' the mass.
                new_idx_final.append(float(i.split('_')[0]) - chemdict['H'][0] + electron_mass)
                
            #Subgraphing sMDiN with the list of masses extracted
            sMDiNs[samp] = ds['MDiN'].subgraph(new_idx_final)
    
    # Store all sMDiNs
    ds['sMDiNs'] = sMDiNs

Count number of graphlet orbits for GCD-11 analysis

In [16]:
def calculating_orbits(GG):
    """Calculates the number of times each node of the network is in each possible (non-redundant) orbit in graphlets (maximum
    4 nodes).
    
    Function is not very efficient, all nodes are passed, every graphlet is 'made' for each node present in it so it is made
    multiple times.
    
       GG: networkx graph;
    
       returns: dict; dictionary (keys are the nodes) of dictionaries (keys are the orbits and values are the number of times)
    """
    
    node_orbits = {} # To store results

    for i in GG.nodes():

        node_orbits[i] = {} # To store results
        orbits = node_orbits[i]

        # 2 node graphlets - orbit 0
        orbits['0'] = GG.degree(i)

        # 3 node graphlets - orbit 1,2 (and 3 redundant)
        node_neigh = list(GG.neighbors(i))

        # orbit 1 and 4 and 6 and 8 and 9
        n_orb = 0
        n_orb4 = 0
        n_orb6 = 0
        n_orb8 = 0
        n_orb9 = 0

        # orbit 1
        for j in node_neigh:
            neigh_neigh = list(GG.neighbors(j)) # Neighbours of the neighbour j of i
            neigh_neigh.remove(i) # Remove i since i is a neighbour of j
            for common in nx.common_neighbors(GG, i, j):
                neigh_neigh.remove(common) # Remove common neighbours of i and j
            n_orb = n_orb + len(neigh_neigh)


            # orbit 4 and 8
            for n3 in neigh_neigh:
                neigh_neigh_neigh = list(GG.neighbors(n3)) # Neighbours of the neighbour n3 of the neighbour j of i
                #neigh_neigh_neigh.remove(j)
                #if i in neigh_neigh_neigh:
                    #neigh_neigh_neigh.remove(i)     
                for common in nx.common_neighbors(GG, j, n3):
                    if common in neigh_neigh_neigh:
                        neigh_neigh_neigh.remove(common)

                for common in nx.common_neighbors(GG, i, n3):
                    if common in neigh_neigh_neigh:
                        neigh_neigh_neigh.remove(common)
                        # orbit 8
                        if common != j:
                            #print(i,j,n3,common)
                            n_orb8 = n_orb8 + 1/2 # always goes in 2 directions so it will always pass like this

                n_orb4 = n_orb4 + len(neigh_neigh_neigh)
                # print(neigh_neigh_neigh)

            # orbit 6 and 9
            for u,v in itertools.combinations(neigh_neigh, 2):
                if not GG.has_edge(u,v):
                    n_orb6 = n_orb6 + 1
                else:
                    n_orb9 = n_orb9 + 1         

        orbits['1'] = n_orb

        # orbit 2 and 5
        n_orb = 0
        n_orb5 = 0
        for u,v in itertools.combinations(node_neigh, 2):
            if not GG.has_edge(u,v):
                n_orb = n_orb + 1

                # orbit 5
                neigh_u = list(GG.neighbors(u))
                neigh_u.remove(i)
                for common in nx.common_neighbors(GG, i, u):
                    neigh_u.remove(common)

                neigh_v = list(GG.neighbors(v))
                neigh_v.remove(i)
                for common in nx.common_neighbors(GG, i, v):
                    neigh_v.remove(common)

                for common in nx.common_neighbors(GG, v, u):
                    if common in neigh_u:
                        neigh_u.remove(common)
                    if common in neigh_v:
                        neigh_v.remove(common) 

                n_orb5 = n_orb5 + len(neigh_u)
                n_orb5 = n_orb5 + len(neigh_v)

        orbits['2'] = n_orb

        # 4 node graphlets - orbit 4,5,6,7,8,9,10,11 (and 12,13,14 redundant)

        # orbit 4
        orbits['4'] = n_orb4

        # orbit 5
        orbits['5'] = n_orb5

        # orbit 6
        orbits['6'] = n_orb6

        # orbit 7 and 11
        n_orb = 0
        n_orb11 = 0
        for u,v,j in itertools.combinations(node_neigh, 3):
            n_edge = [GG.has_edge(a,b) for a,b in itertools.combinations((u,v,j), 2)]
            #print(sum(n_edge))
            if sum(n_edge) == 0:
                n_orb = n_orb + 1
            elif sum(n_edge) == 1:
                n_orb11 = n_orb11 + 1

        orbits['7'] = n_orb

        # orbit 8
        orbits['8'] = int(n_orb8)

        # orbit 9
        orbits['9'] = n_orb9

        # orbit 10
        n_orb = 0
        for j in node_neigh:
            neigh_neigh = list(GG.neighbors(j))
            neigh_neigh.remove(i)
            for u,v in itertools.combinations(neigh_neigh, 2):
                if sum((GG.has_edge(i,u), GG.has_edge(i,v))) == 1:
                    if not GG.has_edge(u,v):
                        n_orb = n_orb + 1

        orbits['10'] = n_orb

        # orbit 11
        orbits['11'] = n_orb11
    
    return node_orbits

### **Sample MDiN analysis** and results storage for 6 benchmark datasets (except HD)

In [17]:
for name, ds in datasets.items():
    
    if name == 'HD':
        continue
    print(f'Analysing sample MDiNs from the data in {name}', end=' ...')
    
    Deg = {}
    Betw = {}
    Close = {}
    MDB_Impact = {}
    GCD = {} 
    
    for samp in ds['data'].index:
        #print(name, samp)
        k = None
        
        # Centrality measures
        Deg[samp] = dict(ds['sMDiNs'][samp].degree())
        Betw[samp] = nx.betweenness_centrality(ds['sMDiNs'][samp], k=k)
        Close[samp] = nx.closeness_centrality(ds['sMDiNs'][samp])
        
        # MDB influence
        MDB_Impact[samp] = dict.fromkeys(MDBs, 0) # MDBs from the transformation list
        for i in ds['sMDiNs'][samp].edges():
            MDB_Impact[samp][ds['sMDiNs'][samp].edges()[i]['Transformation']] = MDB_Impact[samp][
                ds['sMDiNs'][samp].edges()[i]['Transformation']] + 1
            
        # GCD-11
        # Corr_Mat
        orbits_t = calculating_orbits(ds['sMDiNs'][samp]) # Calculating orbit number for each node
        orbits_df = pd.DataFrame.from_dict(orbits_t).T # Transforming into a dataframe
        
        # Signature matrices
        corrMat_ar = stats.spearmanr(orbits_df)[0] # Calculating spearman correlation to obtain 11x11 signature of the network - GCM
        corrMat_tri = np.triu(corrMat_ar) # Both parts of the matrix are equal, so reducing the info to the upper triangle
        
        # Pulling the signature orbit n (u) - orbit m (v) correlations from the upper triangular matrix of the GCM
        # Making the signature of the sample MDiN into a column of the dataset
        samp_col = {}
        orbits = [0,1,2,4,5,6,7,8,9,10,11]
        for u in range(len(corrMat_tri)):
            for v in range(u+1, len(corrMat_tri)):
                samp_col[str(orbits[u]) + '-' + str(orbits[v])] = corrMat_tri[u,v]
            GCD[samp] = samp_col
    
    # Centrality Measures
    ds['Degree'] = pd.DataFrame.from_dict(Deg).replace({np.nan:0}).T
    ds['Betweenness'] = pd.DataFrame.from_dict(Betw).replace({np.nan:0}).T
    ds['Closeness'] = pd.DataFrame.from_dict(Close).replace({np.nan:0}).T
    
    # MDB Impact
    ds['MDBI'] = pd.DataFrame.from_dict(MDB_Impact).replace({np.nan:0})
    #ds['MDB_Imp'] = transf.pareto_scale(ds['MDB_Imp']).T.replace({np.nan:0})
    ds['MDBI'] = (ds['MDBI']/ds['MDBI'].sum()*100).T
    
    # GCD-11
    ds['GCD11'] = pd.DataFrame.from_dict(GCD).replace({np.nan:0}).T
    
    print('done!')

Analysing sample MDiNs from the data in GD_neg_global2 ...

  c /= stddev[:, None]
  c /= stddev[None, :]


done!
Analysing sample MDiNs from the data in GD_pos_global2 ...done!
Analysing sample MDiNs from the data in GD_neg_class2 ...done!
Analysing sample MDiNs from the data in GD_pos_class2 ...done!
Analysing sample MDiNs from the data in YD ...done!
Analysing sample MDiNs from the data in vitis_types ...done!


#### Sample MDiN analysis for the HD dataset using multiprocessing

6 cores being used - Warning: see if the pc has at least 6 cores to process.

Store results in results_HD list.

Function in `smdins.py` called `HD_sMDiN_analysis` performs the 5 network analysis methods as in the cell before but stores them per sample instead of per metric.

In [18]:
print("Number of processors: ", mp.cpu_count())

Number of processors:  8


In [19]:
# 6 cores
with mp.Pool(6) as p:
    results_HD = []
    for i in tqdm(
            p.imap(smdins.HD_sMDiN_analysis,
                   [(samp, datasets['HD']['sMDiNs'][samp]) for samp in datasets['HD']['data'].index]),#, chunksize=5),
            total=len([samp for samp in datasets['HD']['data'].index])
        ):
        results_HD.append(i)

100%|████████████████████████████████████| 249/249 [13:33:49<00:00, 196.10s/it]


In [20]:
# Transforming the results_HD list with a dict of the results for the 5 metrics for each sampleto add to the dataset database

Deg = {}
Betw = {}
Close = {}
MDB_Impact = {}
GCD = {}

for results in results_HD:
    Deg[results['Name']] = results['Deg']
    Betw[results['Name']] = results['Betweenness']
    Close[results['Name']] = results['Closeness']
    MDB_Impact[results['Name']] = results['MDBI']
    GCD[results['Name']] = results['GCD']
    
# Centrality Measures
datasets['HD']['Degree'] = pd.DataFrame.from_dict(Deg).replace({np.nan:0}).T
datasets['HD']['Betweenness'] = pd.DataFrame.from_dict(Betw).replace({np.nan:0}).T
datasets['HD']['Closeness'] = pd.DataFrame.from_dict(Close).replace({np.nan:0}).T

# MDB Impact
datasets['HD']['MDBI'] = pd.DataFrame.from_dict(MDB_Impact).replace({np.nan:0})
#ds['MDBI'] = transf.pareto_scale(ds['MDBI']).T.replace({np.nan:0})
datasets['HD']['MDBI'] = (datasets['HD']['MDBI']/datasets['HD']['MDBI'].sum()*100).T

# GCD-11
datasets['HD']['GCD11'] = pd.DataFrame.from_dict(GCD).replace({np.nan:0}).T

In [21]:
datasets['HD'].keys()

dict_keys(['source', 'alignment', 'mode', 'name', 'data', 'original', 'target', 'classes', 'Ionly', 'P', 'NP', 'NGP', 'Ionly_RF', 'P_RF', 'NP_RF', 'NGP_RF', 'IDT', 'Degree', 'Betw', 'Closeness', 'MDB_Imp', 'GCD11', 'MDiN', 'sMDiNs'])

### Re-Generate json files and HDF Store including the data matrices obtained from sMDiN analysis

In [31]:
# ensure dir exists
path = Path.cwd() / "store_files"
path.mkdir(parents=True, exist_ok=True)

storepath = Path.cwd() / "store_files" / 'processed_data.h5'

store = pd.HDFStore(storepath, complevel=9, complib="blosc:blosclz")
#pd.set_option('io.hdf.default_format','table')

# keep json serializable values and store dataFrames in HDF store

serializable = {}

for dskey, dataset in datasets.items():
    serializable[dskey] = {}
    for key, value in dataset.items():
        #print(dskey, key)
        if isinstance(value, pd.DataFrame):
            storekey = dskey + '_' + key
            #print('-----', storekey)
            store[storekey] = value
            serializable[dskey][key] = f"INSTORE_{storekey}"
        elif key in ('MDiN', 'sMDiNs'):
            continue
        else:
            serializable[dskey][key] = value
store.close()
            

path = path / 'processed_data.json'
with open(path, "w", encoding='utf8') as write_file:
    json.dump(serializable, write_file)

#serializable