# Extra Module for Metabolomics Data Analysis

As a side module, all related functions to it are also present in the notebook.

# FDiGNN Analysis (Formula-Difference Graph Neural Networks Analysis)

FDiGNN or Formula-Difference Graph Neural Networks is an alternative way to process and train a supervised model of the data to complement the usual workflow that is available in this separated module as a jupyter notebook.

First the method, changes the tabular metabolomics data into a graph structure called a Formula-Difference Network or FDiNs. The FDiNs use the assigned formulas as nodes in a network that are linked if their difference corresponds to a common biochemical transformation. This FDiN is used as a basis for every sample which add their node features (mass, intensity and feature occurrence in the sample) and edge weight (2 if the transformation is observed in the SMPDB, 1 otherwise). **The chemical transformations considered will determine the structure of the FDiN.**

These FDiNs are used as input for a deep learning methodology called Graph Neural Network (GNN), hence the denomination FDiGNN. This approach is **highly tuneable**. We present a model architecture absed on TAG layers, global attention pooling and classifier layers but this can be change. Furthermore, the model needs to be optimized for each dataset by changing hyperparameters such as: **number of TAG layers**, **hidden channels in each layer**, **learning rate** (rate, weight decay and exponential decay gamma), **dropout probability on dropout**, **number of epochs trained** or **batch size**.

This module will allow to obtain an estimation of the model performance, obtain the node/metabolite importances and perform network and pathway enrichment analysis to analyse the data.

**Estimating model performance and calculating feature importance may take from a few minutes to a few hours depending on the size of the dataset.**

## Extra Installations

To fully run this, it is required to install a few extra Python packages over the usual software. These are:

- pytorch (see https://pytorch.org/ to see how to install based on your computer)
- pytorch geometric (see https://pytorch-geometric.readthedocs.io/en/2.6.1/install/installation.html to see how to install based on your computer)
- dash (if not installed, install by running 'conda install conda-forge::dash')


paper doi: **To Put**

In [None]:
# standard library imports
import pickle
import json
import sys
import time

# scientific python imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

from sklearn.model_selection import StratifiedKFold

import networkx as nx

# metabolinks and "in folder" modules
import metabolinks.transformations as transf
import metanalysis_standard as metsta

import MDiN_functions as md

from tqdm import tqdm

import torch
from torch.nn import Linear, Softmax, Sequential, BatchNorm1d, ReLU, Dropout, LeakyReLU
import torch.nn.functional as F
import torch.nn as nn
from torch_geometric.utils.convert import from_networkx
from torch_geometric.data import Data, Dataset
from torch_geometric.loader import DataLoader
from torch_geometric.nn import GATv2Conv, TAGConv
from torch_geometric.nn import global_mean_pool, global_add_pool, global_max_pool

from torch_geometric.utils import softmax
from torch_scatter import scatter_add

# For Dash App
import networkx as nx
import dash_cytoscape as cyto
import dash
from dash import Dash, dcc, html, Input, Output, ctx, callback
import base64
from io import BytesIO

# Report versions
print("PyTorch version", torch.__version__)
print("CUDA version", torch.version.cuda)

torch.cuda.device_count()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'\nUSING DEVICE {device}\n')

#### Import treated Data from the main notebook.

Select the filename of the file to import from the main jupyter notebook analysis.

In [None]:
# Filename for the data to import
filename = 'Export_TreatedData.xlsx'
filename_pickle_treated = 'Export_TreatedData.pickle'
filename_pickle_proc = 'Export_ProcData.pickle'

#treated_data = pd.read_excel(filename, sheet_name='Fully Treated Data').set_index('Unnamed: 0').T
bin_data = pd.read_excel(filename, sheet_name='BinSim Treated Data').set_index('Probable m/z').T
univariate_data = pd.read_excel(filename, sheet_name='MVI+Norm Data').set_index('Unnamed: 0')

processed_data = pd.read_pickle(filename_pickle_proc)
treated_data = pd.read_pickle(filename_pickle_treated)
bin_data.columns = treated_data.columns

sample_cols = treated_data.index

In [None]:
# Filename for the target to import
filename_target = 'Export_Target.txt'

with open(filename_target) as a:
    tg = a.readlines()
target = [t.strip() for t in tg]
classes = pd.unique(pd.Series(target))
target

## Details for the Analysis

In [None]:
n_fold = 3
iter_num = 10

colours = sns.color_palette('tab10', 10) # Set the colors

# Formula-Difference Network (FDiN)

Here, we built the FDiN for the whole dataset.

This network is built from assigned formulas either from data annotation or formula assignment in the data. The FDiN has formulas representing metabolic peaks as nodes which are linked between each other if the difference in their formulas corresponds to one of the biochemical transformations chosen to be followed.

Futhermore, the built FDiN might be supplemented with knowledge based edges if the parameter `add_knowledge_based_edges` is set to True. If so, a metabolic knowledge network encompassing all reactions between HMDB compounds represented by their formula stored in the **metabolic** pathways of the **SMPDB** database is used. The reactions were translated by linking every reactant to every product. See details of the construction of this knowledge-based metabolic network here: **LINK TO PAPER**.

Here, if there is reaction in the database linking two formulas that are in the studied dataset, then that edge is also added to our FDiN even if their difference does not correspond to one of the chemical transformations chosen to be followed. Any edge that can be mapped to this network gets an edge weight of 2 compared to 1 of usual links.

## Prepare the Network

We prepared the network to accept the annotations that we perform in our software, however it can take others. In general we create two parameters: `prioritized_ann_col` and `base_form_assign_col` which represent the column from the annotation and the column from the formula assignment. As a reminder, the annotation we perform annotates all possible compounds to a peak as a list.

In general, this will prioritize formulas from the annotation overwriting the ones from the formula assignment if there is only one formula assigned 

In [None]:
prioritized_ann_col = 'Matched formulas'
base_form_assign_col = 'Formula_Assignment'

new_col_name = 'Forms_to_use'

add_knowledge_based_edges = True

In [None]:
# Get the list of formulas to build the sFDiN
temp_df = processed_data.copy()
for i in temp_df.index:
    # See formulas of annotated compounts
    fs = temp_df.loc[i, prioritized_ann_col]
    if type(fs) == list:
        fs = list(set(fs))
        # If only one annotation, overwrite the formula with it
        if len(fs) == 1:
            temp_df.loc[i, 'Formula_Assignment'] = fs[0]
            # Should've been 'Matched HMDB adducts' here, did not catch that before
            temp_df.loc[i, 'Formula_Assignment Adduct'] = temp_df.loc[i, 'Matched formulas'][0]
        # If more than one annotation
        else:
            counted = False
            # If any of the annotated formulas is equal to the formula assigned, change it and keep it
            for f in fs:
                if f == temp_df.loc[i, 'Formula_Assignment']:
                    counted = True
            if counted == False:
                new_f = []
                # See formulas that only have C, H, O, N, S, P elements while having 1 C and 1 H at least
                for f in fs:
                    a = md.formula_process(f)
                    #print(a)
                    if a['C'] != 0 and a['H'] != 0:
                        if len(a) == 8:
                            if a['Cl'] == 0 and a['F'] == 0:
                                #print('hmm')
                                new_f.append(f)
                # If only 1 formula is in this conditions, overwrite formula assignment
                if len(new_f) == 1:
                    temp_df.loc[i, 'Formula_Assignment'] = new_f[0]
                # If more than 1 formula is in these conditions, see these rare exceptions
                else:
                    print('---')
                    print(len(new_f))
                    print(new_f)
                    print(temp_df.loc[i, 'Formula_Assignment'])
                    print(i)
                    print('---------------')

Put Formulas to build the FDiN in DataFrame format

In [None]:
formula_df = temp_df
# Get the formulas from formula assignment, excluding isotopes
formula_df = formula_df.dropna(subset='Formula_Assignment')
formula_df = formula_df.loc[[i for i in formula_df.index if 'iso.' not in formula_df.loc[i, 'Formula_Assignment']]]
# Add the counts of the different elements in columns
elems = metsta.create_element_counts(formula_df, formula_subset=['Formula_Assignment',], compute_ratios=False,
                                     drop_duplicates=False)
filt_elems = elems.iloc[:,:-1]

Choose the list of Mass-Difference-based Building blocks (MDBs) to use and get them to DataFrame format

MDBs are the list of chemical transformations that are used to build Mass-Difference Networks. They usually represent some of the most common and ubiquitous reactions in biological systems but can also be specific to the biological system in case. These 15 were the same as used in the FDiGNN paper.

More MDBs can be added or removed from the list in the 2nd line of the cell below based on what chemical transformations want to be followed.

In [None]:
# Accepted chemical transformations
MDB = ['H2','CH2','CO2','O','CH2O','NCH','O(N-H-)','S','CONH','PO3H','NH3(O-)','SO3','CO', 'C2H2O', 'H2O']

# Create MDB list
results = {}
for i in MDB:
    results[i] = md.formula_process(i, elems=filt_elems.columns)
MDB_df = pd.DataFrame(results).T

Read the built metabolic knowledge network and keep only the formula nodes that are present in the dataset

In [None]:
with open('SMPDB_MetaNetwork_general.pickle', 'rb') as f:
    FDiN_knowledge = pickle.load(f)
node_list = list(FDiN_knowledge.nodes())


# See which of these formulas were detected in our dataset
keep_idxs = []
keep_formulas = []
keep_pathways = []
form_to_idx = {} # Dictionary so we know the association between formulas and idxs

for i in temp_df.index:
    counted = False
    # Give priority to annotated formulas
    fs = temp_df.loc[i, 'Matched formulas']
    if type(fs) == list:
        fs = list(set(fs))
        # Only 1 Formula assigned in annotated data
        if len(fs) == 1:
            form = fs[0]
            # If the formula is the knowledge network
            if form in node_list:
                keep_idxs.append(i)
                keep_formulas.append(form)
                if form in form_to_idx:
                    form_to_idx[form].append(i)
                else:
                    form_to_idx[form] = [i,]
                    keep_pathways.extend(FDiN_knowledge.nodes()[form]['Pathways'])
                counted = True
        # More than 1 Formula assigned in annotated data
        else:
            form_in_node_list = []
            for f in fs:
                if f in node_list:
                    form_in_node_list.append(f)
            if len(form_in_node_list) >= 1:
                for f in form_in_node_list:
                    keep_idxs.append(i)
                    keep_formulas.append(f)
                    if f in form_to_idx:
                        form_to_idx[f].append(i)
                    else:
                        form_to_idx[f] = [i,]
                        keep_pathways.extend(FDiN_knowledge.nodes()[f]['Pathways'])
                counted = True

    # Formula assignment Formulas
    if not counted:
        fs = temp_df.loc[i, 'Formula_Assignment']
        if type(fs) == str:
            if fs in node_list:
                keep_idxs.append(i)
                keep_formulas.append(fs)
                if fs in form_to_idx:
                    form_to_idx[fs].append(i)
                else:
                    form_to_idx[fs] = [i,]
                    keep_pathways.extend(FDiN_knowledge.nodes()[fs]['Pathways'])

# Subgraph the FDiN to only keep these formulas as information for FDiGNN
FDiN_knowledge = FDiN_knowledge.subgraph(keep_formulas)

## Building the FDiN proper

General FDiN

In [None]:
# FDiN basis
FDiN = nx.Graph()
FDiN.add_nodes_from(filt_elems.index) # Each formula is a node
# Adding relevant attributes
nx.set_node_attributes(FDiN, formula_df['Formula_Assignment'].to_dict(), name='Formula')

# Adding simple edges from metabolic transformations
for formula in filt_elems.index:
    poss_formulas = filt_elems.loc[formula] + MDB_df
    for i in poss_formulas.index:
        poss_matches = filt_elems[(filt_elems == poss_formulas.loc[i]).sum(axis=1) == len(MDB_df.columns)]
        for node in poss_matches.index:
            FDiN.add_edge(formula, node, Transformation=i, Weight=1)

if add_knowledge_based_edges:
    # Adding Knowledge-based edges from the metabolic-knowledge based network
    for n1 in FDiN.nodes():
        formA = FDiN.nodes()[n1]['Formula']
        if formA in FDiN_knowledge.nodes():
            for n2 in FDiN.nodes():
                formB = FDiN.nodes()[n2]['Formula']
                if formA != formB:
                    if formB in FDiN_knowledge.nodes():
                        if (formA, formB) in FDiN_knowledge.edges():
                            if (n1, n2) in FDiN.edges():
                                # Change edge weight if the edge already existed to 2
                                FDiN.edges()[(n1, n2)]['Weight'] = 2
                            else:
                                FDiN.add_edge(n1, n2, Transformation='Knowledge', Weight=2)
                        # Confirm nothing is being lost
                        elif (formB, formA) in FDiN_knowledge.edges():
                            print('-------')

print(f'Total with: {len(FDiN.edges())} edges.')

In [None]:
# See Sizes of components
print('Size of components in the FDiN:')
[len(i) for i in sorted(nx.connected_components(FDiN), key=len, reverse=True)]

In [None]:
# Choose minimum component size to use
min_comp_size = 7

# Filter FDiN to only include larger components
comps = []
for i in sorted(nx.connected_components(FDiN), key=len, reverse=True):
    if len(i) > min_comp_size:
        comps.extend(i)
FDiN = FDiN.subgraph(comps)
print(f'Filtered FDiN with: {len(FDiN.edges())} edges.')

In [None]:
# save graph object to file
#pickle.dump(FDiN, open('FDiN_Dataset.pickle', 'wb'))

## Building the sample Formula-Difference Networks

For each sample, copy the FDiN and add the following node features:

- Peak Mass (divided by 100 so it has smaller values)
- (Treated) Intensity
- Feature Occurrence (1 or 0)

Other node features can be added as well here

In [None]:
# Mass Column
mass_col = 'Neutral Mass' if 'Neutral Mass' in processed_data.columns else 'Probable m/z'

In [None]:
sFDiNs_full = {}
for samp in sample_cols:

    sFDiNs_full[samp] = FDiN.copy()

    # Intensity feature
    ints = {i: treated_data.loc[samp, i] for i in formula_df.index}
    # Feature Occurrence feature
    pres = {i: bin_data.loc[samp, i] for i in formula_df.index}

    # Storing intensity of feature in sample, mass and feature occurrence on the nodes
    intensity_attr = dict.fromkeys(sFDiNs_full[samp].nodes(),0)
    for m in sFDiNs_full[samp].nodes():
        intensity_attr[m] = {'mass':formula_df.loc[m,mass_col]/100, 'intensity': ints[m], 'presence':pres[m]}
    nx.set_node_attributes(sFDiNs_full[samp], intensity_attr)

In [None]:
# Description of the sFDiNs built
print('Full sFDiNs')
print('Number of nodes:', len(sFDiNs_full[samp].nodes()))
print('Number of edges:', len(sFDiNs_full[samp].edges()))

### Passing FDiNs to the pytorch architecture

In [None]:
# Node Features
print('Nº of node features excluding the string formula:',
      len(list(sFDiNs_full[samp].nodes()[list(sFDiNs_full[samp].nodes())[0]].keys())[1:]))
node_attrs = list(sFDiNs_full[samp].nodes()[list(sFDiNs_full[samp].nodes())[0]].keys())[1:]

# Edge Attributes
edge_attrs = list(sFDiNs_full[samp].edges()[list(sFDiNs_full[samp].edges())[0]].keys())
edge_attrs.remove('Transformation')

print('Nº of node features:', len(node_attrs))
print('Nº of edge features:', len(edge_attrs))

In [None]:
# Convert the sFDiNs into PyTorch geometric
data_list_full = []
for samp in sFDiNs_full:
    pyg_graph = from_networkx(sFDiNs_full[samp],
                              group_node_attrs=list(
                                  sFDiNs_full[samp].nodes()[list(sFDiNs_full[samp].nodes())[0]].keys())[1:],
                              group_edge_attrs=edge_attrs)
    data_list_full.append(pyg_graph.to(device))
dataset = DataLoader(data_list_full)

# Definition of class depends if there are 2 classes or more
if len(classes) == 2:
    # Adding target information to the sFDiNs
    for g in range(len(target)):
        if target[g] == classes[0]:
            data_list_full[g].y = torch.FloatTensor([1]).type(torch.LongTensor).to(device)
        else:
            data_list_full[g].y = torch.FloatTensor([0]).type(torch.LongTensor).to(device)

else:
    onehot = pd.get_dummies(target, dtype=int)
    for g in range(len(target)):
        data_list_full[g].y = torch.FloatTensor(onehot.iloc[g].values).type(torch.LongTensor).to(device)

# Setting up the FDiGNN Model

Global attention pooling layer

In [None]:
class GlobalAttentionPooling(nn.Module):
    def __init__(self, in_channels):
        super(GlobalAttentionPooling, self).__init__()
        self.attention_nn = nn.Sequential(nn.Linear(in_channels, 1), nn.Sigmoid())
        self.sigmoid = nn.Sigmoid()
        self.last_scores = None
        self.x_weighted = None
        
    def forward(self, x, batch):
        scores = self.attention_nn(x).squeeze(-1)
        scores = softmax(scores, batch)
        x_weighted = x * scores.unsqueeze(-1)
        self.last_scores = scores
        self.x_weighted = x_weighted
        graph_embedding = scatter_add(x_weighted, batch, dim=0)

        return graph_embedding

    def get_attention_scores(self):
        return self.last_scores, self.x_weighted

General Model - 4TAG_AttPool_2Class

In [None]:
# Setting up the model
class FDiGNN_TAG(torch.nn.Module):
    def __init__(self, hidden_channels, drop, n_node_feat, K, n_classes):
        super(FDiGNN_TAG, self).__init__()
        torch.manual_seed(89356)
        self.conv1 = TAGConv(n_node_feat, hidden_channels, K=K)
        self.norm1 = BatchNorm1d(hidden_channels)
        self.conv2 = TAGConv(hidden_channels, hidden_channels, K=K)
        self.norm2 = BatchNorm1d(hidden_channels)
        self.conv3 = TAGConv(hidden_channels, hidden_channels, K=K)
        self.norm3 = BatchNorm1d(hidden_channels)
        self.conv4 = TAGConv(hidden_channels, hidden_channels, K=K)
        self.norm4 = BatchNorm1d(hidden_channels)
        self.pooling = GlobalAttentionPooling(hidden_channels)
        self.lin1 = Linear(hidden_channels, hidden_channels)
        self.lin2 = Linear(hidden_channels, n_classes)
        self.drop = drop

        self.leakyrelu1 = nn.LeakyReLU()
        self.leakyrelu2 = nn.LeakyReLU()
        self.leakyrelu3 = nn.LeakyReLU()
        self.leakyrelu4 = nn.LeakyReLU()

    def forward(self, x, edge_index, batch, edge_weight):
        # 1. Obtain node embeddings 
        x1 = self.conv1(x, edge_index, edge_weight=edge_weight)
        x1_relu = self.leakyrelu1(x1)
        x1_norm = self.norm1(x1_relu)
        x1_drop = F.dropout(x1_norm, p=self.drop, training=self.training)

        x2 = self.conv2(x1_drop, edge_index, edge_weight=edge_weight)
        x2_relu = self.leakyrelu2(x2)
        x2_norm = self.norm2(x2_relu)
        x2_drop = F.dropout(x2_norm, p=self.drop, training=self.training)

        x3 = self.conv3(x2_drop, edge_index, edge_weight=edge_weight)
        x3_relu = self.leakyrelu3(x3)
        x3_norm = self.norm3(x3_relu)
        x3_drop = F.dropout(x3_norm, p=self.drop, training=self.training)

        x4 = self.conv4(x3_drop, edge_index, edge_weight=edge_weight)
        x4_relu = self.leakyrelu4(x4)
        x4_norm = self.norm4(x4_relu)
        x4_drop = F.dropout(x4_norm, p=self.drop, training=self.training)

        # 2. Readout layer
        x_emb = self.pooling(x4_drop, batch)

        # 3. Apply a final classifier
        x_emb = F.dropout(x_emb, p=self.drop, training=self.training)
        x_emb = self.lin1(x_emb)
        x_emb = self.lin2(x_emb)
        return x_emb

Training and Test Functions - Functions change if a 2-class or multiclass problem

In [None]:
if len(classes) == 2:
    def train(model, train_loader, optimizer):
        model.train()
        losses = []
        grad_norms = []
        criterion = torch.nn.CrossEntropyLoss()
        for data in train_loader:  # Iterate in batches over the training dataset.
            out = model(data.x.float(), data.edge_index, data.batch, data.edge_attr.float()) # Perform a forward pass.
            
            loss = criterion(out, data.y)  # Compute the loss.
            loss.backward()  # Derive gradients.
            losses.append(loss.to('cpu').detach().numpy())

            optimizer.step()  # Update parameters based on gradients.
            optimizer.zero_grad()  # Clear gradients.
        return np.mean(losses), grad_norms, model

    def test(model, loader):
        model.eval()

        correct = 0
        losses = []
        criterion = torch.nn.CrossEntropyLoss()
        for data in loader:  # Iterate in batches over the training/test dataset.
            out = model(data.x.float(), data.edge_index, data.batch, data.edge_attr.float())  
            pred = out.argmax(dim=1)  # Use the class with highest probability.
            correct += int((pred == data.y).sum())  # Check against ground-truth labels.
            loss = criterion(out, data.y)  # Compute the loss.
            losses.append(loss.to('cpu').detach().numpy())
        return (correct / len(loader.dataset), np.mean(losses), out)  # Derive ratio of correct predictions.

else:
    def train(model, train_loader, optimizer):
        model.train()
        losses = []
        grad_norms = []
        criterion = torch.nn.CrossEntropyLoss()
        for data in train_loader:  # Iterate in batches over the training dataset.
            out = model(data.x.float(), data.edge_index, data.batch, data.edge_attr.float())# Perform a forward pass.
            loss = criterion(out.cpu(), data.y.reshape(data.batch_size, len(classes)).type(
                torch.FloatTensor)) # Compute the loss.
            loss.backward()  # Derive gradients.
            losses.append(loss.to('cpu').detach().numpy())
            optimizer.step()  # Update parameters based on gradients.
            optimizer.zero_grad()  # Clear gradients.
        return np.mean(losses), grad_norms, model

    def test(model, loader):
        model.eval()

        correct = 0
        losses = []
        criterion = torch.nn.CrossEntropyLoss()
        for data in loader:  # Iterate in batches over the training/test dataset.
            out = model(data.x.float(), data.edge_index, data.batch, data.edge_attr.float())  
            pred = out.argmax(dim=1)  # Use the class with highest probability.
            correct += int((pred.cpu() == data.y.reshape(data.batch_size, len(classes)).type(torch.FloatTensor).argmax(
                dim=1)).sum())
            loss = criterion(out.cpu(), data.y.reshape(data.batch_size, len(classes)).type(
                torch.FloatTensor))  # Compute the loss.
            losses.append(loss.to('cpu').detach().numpy())
        return (correct / len(loader.dataset), np.mean(losses), out)  # Derive ratio of correct predictions.

# Model Parameters

The parameters chosen will be used for estimating model performance and to fit the model to extract metabolite importance.

#### Model parameters to choose:

- n_epochs - number of epochs to train the model
- hidden_channels - number of hidden_channels/nodes in TAG layers
- drop - dropout probability on dropout layers
- K - number of hops in TAG layers
- lr - learning rate of the model
- weight decay - L2 Regularization
- gamma - rate of exponential decay of the learning rate with epochs

## Estimating Model Performance

Average of n iterations of different k-fold stratified cross validation. Both parameters were chosen at the beginning of the notebook.

**This can take some time, put False to skip.**

In [None]:
# Define model and training parameters
n_epochs = 120 # 200
hidden_channels = 64 # 128
drop = 0.15 # 0.3
K = 3 # 1
lr = 0.005 # 0.001
weight_decay = 0.0001 # 0
gamma = 1 # 0.99
batch_size= 32 

In [None]:
# Put True if you want ot see model performance, else put False
see_model_performance = True

In [None]:
# Set random seed if wanted
random_s = 174
np.random.seed(random_s)

if see_model_performance:
    
    # Setting parameters
    max_epochs = n_epochs

    # Setting up store results
    loss_dict = {}
    train_accuracy_dict = {}
    test_accuracy_dict = {}
    preds = 0
    scores_dict = {n: 0 for n in range(max_epochs)}
    accu_dict = {}

    # For each repetition
    for r in range(iter_num):
        print('Iteration nº', r)
        skf = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=random_s*(r+1))
        loss_dict[r] = {}
        train_accuracy_dict[r] = {}
        test_accuracy_dict[r] = {}
        correct = {n: 0 for n in range(max_epochs)}

        # Each fold
        for i, (train_index, test_index) in enumerate(skf.split(treated_data, target)):
            print('Fold nº', i)

            train_loader = DataLoader(pd.Series(data_list_full)[train_index].values, batch_size=batch_size, shuffle=True)
            test_loader = DataLoader(pd.Series(data_list_full)[test_index].values, batch_size=batch_size, shuffle=False)

            # Setting up the models
            model = FDiGNN_TAG(hidden_channels=hidden_channels, drop=drop, n_node_feat=len(node_attrs),
                               K=K, n_classes=len(classes)).to(device)
            optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
            scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=gamma)
            criterion = torch.nn.CrossEntropyLoss()

            # Temporary store lists
            loss_list = []
            train_accuracy_list = []
            test_accuracy_list = []

            # Train the model
            for epoch in range(1, max_epochs+1):
                model.float()
                loss, g_norm, _ = train(model, train_loader, optimizer)
                train_acc, _, _ = test(model, train_loader)
                test_acc, _, _ = test(model, test_loader)
                if epoch%10 == 0:
                    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}, Learning Rate:{scheduler.get_last_lr()}')
                loss_list.append(loss)
                train_accuracy_list.append(train_acc)
                test_accuracy_list.append(test_acc)
                scheduler.step()

            # Store results
            loss_dict[r][i] = loss_list
            train_accuracy_dict[r][i] = train_accuracy_list
            test_accuracy_dict[r][i] = test_accuracy_list

            preds += len(test_index)
            for n in scores_dict:
                correct[n] = correct[n] + test_accuracy_dict[r][i][n]*len(test_index)
                scores_dict[n] = scores_dict[n] + test_accuracy_dict[r][i][n]*len(test_index)
        
        accu_dict[r] = pd.Series(correct)/len(target)

In [None]:
# Store results
if see_model_performance:
    print(f'Number of parameters: {sum(p.numel() for p in model.parameters())}.')
    print('Max Epoch', (pd.Series(scores_dict)/preds).argmax()+1)
    print('Max Acc.', (pd.Series(scores_dict)/preds).max())
    print('-----')
    print('Final Acc.', (pd.Series(scores_dict)/preds).iloc[-1], '+-', pd.DataFrame(accu_dict).std(axis=1).iloc[-1])

In [None]:
# See all accs
#pd.Series(scores_dict)/preds

In [None]:
# See all accs std
#pd.DataFrame(accu_dict).std(axis=1)

## Fitting the Model with All Samples

Fit the model with all samples. From the model with all samples, we can extract metabolite importance

In [None]:
np.random.seed(174)

# Setting parameters
max_epochs = n_epochs

classes = pd.unique(target)

print('Starting model fitting.')

# Train the model
train_loader = DataLoader(pd.Series(data_list_full).values, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(pd.Series(data_list_full).values, batch_size=batch_size, shuffle=False)

# Setting up the models
model = FDiGNN_TAG(hidden_channels=hidden_channels, drop=drop, n_node_feat=len(node_attrs), K=K,
                   n_classes=len(classes)).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=gamma)
criterion = torch.nn.CrossEntropyLoss()

# Model Training
for epoch in range(1, max_epochs+1):
    model.float()
    loss, g_norm, _ = train(model, train_loader, optimizer)
    train_acc, _, _ = test(model, train_loader)
    test_acc, _, _ = test(model, test_loader)
    if epoch%10 == 0:
        print(f'''Epoch: {epoch:03d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f},
              Learning Rate:{scheduler.get_last_lr()}''')
    scheduler.step()

In [None]:
# Summary of archiecture and parameters of models fit
from torchinfo import summary
summary(model)

## Metabolite / Node Importance

Metabolite Importance is estimated by seeing the imapct changes in the features of the corresponding node have on the prediction probability of the different samples. Broadly speaking, the metabolite importance calculation methods is as follow:

1) Get the model predictions probabilities of each class for every sample unchanged.

2) For each node, change its intensity and feature occurrence to low, intermediate and high values in all samples. These values are defined by the quantile values of that node/metabolite across all samples. Default quantile values are 0.05, 0.50 and 0.95.

3) Get the model predictions probabilities for the changed samples with each of the quantiles. Then, for each sample, calculate the absolute prediction probability change in comparison to the unchaged sample for the chosen quantiles. The highest absolute change is stored.

4) We then obtain for each sample and each node, how their class predictions change based on the changing of the node features. Since samples can be easier or more difficult to change predictions, a normalization is performed by dividing the values by the sum of the predictions changes associated to each node for a sample. Thus, all samples contribute the same for the calculation of metabolite importance.

5) Finally, for each node, we calculate the median of all samples normalized prediction changes to get a score of the importance of that node/metabolite.

**Note: After this, we see the top 20 most important metabolites. Observe the score values of the first few metabolites to see if the top 1 is not orders of magnitude higher than the others. If it is, overfitting may have occurred and model parameters might need to be changed. If they are not, then the model is focusing on multiple nodes which is what we want.**

In [None]:
# Store eery prediction made - can occupy space
store_all_preds = False

effect = {} # Store the prediction impact
all_preds ={} # Store all predictions

# Normal Predictions in unchanged samples
out_normal = pd.DataFrame()
for data in DataLoader(data_list_full, batch_size=32, shuffle=False):  # Iterate in batches over the training/test dataset.
    model.eval()
    out = model(data.x.float(), data.edge_index, data.batch, data.edge_attr.float())
    out = F.softmax(out, 1)
    out_normal = pd.concat((out_normal, pd.DataFrame(out.detach().cpu().numpy())))

if store_all_preds:
    all_preds['Normal'] = out_normal.reset_index().iloc[:, 1:].to_dict()

# For each metabolite/node
for i in tqdm(range(len(FDiN.nodes()))):
    # Get node index, unchanged values and see the quantile values
    node = list(FDiN.nodes())[i]
    original_values = treated_data.loc[:, node].copy().values
    original_feature_values = bin_data.loc[:, node].copy().values
    q_values = [0.05, 0.5, 0.95]

    # Set the stores
    if store_all_preds:
        all_preds[node] = {}
    effect[i] = pd.DataFrame(columns=q_values)
    # Get the quantiles
    quantile_values = np.quantile(original_values, q=q_values)
    quantile_feature_values = np.quantile(original_feature_values, q=q_values)
    
    for q in range(len(quantile_values)):
        # Change all samples for the current quantile value
        for g in range(len(data_list_full)):
            data_list_full[g].x[i, -2] = quantile_values[q]
            if quantile_feature_values[q] != 0:
                if quantile_feature_values[q] != 1:
                    data_list_full[g].x[i, -1] = 1
                else:
                    data_list_full[g].x[i, -1] = quantile_feature_values[q]
            else:
                data_list_full[g].x[i, -1] = quantile_feature_values[q]

        # See and store model predictions of changed samples
        out_shuffled = pd.DataFrame()
        for data in DataLoader(data_list_full, batch_size=32, shuffle=False):  # Iterate in batches over the training/test dataset.
            model.eval()
            out = model(data.x.float(), data.edge_index, data.batch, data.edge_attr.float())
            out = F.softmax(out, 1)
            out_shuffled = pd.concat((out_shuffled, pd.DataFrame(out.detach().cpu().numpy())))
        if store_all_preds:
            all_preds[node][q] = out_shuffled.reset_index().iloc[:, 1:].to_dict()

        # Calculate the absolute prediction differences and store them
        results = pd.DataFrame((out_normal.values - out_shuffled.values)).abs()
        effect[i][q_values[q]] = results.max(axis=1)

    # Restore values
    for g in range(len(data_list_full)):
        data_list_full[g].x[i, -2] = original_values[g]
        data_list_full[g].x[i, -1] = original_feature_values[g]


In [None]:
# Join the results from the different nodes, quantiles and samples
new_df = pd.DataFrame(columns=range(len(sample_cols)))
for node in effect.keys():
    # Get the maximum values for each quantile for each node/sample pair
    new_df.loc[node] = effect[node].max(axis=1).values
new_df.index = list(FDiN.nodes())

# Normalize all predictions changes by sample - make each sample have the same weight for node importance calculation
new_df = (new_df/new_df.sum()).replace({np.nan:0})
global_effect = new_df.T.apply(
    lambda x: x.sort_values(ascending=False).values).T

# Get the median of all normalzied prediction changes across the samples for the final importance score
global_effect = global_effect.median(axis=1).sort_values(ascending=False)

In [None]:
# See the top 20 nodes and how much change
global_effect.head(20)

# Analysing FDiGNN Important Metabolites

Select the top % of metabolitees that will be considered as important. This can change based on dataset size, the higher the dataset, the smaller the threshold so results can be interpretable.

Furthermore, select the minimum requirements for a pathway to be considered.

In [None]:
# Base selection of top metabolites to see (in % - 0.1 is the top 10% of metabolites)
base_threshold_of_importance = 0.10

# Choose the minimum number of metabolites detected associated to a pathway to consider that pathway
n_min_nodes_path = 3
# Choose the minimum number of eedges between detected metabolites associated to a pathway to consider that pathway
n_min_edges_path = 1

In [None]:
# Ranking the importances
all_ranks=pd.DataFrame(global_effect).rank(ascending=False)

## Pathway Enrichment Analysis

- Background Set - Number of Metabolites detected in the FDiN with at least one associated pathway
- Pathway Background Set - Number of Metabolites detected in the FDiN in each pathway
- 'Significant' Metabolites - Number of Metabolites in the top ranks considered with at least one associated pathway
- 'Significant' Metabolites in Pathway - Number of Metabolites in the top ranks considered in each pathway

'Signigicant' Metabolites are thos that are within the top % of FDiGNN importance ranks with the % defined by the parameter `base_threshold_of_importance` (e.g. 0.10 is the top 10%) 

Very similar pathways repeated many times were merged to a single pathway:

- 22657 pathway names starting with 'De Novo Triacylglycerol Biosynthesis' merged to 'De Novo Triacylglycerol Biosynthesis'
- 923 pathway names starting with 'Phosphatidylethanolamine Biosynthesis' merged to 'Phosphatidylethanolamine Biosynthesis'
- 923 pathway names starting with 'Phosphatidylcholine Biosynthesis' merged to 'Phosphatidylcholine Biosynthesis'
- 3278 pathway names starting with 'Cardiolipin Biosynthesis' merged to 'Cardiolipin Biosynthesis'

In [None]:
for node in FDiN_knowledge.nodes():
    n_p = []
    for p in FDiN_knowledge.nodes()[node]['Pathways']:
        if p.startswith('De Novo Triacylglycerol Biosynthesis'):
            if 'De Novo Triacylglycerol Biosynthesis' not in n_p:
                n_p.append('De Novo Triacylglycerol Biosynthesis')
        elif p.startswith('Phosphatidylethanolamine Biosynthesis'):
            if 'Phosphatidylethanolamine Biosynthesis' not in n_p:
                n_p.append('Phosphatidylethanolamine Biosynthesis')
        elif p.startswith('Phosphatidylcholine Biosynthesis'):
            if 'Phosphatidylcholine Biosynthesis' not in n_p:
                n_p.append('Phosphatidylcholine Biosynthesis')
        elif p.startswith('Cardiolipin Biosynthesis'):
            if 'Cardiolipin Biosynthesis' not in n_p:
                n_p.append('Cardiolipin Biosynthesis')
        else:
            n_p.append(p)
    FDiN_knowledge.nodes()[node]['Pathways'] = n_p

for edge in FDiN_knowledge.edges():
    n_p = []
    for p in FDiN_knowledge.edges()[edge]['Pathways']:
        if p.startswith('De Novo Triacylglycerol Biosynthesis'):
            if 'De Novo Triacylglycerol Biosynthesis' not in n_p:
                n_p.append('De Novo Triacylglycerol Biosynthesis')
        elif p.startswith('Phosphatidylethanolamine Biosynthesis'):
            if 'Phosphatidylethanolamine Biosynthesis' not in n_p:
                n_p.append('Phosphatidylethanolamine Biosynthesis')
        elif p.startswith('Phosphatidylcholine Biosynthesis'):
            if 'Phosphatidylcholine Biosynthesis' not in n_p:
                n_p.append('Phosphatidylcholine Biosynthesis')
        elif p.startswith('Cardiolipin Biosynthesis'):
            if 'Cardiolipin Biosynthesis' not in n_p:
                n_p.append('Cardiolipin Biosynthesis')
        else:
            n_p.append(p)
    FDiN_knowledge.edges()[edge]['Pathways'] = n_p

Use the metabolic knowledge network to associate metabolic pathways to nodes (if the formula is in the pathway) and to edges (if both connecting formulas are in the pathway)

In [None]:
# For Nodes
# Formulas in the metabolic knowledge network and in the FDiN
formulas_in_knowledge = list(FDiN_knowledge.nodes())
# Set up stores
pathway_ranks = {} # Stores only the ranks of the nodes related to the pathway
pathway_series = {} # Stores the nodes related to the pathway and their ranks

# For each node
for node in FDiN.nodes():
    curr_formula = FDiN.nodes()[node]['Formula']
    # See the annotation to add as an attribute
    if isinstance(processed_data.loc[node, 'Matched names'], list):
        FDiN.nodes()[node]['Name'] = processed_data.loc[node, 'Matched names']
        FDiN.nodes()[node]['HMDB_ID'] = processed_data.loc[node, 'Matched IDs']
    else:
        FDiN.nodes()[node]['Name'] = 'None'
        FDiN.nodes()[node]['HMDB_ID'] = 'None'

    # If the formula is in the metabolic network, add the Pathway related information
    if curr_formula in formulas_in_knowledge:
        FDiN.nodes()[node]['Names in Pathways'] = FDiN_knowledge.nodes()[curr_formula]['Names']
        FDiN.nodes()[node]['HMDB_ID in Pathways'] = FDiN_knowledge.nodes()[curr_formula]['HMDB_ID']
        FDiN.nodes()[node]['Pathways'] = FDiN_knowledge.nodes()[curr_formula]['Pathways']
        FDiN.nodes()[node]['SMPDB_IDs'] = FDiN_knowledge.nodes()[curr_formula]['SMPDB_IDs']

        # And add the FDiGNN metabolite importance rank to the pathway stores
        for p in FDiN_knowledge.nodes()[curr_formula]['Pathways']:
            if p in pathway_ranks:
                pathway_ranks[p].append(all_ranks.loc[node, 0])
                pathway_series[p].loc[node] = all_ranks.loc[node, 0]
            else:
                pathway_ranks[p] = [all_ranks.loc[node, 0],]
                pathway_series[p] = pd.Series()
                pathway_series[p].loc[node] = all_ranks.loc[node, 0]

    # If not associated to a metabolic pathway
    else:
        FDiN.nodes()[node]['Names in Pathways'] = 'None'
        FDiN.nodes()[node]['HMDB_ID in Pathways'] = 'None'
        FDiN.nodes()[node]['Pathways'] = 'None'
        FDiN.nodes()[node]['SMPDB_IDs'] = 'None'

In [None]:
# For Edges - Adding pathway information
for edge in FDiN.edges():
    formA = FDiN.nodes()[edge[0]]['Formula']
    formB = FDiN.nodes()[edge[1]]['Formula']
    if (formA, formB) in FDiN_knowledge.edges():
        FDiN.edges()[edge]['Pathways'] = FDiN_knowledge.edges()[(formA, formB)]['Pathways']
        FDiN.edges()[edge]['SMPDB_IDs'] = FDiN_knowledge.edges()[(formA, formB)]['SMPDB_IDs']
    else:
        FDiN.edges()[edge]['Pathways'] = 'None'
        FDiN.edges()[edge]['SMPDB_IDs'] = 'None'

Re-rank pathway associated nodes to skip the non-pathway associated nodes

In [None]:
# Get only the nodes with associated pathways and their ranks
# Then make a dict to re-order those ranks skipping nodes without associated pathways
s = pd.DataFrame(pathway_series).values.flatten()
s = pd.unique(pd.Series(s[~np.isnan(s)]).sort_values())
path_ranks_short = dict(zip(s,range(1, len(s)+1)))

# Use the dictionary made to get the ranks per metabolic pathway after this association
for i in pathway_ranks:
    pathway_ranks[i].sort()
for p in pathway_ranks:
    pathway_ranks[p] = [path_ranks_short[i] for i in pathway_ranks[p]]
#pathway_ranks

Limit the pathways considered based on the parameters defined before (number of nodes and edges associated to that pathway)

In [None]:
# Limit the pathways considered based on the minimum number of nodes associated with it
paths = pd.Series(nx.get_node_attributes(FDiN, 'Pathways')).explode().value_counts()
paths.pop('None')
paths = paths[paths>=n_min_nodes_path]

# Limit the pathways considered based on the minimum number of edges nodes associated with it have
paths_edges = pd.Series(nx.get_edge_attributes(FDiN, 'Pathways')).explode().value_counts()
paths_edges.pop('None')
paths_edges = paths_edges[paths_edges>=n_min_edges_path]

# Combine the two filters to obtain the total pathways considered
paths = paths[[p for p in paths.index if p in paths_edges.index]]
paths_edges = paths_edges[paths.index]

#### Pathway Enrichment Analysis

- Background Set - Number of Metabolites detected in the FDiN with at least one associated pathway
- Pathway Background Set - Number of Metabolites detected in the FDiN in each pathway
- 'Significant' Metabolites - Number of Metabolites in the top ranks considered with at least one associated pathway
- 'Significant' Metabolites in Pathway - Number of Metabolites in the top ranks considered in each pathway

In [None]:
# 'Significant' Metabolites - Number of Metabolites in the top ranks considered with at least one associated pathway
top_ranks_considered = (
    np.array(list(path_ranks_short.keys())) <= int(base_threshold_of_importance * len(FDiN.nodes()))).sum()
annotated_metabolites_w_path = top_ranks_considered

# Background Set - Number of Metabolites detected in the FDiN with at least one associated pathway
total_metabolites_with_pathways = len(path_ranks_short)

# Get the counts of metabolites of each pathway
path_assign_hmdb = paths

# Preparing DF
over_representation_df = pd.DataFrame(columns=[
    'Nº of Met. in Dataset', 'Nº of Met. in Pathway', '% of Met. In Set', 'Probability'], dtype='object')

for pathway in tqdm(path_assign_hmdb.index):
    # Pathway Name
    p_name = pathway
    # Pathway Background Set - Number of Metabolites detected in the FDiN in each pathway
    total_met_in_path = path_assign_hmdb.loc[pathway]
    # 'Significant' Metabolites in Pathway - Number of Metabolites in the top ranks considered in each pathway
    ann_met_in_path = (np.array(pathway_ranks[pathway]) <= top_ranks_considered).sum()

    # Calculating probability to find ann_met_in_path or more metabolites in annotated_metabolites_w_path
    prob = stats.hypergeom(M=total_metabolites_with_pathways, 
                            n=total_met_in_path, 
                            N=annotated_metabolites_w_path).sf(ann_met_in_path-1)

    # Adding the line to the DF
    over_representation_df.loc[pathway] = [ann_met_in_path, total_met_in_path,
                                           ann_met_in_path/total_met_in_path*100, prob]

# Sorting DataFrame from least probable to more probable and adding adjusted probability
# with Benjamini-Hochberg multiple test correction
over_representation_df_nodes = over_representation_df.sort_values(by='Probability')
over_representation_df_nodes['Adjusted (BH) Probability'] = metsta.p_adjust_bh(over_representation_df_nodes['Probability'])

In [None]:
# Results from the Over-Representation Analysis
over_representation_df_nodes.iloc[:, 1:]

## Dash App For Highlighting Important Network Areas

In [None]:
# Get the list of sorted importances to use to map
sorted_imps = (global_effect).rank(ascending=False).sort_values()
sorted_imps

In [None]:
# Load Extra layouts for saving svg files
cyto.load_extra_layouts()

# Transform FDiN into Cytoscape data
cs_data = nx.cytoscape_data(FDiN)
elements = cs_data["elements"]["nodes"] + cs_data["elements"]["edges"]

# Make a 'graph' to act as node legend
graph_legend = nx.Graph()
graph_legend.add_nodes_from(['Pathway Node','Rel. Near Pathway','Non-Pathway Node'])
nx.set_node_attributes(graph_legend, {'Pathway Node':{'color':'Red'},
                                       'Rel. Near Pathway':{'color':'lightcoral'},
                                       'Non-Pathway Node':{'color':'lightgrey'}})
cs_legend = nx.cytoscape_data(graph_legend)
for n, p in zip(cs_legend["elements"]["nodes"], range(0,250,50)):
    n["position"] = {"x": -200, "y": p}
elements_legend = cs_legend["elements"]["nodes"] + cs_legend["elements"]["edges"]

# Get an initial order of pathways
rel_paths = over_representation_df_nodes.index

# Get specific positions for preset layout of the network
pos = nx.kamada_kawai_layout(FDiN)
for n1, p in zip(cs_data["elements"]["nodes"], pos.values()):
    n1["position"] = {"x": p[0] * 1700, "y": p[1] * 1700}

# Initiate the App
app = Dash()

# Dropdown menu for network layout
dropdown_layout = dcc.Dropdown(
    id='dropdown-update-layout',
    value='preset',
    clearable=False,
    options=[
        {'label': name.capitalize(), 'value': name}
        for name in ['preset', 'grid', 'random', 'circle', 'cose', 'concentric']
    ]
)

# Dropdown menu for pathway chosen
dropdown_layout2 = dcc.Dropdown(
    id='dropdown-update-layout2',
    value='None',
    clearable=False,
    options=[
        {'label': name.capitalize(), 'value': name}
        for name in ['None',] + list(rel_paths)
    ]
)

# Slider to select the threshold to use to consider important metabolites
top_mets_chosen = dcc.Slider(0, 0.5, 0.005, value=0.05, id='top_mets_chosen',
                            marks={0: {'label': '0'}, 0.02: {'label': '0.02'}, 0.05: {'label': '0.05'}, 
                                   0.1: {'label': '0.10'}, 0.2: {'label': '0.20'}, 0.5: {'label': '0.50'} })

# Organize the APP layout
app.layout = html.Div([html.Div([# Initial Options
                       dcc.Markdown('Graph Layout:',),
                       dropdown_layout,
                       dcc.Markdown('Pathway Highlighted:',),
                       dropdown_layout2,
                       dcc.Markdown('Nº of Important Metabolites:',),
                       top_mets_chosen,
    # Put the Network
    cyto.Cytoscape(
        id='cytoscape',
        elements=elements,
        style={'width': '100%', 'height': '800px'},
        layout={
            'name': 'preset'
        },
         stylesheet=[
                    {'selector': 'node',
                'style': {'label': 'data(cname)',
                    'background-color': 'data(color)',
                    'shape': 'circle'}},
                     {'selector': 'edge',
                'style': {
                    'line-color': 'data(color)'}}]
    )], style={'flex': 2}),
                    html.Div([# Right part of App - Information
        # Info when clicking a node
        html.P(id='cytoscape-tapNodeData-output'), # General Info
        html.Img(id='bar-graph-matplotlib'), # Bar Plot
        html.P(id='node-presence-output'), # Feature Occurrence
        # Information when clicking an edge
        html.P(id='cytoscape-tapEdgeData-output'),
        # Information on the highlighted network section
        dcc.Markdown('### Information',),
        html.P(id='cytoscape-selected-area-output'),
        # Buttons to save figure
        html.Div('Download graph:'),
        html.Button("as jpg", id="btn-get-jpg"),
        html.Button("as png", id="btn-get-png"),
        html.Button("as svg", id="btn-get-svg")], style={'flex': 1, 'padding': 10})], 
                      
    style={'display': 'flex', 'flexDirection': 'row'}
)


## Fucntions to respond to changes in the app

@callback(Output('cytoscape', 'stylesheet'),
          Output('cytoscape', 'elements'),
          Output('cytoscape-selected-area-output', 'children'),
          Output(component_id='dropdown-update-layout2', component_property='options'),
              Input('top_mets_chosen', 'value'),
              Input('dropdown-update-layout2', 'value'))
def update_layout(top_chosen, pathway):
    "Updates layout based on the threshold and pathway chosen"

    ### Top chosen metabolites
    top_chosen = int(top_chosen*len(FDiN.nodes()))
    imp_mzs = list(sorted_imps.index[:top_chosen])

    ### Pathway enrichment Analysis as before
    top_ranks_considered = (np.array(list(path_ranks_short.keys())) <= top_chosen).sum()

    # Background Set - Number of Metabolites detected in the FDiN with at least one associated pathway
    total_metabolites_with_pathways = len(path_ranks_short)

    # Get the counts of metabolites of each pathway
    path_assign_hmdb = paths

    # 'Significant' Metabolites - Number of Metabolites in the top ranks considered with at least one associated pathway
    annotated_metabolites_w_path = top_ranks_considered

    # Preparing DF
    over_representation_df = pd.DataFrame(columns=['Pathway Name',
        'Nº of Met. in Dataset', 'Nº of Met. in Pathway', '% of Met. In Set', 'Probability'], dtype='object')

    for pathway2 in tqdm(path_assign_hmdb.index):
        # Pathway Name
        p_name = pathway2
        # Pathway Background Set - Number of Metabolites detected in the FDiN in each pathway
        total_met_in_path = path_assign_hmdb.loc[pathway2]
        # 'Significant' Metabolites in Pathway - Number of Metabolites in the top ranks considered in each pathway
        ann_met_in_path = (np.array(pathway_ranks[pathway2]) <= top_ranks_considered).sum()

        # Calculating probability to find ann_met_in_path or more metabolites in annotated_metabolites_w_path
        prob = stats.hypergeom(M=total_metabolites_with_pathways, 
                                n=total_met_in_path, 
                                N=annotated_metabolites_w_path).sf(ann_met_in_path-1)

        # Adding the line to the DF
        over_representation_df.loc[pathway2] = [p_name, ann_met_in_path, total_met_in_path,
                                               ann_met_in_path/total_met_in_path*100, prob]

    # Sorting DataFrame from least probable to more probable and adding adjusted probability
    # with Benjamini-Hochberg multiple test correction
    over_representation_df_nodes = over_representation_df.sort_values(by='Probability')


    ## Get color, node names to show and opacity of nodes for each
    colors_nodes = {}
    cname_nodes = {}
    opacity_nodes = {}
    for i in FDiN.nodes():
        if i in imp_mzs:
            if pathway in FDiN.nodes()[i]['Pathways'] and pathway != 'None':
                colors_nodes[i] ='red'
            else:
                colors_nodes[i] ='lightcoral'
            cname_nodes[i] = temp_df.loc[i ,'Formula_Assignment']
            opacity_nodes[i] = 1
        else:
            if pathway in FDiN.nodes()[i]['Pathways'] and pathway != 'None':
                colors_nodes[i] ='grey'
                opacity_nodes[i] = 1
            else:
                colors_nodes[i] = 'lightgrey'
                opacity_nodes[i] = 0.4
            cname_nodes[i] =''
    for n, c in zip(cs_data["elements"]["nodes"], colors_nodes.values()):
        n['data']["color"] = c
    for n, c in zip(cs_data["elements"]["nodes"], cname_nodes.values()):
        n['data']["cname"] = c
    for n, c in zip(cs_data["elements"]["nodes"], opacity_nodes.values()):
        n['data']["opacity"] = c

    ## Get color and opacity of edges
    edge_colors = {}
    edge_opacity = {}
    for i in FDiN.edges():
        if i[0] in imp_mzs and i[1] in imp_mzs:
            if pathway in FDiN.edges()[i]['Pathways'] and pathway != 'None':
                edge_colors[i] ='red'
            else:
                edge_colors[i] ='lightcoral'
            edge_opacity[i] = 1
        else:
            if pathway in FDiN.edges()[i]['Pathways'] and pathway != 'None':
                edge_colors[i] ='grey'
                edge_opacity[i] = 1
            else:
                edge_colors[i] = 'lightgrey'
                edge_opacity[i] = 0.4
    for n, c in zip(cs_data["elements"]["edges"], edge_colors.values()):
        n['data']["color"] = c
    for n, c in zip(cs_data["elements"]["edges"], edge_opacity.values()):
        n['data']["opacity"] = c

    # Build the string to describe network component of highlighted area
    construct_string = [html.B(f'Components of top {top_chosen} important metabolites:'), html.Br(), html.Br(),]
    for i in nx.connected_components(FDiN.subgraph(imp_mzs)):
        construct_string.append(html.B(f'{len(i)}-length component'))
        construct_string.append(f'{i}).')
        construct_string.append(html.Br())

    # Return in order to update layout
    return [{'selector': 'node',
                'style': {'label': 'data(cname)',
                    'background-color': 'data(color)',
                    'shape': 'circle',
                    'opacity': 'data(opacity)'}},
            {'selector': 'edge',
                'style': {
                'line-color': 'data(color)',
                'opacity': 'data(opacity)'}}], cs_data["elements"]["nodes"] + cs_data["elements"][
        "edges"], construct_string, [{'label': name.capitalize(), 'value': name}
                                for name in ['None',] + list(over_representation_df_nodes.index)]

@callback(Output('cytoscape', 'layout'),
              Input('dropdown-update-layout', 'value'))
def update_layout(layout):
    return {'name': layout, 'animate': False}

@callback(Output('cytoscape-tapNodeData-output', 'children'),
          Output(component_id='bar-graph-matplotlib', component_property='src'),
          Output(component_id='node-presence-output', component_property='children'),
              Input('cytoscape', 'tapNodeData'))
def displayTapNodeData(data):
    "Display information on the node, its average intensities and feature occurrence across classes."
    if data:
        
        ## Node Information Section
        construct_string = [html.B(f'Most recently clicked node: {data["id"]}'), html.Br(), html.Br(),]
        for cat in ['id', 'Formula', 'Name', 'HMDB_ID', 'Names in Pathways',
                    'HMDB_ID in Pathways', 'Pathways', 'SMPDB_IDs']:
            if cat == 'Name':
                # Include VIP score Rank
                construct_string.extend([f'FDiGNN Rank: {str(all_ranks.loc[data["id"]].iloc[0])}',  html.Br(),])
            construct_string.extend([f'{cat.capitalize()}: {str(data[cat])}',  html.Br(),])

        ## Normalized Intensity Bar Plot Section
        plt.close()
        gfinder = processed_data.copy().loc[data["id"], sample_cols]
        percentages = {}
        # Calculate average intensities (and std)
        for cl in classes:
            section = gfinder.iloc[[i for i in range(len(sample_cols)) if target[i]==cl]]#.mean()
            percentages[cl] = section.notnull().sum()/len(section)*100
            gfinder[cl+' Average'] = section.mean()
            gfinder[cl+' sem'] = section.std() / np.sqrt(section.notnull().sum())
        avg_cols = [col for col in gfinder.index if 'Average' in col]
        sem_cols = [col for col in gfinder.index if 'sem' in col]
        # Plot the graph
        bar_plot_info = gfinder.replace({np.nan:0})
        factor = 1
        maxi_v = (np.array(bar_plot_info.loc[avg_cols])*10**factor).max()
        while maxi_v < 1:
            factor+=1
            maxi_v = (np.array(bar_plot_info.loc[avg_cols])*10**factor).max()
        fig, ax = plt.subplots(1,1, figsize=(5,3), constrained_layout=True)
        x = np.arange(len(avg_cols))
        if len(classes) <= 10:
            colors_for_plot = [colours[i] for i in range(len(classes))]
        else:
            colors_for_plot = [colours[i] for i in range(10)]
        ax.bar(x, np.array(bar_plot_info.loc[avg_cols])*10**factor, color=colors_for_plot,
               yerr=np.array(bar_plot_info.loc[sem_cols])*10**factor, capsize=12)
        ax.set_ylabel(f'Intensity (x$10^{factor}$)', fontsize=15)
        ax.set_title(processed_data.copy().loc[data["id"], 'Formula_Assignment'], fontsize=15)
        ax.set_xticks(x)
        ax.set_xticklabels(classes, fontsize=12)
        # Embed the result in the html output.
        buf = BytesIO()
        fig.savefig(buf, format="png")
        fig_data = base64.b64encode(buf.getbuffer()).decode("ascii")
        fig_bar_matplotlib = f'data:image/png;base64,{fig_data}'

        ## Feature Occurrence Section
        node_pres = [html.B(f'Node Present in % of Class Samples:'), html.Br(), html.Br(),]
        for cl in percentages:
            node_pres.extend([f'{cl}: {percentages[cl]:.3f}', html.Br()])
        
        return construct_string, fig_bar_matplotlib, node_pres
    return '', '', ''

@callback(Output('cytoscape-tapEdgeData-output', 'children'),
              Input('cytoscape', 'tapEdgeData'))
def displayTapEdgeData(data):
    "Show the edge data, that is if it can be mapped to a pathway."
    if data:
        construct_string = [html.B(f'Most recently clicked edge: {data["source"]}-{data["target"]}'),
                            html.Br(), html.Br(),]
        for cat in ['Pathways', 'SMPDB_IDs']:
            construct_string.extend([f'{cat.capitalize()}: {str(data[cat])}',  html.Br(),])
        return construct_string

@callback(
    Output("cytoscape", "generateImage"),
    [Input("btn-get-jpg", "n_clicks"),
        Input("btn-get-png", "n_clicks"),
        Input("btn-get-svg", "n_clicks"),
    ])
def get_image(get_jpg_clicks, get_png_clicks, get_svg_clicks):
    "Download the graph image."
    if ctx.triggered:
        action = "download"
        ftype = ctx.triggered_id.split("-")[-1]

        return {
            'type': ftype,
            'action': action
            }
    else:
        return {
            'type': 'svg',
            'action': 'store'
            }

# Run the app
app.run(debug=True, jupyter_mode='external', port=8052)