# Sample Mass-Difference Networks in Metabolomics Data Analysis

Notebook to support the study on the application of **Sample M**ass-**Di**fference **N**etworks as a highly specific competing form of pre-processing procedure for high-resolution metabolomics data.

Mass-Difference Networks are focused into making networks from a list of masses. Each _m/z_ will represent a node. Nodes will be connected if the difference in their masses can be associated to a simple chemical reaction (enzymatic or non-enzymatic) that led to a change in the elemental composition of its metabolite.

The set of mass differences used to build said networks are called a set of MDBs - Mass-Difference-based Building block.

This is notebook `paper_sMDiNs_supervised.ipynb`


## Organization of the Notebook

- Loading up pre-processed and pre-treated datasets databases with intensity-based pre-treated data and data from sMDiNs analyses.
- **Random Forest - optimization, predictive accuracy and important features: comparison after aplication of different pre-treatment procedures.**
- **PLS-DA - optimization, predictive accuracy and important features: comparison after aplication of different pre-treatment procedures.**


#### Needed Imports

In [None]:
import itertools
from pathlib import Path

import numpy as np
import pandas as pd

import scipy.spatial.distance as dist
import scipy.cluster.hierarchy as hier
import scipy.stats as stats

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
from matplotlib import ticker

from sklearn.model_selection import GridSearchCV
import sklearn.ensemble as skensemble
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import (mean_squared_error, r2_score, roc_auc_score, roc_curve, auc)

import seaborn as sns
import networkx as nx
from collections import namedtuple

# Metabolinks package
import metabolinks as mtl
import metabolinks.transformations as transf

# Python files in the repository
import multianalysis as ma
from elips import plot_confidence_ellipse

# json for persistence
import json
from time import perf_counter

In [None]:
%matplotlib inline

## Description of dataset records

`datasets` is the global dict that holds all data sets. It is a **dict of dict's**.

Each data set is **represented as a dict**.

Each record has the following fields (keys):

- `name`: the table/figure name of the data set
- `source`: the biological source for each dataset
- `mode`: the aquisition mode
- `alignment`: the alignment used to generate the data matrix
- `data`: the data matrix
- `target`: the sample labels, possibly already integer encoded
- `MDiN`: Mass-Difference Network - Not present here, only on sMDiNsAnalysis notebook
- `<treatment name>`: transformed data matrix / network. These treatment names can be
    - `Ionly`: missing value imputed data by 1/5 of the minimum value in each sample in the dataset, only
    - `NGP`: normalized, glog transformed and Pareto scaled
    - `Ionly_RF`: missing value imputed data by random forests, only
    - `NGP_RF`: normalized, glog transformed and Pareto scaled
    - `IDT`: `NGP_RF` or `NGP` - Intensity-based Data pre-Treatment chosen as comparison based on which of the two performed better for each dataset and each statistical method
    - `sMDiN`: Sample Mass-Difference Networks - Not present here, only on sMDiNsAnalysis notebook
       
- `<sMDiN analysis name>`: data matrix from nework analysis of MDiNs - Not in this notebook
    - `Degree`: degree analysis of each sMDiN
    - `Betweenness`: betweenness centrality analysis of each sMDiN
    - `Closeness`: closeness centrality of analysis of each sMDiN
    - `MDBI`: analysis on the impact of each MDB (Mass-Difference based building-block) on building each sMDiN
    - `GCD11`: Graphlet Correlation Distance of 11 different orbits (maximum of 4-node graphlets) between each sMDiN.
    - `WMDBI`: an alternative calculation of MDBI using the results from the degree analysis.

- `iter_fold_splits`: contains nested dicts that identify and contain each transformed training and testing groups data matrices with their respective iteration, training/test, fold number and one of the previously mentioned data pre-treatments
- `train`: specific to the HD dataset; contains a set of the different pre-treatments and sMDin analysis mentioned and a target based on the training set defined for HD
- `test`: specific to the HD dataset; contains a set of the different pre-treatments and sMDin analysis mentioned and a target based on the external test set defined for HD


The keys of `datasets` may be shared with dicts holding records resulting from comparison analysis.

Here are the keys (and respective names) of datasets used in this study:

- GD_neg_global2 (GDg2-)
- GD_neg_class2 (GDc2-)
- YD (YD)
- vitis_types (GD types)
- HD (HD)

#### Data Pre-Treatment

For information on the **commonly used intensity based data pre-treatments** and about the **benchmark datasets**, see notebook `paper_sMDiNs_database_prep.ipynb`.

For information on the **building** and the different **network analysis methods** used for the **Sample MDiNs** and information about the Mass-Difference-based Building blocks (**MDBs**), see notebook `paper_sMDiNs_sMDiNsAnalysis.ipynb`.

### Reading datasets database

In [None]:
# Where the datasets are
path = Path.cwd() / "store_files" / 'processed_data.json'
storepath = Path.cwd() / "store_files" / 'processed_data.h5'
with pd.HDFStore(storepath) as store:
    
    # Read into a dictionary not DataFrame data
    with open(path, encoding='utf8') as read_file:
        datasets = json.load(read_file)
    
    # Add DataFrame data to dict
    for dskey, dataset in datasets.items():
        dataset['iter_fold_splits'] = {}
        if dskey == 'HD':
            dataset['train'] = {}
            dataset['test'] = {}
        for key in dataset:
            # Created right before
            if 'iter_fold_splits' == key:
                continue
            value = dataset[key]
            if isinstance(value, str) and value.startswith("INSTORE"):
                storekey = value.split("_", 1)[1]
                #print(storekey)
                # Load the data from 'iter_fold_splits' carefully restoring the nested dictionaries
                if len(storekey.split("AA_")) > 1: # This separation was made to identify the 'iter_fold_splits' data
                    dictkeys = (storekey.split("AA_")[1]).split('_',3)
                    # Create nested dicts
                    if int(dictkeys[0]) not in dataset['iter_fold_splits'].keys():
                        dataset['iter_fold_splits'][int(dictkeys[0])] = {}
                    if dictkeys[1] not in dataset['iter_fold_splits'][int(dictkeys[0])].keys():
                        dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]] = {}
                    if int(dictkeys[2]) not in dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]].keys():
                        dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]][int(dictkeys[2])] = {}
                    dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]][int(dictkeys[2])][dictkeys[3]] = store[storekey]
                
                # Load the data from 'train' and 'test' from HD dataset keys carefully restoring the nested dictionaries
                elif len(storekey.split("TTS_")) > 1:
                    dictkeys = ((storekey.split("TTS_")[0]).split('_')[-1], storekey.split("TTS_")[1])#.split('_',2)
                    dataset[dictkeys[0]][dictkeys[1]] = store[storekey]
                # Normal DataFrames
                else:
                    dataset[key] = store[storekey]

            # convert colors to tuples, since they are read as lists from json file
            elif key == 'label_colors':
                dataset[key] = {lbl: tuple(c) for lbl, c in value.items()}
            elif key == 'sample_colors':
                dataset[key] = [tuple(c) for c in value]
            elif key.endswith('target') and key.startswith(dskey):
                if len(key.split("AA_")) > 1: 
                    dictkeys = ((key.split("_", 1)[1]).split("AA_")[1]).split('_',3)
                    dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]][int(dictkeys[2])][dictkeys[3]] = value
                else:
                    dictkeys = ((key.split("TTS_")[0]).split('_')[-1], key.split("TTS_")[1])#.split('_',2)
                    dataset[dictkeys[0]][dictkeys[1]] = value

# Remove extra keys
for name, ds in datasets.items():
    keys_to_remove = [keys for keys in ds.keys() if keys.startswith(name)]
    for key in keys_to_remove:
        ds.pop(key)

#datasets

In [None]:
# Selecting a placeholder for the Intensity-based Data pre-Treatment (IDT)
# Chosen for each dataset and each method based on which between NGP and NGP_RF generated the best results
for name, ds in datasets.items():
    ds['IDT'] = ds['NGP_RF']  

In [None]:
#fname = 'store_files/datasets.json'
#with open(fname, "r", encoding='utf8') as read_file:
#    datasets = json.load(read_file)

In [None]:
#for dskey, ds in datasets.items():
#    if dskey.startswith('YD'):
#        ds['data'] = pd.DataFrame(ds['data']['data'], index=ds['data']['index'], columns=ds['data']['columns'])
#        ds['original'] = ds['data']
    
#    else:
#        df_idx = pd.MultiIndex.from_tuples(ds['data']['index'])
#        df_data = pd.DataFrame(ds['data']['data'], index=df_idx, columns=ds['data']['columns'])
#        df_data.index.set_names('label', level=0, inplace=True)
        
#        ds['data'] = df_data
#        ds['original'] = df_data
        
    # Checkpoint to see if the data is equal
    #assert_frame_equal(datasets[dskey]['data'], datasets2[dskey]['data'])

Extra Possibly Useful Data

In [None]:
# Atomic masses - https://ciaaw.org/atomic-masses.htm
#Isotopic abundances-https://ciaaw.org/isotopic-abundances.htm/https://www.degruyter.com/view/journals/pac/88/3/article-p293.xml
# Isotopic abundances from Pure Appl. Chem. 2016; 88(3): 293–306,
# Isotopic compositions of the elements 2013 (IUPAC Technical Report), doi: 10.1515/pac-2015-0503

chemdict = {'H':(1.0078250322, 0.999844),
            'C':(12.000000000, 0.988922),
            'N':(14.003074004, 0.996337),
            'O':(15.994914619, 0.9976206),
            'Na':(22.98976928, 1.0),
            'P':(30.973761998, 1.0),
            'S':(31.972071174, 0.9504074),
            'Cl':(34.9688527, 0.757647),
            'F':(18.998403163, 1.0),
            'C13':(13.003354835, 0.011078) # Carbon 13 isotope
           }

# electron mass from NIST http://physics.nist.gov/cgi-bin/cuu/Value?meu|search_for=electron+mass
electron_mass = 0.000548579909065

In [None]:
# Chemical Formula transformations (MDBs chosen)
MDB = ['H2','CH2','CO2','O','CHOH','NCH','O(N-H-)','S','CONH','PO3H','NH3(O-)','SO3','CO', 'C2H2O', 'H2O']
MDB_YD = ['H2','CH2','CO2','O','CHOH','NCH','O(N-H-)','S','CONH','PO3H','NH3(O-)','SO3','CO', 'C2H2O', 'H2O', 
          'C2H2O2', 'C3H4O2']

### Colors for plots to ensure consistency

#### 11 variety grapevine data sets

In [None]:
# customize label colors for 11 grapevine varieties

colours = sns.color_palette('Blues', 3)
colours.extend(sns.color_palette('Greens', 3))
#colours = sns.cubehelix_palette(n_colors=6, start=2, rot=0, dark=0.2, light=.9, reverse=True)
colours.extend(sns.color_palette('flare', 5))

ordered_vitis_labels = ('CAN','RIP','ROT','RU','LAB','SYL','REG','CS','PN','RL','TRI')

vitis_label_colors = {lbl: c for lbl, c in zip(ordered_vitis_labels, colours)}

tab20bcols = sns.color_palette('tab20b', 20)
tab20ccols = sns.color_palette('tab20c', 20)
tab20cols = sns.color_palette('tab20', 20)
tab10cols = sns.color_palette('tab10', 10)
dark2cols = sns.color_palette('Dark2', 8)

vitis_label_colors['RU'] = tab20bcols[8]
vitis_label_colors['CAN'] = tab20ccols[5]
vitis_label_colors['REG'] = tab10cols[3]

for name in datasets:
    if name.startswith('GD'):
        datasets[name]['label_colors'] = vitis_label_colors
        datasets[name]['sample_colors'] = [vitis_label_colors[lbl] for lbl in datasets[name]['target']]

In [None]:
sns.palplot(vitis_label_colors.values())
new_ticks = plt.xticks(range(len(ordered_vitis_labels)), ordered_vitis_labels)

#### 5 yeast strains

In [None]:
# customize label colors for 5 yeast strains

colours = sns.color_palette('Set1', 5)
yeast_classes = datasets['YD']['classes']
yeast_label_colors = {lbl: c for lbl, c in zip(yeast_classes, colours)}
datasets['YD']['label_colors'] = yeast_label_colors
datasets['YD']['sample_colors'] = [yeast_label_colors[lbl] for lbl in datasets['YD']['target']]

In [None]:
sns.palplot(yeast_label_colors.values())
new_ticks = plt.xticks(range(len(yeast_classes)), yeast_classes)

#### 2 classes of Vitis types (wild and _vinifera_)

In [None]:
# customize label colors for 2 types of Vitis varieties

colours = [vitis_label_colors['SYL'], vitis_label_colors['TRI']]
vitis_type_classes = datasets['vitis_types']['classes']
vitis_types_label_colors = {lbl: c for lbl, c in zip(vitis_type_classes, colours)}
datasets['vitis_types']['label_colors'] = vitis_types_label_colors
datasets['vitis_types']['sample_colors'] = [vitis_types_label_colors[lbl] for lbl in datasets['vitis_types']['target']]

In [None]:
sns.palplot(datasets['vitis_types']['label_colors'].values())
new_ticks = plt.xticks(range(len(datasets['vitis_types']['classes'])), datasets['vitis_types']['classes'])

#### 2 HD classes

In [None]:
# customize label colors for 2 HD classes

colours = sns.color_palette('Set1', 2)
hd_label_colors = {lbl: c for lbl, c in zip(datasets['HD']['classes'], colours)}
datasets['HD']['label_colors'] = hd_label_colors
datasets['HD']['sample_colors'] = [hd_label_colors[lbl] for lbl in datasets['HD']['target']]

In [None]:
sns.palplot(hd_label_colors.values())
new_ticks = plt.xticks(range(len(datasets['HD']['classes'])), datasets['HD']['classes'])

Samples and respective target labels of each dataset

In [None]:
def styled_sample_labels(sample_names, sample_labels, label_colors):

    meta_table = pd.DataFrame({'label': sample_labels,
                               'sample': sample_names}).set_index('sample').T

    def apply_label_color(val):
        red, green, blue = label_colors[val]
        red, green, blue = int(red*255), int(green*255), int(blue*255)   
        hexcode = '#%02x%02x%02x' % (red, green, blue)
        css = f'background-color: {hexcode}'
        return css
    
    return meta_table.style.applymap(apply_label_color)

In [None]:
parsed = mtl.parse_data(datasets['GD_neg_class2']['data'], labels_loc='label')
y = datasets['GD_neg_class2']['target']
label_colors = datasets['GD_neg_class2']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

In [None]:
parsed = mtl.parse_data(datasets['YD']['data'])
y = datasets['YD']['target']
label_colors = datasets['YD']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

In [None]:
parsed = mtl.parse_data(datasets['vitis_types']['data'], labels_loc='label')
y = datasets['vitis_types']['target']
label_colors = datasets['vitis_types']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

In [None]:
parsed = mtl.parse_data(datasets['HD']['data'])
y = datasets['HD']['target']
label_colors = datasets['HD']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

#### Colors for the pre-treatments / sMDiN analysis metrics for the plots

In [None]:
# customize colors for the intensity-based pre-treatment and analysis metrics of sample MDiNs
treatments = ('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11', 'NGP', 'NGP_RF')

treat_colors = tab10cols[:4]
treat_colors.extend(tab20cols[8:10])
treat_colors.append(tab10cols[5])
treat_colors.extend(tab20cols[:2])
treatment_colors = {lbl: c for lbl, c in zip(treatments, treat_colors)}

sns.palplot(treatment_colors.values())
new_ticks = plt.xticks(range(len(treatment_colors)), treatment_colors)

## Supervised Statistical Analysis

The Supervised Statistical Analysis methods used will be Random Forest and PLS-DA.

The performance of the classifiers will be evaluated by their predictive **accuracy** (which will always be estimated by internal stratified 3-fold cross-validation or 5-fold cross-validation in `vitis_types` and in `HD`). For the `HD`, an external test set comprising 30% of samples was separated from the a training set to further validate the models.

Each method will be applied to the differently-treated datasets for each of the benchmark datasets.

**Note**: If `Generate` is **True**, Random Forest and PLS-DA will be applied. They are always on the cell before the application.

#### All supervised methods on IDT or WMDBI were applied using the iteration data in 'iter_fold_splits' of the datasets. This data has 20 iterations of different k-fold separations of the data in training and testing groups which were independently treated by these methods. Optimization methods were applied using the '1st iteration' data. For the remaining sMDiN analysis methods, the same fold separations in each iteration were also used.

Functions were made using the 'iter_fold_splits' key of each benchmark dataset dict. Inside this dict, there are many nested dicts that culminate in the independently treated training and test dataset for all iterations and fold splits (for k-fold cross-validation) as well as each tested data pre-treatment. This is performed to validate our models and finding by stratified k-fold cross-validation with 20 iterations of this process being made to have more combinations of training and test samples are used to offset the small (in terms of samples per group) datasets. As such, this is a variation of the mroe general functions in `multianalysis.py`.

In [None]:
# Creating the dictionaries in iter_fold_splits for the 5 sMDiN analysis (not created before, since there was no
# danger of data leakage, each network analysis is independent from network to network)
for name, ds in datasets.items():
    
    ds_iter = ds['iter_fold_splits']

    for itr in range(len(ds_iter.keys())):
        for fold in ds_iter[itr+1]['train'].keys():
            for treat in ('Degree', 'Betweenness', 'Closeness', 'MDBI', 'GCD11'):

                ds_iter[itr+1]['train'][fold][treat] = ds[treat].loc[ds_iter[itr+1]['train'][fold]['data'].index]
                ds_iter[itr+1]['test'][fold][treat] = ds[treat].loc[ds_iter[itr+1]['test'][fold]['data'].index]

## Random Forests

### Optimization of the number of trees

Random Forest models with different number of trees are built to assess when the predictive accuracy of the different models stops increasing with the number of trees. Grid search of number of trees from 10 to 200 for the random forests with 5 tree interval. See where the cross-validation estimated predictive accuracy stops improving for each one.

In [None]:
GENERATE = True

In [None]:
np.random.seed(2967)
if GENERATE:
    # NOTE: for debugging
    top_tree_in_grid=200
    # otherwise
    #top_tree_in_grid=200

    # For each dataset, build  Random Forest models with the different number of trees
    # and store the predictive accuracy (estimated by k-fold cross-validation)

    RF_optim = {}
    for name, dataset in datasets.items():
        
        if name in ('vitis_types', 'HD'):
            cv = 5
        else:
            cv = 3
        
        # Dicionary key with the iteration/fold combinations
        ds_iter = datasets[name]['iter_fold_splits']
        
        for treatment in ('NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
            print('Fitting to', dataset['name'], 'pre-treatment', treatment, '...', end=' ')
            rfname = name + ' ' + treatment
            RF_optim[rfname] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}
            
            accuracy_scores = []
            
            for n_trees in range(10,top_tree_in_grid,5):
                CV_accuracy_scores = []
                # Fit and evaluate a Random Forest model for each fold in stratified k-fold cross validation
                for fold in range(1,cv+1):
                    # Random Forest setup and fit
                    #print(ds_iter[1]['train'][fold][treatment])
                    rf = skensemble.RandomForestClassifier(n_estimators=n_trees)
                    rf.fit(ds_iter[1]['train'][fold][treatment], ds_iter[1]['train'][fold]['target'])

                    # Compute performance
                    CV_accuracy_scores.append(rf.score(ds_iter[1]['test'][fold][treatment], 
                                                       ds_iter[1]['test'][fold]['target'])) # Prediction Accuracy

                # Average Predictive Accuracy in this iteration
                accuracy_scores.append(np.mean(CV_accuracy_scores))
            
            RF_optim[rfname]['scores'] = accuracy_scores
            RF_optim[rfname]['n_trees'] = list(range(10,top_tree_in_grid,5))

            print('Done!')
    print('writing results to file')
    path = Path.cwd() / 'store_files' / 'RF_optim.json'
    print(path.name)
    with open(path, "w", encoding='utf8') as write_file:
        json.dump(RF_optim, write_file)

In [None]:
path = Path.cwd() / 'store_files' / 'RF_optim.json'
with open(path, "r", encoding='utf8') as read_file:
    RF_optim = json.load(read_file)

#### Plots of tree number optimization

In [None]:
# Plotting the results and adjusting parameters of the plot

def plot_RF_otimization_ntrees(RF_optim, dskey, ax=None, ylabel='', title='', ylim=(30,101)):
    col = treat_colors[-2:] + treat_colors[1:]
    to_plot = [optim for key, optim in RF_optim.items() if optim['dskey'] == dskey]
    treatments = ('NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')
    if ax is None:
        ax = plt.gca()
    for treatment, color in zip(treatments, col):
        for optim in to_plot:
            if optim['treatment'] == treatment:
                break
        ax.plot(optim['n_trees'], [s*100 for s in optim['scores']], label=treatment, color=color)
    ax.set(ylabel=ylabel, xlabel='Number of Trees', ylim=ylim, title=title)
    ax.legend()

with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, axs = plt.subplots(3, 2, figsize=(12,15), constrained_layout=True)

        for dskey, ax in zip(datasets, axs.ravel()):
        
            plot_RF_otimization_ntrees(RF_optim, dskey, ax=ax,
                                       ylabel='Random Forest CV Mean Accuracy (%)',
                                       title=datasets[dskey]["name"])

        f.suptitle('Optimization of the number of trees')
        axs[2][1].set_visible(False)

        plt.show()

### Random Forest models

Random Forest models were built with the `RandomForestClassifier` from scikit-learn using the `RF_model_CV` modified from multianalysis.py (to accommodate our database structure) shown below.

This function reads the 20 iterations and k folds stored in the data to perform k-fold cross-validation. Iterations are used to test more combinations of training and test samples to offset the small (in terms of samples per group) datasets. 

It then stores predictive accuracy of the models (across the iterations) and an ordered list of the most to least important features (average across the iterations) in building the model according to the Gini Importance calculated by scikit-learn of each iteration.

In [None]:
GENERATE = True # False

In [None]:
# RF_model_CV - RF application and result extraction.
# Altered version of RF_model_CV taking into account the treated datasets in the 'iter_fold_splits' key of each dataset dict.
# Inside this dict, there are many nested dicts that culminate in the independently treated training and test dataset for all
# iterations and fold splits (for k-fold cross-validation) as well as each tested data pre-treatment.
def RF_model_CV(ds, treatment, n_trees=200):

    nfeats = ds['data'].shape[1]
    ds_iter = ds['iter_fold_splits']

    # Setting up variables for result storing
    #imp_feat = np.zeros((len(ds_iter.keys()) * len(ds_iter[1]['train'].keys()), nfeats))
    imp_feat = {}
    accuracy_scores = []
    
    # Number of times Random Forest cross-validation is made
    # with `n_fold` randomly generated folds.
    for itr in range(len(ds_iter.keys())):
        
        # To store results
        CV_accuracy_scores = []

        # Fit and evaluate a Random Forest model for each fold in stratified k-fold cross validation
        for fold in ds_iter[itr+1]['train'].keys():
            # Random Forest setup and fit
            rf = skensemble.RandomForestClassifier(n_estimators=n_trees)
            rf.fit(ds_iter[itr+1]['train'][fold][treatment], ds_iter[itr+1]['train'][fold]['target'])
            
            # Compute performance and important features
            CV_accuracy_scores.append(rf.score(ds_iter[itr+1]['test'][fold][treatment], 
                                               ds_iter[itr+1]['test'][fold]['target'])) # Prediction Accuracy
            #imp_feat[f, :] = rf.feature_importances_
            imp_feat[str(itr+1)+'-'+str(fold)] = dict(zip(ds_iter[itr+1]['train'][fold][treatment].columns,
                                                      rf.feature_importances_)) # Importance of each feature

        # Average Predictive Accuracy in this iteration
        accuracy_scores.append(np.mean(CV_accuracy_scores))
    
    # Collect and order all important features values from each Random Forest
    imp_feat = pd.DataFrame.from_dict(imp_feat).replace({np.nan:0})
    imp_feat_sum = (imp_feat.sum(axis=1)/ (len(ds_iter.keys()) * len(ds_iter[itr+1]['train'].keys())))
    sorted_imp_feat = imp_feat_sum.sort_values(ascending=False)
    imp_feat = []
    for i in sorted_imp_feat.index:
        imp_feat.append((i, sorted_imp_feat.loc[i]))

    if len(ds_iter.keys()) == 1:
        return {'accuracy': accuracy_scores[0], 'important_features': imp_feat}
    else:
        return {'accuracy': accuracy_scores, 'important_features': imp_feat}

In [None]:
np.random.seed(16)
if GENERATE:

    RF_all = {}

    # Application of the Random Forests for each differently-treated dataset
    for name, dataset in datasets.items():
        
        # Intensity-based pre-treatments
        IDT_res = {}
        for treatment in ('NGP', 'NGP_RF'):
            print(f'Fitting random forest for {name} with treatment {treatment}', end=' ...')
            IDT_res[treatment] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}

            fit = RF_model_CV(dataset, treatment, n_trees=100)
            IDT_res[treatment].update(fit)

            print(f'done')
        
        # Choose the Intensity-based Data pre-Treatment (IDT) with the highest accuracy
        #print(np.mean(IDT_res['NGP_RF']['accuracy']), np.mean(IDT_res['NGP']['accuracy']))
        if np.mean(IDT_res['NGP_RF']['accuracy']) >= np.mean(IDT_res['NGP']['accuracy']):
            rfname = name + ' ' + 'IDT'
            RF_all[rfname] = IDT_res['NGP_RF']
            RF_all[rfname]['treatment'] = 'IDT'
        else:
            rfname = name + ' ' + 'IDT'
            RF_all[rfname] = IDT_res['NGP']
            RF_all[rfname]['treatment'] = 'IDT'
        
        for treatment in ('Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
            print(f'Fitting random forest for {name} with treatment {treatment}', end=' ...')
            rfname = name + ' ' + treatment
            RF_all[rfname] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}

            fit = RF_model_CV(dataset, treatment, n_trees=100)
            RF_all[rfname].update(fit)

            print(f'done')    
            
    # Store Results
    fname = 'store_files/RF_all.json'
    with open(fname, "w", encoding='utf8') as write_file:
        json.dump(RF_all, write_file)

In [None]:
# Read prior results
fname = 'store_files/RF_all.json'
with open(fname, "r", encoding='utf8') as read_file:
    RF_all = json.load(read_file)

#### Results of the Random Forest - Performance (Predictive Accuracy) 

In [None]:
# Accuracy across the iterations
accuracies = pd.DataFrame({name: RF_all[name]['accuracy'] for name in RF_all})
accuracies

#### Distributions of Predictive Accuracies for _GDg2-_

In [None]:
column_names = ['IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11']

# Violin plot of the distribution of the predictive accuracy (in %) across the iterations of randomly sampled folds for each 
# differently-treated dataset.

cols2keep = [col for col in accuracies.columns if 'neg_global2' in col]

with sns.axes_style("whitegrid"):
    f, ax = plt.subplots(figsize=(14,6))
    res100 = accuracies[cols2keep] * 100
    res100.columns = column_names

    colors = treat_colors
    sns.violinplot(data=res100, palette=colors)

    plt.ylabel('Prediction Accuracy (%) - Random Forest', fontsize=13)
    plt.ylim([25,100])
    ax.tick_params(axis='x', which='major', labelsize = 18)
    ax.tick_params(axis='y', which='major', labelsize = 15)
    for ticklabel, tickcolor in zip(plt.gca().get_xticklabels(), colors):
        ticklabel.set_color(tickcolor)
    f.suptitle('Predictive Accuracy of Random Forest models - GD alignment global2 Negative Mode', fontsize=16)

#### Average Predictive Accuracies of Random Forest models

Error bars were built based on the standard deviation of the predictive accuracies.

In [None]:
accuracy_stats = pd.DataFrame({'Average accuracy': accuracies.mean(axis=0),
                               'STD': accuracies.std(axis=0)})

accuracy_stats = accuracy_stats.assign(dataset=[RF_all[name]['dataset'] for name in RF_all],
                                       treatment=[RF_all[name]['treatment'] for name in RF_all])
accuracy_stats

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.3):
        f, ax = plt.subplots(1, 1, figsize=(16, 6))
        x = np.arange(len(datasets))  # the label locations
        labels = [datasets[name]['name'] for name in datasets]
        width = 0.1  # the width of the bars
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracy_stats[accuracy_stats['treatment']==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax.bar(x + offset, acc_treatment['Average accuracy'], width, label=treatment, color = p4[i])
            ax.errorbar(x + offset, y=acc_treatment['Average accuracy'], yerr=acc_treatment['STD'],
                        ls='none', ecolor='0.2', capsize=3)
        ax.set_xticks(x)
        ax.set_xticklabels(labels)
        ax.set(ylabel='Average accuracy', title='', ylim=(0.3,1.03))
        ax.text(-0.5, 0.95, 'A', weight='bold', fontsize=15)
        ax.legend(loc='upper left', bbox_to_anchor=(0.10, 1), ncol=2, fontsize=10)
        #f.savefig('images/RF_performance.pdf' , dpi=200)
        #f.savefig('images/RF_performance.png' , dpi=600)

#### ROC curves

ROC curves are computed for the `vitis_types` and `HD` classifiers only (2-class problems) using `RandomForestClassifier` from scikit-learn in the function `RF_ROC_cv` from multianalysis.py.

In [None]:
def RF_ROC_cv(datasets, treatment, pos_label, n_trees=200, n_iter=1):
    """Fits and extracts Random Forest model data from the 1st iteration of fold splits saved in the dataset storage.
       It then calculates metrics to plot a ROC curve."""

    ds_iter = datasets['iter_fold_splits']
    
    # Run classifier with cross-validation and plot ROC curves
    classifier = skensemble.RandomForestClassifier(n_estimators=n_trees)

    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    
    # Iteration number cannot be bigger than the number of iterations available in storage
    if n_iter > len(ds_iter.keys()):
        n_iter = len(ds_iter.keys())
    
    # Number of times Random Forest cross-validation is made
    # with `n_fold` randomly generated folds.
    for itr in range(n_iter):
        # Fit and evaluate a Random Forest model for each fold in stratified k-fold cross validation
        for fold in ds_iter[itr+1]['train'].keys():
            
            # transform target labels to an array
            train_group_len = len(ds_iter[itr+1]['train'][fold]['target'])
            labels = ds_iter[itr+1]['train'][fold]['target'] + ds_iter[itr+1]['test'][fold]['target']
            target = [lbl==pos_label for lbl in labels]
            y = np.array(target, dtype=int)
            y_train, y_test = y[:train_group_len], y[train_group_len:]
            target = [lbl==pos_label for lbl in y]
            
            # Fit the rf classifier
            classifier.fit(ds_iter[itr+1]['train'][fold][treatment], y_train)
            
            # Metrics for ROC curve plotting
            scores = classifier.predict_proba(ds_iter[itr+1]['test'][fold][treatment])[:, 1]

            fpr, tpr, _ = roc_curve(y_test, scores)

            interp_tpr = np.interp(mean_fpr, fpr, tpr)
            interp_tpr[0] = 0.0
            tprs.append(interp_tpr)
            aucs.append(roc_auc_score(y_test, scores))
    
    # Mean of every fold of the cross-validation
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)

    return {'average fpr': mean_fpr, 'average tpr': mean_tpr, 
            'upper tpr': tprs_upper, 'lower trp': tprs_lower,
            'mean AUC': mean_auc, 'std AUC': std_auc}

In [None]:
np.random.seed(16)
names = ['vitis_types','HD']
pos_labels = ['vinifera', 'Recurrence']
resROC = {}

# Perform and obtain the results for the ROC curves
for name, pos_label in zip(names, pos_labels):
    np.random.seed(16)
    dataset = datasets[name]
    #datasets[name]['IDT'] = datasets[name]['NGP']
    y = dataset['target']
    resROC[name] = {}
    for treatment in ('NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
        #df = dataset[treatment]
        res = RF_ROC_cv(dataset, treatment, pos_label, n_trees=100, n_iter=20)
        # Selecting the best out of NGP and NGP_RF intensity-based data pre-treatments based on AUC and storing as 'IDT'
        if treatment == 'NGP':
            res_temp = res
        elif treatment == 'NGP_RF':
            if res['mean AUC'] >= res_temp['mean AUC']:
                resROC[name]['IDT'] = res
            else:
                resROC[name]['IDT'] = res_temp
        # Data matrices from the network analyses of sMDiNs
        else:
            resROC[name][treatment] = res

In [None]:
# Plot the ROC curves 
p4 = treat_colors[:7]
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, axs = plt.subplots(1, 2, figsize=(12,5), constrained_layout=True)
        for rROC, ax in zip(resROC.values(), axs.ravel()):
            for treatment, color in zip(rROC, p4):
                res = rROC[treatment]
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                ax.plot(mean_fpr, mean_tpr, color=color,
                       label=f'{treatment} (AUC = {mean_auc:.3f})',
                       lw=2, alpha=0.8)
            ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
            ax.legend()
            ax.set_xlim(None,1)
            ax.set_ylim(0,None)
            ax.set(xlabel='False positive rate', ylabel='True positive rate')
            
        #for letter, ax in zip('AB', axs.ravel()[0:2]):
        #    ax.text(0.88, 0.9, letter, ha='left', va='center', fontsize=15, weight='bold',
        #            transform=ax.transAxes,
        #            bbox=dict(facecolor='white', alpha=0.9))
        
        f.savefig('images/ROC_vitis.pdf', dpi=300)
        f.savefig('images/ROC_vitis.jpg', dpi=300)

### Important feature analysis - Random Forest

In [None]:
def build_impfeat_table(Imp_Feats, colnames=['Feature', 'Importance Metric']):
    "Transform Imp. Feat. into a table (df) with the place, name of the feature and the score of the importance metric."
    Imp_Feats_Table = pd.DataFrame(columns=colnames)
    for n in range(len(Imp_Feats)):
        Imp_Feats_Table.loc[n+1] = Imp_Feats[n][0], Imp_Feats[n][1]
    Imp_Feats_Table.index.name = 'Place'
    
    return Imp_Feats_Table

In [None]:
MDBI_RF_Imp_Feat = {}
for dskey, ds in RF_all.items():
    if dskey.endswith(' MDBI'):
        name, treat = dskey.split()
        #print(name, treat)
        MDBI_RF_Imp_Feat[name] = build_impfeat_table(ds['important_features'], colnames=['Feature', 'Gini Imp.'])
        
WMDBI_RF_Imp_Feat = {}
for dskey, ds in RF_all.items():
    if dskey.endswith('WMDBI'):
        name, treat = dskey.split()
        #print(name, treat)
        WMDBI_RF_Imp_Feat[name] = build_impfeat_table(ds['important_features'], colnames=['Feature', 'Gini Imp.'])

In [None]:
MDBI_RF_Imp_Feat['vitis_types']

In [None]:
MDBI_RF_Imp_Feat['YD']

In [None]:
WMDBI_RF_Imp_Feat['YD']

Building a Heatmap of the MDB Impact data matrix obtained for the YD benchmark dataset.

For visualization purposes, to compare the different MDBs that can have different magnitudes, auto-scaling was employed.

### MDBI

In [None]:
#sns.heatmap
f, ax = plt.subplots(figsize=(10,6))

tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['MDBI'].loc[:,list(MDBI_RF_Imp_Feat['YD']['Feature'])])

MDB_2 = MDBI_RF_Imp_Feat['YD'].loc[:,'Feature']

g = sns.heatmap(df.T, xticklabels=False, yticklabels=MDB_2, cmap='PRGn', vmin=-3, vmax=3)
g.set_yticklabels(g.get_ymajorticklabels(), fontsize = 14)

# Manually specify colorbar labelling after it's been generated
colorbar = g.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

# thick line between the samples of different classes
for i in range(3,15,3):
    ax.axvline(i, color='white', lw=5)
ax.tick_params(length=0)
plt.text(1.5, 17.8, 'WT', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['WT'])
plt.text(4.5, 17.8, 'ΔGRE3', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔGRE3'])
plt.text(7.5, 17.8, 'ΔENO1', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔENO1'])
plt.text(10.5, 17.8, 'ΔGLO1', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔGLO1'])
plt.text(13.5, 17.8, 'ΔGLO2', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔGLO2'])
#plt.plot([0, 3], [15, 15])
#plt.hlines(14, 0, 3, linewidths=2)
f.savefig('images/heatmap_MDBImpact.png' , dpi=300)
f.savefig('images/heatmap_MDBImpact.pdf' , dpi=300)

In [None]:
def perform_HCA(df, metric='euclidean', method='average'):
    "Performs Hierarchical Clustering Analysis of a data set with chosen linkage method and distance metric."
    
    distances = dist.pdist(df, metric=metric)
    
    # method is one of
    # ward, average, centroid, single, complete, weighted, median
    Z = hier.linkage(distances, method=method)

    # Cophenetic Correlation Coefficient
    # (see how the clustering - from hier.linkage - preserves the original distances)
    coph = hier.cophenet(Z, distances)
    # Baker's gamma
    mr = ma.mergerank(Z)
    bg = mr[mr!=0]

    return {'Z': Z, 'distances': distances, 'coph': coph, 'merge_rank': mr, "Baker's Gamma": bg}

In [None]:
HCA_as = perform_HCA(df, metric='Euclidean', method='ward')
HCA_bs = perform_HCA(datasets['YD']['MDBI'], metric='Euclidean', method='ward')

In [None]:
# alternative dendogram plots - Newer
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

def color_list_to_matrix_and_cmap(colors, ind, axis=0):
        if any(issubclass(type(x), list) for x in colors):
            all_colors = set(itertools.chain(*colors))
            n = len(colors)
            m = len(colors[0])
        else:
            all_colors = set(colors)
            n = 1
            m = len(colors)
            colors = [colors]
        color_to_value = dict((col, i) for i, col in enumerate(all_colors))

        matrix = np.array([color_to_value[c]
                           for color in colors for c in color])

        matrix = matrix.reshape((n, m))
        matrix = matrix[:, ind]
        if axis == 0:
            # row-side:
            matrix = matrix.T

        cmap = mpl.colors.ListedColormap(all_colors)
        return matrix, cmap

def plot_dendogram2(Z, leaf_names, label_colors, title='', ax=None, no_labels=False, labelsize=12, **kwargs):
    if ax is None:
        ax = plt.gca()
    hier.dendrogram(Z, labels=leaf_names, leaf_font_size=10, above_threshold_color='0.2', orientation='left',
                    ax=ax, **kwargs)
    #Coloring labels
    #ax.set_ylabel('Distance (AU)')
    ax.set_xlabel('Distance (AU)')
    ax.set_title(title, fontsize = 15)
    
    #ax.tick_params(axis='x', which='major', pad=12)
    ax.tick_params(axis='y', which='major', labelsize=labelsize, pad=12)
    ax.spines['left'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    #xlbls = ax.get_xmajorticklabels()
    xlbls = ax.get_ymajorticklabels()
    rectimage = []
    for lbl in xlbls:
        col = label_colors[lbl.get_text()]
        lbl.set_color(col)
        #lbl.set_fontweight('bold')
        if no_labels:
            lbl.set_color('w')
        rectimage.append(col)

    cols, cmap = color_list_to_matrix_and_cmap(rectimage, range(len(rectimage)), axis=0)

    axins = inset_axes(ax, width="5%", height="100%",
                   bbox_to_anchor=(1, 0, 1, 1),
                   bbox_transform=ax.transAxes, loc=3, borderpad=0)

    axins.pcolor(cols, cmap=cmap, edgecolors='w', linewidths=1)
    axins.axis('off')

In [None]:
f, ax = plt.subplots(figsize=(5, 5))
name = 'YD'
title = f"Data set {datasets[name]['name']}, MDBI after scaling"
plot_dendogram2(HCA_as['Z'], 
               datasets['YD']['target'], ax=ax,
               label_colors=datasets['YD']['label_colors'], title=title,
               color_threshold=0)

In [None]:
tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['MDBI'].loc[:,list(MDBI_RF_Imp_Feat['YD']['Feature'])])

MDB_2 = [f'PO$_3$H', 'CO$_2$', 'CO', 'O', 'CH$_2$', 'S', 'H$_2$', 'NH$_3$(−O)', 'CCH$_3$COOH', 'NCH', 'CONH', 'SO$_3$', 
         'CHCOOH', 'C$_2$H$_2$O', 'O(−NH)', 'H$_2$O', 'CHOH']
#MDB_2 = MDBI_RF_Imp_Feat['YD'].loc[:,'Feature']

row_cols = [datasets['YD']['label_colors'][lbl] for lbl in datasets['YD']['target']]

g = sns.clustermap(df.T, yticklabels=MDB_2 ,cmap='PRGn', vmin=-3, vmax=3, col_linkage=HCA_as['Z'],  # method='ward',
                    cbar_pos = (0.08, 0.1, 0.05, 0.5), col_colors=row_cols, linewidths=1.5,
                   row_cluster=False)

g.fig.set_size_inches((10,6))

# some tweaks
patches = []
for lbl in datasets['YD']['classes']:
    patches.append(mpatches.Patch(color=datasets['YD']['label_colors'][lbl], label=lbl))
g.ax_heatmap.tick_params(axis='y', labelsize=15)
g.ax_heatmap.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)

leg = plt.legend(handles=patches, loc=3, bbox_to_anchor=(-0.75, 1.05, 0.5, 1),
                     frameon=False, fontsize=14) 
    
# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

g.savefig('images/clustermap_MDBImpact.png' , dpi=300)
g.savefig('images/clustermap_MDBImpact.pdf' , dpi=300)

In [None]:
tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['MDBI'].loc[:,list(MDBI_RF_Imp_Feat['YD']['Feature'])])

MDB_2 = [f'PO$_3$H', 'CO$_2$', 'CO', 'O', 'CH$_2$', 'S', 'H$_2$', 'NH$_3$(−O)', 'CCH$_3$COOH', 'NCH', 'CONH', 'SO$_3$', 
         'CHCOOH', 'C$_2$H$_2$O', 'O(−NH)', 'H$_2$O', 'CHOH']
#MDB_2 = MDBI_RF_Imp_Feat['YD'].loc[:,'Feature']

row_cols = [datasets['YD']['label_colors'][lbl] for lbl in datasets['YD']['target']]

g = sns.clustermap(df.T, yticklabels=MDB_2 ,cmap='PRGn', vmin=-3, vmax=3, col_linkage=HCA_as['Z'],  # method='ward',
                    cbar_pos = (0.08, 0.1, 0.05, 0.5), col_colors=row_cols, linewidths=1.5,
                   row_cluster=False)
g.fig.suptitle('         MDBI', y=1.02, fontsize = 16) 
g.fig.set_size_inches((6,6))

# some tweaks
patches = []
for lbl in datasets['YD']['classes']:
    patches.append(mpatches.Patch(color=datasets['YD']['label_colors'][lbl], label=lbl))
g.ax_heatmap.tick_params(axis='y', labelsize=13)
g.ax_heatmap.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)

# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=1) 

g.savefig('images/clustermap_MDBImpactFigpng' , dpi=300)
g.savefig('images/clustermap_MDBImpactFig.pdf' , dpi=300)

### WMDBI

In [None]:
#sns.heatmap
f, ax = plt.subplots(figsize=(10,6))

tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['WMDBI'].loc[:,list(WMDBI_RF_Imp_Feat['YD']['Feature'])])

MDB_2 = WMDBI_RF_Imp_Feat['YD'].loc[:,'Feature']

g = sns.heatmap(df.T, xticklabels=False, yticklabels=MDB_2, cmap='PRGn', vmin=-3, vmax=3)
g.set_yticklabels(g.get_ymajorticklabels(), fontsize = 14)

# Manually specify colorbar labelling after it's been generated
colorbar = g.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

# thick line between the samples of different classes
for i in range(3,15,3):
    ax.axvline(i, color='white', lw=5)
ax.tick_params(length=0)
plt.text(1.5, 17.8, 'WT', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['WT'])
plt.text(4.5, 17.8, 'ΔGRE3', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔGRE3'])
plt.text(7.5, 17.8, 'ΔENO1', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔENO1'])
plt.text(10.5, 17.8, 'ΔGLO1', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔGLO1'])
plt.text(13.5, 17.8, 'ΔGLO2', ha='center', fontsize = 18, color = datasets['YD']['label_colors']['ΔGLO2'])
#plt.plot([0, 3], [15, 15])
#plt.hlines(14, 0, 3, linewidths=2)
f.savefig('images/heatmap_WMDBImpact.png' , dpi=300)
f.savefig('images/heatmap_WMDBImpact.pdf' , dpi=300)

In [None]:
HCA_as = perform_HCA(df.replace({np.nan:0}), metric='Euclidean', method='ward')
#HCA_bs = perform_HCA(datasets['YD']['WMDBI'], metric='Euclidean', method='ward')

In [None]:
f, ax = plt.subplots(figsize=(5, 5))
name = 'YD'
title = f"Data set {datasets[name]['name']}, WMDBI after scaling"
plot_dendogram2(HCA_as['Z'], 
               datasets['YD']['target'], ax=ax,
               label_colors=datasets['YD']['label_colors'], title=title,
               color_threshold=0)

In [None]:
tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['MDBI'].loc[:,list(WMDBI_RF_Imp_Feat['YD']['Feature'])])

MDB_2 = [f'PO$_3$H', 'NCH', 'H$_2$', 'CO', 'CCH$_3$COOH', 'CH$_2$', 'O', 'CONH', 'S', 'CHCOOH', 'NH$_3$(−O)', 'H$_2$O',
         'CO$_2$', 'O(−NH)', 'C$_2$H$_2$O', 'SO$_3$', 'CHOH']
#MDB_2 = WMDBI_RF_Imp_Feat['YD'].loc[:,'Feature']

row_cols = [datasets['YD']['label_colors'][lbl] for lbl in datasets['YD']['target']]

g = sns.clustermap(df.T, yticklabels=MDB_2 ,cmap='PRGn', vmin=-3, vmax=3, col_linkage=HCA_as['Z'],  # method='ward',
                    cbar_pos = (0.08, 0.1, 0.05, 0.5), col_colors=row_cols, linewidths=1.5,
                   row_cluster=False)

g.fig.set_size_inches((10,6))

# some tweaks
patches = []
for lbl in datasets['YD']['classes']:
    patches.append(mpatches.Patch(color=datasets['YD']['label_colors'][lbl], label=lbl))
g.ax_heatmap.tick_params(axis='y', labelsize=15)
g.ax_heatmap.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)

leg = plt.legend(handles=patches, loc=3, bbox_to_anchor=(-0.75, 1.05, 0.5, 1),
                     frameon=False, fontsize=14)
    
# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

g.savefig('images/clustermap_WMDBImpact.png' , dpi=300)
g.savefig('images/clustermap_WMDBImpact.pdf' , dpi=300)

In [None]:
tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['MDBI'].loc[:,list(WMDBI_RF_Imp_Feat['YD']['Feature'])])

MDB_2 = [f'PO$_3$H', 'NCH', 'H$_2$', 'CO', 'CCH$_3$COOH', 'CH$_2$', 'O', 'CONH', 'S', 'CHCOOH', 'NH$_3$(−O)', 'H$_2$O',
         'CO$_2$', 'O(−NH)', 'C$_2$H$_2$O', 'SO$_3$', 'CHOH']
#MDB_2 = WMDBI_RF_Imp_Feat['YD'].loc[:,'Feature']

row_cols = [datasets['YD']['label_colors'][lbl] for lbl in datasets['YD']['target']]

g = sns.clustermap(df.T, yticklabels=MDB_2 ,cmap='PRGn', vmin=-3, vmax=3, col_linkage=HCA_as['Z'],  # method='ward',
                    cbar_pos = (0.08, 0.1, 0.05, 0.5), col_colors=row_cols, linewidths=1.5,
                   row_cluster=False)
g.fig.suptitle('        WMDBI', y=1.02, fontsize = 16) 
g.fig.set_size_inches((6,6))

# some tweaks
patches = []
for lbl in datasets['YD']['classes']:
    patches.append(mpatches.Patch(color=datasets['YD']['label_colors'][lbl], label=lbl))
g.ax_heatmap.tick_params(axis='y', labelsize=13)
g.ax_heatmap.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
    
# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=1)

g.savefig('images/clustermap_WMDBImpactFig.png' , dpi=300)
g.savefig('images/clustermap_WMDBImpactFig.pdf' , dpi=300)

## Projection in Latent Structures Discriminant Analysis (PLS-DA)

PLS-DA models were built using the `PLSRegression` of scikit-learn.

**Decision Rule**

For the multi-class problem, class membership was encoded by the one-hot encoding method, and the prediction decision samples were assigned to the class corresponding to the maximum value in ypred of the PLS output. For two-class problems, class membership was encoded as 0 or 1, with 0.5 threshold for decision. 

### Optimization - Search for the best number of components of PLS model

The number of components were optimized by the 1 - PRESS/SS or Q$^2$ (PLS Score) of models built with 1 to n components.

PRESS - Predictive Residual Sum of Squares; SS - residual Sum of Squares

Strategy: Build PLS-DA with different number of components and extract the PLS score (inverse relation to the mean-squared error ) of the models estimated with stratified k-fold cross-validation. Observe at which point (number of components) the PLS Score starts approaching a "stable maximum value". This was done using the modified `optim_PLSDA_n_components` from multianalysis.py shown below.

These regression metrics are not suitable to evaluate the performance of the classifier, they were just used to optimize the number of components to build the final PLS-DA models.

In [None]:
GENERATE = True #False

In [None]:
# Altered version of optim_PLSDA_n_components taking into account the treated datasets in the 'iter_fold_splits' key.
# Inside this dict, there are many nested dicts that culminate in the independently treated training and test dataset for
# all iterations and fold splits (for k-fold cross-validation) as well as each tested data pre-treatment.
def optim_PLSDA_n_components(datasets, encode2as1vector=True, scale=False, max_comp=50):
    
    ds_iter = datasets['iter_fold_splits']
    
    # Preparating lists to store results
    CVs = []
    CVr2s = []
    MSEs = []
    Accuracy = []

    # Repeating for each component from 1 to max_comp
    for i in range(1, max_comp + 1):
        
        # Setting up storing variables for n-fold cross-validation
        cv = []
        cvr2 = []
        mse = []
        accuracy = []
        nright = 0
        
        # Fit and evaluate a Random Forest model for each fold in stratified k-fold cross validation
        # For iteration 1
        for fold in ds_iter[1]['train'].keys():
            
            # Set up the Y matrix for PLSRegression
            train_group_len = len(ds_iter[1]['train'][fold]['target'])
            labels = ds_iter[1]['train'][fold]['target'] + ds_iter[1]['test'][fold]['target']
            unique_labels = list(pd.unique(labels))
            is1vector = len(unique_labels) == 2 and encode2as1vector
            matrix = ma._generate_y_PLSDA(labels, unique_labels, is1vector)
            
            # Divide the Y into the respective training and testing sets
            if is1vector:
                # keep a copy to use later
                target1D = matrix.copy()
                correct = target1D[train_group_len:]
                y_train, y_test = matrix[:train_group_len], matrix[train_group_len:]
            else:
                y_train, y_test = matrix.iloc[:train_group_len], matrix.iloc[train_group_len:]
                
            # Fit PLS model
            plsda = PLSRegression(n_components=i, scale=scale)
            plsda.fit(X=ds_iter[1]['train'][fold][treatment], Y=y_train)

            # Obtain results with the test group
            y_pred = plsda.predict(ds_iter[1]['test'][fold][treatment])
            cvr2.append(r2_score(y_test, y_pred))
            
            # Obtain results with the test group
            y_pred = plsda.predict(ds_iter[1]['test'][fold][treatment])
            cv.append(plsda.score(ds_iter[1]['test'][fold][treatment], y_test))
            cvr2.append(r2_score(plsda.predict(ds_iter[1]['train'][fold][treatment]), y_train))
            mse.append(mean_squared_error(y_test, y_pred))
            
            # Decision rule for classification
            # Decision rule chosen: sample belongs to group where it has max y_pred (closer to 1)
            # In case of 1,0 encoding for two groups, round to nearest integer to compare

            if not is1vector:
                for i in range(len(y_pred)):
                    if list(y_test.iloc[i, :]).index(max(y_test.iloc[i, :])) == np.argmax(
                        y_pred[i]
                    ):
                        nright += 1  # Correct prediction
            else:
                rounded = np.round(y_pred)
                for i in range(len(y_pred)):
                    if rounded[i] == correct[i]:
                        nright += 1  # Correct prediction

            # Calculate accuracy for this iteration
            accuracy.append(nright / len(labels))

        # Storing results for each number of components
        CVs.append(np.mean(cv))
        CVr2s.append(np.mean(cvr2))
        MSEs.append(np.mean(mse))
        Accuracy.append(np.mean(accuracy)) # not used yet...

    return PLSDA_optim_results(CVscores=CVs, CVR2=CVr2s, MSE=MSEs)

PLSDA_optim_results = namedtuple('PLSDA_optim_results', 'CVscores CVR2 MSE')

In [None]:
%%capture --no-stdout
np.random.seed(16)
GENERATE=True
if GENERATE:
    treatments = ('NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')
    # above is to supress PLS warnings

    # Store Results
    PLS_optim = {}

    # Build and extract metrics from models build with different number of components by using the optim_PLS function.
    for name, dataset in datasets.items():
        for treatment in treatments:
            print(f'Fitting PLS-DA model for {name} with treatment {treatment}', end=' ...')
            plsdaname = name + ' ' + treatment
            PLS_optim[plsdaname] = {'dskey': name, 'dataset':dataset['name'], 'treatment':treatment}

            if name.startswith('YD'):
                max_comp = 10
            elif treatment.endswith('MDBI'):
                max_comp = 10
            else:
                max_comp = 15
            
            scale = True
            if treatment in ('NGP', 'NGP_RF'):
                scale = False
            
            optim = optim_PLSDA_n_components(dataset,
                                            max_comp=max_comp, scale=scale).CVscores
            
            PLS_optim[plsdaname]['CV_scores'] = optim
            print(f'done')

    fname = 'store_files/PLSDA_optim.json'
    with open(fname, "w", encoding='utf8') as write_file:
        json.dump(PLS_optim, write_file)


In [None]:
# Read prior results
fname = 'store_files/PLSDA_optim.json'
with open(fname, "r", encoding='utf8') as read_file:
    PLS_optim = json.load(read_file)

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, axs = plt.subplots(3, 2, figsize=(12,15), constrained_layout=True)
        ranges = [range(i, i+8) for i in (0, 8, 16, 24, 32)]
        titles = ['GDg2-', 'GDc2-', 'YD', 'GD types', 'HD']
        col = treat_colors[-2:] + treat_colors[1:]
        for irange, title, ax in zip(ranges, titles, axs.ravel()):   
            for i in irange:
                name = list(PLS_optim.keys())[i]
                ax.plot(range(1, len(PLS_optim[name]['CV_scores']) + 1), PLS_optim[name]['CV_scores'], 
                         label=PLS_optim[name]['treatment'], color = col[i%8])
            ax.set(xlabel='Number of Components',
                    ylabel='PLS Score (1 - PRESS/SS)',
                    title=title)
            ax.legend(loc='lower left')
            ax.set_ylim([0, 1])
            
        axs[2][1].remove()

        #for letter, ax in zip('ABCDEFGH', axs.ravel()):
        #    ax.text(0.88, 0.9, letter, ha='left', va='center', fontsize=15, weight='bold',
        #        transform=ax.transAxes,
        #        bbox=dict(facecolor='white', alpha=0.9))

        #f.suptitle('Optimization of the number of trees')
        plt.show()

For the YD and GD types, PLS-DA models are going to be built with 6 components.

For the other GD datasets, PLS-DA models are going to be built with 10 components.

For the HD, PLS-DA models are going to be built with 10 components.

For the MDBI and WMBI analysis of Sample MDiNs, PLS-DA models are going to be built with 4 and 6 components, respectively.

### PLS-DA models

PLS-DA models were built with the `PLSRegression` (PLS2 algorithm used) from scikit-learn using the `PLSDA_model_CV` from multianalysis.py (each step explained better there).

This function performs n iterations to randomly sample the folds in k-fold cross-validation - more combinations of training and test samples are used to offset the small (in terms of samples per group) dataset. 

It then stores predictive accuracy of the models, the Q$^2$ score (across the iterations) and an ordered list of the most to least important features (average across the iterations) in building the model according to a chosen feature importance metric.

The function allows the choice of 3 different feature importance metrics (feat_type):

- **VIP (Variable Importance/Influence in Projection)** - used in the paper
- Coef. (regression coefficients - sum)
- Weights (Sum of the X-weights for each feature)

In [None]:
# PLSDA_model_CV - PLSDA application and result extraction.
# Altered version of PLSDA_model_CV taking into account the treated datasets in the 'iter_fold_splits' key of each dataset dict.
# Inside this dict, there are many nested dicts that culminate in the independently treated training and test dataset for all
# iterations and fold splits (for k-fold cross-validation) as well as each tested data pre-treatment.
def PLSDA_model_CV(datasets, treatment, n_comp=10,
                   encode2as1vector=True,
                   scale=False,
                   feat_type='Coef'):
    
    # Setting up lists and matrices to store results
    CVR2 = []
    accuracies = []
    #Imp_Feat = np.zeros((iter_num * n_fold, df.shape[1]))
    Imp_Feat = {}
    ds_iter = datasets['iter_fold_splits']
    f = 0
        
    # Number of times PLS-DA cross-validation is made
    # with `n_fold` randomly generated folds.
    for itr in range(len(ds_iter.keys())):

        # Setting up storing variables for n-fold cross-validation
        nright = 0
        cvr2 = []
        
        # Fit and evaluate a PLS-DA model for each fold in stratified k-fold cross validation
        for fold in ds_iter[itr+1]['train'].keys():
            
            # Set up the Y matrix for PLSRegression
            train_group_len = len(ds_iter[itr+1]['train'][fold]['target'])
            labels = ds_iter[itr+1]['train'][fold]['target'] + ds_iter[itr+1]['test'][fold]['target']
            unique_labels = list(pd.unique(labels))
            is1vector = len(unique_labels) == 2 and encode2as1vector
            matrix = ma._generate_y_PLSDA(labels, unique_labels, is1vector)

            # Divide the Y into the respective training and testing sets
            if is1vector:
                # keep a copy to use later
                target1D = matrix.copy()
                correct = target1D[train_group_len:]
                y_train, y_test = matrix[:train_group_len], matrix[train_group_len:]
            else:
                y_train, y_test = matrix.iloc[:train_group_len], matrix.iloc[train_group_len:]
                
            #print(fold, itr+1)
            # Fit PLS model
            plsda = PLSRegression(n_components=n_comp, scale=scale)
            plsda.fit(X=ds_iter[itr+1]['train'][fold][treatment], Y=y_train)

            # Obtain results with the test group
            y_pred = plsda.predict(ds_iter[itr+1]['test'][fold][treatment])
            cvr2.append(r2_score(y_test, y_pred))

            # Decision rule for classification
            # Decision rule chosen: sample belongs to group where it has max y_pred (closer to 1)
            # In case of 1,0 encoding for two groups, round to nearest integer to compare
            # if not is1vector:
            #     for i in range(len(y_pred)):
            #         where_max = np.argmax(y_pred[i])

            if not is1vector:
                for i in range(len(y_pred)):
                    if list(y_test.iloc[i, :]).index(max(y_test.iloc[i, :])) == np.argmax(
                        y_pred[i]
                    ):
                        nright += 1  # Correct prediction
            else:
                rounded = np.round(y_pred)
                for i in range(len(y_pred)):
                    if rounded[i] == correct[i]:
                        nright += 1  # Correct prediction

            # Calculate important features (3 different methods to choose from)
            if feat_type == 'VIP':
                VIPS = ma._calculate_vips(plsda)
                Imp_Feat[str(itr+1)+'-'+str(fold)] = dict(zip(ds_iter[itr+1]['train'][fold][treatment].columns,
                                                      VIPS)) # Importance of each feature
            elif feat_type == 'Coef':
                Imp_Feat[str(itr+1)+'-'+str(fold)] = dict(zip(ds_iter[itr+1]['train'][fold][treatment].columns,
                                                      abs(plsda.coef_).sum(axis=1))) # Importance of each feature
            elif feat_type == 'Weights':
                Imp_Feat[str(itr+1)+'-'+str(fold)] = dict(zip(ds_iter[itr+1]['train'][fold][treatment].columns,
                                                      abs(plsda.x_weights_).sum(axis=1))) # Importance of each feature
            else:
                raise ValueError(
                    'Type not Recognized. Types accepted: "VIP", "Coef", "Weights".'
                )

        # Calculate the accuracy of the group predicted and storing score results
        accuracies.append(nright / len(labels))
        CVR2.append(np.mean(cvr2))
        
    # Collect and order all important features values from each PLS-DA
    Imp_Feat = pd.DataFrame.from_dict(Imp_Feat).replace({np.nan:0})
    #print(len(Imp_Feat.columns))
    Imp_Feat_sum = (Imp_Feat.sum(axis=1)/ (len(Imp_Feat.columns)))
    sorted_Imp_Feat = Imp_Feat_sum.sort_values(ascending=False)
    # Put them in a list of tuples shape to be able to be saved in json
    Imp_Feat = []
    for i in sorted_Imp_Feat.index:
        Imp_Feat.append((i, sorted_Imp_Feat.loc[i]))

    if len(ds_iter.keys()) == 1:
        return {'accuracy': accuracies[0], 'Q2': CVR2[0], 'important_features': Imp_Feat}
    else:
        return {'accuracy': accuracies, 'Q2': CVR2, 'important_features': Imp_Feat}

In [None]:
GENERATE = True
np.random.seed(16)
if GENERATE:
    PLSDA_all = {}

    # For each differently-treated dataset, fit PLS-DA models on n randomly sampled folds (for stratified cross-validation)
    for name, dataset in datasets.items():
        
        # Intensity-based pre-treatments
        IDT_res = {}
        for treatment in ('NGP', 'NGP_RF'):
            print(f'Fitting a PLS-DA model to {name} with treatment {treatment}', end=' ...')
            IDT_res[treatment] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}
            if name.startswith('GD'):
                n_comp = 10
            elif name.startswith('HD'):
                n_comp = 10
            else:
                n_comp = 6
            
            fit = PLSDA_model_CV(dataset, treatment,
                                 scale=False,
                                 n_comp=n_comp,
                                 feat_type='VIP')

            IDT_res[treatment].update(fit)

            print(f'done')
        
        # Choose the Intensity-based Data pre-Treatment (IDT) with the highest accuracy
        #print(np.mean(IDT_res['NGP_RF']['accuracy']), np.mean(IDT_res['NGP']['accuracy']))
        if np.mean(IDT_res['NGP_RF']['accuracy']) >= np.mean(IDT_res['NGP']['accuracy']):
            plsdaname = name + ' ' + 'IDT'
            PLSDA_all[plsdaname] = IDT_res['NGP_RF']
            PLSDA_all[plsdaname]['treatment'] = 'IDT'
        else:
            plsdaname = name + ' ' + 'IDT'
            PLSDA_all[plsdaname] = IDT_res['NGP']
            PLSDA_all[plsdaname]['treatment'] = 'IDT'

        # Data matrices from the network analyses of sMDiNs
        for treatment in ('Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
            print(f'Fitting a PLS-DA model to {name} with treatment {treatment}', end=' ...')
            plsdaname = name + ' ' + treatment
            PLSDA_all[plsdaname] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}
            if name.startswith('GD'):
                n_comp = 10
            elif name.startswith('HD'):
                n_comp = 10
            else:
                n_comp = 6
            
            if treatment == 'MDBI':
                n_comp = 4
            elif treatment == 'WMDBI':
                n_comp = 6
            
            fit = PLSDA_model_CV(dataset, treatment,
                                 scale=True,
                                 n_comp=n_comp,
                                 feat_type='VIP')
            PLSDA_all[plsdaname].update(fit)
            print(f'done')
    
    PLSDA_all
    fname = 'store_files/PLSDA_all.json'
    with open(fname, "w", encoding='utf8') as write_file:
        json.dump(PLSDA_all, write_file)

#### Results of the PLS-DA - Performance (Predictive Accuracy) 

In [None]:
# Accuracy across iterations
# Read prior Results
fname = 'store_files/PLSDA_all.json'
with open(fname, "r", encoding='utf8') as read_file:
    PLSDA_all = json.load(read_file)

accuracies = pd.DataFrame({name: PLSDA_all[name]['accuracy'] for name in PLSDA_all})
accuracies

#### Distribution for _GDg2-_

In [None]:
column_names = ['IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11']

# Violin plot of the distribution of the predictive accuracy (in %) across the iterations of randomly sampled folds for each 
# differently-treated dataset.

cols2keep = [col for col in accuracies.columns if 'neg_global2' in col]

with sns.axes_style("whitegrid"):
    f, ax = plt.subplots(figsize=(14,6))
    res100 = accuracies[cols2keep] * 100
    res100.columns = column_names

    #colors = ['blue','orange','green','red']
    colors = treat_colors

    sns.violinplot(data=res100, palette=colors)

    plt.ylabel('Prediction Accuracy (%) - PLSDA', fontsize=13)
    plt.ylim([25,100])
    ax.tick_params(axis='x', which='major', labelsize = 18)
    ax.tick_params(axis='y', which='major', labelsize = 15)
    for ticklabel, tickcolor in zip(ax.get_xticklabels(), colors):
        ticklabel.set_color(tickcolor)
    f.suptitle('Predictive Accuracy of PLS-DA models - Grapevine Datasets global2', fontsize=16)
    #plt.title('Yeast Dataset', fontsize = 20)

In [None]:
accuracy_stats = pd.DataFrame({'Average accuracy': accuracies.mean(axis=0),
                               'STD': accuracies.std(axis=0)})
accuracy_stats = accuracy_stats.assign(dataset=[PLSDA_all[name]['dataset'] for name in PLSDA_all],
                                       treatment=[PLSDA_all[name]['treatment'] for name in PLSDA_all])
accuracy_stats

#### Average Predictive Accuracies of PLS-DA models

Error bars were built based on the standard deviation of the predictive accuracies.

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.3):
        f, ax = plt.subplots(1, 1, figsize=(16, 6))
        x = np.arange(len(datasets))  # the label locations
        labels = [datasets[name]['name'] for name in datasets]
        width = 0.1  # the width of the bars
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracy_stats[accuracy_stats['treatment']==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax.bar(x + offset, acc_treatment['Average accuracy'], width, label=treatment, color = p4[i])
            ax.errorbar(x + offset, y=acc_treatment['Average accuracy'], yerr=acc_treatment['STD'],
                        ls='none', ecolor='0.2', capsize=3)
        ax.set_xticks(x)
        ax.set_xticklabels(labels)
        ax.set(ylabel='Average accuracy', title='', ylim=(0.3,1.02))
        #ax.text(-0.5, 0.95, 'B', weight='bold', fontsize=15)


### Accuracy plots for RF and PLS-DA together

In [None]:
accuraciesRF = pd.DataFrame({name: RF_all[name]['accuracy'] for name in RF_all})
accuracy_stats_RF = pd.DataFrame({'Average accuracy': accuraciesRF.mean(axis=0),
                                  'STD': accuraciesRF.std(axis=0)})
accuracy_stats_RF = accuracy_stats_RF.assign(dataset=[RF_all[name]['dataset'] for name in RF_all],
                                       treatment=[RF_all[name]['treatment'] for name in RF_all])


accuraciesPLSDA = pd.DataFrame({name: PLSDA_all[name]['accuracy'] for name in PLSDA_all})
accuracy_stats_PLSDA = pd.DataFrame({'Average accuracy': accuraciesPLSDA.mean(axis=0),
                                     'STD': accuraciesPLSDA.std(axis=0)})
accuracy_stats_PLSDA = accuracy_stats_PLSDA.assign(dataset=[PLSDA_all[name]['dataset'] for name in PLSDA_all],
                                       treatment=[PLSDA_all[name]['treatment'] for name in PLSDA_all])

In [None]:
def endminus(x):
    "Replacing - with − at the end of dataset names."
    if x.endswith('-'):
        return x.replace('-', '−')
    else:
        return x

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.5):
        f, (axu, axl) = plt.subplots(2, 1, figsize=(16, 10), constrained_layout=True)
        x = np.arange(len(datasets))  # the label locations
        labels = [endminus(datasets[name]['name']) for name in datasets]
        width = 0.09  # the width of the bars
        
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracy_stats_RF[accuracy_stats_RF['treatment']==treatment]
            offset = - 0.25 + i * 0.1
            rects = axu.bar(x + offset, acc_treatment['Average accuracy'], width, label=treatment, color = p4[i])
            axu.errorbar(x + offset, y=acc_treatment['Average accuracy'], yerr=acc_treatment['STD'],
                        ls='none', ecolor='0.2', capsize=3)
        axu.set_xticks(x)
        axu.set_xticklabels(labels, fontsize=20)
        axu.set(ylabel='Average accuracy', title='', ylim=(0.2,1.03))
        axu.text(-0.5, 0.95, 'A', weight='bold', fontsize=16)
        for spine in axu.spines.values():
            spine.set_edgecolor('0.1')
        
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracy_stats_PLSDA[accuracy_stats_PLSDA['treatment']==treatment]
            offset = - 0.25 + i * 0.1
            rects = axl.bar(x + offset, acc_treatment['Average accuracy'], width, label=treatment, color = p4[i])
            axl.errorbar(x + offset, y=acc_treatment['Average accuracy'], yerr=acc_treatment['STD'],
                        ls='none', ecolor='0.2', capsize=3)
        axl.set_xticks(x)
        axl.set_xticklabels(labels, fontsize=20)
        axl.set(ylabel='Average accuracy', title='', ylim=(0.2,1.03))
        axl.text(-0.5, 0.95, 'B', weight='bold', fontsize=16)
        for spine in axl.spines.values():
            spine.set_edgecolor('0.1')
        axu.legend(loc='upper left', bbox_to_anchor=(1, 1), fontsize=18)
        plt.show()
        #f.savefig('images/supervised_performance.pdf' , dpi=300)
        #f.savefig('images/supervised_performance.jpg' , dpi=300)

### Important feature analysis - PLS-DA

**The same process as it was applied for Random Forest.**

#### MDBI

In [None]:
MDBI_PLSDA_Imp_Feat = {}
for dskey, ds in PLSDA_all.items():
    if dskey.endswith(' MDBI'):
        name, treat = dskey.split()
        MDBI_PLSDA_Imp_Feat[name] = build_impfeat_table(ds['important_features'], colnames=['Feature', 'VIP Score'])

In [None]:
MDBI_PLSDA_Imp_Feat['YD']

In [None]:
MDBI_PLSDA_Imp_Feat['vitis_types']

#### WMDBI

In [None]:
WMDBI_PLSDA_Imp_Feat = {}
for dskey, ds in PLSDA_all.items():
    if dskey.endswith('WMDBI'):
        name, treat = dskey.split()
        WMDBI_PLSDA_Imp_Feat[name] = build_impfeat_table(ds['important_features'], colnames=['Feature', 'VIP Score'])

In [None]:
WMDBI_PLSDA_Imp_Feat['YD']

In [None]:
WMDBI_PLSDA_Imp_Feat['vitis_types']

### Example of Sample Projection on the two most important Components/Latent Variables of PLS models built with the full dataset and sample representation 

#### GDg2-, YD, GD types and HD after IDT (NGP or NGP_RF) or after Degree analysis of sMDiNs

In [None]:
# Functions for projection
def plot_PLS(principaldf, label_colors, components=(1,2), title="PLS", ax=None):
    "Plot the projection of samples in the 2 main components of a PLS-DA model."
    
    if ax is None:
        ax = plt.gca()
    
    loc_c1, loc_c2 = [c - 1 for c in components]
    col_c1_name, col_c2_name = principaldf.columns[[loc_c1, loc_c2]]
    
    #ax.axis('equal')
    ax.set_xlabel(f'{col_c1_name}')
    ax.set_ylabel(f'{col_c2_name}')

    unique_labels = principaldf['Label'].unique()

    for lbl in unique_labels:
        subset = principaldf[principaldf['Label']==lbl]
        ax.scatter(subset[col_c1_name],
                   subset[col_c2_name],
                   s=50, color=label_colors[lbl], label=lbl)

    #ax.legend(framealpha=1)
    ax.set_title(title, fontsize=15)

def plot_ellipses_PLS(principaldf, label_colors, components=(1,2),ax=None, q=None, nstd=2):
    "Plot the projection of samples in the 2 main components of a PLS-DA model."
    
    if ax is None:
        ax = plt.gca()
    
    loc_c1, loc_c2 = [c - 1 for c in components]
    points = principaldf.iloc[:, [loc_c1, loc_c2]]
    
    #ax.axis('equal')

    unique_labels = principaldf['Label'].unique()

    for lbl in unique_labels:
        subset_points = points[principaldf['Label']==lbl]
        plot_confidence_ellipse(subset_points, q, nstd, ax=ax, ec=label_colors[lbl], fc='none')

In [None]:
n_components = 11

model, scores = ma.fit_PLSDA_model(datasets['GD_neg_class2']['NGP_RF'],
                                   datasets['GD_neg_class2']['target'], n_comp=n_components, scale=False)
model2, scores2 = ma.fit_PLSDA_model(datasets['GD_neg_class2']['Degree'],
                                     datasets['GD_neg_class2']['target'], n_comp=n_components, scale=True)

lcolors = datasets['GD_neg_class2']['label_colors']

with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        fig, (axl, axr) = plt.subplots(1,2, figsize=(14,7))
        plot_PLS(scores, lcolors, title="Negative GD class2, Intensity-based treatment", ax=axl)
        #plt.legend(loc='upper left', ncol=2)

        plot_PLS(scores2, lcolors, title="Negative GD class2, sMDiN (Degree) treatment", ax=axr)
        axr.set_ylabel('')
        axr.legend(loc='upper right', ncol=2)               
        plt.tight_layout()
        plt.show()

In [None]:
n_components = 11

model, scores = ma.fit_PLSDA_model(datasets['YD']['NGP'],
                                   datasets['YD']['target'], n_comp=n_components, scale=False)
model2, scores2 = ma.fit_PLSDA_model(datasets['YD']['Degree'],
                                     datasets['YD']['target'], n_comp=n_components, scale=True)

lcolors = datasets['YD']['label_colors']

with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        fig, (axl, axr) = plt.subplots(1,2, figsize=(14,7))
        plot_PLS(scores, lcolors, title="YD, Intensity-based treatment", ax=axl)
        #plt.legend(loc='upper left', ncol=2)

        plot_PLS(scores2, lcolors, title="YD, sMDiN (Degree) treatment", ax=axr)
        axr.set_ylabel('')
        axr.legend(loc='upper left', ncol=2)               
        plt.tight_layout()
        plt.show()

In [None]:
n_components = 11

model, scores = ma.fit_PLSDA_model(datasets['vitis_types']['NGP'],
                                   datasets['vitis_types']['target'], n_comp=n_components, scale=False)
model2, scores2 = ma.fit_PLSDA_model(datasets['vitis_types']['Degree'],
                                     datasets['vitis_types']['target'], n_comp=n_components, scale=True)

lcolors = datasets['vitis_types']['label_colors']

with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        fig, (axl, axr) = plt.subplots(1,2, figsize=(14,7))
        plot_PLS(scores, lcolors, title="GD target is Vitis types, Intensity-based treatment", ax=axl)
        #plt.legend(loc='upper left', ncol=2)

        plot_PLS(scores2, lcolors, title="GD target is Vitis types, sMDiN (Degree) treatment", ax=axr)
        axr.set_ylabel('')
        axr.legend(loc='upper right', ncol=1)               
        plt.tight_layout()
        plt.show()

In [None]:
n_components = 11

model, scores = ma.fit_PLSDA_model(datasets['HD']['NGP_RF'],
                                   datasets['HD']['target'], n_comp=n_components, scale=False)
model2, scores2 = ma.fit_PLSDA_model(datasets['HD']['Degree'],
                                     datasets['HD']['target'], n_comp=n_components, scale=True)

lcolors = datasets['HD']['label_colors']

with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        fig, (axl, axr) = plt.subplots(1,2, figsize=(14,7))
        plot_PLS(scores, lcolors, title="HD, Intensity-based treatment", ax=axl)
        #plt.legend(loc='upper left', ncol=2)

        plot_PLS(scores2, lcolors, title="HD, sMDiN (Degree) treatment", ax=axr)
        axr.set_ylabel('')
        axr.legend(loc='upper right', ncol=1)               
        plt.tight_layout()
        plt.show()

## Validation of Supervised Analysis of the HD datasets using the external test set

In [None]:
# Creating the dictionaries in train and test for the 5 sMDiN analysis (not created before, since there was no
# danger of data leakage, each network analysis is independent from network to network)
name, ds = 'HD', datasets['HD']

for treat in ('Degree', 'Betweenness', 'Closeness', 'MDBI', 'GCD11'):
    ds['train'][treat] = ds[treat].loc[ds['train']['data'].index]
    ds['test'][treat] = ds[treat].loc[ds['test']['data'].index]

#### Random Forest

In [None]:
def RF_model_HD(ds, treatment, n_trees=200):

    nfeats = ds['data'].shape[1]

    # Setting up variables for result storing
    imp_feat = {}
    accuracy_scores = []
        
    # Fit and evaluate a Random Forest model
    rf = skensemble.RandomForestClassifier(n_estimators=n_trees)
    rf.fit(ds['train'][treatment], ds['train']['target'])
            
    # Compute performance and important features
    accuracy_score =  rf.score(ds['test'][treatment], 
                                ds['test']['target']) # Prediction Accuracy

    imp_feat = dict(zip(ds['train'][treatment].columns,
                            rf.feature_importances_)) # Importance of each feature

    # Collect and order all important features values from each Random Forest
    imp_feat = pd.Series(imp_feat).replace({np.nan:0})
    sorted_imp_feat = imp_feat.sort_values(ascending=False)
    imp_feat = []
    for i in sorted_imp_feat.index:
        imp_feat.append((i, sorted_imp_feat.loc[i]))

    return {'accuracy': accuracy_score, 'important_features': imp_feat}

In [None]:
np.random.seed(16)

RF_HD_res = {}

# Application of the Random Forests
name, dataset = 'HD', datasets['HD']

# Intensity-based pre-treatments
IDT_res = {}
for treatment in ('NGP', 'NGP_RF'):

    IDT_res[treatment] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}

    fit = RF_model_HD(dataset, treatment, n_trees=100)
    IDT_res[treatment].update(fit)

# Choose the Intensity-based Data pre-Treatment (IDT) with the highest accuracy
if IDT_res['NGP_RF']['accuracy'] >= IDT_res['NGP']['accuracy']:
    rfname = 'IDT'
    RF_HD_res[rfname] = IDT_res['NGP_RF']
    RF_HD_res[rfname]['treatment'] = 'IDT'
else:
    rfname = 'IDT'
    RF_HD_res[rfname] = IDT_res['NGP']
    RF_HD_res[rfname]['treatment'] = 'IDT'

for treatment in ('Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):

    rfname = treatment
    RF_HD_res[rfname] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}

    fit = RF_model_HD(dataset, treatment, n_trees=100)
    RF_HD_res[rfname].update(fit)


In [None]:
# Accuracy
accuracies_RF_HD = pd.Series({name: RF_HD_res[name]['accuracy'] for name in RF_HD_res})
accuracies_RF_HD

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.3):
        f, ax = plt.subplots(1, 1, figsize=(6, 6))
        x = np.arange(1)  # the label locations
        width = 0.1  # the width of the bars
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracies_RF_HD[accuracies_RF_HD.index==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax.bar(x + offset, acc_treatment, width, label=treatment, color = p4[i])
        ax.set_xticks(x)
        ax.set_xticklabels(['HD'])
        ax.set(ylabel='Average accuracy', title='', ylim=(0.3,1.03))
        ax.legend(loc='upper left', bbox_to_anchor=(0.10, 1), ncol=2, fontsize=10)

#### PLS-DA

In [None]:
def PLSDA_model_HD(ds, treatment, n_comp=10,
                   encode2as1vector=True,
                   scale=False,
                   feat_type='Coef'):
      
    nright = 0
    
    # Fit and evaluate a PLS-DA model 
    # Setting up the y matrix
    train_group_len = len(ds['train']['target'])
    labels = ds['train']['target'] + ds['test']['target']
    unique_labels = list(pd.unique(labels))
    is1vector = len(unique_labels) == 2 and encode2as1vector
    matrix = ma._generate_y_PLSDA(labels, unique_labels, is1vector)

    if is1vector:
        # keep a copy to use later
        target1D = matrix.copy()
        correct = target1D[train_group_len:]
        y_train, y_test = matrix[:train_group_len], matrix[train_group_len:]
    else:
        y_train, y_test = matrix.iloc[:train_group_len], matrix.iloc[train_group_len:]

    # Fit PLS model
    plsda = PLSRegression(n_components=n_comp, scale=scale)
    plsda.fit(X=ds['train'][treatment], Y=y_train)

    # Obtain results with the test group
    y_pred = plsda.predict(ds['test'][treatment])
    CVR2 = r2_score(y_test, y_pred)

    # Decision rule for classification
    # Decision rule chosen: sample belongs to group where it has max y_pred (closer to 1)
    # In case of 1,0 encoding for two groups, round to nearest integer to compare

    if not is1vector:
        for i in range(len(y_pred)):
            if list(y_test.iloc[i, :]).index(max(y_test.iloc[i, :])) == np.argmax(
                y_pred[i]
            ):
                nright += 1  # Correct prediction
    else:
        rounded = np.round(y_pred)
        for i in range(len(y_pred)):
            if rounded[i] == correct[i]:
                nright += 1  # Correct prediction

    # Calculate important features (3 different methods to choose from)
    if feat_type == 'VIP':
        VIPS = ma._calculate_vips(plsda)
        Imp_Feat = dict(zip(ds['train'][treatment].columns, VIPS)) # Importance of each feature
    elif feat_type == 'Coef':
        Imp_Feat = dict(zip(ds['train'][treatment].columns, abs(plsda.coef_).sum(axis=1))) # Importance of each feature
    elif feat_type == 'Weights':
        Imp_Feat = dict(zip(ds['train'][treatment].columns, abs(plsda.x_weights_).sum(axis=1))) # Importance of each feature
    else:
        raise ValueError(
            'Type not Recognized. Types accepted: "VIP", "Coef", "Weights".'
        )

    # Calculate the accuracy of the group predicted
    accuracies = nright / len(y_test) # Divided by len of test group
        
    # Collect and order important features values
    Imp_Feat = pd.Series(Imp_Feat).replace({np.nan:0})
    sorted_Imp_Feat = Imp_Feat.sort_values(ascending=False)
    # Put them in a list of tuples shape to be able to be saved in json
    Imp_Feat = []
    for i in sorted_Imp_Feat.index:
        Imp_Feat.append((i, sorted_Imp_Feat.loc[i]))

    return {'accuracy': accuracies, 'Q2': CVR2, 'important_features': Imp_Feat}

In [None]:
np.random.seed(16)

PLSDA_HD_res = {}

# Application of the PLS-DA
name, dataset = 'HD', datasets['HD']

# Intensity-based pre-treatments
IDT_res = {}
for treatment in ('NGP', 'NGP_RF'):

    IDT_res[treatment] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}
    n_comp = 10

    fit = PLSDA_model_HD(dataset, treatment, scale=False, n_comp=n_comp)#, feat_type='VIP')
    IDT_res[treatment].update(fit)

# Choose the Intensity-based Data pre-Treatment (IDT) with the highest accuracy
if IDT_res['NGP_RF']['accuracy'] >= IDT_res['NGP']['accuracy']:
    plsdaname = 'IDT'
    PLSDA_HD_res[plsdaname] = IDT_res['NGP_RF']
    PLSDA_HD_res[plsdaname]['treatment'] = 'IDT'
else:
    plsdaname = 'IDT'
    PLSDA_HD_res[plsdaname] = IDT_res['NGP']
    PLSDA_HD_res[plsdaname]['treatment'] = 'IDT'


# Data matrices from the network analyses of sMDiNs
for treatment in ('Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
    
    plsdaname = treatment
    PLSDA_HD_res[plsdaname] = {'dskey': name, 'dataset': dataset['name'], 'treatment':treatment}
    n_comp = 10

    if treatment == 'MDBI':
        n_comp = 4
    elif treatment == 'WMDBI':
        n_comp = 6

    fit = PLSDA_model_HD(dataset, treatment, scale=True, n_comp=n_comp)
    PLSDA_HD_res[plsdaname].update(fit)

In [None]:
# Accuracy
accuracies_PLSDA_HD = pd.Series({name: PLSDA_HD_res[name]['accuracy'] for name in PLSDA_HD_res})
accuracies_PLSDA_HD

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.3):
        f, ax = plt.subplots(1, 1, figsize=(6, 6))
        x = np.arange(1)  # the label locations
        width = 0.1  # the width of the bars
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracies_PLSDA_HD[accuracies_PLSDA_HD.index==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax.bar(x + offset, acc_treatment, width, label=treatment, color = p4[i])
        ax.set_xticks(x)
        ax.set_xticklabels(['HD'])
        ax.set(ylabel='Average accuracy', title='', ylim=(0.3,1.03))
        ax.legend(loc='upper left', bbox_to_anchor=(0.10, 1), ncol=2, fontsize=10)

### Figure with everything  (cross-validation and HD external test set validation)

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.5):
        fig=plt.figure(figsize=(16, 10))

        gs=plt.GridSpec(2,7)

        ax1=fig.add_subplot(gs[0,:6])
        ax2=fig.add_subplot(gs[0,6])
        ax3=fig.add_subplot(gs[1,:6]) 
        ax4=fig.add_subplot(gs[1,6]) 

        #f, (axu, axd) = plt.subplots(2, 2, figsize=(16, 10), constrained_layout=True)
        x = np.arange(len(datasets))  # the label locations
        labels = [endminus(datasets[name]['name']) for name in datasets]
        width = 0.09  # the width of the bars
        
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracy_stats_RF[accuracy_stats_RF['treatment']==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax1.bar(x + offset, acc_treatment['Average accuracy'], width, label=treatment, color = p4[i])
            ax1.errorbar(x + offset, y=acc_treatment['Average accuracy'], yerr=acc_treatment['STD'],
                        ls='none', ecolor='0.2', capsize=3)
        ax1.set_xticks(x)
        ax1.set_xticklabels(labels, fontsize=20)
        ax1.set(ylabel='Average accuracy', title='', ylim=(0.2,1.03))
        for spine in ax1.spines.values():
            spine.set_edgecolor('0.1')
        
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracies_RF_HD[accuracies_RF_HD.index==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax2.bar(0 + offset, acc_treatment, width, label=treatment, color = p4[i])

        ax2.set_xticks([0])
        ax2.set_xticklabels(['HD'], fontsize=20)
        ax2.set(title='', ylim=(0.2,1.03))
        ax2.set_yticklabels([])
        for spine in ax2.spines.values():
            spine.set_edgecolor('0.1')
        
        
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracy_stats_PLSDA[accuracy_stats_PLSDA['treatment']==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax3.bar(x + offset, acc_treatment['Average accuracy'], width, label=treatment, color = p4[i])
            ax3.errorbar(x + offset, y=acc_treatment['Average accuracy'], yerr=acc_treatment['STD'],
                        ls='none', ecolor='0.2', capsize=3)
        ax3.set_xticks(x)
        ax3.set_xticklabels(labels, fontsize=20)
        ax3.set(ylabel='Average accuracy', title='', ylim=(0.2,1.03))
        for spine in ax3.spines.values():
            spine.set_edgecolor('0.1')
        
        for i, treatment in enumerate(('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11')):
            acc_treatment = accuracies_PLSDA_HD[accuracies_PLSDA_HD.index==treatment]
            offset = - 0.25 + i * 0.1
            rects = ax4.bar(0 + offset, acc_treatment, width, label=treatment, color = p4[i])

        ax4.set_xticks([0])
        ax4.set_xticklabels(['HD'], fontsize=20)
        ax4.set(title='', ylim=(0.2,1.03))
        ax4.set_yticklabels([])
        for spine in ax4.spines.values():
            spine.set_edgecolor('0.1')
        
        ax2.legend(loc='upper left', bbox_to_anchor=(1, 1), fontsize=18)
        
        plt.tight_layout()
        plt.show()
        
        fig.savefig('images/supervised_performance_all.pdf' , dpi=300)
        fig.savefig('images/supervised_performance_all.jpg' , dpi=300)

In [None]:
# ensure dir exists
#path = Path.cwd() / "store_files"
#path.mkdir(parents=True, exist_ok=True)

#storepath = Path.cwd() / "store_files" / 'processed_data.h5'

#store = pd.HDFStore(storepath, complevel=9, complib="blosc:blosclz")
#pd.set_option('io.hdf.default_format','table')

# keep json serializable values and store dataFrames in HDF store

#serializable = {}
# Store in and h5 store the pandas dataframes created and in json file the rest
# Since we have a lot of nested dicts in 'iter_fold_splits', the save and load the files back up code is a bit complex
# Probably not the best way but functional
# 'AA_' and 'TTS_' are used as special delimiters for the iter_fold_splits and train and test sets keys in the dict
# Since they have the nested dicts to call on them when reading back the files
#for dskey, dataset in datasets.items():
#    serializable[dskey] = {}
#    for key, value in dataset.items():
        #print(dskey, key)
#        if isinstance(value, pd.DataFrame):
#            storekey = dskey + '_' + key
            #print('-----', storekey)
#            store[storekey] = value
#            serializable[dskey][key] = f"INSTORE_{storekey}"
#        elif key in ('MDiN', 'sMDiNs'):
#            continue
#        elif key == 'iter_fold_splits':
#            for iteration, i in value.items():
#                for group, n in i.items():
#                    for fold, j in n.items():
#                        #print(j)
#                        for treat, dfs in j.items():
#                            storekey = dskey + '_' + key + 'AA_' + str(iteration) + '_' + group + '_' + str(fold) + '_' + treat
#                            if treat == 'target':
#                                serializable[dskey][storekey] = dfs
                            #print(df)
#                            else:
#                                store[storekey] = dfs
#                                serializable[dskey][storekey] = f"INSTORE_{storekey}"
#        elif key in ('train','test'):
#            for treat, dfs in value.items():
#                storekey = dskey + '_' + key + 'TTS_' + treat
#                if treat == 'target':
#                    serializable[dskey][storekey] = dfs
                #print(df)
#                else:
#                    store[storekey] = dfs
#                    serializable[dskey][storekey] = f"INSTORE_{storekey}"
#        else:
#            serializable[dskey][key] = value
#store.close()


#path = path / 'processed_data.json'
#with open(path, "w", encoding='utf8') as write_file:
#    json.dump(serializable, write_file)