# Extra Module for Metabolomics Data Analysis

# sMDiN Analysis (Sample Mass-Difference Networks Analysis)

sMDiN or sample Mass-Difference Network analysis is a pre-treatment to complement intensity-data that is available in a separated module as a jupyter notebook. It uses feature occurrence data by building Mass-Difference Networks (masses linked if their mass-difference corresponds to a common biochemical transformation) for each sample in the data. Then, network analysis is performed on each of these networks based on what characteristic of the data you want to analyze (for example, which chemical transformation might have different prevalences between samples of different classes). Then, the Unsupervised and Supervised analysis can be repeated using the tables obtained from these network analysis.

Here, the user can choose if they prefer to build the networks using the conventional mass-differences or only formula differences from formula assigned features if they exist in the data.

paper doi: 10.3389/fmolb.2022.917911

In [None]:
import pandas as pd
import numpy as np
import networkx as nx

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

import scipy.spatial.distance as dist
import scipy.cluster.hierarchy as hier
import scipy.stats as stats

import sklearn.ensemble as skensemble
import sklearn.model_selection
from sklearn.model_selection import GridSearchCV

import MDiN_functions as md
import metanalysis_standard as metsta
from elips import plot_confidence_ellipse
from multianalysis import p_adjust_bh, fit_PLSDA_model, _calculate_vips, _generate_y_PLSDA

#### Import treated Data from the main notebook.

Select the filename of the file to import from the main jupyter notebook analysis.

In [None]:
# Filename for the data to import
filename = 'Export_TreatedData.xlsx'
filename_pickle_treated = 'Export_TreatedData.pickle'
filename_pickle_proc = 'Export_ProcData.pickle'

#treated_data = pd.read_excel(filename, sheet_name='Fully Treated Data').set_index('Unnamed: 0').T
bin_data = pd.read_excel(filename, sheet_name='BinSim Treated Data').set_index('Bucket label').T
univariate_data = pd.read_excel(filename, sheet_name='MVI+Norm Data').set_index('Unnamed: 0')

processed_data = pd.read_pickle(filename_pickle_proc)
treated_data = pd.read_pickle(filename_pickle_treated)

In [None]:
# Filename for the target to import
filename_target = 'Export_Target.txt'

with open(filename_target) as a:
    tg = a.readlines()
target = [t.strip() for t in tg]
target

### Reading MDB list

MDB - Mass-Difference-based building Block.

MDBs are the list of chemical transformations that are used to build Mass-Difference Networks. They usually represent some of the most common and ubiquitous reactions in biological systems but can also be specific to the biological system in case.

The default file provided `MDB_list.txt` contains 15 common biochemical transformation.

This file can also be used as a guide for the formatting required to create your own MDB list. In short, it should have 4 columns without headers where the first one should correspond to the absolute change in the compound that the chemical transformation causes (for example, methylation is CH2), the second column to the name of the Mass-Difference-based building Block (MDB), that is of the chemical transformation, the third column should be the mass difference associated to the column (for methylation, it would be 14.015650064399999 for example) and the fourth column should say true. Each column should be tab separated.

In [None]:
mdb_filename = 'MDB_list.txt'

# Setting up the Chemical Transformations DataFrame
MDBs_to_use = pd.read_csv(mdb_filename, sep='\t', header=None)
MDBs_to_use.columns = ['Label', 'Transformation', 'Mass', 'Selected']
MDBs_to_use = MDBs_to_use.set_index('Label')
comp = []
for i in MDBs_to_use.index:
    comp.append(md.formula_process(i))
MDBs_to_use['Comp.'] = comp

MDBs_to_use

## Building the MDiN

Here, we built the actual general MDiN for the whole dataset. From this general dataset, we will then subgraph it for each sample with only the metabolic features that appear in it to make the sample Mass-Difference Networks.

There are 4 possible types of MDiNs that you can choose:

- **Simple MDiN**: This is the simplest and fastest type of MDiN. It only requires a list of mass values (and MDBs) and creates connections (edges) in the network if the mass difference between metabolic features (nodes) corresponds to one of the MDBs used.
- **Univocal MDiN**: It also only requires a list of mass values (and MDBs). It starts by building a simple MDiN but then it trims this simple MDiN by seeing cases where one node has multiple edges representing the same chemical transformation in the same 'direction' (that is addition or subtraction of a chemical group) and only keeping the one who has a lower associated error. This is based on the idea that each metabolite feature should have a unique formula, since direct infusion mass spectrometry would indicate that for each unique formula, you should only have one metabolic feature. If there are multiples of the same 'edge' MDB from the same node, this principle would be violated.
- **Mass Formula Propagation (Mass Form. Prop.) MDiN**: It requires a list of mass values and a list of reliably (or more reliably) identificated metabolic features (for example, formulas from annotated metabolites). It builds a Univocal MDiN but uses it for formula propagation in the network starting from those reliably identificated metabolic features. It eliminates edges to both avoid incoherencies and contradictions. The former by leading to very improbable formulas as defined by different criterias such as the usual elemental ratios, Valency restrictions or maximum number of different elements. The latter by eliminating edges where the propagation from two starting nodes leads to different formulas.
- **Formula Difference Networks**: This network is built from assigned formulas in the data instead of the mass differences. It requires only a DataFrame or Series with the formula associated to each metabolic feature. It restricts the data to only the formula assigned metabolites but also reduces possible errors in link creation since there is no deviation associated with formula differences.

In [None]:
# Choose the type of MDiN desired (details about each on the cell above)
mdin_type = 'Univocal' # Options: 'Simple', 'Univocal', 'Mass Form. Prop.', 'Formula'

**Simple MDiN cell**

In [None]:
# Preparing data for Simple MDiN
if mdin_type == 'Simple':
    # Parameters
    # Column with mass values to build the Mass Difference Network
    mass_val_col = 'Neutral Mass'
    # Allowed ppm deviation to link the nodes
    ppm_thresh = 0.5

    # Mass list
    masses_list = list(processed_data[mass_val_col].values)

    formula_df = processed_data.copy()
    general_MDiN = md.simple_MDiN(masses_list, trans_groups=MDBs_to_use, ppm=ppm_thresh)
    re_nodes = {v: k for k, v in formula_df[mass_val_col].to_dict().items()}
    general_MDiN = nx.relabel_nodes(general_MDiN, re_nodes)

**Univocal MDiN cell**

In [None]:
# Preparing data for Univocal MDiN
if mdin_type == 'Univocal':
    # Parameters
    # Column with mass values to build the Mass Difference Network
    mass_val_col = 'Neutral Mass'
    # Allowed ppm deviation to link the nodes
    ppm_thresh = 0.5

    # Mass list
    masses_list = list(processed_data[mass_val_col].values)

    formula_df = processed_data.copy()
    general_MDiN = md.univocal_MDiN(masses_list, trans_groups=MDBs_to_use, ppm=ppm_thresh)
    re_nodes = {v: k for k, v in formula_df[mass_val_col].to_dict().items()}
    general_MDiN = nx.relabel_nodes(general_MDiN, re_nodes)

**Mass Formula Propagation MDiN cell**

First, there is here a suggestion on how to build a reliable formula DataFrame, this can be chosen as the user prefers but should have the same formatting as the end result of the next cell (DataFrame with one column, the index are the masses of the features with reliably annotated formulas and the column named 'Formula' contains the corresponding formulas.

In [None]:
reliable_forms_df = pd.DataFrame(columns=['Formula'])
if mdin_type == 'Mass Form. Prop.':
    # Column with mass values to build the Mass Difference Network
    mass_val_col = 'Neutral Mass'

    # Building the reliable formulas set
    # Here is a way to consider reliable, formulas that come from annotated compounds which only have one possibility of 
    # formula for the node. For this we select, which annotation columns we consider and where are the respective formulas
    # in the two dictionaries below (based of if the formulas are just a string or they are in lists in the respective cols)
    ann_to_form_cols_instring = {'Name': 'Formula'}
    ann_to_form_cols_inlist = {'Matched HMDB names': 'Matched HMDB formulas',
                        'Matched LTS names': 'Matched LTS formulas',  'Matched DBK names': 'Matched DBK formulas',}

    temp_df = processed_data.loc[processed_data['Has Match?'].dropna()]

    for i in temp_df.index:
        form = []
        for a_col, f_col in ann_to_form_cols_instring.items():
            a = temp_df.loc[i, a_col]
            if type(a) == str:
                f = temp_df.loc[i, f_col]
                if f not in form:
                    form.append(f)

        for a_col, f_col in ann_to_form_cols_inlist.items():
            a = temp_df.loc[i, a_col]
            if type(a) == list:
                fs = temp_df.loc[i, f_col]
                for f in fs:
                    if f not in form:
                        form.append(f)

        # Only keep indexes that have 1 possible formula
        if len(form) == 1:
            reliable_forms_df.loc[temp_df.loc[i, mass_val_col]] = form[0]

# DataFrame
reliable_forms_df

In [None]:
# Now to build the MDiN
if mdin_type == 'Mass Form. Prop.':
    # Other Parameters
    # Reliable Formulas DF
    reliable_forms_df = reliable_forms_df
    # Allowed ppm deviation to link the nodes
    ppm_thresh = 0.5

    # Mass list
    masses_list = list(processed_data[mass_val_col].values)

    formula_df = processed_data.copy()
    general_MDiN = md.formula_MDiN(masses_list, reliable_forms_df, trans_groups=MDBs_to_use, ppm=ppm_thresh)
    re_nodes = {v: k for k, v in formula_df[mass_val_col].to_dict().items()}
    general_MDiN = nx.relabel_nodes(general_MDiN, re_nodes)

**Formula Difference-Networks**

Here, as in the case before, we have to prepare a separate DataFrame which has the indexes as thee metabolic feature indexes and then the formulas in DataFrame format. For that, we have to have a way to select the formulas that are considered since we may have multiple columns each with their own formula assignments. As ane xample, we choose one formula column and build the needed DataFrame two cells down.

In [None]:
# Build the necessary formula DataFrame
filt_elems = pd.DataFrame()
if mdin_type == 'Formula':
    # Select the column with Formula Assignment you want to use
    form_col = 'Formula_Assignment'

    # Getting the column, dropping the peaks without formulas and the peaks which are isotopic peaks
    form_df = processed_data.loc[:,[form_col]].dropna()
    form_df = form_df.loc[[i for i in form_df.index if 'iso' not in form_df.loc[i, form_col]]]

    elems = md.create_element_counts(form_df, formula_subset=[form_col,], compute_ratios=False)
    filt_elems = elems.iloc[:,:-1]

# Formula DataFrame
filt_elems

In [None]:
# Now to build the Formula-Difference Networks
if mdin_type == 'Formula':
    # Column with mass values just to add as attributes
    mass_val_col = 'Neutral Mass'
    
    # DF with the formulas in DataFrame format
    filt_elems = filt_elems

    # Transform MDB to suitable format
    MDB_df = pd.DataFrame(dict(MDBs_to_use['Comp.'])).T

    # Making MDB and filt_elems compatible
    for col in MDB_df.columns:
        if col not in filt_elems.columns:
            filt_elems[col] = 0
    for col in filt_elems.columns:
        if col not in MDB_df.columns:
            MDB_df[col] = 0
    MDB_df = MDB_df[filt_elems.columns]

    formula_df = processed_data.loc[filt_elems.index]
    general_MDiN = md.FDiN_builder(formula_df, filt_elems, MDB_df)

### Discarding all uninformative isolated nodes from the network, that is, the nodes that do not establish any connections

You may also discard nodes from very small components as well.

In [None]:
print('Size of components in the general MDiN:')
comp_len = []
for i in sorted(nx.connected_components(general_MDiN), key=len, reverse=True):
    comp_len.append(len(i))
comp_len

In [None]:
# Exclude network components below a certain size (2 to at least remove isolated nodes)
min_comp_size = 2

comps = []
for i in sorted(nx.connected_components(general_MDiN), key=len, reverse=True):
    if len(i) > min_comp_size:
        comps.extend(i)
general_MDiN = general_MDiN.subgraph(comps)
len(general_MDiN.nodes()) # Nº of nodes leftover in the network

## Building the Sample Mass-Difference Networks

In [None]:
# Dict to store sMDiNs
sMDiNs = {}
for samp in treated_data.index:
    # Subgraphing sMDiN
    idxs = [i for i in formula_df[
        formula_df.loc[:,samp].replace({np.nan:0}) != 0].index]
    ints = {i: treated_data.loc[samp, i] for i in formula_df[
        formula_df.loc[:,samp].replace({np.nan:0}) != 0].index}
    sMDiNs[samp] = general_MDiN.copy().subgraph(idxs)

    # Storing intensity of feature in sample on the nodes
    intensity_attr = dict.fromkeys(sMDiNs[samp].nodes(),0)
    intensity_matrix = []
    mass_matrix = []
    for m in ints:
        int_v = ints[m]
        intensity_attr[m] = {'mass':formula_df.loc[m, mass_val_col], 'intensity': int_v}
    nx.set_node_attributes(sMDiNs[samp], intensity_attr)

## Analysing Sample MDiNs

There are many ways to analysis the sMDiNs built, and that is the true advantage of the method, here we show 2 of them based on the original paper where this methodology was explained

- Degree analysis
- MDBI - Mass-Difference based building block Impact analysis

One measures of centrality: degree that keeps each node as a feature (no feature reduction) with its value for each sample being the respective metric value for each sample MDiN.

**MDB Impact** is a measure of the impact that each MDB had in establishing a sample MDiN. To that end, counts of the number of edges established due to each MDB are counted in each sample MDiN - each MDB represents a set of chemical transformations. To allow comparison between samples with different number of edges the counts in each sample MDiN are transformed to a percentage. This analysis was made to see if the relative importance of the MDBs in establishing the networks is characteristic of the class the sample belongs to.

In [None]:
Deg = {}
MDB_Impact = {}

for samp in treated_data.index:

    # Centrality measures
    Deg[samp] = dict(sMDiNs[samp].degree())

    # MDB Impact
    MDB_Impact[samp] = dict.fromkeys(list(MDBs_to_use.index), 0) # MDBs from the transformation list
    for i in sMDiNs[samp].edges():
        MDB_Impact[samp][sMDiNs[samp].edges()[i]['Transformation']] = MDB_Impact[samp][
            sMDiNs[samp].edges()[i]['Transformation']] + 1

# Centrality Measures
Deg = pd.DataFrame.from_dict(Deg).replace({np.nan:0}).T

# MDB Impact
MDB_Impact = pd.DataFrame.from_dict(MDB_Impact).replace({np.nan:0})
MDB_Impact = (MDB_Impact/MDB_Impact.sum()).T

print('done!')

## Statistical Analysis

Now, everyhting is in position to perform traditional statistical analysis

In [None]:
regression = False

In [None]:
# See if the classes are those that you want
classes = list(pd.unique(target))
# customize label colors

colours = sns.color_palette('tab10', 10) # Only room for 10 classes in this case, choose your colours
#colours = ('coral', 'turquoise', 'gold', 'indigo', 'lightgreen') # Example for using named colours
ordered_labels = classes # Put the classes, you can choose the order

label_colours = {lbl: c for lbl, c in zip(ordered_labels, colours)}
sample_colours = [label_colours[lbl] for lbl in target]

# See the colours for each class
sns.palplot(label_colours.values())
new_ticks = plt.xticks(range(len(ordered_labels)), ordered_labels)

## Unsupervised Statistical Analysis

Unsupervised analysis means that the algorithms here do not receive the information of the different class labels.

Here, we show PCA and Hierarchical Clustering (HCA) Analysis.

In [None]:
def plot_PCA(principaldf, label_colors, components=(1,2), title="PCA", ax=None):
    "Plot the projection of samples in the 2 main components of a PCA model."
    
    if ax is None:
        ax = plt.gca()
    
    loc_c1, loc_c2 = [c - 1 for c in components]
    col_c1_name, col_c2_name = principaldf.columns[[loc_c1, loc_c2]]

    #ax.axis('equal')
    ax.set_xlabel(f'{col_c1_name}')
    ax.set_ylabel(f'{col_c2_name}')

    unique_labels = principaldf['Label'].unique()

    for lbl in unique_labels:
        subset = principaldf[principaldf['Label']==lbl]
        ax.scatter(subset[col_c1_name],
                   subset[col_c2_name],
                   s=50, color=label_colors[lbl], label=lbl)

    #ax.legend(framealpha=1)
    ax.set_title(title, fontsize=15)

def plot_ellipses_PCA(principaldf, label_colors, components=(1,2),ax=None, q=None, nstd=2):
    "Plot confidence ellipses of a class' samples based on their projection in the 2 main components of a PCA model."

    if ax is None:
        ax = plt.gca()

    loc_c1, loc_c2 = [c - 1 for c in components]
    points = principaldf.iloc[:, [loc_c1, loc_c2]]

    #ax.axis('equal')

    unique_labels = principaldf['Label'].unique()

    for lbl in unique_labels:
        subset_points = points[principaldf['Label']==lbl]
        plot_confidence_ellipse(subset_points, q, nstd, ax=ax, ec=label_colors[lbl], fc='none')

def color_list_to_matrix_and_cmap(colors, ind, axis=0):
        if any(issubclass(type(x), list) for x in colors):
            all_colors = set(itertools.chain(*colors))
            n = len(colors)
            m = len(colors[0])
        else:
            all_colors = set(colors)
            n = 1
            m = len(colors)
            colors = [colors]
        color_to_value = dict((col, i) for i, col in enumerate(all_colors))

        matrix = np.array([color_to_value[c]
                           for color in colors for c in color])

        matrix = matrix.reshape((n, m))
        matrix = matrix[:, ind]
        if axis == 0:
            # row-side:
            matrix = matrix.T

        cmap = mpl.colors.ListedColormap(all_colors)
        return matrix, cmap

def plot_dendogram(Z, leaf_names, label_colors, title='', ax=None, no_labels=False, labelsize=12, **kwargs):
    if ax is None:
        ax = plt.gca()
    hier.dendrogram(Z, labels=leaf_names, leaf_font_size=10, above_threshold_color='0.2', orientation='left',
                    ax=ax, **kwargs)
    #Coloring labels
    #ax.set_ylabel('Distance (AU)')
    ax.set_xlabel('Distance (AU)')
    ax.set_title(title, fontsize = 15)
    
    #ax.tick_params(axis='x', which='major', pad=12)
    ax.tick_params(axis='y', which='major', labelsize=labelsize, pad=12)
    ax.spines['left'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    #xlbls = ax.get_xmajorticklabels()
    xlbls = ax.get_ymajorticklabels()
    rectimage = []
    for lbl in xlbls:
        lbl_text = lbl.get_text()
        if type(list(label_colors)[0]) == np.float64:
            lbl_text = float(lbl_text)
        col = label_colors[lbl_text]
        lbl.set_color(col)
        #lbl.set_fontweight('bold')
        if no_labels:
            lbl.set_color('w')
        rectimage.append(col)

    cols, cmap = color_list_to_matrix_and_cmap(rectimage, range(len(rectimage)), axis=0)

    axins = inset_axes(ax, width="5%", height="100%",
                   bbox_to_anchor=(1, 0, 1, 1),
                   bbox_transform=ax.transAxes, loc=3, borderpad=0)

    axins.pcolor(cols, cmap=cmap, edgecolors='w', linewidths=1)
    axins.axis('off')

### Principal Component Analysis (PCA) - Degree

In [None]:
f, ax = plt.subplots(1, 1, figsize=(6,6)) # Change the size of the figure

principaldf_deg, var_deg, loadings_deg = metsta.compute_df_with_PCs_VE_loadings(Deg, 
                                       n_components=2, # Select number of components to calculate
                                       whiten=True, labels=target, return_var_ratios_and_loadings=True)

# Plot PCA
ax.axis('equal')
lcolors = label_colours

plot_PCA(principaldf_deg, lcolors, 
         components=(1,2), # Select components to see
         title='Degree', # Select title of plot
         ax=ax)

# Remove ellipses by putting a # before the next line
plot_ellipses_PCA(principaldf_deg, 
                  lcolors, 
                  components=(1,2), # Select components to see
                  ax=ax, 
                  q=0.95) # Confidence ellipse with 95% (q) confidence

ax.set_xlabel(f'PC 1 ({var_deg[0] * 100:.1f} %)', size=15) # Set the size of labels
ax.set_ylabel(f'PC 2 ({var_deg[1] * 100:.1f} %)', size=15) # Set the size of labels

plt.legend(fontsize=15) # Set the size of labels
plt.grid() # If you want a grid or not
plt.show()
#f.savefig('Name_PCAplot_smdinDeg.png', dpi=400) # Save the figure

### Hierarchical Clustering Analysis (HCA) - Degree

Performing Hierarchical Clustering.

Distance metrics: 'euclidean' is the default, others are in https://docs.scipy.org/doc/scipy/reference/spatial.distance.html.

Linkage metrics: **'ward', 'average'**, 'centroid', 'single', 'complete', 'weighted', 'median'.

In [None]:
metric = 'euclidean' # Select distance metric
method = 'ward' # Select linkage method

distances = dist.pdist(Deg, metric=metric)
Z = hier.linkage(distances, method=method)

hca_res_deg = {'Z': Z, 'distances': distances}

In [None]:
# Plot HCA
with sns.axes_style("white"):
    f, ax = plt.subplots(1, 1, figsize=(4, 4), constrained_layout=True) # Set Figure Size
    plot_dendogram(hca_res_deg['Z'], 
                   target, ax=ax,
                   label_colors=label_colours,
                   title='Degree', # Select title
                   color_threshold=0) # Select a distance threshold from where different sets of lines are coloured

    plt.show()
    #f.savefig('Name_HCAplot_smdinDeg.png', dpi=400) # Save the figure

If you want a version of a dendrogram more easy to change parameters:

In [None]:
fig = plt.figure(figsize=(12,4))
# Plotting the dendrogram, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html
# For details on how you can change different aspects of the dendrograms
dn = hier.dendrogram(hca_res_deg['Z'], labels=target,
                     leaf_font_size=13,
                     above_threshold_color='b')
# Coloring labels
ax = plt.gca()
ax.set_ylabel('Distance (UA)')
# Coloring the labels with their specific colours
xlbls = ax.get_xmajorticklabels()
for lbl in xlbls:
    lbl_text = lbl.get_text()
    if type(list(label_colours)[0]) == np.float64:
        lbl_text = float(lbl_text)
    lbl.set_color(label_colours[lbl_text])

### Principal Component Analysis (PCA) - MDB Impact

In [None]:
f, ax = plt.subplots(1, 1, figsize=(6,6)) # Change the size of the figure

principaldf_mdbi, var_mdbi, loadings_mdbi = metsta.compute_df_with_PCs_VE_loadings(MDB_Impact, 
                                       n_components=2, # Select number of components to calculate
                                       whiten=True, labels=target, return_var_ratios_and_loadings=True)

# Plot PCA
ax.axis('equal')
lcolors = label_colours

plot_PCA(principaldf_mdbi, lcolors, 
         components=(1,2), # Select components to see
         title='MDB Impact', # Select title of plot
         ax=ax)

# Remove ellipses by putting a # before the next line
plot_ellipses_PCA(principaldf_mdbi, 
                  lcolors, 
                  components=(1,2), # Select components to see
                  ax=ax, 
                  q=0.95) # Confidence ellipse with 95% (q) confidence

ax.set_xlabel(f'PC 1 ({var_mdbi[0] * 100:.1f} %)', size=15) # Set the size of labels
ax.set_ylabel(f'PC 2 ({var_mdbi[1] * 100:.1f} %)', size=15) # Set the size of labels

plt.legend(fontsize=15) # Set the size of labels
plt.grid() # If you want a grid or not
plt.show()
#f.savefig('Name_PCAplot_smdinMDBI.png', dpi=400) # Save the figure

### Hierarchical Clustering Analysis (HCA) - MDB Impact

Performing Hierarchical Clustering.

Distance metrics: 'euclidean' is the default, others are in https://docs.scipy.org/doc/scipy/reference/spatial.distance.html.

Linkage metrics: **'ward', 'average'**, 'centroid', 'single', 'complete', 'weighted', 'median'.

In [None]:
metric = 'euclidean' # Select distance metric
method = 'ward' # Select linkage method

distances = dist.pdist(MDB_Impact, metric=metric)
Z = hier.linkage(distances, method=method)

hca_res_mdbi = {'Z': Z, 'distances': distances}

In [None]:
# Plot HCA
with sns.axes_style("white"):
    f, ax = plt.subplots(1, 1, figsize=(4, 4), constrained_layout=True) # Set Figure Size
    plot_dendogram(hca_res_mdbi['Z'], 
                   target, ax=ax,
                   label_colors=label_colours,
                   title='MDB Impact', # Select title
                   color_threshold=0) # Select a distance threshold from where different sets of lines are coloured

    plt.show()
    #f.savefig('Name_HCAplot_smdinMDBI.png', dpi=400) # Save the figure

If you want a version of a dendrogram more easy to change parameters:

In [None]:
fig = plt.figure(figsize=(12,4))
# Plotting the dendrogram, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html
# For details on how you can change different aspects of the dendrograms
dn = hier.dendrogram(hca_res_mdbi['Z'], labels=target,
                     leaf_font_size=13,
                     above_threshold_color='b')
# Coloring labels
ax = plt.gca()
ax.set_ylabel('Distance (UA)')
# Coloring the labels with their specific colours
xlbls = ax.get_xmajorticklabels()
for lbl in xlbls:
    lbl_text = lbl.get_text()
    if type(list(label_colours)[0]) == np.float64:
        lbl_text = float(lbl_text)
    lbl.set_color(label_colours[lbl_text])

## Supervised Statistical Analysis

Supervised analysis means that the algorithms have access to label information. This means they are **not** indicated for the purpose of seeing if there are differences between classes/samples, only for seeing which metabolites are most important for those differences.

The supervised statistical analysis methods currently implemented in this notebook are:
- Random Forest Models (RFs)
- Partial Least Squares (PLS)
- Extreme Gradient Boosting (XGBoost)

They all support both regression and classification problems, but may not be equally suitable for all use cases.

XGBoost has thus far performed poorly in Binary classification problems, and both XGBoost and Random Forests may take a long time to run for regression problems, depending on the hyperparameters chosen.

**Functions for this step are in metanalysis_standard.py and are an adaptation of functions in multianalysis.py (from the BinSim paper).**

First, you must intend if you intend to use the methods below to perform regressions or classifications. You may also change this in the parameter of individual method functions.

If you pick regressions, please maake sure that the "class" labels are numerical values.

### Random Forest

First: Minor optimization of the number of trees (200 is a good number to use though) - see when the accuracy of the model stops increasing and starts fluctuating around a certain value (that should be the minimum number of trees to use).

In [None]:
# Select a random seed (number between the ()) if you don't want the results to change every time you run the code
np.random.seed()

# See maximum number of trees to search
top_tree_in_grid=300

# Vector with values for the parameter n_estimators
# Models will be built from 10 to 300 trees in 5 tree intervals
values = {'n_estimators': range(10,top_tree_in_grid,5)}

if regression:
    rf = skensemble.RandomForestRegressor(n_estimators=200)
else:
    rf = skensemble.RandomForestClassifier(n_estimators=200)
    
clf = GridSearchCV(rf, values, cv=3, n_jobs=-1) # Change cv to change cross-validation

print('Fitting RFs...', end=' ')

RF_optim = {'Degree':{}, 'MDBI':{}}

# Degree Analysis
clf.fit(Deg, target) # Fitting the data to RF models with all the different number of trees

# Storing results
RF_optim['Degree']['scores'] = list(clf.cv_results_['mean_test_score'])
RF_optim['Degree']['n_trees'] = list(clf.cv_results_['param_n_estimators'])

# MDB Impact Analysis
clf.fit(MDB_Impact, target) # Fitting the data to RF models with all the different number of trees

# Storing results
RF_optim['MDBI']['scores'] = list(clf.cv_results_['mean_test_score'])
RF_optim['MDBI']['n_trees'] = list(clf.cv_results_['param_n_estimators'])


print('Done!')

In [None]:
# Plotting the results and adjusting parameters of the plot
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, ax = plt.subplots(1, 1, figsize=(6,6), constrained_layout=True) # Set Figure Size

        c_map = sns.color_palette('tab10', 10)

        for treatment, c in zip(RF_optim.keys(), c_map):
            ax.plot(RF_optim[treatment]['n_trees'], [s*100 for s in RF_optim[treatment]['scores']],
                    label=treatment, color=c)
        
        ax.set_ylabel('Random Forest CV Mean Accuracy (%)', fontsize=15) # Set the y_label and size
        ax.set_title('RF Optimization', fontsize=18) # Set the title and size
        ax.set_ylim([30,101]) # Set the limits on the y axis

        #f.suptitle('Optimization of the number of trees')
        ax.legend(fontsize=15) # Set the legend and size
        plt.show()

### Fitting the RF model - Degree

**See details of `RF_model` function (model fitting AND evaluation) in metanalysis_standard.py. Credit to initial function to the BinSim paper.**

In [None]:
# Choose a number for the seed for consistent results
np.random.seed()

n_trees=200 # Number of trees in the model

RF_results_deg = metsta.RF_model(Deg, target, regression, # Data, labels and if it's a regression or classification
                return_cv=True, iter_num=5, # If you want cross validation results and number of iterations for it
                n_trees=n_trees, # Number of trees in the model
                cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
                # For Classification Problems
                 metrics = ('accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted')) # Choose the perf. metrics

                # For Regression problems
                #metrics = ('neg_mean_squared_error',), n_jobs=-1)

Performance analysis

In [None]:
rf_results_summary = pd.DataFrame(columns=['Value', 'Standard Deviation'])
for k,v in RF_results_deg.items():
    if k != 'model' and k != 'imp_feat':
        rf_results_summary.loc[k] = np.mean(v), np.std(v)

print(rf_results_summary)

**Important Feature analysis**

See the most important features for class discrimination (sorted by importance).

In [None]:
imp_feats_rf_deg = processed_data.loc[list(general_MDiN.nodes()),
                                  [i for i in processed_data.columns if i not in treated_data.index]].copy()
imp_feats_rf_deg.insert(0,'Bucket label', imp_feats_rf_deg.index)
imp_feats_rf_deg.insert(1,'Gini Importance', '')
for n in range(len(RF_results_deg['imp_feat'])):
    imp_feats_rf_deg['Gini Importance'].iloc[RF_results_deg['imp_feat'][n][0]] = RF_results_deg['imp_feat'][n][1]
imp_feats_rf_deg = imp_feats_rf_deg.sort_values(by='Gini Importance', ascending=False)
imp_feats_rf_deg.index = range(1, len(imp_feats_rf_deg)+1)

In [None]:
imp_feats_rf_deg.head(20) # Select number of features to see

In [None]:
# Saving Important feature dataset in an excel
SAVE_IMP_FEAT = True

# Saving the most important features by their fraction 'frac_feat_impor'.
# If None, saving the most important features based on a threshold 'VIP_Score_threshold'.
# If also None, save the full dataset of all features
frac_feat_impor = 0.02 # Fraction of features to save, If None the variable in the next line is used.
score_threshold = None # Only used if variable above is None, threshold of score to consider a feature important.

if SAVE_IMP_FEAT:
    if frac_feat_impor:
        max_idx = int(frac_feat_impor*len(imp_feats_rf_deg))
        filt_imp_feats_rf_deg = imp_feats_rf_deg.iloc[:max_idx]
        filt_imp_feats_rf_deg.to_excel(f'RF_smdinDeg_ImpFeat_{frac_feat_impor*100}%.xlsx')
    elif score_threshold:
        filt_imp_feats_rf_deg = imp_feats_rf_deg[imp_feats_rf_deg['Gini Importance'] > score_threshold]
        filt_imp_feats_rf_deg.to_excel(f'RF_smdinDeg_ImpFeat_GiniImpgreater{score_threshold}.xlsx')
    else:
        imp_feats_rf_deg.to_excel(f'RF_smdinDeg_FeatByImportance.xlsx')

### RF Permutation Test

This is a test to observe if the model performance is significant, that is, if it is better than a random model. If it is, then the remaining results from the important features give meaningful information, if not, then you cannot use the important features results since they essentially mean nothing.

The permutation test will permutate the class labels of your samples, that is, all classes will be randomized while maintaining the same number of samples per class and classes. Then, for each permutation it will see the model performance. 

The default metric for model performance is `accuracy`. If you have an imbalanced model, accuracy is not a good metric, so you should change to another such as `f1_weighted`.

**Note: Permutation tests take a while to do, thus the default is False in the begginning so you can make a first analysis on your dataset. If you then want to use the results of a supervised model, run a permutation test to check if your model is significant.**

p-value calculation: (1 + nº of times permutated model has better performance than non-permutated model)/nº of permutations.

In [None]:
GENERATE = True # True if you want to do, False if not
if GENERATE:
    # Set a random seed for reproducibility of cross validation
    np.random.seed()
    # (Random seed of labels permutations is in the random permutator)

    perm_results_RF_deg = metsta.permutation_RF(
        Deg, target, regression,  # data, labels and if it's a regression
        iter_num=500, # Nº of permutations to do in your test - around 500 should be enough
        n_trees=200, # Number of trees in the model
        cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
        random_state=None, # Random seed given to make the permutations rng class labels
        metric=('accuracy')) # Choose a metric to use to evaluate if the model is significant

In [None]:
if GENERATE:
    with plt.style.context('seaborn-v0_8-whitegrid'):
        fig, ax = plt.subplots(1,1, figsize=(6,6))

        n_labels = len(Deg.index)
        tab20bcols = sns.color_palette('tab20b', 20)
        perm_results = perm_results_RF_deg
        
        # Histogram with performance of permutated values
        hist_res = ax.hist(np.array(perm_results[1]), n_labels, range=(0, 1.00001), label='RF Permutations',
                     edgecolor='black', color=tab20bcols[1], alpha = 1)
        
        # Plot the non-permutated model performance
        ylim = [0, hist_res[0].max()*1.2]
        ax.plot(2 * [perm_results[0]], ylim, '-', linewidth=3, color='darkred', #alpha = 0.5,
                     label='p-value %.5f)' % perm_results[2], solid_capstyle='round')
        ax.tick_params(labelsize=13)
        ax.set_xlabel('CV Model Performance', fontsize=14)
        ax.set_ylabel('Nº of occurrences', fontsize=14)
        if perm_results[0] >= 0.5:
            ax.text(perm_results[0]-0.45, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        else:
            ax.text(perm_results[0]+0.05, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        ax.set_title('Random Forest Permutation Test - Degree', size = 15)
        #ax.grid()
        ax.set_axisbelow(True)

        #fig.savefig('Name_RF_PermutationTest_smdinDeg.jpg', dpi=400) # Save the Figure

### ROC curves (Receiver Operating Characteristic)

This basically gives you an area under curve that the closer it is to 1, the better our model. We also iterate this n_iter times so we have a softer curve and to give as a better indication of the actual area under curve (AUC). This plots the true positive rate against the false positive rate.

**Only possible for when your datasets have 2 classes. Choose the class which is considered the 'positive' class.**

Credit to initial function to the BinSim paper.

If you do not have 2 classes, skip ahead this section.

In [None]:
GENERATE = True
if GENERATE:
    if regression:
        print('You are working on a regression problem. Thus, ROC curves are not made.')
    else:
        if len(pd.unique(target)) == 2:
            # Set a random seed for reproducibility
            np.random.seed()
            
            # Set up positive label
            pos_label = pd.unique(target)[0]

            resROC_RF_deg = metsta.RF_ROC_cv(Deg, target, regres=regression, # Data, target and if it's a regression
                                        pos_label=pos_label, # Positive label
                                        n_trees=200, # Number of trees of RF
                                        n_iter=15, # Number of iterations to repeat 
                                        cv=None, n_fold=3) # Method of CV (None is stratified cv) and the number of folds
        else:
            print('Your target has more than 2 classes. Thus, ROC curves are not made.')

In [None]:
if GENERATE:
    if len(pd.unique(target)) == 2:
        # Plot the ROC curves 
        with sns.axes_style("whitegrid"):
            with sns.plotting_context("notebook", font_scale=1.2):
                f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True)
                res = resROC_RF_deg
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                mean_fpr = [0,] + list(mean_fpr)
                mean_tpr = [0,] + list(mean_tpr)
                ax.plot(mean_fpr, mean_tpr,
                       label=f'AUC = {mean_auc:.3f}',
                       lw=2, alpha=0.8)
                ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
                ax.legend()
                ax.set_xlim(None, 1)
                ax.set_ylim(0, None)
                ax.set_title('Random Forest ROC Curve - Degree', fontsize=15)

                #f.savefig('Name_RF_ROCcurve_smdinDeg.jpg', dpi=400) # Save the figure

### Fitting the RF model - MDB Impact

**See details of `RF_model` function (model fitting AND evaluation) in metanalysis_standard.py. Credit to initial function to the BinSim paper.**

In [None]:
# Choose a number for the seed for consistent results
np.random.seed()

n_trees=200 # Number of trees in the model

RF_results_mdbi = metsta.RF_model(MDB_Impact, target, regression, # Data, labels and if it's a regression or classification
                return_cv=True, iter_num=5, # If you want cross validation results and number of iterations for it
                n_trees=n_trees, # Number of trees in the model
                cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
                # For Classification Problems
                 metrics = ('accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted')) # Choose the perf. metrics

                # For Regression problems
                #metrics = ('neg_mean_squared_error',), n_jobs=-1)

Performance analysis

In [None]:
rf_results_summary = pd.DataFrame(columns=['Value', 'Standard Deviation'])
for k,v in RF_results_mdbi.items():
    if k != 'model' and k != 'imp_feat':
        rf_results_summary.loc[k] = np.mean(v), np.std(v)

print(rf_results_summary)

**Important Feature analysis**

See the most important features for class discrimination (sorted by importance).

In [None]:
imp_feats_rf_mdbi = pd.DataFrame(index=MDB_Impact.columns)
imp_feats_rf_mdbi.insert(0,'Bucket label', imp_feats_rf_mdbi.index)
imp_feats_rf_mdbi.insert(1,'Gini Importance', '')
for n in range(len(RF_results_mdbi['imp_feat'])):
    imp_feats_rf_mdbi['Gini Importance'].iloc[RF_results_mdbi['imp_feat'][n][0]] = RF_results_mdbi['imp_feat'][n][1]
imp_feats_rf_mdbi = imp_feats_rf_mdbi.sort_values(by='Gini Importance', ascending=False)
imp_feats_rf_mdbi.index = range(1, len(imp_feats_rf_mdbi)+1)

In [None]:
imp_feats_rf_mdbi

In [None]:
# Saving Important feature dataset in an excel
SAVE_IMP_FEAT = True

if SAVE_IMP_FEAT:
    imp_feats_rf_mdbi.to_excel(f'RF_smdinMDBI_FeatByImportance.xlsx')

In [None]:
import metabolinks.transformations as transf
f, ax = plt.subplots(figsize=(10,6))

tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(MDB_Impact)
df = df[imp_feats_rf_mdbi['Bucket label']]

g = sns.heatmap(df.T, cmap='PRGn', vmin=-3, vmax=3)

# Manually specify colorbar labelling after it's been generated
colorbar = g.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

### RF Permutation Test

This is a test to observe if the model performance is significant, that is, if it is better than a random model. If it is, then the remaining results from the important features give meaningful information, if not, then you cannot use the important features results since they essentially mean nothing.

The permutation test will permutate the class labels of your samples, that is, all classes will be randomized while maintaining the same number of samples per class and classes. Then, for each permutation it will see the model performance. 

The default metric for model performance is `accuracy`. If you have an imbalanced model, accuracy is not a good metric, so you should change to another such as `f1_weighted`.

**Note: Permutation tests take a while to do, thus the default is False in the begginning so you can make a first analysis on your dataset. If you then want to use the results of a supervised model, run a permutation test to check if your model is significant.**

p-value calculation: (1 + nº of times permutated model has better performance than non-permutated model)/nº of permutations.

In [None]:
GENERATE = True # True if you want to do, False if not
if GENERATE:
    # Set a random seed for reproducibility of cross validation
    np.random.seed()
    # (Random seed of labels permutations is in the random permutator)

    perm_results_RF_mdbi = metsta.permutation_RF(
        MDB_Impact, target, regression,  # data, labels and if it's a regression
        iter_num=500, # Nº of permutations to do in your test - around 500 should be enough
        n_trees=200, # Number of trees in the model
        cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
        random_state=None, # Random seed given to make the permutations rng class labels
        metric=('accuracy')) # Choose a metric to use to evaluate if the model is significant

In [None]:
if GENERATE:
    with plt.style.context('seaborn-v0_8-whitegrid'):
        fig, ax = plt.subplots(1,1, figsize=(6,6))

        n_labels = len(MDB_Impact.index)
        tab20bcols = sns.color_palette('tab20b', 20)
        perm_results = perm_results_RF_mdbi
        
        # Histogram with performance of permutated values
        hist_res = ax.hist(np.array(perm_results[1]), n_labels, range=(0, 1.00001), label='RF Permutations',
                     edgecolor='black', color=tab20bcols[1], alpha = 1)
        
        # Plot the non-permutated model performance
        ylim = [0, hist_res[0].max()*1.2]
        ax.plot(2 * [perm_results[0]], ylim, '-', linewidth=3, color='darkred', #alpha = 0.5,
                     label='p-value %.5f)' % perm_results[2], solid_capstyle='round')
        ax.tick_params(labelsize=13)
        ax.set_xlabel('CV Model Performance', fontsize=14)
        ax.set_ylabel('Nº of occurrences', fontsize=14)
        if perm_results[0] >= 0.5:
            ax.text(perm_results[0]-0.45, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        else:
            ax.text(perm_results[0]+0.05, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        ax.set_title('Random Forest Permutation Test - MDB Impact', size = 15)
        #ax.grid()
        ax.set_axisbelow(True)

        #fig.savefig('Name_RF_PermutationTest_smdinMDBI.jpg', dpi=400) # Save the Figure

### ROC curves (Receiver Operating Characteristic)

This basically gives you an area under curve that the closer it is to 1, the better our model. We also iterate this n_iter times so we have a softer curve and to give as a better indication of the actual area under curve (AUC). This plots the true positive rate against the false positive rate.

**Only possible for when your datasets have 2 classes. Choose the class which is considered the 'positive' class.**

Credit to initial function to the BinSim paper.

If you do not have 2 classes, skip ahead this section.

In [None]:
GENERATE = True
if GENERATE:
    if regression:
        print('You are working on a regression problem. Thus, ROC curves are not made.')
    else:
        if len(pd.unique(target)) == 2:
            # Set a random seed for reproducibility
            np.random.seed()
            
            # Set up positive label
            pos_label = pd.unique(target)[0]

            resROC_RF_mdbi = metsta.RF_ROC_cv(MDB_Impact, target, regres=regression, # Data, target and if it's a regression
                                        pos_label=pos_label, # Positive label
                                        n_trees=200, # Number of trees of RF
                                        n_iter=15, # Number of iterations to repeat 
                                        cv=None, n_fold=3) # Method of CV (None is stratified cv) and the number of folds
        else:
            print('Your target has more than 2 classes. Thus, ROC curves are not made.')

In [None]:
if GENERATE:
    if len(pd.unique(target)) == 2:
        # Plot the ROC curves 
        with sns.axes_style("whitegrid"):
            with sns.plotting_context("notebook", font_scale=1.2):
                f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True)
                res = resROC_RF_mdbi
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                mean_fpr = [0,] + list(mean_fpr)
                mean_tpr = [0,] + list(mean_tpr)
                ax.plot(mean_fpr, mean_tpr,
                       label=f'AUC = {mean_auc:.3f}',
                       lw=2, alpha=0.8)
                ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
                ax.legend()
                ax.set_xlim(None, 1)
                ax.set_ylim(0, None)
                ax.set_title('Random Forest ROC Curve - MDB Impact', fontsize=15)

                #f.savefig('Name_RF_ROCcurve_smdinMDBI.jpg', dpi=400) # Save the figure

### PLS-DA (Partial Least Squares - Discriminant Analysis) - Degree

First, an optimization of the number of components of PLS-DA and a **set of functions for PLS-DA - `optim_PLSDA_n_components` for example - to see in metanalysis_standard.**

The VIPs scores are calculated using the function `_calculate_vips` in multianalysis.py that comes from the link https://www.researchgate.net/post/How-can-I-compute-Variable-Importance-in-Projection-VIP-in-Partial-Least-Squares-PLS as provided by Keiron Teilo O'Shea in that link.

**Note: `max_comp` (maximum number of components) cannot be higher than the number of samples that will train a model minus 1. For example, if you have 15 samples and a 3-fold cross-validation each fold will have 5 samples. A training set will be comprised of two of those folds thus it will have 10 samples, thus `max_comp` (and `n_comp` later on) cannot be higher than 9. Another example if you have 22 samples and 5 folds, the folds will have 4/4/4/5/5 samples each. A training set will have four of these folds and the minimum sum of them is 4+4+4+5-1=16, thus max_comp cannot be higher than 16.**

In [None]:
%%capture --no-stdout
# above is to supress PLS warnings

# Set the random seed
np.random.seed()

max_comp = 9 # Max. number of components to search (the higher the more time it takes)

# Store Results
PLS_optim_deg = metsta.optim_PLSDA_n_components(Deg, target, regression, # Data, target and if it's a regression
                                    encode2as1vector=True,
                                    max_comp=max_comp, # Max. number of components to search
                                    kf=None, n_fold=3, # Cross validation to use (none is stratified CV) and nº of folds
                                    scale=False) # Set scale to True only if you did not do scaling in pre-treatments

In the figure below, $R^{2}$ and $Q^{2}$ are shown. You want to choose the number of components **where $Q^{2}$ specifically** stops increasing, so, in this case, 4 components will be chosen. 

- $Q^{2}$ - PLS score by its mean squared error based on the test samples, thus it is ideal to test if the model will overfit. This will increase until a certain number of components that should be chosen. Then it usually stabilizes but from a certain point it might start to decrease which would mean the model is overfitting. For example, in this case, we choose 4 components based on this score, but you could choose 5 or 6 and it would not affect the model a lot.
- $R^{2}$ - PLS score by its mean squared error based on the training samples used to make the model (it will be higher than $Q^{2}$ but it should not be used to choose the number of components. This metric always increases with the more components used which means it will overfit the model eventually.

In [None]:
scores_cols = sns.color_palette('tab10', 10) # Set the colors for the lines
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True) # Set the figure size
        c = 0
        for i, values in PLS_optim_deg.items():
            if i =='CVscores':
                name = 'Q$^2$'
            else:
                name = 'R$^2$'
            
            ax.plot(range(1, len(values) + 1), values, label=name, color = scores_cols[c])
            c = c+1
        
        ax.set(xlabel='Number of Components', # Set the label for the x axis
                ylabel='PLS Score') # Set the label for the Y axis
        ax.legend(loc='lower right', fontsize=15) # Set the legend
        ax.set_ylim([0, 1.02]) # Set limits for y axis
        ax.set_xticks(range(0, len(values), 2)) # Set ticks that appear in the bottom of x axis
        ax.set_title('Degree', fontsize=15)
        plt.show()

### PLS-DA model fitting

**See details of `PLSDA_model_cv` function (model fitting AND evaluation) in metanalysis_standard.py as adapted from the one in the BinSim paper.**

The VIPs scores are calculated using the function `_calculate_vips` in multianalysis.py that comes from the link https://www.researchgate.net/post/How-can-I-compute-Variable-Importance-in-Projection-VIP-in-Partial-Least-Squares-PLS as provided by Keiron Teilo O'Shea in that link.

The function `_generate_y_PLSDA` is also present in multianalysis.py.

In [None]:
%%capture --no-stdout
# above is to supress PLS warnings

n_comp = 9 # Number of components of PLS-DA model - very important

PLSDA_results_deg = metsta.PLSDA_model_CV(Deg, target, regression, # Data, target and if it's a regression
                       n_comp=n_comp, # Number of components of PLS-DA model - very important
                       kf = None, n_fold=3, # Cross validation to use (none is stratified CV) and nº of folds
                       iter_num=10, # Number of iterations of cross-validation to do
                       encode2as1vector=True,
                       scale=False, # Set scale to True only if you did not do scaling in pre-treatments
                       feat_type='VIP') # Feature Importance Metric to use, default is VIP scores (see function for others)

**Performance analysis**

In [None]:
pls_results_summary = pd.DataFrame(columns=['Value', 'Standard Deviation'])
for k,v in PLSDA_results_deg.items():
    if k != 'Q2' and k != 'imp_feat':
        pls_results_summary.loc[k] = np.mean(v), np.std(v)

print(pls_results_summary)

**Important Feature analysis**

See the most important features for class discrimination.

In [None]:
imp_feats_plsda_deg = processed_data.loc[list(general_MDiN.nodes()),
                                  [i for i in processed_data.columns if i not in treated_data.index]].copy()
imp_feats_plsda_deg.insert(0,'Bucket label', imp_feats_plsda_deg.index)
imp_feats_plsda_deg.insert(1,'VIP Score', '')
for n in range(len(PLSDA_results_deg['imp_feat'])):
    imp_feats_plsda_deg['VIP Score'].iloc[PLSDA_results_deg['imp_feat'][n][0]] = PLSDA_results_deg['imp_feat'][n][1]
imp_feats_plsda_deg = imp_feats_plsda_deg.sort_values(by='VIP Score', ascending=False)
imp_feats_plsda_deg.index = range(1, len(imp_feats_plsda_deg)+1)

In [None]:
imp_feats_plsda_deg.head(20) # Select number of features to see

In [None]:
# Saving Important feature dataset in an excel
SAVE_IMP_FEAT = True

# Saving the most important features by their fraction 'frac_feat_impor'.
# If None, saving the most important features based on a threshold 'VIP_Score_threshold'.
# If also None, save the full dataset of all features
frac_feat_impor = 0.02 # Fraction of features to save, If None the variable in the next line is used.
VIP_Score_threshold = 1 # Only used if variable above is None, threshold of score to consider a feature important.

if SAVE_IMP_FEAT:
    if frac_feat_impor:
        max_idx = int(frac_feat_impor*len(imp_feats_plsda_deg))
        filt_imp_feats_plsda_deg = imp_feats_plsda_deg.iloc[:max_idx]
        filt_imp_feats_plsda_deg.to_excel(f'PLSDA_smdinDeg_ImpFeat_{frac_feat_impor*100}%.xlsx')
    elif VIP_Score_threshold:
        filt_imp_feats_plsda_deg = imp_feats_plsda_deg[imp_feats_plsda_deg['VIP Score'] > VIP_Score_threshold]
        filt_imp_feats_plsda_deg.to_excel(f'PLSDA_smdinDeg_ImpFeat_VIPgreater{VIP_Score_threshold}.xlsx')
    else:
        imp_feats_plsda_deg.to_excel(f'PLSDA_smdinDeg_FeatByImportance.xlsx')

### Sample Projection on the two most important Components/Latent Variables of PLS models 

**To do** See if it's worth doing this in a regression

In [None]:
if not regression:
    n_components = 4 # Nº of componentes

    model, scores = fit_PLSDA_model(Deg, target,
                                    n_comp=n_components, scale=False, # Only true if scaling was not done earlier
                                    encode2as1vector=True,
                                    lv_prefix='LV ', label_name='Label')

    lcolors = label_colours

    with sns.axes_style("whitegrid"):
        with sns.plotting_context("notebook", font_scale=1.2):
            fig, ax = plt.subplots(1,1, figsize=(6,6)) # Set up fig size
            plot_PCA(scores, lcolors, title="PLS Projection - Degree", ax=ax,
                    components=(1,2)) # Select components to see
            plt.title('PLS Projection - Degree', fontsize=20) # Title
            plt.legend(loc='upper right', ncol=1, fontsize=15)  # Legend           
            plt.tight_layout()
            plt.show()
            
            #fig.savefig('Name_PLSplot_smdinDeg.jpg', dpi=400) # Save the figure

### PLS-DA Permutation Test

This is a test to observe if the model performance is significant, that is, if it is better than a random model. If it is, then the remaining results from the important features give meaningful information, if not, then you cannot use the important features results since they essentially mean nothing.

The permutation test will permutate the class labels of your samples, that is, all classes will be randomized while maintaining the same number of samples per class and classes. Then, for each permutation it will see the model performance. 

The default metric for model performance is `accuracy`. If you have an imbalanced model, accuracy is not a good metric, so you should change to another such as `f1_weighted`. Metric can only be: `accuracy`, `f1_weighted`, `recall_weighted` or `precision_weighted`.

**Note: Permutation tests take a while to do, thus the default is False in the begginning so you can make a first analysis on your dataset. If you then want to use the results of a supervised model, run a permutation test to check if your model is significant.**

p-value calculation: (1 + nº of times permutated model has better performance than non-permutated model)/nº of permutations.

In [None]:
GENERATE = True # True if you want to do, False if not
if GENERATE:
    # Set a random seed for reproducibility of cross validation
    np.random.seed()
    # (Random seed of labels permutations is in the random state in the function below)

    perm_results_PLSDA_deg = metsta.permutation_PLSDA(
        Deg, target,  # data and labels
        n_comp=4, # Number of components
        iter_num=500, # Nº of permutations to do in your test - around 500 should be enough
        cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
        random_state=None, # Random seed given to make the permutations rng class labels
        encode2as1vector=True, scale=False, # Set scale to True only if you did not do scaling in pre-treatments
        metric='accuracy') # Choose a metric to use to evaluate if the model is significant

In [None]:
if GENERATE:
    with plt.style.context('seaborn-v0_8-whitegrid'):
        fig, ax = plt.subplots(1,1, figsize=(6,6))

        n_labels = len(Deg.index)
        tab20bcols = sns.color_palette('tab20b', 20)
        perm_results = perm_results_PLSDA_deg
        
        # Histogram with performance of permutated values
        hist_res = ax.hist(np.array(perm_results[1]), n_labels, range=(0, 1.00001), label='PLS-DA Permutations',
                     edgecolor='black', color=tab20bcols[1], alpha = 1)
        
        # Plot the non-permutated model performance
        ylim = [0, hist_res[0].max()*1.2]
        ax.plot(2 * [perm_results[0]], ylim, '-', linewidth=3, color='darkred', #alpha = 0.5,
                     label='p-value %.5f)' % perm_results[2], solid_capstyle='round')
        ax.tick_params(labelsize=13)
        ax.set_xlabel('CV Model Performance', fontsize=14)
        ax.set_ylabel('Nº of occurrences', fontsize=14)
        if perm_results[0] >= 0.5:
            ax.text(perm_results[0]-0.45, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        else:
            ax.text(perm_results[0]+0.05, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        ax.set_title('PLS-DA Permutation Test - Degree', size = 15)
        #ax.grid()
        ax.set_axisbelow(True)

        #fig.savefig('Name_PLSDA_PermutationTest_smdinDeg.jpg', dpi=400) # Save the Figure

### ROC curves (Receiver Operating Characteristic)

This basically gives you an area under curve that the closer it is to 1, the better our model. We also iterate this n_iter times so we have a softer curve and to give as a better indication of the actual area under curve (AUC). This plots the true positive rate against the false positive rate.

**Only possible for when your datasets have 2 classes. Choose the class which is considered the 'positive' class.**

Credit to initial function to the BinSim paper.

If you do not have 2 classes, skip ahead this section.

In [None]:
GENERATE = True
if GENERATE:
    if regression:
        print('You are working on a regression problem. Thus, ROC curves are not made.')
    else:
        if len(pd.unique(target)) == 2:
            # Set a random seed for reproducibility
            np.random.seed()
            
            # Set up positive label
            pos_label = pd.unique(target)[0]

            resROC_PLSDA_deg = metsta.PLSDA_ROC_cv(Deg, target, # Data and target
                                pos_label=pos_label, # Positive label
                                n_comp=4, # Number of components
                                scale=False, # Set scale to True only if you did not do scaling in pre-treatments
                                n_iter=15, # Number of iterations to repeat 
                                cv=None, n_fold=3) # method of cross-validation (None is stratified cv) and the number of folds
        else:
            print('Your target has more than 2 classes. Thus, ROC curves are not made.')

In [None]:
# Plot the ROC curves 
if GENERATE:
    if len(pd.unique(target)) == 2:
        with sns.axes_style("whitegrid"):
            with sns.plotting_context("notebook", font_scale=1.2):
                f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True)
                res = resROC_PLSDA_deg
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                mean_fpr = [0,] + list(mean_fpr)
                mean_tpr = [0,] + list(mean_tpr)
                ax.plot(mean_fpr, mean_tpr,
                       label=f'AUC = {mean_auc:.3f}',
                       lw=2, alpha=0.8)
                ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
                ax.legend()
                ax.set_xlim(None, 1)
                ax.set_ylim(0, None)
                ax.set(xlabel='False positive rate', ylabel='True positive rate')
                ax.set_title('PLS-DA ROC Curve - Degree', fontsize=15)

                #f.savefig('Name_PLSDA_ROCcurve_smdinDeg.jpg', dpi=400) # Save the figure

### PLS-DA (Partial Least Squares - Discriminant Analysis) - MDB Impact

First, an optimization of the number of components of PLS-DA and a **set of functions for PLS-DA - `optim_PLSDA_n_components` for example - to see in metanalysis_standard.**

The VIPs scores are calculated using the function `_calculate_vips` in multianalysis.py that comes from the link https://www.researchgate.net/post/How-can-I-compute-Variable-Importance-in-Projection-VIP-in-Partial-Least-Squares-PLS as provided by Keiron Teilo O'Shea in that link.

**Note: `max_comp` (maximum number of components) cannot be higher than the number of samples that will train a model minus 1. For example, if you have 15 samples and a 3-fold cross-validation each fold will have 5 samples. A training set will be comprised of two of those folds thus it will have 10 samples, thus `max_comp` (and `n_comp` later on) cannot be higher than 9. Another example if you have 22 samples and 5 folds, the folds will have 4/4/4/5/5 samples each. A training set will have four of these folds and the minimum sum of them is 4+4+4+5-1=16, thus max_comp cannot be higher than 16.**

In [None]:
%%capture --no-stdout
# above is to supress PLS warnings

# Set the random seed
np.random.seed()

max_comp = 9 # Max. number of components to search (the higher the more time it takes)

# Store Results
PLS_optim_mdbi = metsta.optim_PLSDA_n_components(MDB_Impact, target, regression, # Data, target and if it's a regression
                                    encode2as1vector=True,
                                    max_comp=max_comp, # Max. number of components to search
                                    kf=None, n_fold=3, # Cross validation to use (none is stratified CV) and nº of folds
                                    scale=False) # Set scale to True only if you did not do scaling in pre-treatments

In the figure below, $R^{2}$ and $Q^{2}$ are shown. You want to choose the number of components **where $Q^{2}$ specifically** stops increasing, so, in this case, 4 components will be chosen. 

- $Q^{2}$ - PLS score by its mean squared error based on the test samples, thus it is ideal to test if the model will overfit. This will increase until a certain number of components that should be chosen. Then it usually stabilizes but from a certain point it might start to decrease which would mean the model is overfitting. For example, in this case, we choose 4 components based on this score, but you could choose 5 or 6 and it would not affect the model a lot.
- $R^{2}$ - PLS score by its mean squared error based on the training samples used to make the model (it will be higher than $Q^{2}$ but it should not be used to choose the number of components. This metric always increases with the more components used which means it will overfit the model eventually.

In [None]:
scores_cols = sns.color_palette('tab10', 10) # Set the colors for the lines
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True) # Set the figure size
        c = 0
        for i, values in PLS_optim_mdbi.items():
            if i =='CVscores':
                name = 'Q$^2$'
            else:
                name = 'R$^2$'
            
            ax.plot(range(1, len(values) + 1), values, label=name, color = scores_cols[c])
            c = c+1
        
        ax.set(xlabel='Number of Components', # Set the label for the x axis
                ylabel='PLS Score') # Set the label for the Y axis
        ax.legend(loc='lower right', fontsize=15) # Set the legend
        ax.set_ylim([0, 1.02]) # Set limits for y axis
        ax.set_xticks(range(0, len(values), 2)) # Set ticks that appear in the bottom of x axis
        ax.set_title('MDB Impact', fontsize=15)
        plt.show()

### PLS-DA model fitting

**See details of `PLSDA_model_cv` function (model fitting AND evaluation) in metanalysis_standard.py as adapted from the one in the BinSim paper.**

The VIPs scores are calculated using the function `_calculate_vips` in multianalysis.py that comes from the link https://www.researchgate.net/post/How-can-I-compute-Variable-Importance-in-Projection-VIP-in-Partial-Least-Squares-PLS as provided by Keiron Teilo O'Shea in that link.

The function `_generate_y_PLSDA` is also present in multianalysis.py.

In [None]:
%%capture --no-stdout
# above is to supress PLS warnings

n_comp = 4 # Number of components of PLS-DA model - very important

PLSDA_results_mdbi = metsta.PLSDA_model_CV(MDB_Impact, target, regression, # Data, target and if it's a regression
                       n_comp=n_comp, # Number of components of PLS-DA model - very important
                       kf = None, n_fold=3, # Cross validation to use (none is stratified CV) and nº of folds
                       iter_num=10, # Number of iterations of cross-validation to do
                       encode2as1vector=True,
                       scale=False, # Set scale to True only if you did not do scaling in pre-treatments
                       feat_type='VIP') # Feature Importance Metric to use, default is VIP scores (see function for others)

**Performance analysis**

In [None]:
pls_results_summary = pd.DataFrame(columns=['Value', 'Standard Deviation'])
for k,v in PLSDA_results_mdbi.items():
    if k != 'Q2' and k != 'imp_feat':
        pls_results_summary.loc[k] = np.mean(v), np.std(v)

print(pls_results_summary)

**Important Feature analysis**

See the most important features for class discrimination.

In [None]:
imp_feats_plsda_mdbi = pd.DataFrame(index=MDB_Impact.columns)
imp_feats_plsda_mdbi.insert(0,'Bucket label', imp_feats_plsda_mdbi.index)
imp_feats_plsda_mdbi.insert(1,'VIP Score', '')
for n in range(len(PLSDA_results_mdbi['imp_feat'])):
    imp_feats_plsda_mdbi['VIP Score'].iloc[PLSDA_results_mdbi['imp_feat'][n][0]] = PLSDA_results_mdbi['imp_feat'][n][1]
imp_feats_plsda_mdbi = imp_feats_plsda_mdbi.sort_values(by='VIP Score', ascending=False)
imp_feats_plsda_mdbi.index = range(1, len(imp_feats_plsda_mdbi)+1)

In [None]:
imp_feats_plsda_mdbi

In [None]:
# Saving Important feature dataset in an excel
SAVE_IMP_FEAT = True

if SAVE_IMP_FEAT:
    imp_feats_plsda_mdbi.to_excel(f'PLSDA_smdinMDBI_FeatByImportance.xlsx')

In [None]:
import metabolinks.transformations as transf
f, ax = plt.subplots(figsize=(10,6))

tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(MDB_Impact)
df = df[imp_feats_plsda_mdbi['Bucket label']]

g = sns.heatmap(df.T, cmap='PRGn', vmin=-3, vmax=3)

# Manually specify colorbar labelling after it's been generated
colorbar = g.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

### Sample Projection on the two most important Components/Latent Variables of PLS models 

**To do** See if it's worth doing this in a regression

In [None]:
if not regression:
    n_components = 4 # Nº of componentes

    model, scores = fit_PLSDA_model(MDB_Impact, target,
                                    n_comp=n_components, scale=False, # Only true if scaling was not done earlier
                                    encode2as1vector=True,
                                    lv_prefix='LV ', label_name='Label')

    lcolors = label_colours

    with sns.axes_style("whitegrid"):
        with sns.plotting_context("notebook", font_scale=1.2):
            fig, ax = plt.subplots(1,1, figsize=(6,6)) # Set up fig size
            plot_PCA(scores, lcolors, title="PLS Projection - MDB Impact", ax=ax,
                    components=(1,2)) # Select components to see
            plt.title('PLS Projection - MDB Impact', fontsize=20) # Title
            plt.legend(loc='upper right', ncol=1, fontsize=15)  # Legend           
            plt.tight_layout()
            plt.show()
            
            #fig.savefig('Name_PLSplot_smdinMDBI.jpg', dpi=400) # Save the figure

### PLS-DA Permutation Test

This is a test to observe if the model performance is significant, that is, if it is better than a random model. If it is, then the remaining results from the important features give meaningful information, if not, then you cannot use the important features results since they essentially mean nothing.

The permutation test will permutate the class labels of your samples, that is, all classes will be randomized while maintaining the same number of samples per class and classes. Then, for each permutation it will see the model performance. 

The default metric for model performance is `accuracy`. If you have an imbalanced model, accuracy is not a good metric, so you should change to another such as `f1_weighted`. Metric can only be: `accuracy`, `f1_weighted`, `recall_weighted` or `precision_weighted`.

**Note: Permutation tests take a while to do, thus the default is False in the begginning so you can make a first analysis on your dataset. If you then want to use the results of a supervised model, run a permutation test to check if your model is significant.**

p-value calculation: (1 + nº of times permutated model has better performance than non-permutated model)/nº of permutations.

In [None]:
GENERATE = True # True if you want to do, False if not
if GENERATE:
    # Set a random seed for reproducibility of cross validation
    np.random.seed()
    # (Random seed of labels permutations is in the random state in the function below)

    perm_results_PLSDA_mdbi = metsta.permutation_PLSDA(
        MDB_Impact, target,  # data and labels
        n_comp=4, # Number of components
        iter_num=500, # Nº of permutations to do in your test - around 500 should be enough
        cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
        random_state=None, # Random seed given to make the permutations rng class labels
        encode2as1vector=True, scale=False, # Set scale to True only if you did not do scaling in pre-treatments
        metric='accuracy') # Choose a metric to use to evaluate if the model is significant

In [None]:
if GENERATE:
    with plt.style.context('seaborn-v0_8-whitegrid'):
        fig, ax = plt.subplots(1,1, figsize=(6,6))

        n_labels = len(MDB_Impact.index)
        tab20bcols = sns.color_palette('tab20b', 20)
        perm_results = perm_results_PLSDA_mdbi
        
        # Histogram with performance of permutated values
        hist_res = ax.hist(np.array(perm_results[1]), n_labels, range=(0, 1.00001), label='PLS-DA Permutations',
                     edgecolor='black', color=tab20bcols[1], alpha = 1)
        
        # Plot the non-permutated model performance
        ylim = [0, hist_res[0].max()*1.2]
        ax.plot(2 * [perm_results[0]], ylim, '-', linewidth=3, color='darkred', #alpha = 0.5,
                     label='p-value %.5f)' % perm_results[2], solid_capstyle='round')
        ax.tick_params(labelsize=13)
        ax.set_xlabel('CV Model Performance', fontsize=14)
        ax.set_ylabel('Nº of occurrences', fontsize=14)
        if perm_results[0] >= 0.5:
            ax.text(perm_results[0]-0.45, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        else:
            ax.text(perm_results[0]+0.05, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        ax.set_title('PLS-DA Permutation Test - MDB Impact', size = 15)
        #ax.grid()
        ax.set_axisbelow(True)

        #fig.savefig('Name_PLSDA_PermutationTest_smdinMDBI.jpg', dpi=400) # Save the Figure

### ROC curves (Receiver Operating Characteristic)

This basically gives you an area under curve that the closer it is to 1, the better our model. We also iterate this n_iter times so we have a softer curve and to give as a better indication of the actual area under curve (AUC). This plots the true positive rate against the false positive rate.

**Only possible for when your datasets have 2 classes. Choose the class which is considered the 'positive' class.**

Credit to initial function to the BinSim paper.

If you do not have 2 classes, skip ahead this section.

In [None]:
GENERATE = True
if GENERATE:
    if regression:
        print('You are working on a regression problem. Thus, ROC curves are not made.')
    else:
        if len(pd.unique(target)) == 2:
            # Set a random seed for reproducibility
            np.random.seed()
            
            # Set up positive label
            pos_label = pd.unique(target)[0]

            resROC_PLSDA_mdbi = metsta.PLSDA_ROC_cv(MDB_Impact, target, # Data and target
                                pos_label=pos_label, # Positive label
                                n_comp=4, # Number of components
                                scale=False, # Set scale to True only if you did not do scaling in pre-treatments
                                n_iter=15, # Number of iterations to repeat 
                                cv=None, n_fold=3) # method of cross-validation (None is stratified cv) and the number of folds
        else:
            print('Your target has more than 2 classes. Thus, ROC curves are not made.')

In [None]:
# Plot the ROC curves 
if GENERATE:
    if len(pd.unique(target)) == 2:
        with sns.axes_style("whitegrid"):
            with sns.plotting_context("notebook", font_scale=1.2):
                f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True)
                res = resROC_PLSDA_mdbi
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                mean_fpr = [0,] + list(mean_fpr)
                mean_tpr = [0,] + list(mean_tpr)
                ax.plot(mean_fpr, mean_tpr,
                       label=f'AUC = {mean_auc:.3f}',
                       lw=2, alpha=0.8)
                ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
                ax.legend()
                ax.set_xlim(None, 1)
                ax.set_ylim(0, None)
                ax.set(xlabel='False positive rate', ylabel='True positive rate')
                ax.set_title('PLS-DA ROC Curve - MDB Impact', fontsize=15)

                #f.savefig('Name_PLSDA_ROCcurve_smdinMDBI.jpg', dpi=400) # Save the figure

### XGBoost (eXtreme Gradient Boosting)

This block of code automatically selects an XGBoost objective function for your specific use case. If you want to use a different function, you may select it here, or in the 'objective' input to the functions.

Some reading on objective functions:
- https://xgboost.readthedocs.io/en/stable/parameter.html (Ctrl-F objective)
- https://machinelearningmastery.com/xgboost-loss-functions/

In [None]:
xgb_analysis = True

In [None]:
if regression:
    objective = "reg:squarederror"
else:
    if len(pd.unique(target)) == 2:
        print('Warning: XGBoost is currently unreliable for binary classification tasks. If you still want to use it delete xgb_analysis = False')
        objective = "binary:logistic"
        xgb_analysis = False
    else:
        objective = "multi:softprob"

### XGBoost - Degree

We first start with a brief optimization of the parameters for XGBoost training. Default is to focus only on the number of estimators (trees) and their maximum depth. However, there are other parameters that can be tweaked, simply by adding new terms to the xgb_optim_params dictionary. Please be aware that each new parameter will explonentially increase the running time of the function, and that for regression problem even just a single-parameter tuning can take very long.

To 'fix' an hyperparater that you do not want to tune at a non-default value, simply add it to the XGB_optim function as **kwargs

Resources on XGBoost Hyperparameters and their tuning:
- https://xgboost.readthedocs.io/en/stable/parameter.html
- https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning
- https://freedium.cfd/https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663

In [None]:
if xgb_analysis:
    # Select a random seed (number between the ()) if you don't want the results to change every time you run the code
    np.random.seed()

    xgb_max_n_estimators_deg = 300

    xgb_optim_params_deg = {'n_estimators': range(10,xgb_max_n_estimators_deg+1,5)} 

    #xgb_optim_params = {'min_child_weight': numeric_range(0,1,0.1), 'subsample': numeric_range(0,1,0.1),
    #                    'gamma': numeric_range(0,1,0.1), 
    #                    'max_depth': range(0,10,1)}

    XGB_Optim_deg = metsta.optimise_xgb_parameters(Deg, target, xgb_optim_params_deg, regression, objective,
                                                   n_estimators=200)

In [None]:
if xgb_analysis:
    param_to_plot = 'n_estimators'

    # Plotting the results and adjusting parameters of the plot
    with sns.axes_style("whitegrid"):
        with sns.plotting_context("notebook", font_scale=1.2):
            f, ax = plt.subplots(1, 1, figsize=(6,6), constrained_layout=True) # Set Figure Size

            c_map = sns.color_palette('tab10', 10)

            ax.plot(XGB_Optim_deg.cv_results_['param_n_estimators'],
                    [s*100 for s in XGB_Optim_deg.cv_results_['mean_test_score']])
            ax.set_ylabel('XGBoost CV Mean Accuracy (%)', fontsize=15) # Set the y_label and size
            ax.set_xlabel(param_to_plot, fontsize=15)
            ax.set_title('XGBoost - Degree', fontsize=18) # Set the title and size
            ax.set_ylim([30,101]) # Set the limits on the y axis

            #f.suptitle('Optimization of the number of trees')
            ax.legend(fontsize=15) # Set the legend and size
            plt.show()

### Fitting the XGBoost model

You may add more parameters to the function as **kwargs

https://xgboost.readthedocs.io/en/stable/parameter.html

In [None]:
if xgb_analysis:

    n_estimators = 200

    XGB_results_deg = metsta.XGB_model(Deg, target, # Data and labels
                    regres=regression, obj=objective, # Regression or classification, and objective function
                    return_cv=True, iter_num=5, # If you want cross validation results and number of iterations for it
                    n_estimators=n_estimators, # Number of trees in the model
                    cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
                    #metrics = ('neg_mean_squared_error', 'r2'), subsample=0.7)
                    metrics = ('accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted')) #, gamma=0, min_child_weight=0.9, subsample=0.4), # Choose the performance metrics

In [None]:
if xgb_analysis:

    results_summary = pd.DataFrame(columns=['Value', 'Standard Deviation'])
    for k,v in XGB_results_deg.items():
        if k != 'model' and k != 'imp_feat':
            results_summary.loc[k] = np.mean(v), np.std(v)

    print(results_summary)

In [None]:
if xgb_analysis:
    imp_feats_xgb_deg = processed_data.loc[list(general_MDiN.nodes()),
                                  [i for i in processed_data.columns if i not in treated_data.index]].copy()
    imp_feats_xgb_deg.insert(0,'Bucket label', imp_feats_xgb_deg.index)
    imp_feats_xgb_deg.insert(1,'Feature Importance', '')
    for n in range(len(XGB_results_deg['imp_feat'])):
        imp_feats_xgb_deg['Feature Importance'].iloc[XGB_results_deg['imp_feat'][n][0]] = XGB_results_deg['imp_feat'][n][1]
    imp_feats_xgb_deg = imp_feats_xgb_deg.sort_values(by='Feature Importance', ascending=False)
    imp_feats_xgb_deg.index = range(1, len(imp_feats_xgb_deg)+1)
else:
    imp_feats_xgb_deg = 'XGBoost analysis was not performed'
imp_feats_xgb_deg

In [None]:
imp_feats_xgb_deg.head(20) # Select number of features to see

In [None]:
# Saving Important feature dataset in an excel
SAVE_IMP_FEAT = True

# Saving the most important features by their fraction 'frac_feat_impor'.
# If None, save the full dataset of all features
frac_feat_impor = 0.02 # Fraction of features to save, If None the variable in the next line is used.

if SAVE_IMP_FEAT:
    if frac_feat_impor:
        max_idx = int(frac_feat_impor*len(imp_feats_xgb_deg))
        filt_imp_feats_xgb_deg = imp_feats_xgb_deg.iloc[:max_idx]
        filt_imp_feats_xgb_deg.to_excel(f'XGB_smdinDeg_ImpFeat_{frac_feat_impor*100}%.xlsx')
    else:
        imp_feats_xgb_deg.to_excel(f'XGB_smdinDeg_FeatByImportance.xlsx')

### XGBoost Permutation tests

In [None]:
GENERATE=True
if GENERATE:
    # Set a random seed for reproducibility of cross validation
    np.random.seed()
    # (Random seed of labels permutations is in the random permutator)

    perm_results_XGB_deg = metsta.permutation_XGB(
        Deg, target,  # data and labels
        regres=regression, obj=objective, # regression vs classification and objective function 
        iter_num=100, # Nº of permutations to do in your test - around 500 should be enough
        n_estimators=200, # Number of trees in the model
        cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
        random_state=None, # Random seed given to make the permutations rng class labels
        metric=('accuracy')) # Choose a metric to use to evaluate if the model is significant

In [None]:
if GENERATE:
    with plt.style.context('seaborn-v0_8-whitegrid'):
        fig, ax = plt.subplots(1,1, figsize=(6,6))

        n_labels = len(treated_data.index)
        tab20bcols = sns.color_palette('tab20b', 20)
        perm_results = perm_results_XGB_deg
        
        # Histogram with performance of permutated values
        hist_res = ax.hist(np.array(perm_results[1]), n_labels, range=(0, 1.00001), label='XGBoost Permutations',
                     edgecolor='black', color=tab20bcols[1], alpha = 1)
        
        # Plot the non-permutated model performance
        ylim = [0, hist_res[0].max()*1.2]
        ax.plot(2 * [perm_results[0]], ylim, '-', linewidth=3, color='darkred', #alpha = 0.5,
                     label='p-value %.5f)' % perm_results[2], solid_capstyle='round')
        ax.tick_params(labelsize=13)
        ax.set_xlabel('CV Model Performance', fontsize=14)
        ax.set_ylabel('Nº of occurrences', fontsize=14)
        if perm_results[0] >= 0.5:
            ax.text(perm_results[0]-0.45, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        else:
            ax.text(perm_results[0]+0.05, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        ax.set_title('XGBoost Permutation Test - Degree', size = 15)
        #ax.grid()
        ax.set_axisbelow(True)

        #fig.savefig('Name_XGB_PermutationTest_smdinDeg.jpg', dpi=400) # Save the Figure

### ROC Curves (Receiver Operating Characteristics)

In [None]:
GENERATE = True
if GENERATE:
    if regression:
        print('You are working on a regression problem. Thus, ROC curves are not made.')
    else:
        if len(pd.unique(target)) == 2:
            # Set a random seed for reproducibility
            np.random.seed()
            
            # Set up positive label
            pos_label = pd.unique(target)[0]

            resROC_XGB_deg = metsta.XGB_ROC_cv(Deg, target, # Data and target
                                        pos_label=pos_label, obj=objective, # Positive label and objective
                                        n_estimators=200, # Number of trees of RF
                                        n_iter=15, # Number of iterations to repeat 
                                        cv=None, n_fold=3) # Method of CV (None is stratified cv) and the number of folds
        else:
            print('Your target has more than 2 classes. Thus, ROC curves are not made.')

In [None]:
# Plot the ROC curves 
if GENERATE:
    if len(pd.unique(target)) == 2:
        with sns.axes_style("whitegrid"):
            with sns.plotting_context("notebook", font_scale=1.2):
                f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True)
                res = resROC_XGB_deg
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                mean_fpr = [0,] + list(mean_fpr)
                mean_tpr = [0,] + list(mean_tpr)
                ax.plot(mean_fpr, mean_tpr,
                       label=f'AUC = {mean_auc:.3f}',
                       lw=2, alpha=0.8)
                ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
                ax.legend()
                ax.set_xlim(None, 1)
                ax.set_ylim(0, None)
                ax.set(xlabel='False positive rate', ylabel='True positive rate')
                ax.set_title('XGBoost ROC Curve - Degree', fontsize=15)

                #f.savefig('Name_XGB_ROCcurve_smdinDeg.jpg', dpi=400) # Save the figure

### XGBoost - MDB Impact

We first start with a brief optimization of the parameters for XGBoost training. Default is to focus only on the number of estimators (trees) and their maximum depth. However, there are other parameters that can be tweaked, simply by adding new terms to the xgb_optim_params dictionary. Please be aware that each new parameter will explonentially increase the running time of the function, and that for regression problem even just a single-parameter tuning can take very long.

To 'fix' an hyperparater that you do not want to tune at a non-default value, simply add it to the XGB_optim function as **kwargs

Resources on XGBoost Hyperparameters and their tuning:
- https://xgboost.readthedocs.io/en/stable/parameter.html
- https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning
- https://freedium.cfd/https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663

In [None]:
if xgb_analysis:
    # Select a random seed (number between the ()) if you don't want the results to change every time you run the code
    np.random.seed()

    xgb_max_n_estimators_mdbi = 300

    xgb_optim_params_mdbi = {'n_estimators': range(10,xgb_max_n_estimators_mdbi+1,5)} 

    #xgb_optim_params_mdbi = {'min_child_weight': numeric_range(0,1,0.1), 'subsample': numeric_range(0,1,0.1),
    #'gamma': numeric_range(0,1,0.1), 
    #                    'max_depth': range(0,10,1)}

    XGB_Optim_mdbi = metsta.optimise_xgb_parameters(MDB_Impact, target, xgb_optim_params_mdbi,
                                               regression, objective, n_estimators=200)

In [None]:
if xgb_analysis:
    param_to_plot = 'n_estimators'

    # Plotting the results and adjusting parameters of the plot
    with sns.axes_style("whitegrid"):
        with sns.plotting_context("notebook", font_scale=1.2):
            f, ax = plt.subplots(1, 1, figsize=(6,6), constrained_layout=True) # Set Figure Size

            c_map = sns.color_palette('tab10', 10)

            ax.plot(XGB_Optim_mdbi.cv_results_['param_n_estimators'],
                    [s*100 for s in XGB_Optim_mdbi.cv_results_['mean_test_score']])
            ax.set_ylabel('XGBoost CV Mean Accuracy (%)', fontsize=15) # Set the y_label and size
            ax.set_xlabel(param_to_plot, fontsize=15)
            ax.set_title('XGBoost - MDB Impact', fontsize=18) # Set the title and size
            ax.set_ylim([30,101]) # Set the limits on the y axis

            #f.suptitle('Optimization of the number of trees')
            ax.legend(fontsize=15) # Set the legend and size
            plt.show()

### Fitting the XGBoost model

You may add more parameters to the function as **kwargs

https://xgboost.readthedocs.io/en/stable/parameter.html

In [None]:
if xgb_analysis:

    n_estimators = 200

    XGB_results_mdbi = metsta.XGB_model(MDB_Impact, target, # Data and labels
                    regres=regression, obj=objective, # Regression or classification, and objective function
                    return_cv=True, iter_num=5, # If you want cross validation results and number of iterations for it
                    n_estimators=n_estimators, # Number of trees in the model
                    cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
                    #metrics = ('neg_mean_squared_error', 'r2'), subsample=0.7)
                    metrics = ('accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted')) #, gamma=0, min_child_weight=0.9, subsample=0.4), # Choose the performance metrics

In [None]:
if xgb_analysis:

    results_summary = pd.DataFrame(columns=['Value', 'Standard Deviation'])
    for k,v in XGB_results_mdbi.items():
        if k != 'model' and k != 'imp_feat':
            results_summary.loc[k] = np.mean(v), np.std(v)

    print(results_summary)

In [None]:
if xgb_analysis:
    imp_feats_xgb_mdbi = pd.DataFrame(index=MDB_Impact.columns)
    imp_feats_xgb_mdbi.insert(0,'Bucket label', imp_feats_xgb_mdbi.index)
    imp_feats_xgb_mdbi.insert(1,'Feature Importance', '')
    for n in range(len(XGB_results_mdbi['imp_feat'])):
        imp_feats_xgb_mdbi['Feature Importance'].iloc[XGB_results_mdbi['imp_feat'][n][0]] = XGB_results_mdbi[
            'imp_feat'][n][1]
    imp_feats_xgb_mdbi = imp_feats_xgb_mdbi.sort_values(by='Feature Importance', ascending=False)
    imp_feats_xgb_mdbi.index = range(1, len(imp_feats_xgb_mdbi)+1)
else:
    imp_feats_xgb_mdbi = 'XGBoost analysis was not performed'
imp_feats_xgb_mdbi

In [None]:
# Saving Important feature dataset in an excel
SAVE_IMP_FEAT = True

if SAVE_IMP_FEAT:
    imp_feats_xgb_mdbi.to_excel(f'XGB_smdinMDBI_FeatByImportance.xlsx')

In [None]:
import metabolinks.transformations as transf
f, ax = plt.subplots(figsize=(10,6))

tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(MDB_Impact)
df = df[imp_feats_xgb_mdbi['Bucket label']]

g = sns.heatmap(df.T, cmap='PRGn', vmin=-3, vmax=3)

# Manually specify colorbar labelling after it's been generated
colorbar = g.collections[0].colorbar
colorbar.ax.tick_params(labelsize=14) 

### XGBoost Permutation tests

In [None]:
GENERATE=True
if GENERATE:
    # Set a random seed for reproducibility of cross validation
    np.random.seed()
    # (Random seed of labels permutations is in the random permutator)

    perm_results_XGB_mdbi = metsta.permutation_XGB(
        MDB_Impact, target,  # data and labels
        regres=regression, obj=objective, # regression vs classification and objective function 
        iter_num=100, # Nº of permutations to do in your test - around 500 should be enough
        n_estimators=200, # Number of trees in the model
        cv=None, n_fold=3, # Choose a method of cross-validation (None is stratified cv) and the number of folds
        random_state=None, # Random seed given to make the permutations rng class labels
        metric=('accuracy')) # Choose a metric to use to evaluate if the model is significant

In [None]:
if GENERATE:
    with plt.style.context('seaborn-v0_8-whitegrid'):
        fig, ax = plt.subplots(1,1, figsize=(6,6))

        n_labels = len(treated_data.index)
        tab20bcols = sns.color_palette('tab20b', 20)
        perm_results = perm_results_XGB_mdbi
        
        # Histogram with performance of permutated values
        hist_res = ax.hist(np.array(perm_results[1]), n_labels, range=(0, 1.00001), label='XGBoost Permutations',
                     edgecolor='black', color=tab20bcols[1], alpha = 1)
        
        # Plot the non-permutated model performance
        ylim = [0, hist_res[0].max()*1.2]
        ax.plot(2 * [perm_results[0]], ylim, '-', linewidth=3, color='darkred', #alpha = 0.5,
                     label='p-value %.5f)' % perm_results[2], solid_capstyle='round')
        ax.tick_params(labelsize=13)
        ax.set_xlabel('CV Model Performance', fontsize=14)
        ax.set_ylabel('Nº of occurrences', fontsize=14)
        if perm_results[0] >= 0.5:
            ax.text(perm_results[0]-0.45, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        else:
            ax.text(perm_results[0]+0.05, hist_res[0].max()*1.1, 'p-value = %.3f' % perm_results[2], fontsize = 15)
        ax.set_title('XGBoost Permutation Test - MDB Impact', size = 15)
        #ax.grid()
        ax.set_axisbelow(True)

        #fig.savefig('Name_XGB_PermutationTest_smdinMDBI.jpg', dpi=400) # Save the Figure

### ROC Curves (Receiver Operating Characteristics)

In [None]:
GENERATE = True
if GENERATE:
    if regression:
        print('You are working on a regression problem. Thus, ROC curves are not made.')
    else:
        if len(pd.unique(target)) == 2:
            # Set a random seed for reproducibility
            np.random.seed()
            
            # Set up positive label
            pos_label = pd.unique(target)[0]

            resROC_XGB_mdbi = metsta.XGB_ROC_cv(MDB_Impact, target, # Data and target
                                        pos_label=pos_label, obj=objective, # Positive label and objective
                                        n_estimators=200, # Number of trees of RF
                                        n_iter=15, # Number of iterations to repeat 
                                        cv=None, n_fold=3) # Method of CV (None is stratified cv) and the number of folds
        else:
            print('Your target has more than 2 classes. Thus, ROC curves are not made.')

In [None]:
# Plot the ROC curves 
if GENERATE:
    if len(pd.unique(target)) == 2:
        with sns.axes_style("whitegrid"):
            with sns.plotting_context("notebook", font_scale=1.2):
                f, ax = plt.subplots(1, 1, figsize=(5,5), constrained_layout=True)
                res = resROC_XGB_mdbi
                mean_fpr = res['average fpr']
                mean_tpr = res['average tpr']
                mean_auc = res['mean AUC']
                mean_fpr = [0,] + list(mean_fpr)
                mean_tpr = [0,] + list(mean_tpr)
                ax.plot(mean_fpr, mean_tpr,
                       label=f'AUC = {mean_auc:.3f}',
                       lw=2, alpha=0.8)
                ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='lightgrey', alpha=.8)
                ax.legend()
                ax.set_xlim(None, 1)
                ax.set_ylim(0, None)
                ax.set(xlabel='False positive rate', ylabel='True positive rate')
                ax.set_title('XGBoost ROC Curve - MDB Impact', fontsize=15)

                #f.savefig('Name_XGB_ROCcurve_smdinMDBI.jpg', dpi=400) # Save the figure