# Sample Mass-Difference Networks in Metabolomics Data Analysis

Notebook to support the study on the application of **Sample M**ass-**Di**fference **N**etworks as a highly specific competing form of pre-processing procedure for high-resolution metabolomics data.

Mass-Difference Networks are focused into making networks from a list of masses. Each _m/z_ will represent a node. Nodes will be connected if the difference in their masses can be associated to a simple chemical reaction (enzymatic or non-enzymatic) that led to a change in the elemental composition of its metabolite.

The set of mass differences used to build said networks are called a set of MDBs - Mass-Difference-based Building block.

This is notebook `paper_sMDiNs_unsupervised.ipynb`


## Organization of the Notebook

- Loading up pre-processed and pre-treated datasets databases with intensity-based pre-treated data and data from sMDiNs analyses.
- **PCA, Agglomerative Hierarchical Clustering and K-means Clustering: assessment of performence given a ground-truth of cluster assignments.**


#### Needed Imports

In [None]:
import itertools
from pathlib import Path

import numpy as np
import pandas as pd

import scipy.spatial.distance as dist
import scipy.cluster.hierarchy as hier
import scipy.stats as stats

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
from matplotlib import ticker

import seaborn as sns
import networkx as nx

import sklearn.cluster as skclust
from sklearn.metrics import adjusted_rand_score

# Metabolinks package
import metabolinks as mtl
import metabolinks.transformations as transf

# Python files in the repository
import multianalysis as ma
from elips import plot_confidence_ellipse

In [None]:
%matplotlib inline

In [None]:
# json for persistence
import json
from time import perf_counter

## Description of dataset records

`datasets` is the global dict that holds all data sets. It is a **dict of dict's**.

Each data set is **represented as a dict**.

Each record has the following fields (keys):

- `name`: the table/figure name of the data set
- `source`: the biological source for each dataset
- `mode`: the aquisition mode
- `alignment`: the alignment used to generate the data matrix
- `data`: the data matrix
- `target`: the sample labels, possibly already integer encoded
- `MDiN`: Mass-Difference Network - Not present here, only on sMDiNsAnalysis notebook
- `<treatment name>`: transformed data matrix / network. These treatment names can be
    - `Ionly`: missing value imputed data by 1/5 of the minimum value in each sample in the dataset, only
    - `NGP`: normalized, glog transformed and Pareto scaled
    - `Ionly_RF`: missing value imputed data by random forests, only
    - `NGP_RF`: normalized, glog transformed and Pareto scaled
    - `IDT`: `NGP_RF` or `NGP` - Intensity-based Data pre-Treatment chosen as comparison based on which of the two performed better for each dataset and each statistical method
    - `sMDiN`: Sample Mass-Difference Networks - Not present here, only on sMDiNsAnalysis notebook
       
- `<sMDiN analysis name>`: data matrix from nework analysis of MDiNs - Not in this notebook
    - `Degree`: degree analysis of each sMDiN
    - `Betweenness`: betweenness centrality analysis of each sMDiN
    - `Closeness`: closeness centrality of analysis of each sMDiN
    - `MDBI`: analysis on the impact of each MDB (Mass-Difference based building-block) on building each sMDiN
    - `GCD11`: Graphlet Correlation Distance of 11 different orbits (maximum of 4-node graphlets) between each sMDiN.
    - `WMDBI`: an alternative calculation of MDBI using the results from the degree analysis.

- `iter_fold_splits`: contains nested dicts that identify and contain each transformed training and testing groups data matrices with their respective iteration, training/test, fold number and one of the previously mentioned data pre-treatments
- `train`: specific to the HD dataset; contains a set of the different pre-treatments and sMDin analysis mentioned and a target based on the training set defined for HD
- `test`: specific to the HD dataset; contains a set of the different pre-treatments and sMDin analysis mentioned and a target based on the external test set defined for HD


The keys of `datasets` may be shared with dicts holding records resulting from comparison analysis.

Here are the keys (and respective names) of datasets used in this study:

- GD_global2 (GDg2)
- GD_class2 (GDc2)
- YD (YD)
- vitis_types (GD types)
- HD (HD)

#### Data Pre-Treatment

For information on the **commonly used intensity based data pre-treatments** and about the **benchmark datasets**, see notebook `paper_sMDiNs_database_prep.ipynb`.

For information on the **building** and the different **network analysis methods** used for the **Sample MDiNs** and information about the Mass-Difference-based Building blocks (**MDBs**), see notebook `paper_sMDiNs_sMDiNsAnalysis.ipynb`.

### Reading datasets database

No need to load 'iter_fold_splits' key (or the 'train' and 'test' key of the HD dataset) for unsupervised analysis.

In [None]:
# Where the datasets are
path = Path.cwd() / "store_files" / 'processed_data.json'
storepath = Path.cwd() / "store_files" / 'processed_data.h5'
with pd.HDFStore(storepath) as store:
    
    # Read into a dictionary not DataFrame data
    with open(path, encoding='utf8') as read_file:
        datasets = json.load(read_file)
    
    # Add DataFrame data to dict
    for dskey, dataset in datasets.items():
        dataset['iter_fold_splits'] = {}
        if dskey == 'HD':
            dataset['train'] = {}
            dataset['test'] = {}
        for key in dataset:
            # Created right before
            if 'iter_fold_splits' == key:
                continue
            value = dataset[key]
            if isinstance(value, str) and value.startswith("INSTORE"):
                storekey = value.split("_", 1)[1]
                #print(storekey)
                # Load the data from 'iter_fold_splits' carefully restoring the nested dictionaries
                #if len(storekey.split("AA_")) > 1: # This separation was made to identify the 'iter_fold_splits' data
                #    dictkeys = (storekey.split("AA_")[1]).split('_',3)
                    # Create nested dicts
                #    if int(dictkeys[0]) not in dataset['iter_fold_splits'].keys():
                #        dataset['iter_fold_splits'][int(dictkeys[0])] = {}
                #    if dictkeys[1] not in dataset['iter_fold_splits'][int(dictkeys[0])].keys():
                #        dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]] = {}
                #    if int(dictkeys[2]) not in dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]].keys():
                #        dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]][int(dictkeys[2])] = {}
                #    dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]][int(dictkeys[2])][dictkeys[3]] = store[storekey]
                
                # Load the data from 'train' and 'test' from HD dataset keys carefully restoring the nested dictionaries
                #elif len(storekey.split("TTS_")) > 1:
                #    dictkeys = ((storekey.split("TTS_")[0]).split('_')[-1], storekey.split("TTS_")[1])#.split('_',2)
                #    dataset[dictkeys[0]][dictkeys[1]] = store[storekey]
                # Normal DataFrames
                #else:
                dataset[key] = store[storekey]

            # convert colors to tuples, since they are read as lists from json file
            elif key == 'label_colors':
                dataset[key] = {lbl: tuple(c) for lbl, c in value.items()}
            elif key == 'sample_colors':
                dataset[key] = [tuple(c) for c in value]
            #elif key.endswith('target') and key.startswith(dskey):
            #    if len(key.split("AA_")) > 1: 
            #        dictkeys = ((key.split("_", 1)[1]).split("AA_")[1]).split('_',3)
            #        dataset['iter_fold_splits'][int(dictkeys[0])][dictkeys[1]][int(dictkeys[2])][dictkeys[3]] = value
            #    else:
            #        dictkeys = ((key.split("TTS_")[0]).split('_')[-1], key.split("TTS_")[1])#.split('_',2)
            #        dataset[dictkeys[0]][dictkeys[1]] = value

# Remove extra keys
for name, ds in datasets.items():
    keys_to_remove = [keys for keys in ds.keys() if keys.startswith(name)]
    for key in keys_to_remove:
        ds.pop(key)


In [None]:
datasets['YD']['Ionly']

In [None]:
# Selecting a placeholder for the Intensity-based Data pre-Treatment (IDT)
# Chosen for each dataset and each method based on which between NGP and NGP_RF generated the best results
for name, ds in datasets.items():
    ds['IDT'] = ds['NGP_RF'] 

In [None]:
# Chemical Formula transformations (MDBs chosen)
MDB = ['H2','CH2','CO2','O','CHOH','NCH','O(N-H-)','S','CONH','PO3H','NH3(O-)','SO3','CO', 'C2H2O', 'H2O']
MDB_YD = ['H2','CH2','CO2','O','CHOH','NCH','O(N-H-)','S','CONH','PO3H','NH3(O-)','SO3','CO', 'C2H2O', 'H2O', 
          'C2H2O2', 'C3H4O2']

### Colors for plots to ensure consistency

#### 11 variety grapevine data sets

In [None]:
# customize label colors for 11 grapevine varieties

colours = sns.color_palette('Blues', 3)
colours.extend(sns.color_palette('Greens', 3))
#colours = sns.cubehelix_palette(n_colors=6, start=2, rot=0, dark=0.2, light=.9, reverse=True)
colours.extend(sns.color_palette('flare', 5))

ordered_vitis_labels = ('CAN','RIP','ROT','RU','LAB','SYL','REG','CS','PN','RL','TRI')

vitis_label_colors = {lbl: c for lbl, c in zip(ordered_vitis_labels, colours)}

tab20bcols = sns.color_palette('tab20b', 20)
tab20ccols = sns.color_palette('tab20c', 20)
tab20cols = sns.color_palette('tab20', 20)
tab10cols = sns.color_palette('tab10', 10)
dark2cols = sns.color_palette('Dark2', 8)

vitis_label_colors['RU'] = tab20bcols[8]
vitis_label_colors['CAN'] = tab20ccols[5]
vitis_label_colors['REG'] = tab10cols[3]

for name in datasets:
    if name.startswith('GD'):
        datasets[name]['label_colors'] = vitis_label_colors
        datasets[name]['sample_colors'] = [vitis_label_colors[lbl] for lbl in datasets[name]['target']]

In [None]:
sns.palplot(vitis_label_colors.values())
new_ticks = plt.xticks(range(len(ordered_vitis_labels)), ordered_vitis_labels)

#### 5 yeast strains

In [None]:
# customize label colors for 5 yeast strains

colours = sns.color_palette('Set1', 5)
yeast_classes = datasets['YD']['classes']
yeast_label_colors = {lbl: c for lbl, c in zip(yeast_classes, colours)}
datasets['YD']['label_colors'] = yeast_label_colors
datasets['YD']['sample_colors'] = [yeast_label_colors[lbl] for lbl in datasets['YD']['target']]

In [None]:
sns.palplot(yeast_label_colors.values())
new_ticks = plt.xticks(range(len(yeast_classes)), yeast_classes)

#### 2 classes of Vitis types (wild and _vinifera_)

In [None]:
# customize label colors for 2 types of Vitis varieties

colours = [vitis_label_colors['SYL'], vitis_label_colors['TRI']]
vitis_type_classes = datasets['vitis_types']['classes']
vitis_types_label_colors = {lbl: c for lbl, c in zip(vitis_type_classes, colours)}
datasets['vitis_types']['label_colors'] = vitis_types_label_colors
datasets['vitis_types']['sample_colors'] = [vitis_types_label_colors[lbl] for lbl in datasets['vitis_types']['target']]

In [None]:
sns.palplot(datasets['vitis_types']['label_colors'].values())
new_ticks = plt.xticks(range(len(datasets['vitis_types']['classes'])), datasets['vitis_types']['classes'])

#### 2 HD classes

In [None]:
# customize label colors for 2 HD classes

colours = sns.color_palette('Set1', 2)
hd_label_colors = {lbl: c for lbl, c in zip(datasets['HD']['classes'], colours)}
datasets['HD']['label_colors'] = hd_label_colors
datasets['HD']['sample_colors'] = [hd_label_colors[lbl] for lbl in datasets['HD']['target']]

In [None]:
sns.palplot(hd_label_colors.values())
new_ticks = plt.xticks(range(len(datasets['HD']['classes'])), datasets['HD']['classes'])

Samples and respective target labels of each dataset

In [None]:
def styled_sample_labels(sample_names, sample_labels, label_colors):

    meta_table = pd.DataFrame({'label': sample_labels,
                               'sample': sample_names}).set_index('sample').T

    def apply_label_color(val):
        red, green, blue = label_colors[val]
        red, green, blue = int(red*255), int(green*255), int(blue*255)   
        hexcode = '#%02x%02x%02x' % (red, green, blue)
        css = f'background-color: {hexcode}'
        return css
    
    return meta_table.style.applymap(apply_label_color)

In [None]:
parsed = mtl.parse_data(datasets['GD_class2']['data'], labels_loc='label')
y = datasets['GD_class2']['target']
label_colors = datasets['GD_class2']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

In [None]:
parsed = mtl.parse_data(datasets['YD']['data'])
y = datasets['YD']['target']
label_colors = datasets['YD']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

In [None]:
parsed = mtl.parse_data(datasets['vitis_types']['data'], labels_loc='label')
y = datasets['vitis_types']['target']
label_colors = datasets['vitis_types']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

In [None]:
parsed = mtl.parse_data(datasets['HD']['data'])
y = datasets['HD']['target']
label_colors = datasets['HD']['label_colors']
s = styled_sample_labels(parsed.sample_names, y, label_colors)
s

#### Colors for the pre-treatments / sMDiN analysis metrics for the plots

In [None]:
# customize colors for the intensity-based pre-treatment and analysis metrics of sample MDiNs
treatments = ('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMBDI', 'GCD11')

treat_colors = tab10cols[:4]
treat_colors.extend(tab20cols[8:10])
treat_colors.append(tab10cols[5])
treatment_colors = {lbl: c for lbl, c in zip(treatments, treat_colors)}

sns.palplot(treatment_colors.values())
new_ticks = plt.xticks(range(len(treatment_colors)), treatment_colors)

### PCA scores plots for the datasets

Representation of the samples when projected in the 2 Principal Components obtained from PCA.

Preliminary assessment of the extent of class’s proximity, and consequent degree of difficulty for clustering and classification methods. Greater proximity/overlap would mean a more difficult task for the methods since it would mean the classes are similar to each other or less well defined.

Ellipses shown are 95% confidence ellipses for each class.

In [None]:
def plot_PCA(principaldf, label_colors, components=(1,2), title="PCA", ax=None):
    "Plot the projection of samples in the 2 main components of a PCA model."
    
    if ax is None:
        ax = plt.gca()
    
    loc_c1, loc_c2 = [c - 1 for c in components]
    col_c1_name, col_c2_name = principaldf.columns[[loc_c1, loc_c2]]
    
    #ax.axis('equal')
    ax.set_xlabel(f'{col_c1_name}')
    ax.set_ylabel(f'{col_c2_name}')

    unique_labels = principaldf['Label'].unique()

    for lbl in unique_labels:
        subset = principaldf[principaldf['Label']==lbl]
        ax.scatter(subset[col_c1_name],
                   subset[col_c2_name],
                   s=50, color=label_colors[lbl], label=lbl)

    #ax.legend(framealpha=1)
    ax.set_title(title, fontsize=15)

def plot_ellipses_PCA(principaldf, label_colors, components=(1,2),ax=None, q=None, nstd=2):
    "Plot confidence ellipses of a class' samples based on their projection in the 2 main components of a PCA model."
    
    if ax is None:
        ax = plt.gca()
    
    loc_c1, loc_c2 = [c - 1 for c in components]
    points = principaldf.iloc[:, [loc_c1, loc_c2]]
    
    #ax.axis('equal')

    unique_labels = principaldf['Label'].unique()

    for lbl in unique_labels:
        subset_points = points[principaldf['Label']==lbl]
        plot_confidence_ellipse(subset_points, q, nstd, ax=ax, ec=label_colors[lbl], fc='none')


In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1):
        f, axs = plt.subplots(2,3, figsize=(12,8), constrained_layout=True)

        for (dskey, ds), ax in zip(datasets.items(), axs.ravel()):
            df = datasets[dskey]['Ionly']
            tf = transf.FeatureScaler(method='standard')
            df = tf.fit_transform(df)

            ax.axis('equal')
            principaldf = ma.compute_df_with_PCs(df, n_components=5, whiten=True, labels=datasets[dskey]['target'], return_var_ratios=False)

            lcolors = datasets[dskey]['label_colors']
            #plot_PCA(principaldf, lcolors, components=(1,2), title=datasets[dskey]['name'], ax=ax)
            plot_PCA(principaldf, lcolors, components=(1,2), title='', ax=ax)
            plot_ellipses_PCA(principaldf, lcolors, components=(1,2),ax=ax, q=0.95)

        axs[1][2].remove()
        axs[1][0].legend(loc='upper center', ncol=1, framealpha=1)
        axs[1][1].legend(loc='upper center', ncol=1, framealpha=1)
        
        locs_YD = {'WT':(-1,-0.65),
                   'ΔGRE3':(-0.45, 1.45),
                   'ΔENO1':(0.7, 0.1),
                   'ΔGLO1':(0.5, -0.5),
                   'ΔGLO2':(0,0.7) }
        
        for lbl in datasets['YD']['classes']:
            axs[0][2].text(*locs_YD[lbl], lbl, c=datasets['YD']['label_colors'][lbl])

        locs_GD = {'CAN':(-1.4,-0.2),
                       'CS':(-0.45, 2),
                       'LAB':(-0.25, -0.2),
                       'PN':(-1, 0.2),
                       'REG':(1.8,0.5),
                       'RIP':(-0.3,-0.7),
                       'RL':(-0.5, 1.45),
                       'ROT':(-1, -1.2),
                       'RU':(0.5, -1),
                       'SYL':(-0.2,0),
                       'TRI':(0.5,0.5),}
        
        for lbl in datasets['GD_global2']['classes']:
            axs[0][0].text(*locs_GD[lbl], lbl, c=datasets['GD_global2']['label_colors'][lbl])

        locs_GD = {'CAN':(-0.1,-1),
                       'CS':(-0.45, 2),
                       'LAB':(-0.2, -0.5),
                       'PN':(-1, 0.6),
                       'REG':(1.8,0.35),
                       'RIP':(-0.3,-1.7),
                       'RL':(-0.2, 1),
                       'ROT':(-1, -1.2),
                       'RU':(0.9, -1),
                       'SYL':(-1.2,-0.2),
                       'TRI':(0.5,0.1),}
        
        for lbl in datasets['GD_global2']['classes']:
            axs[0][1].text(*locs_GD[lbl], lbl, c=datasets['GD_global2']['label_colors'][lbl])

        plt.show()
        f.savefig('images/PCAs.pdf', dpi=300)
        f.savefig('images/PCAs.jpg', dpi=300)


## Clustering methods

### Agglomerative Hierarchical Cluster Analysis 

HCA analysis of each differently-treated datasets, results from sample MDiN analysis and corresponding dendrograms.

**Euclidean distance** and **UPGMA linkage** used to build the dendrograms.

In [None]:
def perform_HCA(df, metric='euclidean', method='average'):
    "Performs Hierarchical Clustering Analysis of a data set with chosen linkage method and distance metric."
    
    distances = dist.pdist(df, metric=metric)
    
    # method is one of
    # ward, average, centroid, single, complete, weighted, median
    Z = hier.linkage(distances, method=method)

    # Cophenetic Correlation Coefficient
    # (see how the clustering - from hier.linkage - preserves the original distances)
    coph = hier.cophenet(Z, distances)
    # Baker's gamma
    mr = ma.mergerank(Z)
    bg = mr[mr!=0]

    return {'Z': Z, 'distances': distances, 'coph': coph, 'merge_rank': mr, "Baker's Gamma": bg}

#### Computation of linkages, distances and cophenetics

Traditional intensity-based pre-treatments

- IDT - **I**ntensity-based **D**ata pre-**T**reatment

Analysis metrics of the Sample MDiNs

- Degree
- Betweenness - Betweenness centrality
- Closeness - Closeness centrality
- MDBI - Mass-Difference based Building block Impact
- WMDBI - Weighted Mass-Difference based Building Block Impact
- GCD11 - Graphlet Correlation Distance using 11 non-redundant graphlet orbits (maximum of 4-node graphlets)


Dictionaries to contain results

In [None]:
HCA_all = {}

Perform the clusterings

In [None]:
for name, ds in datasets.items():
    HCA_all[name] = {}
    for treat in 'NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11':
        print(f'Performing HCA to {name} data set with treatment {treat}', end=' ...')
        metric = 'euclidean'
        HCA_all[name][treat] = perform_HCA(datasets[name][treat], metric=metric, method='average')
        print('done!')

In [None]:
# alternative dendogram plots - Newer
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

def color_list_to_matrix_and_cmap(colors, ind, axis=0):
        if any(issubclass(type(x), list) for x in colors):
            all_colors = set(itertools.chain(*colors))
            n = len(colors)
            m = len(colors[0])
        else:
            all_colors = set(colors)
            n = 1
            m = len(colors)
            colors = [colors]
        color_to_value = dict((col, i) for i, col in enumerate(all_colors))

        matrix = np.array([color_to_value[c]
                           for color in colors for c in color])

        matrix = matrix.reshape((n, m))
        matrix = matrix[:, ind]
        if axis == 0:
            # row-side:
            matrix = matrix.T

        cmap = mpl.colors.ListedColormap(all_colors)
        return matrix, cmap

def plot_dendogram(Z, leaf_names, label_colors, title='', ax=None, no_labels=False, labelsize=12, **kwargs):
    if ax is None:
        ax = plt.gca()
    hier.dendrogram(Z, labels=leaf_names, leaf_font_size=10, above_threshold_color='0.2', orientation='left',
                    ax=ax, **kwargs)
    #Coloring labels
    #ax.set_ylabel('Distance (AU)')
    ax.set_xlabel('Distance (AU)')
    ax.set_title(title, fontsize = 15)
    
    #ax.tick_params(axis='x', which='major', pad=12)
    ax.tick_params(axis='y', which='major', labelsize=labelsize, pad=12)
    ax.spines['left'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    #xlbls = ax.get_xmajorticklabels()
    xlbls = ax.get_ymajorticklabels()
    rectimage = []
    for lbl in xlbls:
        col = label_colors[lbl.get_text()]
        lbl.set_color(col)
        #lbl.set_fontweight('bold')
        if no_labels:
            lbl.set_color('w')
        rectimage.append(col)
    
    cols, cmap = color_list_to_matrix_and_cmap(rectimage, range(len(rectimage)), axis=0)

    axins = inset_axes(ax, width="5%", height="100%",
                   bbox_to_anchor=(1, 0, 1, 1),
                   bbox_transform=ax.transAxes, loc=3, borderpad=0)

    axins.pcolor(cols, cmap=cmap, edgecolors='w', linewidths=1)
    axins.axis('off')

In [None]:
f, ax = plt.subplots(figsize=(5, 10))
name = 'GD_global2'
title = f"Data set {datasets[name]['name']}, NGP treatment"
plot_dendogram(HCA_all[name]['NGP']['Z'], 
               datasets[name]['target'], ax=ax,
               label_colors=datasets[name]['label_colors'], title=title,
               color_threshold=0)

#### Dendrograms of 4 differently-treated datasets for each of the benchmark datasets

In [None]:
with sns.axes_style("white"):
    f, axs = plt.subplots(1, 4, figsize=(14, 8), constrained_layout=True)
    
    name = 'GD_global2'
    
    for treatment, ax in zip(('NGP', 'Degree', 'MDBI', 'GCD11'), axs.ravel()):
        plot_dendogram(HCA_all[name][treatment]['Z'], 
                       datasets[name]['target'], ax=ax,
                       label_colors=datasets[name]['label_colors'],
                       title=treatment, color_threshold=0)

    st = f.suptitle(f"Data set {datasets[name]['name']}", fontsize=16)
    plt.show()

In [None]:
with sns.axes_style("white"):
    f, axs = plt.subplots(1, 4, figsize=(14, 8), constrained_layout=True)
    
    name = 'GD_class2'
      
    for treatment, ax in zip(('NGP', 'Degree', 'MDBI', 'WMDBI'), axs.ravel()):
        if treatment == 'NGP':
            title = 'IDT'
        else:
            title = treatment
        plot_dendogram(HCA_all[name][treatment]['Z'], 
                       datasets[name]['target'], ax=ax,
                       label_colors=datasets[name]['label_colors'],
                       title=title, color_threshold=0)

    #st = f.suptitle(f"Data set {datasets[name]['name']}", fontsize=16)
    #for letter, ax in zip('ABCDEFGHIJ', axs.ravel()):
    #    ax.text(0.3, 0.98, letter, ha='left', va='center', fontsize=15, weight='bold',
    #            transform=ax.transAxes,
    #            bbox=dict(facecolor='white', alpha=0.9))

    plt.show()
    f.savefig('images/dendrosGDc2neg.pdf', dpi=300)
    f.savefig('images/dendrosGDc2neg.jpg', dpi=300)

In [None]:
with sns.axes_style("white"):
    f, axs = plt.subplots(1, 4, figsize=(12, 4), constrained_layout=True)
    
    name = 'YD'
      
    for treatment, ax in zip(('NGP', 'Degree', 'MDBI', 'GCD11'), axs.ravel()):
        if treatment == 'NGP':
            title = 'IDT'
        else:
            title = treatment
        plot_dendogram(HCA_all[name][treatment]['Z'], 
                       datasets[name]['target'], ax=ax,
                       label_colors=datasets[name]['label_colors'],
                       title=title, color_threshold=0)

    st = f.suptitle('Data set YD', fontsize=16)
    plt.show()

In [None]:
with sns.axes_style("white"):
    f, axs = plt.subplots(1, 4, figsize=(14, 14), constrained_layout=True)
    
    name = 'HD'
      
    for treatment, ax in zip(('NGP_RF', 'Degree', 'MDBI', 'GCD11'), axs.ravel()):
        plot_dendogram(HCA_all[name][treatment]['Z'], 
                       datasets[name]['target'], ax=ax,
                       label_colors=datasets[name]['label_colors'],
                       title=treatment, color_threshold=0)

    st = f.suptitle(f'Data set {datasets[name]["name"]}', fontsize=16)
    plt.show()

###  Dendrogram Similarity Comparison

The similarity of the dendrograms built from the differently-treated datasets of each of the benchmark datasets were compared using two correlation coefficients:

#### Cophenetic Correlation Coefficient

- Pearson Correlation of the matrix of cophenetic distances of two dendrograms.

#### Baker's Gamma Correlation Coefficient

- Use of the `mergerank` function from multianalysis.py to create a 'rank' of the iteration number two samples were linked to the same cluster. Then see Kendall Correlation between the results from 2 dendrograms according to Baker's paper or Spearman Correlation according to explanation given in the R package `dendextend`.

Baker's paper: Baker FB. Stability of Two Hierarchical Grouping Techniques Case 1: Sensitivity to Data Errors. J Am Stat Assoc. 1974;69(346):440-445. doi:10.2307/2285675

The information from HCA for these methods is already collected.

#### Examples of procedure with these methods with the Negative Grapevine Dataset - GDg2-

In [None]:
for name, ds in datasets.items():
    HCA_all[name] = {}
    print(f'Performing HCAs to {name} data set', end=' ...')
    for treat in 'NGP', 'NGP_RF', 'IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11':
        #print(f'Performing HCA to {name} data set with treatment {treat}', end=' ...')
        metric = 'euclidean'
        HCA_all[name][treat] = perform_HCA(datasets[name][treat], metric=metric, method='average')
    print('done!')

In [None]:
# Correation metrics
pearsonr = stats.pearsonr
kendalltau = stats.kendalltau
spearmanr = stats.spearmanr

table = []
t1 = HCA_all['GD_global2']['NGP']
t2 = HCA_all['GD_global2']['NGP_RF']
t3 = HCA_all['GD_global2']['Degree']

r, p_value = pearsonr(t1['coph'][1], t2['coph'][1])
k, p_value_k = kendalltau(t1["Baker's Gamma"], t2["Baker's Gamma"])
s, p_value_s = spearmanr(t1["Baker's Gamma"], t2["Baker's Gamma"])
table.append({'Pair of samples': 'NGP Treat-NGP_RF Treat',
              'Cophenetic (Pearson)': r,
              '(coph) p-value': p_value,
              "Baker's (Kendall)":k,
              '(B-K) p-value': p_value_k,
              "Baker's (Spearman)":s,
              '(B-S) p-value': p_value_s,})

r, p_value = pearsonr(t1['coph'][1], t3['coph'][1])
k, p_value_k = kendalltau(t1["Baker's Gamma"], t3["Baker's Gamma"])
s, p_value_s = spearmanr(t1["Baker's Gamma"], t3["Baker's Gamma"])
table.append({'Pair of samples': 'NGP Treat-Degree Treat',
              'Cophenetic (Pearson)': r,
              '(coph) p-value': p_value,
              "Baker's (Kendall)":k,
              '(B-K) p-value': p_value_k,
              "Baker's (Spearman)":s,
              '(B-S) p-value': p_value_s,})

pd.DataFrame(table).set_index('Pair of samples')

### Calculate all the pairwise correlations between the dendrograms

Choose the set of treatment/distance metric combination to consider to calculate the pairwise correlations. These are indicated by the strings in the treatments list. The colnames list has to follow the same logic as the treatments list.

In [None]:
# Column names and row names for the dataframes and heatmaps
# Collect results of HCAs

colnames = ['IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD-11']
treatments = ['IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11']

colnames = ['NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD-11']
treatments = ['NGP', 'NGP_RF', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11']

# Calculation of correlation coefficient by each method

def create_HCA_correlations(HCA_results, treatments, colnames):
    n_res = len(colnames)
    correlations = {key: np.empty((n_res, n_res)) for key in ('K', 'S', 'C', 'K_p', 'S_p', 'C_p')}

    for i, treat1 in enumerate(treatments):
        for j, treat2 in enumerate(treatments):
            Si, Sj = HCA_results[treat1], HCA_results[treat2]
            
            # K - Kendall (Baker's Gamma)
            k, p_value_k = stats.kendalltau(Si["Baker's Gamma"], Sj["Baker's Gamma"])
            correlations['K'][i,j], correlations['K_p'][i,j] = k, p_value_k

            # S - Spearman (Baker's Gamma)
            s, p_value_s = stats.spearmanr(Si["Baker's Gamma"], Sj["Baker's Gamma"])
            correlations['S'][i,j], correlations['S_p'][i,j] = s, p_value_s

            # C - Cophenetic Correlation
            r, p_value = stats.pearsonr(Si['coph'][1], Sj['coph'][1])
            correlations['C'][i,j], correlations['C_p'][i,j] = r, p_value

    for k in correlations:
        correlations[k] = pd.DataFrame(correlations[k], columns=colnames, index=colnames)
    return correlations

correlations_neg = create_HCA_correlations(HCA_all['GD_global2'], treatments, colnames)

### Heatmaps of the correlation coeficients

As for the Baker's Gamma Correlation, the heatmaps presented will be the ones with corelation calculated with Kendall correlation (according to the original paper - Baker FB. Stability of Two Hierarchical Grouping Techniques Case 1: Sensitivity to Data Errors. J Am Stat Assoc. 1974;69(346):440-445. doi:10.2307/2285675).

Although, seeing the other correlations is just a case of changing the 'C's and 'K's to 'S's based on which set of correlations you want to see in the `combineCK` function (2 cells below).

Here, **two sets of heatmaps** - Baker's gamma (Kendall) correlation (upper) and cophenetic correlation (lower) - of the GDg2- or GDg2+ datasets with the IDT and the sample MDiN analysis metrics are shown.

Below are the functions to build these heatmaps.

In [None]:
def relative_luminance(color):
    """Calculate the relative luminance of a color according to W3C standards
    Parameters
    ----------
    color : matplotlib color or sequence of matplotlib colors
        Hex code, rgb-tuple, or html color name.
    Returns
    -------
    luminance : float(s) between 0 and 1
    """
    rgb = mpl.colors.to_rgba_array(color)[:, :3]
    rgb = np.where(rgb <= .03928, rgb / 12.92, ((rgb + .055) / 1.055) ** 2.4)
    lum = rgb.dot([.2126, .7152, .0722])
    try:
        return lum.item()
    except ValueError:
        return lum

def plot_partitioned_df_asheatmap(df, ax=None, cmap='viridis', vmin=None, vmax=None, norm=None,
                                  partition_point=0, top_rotate=False, fontsize=14, colorbar=True):
    
    if ax is None:
        ax = plt.gca()
    
    values = df.values.copy()
    #values = np.flipud(values)
    # handle partition point
    
    # insert NaN column/row in values
    values = np.insert(values, partition_point, np.nan, axis=1)
    #values = np.insert(values, df.shape[0]- partition_point, np.nan, axis=0)
    values = np.insert(values, partition_point, np.nan, axis=0)
    
    # compute and insert 2% offset
    X = np.array(range(values.shape[0] + 1), dtype=float)
    Y = np.array(range(values.shape[1] + 1), dtype=float)
    offset = X[-1] * 0.02
    
    X[(partition_point+1):] = np.arange(float(partition_point)+offset, float(len(X)-1), 1.0)
    Y[(partition_point+1):] = np.arange(float(partition_point)+offset, float(len(Y)-1), 1.0)
    #Y[(len(Y)-partition_point-1):] = np.arange(float(len(Y)-partition_point-2)+offset, float(len(Y)-1), 1.0)

    # draw pcolormesh
    pm = ax.pcolormesh(X, Y, values, cmap=cmap, vmin=vmin, vmax=vmax, norm=norm)
    ax.set_ylim(ax.get_ylim()[1], ax.get_ylim()[0])
    ax.set_aspect('equal')

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    
    # handle labels
    midpoints_x = (X[1:] - X[:-1]) / 2 + X[:-1]
    midpoints_x = np.delete(midpoints_x, partition_point)
    midpoints_y = (Y[1:] - Y[:-1]) / 2 + Y[:-1]
    midpoints_y = np.delete(midpoints_y, partition_point)
    ax.set_xticks(midpoints_x)
    ax.set_yticks(midpoints_y)
    ax.set_xticklabels(df.columns)
    ax.set_yticklabels(df.index)
    ax.tick_params(labeltop=True, labelbottom=False, labelsize=fontsize,
                   top=False, bottom=False, left=False, right=False)
    if top_rotate:
        # Rotate the tick labels and set their alignment.
        plt.setp(ax.get_xticklabels(), rotation=90, ha="left", va='center', rotation_mode="anchor")
    
    # handle annotations
    
    pm_colors = pm.cmap(pm.norm(pm.get_array())).reshape(values.shape[0], values.shape[1], 4)
    mask = np.ones((values.shape[0], values.shape[1]), dtype=bool)
    mask[:, partition_point] = False
    mask[partition_point, :] = False
    pm_colors = pm_colors[mask].reshape(df.shape[0], df.shape[1], 4)
    #print(pm_colors)

    for i in range(df.shape[0]):
        for j in range(df.shape[1]):
            locx = midpoints_x[j]
            locy = midpoints_y[i]
            # handle label color according to cell color
            cell_color = pm_colors[i, j, :]
            lum = relative_luminance(cell_color)
            text_color = ".15" if lum > .408 else "w"
            annot = f'{df.iloc[i, j]:.2g}'
            text = ax.text(locx, locy, annot, fontsize=fontsize,
                           ha="center", va="center", color=text_color)

    if colorbar:
        plt.colorbar(pm)
    return pm

In [None]:
def combineCK(correlations):
    correlations['CK'] = correlations['C'].copy()
    # lower tringular mask
    upper_mask = np.triu(np.ones(correlations['CK'].shape)).astype(bool)
    correlations['CK'][upper_mask] = correlations['K']

In [None]:
f, ax = plt.subplots(figsize=(12,10))

combineCK(correlations_neg)

pm = plot_partitioned_df_asheatmap(correlations_neg['CK'], ax=ax, vmin=-0.2, vmax=1,
                                    cmap=sns.color_palette("rocket_r", as_cmap=True),
                                    partition_point=2, top_rotate=False)
ax.text(4.25, -0.9, "Baker's gamma (Kendall) correlation (upper) and cophenetic correlation (lower)",
        ha='center', fontsize=14)

#f.suptitle("Data set GDg2-", fontsize=16, y=1.04)
#ax.text(1.5,-0.45, 'Traditional treatments', fontsize=14, ha='center')
ax.text(4.65,-0.45, 'MDiN analysis metrics', fontsize=14, ha='center')

plt.show()

## Evaluating Dendrogram (HCAs) Sample Discrimination

To evaluate the discrimination achieved with each HCA, 3 different metrics were used:

- **Discrimination Distance** - the average of “class discrimination distance”. For each class, the discrimination distance is 0 if the class is not “correctly clustered” or it is the distance between the node that includes all the samples of the class and the next closest node (including those samples) in the agglomerative procedure, normalized by the maximum distance of any pair of nodes in the final resulting clustering.
- **Correct Clustering Percentage** - the percentage of the classes who are correctly clustered.
- **Correct First Cluster Percentage** - the percentage of samples whose first clustering was only with a sample(s) from its class.

Correct (Class) Clustering definition - samples of a class all clustered together before any other clustering with other samples or already-formed clusters in the agglomerative procedure.

Functions applied here (`dist_discrim` and `correct_1stcluster_fraction`) from multianalysis.py file of this repository with explanations of each step to calculate the different metrics.

**Note**: For `vitis_types` and `HD` datasets, only the correct first cluster percentage was used, since the other metrics are susceptible to outliers which can very well come up in data with high sample number and number of samples per class such as these two datasets. Furthermore, `vitis_types` dendrograms are equal to `GD_class2` since the only difference between these data sets is the target labels.

In [None]:
def compute_clustering_metrics(res_dict, labels):
    """Fill dict with clustering performance metrics."""
    
    discrim = ma.dist_discrim(res_dict['Z'], labels, # all samples have the same order
                              method = 'average')
    res_dict['Average discrim dist'] = discrim[0]
    correct = np.array(list(discrim[1].values()))
    
    classes = pd.unique(labels)
    res_dict['% correct clustering'] = (100/len(classes)) * len(correct[correct>0])

    # Correct First Cluster Percentage
    res_dict['% correct 1st clustering'] = 100 * ma.correct_1stcluster_fraction(res_dict['Z'],labels)
    

### Dendrograms Discrimination Results

Compute clustering metrics for the dendrograms built from the benchmark datasets treated with an intensity-based pre-treatment or with the different analysis metrics used for the sample MDiNs.


See which of the intensity-based data pre-treatments (`NGP` or `NGP_RF`) leads to better HCA results based on the three discrimination metrics used for the work for each dataset to choose it as the `IDT`.

In [None]:
HCA_performance = []
for name, dataset in datasets.items():
    
    for treatment in ('NGP', 'NGP_RF'):
        compute_clustering_metrics(HCA_all[name][treatment], datasets[name]['target'])
        perform = {'dataset': name, 'treatment': treatment,
                   'Discrimination Distance': HCA_all[name][treatment]['Average discrim dist'],
                   '% correct clusters': HCA_all[name][treatment]['% correct clustering'],
                   '% correct 1st clustering': HCA_all[name][treatment]['% correct 1st clustering']}
        HCA_performance.append(perform)
        
HCA_performance = pd.DataFrame(HCA_performance)

cv_dsnames = {name:datasets[name]['name'] for name in datasets}

HCA_performance2 = HCA_performance.assign(dataset = HCA_performance['dataset'].map(cv_dsnames))

p4 = sns.color_palette('tab20', 2)
with sns.axes_style("whitegrid"):
    f, axs = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
    sns.barplot(x="dataset", y="Discrimination Distance", hue="treatment", data=HCA_performance2, ax=axs[2], palette=p4)
    sns.barplot(x="dataset", y="% correct clusters", hue="treatment", data=HCA_performance2, ax=axs[0], palette=p4)
    sns.barplot(x="dataset", y="% correct 1st clustering", hue="treatment", data=HCA_performance2, ax=axs[1], palette=p4)
    axs[1].legend().set_visible(False)
    axs[2].legend().set_visible(False)
    plt.show()

In [None]:
HCA_performance = []
for name, dataset in datasets.items():
    
    for treatment in ('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
        # Choice of the IDT for each dataset
        if treatment == 'IDT':
            if name in ('YD', 'HD'):
                compute_clustering_metrics(HCA_all[name]['NGP_RF'], datasets[name]['target'])
                perform = {'dataset': name, 'treatment': treatment,
                   'Discrimination Distance': HCA_all[name]['NGP_RF']['Average discrim dist'],
                   '% correct clusters': HCA_all[name]['NGP_RF']['% correct clustering'],
                   '% correct 1st clustering': HCA_all[name]['NGP_RF']['% correct 1st clustering']}
            else:
                compute_clustering_metrics(HCA_all[name]['NGP'], datasets[name]['target'])
                perform = {'dataset': name, 'treatment': treatment,
                   'Discrimination Distance': HCA_all[name]['NGP']['Average discrim dist'],
                   '% correct clusters': HCA_all[name]['NGP']['% correct clustering'],
                   '% correct 1st clustering': HCA_all[name]['NGP']['% correct 1st clustering']}
        
        # Data matrices of sMDiNs network analyses
        else:
            compute_clustering_metrics(HCA_all[name][treatment], datasets[name]['target'])
            perform = {'dataset': name, 'treatment': treatment,
                       'Discrimination Distance': HCA_all[name][treatment]['Average discrim dist'],
                       '% correct clusters': HCA_all[name][treatment]['% correct clustering'],
                       '% correct 1st clustering': HCA_all[name][treatment]['% correct 1st clustering']}
        HCA_performance.append(perform)
        
HCA_performance = pd.DataFrame(HCA_performance)

cv_dsnames = {name:datasets[name]['name'] for name in datasets}

HCA_performance2 = HCA_performance.assign(dataset = HCA_performance['dataset'].map(cv_dsnames))
HCA_performance2

Results summary

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    f, axs = plt.subplots(3, 1, figsize=(12, 12), constrained_layout=True)
    sns.barplot(x="dataset", y="Discrimination Distance", hue="treatment", data=HCA_performance2, ax=axs[0], palette=p4)
    sns.barplot(x="dataset", y="% correct clusters", hue="treatment", data=HCA_performance2, ax=axs[1], palette=p4)
    sns.barplot(x="dataset", y="% correct 1st clustering", hue="treatment", data=HCA_performance2, ax=axs[2], palette=p4)
    axs[1].legend().set_visible(False)
    axs[2].legend().set_visible(False)
    plt.show()

## K-means Clustering Analysis

K-means clustering analysis was applied by using the appropriate functions of the scikit-learn as done in the following cells.

#### K-means clustering was applied to all the datasets obtained for each of the benchmark datasets

The number of clusters chosen was equal to the amount of groups. Apart from this, default parameters were used.

K-means clustering analysis has an intrinsically random side to it depending on the starting position of the clusters centroids and due to the existence of local minima. Due to this randomness, the algorithm was repeated 15 (n) times and the result with the least inertia (greater minimization of the objective function - sum of squared distances of the samples to the cluster centroids) was retained (best 10% of results, in this case, only the best).

To evaluate the discrimination achieved with each K-means Clustering, 3 different metrics were used:

- **Discrimination Distance** (for K-means clustering, identical idea to HCA)
- **Correct Clustering Percentage** (for K-means clustering, identical idea to HCA)
- **Adjusted Rand Index** (calculated by scikit-learn - `adjusted_rand_index`) - proportion of sample pairs which are correctly clustered or correctly not clustered, adjusted for the expected percentage of samples which would be in those situations randomly.

The function `Kmeans_discrim` from multianalysis.py was applied to calculate these metrics with explanations of each step to calculate the different metrics.

Correct clustering definition - K-means Cluster contains all and only the samples of a single class (stricter definition than in HCA). Samples of a class can all be together in a cluster, but if another sample (of another class) is present, the class is not correctly clustered. Thus, the Correct Clustering Percentage is expected to be lower in this case.

In this case, the distances are calculated by the distance between different cluster centroids.

**Note**: As before, for `vitis_types` and `HD` datasets, only the Adjusted Rand Index was used, since the other metrics are susceptible to outliers which can very well come up in data with high sample number and number of samples per class such as these two datasets.

In [None]:
def perform_KMeans(dataset, treatment, iter_num=150, best_fraction=0.1):
    "Perform K-means Clustering Analysis and calculate discrimination evaluation metrics."
    
    sample_labels = datasets[dataset]['target']
    n_classes = len(pd.unique(sample_labels))
    
    df = datasets[dataset][treatment]
    
    discrim = ma.Kmeans_discrim(df, sample_labels,
                                method='average', 
                                iter_num=iter_num,
                                best_fraction=best_fraction)

    
    # Lists for the results of the best k-means clustering
    average = []
    correct = []
    rand = []
    
    for j in discrim:
        global_disc_dist, disc_dists, rand_index, SSE = discrim[j]
        
        # Average of discrimination distances
        average.append(global_disc_dist) 
        
        # Correct Clustering Percentages
        all_correct = np.array(list(disc_dists.values()))
        correct.append(len(all_correct[all_correct>0]))
        
        # Adjusted Rand Index
        rand.append(rand_index) 
    
    return{'dataset': dataset,
           'treatment': treatment,
           'Discrimination Distance': np.median(average),
           '% correct clusters':np.median(correct)*100/n_classes,
           'Adjusted Rand Index': np.median(rand)}

See which of the intensity-based data pre-treatments (`NGP` or `NGP_RF`) leads to better K-means clustering results based on the three discrimination metrics used for the work for each dataset to choose it as the `IDT`.

In [None]:
iter_num=20

KMeans_all = []

for dsname in ('GD_global2', 'GD_class2', 'YD', 'vitis_types', 'HD'):
    for treatment in ('NGP', 'NGP_RF'):
        print(f'performing KMeans on {dsname} with treatment {treatment}' , end=' ...')
        KMeans_all.append(perform_KMeans(dsname, treatment, iter_num=iter_num))
        print('done!')
        
KMeans_all = pd.DataFrame(KMeans_all)

cv_dsnames = {name:datasets[name]['name'] for name in datasets}
KMeans_all2 = KMeans_all.assign(dataset = KMeans_all['dataset'].map(cv_dsnames))

KMeans_all2

p4 = sns.color_palette('tab20', 2)
with sns.axes_style("whitegrid"):
    f, axs = plt.subplots(1, 3, figsize=(12, 4), constrained_layout=True)
    sns.barplot(x="dataset", y="Discrimination Distance", hue="treatment", data=KMeans_all2, ax=axs[2], palette=p4)
    sns.barplot(x="dataset", y="% correct clusters", hue="treatment", data=KMeans_all2, ax=axs[0], palette=p4)
    sns.barplot(x="dataset", y="Adjusted Rand Index", hue="treatment", data=KMeans_all2, ax=axs[1], palette=p4)

K-means Clustering Analysis

In [None]:
np.random.seed(16)

iter_num=20

KMeans_all = []

for dsname in ('GD_global2', 'GD_class2', 'YD', 'vitis_types', 'HD'):
    # Choice of the IDT for each dataset
    datasets[dsname]['IDT'] = datasets[dsname]['NGP']
    if dsname == 'HD':
        datasets[dsname]['IDT'] = datasets[dsname]['NGP_RF']
    for treatment in ('IDT', 'Degree', 'Betweenness', 'Closeness', 'MDBI', 'WMDBI', 'GCD11'):
        print(f'performing KMeans on {dsname} with treatment {treatment}' , end=' ...')
        KMeans_all.append(perform_KMeans(dsname, treatment, iter_num=iter_num))
        print('done!')

In [None]:
KMeans_all = pd.DataFrame(KMeans_all)

cv_dsnames = {name:datasets[name]['name'] for name in datasets}
KMeans_all2 = KMeans_all.assign(dataset = KMeans_all['dataset'].map(cv_dsnames))

In [None]:
KMeans_all2

Results summary

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    f, axs = plt.subplots(3, 1, figsize=(12, 12), constrained_layout=True)
    #for ax in axs.ravel():
        #ax.tick_params(labelsize=14)
        #ax.xaxis.label.set_size(16)
        #ax.axhspan(-0.5, 3.5, color='red', alpha=0.2)
        #ax.axhspan(3.55, 7.5, color='darkblue', alpha=0.2)
        #ax.axhspan(7.55, 11.5, color='red', alpha=0.2)
        #ax.axhspan(11.55, 16, color='darkblue', alpha=0.2)
    sns.barplot(x="dataset", y="Discrimination Distance", hue="treatment", data=KMeans_all2, ax=axs[0], palette=p4)
    sns.barplot(x="dataset", y="% correct clusters", hue="treatment", data=KMeans_all2, ax=axs[1], palette=p4)
    sns.barplot(x="dataset", y="Adjusted Rand Index", hue="treatment", data=KMeans_all2, ax=axs[2], palette=p4)

## Summary of Clustering performance

HCA and K-means Clustering results combined.

In [None]:
p4 = treat_colors
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.4):
        #f, axs = plt.subplots(2, 3, figsize=(16, 8), constrained_layout=True)
        
        fig = plt.figure(figsize=(16, 8))

        gs = plt.GridSpec(2,14)

        ax1 = fig.add_subplot(gs[0,:4])
        ax2 = fig.add_subplot(gs[0,4:8])
        ax3 = fig.add_subplot(gs[0,8:])
        
        ax4 = fig.add_subplot(gs[1,:4])
        ax5 = fig.add_subplot(gs[1,4:8])
        ax6 = fig.add_subplot(gs[1,8:]) 
        
        
        sns.barplot(x="dataset", y="Discrimination Distance", hue="treatment", data=HCA_performance2.loc[:20], ax=ax2, palette=p4)
        sns.barplot(x="dataset", y="% correct clusters", hue="treatment", data=HCA_performance2.loc[:20], ax=ax1, palette=p4)
        sns.barplot(x="dataset", y="% correct 1st clustering", hue="treatment", data=HCA_performance2, ax=ax3, palette=p4)

        sns.barplot(x="dataset", y="Discrimination Distance", hue="treatment", data=KMeans_all2.loc[:20], ax=ax5, palette=p4)
        sns.barplot(x="dataset", y="% correct clusters", hue="treatment", data=KMeans_all2.loc[:20], ax=ax4, palette=p4)
        sns.barplot(x="dataset", y="Adjusted Rand Index", hue="treatment", data=KMeans_all2, ax=ax6, palette=p4)
        for ax in (ax1, ax2, ax3, ax4, ax5, ax6):
            ax.set_ylim(0,105)
            ax.xaxis.label.set_visible(False)
            ax.legend().set_visible(False)
            ax.tick_params(axis='x', which='major', labelsize=14)

        ax2.legend(bbox_to_anchor=(1,1), loc="upper right", framealpha=1, fontsize=13, ncol=2)
        
        ax2.set_ylim(0,1.05)
        ax5.set_ylim(0,1.05)
        ax6.set_ylim(0,1.05)
        
        #for letter, ax in zip('ABCDEFGHIJ', axs.ravel()):
        #    ax.text(0.05, 0.9, letter, ha='left', va='center', fontsize=16, weight='bold',
        #            transform=ax.transAxes,
        #            bbox=dict(facecolor='white', edgecolor='white', alpha=0.9))
        
        plt.tight_layout()
        plt.show()

        fig.savefig('images/clust_performance.pdf' , dpi=300)
        fig.savefig('images/clust_performance.jpg' , dpi=300)
        #fig.savefig('images/clust_performance.svg')

Heatmap just to make a colorbar useful for an image in the paper

In [None]:
from matplotlib.colors import LinearSegmentedColormap

In [None]:
fig = plt.figure(figsize=(8,8))
tf = transf.FeatureScaler(method='standard')
df = tf.fit_transform(datasets['YD']['MDBI'])

myColors = ((0/255, 0/255, 255/255), (51/255, 51/255, 255/255), (102/255, 102/255, 255/255),
           (153/255, 153/255, 255/255), (204/255, 204/255, 255/255), (255/255, 204/255, 204/255),
           (255/255, 153/255, 204/255), (255/255, 102/255, 102/255), (255/255, 51/255, 51/255),
           (255/255, 0/255, 0/255), (204/255, 0/255, 0/255), (102/255, 0/255, 0/255))
cmap = LinearSegmentedColormap.from_list('Custom', myColors, len(myColors))

g = sns.heatmap(df.T, cmap = cmap, vmin=0, vmax=12)

# Manually specify colorbar labelling after it's been generated
colorbar = g.collections[0].colorbar
colorbar.set_ticks([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5])
colorbar.set_ticklabels([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fontsize=18)

g.set_title('Node\nDegree', fontsize=18)

#fig.savefig('images/colorbar.jpg' , dpi=600)