# Data Augmentation - Conditional Wasserstein GANs - GP

### Dataset: Prostate Cancer Dataset

This notebook presents the CWGAN-GP model to generate treated Intensity Data from the experimental prostate cancer dataset.

Notebook Organization:
- Read the dataset
- Treatment, Univariate Analysis and PLS-DA of the dataset
- Setup the CWGAN-GP model and train the model with intensity data
- Generate artificial samples in an artificial dataset and compare them to the experimental data

#### Due to stochasticity, re-running the notebook will get slightly different results. Thus, figures in the paper can be slightly different.

#### Needed Imports

In [None]:
import json
from time import perf_counter

import numpy as np
import pandas as pd

import scipy.spatial.distance as dist
import scipy.cluster.hierarchy as hier
import scipy.stats as stats
import scipy.spatial

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
import matplotlib.colors
from matplotlib import ticker

import seaborn as sns
from collections import namedtuple, Counter

from tqdm import tqdm
from IPython import display as ipythondisplay

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

import tensorflow as tf
from keras import backend

import pickle

# Metabolinks package
import metabolinks as mtl
import metabolinks.transformations as transf

# Python files in the repository
import multianalysis as ma
from elips import plot_confidence_ellipse
import gan_evaluation_metrics as gem
import linear_augmentation_functions as laf

%matplotlib inline

In [None]:
# Import needed functions from GAN_functions
from GAN_functions import gradient_penalty_cwgan
from GAN_functions import critic_loss_wgan
from GAN_functions import generator_loss_wgan

# Prostate Cancer dataset

249 samples obtained from 80 subjects belonging to 2 different classes of HILIC-MS data obtained in positive ionization mode. The Chromatography system was a Thermo Dionex Ultimate 3000 with the column being a Waters Xbridge BEH HILIC (75 x 2.1mm, 2.5um). Mass Spectrometry was performed in a Thermo Q Exactive HF hybrid Orbitrap operating in positive electrospray ionization mode. Further information in the data deposition site mentioned below.

The 2 classes are:

- 135 pre-operative blood (serum) samples from patients with **'No Recurrence'** of Prostate Cancer after Radical Prostatectomy
- 114 pre-operative blood (serum) samples from patients with **'Recurrence'** of Prostate Cancer after Radical Prostatectomy

Data acquired by:

- Clendinen, C.S.; Gaul, D.A.; Monge, M.E.; Arnold, R.S.; Edison, A.S.; Petros, J.A.; Fernández, F.M. Preoperative Metabolic Signatures of Prostate Cancer Recurrence Following Radical  Prostatectomy. J. Proteome Res. 2019, 18, 1316–1327, doi:10.1021/acs.jproteome.8b00926.

Data obtained from Metabolomics Workbench project ID PR000724, study ID: ST001082, Analysis ID: AN001766 (https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&DataMode=AllData&StudyID=ST001082&StudyType=MS&ResultType=5#DataTabs)

Data was taken already aligned in a 2D numerical data matrix. It had 2 copies of each sample, where the 2nd copies were equal to the first ones multiplied by a constant. This constant was different for each sample. We removed the presence of the 2nd copies.

Data matrices retained only features that occur (globally) at least twice in all samples of the dataset (filtering/alignment).

### Reading File and Data Analysis

In [None]:
# Reading the dataset, this dataset has two copies of the samples.The 2nd copy is equal to the 1st multiplied by a constant
# unique to each sample.
human_datamatrix_base = pd.read_excel('ST001082_AN001766_HD.xlsx')

In [None]:
human_datamatrix = human_datamatrix_base.iloc[:-1, :-4] # Just select the rows corresponding to the dataset
mz_list = human_datamatrix_base.iloc[:-1, -4] # Select column with list of m/z values

human_datamatrix = human_datamatrix.set_index(human_datamatrix.columns[0]) # Set index as the metabolites name
human_datamatrix = human_datamatrix.replace({0:np.nan}) # Replacing 0 values as missing values

human_datamatrix

In [None]:
# Select one of the two copies of samples in the dataset by removing the samples ending with '.1'
human_datamatrix = human_datamatrix[[i for i in human_datamatrix.columns if not i.endswith('.1')]]

human_datamatrix = human_datamatrix.T # Transpose the dataset

In [None]:
# Current sample labels. Sample will consist of 'Sample Type:No Recurrence' and 'Sample Type:Recurrence'
hd_labels = list(human_datamatrix['Factors'])
set(hd_labels)

In [None]:
# How many 'No Recurrence' samples in the dataset
hd_labels.count('Sample Type:No Recurrence')

### Blank Treatment

- If blank_treatment = True:

An average of the blanks will be subtracted to the remaining dataset. Missing values in the blanks were replaced by 0 to calculate the average of the blanks. Negative values that arise in the dataset from subtracting the blanks will be coded as 0/missing values in the dataset.

- If blank_treatment = False:

Blank samples are removed and not accounted for. 

In [None]:
blank_treatment = True
#blank_treatment = False

In [None]:
if blank_treatment:
    blanks = human_datamatrix[human_datamatrix['Factors'] == 'Sample Type:Blank'].iloc[:,1:] # Select blank samples
    blanks = blanks.replace({np.nan:0}) 
    blanks = blanks.astype(float) # Get blank samples with floats (needed since datamatrix has strings)
    blanks_average = blanks.mean() # Average of the blanks
    blanks_average

In [None]:
# Selecting the samples belonging to either the 'No Recurrence' or 'Recurrence' 
selection = []
for i in human_datamatrix.loc[:, 'Factors']:
    if i in ['Sample Type:No Recurrence', 'Sample Type:Recurrence']:
        selection.append(True)
    else:
        selection.append(False)

In [None]:
human_datamatrix = human_datamatrix[selection]
# Creating the list of 'targets' (labels of samples) of the dataset with the 'No Recurrence' and 'Recurrence' classes 
hd_labels = []
for i in list(human_datamatrix.iloc[:,0]):
    if i == 'Sample Type:No Recurrence':
        hd_labels.append('No Recurrence')
    else:
        hd_labels.append('Recurrence')

In [None]:
human_datamatrix = human_datamatrix.iloc[:,1:]
human_datamatrix = human_datamatrix.astype(float) # Passing the values from strings to floats.
human_datamatrix

In [None]:
# Subtract the average of the 5 blanks from the samples in the dataset
if blank_treatment:
    human_datamatrix = human_datamatrix.replace({np.nan:0}) - blanks_average
    human_datamatrix[human_datamatrix<0] = 0
    human_datamatrix = human_datamatrix.replace({0:np.nan})

'Neutralization' of the _m/z_ values by subtracting the mass of a proton to the _m/z_ peaks.

4 peaks (2 pairs) with different retention time have the exact same _m/z_ values. These 2 pairs of peaks will be treated as only 2 different peaks (instead of four).

In [None]:
# Atomic masses - https://ciaaw.org/atomic-masses.htm
#Isotopic abundances-https://ciaaw.org/isotopic-abundances.htm/https://www.degruyter.com/view/journals/pac/88/3/article-p293.xml
# Isotopic abundances from Pure Appl. Chem. 2016; 88(3): 293–306,
# Isotopic compositions of the elements 2013 (IUPAC Technical Report), doi: 10.1515/pac-2015-0503

chemdict = {'H':(1.0078250322, 0.999844),
            'C':(12.000000000, 0.988922),
            'N':(14.003074004, 0.996337),
            'O':(15.994914619, 0.9976206),
            'Na':(22.98976928, 1.0),
            'P':(30.973761998, 1.0),
            'S':(31.972071174, 0.9504074),
            'Cl':(34.9688527, 0.757647),
            'F':(18.998403163, 1.0),
            'C13':(13.003354835, 0.011078) # Carbon 13 isotope
           } 

# electron mass from NIST http://physics.nist.gov/cgi-bin/cuu/Value?meu|search_for=electron+mass
electron_mass = 0.000548579909065

In [None]:
mz_list = mz_list[1:] - chemdict['H'][0] + electron_mass
counts = human_datamatrix.count(axis=0)
final_mz_list = list(mz_list[list(counts >= 2)])
final_mz_list = set(final_mz_list)

### Data Pre-Treatments

In [None]:
human_datamatrix = transf.keep_atleast(human_datamatrix, minimum=2) # Keep features that appear in at least two samples

In [None]:
# For missing value imputation based on constants relative to the minimum of each feature instead of the full dataset
def fillna_frac_feat_min(df, fraction=0.2):
    """Set NaN to a fraction of the minimum value in each column of the DataFrame."""

    minimum = df.min(axis=0) * fraction
    return df.fillna(minimum)

In [None]:
human_datamatrix_I = fillna_frac_feat_min(human_datamatrix.T, fraction=0.2).T
human_datamatrix_N = transf.normalize_PQN(human_datamatrix_I, ref_sample='mean')
human_datamatrix_treated = transf.pareto_scale(transf.glog(human_datamatrix_N, lamb=None))

In [None]:
human_datamatrix_I

### Calculating Silhouette Coefficient

The Silhouette coefficient was calculated (with scikit learn's `silhouette score`) using the class labels as the clusters in order to give a measure of how overlapped the biological classes are between the different datasets.

The coefficient was calculated using all features of the dataset after pre-treatment and using the PCA Scores of the 2 first Principal Components.

In [None]:
# Silhoutte Coefficient based on all the features of the dataset after pre-treatment
print('Calculation based on all the features of the dataset after pre-treatment')
print('Silhouette Coefficient:', silhouette_score(human_datamatrix_treated, hd_labels))

In [None]:
# Silhoutte Coefficient based on the PCA Scores of the samples on the 2 first Principal Components
principaldf, var = ma.compute_df_with_PCs(human_datamatrix_treated, n_components=2, whiten=True,
                                          labels=hd_labels, return_var_ratios=True)
s_score = silhouette_score(principaldf.iloc[:,:2], hd_labels)
print('Calculation based on the PCA Scores of the samples on the 2 first Principal Components')
print('Silhouette Coefficient:', s_score)

#### Linear Augmentation / Interpolation of the experimental data and corresponding sample labels/classes to use as training data for the GAN network.

Used the data after missing value imputation and before other pre-treatments.

Due to PCD size, this takes a long time (in a way, it would not be needed due to the high number of samples), but we do it for coherency with the remaining datasets.

Pre-treatment on the linearly augmented data.

In [None]:
# Takes a long time the first time to make this (large example dataset), after this make GENERATE_LIN = False
GENERATE_LIN = True
if GENERATE_LIN:
    start = perf_counter()
    data, lbls = laf.artificial_dataset_generator(human_datamatrix_I, labels=hd_labels,
                                            max_new_samples_per_label=512, binary=False, rnd=0.5, 
                                            binary_rnd_state=None, rnd_state=None)
    end = perf_counter()
    print(f'Simple augmentation of data done! took {(end - start):.3f} s')
    
    # Then pre-treatment of the newly generated data
    data_N = transf.normalize_PQN(data, ref_sample='mean')
    data_treated = transf.pareto_scale(transf.glog(data_N, lamb=None))

In [None]:
# Store the generated data as to not need to make them again
if GENERATE_LIN:
    # Store Linear Augmented Data
    data_treated.to_csv('store_data/PCD_generated_data_NGP.csv')

    # Store labels/classes corresponding to each generated sample
    with open('store_data/PCD_data_NGP_lbls.txt', 'w') as a:
        for item in lbls:
            a.write("{}\n".format(item))

# Read back the generated GAN data
if not GENERATE_LIN:
    data_treated = pd.read_csv('store_data/PCD_generated_data_NGP.csv')
    data_treated = data_treated.set_index(data_treated.columns[0])
    data_treated

    with open('store_data/PCD_data_NGP_lbls.txt') as a:
        lbls = a.read().split('\n')[:-1]

#### Univariate Analysis

In [None]:
uni_results = ma.compute_FC_pvalues_2groups(human_datamatrix_N, human_datamatrix_treated,
                               labels=hd_labels,
                               equal_var=True,
                               alpha=0.05, useMW=False)
uni_results

In [None]:
pvalue = 0.01
log2FC = 1

b = uni_results[uni_results['FDR adjusted p-value'] < pvalue]
uni_results_filt = b[abs(b['log2FC']) > log2FC]
uni_results_filt

#### PLS-DA model

In [None]:
%%capture --no-stdout
# above is to supress PLS warnings

max_comp=20

# Store Results
PLS_optim = {}

# Build and extract metrics from models build with different number of components by using the optim_PLS function.
treatment = 'NGP'
print(f'Fitting PLS-DA model with treatment {treatment}', end=' ...')
plsdaname = treatment
PLS_optim[plsdaname] = {'treatment':treatment}
n_fold = 5
optim = ma.optim_PLSDA_n_components(human_datamatrix_treated, hd_labels,
                                    max_comp=max_comp, n_fold=n_fold).CVscores
PLS_optim[plsdaname]['CV_scores'] = optim
print(f'done')

In [None]:
PLS_optim[plsdaname]['CV_scores']

### Inputs for GAN model

In [None]:
# Get distribution of intensity values of the dataset
hist = np.histogram(human_datamatrix_treated.values.flatten(), bins=100)
input_realdata_dist = stats.rv_histogram(hist)

Set up colours for each of the classes. Generated samples will have the corresponding label with '- GAN' after.

In [None]:
colours2 = sns.color_palette('tab20', 4)[:5]

ordered_labels_test = ('Recurrence','Recurrence - GAN','No Recurrence','No Recurrence - GAN')
label_colors_test = {lbl: c for lbl, c in zip(ordered_labels_test, colours2)}
ordered_labels = ['Recurrence', 'No Recurrence']

sns.palplot(label_colors_test.values())
new_ticks_test = plt.xticks(range(len(ordered_labels_test)), ordered_labels_test)

## Conditional Wasserstein GAN - GP model

This model construction was made by joining WGAN-GP models with Conditional GAN models. WGAN-GP models were originally made according to / originally based in https://keras.io/examples/generative/wgan_gp/#wasserstein-gan-wgan-with-gradient-penalty-gp and Conditional GAN models - https://machinelearningmastery.com/how-to-develop-a-conditional-generative-adversarial-network-from-scratch/ (generator and discriminator model) and https://keras.io/examples/generative/conditional_gan/ without using OOP (loss functions and training/training steps).

In [None]:
def generator_model(len_input, len_output, n_hidden_nodes, n_labels): 
    "Make the generator model of CWGAN-GP."

    data_input = tf.keras.Input(shape=(len_input,), name='data') # Take intensity input
    label_input = tf.keras.Input(shape=(1,), name='label') # Take Label Input
    
    # Treat label input to concatenate to intensity data after
    label_m = tf.keras.layers.Embedding(n_labels, 30, input_length=1)(label_input)
    label_m = tf.keras.layers.Dense(256, activation='linear', use_bias=True)(label_m)
    #label_m = tf.keras.layers.Reshape((len_input,1,))(label_m)
    label_m2 = tf.keras.layers.Reshape((256,))(label_m)

    joined_data = tf.keras.layers.Concatenate()([data_input, label_m2]) # Concatenate intensity and label data
    # Hidden Dense Layer and Normalization
    joined_data = tf.keras.layers.Dense(n_hidden_nodes, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(256, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.BatchNormalization()(joined_data)
    
    # Output - number of features of sample to make
    output = tf.keras.layers.Dense(len_output, activation='linear', use_bias=True)(joined_data)
    
    generator = tf.keras.Model(inputs=[data_input, label_input], outputs=output)
    
    return generator


def critic_model(len_input, n_hidden_nodes, n_labels):
    "Make the critic model of CWGAN-GP."
    
    label_input = tf.keras.Input(shape=(1,)) # Take intensity input
    data_input = tf.keras.Input(shape=(len_input,)) # Take Label Input

    # Treat label input to concatenate to intensity data after
    label_m = tf.keras.layers.Embedding(n_labels, 30, input_length=1)(label_input)
    label_m = tf.keras.layers.Dense(256, activation='linear', use_bias=True)(label_m)
    #label_m = tf.keras.layers.Reshape((len_input,1,))(label_m)
    label_m = tf.keras.layers.Reshape((256,))(label_m)

    joined_data = tf.keras.layers.Concatenate()([data_input, label_m]) # Concatenate intensity and label data
    # Hidden Dense Layer (Normalization worsened results here)
    joined_data = tf.keras.layers.Dense(n_hidden_nodes, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(128, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(256, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    #joined_data = tf.keras.layers.BatchNormalization()(joined_data)

    # Output Layer - 1 node for critic decision
    output = tf.keras.layers.Dense(1, activation='linear', use_bias=True)(joined_data)
    
    critic = tf.keras.Model(inputs=[data_input, label_input], outputs=output)

    return critic

In [None]:
def generate_predictions(model, num_examples_to_generate, len_input, input_dist, uni_lbls):
    "Generate sample predictions based on a Generator model."
    
    test_input =  tf.constant(input_dist.rvs(size=len_input*num_examples_to_generate), shape=[
        num_examples_to_generate,len_input]) 
    
    if len(uni_lbls) < 3:
        test_labels = tf.constant([1.0]*(num_examples_to_generate//2) + [0.0]*(num_examples_to_generate//2), 
                                  shape=(num_examples_to_generate,1))
    else:
        test_labels = []
        for i in range(len(uni_lbls)):
            test_labels.extend([i]*(num_examples_to_generate//len(uni_lbls)))
        test_labels = np.array(pd.get_dummies(test_labels))
        #np.array(pd.get_dummies([i for i in range(len(uni_lbls))]*(num_examples_to_generate//len(uni_lbls))))
    predictions = model([test_input, test_labels], training=False) # `training` is set to False.
    return predictions

In [None]:
def training_montage(train_data_o, train_lbls, test_data, test_lbls,
                     epochs, generator, critic, generator_optimizer, critic_optimizer, input_dist,
                    batch_size, grad_pen_weight=10, k_cov_den=50, k_crossLID=15, random_seed=145,
                    n_generated_samples=96):
    """Train a generator and critic of CWGAN-GP.
    
       Receives training data and respective class labels (train_data_o and train_lbls) and trains a generator and a critic
        model (generator, critic) over a number of epochs (epochs) with a set batch size (batch_size) with the respective 
        optimizers and learning rate (generator_optimizer, critic_optimizer). Gradient Penalty is calculated with
        grad_pen_weight as the weight of the penalty.
       The functions returns at time intervals three graphs to evaluate the progression of the models (Loss plots,
        coverage, density, crossLID and correct first cluster plots and PCA plot with generated and test data). To this
        end, samples need to be generated requiring the distribution to sample the initial input values from (input_dist),
        and test data and respective labels has to be given (test_data and test_lbls). Finally the number of neighbors to
        consider for coverage/density and crossLID calculation is also needed (k_cov_den, k_crossLID).
    
       train_data_o: Pandas DataFrame with training data;
       train_lbls: List with training data class labels;
       test_data: Pandas DataFrame with test data to evaluate the model;
       test_lbls: List with test data class labels to evaluate the model;
       epochs: Int value with the number of epochs to train the model;
       generator: tensorflow keras.engine.functional.Functional model for the generator;
       critic: tensorflow keras.engine.functional.Functional model for the critic;
       generator_optimizer: tensorflow keras optimizer (with learning rate) for generator;
       critic_optimizer: tensorflow keras optimizer (with learning rate) for critic;
       input_dist: scipy.stats._continuous_distns.rv_histogram object - distribution to sample input values for generator;
       batch_size: int value with size of batch for model training;
       grad_pen_weight: int value (default 10) for penalty weight in gradient penalty calculation;
       k_cov_den: int value (default 50) for number of neighbors to consider for coverage and density calculation in
       generated samples evaluation;
       k_crossLID: int value (default 15) for number of neighbors to consider for crossLID calculation in generated samples
        evaluation.
       random_seed: int value (default 145) for numpy random seeding when randomly organizing samples in the data that
        will be split into batches.
       n_generated_samples: int value (default 96) for number of samples generated to test the model during training.
    """
    
    # Obtaining the train data, randomize its order and divide it be twice the standard deviation of its values
    all_data = train_data_o.iloc[
        np.random.RandomState(seed=random_seed).permutation(len(train_data_o))]/(2*train_data_o.values.std())
    
    # Same treatment for the test data
    test_data = (test_data/(2*test_data.values.std())).values
    training_data = all_data
    train_data = all_data.values
    
    # Change class labels to numerical values while following the randomized ordered of samples
    if len(set(train_lbls)) < 3: # 1 and 0 for when there are only two classes
        train_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data))]).values[:,0]
        test_labels = pd.get_dummies(np.array(test_lbls)).values[:,0]
    else: # One hot encoding for when there are more than two classes
        train_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data))]).values
        test_labels = pd.get_dummies(np.array(test_lbls)).values
    # Save the order of the labels
    ordered_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data_o))]).columns

    batch_divisions = int(batch_size / len(set(train_lbls))) # See how many samples of each class will be in each batch
    n_steps = epochs * int(training_data.shape[0] / batch_size) # Number of steps: nº of batches per epoch * nº of epochs
    n_critic = 5
    
    # Set up the evaluating images printed during training and the intervals they will be updated
    f, (axl, axc, axr) = plt.subplots(1, 3, figsize = (16,5))
    update1 = n_steps//200
    update2 = n_steps//20

    if hasattr(tqdm, '_instances'):
        tqdm._instances.clear() # clear if it exists

    i=0

    for step in tqdm(range(n_steps)):
        
        # Critic Training
        crit_loss_temp = []
        
        # Select real samples for this batch on training and order samples to put samples of the same class together
        real_samp = train_data[i*batch_size:(i+1)*batch_size]
        real_lbls = train_labels[i*batch_size:(i+1)*batch_size]

        real_samples = np.empty(real_samp.shape)
        real_labels = np.empty(real_lbls.shape)
        a = 0
        if len(set(train_lbls)) < 3:
            for l,s in sorted(zip(real_lbls, real_samp), key=lambda pair: pair[0], reverse=True):
                real_samples[a] = s
                real_labels[a] = l
                a = a+1
        else:
            for l,s in sorted(zip(real_lbls, real_samp), key=lambda pair: np.argmax(pair[0]), reverse=False):
                #print(l, np.argmax(l))
                real_samples[a] = s
                real_labels[a] = l
                a = a+1

        for _ in range(n_critic): # For each step, train critic n_critic times
            
            # Generate input for generator
            artificial_samples = tf.constant(input_dist.rvs(size=all_data.shape[1]*batch_size), shape=[
                batch_size,all_data.shape[1]])
            artificial_labels = real_labels.copy()

            # Generate artificial samples from the latent vector
            artificial_samples = generator([artificial_samples, artificial_labels], training=True)
            
            with tf.GradientTape() as crit_tape: # See the gradient for the critic

                # Get the logits for the generated samples
                X_artificial = critic([artificial_samples, artificial_labels], training=True)
                # Get the logits for the real samples
                X_true = critic([real_samples, real_labels], training=True)

                # Calculate the critic loss using the generated and real sample results
                c_cost = critic_loss_wgan(X_true, X_artificial)

                # Calculate the gradient penalty
                grad_pen = gradient_penalty_cwgan(batch_size, real_samples, artificial_samples,
                                                  real_labels, artificial_labels, critic)
                # Add the gradient penalty to the original discriminator loss
                crit_loss = c_cost + grad_pen * grad_pen_weight
                
            crit_loss_temp.append(crit_loss)

            # Calculate and apply the gradients obtained from the loss on the trainable variables
            gradients_of_critic = crit_tape.gradient(crit_loss, critic.trainable_variables)
            critic_optimizer.apply_gradients(zip(gradients_of_critic, critic.trainable_variables))

        i = i + 1
        if (step+1) % (n_steps//epochs) == 0:
            i=0

        crit_loss_all.append(np.mean(crit_loss_temp))
        
        # Generator Training
        # Generate inputs for generator, values and labels
        artificial_samples = tf.constant(input_dist.rvs(size=all_data.shape[1]*batch_size), shape=[
                batch_size,all_data.shape[1]])
        
        if len(set(train_lbls)) < 3:
            artificial_labels = tf.constant([1.0]*(batch_size//2) + [0.0]*(batch_size//2), shape=(batch_size,1))
        else:
            artificial_labels = np.array(pd.get_dummies([i for i in range(len(set(train_lbls)))]*batch_divisions))
    
        with tf.GradientTape() as gen_tape: # See the gradient for the generator
            # Generate artificial samples
            artificial_samples = generator([artificial_samples, artificial_labels], training=True)
            
            # Get the critic results for generated samples
            X_artificial = critic([artificial_samples, artificial_labels], training=True)
            # Calculate the generator loss
            gen_loss = generator_loss_wgan(X_artificial)

        # Calculate and apply the gradients obtained from the loss on the trainable variables
        gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
        generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
        gen_loss_all.append(gen_loss)

        # Update the progress bar and evaluation graphs every update1 steps for loss plots and update2 for the others.
        if (step + 1) % update1 == 0:
            
            # Update the evaluating figures at the set intervals
            axl.clear() # Always clear the corresponding ax before redrawing it
            
            # Loss Plot
            axl.plot(gen_loss_all, color = 'blue', label='Generator Loss')
            axl.plot(crit_loss_all,color = 'red', label='Critic Loss')
            axl.set_xlabel('Number of Steps')
            axl.set_ylabel('Loss')
            axl.legend()
            
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())

        if (step + 1) % update2 == 0:

            saved_predictions.append(generate_predictions(generator, n_generated_samples, all_data.shape[1], 
                                                          input_realdata_dist, ordered_labels))
            # See density and coverage and crossLID (divided by 25 to be in the same order as the rest) 
            # of latest predictions
            den, cov = gem.evaluation_coverage_density(test_data, saved_predictions[-1], k= k_cov_den, metric='euclidean')
            clid = gem.cross_LID_estimator_byMLE(test_data, saved_predictions[-1], k=k_crossLID, metric='euclidean')/25
            density.append(den)
            coverage.append(cov)
            crossLID.append(clid)

            # PCA of the latest predictions and training data
            # Divide by twice the standard deviation to be the same as the generated data
            dfs_temp = pd.concat((train_data_o/(2*train_data_o.values.std()),pd.DataFrame(
                saved_predictions[-1].numpy(), columns=train_data_o.columns))) 
            temp_lbls = train_lbls.copy()
            for l in ordered_labels:
                temp_lbls.extend([l+' - GAN']*(n_generated_samples//len(ordered_labels)))
            principaldf = gem.pca_sample_projection(dfs_temp, temp_lbls, pca, whiten=True, 
                                                samp_number=len(train_data_o.index))
            lcolors = label_colors_test

            # Hierarchical clustering of the latest predictions and testing data, 
            # saving the correct 1st cluster fraction results
            dfs_temp = np.concatenate((test_data, saved_predictions[-1].numpy()))
            temp_lbls = ['real']*len(test_data) + ['gen']*len(saved_predictions[-1])
            hca_results = gem.perform_HCA(dfs_temp, temp_lbls, metric='euclidean', method='ward')
            corr1stcluster.append(hca_results['correct 1st clustering'])
            
            # Plots
            axc.clear()
            axc.plot(range(update2, step+2, update2), coverage, label='coverage')
            axc.plot(range(update2, step+2, update2), density, label='density')
            axc.plot(range(update2, step+2, update2), crossLID, color='red', label='crossLID/25')
            axc.plot(range(update2, step+2, update2), corr1stcluster, color='purple', label='corr_cluster')
            axc.legend()

            axr.clear()
            gem.plot_PCA(principaldf, lcolors, components=(1,2), title='', ax=axr)
            axr.legend(loc='upper right', ncol=1, framealpha=1)
            
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())

### Training the GAN

In [None]:
GENERATE=True

epochs = 500
batch_size = 32
k_cov_den = 20
k_crossLID = 15
random_seed = 145
n_generated_samples = 48*len(pd.unique(lbls))

if GENERATE:
    # Store results
    gen_loss_all = []
    crit_loss_all = []
    saved_predictions = []
    coverage = []
    density = []
    crossLID = []
    corr1stcluster = []
    
    df = human_datamatrix_treated
    pca = PCA(n_components=2, svd_solver='full', whiten=True)
    pc_coords = pca.fit_transform(df)

    generator_optimizer = tf.keras.optimizers.RMSprop(5e-4)
    critic_optimizer = tf.keras.optimizers.RMSprop(5e-4)

    generator = generator_model(data_treated.shape[1], data_treated.shape[1], 128, 2)
    critic = critic_model(data_treated.shape[1], 256, 2)

    training_montage(data_treated, lbls, data_treated, lbls,
                     epochs, generator, critic, generator_optimizer, critic_optimizer,
                     input_realdata_dist, batch_size, grad_pen_weight=10, k_cov_den=k_cov_den, k_crossLID=k_crossLID,
                     random_seed=random_seed, n_generated_samples=n_generated_samples)
    
    results={'gen_loss': gen_loss_all, 'crit_loss': crit_loss_all, 'saved_pred': saved_predictions,
             'coverage': coverage, 'density': density, 'crossLID': crossLID, 'corr1st_cluster': corr1stcluster}

In [None]:
if GENERATE:
    # Save the generator and critic models' weights.
    generator.save_weights('gan_models/PCDint_gen')
    critic.save_weights('gan_models/PCDint_crit')
    
    # Store data (serialize)
    with open('gan_models/PCDint_results.pickle', 'wb') as handle:
        pickle.dump(results, handle)

else:
    # Setting the model up
    generator_optimizer = tf.keras.optimizers.RMSprop(5e-4)
    critic_optimizer = tf.keras.optimizers.RMSprop(5e-4)

    generator = generator_model(data_treated.shape[1], data_treated.shape[1], 128, 2)
    critic = critic_model(data_treated.shape[1], 256, 2)

    # Load Previously saved models
    generator.load_weights('./gan_models/PCDint_gen')
    critic.load_weights('./gan_models/PCDint_crit')
    
    with open('gan_models/PCDint_results.pickle', 'rb') as handle:
        results = pickle.load(handle)
    
    gen_loss_all, crit_loss_all, saved_predictions = results['gen_loss'], results['crit_loss'], results['saved_pred']
    coverage, density, crossLID = results['coverage'], results['density'], results['crossLID']
    corr1stcluster = results['corr1st_cluster']

#### Generate examples from our new code


- Generate examples in bulk - predictions (GAN data)
- Select only the 5 most correlated generated samples with each of the original samples - corr_preds (CorrGAN Data)

#### CorrGAN

In [None]:
num_examples_to_generate = 2048
# Get distribution of intensity values of the dataset
hist = np.histogram(human_datamatrix_treated.values.flatten(), bins=100)
input_realdata_dist = stats.rv_histogram(hist)

test_input = tf.constant(input_realdata_dist.rvs(size=data_treated.shape[1]*num_examples_to_generate), 
                         shape=[num_examples_to_generate, data_treated.shape[1]])
test_labels = tf.constant([1]*(num_examples_to_generate//2) + [0]*(num_examples_to_generate//2), shape=[
        num_examples_to_generate,1])

predictions = generator([test_input, test_labels], training=False)
predictions = pd.DataFrame(predictions.numpy(), columns=data_treated.columns) * 2*data_treated.values.std()

See correlation between samples and choose the 5 most correlated generated samples for each of the original samples.

In [None]:
df = human_datamatrix_treated
# Calculate all correlations between all samples of real and artificial data and store them in a dataframe
correlations = pd.DataFrame(index=predictions.index, columns=df.index).astype('float')

for i in df.index:
    for j in predictions.index:
        correlations.loc[j,i] = stats.pearsonr(df.loc[i],
                                               predictions.loc[j])[0]

In [None]:
# Indices to keep in the correlated GAN data
idx_to_keep = []
for i in correlations:
    idx_to_keep.extend(correlations[i].sort_values(ascending=False).index[:5])
    
print('Nº of total idx :', len(idx_to_keep))
print('Nº of unique idx:', len(set(idx_to_keep)))

In [None]:
# Make the correlation GAN dataframe and corresponding label targets
corr_preds = predictions.loc[list(set(idx_to_keep))]
corr_lbls  = list(np.array(test_labels)[list(set(idx_to_keep))])

#### GAN Data

In [None]:
ordered_labels = pd.get_dummies(
            np.array(lbls)[np.random.RandomState(seed=random_seed).permutation(len(lbls))]).columns
ordered_labels

In [None]:
num_examples_to_generate = 1024
# Get distribution of intensity values of the dataset
hist = np.histogram(human_datamatrix_treated.values.flatten(), bins=100)
input_realdata_dist = stats.rv_histogram(hist)

test_input = tf.constant(input_realdata_dist.rvs(size=data_treated.shape[1]*num_examples_to_generate), 
                         shape=[num_examples_to_generate, data_treated.shape[1]])
test_labels = tf.constant([1]*(num_examples_to_generate//2) + [0]*(num_examples_to_generate//2), shape=[
        num_examples_to_generate,1])

predictions = generator([test_input, test_labels], training=False)
predictions = pd.DataFrame(predictions.numpy(), columns=data_treated.columns) 
predictions = predictions * 2*data_treated.values.std()

last_preds_labels = ['No Recurrence']*(len(predictions)//2) + ['Recurrence']*(len(predictions)//2)

Transforming the 96 generated samples for 20 different time points during GAN training (stored in results) into DataFrames.

In [None]:
# Transform predictions into Pandas DataFrames
for i in range(len(saved_predictions)):
    if type(saved_predictions[i]) != pd.core.frame.DataFrame:
        saved_predictions[i] = pd.DataFrame(saved_predictions[i].numpy(),
                                            columns=human_datamatrix.columns)* 2*data_treated.values.std()

### Loss Plot and PCAs and tSNEs representation on the evolution of generated samples with epochs

Measures of progression of the model in time.

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        steps_per_epoch = int(data_treated.shape[0] / batch_size)
        f, ax = plt.subplots(1, 1, figsize=(4,4))
        ax.plot(range(1,len(gen_loss_all)+1), gen_loss_all, label='Generator Loss')
        ax.plot(range(1,len(crit_loss_all)+1), crit_loss_all, label='Critic Loss')
        ax.set_xticks(range(0, (epochs+1)*steps_per_epoch, 200*steps_per_epoch))
        ax.set_xticklabels(range(0, (epochs+1), 200))

        ax.legend(fontsize=12, loc='lower right')#, bbox_to_anchor=(1,1))
        ax.set_xlim([0*steps_per_epoch,(epochs)*steps_per_epoch])
        ax.set_ylim([-90,70])
        ax.set_xlabel('Nº of Epochs', fontsize=15)
        ax.set_ylabel('Loss', fontsize=15)
        ax.set_title('Loss Plot', fontsize=15)
        plt.tight_layout()
        f.savefig('images/PCDint_LossPlot.png' , dpi=400)

**PCA and tSNE of GAN generated data and the linearly generated data**

Progression with number of epochs.

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1):
        f, axs = plt.subplots(2,5, figsize=(16,8), constrained_layout=True)
        
        for i, ax in zip(range(0, len(saved_predictions),2), axs.ravel()):
            dfs_temp = pd.concat((human_datamatrix_treated, saved_predictions[i]))
            temp_lbls = hd_labels.copy()
            temp_lbls.extend(['No Recurrence - GAN']*(len(saved_predictions[-1])//2))
            temp_lbls.extend(['Recurrence - GAN']*(len(saved_predictions[-1])//2))
            
            principaldf = ma.compute_df_with_PCs(dfs_temp, n_components=2, whiten=True, labels=temp_lbls,
                                                 return_var_ratios=False)

            lcolors = label_colors_test
            
            gem.plot_PCA(principaldf, lcolors, components=(1,2), title='', ax=ax)
            #gem.plot_ellipses_PCA(principaldf, lcolors, components=(1,2),ax=ax, q=0.95)
            ax.legend(loc='upper right', ncol=1, framealpha=1)
        

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1):
        f, axs = plt.subplots(2,5, figsize=(16,8), constrained_layout=True)
        
        for i, ax in zip(range(0, len(saved_predictions), 2), axs.ravel()):
            
            dfs_temp = pd.concat((human_datamatrix_treated, saved_predictions[i]))
            temp_lbls = hd_labels.copy()
            temp_lbls.extend(['No Recurrence - GAN']*(len(saved_predictions[-1])//2))
            temp_lbls.extend(['Recurrence - GAN']*(len(saved_predictions[-1])//2))
            
            X = dfs_temp.copy()
            X_embedded = TSNE(n_components=2, perplexity=30, learning_rate='auto',
                              init='random', verbose=0).fit_transform(X)

            df = X_embedded
            labels = temp_lbls
            lcolors = label_colors_test
            
            gem.plot_tSNE(df, labels, lcolors, components=(1,2), title='', ax=ax)
            gem.plot_ellipses_tSNE(df, labels, lcolors, components=(1,2),ax=ax, q=0.95)
            ax.legend(loc='upper right', ncol=1, framealpha=1)

#### PCA of GAN generated data and experimental (real) data

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1):
        f, ax = plt.subplots(1,1, figsize=(4,4), constrained_layout=True)
            
        p = saved_predictions[-1].copy()
        p.columns = human_datamatrix_treated.columns
        dfs_temp = pd.concat((human_datamatrix_treated, p))
        temp_lbls = hd_labels.copy()
        temp_lbls.extend(['No Recurrence - GAN']*(len(saved_predictions[-1])//2))
        temp_lbls.extend(['Recurrence - GAN']*(len(saved_predictions[-1])//2))

        principaldf = ma.compute_df_with_PCs(dfs_temp, n_components=2, whiten=True, labels=temp_lbls,
                                                 return_var_ratios=False)

        lcolors = label_colors_test

        gem.plot_PCA(principaldf, lcolors, components=(1,2), title='', ax=ax)
        gem.plot_ellipses_PCA(principaldf, lcolors, components=(1,2),ax=ax, q=0.95)
        ax.set_ylabel('PC 2', fontsize=15)
        ax.set_xlabel('PC 1', fontsize=15)
        #ax.legend(loc='upper left', bbox_to_anchor=(1, 1), fontsize=14)
        f.savefig('images/PCDint_PCAPlot.png' , dpi=400)

### Comparing GAN Generated Data Characteristics with other data

In [None]:
names = ['Experimental data', 'GAN data', 'CorrGAN data']
data_repo = [human_datamatrix_treated, predictions, corr_preds]
tgs = [hd_labels, last_preds_labels, corr_lbls]
data_characteristics = [gem.characterize_data(ds, name, tg) for ds,name,tg in zip(data_repo, names, tgs)]
data_characteristics = pd.DataFrame(data_characteristics).set_index('Dataset')
data_characteristics

In [None]:
f, (axl, axr) = plt.subplots(1,2, figsize=(8,4))

names = ['Experimental', 'GAN', 'CorrGAN']
axl.boxplot([ds.values.flatten() for ds in data_repo])
axl.set_ylabel('Distribution of scaled intensities', fontsize=15)
axl.set_xticklabels(names, fontsize=12)
#axl.set_yticks([-2, 0, 2, 4])

axr.boxplot([ds.sum(axis=1) for ds in data_repo])
axr.set_ylabel('Distribution of feature sum of intensities', fontsize=12)
axr.set_xticklabels(names, fontsize=12)

plt.tight_layout()
plt.show()
f.savefig('images/PCDint_characteristics.png' , dpi=400)

### Hierarchical Clustering

Hierarchical clustering of the 96 GAN samples and 96 random samples from experimental data. 

In [None]:
# Hierarchical clustering of 96 GAN samples and all samples from real data
dt = human_datamatrix_treated
t_data = dt#.iloc[np.random.RandomState(seed=145).permutation(len(dt))][-96:]
dfs_temp = np.concatenate((t_data, saved_predictions[-1]))

test_labels = np.array(hd_labels)#[np.random.RandomState(seed=145).permutation(len(dt))][-96:]
temp_lbls = list(test_labels)

for i in ['No Recurrence','Recurrence']:
    temp_lbls.extend([i+' - GAN']*(n_generated_samples//2))

hca_results = gem.perform_HCA(dfs_temp, temp_lbls, metric='euclidean', method='ward')

In [None]:
# Hierarchical clustering of 96 GAN samples and 96 samples from real data
dt = human_datamatrix_treated.copy()
sp = saved_predictions[-1]
random_seed=145
all_data = dt.iloc[np.random.RandomState(seed=random_seed).permutation(len(dt))][:96]
test_data = all_data[:96]

dfs_temp = np.concatenate((test_data, sp.values))
test_labels = np.array(hd_labels)[
    np.random.RandomState(seed=random_seed).permutation(len(hd_labels))][:96]
temp_lbls = list(test_labels)

for i in ['No Recurrence','Recurrence']:
    temp_lbls.extend([i+' - GAN']*(96//2))

hca_results = gem.perform_HCA(dfs_temp, temp_lbls, metric='euclidean', method='ward')

In [None]:
f, ax = plt.subplots(figsize=(3, 12))
gem.plot_dendogram(hca_results['Z'], 
               temp_lbls, ax=ax,
               label_colors=label_colors_test, title='',
               color_threshold=0)
ax.set_yticklabels([])
ax.set_xticks([])
plt.show()
f.savefig('images/PCD_HCAPlot.png' , dpi=400)

### Coverage and Density

In [None]:
training_set = data_treated

density_list, coverage_list = gem.evaluation_coverage_density_all_k_at_once(training_set, 
                                                                            predictions, 
                                                                            metric='Euclidean')

density_list_test, coverage_list_test = gem.evaluation_coverage_density_all_k_at_once(training_set,
                                                                                      human_datamatrix_treated, 
                                                                                      metric='Euclidean')

density_list_real, coverage_list_real = gem.evaluation_coverage_density_all_k_at_once(human_datamatrix_treated, 
                                                                                      predictions,
                                                                                      metric='Euclidean')

density_list_corr, coverage_list_corr = gem.evaluation_coverage_density_all_k_at_once(training_set, 
                                                                                      corr_preds,
                                                                                      metric='Euclidean')

density_list_linaug_real, coverage_list_linaug_real = gem.evaluation_coverage_density_all_k_at_once(
                                                                                      human_datamatrix_treated, 
                                                                                      training_set,
                                                                                      metric='Euclidean')

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, (axl, axr) = plt.subplots(1, 2, figsize = (8,4))#, sharey='row')#, sharex='col')
        
        axl.plot(range(1,len(training_set)), density_list, label='Density', color='blue')
        axl.plot(range(1,len(training_set)), coverage_list, label='Coverage', color='red')
        axl.set_title('GAN - Interpolated Data', fontsize=15)
        axl.set_ylabel('Density / Coverage', fontsize=15)
        axl.set_ylim([0,5])
        axl.set_xlabel('Nº of Nearest Neighbors', fontsize=15)

        axr.plot(range(1,len(human_datamatrix_treated)), density_list_real, label='Density', color='blue')
        axr.plot(range(1,len(human_datamatrix_treated)), coverage_list_real, label='Coverage', color='red')
        axr.set_title('GAN - Experimental Data', fontsize=15)
        axr.set_ylim([0,5])
        
        axr.legend()
        axl.legend()
        axr.set_xlabel('Nº of Nearest Neighbors', fontsize=15)

    plt.tight_layout()
    f.savefig('images/PCDint_DenCovPlot.png' , dpi=400)

### Histograms
 
Histograms of Values of normal Experimental (Real), Linearly Interpolated and GAN Generated Data.

In [None]:
# Predictions
last_preds = predictions
dt = human_datamatrix_treated

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, ax = plt.subplots(1, 1, figsize = (6,4))#, sharey='row')#, sharex='col')
        X = np.arange(-8, 8.01, 0.05)
        hist = np.histogram(human_datamatrix_treated.values.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='blue', label='Experimental')
        
        hist = np.histogram(data_treated.values.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='orange', label='Interpolated')
        ax.set_ylabel('PDF', fontsize=15)
        
        hist = np.histogram(last_preds.values.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='red', label='GAN')
        
        hist = np.histogram(corr_preds.values.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='green', label='CorrGAN')
        
        ax.set_ylabel('PDF', fontsize=15)
        ax.set_xlabel('Scaled Intensities', fontsize=15)
        #ax.set_xticks([-8, -4, 0, 4, 8])
        ax.set_ylim([0,0.4])

        ax.legend(loc='upper left', bbox_to_anchor=(1, 1), fontsize=12, handlelength=1)
        ax.set_title('Intensities Distribution', fontsize=18)
        plt.tight_layout()
        f.savefig('images/PCDint_IntPlot.png', dpi=400)

### Correlations between samples of Experimental (Real) Data, Linearly Interpolated and GAN Generated Data

In [None]:
correlation_real_real = np.corrcoef(human_datamatrix_treated)
print('Correlation Experimental-Experimental calculation ended.')

correlation_lin_lin = np.corrcoef(data_treated)
print('Correlation Linear-Linear calculation ended.')

correlation_gan_gan = np.corrcoef(last_preds)
print('Correlation GAN-GAN calculation ended.')

correlation_corr_corr = np.corrcoef(corr_preds)
print('Correlation CorrGAN-CorrGAN calculation ended.')

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, ax = plt.subplots(1, 1, figsize = (5,4))#, sharey='row')#, sharex='col')
        X = np.arange(-1, 1.01, 0.05)
        hist = np.histogram(correlation_real_real.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='blue', label='Experimental')
        ax.set_ylabel('PDF', fontsize=15)
        
        hist = np.histogram(correlation_lin_lin.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='orange', label='Interpolated')
        
        hist = np.histogram(correlation_gan_gan.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='red', label='GAN')
        
        hist = np.histogram(correlation_corr_corr.flatten(), bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='green', label='CorrGAN')
        
        ax.legend(loc='upper left', bbox_to_anchor=(1, 1), fontsize=12, handlelength=0.9, handletextpad=0.5)
        ax.set_title('Sample Correlations', fontsize=18)
        ax.set_xlabel('Correlation', fontsize=15)
        #ax.set_ylim([0,3.6])

    #f.text(0.5, 0.05, 'Correlation', ha='center', va='top', fontsize=15)
    plt.tight_layout()
    f.savefig('images/PCDint_SampCorrPlot.png', dpi=400)

### Correlations between features of Experimental (Real), Linearly Interpolated and GAN Generated Samples

In [None]:
correlation_real_real = np.corrcoef(dt.T)
print('Correlation Experimental-Experimental calculation ended.')

correlation_gan_gan = np.corrcoef(last_preds.T)
print('Correlation GAN-GAN calculation ended.')

correlation_corr_corr = np.corrcoef(corr_preds.T)
print('Correlation CorrGAN-CorrGAN calculation ended.')

correlation_gen_gen = np.corrcoef(data_treated.T)
print('Correlation Generated-Generated calculation ended.')

In [None]:
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.2):
        f, ax = plt.subplots(1, 1, figsize = (5,4))#, sharey='row')#, sharex='col')
        X = np.arange(-1, 1.01, 0.05)
        hist = np.histogram(correlation_real_real.flatten()[~np.isnan(correlation_real_real.flatten())],
                            bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='blue', label='Experimental')
        ax.set_ylabel('PDF', fontsize=15)
        
        hist = np.histogram(correlation_gen_gen.flatten()[~np.isnan(correlation_gen_gen.flatten())], bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='orange', label='Interpolated')
        
        hist = np.histogram(correlation_gan_gan.flatten()[~np.isnan(correlation_gan_gan.flatten())], bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='red', label='GAN')
        
        hist = np.histogram(correlation_corr_corr.flatten()[~np.isnan(correlation_corr_corr.flatten())],
                            bins=50)
        hist_dist = stats.rv_histogram(hist)
        ax.plot(X, hist_dist.pdf(X), color='green', label='CorrGAN')
        
        
        ax.legend(bbox_to_anchor=(1,1), fontsize=11, loc='upper left', handlelength=0.9, handletextpad=0.5)
        ax.set_title('Feature Correlations', fontsize=18)
        ax.set_xlabel('Correlation', fontsize=15)
        ax.set_ylim([0,3.3])

    #f.text(0.5, 0.05, 'Correlation', ha='center', va='top', fontsize=15)
    plt.tight_layout()
    f.savefig('images/PCDint_FeatCorrPlot.png', dpi=400)

### Sample Correlation Matrix

Between samples of the experimental (real) data and a set of GAN samples.

In [None]:
# Experimental (Real) Data, organize it to have first all samples of a class, then all samples of the other class
df = human_datamatrix_treated.copy()

samp = df.index
tg = hd_labels.copy()
new_order = [x for _, x in sorted(zip(tg, samp))]
new_tg = [x for x, _ in sorted(zip(tg, samp))]

df = df.loc[new_order]
#df

In [None]:
# Calculate all correlations between all samples of experimental (real) and generated data and store them in a dataframe
correlations = pd.DataFrame(index=last_preds.index, columns=df.index).astype('float')

for i in df.index:
    for j in last_preds.index:
        correlations.loc[j,i] = stats.pearsonr(df.loc[i],
                                               last_preds.loc[j])[0]

correlations.columns = new_tg
correlations.index = ['No Recurrence - GAN']*(len(last_preds)//2) + ['Recurrence - GAN']*(len(last_preds)//2)

In [None]:
# Draw the clustermap

row_cols = [label_colors_test[lbl] for lbl in new_tg]
row_cols2 = [label_colors_test[lbl] for lbl in correlations.index]
co = sns.diverging_palette(160, 310, s=90, as_cmap=True, l=40)
g = sns.clustermap(correlations, col_colors=row_cols, cmap=co, row_colors= row_cols2, vmin=-1, vmax=1, method='ward',
                  cbar_pos = (-0.15, 0.2, 0.05, 0.5))
g.fig.set_size_inches((6,9))
# some 
patches = []
for lbl in ordered_labels_test:
    patches.append(mpatches.Patch(color=label_colors_test[lbl], label=lbl))
leg = plt.legend(handles=patches, loc=3, bbox_to_anchor=(-1.8, 1.25, 0.5, 1),
                     frameon=False, fontsize=14) 
g.ax_heatmap.set_ylabel('GAN Data', fontsize=20)
g.ax_heatmap.set_xlabel('Experimental Data', fontsize=20)
g.ax_heatmap.tick_params(axis='both', which='both', bottom=False, top=False, labelbottom=False, right=False,
                         labelright=False)

# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=13) 
plt.text(1, 1.10, 'Correlations', fontsize=14, horizontalalignment='center')
plt.show()
g.savefig('images/PCDint_Clustermap.png', dpi=400)

In [None]:
a = scipy.spatial.distance_matrix(last_preds, df)
sample_distances = pd.DataFrame(a, index=last_preds.index, columns=df.index).astype('float')
sample_distances.index = ['No Recurrence - GAN']*(len(last_preds)//2) + ['Recurrence - GAN']*(len(last_preds)//2)

In [None]:
sample_distances.max()

In [None]:
sample_distances.min()

In [None]:
# Draw the clustermap

row_cols = [label_colors_test[lbl] for lbl in new_tg]
row_cols2 = [label_colors_test[lbl] for lbl in sample_distances.index]
g = sns.clustermap(sample_distances, col_colors=row_cols, cmap='bwr', row_colors= row_cols2,
                   norm=matplotlib.colors.LogNorm(vmin=225, vmax=750),
                   #vmin=0, vmax=10**3,
                   method='ward',
                  cbar_pos = (-0.13, 0.1, 0.05, 0.5))
g.fig.set_size_inches((6,9))
# some tweaks
patches = []
for lbl in ordered_labels_test:
    patches.append(mpatches.Patch(color=label_colors_test[lbl], label=lbl))
leg = plt.legend(handles=patches, loc=3, bbox_to_anchor=(-2.2, 1.45, 0.5, 1),
                     frameon=False, fontsize=14) 
g.ax_heatmap.set_ylabel('GAN Data', fontsize=20)
g.ax_heatmap.set_xlabel('Experimental Data', fontsize=20)
g.ax_heatmap.tick_params(axis='both', which='both', bottom=False, top=False, labelbottom=False, right=False,
                         labelright=False)

# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=13) 
plt.text(1, 802.10, 'Distances', fontsize=14, horizontalalignment='center')
plt.show()

In [None]:
# Experimental (Real) Data, organize it to have first all samples of a class, then all samples of the other class
df = human_datamatrix_treated.copy()

samp = df.index
tg = hd_labels.copy()
new_order = [x for _, x in sorted(zip(tg, samp))]
new_tg = [x for x, _ in sorted(zip(tg, samp))]

df = df.loc[new_order]
a = scipy.spatial.distance_matrix(df, df)
sample_distances = pd.DataFrame(a, index=df.index, columns=df.index).astype('float')

In [None]:
sample_distances.max()

In [None]:
# Draw the clustermap

row_cols = [label_colors_test[lbl] for lbl in new_tg]
row_cols2 = [label_colors_test[lbl] for lbl in new_tg]
g = sns.clustermap(sample_distances, col_colors=row_cols, cmap='bwr', row_colors= row_cols2,
                   #norm=matplotlib.colors.LogNorm(vmin=225, vmax=550),
                   vmin=225, vmax=550,
                   method='ward',
                  cbar_pos = (-0.13, 0.1, 0.05, 0.5))
g.fig.set_size_inches((6,9))
# some tweaks
patches = []
for lbl in ordered_labels_test:
    patches.append(mpatches.Patch(color=label_colors_test[lbl], label=lbl))
leg = plt.legend(handles=patches, loc=3, bbox_to_anchor=(-2.2, 1.45, 0.5, 1),
                     frameon=False, fontsize=14) 
g.ax_heatmap.set_ylabel('GAN Data', fontsize=20)
g.ax_heatmap.set_xlabel('Experimental Data', fontsize=20)
g.ax_heatmap.tick_params(axis='both', which='both', bottom=False, top=False, labelbottom=False, right=False,
                         labelright=False)

# Manually specify colorbar labelling after it's been generated
colorbar = g.ax_heatmap.collections[0].colorbar
colorbar.ax.tick_params(labelsize=13) 
plt.text(1, 672.10, 'Distances', fontsize=14, horizontalalignment='center')
plt.show()