# Data Augmentation - Conditional Wasserstein GANs - GP

### Dataset: Trypanosomiasis Dataset

This notebook presents the CWGAN-GP model to generate treated Intensity Data and test GAN data augmentation efficacy in mitigating class imbalanced dataset issues (in supervised analysis) by supplementing the minority class with GAN samples.

Notebook Organization:
- Read the dataset
- Treatment and Univariate Analysis of the dataset
- Split of the Dataset into 2 folds training and test sets (details on that section).
- Setup the CWGAN-GP model and train 2 models, once for each fold, with the corresponding training data.
- Generate GAN samples and add them to the corresponding imbalanced training sets in small increments.
- Build and evaluate performance of RF and PLS-DA models from the imbalanced dataset, the imbalanced dataset supplemented with minority class samples and purely GAN samples dataset.
- Compare important features in the RF and PLS-DA models against important features of models built with the complete dataset.

#### Due to stochasticity, re-running the notebook will get slightly different results. Thus, figures in the paper can be slightly different.


#### Needed Imports

In [None]:
import json
from time import perf_counter

import numpy as np
import pandas as pd
import scipy.stats as stats
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
from matplotlib import ticker

import seaborn as sns
from tqdm import tqdm
from IPython import display as ipythondisplay

from sklearn.metrics import precision_recall_fscore_support, f1_score

import tensorflow as tf
from keras import backend

import pickle

# Metabolinks package
import metabolinks as mtl
import metabolinks.transformations as transf

# Python files in the repository
import multianalysis as ma
import gan_evaluation_metrics as gem
import linear_augmentation_functions as laf

In [None]:
# Import needed functions from GAN_functions
from GAN_functions import gradient_penalty_cwgan
from GAN_functions import critic_loss_wgan
from GAN_functions import generator_loss_wgan

In [None]:
%matplotlib inline

### Description of Trypanosomiasis dataset

Trypanosomiasis Dataset - 60 cerebrospinal fluid human samples belonging to 3 different classes (20 samples per class) of HILIC-Orbitrap metabolomics data. The 3 classes correspond to 20 control samples (healthy) and then 20 samples of both stage 1 and stage 2 of Human African Trypanosomiasis disease.

Data acquired by Vincent et al. (2016) available at the MetaboLights repository:

- Vincent, I. M. et al. Metabolomics Identifies Multiple Candidate Biomarkers to Diagnose and Stage Human African Trypanosomiasis. PLoS Negl. Trop. Dis. 10, e0005140 (2016). https://doi.org/10.1371/journal.pntd.0005140
- https://www.ebi.ac.uk/metabolights/MTBLS413/descriptors - MetaboLights Repository

Data obtained from MetaboLights ID MTBLS413.

**Peak Alignment** and **Peak Filtering** were performed using mzMatch and IDEOM according to the authors indication. The cerebrospinal fluid (CSF) .tsv files were obtained from the repository in both positive and negative ionization modes (already filtered data). These files were joined together eliminating all repeating compounds except the 'repeating unknown' compounds. 

`TD` was the name given to this dataset generated retaining only features that occur (globally) at least twice in all 60 samples of the dataset.

### Reading File and Data Analysis

In [None]:
# Reading files already with non-repeating compounds
base_file = pd.read_excel('modes_together_norepeats.xlsx')
base_file = base_file.set_index('Unnamed: 0')
base_file

In [None]:
# Read a file with the peaks that are repeated still there to add the 'repeating unknowns' to the previous dataset
base_file2 = pd.read_excel('tryp_modes_together.xlsx')
base_file2 = base_file2.set_index('database_identifier')
base_file2 = base_file2.replace({np.nan:0})
df2 = pd.DataFrame(index=base_file2.index, columns=base_file2.columns[1:])
for i in range(1, len(base_file2.columns)):
    df2[base_file2.columns[i]] = base_file2.iloc[:,i] - base_file2.iloc[:,0]
    for j in range(len(df2.index)):
        if df2.iloc[j,i-1]<0:
            df2.iloc[j,i-1] = 0.0

In [None]:
# Build the final DataFrame by concating the needed parts of the two Dataframes
df = pd.concat((base_file.iloc[1:], df2.iloc[-3:])).T
# Extract target list from the first DataFrame as well
target = list(base_file.iloc[0].values)

**Stage 1 and Stage 2 of disease samples were joined together making a 'disease' class label**

In [None]:
labels = []
for i in target:
    if i.startswith('stage'):
        labels.append('disease')
    else:
        labels.append(i)
target=labels

### Data Pre-Treatments

In [None]:
def basic_feat_filtering(file, target=None, filt_method='total_samples', filt_kw=2,
                  extra_filt=None, extra_filt_kw='Formula', extra_filt_data=None):
    "Performs feature filtering in 2 steps."
    
    # Filtering based on the sampels each feature appears in
    if filt_method == 'total_samples': # Filter features based on the times (filt_kw) they appear in the dataset
        # Minimum can be a percentage if it is a value between 0 and 1!
        data_filt = transf.keep_atleast(file, minimum=filt_kw)
    elif filt_method == 'class_samples': # Features retained if they appear filt_kw times in the samples of at least one class
        # Minimum can be a percentage if it is a value between 0 and 1!
        data_filt = transf.keep_atleast(file, minimum=filt_kw, y=np.array(target))
    elif filt_method == None: # No filtering
        data_filt = file.copy()
    else:
        raise ValueError('Feature Filtering strategy not accepted/implemented in function. Implement if new strategy.')
        
    # Extra filtering based if the features are annotated
    if extra_filt == 'Formula': # Filter features based if they have a formula on the dataset
        idxs_to_keep = [i for i in data_filt.index if type(extra_filt_data.loc[i, 'Formula']) == str]
        data_filt = data_filt.loc[idxs_to_keep]
    elif extra_filt == 'Name': # Filter features based if they have an annotated name on the dataset
        idxs_to_keep = [i for i in data_filt.index if type(extra_filt_data.loc[i, 'Name']) == str]
        data_filt = data_filt.loc[idxs_to_keep]
    elif extra_filt == None: # No extra filtering
        data_filt = data_filt.copy()
    else:
        raise ValueError('Feature Filtering strategy not accepted/implemented in function. Implement if new strategy.')
    
    return data_filt

In [None]:
# Giving a different feature names to the 'unknown features'
df.columns = list(df.columns[:-3]) + ['unknown1'] + ['unknown2'] + ['unknown3']
df = df.replace({0:np.nan})

In [None]:
filt_sample_data = basic_feat_filtering(df, target, filt_method='total_samples', filt_kw=2).T
imputed = transf.fillna_frac_min_feature(filt_sample_data, fraction=0.2).T
real_samples_all = transf.pareto_scale(transf.glog(transf.normalize_sum(imputed), lamb=None))
real_samples_all.head(10)

In [None]:
target = labels
classes = pd.unique(target)
classes

#### Univariate Analysis

In [None]:
uni_results = ma.compute_FC_pvalues_2groups(transf.normalize_sum(imputed), real_samples_all,
                               labels=target,
                               equal_var=True,
                               alpha=0.05, useMW=False)

In [None]:
pvalue = 0.1
log2FC = 1

b = uni_results[uni_results['FDR adjusted p-value'] < pvalue]
uni_results_filt = b[abs(b['log2FC']) > log2FC]
uni_results_filt

### Fitting RF and PLS-DA models to the complete dataset

Results and Feature Importance are estimated by stratified n_fold-cross validation averaged across 20 iterations.

In [None]:
np.random.seed(485)
n_fold = 5

RF_model_real = ma.RF_model_CV(real_samples_all, target,
                               iter_num=20, n_fold=n_fold, n_trees=200) 
RF_feats_real = pd.DataFrame(RF_model_real['important_features']).set_index(0).sort_values(by=1, ascending=False)
RF_feats_real.index = [real_samples_all.columns[i] for i in RF_feats_real.index]
#RF_feats_real 

In [None]:
np.random.seed(485)
n_fold = 5
n_comp = 6 # Nº of Components

PLSDA_model_real = ma.PLSDA_model_CV(real_samples_all, target,
                                     n_comp=n_comp, iter_num=20, n_fold=n_fold, feat_type='VIP') 

PLSDA_feats_real = pd.DataFrame(PLSDA_model_real['important_features']).set_index(0).sort_values(by=1, ascending=False)
PLSDA_feats_real.index = [real_samples_all.columns[i] for i in PLSDA_feats_real.index]

## Creating Imbalanced Datasets

This dataset has two imbalanced classes with few samples: 40 disease and 20 control samples. Thus we only considered control as a minority class.

We split the dataset in 2 different ways where each had 30 disease samples and 10 control samples in the training set. Thus, this left 10 samples of each class for the test sets. 

Then both sets were independently trained with the same pre-treatment pipeline but on the test set we performed a 'faux' Pareto scaling using the features standard deviation and mean of the training set. The training and test sets have (in most cases) a vastly different balance of class samples. Thus, feature averages and standard deviations can be quite different between them, especially in key features for discrimination. To compensate for this, the ‘faux’ Pareto scaling was applied. The untreated training sets were linearly augmented, which was then treated (using a normal Pareto scaling, in this case), to generate samples to train the CWGAN-GP models.

In [None]:
df_storage_train = {}
df_storage_test = {}
lbl_storage_train = {}
lbl_storage_test = {}
real_samples = {}

rng = np.random.default_rng(7519)

# Select the samples which will be in the imbalanced and in the test set
permutations = {}
for cl in classes:
    permutations[cl] = list(rng.permutation(np.where(np.array(target) == cl)[0]))
perm_classes = list(rng.permutation(classes))

for i in range(2):
    train_idxs = {'control':[], 'disease':[]}
    test_idxs = {'control':[], 'disease':[]}
    for cl in perm_classes:
        if cl == 'control':
            train_idxs[cl].extend(permutations[cl][i*10:(i+1)*10])
            test_idxs[cl].extend(list(np.array(permutations[cl][:i*10] + permutations[cl][(i+1)*10:])))
        else:
            train_idxs[cl].extend(permutations[cl][i*10:(i+3)*10])
            test_idxs[cl].extend(list(np.array(permutations[cl][:i*10] + permutations[cl][(i+3)*10:])))

    print('Fold nº:', i+1)
    print('Train control/disease:', len(train_idxs['control']), len(train_idxs['disease']))
    print('Test control/disease: ', len(test_idxs['control']), len(test_idxs['disease']))

    train_idxs = train_idxs['control'] + train_idxs['disease']
    test_idxs = test_idxs['control'] + test_idxs['disease']

    df_imputed = imputed.copy()

    # Create the imbalanced and test set
    df_storage_train[i+1] = df_imputed.iloc[train_idxs]
    lbl_storage_train[i+1] = list(np.array(labels)[train_idxs])

    df_storage_test[i+1] = df_imputed.iloc[test_idxs]
    lbl_storage_test[i+1] = list(np.array(labels)[test_idxs])

    # Data pretreatment of the imbalanced and test dataset
    df_G = transf.glog(transf.normalize_sum(df_storage_train[i+1]), lamb=None)
    real_samples[i+1] = transf.pareto_scale(df_G)

    test_datamatrix_G = transf.glog(transf.normalize_sum(df_storage_test[i+1]), lamb=None)
    df_storage_test[i+1] = (test_datamatrix_G - df_G.mean())/np.sqrt(df_G.std()) # 'Faux' Pareto Scale

Linear Augmentation of the Training Sets and corresponding data pre-treatment.

In [None]:
aug_df_storage_train = {}
aug_lbl_storage_train = {}
# Only generation of samples based on the imbalanced dataset
for i in range(1,3):#df_storage_train.keys():
    start = perf_counter()
    data, lbls = laf.artificial_dataset_generator(df_storage_train[i], labels=lbl_storage_train[i],
                                        max_new_samples_per_label=512, binary=False, 
                                        rnd=list(np.linspace(0.1,0.9,9)), 
                                        binary_rnd_state=None, rnd_state=3577265)

    #data_N = transf.normalize_sum(data)
    data_treated = transf.pareto_scale(transf.glog(transf.normalize_sum(data), lamb=None))

    aug_df_storage_train[i] = data_treated.copy()
    aug_lbl_storage_train[i] = lbls

    end = perf_counter()
    print(f'Simple augmentation of data done! Fold: {i}. Took {(end - start):.3f} s')

Set up colours for each of the classes. Generated samples will have the corresponding label with '- GAN' after.

In [None]:
# Colors to use in plots
colours2 = sns.color_palette('tab20', 4)#[:6]

ordered_labels_test = []
for i in classes:
    ordered_labels_test.extend([i, i + ' - GAN'])
label_colors_test = {lbl: c for lbl, c in zip(ordered_labels_test, colours2)}

sns.palplot(label_colors_test.values())
new_ticks_test = plt.xticks(range(len(ordered_labels_test)), ordered_labels_test)

## Conditional Wasserstein GAN - GP model

This model construction was made by joining WGAN-GP models with Conditional GAN models. WGAN-GP models were originally made according to / originally based in https://keras.io/examples/generative/wgan_gp/#wasserstein-gan-wgan-with-gradient-penalty-gp and Conditional GAN models - https://machinelearningmastery.com/how-to-develop-a-conditional-generative-adversarial-network-from-scratch/ (generator and discriminator model) and https://keras.io/examples/generative/conditional_gan/ without using OOP (loss functions and training/training steps).

Functions for the generator and critic (discriminator) models

In [None]:
def generator_model(len_input, len_output, n_hidden_nodes, n_labels): 
    "Make the generator model of CWGAN-GP."

    data_input = tf.keras.Input(shape=(len_input,), name='data') # Take intensity input
    label_input = tf.keras.Input(shape=(1,), name='label') # Take Label Input
    
    # Treat label input to concatenate to intensity data after
    label_m = tf.keras.layers.Embedding(n_labels, 30, input_length=1)(label_input)
    label_m = tf.keras.layers.Dense(256, activation='linear', use_bias=True)(label_m)
    #label_m = tf.keras.layers.Reshape((len_input,1,))(label_m)
    label_m2 = tf.keras.layers.Reshape((256,))(label_m)

    joined_data = tf.keras.layers.Concatenate()([data_input, label_m2]) # Concatenate intensity and label data
    # Hidden Dense Layer and Normalization
    joined_data = tf.keras.layers.Dense(n_hidden_nodes, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(256, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.BatchNormalization()(joined_data)
    
    # Output - number of features of sample to make
    output = tf.keras.layers.Dense(len_output, activation='linear', use_bias=True)(joined_data)
    
    generator = tf.keras.Model(inputs=[data_input, label_input], outputs=output)
    
    return generator


def critic_model(len_input, n_hidden_nodes, n_labels):
    "Make the critic model of CWGAN-GP."
    
    label_input = tf.keras.Input(shape=(1,)) # Take intensity input
    data_input = tf.keras.Input(shape=(len_input,)) # Take Label Input

    # Treat label input to concatenate to intensity data after
    label_m = tf.keras.layers.Embedding(n_labels, 30, input_length=1)(label_input)
    label_m = tf.keras.layers.Dense(256, activation='linear', use_bias=True)(label_m)
    #label_m = tf.keras.layers.Reshape((len_input,1,))(label_m)
    label_m = tf.keras.layers.Reshape((256,))(label_m)

    joined_data = tf.keras.layers.Concatenate()([data_input, label_m]) # Concatenate intensity and label data
    # Hidden Dense Layer (Normalization worsened results here)
    joined_data = tf.keras.layers.Dense(n_hidden_nodes, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(128, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(256, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    #joined_data = tf.keras.layers.BatchNormalization()(joined_data)

    # Output Layer - 1 node for critic decision
    output = tf.keras.layers.Dense(1, activation='linear', use_bias=True)(joined_data)
    
    critic = tf.keras.Model(inputs=[data_input, label_input], outputs=output)

    return critic

In [None]:
def generate_predictions(model, num_examples_to_generate, len_input, input_dist, uni_lbls):
    "Generate sample predictions based on a Generator model."
    
    test_input =  tf.constant(input_dist.rvs(size=len_input*num_examples_to_generate), shape=[
        num_examples_to_generate,len_input]) 
    
    if len(uni_lbls) < 3:
        test_labels = tf.constant([1.0]*(num_examples_to_generate//2) + [0.0]*(num_examples_to_generate//2), 
                                  shape=(num_examples_to_generate,1))
    else:
        test_labels = []
        for i in range(len(uni_lbls)):
            test_labels.extend([i]*(num_examples_to_generate//len(uni_lbls)))
        test_labels = np.array(pd.get_dummies(test_labels))
        #np.array(pd.get_dummies([i for i in range(len(uni_lbls))]*(num_examples_to_generate//len(uni_lbls))))
    predictions = model([test_input, test_labels], training=False) # `training` is set to False.
    return predictions

In [None]:
def training_montage(train_data_o, train_lbls, test_data, test_lbls,
                     epochs, generator, critic, generator_optimizer, critic_optimizer, input_dist,
                    batch_size, grad_pen_weight=10, k_cov_den=50, k_crossLID=15, random_seed=145,
                    n_generated_samples=96):
    """Train a generator and critic of CWGAN-GP.
    
       Receives training data and respective class labels (train_data_o and train_lbls) and trains a generator and a critic
        model (generator, critic) over a number of epochs (epochs) with a set batch size (batch_size) with the respective 
        optimizers and learning rate (generator_optimizer, critic_optimizer). Gradient Penalty is calculated with
        grad_pen_weight as the weight of the penalty.
       The functions returns at time intervals three graphs to evaluate the progression of the models (Loss plots,
        coverage, density, crossLID and correct first cluster plots and PCA plot with generated and test data). To this
        end, samples need to be generated requiring the distribution to sample the initial input values from (input_dist),
        and test data and respective labels has to be given (test_data and test_lbls). Finally the number of neighbors to
        consider for coverage/density and crossLID calculation is also needed (k_cov_den, k_crossLID).
    
       train_data_o: Pandas DataFrame with training data;
       train_lbls: List with training data class labels;
       test_data: Pandas DataFrame with test data to evaluate the model;
       test_lbls: List with test data class labels to evaluate the model;
       epochs: Int value with the number of epochs to train the model;
       generator: tensorflow keras.engine.functional.Functional model for the generator;
       critic: tensorflow keras.engine.functional.Functional model for the critic;
       generator_optimizer: tensorflow keras optimizer (with learning rate) for generator;
       critic_optimizer: tensorflow keras optimizer (with learning rate) for critic;
       input_dist: scipy.stats._continuous_distns.rv_histogram object - distribution to sample input values for generator;
       batch_size: int value with size of batch for model training;
       grad_pen_weight: int value (default 10) for penalty weight in gradient penalty calculation;
       k_cov_den: int value (default 50) for number of neighbors to consider for coverage and density calculation in
       generated samples evaluation;
       k_crossLID: int value (default 15) for number of neighbors to consider for crossLID calculation in generated samples
        evaluation.
       random_seed: int value (default 145) for numpy random seeding when randomly organizing samples in the data that
        will be split into batches.
       n_generated_samples: int value (default 96) for number of samples generated to test the model during training.
    """
    
    # Obtaining the train data, randomize its order and divide it be twice the standard deviation of its values
    all_data = train_data_o.iloc[
        np.random.RandomState(seed=random_seed).permutation(len(train_data_o))]/(2*train_data_o.values.std())
    
    # Same treatment for the test data
    test_data = (test_data/(2*test_data.values.std())).values
    training_data = all_data
    train_data = all_data.values
    
    # Change class labels to numerical values while following the randomized ordered of samples
    if len(set(train_lbls)) < 3: # 1 and 0 for when there are only two classes
        train_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data))]).values[:,0]
        test_labels = pd.get_dummies(np.array(test_lbls)).values[:,0]
    else: # One hot encoding for when there are more than two classes
        train_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data))]).values
        test_labels = pd.get_dummies(np.array(test_lbls)).values
    # Save the order of the labels
    ordered_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data_o))]).columns

    batch_divisions = int(batch_size / len(set(train_lbls))) # See how many samples of each class will be in each batch
    n_steps = epochs * int(training_data.shape[0] / batch_size) # Number of steps: nº of batches per epoch * nº of epochs
    n_critic = 5
    
    # Set up the evaluating images printed during training and the intervals they will be updated
    f, (axl, axc, axr) = plt.subplots(1, 3, figsize = (16,5))
    update1 = n_steps//200
    update2 = n_steps//20

    if hasattr(tqdm, '_instances'):
        tqdm._instances.clear() # clear if it exists

    i=0

    for step in tqdm(range(n_steps)):
        
        # Critic Training
        crit_loss_temp = []
        
        # Select real samples for this batch on training and order samples to put samples of the same class together
        real_samp = train_data[i*batch_size:(i+1)*batch_size]
        real_lbls = train_labels[i*batch_size:(i+1)*batch_size]

        real_samples = np.empty(real_samp.shape)
        real_labels = np.empty(real_lbls.shape)
        a = 0
        if len(set(train_lbls)) < 3:
            for l,s in sorted(zip(real_lbls, real_samp), key=lambda pair: pair[0], reverse=True):
                real_samples[a] = s
                real_labels[a] = l
                a = a+1
        else:
            for l,s in sorted(zip(real_lbls, real_samp), key=lambda pair: np.argmax(pair[0]), reverse=False):
                #print(l, np.argmax(l))
                real_samples[a] = s
                real_labels[a] = l
                a = a+1

        for _ in range(n_critic): # For each step, train critic n_critic times
            
            # Generate input for generator
            artificial_samples = tf.constant(input_dist.rvs(size=all_data.shape[1]*batch_size), shape=[
                batch_size,all_data.shape[1]])
            artificial_labels = real_labels.copy()

            # Generate artificial samples from the latent vector
            artificial_samples = generator([artificial_samples, artificial_labels], training=True)
            
            with tf.GradientTape() as crit_tape: # See the gradient for the critic

                # Get the logits for the generated samples
                X_artificial = critic([artificial_samples, artificial_labels], training=True)
                # Get the logits for the real samples
                X_true = critic([real_samples, real_labels], training=True)

                # Calculate the critic loss using the generated and real sample results
                c_cost = critic_loss_wgan(X_true, X_artificial)

                # Calculate the gradient penalty
                grad_pen = gradient_penalty_cwgan(batch_size, real_samples, artificial_samples,
                                                  real_labels, artificial_labels, critic)
                # Add the gradient penalty to the original discriminator loss
                crit_loss = c_cost + grad_pen * grad_pen_weight
                
            crit_loss_temp.append(crit_loss)

            # Calculate and apply the gradients obtained from the loss on the trainable variables
            gradients_of_critic = crit_tape.gradient(crit_loss, critic.trainable_variables)
            critic_optimizer.apply_gradients(zip(gradients_of_critic, critic.trainable_variables))

        i = i + 1
        if (step+1) % (n_steps//epochs) == 0:
            i=0

        crit_loss_all.append(np.mean(crit_loss_temp))
        
        # Generator Training
        # Generate inputs for generator, values and labels
        artificial_samples = tf.constant(input_dist.rvs(size=all_data.shape[1]*batch_size), shape=[
                batch_size,all_data.shape[1]])
        
        if len(set(train_lbls)) < 3:
            artificial_labels = tf.constant([1.0]*(batch_size//2) + [0.0]*(batch_size//2), shape=(batch_size,1))
        else:
            artificial_labels = np.array(pd.get_dummies([i for i in range(len(set(train_lbls)))]*batch_divisions))
    
        with tf.GradientTape() as gen_tape: # See the gradient for the generator
            # Generate artificial samples
            artificial_samples = generator([artificial_samples, artificial_labels], training=True)
            
            # Get the critic results for generated samples
            X_artificial = critic([artificial_samples, artificial_labels], training=True)
            # Calculate the generator loss
            gen_loss = generator_loss_wgan(X_artificial)

        # Calculate and apply the gradients obtained from the loss on the trainable variables
        gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
        generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
        gen_loss_all.append(gen_loss)

        # Update the progress bar and evaluation graphs every update1 steps for loss plots and update2 for the others.
        if (step + 1) % update1 == 0:
            
            # Update the evaluating figures at the set intervals
            axl.clear() # Always clear the corresponding ax before redrawing it
            
            # Loss Plot
            axl.plot(gen_loss_all, color = 'blue', label='Generator Loss')
            axl.plot(crit_loss_all,color = 'red', label='Critic Loss')
            axl.set_xlabel('Number of Steps')
            axl.set_ylabel('Loss')
            axl.legend()
            
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())

        if (step + 1) % update2 == 0:

            saved_predictions.append(generate_predictions(generator, n_generated_samples, all_data.shape[1], 
                                                          input_realdata_dist, ordered_labels))
            # See density and coverage and crossLID (divided by 25 to be in the same order as the rest) 
            # of latest predictions
            den, cov = gem.evaluation_coverage_density(test_data, saved_predictions[-1], k= k_cov_den, metric='euclidean')
            clid = gem.cross_LID_estimator_byMLE(test_data, saved_predictions[-1], k=k_crossLID, metric='euclidean')/25
            density.append(den)
            coverage.append(cov)
            crossLID.append(clid)

            # PCA of the latest predictions and training data
            # Divide by twice the standard deviation to be the same as the generated data
            dfs_temp = pd.concat((train_data_o/(2*train_data_o.values.std()),pd.DataFrame(
                saved_predictions[-1].numpy(), columns=train_data_o.columns))) 
            temp_lbls = train_lbls.copy()
            for l in ordered_labels:
                temp_lbls.extend([l+' - GAN']*(n_generated_samples//len(ordered_labels)))
            principaldf = gem.pca_sample_projection(dfs_temp, temp_lbls, pca, whiten=True, 
                                                samp_number=len(train_data_o.index))
            lcolors = label_colors_test

            # Hierarchical clustering of the latest predictions and testing data, 
            # saving the correct 1st cluster fraction results
            dfs_temp = np.concatenate((test_data, saved_predictions[-1].numpy()))
            temp_lbls = ['real']*len(test_data) + ['gen']*len(saved_predictions[-1])
            hca_results = gem.perform_HCA(dfs_temp, temp_lbls, metric='euclidean', method='ward')
            corr1stcluster.append(hca_results['correct 1st clustering'])
            
            # Plots
            axc.clear()
            axc.plot(range(update2, step+2, update2), coverage, label='coverage')
            axc.plot(range(update2, step+2, update2), density, label='density')
            axc.plot(range(update2, step+2, update2), crossLID, color='red', label='crossLID/25')
            axc.plot(range(update2, step+2, update2), corr1stcluster, color='purple', label='corr_cluster')
            axc.legend()

            axr.clear()
            gem.plot_PCA(principaldf, lcolors, components=(1,2), title='', ax=axr)
            axr.legend(loc='upper right', ncol=1, framealpha=1)
            
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())

### Training the GAN

In [None]:
GENERATE=True

epochs = 600
batch_size = 32
k_cov_den = 20
k_crossLID = 15
random_seed = 145
n_generated_samples = 48*len(pd.unique(aug_lbl_storage_train[1]))

if GENERATE:
    generator_train = {}
    critic_train = {}

    results_train = {}
    for fold in df_storage_train:
        # Store results
        gen_loss_all = []
        crit_loss_all = []
        saved_predictions = []
        coverage = []
        density = []
        crossLID = []
        corr1stcluster = []

        # Get distribution of intensity values of the dataset
        hist = np.histogram(real_samples[fold].values.flatten(), bins=100)
        input_realdata_dist = stats.rv_histogram(hist)

        df = real_samples[fold]
        pca = PCA(n_components=2, svd_solver='full', whiten=True)
        pc_coords = pca.fit_transform(df)

        generator_optimizer = tf.keras.optimizers.RMSprop(5e-4)
        critic_optimizer = tf.keras.optimizers.RMSprop(5e-4)

        generator_train[fold] = generator_model(aug_df_storage_train[fold].shape[1],
                                             aug_df_storage_train[fold].shape[1], 128, 2)
        critic_train[fold] = critic_model(aug_df_storage_train[fold].shape[1], 512, 2)

        training_montage(aug_df_storage_train[fold], aug_lbl_storage_train[fold], 
                         real_samples[fold], lbl_storage_train[fold], 
                         epochs, generator_train[fold], critic_train[fold],
                         generator_optimizer, critic_optimizer, input_realdata_dist, batch_size,
                         grad_pen_weight=5, k_cov_den=k_cov_den, k_crossLID=k_crossLID,
                         random_seed=random_seed, n_generated_samples=n_generated_samples)

        results_train[fold]={'gen_loss': gen_loss_all, 'crit_loss': crit_loss_all, 'saved_pred': saved_predictions,
                 'coverage': coverage, 'density': density, 'crossLID': crossLID, 'corr1st_cluster': corr1stcluster}

In [None]:
# Save the models and results
if GENERATE:
    for i in generator_train:
        # Save the weigths of the models
        generator_train[i].save_weights('gan_models/TDint_imb_gen_CV'+str(i))
        critic_train[i].save_weights('gan_models/TDint_imb_crit_CV'+str(i))
        
# Read back the models
if not GENERATE:
    generator_train = {}
    critic_train = {}

    results_train = {}

    for i in real_samples:
        # Setting the model up
        generator_optimizer = tf.keras.optimizers.RMSprop(1e-4)
        critic_optimizer = tf.keras.optimizers.RMSprop(1e-4)

        generator_train[i] = generator_model(real_samples[i].shape[1], real_samples[i].shape[1], 128, 2)
        critic_train[i] = critic_model(real_samples[i].shape[1], 512, 2)

        # Load Previously saved models
        generator_train[i].load_weights('./gan_models/TDint_imb_gen_CV'+str(i))
        critic_train[i].load_weights('./gan_models/TDint_imb_crit_CV'+str(i))

### Comparison of Classification Accuracy

Here we divided our dataset, getting 2 folds where each has 10 samples of the minority class 'control' and 30 samples of a majority class 'disease' that acts as the 'training set'.

With the train set, we build and train a GAN model from them. Then we build models with the train set and with generated GAN samples from the GAN models and compare the performance in discriminating the test set.

#### Generate a lot of samples and make Random Forests and PLS-DA models

In [None]:
np.random.seed(5402)
# Generate sample for each fold
generated_samples = {}

for i in generator_train:
    # Input to the generator
    num_examples_to_generate = 2048
    # Get distribution of intensity values of the dataset
    hist = np.histogram(real_samples[i].values.flatten(), bins=100)
    input_realdata_dist = stats.rv_histogram(hist)

    test_input = tf.constant(input_realdata_dist.rvs(
        size=len(df_storage_train[i].columns)*num_examples_to_generate),
                             shape=[num_examples_to_generate, len(df_storage_train[i].columns)])

    test_labels = tf.constant([0]*(num_examples_to_generate//2) + [1]*(num_examples_to_generate//2), shape=[
        num_examples_to_generate,1])

    # Generate GAN samples
    predictions = generator_train[i]([test_input, test_labels], training=False)
    # Reverse the division done to the data
    predictions = predictions * 2* aug_df_storage_train[i].values.std()

    ordered_labels_fold = pd.get_dummies(
        np.array(lbl_storage_train[i])[np.random.RandomState(seed=random_seed).permutation(
            len(lbl_storage_train[i]))]).columns

    generated_samples[i] = [pd.DataFrame(np.array(predictions), columns=df_storage_train[i].columns),
                            [ordered_labels_fold[1],]*(num_examples_to_generate//2) + [ordered_labels_fold[0],]*(
                            num_examples_to_generate//2)]

In [None]:
# To store for each fold
bal_datasets = {}
np.random.seed(325)
rng = np.random.default_rng(7519)
for i in generated_samples.keys():
    bal_datasets[i] = {}
    df = real_samples[i].loc[np.array(lbl_storage_train[i]) == 'control']
    # Calculate all correlations between all samples of experimental and GAN data and store them in a dataframe
    correlations = pd.DataFrame(index=generated_samples[i][0].index, columns=df.index).astype('float')

    for a in df.index:
        for j in generated_samples[i][0].index:
            correlations.loc[j,a] = stats.pearsonr(df.loc[a],
                                                   generated_samples[i][0].loc[j])[0]

    correlated_samples = pd.DataFrame(columns=df.index)
    for a in correlations:
        correlated_samples[a] = correlations[a].sort_values(ascending=False).index

    permutated = correlated_samples.copy()
    for l in correlated_samples.index:
        permutated.loc[l] = rng.permutation(correlated_samples.loc[l])

    corr_idxs = pd.unique(permutated.values.flatten())

    dataset_len = real_samples[i].shape[0]

    n_min_class = (np.array(lbl_storage_train[i]) == 'control').sum()
    n_max_class = (dataset_len - n_min_class)//(len(pd.unique(lbl_storage_train[i]))-1)
    
    # Slowly add samples - 2 at a time until a total of 20
    for num in range(0,n_max_class - n_min_class+1, (n_max_class - n_min_class+1)//10):
        idx_to_keep = corr_idxs[:num]

        corr_preds = generated_samples[i][0].loc[list(pd.unique(idx_to_keep))]
        corr_lbls  = np.array(generated_samples[i][1])[list(pd.unique(idx_to_keep))]

        # Slowly add the GAN correlated GAN samples to the the imbalanced dataset, making it a balanced dataset
        concat_df = pd.concat((corr_preds, real_samples[i]))
        concat_lbls = ['control',]*len(set(idx_to_keep)) + lbl_storage_train[i]
        bal_datasets[i][num] = [concat_df.copy(), concat_lbls.copy()]

In [None]:
# Samples added to the minority class
bal_datasets[i].keys()

### Fitting RF and PLS-DA models to Imbalance Datasets and Evaluating them

RF and PLS-DA models are built for each minority class, each fold and each number of GAN samples added.

RF

In [None]:
# Fitting and storing Random Forest models for each fold
RF_models_bal = {}

# Train the Models
for size in bal_datasets[1].keys():
    RF_models_bal[size] = {}
    for fold in bal_datasets:
        rf_mod = ma.RF_model(bal_datasets[fold][size][0], bal_datasets[fold][size][1],
                             return_cv=False, n_trees=200)
        RF_models_bal[size][fold] = rf_mod

In [None]:
# Testing the RF models with the test data for each fold
RF_results_bal = {'Accuracy':{}, 'F1-Score':{}, 'Precision':{}, 'Recall':{}}
# Evaluate the Models
for size in RF_models_bal.keys():
    RF_results_bal['Accuracy'][size] = {}
    RF_results_bal['F1-Score'][size] = {}
    RF_results_bal['Precision'][size] = {}
    RF_results_bal['Recall'][size] = {}
    for fold in RF_models_bal[size]:
        RF_results_bal['Accuracy'][size][fold] = RF_models_bal[size][fold].score(df_storage_test[fold],
                                                                                lbl_storage_test[fold])
        preds = RF_models_bal[size][fold].predict(df_storage_test[fold])
        prec, rec, f1, sup = precision_recall_fscore_support(lbl_storage_test[fold], preds,
                                                            pos_label='control', average='binary')
        RF_results_bal['F1-Score'][size][fold] = f1
        RF_results_bal['Precision'][size][fold] = prec
        RF_results_bal['Recall'][size][fold] = rec

In [None]:
pd.DataFrame.from_dict(RF_results_bal['Accuracy'])

In [None]:
pd.DataFrame.from_dict(RF_results_bal['F1-Score'])

In [None]:
# Plotting the results for each fold
f, ax = plt.subplots(1,1,figsize=(4,4))
for metric in RF_results_bal:
    plt.plot((pd.DataFrame.from_dict(RF_results_bal[metric]).columns + n_min_class)/n_max_class*100,
             pd.DataFrame.from_dict(RF_results_bal[metric]).mean(), 
             label='Avg. '+metric)

ax.set_ylabel('Performance', fontsize=15)
ax.set_xlabel('% of Augmentation', fontsize=15)
#ax.set_ylim([0.8, 1.01])
ax.legend(fontsize=15, bbox_to_anchor=(1,1))
plt.title('Random Forest')
plt.show()

PLS-DA

In [None]:
def decision_rule(y_pred, y_true, pos_label, average='binary'):
    "Decision rule for PLS-DA classification."
    # Decision rule for classification
    # Decision rule chosen: sample belongs to group where it has max y_pred (closer to 1)
    # In case of 1,0 encoding for two groups, round to nearest integer to compare
    nright = 0
    rounded = np.round(y_pred)

    for p in range(len(y_pred)):
        if rounded[p] >= 1:
            pred = 1
            rounded[p] = 1
        else:
            pred = 0
            rounded[p] = 0
        if pred == y_true[p]:
            nright += 1  # Correct prediction
    
    # Calculate accuracy for this iteration
    accuracy = (nright / len(y_pred))
    prec, rec, f1, sup = precision_recall_fscore_support(y_true, rounded, pos_label=pos_label, average=average)
    return accuracy, f1, prec, rec

In [None]:
PLSDA_models_bal = {}
PLSDA_results_bal = {'Accuracy':{}, 'F1-Score':{}, 'Precision':{}, 'Recall':{}}

np.random.seed(325)

# Train the Models
for size in bal_datasets[1].keys():
    PLSDA_models_bal[size] = {}
    PLSDA_results_bal['Accuracy'][size] = {}
    PLSDA_results_bal['F1-Score'][size] = {}
    PLSDA_results_bal['Precision'][size] = {}
    PLSDA_results_bal['Recall'][size] = {}
    for fold in bal_datasets:

        PLSDA_models_bal[size][fold] = ma.fit_PLSDA_model(bal_datasets[fold][size][0],
                                                          bal_datasets[fold][size][1],
                                                          n_comp=6,
                                                  return_scores=False, scale=False, encode2as1vector=True)
        plsda = PLSDA_models_bal[size][fold]
        # Obtain results with the test group
        y_pred = plsda.predict(df_storage_test[fold])
        y_true = ma._generate_y_PLSDA(lbl_storage_test[fold],
                                      pd.unique(bal_datasets[fold][size][1]),
                                      True)
        pos_label = np.where(pd.unique(bal_datasets[fold][size][1]) != 'control')[0][0]
        # Calculate accuracy
        accuracy, f1, prec, rec = decision_rule(y_pred, y_true, pos_label=pos_label, average='binary')
        PLSDA_results_bal['Accuracy'][size][fold] = accuracy
        PLSDA_results_bal['F1-Score'][size][fold] = f1
        PLSDA_results_bal['Precision'][size][fold] = prec
        PLSDA_results_bal['Recall'][size][fold] = rec

In [None]:
# Plotting the results for each fold
f, ax = plt.subplots(1,1,figsize=(4,4))
for metric in RF_results_bal:
    plt.plot((pd.DataFrame.from_dict(PLSDA_results_bal[metric]).columns + n_min_class)/n_max_class*100,
             pd.DataFrame.from_dict(PLSDA_results_bal[metric]).mean(), 
             label='Avg. '+metric)

ax.set_ylabel('Performance', fontsize=15)
ax.set_xlabel('% of Augmentation', fontsize=15)
ax.set_ylim([0.8, 1.01])
ax.legend(fontsize=15, bbox_to_anchor=(1,1))
plt.title('PLS-DA', fontsize=15)
plt.show()

In [None]:
# Plotting the results for each fold
f, axs = plt.subplots(1,3,figsize=(12,4), constrained_layout=True)
for i, ax in zip(['F1-Score', 'Precision', 'Recall'], axs.ravel()):
    avg_increase = pd.DataFrame.from_dict(RF_results_bal[i]).mean()
    ax.plot((avg_increase.index + n_min_class)/n_max_class*100, avg_increase, 
             label='Avg. RF')

    avg_increase = pd.DataFrame.from_dict(PLSDA_results_bal[i]).mean()
    ax.plot((avg_increase.index + n_min_class)/n_max_class*100, avg_increase, 
             label='Avg. PLS-DA')
    ax.set_ylim([0.6, 1.01])
    ax.set_title(i, fontsize=15)

axs[0].set_ylabel('Performance', fontsize=15)
axs[1].set_xlabel('% of Samples of the Minority Class in Comparison to the Majority Class', fontsize=15)
axs[1].legend(fontsize=13)

plt.show()

##### Extracting Important Features for RF models

In [None]:
rf_feats = {}
#rf_feats_df = pd.DataFrame(columns=RF_models[i].keys())
for size, folds in RF_models_bal.items():
    rf_feats[size] = {}
    for fold, mod in folds.items():
        if fold == 1:
            temp_df_bal = mod.feature_importances_
        else:
            temp_df_bal = temp_df_bal + mod.feature_importances_

    temp_df_bal = temp_df_bal / len(folds)

    #RF_models[fold][size].feature_importances_
    rf_feats[size] = dict(zip(range(1, len(bal_datasets[1][size][0].columns)+1), temp_df_bal))
    
rf_feats = pd.DataFrame(rf_feats)
rf_feats

In [None]:
rf_feats_abrev = pd.Series(index=rf_feats.columns)
top10 = int(0.05*len(rf_feats.index))
for i in rf_feats.columns:
    rf_feats_abrev[i] = rf_feats[i].sort_values(ascending=False)[:top10].mean()
    #print(rf_feats[i].sort_values(ascending=False)[:top10])

In [None]:
top10

Relative importance values depending on the number of GAN samples added

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, ax = plt.subplots(figsize=(6,6))

    plt.plot((np.array(list(rf_feats_abrev.index)+n_min_class))/n_max_class*100,
             rf_feats_abrev.values/rf_feats_abrev.values[0]*100,
             color='Black', linewidth=3)

    plt.ylabel('Avg. Gini Importance of top 5% Imp. features change (%)', fontsize = 12)
    plt.xlabel('% of Augmentation', fontsize = 15)
    plt.ylim(80,160)
    ax.tick_params(axis='both', which='major', labelsize=13)
    plt.title('Random Forest', fontsize=15)

Common top 5% important features compared to the complete models depending on the number of GAN samples added

In [None]:
rf_line = []
for i in rf_feats:
    idxs = []
    for l in rf_feats[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets[1][0][0].columns[l-1])
    rf_line.append(len(np.intersect1d(idxs, RF_feats_real.index[:top10])))
    
rf_line_uni = []
for i in rf_feats:
    idxs = []
    for l in rf_feats[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets[1][0][0].columns[l-1])
    rf_line_uni.append(len(np.intersect1d(idxs, uni_results_filt.index)))
    
rf_line_unitop = []
for i in rf_feats:
    idxs = []
    for l in rf_feats[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets[1][0][0].columns[l-1])
    rf_line_unitop.append(len(np.intersect1d(idxs, uni_results_filt.index[:top10])))

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axc, axr) = plt.subplots(1,3,figsize=(15,5))

axl.plot((np.array(list(rf_feats.keys()))+n_min_class)/n_max_class*100,
             np.array(rf_line)/top10*100, color='black',
            label='All Together', linewidth=3)
axc.plot((np.array(list(rf_feats.keys()))+n_min_class)/n_max_class*100,
         np.array(rf_line_uni)/top10*100, color='black',
        label='All Together', linewidth=3)
axr.plot((np.array(list(rf_feats.keys()))+n_min_class)/n_max_class*100,
         np.array(rf_line_unitop)/top10*100, color='black',
        label='All Together', linewidth=3)  

axl.set_title('Against RF Top 5% Imp. Feats', fontsize=15)
axc.set_title('Against Univariate Significant Feat.', fontsize=15)
axr.set_title('Against Univariate Top 5% Significant Feat.', fontsize=14)
axl.set_ylabel('% of Common Features with Real Model', fontsize = 15)
axc.set_xlabel('% of Samples of the Minority Class in Comparison to the Majority Class', fontsize = 15)
axl.set_ylim(15,102)
axc.set_ylim(15,102)
axr.set_ylim(15,102)
#plt.legend(loc='upper left', fontsize=13, bbox_to_anchor=(1,1))

plt.suptitle('Random Forest', fontsize=15)

##### Extracting Important Features for PLS-DA models

In [None]:
plsda_feats = {}

for size, folds in PLSDA_models_bal.items():
    plsda_feats[size] = {}
    for fold, mod in folds.items():
        if fold == 1:
            temp_df_bal = ma._calculate_vips(mod)
        else:
            temp_df_bal = temp_df_bal + ma._calculate_vips(mod)

    temp_df_bal = temp_df_bal / len(folds)

    plsda_feats[size] = dict(zip(range(1, len(bal_datasets[1][size][0].columns)+1), temp_df_bal))

plsda_feats = pd.DataFrame.from_dict(plsda_feats)

In [None]:
plsda_feats_abrev = pd.Series(index=plsda_feats.columns)
top10 = int(0.05*len(plsda_feats.index))
for i in plsda_feats.columns:
    plsda_feats_abrev[i] = plsda_feats[i].sort_values(ascending=False)[:top10].mean()
    #print(rf_feats[i].sort_values(ascending=False)[:top10])

In [None]:
top10

Relative importance values depending on the number of GAN samples added

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, ax = plt.subplots(figsize=(6,6))

    plt.plot((np.array(list(plsda_feats_abrev.index)+n_min_class))/n_max_class*100,
             plsda_feats_abrev.values/plsda_feats_abrev.values[0]*100,
             color='Black', linewidth=3)

    plt.ylabel('Avg. VIP Score of top 5% Imp. features change (%)', fontsize = 12)
    plt.xlabel('% of Augmentation', fontsize = 15)
    plt.ylim(90,120)
    ax.tick_params(axis='both', which='major', labelsize=13)
    plt.title('PLS-DA', fontsize=15)

Common top 2% important features compared to the complete models depending on the number of GAN samples added

In [None]:
plsda_line = []
for i in plsda_feats:
    idxs = []
    for l in plsda_feats[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets[1][0][0].columns[l-1])
    plsda_line.append(len(np.intersect1d(idxs, PLSDA_feats_real.index[:top10])))
    
plsda_line_uni = []
for i in plsda_feats:
    idxs = []
    for l in plsda_feats[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets[1][0][0].columns[l-1])
    plsda_line_uni.append(len(np.intersect1d(idxs, uni_results_filt.index)))
    
plsda_line_unitop = []
for i in plsda_feats:
    idxs = []
    for l in plsda_feats[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets[1][0][0].columns[l-1])
    plsda_line_unitop.append(len(np.intersect1d(idxs, uni_results_filt.index[:top10])))

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axc, axr) = plt.subplots(1,3,figsize=(15,5))

    axl.plot((np.array(list(plsda_feats.keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line)/top10*100, color='black',
                label='All Together', linewidth=3)
    axc.plot((np.array(list(plsda_feats.keys()))+n_min_class)/n_max_class*100,
             np.array(plsda_line_uni)/top10*100, color='black',
            label='All Together', linewidth=3)
    axr.plot((np.array(list(plsda_feats.keys()))+n_min_class)/n_max_class*100,
             np.array(plsda_line_unitop)/top10*100, color='black',
            label='All Together', linewidth=3)  

    axl.set_title('Against PLS-DA Top 5% Imp. Feats', fontsize=15)
    axc.set_title('Against Univariate Significant Feat.', fontsize=15)
    axr.set_title('Against Univariate Top 5% Significant Feat.', fontsize=14)
    axl.set_ylabel('% of Common Features with Real Model', fontsize = 15)
    axc.set_xlabel('% of Samples of the Minority Class in Comparison to the Majority Class', fontsize = 15)
    axl.set_ylim(50,102)
    axc.set_ylim(50,102)
    axr.set_ylim(50,102)

    plt.suptitle('PLS-DA', fontsize=15)

#### Summary of Results

In [None]:
#### Summary of Results of Feature Importance Analysis
with plt.style.context('seaborn-whitegrid'):
    f, (axl,axr) = plt.subplots(1,2,figsize=(8,3), constrained_layout=True)
    for i in ['F1-Score', 'Precision', 'Recall']:
        avg_increase = pd.DataFrame.from_dict(RF_results_bal[i]).mean()
        axl.plot((avg_increase.index + n_min_class)/n_max_class*100, avg_increase, 
                 label='Avg. ' + i)

        avg_increase = pd.DataFrame.from_dict(PLSDA_results_bal[i]).mean()
        axr.plot((avg_increase.index + n_min_class)/n_max_class*100, avg_increase, 
                 label='Avg. ' + i)
        axl.set_ylim([0, 1.05])
        axr.set_ylim([0, 1.05])
        axl.set_title('RF', fontsize=13)
        axr.set_title('PLS-DA', fontsize=13)

    axl.set_ylabel('Performance', fontsize=13)
    axr.set_ylabel('Performance', fontsize=13)
    axr.set_xlabel('Class Balance (%)', fontsize=13)
    axl.set_xlabel('Class Balance (%)', fontsize=13)
    axl.legend(fontsize=13)
    axl.tick_params(axis='both', which='major', labelsize=11)
    axr.tick_params(axis='both', which='major', labelsize=11)

    plt.show()
    f.savefig('images/TDint_Imbalanced_AugPerformance_plot.png', dpi=400)
    f.savefig('images/TDint_Imbalanced_AugPerformance_plot.pdf', dpi=400)

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axr) = plt.subplots(1,2, figsize=(8,4), constrained_layout=True)

    axl.plot((np.array(list(rf_feats_abrev.index)+n_min_class))/n_max_class*100,
             rf_feats_abrev.values/rf_feats_abrev.values[0]*100, label='Minority Class: Control',
             color='Black', linewidth=3)

    axl.set_ylabel('Avg. Gini Imp. of top 5% Imp. Feat. change (%)', fontsize = 10)
    axl.set_ylim(75,150)
    axl.legend(fontsize=13)
    axl.tick_params(axis='both', which='major', labelsize=11)
    axl.set_title('RF', fontsize=15)
    
    axr.plot((np.array(list(plsda_feats_abrev.index)+n_min_class))/n_max_class*100,
             plsda_feats_abrev.values/plsda_feats_abrev.values[0]*100, label='Minority Class: Control',
             color='Black', linewidth=3)

    axr.set_ylabel('Avg. VIP Score of top 5% Imp. Feat. change (%)', fontsize = 10)
    f.supxlabel('% of Samples of the Minority Class in Comparison to the Majority Class', fontsize = 15)
    axr.set_ylim(85,115)
    axr.tick_params(axis='both', which='major', labelsize=11)
    axr.set_title('PLS-DA', fontsize=15)
    #plt.suptitle('OD', fontsize=18)
    f.savefig('Images/TDint_Imbalanced_ImpFeatChange_plot.png', dpi=400)
    f.savefig('Images/TDint_Imbalanced_ImpFeatChange_plot.pdf', dpi=400)

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axul, axur) = plt.subplots(1,2,figsize=(8,3), constrained_layout=True)

    axul.plot((np.array(list(rf_feats.keys()))+n_min_class)/n_max_class*100,
                 np.array(rf_line)/top10*100, color='black',
                label='Minority Class: Control', linewidth=3)
    axur.plot((np.array(list(rf_feats.keys()))+n_min_class)/n_max_class*100,
             np.array(rf_line_uni)/top10*100, color='black',
            label='Minority Class: Control', linewidth=3)

    axul.set_title('RF Vs RF Complete Model', fontsize=13)
    axur.set_title('RF Vs Ref. Significant Features', fontsize=13)
    axul.set_ylabel('% of Common Features', fontsize = 13)
    f.supxlabel('Class Balance (%)', fontsize = 13)
    axul.set_ylim(0,105)
    axur.set_ylim(0,105)
    #plt.legend(loc='upper left', fontsize=13, bbox_to_anchor=(1,1))
    axur.legend(loc='lower left', fontsize=12)
    axul.tick_params(axis='both', which='major', labelsize=11)
    #axuc.tick_params(axis='both', which='major', labelsize=14)
    axur.tick_params(axis='both', which='major', labelsize=11)

    f.savefig('images/TDint_Imbalanced_RF_ImpFeatCommon_plot.png', dpi=400)
    f.savefig('images/TDint_Imbalanced_RF_ImpFeatCommon_plot.pdf', dpi=400)

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axdl, axdr) = plt.subplots(1,2,figsize=(8,3), constrained_layout=True)

    axdl.plot((np.array(list(plsda_feats.keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line)/top10*100, color='black',
                label='Minority Class: Control', linewidth=3)
    axdr.plot((np.array(list(plsda_feats.keys()))+n_min_class)/n_max_class*100,
             np.array(plsda_line_uni)/top10*100, color='black',
            label='Minority Class: Control', linewidth=3)     

    axdl.set_title('PLS-DA Vs PLS-DA Complete Model', fontsize=13)
    axdr.set_title('PLS-DA Vs Ref. Significant Features', fontsize=13)
    axdl.set_ylabel('% of Common Features', fontsize = 13)
    f.supxlabel('Class Balance (%)', fontsize = 13)
    axdl.set_ylim(0,105)
    axdr.set_ylim(0,105)
    axdl.tick_params(axis='both', which='major', labelsize=11)
    axdr.tick_params(axis='both', which='major', labelsize=11)
    #plt.suptitle('HD', fontsize=18)
    f.savefig('images/TDint_Imbalanced_PLSDA_ImpFeatCommon_plot.png', dpi=400)
    f.savefig('images/TDint_Imbalanced_PLSDA_ImpFeatCommon_plot.pdf', dpi=400)

### Comparing Imbalanced Datasets, Imb + Min. Cl. GAN Samples and only GAN samples

Creating the models and evaluating them for GAN data and compiling results for the imbalanced datasets (with no augmentation) and the imbalanced datasets made balanced with minority class GAN samples.

##### RF

In [None]:
# Fitting and storing Random Forest models for each fold for real, GAN and CorrGAN Data
RF_models_GAN = {}
RF_models_real = {}
RF_models_GAN_bal = {}

# Train the Models
for fold in generated_samples:
    rf_mod = ma.RF_model(generated_samples[fold][0], generated_samples[fold][1], return_cv=False, n_trees=200)
    RF_models_GAN[fold] = rf_mod

    RF_models_real[fold] = RF_models_bal[0][fold]

    RF_models_GAN_bal[fold] = RF_models_bal[n_max_class-n_min_class][fold]

In [None]:
# Testing the RF models with the test data for each fold
RF_results_GAN = {'Accuracy':{}, 'F1-Score':{},
                  'Precision':{}, 'Recall':{}}
RF_results_real = {'Accuracy':{}, 'F1-Score':{},
                   'Precision':{}, 'Recall':{}}
RF_results_GAN_bal = {'Accuracy':{}, 'F1-Score':{},
                      'Precision':{}, 'Recall':{}}

for fold in generated_samples:
    RF_results_GAN['Accuracy'][fold] = RF_models_GAN[fold].score(df_storage_test[fold],
                                                                 lbl_storage_test[fold])
    preds = RF_models_GAN[fold].predict(df_storage_test[fold])
    prec, rec, f1, sup = precision_recall_fscore_support(lbl_storage_test[fold], preds, pos_label='control',
                                                        average='binary')
    RF_results_GAN['F1-Score'][fold] = f1
    RF_results_GAN['Precision'][fold] = prec
    RF_results_GAN['Recall'][fold] = rec

    RF_results_real['Accuracy'][fold] = RF_results_bal['Accuracy'][0][fold]
    RF_results_real['F1-Score'][fold] = RF_results_bal['F1-Score'][0][fold]
    RF_results_real['Precision'][fold] = RF_results_bal['Precision'][0][fold]
    RF_results_real['Recall'][fold] = RF_results_bal['Recall'][0][fold]

    RF_results_GAN_bal['Accuracy'][fold] = RF_results_bal['Accuracy'][n_max_class-n_min_class][fold]
    RF_results_GAN_bal['F1-Score'][fold] = RF_results_bal['F1-Score'][n_max_class-n_min_class][fold]
    RF_results_GAN_bal['Precision'][fold] = RF_results_bal['Precision'][n_max_class-n_min_class][fold]
    RF_results_GAN_bal['Recall'][fold] = RF_results_bal['Recall'][n_max_class-n_min_class][fold]

In [None]:
PLSDA_models_GAN = {}
PLSDA_results_GAN = {'Accuracy':{}, 'F1-Score':{},
                  'Precision':{}, 'Recall':{}}

PLSDA_models_real = {}
PLSDA_results_real = {'Accuracy':{}, 'F1-Score':{},
                  'Precision':{}, 'Recall':{}}

PLSDA_models_GAN_bal = {}
PLSDA_results_GAN_bal = {'Accuracy':{}, 'F1-Score':{},
                  'Precision':{}, 'Recall':{}}

# Train the Models
for fold in bal_datasets:

    PLSDA_models_GAN[fold] = ma.fit_PLSDA_model(generated_samples[fold][0],
                                                generated_samples[fold][1],
                                                n_comp=6,
                                                return_scores=False, scale=False, encode2as1vector=True)
    plsda = PLSDA_models_GAN[fold]
    # Obtain results with the test group
    y_pred = plsda.predict(df_storage_test[fold])
    y_true = ma._generate_y_PLSDA(lbl_storage_test[fold],
                                  pd.unique(generated_samples[fold][1]),
                                  True)
    pos_label = np.where(pd.unique(generated_samples[fold][1]) != 'control')[0][0]
    # Calculate accuracy
    accuracy, f1, prec, rec = decision_rule(y_pred, y_true, pos_label=pos_label, average='binary')
    PLSDA_results_GAN['Accuracy'][fold] = accuracy
    PLSDA_results_GAN['F1-Score'][fold] = f1
    PLSDA_results_GAN['Precision'][fold] = prec
    PLSDA_results_GAN['Recall'][fold] = rec

    PLSDA_models_real[fold] = PLSDA_models_bal[0][fold]

    PLSDA_results_real['Accuracy'][fold] = PLSDA_results_bal['Accuracy'][0][fold]
    PLSDA_results_real['F1-Score'][fold] = PLSDA_results_bal['F1-Score'][0][fold]
    PLSDA_results_real['Precision'][fold] = PLSDA_results_bal['Precision'][0][fold]
    PLSDA_results_real['Recall'][fold] = PLSDA_results_bal['Recall'][0][fold]

    PLSDA_models_GAN_bal[fold] = PLSDA_models_bal[n_max_class-n_min_class][fold]

    PLSDA_results_GAN_bal['Accuracy'][fold] = PLSDA_results_bal['Accuracy'][n_max_class-n_min_class][fold]
    PLSDA_results_GAN_bal['F1-Score'][fold] = PLSDA_results_bal['F1-Score'][n_max_class-n_min_class][fold]
    PLSDA_results_GAN_bal['Precision'][fold] = PLSDA_results_bal['Precision'][n_max_class-n_min_class][fold]
    PLSDA_results_GAN_bal['Recall'][fold] = PLSDA_results_bal['Recall'][n_max_class-n_min_class][fold]

In [None]:
# Results
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.5):
        f, axs = plt.subplots(1, 2, figsize=(4, 2.5), constrained_layout=True)
        for metric, axu in zip(['F1-Score', 'Recall'], axs.ravel()):
            x = np.arange(2)  # the label locations
            l = ['RF', 'PLS-DA']
            width = 0.23  # the width of the bars

            offset = - 0.25 + 0 * 0.25
            accuracy_stats = pd.DataFrame({metric: [pd.Series(RF_results_real[metric]).values.mean(), 
                                                                pd.Series(PLSDA_results_real[metric]).values.mean()],
                                         'STD': [pd.Series(RF_results_real[metric]).values.std(), 
                                                 pd.Series(PLSDA_results_real[metric]).values.std()]})
            rects = axu.bar(x + offset, accuracy_stats[metric], width, label='Exp. Imb. Data', color='green')

            offset = - 0.25 + 1 * 0.25
            accuracy_stats = pd.DataFrame({metric: [pd.Series(RF_results_GAN_bal[metric]).values.mean(), 
                                                            pd.Series(PLSDA_results_GAN_bal[metric]).values.mean()],
                                         'STD': [pd.Series(RF_results_GAN_bal[metric]).values.std(), 
                                                 pd.Series(PLSDA_results_GAN_bal[metric]).values.std()]})
            rects = axu.bar(x + offset, accuracy_stats[metric],
                            width, label='Imb. + GAN Data', color='red')

            offset = - 0.25 + 2 * 0.25
            accuracy_stats = pd.DataFrame({metric: [pd.Series(RF_results_GAN[metric]).values.mean(), 
                                                                pd.Series(PLSDA_results_GAN[metric]).values.mean()],
                                         'STD': [pd.Series(RF_results_GAN[metric]).values.std(), 
                                                 pd.Series(PLSDA_results_GAN[metric]).values.std()]})
            rects = axu.bar(x + offset, accuracy_stats[metric], width, label='GAN Data', color='blue')
            #axu.errorbar(x + offset, y=accuracy_stats[metric], yerr=accuracy_stats['STD'],
            #                    ls='none', ecolor='0.2', capsize=3)

            for spine in axu.spines.values():
                spine.set_edgecolor('0.1')
            axu.set_xticks(x)
            axu.set_xticklabels(l, fontsize=13)
            axu.set_title(metric, fontsize=13)
        axs[0].set(ylabel='Performance', ylim=(0,1.05))
        axs[0].set_ylabel('Performance', fontsize=13)
        axs[1].tick_params(labelleft=False)
        axs[0].tick_params(axis='y', which='major', labelsize=11)
        #axs[1].set_yticks([])
        #axs[0].legend(loc='center', fontsize=11, bbox_to_anchor=(1.1,1.25), ncol=2)
        lgd = f.legend(*axs[0].get_legend_handles_labels(), loc='center', fontsize=11.2, bbox_to_anchor=(0.572,1.12),
                       ncol=2)
        #plt.suptitle('HD Dataset                ', fontsize=18)

        plt.show()
        f.savefig('images/TDint_Imbalanced_Accuracy_plot.png', dpi=400, bbox_extra_artists=(lgd,), bbox_inches='tight')
        f.savefig('images/TDint_Imbalanced_Accuracy_plot.pdf', dpi=400, bbox_extra_artists=(lgd,), bbox_inches='tight')

In [None]:
Results_abridged = {'RF':{}, 'PLS-DA':{}}
Results_abridged_std = {'RF':{}, 'PLS-DA':{}}
for i in RF_results_GAN:
    Results_abridged['RF'][i] = {'Real Imbalanced': pd.Series(RF_results_real[i]).values.mean(),
                                 'GAN Augmented': pd.Series(RF_results_GAN_bal[i]).values.mean(),
                                 'GAN': pd.Series(RF_results_GAN[i]).values.mean()}
    Results_abridged['PLS-DA'][i] = {'Real Imbalanced': pd.Series(PLSDA_results_real[i]).values.mean(),
                                     'GAN Augmented': pd.Series(PLSDA_results_GAN_bal[i]).values.mean(),
                                     'GAN': pd.Series(PLSDA_results_GAN[i]).values.mean()}
    
    Results_abridged_std['RF'][i] = {'Real Imbalanced': pd.Series(RF_results_real[i]).values.std(),
                                 'GAN Augmented': pd.Series(RF_results_GAN_bal[i]).values.std(),
                                 'GAN': pd.Series(RF_results_GAN[i]).values.std()}
    Results_abridged_std['PLS-DA'][i] = {'Real Imbalanced': pd.Series(PLSDA_results_real[i]).values.std(),
                                     'GAN Augmented': pd.Series(PLSDA_results_GAN_bal[i]).values.std(),
                                     'GAN': pd.Series(PLSDA_results_GAN[i]).values.std()}

In [None]:
pd.DataFrame(Results_abridged['RF'])

In [None]:
pd.DataFrame(Results_abridged['PLS-DA'])

### Comparing Important features of models built from Imbalanced Datasets, Imb + Min. Cl. GAN Samples and only GAN samples

#### Comparing Against Important Features of the Complete models

In [None]:
# Create a 2nd complete model but with only 1 iteration to compare the important features
RF_model_real1 = ma.RF_model_CV(real_samples_all, target, iter_num=1, n_fold=n_fold, n_trees=200) 
RF_feats_real1 = pd.DataFrame(RF_model_real1['important_features']).set_index(0).sort_values(by=1, ascending=False)
RF_feats_real1.index = [real_samples_all.columns[i] for i in RF_feats_real1.index]
RF_feats_real1 

In [None]:
RF_feats_GAN = {}
RF_feats_imb = {}
RF_feats_GAN_bal = {}

for fold in RF_models_real:
    temp_df = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)),
                               RF_models_real[fold].feature_importances_))
    temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
    temp_df.index = [generated_samples[fold][0].columns[a] for a in temp_df.index]
    RF_feats_imb[fold] = temp_df.copy()

    temp_df = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)),
                               RF_models_GAN[fold].feature_importances_))
    temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
    temp_df.index = [generated_samples[fold][0].columns[a] for a in temp_df.index]
    RF_feats_GAN[fold] = temp_df.copy()

    temp_df = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)),
                               RF_models_GAN_bal[fold].feature_importances_))
    temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
    temp_df.index = [generated_samples[fold][0].columns[a] for a in temp_df.index]
    RF_feats_GAN_bal[fold] = temp_df.copy()

In [None]:
for fold in RF_models_real:
    if fold == 1:
        temp_df_imb = RF_models_real[fold].feature_importances_
        temp_df_bal = RF_models_GAN_bal[fold].feature_importances_
        temp_df_GAN = RF_models_GAN[fold].feature_importances_
    else:
        temp_df_imb = temp_df_imb + RF_models_real[fold].feature_importances_
        temp_df_bal = temp_df_bal + RF_models_GAN_bal[fold].feature_importances_
        temp_df_GAN = temp_df_GAN + RF_models_GAN[fold].feature_importances_

temp_df_imb = temp_df_imb / len(RF_models_real)
temp_df_imb = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), temp_df_imb))
temp_df_imb = temp_df_imb.set_index(0).sort_values(by=1, ascending=False)
temp_df_imb.index = [generated_samples[fold][0].columns[a] for a in temp_df_imb.index]
RF_feats_imb_mean = temp_df_imb.copy()

temp_df_bal = temp_df_bal / len(RF_models_GAN_bal)
temp_df_bal = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), temp_df_bal))
temp_df_bal = temp_df_bal.set_index(0).sort_values(by=1, ascending=False)
temp_df_bal.index = [generated_samples[fold][0].columns[a] for a in temp_df_bal.index]
RF_feats_GAN_bal_mean = temp_df_bal.copy()

temp_df_GAN = temp_df_GAN / len(RF_models_GAN)
temp_df_GAN = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), temp_df_GAN))
temp_df_GAN = temp_df_GAN.set_index(0).sort_values(by=1, ascending=False)
temp_df_GAN.index = [generated_samples[fold][0].columns[a] for a in temp_df_GAN.index]
RF_feats_GAN_mean = temp_df_GAN.copy()

Calculate intersection of important features from top 1 to top (number of features in the dataset) between the complete dataset (averaged over 20 iterations) and a iteration of the complete dataset, the imbalanced dataset, the balanced dataset and the GAN dataset.

In [None]:
intersections_RF = []
for i in range(1, len(RF_feats_real1)):
    intersections_RF.append(len(np.intersect1d(RF_feats_real1.index[:i], RF_feats_real.index[:i])))

In [None]:
intersections_RF_bal = {}
for cl in RF_feats_GAN_bal_mean:
    int_bal = []
    for i in range(1, len(RF_feats_real)):
        int_bal.append(len(np.intersect1d(RF_feats_real.index[:i], RF_feats_GAN_bal_mean[cl].index[:i])))
    intersections_RF_bal[cl] = int_bal

print('Intersections - Dataset Balanced - Finished')

intersections_RF_GAN = {}
for cl in RF_feats_GAN_mean:
    int_GAN = []
    for i in range(1, len(RF_feats_real)):
        int_GAN.append(len(np.intersect1d(RF_feats_real.index[:i], RF_feats_GAN_mean[cl].index[:i])))
    intersections_RF_GAN[cl] = int_GAN

print('Intersections - Dataset GAN - Finished')

intersections_RF_imb = {}
for cl in RF_feats_imb_mean:
    int_imb = []
    for i in range(1, len(RF_feats_real)):
        int_imb.append(len(np.intersect1d(RF_feats_real.index[:i], RF_feats_imb_mean[cl].index[:i])))
    intersections_RF_imb[cl] = int_imb

print('Intersections - Dataset Imbalanced - Finished')

# See intersections if features were randomly shuffled
random_intersections_RF = []
copy_shuffle = list(RF_feats_real.index).copy()
np.random.shuffle(copy_shuffle)
for i in range(1, len(RF_feats_real)):
    random_intersections_RF.append(len(np.intersect1d(RF_feats_real.index[:i], copy_shuffle[:i])))

In [None]:
f, axl = plt.subplots(1,1,figsize=(8,4))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)), 
            label = 'Real-Real Intersections', color='Black', s=5)

axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_imb).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Imbalanced Real Dataset', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_bal).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'GAN Augmented Real Dataset', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_GAN).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'GAN Samples Dataset', color='Blue', s=5)

axl.scatter(range(1,len(intersections_RF)+1),
            np.array(random_intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Random Intersections', color='Orange', s=5)

axl.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3)
axl.set_xlabel('Nº of Top Important (Gini Importance) Compounds', fontsize=14)
axl.set_ylabel('Fraction of Common Compounds', fontsize=14)
axl.set_xlim([0,len(intersections_RF)//8])
axl.set_ylim([0,1.01])
axl.set_title('Random Forest', fontsize=18)

In [None]:
# Create a 2nd complete model but with only 1 iteration to compare the important features
np.random.seed(485)
n_fold = 5
PLSDA_model_real1 = ma.PLSDA_model_CV(real_samples_all, target, n_comp=6, iter_num=1, n_fold=n_fold, feat_type='VIP')
PLSDA_feats_real1 = pd.DataFrame(PLSDA_model_real1['important_features']).set_index(0).sort_values(by=1, ascending=False)
PLSDA_feats_real1.index = [real_samples_all.columns[i] for i in PLSDA_feats_real1.index]
PLSDA_feats_real1 

In [None]:
PLSDA_feats_GAN = {}
PLSDA_feats_imb = {}
PLSDA_feats_GAN_bal = {}

for fold in PLSDA_models_real:
    vips = ma._calculate_vips(PLSDA_models_real[fold])
    temp_df = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), vips))
    temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
    temp_df.index = [generated_samples[fold][0].columns[a] for a in temp_df.index]
    PLSDA_feats_imb[fold] = temp_df.copy()

    vips = ma._calculate_vips(PLSDA_models_GAN[fold])
    temp_df = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), vips))
    temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
    temp_df.index = [generated_samples[fold][0].columns[a] for a in temp_df.index]
    PLSDA_feats_GAN[fold] = temp_df.copy()

    vips = ma._calculate_vips(PLSDA_models_GAN_bal[fold])
    temp_df = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), vips))
    temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
    temp_df.index = [generated_samples[fold][0].columns[a] for a in temp_df.index]
    PLSDA_feats_GAN_bal[fold] = temp_df.copy()

In [None]:
for fold in PLSDA_models_real:
    if fold == 1:
        vips = ma._calculate_vips(PLSDA_models_real[fold])
        temp_df_imb = vips
        vips = ma._calculate_vips(PLSDA_models_GAN_bal[fold])
        temp_df_bal = vips
        vips = ma._calculate_vips(PLSDA_models_GAN[fold])
        temp_df_GAN = vips
        
    else:
        vips = ma._calculate_vips(PLSDA_models_real[fold])
        temp_df_imb = temp_df_imb + vips
        vips = ma._calculate_vips(PLSDA_models_GAN_bal[fold])
        temp_df_bal = temp_df_bal + vips
        vips = ma._calculate_vips(PLSDA_models_GAN[fold])
        temp_df_GAN = temp_df_GAN + vips

temp_df_imb = temp_df_imb / len(PLSDA_models_real)
temp_df_imb = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), temp_df_imb))
temp_df_imb = temp_df_imb.set_index(0).sort_values(by=1, ascending=False)
temp_df_imb.index = [generated_samples[fold][0].columns[a] for a in temp_df_imb.index]
PLSDA_feats_imb_mean = temp_df_imb.copy()

temp_df_bal = temp_df_bal / len(PLSDA_models_GAN_bal)
temp_df_bal = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), temp_df_bal))
temp_df_bal = temp_df_bal.set_index(0).sort_values(by=1, ascending=False)
temp_df_bal.index = [generated_samples[fold][0].columns[a] for a in temp_df_bal.index]
PLSDA_feats_GAN_bal_mean = temp_df_bal.copy()

temp_df_GAN = temp_df_GAN / len(PLSDA_models_GAN)
temp_df_GAN = pd.DataFrame(zip(range(len(generated_samples[fold][0].columns)), temp_df_GAN))
temp_df_GAN = temp_df_GAN.set_index(0).sort_values(by=1, ascending=False)
temp_df_GAN.index = [generated_samples[fold][0].columns[a] for a in temp_df_GAN.index]
PLSDA_feats_GAN_mean = temp_df_GAN.copy()

Calculate intersection of important features from top 1 to top (number of features in the dataset) between the complete dataset (averaged over 20 iterations) and a iteration of the complete dataset, the imbalanced dataset, the balanced dataset and the GAN dataset.

In [None]:
intersections_PLSDA = []
for i in range(1, len(PLSDA_feats_real1)):
    intersections_PLSDA.append(len(np.intersect1d(PLSDA_feats_real1.index[:i], PLSDA_feats_real.index[:i])))

In [None]:
intersections_PLSDA_bal = {}
for cl in PLSDA_feats_GAN_bal_mean:
    int_bal = []
    for i in range(1, len(PLSDA_feats_real)):
        int_bal.append(len(np.intersect1d(PLSDA_feats_real.index[:i], PLSDA_feats_GAN_bal_mean[cl].index[:i])))
    intersections_PLSDA_bal[cl] = int_bal

print('Intersections - Dataset Balanced - Finished')

intersections_PLSDA_GAN = {}
for cl in PLSDA_feats_GAN_mean:
    int_GAN = []
    for i in range(1, len(PLSDA_feats_real)):
        int_GAN.append(len(np.intersect1d(PLSDA_feats_real.index[:i], PLSDA_feats_GAN_mean[cl].index[:i])))
    intersections_PLSDA_GAN[cl] = int_GAN

print('Intersections - Dataset GAN - Finished')

intersections_PLSDA_imb = {}
for cl in PLSDA_feats_imb_mean:
    int_imb = []
    for i in range(1, len(PLSDA_feats_real)):
        int_imb.append(len(np.intersect1d(PLSDA_feats_real.index[:i], PLSDA_feats_imb_mean[cl].index[:i])))
    intersections_PLSDA_imb[cl] = int_imb

print('Intersections - Dataset Imbalanced - Finished')

# See intersections if features were randomly shuffled
random_intersections_PLSDA = []
copy_shuffle = list(PLSDA_feats_real.index).copy()
np.random.shuffle(copy_shuffle)
for i in range(1, len(PLSDA_feats_real)):
    random_intersections_PLSDA.append(len(np.intersect1d(PLSDA_feats_real.index[:i], copy_shuffle[:i])))

In [None]:
f, axl = plt.subplots(1,1,figsize=(8,4))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)), 
            label = 'Reference (expected var.)', color='Black', s=5)

axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_imb).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Imbalanced Exp. Dataset', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_PLSDA_bal).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'GAN Augmented Exp. Dataset', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_PLSDA_GAN).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'GAN Samples Dataset', color='Blue', s=5)

axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(random_intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Random Intersections', color='Orange', s=5)

axl.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3)
axl.set_xlabel('Nº of Top Important (VIP Score) Compounds', fontsize=14)
axl.set_ylabel('Fraction of Common Compounds', fontsize=14)
axl.set_xlim([0,len(intersections_PLSDA)//8])
axl.set_ylim([0,1.01])
axl.set_title('PLS-DA', fontsize=18)

In [None]:
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axr) = plt.subplots(1,2,figsize=(8,3), constrained_layout=True)

    # Graph depicting intersection of important features
    axl.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)), 
                label = 'Reference (expected var.)', color='Black', s=2)

    axl.scatter(range(1,len(intersections_RF)+1),
                np.array(pd.DataFrame(intersections_RF_imb).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
                label = 'Exp. Imbalanced', color='Green', s=2)
    axl.scatter(range(1,len(intersections_RF)+1),
                np.array(pd.DataFrame(intersections_RF_bal).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
                label = 'Imb. + Min. Cl. GAN Data', color='Red', s=2)
    axl.scatter(range(1,len(intersections_RF)+1),
                np.array(pd.DataFrame(intersections_RF_GAN).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
                label = 'Only GAN data', color='Blue', s=2)

    axl.scatter(range(1,len(intersections_RF)+1),
                np.array(random_intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
                label = 'Random', color='Orange', s=2)

    axl.set_xlabel('Nº of Top Imp. (Gini Importance) Feat.', fontsize=13)
    axl.set_ylabel('Common Imp. Feat. Frac.', fontsize=13)
    axl.set_xlim([0,len(intersections_RF)//4])
    axl.set_ylim([0,1.01])
    axl.set_title('RF', fontsize=13)

    # Graph depicting intersection of important features
    axr.scatter(range(1,len(intersections_PLSDA)+1),
                np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)), 
                label = 'Reference (expected var.)', color='Black', s=2)

    axr.scatter(range(1,len(intersections_PLSDA)+1),
                np.array(pd.DataFrame(intersections_PLSDA_imb).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'Exp. Imbalanced', color='Green', s=2)
    axr.scatter(range(1,len(intersections_PLSDA)+1),
                np.array(pd.DataFrame(intersections_PLSDA_bal).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'Imb. + Min. Cl. GAN Data', color='Red', s=2)
    axr.scatter(range(1,len(intersections_PLSDA)+1),
                np.array(pd.DataFrame(intersections_PLSDA_GAN).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'Only GAN data', color='Blue', s=2)

    axr.scatter(range(1,len(intersections_PLSDA)+1),
                np.array(random_intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'Random', color='Orange', s=2)

    #axr.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3, handletextpad=0)
    axr.legend(loc='center right', fontsize=9, ncol=2, markerscale=3, handletextpad=0, frameon=True,
              columnspacing=0.4, bbox_to_anchor=(1,0.45), borderaxespad=0.4, labelspacing=0.4)
    axr.set_xlabel('Nº of Top Imp. (VIP Score) Feat.', fontsize=13)
    #axr.set_ylabel('Common Imp. Feat. Fraction', fontsize=11.5)
    axr.set_xlim([0,len(intersections_PLSDA)//4])
    axr.set_ylim([0,1.01])
    axr.set_title('PLS-DA', fontsize=13)
    def fmt_two_digits(x, pos):
        return f'{x:.2f}'
    axl.yaxis.set_major_formatter(ticker.FuncFormatter(fmt_two_digits))
    axr.yaxis.set_major_formatter(ticker.FuncFormatter(fmt_two_digits))
    
    axl.tick_params(axis='both', which='major', labelsize=11)
    axr.tick_params(axis='both', which='major', labelsize=11)
f.savefig('images/TDint_Imbalanced_ImpFeat_plot.png', dpi=400)
f.savefig('images/TDint_Imbalanced_ImpFeat_plot.pdf', dpi=400)

#### Comparing Against Significant Feature (by Univariate Analysis) of the Complete models

Calculate intersection of important features from top 1 to top (number of features in the dataset) between the complete dataset (averaged over 20 iterations) and a iteration of the complete dataset, the imbalanced dataset, the balanced dataset and the GAN dataset.

In [None]:
intersections_RF_uni = []
for i in range(1, len(uni_results)):
    intersections_RF_uni.append(len(np.intersect1d(RF_feats_real1.index[:i], uni_results.index[:i])))

In [None]:
intersections_RF_bal_uni = {}
for cl in RF_feats_GAN_bal_mean:
    int_bal = []
    for i in range(1, len(uni_results)):
        int_bal.append(len(np.intersect1d(uni_results.index[:i], RF_feats_GAN_bal_mean[cl].index[:i])))
    intersections_RF_bal_uni[cl] = int_bal

print('Intersections - Dataset Balanced - Finished')

intersections_RF_GAN_uni = {}
for cl in RF_feats_GAN_mean:
    int_GAN = []
    for i in range(1, len(uni_results)):
        int_GAN.append(len(np.intersect1d(uni_results.index[:i], RF_feats_GAN_mean[cl].index[:i])))
    intersections_RF_GAN_uni[cl] = int_GAN

print('Intersections - Dataset GAN - Finished')

intersections_RF_imb_uni = {}
for cl in RF_feats_imb_mean:
    int_imb = []
    for i in range(1, len(uni_results)):
        int_imb.append(len(np.intersect1d(uni_results.index[:i], RF_feats_imb_mean[cl].index[:i])))
    intersections_RF_imb_uni[cl] = int_imb

print('Intersections - Dataset Imbalanced - Finished')

# See intersections if features were randomly shuffled
random_intersections_RF_uni = []
copy_shuffle = list(uni_results.index).copy()
np.random.shuffle(copy_shuffle)
for i in range(1, len(uni_results)):
    random_intersections_RF_uni.append(len(np.intersect1d(uni_results.index[:i], copy_shuffle[:i])))

In [None]:
f, axl = plt.subplots(1,1,figsize=(8,4))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(intersections_RF_uni) / np.array(range(1,len(intersections_RF_uni)+1)), 
            label = 'Ref.-Univariate', color='Black', s=5)

axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_RF_imb_uni).mean(axis=1)) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'Imbalanced Exp. Dataset', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_RF_bal_uni).mean(axis=1)) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'GAN Augmented Exp. Dataset', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_RF_GAN_uni).mean(axis=1)) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'GAN Samples Dataset', color='Blue', s=5)

axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(random_intersections_RF_uni) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'Random Intersections', color='Orange', s=5)

axl.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3, handletextpad=0)
axl.set_xlabel('Nº of Top Important (Gini Importance) Compounds', fontsize=14)
axl.set_ylabel('Fraction of Common Compounds', fontsize=14)
axl.set_xlim([0,len(intersections_RF_uni)//8])
axl.set_ylim([0,1.01])
axl.set_title('Random Forest', fontsize=18)
plt.show()

Calculate intersection of important features from top 1 to top (number of features in the dataset) between the complete dataset (averaged over 20 iterations) and a iteration of the complete dataset, the imbalanced dataset, the balanced dataset and the GAN dataset.

In [None]:
intersections_PLSDA_uni = []
for i in range(1, len(uni_results)):
    intersections_PLSDA_uni.append(len(np.intersect1d(PLSDA_feats_real1.index[:i], uni_results.index[:i])))

In [None]:
intersections_PLSDA_bal_uni = {}
for cl in PLSDA_feats_GAN_bal_mean:
    int_bal = []
    for i in range(1, len(uni_results)):
        int_bal.append(len(np.intersect1d(uni_results.index[:i], PLSDA_feats_GAN_bal_mean[cl].index[:i])))
    intersections_PLSDA_bal_uni[cl] = int_bal

print('Intersections - Dataset Balanced - Finished')

intersections_PLSDA_GAN_uni = {}
for cl in PLSDA_feats_GAN_mean:
    int_GAN = []
    for i in range(1, len(uni_results)):
        int_GAN.append(len(np.intersect1d(uni_results.index[:i], PLSDA_feats_GAN_mean[cl].index[:i])))
    intersections_PLSDA_GAN_uni[cl] = int_GAN

print('Intersections - Dataset GAN - Finished')

intersections_PLSDA_imb_uni = {}
for cl in PLSDA_feats_imb_mean:
    int_imb = []
    for i in range(1, len(uni_results)):
        int_imb.append(len(np.intersect1d(uni_results.index[:i], PLSDA_feats_imb_mean[cl].index[:i])))
    intersections_PLSDA_imb_uni[cl] = int_imb

print('Intersections - Dataset Imbalanced - Finished')

# See intersections if features were randomly shuffled
random_intersections_PLSDA_uni = []
copy_shuffle = list(uni_results.index).copy()
np.random.shuffle(copy_shuffle)
for i in range(1, len(uni_results)):
    random_intersections_PLSDA_uni.append(len(np.intersect1d(uni_results.index[:i], copy_shuffle[:i])))

In [None]:
f, axl = plt.subplots(1,1,figsize=(8,4))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(intersections_PLSDA_uni) / np.array(range(1,len(intersections_PLSDA_uni)+1)), 
            label = 'Ref.-Univariate', color='Black', s=5)

axl.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(pd.DataFrame(intersections_PLSDA_imb_uni).mean(axis=1)) / np.array(
                range(1,len(intersections_PLSDA_uni)+1)),
            label = 'Imbalanced Exp. Dataset', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_PLSDA_bal_uni).mean(axis=1)) / np.array(
                range(1,len(intersections_PLSDA_uni)+1)),
            label = 'GAN Augmented Exp. Dataset', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_PLSDA_GAN_uni).mean(axis=1)) / np.array(
                range(1,len(intersections_PLSDA_uni)+1)),
            label = 'GAN Samples Dataset', color='Blue', s=5)


axl.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(random_intersections_PLSDA_uni) / np.array(range(1,len(intersections_PLSDA_uni)+1)),
            label = 'Random Intersections', color='Orange', s=5)

axl.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3)
axl.set_xlabel('Nº of Top Important (VIP Score) Compounds', fontsize=14)
axl.set_ylabel('Fraction of Common Compounds', fontsize=14)
axl.set_xlim([0,len(intersections_PLSDA_uni)//8])
axl.set_ylim([0,1.01])
axl.set_title('PLS-DA', fontsize=18)

In [None]:
f, (axl, axr) = plt.subplots(1,2,figsize=(12,4),constrained_layout=True)

# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_RF_uni)+1), 
            np.array(intersections_RF_uni) / np.array(range(1,len(intersections_RF_uni)+1)), 
            label = 'Ref.-Univariate Inter.', color='Black', s=5)

axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_RF_imb_uni).mean(axis=1)) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'Real Imbalanced', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_RF_bal_uni).mean(axis=1)) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'Imb. + Min. Cl. GAN Data', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(pd.DataFrame(intersections_RF_GAN_uni).mean(axis=1)) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'Only GAN data', color='Blue', s=5)

axl.scatter(range(1,len(intersections_RF_uni)+1),
            np.array(random_intersections_RF_uni) / np.array(range(1,len(intersections_RF_uni)+1)),
            label = 'Random', color='Orange', s=5)

axl.set_xlabel('Nº of Top Important (Gini Importance) Compounds', fontsize=13)
axl.set_ylabel('Fraction of Common Compounds', fontsize=13)
axl.set_xlim([0,len(intersections_RF_uni)//4])
axl.set_ylim([0,1.01])
axl.set_title('Random Forest', fontsize=18)

# Graph depicting intersection of important features
axr.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(intersections_PLSDA_uni) / np.array(range(1,len(intersections_PLSDA_uni)+1)), 
            label = 'Ref.-Univariate Inter.', color='Black', s=5)

axr.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(pd.DataFrame(intersections_PLSDA_imb_uni).mean(axis=1)) / np.array(
                range(1,len(intersections_PLSDA_uni)+1)),
            label = 'Real Imbalanced', color='Green', s=5)
axr.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(pd.DataFrame(intersections_PLSDA_bal_uni).mean(axis=1)) / np.array(
                range(1,len(intersections_PLSDA_uni)+1)),
            label = 'Imb. + Min. Cl. GAN Data', color='Red', s=5)
axr.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(pd.DataFrame(intersections_PLSDA_GAN_uni).mean(axis=1)) / np.array(
                range(1,len(intersections_PLSDA_uni)+1)),
            label = 'Only GAN data', color='Blue', s=5)

axr.scatter(range(1,len(intersections_PLSDA_uni)+1),
            np.array(random_intersections_PLSDA_uni) / np.array(range(1,len(intersections_PLSDA_uni)+1)),
            label = 'Random', color='Orange', s=5)

axr.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3)
axr.set_xlabel('Nº of Top Important (VIP Score) Compounds', fontsize=13)
axr.set_ylabel('Fraction of Common Compounds', fontsize=13)
axr.set_xlim([0,len(intersections_PLSDA_uni)//4])
axr.set_ylim([0,1.01])
axr.set_title('PLS-DA', fontsize=18)
f.savefig('images/TDint_Imbalanced_ImpFeat_UNI_plot.png', dpi=400)
f.savefig('images/TDint_Imbalanced_ImpFeat_UNI_plot.pdf', dpi=400)