# Data Augmentation - Conditional Wasserstein GANs - GP

### Dataset: Maternal Plasma Dataset

This notebook presents the CWGAN-GP model to generate treated Intensity Data and test GAN data augmentation efficacy in mitigating class imbalanced dataset issues (in supervised analysis) by supplementing the minority class with GAN samples.

Notebook Organization:
- Read the dataset
- Treatment and Univariate Analysis of the dataset
- Split of the Dataset into 12 folds (4 for each minority class) training and test sets (details on that section).
- Setup the CWGAN-GP model and train 12 models, once for each fold, with the corresponding training data.
- Generate GAN samples and add them to the corresponding imbalanced training sets in small increments.
- Build and evaluate performance of RF and PLS-DA models from the imbalanced dataset, the imbalanced dataset supplemented with minority class samples and purely GAN samples dataset.
- Compare important features in the RF and PLS-DA models against important features of models built with the complete dataset.

#### Due to stochasticity, re-running the notebook will get slightly different results. Thus, figures in the paper can be slightly different.

In [None]:
import json
from time import perf_counter

import numpy as np
import pandas as pd
import scipy.stats as stats
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.patches as mpatches
from matplotlib import ticker

import seaborn as sns
from tqdm import tqdm
from IPython import display as ipythondisplay

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import precision_recall_fscore_support, f1_score

import tensorflow as tf
from keras import backend

import pickle

# Metabolinks package
import metabolinks as mtl
import metabolinks.transformations as transf

# Python files in the repository
import multianalysis as ma
import gan_evaluation_metrics as gem
import linear_augmentation_functions as laf

In [None]:
# Import needed functions from GAN_functions
from GAN_functions import gradient_penalty_cwgan
from GAN_functions import critic_loss_wgan
from GAN_functions import generator_loss_wgan

### Maternal Plasma Dataset

168 samples belonging to 3 classes (56 samples per class) of RP-HPLC followed by QTOF data obtained in ESI+ of blood plasma maternal samples from the first trimester (FTMP) and the baby delivery (DMP) and from the child umbilical cord blood plasma at delivery (UCB). 

Data acquired by:

- Habra, H.; Kachman, M.; Padmanabhan, V.; Burant, C.; Karnovsky, A.; Meijer, J. Alignment and Analysis of a Disparately Acquired Multibatch Metabolomics Study of Maternal Pregnancy Samples. J. Proteome Res. 2022, 21, 2936–2946, doi:10.1021/acs.jproteome.2c00371.

- Data obtained from Metabolomics Workbench project ID PR001352, study ID: ST002134, Analysis ID: AN003489.

Data matrices retained only features that occur (globally) at least twice in all samples of the dataset (filtering/alignment).

### Reading the data to be analysed and augmented

In [None]:
base_file = pd.read_csv('ST002134_AN003489_res.txt', sep='\t')
base_file = base_file.set_index('Samples')
base_file

In [None]:
df = base_file.iloc[:, 1:].replace({0:np.nan})
samples = list(base_file.iloc[:, 0].values)
df

In [None]:
set(samples)

In [None]:
# Excluding the pooled samples
selection = []
for i in samples:
    if i.startswith('SampleType:POOLED'):
        selection.append(False)
    else:
        selection.append(True)
df = df.loc[selection]
df_initial = df.copy()

In [None]:
# Creating the list of 'targets' (labels of samples) of the dataset
labels = []
for i in np.array(samples)[selection]:
    if i.startswith('SampleType:Delivery maternal plasma'):
        labels.append('DMP')
    elif i.startswith('SampleType:First trimester maternal plasma'):
        labels.append('FTMP')
    else:
        labels.append('UCB')

In [None]:
pd.Series(labels).value_counts()

### Data Pre-Treatments

In [None]:
def impute_RF(df, nearest_features=100, n_trees=50, max_iter=10):
    "Random Forest Imputation of Missing Values."
    rf_estimator = ExtraTreesRegressor(n_estimators=n_trees)
    imputer = IterativeImputer(random_state=0, estimator=rf_estimator,
                           n_nearest_features=nearest_features,
                           min_value=0.0, max_value=1.0e10,
                           max_iter=max_iter,
                           verbose=0)
    imputed_data = imputer.fit_transform(df)
    res = pd.DataFrame(imputed_data, index=df.index, columns=df.columns)
    ncols = len(res.columns)
    res = res.dropna(axis='columns', how='any')
    ncols2 = len(res.columns)
    if ncols > ncols2:
        print(f'{ncols-ncols2} features dropped')
    return res

In [None]:
# Represents Binary Simplification pre-treatment
def df_to_bool(df):
    "Transforms data into 'binary' matrices."
    return df.mask(df.notnull(), 1).mask(df.isnull(), 0)

# Performs pre-treatment combinations
def compute_transf(df, norm_ref=None, lamb=None):
    "Computes combinations of pre-treatments and BinSim and returns after treatment datasets in a dict."
    
    data = df.copy()
    
    # Imputation of Missing Values
    imputed = transf.fillna_frac_min_feature(data.T, fraction=0.2).T
    
    # Imputation by RF
    datacols = data.columns
    data.columns = [str(col) for col in data.columns]
    imputedRF = impute_RF(data, nearest_features=100, n_trees=10)
    imputedRF.columns = datacols

    # Normalization
    if norm_ref is not None:
        # Normalization by a reference feature
        norm = transf.normalize_ref_feature(imputed, norm_ref, remove=True)
    else:
        # Normalization by the total sum of intensities
        norm = transf.normalize_sum(imputed)
        # Normalization by PQN
        #norm = transf.normalize_PQN(imputed, ref_sample='mean')
    
    # Normalization - RF
    if norm_ref is not None:
        # Normalization by a reference feature
        normRF = transf.normalize_ref_feature(imputedRF, norm_ref, remove=True)
    else:
        # Normalization by the total sum of intensities
        normRF = transf.normalize_sum(imputedRF)
        # Normalization by PQN
        #normRF = transf.normalize_PQN(imputedRF, ref_sample='mean')
    
    # Pareto Scaling and Generalized Logarithmic Transformation
    P = transf.pareto_scale(imputed)
    NP = transf.pareto_scale(norm)
    NGP = transf.pareto_scale(transf.glog(norm, lamb=lamb))
    GP = transf.pareto_scale(transf.glog(imputed, lamb=lamb))

    # Pareto Scaling and Generalized Logarithmic Transformation
    P_RF = transf.pareto_scale(imputedRF)
    NP_RF = transf.pareto_scale(normRF)
    NGP_RF = transf.pareto_scale(transf.glog(normRF, lamb=lamb))
    GP_RF = transf.pareto_scale(transf.glog(imputedRF, lamb=lamb))
    
    # Store results
    dataset = {}
    dataset['data'] = df

    dataset['BinSim'] = df_to_bool(data)
    dataset['Ionly'] = imputed
    dataset['P'] = P
    dataset['NP'] = NP
    dataset['GP'] = GP
    dataset['NGP'] = NGP
    
    dataset['Ionly_RF'] = imputedRF
    dataset['P_RF'] = P_RF
    dataset['NP_RF'] = NP_RF
    dataset['GP_RF'] = GP_RF
    dataset['NGP_RF'] = NGP_RF
    
    return dataset

In [None]:
df = df.replace({0:np.nan})
p_df = compute_transf(df, norm_ref=None, lamb=None)

#### Colors for plots to ensure consistency

3 classes, 20 samples each

In [None]:
# customize label colors for 11 grapevine varieties

colours = sns.color_palette('Set1', 3)

ordered_labels = ['DMP', 'FTMP', 'UCB']

label_colors = {lbl: c for lbl, c in zip(ordered_labels, colours)}
sample_colors = [label_colors[lbl] for lbl in labels]

sns.palplot(label_colors.values())
new_ticks = plt.xticks(range(len(ordered_labels)), ordered_labels)

#### Univariate Analysis

In [None]:
target=labels

In [None]:
import itertools

alpha = 0.0001
abs_log2FC_threshold = np.log2(6)

univariate_results = {}

for a,b in itertools.combinations(pd.unique(target),2):
    print(a,b)
    selection = [i in [a, b] for i in target]
    target_temp = list(np.array(target)[selection])
    
    df_temp = df_initial.loc[selection, :].copy()
    df_temp = transf.keep_atleast(df_temp, minimum=2)
    df_I = transf.fillna_frac_min_feature(df_temp, fraction=0.2)
    df_N = transf.normalize_sum(df_I)
    df_NGP = transf.pareto_scale(transf.glog(df_N, lamb=None))
    
    uni_results = ma.compute_FC_pvalues_2groups(
                              normalized=df_N, # Used for Fold-Change Computation
                              processed=df_NGP, # Used for p-value computation
                              labels=target_temp, # Labels of the samples
                              equal_var=True, # Consider variance between groups as equal
                              useMW=False) # Use Mann-Whitney Test if True or standard T-test if False
    
    filt_uni_results = uni_results[uni_results['FDR adjusted p-value'] < alpha].copy()
    # Calculate absolute Log2 Fold-Change
    filt_uni_results['abs_log2FC'] = abs(filt_uni_results['log2FC'])
    # Select
    filt_uni_results = filt_uni_results[filt_uni_results['abs_log2FC'] > abs_log2FC_threshold]
    filt_uni_results = filt_uni_results.drop(columns='abs_log2FC')
    univariate_results[(a,b)] = filt_uni_results

In [None]:
idxs = []
for i in univariate_results:
    idxs.extend(list(univariate_results[i].index))

In [None]:
uni_df = df_initial[pd.unique(idxs)]
uni_df

### Fitting RF and PLS-DA models to the complete dataset

Results and Feature Importance are estimated by stratified n_fold-cross validation averaged across 20 iterations.

In [None]:
np.random.seed(485)
n_fold = 5

RF_model_real = ma.RF_model_CV(p_df['NGP'], target,
                               iter_num=20, n_fold=n_fold, n_trees=200) 
RF_feats_real = pd.DataFrame(RF_model_real['important_features']).set_index(0).sort_values(by=1, ascending=False)
RF_feats_real.index = [p_df['NGP'].columns[i] for i in RF_feats_real.index]
#RF_feats_real 

In [None]:
np.random.seed(485)
n_fold = 5
PLSDA_model_real = ma.PLSDA_model_CV(p_df['NGP'], target,
                                     n_comp=4, iter_num=20, n_fold=n_fold, feat_type='VIP') 

PLSDA_feats_real = pd.DataFrame(PLSDA_model_real['important_features']).set_index(0).sort_values(by=1, ascending=False)
PLSDA_feats_real.index = [p_df['NGP'].columns[i] for i in PLSDA_feats_real.index]

# Data Augmentation

## Creating Imbalanced Datasets

This dataset has three balanced classes. Thus we considered cases where all of them were once the minority class.

For one class chosen as the minority class:

We split the dataset in 4 different ways where each had 42 samples of the majority classes in that case and 14 samples of the minority class in the training set. Thus, this left 14 samples of the majority classes and 42 of the minority class to be the test sets. This was made by putting the set of 56 samples of a class into 4 folds of 14, combining 3 for the majority classes for the training set. This split happenned before data pre-treatment. Then both sets were independently trained with the same pre-treatment pipeline but on the test set we performed a 'faux' Pareto scaling using the features standard deviation and mean of the training set. The training and test sets have (in most cases) a vastly different balance of class samples. Thus, feature averages and standard deviations can be quite different between them, especially in key features for discrimination. To compensate for this, the ‘faux’ Pareto scaling was applied. The untreated training sets were linearly augmented, which was then treated (using a normal Pareto scaling, in this case), to generate samples to train the CWGAN-GP models.

This process was then repeated two times considering the other 2 classes as the minority class.

In [None]:
df_I = transf.fillna_frac_min_feature(df_initial.T, fraction=0.2).T
df_I

In [None]:
df_storage_train = {}
df_storage_test = {}
lbl_storage_train = {}
lbl_storage_test = {}
real_samples = {}

rng = np.random.default_rng(7519)

# Randomly order the samples of each class to separate them into the folds after
permutations = {}
permutations['FTMP'] = list(rng.permutation(np.where(np.array(labels) == 'FTMP')[0]))
permutations['DMP'] = list(rng.permutation(np.where(np.array(labels) == 'DMP')[0]))
permutations['UCB'] = list(rng.permutation(np.where(np.array(labels) == 'UCB')[0]))

for lbl in ['FTMP', 'DMP', 'UCB']:
    fold_len = 56//4
    df_storage_train[lbl] = {}
    df_storage_test[lbl] = {}
    lbl_storage_train[lbl] = {}
    lbl_storage_test[lbl] = {}
    real_samples[lbl] = {}
    for i in range(4):
        train_idxs = {}
        test_idxs = {}
        for cl in permutations.keys():
            if cl == lbl:
                train_idxs[cl] = list(np.array(permutations[cl])[i*fold_len: (i+1)*fold_len])
                test_idxs[cl] = list(np.array(permutations[cl])[: i*fold_len]) + list(
                    np.array(permutations[cl])[(i+1)*fold_len:])
            else:
                train_idxs[cl] = list(np.array(permutations[cl])[: i*fold_len]) + list(
                    np.array(permutations[cl])[(i+1)*fold_len:])
                test_idxs[cl] = list(np.array(permutations[cl])[i*fold_len: (i+1)*fold_len])
        
        print('Minority Class:', lbl, 'Fold nº:', i+1)
        print('Train FTMP/DMP/UCB:', len(train_idxs['FTMP']), len(train_idxs['DMP']), len(train_idxs['UCB']))
        print('Test FTMP/DMP/UCB: ', len(test_idxs['FTMP']), len(test_idxs['DMP']), len(test_idxs['UCB']))

        train_idxs = train_idxs['FTMP'] + train_idxs['DMP'] + train_idxs['UCB']
        test_idxs = test_idxs['FTMP'] + test_idxs['DMP'] + test_idxs['UCB']
        
        df_imputed = df_I.copy()
        
        # Create the imbalanced and test set
        df_storage_train[lbl][i+1] = df_imputed.iloc[train_idxs]
        lbl_storage_train[lbl][i+1] = list(np.array(labels)[train_idxs])

        df_storage_test[lbl][i+1] = df_imputed.iloc[test_idxs]
        lbl_storage_test[lbl][i+1] = list(np.array(labels)[test_idxs])

        # Data pretreatment of the imbalanced and test dataset
        df_N = transf.normalize_sum(df_storage_train[lbl][i+1])
        df_G = transf.glog(df_N, lamb=None)
        real_samples[lbl][i+1] = transf.pareto_scale(df_G)

        test_datamatrix_N = transf.normalize_sum(df_storage_test[lbl][i+1])
        test_datamatrix_G = transf.glog(test_datamatrix_N, lamb=None)
        df_storage_test[lbl][i+1] = (test_datamatrix_G - df_G.mean())/np.sqrt(df_G.std()) # 'Faux' Pareto Scale

In [None]:
GENERATE_LIN=False
if GENERATE_LIN:
    aug_df_storage_train = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
    aug_lbl_storage_train = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
    # Only generation of samples based on the imbalanced dataset
    for lbl in ['FTMP', 'DMP', 'UCB']:
        for i in range(1,5):
            start = perf_counter()
            data, lbls = laf.artificial_dataset_generator(df_storage_train[lbl][i], labels=lbl_storage_train[lbl][i],
                                                max_new_samples_per_label=384, binary=False,
                                                rnd=list(np.linspace(0.15,0.9,6)), 
                                                binary_rnd_state=None, rnd_state=3434556)

            data_N = transf.normalize_sum(data)
            data_G = transf.pareto_scale(transf.glog(data_N, lamb=None))
            data_treated = data_G

            aug_df_storage_train[lbl][i] = data_treated.copy()
            aug_lbl_storage_train[lbl][i] = lbls
            
            end = perf_counter()
            print(f'Simple augmentation of data done! Minority class: {lbl}, Fold: {i}. Took {(end - start):.3f} s')

In [None]:
# Store the generated data as to not need to make them again
if GENERATE_LIN:
    for lbl in aug_df_storage_train.keys():
        for i in aug_df_storage_train[lbl].keys():
            # Store Linear Augmented Data
            aug_df_storage_train[lbl][i].to_csv('store_data/MPint_imb_generated_data_'+lbl+str(i)+'.csv')

            # Store labels/classes corresponding to each generated sample
            with open('store_data/MPint_imb_data_lbls_'+lbl+str(i)+'.txt', 'w') as a:
                for item in aug_lbl_storage_train[lbl][i]:
                    a.write("{}\n".format(item))

# Read back the generated GAN data
if not GENERATE_LIN:
    aug_df_storage_train = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
    aug_lbl_storage_train = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
    for lbl in aug_df_storage_train:
        for i in df_storage_train[lbl].keys():

            data_treated = pd.read_csv('store_data/MPint_imb_generated_data_'+lbl+str(i)+'.csv')
            aug_df_storage_train[lbl][i] = data_treated.set_index(data_treated.columns[0])
            data_treated
            with open('store_data/MPint_imb_data_lbls_'+lbl+str(i)+'.txt') as a:
                aug_lbl_storage_train[lbl][i] = a.read().split('\n')[:-1]

In [None]:
# Colors to use in plots
colours2 = sns.color_palette('tab20', 6)[:6]

ordered_labels_test = ('FTMP','FTMP - GAN','DMP','DMP - GAN','UCB','UCB - GAN')
label_colors_test = {lbl: c for lbl, c in zip(ordered_labels_test, colours2)}

sns.palplot(label_colors_test.values())
new_ticks_test = plt.xticks(range(len(ordered_labels_test)), ordered_labels_test)

## Conditional Wasserstein GAN - GP model

This model construction was made by joining WGAN-GP models with Conditional GAN models. WGAN-GP models were originally made according to / originally based in https://keras.io/examples/generative/wgan_gp/#wasserstein-gan-wgan-with-gradient-penalty-gp and Conditional GAN models - https://machinelearningmastery.com/how-to-develop-a-conditional-generative-adversarial-network-from-scratch/ (generator and discriminator model) and https://keras.io/examples/generative/conditional_gan/ without using OOP (loss functions and training/training steps).

In [None]:
def generator_model(len_input, len_output, n_hidden_nodes, n_labels):
    "Make the generator model of CWGAN-GP."

    data_input = tf.keras.Input(shape=(len_input,), name='data') # Take intensity input
    label_input = tf.keras.Input(shape=(n_labels,), name='label') # Take Label Input

    # Treat label input to concatenate to intensity data after
    label_m = tf.keras.layers.Embedding(n_labels, 30, input_length=1)(label_input)
    label_m = tf.keras.layers.Dense(256, activation='linear', use_bias=True)(label_m)
    #label_m = tf.keras.layers.Reshape((len_input,1,))(label_m)
    label_m2 = tf.keras.layers.Reshape((256*n_labels,))(label_m)

    joined_data = tf.keras.layers.Concatenate()([data_input, label_m2]) # Concatenate intensity and label data
    # Hidden Dense Layer and Normalization
    joined_data = tf.keras.layers.Dense(n_hidden_nodes, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(256, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.BatchNormalization()(joined_data)

    # Output - number of features of sample to make
    output = tf.keras.layers.Dense(len_output, activation='linear', use_bias=True)(joined_data)
    
    generator = tf.keras.Model(inputs=[data_input, label_input], outputs=output)
    
    return generator

def critic_model(len_input, n_hidden_nodes, n_labels):
    "Make the critic model of CWGAN-GP."
    
    label_input = tf.keras.Input(shape=(n_labels,)) # Take intensity input
    data_input = tf.keras.Input(shape=(len_input,)) # Take Label Input
    #data_input = tf.keras.layers.Reshape((len_input,1,))(data_input)

    # Treat label input to concatenate to intensity data after
    label_m = tf.keras.layers.Embedding(n_labels, 30, input_length=1)(label_input)
    label_m = tf.keras.layers.Dense(256, activation='linear', use_bias=True)(label_m)
    #label_m = tf.keras.layers.Reshape((len_input,1,))(label_m)
    label_m = tf.keras.layers.Reshape((256*n_labels,))(label_m)

    joined_data = tf.keras.layers.Concatenate()([data_input, label_m]) # Concatenate intensity and label data
    # Hidden Dense Layer (Normalization worsened results here)
    joined_data = tf.keras.layers.Dense(n_hidden_nodes, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(128, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    joined_data = tf.keras.layers.Dense(256, activation=tf.nn.leaky_relu, use_bias=True)(joined_data)
    #joined_data = tf.keras.layers.BatchNormalization()(joined_data)

    # Output Layer - 1 node for critic decision
    output = tf.keras.layers.Dense(1, activation='linear', use_bias=True)(joined_data)
    
    critic = tf.keras.Model(inputs=[data_input, label_input], outputs=output)

    return critic

In [None]:
def generate_predictions(model, num_examples_to_generate, len_input, input_dist, uni_lbls):
    "Generate sample predictions based on a Generator model."
    # `training` is set to False.
    test_input =  tf.constant(input_dist.rvs(size=len_input*num_examples_to_generate), shape=[
        num_examples_to_generate,len_input])
    if len(uni_lbls) < 3:
        test_labels = tf.constant([1.0]*(num_examples_to_generate//2) + [0.0]*(num_examples_to_generate//2), 
                                  shape=(num_examples_to_generate,1))
    else:
        test_labels = []
        for i in range(len(uni_lbls)):
            test_labels.extend([i]*(num_examples_to_generate//len(uni_lbls)))
        test_labels = np.array(pd.get_dummies(test_labels))

    predictions = model([test_input, test_labels], training=False)
    return predictions

In [None]:
def training_montage(train_data_o, train_lbls, test_data, test_lbls,
                     epochs, generator, critic, generator_optimizer, critic_optimizer, input_dist,
                    batch_size, grad_pen_weight=10, k_cov_den=50, k_crossLID=15, random_seed=145,
                    n_generated_samples=96):
    """Train a generator and critic of CWGAN-GP.
    
       Receives training data and respective class labels (train_data_o and train_lbls) and trains a generator and a critic
        model (generator, critic) over a number of epochs (epochs) with a set batch size (batch_size) with the respective 
        optimizers and learning rate (generator_optimizer, critic_optimizer). Gradient Penalty is calculated with 
        grad_pen_weight as the weight of the penalty.
       The functions returns at time intervals three graphs to evaluate the progression of the models (Loss plots,
        coverage, density, crossLID and correct first cluster plots and PCA plot with generated and test data). To this
        end, samples need to be generated requiring the distribution to sample the initial input values from (input_dist),
        and test data and respective labels has to be given (test_data and test_lbls). Finally the number of neighbors to
        consider for coverage/density and crossLID calculation is also needed (k_cov_den, k_crossLID).
    
       train_data_o: Pandas DataFrame with training data;
       train_lbls: List with training data class labels;
       test_data: Pandas DataFrame with test data to evaluate the model;
       test_lbls: List with test data class labels to evaluate the model;
       epochs: Int value with the number of epochs to train the model;
       generator: tensorflow keras.engine.functional.Functional model for the generator;
       critic: tensorflow keras.engine.functional.Functional model for the critic;
       generator_optimizer: tensorflow keras optimizer (with learning rate) for generator;
       critic_optimizer: tensorflow keras optimizer (with learning rate) for critic;
       input_dist: scipy.stats._continuous_distns.rv_histogram object - distribution to sample input values for generator;
       batch_size: int value with size of batch for model training;
       grad_pen_weight: int value (default 10) for penalty weight in gradient penalty calculation;
       k_cov_den: int value (default 50) for number of neighbors to consider for coverage and density calculation in
        generated samples evaluation;
       k_crossLID: int value (default 15) for number of neighbors to consider for crossLID calculation in generated samples
        evaluation.
       random_seed: int value (default 145) for numpy random seeding when randomly organizing samples in the data that
        will be split into batches.
       n_generated_samples: int value (default 96) for number of samples generated to test the model during training.
    """
    
    # Obtaining the train data, randomize its order and divide it be twice the standard deviation of its values
    all_data = train_data_o.iloc[
        np.random.RandomState(seed=random_seed).permutation(len(train_data_o))]/(2*train_data_o.values.std())
    
    # Same treatment for the test data
    test_data = (test_data/(2*test_data.values.std())).values
    training_data = all_data
    train_data = all_data.values
    
    # Change class labels to numerical values while following the randomized ordered of samples
    if len(set(train_lbls)) < 3: # 1 and 0 for when there are only two classes
        train_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data))]).values[:,0]
        test_labels = pd.get_dummies(np.array(test_lbls)).values[:,0]
    else: # One hot encoding for when there are more than two classes
        train_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data))]).values
        test_labels = pd.get_dummies(np.array(test_lbls)).values
    # Save the order of the labels
    ordered_labels = pd.get_dummies(
            np.array(train_lbls)[np.random.RandomState(seed=random_seed).permutation(len(train_data_o))]).columns
    
    # To save the model after
    checkpoint = tf.train.Checkpoint(generator_optimizer=generator_optimizer,
                                     critic_optimizer=critic_optimizer,
                                     generator=generator,
                                     critic=critic)

    batch_divisions = int(batch_size / len(set(train_lbls))) # See how many samples of each class will be in each batch
    n_steps = epochs * int(training_data.shape[0] / batch_size) # Number of steps: nº of batches per epoch * nº of epochs
    n_critic = 5
    
    # Set up the evaluating images printed during training and the intervals they will be updated
    f, (axl, axc, axr) = plt.subplots(1, 3, figsize = (16,5))
    update1 = n_steps//200
    update2 = n_steps//20

    if hasattr(tqdm, '_instances'):
        tqdm._instances.clear() # clear if it exists

    i=0

    for step in tqdm(range(n_steps)):
        
        # Critic Training
        crit_loss_temp = []
        
        # Select real samples for this batch on training and order samples to put samples of the same class together
        real_samp = train_data[i*batch_size:(i+1)*batch_size]
        real_lbls = train_labels[i*batch_size:(i+1)*batch_size]

        real_samples = np.empty(real_samp.shape)
        real_labels = np.empty(real_lbls.shape)
        a = 0
        if len(set(train_lbls)) < 3:
            for l,s in sorted(zip(real_lbls, real_samp), key=lambda pair: pair[0], reverse=True):
                real_samples[a] = s
                real_labels[a] = l
                a = a+1
        else:
            for l,s in sorted(zip(real_lbls, real_samp), key=lambda pair: np.argmax(pair[0]), reverse=False):
                #print(l, np.argmax(l))
                real_samples[a] = s
                real_labels[a] = l
                a = a+1

        for _ in range(n_critic): # For each step, train critic n_critic times
            
            # Generate input for generator
            artificial_samples = tf.constant(input_dist.rvs(size=all_data.shape[1]*batch_size), shape=[
                batch_size,all_data.shape[1]])
            artificial_labels = real_labels.copy()

            # Generate artificial samples from the latent vector
            artificial_samples = generator([artificial_samples, artificial_labels], training=True)
            
            with tf.GradientTape() as crit_tape: # See the gradient for the critic

                # Get the logits for the generated samples
                X_artificial = critic([artificial_samples, artificial_labels], training=True)
                # Get the logits for the real samples
                X_true = critic([real_samples, real_labels], training=True)

                # Calculate the critic loss using the generated and real sample results
                c_cost = critic_loss_wgan(X_true, X_artificial)

                # Calculate the gradient penalty
                grad_pen = gradient_penalty_cwgan(batch_size, real_samples, artificial_samples,
                                                  real_labels, artificial_labels, critic)
                # Add the gradient penalty to the original discriminator loss
                crit_loss = c_cost + grad_pen * grad_pen_weight
                #print(crit_loss)
                #crit_loss = c_cost
                
            crit_loss_temp.append(crit_loss)

            # Calculate and apply the gradients obtained from the loss on the trainable variables
            gradients_of_critic = crit_tape.gradient(crit_loss, critic.trainable_variables)
            #print(gradients_of_critic)
            critic_optimizer.apply_gradients(zip(gradients_of_critic, critic.trainable_variables))

        i = i + 1
        if (step+1) % (n_steps//epochs) == 0:
            i=0

        crit_loss_all.append(np.mean(crit_loss_temp))
        
        # Generator Training
        # Generate inputs for generator, values and labels
        artificial_samples = tf.constant(input_dist.rvs(size=all_data.shape[1]*batch_size), shape=[
                batch_size,all_data.shape[1]])
        
        if len(set(train_lbls)) < 3:
            artificial_labels = tf.constant([1.0]*(batch_size//2) + [0.0]*(batch_size//2), shape=(batch_size,1))
        else:
            artificial_labels = np.array(pd.get_dummies([i for i in range(len(set(train_lbls)))]*batch_divisions))
    
        with tf.GradientTape() as gen_tape: # See the gradient for the generator
            # Generate artificial samples
            artificial_samples = generator([artificial_samples, artificial_labels], training=True)
            
            # Get the critic results for generated samples
            X_artificial = critic([artificial_samples, artificial_labels], training=True)
            # Calculate the generator loss
            gen_loss = generator_loss_wgan(X_artificial)

        # Calculate and apply the gradients obtained from the loss on the trainable variables
        gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
        #print(gradients_of_generator)
        generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
        gen_loss_all.append(gen_loss)

        # Update the progress bar and evaluation graphs every update1 steps for loss plots and update2 for the others.
        if (step + 1) % update1 == 0:
            
            # Update the evaluating figures at the set intervals
            axl.clear() # Always clear the corresponding ax before redrawing it
            
            # Loss Plot
            axl.plot(gen_loss_all, color = 'blue', label='Generator Loss')
            axl.plot(crit_loss_all,color = 'red', label='Critic Loss')
            axl.set_xlabel('Number of Steps')
            axl.set_ylabel('Loss')
            axl.legend()
            
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())

        if (step + 1) % update2 == 0:

            saved_predictions.append(generate_predictions(generator, n_generated_samples, all_data.shape[1], 
                                                          input_realdata_dist, ordered_labels))
            # See density and coverage and crossLID (divided by 25 to be in the same order as the rest)
            # of latest predictions
            den, cov = gem.evaluation_coverage_density(test_data, saved_predictions[-1], k= k_cov_den, metric='euclidean')
            clid = gem.cross_LID_estimator_byMLE(test_data, saved_predictions[-1], k=k_crossLID, metric='euclidean')/25
            density.append(den)
            coverage.append(cov)
            crossLID.append(clid)

            # PCA of the latest predictions and training data
            # Divide by twice the standard deviation to be the same as the generated data
            dfs_temp = pd.concat((train_data_o/(2*train_data_o.values.std()),pd.DataFrame(
                saved_predictions[-1].numpy(), columns=train_data_o.columns))) 
            temp_lbls = train_lbls.copy()
            for l in ordered_labels:
                temp_lbls.extend([l+' - GAN']*(n_generated_samples//len(ordered_labels)))
            principaldf = gem.pca_sample_projection(dfs_temp, temp_lbls, pca, whiten=True, 
                                                samp_number=len(train_data_o.index))
            lcolors = label_colors_test

            # Hierarchical clustering of the latest predictions and testing data, 
            # saving the correct 1st cluster fraction results
            dfs_temp = np.concatenate((test_data, saved_predictions[-1].numpy()))
            temp_lbls = ['real']*len(test_data) + ['gen']*len(saved_predictions[-1])
            hca_results = gem.perform_HCA(dfs_temp, temp_lbls, metric='euclidean', method='ward')
            corr1stcluster.append(hca_results['correct 1st clustering'])
            
            # Plots
            axc.clear()
            axc.plot(range(update2, step+2, update2), coverage, label='coverage')
            axc.plot(range(update2, step+2, update2), density, label='density')
            axc.plot(range(update2, step+2, update2), crossLID, color='red', label='crossLID')
            axc.plot(range(update2, step+2, update2), corr1stcluster, color='purple', label='corr_cluster')
            axc.legend()

            axr.clear()
            gem.plot_PCA(principaldf, lcolors, components=(1,2), title='', ax=axr)
            axr.legend(loc='upper right', ncol=1, framealpha=1)
            
            ipythondisplay.clear_output(wait=True)
            ipythondisplay.display(plt.gcf())
            
        # Save the model every so often
        if (step + 1) % (update2*5) == 0:
            checkpoint.save(file_prefix = checkpoint_prefix)      

In [None]:
# Train models to generate samples for each fold
GENERATE = True

epochs = 600
batch_size = 48
k_cov_den = 20
k_crossLID = 15
random_seed = 145
n_generated_samples = 32*len(pd.unique(labels))

if GENERATE:
    generator_train = {}
    critic_train = {}

    results_train = {}

    for lbl in df_storage_train:
        generator_train[lbl] = {}
        critic_train[lbl] = {}

        results_train[lbl] = {}
        for fold in df_storage_train[lbl]:

            gen_loss_all = []
            crit_loss_all = []
            saved_predictions = []
            coverage = []
            density = []
            crossLID = []
            corr1stcluster = []

            # Get distribution of intensity values of the dataset
            hist = np.histogram(real_samples[lbl][fold].values.flatten(), bins=100)
            input_realdata_dist = stats.rv_histogram(hist)

            generator_optimizer = tf.keras.optimizers.RMSprop(1e-4)
            critic_optimizer = tf.keras.optimizers.RMSprop(1e-4)

            df = real_samples[lbl][fold]
            pca = PCA(n_components=2, svd_solver='full', whiten=True)
            pc_coords = pca.fit_transform(df)

            generator_train[lbl][fold] = generator_model(aug_df_storage_train[lbl][fold].shape[1],
                                                 aug_df_storage_train[lbl][fold].shape[1], 128, 3)
            critic_train[lbl][fold] = critic_model(aug_df_storage_train[lbl][fold].shape[1], 512, 3)

            training_montage(aug_df_storage_train[lbl][fold], aug_lbl_storage_train[lbl][fold], 
                             real_samples[lbl][fold], lbl_storage_train[lbl][fold],
                             epochs, generator_train[lbl][fold], critic_train[lbl][fold],
                             generator_optimizer, critic_optimizer, input_realdata_dist, batch_size=batch_size,
                             grad_pen_weight=10, k_cov_den=k_cov_den, k_crossLID=k_crossLID,
                             random_seed=random_seed, n_generated_samples=n_generated_samples)

            results_train[lbl][fold]={'gen_loss': gen_loss_all, 'crit_loss': crit_loss_all,
                        'saved_pred': saved_predictions,
                        'coverage': coverage, 'density': density, 'crossLID': crossLID, 'corr1st_cluster': corr1stcluster}

            print(f'Minority Class {lbl}, Fold {fold} done.')

In [None]:
#GENERATE = False
#GENERATE=True
if GENERATE:
    for lbl in generator_train:
        for fold in generator_train[lbl]:
            # Save the generator and critic models' weights.
            generator_train[lbl][fold].save_weights('gan_models/MPint_gen_imb_'+lbl+str(fold))
            critic_train[lbl][fold].save_weights('gan_models/MPint_crit_imb_'+lbl+str(fold))

            # Save the results from GAN training
            with open('gan_models/MPint_results_imb_'+lbl+str(fold)+'.pickle', 'wb') as handle:
                pickle.dump(results_train[lbl][fold], handle)

else:
    generator_train = {}
    critic_train = {}

    results_train = {}

    for lbl in df_storage_train:
        generator_train[lbl] = {}
        critic_train[lbl] = {}

        results_train[lbl] = {}
        for fold in df_storage_train[lbl]:
            # Read back the saved model
            generator_optimizer = tf.keras.optimizers.RMSprop(1e-4)
            critic_optimizer = tf.keras.optimizers.RMSprop(1e-4)

            generator_train[lbl][fold] = generator_model(df_storage_train[lbl][fold].shape[1],
                                                         df_storage_train[lbl][fold].shape[1], 128, 3)
            critic_train[lbl][fold] = critic_model(df_storage_train[lbl][fold].shape[1],
                                                   512, 3)

            # Load previously saved models
            generator_train[lbl][fold].load_weights('./gan_models/MPint_gen_imb_'+lbl+str(fold))
            critic_train[lbl][fold].load_weights('./gan_models/MPint_crit_imb_'+lbl+str(fold))

            # Load previously saved results
            #with open('gan_models/MPint_results_imb_'+lbl+str(fold)+'.pickle', 'rb') as handle:
            #    results_train[lbl][fold] = pickle.load(handle)

### Classification Accuracy Comparison

Here we divided our dataset, getting 4 folds where each has 14 samples of a minority class and 42 of the two majority classes that act as the 'training set' and the remaining 42 samples of the minority class and 14 of each of the majority classes as the 'test set'. The three classes are a minority class at one point as explained before.

With the train set, we build and train a GAN model from them. Then we build models with the train set and with generated samples from the GAN models and compare the performance in discriminating the test set.

#### Generate a lot of samples and make Random Forests and PLS-DA models

In [None]:
np.random.seed(5402)
# Generate sample for each fold
generated_samples = {}

for i in generator_train:
    generated_samples[i] = {}
    for fold in generator_train[i]:
        # Input to the generator
        num_examples_to_generate = 1980
        # Get distribution of intensity values of the dataset
        hist = np.histogram(real_samples[i][fold].values.flatten(), bins=100)
        input_realdata_dist = stats.rv_histogram(hist)

        test_input = tf.constant(input_realdata_dist.rvs(
            size=len(df_storage_train[i][fold].columns)*num_examples_to_generate),
                                 shape=[num_examples_to_generate,len(df_storage_train[i][fold].columns)])
        
        ordered_labels_fold = pd.get_dummies(
            np.array(lbl_storage_train[i][fold])[np.random.RandomState(seed=145).permutation(
                len(lbl_storage_train[i][fold]))]).columns
        
        test_labels = tf.constant(np.array(pd.get_dummies([i for i in range(len(ordered_labels_fold))]*(
            num_examples_to_generate//len(ordered_labels_fold)))))

        # Generate GAN samples
        predictions = generator_train[i][fold]([test_input, test_labels], training=False)
        # Reverse the division done to the data
        predictions = predictions * 2* aug_df_storage_train[i][fold].values.std()


        generated_samples[i][fold] = [pd.DataFrame(np.array(predictions), columns=df_storage_train[i][fold].columns),
                                [i for i in ordered_labels_fold]*(num_examples_to_generate//len(ordered_labels_fold))]

In [None]:
# To store for each fold
bal_datasets = {}
np.random.seed(325)
rng = np.random.default_rng(7519)
for i in generated_samples.keys():
    bal_datasets[i] = {}
    for fold in generator_train[i]:
        bal_datasets[i][fold] = {}
        df = real_samples[i][fold].loc[np.array(lbl_storage_train[i][fold]) == i]
        # Calculate all correlations between all samples of experimental and GAN data and store them in a dataframe
        correlations = pd.DataFrame(index=generated_samples[i][fold][0].index, columns=df.index).astype('float')

        for a in df.index:
            for j in generated_samples[i][fold][0].index:
                correlations.loc[j,a] = stats.pearsonr(df.loc[a],
                                                       generated_samples[i][fold][0].loc[j])[0]

        correlated_samples = pd.DataFrame(columns=df.index)
        for a in correlations:
            correlated_samples[a] = correlations[a].sort_values(ascending=False).index
            
        permutated = correlated_samples.copy()
        for l in correlated_samples.index:
            permutated.loc[l] = rng.permutation(correlated_samples.loc[l])

        corr_idxs = pd.unique(permutated.values.flatten())
        
        dataset_len = real_samples[i][fold].shape[0]
        
        n_min_class = (np.array(lbl_storage_train[i][fold]) == i).sum()
        n_max_class = (dataset_len - n_min_class)//2
        
        # Slowly add samples - 2 at a time until a total of 28
        for num in range(0,n_max_class - n_min_class+1, (n_max_class - n_min_class+1)//14):
            idx_to_keep = corr_idxs[:num]

            corr_preds = generated_samples[i][fold][0].loc[list(pd.unique(idx_to_keep))]
            corr_lbls  = np.array(generated_samples[i][fold][1])[list(pd.unique(idx_to_keep))]

            # Slowly add the GAN correlated GAN samples to the the imbalanced dataset, making it a balanced dataset
            concat_df = pd.concat((corr_preds, real_samples[i][fold]))
            concat_lbls = [i,]*len(set(idx_to_keep)) + lbl_storage_train[i][fold]
            bal_datasets[i][fold][num] = [concat_df.copy(), concat_lbls.copy()]

In [None]:
# Samples added to the minority class
bal_datasets[i][fold].keys()

### Fitting RF and PLS-DA models to Imbalance Datasets and Evaluating them

RF and PLS-DA models are built for each minority class, each fold and each number of GAN samples added.

RF

In [None]:
# Fitting and storing Random Forest models for each fold
RF_models_bal = {}

# Train the Models
for min_class in bal_datasets:
    RF_models_bal[min_class] = {}
    for size in bal_datasets['DMP'][1].keys():
        RF_models_bal[min_class][size] = {}
        for fold in bal_datasets[min_class]:
            rf_mod = ma.RF_model(bal_datasets[min_class][fold][size][0], bal_datasets[min_class][fold][size][1],
                                 return_cv=False, n_trees=200)
            RF_models_bal[min_class][size][fold] = rf_mod

In [None]:
# Testing the RF models with the test data for each fold
RF_results_bal = {'Accuracy':{}, 'F1-Score':{}, 'Precision':{}, 'Recall':{}}
# Evaluate the Models
for min_class in RF_models_bal:
    RF_results_bal['Accuracy'][min_class] = {}
    RF_results_bal['F1-Score'][min_class] = {}
    RF_results_bal['Precision'][min_class] = {}
    RF_results_bal['Recall'][min_class] = {}
    for size in RF_models_bal[min_class].keys():
        RF_results_bal['Accuracy'][min_class][size] = {}
        RF_results_bal['F1-Score'][min_class][size] = {}
        RF_results_bal['Precision'][min_class][size] = {}
        RF_results_bal['Recall'][min_class][size] = {}
        for fold in RF_models_bal[min_class][size]:
            RF_results_bal['Accuracy'][min_class][size][fold] = RF_models_bal[min_class][size][fold].score(
                                                                                    df_storage_test[min_class][fold],
                                                                                    lbl_storage_test[min_class][fold])
            preds = RF_models_bal[min_class][size][fold].predict(df_storage_test[min_class][fold])
            prec, rec, f1, sup = precision_recall_fscore_support(lbl_storage_test[min_class][fold], preds,
                                                                average='weighted')
            RF_results_bal['F1-Score'][min_class][size][fold] = f1
            RF_results_bal['Precision'][min_class][size][fold] = prec
            RF_results_bal['Recall'][min_class][size][fold] = rec

In [None]:
pd.DataFrame.from_dict(RF_results_bal['Accuracy']['DMP'])

In [None]:
pd.DataFrame.from_dict(RF_results_bal['Accuracy']['FTMP'])

In [None]:
pd.DataFrame.from_dict(RF_results_bal['Accuracy']['UCB'])

In [None]:
# Plotting the results for each fold
f, (axl, axr) = plt.subplots(1,2,figsize=(8,4), constrained_layout=True)
for min_class in RF_results_bal['Accuracy'].keys():
    axl.plot((pd.DataFrame.from_dict(RF_results_bal['Accuracy'][min_class]).columns)/dataset_len*100,
             pd.DataFrame.from_dict(RF_results_bal['Accuracy'][min_class]).mean(), 
             label=min_class+' Min. Accuracy')
    axl.plot((pd.DataFrame.from_dict(RF_results_bal['F1-Score'][min_class]).columns)/dataset_len*100,
             pd.DataFrame.from_dict(RF_results_bal['F1-Score'][min_class]).mean(), 
             label=min_class+' Min. F1-score')

axl.set_ylabel('Performance', fontsize=15)
axl.set_xlabel('% of Augmentation', fontsize=15)
axl.set_ylim([0.5, 1.01])
axl.legend(fontsize=14)
axl.set_title('Random Forest')

for min_class in RF_results_bal['Precision'].keys():
    axr.plot((pd.DataFrame.from_dict(RF_results_bal['Precision'][min_class]).columns + n_min_class)/n_max_class*100,
             pd.DataFrame.from_dict(RF_results_bal['Precision'][min_class]).mean(), 
             label=min_class+' Min. Precision')
    axr.plot((pd.DataFrame.from_dict(RF_results_bal['Recall'][min_class]).columns + n_min_class)/n_max_class*100,
             pd.DataFrame.from_dict(RF_results_bal['Recall'][min_class]).mean(), 
             label=min_class+' Min. Recall')

axr.set_ylabel('Performance', fontsize=15)
axr.set_xlabel('% of Augmentation', fontsize=15)
axr.set_ylim([0.5, 1.01])
axr.legend(fontsize=14)
axr.set_title('Random Forest')

plt.show()

PLS-DA

In [None]:
def decision_rule(y_pred, y_true, average='weighted'):
    "Decision rule for PLS-DA classification."
    # Decision rule chosen: sample belongs to group where it has max y_pred (closer to 1)
    # In case of 1,0 encoding for two groups, round to nearest integer to compare
    nright=0
    rounded_pred = y_pred.copy()
    for i in range(len(y_pred)):
        if list(y_true.iloc[i, :]).index(max(y_true.iloc[i, :])) == np.argmax(
            y_pred[i]
        ):
            nright += 1  # Correct prediction
            
        for l in range(len(y_pred[i])):
            if l == np.argmax(y_pred[i]):
                rounded_pred[i, l] = 1
            else:
                rounded_pred[i, l] = 0
    
    prec, rec, f1, sup = precision_recall_fscore_support(y_true.astype(int), rounded_pred.astype(int),
                                                        average=average)

    # Calculate accuracy for this iteration
    accuracy = (nright / len(y_pred))
    return accuracy, f1, prec, rec

In [None]:
PLSDA_models_bal = {}
PLSDA_results_bal = {'Accuracy':{}, 'F1-Score':{}, 'Precision':{}, 'Recall':{}}

# Train and Evaluate the models
for min_class in bal_datasets:
    PLSDA_models_bal[min_class] = {}
    PLSDA_results_bal['Accuracy'][min_class] = {}
    PLSDA_results_bal['F1-Score'][min_class] = {}
    PLSDA_results_bal['Precision'][min_class] = {}
    PLSDA_results_bal['Recall'][min_class] = {}
    for size in bal_datasets[min_class][1].keys():
        PLSDA_models_bal[min_class][size] = {}
        PLSDA_results_bal['Accuracy'][min_class][size] = {}
        PLSDA_results_bal['F1-Score'][min_class][size] = {}
        PLSDA_results_bal['Precision'][min_class][size] = {}
        PLSDA_results_bal['Recall'][min_class][size] = {}
        for fold in bal_datasets[min_class]:

            PLSDA_models_bal[min_class][size][fold] = ma.fit_PLSDA_model(bal_datasets[min_class][fold][size][0],
                                                                   bal_datasets[min_class][fold][size][1],
                                                              n_comp=4,
                                                      return_scores=False, scale=False, encode2as1vector=True)
            plsda = PLSDA_models_bal[min_class][size][fold]
            # Obtain results with the test group
            y_pred = plsda.predict(df_storage_test[min_class][fold])
            y_true = ma._generate_y_PLSDA(lbl_storage_test[min_class][fold],
                                          pd.unique(bal_datasets[min_class][fold][size][1]),
                                          False)
            # Calculate accuracy
            accuracy, f1, prec, rec = decision_rule(y_pred, y_true)
            PLSDA_results_bal['Accuracy'][min_class][size][fold] = accuracy
            PLSDA_results_bal['F1-Score'][min_class][size][fold] = f1
            PLSDA_results_bal['Precision'][min_class][size][fold] = prec
            PLSDA_results_bal['Recall'][min_class][size][fold] = rec

In [None]:
pd.DataFrame.from_dict(PLSDA_results_bal['Accuracy']['DMP'])

In [None]:
pd.DataFrame.from_dict(PLSDA_results_bal['Accuracy']['FTMP'])

In [None]:
pd.DataFrame.from_dict(PLSDA_results_bal['Accuracy']['UCB'])

In [None]:
# Plotting the results for each fold
f, (axl, axr) = plt.subplots(1,2,figsize=(8,4), constrained_layout=True)
for min_class in PLSDA_results_bal['Accuracy'].keys():
    axl.plot((pd.DataFrame.from_dict(PLSDA_results_bal['Accuracy'][min_class]).columns)/dataset_len*100,
             pd.DataFrame.from_dict(PLSDA_results_bal['Accuracy'][min_class]).mean(), 
             label=min_class+' Min. Accuracy')
    axl.plot((pd.DataFrame.from_dict(PLSDA_results_bal['F1-Score'][min_class]).columns)/dataset_len*100,
             pd.DataFrame.from_dict(PLSDA_results_bal['F1-Score'][min_class]).mean(), 
             label=min_class+' Min. F1-score')

axl.set_ylabel('Performance', fontsize=15)
axl.set_xlabel('% of Augmentation', fontsize=15)
axl.set_ylim([0.5, 1.01])
axl.legend(fontsize=15)
axl.set_title('PLS-DA')

for min_class in PLSDA_results_bal['Precision'].keys():
    axr.plot((pd.DataFrame.from_dict(PLSDA_results_bal['Precision'][min_class]).columns + n_min_class)/n_max_class*100,
             pd.DataFrame.from_dict(PLSDA_results_bal['Precision'][min_class]).mean(), 
             label=min_class+' Min. Precision')
    axr.plot((pd.DataFrame.from_dict(PLSDA_results_bal['Recall'][min_class]).columns + n_min_class)/n_max_class*100,
             pd.DataFrame.from_dict(PLSDA_results_bal['Recall'][min_class]).mean(), 
             label=min_class+' Min. Recall')

axr.set_ylabel('Performance', fontsize=15)
axr.set_xlabel('% of Augmentation', fontsize=15)
axr.set_ylim([0.5, 1.01])
axr.legend(fontsize=15)
axr.set_title('PLS-DA')

plt.show()

In [None]:
# Plotting the results for each fold
f, ax = plt.subplots(1,1,figsize=(4,4))
avg_increase = (pd.DataFrame.from_dict(RF_results_bal['F1-Score']['DMP']).mean()
                + pd.DataFrame.from_dict(RF_results_bal['F1-Score']['FTMP']).mean()
               + pd.DataFrame.from_dict(RF_results_bal['F1-Score']['UCB']).mean())/3
plt.plot((avg_increase.index)/dataset_len*100, avg_increase, 
         label='Avg. RF F1-score')

avg_increase = (pd.DataFrame.from_dict(PLSDA_results_bal['F1-Score']['DMP']).mean()
                + pd.DataFrame.from_dict(PLSDA_results_bal['F1-Score']['FTMP']).mean()
               + pd.DataFrame.from_dict(PLSDA_results_bal['F1-Score']['UCB']).mean())/3
plt.plot((avg_increase.index)/dataset_len*100, avg_increase, 
         label='Avg. PLS-DA F1-score')

ax.set_ylabel('Accuracy', fontsize=15)
ax.set_xlabel('% of Augmentation', fontsize=15)
ax.set_ylim([0.5, 1.01])
ax.legend(fontsize=13)
plt.show()

##### Extracting Important Features for RF models

In [None]:
# Extracting Important Features and averaging them across the 4 folds for each minority class
rf_feats = {}

for cl, models in RF_models_bal.items():
    rf_feats[cl] = {}
    for size, folds in models.items():
        rf_feats[cl][size] = {}
        for fold, mod in folds.items():
            if fold == 1:
                temp_df_bal = mod.feature_importances_
            else:
                temp_df_bal = temp_df_bal + mod.feature_importances_

        temp_df_bal = temp_df_bal / len(folds)

        #RF_models[fold][size].feature_importances_
        rf_feats[cl][size] = dict(zip(range(1, len(bal_datasets[cl][1][size][0].columns)+1), temp_df_bal))

# Averaging feature importance over the minority classes as well    
rf_feats_together = (pd.DataFrame.from_dict(rf_feats['DMP']) + pd.DataFrame.from_dict(rf_feats['FTMP']) + 
           pd.DataFrame.from_dict(rf_feats['UCB']))/3

In [None]:
# Select the top 2% of important features
rf_feats_together_abrev = pd.Series(index=rf_feats_together.columns)
top10 = int(0.02*len(rf_feats_together.index))
for i in rf_feats_together.columns:
    rf_feats_together_abrev[i] = rf_feats_together[i].sort_values(ascending=False)[:top10].mean()

In [None]:
# Select the top 2% of important features
rf_feats_abrev = {}
for min_class in rf_feats:
    temp_df = pd.DataFrame.from_dict(rf_feats[min_class])
    rf_feats_abrev[min_class] = pd.Series(index=temp_df.columns)
    top10 = int(0.02*len(temp_df.index))
    for i in temp_df.columns:
        rf_feats_abrev[min_class][i] = temp_df[i].sort_values(ascending=False)[:top10].mean()

In [None]:
top10

Relative importance values depending on the number of GAN samples added

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, ax = plt.subplots(figsize=(6,6))

    plt.plot((np.array(list(rf_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             rf_feats_together_abrev.values/rf_feats_together_abrev.values[0]*100, label='All Together',
             color='Black', linewidth=3)
    for i in rf_feats_abrev:
        plt.plot((np.array(list(rf_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             rf_feats_abrev[i].values/rf_feats_abrev[i].values[0]*100,
                 label='Minority Class: ' +i, color=label_colors_test[i])

    plt.ylabel('Avg. Gini Importance of top 2% Imp. features change (%)', fontsize = 12)
    plt.xlabel('% of Augmentation', fontsize = 15)
    plt.ylim(80,160)
    plt.legend(fontsize=13)
    ax.tick_params(axis='both', which='major', labelsize=13)
    plt.title('Random Forest', fontsize=15)

Common top 2% important features compared to the complete models depending on the number of GAN samples added

In [None]:
rf_line = []
for i in rf_feats_together:
    idxs = []
    for l in rf_feats_together[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets['DMP'][1][0][0].columns[l-1])
    rf_line.append(len(np.intersect1d(idxs, RF_feats_real.index[:top10])))
    
rf_line2 = []
for i in rf_feats_together:
    idxs = []
    for l in rf_feats_together[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets['DMP'][1][0][0].columns[l-1])
    rf_line2.append(len(np.intersect1d(idxs, uni_df.columns)))

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axr) = plt.subplots(1,2,figsize=(10,5))

    for min_class in rf_feats:
        rf_line_min = []
        for i in rf_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(rf_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            rf_line_min.append(len(np.intersect1d(idxs, RF_feats_real.index[:top10])))

        rf_line2_min = []
        for i in rf_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(rf_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            rf_line2_min.append(len(np.intersect1d(idxs, uni_df.columns)))

        axl.plot((np.array(list(rf_feats[min_class].keys()))+n_min_class)/n_max_class*100, np.array(rf_line_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)
        axr.plot((np.array(list(rf_feats[min_class].keys()))+n_min_class)/n_max_class*100, np.array(rf_line2_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)

axl.plot((np.array(list(rf_feats_together.keys()))+n_min_class)/n_max_class*100,
             np.array(rf_line)/top10*100, color='black',
            label='All Together', linewidth=3)
axr.plot((np.array(list(rf_feats_together.keys()))+n_min_class)/n_max_class*100,
         np.array(rf_line2)/top10*100, color='black',
        label='All Together', linewidth=3)        

axl.set_title('Against RF Top 2% Imp. Feats', fontsize=15)
axr.set_title('Against Univariate Significant Feats', fontsize=15)
axl.set_ylabel('% of Common Features with Real Model', fontsize = 15)
axl.set_xlabel('% Augmentation', fontsize = 15)
axr.set_xlabel('% Augmentation', fontsize = 15)
axl.set_ylim(0,102)
axr.set_ylim(0,102)
#plt.legend(loc='upper left', fontsize=13, bbox_to_anchor=(1,1))
axr.legend(loc='lower left', fontsize=13)
ax.tick_params(axis='both', which='major', labelsize=14)
plt.suptitle('Random Forest', fontsize=15)

##### Extracting Important Features for PLS-DA models

In [None]:
# Extracting Important Features and averaging them across the 5 folds for each minority class
plsda_feats = {}

for cl, models in PLSDA_models_bal.items():
    plsda_feats[cl] = {}
    for size, folds in models.items():
        plsda_feats[cl][size] = {}
        for fold, mod in folds.items():
            if fold == 1:
                temp_df_bal = ma._calculate_vips(mod)
            else:
                temp_df_bal = temp_df_bal + ma._calculate_vips(mod)

        temp_df_bal = temp_df_bal / len(folds)

        plsda_feats[cl][size] = dict(zip(range(1, len(bal_datasets[cl][1][size][0].columns)+1), temp_df_bal))

# Averaging feature importance over the minority classes as well
plsda_feats_together = (pd.DataFrame.from_dict(plsda_feats['DMP']) + pd.DataFrame.from_dict(plsda_feats['FTMP']) + 
           pd.DataFrame.from_dict(plsda_feats['UCB']))/3

In [None]:
# Select the top 2% of important features
plsda_feats_together_abrev = pd.Series(index=plsda_feats_together.columns)
top10 = int(0.02*len(plsda_feats_together.index))
for i in plsda_feats_together.columns:
    plsda_feats_together_abrev[i] = plsda_feats_together[i].sort_values(ascending=False)[:top10].mean()

In [None]:
# Select the top 2% of important features
plsda_feats_abrev = {}
for min_class in plsda_feats:
    temp_df = pd.DataFrame.from_dict(plsda_feats[min_class])
    plsda_feats_abrev[min_class] = pd.Series(index=temp_df.columns)
    top10 = int(0.02*len(temp_df.index))
    for i in temp_df.columns:
        plsda_feats_abrev[min_class][i] = temp_df[i].sort_values(ascending=False)[:top10].mean()

In [None]:
top10

Relative importance values depending on the number of GAN samples added

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, ax = plt.subplots(figsize=(6,6))

    plt.plot((np.array(list(plsda_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             plsda_feats_together_abrev.values/plsda_feats_together_abrev.values[0]*100, label='All Together',
             color='Black', linewidth=3)
    for i in plsda_feats_abrev:
        plt.plot((np.array(list(plsda_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             plsda_feats_abrev[i].values/plsda_feats_abrev[i].values[0]*100,
                 label='Minority Class: ' +i, color=label_colors_test[i])

    plt.ylabel('Avg. VIP Score of top 2% Imp. features change (%)', fontsize = 12)
    plt.xlabel('% of Augmentation', fontsize = 15)
    plt.ylim(90,120)
    plt.legend(fontsize=13)
    ax.tick_params(axis='both', which='major', labelsize=13)
    plt.title('PLS-DA', fontsize=15)

Common top 2% important features compared to the complete models depending on the number of GAN samples added

In [None]:
plsda_line = []
for i in plsda_feats_together:
    idxs = []
    for l in plsda_feats_together[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets['DMP'][1][0][0].columns[l-1])
    plsda_line.append(len(np.intersect1d(idxs, PLSDA_feats_real.index[:top10])))
    
plsda_line2 = []
for i in plsda_feats_together:
    idxs = []
    for l in plsda_feats_together[i].sort_values(ascending=False)[:top10].index:
        idxs.append(bal_datasets['DMP'][1][0][0].columns[l-1])
    plsda_line2.append(len(np.intersect1d(idxs, uni_df.columns)))

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axr) = plt.subplots(1,2,figsize=(10,5))

    for min_class in plsda_feats:
        plsda_line_min = []
        for i in plsda_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(plsda_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            plsda_line_min.append(len(np.intersect1d(idxs, PLSDA_feats_real.index[:top10])))

        plsda_line2_min = []
        for i in plsda_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(plsda_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            plsda_line2_min.append(len(np.intersect1d(idxs, uni_df.columns)))

        axl.plot((np.array(list(plsda_feats[min_class].keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)
        axr.plot((np.array(list(plsda_feats[min_class].keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line2_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)

    axl.plot((np.array(list(plsda_feats_together.keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line)/top10*100, color='black',
                label='All Together', linewidth=3)
    axr.plot((np.array(list(plsda_feats_together.keys()))+n_min_class)/n_max_class*100,
             np.array(plsda_line2)/top10*100, color='black',
            label='All Together', linewidth=3)        

    axl.set_title('Against PLS-DA Top 2% Imp. Feats', fontsize=15)
    axr.set_title('Against Univariate Significant Feats', fontsize=15)
    axl.set_ylabel('% of Common Features with Real Model', fontsize = 15)
    f.supxlabel('% of Samples of the Minority Class in Comparison to the Majority Class', fontsize = 15)
    #axr.set_xlabel('% Augmentation', fontsize = 15)
    axl.set_ylim(0,102)
    axr.set_ylim(0,102)
    #plt.legend(loc='upper left', fontsize=13, bbox_to_anchor=(1,1))
    axr.legend(loc='lower left', fontsize=13)
    axl.tick_params(axis='both', which='major', labelsize=14)
    axr.tick_params(axis='both', which='major', labelsize=14)
    plt.suptitle('PLS-DA', fontsize=15)

#### Summary of Results

In [None]:
# Plotting the results for each fold
with plt.style.context('seaborn-whitegrid'):
    f, (axl,axr) = plt.subplots(1,2,figsize=(8,4), constrained_layout=True)
    for i in ['F1-Score', 'Precision', 'Recall']:
        avg_increase = (pd.DataFrame.from_dict(RF_results_bal[i]['DMP']).mean()
                + pd.DataFrame.from_dict(RF_results_bal[i]['FTMP']).mean()
               + pd.DataFrame.from_dict(RF_results_bal[i]['UCB']).mean())/3
        axl.plot((avg_increase.index + n_min_class)/n_max_class*100, avg_increase, 
                 label='Avg. ' + i)

        avg_increase = (pd.DataFrame.from_dict(PLSDA_results_bal[i]['DMP']).mean()
                + pd.DataFrame.from_dict(PLSDA_results_bal[i]['FTMP']).mean()
               + pd.DataFrame.from_dict(PLSDA_results_bal[i]['UCB']).mean())/3
        axr.plot((avg_increase.index + n_min_class)/n_max_class*100, avg_increase, 
                 label='Avg. ' + i)
        axl.set_ylim([0.9, 1.005])
        axr.set_ylim([0.9, 1.005])
        axl.set_title('Random Forest', fontsize=15)
        axr.set_title('PLS-DA', fontsize=15)

    axl.set_ylabel('Performance', fontsize=15)
    f.supxlabel('% of Samples of the Minority Class in Comparison to the Majority Classes', fontsize=15)
    axr.legend(fontsize=13)
    #plt.suptitle('HD', fontsize=18)

    plt.show()
    f.savefig('Images/MPint_Imbalanced_AugPerformance_plot.png', dpi=400)
    f.savefig('Images/MPint_Imbalanced_AugPerformance_plot.pdf', dpi=400)

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axl, axr) = plt.subplots(1,2, figsize=(8,4), constrained_layout=True)

    axl.plot((np.array(list(rf_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             rf_feats_together_abrev.values/rf_feats_together_abrev.values[0]*100, label='All Together',
             color='Black', linewidth=3)
    for i in rf_feats_abrev:
        axl.plot((np.array(list(rf_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             rf_feats_abrev[i].values/rf_feats_abrev[i].values[0]*100,
                 label='Minority Class: ' +i, color=label_colors_test[i])

    axl.set_ylabel('Avg. Gini Imp. of top 2% Imp. Feat. change (%)', fontsize = 10)
    axl.set_ylim(75,150)
    axl.legend(fontsize=13)
    axl.tick_params(axis='both', which='major', labelsize=11)
    axl.set_title('Random Forest', fontsize=15)
    
    axr.plot((np.array(list(plsda_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             plsda_feats_together_abrev.values/plsda_feats_together_abrev.values[0]*100, label='All Together',
             color='Black', linewidth=3)
    for i in plsda_feats_abrev:
        axr.plot((np.array(list(plsda_feats_together_abrev.index)+n_min_class))/n_max_class*100,
             plsda_feats_abrev[i].values/plsda_feats_abrev[i].values[0]*100,
                 label='Minority Class: ' +i, color=label_colors_test[i])

    axr.set_ylabel('Avg. VIP Score of top 2% Imp. Feat. change (%)', fontsize = 10)
    f.supxlabel('% of Samples of the Minority Class in Comparison to the Majority Classes', fontsize = 15)
    axr.set_ylim(85,115)
    axr.tick_params(axis='both', which='major', labelsize=11)
    axr.set_title('PLS-DA', fontsize=15)
    #plt.suptitle('MPD', fontsize=18)
    f.savefig('Images/MPint_Imbalanced_ImpFeatChange_plot.png', dpi=400)
    f.savefig('Images/MPint_Imbalanced_ImpFeatChange_plot.pdf', dpi=400)

In [None]:
# Plotting the results and adjusting parameters of the plot
with plt.style.context('seaborn-whitegrid'):
    f, (axu, axd) = plt.subplots(2,2,figsize=(10,10), constrained_layout=True)
    axul, axur = axu
    axdl, axdr = axd
    
    for min_class in rf_feats:
        rf_line_min = []
        for i in rf_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(rf_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            rf_line_min.append(len(np.intersect1d(idxs, RF_feats_real.index[:top10])))

        rf_line_uni_min = []
        for i in rf_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(rf_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            rf_line_uni_min.append(len(np.intersect1d(idxs, uni_df.columns)))

        axul.plot((np.array(list(rf_feats[min_class].keys()))+n_min_class)/n_max_class*100, np.array(rf_line_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)
        axur.plot((np.array(list(rf_feats[min_class].keys()))+n_min_class)/n_max_class*100,
                 np.array(rf_line_uni_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)

    axul.plot((np.array(list(rf_feats_together.keys()))+n_min_class)/n_max_class*100,
                 np.array(rf_line)/top10*100, color='black',
                label='All Together', linewidth=3)
    axur.plot((np.array(list(rf_feats_together.keys()))+n_min_class)/n_max_class*100,
             np.array(rf_line2)/top10*100, color='black',
            label='All Together', linewidth=3)

    axul.set_title('Against RF Top 2% Imp. Feat.', fontsize=15)
    axur.set_title('Against Univariate Significant Feat.', fontsize=15)
    axul.set_ylabel('% of Common Features with RF Real Model', fontsize = 14)
    #f.set_xlabel('% of Samples of the Minority Class in Comparison to the Majority Class', fontsize = 15)
    axul.set_ylim(0,105)
    axur.set_ylim(0,105)
    #plt.legend(loc='upper left', fontsize=13, bbox_to_anchor=(1,1))
    axur.legend(loc='lower left', fontsize=15)
    axul.tick_params(axis='both', which='major', labelsize=14)
    axur.tick_params(axis='both', which='major', labelsize=14)
    
    for min_class in plsda_feats:
        plsda_line_min = []
        for i in plsda_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(plsda_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            plsda_line_min.append(len(np.intersect1d(idxs, PLSDA_feats_real.index[:top10])))

        plsda_line_uni_min = []
        for i in plsda_feats[min_class]:
            idxs = []
            for l in pd.DataFrame(plsda_feats[min_class])[i].sort_values(ascending=False)[:top10].index:
                idxs.append(bal_datasets[min_class][1][0][0].columns[l-1])
            plsda_line_uni_min.append(len(np.intersect1d(idxs, uni_df.columns)))

        axdl.plot((np.array(list(plsda_feats[min_class].keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)
        axdr.plot((np.array(list(plsda_feats[min_class].keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line_uni_min)/top10*100,
                 color=label_colors_test[min_class],
                 label='Minority Class: ' +min_class)

    axdl.plot((np.array(list(plsda_feats_together.keys()))+n_min_class)/n_max_class*100,
                 np.array(plsda_line)/top10*100, color='black',
                label='All Together', linewidth=3)
    axdr.plot((np.array(list(plsda_feats_together.keys()))+n_min_class)/n_max_class*100,
             np.array(plsda_line2)/top10*100, color='black',
            label='All Together', linewidth=3)       

    axdl.set_title('Against PLS-DA Top 2% Imp. Feat.', fontsize=15)
    axdr.set_title('Against Univariate Significant Feat.', fontsize=15)
    axdl.set_ylabel('% of Common Features with Real PLS-DA Model', fontsize = 14)
    f.supxlabel('% of Samples of the Minority Class in Comparison to the Majority Classes', fontsize = 15)
    axdl.set_ylim(0,105)
    axdr.set_ylim(0,105)
    axdl.tick_params(axis='both', which='major', labelsize=14)
    axdr.tick_params(axis='both', which='major', labelsize=14)
    #plt.suptitle('HD', fontsize=18)
    f.savefig('Images/MPint_Imbalanced_ImpFeatCommon_plot.png', dpi=400)
    f.savefig('Images/MPint_Imbalanced_ImpFeatCommon_plot.pdf', dpi=400)

### Comparing Imbalanced Datasets, Imb + Min. Cl. GAN Samples and only GAN samples

Creating the models and evaluating them for GAN data and compiling results for the imbalanced datasets (with no augmentation) and the imbalanced datasets made balanced with minority class GAN samples.

##### RF

In [None]:
# Fitting and storing Random Forest models for each fold
RF_models_GAN = {}
RF_models_real = {}
RF_models_GAN_bal = {}

# Train the Models
for i in generated_samples:
    RF_models_GAN[i] = {}
    RF_models_real[i] = {}
    RF_models_GAN_bal[i] = {}
    for fold in generated_samples[i]:
        rf_mod = ma.RF_model(generated_samples[i][fold][0], generated_samples[i][fold][1], return_cv=False, n_trees=200)
        RF_models_GAN[i][fold] = rf_mod

        RF_models_real[i][fold] = RF_models_bal[i][0][fold]

        RF_models_GAN_bal[i][fold] = RF_models_bal[i][n_max_class-n_min_class][fold]

In [None]:
# Testing the RF models with the test data for each fold
RF_results_GAN = {'Accuracy':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'F1-Score':{'FTMP':{}, 'DMP':{}, 'UCB':{}},
                  'Precision':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'Recall':{'FTMP':{}, 'DMP':{}, 'UCB':{}}}
RF_results_real = {'Accuracy':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'F1-Score':{'FTMP':{}, 'DMP':{}, 'UCB':{}},
                   'Precision':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'Recall':{'FTMP':{}, 'DMP':{}, 'UCB':{}}}
RF_results_GAN_bal = {'Accuracy':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'F1-Score':{'FTMP':{}, 'DMP':{}, 'UCB':{}},
                      'Precision':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'Recall':{'FTMP':{}, 'DMP':{}, 'UCB':{}}}

for min_class in RF_models_bal:
    for fold in generated_samples[i]:
        RF_results_GAN['Accuracy'][min_class][fold] = RF_models_GAN[min_class][fold].score(
                                                                                df_storage_test[min_class][fold],
                                                                                lbl_storage_test[min_class][fold])
        preds = RF_models_GAN[min_class][fold].predict(df_storage_test[min_class][fold])
        prec, rec, f1, sup = precision_recall_fscore_support(lbl_storage_test[min_class][fold], preds,
                                                            average='weighted')
        RF_results_GAN['F1-Score'][min_class][fold] = f1
        RF_results_GAN['Precision'][min_class][fold] = prec
        RF_results_GAN['Recall'][min_class][fold] = rec

        RF_results_real['Accuracy'][min_class][fold] = RF_results_bal['Accuracy'][min_class][0][fold]
        RF_results_real['F1-Score'][min_class][fold] = RF_results_bal['F1-Score'][min_class][0][fold]
        RF_results_real['Precision'][min_class][fold] = RF_results_bal['Precision'][min_class][0][fold]
        RF_results_real['Recall'][min_class][fold] = RF_results_bal['Recall'][min_class][0][fold]
        
        RF_results_GAN_bal['Accuracy'][min_class][fold] = RF_results_bal[
            'Accuracy'][min_class][n_max_class-n_min_class][fold]
        RF_results_GAN_bal['F1-Score'][min_class][fold] = RF_results_bal[
            'F1-Score'][min_class][n_max_class-n_min_class][fold]
        RF_results_GAN_bal['Precision'][min_class][fold] = RF_results_bal[
            'Precision'][min_class][n_max_class-n_min_class][fold]
        RF_results_GAN_bal['Recall'][min_class][fold] = RF_results_bal['Recall'][min_class][n_max_class-n_min_class][fold]

##### PLS-DA

In [None]:
PLSDA_models_GAN = {}
PLSDA_results_GAN = {'Accuracy':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'F1-Score':{'FTMP':{}, 'DMP':{}, 'UCB':{}},
                     'Precision':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'Recall':{'FTMP':{}, 'DMP':{}, 'UCB':{}}}

PLSDA_models_real = {}
PLSDA_results_real = {'Accuracy':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'F1-Score':{'FTMP':{}, 'DMP':{}, 'UCB':{}},
                     'Precision':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'Recall':{'FTMP':{}, 'DMP':{}, 'UCB':{}}}

PLSDA_models_GAN_bal = {}
PLSDA_results_GAN_bal = {'Accuracy':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'F1-Score':{'FTMP':{}, 'DMP':{}, 'UCB':{}},
                     'Precision':{'FTMP':{}, 'DMP':{}, 'UCB':{}}, 'Recall':{'FTMP':{}, 'DMP':{}, 'UCB':{}}}

# Train the Models
for min_class in bal_datasets:
    PLSDA_models_GAN[min_class] = {}
    PLSDA_models_real[min_class] = {}
    PLSDA_models_GAN_bal[min_class] = {}

    for fold in bal_datasets[min_class]:

        PLSDA_models_GAN[min_class][fold] = ma.fit_PLSDA_model(generated_samples[min_class][fold][0],
                                                               generated_samples[min_class][fold][1],
                                                          n_comp=4,
                                                  return_scores=False, scale=False, encode2as1vector=True)
        plsda = PLSDA_models_GAN[min_class][fold]
        # Obtain results with the test group
        y_pred = plsda.predict(df_storage_test[min_class][fold])
        y_true = ma._generate_y_PLSDA(lbl_storage_test[min_class][fold],
                                      pd.unique(generated_samples[min_class][fold][1]),
                                      False)
        # Calculate accuracy
        accuracy, f1, prec, rec = decision_rule(y_pred, y_true)
        PLSDA_results_GAN['Accuracy'][min_class][fold] = accuracy
        PLSDA_results_GAN['F1-Score'][min_class][fold] = f1
        PLSDA_results_GAN['Precision'][min_class][fold] = prec
        PLSDA_results_GAN['Recall'][min_class][fold] = rec
        
        PLSDA_models_real[min_class][fold] = PLSDA_models_bal[min_class][0][fold]
        
        PLSDA_results_real['Accuracy'][min_class][fold] = PLSDA_results_bal['Accuracy'][min_class][0][fold]
        PLSDA_results_real['F1-Score'][min_class][fold] = PLSDA_results_bal['F1-Score'][min_class][0][fold]
        PLSDA_results_real['Precision'][min_class][fold] = PLSDA_results_bal['Precision'][min_class][0][fold]
        PLSDA_results_real['Recall'][min_class][fold] = PLSDA_results_bal['Recall'][min_class][0][fold]
        
        PLSDA_models_GAN_bal[min_class][fold] = PLSDA_models_bal[min_class][n_max_class-n_min_class][fold]
        
        PLSDA_results_GAN_bal['Accuracy'][min_class][fold] = PLSDA_results_bal[
            'Accuracy'][min_class][n_max_class-n_min_class][fold]
        PLSDA_results_GAN_bal['F1-Score'][min_class][fold] = PLSDA_results_bal[
            'F1-Score'][min_class][n_max_class-n_min_class][fold]
        PLSDA_results_GAN_bal['Precision'][min_class][fold] = PLSDA_results_bal[
            'Precision'][min_class][n_max_class-n_min_class][fold]
        PLSDA_results_GAN_bal['Recall'][min_class][fold] = PLSDA_results_bal[
            'Recall'][min_class][n_max_class-n_min_class][fold]

##### Summarising Results

In [None]:
# Results
with sns.axes_style("whitegrid"):
    with sns.plotting_context("notebook", font_scale=1.5):
        f, axs = plt.subplots(1, 2, figsize=(6, 3), constrained_layout=True)
        for metric, axu in zip(['F1-Score', 'Recall'], axs.ravel()):
            x = np.arange(2)  # the label locations
            l = ['RF', 'PLSDA']
            width = 0.23  # the width of the bars

            offset = - 0.25 + 0 * 0.25
            accuracy_stats = pd.DataFrame({metric: [pd.DataFrame(RF_results_real[metric]).values.mean(), 
                                                                pd.DataFrame(PLSDA_results_real[metric]).values.mean()],
                                         'STD': [pd.DataFrame(RF_results_real[metric]).values.std(), 
                                                 pd.DataFrame(PLSDA_results_real[metric]).values.std()]})
            rects = axu.bar(x + offset, accuracy_stats[metric], width, label='Exp. Imb. Data', color='green')
            axu.errorbar(x + offset, y=accuracy_stats[metric], yerr=accuracy_stats['STD'],
                                ls='none', ecolor='0.2', capsize=3)

            offset = - 0.25 + 1 * 0.25
            accuracy_stats = pd.DataFrame({metric: [pd.DataFrame(RF_results_GAN_bal[metric]).values.mean(), 
                                                            pd.DataFrame(PLSDA_results_GAN_bal[metric]).values.mean()],
                                         'STD': [pd.DataFrame(RF_results_GAN_bal[metric]).values.std(), 
                                                 pd.DataFrame(PLSDA_results_GAN_bal[metric]).values.std()]})
            rects = axu.bar(x + offset, accuracy_stats[metric],
                            width, label='Imb. + GAN Data', color='red')
            axu.errorbar(x + offset, y=accuracy_stats[metric], yerr=accuracy_stats['STD'],
                                ls='none', ecolor='0.2', capsize=3)

            offset = - 0.25 + 2 * 0.25
            accuracy_stats = pd.DataFrame({metric: [pd.DataFrame(RF_results_GAN[metric]).values.mean(), 
                                                                pd.DataFrame(PLSDA_results_GAN[metric]).values.mean()],
                                         'STD': [pd.DataFrame(RF_results_GAN[metric]).values.std(), 
                                                 pd.DataFrame(PLSDA_results_GAN[metric]).values.std()]})
            rects = axu.bar(x + offset, accuracy_stats[metric], width, label='GAN Data', color='blue')
            axu.errorbar(x + offset, y=accuracy_stats[metric], yerr=accuracy_stats['STD'],
                                ls='none', ecolor='0.2', capsize=3)

            for spine in axu.spines.values():
                spine.set_edgecolor('0.1')
            axu.set_xticks(x)
            axu.set_xticklabels(l, fontsize=16)
            axu.set_title(metric, fontsize=15)
        axs[0].set(ylabel='Performance', ylim=(0,1.05))
        axs[1].tick_params(labelleft=False)
        #axs[1].set_yticks([])
        axs[1].legend(loc='upper left', fontsize=13, bbox_to_anchor=(1,1))
        #plt.suptitle('MPD Dataset                ', fontsize=18)

        plt.show()
        f.savefig('Images/MPint_Imbalanced_Accuracy_plot.png', dpi=400)
        f.savefig('Images/MPint_Imbalanced_Accuracy_plot.pdf', dpi=400)

In [None]:
Results_abridged = {'RF':{}, 'PLS-DA':{}}
Results_abridged_std = {'RF':{}, 'PLS-DA':{}}
for i in RF_results_GAN:
    Results_abridged['RF'][i] = {'Real Imbalanced': pd.DataFrame(RF_results_real[i]).values.mean(),
                                 'GAN Augmented': pd.DataFrame(RF_results_GAN_bal[i]).values.mean(),
                                 'GAN': pd.DataFrame(RF_results_GAN[i]).values.mean()}
    Results_abridged['PLS-DA'][i] = {'Real Imbalanced': pd.DataFrame(PLSDA_results_real[i]).values.mean(),
                                     'GAN Augmented': pd.DataFrame(PLSDA_results_GAN_bal[i]).values.mean(),
                                     'GAN': pd.DataFrame(PLSDA_results_GAN[i]).values.mean()}
    
    Results_abridged_std['RF'][i] = {'Real Imbalanced': pd.DataFrame(RF_results_real[i]).values.std(),
                                 'GAN Augmented': pd.DataFrame(RF_results_GAN_bal[i]).values.std(),
                                 'GAN': pd.DataFrame(RF_results_GAN[i]).values.std()}
    Results_abridged_std['PLS-DA'][i] = {'Real Imbalanced': pd.DataFrame(PLSDA_results_real[i]).values.std(),
                                     'GAN Augmented': pd.DataFrame(PLSDA_results_GAN_bal[i]).values.std(),
                                     'GAN': pd.DataFrame(PLSDA_results_GAN[i]).values.std()}

In [None]:
pd.DataFrame(Results_abridged['RF'])

In [None]:
pd.DataFrame(Results_abridged_std['RF'])

In [None]:
pd.DataFrame(Results_abridged['PLS-DA'])

In [None]:
pd.DataFrame(Results_abridged_std['PLS-DA'])

### Comparing Important features of models built from Imbalanced Datasets, Imb + Min. Cl. GAN Samples and only GAN samples

#### Comparing Against Important Features of the Complete models

In [None]:
# Create a 2nd complete model but with only 1 iteration to compare the important features
RF_model_real1 = ma.RF_model_CV(p_df['NGP'], labels, iter_num=1, n_fold=n_fold, n_trees=200) 
RF_feats_real1 = pd.DataFrame(RF_model_real1['important_features']).set_index(0).sort_values(by=1, ascending=False)
RF_feats_real1.index = [p_df['NGP'].columns[i] for i in RF_feats_real1.index]
RF_feats_real1 

In [None]:
RF_feats_GAN = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
RF_feats_imb = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
RF_feats_GAN_bal = {'FTMP':{}, 'DMP':{}, 'UCB':{}}

for i in RF_models_real.keys():
    for fold in RF_models_real[i]:
        temp_df = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)),
                                   RF_models_real[i][fold].feature_importances_))
        temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
        temp_df.index = [generated_samples[i][fold][0].columns[a] for a in temp_df.index]
        RF_feats_imb[i][fold] = temp_df.copy()

        temp_df = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)),
                                   RF_models_GAN[i][fold].feature_importances_))
        temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
        temp_df.index = [generated_samples[i][fold][0].columns[a] for a in temp_df.index]
        RF_feats_GAN[i][fold] = temp_df.copy()

        temp_df = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)),
                                   RF_models_GAN_bal[i][fold].feature_importances_))
        temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
        temp_df.index = [generated_samples[i][fold][0].columns[a] for a in temp_df.index]
        RF_feats_GAN_bal[i][fold] = temp_df.copy()
    
    print(i)

In [None]:
RF_feats_GAN_mean = {'FTMP':[], 'DMP':[], 'UCB':[]}
RF_feats_imb_mean = {'FTMP':[], 'DMP':[], 'UCB':[]}
RF_feats_GAN_bal_mean = {'FTMP':[], 'DMP':[], 'UCB':[]}

for i in RF_models_real.keys():
    for fold in RF_models_real[i]:
        if fold == 1:
            temp_df_imb = RF_models_real[i][fold].feature_importances_
            temp_df_bal = RF_models_GAN_bal[i][fold].feature_importances_
            temp_df_GAN = RF_models_GAN[i][fold].feature_importances_
        else:
            temp_df_imb = temp_df_imb + RF_models_real[i][fold].feature_importances_
            temp_df_bal = temp_df_bal + RF_models_GAN_bal[i][fold].feature_importances_
            temp_df_GAN = temp_df_GAN + RF_models_GAN[i][fold].feature_importances_

    temp_df_imb = temp_df_imb / len(RF_models_real[i])
    temp_df_imb = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), temp_df_imb))
    temp_df_imb = temp_df_imb.set_index(0).sort_values(by=1, ascending=False)
    temp_df_imb.index = [generated_samples[i][fold][0].columns[a] for a in temp_df_imb.index]
    RF_feats_imb_mean[i] = temp_df_imb.copy()

    temp_df_bal = temp_df_bal / len(RF_models_GAN_bal[i])
    temp_df_bal = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), temp_df_bal))
    temp_df_bal = temp_df_bal.set_index(0).sort_values(by=1, ascending=False)
    temp_df_bal.index = [generated_samples[i][fold][0].columns[a] for a in temp_df_bal.index]
    RF_feats_GAN_bal_mean[i] = temp_df_bal.copy()

    temp_df_GAN = temp_df_GAN / len(RF_models_GAN[i])
    temp_df_GAN = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), temp_df_GAN))
    temp_df_GAN = temp_df_GAN.set_index(0).sort_values(by=1, ascending=False)
    temp_df_GAN.index = [generated_samples[i][fold][0].columns[a] for a in temp_df_GAN.index]
    RF_feats_GAN_mean[i] = temp_df_GAN.copy()

Calculate intersection of important features from top 1 to top (number of features in the dataset) between the complete dataset (averaged over 20 iterations) and a iteration of the complete dataset, the imbalanced dataset, the balanced dataset and the GAN dataset.

In [None]:
intersections_RF = []
for i in range(1, len(RF_feats_real1)):
    intersections_RF.append(len(np.intersect1d(RF_feats_real1.index[:i], RF_feats_real.index[:i])))

In [None]:
intersections_RF_bal = {}
for cl in RF_feats_GAN_bal_mean:
    int_bal = []
    for i in range(1, len(RF_feats_real)):
        int_bal.append(len(np.intersect1d(RF_feats_real.index[:i], RF_feats_GAN_bal_mean[cl].index[:i])))
    intersections_RF_bal[cl] = int_bal

print('Intersections - Dataset Balanced - Finished')

intersections_RF_GAN = {}
for cl in RF_feats_GAN_mean:
    int_GAN = []
    for i in range(1, len(RF_feats_real)):
        int_GAN.append(len(np.intersect1d(RF_feats_real.index[:i], RF_feats_GAN_mean[cl].index[:i])))
    intersections_RF_GAN[cl] = int_GAN

print('Intersections - Dataset GAN - Finished')

intersections_RF_imb = {}
for cl in RF_feats_imb_mean:
    int_imb = []
    for i in range(1, len(RF_feats_real)):
        int_imb.append(len(np.intersect1d(RF_feats_real.index[:i], RF_feats_imb_mean[cl].index[:i])))
    intersections_RF_imb[cl] = int_imb

print('Intersections - Dataset Imbalanced - Finished')

# See intersections if features were randomly shuffled
random_intersections_RF = []
copy_shuffle = list(RF_feats_real.index).copy()
np.random.shuffle(copy_shuffle)
for i in range(1, len(RF_feats_real)):
    random_intersections_RF.append(len(np.intersect1d(RF_feats_real.index[:i], copy_shuffle[:i])))

In [None]:
f, (axl, axc, axr) = plt.subplots(1,3,figsize=(16,6))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)), 
            label = 'Reference (expected var.)', color='Black', s=5)
axc.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Reference (expected var.)', color='Black', s=5)
axr.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Reference (expected var.)', color='Black', s=5)

for i,ax in zip(intersections_RF_imb, (axl,axc,axr)):
    ax.scatter(range(1,len(intersections_RF)+1),
               np.array(intersections_RF_imb[i]) / np.array(range(1,len(intersections_RF)+1)),
                label = 'Imbalanced Exp. Dataset', color='Green', s=5)
    ax.scatter(range(1,len(intersections_RF)+1),
               np.array(intersections_RF_bal[i]) / np.array(range(1,len(intersections_RF)+1)),
                label = 'GAN Augmented Exp. Dataset', color='Blue', s=5)
    ax.scatter(range(1,len(intersections_RF)+1),
               np.array(intersections_RF_GAN[i]) / np.array(range(1,len(intersections_RF)+1)),
                label = 'GAN Samples Dataset', color='Red', s=5)
    
    ax.scatter(range(1,len(intersections_RF)+1),
               np.array(random_intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Random', color='Orange', s=5)
    
    ax.set_title('Minority Class: '+i, fontsize=15)

axl.legend(loc='center left', fontsize=15, bbox_to_anchor=(0.15,-0.2), ncol=3, markerscale=3)
axl.set_ylabel('Fraction of Common Compounds', fontsize=15)
axl.set_xlim([0,len(intersections_RF)//8])
axl.set_ylim([0,1])

axc.set_xlabel('Nº of Top Important (Gini Importance) Compounds', fontsize=15)
axc.set_xlim([0,len(intersections_RF)//8])
axc.set_ylim([0,1])

axr.set_xlim([0,len(intersections_RF)//8])
axr.set_ylim([0,1])
plt.suptitle('Random Forest', fontsize=18)
plt.show()

In [None]:
f, axl = plt.subplots(1,1,figsize=(8,4))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)), 
            label = 'Reference (expected var.)', color='Black', s=5)

axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_imb).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Imbalanced Exp. Dataset', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_bal).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'GAN Augmented Exp. Dataset', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_GAN).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'GAN Samples Dataset', color='Blue', s=5)

axl.scatter(range(1,len(intersections_RF)+1),
            np.array(random_intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Random', color='Orange', s=5)

axl.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3)
axl.set_xlabel('Nº of Top Important (Gini Importance) Compounds', fontsize=14)
axl.set_ylabel('Fraction of Common Compounds', fontsize=14)
axl.set_xlim([0,len(intersections_RF)//8])
axl.set_ylim([0,1.01])
axl.set_title('Random Forest', fontsize=18)

PLS-DA

In [None]:
# Create a 2nd complete model but with only 1 iteration to compare the important features
np.random.seed(485)
n_fold = 5
PLSDA_model_real1 = ma.PLSDA_model_CV(p_df['NGP'], labels, n_comp=4, iter_num=1, n_fold=n_fold, feat_type='VIP')
PLSDA_feats_real1 = pd.DataFrame(PLSDA_model_real1['important_features']).set_index(0).sort_values(by=1, ascending=False)
PLSDA_feats_real1.index = [p_df['NGP'].columns[i] for i in PLSDA_feats_real1.index]
PLSDA_feats_real1 

In [None]:
PLSDA_feats_GAN = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
PLSDA_feats_imb = {'FTMP':{}, 'DMP':{}, 'UCB':{}}
PLSDA_feats_GAN_bal = {'FTMP':{}, 'DMP':{}, 'UCB':{}}

for i in PLSDA_models_real.keys():
    for fold in PLSDA_models_real[i]:
        vips = ma._calculate_vips(PLSDA_models_real[i][fold])
        temp_df = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), vips))
        temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
        temp_df.index = [generated_samples[i][fold][0].columns[a] for a in temp_df.index]
        PLSDA_feats_imb[i][fold] = temp_df.copy()

        vips = ma._calculate_vips(PLSDA_models_GAN[i][fold])
        temp_df = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), vips))
        temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
        temp_df.index = [generated_samples[i][fold][0].columns[a] for a in temp_df.index]
        PLSDA_feats_GAN[i][fold] = temp_df.copy()

        vips = ma._calculate_vips(PLSDA_models_GAN_bal[i][fold])
        temp_df = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), vips))
        temp_df = temp_df.set_index(0).sort_values(by=1, ascending=False)
        temp_df.index = [generated_samples[i][fold][0].columns[a] for a in temp_df.index]
        PLSDA_feats_GAN_bal[i][fold] = temp_df.copy()

    print(i)

In [None]:
PLSDA_feats_GAN_mean = {'FTMP':[], 'DMP':[], 'UCB':[]}
PLSDA_feats_imb_mean = {'FTMP':[], 'DMP':[], 'UCB':[]}
PLSDA_feats_GAN_bal_mean = {'FTMP':[], 'DMP':[], 'UCB':[]}

for i in PLSDA_models_real.keys():
    for fold in PLSDA_models_real[i]:
        if fold == 1:
            vips = ma._calculate_vips(PLSDA_models_real[i][fold])
            temp_df_imb = vips
            vips = ma._calculate_vips(PLSDA_models_GAN_bal[i][fold])
            temp_df_bal = vips
            vips = ma._calculate_vips(PLSDA_models_GAN[i][fold])
            temp_df_GAN = vips
        else:
            vips = ma._calculate_vips(PLSDA_models_real[i][fold])
            temp_df_imb = temp_df_imb + vips
            vips = ma._calculate_vips(PLSDA_models_GAN_bal[i][fold])
            temp_df_bal = temp_df_bal + vips
            vips = ma._calculate_vips(PLSDA_models_GAN[i][fold])
            temp_df_GAN = temp_df_GAN + vips

    temp_df_imb = temp_df_imb / len(PLSDA_models_real[i])
    temp_df_imb = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), temp_df_imb))
    temp_df_imb = temp_df_imb.set_index(0).sort_values(by=1, ascending=False)
    temp_df_imb.index = [generated_samples[i][fold][0].columns[a] for a in temp_df_imb.index]
    PLSDA_feats_imb_mean[i] = temp_df_imb.copy()

    temp_df_bal = temp_df_bal / len(PLSDA_models_GAN_bal[i])
    temp_df_bal = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), temp_df_bal))
    temp_df_bal = temp_df_bal.set_index(0).sort_values(by=1, ascending=False)
    temp_df_bal.index = [generated_samples[i][fold][0].columns[a] for a in temp_df_bal.index]
    PLSDA_feats_GAN_bal_mean[i] = temp_df_bal.copy()

    temp_df_GAN = temp_df_GAN / len(PLSDA_models_GAN[i])
    temp_df_GAN = pd.DataFrame(zip(range(len(generated_samples[i][fold][0].columns)), temp_df_GAN))
    temp_df_GAN = temp_df_GAN.set_index(0).sort_values(by=1, ascending=False)
    temp_df_GAN.index = [generated_samples[i][fold][0].columns[a] for a in temp_df_GAN.index]
    PLSDA_feats_GAN_mean[i] = temp_df_GAN.copy()

Calculate intersection of important features from top 1 to top (number of features in the dataset) between the complete dataset (averaged over 20 iterations) and a iteration of the complete dataset, the imbalanced dataset, the balanced dataset and the GAN dataset.

In [None]:
intersections_PLSDA = []
for i in range(1, len(PLSDA_feats_real1)):
    intersections_PLSDA.append(len(np.intersect1d(PLSDA_feats_real1.index[:i], PLSDA_feats_real.index[:i])))

In [None]:
intersections_PLSDA_bal = {}
for cl in PLSDA_feats_GAN_bal_mean:
    int_bal = []
    for i in range(1, len(PLSDA_feats_real)):
        int_bal.append(len(np.intersect1d(PLSDA_feats_real.index[:i], PLSDA_feats_GAN_bal_mean[cl].index[:i])))
    intersections_PLSDA_bal[cl] = int_bal

print('Intersections - Dataset Balanced - Finished')

intersections_PLSDA_GAN = {}
for cl in PLSDA_feats_GAN_mean:
    int_GAN = []
    for i in range(1, len(PLSDA_feats_real)):
        int_GAN.append(len(np.intersect1d(PLSDA_feats_real.index[:i], PLSDA_feats_GAN_mean[cl].index[:i])))
    intersections_PLSDA_GAN[cl] = int_GAN

print('Intersections - Dataset GAN - Finished')

intersections_PLSDA_imb = {}
for cl in PLSDA_feats_imb_mean:
    int_imb = []
    for i in range(1, len(PLSDA_feats_real)):
        int_imb.append(len(np.intersect1d(PLSDA_feats_real.index[:i], PLSDA_feats_imb_mean[cl].index[:i])))
    intersections_PLSDA_imb[cl] = int_imb

print('Intersections - Dataset Imbalanced - Finished')

# See intersections if features were randomly shuffled
random_intersections_PLSDA = []
copy_shuffle = list(PLSDA_feats_real.index).copy()
np.random.shuffle(copy_shuffle)
for i in range(1, len(PLSDA_feats_real)):
    random_intersections_PLSDA.append(len(np.intersect1d(PLSDA_feats_real.index[:i], copy_shuffle[:i])))

In [None]:
f, (axl, axc, axr) = plt.subplots(1,3,figsize=(16,6))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)), 
            label = 'Reference (expected var.)', color='Black', s=5)
axc.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Reference (expected var.)', color='Black', s=5)
axr.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Reference (expected var.)', color='Black', s=5)

for i,ax in zip(intersections_PLSDA_imb, (axl,axc,axr)):
    ax.scatter(range(1,len(intersections_PLSDA)+1),
               np.array(intersections_PLSDA_imb[i]) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'Imbalanced Exp. Dataset', color='Green', s=5)
    ax.scatter(range(1,len(intersections_PLSDA)+1),
               np.array(intersections_PLSDA_bal[i]) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'GAN Augmented Exp. Dataset', color='Blue', s=5)
    ax.scatter(range(1,len(intersections_PLSDA)+1),
               np.array(intersections_PLSDA_GAN[i]) / np.array(range(1,len(intersections_PLSDA)+1)),
                label = 'GAN Samples Dataset', color='Red', s=5)
    
    ax.scatter(range(1,len(intersections_PLSDA)+1),
               np.array(random_intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Random', color='Orange', s=5)
    
    ax.set_title('Minority Class: '+i, fontsize=15)

axl.legend(loc='center left', fontsize=15, bbox_to_anchor=(0.15,-0.2), ncol=3, markerscale=3)
axl.set_ylabel('Fraction of Common Compounds', fontsize=15)
axl.set_xlim([0,len(intersections_PLSDA)//8])
axl.set_ylim([0,1])

axc.set_xlabel('Nº of Top Important (VIP Score) Compounds', fontsize=15)
axc.set_xlim([0,len(intersections_PLSDA)//8])
axc.set_ylim([0,1])

axr.set_xlim([0,len(intersections_PLSDA)//8])
axr.set_ylim([0,1])
plt.suptitle('PLS-DA', fontsize=18)
plt.show()

In [None]:
f, axl = plt.subplots(1,1,figsize=(8,4))
# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)), 
            label = 'Real-Real Intersections', color='Black', s=5)

axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_imb).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Imbalanced Real Dataset', color='Green', s=5)
axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_bal).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'GAN Augmented Real Dataset', color='Red', s=5)
axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_GAN).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'GAN Samples Dataset', color='Blue', s=5)

axl.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(random_intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Random Intersections', color='Orange', s=5)

axl.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3)
axl.set_xlabel('Nº of Top Important (VIP Score) Compounds', fontsize=14)
axl.set_ylabel('Fraction of Common Compounds', fontsize=14)
axl.set_xlim([0,len(intersections_PLSDA)//8])
axl.set_ylim([0,1.01])
axl.set_title('PLS-DA', fontsize=18)

In [None]:
f, (axl, axr) = plt.subplots(1,2,figsize=(12,4), constrained_layout=True)

# Graph depicting intersection of important features
axl.scatter(range(1,len(intersections_RF)+1), np.array(intersections_RF) / np.array(range(1,len(intersections_RF)+1)), 
            label = 'Reference (expected var.)', color='Black', s=5)

axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_imb).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Exp. Imbalanced', color='Green', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_bal).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Imb. + Min. Cl. GAN Data', color='Red', s=5)
axl.scatter(range(1,len(intersections_RF)+1),
            np.array(pd.DataFrame(intersections_RF_GAN).mean(axis=1)) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Only GAN data', color='Blue', s=5)

axl.scatter(range(1,len(intersections_RF)+1),
            np.array(random_intersections_RF) / np.array(range(1,len(intersections_RF)+1)),
            label = 'Random', color='Orange', s=5)

axl.set_xlabel('Nº of Top Important (Gini Importance) Compounds', fontsize=13)
axl.set_ylabel('Fraction of Common Compounds', fontsize=13)
axl.set_xlim([0,len(intersections_RF)//8])
axl.set_ylim([0,1.01])
axl.set_title('Random Forest', fontsize=18)

# Graph depicting intersection of important features
axr.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)), 
            label = 'Reference (expected var.)', color='Black', s=5)

axr.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_imb).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Exp. Imbalanced', color='Green', s=5)
axr.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_bal).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Imb. + Min. Cl. GAN Data', color='Red', s=5)
axr.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(pd.DataFrame(intersections_PLSDA_GAN).mean(axis=1)) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Only GAN data', color='Blue', s=5)

axr.scatter(range(1,len(intersections_PLSDA)+1),
            np.array(random_intersections_PLSDA) / np.array(range(1,len(intersections_PLSDA)+1)),
            label = 'Random', color='Orange', s=5)

axr.legend(loc='upper left', fontsize=11, ncol=1, bbox_to_anchor=(1,1), markerscale=3, handletextpad=0)
axr.set_xlabel('Nº of Top Important (VIP Score) Compounds', fontsize=13)
axr.set_ylabel('Fraction of Common Compounds', fontsize=13)
axr.set_xlim([0,len(intersections_PLSDA)//8])
axr.set_ylim([0,1.01])
axr.set_title('PLS-DA', fontsize=18)
f.savefig('Images/MPint_Imbalanced_ImpFeat_plot.png', dpi=400)
f.savefig('Images/MPint_Imbalanced_ImpFeat_plot.pdf', dpi=400)