# 2.3.2. Increasing training data using unlabeled datasets

Let $X^S$ and $Y^S$ be the source data and its corresponding labels where $y={y_j:   j∈{1,2,…,L}}$ and $y∈Y^S$. A classifier trained using a labeled dataset $D^S≔{(x,y):  x∈X^S,y∈Y^S }$, will be used to classify samples in an unlabeled dataset $D^T≔{(x,∅):  x∈X^T }$  where $X^T$ is the target data and ∅ indicates that a label has not yet been assigned. Each sample’s epistemic uncertainty in the computed labels (e.g., using the method proposed by Yang et al.16) will be measured as follows:


$u_j≜ uncertainty (\hat{y}_{j} | x,W)  ,  ∀x∈X^{T}  ,j∈{1,2,…,L}$


where $\hat{y}_j∈{0,1}$ is the predicted value for class j associated with data x given weights W (weights of a model trained on source dataset D^S). The samples with uncertainty values below a threshold $(γ)$ will be selected as follows:


 $\hat{D}^T≜{(x ,\hat{y}_j ): 1/L ∑_{j=1}^L u_j < γ ,  x ∈ X^{T},  j∈{1,2,…,L}}$


Finally, we will add all these selected samples along with their computed labels to the original training dataset, $ D^{H} ≜ D^{S} ∪ \hat{D}^{T} $.

In Bayesian DNNs, we need to compute the full posterior $p(W│D^S )$. Different datasets can have different distributions, which makes measuring the priory $p(D^S)$ troublesome. Thus, to apply the model to new cases, one need to calculate the posterior predictive distribution $p(y_j=1│x,D^S )=∫p(y_j=1│x,W)  p(W|D^S )*dW$ instead. This is hard. Instead, the posterior can be approximated using a variational distribution Q(W) by minimizing the Kullback-Leibler (KL) divergence between Q(W) and posterior. The posterior predictive distribution can then be approximated through Monte-Carlo sampling. 

$p(y_j=1 │ x,D^S )=∫〖p(y_j=1│x,W)  Q(W)  dW〗≈1/G ∑_{g=1}^G〖p(y_j=1│x,W_g ) 〗  ,∀ x∈X^T,  j∈{1,2,…,L}$, 

where $G$ is the number of Monte-Carlo samples, $W_g$ are the weights after applying Monte Carlo dropout to original weights $W$.


Uncertainty can be measured by measuring the std of these samples:


uncertainty $(\hat{y}_j | x,W) =$ std $(p(y_j=1│x,W_g ),∀g∈G)$




# Importing the libraries

In [1]:
%reload_ext autoreload
%autoreload 2

import os
import sys
sys.path.append('../')

import funcs 
import load_data
import mlflow
import numpy as np
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
import subprocess
from time import time
import git
import matplotlib.pyplot as plt 
import warnings
warnings.filterwarnings('ignore')

%reload_ext load_data
%reload_ext funcs

### TODO:
1. what are the state of art techniques for increasing datasets (e.g., data augmentaion, unsupervised techniques, etc.) and see how much our technique improve the accuracy with comparison to those

2. we should only use a portion of the original dataset(chest) to train the original model to see how much this technique can improve the results. this can also be a variable to see the relation between the dataset size and change in accuracy

3. maybe 1 juornal paper for the main technique, and then a conference paper for the domain fluctuation

### Order of pathologies

In [2]:
pathologies = ["No Finding", "Enlarged Cardiomediastinum" , "Cardiomegaly" , "Lung Opacity" , "Lung Lesion", "Edema" , "Consolidation" , "Pneumonia" , "Atelectasis" , "Pneumothorax" , "Pleural Effusion" , "Pleural Other" , "Fracture" , "Support Devices"]

# TODO 1. run it for multiple thresholds to plot a graph
# TODO 2. find multi-label datasets (https://www.uco.es/kdis/mllresources/)

# Values I get from this results:
#       1. Improve the accuracy on existing models by introducing the model to multiple version of the test image
#       2. Increase the size of existing datasets by adding the unlabeled data to it
#       3. Can help with domain fluctuation. train on one domain, test on another using the ATT method (is this the monte carlo method or its different? more details in the proposal)

### ssh-tunnel to server in the background

In [3]:
command = 'ssh -N -L 5000:localhost:5432 artinmajdi@data7-db1.cyverse.org &'
ssh_session = subprocess.Popen('exec ' + command, stdout=subprocess.PIPE, shell=True)

### Setting up mlflow config

In [4]:
# getting the server config
server, artifact = funcs.mlflow_settings()

# setting the server uri
mlflow.set_tracking_uri(server)


# Setting up the experiment
experiment_name = 'expanding_dataset_aim1_2' # 'label_inter_dependence'
mlflow.set_experiment(experiment_name=experiment_name)

# Running the PARENT simulation holding the optimized model
run_id_parent = 'bc306d0c76b94e19845f442f143fd5df' 

# starting the parent session
session_parent = mlflow.start_run(run_id=run_id_parent)

# starting the child session
session_child = mlflow.start_run(nested=True) # run_name='test' or run_id='

# mlflow.set_tag('run_id',session_child.info.run_id)
# mlflow.start_run(run_id='18aa1eac47264f4dba07fce44984b438')

### Loading the data

In [5]:
mode_labels = 'uncertain'
Data, Info, data_generator, data_generator_aug = load_data.load_chest_xray_with_mode(dataset='chexpert', mode=mode_labels, max_sample=1000)

before sample-pruning
train: (223414, 19)
test: (234, 19)

after sample-pruning
train (certain): (567, 20)
train (uncertain): (291, 20)
valid: (142, 20)
test: (169, 20) 

Found 291 validated image filenames.
Found 291 validated image filenames.


### Getting the optimized model from server

In [6]:
# Loading the trained model
# mlflow.set_experiment(experiment_name='soft_weighted_MV_aim1_3')
model = mlflow.keras.load_model(model_uri='runs:/{}/model'.format(run_id_parent),compile=False)

#  Compiling the model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss      = funcs.weighted_bce_loss(Info.class_weights) 
metrics   = [tf.keras.metrics.binary_accuracy]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

### Measuring accuracy after changing the nan labels

In [7]:
# measuring the accuracy
MA = funcs.Measure_Accuracy_Aim1_2(predict_accuracy_mode=True, generator=data_generator, model=model, how_to_treat_nans='ignore')
prob_ignore, acc_ignore = MA.loop_over_whole_dataset()

MA = funcs.Measure_Accuracy_Aim1_2(predict_accuracy_mode=True, generator=data_generator, model=model, how_to_treat_nans='pos')
prob_pos, acc_pos = MA.loop_over_whole_dataset()

MA = funcs.Measure_Accuracy_Aim1_2(predict_accuracy_mode=True, generator=data_generator, model=model, how_to_treat_nans='neg')
prob_neg, acc_neg = MA.loop_over_whole_dataset()


# converting to dataframe
df = pd.DataFrame({'ignore':acc_ignore , 'pos':acc_pos, 'neg':acc_neg},index=pathologies)

df['maximum'] = df.columns[ df.values.argmax(axis=1) ]
df['change']  = df[['neg','pos']].max(axis=1) - df['ignore']

df = df[['ignore','pos','neg','maximum','change']]

df.maximum[df.change==0.0] = '--'
df.change[df.change==0.0] = '--'
df

Unnamed: 0,ignore,pos,neg,maximum,change
No Finding,0.0,100.0,0.0,pos,100.0
Enlarged Cardiomediastinum,51.1,11.8,88.5,neg,37.4
Cardiomegaly,61.1,37.8,66.6,neg,5.5
Lung Opacity,93.2,91.8,53.3,ignore,-1.4
Lung Lesion,12.5,8.2,87.5,neg,75.0
Edema,69.2,69.8,40.4,pos,0.6
Consolidation,66.6,12.2,93.4,neg,26.8
Pneumonia,76.4,12.1,90.9,neg,14.5
Atelectasis,34.9,12.5,76.7,neg,41.8
Pneumothorax,88.2,62.2,49.1,ignore,-26.0


### viewing the augmented images

In [None]:
data_generator_aug.reset()
sample_index = 25

x = {}
for j in range(1,4):

    data_generator_aug.batch_index = 0
    x, _  = next(data_generator_aug)

    ax = plt.subplot(1,3,j)
    ax.imshow(x[sample_index,...])


### Measuring the accuracy after augmentation

In [None]:
    df = {}
    columns = ['old-accuracy', 'new-accuracy', 'std']
    accuracies_all_modes = {mode:{} for mode in columns}

In [None]:
for how_to_treat_nans in ['ignore', 'pos', 'neg']:

    print('How to treat nans:',how_to_treat_nans)

    all_outputs, MA = funcs.apply_technique_aim_1_2(how_to_treat_nans=how_to_treat_nans, data_generator=data_generator, data_generator_aug=data_generator_aug, model=model, uncertainty_type='std')

    df[how_to_treat_nans] = funcs.estimate_maximum_and_change(all_accuracies=all_outputs, pathologies=pathologies)
    
    for mode in columns: 
        accuracies_all_modes[mode][how_to_treat_nans] = all_outputs[mode]
   

In [None]:
# MA.probs_avg_2d

### Viewing the results

In [None]:
accuracies_all_modes.keys()

In [None]:
df['neg'].plot.barh()


In [None]:
df['neg']

# Applying the results to uncertain labels

### a. generating the results

In [None]:
model_name = 'DenseNet121'

# without augmentation and therefore no uncertainty
MA = funcs.Measure_Accuracy_Aim1_2(predict_accuracy_mode=True, generator=data_generator, model=model, how_to_treat_nans='ignore', uncertainty_type='std')

prob_orig, _ = MA.loop_over_whole_dataset()


# with augmentation and uncertainty
accuracies, MA = funcs.apply_technique_aim_1_2(how_to_treat_nans='ignore', data_generator=data_generator, data_generator_aug=data_generator_aug, model=model, uncertainty_type='std')

probs_avg_2d = MA.probs_avg_2d
probs_std_2d = MA.probs_std_2d


# saving the effect of augmentation as artifact
df_effect = funcs.estimate_maximum_and_change(all_accuracies=accuracies, pathologies=pathologies)


dataframes = Data.dataframe.copy()

### b. downloading the results

In [135]:
# getting the server config
server, artifact = funcs.mlflow_settings()

# setting the server uri
mlflow.set_tracking_uri(server)

# Setting up the experiment
experiment_name = 'expanding_dataset_aim1_2'
mlflow.set_experiment(experiment_name=experiment_name)

# starting the parent session
run_id = 'bc306d0c76b94e19845f442f143fd5df'
mlflow.start_run(run_id=run_id)

''

In [137]:
# loading the processed dataframe
dir = '/groups/jjrodrig/projects/chest/dataset/chexpert/'
train_raw = pd.read_csv(dir + '/train.csv')
(df_train, df_train_uncertain), df_valid, df_test, pathologies, class_weights = load_data.chexpert(dir=dir,max_sample=10000000)

dataframes = {'train_raw':train_raw, 'train':df_train, 'valid':df_valid, 'uncertain':df_train_uncertain, 'test':df_test}


# downloading the probabilities and uncertainties
local_dir = '../../increase_dataset/' 
# os.mkdir(local_dir)
full_path = mlflow.tracking.MlflowClient().download_artifacts(run_id, '', local_dir)

model = 'DenseNet121'
prob_orig    = pd.read_csv(full_path + 'prob_' + model + '_orig.csv'  ,index_col=['Unnamed: 0'])
probs_avg_2d = pd.read_csv(full_path + 'prob_' + model + '_aug.csv'   ,index_col=['Unnamed: 0'])
probs_std_2d = pd.read_csv(full_path + 'uncertainty_' + model + '.csv',index_col=['Unnamed: 0'])

In [138]:
# Measuring accuracy
prediction = probs_avg_2d.reset_index().copy()
truth = dataframes['uncertain'].reset_index().copy()

prediction[pathologies] = prediction[pathologies] > 0.5
truth = truth[pathologies].replace(-5,np.nan).replace(-10,np.nan).replace(1,True).replace(0,False)

func = lambda x1, x2: [ (x1[j] > 0.5) == (x2[j] > 0.5) for j in range(len(x1))]
pred_acc = truth[pathologies].combine(prediction[pathologies],func=func) # .to_list()
pred_acc = pred_acc.set_index(truth.index)
pred_acc[truth.isnull()] = np.nan

accuracy = np.nanmean(np.array(pred_acc),axis=0)

accuracy

array([       nan, 0.52242094, 0.43286486, 0.93319385, 0.15750487,
       0.52797478, 0.68221024, 0.79607109, 0.41781095, 0.44198895,
       0.80347517, 0.40163934, 0.0852159 , 0.62468958])

In [248]:
train_raw = pd.read_csv(dir + '/train.csv')

indexes = probs_avg_2d.index
train_uncertain  = train_raw.loc[indexes,pathologies] 
null_locations = train_uncertain.replace(0,np.nan).isnull().to_numpy()

# extracting the certain values
uncertainty_threshold = 0.25
null_locations[probs_std_2d[pathologies] > uncertainty_threshold] = False

train_uncertain = train_uncertain.to_numpy()
predictions     = (probs_avg_2d[pathologies] > 0.5).replace(True,1).replace(False,-1)

# replacing the uncertain cells with predicted values
train_uncertain[null_locations]    = predictions.to_numpy()[null_locations]

train_raw.loc[indexes,pathologies] = pd.DataFrame(train_uncertain,columns=pathologies)

# replacing all remainign nan cells to -1 or negative label
train_raw = train_raw.replace(np.nan,-1)

# saving the final results
train_raw.to_csv(dir + '/train_aim1_2.csv')



In [250]:
# closing the child mlflow session
mlflow.end_run()

# closing the parent mlflow session
mlflow.end_run()

# closing the ssh session
ssh_session.kill()

### saving the results as artifact

In [None]:
# dataframes[mode_labels].shape
# prob_orig.shape

In [None]:
def save_artifacts(path, df_info, outputs, pathologies):
    df = df_info.drop(pathologies,axis=1)
    df_temp = pd.DataFrame(outputs,columns=pathologies).set_index(df.index)
    df[pathologies] = df_temp[pathologies]
    df.to_csv(path)

model_name = 'DenseNet121'
mode_labels = 'test'

mlflow.log_param('test','successful')
# without augmentation and therefore no uncertainty
path = f'../../prob_{model_name}_orig.csv'
save_artifacts(path=path, df_info=dataframes[mode_labels], outputs=prob_orig, pathologies=pathologies)
mlflow.log_artifact(path,artifact_path='')


# saving the augmentation and uncertainty data
path = f'../../prob_{model_name}_aug.csv'
save_artifacts(path=path, df_info=dataframes[mode_labels], outputs=probs_avg_2d, pathologies=pathologies)
mlflow.log_artifact(path,artifact_path='')


path = f'../../uncertainty_{model_name}.csv'
save_artifacts(path=path, df_info=dataframes[mode_labels], outputs=probs_std_2d, pathologies=pathologies)
mlflow.log_artifact(path,artifact_path='')


# saving the effect of augmentation as artifact
# path = f'../../effect_of_augmentation_{model_name}.csv'
# df_effect.to_csv(path)
# mlflow.log_artifact(path,artifact_path='')

In [None]:
# pred_labels = prob_aug > 0.5
# pred_labels
mlflow.log_artifact('/home/u29/mohammadsmajdi/projects/chest_xray/prob_DenseNet121_aug2.csv',artifact_path='')
mlflow.end_run()
# mlflow.end_run()

In [None]:
# # TODO apply the predicted probs to uncertain sampels

# # all_outputs['MA'].probs_avg_2d
# # all_outputs['MA'].probs_std_2d
# truth = data_generator.labels.copy()

# # changing the null lables according to above results
# (truth==-10) and (all_outputs['MA'].probs_std_2d < 0.3)
# # uncertain_indexes = np.where(truth==-10 and all_outputs['MA'].probs_std_2d < 0.3)
# # truth[uncertain_indexes] = all_outputs['MA'].probs_avg_2d[uncertain_indexes] > 0.5

In [None]:
# np.unique(data_generator.labels)
# all_outputs['MA'].probs_avg_2d[uncertain_indexes]
# np.where(all_outputs['MA'].probs_std_2d > 0.3)

# killing the mlflow & ssh sessions

In [None]:
# closing the child mlflow session
mlflow.end_run()

# closing the parent mlflow session
mlflow.end_run()

# closing the ssh session
ssh_session.kill()