# Neural Image Compression # 

Using local Data to test functionality with a few WSIs


## Imports ##

In [1]:
import numpy as np
import pandas as pd
from os.path import join, dirname, basename, splitext, exists
import os
from preprocessing import create_csv
from glob import glob
import sys
import shutil
import random
from tqdm import tqdm
from PIL import Image

import multiresolutionimageinterface as mri
import keras

Using TensorFlow backend.


## Data ##

To demonstrate the functionality of NIC, we will need a set of whole-slide images (WSIs) with their respective slide-level labels. In this case, we will use the WSIs that can be found using the following pattern:

These data was already reorganized, it is, all the tiff files are contained in one folder for each class. 

These are a small version of the TCGA dataset:

`E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction\data\tcga_luad\images_diagnostic`

`E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction\data\tcga_lusc\images_diagnostic`

The data we are going to use is only the **diagnostic** data an no the **tissue** data. The mask are already given, but we will have to implementa script to create this masks that filter out the background.


Because there is no slide-level csv file, we have to create one, this will be created after once we get the featurized wsi. FIle should be located at  from:

`E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction\data\slide_original_list_tcga.csv`


In [2]:
from preprocessing import create_csv
from tqdm import tqdm
import os
import pandas as pd

In [3]:
# Creates csv from original data
root_dir=  r'E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction'

dir_luad =  r'E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction\data\tcga_luad\wsi_diagnostic_tif'
dir_lusc =  r'E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction\data\tcga_lusc\wsi_diagnostic_tif'
csv_path =  os.path.join(root_dir,'data/slide_original_list_tcga.csv')
cache_dir = None  # used to store local copies of files during I/O operations (useful in cluster


## 0. Preprocessing

We need to create a csv file to point out the data.

In [None]:
files_class_1 = sorted([(os.path.basename(file)).split('.')[0] for file in tqdm(os.listdir(dir_class1)) if file.endswith(ext)])
files_class_0 = sorted([(os.path.basename(file)).split('.')[0] for file in tqdm(os.listdir(ddir_class0)) if file.endswith(ext)])
labels1 = np.ones(len(files_class_1), dtype=np.int8)
labels0 = np.zeros(len(files_class_0), dtype=np.int8)

df1 = pd.DataFrame(list(zip(files_class_1, labels1)), columns=['slide_id', 'label'])
df0 = pd.DataFrame(list(zip(files_class_0, labels0)), columns=['slide_id', 'label'])

# conacatenate dataframes
data = pd.concat([df1, df0], ignore_index=True, )
data.to_csv(csv_path, index=None, header=True)
print('Csv file sucessfully exported!')


In [11]:
print('Creating main csv data files from original data...')
create_csv(dir_luad, dir_lusc, csv_path, '.tif')

# read files to check shapes
df = pd.read_csv(csv_path)
print(f'Files were read with shapes: {df.shape}')

Creating main csv data files from original data...


100%|██████████| 10/10 [00:00<00:00, 10027.02it/s]
100%|██████████| 10/10 [00:00<00:00, 4996.79it/s]


Csv file sucessfully exported!
Files were read with shapes: (20, 2)


## 1. Encoder network ##

To perform NIC, we will need an encoder network to transform small image patches into embedding vectors. According to the paper, BiGAN produces the best unsupervised encoder and it is the one we will train here.

Alternatively, a collection of pretrained encoders (the one used in the NIC paper) can be found in 

`./models/encoders_patches_pathology/*.h5`

Remember that these pretrained encoders accept 128x128x3 patches taken at 0.5 um/px resolution (often level 1), except for the BiGAN model that takes 64x64x3 at 1 um/px (often level 2).



In order to train the BiGAN model, we will first extract patches from the slides in the `encoder` partition. We will sample 10K patches per slide, producing ~260K patches in total. We select 96x96 patches to perform crop augmentation during training later.

In [None]:
# Dont run this we, will train later the encoder but not now. 

from source.extract_patches import create_patch_dataset

patches_npy_path = join(root_dir, 'results', 'patches', 'training.npy')

# Extracts patches from whole-slide images and store them in a numpy array file
create_patch_dataset(
    input_dir=slide_dir,
    csv_path=csv_path,
    partition_tag='encoder',
    output_path=patches_npy_path,
    image_level=2,
    patch_size=96,
    n_patches_per_image=10000,
    cache_dir=join(cache_dir, 'patches')
)

Once we have extracted the patches, we can proceed to train the BiGAN model. We will use the hyper-parameters described in the NIC paper. 

In [None]:
from source.train_bigan_model import BiganModel

model_bigan_dir = join(root_dir, 'results', 'encoders', 'bigan', 'rotterdam1_96_noaug', '0.0001')

# Trains BiGAN
bigan = BiganModel(
    latent_dim=128,
    n_filters=128,
    lr=0.0001,
    patch_size=64,
)
bigan.train(
    x_path=patches_npy_path,
    output_dir=model_bigan_dir,
    epochs=400000,
    batch_size=64,
    sample_interval=1000,
    save_models_on_epoch=True
)

Beware that training this model is highly unstable, thus it can fail or collapse with ease. If this happens, restart the training. Selecting a checkpoint model is a manual procedure: check the generated images and loss values and avoid abnormal results. 

## 2. Compress images ##

Once we have a trained encoder, we can proceed with the WSI compression. I recommend running several `IDLE` instances of the following code in the cluster to speed up the lenghty process.

Before the actual compression, we need to vectorize the WSIs. This process extracts all non-background patches from the slide and store them in numpy array format for quick access. In this case, we will read 64x64 patches at 1 um/px resolution (level 2).

In [4]:
from vectorize_wsi import vectorize_images
import os
from os.path import join

root_dir=  r'E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction'
data_dir = r'E:\pathology-weakly-supervised-lung-cancer-growth-pattern-prediction\data'

data_dir_luad = os.path.join(data_dir, 'tcga_luad', 'wsi_diagnostic_tif')
data_dir_lusc = os.path.join(data_dir, 'tcga_lusc', 'wsi_diagnostic_tif')
mask_dir_luad = os.path.join(data_dir, 'tcga_luad', 'tissue_masks_diagnostic')
mask_dir_lusc = os.path.join(data_dir, 'tcga_lusc', 'tissue_masks_diagnostic')

csv_path =  os.path.join(root_dir,'data', 'slide_original_list_tcga.csv')
cache_dir = None  # used to store local copies of files during I/O operations (useful in cluster

In [5]:
# Vectorize LUAD WSIs

vectorized_luad_dir = join(root_dir, 'results', 'tcga_luad', 'vectorized')

vectorize_images(
    input_dir=data_dir_luad,
    mask_dir=mask_dir_luad, 
    output_dir=vectorized_luad_dir, 
    cache_dir=cache_dir, 
    image_level=2, 
    patch_size=128
    )

Already existing file TCGA-05-4244-01Z-00-DX1 - 9 images left
Already existing file TCGA-05-4245-01Z-00-DX1 - 8 images left
Already existing file TCGA-05-4249-01Z-00-DX1 - 7 images left
Already existing file TCGA-05-4250-01Z-00-DX1 - 6 images left
Already existing file TCGA-05-4382-01Z-00-DX1 - 5 images left
Already existing file TCGA-05-4395-01Z-00-DX1 - 4 images left
Already existing file TCGA-05-4396-01Z-00-DX1 - 3 images left
Already existing file TCGA-05-4397-01Z-00-DX1 - 2 images left
Already existing file TCGA-05-4398-01Z-00-DX1 - 1 images left
Already existing file TCGA-4B-A93V-01Z-00-DX1 - 0 images left
Finish Processing All images!


In [6]:
# Vectorize LUSC WSIs

vectorized_lusc_dir = join(root_dir, 'results', 'tcga_lusc', 'vectorized')

vectorize_images(
    input_dir=data_dir_lusc,
    mask_dir=mask_dir_lusc, 
    output_dir=vectorized_lusc_dir, 
    cache_dir=cache_dir, 
    image_level=2, 
    patch_size=128
    )

Already existing file TCGA-33-4538-01Z-00-DX3 - 9 images left
Already existing file TCGA-52-7812-01Z-00-DX1 - 8 images left
Already existing file TCGA-60-2721-01Z-00-DX1 - 7 images left
Already existing file TCGA-77-8007-01Z-00-DX1 - 6 images left
Already existing file TCGA-77-8009-01Z-00-DX1 - 5 images left
Already existing file TCGA-77-8139-01Z-00-DX1 - 4 images left
Already existing file TCGA-77-8143-01Z-00-DX1 - 3 images left
Already existing file TCGA-77-A5G1-01Z-00-DX1 - 2 images left
Already existing file TCGA-NK-A5CR-01Z-00-DX1 - 1 images left
Already existing file TCGA-NK-A5D1-01Z-00-DX1 - 0 images left
Finish Processing All images!


Now we can compress the WSIs. Each WSI (vectorized file) will be processed 8 times due to WSI-level augmentation (rotation and flip). We will use an existing pretrained encoder from the NIC paper.

In [10]:
# Featurize images

from featurize_wsi import featurize_images

# Set paths
model_path = './neural-image-compression-private/models/encoders_patches_pathology/encoder_bigan.h5'
featurized_luad_dir = join(root_dir, 'results', 'tcga_luad', 'featurized')
featurized_lusc_dir = join(root_dir, 'results', 'tcga_lusc', 'featurized')

# Featurize LUAD data
featurize_images(
    input_dir=vectorized_luad_dir,
    model_path=model_path, 
    output_dir=featurized_luad_dir, 
    batch_size=32
    )

# Featurize LUSC data
featurize_images(
    input_dir=vectorized_lusc_dir,
    model_path=model_path, 
    output_dir=featurized_lusc_dir, 
    batch_size=32
    )

Already existing file TCGA-05-4244-01Z-00-DX1_{item} - 9 images left
Already existing file TCGA-05-4245-01Z-00-DX1_{item} - 8 images left
Already existing file TCGA-05-4249-01Z-00-DX1_{item} - 7 images left
Already existing file TCGA-05-4250-01Z-00-DX1_{item} - 6 images left
Already existing file TCGA-05-4382-01Z-00-DX1_{item} - 5 images left
Already existing file TCGA-05-4395-01Z-00-DX1_{item} - 4 images left
Already existing file TCGA-05-4396-01Z-00-DX1_{item} - 3 images left
Already existing file TCGA-05-4397-01Z-00-DX1_{item} - 2 images left
Already existing file TCGA-05-4398-01Z-00-DX1_{item} - 1 images left
Already existing file TCGA-4B-A93V-01Z-00-DX1_{item} - 0 images left
Finish Processing All images!
Already existing file TCGA-33-4538-01Z-00-DX3_{item} - 9 images left
Already existing file TCGA-52-7812-01Z-00-DX1_{item} - 8 images left
Already existing file TCGA-60-2721-01Z-00-DX1_{item} - 7 images left
Already existing file TCGA-77-8007-01Z-00-DX1_{item} - 6 images left
Alre

## 3. Train CNN on compressed images ##

Once we have compressed the WSIs, we can proceed with the CNN classifier. In this example, we will train a classifier targeting the binary label `HGP_SL` found in the CSV file. We will be training 4 models using cross-validation: in each fold, we will use 2 data partitions for training, 1 for validation and 1 for testing. At the end of model training, we perform inference on the test set, compute metrics, and run GradCAM on the images.



In [12]:
from train_wsi_test_on_cluster import run_train_model


csv_path = 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/data/slide_list_tcga.csv'
csv_train = 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/data/train_slide_list_tcga.csv'
csv_val = 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/data/validation_slide_list_tcga.csv'
csv_test = 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/data/test_slide_list_tcga.csv'
root_dir = r'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction'
data_dir_luad = root_dir + r'/results/tcga_luad/featurized'
data_dir_lusc = root_dir + r'/results/tcga_lusc/featurized'
model_dir = root_dir + '/results/model2'  # change this everytime a new model is run

# paths = {'csv_path': csv_path, 'csv_train': csv_train, 'csv_val': csv_val, 'csv_test': csv_test}
# generate_csv_files(paths, test_size=0.2, validation_size = 0.3)

cache_path = None

# Training
multiple_paths = {'data_dir_luad': data_dir_luad, 'data_dir_lusc': data_dir_lusc, 'output_dir': model_dir,
              'csv_train': csv_train, 'csv_val': csv_val, 'csv_test': csv_test, 'cache_path': cache_path}
#run_train_model(multiple_paths, epochs=200, size_of_batch=12)
run_train_model(multiple_paths, epochs=1, size_of_batch=2)

Loading training set ...
FeaturizedWsiGenerator data config: {'data_dir_luad': 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/results/tcga_luad/featurized', 'data_dir_lusc': 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/results/tcga_lusc/featurized', 'csv_path': 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/data/train_slide_list_tcga.csv'}
FeaturizedWsiGenerator using 11 samples and 6 batches, distributed in 4 positive and 7 negative samples.
Loading validation set ...
FeaturizedWsiSequence data config: {'data_dir_luad': 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/results/tcga_luad/featurized', 'data_dir_lusc': 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/results/tcga_lusc/featurized', 'csv_path': 'E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/data/validation_slide_list_tcga.csv'}
FeaturizedWsiSequence using 5 samples and 4 batches, distribute

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  current = df.ix[len(df) - 1, self.monitor]


Epoch 00000: val_loss improved from inf to 0.71246, saving model to E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/results/model2\checkpoint.h5
Epoch 00000: saving model to E:/pathology-weakly-supervised-lung-cancer-growth-pattern-prediction/results/model2\last_epoch.h5


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  current = df.ix[len(df) - 1, self.monitor]


In [None]:
from source.gradcam_wsi import gradcam_on_dataset
from source.train_compressed_wsi import train_wsi_classifier, eval_model, compute_metrics

def train_model(featurized_dir, csv_path, fold_n, output_dir, cache_dir, batch_size=16,
                images_dir=None, vectorized_dir=None, lr=1e-2, patience=4,
                occlusion_augmentation=False, elastic_augmentation=False, shuffle_augmentation=None):
    """
    Trains a CNN using compressed whole-slide images.

    :param featurized_dir: folder containing the compressed (featurized) images.
    :param csv_path: list of slides with labels.
    :param fold_n: fold determining which data partitions to use for training, validation and testing.
    :param output_dir: destination folder to store results.
    :param cache_dir: folder to store compressed images temporarily for fast access.
    :param batch_size: number of samples to train with in one-go.
    :return: nothing.
    """

    # Params
    folds = [
        {'training': ['partition_0', 'partition_1'], 'validation': ['partition_2'], 'test': ['partition_3']},
        {'training': ['partition_1', 'partition_2'], 'validation': ['partition_3'], 'test': ['partition_0']},
        {'training': ['partition_2', 'partition_3'], 'validation': ['partition_0'], 'test': ['partition_1']},
        {'training': ['partition_3', 'partition_0'], 'validation': ['partition_1'], 'test': ['partition_2']},
    ]
    result_dir = join(output_dir, 'fold_{n}'.format(n=fold_n))

    # Train CNN
    train_wsi_classifier(
        data_dir=featurized_dir,
        csv_path=csv_path,
        partitions=folds[fold_n],
        crop_size=400,
        output_dir=result_dir,
        output_units=2,
        cache_dir=cache_dir,
        n_epochs=200,
        batch_size=batch_size,
        lr=lr,
        code_size=128,
        workers=1,
        train_step_multiplier=1,
        val_step_multiplier=0.5,
        keep_data_training=1,
        keep_data_validation=1,
        patience=patience,
        occlusion_augmentation=occlusion_augmentation,
        elastic_augmentation=elastic_augmentation,
        shuffle_augmentation=shuffle_augmentation
    )

    # Evaluate CNN
    eval_model(
        model_path=join(result_dir, 'checkpoint.h5'),
        data_dir=featurized_dir,
        csv_path=csv_path,
        partitions=folds[fold_n],
        crop_size=400,
        output_path=join(result_dir, 'eval', 'preds.csv'),
        cache_dir=cache_dir,
        batch_size=batch_size,
        keep_data=1
    )

    # Metrics
    try:
        compute_metrics(
            input_path=join(result_dir, 'eval', 'preds.csv'),
            output_dir=join(result_dir, 'eval')
        )
    except Exception as e:
        print('Failed to compute metrics. Exception: {e}'.format(e=e), flush=True)

    # Apply GradCAM analysis to CNN
    gradcam_on_dataset(
        featurized_dir=featurized_dir,
        csv_path=csv_path,
        model_path=join(result_dir, 'checkpoint.h5'),
        partitions=folds[fold_n]['test'],
        layer_name='separable_conv2d_1',
        output_unit=1,
        custom_objects=None,
        cache_dir=cache_dir,
        images_dir=images_dir,
        vectorized_dir=vectorized_dir
    )

In [None]:
# Train CNN

selected_fold = 0
model_dir = join(root_dir, 'results', 'models', 'rotterdam1', 'bigan', 'nic', 'hgp_bin')

train_model(
    featurized_dir=featurized_dir,
    csv_path=csv_path,
    fold_n=selected_fold, 
    output_dir=model_dir,
    cache_dir=join(cache_dir, 'cnn'),
    occlusion_augmentation=False,
    lr=1e-2,
    patience=4,
    elastic_augmentation=False,
    images_dir=slide_dir,  # required for GradCAM
    vectorized_dir=vectorized_dir,  # required for GradCAM
    shuffle_augmentation=None
)