AI Ranger Team Demo Development
# Deep Learning for Medical Image Analysis leveraging the AML Platform
### Pneunomia detection using Remote Experiment Runs and Azure ML Pipeines

<img src="images/medicalimage.jpg" width=1000 />

### In this notebook
In this notebook, three different approaches are demonstrated to train a Pneumonia detection model using the unique capabilities of Azure ML. We will start by training a baseline model, which is trained on a remote cluster with GPU machines. The second part will show an approach for training hyperparameters. The third and last part of the notebook, will transform the Remote Experiment Runs in a Azure ML Pipeline, that can be used to train a model repeatably when new data is available. The Pipeline will also include a step for deployment.

# Pneumonia Detection Use Case
A relatively small public dataset of medical images for detecting viral or bacterial pneumonia has been chosen to keep the scenario straightforward and reproducible with limited computing resources. The dataset contains 5,218 x-ray images with two classes of diagnostic outcomes: 3,876 cases with (viral or bacterial) pneumonia and 1,342 cases without findings ("Normal").
The dataset is split into training, validation and test sets. Since some images represent radiographs from the same patient, it has been ensured that there is no overlap of patients between the training, validation and test sets.

<img src="images/pneumonia.png" width=1000 />
You can find the dataset under this location: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia. 


# Neural Network architecture

The following neural network architecture is used:

<img src="images/cnnframe.png" width=1200 />

Though a detailed discussion of the architecture and functionality of Convnets is outside the scope of this demo, the following summary provides a brief overview ofthe design:
- The x-ray images are resized to a 224 x 224 pixel resolution before being fed into the Convnet. Other medical imaging use cases will most likely require higher resolutions. However, for the selected dataset, high accuracy results can be achieved with this small image size.
- During  the  data  flow  through  the Convnet, relevant  properties  for  the  classification  task (features) are extracted in a hierarchical way. The lower layers of the network detect low-level features like edges or surfaces. More complex features (for detecting pneumonia in this case) are extracted at higher layers. The three convolutional layers perform the detection of features at different abstraction levels in the network, where the images are scanned by a small moving window (kernel).
- To reduce computational effort while focusing on the most dominant features, the image size is reduced further as the data flows through the three max pooling layers.
- Two dropout layers are included to reduce the risk of overfitting to the training data.
- The final layer consists of two neurons for representing the classes "pneumonia" and "normal".

# Setup


## Installs and imports

In [16]:
%config Completer.use_jedi = False
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'

import json
from azureml.core import Workspace, Dataset, Experiment

workspace = Workspace.from_config()

## Upload data

When you use this notebook for the first time, the pneumonia dataset should be uploaded to the default AzureML datastore and registered as a managed file dataset.

The commands below can be used to download the dataset using the Kaggle API (https://github.com/Kaggle/kaggle-api). Use the instructions to generate your own API key and fill them in on the code cell.

In [15]:
# Download Kaggle pip package
!pip install kaggle --upgrade

Requirement already up-to-date: kaggle in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.5.12)


In [None]:
# Export Kaggle configuration variables
%env KAGGLE_USERNAME=[FILL IN YOUR USERNAME]
%env KAGGLE_KEY=[FILL IN YOUR API KEY]

In [None]:
# Download the Pneumonia dataset
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

In [3]:
import os
import urllib
from zipfile import ZipFile

data_file = './chest-xray-pneumonia.zip'
data_folder = './chest_xray'

# extract files
with ZipFile(data_file, 'r') as zip:
    print('extracting files...')
    zip.extractall(data_folder)
    print('done')
    
# delete zip file
os.remove(data_file)

for i, filename in enumerate(os.listdir(data_folder)):
    if filename.endswith(".zip"):
        with ZipFile(os.path.join(data_folder, filename), 'r') as zip:
            zip.extractall(os.path.join(data_folder, filename.split('.')[0]))
            os.remove(os.path.join(data_folder, filename))


extracting files...
done


In [21]:
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath

# Upload data to AzureML Datastore
ds = workspace.get_default_datastore()
ds = Dataset.File.upload_directory(src_dir='./chest_xray/chest_xray/',
           target=DataPath(ds, 'chest-xray'),
           show_progress=False, overwrite=False)

# Register file dataset with AzureML
ds = ds.register(workspace=workspace, name="pneumonia", description="Pneumonia train / val / test folders with 2 classes", create_new_version=True)

print(f'Dataset {ds.name} registered.')

Dataset pneumonia registered.


# I. Run baseline experiment

## Create/retrieve Compute Cluster

In [23]:
from azureml.core.compute import AmlCompute, ComputeTarget

cluster_name = "gpu-cluster"

try:
    compute_target = workspace.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6', 
                                                           idle_seconds_before_scaledown=1800,
                                                           min_nodes=0, 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Creating a new compute target...
InProgress......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Define ScriptRunConfig object

Define the Environment that you will use to run your experiment, retrieve the dataset by name and define the ScriptRunConfig object.

In [4]:
from azureml.core import ScriptRunConfig, Environment
from azureml.core.compute import ComputeTarget

experiment = Experiment(workspace, 'pneumonia')

pytorch_env = Environment.from_conda_specification(name = 'pytorch-1.6-gpu', file_path = './training/conda_dependencies.yml')

dataset = Dataset.get_by_name(workspace, name='pneumonia', version='latest')

src = ScriptRunConfig(source_directory='./training',
                      script='train.py',
                      arguments=['--epochs', 15, '--data-folder', dataset.as_mount()],
                      compute_target= ComputeTarget(workspace, 'gpu-cluster'),
                      environment=pytorch_env)

## Submit baseline experiment

In [6]:
from azureml.widgets import RunDetails

script_run = experiment.submit(src)
RunDetails(script_run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

# II. Hyperparameter tuning using Random Parameter Sampling
Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that results in the best performance. The process is typically computationally expensive and manual.

Azure Machine Learning lets you automate hyperparameter tuning and run experiments in parallel to efficiently optimize hyperparameters.

Random sampling supports discrete and continuous hyperparameters. It supports early termination of low-performance runs. Some users do an initial search with random sampling and then refine the search space to improve results. In random sampling, hyperparameter values are randomly selected from the defined search space.

Selected hyperparameters affect various stages of the experiment:

- Data: Training and validation loader: batch size
- CNN Architecture: Dropout
- Choice of optimizer
- Training loop: learning rate

In [8]:
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, HyperDriveConfig, uniform, choice, PrimaryMetricGoal

param_sampling = RandomParameterSampling( {
        'learning_rate': choice(0.00007, 0.0007, 0.07),
        'batch_size': choice(16, 32, 64, 128), 
        'conv_dropout' : uniform(0.0, 0.5), 
        'optimizer': choice('SGD', 'Adam', 'RMSprop')
    }
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=5)

hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling, 
                                     policy=early_termination_policy,
                                     primary_metric_name='best_val_acc',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=8,
                                     max_concurrent_runs=4)

## Submit hyperdrive run

In [9]:
from azureml.widgets import RunDetails

# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)

RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

#  III. Define AML pipeline with HyperDriveStep

The third part of this, is a showcase on how to use the Azure ML Pipeline capability to create a pipeline from the same training script that we have been using. In the cell below, the first step of the pipeline is created, by defining the pipeline data that will be the output of the first step.
The Hyperdrive config that we have defined in the previous step will be re-used.

In [11]:
from azureml.pipeline.steps import HyperDriveStep, HyperDriveStepRun, PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput

metrics_output_name = 'metrics_output'
metrics_data = PipelineData(name='metrics_data',
                            datastore=workspace.get_default_datastore(),
                            pipeline_output_name=metrics_output_name,
                            training_output=TrainingOutput("Metrics"))

model_output_name = 'model_output'
saved_model = PipelineData(name='saved_model',
                            datastore=workspace.get_default_datastore(),
                            pipeline_output_name=model_output_name,
                            training_output=TrainingOutput("Model",
                                                           model_file="outputs/model/pneumonia.pt"))

hd_step_name='hyperdrive_step'
hd_step = HyperDriveStep(
    name=hd_step_name,
    hyperdrive_config=hyperdrive_config,
    inputs=[dataset.as_mount()],
    outputs=[metrics_data, saved_model])


## Find and register best model

We add a step in our pipeline to find and register the best model, that is the output of the Hyperdrivestep.

In [15]:
%%writefile training/register_model.py

import argparse
import json
import os
from azureml.core import Workspace, Experiment, Model
from azureml.core import Run
from shutil import copy2

parser = argparse.ArgumentParser()
parser.add_argument('--saved-model', type=str, dest='saved_model', help='path to saved model file')
args = parser.parse_args()

model_output_dir = './model/'

os.makedirs(model_output_dir, exist_ok=True)
copy2(args.saved_model, model_output_dir)

ws = Run.get_context().experiment.workspace

model = Model.register(workspace=ws, model_name='tf-dnn-mnist', model_path=model_output_dir)

Overwriting training/register_model.py


In [22]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-sdk")

rcfg = RunConfiguration(conda_dependencies=conda_dep)

register_model_step = PythonScriptStep(source_directory='./training',
                                       script_name='register_model.py',
                                       name="register_model_step01",
                                       inputs=[saved_model],
                                       compute_target=ComputeTarget(workspace, 'gpu-cluster'),
                                       arguments=["--saved-model", saved_model],
                                       allow_reuse=True,
                                       runconfig=rcfg)

register_model_step.run_after(hd_step)

## Submit pipeline including model registration

In [23]:
pipeline = Pipeline(workspace=workspace, steps=[hd_step, register_model_step])
pipeline_run = experiment.submit(pipeline)

Created step hyperdrive_step [c1aca985][6e8e5e1c-4e9d-4ec7-903d-391cd8d1def7], (This step is eligible to reuse a previous run's output)Created step register_model_step01 [757b77f3][58c049bb-0692-4a1d-ad62-1f23b255b237], (This step will run and generate new outputs)

Submitted PipelineRun 2097db23-03dd-479d-85e5-b07e054c19be
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/2097db23-03dd-479d-85e5-b07e054c19be?wsid=/subscriptions/4eeedd72-d937-4243-86d1-c3982a84d924/resourcegroups/harmke-andreas-demo/workspaces/harmke-andreas-demo&tid=72f988bf-86f1-41af-91ab-2d7cd011db47


## Download training metrics

In [24]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)


Downloading azureml/cc19e9a9-1e12-480b-972d-e6be939732ad/metrics_data
Downloaded azureml/cc19e9a9-1e12-480b-972d-e6be939732ad/metrics_data, 1 files out of an estimated total of 1


## Visualize training metrics

In [25]:
import pandas as pd
import json
with open(metrics_output._path_on_datastore) as f:  
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_3,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_0,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_5,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_6,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_4,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_2,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_7,HD_9baebc70-185e-4337-801a-4c7986c8ef1e_1
training accuracy,"[0.8038461538461539, 0.9370192307692308, 0.957...","[0.6240234375, 0.87939453125, 0.92236328125, 0...","[0.5144230769230769, 0.5163461538461539, 0.516...","[0.7673076923076924, 0.9336538461538462, 0.960...","[0.81982421875, 0.939453125, 0.94921875, 0.960...","[0.5185546875, 0.5673828125, 0.66064453125, 0....","[0.654296875, 0.88623046875, 0.94140625, 0.966...","[0.498046875, 0.53466796875, 0.53173828125, 0...."
best_val_acc,"[0.934375, 0.934375, 0.959375, 0.959375, 0.959...","[0.7578125, 0.94140625, 0.953125, 0.953125, 0....","[0.41875, 0.41875, 0.41875, 0.41875, 0.41875, ...","[0.93125, 0.93125, 0.953125, 0.953125, 0.95625...","[0.91875, 0.95, 0.95, 0.953125, 0.959375, 0.95...","[0.45625, 0.465625, 0.59375, 0.79375, 0.803125...","[0.88671875, 0.94140625, 0.953125, 0.95703125,...","[0.46875, 0.49609375, 0.5703125, 0.5703125, 0...."
training loss,"[0.43161301441329847, 0.17061507584665606, 0.1...","[1.3686887928621947, 0.30821649839552184, 0.19...","[0.6876755327915403, 0.6832714803498998, 0.679...","[0.4938802280014367, 0.17829929210013337, 0.12...","[0.3700377759315985, 0.15760035846444914, 0.12...","[0.6735063090884714, 0.6562921537769785, 0.640...","[1.056118088317432, 0.27589923849494624, 0.154...","[0.680730085121356, 0.680171423164203, 0.67963..."
validation loss,"[0.1985279983944363, 0.2482063065566908, 0.117...","[0.44814315752426104, 0.13481579750691383, 0.1...","[0.6344831852491765, 0.6319371954328314, 0.629...","[0.20772096574136675, 0.20074900195129916, 0.1...","[0.18875072077128963, 0.15107901891072592, 0.1...","[0.6221604306473691, 0.6130199921436799, 0.598...","[0.2620795160277277, 0.13359475747132912, 0.09...","[0.5061733104564525, 0.5054191741508636, 0.504..."
validation accuracy,"[0.934375, 0.878125, 0.959375, 0.953125, 0.959...","[0.7578125, 0.94140625, 0.953125, 0.9375, 0.89...","[0.41875, 0.41875, 0.41875, 0.41875, 0.41875, ...","[0.93125, 0.909375, 0.953125, 0.9375, 0.95625,...","[0.91875, 0.95, 0.925, 0.953125, 0.959375, 0.9...","[0.45625, 0.465625, 0.59375, 0.79375, 0.803125...","[0.88671875, 0.94140625, 0.953125, 0.95703125,...","[0.46875, 0.49609375, 0.5703125, 0.55859375, 0..."
Train imgs,[2085.0],[2085.0],[2085.0],[2085.0],[2085.0],[2085.0],[2085.0],[2085.0]


## Publish the training pipeline

By publishing the training pipeline, an pipeline endpoint is created, that we can use to trigger the pipeline from external services.

In [30]:
published_pipeline1 = pipeline_run.publish_pipeline(
     name="Training_pneumonia",
     description="Pipeline to train a classification model to detect pneumonia.",
     version="1.0")

## Create a schedule based on file change
One advantage of defining and publishing your training script as an Azure ML Pipeline, is that a schedule can be created to trigger retraining of your model based on file changes in the source dataset.

In [None]:
from azureml.pipeline.core.schedule import Schedule
from azureml.pipeline.core import PipelineEndpoint

datastore = workspace.get_default_datastore()

pipeline_endpoint_by_name = PipelineEndpoint.get(workspace=workspace, name="Training_pneumonia")

reactive_schedule = Schedule.create(workspace, name="MyReactiveSchedule", description="Based on input file change.",
                            pipeline_id=pipeline_endpoint_by_name.id, experiment_name='experiment_name', datastore=datastore, data_path_parameter_name="input_data")
