# Structured Learning Session 
## Project 1: Build a Clinical Support Application

## Step 0
### Get to know the environment
- Run BASH commands from this notebook
- Go up and down the directory tree

## Step 1
### Download the Kaggle Chest X-ray (Pneumonia) Dataset
- Create a Kaggle account
- Go to the [account](https://www.kaggle.com/udacityinc/account) page.
- Create and download an API token to your personal system.

## Step 2
### Install the Kaggle API \[[Reference](https://www.kaggle.com/docs/api#installation)\]

In [None]:
!pip install kaggle

## Step 3 
### Set up Kaggle API token \[[Reference](https://www.kaggle.com/docs/api#authentication)\]
- Move the Kaggle API token to a directory named `.kaggle` inside the home directory 

Check the directory we are in.

In [None]:
!pwd

Create the hidden directory `.kaggle` inside the home directory

In [None]:
!mkdir /home/ec2-user/.kaggle

Check that the direcotry has been created.

In [None]:
!ls -al /home/ec2-user/

From the GUI upload the kaggle.json API token file to the current direcotry 
then move it to the newly created directory

In [None]:
!mv kaggle.json /home/ec2-user/.kaggle/

\[OPTIONAL\]Restrict access rights to the API token.

In [None]:
!chmod 600 /home/ec2-user/.kaggle/kaggle.json

## Step 4
### Set up the dataset in Sagemaker
- Create a directory named `data`
- Download the [pneumonia dataset](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) using the Kaggle API
- Unzip the dataset

In [None]:
!mkdir ./data
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia --path ./data

In [None]:
!unzip  -q ./data/chest-xray-pneumonia.zip -d ./data 

## Step 5
### Explore a few data samples
- Look at the direcotry structure of the dataset
- Pay attention to the naming scheme of the image files in the NORMAL and PNEUMONIA sub-directories 
- Is the training dataset balanced?
- Plot a few images from the two categories
- Is there a pronounced difference between normal and pneumonia X-rays?
- How large are the images? Is the image size fixed?


In [None]:
!ln -s ../S2/data/

In [None]:
data_root = './data/chest_xray/'
train_data_dir = 'train'
test_data_dir = 'test'
val_data_dir = 'val'

In [None]:
!ls {data_root}

In [None]:
!ls {data_root+train_data_dir}

In [None]:
!ls -l {data_root+train_data_dir+"/PNEUMONIA"} | wc -l

In [None]:
!ls -l {data_root+train_data_dir+"/NORMAL"} | wc -l

In [None]:
from PIL import Image

In [None]:
sample_path = './data/chest_xray/train/NORMAL/IM-0122-0001.jpeg'
sample_image = Image.open(sample_path)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.imshow(sample_image,cmap='gray')
plt.title('h: '+str(sample_image.height)+' w: '+str(sample_image.width))
plt.show()

## Step 6
### Understand the problem
- What are some distinctive features used by clinicians? \[[Reference](https://www.radiologyinfo.org/en/info/pneumonia#:~:text=Chest%20x%2Dray%3A%20An%20x,infiltrates\)%20that%20identify%20an%20infection.)\]
- Are clinicians always sure about the condition? \[[Reference](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview)\]
- What is a good performance baseline that we should try to achieve or beat? \[[Reference](https://www.mdedge.com/familymedicine/article/60101/infectious-diseases/how-accurate-clinical-diagnosis-pneumonia#:~:text=Sensitivity%20of%20clinical%20diagnosis%20ranged%20from%2047%25%20to%2069%25%2C%20and%20specificity%20from%2058%25%20to%2075%25.)\], \[[Reference](https://arxiv.org/pdf/1711.05225.pdf)\]

- How's pneumonia detected? 
>Chest x-ray: An x-ray exam will allow your doctor to see your lungs, heart and blood vessels to help determine if you have pneumonia. When interpreting the x-ray, the radiologist will look for white spots in the lungs (called infiltrates) that identify an infection. 
[Source](https://www.radiologyinfo.org/en/info/pneumonia#:~:text=Chest%20x%2Dray%3A%20An%20x,infiltrates\)%20that%20identify%20an%20infection.)

- Are clinicians always sure about the condition?
>While common, accurately diagnosing pneumonia is a tall order. It requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity [3] on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs such as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post-radiation or surgical changes. Outside of the lungs, fluid in the pleural space (pleural effusion) also appears as increased opacity on CXR. When available, comparison of CXRs of the patient taken at different time points and correlation with clinical symptoms and history are helpful in making the diagnosis.
[Source](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/overview)
   
   
- What is a good performance baseline that we should try to achieve or beat?
>Sensitivity of clinical diagnosis ranged from 47% to 69%, and specificity from 58% to 75%
[Source](https://www.mdedge.com/familymedicine/article/60101/infectious-diseases/how-accurate-clinical-diagnosis-pneumonia#:~:text=Sensitivity%20of%20clinical%20diagnosis%20ranged%20from%2047%25%20to%2069%25%2C%20and%20specificity%20from%2058%25%20to%2075%25.). F1 scores vary from 0.33 to 0.44
[Source](https://arxiv.org/pdf/1711.05225.pdf)

## Step 7
### Select an ML approach 
- What kind of algorithm/model is best suited?
- Do we have adequate data?
    - How can we augment the data?
    - Which augmentations will not make sense?

- What kind of algorithm/model is best suited? 
```
The number of samples in the training set is really small for training a deep learning model from the grounds up. A pretrained model can be used as a feature extractor and even fine-tuned further.
```
[Source](https://arxiv.org/abs/1711.05225)

- How can we augment the data?
```
Rotation and brightness and contrast adjustments make sense. One can also apply horizontal flipping.
```
[Source](https://arxiv.org/abs/1711.05225)

## Step 8
### Create Pytorch dataloaders for training, validation and testing
- Decide data tranformations
- Create Pytorch datasets from the folder structure
- Create dataloaders from the corresponding datasets

In [None]:
import os
import torch
from torchvision.datasets import ImageFolder
from torchvision import transforms

In [None]:
## Why Imagenet?: https://discuss.pytorch.org/t/how-to-preprocess-input-for-pre-trained-networks/683/2
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

In [None]:
## All normalizations below done using the image net mean and std. deviation
## as described here: https://discuss.pytorch.org/t/how-to-preprocess-input-for-pre-trained-networks/683/2 

train_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop((224,224)),
    transforms.RandomRotation(degrees=5),
    transforms.ColorJitter(brightness=0.1, contrast=0.1),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN,
                         IMAGENET_STD)
])

test_transforms =  transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN,
                         IMAGENET_STD)
])


val_transforms =  transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN,
                         IMAGENET_STD)])

In [None]:
train_dataset = ImageFolder(os.path.join(data_root,train_data_dir),transform=train_transforms)
test_dataset = ImageFolder(os.path.join(data_root,test_data_dir), transform=test_transforms)
val_dataset = ImageFolder(os.path.join(data_root,val_data_dir), transform=val_transforms)
print(train_dataset, test_dataset, val_dataset, sep='\n\n')

In [None]:
from torch.utils.data import DataLoader

In [None]:
BATCH_SZ = 32

In [None]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SZ, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SZ, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SZ, shuffle=True)

## Step 9
### Sanity test the data 
- Plot a few random data points

In [None]:
train_dataset.class_to_idx

In [None]:
idx_to_class = {v:k for k,v in train_dataset.class_to_idx.items()}
idx_to_class

In [None]:
def denormalize(x):
    return x * IMAGENET_STD+IMAGENET_MEAN

def tensor_to_img(t):
    return t.numpy().transpose(1,2,0)

def tensor_to_label(t):
    return idx_to_class[int(t.numpy())]

In [None]:
sample_X, sample_y = next(iter(train_loader))

In [None]:
sample_y 

In [None]:
train_dataset.class_to_idx

In [None]:
plt.imshow(denormalize(tensor_to_img(sample_X[3]))); plt.title("class:"+str(tensor_to_label(sample_y[3])));plt.show()

## Step 10
### Shop around for a model 

- Instantiate a pretrained Resnet18 model \[[Reference](https://pytorch.org/vision/stable/models.html)\]

In [None]:
import torchvision.models as models

In [None]:
model = models.resnet.resnet18(pretrained=True)

- Understand the model's architecture and functioning

In [None]:
print(model)

> The model has three disconnected segments, each can be accessed using `model.segmentname`

- Freeze the model \[[Reference](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#convnet-as-fixed-feature-extractor)\]

In [None]:
for param in model.parameters():
    param.requires_grad = False

- Decapitate the model and use a different classifier dense layers 256, 64, 1 \[[Reference](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#convnet-as-fixed-feature-extractor)\]

In [None]:
from torch.nn import Sequential, Linear, ReLU, Dropout

In [None]:
model.fc = Sequential(
    Linear(in_features=512, out_features=256, bias=True),
    ReLU(),
    Dropout(p=0.5, inplace=True),
    Linear(in_features=256, out_features=64, bias=True),
    ReLU(),
    Dropout(p=0.5, inplace=True),
    Linear(in_features=64, out_features=1, bias=True),
)

In [None]:
model

## Step 11
### Pre-heat the oven


- Define a loss function appropriate for binary classification based on the following criteria
    - Appropriate for a classification problem \[[Reference](https://pytorch.org/docs/stable/nn.html#loss-functions)\]
    - Compatible with the size of the output layer (Single neuron vs. 1-neuron per class) \[[Reference](https://stats.stackexchange.com/q/207049/348089)\]
    - Compatible with the type of the output (logit, softmax, sigmoid) \[[Reference](https://stackoverflow.com/a/43577384/17203040)\]

In [None]:
from torch.nn import BCEWithLogitsLoss

In [None]:
loss_fn = BCEWithLogitsLoss()

- Instantiate an optimizer \[[Optimizer](https://pytorch.org/docs/stable/optim.html)\]

In [None]:
from torch.optim import Adam

In [None]:
optimizer = Adam(model.parameters())

## Step 12
### Implement training and evaluation functions

In [None]:
from tqdm import tqdm

- Implement the `train()` function that trains the model for one epoch

In [None]:
def train(model, loader, optimizer, loss_fn, device):
    
    ## Set the model in training mode and copy the model to the device
    model.train()
    model = model.to(device)   
    
    for batch_X, batch_y in tqdm(loader):
        
        ## Move the batch to the device
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)

        ## Clear the optimizer's accumulated gradients
        optimizer.zero_grad()
        
        ## Pass the data through the model and collect the logits
        logits = model(batch_X)
        
        ## Calculate the loss and backpropagate errors
        loss = loss_fn(logits.squeeze(), batch_y.float())
        loss.backward()
        
        ## Run the optimizer to update the parameters based on backpropagated errors
        optimizer.step()

- Implement the `evaluate()` function to compute the loss, and any other metrics we care about.

In [None]:
def evaluate(model, loader, loss_fn, device, pos_label, neg_label):
    
    ## Set the model in evaluation
    model.eval()
    model = model.to(device)   
    
    total_loss = 0
    total_TP = total_FN = total_TN = total_FP = 0
    for batch_X, batch_y in tqdm(loader):
        
        ## Move the batch to the device
        batch_X = batch_X.to(device)
        batch_y = batch_y.to(device)
        
        ## Pass the data through the model and collect the logits
        logits = model(batch_X)
        
        ## Calculate the loss 
        loss = loss_fn(logits.squeeze(), batch_y.float())

        ## Accumulate the loss
        total_loss += loss.detach().cpu().numpy()
        
        ## Compute predicted labels
        probs = torch.sigmoid(logits.squeeze())
        preds = probs > 0.5
        
        ## Compute batch TP, FP, FN, TN
        total_TP += ((preds == pos_label) & (batch_y == pos_label)).sum().item()
        total_FN += ((preds == neg_label) & (batch_y == pos_label)).sum().item()
        total_TN += ((preds == neg_label) & (batch_y == neg_label)).sum().item()
        total_FP += ((preds == pos_label) & (batch_y == neg_label)).sum().item()
    
    sensitivity = total_TP / (total_TP+total_FN)
    specificity = total_TN / (total_TN+total_FP)
    accuracy = (total_TP+total_TN) / (total_TP+total_FN+total_TN+total_FP)
    
    
        
    return {'loss':total_loss/len(loader), 'sensitivity':sensitivity, 'specificity':specificity, 'accuracy':accuracy}

- Select device 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

- **Sanity test:** Train and evaluate the model on a tiny subset(~100) of the training set (train/eval on the same subset) for a few epochs \[[Reference](https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)\]

In [None]:
from torch.utils.data import Subset
import numpy as np

In [None]:
sample_train_dataset = Subset(train_dataset, np.random.randint(0,len(train_dataset)-1,100))

In [None]:
sample_train_loader = DataLoader(sample_train_dataset, BATCH_SZ, shuffle=True)

In [None]:
EPOCHS = 5

for e in range(1,EPOCHS+1):
    train(model, sample_train_loader,optimizer,loss_fn, device)
    val_loss = evaluate(model, sample_train_loader,loss_fn, device, train_dataset.class_to_idx['PNEUMONIA'], train_dataset.class_to_idx['NORMAL'])
    print(f'Epoch: {e}, loss: {val_loss}')

## Step 13
### Set up training on a separate instance

- Create a new Sagemaker session and get execution role

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

- Get the default S3 bucket \[[Reference](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-bucket.html)\]

In [None]:
bucket = sagemaker_session.default_bucket()

- Move our data directory to the path `sagemaker/pneumonia` on the S3 bucket.
  **This operation is time consuming. Jump ahead to next steps after executing your code.**

In [None]:
prefix = 'sagemaker/pneumonia'

In [None]:
input_data = sagemaker_session.upload_data(path=data_root, bucket=bucket, key_prefix=prefix)

- Use the AWS CLI to check if the data has been transfered. \[[Reference](https://aws.amazon.com/cli/)\]

In [None]:
!aws s3 ls s3://{bucket+'/'+prefix+'/'}

- Create a directory named `train` and create the file `train.py` within it. This will be the training script.

In [None]:
!mkdir train; touch train/train.py

- Write the body of the taining script adapting from the code [here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#prepare-a-pytorch-training-script)

- Copy code to create the train and validation datasets from this notebook into the training script.

- Copy code to set up the model, the loss function and the optimizer.

- Adapt code to train the model on the **entire training set** and evaluate training and validation losses every epoch

- Add code to save the model at the end of the training \[[Reference](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#save-the-model)\].
Saving is necessary if we want to make predictions using the remote model and the Sagemaker API.

- Invoke the cell below that executes the train/train.py script with approrpriate arguments to check if the code runs without errors. **You may want to interrupt the execution since the full training is too slow**.

In [None]:
!export SM_MODEL_DIR='./model' SM_CHANNEL_TRAIN=s3://{bucket+'/'+prefix+'/'};python train/train.py --epochs 1 --batch-size 32 --model-dir ./model --data-dir {data_root}

- Create a PyTorch estimator using the Sagemaker API \[[Reference](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator)\]. Use `ml.p2.xlarge` instance type and pass hyperparameter as needed. Keep the number of epochs small (1-2).

In [None]:
from sagemaker.pytorch import PyTorch

In [None]:
pytorch_estimator = PyTorch(entry_point= 'train.py',
                            source_dir='train',
                            instance_type='ml.p2.xlarge',
                            instance_count=1,
                            framework_version=torch.__version__,
                            py_version='py3',
                            role=role,
                            hyperparameters = {'epochs': 1, 'batch-size': 32, 'use-cuda':True})

In [None]:
pytorch_estimator.fit({'train': f"s3://{bucket+'/'+prefix+'/'}"})

## Step 14
### Deploying the model 

- To deploy a model, Sagemaker expects to have a function named `model_fn()` in the script \[[Reference](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#load-a-model)\]. This function should load and return the model saved by the training process. **Implement `model_fn()` in our code.**
    - Recreate the model architecture.
    - Load the model's parameters saved by the training process
    - Set the model in evaluation mode
    - Copy the model over to the right device
    - Return the model

- Re-run the Pytorch estimator creation and `estimator.fit()` cells above

- Create a predictor by calling `deploy()` on the estimator. Use an `ml.m4.xlarge` instance for deployment.

In [None]:
# Deploy my estimator to a SageMaker Endpoint and get a Predictor
predictor = pytorch_estimator.deploy(instance_type='ml.t2.medium',
                                     initial_instance_count=1)

- Load any sample image from the test dataset using the test loader.

In [None]:
test_X, test_y = next(iter(test_loader))

- Display the image and its class label.

In [None]:
plt.imshow(denormalize(tensor_to_img(test_X[1]))); plt.title("class:"+str(tensor_to_label(test_y[1])));plt.show()

- Convert the image data from Pytorch tensor to numpy array

In [None]:
img_data = test_X[0].cpu().numpy()

In [None]:
img_data.shape

- Send the numpy array to our predictor. Since the predictor is remote, Sagemaker takes care of serializing and deserealizing the data and the model's prediction\[[Reference](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model)\]

In [None]:
output = predictor.predict(img_data.reshape([1,*img_data.shape]))

- The output of the predictor will be a logit, pass it through a sigmoid to get a probability in the range [0,1]

In [None]:
prob = 1/(1+np.exp(-output))
prob

- Threshold the probability at 0.5 to get a prediction 

In [None]:
pred = int(prob > 0.5)
pred

- Convert the prediction to a label using `idx_to_class`

In [None]:
idx_to_class[pred]

- To serve models Sagemakers spins up a server. When we call `predictor.predict(data)`, the data is serialized and sent to this server. After we are done making predictions, we need to shutdown the server to save cost. **Delete the endpoint created by the predictor.**

## Step 15
### Accessing the deployed model without the use of the predictor object 

- Print the deployed model's endpoint identifier

In [None]:
predictor.endpoint 

- (Re-)Create a predictor object using the endpoint of the deployed model
    - Use the sagemaker.predictor.Predictor class
    - Use `sagemaker.serializers.NumpySerializer`, `sagemaker.deserializers.NumpyDeserializer()` as the serializer and deserializer. (why's this needed?)

In [None]:
from sagemaker.predictor import Predictor

In [None]:
predictor_endpoint = Predictor(predictor.endpoint, 
                               serializer=sagemaker.serializers.NumpySerializer(),
                               deserializer=sagemaker.deserializers.NumpyDeserializer())

- Create image payload by inserting an empty `batch` dimension to the numpy array

In [None]:
img_payload = img_data.reshape([1,*img_data.shape])

- Invoke the `predict()` method of the new predictor and capture its response

In [None]:
inference_response = predictor_endpoint.predict(data=img_payload)
print(inference_response)

In [None]:
prob = 1/(1+np.exp(-int(inference_response)))
prob

In [None]:
pred = int(prob > 0.5)
pred

- Delete the predictor(s)

In [None]:
predictor.delete_endpoint()

## Step 16
### Invoke the endpoint from outside the notebook (from a Lambda function)

- Create a Lambda function and assign it an IAM role that allows unrestricted access to Sagemaker (Why is this needed?)

- Use the endpoint approach to create a predictor object call its `predict` method on a sample image

## Step 17
### Speedup Hacks

- Resize images only once
- Generate bottleneck features