# Deep Learning in Medicine
### BMSC-GA 4493, BMIN-GA 3007 
### Homework 2



**Note:** If you need to write mathematical terms, you can type your answeres in a Markdown Cell via LaTex 

See: <a href="https://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook">here</a> if you have issues. To see basic LaTex notation see: <a href="https://en.wikibooks.org/wiki/LaTeX/Mathematics"> here </a>.

**Submission instruction**: Upload and Submit your final jupyter notebook with necessary files in <a href='http://newclasses.nyu.edu'>newclasses.nyu.edu</a>. If you use code or script from web, please give a link to the code in your answers.

**Submission deadline:** Friday March 9th 2018 (5:00 PM)

# Question 1: Convolutional Layer  (Total 28 points)

We have a 3x6x6 image and two 3x3x3 convolution kernels as pictured. Bias term for each feature map is also provided. For the questions 1.2., 1.3. and 1.5., in addition to providing the maps, please provide the python code (without using pytorch package) that you used to calculate the maps

<img src="Picture1.png" width="500">

## 1.1) 
What will be the dimension of the feature maps after we forward propogate the image using the given convolution kernels for

### 1.1.a) (2 points)
stride=1, without zero padding?

### 1.1.b) (2 points)
stride=2, without zero padding?

### 1.1.c) (2 points) 
stride=2, with zero padding?

### 1.1.d) (2 points)
stride=3, with zero padding?

### 1.1.e) (2 points) 
a dilated convolution with stride=1, dilation rate=2 and zero padding?

## 1.2) (4 points)  
Calculate the feature maps for the case stride=2, with zero padding. 

In [None]:
# starter code to load image:x, kernel weights:w and bias:b
import numpy as np
npzfile = np.load('Question1.npz') # 'Question1.npz' is provided under /beegfs/ga4493/data/HW2 folder at HPC
print(npzfile.files) # check the variable names
x = npzfile['x']
w = npzfile['w']
b = npzfile['b']

## 1.3) 
Apply the following activation function on the feature maps calculated in 1.2 and provide the resulting activation maps

### 1.3.a) (1 point)
ReLU

### 1.3.b) (2 points)
leaky ReLu with negative slope coefficient = 0.01

## 1.4) (3 points)
List three pooling strategies, write their mathematical forms for 2D inputs

## 1.5)
Pick two out of three pooling strategies and provide the output features by applying it to the activation maps obtained in 1.3.b for 

### 1.5.a) (2 points)
pool width=2 and stride 1

### 1.5.b) (2 points) 
pool width=3 and stride 1

## 1.6) (4 points)
Here we will use the pytorch package to calculate feature/activation maps. Write a code which takes 3x6x6 image and performs a 2D convolution operation (with stride=2 and zero padding) using 3x3x3 filters provided on the picture. After convolution layer use leacky ReLU activation function (with negative slope 0.01) and L2-pooling operation (pool width = 2 and stride = 1). Provide the code, feature maps obtained from convolution operation (compare with 1.2.), activation maps (compare with 1.3.b), and feature maps after L2-pooling operation.

# Question 2: Network design for disease classification (Total 26 points)

Disease classification is a common problem in medicine. There are many ways to solve this problem. Goal of this question is to make sure that you have a clear picture in your mind about possible techniques that you can use in such a classification task.

Assume that we have a 10K images in a dataset of x-rays. For each image, the dimension is 128x128 and we have the label for each image. Label of each image defines which class the image belongs (lets assume we have 10 disease classes in total). You will describe your approach of classifying the disease for the techniques below. Make sure you do not forget the bias term. You can either design your proposed network by explaining it explicitely or you can provide the pytorch code which designs the network for questions 2.1.a, 2.2.a, and 2.3.a


### 2.1.a) (2 points)
Design a multi-class logistic regression model which takes an image as input (by reshaping it to a vector: lets call this a vectorized image) and outputs to get the probability of 10 disease classes. 

### 2.1.b) (2 points)
Clearly mention the sizes for your input and output

### 2.1.c) (1 point)
What type of activation function you will use and why?

### 2.1.d) (1 point)
How many parameters you need to fit for your design?

### 2.2.a) (2 points)
Design a one layer multi layer perceptron (MLP) which first maps the vectorized images to a vector of 128 then feeds this vector to a fully connected layer to get the probability of 10 disease classes. 

### 2.2.b) (2 points)
Clearly mention the sizes for your input and output at each layer until you get final output vector with 10 probabilities

### 2.2.c) (2 points) 
Define two types of activation functions you can use in the first layer. Which activation function you will use on the second fully connected layer?

### 2.2.d) (1 points)
How many parameters you need to fit for your design? How does adding another hidden layer effected the number of parameters to use?

### 2.3.a) (2 points)
Design a one layer convolutional neural network which first maps the images to a vector of 128 (with the help of convolution and pooling operations) then feeds this vector to a fully connected layer to get the probability of 10 disease classes.

### 2.3.b) (2 points)
Clearly mention the sizes for your input, kernel, pooling, and output at each step until you get final output vector with 10 probabilities

### 2.3.c) (1 points) 
How many parameters you need to fit for your design?

### 2.3.d) (2 points)
Now increase your selected convolution kernel size by 2 in each direction. Describe the effect of using small vs large filter size during convolution. 

### 2.3.e) (3 points)
Now multiply your selected stride size for convolution and pooling operation by 2. Describe the effect of this change in design criteria in terms of memory requirements, number of parameters to fit and number of operations.

### 2.3.f) (3 points)
Assume we trained the designed network and we want to classify the disease from a image of size 256x192.  and we want to use your designed network for inference. Describe if your designed CNN is capable of accepting this image without any preprocessing. If we can not use your network with this image, please propose changes on your network which will enable accepting images of various shapes. 

# Question 3: Deep CNN design for disease classification (Total 56 points + 12 points in bonus question)

In this part of the howework, we will focus on classifiying the lung disease using chest x-ray dataset provided by NIH (https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community). Please go over the following paper for the details of the dataset: https://arxiv.org/pdf/1705.02315.pdf 

You need to use HPC for training part of this question, as your computer's CPU will not be fast enough to compute learning iterations. In case you use HPC, please have your code/scripts uploaded under the questions and provide the required plots and tables there as well. Data is available in HPC under /beegfs/ga4493/data/HW2 folder. We are interested in classifying infiltration, pneumothorax, cardiomegaly and *not*(infiltration OR pneumothorax OR cardiomegaly) cases. By saying so we have 4 classes that we want to identify by modelling a deep CNN.

First, you need to work on Data_Entry_2017.csv file to identify cases/images that has infiltration, pneumothorax, cardiomegaly and *not* images that doesnt have any one of 3 diseases. 

## 3.1) Train, Test, and Validation Sets (4 points)
Write a script to read data from Data_Entry_2017.csv and process to obtain 3 sets(train, validation and test). By using 'Finding Labels' column, define a class that each image belongs to, in total you can define 5 classes:
- 1 infiltration
- 2 pneumothorax
- 3 cardiomegaly
- 4 cases which contain at least two disease and at least one of them belongs to classes 1,2 and 3 
- 0 for all other diseases (doesnt have infiltration OR pneumothorax OR cardiomegaly) or NoFinding

Generate a train, validation and test set by splitting the whole dataset containing specific classes (0, 1, 2, and 3)  by 60%, 20% and 20%, respectively. Since we have too many samples on Class 0, use only random 10% of the samples for creating sets. Test set will not be used during modelling but it will be used to test your model's accuracy. Make sure you have similar percentages of different cases in each subset. Provide statistics of the number of classess in your subsets. (you do not need to think about splitting the sets based on subjects for this homework. In general, we do not want images from the same subject to appear in both train and test sets!!) 

Write a .csv files defining the samples in your train, validation and test set with names: train.csv, validation.csv, and test.csv. Submit these files with your homework. 

## 3.2) Data preparation before training (4 points)
From here on, you will use HW2_trainSet.csv, HW2_testSet.csv and HW2_validationSet.csv provided under /beegfs/ga4493/data/HW2 folder for defining train, test and validation set samples instead of the csv files you generate on Question 3.1.


There are multiple ways of using images as an input during training or validation. Here, you need to decide on one way of using images in your network. You may want to use numpy arrays as shown in Lab 4, HDF5 file format or torch Dataset class  (http://pytorch.org/tutorials/beginner/data_loading_tutorial.html). Once you decide on the way to use images as input, write necessary script which will enable you to input images in your designed CNN later. !! If you need to save anything, please use your own folder at HPC.

Since now we can import images for model training, next step is to define a CNN model that you will use to train disease classification task. Any model requires us to select model parameters like how many layers, what is the kernel size, how many feature maps and so on. The number of possible models is infinite, but we need to make some design choices to start.  Lets design a CNN model with 5 convolutional layers and a fully connected layer followed by a classification layer. Lets use 

-  3x3 convolution kernels
-  ReLU for an activation function
-  max pooling with kernel 2x2 and stride 2. 

Define the number of feature maps in hidden layers as: 16, 16, 32, 32, 64, 32 (1st layer, ..., 6th layer). 

## 3.3) CNN model definition (4 points)
Write a class which specifies this network details.

## 3.4) (4 points)
How many learnable parameters of this model has? How many learnable parameters we would have if we only have 5 convolutional layers without a fully connected 6th layer in our network? Describe why the fully connected layer needs so much trainable parameters, and provide additional suggestions to mitigate this?

## 3.5) Loss function and optimizer (2 points)
Define a loss criterion and an optimizer using pytorch. What type of loss function is applicable to our multi-class classification problem? Explain your choice of a loss function.  For an optimizer lets use SGD with momentum for now. Choose an emprical learning rate and momentum.  

_Some background:_ In network architecture design, we want to have an architecture that has enough capacity to learn. We can achive this by using large number of feature maps and/or many more connections and activation nodes. However, having a large number of learnable parameters can easily result in overfitting. To mitigate overfitting, we can keep the number of learnable parameters of the network small either using shallow networks or few feature maps. This approach results in underfitting that model can neither model the training data nor generalize to new data. Ideally, we want to select a model at the sweet spot between underfitting and overfitting. It is hard to find the exact sweet spot. 

We first need to make sure we have enough capacity to learn, without a capacity we will underfit. Here, you will need to check if designed model in 3.3. can learn or not. Since we do not need to check the generalization capacity (overfitting is OK for now since it shows learning is possible), it is a great strategy to use a subset of training samples. Also, using a subset of samples is helpful for debugging and hyperparameter search.

## 3.6) Train the network on a subset
### 3.6.a) (2 points)
Write a script which takes 256 random samples from train set (HW2_trainSet.csv), lets name this set as HW2_randomTrainSet. Choose 64 random samples from validation set (HW2_validationSet.csv), lets name this set as HW2_randomValidationSet. Make sure these sample sets include data from each class.     

### 3.6.b) (12 points)
Use the random samples from 3.6.b. and write a script to train your network. Using the script train your network using your choice of weight initialization strategy. In case you need to define other hyperparameters choose them emprically, for example batch size. Plot average loss on your random sample set per epoch. (Stop the training after at most ~100 epochs) 

## 3.7) Analysis of training using a CNN model(2 points)
Describe your findings. Can your network learn from 256 random samples? Does CNN model have enough capacity to learn with your choice of emprical hyperparameters?
-  If yes, how will average loss plot will change if you multiply the learning rate by 10?
-  If no, how can you increase the model capacity? Increase your model capacity and train again until you find a model with enough capacity. If the capacity increase is not sufficient to learn, think about emprical parameters you choose in designing your network and make some changes on your selection. Describe what type of changes you made to your original network and how can you manage this model to learn.

## 3.8) Hyperparameters (2 points each)
Now, we will revisit our selection of CNN model architecture, training parameters and so on: i.e. hyperparameters. In your investigations, define how you will change the hyperparameter in the light of model performance using previous hyperparameters. Provide your rationale choosing the next hyperparameter. Provide learning loss and accuracy curves, and model performance in HW2_randomValidationSet. You will use macro AUC as the performance metric for comparing CNN models for disease classification task.  Report macro AUC for each CNN model with different hyperparameters (Check http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#multiclass-settings).

### 3.8.a)
Investigate the effect of learning rate in the model performance

### 3.8.b)
We choose SGD with momentum as an optimizer. Investigate the effect of at least two other optimizers in model performance.

### 3.8.c)
Investigate the effect of the dimension of the fully connected layer in the model performance.

### 3.8.d)
Investigate the effect of the batch size in learning speed and the model performance.

## 3.9) Train the network on the whole dataset (4 points)
After question 3.7., you should have a network which has enough capacity to learn and from question 3.8 you know which hyperparameters perform better on a subset of test and validation set. Train your network on the whole train set (HW2_trainSet.csv) and check the validation loss on the whole validation set (HW2_validationSet.csv) in each epoch. Plot average loss and accuracy on train and validation sets. Describe your findings. Do you see overfitting or underfitting to train set? What else you can do to mitigate it?

## 3.10) Analysis  of the results (4 points)
Using the validation loss to choose the model (lets name it as baseline model) which learns from train data and generalizes well to the validation set. Using this model plot confusion matrix and ROC curve for your multi-class CNN disease classifier on the test set (HW2_testSet.csv). Report macro AUC for this CNN model as the performance metric. 

## 3.11) Understanding the network (6 points)
Using the best performing model (choose from models developed in  3.10., and 3.12.(in case you work on it)), we will figure out where our network gathers infomation to decide the class for the image. One way of doing this is to oclude parts of the image and run through your network. By changing the location of the ocluded region we can visualize the probability of image being in one class as a 2-dimensional heat map. Using the best performing model, provide the heat map of the following images: HW2_visualize.csv. Do the heap map and bounding box for pathologies provide similar information? Describe your findings.
Reference: https://arxiv.org/pdf/1311.2901.pdf

## 3.11) Your CNN architecture design (Bonus Question 1: 12 points)
Be creative and design your own CNN model. This model can be some variation of the baseline model using the information from hyperparameter search or it can be a totally new architecture. Use the knowledge you gained from previous questions to design your network. Because of this reason, your network is expected to provide superior results. After you trained your network on the whole train set, choose the best performaing model using the loss on the whole validation set. Provide the confusion matrix, ROC curves and macro AUC for your best performing model using the whole test set. Explain your design criteria and why your performance is better compared to the baseline model. Some architecture change suggestions: convolution filter dimensions, dilated convolutions, network without a fully connected layer, deeper networks, data augmentation ...     