<a href="https://colab.research.google.com/github/carrotjamb/AIMI-Intern-Part-2-Files/blob/main/_SECOND_FINAL_AIMI_Project_Part_2_Training_a_Vision_Model_to_Predict_ET_Distances_v9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## AIMI High School Internship 2023 - Classification Model
### Notebook 2: Training a Vision Model to Predict ET Distances

**The Problem**: Given a chest X-ray, our goal in this project is to predict the distance from an endotracheal tube to the carina. This is an important clinical task - endotracheal tubes that are positioned too far (>5cm) above the carina will not work effectively.

**Your Second Task**: You should now have a training dataset consisting of (a) chest X-rays and (b) annotations indicating the distance of the endotracheal tube from the carina. Now, your goal is to train a computer vision model to predict endotracheal tube distance from the image. You have **two options** for this task, and you may attempt one or both of these:
- *Distance Categorization* : Train a model to determine whether the position of a tube is abnormal (>5.0 cm) or normal (≤ 5.0 cm).
- *Distance Prediction*: Train a model that predicts the distance of the endotracheal tube from the carina in centimeters.

In this notebook, we provide some simple starter code to get you started on training a computer vision model. You are not required to use this template - feel free to modify as you see fit.

**Submitting Your Model**: We have created a leaderboard where you can submit your model and view results on the held-out test set. We provide instructions below for submitting your model to the leaderboard. **Please follow these directions carefully**.

We will evaluate your results on the held-out test set with the following evaluation metrics:
- *Distance Categorization* : We will measure AUROC, which is a metric commonly used in healthcare tasks. See this blog for a good explanation of AUROC: https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/
- *Distance Prediction*: We will measure the mean average error (also known as L1 distance) between the predicted distances and the true distances.


## Load Data
Before you begin, make sure to go to `Runtime` > `Change Runtime Type` and select a T4 GPU. Then, upload `data.zip`. It should take about 10 minutes for these files to be uploaded. Then, run the following cells to unzip the dataset (which should take < 10 seconds)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!unzip /content/drive/MyDrive/Colab\ Notebooks/drive-download-20230620T044532Z-001\ \(1\).zip

In [None]:
!unzip -qq /content/mimic-train.zip

In [None]:
!unzip -qq /content/mimic-test.zip

## Import Libraries
We are leveraging the PyTorch framework to train our models. For more information and tutorials on PyTorch, see this link: https://pytorch.org/tutorials/beginner/basics/intro.html

In [None]:
# Some libraries that you may find useful are included here.
# To import a library that isn't provided with Colab, use the following command: !pip install torchmetrics
import torch
import pandas as pd
from PIL import Image
import numpy as np
from tqdm import tqdm
import csv
from torch.utils.data.dataset import TensorDataset
import torchvision.models as models
import torch.nn as nn
import cv2

#Set up GPU
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [None]:
filename_1 = "/content/mimic_train_student.csv"
list_of_keys = []

with open(filename_1, 'r') as csvfile:
    datareader = csv.reader(csvfile)
    for row in datareader:
        img_path = "/content/" + str(row[4])
        list_of_keys.append(img_path)

filename_2 = "/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /csv-part-2-v2-(results_1.16).csv"
list_of_labels = []

with open(filename_2, 'r') as csvfile:
    datareader = csv.reader(csvfile)
    for row in datareader:
        list_of_labels.append(row[1])

with open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-tester.csv', 'w') as f:
    writer = csv.writer(f)
    reader = csv.reader(f)
    writer.writerows(zip(list_of_keys, list_of_labels))

input = open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-tester.csv', 'r')
output = open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-tester-removed.csv', 'w')
writer = csv.writer(output)
for row in csv.reader(input):
    if row[1] != "0.0" and row[1] != "":
        writer.writerow(row)

#Get training set - 75% of
input = open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-tester-removed.csv', 'r')
output = open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-training-set.csv', 'w')
writer2 = csv.writer(output)
count = 0

for row in csv.reader(input):
    count += 1
    if count < (0.75)*(len(list_of_keys)):
        writer2.writerow(row)

In [None]:
#Get validation set - 25% of
input = open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-tester-removed.csv', 'r')
output = open('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-validation-set.csv', 'w')
writer2 = csv.writer(output)
count = 0

writer2.writerow(['/content/image_path', 'measurement'])

for row in csv.reader(input):
    count += 1
    if count >= (0.75)*(len(list_of_keys)):
        writer2.writerow(row)

# Create Dataloaders
We will implement a custom Dataset class to load in data. A custom Dataset class must have three methods: `__init__`, which sets up any class variables, `__len__`, which defines the total number of images, and `__getitem__`, which returns a single image and its paired label.

In [None]:
from PIL import Image
from torch.utils.data import Dataset
import torchvision.transforms as transforms

class ChestXRayDataset(Dataset):

    def __init__(self, csv_file_key, transform=None, **kwargs):
        super(ChestXRayDataset, self).__init__(**kwargs)

        # # Fill in __init__() here
        # self.chest_xray_labels = pd.read_csv(csv_file_labels)
        self.chest_xray_key = pd.read_csv(csv_file_key)
        self.transform = transform

    def __len__(self):

        # Fill in __len__() here
        length = len(self.chest_xray_key)
        return length

    def __getitem__(self, idx):
        out_dict = {"idx": torch.tensor(idx),}
        convert_tensor = transforms.ToTensor()

        # Fill in __getitem__() here
        #Read in Image as
        img_name = self.chest_xray_key['/content/image_path'][idx]
        img = cv2.imread(img_name)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = convert_tensor(img)

        mean = [0.485, 0.456, 0.406]
        std = [0.229, 0.224, 0.225]


        transform = transforms.Compose([
            transforms.Resize((224,224)),
            transforms.Normalize(mean, std),
        ])

        normalized_image = transform(img)
        out_dict["img"] = normalized_image

        #Convert measurement to category (abnormal if measurement greater than 5.0 cm, otherwise normal)
        measurement = self.chest_xray_key['measurement'][idx]

        label = 0
        if measurement > 5.0:
          #1 Corresponds to Abnormal
          label = 1.0
        else:
          #0 Corresponds to Normal
          label = 0.0

        out_dict["label"] = torch.tensor(label)

        return out_dict


# Define Training Components
Here, define any necessary components that you need to train your model, such as the model architecture, the loss function, and the optimizer.

In [None]:
# Model Architecture
# model_ft = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', weights='ResNet50_Weights.IMAGENET1K_V1')
model = models.resnet50(weights='ResNet50_Weights.IMAGENET1K_V1')

#Set Model Feature Output to 1
num_ftrs = model.fc.in_features
new_layer = nn.Linear(num_ftrs, 1)
model.fc = new_layer
model = model.to(device)

# Loss Function
loss = nn.BCELoss()

#Optimizer
opt = torch.optim.AdamW(model.parameters(), lr=9e-5) # AdamW is a commonly-used optimizer. Feel free to modify.

## Training Code
We provide starter code below that implements a simple training loop in PyTorch. Feel free to modify as you see fit.

In [None]:
sigmoid = nn.Sigmoid()

def train(model, loss_fn, train_loader, opt, max_epoch, validation_loader):
  for epoch in range(0, max_epoch):
      #Training
      model.train()
      total_loss = 0.
      correct_train = 0.
      num_batches = len(train_loader)
      count = 0
      #Loop through training set
      for step, sample in tqdm(enumerate(train_loader)):
        count += 1
        #Send image/labels to gpu
        image = sample['img'].to(device)
        labels = sample['label'].to(device)
        labels = labels.unsqueeze(dim=1) #Converts labels to (16, 1) tensor

        opt.zero_grad()

        pred = model(image)
        pred = sigmoid(pred)

        loss = loss_fn(pred, labels)


        loss.backward()
        opt.step()

        if count % 10 == 0:
          print(loss)

        total_loss += loss.item()

        #Compute average loss
        print("Average Loss:", total_loss/count)

        #Count number of accurate predictions
        correct_train += (pred.round() == labels).sum().item()

        #Compute accuracy
        accuracy = correct_train/(count*16)
        print("Accuracy:", accuracy)


      # #Validation
      # model.eval()
      # total_loss = 0.
      # correct_train = 0.
      # num_batches = len(validation_loader)
      # # test_loss, correct = 0, 0
      # with torch.no_grad():
      #   total_correct = 0
      #   total_samples = 0

      #   for step, sample in tqdm(enumerate(validation_loader)):
      #     sample['img'], sample['label'] = sample['img'].to(device), sample['label'].to(device)

      #     pred = model(sample['img'])
      #     pred = sigmoid(pred)
      #     pred = torch.round(pred)

      #     labels2 = sample['label']
      #     labels2 = labels2.unsqueeze(dim=1)

      #     #Calculate Loss
      #     loss = loss_fn(pred, labels2)
      #     #Add Loss to Total Loss
      #     total_loss += loss.item()

      #     #Count number of accurate predictions
      #     correct_train += (pred == labels2).sum().item()

      #   #Compute average loss
      #   print("Average Loss:", total_loss/num_batches)

      #   #Compute accuracy
      #   accuracy = correct_train/(32 * num_batches)
      #   print("Accuracy:", accuracy)

In [None]:
#Create Dataset and Dataloader
training_dataset = ChestXRayDataset("/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-training-set.csv")
training_dataloader = torch.utils.data.DataLoader(dataset=training_dataset, batch_size=16, shuffle=True, drop_last=True)
validation_dataset = ChestXRayDataset('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-validation-set.csv')
validation_data_loader = torch.utils.data.DataLoader(dataset=validation_dataset, batch_size=16, shuffle=False, drop_last=False)

train(model, loss, training_dataloader, opt, max_epoch=3, validation_loader=validation_data_loader)

## Submitting Your Results
Once you have successfully trained your model, generate predictions on the test set and save your results as a `.csv` file. This file can then be uploaded to the leaderboard.

Your final `.csv` file **must** have the following format:
- There must be a column titled `image_path` with the paths to the test set images. This column should be identical to the one provided in `mimic_test_student.csv`.
- There must be a column titled `pred` with your model outputs.
  - If you are running the `distance categorization` task, this column must have floating point numbers ranging between 0 and 1. Higher numbers should indicate a greater likelihood that the tube distance is abnormal. Hint: You can convert model outputs to the 0 to 1 range by applying the sigmoid activation function (torch.nn.sigmoid())
  - If you are running the `distance prediction` task, this column must have numbers representing the tube distance in centimeters.
- Double check that there are 500 rows in your output file

In [None]:
#Get list of predictions

# model = # Model Architecture
# ckpt = torch.load("/content/drive/MyDrive/best.pkl")
# model.load_state_dict(ckpt["state_dict"])
filename_3 = "/content/drive/MyDrive/test_results_final_4.csv"


test_dataset = ChestXRayDataset("/content/drive/MyDrive/test_key_2.csv")
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=1, shuffle=False, drop_last=False)

test_results = {"image_path": [], "pred": []}
# Write method to load in data from test_loader, compute model predictions, and append results to test_results dict

with open(filename_3, 'w') as csvfile3:
    datawriter = csv.writer(csvfile3)
    with torch.no_grad():
      datawriter.writerow(["pred"])
      for step, sample in tqdm(enumerate(test_loader)):

        sample['img'], sample['label'] = sample['img'].to(device), sample['label'].to(device)

        pred = model(sample['img'])
        pred = sigmoid(pred)
        print(pred)
        # pred = torch.round(pred)
        datawriter.writerow([pred.item()])
        count += 1

In [None]:
#Create List of Filenames
total = 0

filename_1 = "/content/mimic_test_student.csv"
filename_2 = "/content/drive/MyDrive/test_key_2.csv"
list_of_keys = []
with open(filename_1, 'r') as csvfile:
  with open(filename_2, 'w') as csvfile2:
    datareader = csv.reader(csvfile)
    datawriter = csv.writer(csvfile2)
    datawriter.writerow(['/content/image_path', 'measurement'])
    count = 1
    for row in datareader:
        if count != 1:
          filepath = '/content/' + row[5]
          datawriter.writerow([filepath, ""])
        count +=1

In [None]:
#Validation
validation_dataset = ChestXRayDataset('/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /For Part 2/csv-validation-set.csv')
validation_data_loader = torch.utils.data.DataLoader(dataset=validation_dataset, batch_size=16, shuffle=False, drop_last=False)

write = csv.writer(open("/content/drive/MyDrive/Colab Notebooks/CSVs_from_part_1 /validation_results2.csv", 'w'))
write.writerow(['prediction', 'label'])
sigmoid = nn.Sigmoid()
total = 0
model.eval()
num_batches = len(validation_data_loader)
# test_loss, correct = 0, 0
with torch.no_grad():
  total_correct = 0.

  total_loss = 0.
  total_samples = 0.
  correct_train = 0.
  count = 0

  for step, sample in tqdm(enumerate(validation_data_loader)):
    count += 1
    sample['img'], sample['label'] = sample['img'].to(device), sample['label'].to(device)

    pred = model(sample['img'])
    pred = sigmoid(pred)
    pred = torch.round(pred)

    labels2 = sample['label']
    labels2 = labels2.unsqueeze(dim=1)

    #Calculate Loss
    _loss = loss(pred, labels2)
    #Add Loss to Total Loss
    total_loss += _loss.item()

    #Count number of accurate predictions
    correct_train += (pred == labels2).sum().item()
    print("Average Loss:", total_loss/count)


  #Compute accuracy
  accuracy = correct_train/(16 * count)
  print("Accuracy:", accuracy)

  write.writerow([pred, labels2])

# def test(dataloader, model, loss_fn):
#     size = len(dataloader.dataset)
#     num_batches = len(dataloader)
#     model.eval()
#     test_loss, correct = 0, 0
#     with torch.no_grad():
#         for step, sample in tqdm(enumerate(dataloader)):
#             sample['img'], sample['label'] = sample['img'].to(device), sample['label'].to(device)

#             pred = model(sample['img'])
#             pred = sigmoid(pred)

#             print((sample['label']))
#             print(pred)

#             write.writerow([list[pred], list[sample['label']]])
# #     # test_loss += loss_fn(pred, y).item()
# #     #         correct += (pred.argmax(1) == y).type(torch.float).sum().item()
# #     # test_loss /= num_batches
# #     # correct /= size
# #     # print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

# # ckpt = torch.load("/content/best.pkl")
# # model.load_state_dict(ckpt["state_dict"])




In [None]:
#Create List of Filenames without /content/

+# model = # Model Architecture
# ckpt = torch.load("/content/drive/MyDrive/best.pkl")
# model.load_state_dict(ckpt["state_dict"])
filename_3 = "/content/drive/MyDrive/test_key.csv"
filename_4 =  "/content/drive/MyDrive/test_key_2.csv"

# test_dataset = ChestXRayDataset("/content/drive/MyDrive/test_key.csv")
# test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=1, shuffle=False, drop_last=False)

# test_results = {"image_path": [], "pred": []}
# # Write method to load in data from test_loader, compute model predictions, and append results to test_results dict
with open(filename_4, 'w') as csvfile4:
    datawriter = csv.writer(csvfile4)
    with open(filename_3, 'r') as csvfile3:
        datareader = csv.writer(csvfile3)
    #     with torch.no_grad():
    #       total_correct = 0.

    #       total_loss = 0.
    #       total_samples = 0.
    #       correct_train = 0.
    #       count = 0

        for row in csvfile3:
          if row[0].startswith("/content/"):
            answer = row[0][9:]
            datawriter.writerow([answer])

