<a href="https://colab.research.google.com/github/abhayk-c/Heart-Lesion-Classifier-Model/blob/main/Heart_Lesion_Classifier_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Environment Setup

First lets pip install all the python package dependencies we will need to build out our Classifier.




In [None]:
!pip install torch torchvision
!pip install kaggle
!pip install cadica-data-set
!pip install fastai fastcore

# Step 2: Create our Data Set

Now lets download the cadica data set hosted on Kaggle. This data set contains labeled images of heart CT scans, labeling pictures containing arteries with lesions (potential CAD) and pictures without lesions (no CAD).

In [None]:
!mkdir sample_data/cadica_data
!mkdir learner
!kaggle datasets download abhaycuram/cadica-data-set -p /content/sample_data/cadica_data --unzip

Now that we have downloaded our data set, lets separately save some images from the data set for model quality analysis (analyzing inferencing and prediction quality). This will be our test data set.

In [3]:
# The p20/v7 file contains non lesioned images.
# The p30/v10 file contains lesioned images.
# We separate the video folders containing the images for these two cases
# away as our test set to validate the inference quality of our model after fine tuning.
!mkdir sample_data/cadica_data/test_data
!mv /content/sample_data/cadica_data/CADICA/selectedVideos/p20/v7 /content/sample_data/cadica_data/test_data/
!mv /content/sample_data/cadica_data/CADICA/selectedVideos/p31/v10 /content/sample_data/cadica_data/test_data/

In [4]:
# Here we prepare our test data. All the lesioned and non lesioned images
# we "save" as our test data for model quality analysis we write some scripting code
# to generate the full paths to these images.

import os

def read_cadica_txt_file(path):
  lines = []
  f = open(path, "r")
  for line in f:
    sanitized_line = line.strip()
    lines.append(sanitized_line)
  f.close()
  return lines

def get_test_image_paths(vid_input_path, selected_frames_txt_path):
  labeled_test_images = read_cadica_txt_file(selected_frames_txt_path)
  image_files = list(filter(lambda img_file: os.path.isfile(os.path.join(vid_input_path, img_file)), os.listdir(vid_input_path)))
  image_file_names = list(map(lambda img_file_w_ext: (os.path.splitext(img_file_w_ext))[0], image_files))
  test_image_paths = []
  for img_file_name in image_file_names:
    if img_file_name in labeled_test_images:
      test_image_paths.append(os.path.join(vid_input_path, img_file_name + ".png"))
  return test_image_paths

# Full paths to the labeled lesioned images and non_lesioned images we will use as our test data
# at the bottom of our notebook after model tuning.
lesioned_test_image_paths = get_test_image_paths("/content/sample_data/cadica_data/test_data/v10/input/",
                                                 "/content/sample_data/cadica_data/test_data/v10/p31_v10_selectedFrames.txt")
nonlesioned_test_image_paths = get_test_image_paths("/content/sample_data/cadica_data/test_data/v7/input/",
                                                    "/content/sample_data/cadica_data/test_data/v7/p20_v7_selectedFrames.txt")

Now lets read the downloaded cadica data set, index it, and prepare our training data (input images and labels). We will use the cadica-data-set package that we pip installed above to make this very easy. The package contains python objects and code that can read the cadica data set, and index it into memory and prepare the data labels to make preparing the training data very easy. This is a package I open sourced, source code can be found here:
https://github.com/abhayk-c/cadica_data_set.git

In [5]:
from cadica_data_set import CadicaDataSet
from cadica_data_set import CadicaDataSetSamplingPolicy

data_set_path = "/content/sample_data/cadica_data/CADICA/"
learner_path = "/content/learner/"
cadica_data_set = CadicaDataSet(data_set_path)
cadica_data_set.load()

# Step 3: Train and export our Model

Let's now train and build our heart lesion classifier model. We will leverage the existing and widely popular resnet18 computer vision model and "fine tune" the model to learn to classify the images in our cadica data set. We leverage "transfer learning"



In [None]:
from fastai.vision.all import *

# To produce our training data (image paths) and training labels, we use our existing
# cadica_set object that loaded our data set and directly call it's API's.
# We wrap calling the API in named python functions to make this compatible with
# Fast AI's DataBlock and DataLoader API.
def get_input_image_paths(o: Any):
    image_paths = cadica_data_set.get_training_data_image_paths(CadicaDataSetSamplingPolicy.BALANCED_SAMPLING)
    return image_paths

def get_classification_label(path: str):
    if cadica_data_set.is_lesioned_image(path):
        return "lesioned_heart_scan"
    else:
        return "non_lesioned_heart_scan"

# Now let's wire up the input training/label data and create the input data used
# for model training (Data Block and Data Loader)
data_loaders = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_input_image_paths,
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=get_classification_label,
    item_tfms=[RandomResizedCrop(224, min_scale=0.5)],
    batch_tfms=aug_transforms()
).dataloaders(learner_path, bs=32)
data_loaders.show_batch(max_n=6)


# Let's train our model via "fine tuning" and transfer learning approach, then export it.
learn = vision_learner(data_loaders, resnet18, metrics=error_rate)
learn.fine_tune(5)
learn.export('heart_lesion_detector_model.pkl')

# Step 4: Evaluate Model quality

Let's now run the prediction and inferencing and see how our model can accurately classify new images.

In [None]:
model_inference = load_learner('heart_lesion_detector_model.pkl')
lesion_prediction_hits = 0
non_lesion_prediction_hits = 0
prob_threshold = 0.5

print("Expecting a lesioned label prediction")
for test_img in lesioned_test_image_paths:
  label1,_,probs1 = model_inference.predict(test_img)
  print(f"This is a: {label1}.")
  print(f"Probability it's a {label1}: {probs1[0]:.4f}")
  if label1 == "lesioned_heart_scan" and probs1[0] >= prob_threshold:
    lesion_prediction_hits += 1

print("\n")
print("\n")
print("Expecting a nonlesioned label prediction")
for test_img in nonlesioned_test_image_paths:
  label1,_,probs1 = model_inference.predict(test_img)
  print(f"This is a: {label1}.")
  print(f"Probability it's a {label1}: {probs1[0]:.4f}")
  if label1 == "non_lesioned_heart_scan" and probs1[0] >= prob_threshold:
    non_lesion_prediction_hits += 1

lesion_img_count = len(lesioned_test_image_paths)
lesion_prediction_misses = lesion_img_count - lesion_prediction_hits
lesion_prediction_error = (lesion_prediction_misses) / lesion_img_count

nonlesion_img_count = len(nonlesioned_test_image_paths)
nonlesion_prediction_misses = nonlesion_img_count - non_lesion_prediction_hits
nonlesion_prediction_error = nonlesion_prediction_misses / nonlesion_img_count

model_error = (lesion_prediction_misses + nonlesion_prediction_misses) / (lesion_img_count + nonlesion_img_count)

print("\n")
print("\n")
print("Model Quality Results:-------")
print("\n")
print(f"lesion_prediction_error: {lesion_prediction_error:.6f}")
print(f"non_lesion_prediction_error: {nonlesion_prediction_error:.6f}")
print(f"model_error: {model_error:.6f}")


# Model Results and Conclusion

**Observations**
1. The ML Model performs poorly on the test data. It has an error rate of around 35.4% on the test data.
2. Introducing data augmentations improved the model accuracy. Without data augmentations the model previously had an error rate of 48.38%
3. The methodology we are using to assess model quality may not be very robust so there could be some false positives or false negatives.

**Learnings**
1. Leveraging a transfer learning technique on the resnet18 CNN image model to classify heart lesions may not have been a great approach due to "out of domain" data problem:
  - Heart lesion angiogram pictures have a grainy low quality resolution and are black and white/grayscale. The resnet18 CNN was trained on colored images.
  - The resnet18 CNN model was trained on the image net data set which contains real world objects based on the word net language/synset graph. These images reflect real world objects. The features in images of real world objects vs medical imaging data are fundamentally different. It might be hard for the nueral net to pick up features like arteries, veins, and organs.
2. The cadica data set may not be enough of a representative sample of cardiac angiogram images for the neural net to learn from and "generalize" well to net new angiogram pictures. The resolution of these images I noticed are quite low.
3. More advanced image manipulation and processing techniques may be needed for the ML model to learn and detect subtle features like lesioned arteries.

**Improving Model Quality, things I'd do differently:**
1. I'd curate a larger dataset of angiogram images and carefully ensure I have a goopd representative sample that generalizes well. I'd have low res images, but also high res images, and try to give the model more labeled data to learn from as this is a difficult classification task.
2. If I wanted to continue with a tuning/transfer learning approach I would consider using a medical image ML model that has been trained on medical image data to classify things like lesions and arteries well. There are a few out there.
3. If I wanted to continue using resnet18, Instead of finetuning (transfer learning) I would fit the model (train from scratch) and change the input layer to recognize black and white images, this could help it recognize the fine grained features in angiogram images better.
4. More sophisticated data augmentation/transform pipeline. Introduce image transformations that can really focus on lesioned arteries and what they look like. This can help the model better isolate those features. Almost take a zoom in to zoom out approach.

Ultimately a model like this could be developed in theory with more careful experimentation and iteration. Creating such a model is "bounded" by the data set. Getting the medical image data that we need with proper labels and quality is the limiting factor. Exploring open sourced image models trained on medical imaging data can speed up development and training cycles.