# 01 - Model Selection

We tested a set of different models with a subsample of pathological findings, in order to establish which would be the best model overall for our tests.

## Steps
- prepare the dataset for 5-fold cross validation using a sub-set of all findings present in Vindr
- Modifies all models for testing
- Evaluates each for 10 epochs
- Presents a table with the results.

In [None]:
import os
import numpy as np
import pandas as pd
from ast import literal_eval

import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler

from torchvision import models
from torchvision.transforms import v2 as transforms
from torchvision.ops import sigmoid_focal_loss
from torchmetrics.classification import MultilabelF1Score, MultilabelAccuracy

## Part 00 - Converting to the new labels
As some findings have very few samples on the complete dataset, we fuse the following types of pathological findings:
- `Focal Asymmetry`, `Asymmetry`, `Global Asymmetry` -> `Asymmetries`
- `Nipple Retraction`, `Skin Retraction` -> `Retractions`

The following cells convert the original csv data to this new annotations, saving it as `finding_annotationsV2.csv`

If you already converted these, you can skip this step

In [None]:
csvpath = ""  # Path to finding_annotations.csv from the VinDr-Mammo Dataset
imagepath = ""  # Image directory with vindr Dataset images processed with our method

In [None]:
def review_labels(row):
    label_cols = literal_eval(row["finding_categories"])
    new_labels = []
    for label in label_cols:
        if label in ["Focal Asymmetry", "Asymmetry", "Global Asymmetry"]:
            new_labels.append("Asymmetries")
        elif label in ["Skin Retraction", "Nipple Retraction"]:
            new_labels.append("Retractions")
        else:
            new_labels.append(label)
    # remove duplicates
    new_labels = list(set(new_labels))
    # return them as a string, same as the original format
    return str(new_labels)

In [None]:
# load the original csv
df = pd.read_csv(csvpath)
newdf = df.copy()
newdf["finding_categories"] = newdf.apply(review_labels, axis=1)
newdf.to_csv("finding_annotations_V2.csv", index=False)

## Part 01 - Preparing dataset
We need the converted csv file `finding_annotations_V2.csv` done previously, if you have already converted it, skip the previous step and continue here.

In this step, we obtain the evaluated subset of images that contain these labels.

In [None]:
csvpath = "finding_annotations_V2.csv"


selected_labels = [
    "Mass",
    "Suspicious Calcification",
    "Asymmetries",
    "Architectural DistortionSuspicious Lymph Node",
    "Skin Thickening",
    "Retractions",
]

In [None]:
# open the csv file
df = pd.read_csv(csvpath)

df.head()

In [None]:
# filter the labels
def check_if_present(row):
    label_cols = literal_eval(row["finding_categories"])
    # if all the labels are in the selected labels, return True
    if np.all([label in selected_labels for label in label_cols]):
        return True
    return False


selected = df[df.apply(check_if_present, axis=1)]
print(len(selected))

train_df = selected.groupby("split").get_group("training")
test_df = selected.groupby("split").get_group("test")

In [None]:
train_df["finding_categories"].value_counts()

## Part 02 - Preparing for input
We create the `torch.data.Dataset` object for preprocessing each

In [None]:
class VindrDataset(Dataset):
    def __init__(self, dataframe, imageroot, transforms=None):
        self.df = dataframe
        self.root = imageroot
        self.transforms = transforms
        self.labels = [
            "Mass",
            "Suspicious Calcification",
            "Asymmetries",
            "Architectural DistortionSuspicious Lymph Node",
            "Skin Thickening",
            "Retractions",
        ]