## AIMI High School Internship 2023
### Notebook 1: Extracting Labels from Radiology Reports

**The Problem**: Given a chest X-ray, our goal in this project is to predict the distance from an endotracheal tube to the carina. This is an important clinical task - endotracheal tubes that are positioned too far (>5cm) above the carina will not work effectively.

In order to train a model that can predict tube distances given chest X-rays, we require a ***training set*** with chest X-rays and labeled tube distances. However, when working with real-world medical data, important labels (e.g. endotracheal tube distances) are often not annotated ahead of time. The only data that a researcher has access to are the raw images and free-form clinical text written by the radiologist.

**Your First Task**: Given a set of chest X-rays and paired radiology reports, your goal is to use natural language processing tools to extract endotracheal tube distances from the reports.

**Looking Ahead**: When you complete this task, you should have a training dataset with chest X-rays labeled with endotracheal tube distances. You will later use this dataset to train a computer vision model that predicts the tube distance given an image.

### Load Data

Upload `data.zip`. It should take about 10 minutes for these files to be uploaded. Then, run the following cells to unzip the dataset (which should take < 10 seconds)

In [None]:
!unzip -qq /content/data.zip

replace mimic_test_student.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
!unzip -qq /content/mimic-train.zip

In [None]:
!unzip -qq /content/mimic-test.zip

### Understanding the Data

Let's first go through some terminology. Medical data is often stored in a hierarchy consisting of three levels: patient, study, and images.
- Patient: A patient is a single unique individual.
- Study: Each patient may have multiple sets of images taken, perhaps on different days. Each set of images is referred to as a *study*.
- Images: Each study consists of one or more *images*.

Chest X-ray images and radiology reports are stored in `data/` and are organized as follows:
- `data/mimic-train`:
  - Images: The MIMIC training set consists of 5313 subfolders, each representing a patient. Every patient has one or more studies, which are stored as subfolders. Images are stored in study folders as `.jpg` files with 512x512 pixels.
  - Text: Reports are stored in patient folders with  `.txt` extensions. The filename corresponds to the study id and the content of the report applies to all images in the corresponding study.
- `data/mimic-test`: The MIMIC test set is organized in a similar fashion as the MIMIC training set. Note that this is a held-out test set with 500 images that we will use for scoring models, so reports are not provided!
- `data/mimic_train_student.csv`: This spreadsheet provides mappings between image paths, report paths, patient ids, study ids, and image ids for samples in the training set.
- `data/mimic_test_student.csv`: This spreadsheet provides mappings between image paths, patient ids, study ids, and image ids for samples in the test set.

In [1]:
# # Example Image
from PIL import Image
img = Image.open(f"/content/mimic-test/10010/50391/80339.jpg")
img.show()

In [2]:
# Example Text Report
with open(f"/content/mimic-train/13360/57560.txt", "r") as f:
  txt = str(f.readlines())
txt

In [8]:
# Load csv file with mappings
import pandas as pd
data = pd.read_csv('/content/mimic_train_student.csv')

image_report_file = data["report_path"]
image_report_file

### Extracting Tube Distance Labels

You're now ready to begin this task! Keep in mind that not every chest X-ray provided in the training set contains endotracheal tube distance information, and there may be several edge cases to consider.

In [None]:
import os

root_directory = "/content/mimic-train"
report_ds = []
report_img = []

for root, dirs, files in os.walk(root_directory):
  for file in files:
    if file.endswith(".txt"):
      filepath = os.path.join(root, file)
      with open(filepath, "r") as f:
        report = str(f.readlines())
        report_ds.append(report)
    elif file.endswith(".jpg"):
      report_img.append(os.path.join(root, file))
len(report_ds)

10862

In [None]:
report_df = pd.read_csv("/content/mimic_train_student.csv")
report_df = report_df[["image_path", "report_path"]]
report_df.iloc[[4]]

Unnamed: 0,image_path,report_path
4,mimic-train/13360/62326/88457.jpg,mimic-train/13360/62326.txt


In [3]:
x = 0
for i in report_df["report_path"]:
  with open(i, "r") as f:
        report = str(f.readlines())
        report_df["report_path"][x] = report
        x+=1

In [4]:
with open(, "r") as f:
  txt = str(f.readlines())
txt
report_df.iloc[[3]]

In [None]:
!python -m spacy download en_core_web_sm

2023-06-29 04:15:46.848700: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-29 04:15:49.210652: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-29 04:15:49.211094: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-

In [None]:
!python -m spacy download en_core_web_lg

2023-06-29 04:16:26.029220: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-29 04:16:28.590159: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-29 04:16:28.590835: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-06-

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
nlp_sm = spacy.load("en_core_web_sm")

In [5]:
doc = nlp(report_df["report_path"][3])
displacy.render(doc, style="ent", jupyter=True)

In [None]:
import tqdm
just_text = [i for i in report_df["report_path"]]
docs = list(nlp.pipe(just_text))

In [None]:
import re
distances = []
for d in docs:
    closest_quantity = None
    min_distance = float("inf")
    found_quantity = False
    for ent in d.ents:
        if ent.label_ != "QUANTITY":
            continue
        if re.search("carina|cm|tube", ent.sent.text):
            distance = min(abs(ent.sent.text.find("carina")), abs(ent.sent.text.find("cm")), abs(ent.sent.text.find("tube")))
            if (distance < min_distance):
                closest_quantity = ent.text
                min_distance = distance
                found_quantity = True
    if found_quantity:
        if "mm" in closest_quantity:
              closest_quantity = None
        else:
            closest_quantity = closest_quantity.replace("approximately", "").replace("about", "").replace("above", "").replace("below", "").replace("\\n", "").replace("cm", "").replace(" ", "").replace("centimeters", "").replace("mm", "").replace("'", "").replace("some", "").replace("atleast", "").replace("to", "")
            try:
                closest_quantity = float(closest_quantity)
            except ValueError:
                closest_quantity = None
    distances.append(closest_quantity)

In [6]:
report_df = report_df[report_df['distances'].notna()]
report_df

In [7]:
report_df["distances"] = pd.to_numeric(report_df["distances"], errors = "coerce")
report_df = report_df.dropna()

report_df

In [None]:
from google.colab import files
report_df.to_csv('mapped_data.csv')
files.download('mapped_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>