# Problem description

Five times more deadly than the flu, COVID-19 causes significant morbidity and mortality. Like other pneumonias, pulmonary infection with COVID-19 results in inflammation and fluid in the lungs. COVID-19 looks very similar to other viral and bacterial pneumonias on chest radiographs, which makes it difficult to diagnose. This computer vision model for detection and localization of COVID-19 would help doctors provide a quick and confident diagnosis. As a result, patients could get the right treatment before the most severe effects of the virus take hold.


Currently, COVID-19 can be diagnosed via polymerase chain reaction to detect genetic material from the virus or chest radiograph. However, it can take a few hours and sometimes days before the molecular test results are back. By contrast, chest radiographs can be obtained in minutes. While guidelines exist to help radiologists differentiate COVID-19 from other types of infection, their assessments vary. In addition, non-radiologists could be supported with better localization of the disease, such as with a visual bounding box.


In this competition, the task is to identify and localize COVID-19 abnormalities on chest radiographs. In particular, categorization of the radiographs as negative for pneumonia or typical, indeterminate, or atypical for COVID-19.

* train_study_level.csv - the train study-level metadata, with one row for each study, including correct labels.
* train_image_level.csv - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.
* sample_submission.csv - a sample submission file containing all image- and study-level IDs.
* train folder - comprises 6334 chest scans in DICOM format, stored in paths with the form study/series/image
* test folder - The hidden test dataset is of roughly the same scale as the training dataset. Studies in the test set may contain more than one label.

# Content table

1. Importing the libraries
2. Importing the datasets
3. Data exploration
4. Read Dicom files
5. Feature engineering
6. Credits

# Importing the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sn
import pydicom as dicom # Dicom (Digital Imaging in Medicine) - medical image datasets, storage and transfer
import os
from tqdm import tqdm # allows you to output a smart progress bar by wrapping around any iterable
import glob # retrieve files/pathnames matching a specified pattern
import pprint # pretty-print” arbitrary Python data structures
import ast # 

# Importing the datasets

In [None]:
path = '/kaggle/input/siim-covid19-detection/'
train_image_level = pd.read_csv(path + "train_image_level.csv")
train_study_level = pd.read_csv(path + "train_study_level.csv")

# Data exploration

Let's have a look inside the train_image_level.

In [None]:
train_image_level.head()

In [None]:
train_image_level.describe()

There are 6334 unique values in the train_image_level dataframe.

In [None]:
train_study_level.head()

In [None]:
train_study_level.describe()

There are 6054 rows in the train_study_level dataframe. The number of unique values in study dataframe differs from the unique values in the images dataframe. Let's check how many studies have more than 1 image linked.

In [None]:
train_study_level_key = train_study_level.id.str[:-6]
training_set = pd.merge(left = train_study_level, right = train_image_level, how = 'right', left_on = train_study_level_key, right_on = 'StudyInstanceUID')
training_set.drop(['id_x'], axis = 1)

Let's have a look at these studies with multiple images:

In [None]:
training_set[training_set.groupby('StudyInstanceUID')['id_y'].transform('size') > 1].sort_values('StudyInstanceUID')

# Read Dicom files

Function used to locate image from the path:

In [None]:
def extract_image(i):
    path_train = path + 'train/' + training_set.loc[i, 'StudyInstanceUID']
    last_folder_in_path = os.listdir(path_train)[0]
    path_train = path_train + '/{}/'.format(last_folder_in_path)
    img_id = training_set.loc[i, 'id_y'].replace('_image','.dcm')
    print(img_id)
    data_file = dicom.dcmread(path_train + img_id)
    img = data_file.pixel_array
    return img

Images and rectangles visualization

In [None]:
fig, axes = plt.subplots(3,3, figsize=(20,16))
fig.subplots_adjust(hspace=.1, wspace=.1)
axes = axes.ravel()

for row in range(9):
    img = extract_image(row)
    if (training_set.loc[row,'boxes'] == training_set.loc[row,'boxes']):
        boxes = ast.literal_eval(training_set.loc[row,'boxes'])
        for box in boxes:
            p = matplotlib.patches.Rectangle((box['x'], box['y']),
                                              box['width'], box['height'],
                                              ec='r', fc='none', lw=2.
                                            )
            axes[row].add_patch(p)
    axes[row].imshow(img, cmap='gray')
    axes[row].set_title(training_set.loc[row, 'label'].split(' ')[0])
    axes[row].set_xticklabels([])
    axes[row].set_yticklabels([])

# Feature engineering

Count the number of opacities in the image

In [None]:
Opacity_Count = training_set['label'].str.count('opacity')
training_set['Opacity_Count'] = Opacity_Count.values

Sum of areas of rectangles - assumption : the bigger the rectangle - the bigger the opacity

In [None]:
image_rectangles_areas = []

for row in range(6334):#len(training_set.index)):
    image_rectangles_area_sum = 0
    rectangle_area = 0
    if (training_set.loc[row,'boxes'] == training_set.loc[row,'boxes']):
        boxes = ast.literal_eval(training_set.loc[row,'boxes'])
        for box in boxes:
            rectangle_area = box['width'] * box['height']
            #print('Rectangle area : {} '.format(rectangle_area))
            image_rectangles_area_sum = image_rectangles_area_sum + rectangle_area
            #training_set['Image_Rectangles_Area_Sum'] = 
        image_rectangles_areas.append(image_rectangles_area_sum)
    else:
        image_rectangles_area_sum = image_rectangles_area_sum + rectangle_area
        image_rectangles_areas.append(image_rectangles_area_sum)
        

In [None]:
training_set['Rectangle_Area'] = image_rectangles_areas

Creating buckets - rectangle areas

First see the distribution of the rectangle areas

In [None]:
training_set['Rectangle_Area'] = round(training_set['Rectangle_Area'],2)

In [None]:
training_set['Rectangle_Area']

In [None]:
#pd.qcut(training_set['Rectangle_Area'], q = 4)

#training_set.boxplot(by = "Negative for Pneumonia",column = ['Rectangle_Area'],grid = True, layout=(1, 1))

cut_labels_4 = ['0', '<1e6', '<2e6', '<4e6', '<8e6']
cut_bins = [-1, 0, 1000000, 2000000, 4000000, 8000000]
training_set['Rectangle_Area_Bin'] = pd.cut(training_set['Rectangle_Area'], bins=cut_bins, labels=cut_labels_4)

In [None]:
columns = ['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']

plt.figure(figsize = (16, 14))
sn.set(font_scale = 1.2)
sn.set_style('ticks')

for i, column in enumerate(columns):
    plt.subplot(3, 3, i + 1)
    sn.countplot(data = training_set, x = 'Rectangle_Area_Bin', hue = column, palette = ['#d02f52',"#55a0ee"])
    
sn.despine()

In [None]:
#columns = ['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']
opacity = sorted(list(training_set['Rectangle_Area_Bin'].value_counts().index))

for i in opacity:
    Count_Series = training_set[training_set['Rectangle_Area_Bin'] == i].iloc[:,[1, 2, 3, 4]].sum()
    fig = plt.figure(figsize=(12,3))
    sn.barplot(x = Count_Series.index, y = Count_Series.values/sum(training_set['Rectangle_Area_Bin'] == i))
    plt.title('Rectangle_Area_Bin : {} '.format(i))
    plt.plot()

Rectangle area and opacity count

In [None]:
#columns = ['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']
opacity = sorted(list(training_set['Opacity_Count'].value_counts().index))

for i in opacity:
    Count_Series = training_set[training_set['Opacity_Count'] == i].iloc[:,[1, 2, 3, 4]].sum()
    fig = plt.figure(figsize=(12,3))
    sn.barplot(x = Count_Series.index, y = Count_Series.values/sum(training_set['Opacity_Count'] == i))
    plt.title('OpacityCount : {} '.format(i))
    plt.plot()

Position of the rectangle by quadrants (4 bins - 4 quadrants)

Image metadata

In [None]:
training_paths = []

for sid in tqdm(training_set['StudyInstanceUID']):
    training_paths.append(glob.glob(os.path.join(train_directory, sid +"/*/*"))[0])

training_set['path'] = training_paths

In [None]:
voi_lut=True
fix_monochrome=True

def dicom_dataset_to_dict(filename,func):
    """Credit: https://github.com/pydicom/pydicom/issues/319
               https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
    """
    
    dicom_header = dicom.dcmread(filename) 
    
    #====== DICOM FILE DATA ======
    dicom_dict = {}
    repr(dicom_header)
    for dicom_value in dicom_header.values():
        if dicom_value.tag == (0x7fe0, 0x0010):
            #discard pixel data
            continue
        if type(dicom_value.value) == dicom.dataset.Dataset:
            dicom_dict[dicom_value.name] = dicom_dataset_to_dict(dicom_value.value)
        else:
            v = _convert_value(dicom_value.value)
            dicom_dict[dicom_value.name] = v
      
    del dicom_dict['Pixel Representation']
    
    if func!='metadata_df':
        #====== DICOM IMAGE DATA ======
        # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
        if voi_lut:
            data = apply_voi_lut(dicom_header.pixel_array, dicom_header)
        else:
            data = dicom_header.pixel_array
        # depending on this value, X-ray may look inverted - fix that:
        if fix_monochrome and dicom_header.PhotometricInterpretation == "MONOCHROME1":
            data = np.amax(data) - data
        data = data - np.min(data)
        data = data / np.max(data)
        modified_image_data = (data * 255).astype(np.uint8)
    
        return dicom_dict, modified_image_data
    
    else:
        return dicom_dict

def _sanitise_unicode(s):
    return s.replace(u"\u0000", "").strip()

def _convert_value(v):
    t = type(v)
    if t in (list, int, float):
        cv = v
    elif t == str:
        cv = _sanitise_unicode(v)
    elif t == bytes:
        s = v.decode('ascii', 'replace')
        cv = _sanitise_unicode(s)
    elif t == dicom.valuerep.DSfloat:
        cv = float(v)
    elif t == dicom.valuerep.IS:
        cv = int(v)
    else:
        cv = repr(v)
    return cv

for filename in train_df.path[0:5]:
    df, img_array = dicom_dataset_to_dict(filename, 'fetch_both_values')
    
    pprint.pprint(df)


Outliers and irregularities in the data

# Credits

* https://github.com/pydicom/pydicom/issues/319
* https://www.kaggle.com/songseungwon/siim-covid-19-detection-10-step-tutorial-1