# About Pulmonary Fibrosis:

**Pulmonary fibrosis** is a condition that causes lung scarring and stiffness. This makes it difficult to breathe. It can prevent your body from getting enough oxygen and may eventually lead to respiratory failure, heart failure, or other complications.

Researchers currently believe that a combination of exposure to lung irritants like certain chemicals, smoking, and infections, along with genetics and immune system activity, play key roles in pulmonary fibrosis.

![Pulmonary Fibrosis](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.pulmonologyadvisor.com%2Fwp-content%2Fuploads%2Fsites%2F21%2F2019%2F03%2Ffibrosis.tuberculosis_SH_414679936.jpg&f=1&nofb=1)

# Basic EDA and DICOM Visualization

I know you're excited for the new competition! Let's start with some EDA and the most important part, visualizing DICOM images!

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import tqdm
import re
import cv2

In [None]:
train_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test_df = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

Here is our train.csv:

In [None]:
train_df.head()

It seems that there isn't missing data. This is great!

In [None]:
train_df.info()

In [None]:
test_df.head()

Only 5 rows. Ok.

In [None]:
test_df.info()

# DICOM Visualization:

Let us now visualize DICOM (`.dcm` images in the dataset.):

# What's DICOM?

**Digital Imaging and Communications in Medicine (DICOM)** is the standard for the communication and management of medical imaging information and related data. 

DICOM is most commonly used for storing and transmitting medical images enabling the integration of medical imaging devices such as scanners, servers, workstations, printers, network hardware, etc.

We will be visualizing images using the `pydicom` package. 

Let's define a function to visualize DICOM images:

In [None]:
import pydicom
import glob

In [None]:
def visualize_dicom(images, limit = 16):
    images = images[:limit]
    
    fig, ax = plt.subplots(4, 4, figsize = (20, 20))
    ax = ax.flatten()
    
    for index, file in enumerate(images):
        image_data = pydicom.read_file(file).pixel_array
        ax[index].imshow(image_data, cmap = plt.cm.bone)
        
        name = '-'.join(file.split('/')[-2:])
        ax[index].set_title(name)

In [None]:
TRAIN_PATH = '/kaggle/input/osic-pulmonary-fibrosis-progression/train'
image_files = glob.glob(os.path.join(TRAIN_PATH, '*', '*.dcm'))

visualize_dicom(image_files)

In [None]:
# !pip3 install med2image

# Converting DICOM (.dcm) to PNG:

Let's try to convert DICOM (`.dcm`) files to PNG for future use.



In this kernel, I'll be converting sliced `.dcm` images to PNG of a particular patient with a unique ID, feel free to change the code for suitable conversion for all patients:

In [None]:
TRAIN_PATH_PATIENT = '/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00123637202217151272140'

images_patient = glob.glob(os.path.join('/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00123637202217151272140/*.dcm'
))

images_patient.sort(key=lambda f: int(re.sub('\D', '', f)))

for index, file in enumerate(images_patient[:10]):
    print(file)

Let's manually make a directory for saving the converted images:

In [None]:
# ! mkdir "patient-png-1"

In [None]:
!pip3 install mritopng

`mritopng` reads the folder and converts the `.dcm` files to PNG:

In [None]:
import mritopng
mritopng.convert_folder('/kaggle/input/osic-pulmonary-fibrosis-progression/train/ID00123637202217151272140', './patient-png-1/')

Let's see the converted images:

In [None]:
images_patient_png = glob.glob(os.path.join('/kaggle/working/patient-png-1/*.png'
))
images_patient_png.sort(key=lambda f: int(re.sub('\D', '', f)))

for index, file in enumerate(images_patient_png[:10]):
    print(file)

Here is a small piece of code for generating a video using the converted images:

In [None]:
img_array = []
for frame in images_patient_png:
    img = cv2.imread(frame)
    height, width, layers = img.shape
    size = (width,height)
    img_array.append(img)
 
 
out = cv2.VideoWriter('project.mp4',cv2.VideoWriter_fourcc(*'mp4v'), 15, size)
 
for i in range(len(img_array)):
    out.write(img_array[i])
out.release()

# Video of the Scan!

> **Update:** I uploaded the video to Imgur. The raw video must be available down in the output files section (above comments).

Here is a video of the scan, it looks really interesting:

<blockquote class="imgur-embed-pub" lang="en" data-id="a/0pCtdHN"  ><a href="//imgur.com/a/0pCtdHN">OSIC Pulmonary Fibrosis MRI Scan</a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>

# Metadata in DICOM images:

DICOM images generally contain metadata like patient's name, ID, image data like photometric interpretation, transfer syntax, image width and height, etc.

More about metadata in DICOM files can be seen at: http://dicom.nema.org/dicom/2013/output/chtml/part10/chapter_7.html

Here is the metadata of the 10th image in the training set:

In [None]:
image_data = pydicom.read_file(image_files[3837])
image_data

In [None]:
dir(image_data)

Let's create a **utility function for extracting metadata** from DICOM images:

In [None]:
def extract_metadata(file):
    image_data = pydicom.read_file(file)
    
    record = {
        'patient_ID': image_data.PatientID,
        'patient_name': image_data.PatientName,
        'patient_sex': image_data.PatientSex,
        'modality': image_data.Modality,
        'body_part_examined': image_data.BodyPartExamined,
        'photometric_interpretation': image_data.PhotometricInterpretation,
        'rows': image_data.Rows,
        'columns': image_data.Columns,
        'pixel_spacing': image_data.PixelSpacing,
        'window_center': image_data.WindowCenter,
        'window_width': image_data.WindowWidth,
        'bits_allocated': image_data.BitsAllocated
    }
    
    return record

Let's create a empty list and append new records to the dictionary:

In [None]:
metadata_list = []

for file in tqdm.tqdm(image_files):
    metadata_list.append(extract_metadata(file))

Now we'll be converting this to a DataFrame object:

In [None]:
metadata_df = pd.DataFrame.from_dict(metadata_list)
metadata_df.head()

These were the records of the 33026 images in the training set.

In [None]:
len(metadata_df)

Check more about `pydicom` here: https://pydicom.github.io/pydicom/0.9/viewing_images.html

# Basic EDA

Let us import `plotly` for making visualizations:

In [None]:
from collections import Counter
import plotly.express as px
import seaborn as sns

## Ground Truths by Researchers:

According to scientific research,

You’re more likely to be diagnosed with pulmonary fibrosis if you:

- are male
- are between the ages of 40 and 70
- have a history of smoking

Here is a distribution of the smoking status of the patients:

In [None]:
smoker_counts = dict(Counter(train_df['SmokingStatus']))
smoker_counts = {'status': list(smoker_counts.keys()), 'count': list(smoker_counts.values())}
smoker_df = pd.DataFrame(smoker_counts)

fig_smoker = px.pie(smoker_df, values = 'count', names = 'status', title = 'Smoker Status', hole = .5, color_discrete_sequence = px.colors.diverging.Portland)
fig_smoker.show()

Most patients are ex-smokers, and there are a significant amount of people who didn't smoke.

79% patients are male, 21% female:

In [None]:
sex_counts = dict(Counter(train_df['Sex']))
sex_counts = {'sex': list(sex_counts.keys()), 'count': list(sex_counts.values())}
sex_df = pd.DataFrame(sex_counts)

fig_sex = px.pie(sex_df, values = 'count', names = 'sex', title = 'Gender Distribution', hole = .5, color_discrete_sequence = px.colors.sequential.Agsunset)
fig_sex.show()

Let's see the distribution of patient's ages:

In [None]:
# Uncomment for interactive histogram
# fig_age = px.histogram(train_df, x="Age")
# fig_age.update_layout(title_text='Age Distribution')
# fig_age.show()

plt.figure(figsize = (10, 7))
ax = sns.distplot(train_df['Age'])
ax.set_title('Histogram for Age')

Let's see the FVC distribution:

**Forced vital capacity (FVC)** is the amount of air that can be forcibly exhaled from your lungs after taking the deepest breath possible, as measured by spirometry. This test may help distinguish obstructive lung diseases, such as asthma and COPD, from restrictive lung diseases, such as pulmonary fibrosis and sarcoidosis. 

In [None]:
# Uncomment for interactive histogram
# fig_fvc = px.histogram(train_df, x="FVC")
# fig_fvc.update_layout(title_text='FVC Distribution')
# fig_fvc.show()

plt.figure(figsize = (10, 7))
ax = sns.distplot(train_df['FVC'])
ax.set_title('Histogram for FVC')

# Conclusion:

In this kernel:

- We got to know about DICOM images

- We visualized DICOM images

- Viewing sliced images as a video (Cool MRI Scan)

- We saw that with visualizations we are able to align with the study conducted by the researchers:

    - Most patients were male.
    - Most patients had a history of smoking
    - A significant portion of patients had an age between 40-70



Do upvote the kernel if you liked it!

## Give your suggestions and feel free to ask questions!

In [None]:
# Code for deleting output visualizations (reduces the chances of slow loading of the kernel)
! rm -rf './patient-png-1/'