### Summary of the problem statement
Pneumonia is in the list of top 10 causes for death in the US. It accounts for 15% of all death in children under the age of 5 internationally. Accurately diagnosing Pneumonia is an elaborate process. It requires review of Chest Radiograph by trained specialists and other detailed examination. Due to the high volume of Chest X-Ray review the specialists are burdened with, screening the radiographs for opacity which indicated pneumonia using AI to prioritize and expedite review is seen a possible solution.

### The Dataset - BIG DATA
The dataset contains images with details in DICOM® format. DICOM® (Digital Imaging and Communications in Medicine) is the international standard to transmit, store, retrieve, print, process, and display medical imaging information. DICOM images are special images with metadata. Each image has information about itself.

The actual data set that has 26684 training and 3000 test X-ray images. The images are annotated with bounding boxes to highlight the region in the X-ray that is indicative of possible Pneumonia. For the purpose of this training, we have reduced it to 100 images and their respective data in the file Patient_images.zip. Unzip this file into a directory named "stage_2_train_images". *To make it more challenging, find the package python uses to unzip and you can unzip the file in your code*

### The normal structured data
All the patient outcomes and the infected area are stored in the CSV File Patient_details.csv. This also contains 100 records which pertain to each of the patients whose image details we have in the zip file. 


In [None]:
# Installing & Importing Required Packages

#This will install a special package to read dicom files

import pandas as pd
import matplotlib.pyplot as plt 
import pydicom
import numpy as np

In [None]:
# Importing Class info and Label dataset
patient_df = pd.read_csv("Patient_Details_opt.xls")
patient_df.head()

In [None]:
# Storing PatientIds as list as list because they are the  file name of DCIM images, hence it will help to read the images 

patientID = patient_df.patientId.unique().tolist()

In [None]:

path='.\\stage_2_train_images\\' #If you are using a windows machine use '\\' instead of '/'

In [None]:
#Read all the dicom files into an array
dcm_data = []
for each_patient in patientID:
    dcm_file = path +'%s.dcm' %each_patient
    dcm_data.append(pydicom.read_file(dcm_file))

In [None]:
#We will examine one random dicom image. The image is stored as an attribute array 
random_dicom = dcm_data[np.random.randint(0,100)]
print(type(random_dicom))

As we can see, each dicom file is read as a **FileDataSet**. All kind of Big Data have ways to deal with them. For dicom images, it is **FileDataSet** which contains the image and metadata. For the purpose of this exercise we will only retrieve the image attribute, which is stored as a ndarray named **pixel_array**. You can challenge yourself to explore what other attributes of the FileDataSet has.

In [None]:
plt.imshow(random_dicom.pixel_array)

We can run it through a loop and process all the images just like how we retrieved a random one. 