Pneumonia is a common infection that causes inflammation and possible fluid accumulation in the air sacs of the lungs.


[In China, pneumonia is one of the leading causes of death for children under 5 years old](https://journals.lww.com/md-journal/Fulltext/2018/11160/The_drug_use_to_treat_community_acquired_pneumonia.42.aspx#:~:text=More%20than%202%20million%200,the%20age%20of%205%20years.)


Causes of pneumonia include bactria, virus and fungal sources. 

[Pediatric pneumonia is generally diagnosed based on the time of the year and the results of a physical exam, paying attention the child's breathing and listening to the lungs](https://www.nationwidechildrens.org/conditions/pneumonia). Further testing can include blood tests and chest X-rays. 


Even with modern medicine, pneumonia can be misdiagnosed. A fast and accurate diagnosis allow doctors to treat the infection with the appropirate care. 


One application of machine learning in medicine is digital diagnosis. 


We have been tasked with developing an identification model to determine if a chest X-ray indicates the presence of pneumonia. False negative results are to be minimized compared to false positives.


The data is sourced from [Kaggle](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia). It is already split into three folders for training, validation and testing. All the chest radiographs were screened for quality and diagnostic labeling performed by physicians. The images were collected during routine clinicial care of pediatric patients between one and five years old from Guangzhou Women and Children's Medical Center in Guangzhou, China.

In [42]:
from os import listdir
from os.path import isfile, join

import pandas as pd
import numpy as np
from PIL import Image

import matplotlib.pyplot as plt
%matplotlib inline

The first thing we wanted to do was shrink down the size of our images for processing speed. We have a backup copy of the original sized images and the modeling will work if we decide to try and go back to the full size images. After running this code, we have commented out the image rewriting so we don't perform this operation every time.

In [43]:
def resize_with_ratio(image, min_size=256):
    """
    This function will take in an image and resize it, maintaining aspect ratio.
    The default minimum length for a side is 256 pixels.
    The function will output a resized image with the smaller side sized to 256 pixels
    and the other side resized proportionately.
    
    """
    # get width and height from passed image   
    width, height = image.size
    
    # based on longer side changing to 256 pixels, calculate the ratio and length of
    # the shorter side
    if width>height:
        ratio_wh = width / height
        new_width = int(ratio_wh * min_size)
        new_height = min_size
    else:
        ratio_hw = height / width
        new_width = min_size
        new_height = int(ratio_hw * min_size)

    return image.resize((new_width, new_height))

In [41]:
folder_names = ['train', 'test', 'val']
img_names = ['NORMAL', 'PNEUMONIA']



# loop through the different combinations of folder name prefixes
for folder in folder_names:
    for img_type in img_names:
        
        # set up the path to each folder of images
        path = f'./chest_xray/{folder}/{img_type}'
        
        # create a list of the filenames in that directory
        filelist = list(listdir(path))
        
        
# This loop was only needed one time to resize the image files. It is left in
# in case of data loss and we need to resize again
        # loop through each file, resizing using the helper function above
        # saves the new file overtop of the old file.
#         for file_name in filelist:
#             im = Image.open(path + r'/' + file_name)
#             im_resized = resize_with_ratio(im)
#             filepath = path + r'/' + file_name
#             im_resized.save(filepath)



In [47]:
path = f'./chest_xray/train/NORMAL'
filelist = list(listdir(path))
len(filelist)

1341

In [51]:
train_normal = np.empty(0)

In [52]:
train_normal

array([], dtype=float64)

In [55]:
for file_name in filelist:
    image = Image.open(path + r'/' + file_name)
    numpydata = np.asarray(image)
    np.append(train_normal, numpydata)

In [56]:
train_normal

array([], dtype=float64)