# Pneumonia X-Ray Image Classification: Splitting Image Files

This notebook is for spliting the image data into train, validation, and test folders. Please see the following for the other notebooks:
* [EDA Notebook](01_Pneumonia_Classifier_EDA.ipynb)
* [Binary Modeling Notebook](03_Binary_Modeling.ipynb)
* [Model Visualization](04_Model_Visualizations.ipynb)
* [Binary Transfer Learning Model](05_Binary_Transfer_Learning.ipynb)
* [Multiclass Modeling](06_Multiclass_Modeling.ipynb)

## Import

In [1]:
import pandas as pd
import os, shutil, random

## Splitting Pneumonia vs Normal Images

In [2]:
imgs_pneu = [file for file in os.listdir('data/PNEUMONIA') if file.endswith('.jpeg')]
imgs_non_pneu = [file for file in os.listdir('data/NORMAL') if file.endswith('.jpeg')]

In [5]:
print('There are', len(imgs_pneu), 'pneumonia positive images')
print('There are', len(imgs_non_pneu), 'pneumonia negative images')

There are 4273 pneumonia positive images
There are 1583 pneumonia negative images


In [6]:
print('Proportion of images that are COVID-19 Positve Images:', round(len(imgs_pneu)/(len(imgs_pneu)+len(imgs_non_pneu)),2))
print('Proportion of images that are COVID-19 Negative Images:', round(len(imgs_non_pneu)/(len(imgs_pneu)+len(imgs_non_pneu)),2))

Proportion of images that are COVID-19 Positve Images: 0.73
Proportion of images that are COVID-19 Negative Images: 0.27


In [7]:
round(len(imgs_pneu)/len(imgs_non_pneu),1)

2.7

In [22]:
new_dir = 'data/split/'
org_pneu = 'data/PNEUMONIA/'
org_norm = 'data/NORMAL/'

In [11]:
os.mkdir(new_dir)

In [19]:
train_folder = os.path.join(new_dir, 'train')
train_pneu = os.path.join(train_folder, 'PNEUMONIA')
train_non_pneu = os.path.join(train_folder, 'NORMAL')

test_folder = os.path.join(new_dir, 'test')
test_pneu = os.path.join(test_folder, 'PNEUMONIA')
test_non_pneu = os.path.join(test_folder, 'NORMAL')

val_folder = os.path.join(new_dir, 'val')
val_pneu = os.path.join(val_folder, 'PNEUMONIA')
val_non_pneu = os.path.join(val_folder, 'NORMAL')

In [13]:
os.mkdir(train_folder)
os.mkdir(train_pneu)
os.mkdir(train_non_pneu)

os.mkdir(val_folder)
os.mkdir(val_pneu)
os.mkdir(val_non_pneu)

os.mkdir(test_folder)
os.mkdir(test_pneu)
os.mkdir(test_non_pneu)

In [14]:
random.shuffle(imgs_pneu)
random.shuffle(imgs_non_pneu)

In [20]:
print("Train pneumonia should have", round(len(imgs_pneu)*0.6),"images")
print("Validation pneumonia should have", round(len(imgs_pneu)*0.2),"images")
print("Test pneumonia should have", round(len(imgs_pneu)*0.2),"images")
print("Train pneumonia should have", round(len(imgs_non_pneu)*0.6),"images")
print("Validation pneumonia should have", round(len(imgs_non_pneu)*0.2),"images")
print("Test pneumonia should have", round(len(imgs_non_pneu)*0.2),"images")

Train pneumonia should have 2564 images
Validation pneumonia should have 855 images
Test pneumonia should have 855 images
Train pneumonia should have 950 images
Validation pneumonia should have 317 images
Test pneumonia should have 317 images


In [23]:
# train pneumonia
imgs = imgs_pneu[:2564]
for img in imgs:
    origin = os.path.join(org_pneu, img)
    destination = os.path.join(train_pneu, img)
    shutil.copyfile(origin, destination)
    
# validation pneumonia
imgs = imgs_pneu[2564:3419]
for img in imgs:
    origin = os.path.join(org_pneu, img)
    destination = os.path.join(val_pneu, img)
    shutil.copyfile(origin, destination)

# test pneumonia
imgs = imgs_pneu[3419:]
for img in imgs:
    origin = os.path.join(org_pneu, img)
    destination = os.path.join(test_pneu, img)
    shutil.copyfile(origin, destination)

In [24]:
# train non-pneumonia
imgs = imgs_non_pneu[:950]
for img in imgs:
    origin = os.path.join(org_norm, img)
    destination = os.path.join(train_non_pneu, img)
    shutil.copyfile(origin, destination)
    
# validation non-pneumonia
imgs = imgs_non_pneu[950:1267]
for img in imgs:
    origin = os.path.join(org_norm, img)
    destination = os.path.join(val_non_pneu, img)
    shutil.copyfile(origin, destination)
    
# test non-pneumonia
imgs = imgs_non_pneu[1267:]
for img in imgs:
    origin = os.path.join(org_norm, img)
    destination = os.path.join(test_non_pneu, img)
    shutil.copyfile(origin, destination)

In [25]:
print('There are', len(os.listdir(train_pneu)), 'pneumonia images in the training set')
print('There are', len(os.listdir(val_pneu)), 'pneumonia images in the validation set')
print('There are', len(os.listdir(test_pneu)), 'pneumonia images in the testing set')
print('There are', len(os.listdir(train_non_pneu)), 'non-pneumonia images in the training set')
print('There are', len(os.listdir(val_non_pneu)), 'non-pneumonia images in the validation set')
print('There are', len(os.listdir(test_non_pneu)), 'non-pneumonia images in the testing set')

There are 2564 pneumonia images in the training set
There are 855 pneumonia images in the validation set
There are 854 pneumonia images in the testing set
There are 950 non-pneumonia images in the training set
There are 317 non-pneumonia images in the validation set
There are 316 non-pneumonia images in the testing set


In [None]:
## Splitting Pneumonia Further

In [None]:
imgs_bacteria = [file for file in os.listdir('data/BACTERIA') if file.endswith('.jpeg')]
imgs_normal = [file for file in os.listdir('data/NORMAL') if file.endswith('.jpeg')]
imgs_virus = [file for file in os.listdir('data/VIRUS') if file.endswith('.jpeg')]

In [None]:
print('There are', len(imgs_bacteria), 'pneumonia positive bacteria images')
print('There are', len(imgs_virus), 'pneumonia positive virus images')
print('There are', len(imgs_normal), 'pneumonia negative images')

In [None]:
new_dir2 = 'data/split2/'
org_bac = 'data/BACTERIA/'
org_norm = 'data/NORMAL/'
org_vir = 'data/VIRUS/'

In [8]:
os.mkdir(new_dir2)

In [9]:
train_folder = os.path.join(new_dir2, 'train')
train_bac = os.path.join(train_folder, 'BACTERIA')
train_vir = os.path.join(train_folder, 'VIRUS')
train_non_pneu = os.path.join(train_folder, 'NORMAL')

test_folder = os.path.join(new_dir2, 'test')
test_bac = os.path.join(test_folder, 'BACTERIA')
test_vir = os.path.join(test_folder, 'VIRUS')
test_non_pneu = os.path.join(test_folder, 'NORMAL')

val_folder = os.path.join(new_dir2, 'val')
val_bac = os.path.join(val_folder, 'BACTERIA')
val_vir = os.path.join(val_folder, 'VIRUS')
val_non_pneu = os.path.join(val_folder, 'NORMAL')

In [10]:
os.mkdir(train_folder)
os.mkdir(train_bac)
os.mkdir(train_vir )
os.mkdir(train_non_pneu)

os.mkdir(val_folder)
os.mkdir(val_bac)
os.mkdir(val_vir)
os.mkdir(val_non_pneu)

os.mkdir(test_folder)
os.mkdir(test_bac)
os.mkdir(test_vir )
os.mkdir(test_non_pneu)

In [11]:
random.shuffle(imgs_bacteria)
random.shuffle(imgs_virus)
random.shuffle(imgs_normal)

In [12]:
print("Train bacteria should have", round(len(imgs_bacteria)*0.6),"images")
print("Validation pbacteria  should have", round(len(imgs_bacteria)*0.2),"images")
print("Test bacteria  should have", round(len(imgs_bacteria)*0.2),"images")

print("Train virus should have", round(len(imgs_virus)*0.6),"images")
print("Validation virus should have", round(len(imgs_virus)*0.2),"images")
print("Test virus should have", round(len(imgs_virus)*0.2),"images")

print("Train normal should have", round(len(imgs_normal)*0.6),"images")
print("Validation normal should have", round(len(imgs_normal)*0.2),"images")
print("Test normalshould have", round(len(imgs_normal)*0.2),"images")

Train bacteria should have 1668 images
Validation pbacteria  should have 556 images
Test bacteria  should have 556 images
Train virus should have 896 images
Validation virus should have 299 images
Test virus should have 299 images
Train normal should have 950 images
Validation normal should have 317 images
Test normalshould have 317 images


In [13]:
# train bacteria 
imgs = imgs_bacteria[:1668]
for img in imgs:
    origin = os.path.join(org_bac, img)
    destination = os.path.join(train_bac, img)
    shutil.copyfile(origin, destination)
    
# validation bacteria 
imgs = imgs_bacteria[1668:2224]
for img in imgs:
    origin = os.path.join(org_bac, img)
    destination = os.path.join(val_bac, img)
    shutil.copyfile(origin, destination)

# test bacteria 
imgs = imgs_bacteria[2224:]
for img in imgs:
    origin = os.path.join(org_bac, img)
    destination = os.path.join(test_bac, img)
    shutil.copyfile(origin, destination)

In [14]:
# train virus 
imgs = imgs_virus[:896]
for img in imgs:
    origin = os.path.join(org_vir, img)
    destination = os.path.join(train_vir, img)
    shutil.copyfile(origin, destination)
    
# validation virus 
imgs = imgs_virus[896:1195]
for img in imgs:
    origin = os.path.join(org_vir, img)
    destination = os.path.join(test_vir, img)
    shutil.copyfile(origin, destination)
    
# test virus 
imgs = imgs_virus[1195:]
for img in imgs:
    origin = os.path.join(org_vir, img)
    destination = os.path.join(val_vir, img)
    shutil.copyfile(origin, destination)

In [15]:
# train non-pneumonia
imgs = imgs_normal[:950]
for img in imgs:
    origin = os.path.join(org_norm, img)
    destination = os.path.join(train_non_pneu, img)
    shutil.copyfile(origin, destination)
    
# validation non-pneumonia
imgs = imgs_normal[950:1267]
for img in imgs:
    origin = os.path.join(org_norm, img)
    destination = os.path.join(val_non_pneu, img)
    shutil.copyfile(origin, destination)
    
# test non-pneumonia
imgs = imgs_normal[1267:]
for img in imgs:
    origin = os.path.join(org_norm, img)
    destination = os.path.join(test_non_pneu, img)
    shutil.copyfile(origin, destination)

In [17]:
print('There are', len(os.listdir(train_bac)), 'bacteria images in the training set')
print('There are', len(os.listdir(val_bac)), 'bacteria images in the validation set')
print('There are', len(os.listdir(test_bac)), 'bacteria images in the testing set')
print('There are', len(os.listdir(train_vir)), 'virus images in the training set')
print('There are', len(os.listdir(val_vir)), 'virus images in the validation set')
print('There are', len(os.listdir(test_vir)), 'virus images in the testing set')
print('There are', len(os.listdir(train_non_pneu)), 'non-pneumonia images in the training set')
print('There are', len(os.listdir(val_non_pneu)), 'non-pneumonia images in the validation set')
print('There are', len(os.listdir(test_non_pneu)), 'non-pneumonia images in the testing set')

There are 1668 bacteria images in the training set
There are 556 bacteria images in the validation set
There are 556 bacteria images in the testing set
There are 896 virus images in the training set
There are 298 virus images in the validation set
There are 299 virus images in the testing set
There are 950 non-pneumonia images in the training set
There are 317 non-pneumonia images in the validation set
There are 316 non-pneumonia images in the testing set
