# Generating train, validation and test images using random sampling

In this notebook, I took following steps to sample train, validation and test images from DICOM images available for __[this kaggle competition](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge)__ and save them into their corresponding folders for the CNN model I train and evaluate in **'Classifying_Potential_Pneumonia_Cases_Using_CNN_Final.ipynb'**.

* There are 26,684 unique chest radigraphs available from the competition. Of those radiographs, 22.53% are identified as pneumonia cases. I use random under sampling of the majority class (i.e., normal cases) to gather balanced samples of about 12,024 radiographs.

* I then randomly sample about 8% (998 radiographs) of the balanced samples (12,024) and save them as test samples. 

* Of the reamining 11,026 samples, I randomly sample about 9% (1,004) and save them as validation images. 

* Of the remaining 10,022, I randomly sample 50% (5,011) and save them as my train samples.

* I copy final train, validation  and test images into their corresponding folders and associated csv files.

In [1]:
# Import packages
import pandas as pd
import numpy as np
from random import sample
import os
import shutil
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from sklearn.model_selection import train_test_split

**Step 1: Import csv file that has image names and target labels for all original samples**

In [2]:
# Import csv file with image names and target labels
df=pd.read_csv('~/Desktop/Thinkful/data/stage_2_train_labels.csv')

# All duplicates have the same class label, so drop duplicates
df = df.drop_duplicates(subset=['patientId'], keep='last')

# Add index as a new column
df['index'] = df.index

df.head(5)

Unnamed: 0,patientId,x,y,width,height,Target,index
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,,,,,0,0
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,,,,,0,1
2,00322d4d-1c29-4943-afc9-b6754be640eb,,,,,0,2
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,,,,,0,3
5,00436515-870c-4b36-a041-de91049b9ab4,562.0,152.0,256.0,453.0,1,5


In [3]:
# Percentage of positive class in the original samples
print('Percent of positive class:', round(df['Target'].mean()*100, 2), '%')

Percent of positive class: 22.53 %


In [4]:
# Define x(index) and y(Target)
x=df.loc[:,'index']
y=df.loc[:,'Target']

In [5]:
# Convert x(index) and y(Target) into numpy array for random under sampling
x=x.values.reshape(-1, 1) # since we only have one feature for resampling
y=y.values

In [6]:
# Random under sampling
rus = RandomUnderSampler(random_state=321)
rus.fit(x, y)
x_rus, y_rus = rus.sample(x, y)

In [7]:
# Split the random under samples into train and validation (92%) and test (8%) samples
x_train, x_test, y_train, y_test = train_test_split(x_rus, y_rus, test_size=0.083, random_state=321)
print('Train and validation samples: ', x_train.shape[0], 'images')
print('Test samples: ', x_test.shape[0], 'images')

Train and validation samples:  11026 images
Test samples:  998 images


In [8]:
# Split the train and validation samples into train (91%) and validatation (9%) samples
x_train_1, x_val, y_train_1, y_val = train_test_split(x_train, y_train, test_size=0.091, random_state=321)
print('Train samples: ', x_train_1.shape[0], 'images')
print('Validation samples: ', x_val.shape[0], 'images')

Train samples:  10022 images
Validation samples:  1004 images


In [9]:
# Split the train images into final train (50%)
x_train_final, x_rest, y_train_final, y_rest = train_test_split(x_train_1, y_train_1, test_size=0.5, random_state=321)
print('Final train samples: ', x_train_final.shape[0], 'images')
print('Remaining samples: ', x_rest.shape[0], 'images')

Final train samples:  5011 images
Remaining samples:  5011 images


**Step 2: Save test images and csv files with image names and target labels**

In [10]:
# Convert numpy arrays back to dataframe
x_test_df=pd.DataFrame(x_test)
x_test_df['index']=x_test_df
print(len(x_test_df), 'Test images')
x_test_df.head(5)

998 Test images


Unnamed: 0,0,index
0,20019,20019
1,23190,23190
2,20352,20352
3,9960,9960
4,14064,14064


In [11]:
# Merge the test samples with the original dataframe to retrieve image names and Target values 
merge_test= pd.merge(x_test_df,df, left_on='index', right_on='index',how='left')
print('Final test samples: ', merge_test.shape[0], 'images') 
print('Percent of positive class in the final test samples:', round(merge_test['Target'].mean()*100, 2), '%')
merge_test.head(5)

Final test samples:  998 images
Percent of positive class in the final test samples: 50.1 %


Unnamed: 0,0,index,patientId,x,y,width,height,Target
0,20019,20019,b877bb8d-e274-483d-9586-397dc8897d45,,,,,0
1,23190,23190,d0ea3f4f-9bc8-417a-80ba-2ed5603b7662,,,,,0
2,20352,20352,ba9148d0-12d5-4333-9b5f-bb32c687ed66,584.0,89.0,229.0,488.0,1
3,9960,9960,6b731342-293f-495a-b341-e4ccc0e717fd,,,,,0
4,14064,14064,8be52575-7877-41c2-a1cd-cb16915273b6,149.0,502.0,249.0,205.0,1


In [12]:
# Create a list of final test image names from the merged dataframe
test_images_df=list(merge_test['patientId'])
print('Final test samples: ', len(test_images_df), 'images') 
test_images_df[0:5]

Final test samples:  998 images


['b877bb8d-e274-483d-9586-397dc8897d45',
 'd0ea3f4f-9bc8-417a-80ba-2ed5603b7662',
 'ba9148d0-12d5-4333-9b5f-bb32c687ed66',
 '6b731342-293f-495a-b341-e4ccc0e717fd',
 '8be52575-7877-41c2-a1cd-cb16915273b6']

In [13]:
# Specify the absolute path
abs_path=os.path.abspath(os.path.join('..'))

# Specify the path of original .jpg images
jpg_image_path = os.path.join(abs_path, 'data', 'stage_2_train_images_jpg')

# Specify the path to copy final test images to
test_image_path = os.path.join(abs_path, 'data', 'test_images')

In [14]:
# Copy test images
for i in test_images_df: 
    shutil.copy(os.path.join(jpg_image_path, i+ '.jpg'), test_image_path)

In [15]:
# Generate the list of the final test images from the folder paths
test_images_path = os.listdir(test_image_path)
test_images_path1 = [i.strip('.jpg') for i in test_images_path]
print('Final test samples: ', len(test_images_path1), 'images') 
test_images_path1[0:5]

Final test samples:  998 images


['07332989-6518-4bd9-96de-f4513948cf4a',
 'e380d5fb-dc74-41c9-adad-05b7b88b9137',
 '75dbc949-4634-4f9c-b3bb-a8c03747e9e1',
 '3c0a97d7-6d90-46a6-a10a-1df9e4064d52',
 '8ca19865-fd6c-4bcd-a2a0-7f9e9ea4bc5c']

In [16]:
# Create a dataframe of the final test images from the folder paths
test_images_path_df=pd.DataFrame(test_images_path1, columns=['patientId'])
test_images_path_df.head(5)

Unnamed: 0,patientId
0,07332989-6518-4bd9-96de-f4513948cf4a
1,e380d5fb-dc74-41c9-adad-05b7b88b9137
2,75dbc949-4634-4f9c-b3bb-a8c03747e9e1
3,3c0a97d7-6d90-46a6-a10a-1df9e4064d52
4,8ca19865-fd6c-4bcd-a2a0-7f9e9ea4bc5c


In [17]:
# Merge the train_images_path_df with corresponding Target values using merge_rus dataframe 
df_test= pd.merge(test_images_path_df,merge_test, left_on='patientId', right_on='patientId',how='left')
print('Final test samples: ', len(df_test), 'images') 
df_test.head(5)

Final test samples:  998 images


Unnamed: 0,patientId,0,index,x,y,width,height,Target
0,07332989-6518-4bd9-96de-f4513948cf4a,379,379,146.0,701.0,303.0,323.0,1
1,e380d5fb-dc74-41c9-adad-05b7b88b9137,25388,25388,162.0,412.0,228.0,260.0,1
2,75dbc949-4634-4f9c-b3bb-a8c03747e9e1,11313,11313,720.0,363.0,166.0,308.0,1
3,3c0a97d7-6d90-46a6-a10a-1df9e4064d52,3997,3997,608.0,360.0,292.0,233.0,1
4,8ca19865-fd6c-4bcd-a2a0-7f9e9ea4bc5c,14158,14158,580.0,228.0,301.0,614.0,1


In [18]:
# Save as csv file  
df_test.to_csv('~/Desktop/Thinkful/data/df_test.csv', index=False)

**Step 3: Save validation images and csv file with image names and target labels**

In [19]:
# Convert numpy arrays back to dataframe
x_val_df=pd.DataFrame(x_val)
x_val_df['index']=x_val_df
print(len(x_val_df), 'Validation images')
x_val_df.head(5)

1004 Validation images


Unnamed: 0,0,index
0,20243,20243
1,12122,12122
2,706,706
3,1500,1500
4,14731,14731


In [20]:
# Merge the validation samples with the original dataframe to retrieve image names and Target values 
merge_valid= pd.merge(x_val_df,df, left_on='index', right_on='index',how='left')
print('Final validation samples: ', merge_valid.shape[0], 'images') 
print('Percent of positive class in the final validation samples:', round(merge_valid['Target'].mean()*100, 2), '%')
merge_valid.head(5)

Final validation samples:  1004 images
Percent of positive class in the final validation samples: 48.01 %


Unnamed: 0,0,index,patientId,x,y,width,height,Target
0,20243,20243,ba036253-6844-428c-a27b-14e9b6021740,572.0,400.0,156.0,240.0,1
1,12122,12122,7c545810-e2b5-4849-a2ff-893075f28192,680.0,484.0,166.0,288.0,1
2,706,706,09a71d45-931c-4089-958d-3eca0f1e303a,,,,,0
3,1500,1500,1677cfe0-e54f-4672-b49c-36cfd12ec76b,,,,,0
4,14731,14731,911027e5-82d0-44f8-af40-de9049922430,281.0,336.0,260.0,680.0,1


In [21]:
# Create a list of final test image names from the merged dataframe
valid_images_df=list(merge_valid['patientId'])
print('Final validation samples: ', len(valid_images_df), 'images') 
valid_images_df[0:5]

Final validation samples:  1004 images


['ba036253-6844-428c-a27b-14e9b6021740',
 '7c545810-e2b5-4849-a2ff-893075f28192',
 '09a71d45-931c-4089-958d-3eca0f1e303a',
 '1677cfe0-e54f-4672-b49c-36cfd12ec76b',
 '911027e5-82d0-44f8-af40-de9049922430']

In [22]:
# Specify the path to copy final validation images to
valid_image_path = os.path.join(abs_path, 'data', 'valid_images')

In [23]:
# Copy validation images
for i in valid_images_df: 
    shutil.copy(os.path.join(jpg_image_path, i+ '.jpg'), valid_image_path)

In [24]:
# Generate the list of the final validation images from the folder paths
valid_images_path = os.listdir(valid_image_path)
valid_images_path1 = [i.strip('.jpg') for i in valid_images_path]
print('Final validation samples: ', len(valid_images_path1), 'images') 
valid_images_path1[0:5]

Final validation samples:  1004 images


['af8a9a3f-9487-454b-8ee6-65c9554f3a87',
 'f7793f41-fe23-4e09-8bef-6394b56bed37',
 '6ed56555-277d-42bc-9e16-c43b679b62d2',
 '9b90c3f5-5126-4ef4-b109-244c88676697',
 '17bbf318-ef96-4dac-99e0-59fae5c60b56']

In [25]:
# Create a dataframe of the final validation images from the folder paths
valid_images_path_df=pd.DataFrame(valid_images_path1, columns=['patientId'])
valid_images_path_df.head(5)

Unnamed: 0,patientId
0,af8a9a3f-9487-454b-8ee6-65c9554f3a87
1,f7793f41-fe23-4e09-8bef-6394b56bed37
2,6ed56555-277d-42bc-9e16-c43b679b62d2
3,9b90c3f5-5126-4ef4-b109-244c88676697
4,17bbf318-ef96-4dac-99e0-59fae5c60b56


In [26]:
# Merge the valid_images_path_df with corresponding Target values using merge_valid dataframe 
df_valid= pd.merge(valid_images_path_df,merge_valid, left_on='patientId', right_on='patientId',how='left')
print('Final validation samples: ', len(df_valid), 'images') 
df_valid.head(5)

Final validation samples:  1004 images


Unnamed: 0,patientId,0,index,x,y,width,height,Target
0,af8a9a3f-9487-454b-8ee6-65c9554f3a87,18667,18667,294.0,606.0,132.0,120.0,1
1,f7793f41-fe23-4e09-8bef-6394b56bed37,27898,27898,630.0,188.0,221.0,506.0,1
2,6ed56555-277d-42bc-9e16-c43b679b62d2,10400,10400,,,,,0
3,9b90c3f5-5126-4ef4-b109-244c88676697,16036,16036,555.0,367.0,270.0,574.0,1
4,17bbf318-ef96-4dac-99e0-59fae5c60b56,1646,1646,,,,,0


In [27]:
# Save as csv file  
df_valid.to_csv('~/Desktop/Thinkful/data/df_valid.csv', index=False)

**Step 4: Save train images and csv file with image names and target labels**

In [28]:
# Convert numpy arrays back to dataframe
x_train_df=pd.DataFrame(x_train_final)
x_train_df['index']=x_train_df
print(len(x_train_df), 'Train images')
x_train_df.head(5)

5011 Train images


Unnamed: 0,0,index
0,22920,22920
1,1245,1245
2,1219,1219
3,20480,20480
4,17700,17700


In [29]:
# Merge the train samples with the original dataframe to retrieve image names and Target values 
merge_train= pd.merge(x_train_df,df, left_on='index', right_on='index',how='left')
print('Final train samples: ', merge_train.shape[0], 'images') 
print('Percent of positive class in the final train samples:', round(merge_train['Target'].mean()*100, 2), '%')
merge_train.head(5)

Final train samples:  5011 images
Percent of positive class in the final train samples: 50.53 %


Unnamed: 0,0,index,patientId,x,y,width,height,Target
0,22920,22920,ceb849b4-5618-4c3b-b34a-6ef007ae2ba0,357.0,397.0,157.0,212.0,1
1,1245,1245,11750ff6-94ea-43d0-bad8-f3c3e5278d7d,570.0,395.0,271.0,499.0,1
2,1219,1219,10528b1d-887c-4107-8b18-aa40a0507389,,,,,0
3,20480,20480,bb6164ee-9b30-4b94-85ca-5f9d94b2ce4a,270.0,182.0,239.0,508.0,1
4,17700,17700,a8aae1e6-2a7a-443b-aaed-aecc02bb3e39,,,,,0


In [30]:
# Create a list of final train image names from the merged dataframe
train_images_df=list(merge_train['patientId'])
print('Final train samples: ', len(train_images_df), 'images') 
train_images_df[0:5]

Final train samples:  5011 images


['ceb849b4-5618-4c3b-b34a-6ef007ae2ba0',
 '11750ff6-94ea-43d0-bad8-f3c3e5278d7d',
 '10528b1d-887c-4107-8b18-aa40a0507389',
 'bb6164ee-9b30-4b94-85ca-5f9d94b2ce4a',
 'a8aae1e6-2a7a-443b-aaed-aecc02bb3e39']

In [31]:
# Specify the path to copy final validation images to
train_image_path = os.path.join(abs_path, 'data', 'train_images')

In [32]:
# Copy validation images
for i in train_images_df: 
    shutil.copy(os.path.join(jpg_image_path, i+ '.jpg'), train_image_path)

In [33]:
# Generate the list of the final train images from the folder paths
train_images_path = os.listdir(train_image_path)
train_images_path1 = [i.strip('.jpg') for i in train_images_path]
print('Final train samples: ', len(train_images_path1), 'images') 
train_images_path1[0:5]

Final train samples:  5011 images


['f93d9a23-cc0d-4eff-abf1-62139ebe80d9',
 '3abb7176-035d-46cc-844e-820870e8154b',
 '55d5fe58-1dd5-454e-8628-92ef7f2993dc',
 'da358a05-8106-45af-9a65-a95e5296cc09',
 'e46bf3ce-4426-4f41-bcf5-4afb112164b5']

In [34]:
# Create a dataframe of the final validation images from the folder paths
train_images_path_df=pd.DataFrame(train_images_path1, columns=['patientId'])
train_images_path_df.head(5)

Unnamed: 0,patientId
0,f93d9a23-cc0d-4eff-abf1-62139ebe80d9
1,3abb7176-035d-46cc-844e-820870e8154b
2,55d5fe58-1dd5-454e-8628-92ef7f2993dc
3,da358a05-8106-45af-9a65-a95e5296cc09
4,e46bf3ce-4426-4f41-bcf5-4afb112164b5


In [35]:
# Merge the valid_images_path_df with corresponding Target values using merge_valid dataframe 
df_train= pd.merge(train_images_path_df,merge_train, left_on='patientId', right_on='patientId',how='left')
print('Final train samples: ', len(df_train), 'images') 
df_train.head(5)

Final train samples:  5011 images


Unnamed: 0,patientId,0,index,x,y,width,height,Target
0,f93d9a23-cc0d-4eff-abf1-62139ebe80d9,28113,28113,,,,,0
1,3abb7176-035d-46cc-844e-820870e8154b,3803,3803,138.0,305.0,242.0,529.0,1
2,55d5fe58-1dd5-454e-8628-92ef7f2993dc,7297,7297,693.0,385.0,135.0,132.0,1
3,da358a05-8106-45af-9a65-a95e5296cc09,24288,24288,,,,,0
4,e46bf3ce-4426-4f41-bcf5-4afb112164b5,25497,25497,,,,,0


In [36]:
# Save as csv file  
df_train.to_csv('~/Desktop/Thinkful/data/df_train.csv', index=False)