# Selection for Train, Test and Validation datasets

### Selection for 'Bacterial Pneumonia' and 'Normal' 
* It was simple random sampling: 80% randomly selected for training set. 10% randomly selected for validation set. 10% randomly selected for training set for test set.
* We downloaded thousands of Chest XRay files from https://data.mendeley.com/datasets/rscbjbr9sj/3. More details on how we selected images from each class can be found in the notebook 'Random Selection from ZhangLabData.ipynb'

### Selection for 'COVID19' <br>
* It was stratified random sampling, according to the distribution in original dataset, because many images came from distinct sources.
* Measures against data leakage were taken, because there was more than one X-Ray image per patient.
* We got COVID images from ieee8023 repository. <br>
Link: https://github.com/ieee8023/covid-chestxray-dataset <br>
* In the same repository, we found information about the X-Ray images (e.g. X-Ray source) in the file metadata.csv.
* We imported images and metadata file manually into Google Drive. <br>

## Setting up main parameters and imports

### Mounting Drive to allow file handling

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Handling Imports

In [None]:
import pandas as pd
import os
import numpy as np
import shutil

### Main parameters

In [None]:
np.random.seed(42)
train_fraction=0.8

# Folder where Bacterial Pneumonia and Normal images can be found
input_other_directory ="gdrive/Shared drives/COVID Hackathon/Ravi/SelectedFiles_from_ZhangLab/"

# Folder where COVID19 images from github can be found
input_covid_directory="gdrive/Shared drives/COVID Hackathon/Ravi/ieeeRawCOVIDImages_manually/"

# Local where metadata.csv from github can be found
metadata_file_path="gdrive/Shared drives/COVID Hackathon/Ravi/metadata.csv"

# Setting up the final folders where datasets will be stored and setting up directory structure
final_data_directory="gdrive/Shared drives/COVID Hackathon/Ravi/FinalDatasets_Sofia/"
classes=["COVID-19","PNEUMONIA BACTERIA","NORMAL"]
train_directory=final_data_directory+"Train/"
validation_directory=final_data_directory+"Validation/"
test_directory=final_data_directory+"Test/"



Setting up the final directory structure to prepare data for Keras' ImageDataGenerators

In [None]:
os.mkdir(final_data_directory)
os.mkdir(train_directory)
os.mkdir(validation_directory)
os.mkdir(test_directory)

for class_name in classes:
  os.mkdir(train_directory+class_name)
  os.mkdir(validation_directory+class_name)
  os.mkdir(test_directory+class_name)

print("Directory structure set at : ",final_data_directory)

Directory structure set at :  gdrive/Shared drives/COVID Hackathon/Ravi/FinalDatasets_Sofia/


## Stratified sampling for COVID19 with data leakage prevention


We need to pick images that where finding is COVID-19 and modality is X-ray, according to the file 'metadata.csv', also downloaded from ieee8023 repository and imported to google drive. <br>



Reading in and manipulating the metadata.csv


In [None]:
ieee_metadata=pd.read_csv(metadata_file_path)
ieee_metadata.head(5)

Unnamed: 0,patientid,offset,sex,age,finding,RT_PCR_positive,survival,intubated,intubation_present,went_icu,in_icu,needed_supplemental_O2,extubated,temperature,pO2_saturation,leukocyte_count,neutrophil_count,lymphocyte_count,view,modality,date,location,folder,filename,doi,url,license,clinical_notes,other_notes
0,2,0.0,M,65.0,COVID-19,Y,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 22, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
1,2,3.0,M,65.0,COVID-19,Y,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 25, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
2,2,5.0,M,65.0,COVID-19,Y,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 27, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
3,2,6.0,M,65.0,COVID-19,Y,Y,N,N,N,N,Y,,,,,,,PA,X-ray,"January 28, 2020","Cho Ray Hospital, Ho Chi Minh City, Vietnam",images,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,10.1056/nejmc2001272,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,"On January 22, 2020, a 65-year-old man with a ...",
4,4,0.0,F,52.0,COVID-19,Y,,N,N,N,N,N,,,,,,,PA,X-ray,"January 25, 2020","Changhua Christian Hospital, Changhua City, Ta...",images,nejmc2001573_f1a.jpeg,10.1056/NEJMc2001573,https://www.nejm.org/doi/full/10.1056/NEJMc200...,,diffuse infiltrates in the bilateral lower lungs,


In [None]:
# Dropping unnecessary columns
ieee_metadata=ieee_metadata.drop(columns=[column_name for column_name in ieee_metadata.columns if column_name not in ["patientid","finding","modality","filename","url"]])

#Filtering based on modality and findings
ieee_metadata=ieee_metadata[(ieee_metadata["modality"]=="X-ray") & (ieee_metadata["finding"]=="COVID-19")]
ieee_metadata.head()

Unnamed: 0,patientid,finding,modality,filename,url
0,2,COVID-19,X-ray,auntminnie-a-2020_01_28_23_51_6665_2020_01_28_...,https://www.nejm.org/doi/full/10.1056/NEJMc200...
1,2,COVID-19,X-ray,auntminnie-b-2020_01_28_23_51_6665_2020_01_28_...,https://www.nejm.org/doi/full/10.1056/NEJMc200...
2,2,COVID-19,X-ray,auntminnie-c-2020_01_28_23_51_6665_2020_01_28_...,https://www.nejm.org/doi/full/10.1056/NEJMc200...
3,2,COVID-19,X-ray,auntminnie-d-2020_01_28_23_51_6665_2020_01_28_...,https://www.nejm.org/doi/full/10.1056/NEJMc200...
4,4,COVID-19,X-ray,nejmc2001573_f1a.jpeg,https://www.nejm.org/doi/full/10.1056/NEJMc200...


Function to check dataleakage between two sets 

In [None]:
def check_for_leakage(list1,list2):
  """
  Takes in lists of patient ids and checks for leakage
  """
  pass
  result=set(list1).intersection(set(list2))
  if len(result)>0:
    print("True")
    print("Common patient ids : ",result)
  else:
    print("False")

Removing unnecessary columns and summarizing to see main data sources

In [None]:
# Dropping the modality and findings columns as all images are COVID X-rays
ieee_metadata=ieee_metadata.drop(labels=["finding","modality"],axis=1)

# Cleaning the source url so as to easily identify the source path
ieee_metadata['Source'] = ieee_metadata["url"].str.split("/", n = 3, expand = True)[2]
# Droping the original url column
ieee_metadata=ieee_metadata.drop(labels=["url"],axis=1)

# Summarizing to see different sources
df_summary=ieee_metadata["Source"].value_counts().reset_index()
df_summary.columns=["Source","Counts"]

# Sources with more than 10 entries
df_summary['prop'] = df_summary.Counts>=10

# Fetching the Main sources in a list
main_sources=list(df_summary[df_summary.prop==True].Source.values)
print("\nSources with more than 10 contributions : ",main_sources)
print(df_summary[df_summary.prop])


Sources with more than 10 contributions :  ['radiopaedia.org', 'github.com', 'www.sirm.org', 'www.sciencedirect.com', 'www.eurorad.org', 'link.springer.com', 'pubs.rsna.org', 'www.nejm.org']
                  Source  Counts  prop
0        radiopaedia.org     102  True
1             github.com      79  True
2           www.sirm.org      70  True
3  www.sciencedirect.com      39  True
4        www.eurorad.org      30  True
5      link.springer.com      24  True
6          pubs.rsna.org      17  True
7           www.nejm.org      17  True


Number of COVID-19 files expected into each dataset (training, validation and test)

In [None]:
# Setting up number of COVID-19 files expected into each dataset 
num_covid_total_images=ieee_metadata.shape[0]
num_covid_train_images=int(train_fraction*num_covid_total_images)
num_covid_validation_images=int((num_covid_total_images-num_covid_train_images)/2)
num_covid_test_images=num_covid_total_images-num_covid_train_images-num_covid_validation_images
print("\nCOVID Images Distribution : ")
print("\nTraining Fraction : ",train_fraction)
print("Number of Total COVID Images : ",num_covid_total_images)
print("Number of Training Images : ",num_covid_train_images)
print("Number of Validation Images : ",num_covid_validation_images)
print("Number of Test Images : ",num_covid_test_images)


COVID Images Distribution : 

Training Fraction :  0.8
Number of Total COVID Images :  479
Number of Training Images :  383
Number of Validation Images :  48
Number of Test Images :  48


### Performing Stratified Sampling on major sources
##### (Need to make sure that the most significant sources(Those with >=10 images) have equal proprotion in each of the sets)



Setting up some lists to store filenames and patient ids for train, test and validation sets. <br>
Patient ids for each set are stored so as to ensure no data leakage

In [None]:
train_ids=[]
train_covid_files=[]

validation_ids=[]
validation_covid_files=[]

test_ids=[]
test_covid_files=[]

Performing stratified random sampling and distributing images from main sources into train, test and validation sets

In [None]:
for source_name in main_sources:

  # Filtering only filenames from specified source into source_df
  source_df=ieee_metadata[ieee_metadata["Source"]==source_name]

  # Count files per patient and rename columns
  source_idCounts_df=source_df.patientid.value_counts().reset_index()
  source_idCounts_df.columns=["patientid","idCounts"]

  # Estimated how many source images are expected in each dataset 
  total_source_images=source_df.shape[0]
  num_Strain=int(train_fraction*total_source_images)
  num_Svalidation=int((total_source_images-num_Strain)/2)
  num_Stest=total_source_images-num_Strain-num_Svalidation

  # Add randomness by shuffling source_idCounts_df
  source_idCounts_df = source_idCounts_df.sample(frac=1).reset_index(drop=True)

  # Initiating variable sum = number of files selected
  sum=0

  # Start patient distribution for train, validation and test sets
  for i in range(source_idCounts_df.shape[0]):

    # condition for inclusion into train set:
    # if the sum of next patient's files into number of files selected
    # will not surpass the estimated number of files for training set
    if(sum+int(source_idCounts_df.iloc[i][1])<=num_Strain):
      sum+=int(source_idCounts_df.iloc[i][1])
      # Get patientid and add it to training lists
      id=source_idCounts_df.iloc[i][0] 
      train_ids.append(id) 
      train_covid_files.extend(list(source_df[source_df["patientid"]==id].filename.values))

    # condition for inclusion into validation set:
    # if the sum of next patient's files into number of files selected
    # will not surpass the estimated number of files for training and 
    # validation sets
    elif(sum+int(source_idCounts_df.iloc[i][1])<=num_Strain+num_Svalidation):
      sum+=int(source_idCounts_df.iloc[i][1])
      id=source_idCounts_df.iloc[i][0]
      # Get patientid and add it to validation lists
      validation_ids.append(id)
      validation_covid_files.extend(list(source_df[source_df["patientid"]==id].filename.values))

    # the last non-selected patients are added into test set
    else:
      sum+=int(source_idCounts_df.iloc[i][1])
      id=source_idCounts_df.iloc[i][0]
      # Get patientid and add it to test lists
      test_ids.append(id)
      test_covid_files.extend(list(source_df[source_df["patientid"]==id].filename.values))
  

In [None]:
print("Number of training images from major sources : ",len(train_covid_files))
print("Total number of training files acc. to 80:10:10 split(images form major+minor sources) : ",num_covid_train_images)
print("Number of validation images from major sources : ",len(validation_covid_files))
print("Total number of validation files acc. to 80:10:10 split(images form major+minor sources) : ",num_covid_validation_images)
print("Number of test images from major sources : ",len(test_covid_files))
print("Total number of test files acc. to 80:10:10 split(images form major+minor sources) : ",num_covid_test_images)

Number of training images from major sources :  295
Total number of training files acc. to 80:10:10 split(images form major+minor sources) :  383
Number of validation images from major sources :  36
Total number of validation files acc. to 80:10:10 split(images form major+minor sources) :  48
Number of test images from major sources :  47
Total number of test files acc. to 80:10:10 split(images form major+minor sources) :  48


Checking for Data Leakages (using patient ids) between sets after sampling form major sources

In [None]:
# Checking for leakages upto now (should be False for all the 3 cases)
print("Leakage between Training and Validation Set : ")
check_for_leakage(train_ids,validation_ids)
print("Leakage between Training and Test Set : ")
check_for_leakage(train_ids,test_ids)
print("Leakage between Test and Validation Set : ")
check_for_leakage(test_ids,validation_ids)

Leakage between Training and Validation Set : 
False
Leakage between Training and Test Set : 
False
Leakage between Test and Validation Set : 
False


### Sampling from remaining images i.e. the minor sources
#### Now here the most popular source has total 9 images and 0.1 of which(validation_fraction=0.1 as 80:10:10 split) is less than 1 so we randomly shuffle these images and groupby patient ids and allot to corresponding sets till they get completed



First getting the minor sources.

In [None]:
# Fetching the Minor sources in a list
minor_sources=list(df_summary[df_summary.prop==False].Source.values)
print("Minor Sources : ",minor_sources)
print("Total number of sources : ",df_summary.shape[0])
print("Number of major sources : ",len(main_sources))
print("Number of minor sources : ",len(minor_sources))
assert(len(minor_sources)+len(main_sources)==df_summary.shape[0])
print(df_summary[~df_summary.prop].head())

Minor Sources :  ['radiologyassistant.nl', 'www.ams.edu.sg', 'www.rad2share.com', 'www.cureus.com', 'app.figure1.com', 'onlinelibrary.wiley.com', 'academic.oup.com', 'www.nature.com', 'www.ncbi.nlm.nih.gov', 'cases.rsna.org', 'journals.lww.com', 'www.thelancet.com', 'www.kjronline.org', 'www.heartrhythmcasereports.com', 'tropmedhealth.biomedcentral.com', 'www.ajtmh.org', 'www.aurisnasuslarynx.com', 'www.yxppt.com', 'www.jkms.org', 'www.onlinejcf.com', 'www.clinicalradiologyonline.net', 'www.ajronline.org', 'www.journalofhospitalinfection.com', 'www.jhltonline.org', 'journal.chestnet.org', 'www.thno.org', 'pubmed.ncbi.nlm.nih.gov', 'mmrjournal.biomedcentral.com', 'ann-clinmicrob.biomedcentral.com']
Total number of sources :  37
Number of major sources :  8
Number of minor sources :  29
                   Source  Counts   prop
8   radiologyassistant.nl       9  False
9          www.ams.edu.sg       8  False
10      www.rad2share.com       7  False
11         www.cureus.com       6  False

Creating a dataframe having contributions from these sources. <br> Picking up images from minor sources and populating training, test and validation sets till they have the required number of images according to 80:10:10 split, and ensuring that multiple X-rays of one patient goes to only one set

In [None]:
# Filtering only filenames from minor source into minor_data
minor_data=ieee_metadata[ieee_metadata["Source"].isin(minor_sources)]

# Count files per patient and rename columns
minor_idCounts_df=minor_data.patientid.value_counts().reset_index()
minor_idCounts_df.columns=["patientid","idCounts"]

# Estimated how many source images are expected in each dataset 
total_minor_images=minor_idCounts_df.shape[0]
num_Strain=int(train_fraction*total_minor_images)
num_Svalidation=int((total_minor_images-num_Strain)/2)
num_Stest=total_minor_images-num_Strain-num_Svalidation

# Add randomness by shuffling minor_idCounts_df
minor_idCounts_df=minor_idCounts_df.sample(frac=1).reset_index(drop=True)

#Setting up counter
i = 0

#### Filling the training set
sum=0

# Finding how many images are necessary to finish filling it (train_gap)
train_gap=num_covid_train_images-len(train_covid_files)

# Filling until gap is closed
while(sum+int(minor_idCounts_df.iloc[i][1]) <= train_gap):
  sum+=int(minor_idCounts_df.iloc[i][1])
  id=minor_idCounts_df.iloc[i][0]
  train_ids.append(id)
  train_covid_files.extend(list(minor_data[minor_data["patientid"]==id].filename.values))
  i+=1

#### Filling the Validation set
sum=0

# Finding how many images are necessary to finish filling it (validation_gap)
validation_gap=num_covid_validation_images-len(validation_covid_files)

# Filling until gap is closed
while(sum+int(minor_idCounts_df.iloc[i][1])<=validation_gap):
  sum+=int(minor_idCounts_df.iloc[i][1])
  id=minor_idCounts_df.iloc[i][0]
  validation_ids.append(id)
  validation_covid_files.extend(list(minor_data[minor_data["patientid"]==id].filename.values))
  i+=1

### Filling the Test set
sum=0

# Finding how many images are necessary to finish filling it (test_gap)
test_gap=num_covid_test_images-len(test_covid_files)

# Filling until gap is closed or break in case there is no more patients to fill it
while(sum+int(minor_idCounts_df.iloc[i][1])<=test_gap):
  sum+=int(minor_idCounts_df.iloc[i][1])
  id=minor_idCounts_df.iloc[i][0]
  test_ids.append(id)
  test_covid_files.extend(list(minor_data[minor_data["patientid"]==id].filename.values))
  i+=1
  if i==minor_idCounts_df.shape[0]:  #Needed to avoid index error in last iteration
    break

### Verifying results

In [None]:
print("Number of training images from major+minor sources : ",len(train_covid_files))
print("Total number of training files acc. to 80:10:10 split(images form major+minor sources) : ",num_covid_train_images)
print("Number of validation images from major+minor source : ",len(validation_covid_files))
print("Total number of validation files acc. to 80:10:10 split(images form major+minor sources) : ",num_covid_validation_images)
print("Number of test images from major+minor source : ",len(test_covid_files))
print("Total number of test files acc. to 80:10:10 split(images form major+minor sources) : ",num_covid_test_images)

Number of training images from major+minor sources :  382
Total number of training files acc. to 80:10:10 split(images form major+minor sources) :  383
Number of validation images from major+minor source :  48
Total number of validation files acc. to 80:10:10 split(images form major+minor sources) :  48
Number of test images from major+minor source :  47
Total number of test files acc. to 80:10:10 split(images form major+minor sources) :  48


### Checking for Data Leakage between out sets

In [None]:
# Checking for leakages (should be all False)
print("Leakage between Training and Validation Set : ")
check_for_leakage(train_ids,validation_ids)
print("Leakage between Training and Test Set : ")
check_for_leakage(train_ids,test_ids)
print("Leakage between Test and Validation Set : ")
check_for_leakage(test_ids,validation_ids)

Leakage between Training and Validation Set : 
False
Leakage between Training and Test Set : 
False
Leakage between Test and Validation Set : 
False


### Checking for no image repetition between sets just to be double sure

In [None]:
# Checking for image repetition among datasets just to be double sure (should be all 0)
print(len(set(train_covid_files)),len(set(validation_covid_files)),len(set(test_covid_files)))
print(len(set(train_covid_files).intersection(set(test_covid_files))))
print(len(set(train_covid_files).intersection(set(validation_covid_files))))
print(len(set(validation_covid_files).intersection(set(test_covid_files))))

382 48 47
0
0
0


### Copying Images to the final dataset into the respective sets

In [None]:
for filename in test_covid_files:
  shutil.copy(input_covid_directory+filename,test_directory+"COVID-19")

for filename in validation_covid_files:
  shutil.copy(input_covid_directory+filename,validation_directory+"COVID-19")

for filename in train_covid_files:
  shutil.copy(input_covid_directory+filename,train_directory+"COVID-19")

In [None]:
# Verifying the results
print("Number of COVID images in the training directory : ",len(os.listdir(train_directory+"COVID-19")))
print("Number of COVID images in the validation directory : ",len(os.listdir(validation_directory+"COVID-19")))
print("Number of COVID images in the test directory : ",len(os.listdir(test_directory+"COVID-19")))

assert(len(train_covid_files)==len(os.listdir(train_directory+"COVID-19")))
assert(len(validation_covid_files)==len(os.listdir(validation_directory+"COVID-19")))
assert(len(test_covid_files)==len(os.listdir(test_directory+"COVID-19")))

Number of COVID images in the training directory :  382
Number of COVID images in the validation directory :  48
Number of COVID images in the test directory :  47


## Simple sampling for NORMAL and PNEUMONIA_Bacterial

In [None]:
class_names=os.listdir(input_other_directory)

for class_name in class_names:
  path=input_other_directory+class_name+"/"
  class_images_names=os.listdir(path)
  np.random.shuffle(class_images_names)
  num_class_images=len(class_images_names)
  print(num_class_images)
  num_train_images=int(train_fraction*num_class_images)
  num_validation_images=int((num_class_images-num_train_images)/2)
  num_test_images=num_class_images-num_train_images-num_validation_images
  print(num_train_images,num_validation_images,num_test_images)

  if class_name=="PNEUMONIA_Bacterial":
    class_name="PNEUMONIA BACTERIA"       #Just handling a mismatch in folder names
  # Copying images
  for i in range(num_train_images):
    shutil.copy(path+class_images_names[i],train_directory+class_name+"/")
  for i in range(num_validation_images):
    shutil.copy(path+class_images_names[num_train_images+i],validation_directory+class_name+"/")
  for i in range(num_test_images):
    shutil.copy(path+class_images_names[num_train_images+num_validation_images+i],test_directory+class_name+"/")

490
392 49 49
490
392 49 49


## Finally Verifying the results

In [None]:
# The numbers should look familiar.
class_names=["COVID-19","NORMAL","PNEUMONIA BACTERIA"]
for class_name in class_names:
  print("\n\nClass : ",class_name)
  print("Number of training images in train directory : ",len(os.listdir(train_directory+class_name)))
  print("Number of training images in train directory : ",len(os.listdir(validation_directory+class_name)))
  print("Number of training images in train directory : ",len(os.listdir(test_directory+class_name)))



Class :  COVID-19
Number of training images in train directory :  382
Number of training images in train directory :  48
Number of training images in train directory :  47


Class :  NORMAL
Number of training images in train directory :  392
Number of training images in train directory :  49
Number of training images in train directory :  49


Class :  PNEUMONIA BACTERIA
Number of training images in train directory :  392
Number of training images in train directory :  49
Number of training images in train directory :  49
