# Extract and Sort CXR Testing and Training Images from NIH CXR Dataset. 

- This code extracts the CXR image png files from NIH CXR dataset available on [Kaggle](https://www.kaggle.com/datasets/nih-chest-xrays/data). 

- The code assumes that the parent folder for all downloaded files from dataset is named `NIH-CXR/`. 

- This code will sort images irregardless of their placement and nesting. 

- The sorting is based on the list of images provided in the `test_list.txt` and the `train_val_list.txt` files. 

- Images are sorted to target folders named `test_images` and `train_images` respectively. 

- Code File has not been ran as of 6/30 *** PLEASE UPDATE UPON VALIDATION ***

- The testing images set was sorted via the raw, unrefactored file. Validation can be seen in the version history of this file on github(Note: Please excuse the mess). 

- The `image_extraction()` function takes in 3 arguements and gives status updates for total images sorted.

In [1]:
# Import Dependencies 
import os 
import shutil as sh 
import collections as coll

### Define Paths for Prerequisite Items: Source Folders, Info, Target Folders. 

In [2]:
# Define path for parent folder
parent = os.path.join("../CODE_TEST/")

In [3]:
# Define the path for the text files for train and test sets. 

test_file = os.path.join("test_list.txt")
train_file = os.path.join("train_val_list.txt")

In [4]:
# Make destination folders for sorted images 
os.mkdir("test_images/")
os.mkdir("train_images/")

In [5]:
# Define paths to Destination folders 
test_folder = os.path.join("test_images/")
train_folder = os.path.join("train_images/")

### Define Functions for Extraction and QC

In [6]:
# Create a function to open and extract the list of desired images from each text file. 

def get_desired_images(txt_file):
    
    # Open the file #cookie jar  
    open_file = open(txt_file, "r")
    
    # List contents by reading and turning new lines into different list items. 
    listify = open_file.read().split("\n")

    # Close the #cookie jar 
    open_file.close()
    
    return listify 

In [15]:
# Create Function to Sort Images 
# Takes in the list of desired images, destination folder and parent folder as arguements. 

def image_extraction(desired_image_list, destination_folder, parent_folder):

    # Initialize counter
    counter = 0
    
    #list all the files in the current parent folder 
    for files in os.walk(parent_folder):

        # For the images/files in the list of files in current parent folder
        for image in files:
            
            # If they are in the list of desired images
            if image in desired_image_list: 

                # Get the absolute path of the image file 
                source = os.path.abspath(image)

                print(source)
                
                # Create a copy of the image in the destination folder
                sh.copy(source, destination_folder)

                # Update counter 
                counter += 1

                # Print status 
                print(f"{str(counter)} / {str(len(desired_image_list))} images sorted ... ", end="\r")


In [8]:
# Function to confirm the copied files match

def qc_copy_folder(desired_image_list, destination_folder):

    # Get a list of all the files in the current folder of copied images 
    copied_files = os.listdir(destination_folder)
    
    # Check to see if they match
    if coll.Counter(copied_files) == coll.Counter(desired_image_list):
        return print("You got a star! All files match.")
    else:
        return print("This ain't it chief!")

### Excecute Functions 

In [9]:
# Get the test set Confirm size is [test_list = 25596]
test_list = get_desired_images(test_file)
print(len(test_list))

25596


In [10]:
# Get the train set Confirm size is [train_val_list = 86524]
train_list = get_desired_images(train_file)
print(len(train_list))

86524


In [17]:
# Perform the task on test_list 
image_extraction(test_list, test_folder, parent)

In [12]:
# Perform QC on the copied files 
qc_copy_folder(test_list, test_folder)

This ain't it chief!


In [13]:
# Perform the task on training_list 
image_extraction(train_list, train_folder, parent)

In [14]:
# Perform QC on the copied files 
qc_copy_folder(train_list, train_folder)

This ain't it chief!
