# Extract and Sort CXR Testing and Training Images from NIH CXR Dataset. 

- This code extracts the CXR image png files from NIH CXR dataset available on [Kaggle](https://www.kaggle.com/datasets/nih-chest-xrays/data). 

- The code assumes that the parent folder for all downloaded files from dataset is named `NIH-CXR/`. 

- The extraction is focused on the images stored in the files named as `r"images_\d+"` and there is a sub-folder of `images/` nested within each image folder.

- The code assumes that this code file is located on the same level as the `r"images_\d+"` folders. Please change directories accordingly if the layout is different. 

- The sorting is based on the list of images provided in the `test_list.txt` and the `train_val_list.txt` files. 

- Images are sorted to target folders named `test_images` and `train_images` respectively. 

- Code was employed in this file on the training images only and directories were previously created. 

- The testing images set was sorted via the raw, unrefactored file. Validation can be seen in the version history of this file on github(Note: Please excuse the mess). 

- The `image_extraction()` function takes in 3 arguements and gives status updates once per each image folder. The extraction of training images from `images_001/` folder took around 2.45 minutes before a status message was displayed. 

### Import Dependencies 

In [1]:
# Import Dependencies 
import os 
import shutil as sh 
import re 
import collections as coll

### Define Paths for Prerequisite Items: Source Folders, Info. and Target Folders

In [2]:
# Define the path for the text files for train and test sets. 

test_file = os.path.join("test_list.txt")
train_file = os.path.join("train_val_list.txt")

In [None]:
# Make destination folders for sorted images 
os.mkdir("test_images/")
os.mkdir("train_images/")

In [4]:
# Define paths to Destination folders 
test_folder = os.path.join("test_images/")
train_folder = os.path.join("train_images/")

In [5]:
# Define the path for the central folder where image files are stored
parent = os.path.join("../NIH-CXR/")

# Get all folder names in current folder directory. 
all_folders = [f.name for f in os.scandir(parent) if f.is_dir()]

# Extract all image folder names syntax: images_\d+ 
image_folders = [x for x in all_folders if re.match(r"images_\d+", x)]
print(image_folders)

['images_001', 'images_002', 'images_003', 'images_004', 'images_005', 'images_006', 'images_007', 'images_008', 'images_009', 'images_010', 'images_011', 'images_012']


### Define Functions for Extraction and QC

In [6]:
# Create a function to open and extract the list of desired images from each text file. 

def get_desired_images(txt_file):
    
    # Open the file #cookie jar  
    open_file = open(txt_file, "r")
    
    # List contents by reading and turning new lines into different list items. 
    listify = open_file.read().split("\n")

    # Close the #cookie jar 
    open_file.close()
    
    return listify 

In [7]:
# Define function for the extraction and sorting process 

def image_extraction(desired_image_list, destination_folder, source_folders):

    # Make a copy of the list of images to search
    images_list  = desired_image_list.copy()
    total_images = len(images_list)

    # Create Counters
    total_count = 0

    # Iterate over the folders
    for item in source_folders: 

        # Print Working Status 
        print(f"***** Scanning folder: {item} *****")

        # Define Source image path folder to scan (Note: there is a secondary images/ folder)
        source_image_path = f"{item}/images/"
    
        # Loop through the files in directory   
        for f in os.listdir(source_image_path):
            
            # Filter the list of files in the current folder as per images list.   
            files = [image for image in images_list if image == f]
            
            # Copy from the list of files to new folder and increase counter.  
            for file in files: 
                sh.copy(source_image_path+file, destination_folder)
                total_count += 1 
            
        # Running total copies 
        print(f"Total Files copied thus far:{str(total_count)}/{str(total_images)}")

In [8]:
# Function to confirm the copied files match

def qc_copy_folder(desired_image_list, destination_folder):

    # Get a list of all the files in the current folder of copied images 
    copied_files = os.listdir(destination_folder)
    
    # Check to see if they match
    if coll.Counter(copied_files) == coll.Counter(desired_image_list):
        return print("You got a star! All files match.")
    else:
        return print("This ain't it chief!")

### Excecute Functions 

In [9]:
# Get the test set Confirm size is [test_list = 25596]
test_list = get_desired_images(test_file)
print(len(test_list))

25596


In [10]:
# Get the train set Confirm size is [train_val_list = 86524]
train_list = get_desired_images(train_file)
print(len(train_list))

86524


In [None]:
# Perform the task on test_list 
image_extraction(test_list, test_folder, image_folders)

In [None]:
# Perform QC on the copied files 
qc_copy_folder(test_list, test_folder)

In [11]:
# Perform the task on training_list 
image_extraction(train_list, train_folder, image_folders)

***** Scanning folder: images_001 *****
Total Files copied thus far:4032/86524
***** Scanning folder: images_002 *****
Total Files copied thus far:12540/86524
***** Scanning folder: images_003 *****
Total Files copied thus far:20931/86524
***** Scanning folder: images_004 *****
Total Files copied thus far:29394/86524
***** Scanning folder: images_005 *****
Total Files copied thus far:37275/86524
***** Scanning folder: images_006 *****
Total Files copied thus far:44597/86524
***** Scanning folder: images_007 *****
Total Files copied thus far:51820/86524
***** Scanning folder: images_008 *****
Total Files copied thus far:59330/86524
***** Scanning folder: images_009 *****
Total Files copied thus far:67091/86524
***** Scanning folder: images_010 *****
Total Files copied thus far:74937/86524
***** Scanning folder: images_011 *****
Total Files copied thus far:82995/86524
***** Scanning folder: images_012 *****
Total Files copied thus far:86524/86524


In [12]:
# Perform QC on the copied files 
qc_copy_folder(train_list, train_folder)

You got a star! All files match.
