# 1  Automated data cleaning for chest Xrays with cleanX: notebook for medical professionals with limited coding abililties. 

CleanX is a code library by Candace Makeda Moore, MD, Oleg Sivokon, and Andrew Murphy. Please note this workflow does not cover the whole scope of cleanX, and is only meant to show some of the functionality that can be accomplished using cleanX. 

The purpose of this notebook is to educate people with very limited understanding of machine learning and code about some of what cleanX does, and why it is worth incorporating it into use. 

In [None]:
import sys
sys.path = ['D:/projects/cleanX'] + sys.path
# we will need to import some libraries
import pandas as pd
import os
from cleanX import (
    dataset_processing as csvp,
    dicom_processing as dicomp,
    image_work as iwork,
)

Reading and analysis of chest X-rays is a common task in hospitals. In fact in many hospitals so many chest X-rays are performed that some are never read, and some are only read by people with limited training in radiology. Some countries have a very limited number of radiologists so radiographers read the chest X-rays. Regardless of who reads these images, they can be difficult to interpret and the error rate is reported as high in the medical literature (over 10% or even over 15% depending upon the source). Machine learning based algorithms have the potential to improve this situation in a variety of ways, however machine learning algorithms are powered by mountains of labeled data. And this need for labeled data creates a potential problem. 

Labeled data must either be retrieved from existing read X-rays (errors included), or created by humans (already over-burdened with reading X-rays, actually that was the original problem in the first place, right?). Several groups have created big datasets that algorithms can be trained on, but no dataset is perfect for every task. Unfortunately many datasets contain images that may not be appropriate to make a machine learning algorithm from. As a case in point, let's take a look at some of the data in a large set of COVID-19 images. We will use the CoronaHack -Chest X-Ray-Dataset from Kaggle. The dataset was assembled by Praveen Govindaraj. This dataset has thousands of images...too many to look through by hand without wasting a lot of time

In [None]:
origin_folder ='D:/my_academia/ncs/Coronahack-Chest-XRay-Dataset/Coronahack-Chest-XRay-Dataset/train/'

## Finding duplicates
We may or may not want to use duplicated images to build an algorithm. Generally, it's a bad idea. At the extreme, if all of one pathology is simply duplicated images, we do not have enough data. Instead of trying to remember if we see duplicates in thousands of images, let's ask cleanX. cleanX compares the images pixel by pixel, and this takes time if you don't have a powerful computer, but it doesn't take human time. We can take a much needed break! 

In [None]:
found = iwork.find_duplicated_images_todf(origin_folder)
len(found[found.status == 'duplicated'])  

OK, so we may have 26 duplicates. Not so bad out of thousands of pictures. Let's pull up a list so we can check them by hand.

In [None]:
found[found.status == 'duplicated']

In [None]:
wierd_images = found[found.status == 'duplicated']
wierd_images_list = wierd_images.images.to_list()

# we need the full file path
final_names = []
for image_string in wierd_images_list:
    final_names.append(os.path.join(origin_folder, image_string))

In [None]:
iwork.show_images_in_df(final_names,19)

In [None]:
# make a function that compares one image to list, and finds closest. 
import cv2
import numpy as np
image1 = 'person1372_bacteria_3502.jpeg'
image1name = os.path.join(origin_folder, image1)
compare_list = final_names
image1image = cv2.imread(image1name)
results = []
pictures = []
width, height = image1image.shape[1], image1image.shape[0]
dim = (width, height)
for picture in compare_list:
    
    image_there = cv2.imread(picture)
    resized = cv2.resize(image_there, dim, interpolation = cv2.INTER_AREA)
    result = resized - image1image
    result_sum = np.sum(result)
    results.append(result_sum)
    pictures.append(pictures)

In [None]:
d = {'results':results,'pictures':pictures}
ho = pd.DataFrame(d)

In [None]:
ho

Interesting, some of our duplicated pictures appear to have been triplicated, and we get two of the same duplicate.

## Finding outlier images

Now let's move on to seeing if we really have all similarly shot chest Xrays, or some nonsense flew in. We can use one of several methods with cleanX:

In [None]:
# in a chest Xray we expect more white on top- the abdomen is bigger than the neck, let's see where that is not true
upper_lower_returned = iwork.find_sample_upper_greater_than_lower(origin_folder, 10)
# let's look at a sample of upper part of images and see if there are outliers
upper_scan_returned = iwork.find_by_sample_upper(origin_folder, 10, 200)
#let's compare each image to an average of all images, and take the most different
tiny_image_different = iwork.find_tiny_image_differences(origin_folder, percentile=1)

In [1]:
# libraries
import cv2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os 

#import shutil
from PIL import Image, ImageOps
import math
import filecmp
import tesserocr
from tesserocr import PyTessBaseAPI
from filecmp import cmp
from pathlib import Path
import re
import makedalytics as ma

In [8]:
imageAr = cv2.imread(imageA)
imageAr.size

2521728

In [11]:
def image_quality_by_size(specific_image):
    q = os.stat(specific_image).st_size
    return q

In [12]:
image_quality_by_size(imageA)

79645

In [13]:
imageA = target_upside_down + '/covercleanXdistort.jpg'
print(type(os.stat(imageA).st_size))


<class 'int'>


In [None]:
def check_img_quality(directory, imageA, imageB, list):
    size_imgA = os.stat(directory + imageA).st_size
    size_imgB = os.stat(directory + imageB).st_size
    if size_imgA > size_imgB:
        add_to_list(imageB, list)
    else:
        add_to_list(imageA, list)

In [2]:
# get images and group by size
target_upside_down= 'D:/my_academia/new_dicom_output'
to_be_sorted = glob.glob(os.path.join(target_upside_down, '*.jpg'))
pic_list = []
heights = []
widths = []
dimension_groups = []
# group = 0
for picy in to_be_sorted:
    
    example = cv2.imread(picy, cv2.IMREAD_GRAYSCALE)
    height = example.shape[0]
    width = example.shape[1]
    height_width= 'h'+str(height) + '_w' + str(width)
    heights.append(height)    
    widths.append(width)
    pic_list.append(picy)
    dimension_groups.append(height_width) 
    #if height, width == height, width
    #group += 1
    d = {'pics' : pic_list, 'height': heights, 'width': widths, 'height_width':dimension_groups}
    data = pd.DataFrame(d)

In [None]:
data

In [None]:
# sorted_f = pd.DataFrame(data.groupby(data.height_width))
# sorted_f

In [None]:
data = data.sort_values('height_width')

In [None]:
#dict_of_dfs = {f'data{i}':data[['pics','height','width', i]] for i in data.columns[3:]}

In [None]:
#data.columns[3:]

In [None]:
#dict_of_dfs

In [None]:
compuniquesizes = data.height_width.unique()

In [None]:
sizesdict = {elem : pd.DataFrame() for elem in compuniquesizes}

In [None]:
for key in sizesdict.keys():
    sizesdict[key] = data[:][data.height_width == key]

In [None]:
sizesdict['h704_w1194']

In [None]:
compuniquesizes

In [None]:
compuniquesizes, sizesdict

In [None]:
for sized in compuniquesizes:
    print(sized)
    print(len(sizesdict[sized]))
    print(sizesdict[sized])

In [None]:
len_list = []
size_name_list = []
for sized in compuniquesizes:
    lener= len(sizesdict[sized])
    len_list.append(lener)
    size_name_list.append(sized)
sized_data = {'size':size_name_list, 'count':len_list}
df = pd.DataFrame(sized_data)
    #print(len(sizesdict[sized]))

In [None]:
df

In [None]:
def give_me_size_count_df(folder):
    """
    This function returns a dataframe of unique sized, and how many pictures 
    have such a size.
    :param folder: folder with jpgs
    :type folder: string

    :return: df
    :rtype: pandas.core.frame.DataFrame
    """
    to_be_sorted = glob.glob(os.path.join(folder, '*.jpg'))
    pic_list = []
    heights = []
    widths = []
    dimension_groups = []
    for picy in to_be_sorted:
        example = cv2.imread(picy, cv2.IMREAD_GRAYSCALE)
        height = example.shape[0]
        width = example.shape[1]
        height_width= 'h'+str(height) + '_w' + str(width)
        heights.append(height)    
        widths.append(width)
        pic_list.append(picy)
        dimension_groups.append(height_width) 
        d = {'pics' : pic_list, 'height': heights, 'width': widths, 'height_width':dimension_groups}
        data = pd.DataFrame(d)
        data = data.sort_values('height_width')
        compuniquesizes = data.height_width.unique()
        len_list = []
    size_name_list = []
    sizesdict = {elem : pd.DataFrame() for elem in compuniquesizes}
    for key in sizesdict.keys():
        sizesdict[key] = data[:][data.height_width == key]
    for sized in compuniquesizes:
        lener= len(sizesdict[sized])
        len_list.append(lener)
        size_name_list.append(sized)
    sized_data = {'size':size_name_list, 'count':len_list}
    df = pd.DataFrame(sized_data)
    return df
    

In [None]:
target_upside_down= 'D:/my_academia/new_dicom_output'
print(give_me_size_count_list(target_upside_down))

In [None]:
def give_me_size_counted_dfs(folder):
    """
    This function returns dataframes of unique sized images in a list
    :param folder: folder with jpgs
    :type folder: string

    :return: big_sizer
    :rtype: list
    """
    to_be_sorted = glob.glob(os.path.join(folder, '*.jpg'))
    pic_list = []
    heights = []
    widths = []
    dimension_groups = []
    for picy in to_be_sorted:
        example = cv2.imread(picy, cv2.IMREAD_GRAYSCALE)
        height = example.shape[0]
        width = example.shape[1]
        height_width= 'h'+str(height) + '_w' + str(width)
        heights.append(height)    
        widths.append(width)
        pic_list.append(picy)
        dimension_groups.append(height_width) 
        d = {'pics' : pic_list, 'height': heights, 'width': widths, 'height_width':dimension_groups}
        data = pd.DataFrame(d)
        data = data.sort_values('height_width')
        compuniquesizes = data.height_width.unique()
        len_list = []
    size_name_list = []
    sizesdict = {elem : pd.DataFrame() for elem in compuniquesizes}
    for key in sizesdict.keys():
        sizesdict[key] = data[:][data.height_width == key]
    big_sizer = []
    for nami in compuniquesizes:
        frames = sizesdict[nami]
        big_sizer.append(frames)
    return big_sizer    

In [None]:
print(type(give_me_size_counted_dfs('D:/my_academia/new_dicom_output')))

In [14]:
def create_imgs_matrix(directory):
    compression=50
    global image_files   
    image_files = []
    # create list of all files in directory     
    folder_files = [filename for filename in os.listdir(directory)]  
    
    # create images matrix   
    counter = 0
    for filename in folder_files: 
        # check if the file is accesible and if the file format is an image
        if not os.path.isdir(directory + filename) and imghdr.what(directory + filename):
            # decode the image and create the matrix
            img = cv2.imdecode(np.fromfile(directory + filename, dtype=np.uint8), cv2.IMREAD_UNCHANGED)
            if type(img) == np.ndarray:
                img = img[...,0:3]
                # resize the image based on the given compression value
                img = cv2.resize(img, dsize=(compression, compression), interpolation=cv2.INTER_CUBIC)
                if counter == 0:
                    imgs_matrix = img
                    image_files.append(filename)
                    counter += 1
                else:
                    imgs_matrix = np.concatenate((imgs_matrix, img))
                    image_files.append(filename)
    return imgs_matrix

In [15]:
def mse(imageA, imageB):
    err = np.sum((imageA.astype("float") - imageB.astype("float")) ** 2)
    err /= float(imageA.shape[0] * imageA.shape[1])
    return err

In [16]:
def compare_images(directory, show_imgs=True, similarity="high"):
    """
    directory (str).........folder to search for duplicate/similar images
    show_imgs (bool)........True = shows the duplicate/similar images found in output
                            False = doesn't show found images
    similarity (str)........"high" = searches for duplicate images, more precise
                            "low" = finds similar images
    compression (int).......recommended not to change default value
                            compression in px (height x width) of the images before being compared
                            the higher the compression i.e. the higher the pixel size, the more computational ressources and time required                 
    """
    compression = 50
    # list where the found duplicate/similar images are stored
    
    duplicates = []
    lower_res = []
    
    imgs_matrix = create_imgs_matrix(directory, compression)

    # search for similar images
    if similarity == "low":
        ref = 1000
    # search for 1:1 duplicate images
    else:
        ref = 200

    main_img = 0
    compared_img = 1
    nrows, ncols = compression, compression
    srow_A = 0
    erow_A = nrows
    srow_B = erow_A
    erow_B = srow_B + nrows       
    
    while erow_B <= imgs_matrix.shape[0]:
        while compared_img < (len(image_files)):
            # select two images from imgs_matrix
            imgA = imgs_matrix[srow_A : erow_A, # rows
                               0      : ncols]  # columns
            imgB = imgs_matrix[srow_B : erow_B, # rows
                               0      : ncols]  # columns
            # compare the images
#             rotations = 0
#             while image_files[main_img] not in duplicates and rotations <= 3:
#                 if rotations != 0:
#                     imgB = rotate_img(imgB)
            err = mse(imgA, imgB)
            if err < ref:
                if show_imgs == True:
                    show_img_figs(imgA, imgB, err)
                    show_file_info(compared_img, main_img)
                add_to_list(image_files[main_img], duplicates)
                check_img_quality(directory, image_files[main_img], image_files[compared_img], lower_res)
                #rotations += 1
            srow_B += nrows
            erow_B += nrows
            compared_img += 1
        
        srow_A += nrows
        erow_A += nrows
        srow_B = erow_A
        erow_B = srow_B + nrows
        main_img += 1
        compared_img = main_img + 1

    msg = "\n***\n DONE: found " + str(len(duplicates))  + " duplicate image pairs in " + str(len(image_files)) + " total images.\n The following files have lower resolution:"
    print(msg)
    return set(lower_res)

IndentationError: unexpected indent (Temp/ipykernel_27348/2737120513.py, line 47)