# Veteran Affairs & SingleStore - Detecting Anomalies in Chest X-rays

### Improving healthcare efficiency through Computer Vision and near real time analytics with SingleStore

<center><img src="assets/1st_image.png" width="600"></center>

<a name="contents"></a>

# Contents


- [Import Libraries](#imports)
- [Functions](#functions)
- [Dicom Extraction](#dicom_extract)
- [Overview](#overview)
- [Dataset Information](#dataset)
- [Send Headers and JPG to SingleStore](#send_to_s2)
- [About the Trained Model](#about_model)

<a name="imports"></a>

- [Back to Contents](#contents)

# Import Libraries (env = conda_pytorch_p37)

In [None]:
# !pip install pydicom
# !pip install kornia
# !pip install PyMySQL
# !pip install SQLAlchemy
from pathlib import Path
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import tqdm
import os
import re
import cv2
import pandas as pd
from fastai.medical.imaging import *
from fastai.vision.all import *
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.patches as ptc
from tqdm import tqdm # for getting a progress bar on loops
# import pymysql
import time
from PIL import Image
import pymysql
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)



<a name="functions"></a>

- [Back to Contents](#contents)

# Functions

In [None]:
%%time
# Bring in Dicom Metadata

# Read a Dicom Image
def read_xray(path, voi_lut = True, fix_monochrome = True):
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

def get_dcm_contents(file):
    dcm = Path(f_path + file).dcmread()    
    properties = [string for string in dir(dcm) if prog.match(string).group(0)!='']
    dict1 = {'file': file.replace('.dicom', '')}    
    dict1.update( { what: dcm[what].value for what in properties if isinstance(dcm[what].value, (bytes, bytearray))!=True } )
    return dict1


# Convert to JPG and resize to max 1024 pixels
def resize(image, width=None, height=None, inter=cv2.INTER_AREA):
    dim = None
    (h,w) = image.shape[:2]
    
    if width is None and height is None:
        return image
    
    if width is None:
        # calculate the ratio of the height and construct the dimensions
        r = height / float(h)
        dim = (int(w*r), height)
        
    else:
        r = width / float(w)
        dim = (width, int(h*r))
    
    # resize image
    img = cv2.resize(image, dim, interpolation=inter)
    
    # return the resized image
    return img





<a name="dicom_extract"></a>

- [Back to Contents](#contents)

# Dicom header extraction, and image resize and conversion to JPG

In [None]:
%%time
f_path = 'DemoDicomImages/'
files = [f for f in os.listdir(f_path) if os.path.isfile(os.path.join(f_path, f))]
prog = re.compile('^[A-Z]*')

# Bring in Dicom FilesBring in Dicom Metadata

df = pd.DataFrame( [ get_dcm_contents(file) for file in files ] )
# df = val_files.append(val_files, ignore_index=True)

# Convert to JPG and resize to max 1024 pixels
val_outdir = 'DemoConversionToJPG/'

for files in os.listdir(val_outdir):
    path = os.path.join(val_outdir, files)
    try:
        shutil.rmtree(path)
    except OSError:
        os.remove(path)

if not os.path.exists(val_outdir):
    os.mkdir(val_outdir)
    
# Convert DICOM to JPG via openCV
val_list = [os.path.basename(x) for x in glob.glob(f_path + './*.dicom')]
for f in tqdm(val_list):  
    if not os.path.exists(f_path + f[:-5] + 'jpg'):
        img = read_xray(f_path + f) # read dicom image
        img = resize(img,height=1024)
        cv2.imwrite(val_outdir + f.replace('.dicom','.jpg'),img) # write jpg image
        
val_jpg_files = glob.glob(f'{val_outdir}/*.jpg')
print (f'Number of val_jpg_test files in {val_outdir}: {len(val_jpg_files)}')

In [None]:
df

<a name="overview"></a>

- [Back to Contents](#contents)

# Overview

### General Radiology overview:

 - There are 9,992 Diagnostic Imaging Centers Businesses in the US in 2021  
source: https://www.ibisworld.com/industry-statistics/number-of-businesses/diagnostic-imaging-centers-united-states/   

- The US has approximately 30,000 post-training, professionally active radiologists, or about 100 radiologists per million Americans.  
source: https://answerstoall.com/common-questions/how-many-radiologists-are-in-the-us-in-2021/  

- Worldwide, an estimated 3.6 billion diagnostic medical examinations, such as X-rays, are performed every year.  source: https://www.who.int/news-room/feature-stories/detail/to-x-ray-or-not-to-x-ray-

- Chest X-rays are used for diagnosis
- Two views: The back and side


### The problem:

- Chest X-rays are critical for the detection of acute thoracic diseases affecting millions of people worldwide each year.
- Fatigue based diagnostic error and lack of expertise in areas where radiologists are not available

- America’s shortage of radiologists and other physician specialists could surpass 35,000 by 2034  
source: https://www.radiologybusiness.com/topics/artificial-intelligence/physician-shortages-radiology-aamc-artificial-intelligence 



- Chest X-rays are critical for the detection of acute thoracic diseases, including lung cancer and pneumonia,  affecting millions of people worldwide each year.
- Interpretation is a time consuming task requiring expert radiologist to read these images.  It may lead to fatigue based diagnostic error and lack of expertise in areas where radiologists are not available

- America’s shortage of radiologists and other physician specialists could surpass 35,000 by 2034  
source: https://www.radiologybusiness.com/topics/artificial-intelligence/physician-shortages-radiology-aamc-artificial-intelligence 


### Solutions:  

- New research opportunities
- Decision Support. 
- Speed and scale
- Determine the levels of urgency and relevancy faster

- Opening new research opportunities
- Decision Support. Helping radiologists prioritize workflows and while reducing diagnostic errors
- Storing and retrieving data from models with speed and scale
- Enable clinicians to determine the levels of urgency and relevance faster

<a name="dataset"></a>

- [Back to Contents](#contents)

# Vinbigdata Dicom header information

<center><img src="assets/dataset_statistics.png" width="500"></center>

# Patient Gender

<center><img src="assets/patient_gender.png" width="700"></center>

# Patient Age

<center><img src="assets/patient_age.png" width="700"></center>

<center><img src="assets/dicom_missing_values_chart.png" width="700"></center>

<a name="send_to_s2"></a>

- [Back to Contents](#contents)

# Insert headers and jpg images to SingleStore

In [None]:
%%time
s2conn = create_engine('mysql+pymysql://root:Sglstrpw34@172.31.62.112:3306/PatientRecords')
df.to_sql('ImageHeaderdf', s2conn, if_exists='replace', index = False)

def convertToBinaryData(ImageFile):
    # Convert digital data to binary format
    with open(ImageFile, 'rb') as file:
        binaryData = file.read()
    return binaryData


def insertBLOB(ImageID, ImagePath, ImageFile):
    #print("Inserting BLOB into JPGImages table")
    mycursor = s2conn.cursor()

    sql_insert_blob_query = """ INSERT IGNORE INTO JPGImages
                      (file, ImagePath, Image) VALUES (%s,%s,%s)"""

    jpgImage = convertToBinaryData(ImageFile)

    # Convert data into tuple format
    insert_blob_tuple = (ImageID, ImagePath, jpgImage)
    result = mycursor.execute(sql_insert_blob_query, insert_blob_tuple)
    s2conn.commit()
    mycursor.close()
    
    
# Main
directory = '/home/ubuntu/vinbigdata/DemoConversionToJPG/'
jpgCount = 0

startTime = datetime.now()
print("Starting to Inserted jpg files ",startTime)

s2conn = pymysql.connect(
    user='root',
    password='Sglstrpw34',
    host='172.31.62.112',
    port=3306,
    database='Images')

# iterate over files in that directory
for filename in os.listdir(directory):
        f = os.path.join(directory, filename)
        file = os.path.splitext(os.path.basename(os.path.basename(f)))[0]

        # checking if it is a file
        if os.path.isfile(f) and f.endswith(".jpg"):
            insertBLOB(file, directory, f)
            jpgCount += 1
            if jpgCount % 100 == 0:
                print (jpgCount, " Elapse Time", (datetime.now() - startTime))

s2conn.close()
print("Inserted ",jpgCount, " End Time ",datetime.now(), " Elapse Time ",(datetime.now() - startTime))

### - Run inference on notebook 9_Demo_inference and pull the data now stored in SingleStore

<a name="about_model"></a>

- [Back to Contents](#contents)

# About the trained model

## 5 K-fold ensemble model using Yolov5 Architecture and transfer learning with pre-trained wieghts
- 60 Epochs
- Using 15000 Train Files split into 5 parts
- Each Fold is 444 layers
- 86,260,891 parameters
- Using an Ensemble approach should improve model prediction generalizations, resulting in a more accurate model

<center><img src="assets/train_obj_loss.png" width="700"></center>

<center><img src="assets/mean_avg_precision_.5.png" width="700"></center>

<center><img src="assets/confusion_matrix.png" width="500"></center>

## Training mosaics

![](assets/train_mosaics.png)

# Example with 11 Dicom Image

In [3]:
%%time
# !pip install pydicom
# !pip install kornia
# !pip install PyMySQL
# !pip install SQLAlchemy
from pathlib import Path
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut
import tqdm
import os
import re
import cv2
import pandas as pd
from fastai.medical.imaging import *
from fastai.vision.all import *
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.patches as ptc
from tqdm import tqdm # for getting a progress bar on loops
# import pymysql
import time
from PIL import Image
import pymysql
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 500)

# Bring in Dicom Metadata

# Read a Dicom Image
def read_xray(path, voi_lut = True, fix_monochrome = True):
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data

def get_dcm_contents(file):
    dcm = Path(f_path + file).dcmread()    
    properties = [string for string in dir(dcm) if prog.match(string).group(0)!='']
    dict1 = {'file': file.replace('.dicom', '')}    
    dict1.update( { what: dcm[what].value for what in properties if isinstance(dcm[what].value, (bytes, bytearray))!=True } )
    return dict1


# Convert to JPG and resize to max 1024 pixels
def resize(image, width=None, height=None, inter=cv2.INTER_AREA):
    dim = None
    (h,w) = image.shape[:2]
    
    if width is None and height is None:
        return image
    
    if width is None:
        # calculate the ratio of the height and construct the dimensions
        r = height / float(h)
        dim = (int(w*r), height)
        
    else:
        r = width / float(w)
        dim = (width, int(h*r))
    
    # resize image
    img = cv2.resize(image, dim, interpolation=inter)
    
    # return the resized image
    return img

def convertToBinaryData(ImageFile):
    # Convert digital data to binary format
    with open(ImageFile, 'rb') as file:
        binaryData = file.read()
    return binaryData


def insertBLOB(ImageID, ImagePath, ImageFile):
    #print("Inserting BLOB into JPGImages table")
    mycursor = s2conn.cursor()

    sql_insert_blob_query = """ INSERT IGNORE INTO JPGImages
                      (file, ImagePath, Image) VALUES (%s,%s,%s)"""

    jpgImage = convertToBinaryData(ImageFile)

    # Convert data into tuple format
    insert_blob_tuple = (ImageID, ImagePath, jpgImage)
    result = mycursor.execute(sql_insert_blob_query, insert_blob_tuple)
    s2conn.commit()
    mycursor.close()
    

In [12]:
%%time
# Create list of every file in the Dicom folder

f_path = '1DemoDicomImage/'
files = [f for f in os.listdir(f_path) if os.path.isfile(os.path.join(f_path, f))]
prog = re.compile('^[A-Z]*')


# Extract the Dicom Metadata from the file list


df = pd.DataFrame( [ get_dcm_contents(file) for file in files ] )

val_outdir = '1DemoConversionToJPG/'

for files in os.listdir(val_outdir):
    path = os.path.join(val_outdir, files)
    try:
        shutil.rmtree(path)
    except OSError:
        os.remove(path)

if not os.path.exists(val_outdir):
    os.mkdir(val_outdir)

    
# Resizing DICOM image and converting to JPG


val_list = [os.path.basename(x) for x in glob.glob(f_path + './*.dicom')]
for f in tqdm(val_list):  
    if not os.path.exists(f_path + f[:-5] + 'jpg'):
        img = read_xray(f_path + f) # read dicom image
        img = resize(img,height=1024)
        cv2.imwrite(val_outdir + f.replace('.dicom','.jpg'),img) # write jpg image
        
val_jpg_files = glob.glob(f'{val_outdir}/*.jpg')
print (f'Number of val_jpg_test files in {val_outdir}: {len(val_jpg_files)}')


# Send Dicom metadata to SingleStore, started timer.


startTime = time.time()
s2conn = create_engine('mysql+pymysql://root:Sglstrpw34@172.31.62.112:3306/PatientRecords')
df.to_sql('ImageHeaderdf', s2conn, if_exists='replace', index = False)


# Send Resized JPG Images to SingleStore


directory = '/home/ubuntu/vinbigdata/1DemoConversionToJPG/'
jpgCount = 0

s2conn = pymysql.connect(
    user='root',
    password='Sglstrpw34',
    host='172.31.62.112',
    port=3306,
    database='Images')



for filename in os.listdir(directory):
        f = os.path.join(directory, filename)
        file = os.path.splitext(os.path.basename(os.path.basename(f)))[0]

        # checking if it is a file
        if os.path.isfile(f) and f.endswith(".jpg"):
            insertBLOB(file, directory, f)
            jpgCount += 1
            if jpgCount % 100 == 0:
                print (jpgCount, " Elapse Time", (datetime.now() - startTime))
s2conn.close()


# End timer


endTime = time.time()
print(f"Inserted {jpgCount} images")
print(f'Time it took for DICOM metadata and JPG images to reach SingleStore: {round((end_headers_time - start_headers_time),2)} seconds')

100%|██████████| 11/11 [00:04<00:00,  2.45it/s]


Number of val_jpg_test files in 1DemoConversionToJPG/: 11
Inserted 11 JPG images
Time it took for DICOM metadata and JPG images to reach SingleStore: 0.19 seconds
CPU times: user 3.71 s, sys: 995 ms, total: 4.71 s
Wall time: 4.87 s


In [13]:
df

Unnamed: 0,file,BitsAllocated,BitsStored,Columns,HighBit,LossyImageCompression,PatientSex,PhotometricInterpretation,PixelRepresentation,PixelSpacing,RescaleIntercept,RescaleSlope,Rows,SamplesPerPixel,WindowCenter,WindowWidth,NumberOfFrames,PatientAge,PatientSize,PatientWeight,LossyImageCompressionMethod,LossyImageCompressionRatio,PixelAspectRatio,LargestImagePixelValue,SmallestImagePixelValue
0,0c187ebe652499a7e28fd93da2e42ebb,16,12,1994,11,0.0,O,MONOCHROME2,0,"[0.175, 0.175]",0.0,1.0,2430,1,2047.0,4096.0,,,,,,,,,
1,0c15e19c74ef8ddd9bed0a4fc7f4b5a8,16,12,1994,11,0.0,O,MONOCHROME2,0,"[0.175, 0.175]",0.0,1.0,2430,1,2047.0,4096.0,,,,,,,,,
2,043111cdad4d26204503d3396876046f,16,14,3072,13,,O,MONOCHROME2,0,"[0.139000, 0.139000]",0.0,1.0,3072,1,2655.0,5424.0,1.0,Y,,,,,,,
3,0c803c4810a8c5ec362f5d4504489431,16,14,2540,13,,F,MONOCHROME2,0,"[0.140, 0.140]",,,3072,1,9611.0,11854.0,1.0,062Y,,,,,,,
4,0c84dad44979c3a7777a8f4f4d9f5f7a,16,12,2642,11,0.0,,MONOCHROME2,0,"[0.125, 0.125]",0.0,1.0,3170,1,1990.0,4180.0,,,,,,,,,
5,0c26e997c79d2a3149cf1a62c9444554,16,14,3072,13,,M,MONOCHROME2,0,"[0.140, 0.140]",,,3072,1,8431.0,10872.0,1.0,052Y,,,,,,,
6,0c785dff51447d6e689e15dcfe10c8e2,16,12,1994,11,0.0,O,MONOCHROME2,0,"[0.175, 0.175]",0.0,1.0,2430,1,2047.0,4096.0,,,,,,,,,
7,0c5bc5fdf756d1251e431c9281349376,16,12,2048,11,0.0,O,MONOCHROME2,0,"[0.168, 0.168]",0.0,1.0,2500,1,2048.0,4096.0,,,,,,,"[1, 1]",4095.0,0.0
8,0c6036df3708fe77c1c76498240d6774,16,12,1994,11,0.0,O,MONOCHROME2,0,"[0.175, 0.175]",0.0,1.0,2430,1,2047.0,4096.0,,,,,,,,,
9,0c6f7b76bcc012fc5a9af4470f735aba,16,12,2701,11,0.0,,MONOCHROME2,0,"[0.125, 0.125]",0.0,1.0,3155,1,2048.0,4096.0,,,,,,,,,
