Skip to content

Legacy medical professional documentation

Candace Makeda Moore, MD edited this page Nov 6, 2021 · 2 revisions

This is no longer the home of documentation for medical professionals who do not spend most of their time coding. The most current documentation for medical professionals is integrated with the rest of the documentation here. If you are using the develop branch of the project, then we suggest generating it by hand or pulling the most current develop branch into your computer. The main branch documentation will be updated with the next release.

From this legacy documentation here you can still link to our Jupyter notebook specifically for medical professionals (one of a series of notebooks), a brief introductory video or a special video demo (linked) which shows some basic functionality for people who bring expertise outside of coding.

Introduction

cleanX is a package meant to ease the preparation of data for machine learning or other algorithms created with some types of radiology imaging data. The package was developed specifically for chest X-rays, however, some of the functionality can be used with other types of data. For example, the data_processing module can be used with many types of tabular data for radiology or even other areas.

Programmers have a saying "garbage in, garbage out". If we apply this saying here it means if you feed a machine learning algorithm incorrect and/or mislabeled data, you should expect incorrect outputs (the algorithm will not work as desired). Therefore the work of radiologists and allied professionals is critical in creating machine learning algorithms for the field. At the present state of the art datasets often include over 100,000 images. The task of reading this many images can be aided by using a tool such as cleanX. It will not only automatically reveal some questionable images, but also reveal images that are problematic for computer vision programs e.g. inverted or upside-down images that a human reader can read easily. CleanX can also be used by anyone (even programmers( to get rid of images which obviously should not be in a dataset. There are public datasets of chest X-rays that contain occasional coronal CT slices, or other types of images. Such images should ideally be thrown out of a dataset before you begin reading. cleanX can help anyone with good vision accomplish this task.

cleanX is also made so it can extract and clean the metadata on the images. But if you are an imaging professional not very thrilled about coding, but still want to make a substantial contribution to a machine learning project, then data curation is one of the most important tasks you can help with. We have a video, (our video for non-coders demo (linked)), that shows how cleanX can help with some of these tasks.

Workflow

cleanX workflow is made of three modules:

Dicom_processing 

Data_processing

Image_work

To see some functionality of cleanX, it is suggested to run both notebooks in the workflow demo folder. If you do not have or know how to comfortably operate Jupyter notebooks, you can check out our videos video demo of several classes and functions (linked) and video for non-coders demo (linked).

image linked to video

What makes cleanX different and/or important?

Much research about biomedical imaging relies on a combination of proprietary technology and in-house code that is not shared across institutions. This reality stymies cooperation, and locks out people and institutions without great amounts of capital. It is often poorer institutions e.g. hospitals in nations with very few radiologists, that would benefit the most from AI. Unfortunately, research algorithms made on imaging data from brand new, expensive, top-of-the-line machines may or may not easily translate if applied to images made on lower-quality machines for purely technical reasons. Additionally, no matter how similar our machines, to even investigate how well an algorithm made on imaging data from a place like Utah or Holland translates the chest X-rays made in Malawi (or any very different setting from the one the data comes from), we would need a dataset of images from Malawi (or whichever country we are trying out the algorithm in) or we actually risk worsening health inequity by simply applying the exact same algorithm everywhere (both present research, and practical field experience indicate this is a likely outcome). CleanX is an open-source solution designed to be used freely everywhere. Some algorithms in machine learning are essentially solved problems, but what is not solved is how a broader group of people and institutions can get or create appropriate datasets to power these algorithms. cleanX is for everyone, free and open source.

Special note for "old-school" doctors: You may wonder why we chose to do most of our output work with .jpg and .csv files when most doctors are more familiar with DICOMs and Excel files i.e. xlsx. Among the reasons are the following:

  • by using jpg files instead of whole DICOMs you can avoid touching a lot of metadata which can be a problem in terms of anonymization of data
  • big data is not handled all that well by Microsoft Excel, for example, if you have over 1,048,576 rows (according to Microsoft) you have a problem. For a reference point, if you took certain Chest-Xray datasets on Kaggle and added 3 augmentations to each image (with a new row in an Excel file for each), you would have a run out of rows...
  • by using csv we are being inclusive of people who do not have access to Microsoft products and still providing a file that can be opened in Excel