Skip to content
Candace Makeda Moore, MD edited this page Aug 21, 2021 · 27 revisions

Welcome to the cleanX wiki!

cleanX is a Python library that began to help people do data cleaning and preparation of large sets of radiology images. Functions for exploratory data analysis, data cleaning, normalization, and augmentation are included. One of the modules (data_processing) can be applied to tabular data whether it be in a csv, json, or other formats, for other purposes as well. The library began in April 2021.

The library hopes to help overcome certain misunderstandings between medical professionals, data analysts, and engineers. For example, engineers often do not understand that you can not augment medical images in certain ways, as you would the image of a cat, without the resulting image taking on a different clinical meaning. On the other hand, many medical professionals lack coding skills. While using the library obviously requires code, it turns some complex code maneuvers into single functions, or in the lingo 'high-level APIs' and in many cases also implements classes to ease implementing the functionality.

In order to make the library even more useful, we are working on a no-code workflow in addition to adding functions. Whatever your level of coding you can contribute to cleanX, and there are more details about that here.

For a demo of a possible workflow with code go to the workflow_demo folder and examine the notebooks within. There is a demo workflow implemented without classes so data analysts and people with limited coding skills can use it. For more advanced programmers who would want to implement a workflow many times, classes are provided (and there is a notebook on that as well). There is also a workflow specifically about bias.

The project has a conda based version. Documentation can be built on command. Please see the notes here

Special note on output file formats: You may wonder why we work with DICOM and any kind of metadata but chose to do most output as .jpg and .csv files when most doctors are more familiar with DICOMs and Excel files i.e. xlsx. Among the reasons are the following:

  • by using jpg files instead of whole DICOMs you can avoid touching a lot of metadata which can be a problem in terms of anonymization of data
  • big data is not handled all that well by Microsoft Excel, for example, if you have over 1,048,576 rows (according to Microsoft) you have a problem. For a reference point, if you took certain Chest-Xray datasets on Kaggle and added 3 augmentations to each image (with a new row in an Excel file for each), you would have a run out of rows...
  • by using csv we are being inclusive of people who do not have access to Microsoft products and still providing a file that can be opened in Excel

If you have ideas or comments on the library you can open a ticket and/or go to the discussions page.