## CBIS Images

The CBIS images were saved as Dicom images, which is a format used to save medical images. To read them I downloaded a free Dicom viewer, MicroDicom, from here http://www.microdicom.com. I used MicroDicom to export the images as JPEGs. The images were exported to a directory structure where the information about the scans were in the names of the subdirectories and each image had the same name "000000.jpg". At first I tried to size the images to a uniform size in the MicroDicom export function, but eventually decided to just export each image at it's own size.

I then used a function, rename_and_copy_files in mammo_utils, to extract the information from the directory names and rename each image appropriately and copy them all to a single directory. The JPEGs of each scan were put in one directory and the mask files were put in another. Some images had multiple ROIs and thus multiple masks, so multiple masks were allowed for each scan.

This dataset included cropped and zoomed ROIs for each scan, but each ROI was a different size and had a different amount of zoom. The differing sizes concerned me as that would require either leaving each ROI image with a black border or cropping it. Since my exploratory analysis of the UCI data indicated that the edges of a mass were important I did not want to crop the images and potentially lose important information, and leaving a black border could provide a method to easily identify which images were abnormal I decided to create my own ROI images.

The cropping of the ROI out of the images is done in the notebook crop_cbis_images.ipynb. First the mask and scan images are copied to respective directories from the nested directories using function mammo_utils.rename_and_copy_files. Then function create_roi_slices is used to generate ROI images from each image and mask as follows:
 1. The function takes as input the directory containing the masks and the directory containing the full images. A list of files in the mask directory is generated and looped through.
 2. For each mask the name of the full scan is generated the image is loaded and converted to an array.
 3. Since the images are grayscale and the channels are identical, the color channels are discarded.
 4. Function mammo_utils.create_mask is used to create the geometry of the ROI. This function reads in the mask, identifies where the white pixels start and returns the center of the ROI along with it's height and width. The function also does some additional processing if necessary.
 5. Regardless of the size of the ROI, a tile is cut out at 598x598 pixels centered on the center of the ROI. This tile is then downsized by half and added to the list of tiles.
 6. If the ROI is either bigger or less than 75% of the size of the tile, the size of the tile is expanded or shrunk to the size of the ROI plus a 10% margin, with lower and upper limits on the size of the tiles to avoid distortion, and then resized to 299x299 and added to the list.
 7. If the ROI is bigger than the 598x598 tile, the ROI is also split up into multiple tiles with a stride of 299 so that the full ROI is included in the training data. The tiles are then resized to 299x299 and added to the list.
 8. Since including multiple images of the same ROI might cause some bias in the data some additional processing was done to keep the data as varied as possible. All tiles are randomly flipped horizontally, vertically, or both; and instead of being perfectly centered function get_fuzzy_offset was used to randomly position the ROI within the tile.
 9. Additional code was also used to ensure that the tiles did not run off the edge of the images and that tiles included actual image data and not just black background.
 
The CBIS-DDSM data was already divided into training and test data. At first I combined all the images from all categories, shuffled it and then divided it into training and test data. I then realized that having different versions of the same images potentially included in both the training and test sets could defeat the purpose of having separate datasets, so I decided to keep the data divided as it was originally. This had the unfortunate side effect of limiting the size of the test and validation data sets while increasing the size of the training dataset.

The CBIs data was combined with the normal DDSM images. The training data was written to tfrecords files while the test and validation data were saved as npy files. This is because the training data was far too large to keep in memory.

## DDSM Images

The DDSM dataset, from which the normal images were taken, is saved as Lossless JPEGS, a format which has not been maintained since the 1990s. Decoding these images to PNGs was a long and complicated processs, which was greatly sped up by the use of the following tools:

 - Decompressing for LJPEG Images - https://github.com/zizo202/Decompressing-For-LJPEG-image
 - jpeg.exe - an executable specifically for processing DDSM images, written by Chris Rose

To run this on my Windows laptop, I had to install Cygwin, configure the cygwin environment, download several exe files and then run the Python script from the GitHub repo above.

The conversion script first uncompresses the LJPEG files to an LJPEG.1 file, which is then converted to a raw PNM file. This step also takes into account the specific type of scanner used to create the images, as the images need to be normalized differently depending on the scanner. Finally the PNM files are converted to PNGs.

The DDSM data was saved with each scan in separate subfolders. To streamline the process, I used a script to copy the contents of each subdirectory to a single directory, which I could then run through the conversion process. Once the images were converted I copied the PNGs to a separate directory.

One of the scanners (DBA) had what I presume to be personal information on the scans cut out to leave chunks of white pixels. To remove the possibility of this being used by a ConvNet I programmatically replaced all white pixels with black pixels after I had reviewed the images to ensure that no pure white pixels would appear.

Once the images had been converted and saved into single directories, function mammo_utils.create_slices() was used to create usable data. This function takes the path to the directory as input, generates a list of the images in the directory, and then feeds each image to mammo_utils.slice_normal_image(). This function processes each image as follows:

 1. The PNG image is read in and converted to RGB. The image is then scaled down to half size.
 2. A 7% margin is trimmed from the sides of each image to eliminate the white borders that occur in many scans.
 3. Each image is cut into 299x299 tiles with a stride of 299.
 4. The list of tiles is looped through and if the tile meets certain conditions it is added to the list of usable tiles.
 5. The conditions each tile must meet are to ensure that the image does not contain mostly black background and contains usable content. This is done by setting lower and upper thresholds on the mean and variance of each image. The thresholds were determined by manually reviewing the tiles and their corresponding means and variances.
 
Once the tiles are created, the tiles from each of the scanners are combined, shuffled and saved in multiple batches to keep the files at reasonable sizes.

## MIAS Data

Since the MIAS data was the dataset I originally planned to work with I wanted to try to use it for this project. The MIAS data differs significantly from both of the DDSM datasets used, so had to be processed in order to be usable.

The biggest difference is that the MIAS data is all 1024x1024 images, with the scan horizontally centered in the image. The DDSM images were full sized. In order to have the images on similar scales I calculated the average size of the DDSM images, factored in the fact that I would be scaling each image down by half the get the ROI to fit in a 299x299 tile, and came up with a factor of 2.5 by which the MIAS images need to be scaled up.

Another difference was that the MIAS data had the ROI identified by a center and a radius as opposed to a mask. Originally the script for processing the CBIS-DDSM images returned two sets of coordinates to outline a square surrounding the ROI. I realized that this could result in the data being substantially different so I updated the scripts for creating the CBIS images to return a center and the size of the ROI so that the same process could be applied to the MIAS data.

I then wrote functions to apply the same methods used to process the DDSM and CBIS-DDSM data to the MIAS data. For each image, if the scan was normal it was cut into tiles as the DDSM data is, and if abnormal the ROI is extracted using the same procedure as the CBIS-DDSM data.

My original plan was to use the MIAS data as a test data set, but after reviewing the tiles created from each dataset this was abandoned as the MIAS data appeared somehow different from the DDSM data. This was confirmed by the fact that a model which was trained on the DDSM data and performed well on the validation data performed terribly on the MIAS data.

At the moment the MIAS data is being added to the training dataset, but I may remove it and just use the DDSM data as I am unsure if the datasets are sufficiently similar to be combined. The very small size of the MIAS data should make it's inclusion or exclusion largely irrelevant.