# Exploratory Data Analysis

The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). The dataset is from the following paper:

* Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. "MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification". arXiv preprint arXiv:2110.14795, 2021.
* Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

The MedMNIST dataset contains medical imaging data ranging from colonoscopy imaging to abdominal CT scans to breast ultrasounds.  There two main types of data formats that are used: 2 dimensional images and 3 dimensional.


----
## Install required packages and datset

**Goal:** Install the necessary packages and dataset

*Note that this will take a few minutes to run...*

In [None]:
# Terminal command for installing packages. pip is commonly used to install any python package.  Here, we pass in medmnist as the package name.  You can install any existing python packages using the command format pip install XXX, where XXX is the name of the desired package.
! pip install medmnist
! pip install keras

# Import medmnist dataset
import medmnist
from medmnist import INFO, Evaluator

# Import keras and tensorflow
import torchvision.transforms as transforms

# Import utlity packages
import numpy as np
import random

----
## Load in the dataset

**Goal:** After installing the dataset (in the step above), we need to now load in the dataset so that we can use the dataset in our code.

*Note that there are multiple steps for this process...*

### Step 1: Load helper functions

There are a few functions that can help us load in the dataset. Run the cell below to load up the helper functions we need.

You don't need to worry about understanding the functions get_loader and shuffle_iterator (those are for simply loading in the data from MedMNIST).  If you want to learn more about the dataset itself as well as the code and paper you can visit the MedMNIST website at https://medmnist.com/.

In [None]:
# Helper function for retrieving the dataset
def get_loader(dataset, batch_size):
    total_size = len(dataset)
    print('Size', total_size)
    index_generator = shuffle_iterator(range(total_size))
    while True:
        data = []
        for _ in range(batch_size):
            idx = next(index_generator)
            data.append(dataset[idx])
        yield dataset._collate_fn(data)

# This function takes in an iterator and shuffles the order of the items in it
def shuffle_iterator(iterator):
    # iterator should have limited size
    index = list(iterator)
    total_size = len(index)
    i = 0
    random.shuffle(index)
    while True:
        yield index[i]
        i += 1
        if i >= total_size:
            i = 0
            random.shuffle(index)

# This transforms our data through tensor building and normalization.
data_transform = transforms.Compose([
                                     transforms.ToTensor(),
                                     transforms.Normalize(mean=[.5], std=[.5])
                                    ])

### Step 2: Specify dataset

There are many different possible datasets that we can use, so we need to specify the dataset we want to use (and other parameters needed to run a model on that dataset).

In [None]:
data_flag = 'pathmnist'
download = True

NUM_EPOCHS = 3
BATCH_SIZE = 64

info = INFO[data_flag]
task = info['task']
n_channels = info['n_channels']
n_classes = len(info['label'])

DataClass = getattr(medmnist, info['python_class'])

### Step 3: Load the data

Now that we have set up the parameters required to load the dataset we want, let's finally load the data. Run the next cell to load the dataset that we will use to train and test our model. *Note that this will take a few minutes...*

In [None]:
train_dataset = DataClass(split='train', transform=data_transform, download=download)
test_dataset = DataClass(split='test', transform=data_transform, download=download)

### Step 4: Split the data

After loading in the data, we need to process it a little bit. We want to (1) create a loader for running our model and (2) extract the images and labels that are in the train/test datasets.

In [None]:
# (1) Create loaders for running our model
train_loader = get_loader(dataset=train_dataset, batch_size=BATCH_SIZE)
test_loader = get_loader(dataset=test_dataset, batch_size=BATCH_SIZE)

# (2) Split the dataset into images and labels using numpy and list comprehensions
training_images = np.array([np.array(elem[0]) for elem in train_dataset])
training_labels = np.array([np.array(elem[1]) for elem in train_dataset])
testing_images = np.array([np.array(elem[0]) for elem in test_dataset])
testing_labels = np.array([np.array(elem[1]) for elem in test_dataset])

The above code separates the datasets into the respective images and labels. We will look into the images and labels in more detail.

----
## Analyze our data

**Goal:** Now that we've loaded our data, let's explore it and see what insights it holds.

### Step 1: Summarize data

In [None]:
print(train_dataset)

**Try it yourself!** Get a summary of the testing dataset

**Try it yourself!** Explore the data to answer the following questions

*Question #1:* How many datapoints are in the train dataset?

*Answer:* Write your answer here (double click on cell to type)

*Question #2:* How many datapoints are in the test dataset?

*Answer:* Write your answer here (double click on cell to type)

*Question #3:* How many labels are in the dataset?

*Answer:* Write your answer here (double click on cell to type)

### Step 2: Examine individual data points (images)

Let's now take a look at individual pictures.  We can look at the images with and without colors.

**Examine multiple images at a time:** You can edit the parameter length to display however many images you want.  For example, if you wanted to see five images you can change it to `train_dataset.montage(length=5)`.

In [None]:
"""
The .montage() function takes in a parameter called length, which allows you to
set the number of images per row and per column to output in a square montage.

This function is specific to the DataClass class. In this example, we have set
the parameter length to 20, and we can count 20 images as the side length of
the square montage.
"""

train_dataset.montage(length=20)

**Examine one image at a time:** You can use `i` to select the image that you want to see. `i` represents the index of the image in the dataset. This means if `i` is 3, then the image displayed will be the 3rd image in the dataset.

In [None]:
# Import visualization package
from matplotlib import pyplot as plt

In [None]:
# Visualize the image at index i (grayscale)

i = 300

print("Label: " + str(training_labels[i]))

# imshow can take in an image in array format
plt.imshow(training_images[i][1])
plt.show()

In [None]:
# Create a single image and putting it in a 3D array format for viewing

single_image = []
for j in range(28):
  tmp = []
  for k in range(28):
    tmp.append((training_images[i][0][j][k], training_images[i][1][j][k], training_images[i][2][j][k]))
  single_image.append(tmp)

# Plotting
plt.imshow(single_image)
plt.show()

**Examine labels of images:** Now, let's look at what the images are of. We can (1) summarize the dataset to see what the labels mean, (2) print out the labels in our training set, and (3) calculate how many of each label is in our training set.

In [None]:
# Print out the summary of the dataset
print(train_dataset)

In [None]:
# View the labels in number format
labels = [item[0] for item in training_labels]
print("Labels of train dataset:")
print(labels)

In [None]:
# How many "adipose" are in the training labels?
# Hint: The number that represents "adipose" is 0.
print("Number of images of 'adipose' in the train dataset:", labels.count(0))

**Try it yourself!** Create a dictionary called `number_classes` to map each label to the number of times it appears in the training dataset.

An example of a label is "adipose". We calculated how many times it appears in the training labels. Now, let's repeat that for all of the labels and save it to the `number_classes` dictionary.

In [None]:
number_classes = {} # Fill in the blank
print(number_classes)

**Our first visualization!** Let's create a plot that summarizes the number of classes (or labels) we have in our dataset! After you create the `number_classes` dictionary in the above cell, simply run the cell below.

In [None]:
# Creating the bar chart and adding the correct labels.  The functions used here correspond to each part of the graph
plt.bar(number_classes.keys(), number_classes.values(), width = .5);
plt.title("Number of Images by Class");
plt.xlabel('Class Name');
plt.ylabel('# Images');
plt.xticks(rotation='vertical')