### Exploring MHD, RAW, and ZRAW files

To better appreciate the data retrieve, my next step is to lookup these file formats and understand a little bit more how to query them. This is the time where I will review whether there are libraries that are the best at this work. I will post my results below. To appreciate these files, I will install few libraries that will be required later on.

Some libraries required/recommended:
- [simpleITK](https://simpleitk.org/)

In [None]:
#%pip install pydicom scikit-image plotly SimpleITK ipywidgets
import pydicom
from LUNA16.utils.analyze_folders import analyze_folder
import os
import matplotlib.pyplot as plt
import random
import numpy as np
import plotly
import skimage
import pprint
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from ipywidgets import interact, FloatSlider
import SimpleITK as sitk
%matplotlib inline
plt.rcParams["figure.figsize"] = [20, 8]
random.seed(123)

Pick a random MetaData file, and check if it has any similar 

In [None]:
ROOT_FOLDER = "/home/azureuser/cloudfiles/data/LUNA16/extracted"
all_files = analyze_folder(ROOT_FOLDER)
random_uid = random.choice([file.filename for file in all_files if file.extension =="mhd"])
print(f"The Random UID chosen for this notebook is: {random_uid}")
notebook_files = [file for file in all_files if file.filename ==random_uid]
print(f"There were {len(notebook_files)} files found with this uid., they are:\nSize \t Ext. \t Folder.")
for file in notebook_files:
    print(round(file.size / (1024 **2), 4), "\t", file.extension, "\t", file.folder.split("/")[-2])
raw = [file for file in notebook_files if file.extension == "raw"][0]
mhd = [file for file in notebook_files if file.extension == "mhd"]

In [None]:
mhd_image = sitk.ReadImage(mhd[0].folder)
mhd_image = np.array(sitk.GetArrayFromImage(mhd_image), dtype=np.float32)
mhd_image = np.transpose(mhd_image, [1, 2, 0])
NUM_IMAGES = 8
fig, ax = plt.subplots(1, NUM_IMAGES)
for i in range(NUM_IMAGES):
    spacing = round(mhd_image.shape[2] / NUM_IMAGES)
    idx = i * spacing
    image = mhd_image[:,:,idx]
    ax[i].imshow(image)
    ax[i].set_title(f"{idx}")
    ax[i].get_xaxis().set_visible(False)
    ax[i].get_yaxis().set_visible(False)
plt.show()
plt.close()

Those cross-sections are very interesting!!! I admit that I still don't understand the syntax of ```SimpleITK``` but we are progressing forward. One area that I would like to know about is the distribution of values within the slides of the image. This is what we explore next:

In [None]:
# Uncomment to draw 1 Histogram. Takes ~20 seconds
# hist_data = np.reshape(mhd_image, -1) #Flatten all channels
# plt.hist(hist_data, bins=100)
# plt.title(f"Histogram of values in {mhd[0].filename}")
# plt.show()

We realize that some values seem to be improperly computed - why are there so many values at -3000? This is worth investigating.
To do so, I will start creating reusable objects in the ```LUNA16``` package.

My first experimentation is to open all the mhd files, and try to understand the distribution of data. But first, we need to make sure that we using the multiprocessing package correctly to perform fast calculations. I will first try to run a process sequentially, and then use multiprocessing.

In [None]:
from LUNA16.utils.analyze_data_distribution import analyze_data_distribution, analyze_shapes
from tqdm.notebook import tqdm
from collections import Counter

all_mhd_files = [file for file in all_files if file.extension == "mhd"]
all_results = []

# Uncomment to perform sequential operation. It took 11 minutes on my machine, Your mileage may vary
# for i in tqdm(range(len(all_mhd_files))):
#     result = analyze_shapes([all_mhd_files[i]])
#     all_results.append(result)
# shapes = Counter(all_results)
# print(shapes.most_common(10))

In [None]:
import multiprocessing
all_mhd_files_mp = [(file, ) for file in all_files if file.extension == "mhd"]
results = []

def mp_fn(fn=analyze_shapes, files=all_mhd_files_mp):
    PROCESSES = multiprocessing.cpu_count() 
    with multiprocessing.Pool(PROCESSES) as pool:
        pool_results = pool.map(fn, files)
        results.append(pool_results)

    return results

In [None]:
list_of_channels = mp_fn(analyze_shapes)

I now turn my attention to exploring the RAW file, as well as the pydicom library.
First, considering the ```pydicom``` library is one that I am not very familiar with, I decide to browse the contents of this namespace. to do so, I use list comprehension to onlypretty print those contents of ```__dir__``` if they dont include an underscore. After analyzing the returned commands, I believe that ```.dcmread()``` seems to be what I need to open the file.