## Combining data sources in unified dataset

* Loading and processing raw data files
* Implementing a Python class to represent our data
* Converting our data format into a format usable by pytorch
* Visualizing the training and validation data

1. Now that we have discussed the high level goals of part2 its time to discuss the specifics in this chapter. Now it's time to implement the basic data-loading and data-processing routines for our raw data. Basically evry significant project that we work on will need something analogous to what we work on here.
2. Our goal is to be able to produce a training sample given our inputs of our raw CT scan data and a list of annotations for those CTs. This might sound simple but quite a bit needs to happen before we can load, process and extract the data we are interested in.

#### Raw CT data files
1. Our CT data comes in two files, a .mhd data which contains the metadata header information and a .raw file containing the raw files that make up the 3D array.
2. Each files name starts with an identifier called as a `Series UID` for the CT scan in question.
3. Our CT class will consume those two files and produce the 3D array, as well as the transformation matrix to convert from the patient coordinate system (discussed later) to index, row, column coordinates needed by the array. We just have some coordinate system conversion to do before we can apply these coordinates to the CT data. The trasformation matric is contained in the .mhd file.
4. We will also load the annotation data provided by LUNA, which will give us a list of nodule coordinates, each with a malignancy flag, along with the series UID of the relevant CT Scan. By combining the nodule coordinate with the coordinate system transformation information, we get the index, row and column of the coordinates of the voxel at the center of our nodule.
5. Using the I,R,C data we can crop a small 3D slice of our CT Scan to use as an input to our model. Along with this 3D sample of our array we must construct the rest of our training sample tuple which will have
    * Sample array
    * Nodule Status Flag
    * Series UID
    * The index of this sample in the CT list of nodule candidates.\
    This sample tuple is exactly what pytorch expects from our Dataset subclass and represents the last section of our bridge from our original raw data to the standard structure of Pytorch Tensors.
6. Limiting or cropping our data so as not to drown our model with noise is important. But we need to make sure that the cropping is not so agressive that our signal gets cropped out of the input. This is known as feature engineering, but here we would not do traditional feature engineering rather we would let the model do the heavy lifting.


#### Parsing LUNA's annotation data
The first thing we need to do is to load the data and see what the data looks like. We could try loading and parsing individual CT scans but it would be better to parse the csv files that luna provides, which contains points of interest in each CT Scan. The information that we will get is:
* The coordinates
* An indication whether the coordinate is a nodule
* A unique identifier for the CT Scan
Since there are fewer types of information in the CSV files, they are easier to parse and will give us some clues about what to look for, if we start loading the CT's.
* The candidate.csv contains information about all lumps that look like nodules, whether those lumps are malignant, benign tumors or something else altogether. We will use this a basis for a complete list of candidates that can be split into training and validation tests

In [1]:
import pandas as pd

In [2]:
# Parsing candidates.csv file, we can use pandas to parse the files

# reading the csv file
candidate = pd.read_csv("E:\data\data-unversioned/candidates.csv")

# renaming the columns
candidate.rename(columns={"seriesuid":"series_uid", "coordX":"x","coordY":"y","coordZ":"z"}, inplace=True)

# displaying the dataframe
candidate


Unnamed: 0,series_uid,x,y,z,class
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-56.08,-67.85,-311.92,0
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,53.21,-244.41,-245.17,0
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.66,-121.80,-286.62,0
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-33.66,-72.75,-308.41,0
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-32.25,-85.36,-362.51,0
...,...,...,...,...,...
551060,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,-55.66,37.24,-110.42,0
551061,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,68.40,70.18,-109.72,0
551062,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,-82.29,-27.94,-106.92,0
551063,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,38.26,83.50,-102.71,0


In [4]:
# The annotations.csv contains information about the candidates that have been flagged as nodules
# We are particularly interested in the diameter_mm feature in particular
annotations = pd.read_csv("E:\data\data-unversioned/annotations.csv")
annotations.rename(columns={"seriesuid":"series_uid", "coordX":"x","coordY":"y","coordZ":"z"}, inplace = True)
annotations

Unnamed: 0,series_uid,x,y,z,diameter_mm
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-128.699421,-175.319272,-298.387506,5.651471
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.783651,-211.925149,-227.121250,4.224708
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793...,69.639017,-140.944586,876.374496,5.786348
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,-24.013824,192.102405,-391.081276,8.143262
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,2.441547,172.464881,-405.493732,18.545150
...,...,...,...,...,...
1181,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,-160.856298,-28.560349,-269.168728,5.053694
1182,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,-102.189570,-73.865766,-220.536241,4.556101
1183,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,-37.535409,64.041949,-127.687101,4.357368
1184,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,43.196112,74.438486,-200.523314,4.277203


Get item from object for given key (ex: DataFrame column).



In [9]:
candidate['series_uid'].value_counts()

1.3.6.1.4.1.14519.5.2.1.6279.6001.339142594937666268384335506819    1468
1.3.6.1.4.1.14519.5.2.1.6279.6001.174692377730646477496286081479    1425
1.3.6.1.4.1.14519.5.2.1.6279.6001.279953669991076107785464313394    1414
1.3.6.1.4.1.14519.5.2.1.6279.6001.113697708991260454310623082679    1408
1.3.6.1.4.1.14519.5.2.1.6279.6001.258220324170977900491673635112    1400
                                                                    ... 
1.3.6.1.4.1.14519.5.2.1.6279.6001.608029415915051219877530734559     172
1.3.6.1.4.1.14519.5.2.1.6279.6001.397202838387416555106806022938      65
1.3.6.1.4.1.14519.5.2.1.6279.6001.225515255547637437801620523312      57
1.3.6.1.4.1.14519.5.2.1.6279.6001.935683764293840351008008793409      52
1.3.6.1.4.1.14519.5.2.1.6279.6001.153536305742006952753134773630      32
Name: series_uid, Length: 888, dtype: int64

In [10]:
# counts the number of examples which are nodules(0) or non_nodules(1)
candidate['class'].value_counts()

0    549714
1      1351
Name: class, dtype: int64

In [11]:
# The annotations.csv contains information about the candidates that have been flagged as nodules
# We are particularly interested in the diameter_mm feature in particular
annotations = pd.read_csv("E:\data\data-unversioned/annotations.csv")
annotations.rename(columns={"seriesuid":"series_uid", "coordX":"x","coordY":"y","coordZ":"z"}, inplace = True)
annotations

Unnamed: 0,series_uid,x,y,z,diameter_mm
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,-128.699421,-175.319272,-298.387506,5.651471
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222...,103.783651,-211.925149,-227.121250,4.224708
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793...,69.639017,-140.944586,876.374496,5.786348
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,-24.013824,192.102405,-391.081276,8.143262
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016...,2.441547,172.464881,-405.493732,18.545150
...,...,...,...,...,...
1181,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,-160.856298,-28.560349,-269.168728,5.053694
1182,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,-102.189570,-73.865766,-220.536241,4.556101
1183,1.3.6.1.4.1.14519.5.2.1.6279.6001.994459772950...,-37.535409,64.041949,-127.687101,4.357368
1184,1.3.6.1.4.1.14519.5.2.1.6279.6001.997611074084...,43.196112,74.438486,-200.523314,4.277203


In [12]:
annotations['series_uid'].value_counts()

1.3.6.1.4.1.14519.5.2.1.6279.6001.176030616406569931557298712518    12
1.3.6.1.4.1.14519.5.2.1.6279.6001.328789598898469177563438457842     9
1.3.6.1.4.1.14519.5.2.1.6279.6001.195557219224169985110295082004     9
1.3.6.1.4.1.14519.5.2.1.6279.6001.202187810895588720702176009630     9
1.3.6.1.4.1.14519.5.2.1.6279.6001.219428004988664846407984058588     9
                                                                    ..
1.3.6.1.4.1.14519.5.2.1.6279.6001.164988920331211858091402361989     1
1.3.6.1.4.1.14519.5.2.1.6279.6001.202464973819273687476049035824     1
1.3.6.1.4.1.14519.5.2.1.6279.6001.200841000324240313648595016964     1
1.3.6.1.4.1.14519.5.2.1.6279.6001.167919147233131417984739058859     1
1.3.6.1.4.1.14519.5.2.1.6279.6001.313605260055394498989743099991     1
Name: series_uid, Length: 601, dtype: int64

We have information about around 1200 nodules. This is useful bacause we can use it to make sure that our training and validation data includes a representative spread of nodule sizes. Without this it is possible that our validation set could end up with only extreme values making it seem as though our model is underperforming.

#### Training and Validation Sets
For any classification task we split our data in training and validation sets. We want to make sure that both the sets are representative of the range of real-world input data we are expecting to see and handle normally. If either set is meaningfully different from real-world use cases, it's pretty likely that our model would behave differently than we expect. All the statistics that we collect and the trainings that we compute will not be predictive once we transfer our model to production to use.\
Lets go back to our nodules in annotations.csv file:
1. We are going to sort the nodules in ascending order and we are going to take every nth element of the data for our validation set. This should give us a sample representative of the data. But the problem is that sometimes the results of annotations.csv does not match with the candidates.csv.
2. Real-world data sources are often imperfect we are going to have to do some work to make them line-up, this is a good-example of the kind of work that you would need to assemble data from disparate data sources.

#### Unifying our annotation and candidate data
Now that we have seen what the data looks like lets, build a candidate infoList function which will stitch all these things together. We will use a named tuple to hold all of these information that is defined at the top of the file to hold the information of each nodule.

In [15]:

import os
list_d = os.listdir("E://data/data-unversioned/subset0/")
filename = list_d[0]
print(filename)
print(os.path.split(filename)[-1])
print(os.path.split(filename)[-1][:-4])

1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.mhd
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860.mhd
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860


The above tuple instances are not our training samples, as they are missing the chunks of CT data that we need. Instead they represent a sanitized, cleaned and unified interface to human-annotated data that we are using.\
`Note : Clearly seperate the code that is responsible for data sanitization from the rest of your project. Don't be afraid to rewrite your data once and save it to disk if needed.`\
Our candidate tuple would have:
1. `nodule status`: what we are going to be training the model to classify.
2. `Diameter` : useful for getting a good spread in training, since large and small nodules would not have the same features.
3. `series_uid` : To locate the correct CT Scan
4. `Candidate center` : To find the candidate in the larger CT scan\
The function that will build a list of these NoduleInfoTuple instances starts by using an in-memory caching decorator, followed by getting the list of files present on the disk.

1. Since parsing the data files can be a slow process. We will be caching the results of this function call in memory. This will come in handy later because we will be calling this function more often in future chapters. Sppeding up our data pipleline by carefully applying in-memory or on disk caching will result in some pretty impressive gains in training speed. Keep an eye out on these opportunities when working on your projects.
2. Earlier we said that we will run our program on a less than full set of training data, due to long download times and high disk space requirements. `The requireOnDisk_bool = True parameter makes good on this promise`, we are detecting the series_uid's that are actually present and ready to be loaded from disk and we are using that information to limit which entries that we use from the CSV file we are going to parse.
3. Being able to run our loop on a subset of data is useful because it can help to see if the code is working as intended. Often the metrics are bad but excercising our metrics, logging, model-checking and similar functionality is beneficial. We can then load the full training set to improve the performance after we are satisfied with all the other criteria's.
4. After we get our candidate information, we want to merge the diameter information from annotations.csv file. First we need to group our annotations by the series_uid, and that is the first key that we will use to cross-reference each row from the two files. 
5. For each candidate_entry for a given series_uid, we would loop through the annotations we collected earlier for the same series_uid and see of the nodules are close enough to be classified as the same nodule. If yes then we have found out the diameter of the nodule. If not then we will just treat the nodule as having 0.0 diameter. Since we are using this information to get a good spread of nodule sizes in our training and validation sets, having incorrect diameter sizes for the same nodules should not be a problem, but we should remember we are doing this in case our assumption is wrong.
Doing this sort of fiddly code is to just converge on the nodule diameter is somewhat fuzzy. But this is required when we are dealing with raw data. Once we get to this point we just need to sort the data and return it.
6. The ordering of tuple members in this nodule_info_list is driven by this sort. We are using this approach that when we use a slice of the data, that slice gets a representative chunk of the actual nodules, with a good spread of nodule diameters.


#### Loading the individual CT Scans
1. Next up we need to be able to take our raw CT scan data and transform it from bits on the disk to a python object from which we can extract 3D nodule density data.
2. Our annotation information acts as a map to the interesting parts of the Raw data. Before we can follow that map to our data of interest we need to get the data into an adressable form.\
`Note : Having large amounts of raw data, most of which is uninteresting is common in machine learning. Look for ways to limit your scope to only relevant data when working on your own projects.`
3. The DICOM format is a very old format, so LUNA has converted our data into a format known as meta.IO format which is a bit easier to use. Don't worry if you have never heard of the format before. We can treat the data format as a black box and use the SimpleITK library to load them into more familiar numpy arrays.
4. For real projects you would want to know what type of information is contained in your data, but it's perfectly fine to use libraries like simpleITK to parse the bits on the disk. Just remember that we are more concerned about the data and not how it is represented on the disk.
5. Being able to uniquely identify the sample is very useful for ourselves, clearly communicating which sample is causing trouble can help us isolate the case and look at it more closely.
6. We identify the CT scans using series instance UID's (series_uid) when the CT scan was created. DICOM makes heavy useful of identifiers like UUID's but they are created in different ways and are formatted differently.
7. At this point ct_a is a 3D array. All dimensions are spatial and the single intensity channel is implicit i.e the channle information is represented as the fourth dimension of size 1.

##### Hounsefield Units
1. Without understanding the nuances of our data values and range, it will hinder our models ability to learn well from the data
2. The CT scans are typically stored as a signed 12 bit integer(shoved into 16 bit integers) which fits well with the level of precision CT scanners can provide.
3. CT scans are expressed in `Hounsfield units` which are odd units. `Air` is expressed as `-1000 HU (close to 0g/cc)`, `Water` is expressed as `0 HU(1g/cc)` and `Bone` is expressed as `+1000 HU (2-3g/cc)`.
4. CT scanners use negative HU values to correspond to negative densities to indicate that those voxels are out of the CT Scanners view. For our purposes every thing outside the patients body is irrelevant and we also don't want the exact densities of the bones. So clipping our values HU values from `-1000 to 1000` would be a good range but it would not be biologically sound for most of the cases.
5. Values above 0 HU don't scale perfectly with density but as our tumors are arounf the range of 0 HU we do not need our values to perfectly map from HU to g/cc because we will be using HU units as the input.
6. We want to remove all of these outliers from the data becuase they aren't directly relevant to our goal and having outliers can make the models job harder. This can happen in many ways like when the unclipped data is fed to the batch normalization and the statistics on how to best normalize the data are skewed.`Always be on the lookout for these ways to clean your data`.
7. It is important to know that our data has values between -1000 to 1000, since in chapter 13 we end up adding channels of information to our samples. If we do not account for disparities between HU and our data, those new channels can easily be overshadowed by the raw HU data. We do not add those channels for the classification step of the project so we do not need to implement special handling right now.

##### Locating a nodule using the patient coordinate system
1. Deep Learning model expect fixed size inputs due to the fixed number of input neurons. we need to input a fixed size input which contains our candidate nodule so that it can used as an input to our classifier. We would like to train our model on a crop of the data which has the candidate nicely centered so that our model does not need to learn to identify nodules tucked in the corners. By reducing the variation of our inputs we make the models job easier.
    ##### The patient coordinate system
    1. Unfortunately all the candidates data we loaded earlier is in millimeters(x,y,z coordinate system) and not voxels(I,R,C) coordinate system.
    2. So they need to be converted from the x,y,z coordinate system to the voxel based I,R,C coordinate system used to take array slices from our CT scan data. This is a classic example of how to handle units consistently.
    3. As mentioned previously the patient when dealing with CT scans we define array dimensions x,y,z as index, row and column, because seperate meanings exists for the x,y and z axis.
    4. `The patient coordinate system defines the postive X to be the patients left(Left), positive Y to be the patients behind (Posterior) and positive Z to be towards the patients head (Superior). This is sometimes also referred to as Left-Posterior-Superior(LPS).`
    5. `The patient coordinate system is measured in millimeters and has an arbitrarily positioned origin, that does not correspond to the origin of the CT voxel array`.
    6. The patient coordinate system is used to specify the locations of interesting anatomy in a way that is independent of any particular scan. The metadata that defines the relation between the CT array and the patient coordinate system is stored in the header of the DICOM files and that meta-image format preserves the data in its header itself. This metadata allows us to construct the information from X,Y,Z to I,R,C. The raw data contains many other similar metadata but as we do not have use for them right now we won't be looking into it.

    ##### CT scan Shape and Voxel size
    1. One of the most common variations between CT scans is the size of the voxels; typically they are not cubes. Usually the row and column voxel sizes are the same but the index dimension has a larger value, but other ratios can exist.
    2. When plotted using square pixels the non-cubic voxels can look somewhat distorted. We will need to apply a scaling factor if we want the image to depict realistic proportions.
    3. Knowing these kinds of details would help when we want to depict our data and interpret our results visually. without this information it would be easy to assume that something was wrong with our data loading, and that the data might be looking squat because we omitted some slices at the time of loading. It can be easy to waste a long time debugging something that was working all along and being familiar with your data can help that.
    4. CT's are commonly 512x512 with the index dimension ranging from 100 - 250 slices(250 slices into 2.5 mm would be enough for the anatomical region of interest). This results in a lower bound size of about 2^25 pixels or about 32 million data points. Each CT specifies the voxel size in millimeters as part of the file metadata.

    ##### Converting between millimeters and voxel adresses
    1. We would define the code to assist between the conversion of patient coordinates(in millimeters) and the IRC corrdinate system.
    2. We might also wonder that the SimpleITK library comes up with utility functions to convert these, and indeed the `Image` instance does feature two mehtods - `Transform-IndexToPhysicalPoint` and `TransformPhysicalPointToIndex` ---> to do just that(except shuffling from CRI (column, row, index) to IRC.) However, we want to be able to do this computation without keeping the Image object around, so we'll perform the math manually here.
    3. Flipping the axes(and potentially a rotation and other transforms) is encoded in a 3x3 matrix returned as a tuple from `ct_mhd.GetDirections()`. To go from voxel indices to coordinates, we need to follow these 4 steps in order.
        1. Flip the coordinates from IRC to CRI, to align with XYZ.
        2. Scale the indices with the voxel size.
        3. Matrix-Multiply with the directions matrix using @ in Python.
        4. Add the offset for the origin.

    4. To go back from XYZ to IRC, we need to perform the inverse of each step in the reverse order.
    5. We keep the voxel sizes in named tuples, so we convert these into arrays.

    ##### Extracting a Nodule from a CT Scan
    1. As we mentioned that 99.9999% of the voxels in the CT Ct scan of the patient with a lung cancer nodule won't be a part of the actual nodule.
    2. So what we would do is that we would we will extract an area around each candidate and let the model focus on one candidate at a time. Looking for ways to reduce the scope of our problem for our model can help, especiallly in the early stages of a project when we are trying to get our first working implementation up and running.
    3. The getRawNodule() function takes the center expressed in the patient coordinate system(x,y,z), just as it is specified in the LUNA CSV data, as well width in voxels. It returns a cubic chunk of CT, as well as the center of the candidate coordinates.
    4. The actual implementation will need to deal with situations where the combination of center and width puts the edges of the cropped areas outside of the array. But as noted earlier, we will skip complications that obscure the larger intent of the function.The full implementation can be found in the books github repo.


#### A straightforward Dataset implementation
1. Now to load the data and to conform it with pytorch we need to implement the dataset class. By subclassing the Dataset, we will take our arbitrary data and plug it into the rest of the Pytorch Ecosystem.
2. Each CT instance represents hundereds of different samples that we can use to train our model or validate its effectiveness.
3. Our lunaDataset class will normalize those samples, flattening each CT's nodules into a single collection from which samples can be retrieved without regard for which CT instance the sample originates from. This flattenning is often how we want to process the data(we will see different methods also).
4. In terms of implementation, we are going to start with the requirements imposed from subclassing dataset and work backwards. This is different from the datasets we have encountered earlier; there we we using classes provided by external libraries, whereas here we need to implement and instantiate the class oureselves.
5. Pytorch API only requires the dataset class to have two functions:
    1. An implementation of the `__len__` that must return a single, constant value after initialization(the value ends up being cached in some cases)
    2. The `__getitem__` method, which takes an index and returns a tuple with sample data to used for training(or validation, as the case may be). It should return a something valid for all inputs from 0 to N-1.


In [6]:
import collections
import copy
import datetime
import gc
import time
import numpy as np

IrcTuple = collections.namedtuple('IrcTuple', ['index', 'row', 'col'])
XyzTuple = collections.namedtuple('XyzTuple', ['x','y','z'])

def irc2xyz(coord_irc, origin_xyz, vxSize_xyz, direction_a):
    # Swipes the order of the coordinates from I,R,C to C,R,I while we convert to a numpy array of the voxels coordinates to x,y,z coordinates
    cri_a = np.array(coord_irc)[::-1]
    # Converts the origin to an array
    origin_a = np.array(origin_xyz)
    # converts the voxel size to an array
    vxSize_a = np.array(vxSize_xyz)

    # Multiplies the indices with the voxel_szie to scale them and matrix multiply with the directions so that it is captured aacordingly in each axis and offset the origin.
    coords_xyz = (direction_a @ (cri_a * vxSize_a)) + origin_a
    return XyzTuple(*coords_xyz)  # Convert it into a named tuple and return it

def xyz2irc(coord_xyz, origin_xyz, vxSize_xyz, direction_a):
    origin_a = np.array(origin_xyz)
    vxSize_a = np.array(vxSize_xyz)
    coord_a =  np.array(coord_xyz)

    # Inverse  of the last 3 steps
    cri_a = ((coord_a - origin_a) @ np.linalg.inv(direction_a))/vxSize_a

    # rounds off before converting to integers
    cri_a = np.round(cri_a)
    return IrcTuple(int(cri_a[2]), int(cri_a[1]), int(cri_a[0])) # Converts from shape C,R,I to I,R,C and converts to integers


In [43]:
import functools
import os
import glob
from collections import namedtuple
import csv
import SimpleITK as sitk
import numpy as np
import torch
from torch.utils.data import Dataset
import copy





@functools.lru_cache(1)
def GetCandidateInfoList(requireOnDisk_bool = True):
    candidate_info_tuple = namedtuple('candidate_info_tuple', ['isNodule_bool', 'diameter_mm', 'series_uid','center_xyz'])
    # get a list of all the .mhd files in the various subsets of the data.
    mhd_list = glob.glob("E://data/data-unversioned/subset*/*.mhd")

    # We are splitting the whole filepath into its various pieces, taking only the filename(hence the -1) and removing the .mhd from it(hence the -4). 
    presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list}

    # here we will define a diameter dict to keep track of distance between individual nodules.
    # If they are really close, treat them as a single nodule and save their series_uid along with their diameter, else 0 diameter
    diameter_dict = {}

    with open("E://data/data-unversioned/annotations.csv", 'r') as f:

        for row in list(csv.reader(f))[1:]:  # starting from the first row becuse row 0 is just column names(headers)
            # get the series_uid of the nodule which is the first value of the row
            series_uid = row[0]

            # get the x,y,z coordinates(which are the 2,3,4 values respectively) which mark the center of the nodules.
            annotation_center_xyz = tuple([float(x) for x in row[1:4]])

            # Get the diameter of the nodules which is the 5th value of each row
            annotation_diameter_mm = float(row[4])

            # Set the default of the key i.e series_uid as an empty list and this also returns the value of that key.
            # Then append a tuple of the x,y,z coordinates of the center of the nodule and the diameter of the annotated nodule
            diameter_dict.setdefault(series_uid, []).append((annotation_center_xyz, annotation_diameter_mm))

    # Now  we will build a full list of candidates nodules using the information in candidates.csv file
    candidate_info_list = []
    with open("E://data/data-unversioned/candidates.csv", 'r') as f:
        for row in list(csv.reader(f))[1:]:

            # Get the series_uid
            series_uid = row[0]

            # If the series_uid is not present on disk and the requireOnDisk attribute is set to True then skip the file
            if series_uid not in presentOnDisk_set and requireOnDisk_bool:
                continue
            
            # Put if the candidate is a nodule or not into the isNodule_bool parameter. This is the 5th value of the row
            isNodule_bool = bool(int(row[4]))

            # Get the x,y,z coordinates of the center of the candidate
            candidate_center_xyz = tuple([float(x) for x in row[1:4]])

            # set the candidate diameter to 0.0 
            candidate_diameter_mm = 0.0

            # loop over the annotation_tuple dictionary and get the annotation tuple of the matching series_uid of the candidate
            for annotation_tuple in diameter_dict.get(series_uid, []):
                # get the x,y,z center values of the dictionary and the annotation_diameter of the annotated tuple
                # print(diameter_dict.get(series_uid, []), annotation_tuple)
                # print(type(annotation_tuple), len(annotation_tuple))
                annotation_center_xyz, annotation_diameter_mm = annotation_tuple

                # Now this loops over the x,y,z coodinates and finds the absolute distance between the centers of annotated and candidate nodules.
                for i in range(3):
                    delta_mm = abs(candidate_center_xyz[i] - annotation_center_xyz[i])

                    # if delta is > annotation_diameter/4
                    # annotation_diameter/4 --> Divides it by 2 to get diameter, again divides it by 2 to get the radius and then compares.
                    # This is done to make sure that the centers are not too far apart relative to the size of the nodule.
                    # This is a type of a bounding box check and not a distance check.
                    if delta_mm > annotation_diameter_mm /4:
                        # If the candidates and the annotations are not close then break and add them to the candidate list as seperate nodules
                        break
                else:
                    # If they are very close then we should see them as the same nodule
                    candidate_diameter_mm = annotation_diameter_mm
                    break

            # Covert all the in information into a tuple and append it to the candidate list
            candidate_info_list.append(candidate_info_tuple(isNodule_bool,
                                                            candidate_diameter_mm,
                                                            series_uid,
                                                            candidate_center_xyz))

    # Sort the list in ascending/descending order to make the sampling representative of the dataset.
    candidate_info_list.sort(reverse=True)
    return candidate_info_list

class Ct():
    def __init__(self, series_uid):
        mhd_path = glob.glob(f"E://data/data-unversioned/subset*/{series_uid}.mhd")[0]
        
        # The readImage automatically consumes the .raw file in addition to the .mhd file passed in
        ct_mhd = sitk.ReadImage(mhd_path)

        # Recreates an np.array since we want to convert the value type to np.float32
        ct_a = np.array(sitk.GetArrayFromImage(ct_mhd), dtype = np.float32)
        ct_a.clip(-1000,1000,ct_a)

        # All of the values we have built above are now assigned to self to make them the attributes of the object
        self.series_uid = series_uid
        self.hu_a = ct_a
        
        # Get the origin values from the .mhd file which would be used to convert from x,y,z coordinates to the i,r,c coordinates
        self.origin_xyz = XyzTuple(*ct_mhd.GetOrigin())

        # Get the size of the voxel which needs to be multiplied to every axis to get the correct voxel length from xyz to irc of each axis
        self.vxSize_xyz = XyzTuple(*ct_mhd.GetSpacing())

        # converts the directions to an array, and reshapes the nine-element array to 3,3
        self.direction_a = np.array(ct_mhd.GetDirection()).reshape(3,3)
        # These are the inputs we need to pass into our xyz2irc conversions In addition to the individual point to convert.
        # with these attributes, our CT object implementation now has all the data needed to convert a candidate center from patient coordinates to array coordinates.
    

    def getRawCandidate(self, center_xyz, width_irc):
        """ This function crops the relevant voxels(bounding box of the nodule) from the full CT scan array.
            It does so by converting from x,y,z to i,r,c of the center of the nodule and then using the width to calculate the
            start and end indexes of the crop box.  The width_irc is a fixed 3D width(which resembles our input shape) with which
            the CT scan will need to be cropped so that the center of the cropped array and the nodule are aligned."""
        center_irc = xyz2irc(
            coord_xyz = center_xyz, # center corrdinates of the ct_array
            origin_xyz = self.origin_xyz, # origin information of the ct_array
            vxSize_xyz = self.vxSize_xyz, # voxel_size of the ct array
            direction_a = self.direction_a # direction matrix of the ct_array
        )

        slice_list = []
        for axis, center_val in enumerate(center_irc):
            # Find the start and the end indexes for every axis
            start_ndx = int(round(center_val-width_irc[axis]/2))
            end_ndx = int(start_ndx + width_irc[axis])

            # Append the start and the end indexes for every axis onto a list
            slice_list.append(slice(start_ndx, end_ndx))
        
        # Now crop the CT scan with the start and end indexes list to get the cropped 3D CT array which has the nodule clearly centered
        ct_chunk = self.hu_a[tuple(slice_list)]

        return ct_chunk, center_irc   # Return the CT chunk and the I,R,C corrdinates of the center.
        



@functools.lru_cache(1, typed=True)
def getCt(series_uid):
    return Ct(series_uid)

@functools.lru_cache(typed=True)
def getCtRawCandidate(series_uid, center_xyz, width_irc):
    ct = getCt(series_uid)
    ct_chunk, center_irc = ct.getRawCandidate(center_xyz, width_irc)
    return ct_chunk, center_irc

class LunaDataset(Dataset):
    def __init__(self, val_stride = 0, is_val_set_bool = None, series_uid = None):
        self.candidateInfo_list = copy.copy(GetCandidateInfoList())

        if series_uid:
            self.candidateInfo_list = [x for x in self.candidateInfo_list if x.series_uid == series_uid]

        if is_val_set_bool:
            assert val_stride > 0, val_stride
            self.candidateInfo_list = self.candidateInfo_list[::val_stride]
            assert self.candidateInfo_list

        elif val_stride > 0:
            del self.candidateInfo_list[::val_stride]
            assert self.candidateInfo_list


    def __len__(self):
        return len(self.candidateInfo_list)


    
    def __getitem__(self, ndx):
        candidate_info_tup = self.candidateInfo_list[ndx]
        print(candidate_info_tup)
        # assert isinstance(candidate_info_tup, collections.namedtuple)
        width_irc = (32,48,48)

        candidate_a, center_irc = getCtRawCandidate(
            candidate_info_tup.series_uid, 
            candidate_info_tup.center_xyz,
            width_irc
        )

        candidate_t = torch.from_numpy(candidate_a)
        candidate_t = candidate_t.to(torch.float32)
        candidate_t = candidate_t.unsqueeze(0)

        pos_t = torch.tensor([
            not candidate_info_tup.isNodule_bool,
            candidate_info_tup.isNodule_bool
        ],
        dtype = torch.long)

        return (candidate_t, 
                pos_t,
                candidate_info_tup.series_uid,
                torch.tensor(center_irc))
x = GetCandidateInfoList(requireOnDisk_bool=False)
new_dataset_instance = LunaDataset()

TypeError: __init__() missing 1 required positional argument: 'series_uid'

Below is the whole code for the util.py file in the github repo.
This will be a bit heavy on the logic side. Just remember that we need to convert and use the functions as a black box. The metadata we need to convert from patient coordinates(_xyz) to array_coordinates(_irc) is contained within the MetaIO file alongside the CT data itself. We pull the voxel sizing and positioning metadata out of the .mhd file at the same time we get the ct_a.

In [34]:
c = LunaDataset(is_val_set_bool= True, val_stride=10)
print(c.__getitem__(25))
print(len(c))

candidate_info_tuple(isNodule_bool=True, diameter_mm=11.36963426, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.249404938669582150398726875826', center_xyz=(131.415682197, 32.7486662007, -174.02278535))
(tensor([[[[-848., -828., -780.,  ...,  -20.,   61.,   79.],
          [-797., -817., -816.,  ...,    7.,   68.,   95.],
          [-793., -823., -832.,  ...,   45.,   73.,   71.],
          ...,
          [-820., -814., -805.,  ...,    7.,   16.,    8.],
          [-820., -836., -839.,  ...,   -2.,    5.,  -31.],
          [-852., -838., -840.,  ...,  -10.,  -30.,  -54.]],

         [[-787., -799., -815.,  ...,  -53.,   14.,   46.],
          [-729., -771., -802.,  ...,  -32.,   29.,   52.],
          [-732., -780., -845.,  ...,  -45.,   21.,   37.],
          ...,
          [-793., -846., -828.,  ...,  -14.,   -1.,   29.],
          [-768., -805., -842.,  ...,  -60.,  -40.,  -26.],
          [-766., -798., -807.,  ...,  -38.,    1.,  -39.]],

         [[-811., -801., -826.,  ...,  -84

In [35]:
p = new_dataset_instance.__getitem__(25)
p
print(len(new_dataset_instance))

candidate_info_tuple(isNodule_bool=True, diameter_mm=23.57156231, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.220596530836092324070084384692', center_xyz=(-83.80648889, 247.1206328, -508.1281928))
551065


In [42]:
positive_info_list = [y for y in x if y[0] == True]
# print(positive_info_list)
positive_diameter = [x[1] for x in positive_info_list]
for i, diameter in  enumerate(positive_diameter):
    if i%100 == 0:
        print(f"{i} : {diameter}")

0 : 32.27003025
100 : 17.74608206
200 : 12.98553245
300 : 9.953169615
400 : 8.222297439
500 : 7.013205598
600 : 6.256102138
700 : 5.687246746
800 : 5.118324575
900 : 4.66199491
1000 : 3.973281304
1100 : 0.0
1200 : 0.0
1300 : 0.0


In [4]:
from dataset import LunaDataset
b = LunaDataset()
print(b.__getitem__(25))

candidate_info_tuple(isNodule_bool=True, diameter_mm=23.57156231, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.220596530836092324070084384692', center_xyz=(-83.80648889, 247.1206328, -508.1281928))
(tensor([[[[-766., -774., -788.,  ..., -811., -866., -854.],
          [-769., -816., -819.,  ..., -847., -805., -802.],
          [-754., -762., -815.,  ..., -847., -810., -796.],
          ...,
          [ 251.,  211.,  199.,  ..., -707., -736., -697.],
          [ 637.,  521.,  527.,  ..., -755., -693., -689.],
          [ 726.,  773.,  793.,  ..., -643., -600., -540.]],

         [[-747., -727., -739.,  ..., -801., -849., -805.],
          [-768., -776., -772.,  ..., -830., -830., -770.],
          [-758., -775., -778.,  ..., -815., -794., -784.],
          ...,
          [  44.,   12.,   21.,  ..., -720., -755., -719.],
          [ 291.,  177.,  181.,  ..., -777., -769., -708.],
          [ 444.,  483.,  449.,  ..., -589., -511., -426.]],

         [[-798., -782., -741.,  ..., -756., -

In [5]:
c = LunaDataset(is_val_set_bool=True, val_stride = 10)
print(c.__getitem__(25))

candidate_info_tuple(isNodule_bool=True, diameter_mm=11.36963426, series_uid='1.3.6.1.4.1.14519.5.2.1.6279.6001.249404938669582150398726875826', center_xyz=(131.415682197, 32.7486662007, -174.02278535))
(tensor([[[[-848., -828., -780.,  ...,  -20.,   61.,   79.],
          [-797., -817., -816.,  ...,    7.,   68.,   95.],
          [-793., -823., -832.,  ...,   45.,   73.,   71.],
          ...,
          [-820., -814., -805.,  ...,    7.,   16.,    8.],
          [-820., -836., -839.,  ...,   -2.,    5.,  -31.],
          [-852., -838., -840.,  ...,  -10.,  -30.,  -54.]],

         [[-787., -799., -815.,  ...,  -53.,   14.,   46.],
          [-729., -771., -802.,  ...,  -32.,   29.,   52.],
          [-732., -780., -845.,  ...,  -45.,   21.,   37.],
          ...,
          [-793., -846., -828.,  ...,  -14.,   -1.,   29.],
          [-768., -805., -842.,  ...,  -60.,  -40.,  -26.],
          [-766., -798., -807.,  ...,  -38.,    1.,  -39.]],

         [[-811., -801., -826.,  ...,  -84