<a href="https://colab.research.google.com/github/arnavvats/lung-cancer-prediction/blob/master/Lung_Cancer_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lung Caner Prediction Using CNN (LUNA16)

### Goal

Our goal would be to predict the possibility of presence of nodules in some position , in the CT scan image of a patient's lungs .

From the perspective of data science , we'd be learning some cool new tricks primarily 3D - CNN architecture and data preprocessing of medical images.

The full description of the problem can be [read here](https://luna16.grand-challenge.org/Description/). Here is a quote from the page:

>  We invite the research community to participate in one or two of the following challenge tracks:
1. Nodule detection (NDET)
Using raw CT scans, the goal is to identify locations of possible nodules, and to assign a probability for being a nodule to each location. The pipeline typically consists of two stages: candidate detection and false positive reduction.
2. False positive reduction (FPRED)
Given a set of candidate locations, the goal is to assign a probability for being a nodule to each candidate location. Hence, one could see this as a classification task: nodule or not a nodule. Candidate locations will be provided in world coordinates. This set detects 1,162/1,186 nodules.


We will be interested only in Part 1 as for now.

### The data

We are taking the data from the [LUNA16](https://zenodo.org/record/2604219#.XQuGWYgzZPY) competition dataset.
There is a lot of data and since we're working on colab with max 25GB data, we'll be reading the data in parts.

>  1. subset0.zip to subset9.zip: 10 zip files which contain all CT images

> 2. annotations.csv: csv file that contains the annotations used as reference standard for the 'nodule detection' track

> 3. sampleSubmission.csv: an example of a submission file in the correct format

> 4. candidates.csv: the original set of candidates used for the LUNA16 workshop at ISBI2016. This file is kept for completeness, but should not be used, use candidates_V2.csv instead (see more info below).

> 5. candidates_V2.csv: csv file that contains an extended set of candidate locations for the ‘false positive reduction’ track. 

> 6. evaluation script: the evaluation script that is used in the LUNA16 framework

> 7. lung segmentation: a directory that contains the lung segmentation for CT images computed using automatic algorithms

> 8. additional_annotations.csv: csv file that contain additional nodule annotations from our observer study. The file will be available soon

We'll be downloading 2-7 right away , the zip files which contain the images are pretty big and cannot be downloaded all at once. They will be download step by step so that we dont run out of memory.

First we clear up the sample_data folder on colab, to clear up some space.

In [0]:
!rm -rf ./sample_data/

Now lets download the files with lower size, these are the labels and information of pateients.

In [3]:
!curl "https://zenodo.org/record/2604219/files/annotations.csv?download=1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" -H "Referer: https://zenodo.org/record/2604219" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Cookie: session=aefea784c123ff64_5d0b7922.IDCGjz0FfCNJlDLZPwl4x8TE5Xk; __atuvc=2^%^7C25; _pk_ref.57.a333=^%^5B^%^22^%^22^%^2C^%^22^%^22^%^2C1561043590^%^2C^%^22https^%^3A^%^2F^%^2Fluna16.grand-challenge.org^%^2F^%^22^%^5D; _pk_id.57.a333=eb8f44b38bcb4c45.1561033029.3.1561043590.1561036378.; _pk_ses.57.a333=*" --compressed -o 'annotations.csv'
!curl "https://zenodo.org/record/2604219/files/candidates.csv?download=1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" -H "Referer: https://zenodo.org/record/2604219" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Cookie: session=aefea784c123ff64_5d0b7922.IDCGjz0FfCNJlDLZPwl4x8TE5Xk; __atuvc=2^%^7C25; _pk_ref.57.a333=^%^5B^%^22^%^22^%^2C^%^22^%^22^%^2C1561043590^%^2C^%^22https^%^3A^%^2F^%^2Fluna16.grand-challenge.org^%^2F^%^22^%^5D; _pk_ses.57.a333=*; _pk_id.57.a333=eb8f44b38bcb4c45.1561033029.3.1561043957.1561036378." --compressed -o 'candidates.csv'
!curl "https://zenodo.org/record/2604219/files/candidates_V2.zip?download=1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" -H "Referer: https://zenodo.org/record/2604219" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Cookie: session=aefea784c123ff64_5d0b7922.IDCGjz0FfCNJlDLZPwl4x8TE5Xk; __atuvc=2^%^7C25; _pk_ref.57.a333=^%^5B^%^22^%^22^%^2C^%^22^%^22^%^2C1561043590^%^2C^%^22https^%^3A^%^2F^%^2Fluna16.grand-challenge.org^%^2F^%^22^%^5D; _pk_ses.57.a333=*; _pk_id.57.a333=eb8f44b38bcb4c45.1561033029.3.1561044000.1561036378." --compressed -o 'candidates_V2.zip'
!curl "https://zenodo.org/record/2604219/files/evaluationScript.zip?download=1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" -H "Referer: https://zenodo.org/record/2604219" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Cookie: session=aefea784c123ff64_5d0b7922.IDCGjz0FfCNJlDLZPwl4x8TE5Xk; __atuvc=2^%^7C25; _pk_ref.57.a333=^%^5B^%^22^%^22^%^2C^%^22^%^22^%^2C1561043590^%^2C^%^22https^%^3A^%^2F^%^2Fluna16.grand-challenge.org^%^2F^%^22^%^5D; _pk_ses.57.a333=*; _pk_id.57.a333=eb8f44b38bcb4c45.1561033029.3.1561044039.1561036378." --compressed -o 'evaluationScript.zip'
!curl "https://zenodo.org/record/2604219/files/sampleSubmission.csv?download=1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" -H "Referer: https://zenodo.org/record/2604219" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Cookie: session=aefea784c123ff64_5d0b7922.IDCGjz0FfCNJlDLZPwl4x8TE5Xk; __atuvc=2^%^7C25; _pk_ref.57.a333=^%^5B^%^22^%^22^%^2C^%^22^%^22^%^2C1561043590^%^2C^%^22https^%^3A^%^2F^%^2Fluna16.grand-challenge.org^%^2F^%^22^%^5D; _pk_ses.57.a333=*; _pk_id.57.a333=eb8f44b38bcb4c45.1561033029.3.1561044078.1561036378." --compressed -o 'sampleSubmission.csv'
!curl "https://zenodo.org/record/2604219/files/seg-lungs-LUNA16.zip?download=1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3" -H "Referer: https://zenodo.org/record/2604219" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Cookie: session=aefea784c123ff64_5d0b7922.IDCGjz0FfCNJlDLZPwl4x8TE5Xk; __atuvc=2^%^7C25; _pk_ref.57.a333=^%^5B^%^22^%^22^%^2C^%^22^%^22^%^2C1561043590^%^2C^%^22https^%^3A^%^2F^%^2Fluna16.grand-challenge.org^%^2F^%^22^%^5D; _pk_ses.57.a333=*; _pk_id.57.a333=eb8f44b38bcb4c45.1561033029.3.1561044122.1561036378." --compressed -o 'seg-lungs-LUNA16.zip'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43830    0 43830    0     0  31085      0 --:--:--  0:00:01 --:--:-- 31085
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.8M    0  9.8M    0     0  3436k      0 --:--:--  0:00:02 --:--:-- 3435k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.9M  100 10.9M    0     0  5096k      0  0:00:02  0:00:02 --:--:-- 5096k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  8201k      0  0:00:02  0:00:02 --:--:-- 8198k
  % Total    % Received % Xferd  Average Speed   Tim

In [0]:
!unzip seg-lungs-LUNA16.zip
!unzip evaluationScript.zip
!unzip candidates_V2.zip

For moving on further to data preprocessing, I suggest reading their [tutorial](https://luna16.grand-challenge.org/media/LUNA16/public_html/SimpleITKTutorial.pdf) on viewing ct  scan images.

In [14]:
!pip install SimpleITK



In [0]:
import SimpleITK as sitk
import numpy as np
import pandas as pd
import os
from PIL import Image
import matplotlib.pyplot as plt
%matplotlib inline

We define now a function to: -

Open the image - Store it into a numpy array- Extract the following info:

Pixel Spacing, Origin This function takes as input the name of the image and returns: - 

The array
corresponding to the image (numpyImage) - Origin (numpyOrigin) - PixelSpacing (numpySpacing)

In [0]:
def load_itk_image(filename):
  itkimage = sitk.ReadImage(filename)
  numpyImage = sitk.GetArrayFromImage(itkimage)
  
  numpyOrigin = np.array(list(reversed(itkimage.GetOrigin())))
  numpySpacing = np.array(list(reversed(itkimage.GetSpacing())))
  
  return numpyImage, numpyOrigin, numpySpacing

We'll skip the read_csv function in tutorial as we have pandas to handle csv data.

Since the coordinates of the candidates are given in World Coordinates, we now need to transform
from world coordinates to voxel coordinates.