# Table of content
- **Introduction**
-  **Data importing**  
    - *Import CIA data*
    - *Import annotation data*
    - *Import normal lungs data*
- **Data cleaning**
    - *Setting the data types right*
    - *Cleaning columns where applied*
    - *Remapping columns*
    - *Normalizing the data*

# Introduction 
This project is about creating an algorithm that would predict the detection of lung cancer based on CT scans. The algorithm would also predict what type of cancer is within the patient and the size of the tumor if applied.

In [2]:
# Import the necessary modules
# If you get a ModuleNotFoundError, use %pip install {module} to install the module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import pydicom as dicom
from pathlib import Path
import time
from zipfile import ZipFile 
import cv2
from tqdm import tqdm
import glob
import tensorflow as tf
from tensorflow.data import Dataset
from tensorflow.keras.utils import plot_model
from sklearn.model_selection import train_test_split
from keras.utils.vis_utils import plot_model
from sklearn.metrics import accuracy_score, recall_score, roc_curve, roc_auc_score, RocCurveDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import xmltodict
import functions_project
from sklearn.preprocessing import LabelBinarizer

Functions import succesfull


# Data importing 
In this chapter, the data will be imported. The data consists of two different sets:
- A subselection from the Cancer Imaging Archive (CIA): https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70224216#7022421621c64ff049c44f03bb442ec5eb88bdf2 
- A subselection from Kaggle: https://www.kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images

A subselection was selected from CIA to avoid having a too big dataset to be processed. The subselection was done from Kaggle because we only needed the data from that subselection.

## Import CIA data

In [3]:
# Set start time
start_time = time.time()

# Get the data that's locally stored
basePath = Path("tciaDownload\manifest-1608669183333\Lung-PET-CT-Dx")

# Get the DICOM files present in the path set before
pathFiles = list(basePath.rglob('*.dcm'))

# Set the DICOM files in a numpy array
images = [dicom.filereader.dcmread(x) for x in pathFiles]
pixels = [image.pixel_array for image in images]
age = [image.PatientAge for image in images]
sex = [image.PatientSex for image in images]
patient_id = [image.PatientID for image in images]
cancer_type = [image.PatientID[8] for image in images]
sop_instance_uid = [image.SOPInstanceUID for image in images]

# Print time execution
print(f"Execution time: {(time.time() - start_time):.3f} seconds")

Execution time: 107.047 seconds


In [4]:
# Put created lists into a dataframe
# Set the has_cancer column manually to 1, since this contains data with patients that have the disease
colNames = ["patient_id", "sop_instance_id", "image_pixels", "patient_age", "patient_sex", "cancer_type", 'has_cancer']
data = [patient_id, sop_instance_uid, pixels, age, sex, cancer_type, 1]

df_sick_lungs = pd.DataFrame()

for colName, values in tqdm(zip(colNames, data)):
    df_sick_lungs[colName] = values

# Display the first few rows and column information
display(df_sick_lungs.head())
print(df_sick_lungs.info())

0it [00:00, ?it/s]

7it [00:00, 81.48it/s]


Unnamed: 0,patient_id,sop_instance_id,image_pixels,patient_age,patient_sex,cancer_type,has_cancer
0,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.295499053390...,"[[44, 0, 10, 37, 0, 36, 70, 22, 0, 35, 0, 49, ...",053Y,F,A,1
1,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.314562946465...,"[[50, 0, 22, 26, 42, 25, 6, 51, 35, 7, 13, 33,...",053Y,F,A,1
2,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.142392682681...,"[[16, 3, 0, 89, 20, 10, 1, 22, 12, 61, 10, 25,...",053Y,F,A,1
3,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.117012811165...,"[[58, 0, 25, 71, 40, 48, 0, 0, 32, 49, 4, 46, ...",053Y,F,A,1
4,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.208988389922...,"[[13, 0, 27, 53, 18, 41, 26, 0, 37, 30, 68, 0,...",053Y,F,A,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7700 entries, 0 to 7699
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   patient_id       7700 non-null   object
 1   sop_instance_id  7700 non-null   object
 2   image_pixels     7700 non-null   object
 3   patient_age      7700 non-null   object
 4   patient_sex      7700 non-null   object
 5   cancer_type      7700 non-null   object
 6   has_cancer       7700 non-null   int64 
dtypes: int64(1), object(6)
memory usage: 421.2+ KB
None


## Import the annotation data

In [5]:
# Loop through zip file to get XML content
path_zip_annotations = "Lung-PET-CT-Dx-Annotations-XML-Files-rev12222020.zip"

cols = ["patient_id", "sop_instance_id", "xmin", "ymin", "xmax", "ymax"]

# Use the createAnnotationDf to add the annotation data
annot_df = functions_project.createAnnotationDf(path_zip_annotations, cols)

In [6]:
# Check the data for the annotation
print(annot_df.info())
display(annot_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31463 entries, 0 to 31462
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   patient_id       31463 non-null  object
 1   sop_instance_id  31463 non-null  object
 2   xmin             31463 non-null  object
 3   ymin             31463 non-null  object
 4   xmax             31463 non-null  object
 5   ymax             31463 non-null  object
dtypes: object(6)
memory usage: 1.4+ MB
None


Unnamed: 0,patient_id,sop_instance_id,xmin,ymin,xmax,ymax
0,Lung_Dx-A0001,1.3.6.1.4.1.14519.5.2.1.6655.2359.102500633407...,286,310,355,286
1,Lung_Dx-A0001,1.3.6.1.4.1.14519.5.2.1.6655.2359.103293611003...,304,304,337,304
2,Lung_Dx-A0001,1.3.6.1.4.1.14519.5.2.1.6655.2359.136943255924...,278,308,360,278
3,Lung_Dx-A0001,1.3.6.1.4.1.14519.5.2.1.6655.2359.155870813347...,290,305,338,290
4,Lung_Dx-A0001,1.3.6.1.4.1.14519.5.2.1.6655.2359.184131899543...,301,329,351,301


In [7]:
# Combine the annotation with the sick lungs dataset
df_sick_lungs_annot = df_sick_lungs.merge(annot_df, on = ['patient_id', 'sop_instance_id'])

df_sick_lungs_annot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1607 entries, 0 to 1606
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   patient_id       1607 non-null   object
 1   sop_instance_id  1607 non-null   object
 2   image_pixels     1607 non-null   object
 3   patient_age      1607 non-null   object
 4   patient_sex      1607 non-null   object
 5   cancer_type      1607 non-null   object
 6   has_cancer       1607 non-null   int64 
 7   xmin             1607 non-null   object
 8   ymin             1607 non-null   object
 9   xmax             1607 non-null   object
 10  ymax             1607 non-null   object
dtypes: int64(1), object(10)
memory usage: 138.2+ KB


In [8]:
# Get the difference in pixels for both x and y axis
x_diff = [abs(int(x_top) - int(x_bottom)) for x_top, x_bottom in zip(df_sick_lungs_annot["xmin"], df_sick_lungs_annot["xmax"])]
y_diff = [abs(int(y_top) - int(y_bottom)) for y_top, y_bottom in zip(df_sick_lungs_annot["ymin"], df_sick_lungs_annot["ymax"])]

# Multiply the x_diff and y_diff lists
diff_squared = [(x * y) for x, y in zip(x_diff, y_diff)]

# Add the results to the df_sick_lungs_annot dataframe
df_sick_lungs_annot['diff_squared'] = diff_squared

df_sick_lungs_annot.head()

Unnamed: 0,patient_id,sop_instance_id,image_pixels,patient_age,patient_sex,cancer_type,has_cancer,xmin,ymin,xmax,ymax,diff_squared
0,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.173449310598...,"[[0, 25, 51, 3, 0, 49, 22, 0, 63, 26, 0, 49, 5...",053Y,F,A,1,171,256,228,171,4845
1,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.283973423081...,"[[15, 0, 58, 4, 23, 57, 9, 69, 0, 0, 39, 77, 3...",053Y,F,A,1,164,259,227,164,5985
2,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.251378513351...,"[[17, 12, 36, 0, 0, 79, 34, 22, 0, 69, 21, 0, ...",053Y,F,A,1,167,263,231,167,6144
3,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.494618634856...,"[[48, 0, 0, 53, 37, 35, 62, 46, 0, 0, 52, 38, ...",053Y,F,A,1,122,244,233,122,13542
4,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.124409970701...,"[[40, 23, 27, 0, 38, 48, 46, 59, 0, 17, 51, 30...",053Y,F,A,1,111,251,225,111,15960


## Import the normal lungs dataset

In [9]:
# Extract the normal lungs images from the zip file
normalLungs = "normal_ct_scans.zip"
  

normal_lungs_images = functions_project.getNormalLungsData(normalLungs)

100%|██████████| 203/203 [00:02<00:00, 72.67it/s]


In [10]:
# Create dataframe based on PNG images and add has_cancer column with value of 0
df_normal_lungs = pd.DataFrame()
df_normal_lungs['image_pixels'] = normal_lungs_images
df_normal_lungs["has_cancer"] = 0


# Add rows from other dataframe using the append function: pixel arrays and has_cancer 
sample_df_cancer = df_sick_lungs.sample(n = df_normal_lungs.shape[0])
df_lungs_binary = pd.concat([df_normal_lungs, sample_df_cancer[['image_pixels', 'has_cancer']]], ignore_index= True)

# Display the first few rows and column information
display(df_lungs_binary.head())
print(df_lungs_binary.info())

Unnamed: 0,image_pixels,has_cancer
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   image_pixels  406 non-null    object
 1   has_cancer    406 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 6.5+ KB
None


In [11]:
print(f"Shape of the DICOM dataset image: {df_sick_lungs.image_pixels[0].shape}")
print(f"Shape of the PNG image: {df_normal_lungs.image_pixels[0].shape}")

Shape of the DICOM dataset image: (512, 512)
Shape of the PNG image: (512, 512)


<a id="data_cleaning"></a>
# Data cleaning

In this chapter, the data cleaning will be done. This involves mostly around feature engineering the structure of the images to make it ready for training the data. To do this, we'll do the following:
- Setting the data types of the columns right
- Cleaning columns where applied
- Remapping columns
- Normalizing the data

## Setting the data types of the columns right

In [12]:
print(df_sick_lungs.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7700 entries, 0 to 7699
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   patient_id       7700 non-null   object
 1   sop_instance_id  7700 non-null   object
 2   image_pixels     7700 non-null   object
 3   patient_age      7700 non-null   object
 4   patient_sex      7700 non-null   object
 5   cancer_type      7700 non-null   object
 6   has_cancer       7700 non-null   int64 
dtypes: int64(1), object(6)
memory usage: 421.2+ KB
None


Based on the info method, the following columns needs to be changed:
- `patient_age`: to int
- `cancer_type`: to category
- `has_cancer`: to boolean

## Clean columns

In [13]:
print(f"Lowest age: {min(df_sick_lungs.patient_age)}")
print(f"Highest age: {max(df_sick_lungs.patient_age)}")

Lowest age: 000Y
Highest age: 085Y


As you can see, the age column starts with a '0' and ends with a 'Y'. To clean this data, the 'Y' will be removed and the column data type will be converted to an integer.

In [14]:
df_sick_lungs.head()

Unnamed: 0,patient_id,sop_instance_id,image_pixels,patient_age,patient_sex,cancer_type,has_cancer
0,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.295499053390...,"[[44, 0, 10, 37, 0, 36, 70, 22, 0, 35, 0, 49, ...",053Y,F,A,1
1,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.314562946465...,"[[50, 0, 22, 26, 42, 25, 6, 51, 35, 7, 13, 33,...",053Y,F,A,1
2,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.142392682681...,"[[16, 3, 0, 89, 20, 10, 1, 22, 12, 61, 10, 25,...",053Y,F,A,1
3,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.117012811165...,"[[58, 0, 25, 71, 40, 48, 0, 0, 32, 49, 4, 46, ...",053Y,F,A,1
4,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.208988389922...,"[[13, 0, 27, 53, 18, 41, 26, 0, 37, 30, 68, 0,...",053Y,F,A,1


In [15]:
# Clean up the patientAge column
df_sick_lungs['patient_age'] = [int(x[0:-1]) for x in df_sick_lungs.patient_age]

## Remapping columns

In [16]:
# Remap PatientSex and CancerType
cols = ['patient_sex', 'cancer_type']

dict_sex = {"M": "Male", "F": "Female"}
dict_cancer_type = {'A': "Adenocarcinoma", 'B': "Small Cell Carcinoma", 'E': "Large Cell Carcinoma", "G": "Squamous Cell Carcinoma"}

df_sick_lungs = df_sick_lungs.replace({cols[0]: dict_sex, cols[1]: dict_cancer_type})

# Turn PatientSex and CancerType into categorical columns
df_sick_lungs[cols] = df_sick_lungs[cols].astype('category')

df_sick_lungs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7700 entries, 0 to 7699
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   patient_id       7700 non-null   object  
 1   sop_instance_id  7700 non-null   object  
 2   image_pixels     7700 non-null   object  
 3   patient_age      7700 non-null   int64   
 4   patient_sex      7700 non-null   category
 5   cancer_type      7700 non-null   category
 6   has_cancer       7700 non-null   int64   
dtypes: category(2), int64(2), object(3)
memory usage: 316.3+ KB


In [17]:
df_sick_lungs_annot.describe()

Unnamed: 0,has_cancer,diff_squared
count,1607.0,1607.0
mean,1.0,5882.301805
std,0.0,4739.260842
min,1.0,0.0
25%,1.0,2380.0
50%,1.0,4840.0
75%,1.0,7806.0
max,1.0,29760.0


### Findings
Based on the describe method, the tumor size has a minimum of 0. We want to remove those values, since it could alter the performance of the Neural Network we're trying to built.

In [18]:
# Remove values from the df_sick_lungs_annot data that are 0
df_sick_lungs_annot = df_sick_lungs_annot[df_sick_lungs_annot.diff_squared > 0]

df_sick_lungs_annot.describe()

Unnamed: 0,has_cancer,diff_squared
count,1603.0,1603.0
mean,1.0,5896.980037
std,0.0,4736.038792
min,1.0,23.0
25%,1.0,2391.0
50%,1.0,4845.0
75%,1.0,7820.0
max,1.0,29760.0


## Normalizing the data
In this paragraph, the data will be normalized. For the binary classificatino model, this will go for the image pixels

In [19]:
# Normalize the data for the binary classification dataframe
#df_lungs_binary['image_pixels'] = df_lungs_binary.image_pixels / 255
#df_sick_lungs['image_pixels'] = df_sick_lungs.image_pixels / 255
df_sick_lungs_annot['image_pixels'] = df_sick_lungs_annot.image_pixels / 255

# Instantiate a scalar and apply it to the diff_squared column
scaler = StandardScaler()
df_sick_lungs_annot['diff_squared_nor'] = scaler.fit_transform(df_sick_lungs_annot[["diff_squared"]]).flatten()


display(df_lungs_binary.head())
display(df_sick_lungs_annot.head())

Unnamed: 0,image_pixels,has_cancer
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0


Unnamed: 0,patient_id,sop_instance_id,image_pixels,patient_age,patient_sex,cancer_type,has_cancer,xmin,ymin,xmax,ymax,diff_squared,diff_squared_nor
0,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.173449310598...,"[[0.0, 0.09803921568627451, 0.2, 0.01176470588...",053Y,F,A,1,171,256,228,171,4845,-0.222192
1,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.283973423081...,"[[0.058823529411764705, 0.0, 0.227450980392156...",053Y,F,A,1,164,259,227,164,5985,0.018591
2,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.251378513351...,"[[0.06666666666666667, 0.047058823529411764, 0...",053Y,F,A,1,167,263,231,167,6144,0.052174
3,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.494618634856...,"[[0.18823529411764706, 0.0, 0.0, 0.20784313725...",053Y,F,A,1,122,244,233,122,13542,1.614726
4,Lung_Dx-A0002,1.3.6.1.4.1.14519.5.2.1.6655.2359.124409970701...,"[[0.1568627450980392, 0.09019607843137255, 0.1...",053Y,F,A,1,111,251,225,111,15960,2.125439


## Store dataframes
After the data has been cleaned, we will store the dataframes which will be used in seperate Jupyter Notebooks.

In [20]:
# Store the dataframes
%store df_lungs_binary
%store df_sick_lungs
%store df_sick_lungs_annot

Stored 'df_lungs_binary' (DataFrame)
Stored 'df_sick_lungs' (DataFrame)
Stored 'df_sick_lungs_annot' (DataFrame)
