# Cancer Detection Histopathology Using Deep Learning

This project is will take Whole Slide Imaging (WSI) patches of tissue in combination with Convolutional Neural Networks to classify metastatic cancer.

---

This project derives from the Kaggle competition with the name Histopathologic Cancer Detection, found here: https://www.kaggle.com/competitions/histopathologic-cancer-detection/overview

You can find this project at the github repo: https://github.com/chill0121/Kaggle_Projects/tree/main/Cancer_Detection_Histopathology

## Table of Contents <a name="toc"></a>

---

- 1.[**Data Source Information**](#datasource)
  - 1.1. [Dataset Information](#data)
  - 1.2. [Kaggle Information](#kaggle)
- 2.[**Setup**](#setup)
  - 2.1. [Environment Details for Reproducility](#env)
  - 2.2. [Importing the Data](#dataimport)
- 3.[**Data Preprocessing**](#datapre)
  - 3.1. [First Looks](#firstlook)
  - 3.2. [Missing Data](#missingdata)
  - 3.3. [Data Cleanup](#dataclean)
  - 3.4. [Checking for Duplicate Entries](#duplicates)
- 4.[**Exploratory Data Analysis (EDA)**](#eda)
- 5.[**Models**](#models)
  - 5.1. [Baseline Models](#baseline)
  - 5.2. [Model Helper Functions](#helper)
  - 5.3. [Deep Learning Models](#deep)
- 6.[**Results**](#results)
- 7.[**Conclusion - Kaggle Submission Test Set**](#conclusion)
  - 7.1. [Possible Areas for Improvement](#improvements)

- [**Appendix A - Online References**](#appendixa)

## 1. Data Source Information <a name="datasource"></a>

---


### 1.1. Data Information: <a name="data"></a>

The data in this project is an altered/reduced version of PatchCamelyon (PCam). It consists of color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. This dataset was first introduced in this paper: https://arxiv.org/abs/1806.03962v1

"In this dataset, you are provided with a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. You are predicting the labels for the images in the test folder. A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image.

The original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates. We have otherwise maintained the same data and splits as the PCam benchmark."

**Data Info:**
- 277,485 Slide Image Patches
    - Images: 96 x 96 x 3

### 1.2. Kaggle Information: <a name="kaggle"></a>

#### Description:

In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. The data for this competition is a slightly modified version of the PatchCamelyon (PCam) benchmark dataset (the original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates).

#### Evaluation:

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

#### Citation: 

Will Cukierski. (2018). Histopathologic Cancer Detection. Kaggle. https://kaggle.com/competitions/histopathologic-cancer-detection

###### [Back to Table of Contents](#toc)

## 2. Setup <a name="setup"></a>

---

In [5]:
import os
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image

import sklearn
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.metrics import accuracy_score, f1_score,classification_report, confusion_matrix, ConfusionMatrixDisplay, auc, roc_curve, RocCurveDisplay

import tensorflow as tf
import torch

###### [Back to Table of Contents](#toc)

### 2.1. Environment Information for Reproducibility: <a name="env"></a>

In [6]:
print(f"Python version: {sys.version}")

packages = [pd, np, sns, sklearn, tf, torch]
for package in packages:
    print(f"{str(package).partition('from')[0]} using version: {package.__version__}")

Python version: 3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]
<module 'pandas'  using version: 2.1.4
<module 'numpy'  using version: 1.26.4
<module 'seaborn'  using version: 0.13.2
<module 'sklearn'  using version: 1.3.2
<module 'tensorflow'  using version: 2.16.2
<module 'torch'  using version: 2.2.2


###### [Back to Table of Contents](#toc)

### 2.2. Importing the Data: <a name="dataimport"></a>

In [7]:
# Set directories
current_wdir = os.getcwd()
data_folder = current_wdir + '/Data/'

In [34]:
def image_import(file_list, folder_name):
    '''
    Takes a list of image filenames and loads the images into a dictionary with the same filename(*).
    
    Parameters:
        file_list: List of filenames in the form ['name.extension', ...]
        folder_name: String of folder name within ./Data/ to import.
    Returns:
        image_array: ndarray of images.
        id_array: ndarray of image ids.
    '''
    id_array = np.empty(len(file_list), dtype = object)
    image_array = np.zeros((len(file_list), 96, 96, 3)) # 4D Array with shape (n_images, height, width, channels)

    for i, file in enumerate(file_list):
        # Separate file extension and image id.
        id, _ = file.split('.')
        img = Image.open(f'./Data/{folder_name}/{file}')
        id_array[i] = id
        image_array[i] = np.asarray(img)
        if i % 10_000 == 0:
            print(f'..importing image # {i}')

    return image_array, id_array 

In [35]:
files_train = os.listdir('./Data/train')
files_test = os.listdir('./Data/test')

train_images, train_ids = image_import(files_train, 'train')
test_images, test_ids = image_import(files_test, 'test')

..importing image #0
..importing image #10000
..importing image #20000


KeyboardInterrupt: 

In [8]:
# Add and sort all filenames from each folder path.
file_path = [f'{data_folder}/{file}' for file in os.listdir(data_folder) if '.csv' in file]
file_path = sorted(file_path)

# Iterate through filenames and add them to dataframe.
train = pd.read_csv(data_folder + '/train.csv')
X_test = pd.read_csv(data_folder + '/test.csv')
sample_y_test = pd.read_csv(data_folder + '/sample_submission.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/chill/GitHub/Kaggle_Projects/Cancer_Detection_Histopathology/Data/'

###### [Back to Table of Contents](#toc)

## 3. Data Preprocessing <a name="datapre"></a>

---

### 3.1. First Looks: <a name="firstlook"></a>

Print out some basic information about the training and testing sets.

In [None]:
print('-------------------------------------\nTrain\n-------------------------------------')
display(train)
print(train.dtypes)
print('\n------------------------------------\nTest\n------------------------------------')
print(X_test.columns.to_list())
print(X_test.shape)

-------------------------------------
Train
-------------------------------------


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

------------------------------------
Test
------------------------------------
['id', 'keyword', 'location', 'text']
(3263, 4)


To keep all future plots clear and consistent, a color map dictionary will map the classes to a color. 

In [None]:
# Build custom color map for consistent label visualization.
class_cmap = {'Not Disaster' : '#012A36',
              'Disaster' : '#D16666'}

###### [Back to Table of Contents](#toc)

### 3.2. Missing Data: <a name="missingdata"></a>

Now, we should ensure there aren't any tweets with completely missing data.

###### [Back to Table of Contents](#toc)

### 3.3. Text Cleanup: <a name="textclean"></a>

###### [Back to Table of Contents](#toc)

### 3.4. Checking for Duplicate Entries: <a name="duplicates"></a>

Finally, we need to ensure there aren't any duplicate entries in the data.

In [None]:
# Training Set
print('Duplicates Found:', train.text.duplicated().sum())
print('DF Shape:', train.shape)

drop_idx = train[train.text.duplicated()].index
train = train.drop(drop_idx, axis = 0)

print('Duplicates Found:', train.text.duplicated().sum())
print('DF Shape:', train.shape)

Duplicates Found: 61
DF Shape: (7613, 12)
Duplicates Found: 0
DF Shape: (7552, 12)


110 tweets showed up as being duplicates in the training data. These have now been removed.

In [None]:
# Test
print('Duplicates Found:', X_test.text.duplicated().sum())
print('DF Shape:', X_test.shape)

Duplicates Found: 20
DF Shape: (3263, 10)


Unfortunately, there are also 20 duplicate tweets in the testing set. Since the testing set was created for us and to be submitted into Kaggle for scoring, we can't fix this issue.

###### [Back to Table of Contents](#toc)

## 4. Exploratory Data Analysis (EDA) <a name="eda"></a>

---

###### [Back to Table of Contents](#toc)

## 5. Models and Embedding <a name="models"></a>

---

### 5.1. Baseline Models <a name="baseline"></a>

It's always important to set a suitable baseline for comparison.

The first baseline model is simple, equal random chance at selecting any of the classes.

In [None]:
mod_rand_baseline = 1 / len(train.target.unique()) # 1/2
print('Random Baseline F1-Score:', mod_rand_baseline)

Random Baseline F1-Score: 0.5


The next option is taking the most frequent class in the dataset and always predict that class.

*Note: The y predictions will be submitted to Kaggle to receive the accuracy score and posted here and in the results section.*

In [None]:
# most_freq_cat = train.target.value_counts(sort = True).index[0]
# mod_freq_array = np.full(shape = len(X_test), fill_value = most_freq_cat)

# y_pred_freq_baseline = X_test[['id']].copy()
# y_pred_freq_baseline['target'] = mod_freq_array
# y_pred_freq_baseline.to_csv(current_wdir + f'/Models/Frequency_Baseline/X_test_Submission_Freq_Baseline.csv', index = False)
# display(y_pred_freq_baseline)

# mod_freq_baseline = 0.57033
# print('Most Frequent Category Baseline F1-Score:', mod_freq_baseline)

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0
...,...,...
3258,10861,0
3259,10865,0
3260,10868,0
3261,10874,0


Most Frequent Category Baseline F1-Score: 0.57033


###### [Back to Table of Contents](#toc)

### 5.2. Model Helper Functions <a name="helper"></a>

Here I have created a few functions to help visualize the training process, tracking and plotting the training and validation metric and loss.

In [None]:
def set_region_overlay(model_history_df, x_offset):
    x_mid = ((model_history_df.index.stop-1) + model_history_df.val_loss.idxmin()) / 2
    plt.text(x = x_mid - x_offset,
             y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
             s = 'Early Stop',
             rotation = 'horizontal',
             weight = 'extra bold',
             fontsize = 'large',
             antialiased = True,
             alpha = 1,
             c = 'white',
             bbox = dict(facecolor = 'black', edgecolor = 'black', boxstyle = 'round', alpha = 0.5))
    return None

def plot_TF_training_history(model_history_df):

    # Find all epochs that callback ReduceLROnPlateau() occurred.
    lr_change = model_history_df.learning_rate.shift(-1) != model_history_df.learning_rate

    # Create color map and lines style map for train/val
    plot_maps = {'cmap': {'accuracy': '#653096',
                        'loss': '#653096',
                        'val_accuracy': '#004a54',
                        'val_loss': '#004a54'},
                'dashmap': {'accuracy': '',
                            'loss': (2,1),
                            'val_accuracy': '',
                            'val_loss': (2,1)}}

    # Plot
    fig, ax = plt.subplots(figsize = (10,6))
    ax = sns.lineplot(model_history_df.drop(columns = ['learning_rate']).iloc[1:], palette = plot_maps['cmap'], dashes = plot_maps['dashmap'])
    ax.set_xlabel('Epoch')

    # Create secondary x-axis for Learning Rate changes.
    sec_ax = ax.secondary_xaxis('top')
    sec_ax.set_xticks(model_history_df[lr_change].index[:-1])
    sec_ax.set_xticklabels([f'{x:.1e}' for x in model_history_df[lr_change].learning_rate[1:]])
    sec_ax.tick_params(axis = 'x', which = 'major', labelsize = 7)
    sec_ax.set_xlabel('Learning Rate Reductions')

    # Create vertical line for each LR change.
    for epoch in (model_history_df[lr_change].index[:-1]):
        plt.axvline(x = epoch, c = '#d439ad', ls = (0, (5,5)))
    # Create lines for best epoch/val_loss.
    plt.axvline(x = (model_history_df.val_loss.idxmin()), c = '#f54260', ls = (0, (3,1,1,1)))
    plt.axhline(y = (model_history_df.val_loss.min()), c = '#f54260', alpha = 0.3, ls = (0, (3,1,1,1)))
    # Grey out epochs after early stop.
    plt.axvspan(model_history_df.val_loss.idxmin(), model_history_df.index.stop-1, facecolor = 'black', alpha = 0.25)
    plt.margins(x = 0)
    set_region_overlay(model_history_df, 5)

    plt.legend()
    plt.show()
    return None

###### [Back to Table of Contents](#toc)

### 5.3. Deep Learning <a name="deep"></a>

###### [Back to Table of Contents](#toc)

## 6. Results <a name="results"></a>

---

In [None]:
# Highlight the best model's test results green at each proportion.
def max_value_highlight(df):
    max_test_rows = df.max()
    is_max = (df == max_test_rows)
    
    return ['background-color:green' if v else '' for v in is_max]

# Highlight the top two results in each column blue so that 2nd place is in blue after .apply().
def highlight_top_two(df):
    # Sort values
    test_rows = df
    sorted_df = test_rows.sort_values(ascending = False)
    top_two = sorted_df.iloc[: 2]
    # Mask
    is_top_two = df.isin(top_two)

    return ['background-color: blue' if v else '' for v in is_top_two]

To evaluate the test set the .csv files must be submitted to Kaggle. Each model's predictions were saved above and manually submitted. Below you can find a screenshot of all the results.

<img src="https://github.com/chill0121/Kaggle_Projects/blob/main/NLP_Disaster_Tweets/Models/Kaggle_Results.png?raw=true" alt="results" width="1000"/>

In [None]:
# # Kaggle Submission Scores for Test Set.
# results_test = {'Random_Baseline' : mod_rand_baseline,
#                 'Frequent_Baseline' : mod_freq_baseline,
#                 'RNN' : 0.73000,
#                 'LSTM' : 0.77903,
#                 'GRU' : 0.73582}

# results_test_df = pd.DataFrame().from_dict(results_test, orient = 'index', columns = ['F1-Score'])
# results_test_df.index.name = 'Model'

# results_test_df.style.apply(highlight_top_two).apply(max_value_highlight)

Unnamed: 0_level_0,F1-Score
Model,Unnamed: 1_level_1
Random_Baseline,0.5
Frequent_Baseline,0.57033
RNN,0.73
LSTM,0.77903
GRU,0.73582


The best model's score is highlighted in green and the 2nd best is in blue.

Discussions can be found in the conclusion section.

###### [Back to Table of Contents](#toc)

## 7. Conclusion - Kaggle Submission Test Set <a name="conclusion"></a>

---



### 7.1. Possible Areas for Improvement <a name="improvements"></a>



###### [Back to Table of Contents](#toc)

## Appendix A - Online References: <a name="appendixa"></a>

Resources that helped along the way in no particular order.

1. 

 Exported to HTML via command line using:

- `jupyter nbconvert NLP_Disaster_Tweets.ipynb --to html`
- `jupyter nbconvert NLP_Disaster_Tweets.ipynb --to html --HTMLExporter.theme=dark`

###### [Back to Table of Contents](#toc)