# Cancer Detection and Data background

The project involves building CNN classifiers from scratch to detect cancer cells in images. The competition focuses on finding a single cancerous cell in a 32x32 square at the center of each image, making it very challenging. Since there is no competition here (it is over), the classifiers will label the entire image as malignant or benign. This simplifies the training process.

The notebook will explore a few CNN iterations. Different activation functions, layers, and hyperparameters will be tested.

The curated Kaggle dataset will be used, available [here](https://www.kaggle.com/competitions/histopathologic-cancer-detection/data). This dataset, provided by Bas Veeling and others, has removed duplicate images and is hosted by Kaggle for practice. The original dataset can be found on [GitHub](https://github.com/basveeling/pcam) under the CC0 License. 


In [None]:
# for stability, set the version of tensorflow
#!pip install wurlitzer
#!pip install tensorflow==2.13.0
#!pip install tensorflow[and-cuda]
#!pip install tensorflow-io

In [None]:
# Standard Kaggle info


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import tensorflow as tf 
# the below is needed to fix cuDNN registration issues, put immediately after the tf import
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

import tensorflow_io as tfio

import numpy as np 
import pandas as pd 
from PIL import Image
import matplotlib.pyplot as plt

import keras
from keras.models import Sequential
from keras.layers import AvgPool2D,BatchNormalization, Conv2D, Dense, Flatten, Input, GlobalAveragePooling2D, Dropout 
from keras.layers import MaxPool2D, MaxPooling2D, ReLU, concatenate
import math, gc, copy

AUTOTUNE = tf.data.AUTOTUNE
import warnings
warnings.filterwarnings('ignore')
import os

print("Tensorflow Version In Use: ", tf.__version__ )

Next, we detect our hardware and light up GPUs or TPUs if we have them.

In [None]:
# Detect hardware and GPUs/TPUs (useful code here is copied from code search, not original)
try:
     # detect and init the TPU
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()

    # instantiate a distribution strategy
    tf.tpu.experimental.initialize_tpu_system(tpu)
    tpu_strategy = tf.distribute.TPUStrategy(tpu)
 
    # report results
    print('Running on TPU ', tpu.cluster_spec().as_dict())

except ValueError: # If TPU not found
    tpu = None
    tpu_strategy = tf.distribute.get_strategy() # Default strategy that works on CPU and single GPU
    print('Running on CPU instead')

print("Number of accelerators: ", tpu_strategy.num_replicas_in_sync)
print("TPU: ", tpu)

In [2]:
gpus = tf.config.list_physical_devices('GPU')
try:
    if gpus:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")

except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

NameError: name 'tf' is not defined

# Load Data

Load data and show basic information to describe the data files.

In [None]:
df = pd.read_csv('/kaggle/input/histopathologic-cancer-detection/train_labels.csv')

# get basic info
print(df.info())
print('')
print(df.describe())
print('')

df.head()

Of the 220 thousand rows of data, 40% of the observations have a positive cancer label (mean of the binary 1 over the count). The labels iare binary (1,0) with no null values in either column.

Analysis of the counts by the classification lables show an imbalance in the data.  Most are negative (no cancer/benign).  

In [None]:
# Count the occurrences of each category
category_counts = df['label'].value_counts()

# Plot the bar chart
plt.figure(figsize=(8, 6))
category_counts.plot(kind='bar', color=['blue', 'orange'])
plt.title('Cancer classifications')
plt.xlabel('Positive/Negative detection')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

In [None]:
imagelist = os.listdir('/kaggle/input/histopathologic-cancer-detection/train')
print("Image List length:", len(imagelist))

print ("Image list type:", type(imagelist))
print ("First record type: ", type(imagelist[0]))
print ("First image values: ", imagelist[0])

The number of rows match the number of images in the training folder. Raw data is shown, the file contains the image file name.

In [None]:
# Load sample submission data

dfss = pd.read_csv('/kaggle/input/histopathologic-cancer-detection/sample_submission.csv')
# describe data
print(dfss.describe())
print('')
print(dfss.info())
print('')

imagesslist = os.listdir('/kaggle/input/histopathologic-cancer-detection/test/')
print("Validation Image List length:", len(imagesslist))

# raw data
dfss.head()

The validation set has no labels, all are 0.

In [None]:
fig, axs = plt.subplots(3,4) 
col = 0
row = 0
for i in range(12):
    imid = df.id.sample(1).values[0]
    image = Image.open('/kaggle/input/histopathologic-cancer-detection/train/'+imid+'.tif')
    axs[col,row].imshow(image)
    if col == 2:
        row +=1
        col = 0
    else:
        col +=1
plt.show();
print("Image Specifications: Shape",image.size,"\nFormat:",image.info )


## Data handling in preperation for modeling
Preprocess data to better balance the data set

In [None]:
randomseed = 9

# rebalance so classes of positive and negative are same size, this will help in improving classification training
trainsize = int(df.label.sum() * .7)
testsize = df.label.sum()-trainsize
cancer = pd.DataFrame(df.index[df['label'] ==1].tolist(),columns=['id'])
negative = pd.DataFrame(df.index[df['label'] ==0].tolist(),columns=['id'])

# random sample the split
cancer_train = cancer['id'].sample(trainsize,replace=False,random_state = randomseed)
negative_train = negative['id'].sample(trainsize,replace=False,random_state = randomseed)

print("Cancer train size:",len(cancer_train))
print("Negative train size:",len(negative_train))

In [None]:
# mark which rows to use for train and for test
cancer['train'] = 0
negative['train'] = 0
cancer['test'] = 0
negative['test'] = 0

for i in range(len(cancer_train)):
    cancer['train'].loc[cancer['id'] == cancer_train.iat[i]] = 1
    negative['train'].loc[negative['id'] == negative_train.iat[i]] = 1

# test set samplinig
cancer_test = cancer['id'].loc[cancer['train']==0].sample(testsize,\
                                    replace=False,random_state=randomseed)
negative_test = negative['id'].loc[negative['train']==0].sample(testsize,\
                                    replace=False,random_state=randomseed)

for i in range(len(cancer_test)):
    cancer['test'].loc[cancer['id'] == cancer_test.iat[i]] ==1
    negative['test'].loc[negative['id'] == negative_test.iat[i]] ==1

# compare the test size counts
print("Cancer test size:",len(cancer_test))
print("Negative test size:",len(negative_test))

The index numbers were used to randomly select train/test sets with balanced classes.
The dataframes for the train and test set will be built, complete with image paths.

In [None]:
def image_path(id_filename):
    return f"/kaggle/input/histopathologic-cancer-detection/train/{id_filename}.tif"

In [None]:
cancer_train_df = df.loc[df.index[cancer_train.tolist()]]
negative_train_df = df.loc[df.index[negative_train.tolist()]]

train_df = pd.concat([cancer_train_df, negative_train_df])
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df = train_df.sample(frac=1, random_state=randomseed).reset_index(drop=True)

train_df['path'] = train_df.id.apply(image_path)
print("Combined train length:",len(train_df.id)," Cancer positives:",train_df.label.sum())
train_df.head()

In [None]:
# do same for test data
cancer_test_df = df.loc[df.index[cancer_test.tolist()]]
negative_test_df = df.loc[df.index[negative_test.tolist()]]

test_df = pd.concat([cancer_test_df, negative_test_df])
test_df = test_df.sample(frac=1).reset_index(drop=True)
test_df = test_df.sample(frac=1, random_state=randomseed).reset_index(drop=True)

test_df['path'] = test_df.id.apply(image_path)
print("Combined test length:",len(test_df.id)," Cancer positives:",test_df.label.sum())
test_df.head()

Turn dataframes into numpy arrays for the cnn training

In [None]:
# define tr function to apply later - NOTE: this code is adapted from another notebook, very useful

@tf.function
def grab_images(path):
    file = tf.io.read_file(path)
    img = tfio.experimental.image.decode_tiff(file, index=0)
    img = tf.image.random_flip_left_right(img, seed=None)
    img = tf.image.random_flip_up_down(img, seed=None)
    img =img[:,:,0:-1]
    img = img/255
    img = tf.image.convert_image_dtype(img,dtype=tf.float32)
    return img

# test it
test_image = grab_images('/kaggle/input/histopathologic-cancer-detection/train/ff1dd7be24e74d29d5a91862179703eadfe8fe43.tif')
plt.imshow(test_image)
plt.show()
# check normalized between 0-1
print(test_image[0:5,0:5,:])

# check for right shape for RGBA (4th channel is pixel intensity of 1)
test_image.shape

In [None]:
# get training datasets together
train_labels = tf.data.Dataset.from_tensor_slices(np.array([np.array([0,1]) if i ==1 else np.array([1,0]) for i in train_df.label.values]))
train_paths = tf.data.Dataset.from_tensor_slices(np.array([path for path in train_df.path.values]))
train_imgs = train_paths.map(grab_images)
train_set = tf.data.Dataset.zip((train_imgs,train_labels)).batch(64).prefetch(AUTOTUNE)

# get test dataset together
test_labels = tf.data.Dataset.from_tensor_slices(np.array([np.array([0,1]) if i ==1 else np.array([1,0]) for i in test_df.label.values]))
test_paths = tf.data.Dataset.from_tensor_slices(np.array([path for path in test_df.path.values]))
test_imgs = test_paths.map(grab_images)
test_set = tf.data.Dataset.zip((test_imgs,test_labels)).batch(64).prefetch(AUTOTUNE)

In [None]:
checkpoint_filepath =''
#define the callbacks for upcoming models
# earlyst = tf.keras.callbacks.EarlyStopping(monitor="binary_crossentropy", 
#                                            patience = 5)
earlyst = tf.keras.callbacks.EarlyStopping(monitor="val_loss", 
                                           patience = 5)

# rlrop = tf.keras.callbacks.ReduceLROnPlateau(monitor="binary_crossentropy", 
#                                              factor=.1,
#                                              patience = 2,
#                                              min_lr = 0)

rlrop = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", 
                                             factor=.1,
                                             patience = 2,
                                             min_lr = 0)

# CNN models

Multiple attempts at CNN, an example model is shown.


## Starting CNN
5 layers of a convolution and an average pooling layer, followed by 3 dense layers of neurons with final activation of tanh and a Binary Cross Entropy Loss Function to choose between the two classes. 

In [None]:
 with tpu_strategy.scope():
    model = Sequential([
    Input(shape=(96, 96, 3)),  
   
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),      
    
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
 
    Conv2D(64, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
    
    Conv2D(64, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
 
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
 
    # build the ANN layers
    Flatten(),
    Dense(288, activation='relu'),
    Dense(128, activation='relu'),
    Dense(2, activation='tanh')
    ])
    
    model.compile(
     optimizer =    tf.keras.optimizers.RMSprop(
            learning_rate=0.0005,
            momentum=0.18,
        ),
        loss= keras.losses.BinaryCrossentropy(from_logits=True), # for tf v 2.15
        # loss= 'BinaryCrossentropy', # for tf v 2.13
        metrics=[ 'BinaryCrossentropy', 'accuracy']
    )

model.summary()

In [None]:
# define model save location
checkpoint_filepath = '/kaggle/working/model_basic_cnn/'
!mkdir {checkpoint_filepath}
checkpoint_filename = 'checkpoint.model.keras'
checkpoint_fullpath = checkpoint_filepath + '' + checkpoint_filename
# checkpoint_fullpath = checkpoint_filepath # for tf v 2.13
                                           
# checkpoints = model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
#     filepath=checkpoint_fullpath,
#     save_weights_only=False,
#     monitor='val_accuracy',
#     mode='max',
#     save_best_only=True)

checkpoints = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_fullpath,
    save_weights_only=False,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)


In [None]:
# set tf functions to run eagerly
tf.config.run_functions_eagerly(True)

# fit model
history = model.fit(
                    train_set,
                    epochs=10, #ran before at 20, set to 10 to keep reasonable on lenght of time to run
                    callbacks=[rlrop,earlyst,checkpoints],
                    validation_data = test_set
                    )



In [None]:
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for cross entropy loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

print("Make a few predictions")
x = model.predict(test_set.take(1)) #batch size is 64

print("Binary decision logits (first 10):\n",x[0:10])

print("Predictions:\n",[np.argmax(x) for x in x[0:30]],'\nGround Truth:\n',[x for x in test_df.label.values[0:30]])

## Analysis of Results from Initial Model
Analysing the training versus test loss over training epochs shows it is likely optimal around XX epochs.

## Variation of Architecture
Add a layer of convolution to determine impact.  This may allow the model to learning a higher order of features that improve classifications

In [None]:
 with tpu_strategy.scope():
    model2 = Sequential([
    Input(shape=(96, 96, 3)),  
   
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),      
    
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
 
    Conv2D(64, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
    
    Conv2D(64, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
 
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
    
    # Additional Conv/pooling layer
    Conv2D(32, 3, padding='same', activation = 'relu'),
    AvgPool2D(pool_size=2, padding='same'),
 
    # build the ANN layers
    Flatten(),
    Dense(288, activation='relu'),
    Dense(128, activation='relu'),
    Dense(2, activation='tanh')
    ])
    
    model2.compile(
     optimizer =    tf.keras.optimizers.RMSprop(
            learning_rate=0.0005,
            momentum=0.18,
        ),
        loss= keras.losses.BinaryCrossentropy(from_logits=True), # for tf v 2.15
        # loss= 'BinaryCrossentropy', # for tf v 2.13
        metrics=[ 'BinaryCrossentropy', 'accuracy']
    )

model2.summary()

In [None]:
# define model save location
checkpoint_filepath = '/kaggle/working/model_xtralayer_cnn/'
!mkdir {checkpoint_filepath}
checkpoint_filename = 'checkpoint.model.keras'
checkpoint_fullpath = checkpoint_filepath + '' + checkpoint_filename
# checkpoint_fullpath = checkpoint_filepath # for tf v 2.13
                                           
# checkpoints = model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
#     filepath=checkpoint_fullpath,
#     save_weights_only=False,
#     monitor='val_accuracy',
#     mode='max',
#     save_best_only=True)

checkpoints = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_fullpath,
    save_weights_only=False,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

In [None]:
# set tf functions to run eagerly
tf.config.run_functions_eagerly(True)

# fit model
history2 = model2.fit(
                    train_set,
                    epochs=10,  # keep to 10 epochs to save time
                    callbacks=[rlrop,earlyst,checkpoints],
                    validation_data = test_set
                    )


In [None]:
# summarize history for accuracy
plt.plot(history2.history['accuracy'])
plt.plot(history2.history['val_accuracy'])
plt.title('model v2 accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history2.history['loss'])
plt.plot(history2.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

print("Make a few predictions with v2")
x = model2.predict(test_set.take(1)) #batch of 64

print("Binary decision Logits v2 (firt 10):\n",x[0:10])

print("Predictions v2:\n",[np.argmax(x) for x in x[0:30]],'\nGround Truth:\n',[x for x in test_df.label.values[0:30]])

## Hyperparameter tuning

In [None]:
from kerastuner.tuners import RandomSearch
from kerastuner.engine.hyperparameters import HyperParameters

tuner = RandomSearch(
    model2,
    objective='val_accuracy',
    max_trials=3,  
    directory='cancerclassifier_tuner_dir',  
    project_name='cancerclassifier_v2'
)

tuner.search(train_set, epochs=5, validation_data=test_set, validation_steps=VA_STEPS)

# Find the best hyperparameters
best_hypers = tuner.get_best_hyperparameters(num_trials=1)[0]

print("Best hyperparameters:\n", best_hypers)

#### Save model


In [None]:
# Save the model
final_model.save('Cancer_Image_Model.h5')

# Save the history
with open('Cancer_Image_Model_history.pkl', 'wb') as file:
    pickle.dump(history.history, file)

## Conclusions
CNN results vary, etc.

#### Attributions
Source references:
HCD_SB6: https://www.kaggle.com/code/shayjohnson/hcd-sb-6

Kaggle environment setup code: https://keras.io/getting_started/

Binary Classification: https://www.kaggle.com/code/toddgardiner/binary-cancer-classifier-s-tf-tpu

GPU usage code: https://github.com/tensorflow/tensorflow/issues/64177

CuDNN and similar code: https://github.com/tensorflow/tensorflow/issues/62075

Tensflow runtime graph vs eagerly: https://www.tensorflow.org/guide/intro_to_graphs

Tensorflow versioning issues and CUDA on Kaggle: https://github.com/tensorflow/tensorflow/issues/64177