# Breast Cancer Analysis

**Building the breast cancer image dataset**


Our breast cancer image dataset consists of 198,783 images, each of which is 50×50 pixels.

If we were to try to load this entire dataset in memory at once we would need a little over 5.8GB.

On Kaggle it is stated that the data set containes 1.5 GB of data, however the set needs more than 4 GB space on dics together with metadata. Images are stored in numerious directories named with a number preresenting the ID of a patient. Such a large data set can not fit into computer memory and during training we have to use a data generator.

Data generator expects a path to directory where data is saved in subdirecotries named as the classes. Therefore, we must fistly create tran, validation and test subdirectories where data will be sotred in two directories (since we have binary classification problem).

This is a code that will create new directory where all the files will be copied in traning (72% of total data), validation (8% of data) and testing (20% of data) directories. Each directory will contain subdirectories named "0" for negative examples and "1" for positives.

In [1]:
from imutils import paths
import random, shutil, os

random.seed(7)
TRAIN_AND_VAL_SPLIT = 0.8
VAL_SPLIT = 0.1

In [3]:
# %cd "C:\Users\HP\Projects\Breast Cancer"   Change directory
os.chdir('C:\\Users\\HP\\Projects\\Breast Cancer')
INPUT_PATH = "C:\\Users\\HP\\Projects\\Breast Cancer"
BASE_PATH = "Breast_Cancer_Data\\arranged"

# ---- Create 3 names for subdirectories containing 3 Sets of data 
TRAIN_PATH = os.path.join(BASE_PATH + "\\training")
VAL_PATH = os.path.join(BASE_PATH + "\\validation")
TEST_PATH = os.path.join(BASE_PATH + "\\testing")

In [12]:
# --- Get a list of paths for all images in original data using imutils function paths 
originalPaths = list(paths.list_images(INPUT_PATH))
# --- Randomly shuffle all paths in the list
random.shuffle(originalPaths)

# --- Take first 80% of the paths in trainPaths (TotalTrain)
N = int(len(originalPaths)*TRAIN_AND_VAL_SPLIT)
trainPaths = originalPaths[:N]
# ---- Take last 20% of paths in testPaths
testPaths = originalPaths[N:]

# --- Take 10% of trainPaths for validation
N = int(len(trainPaths)*VAL_SPLIT)
valPaths = trainPaths[:N]
# --- Take 90% of trainPaths (total) for training data (True training set)
trainPaths = trainPaths[N:]

In [13]:
len(originalPaths)*TRAIN_AND_VAL_SPLIT

222019.2

In [14]:
int(len(trainPaths)*VAL_SPLIT)

19981

In [15]:
(len(originalPaths)*TRAIN_AND_VAL_SPLIT)*0.2

44403.840000000004

In [16]:
# --- Crate a list with name, paths to files and path to just base directories
datasets=[("training", trainPaths, TRAIN_PATH),
          ("validation", valPaths, VAL_PATH),
          ("testing", testPaths, TEST_PATH)]

In [17]:
# --- Iterate over all train/valid/test-ing file paths in dataset
for (setType, originalPaths, BasePaths) in datasets:
        print('Building', setType, 'set')
        #If base directory doesn't ecxist- create it
        if not os.path.exists(BasePaths):
                print('Building directory', BasePaths)
                os.makedirs(BasePaths)
        
        # --- Iterate over all paths in a given setType (train/valid/test-ing)        
        for path in originalPaths:
            
                # --- Get file name using split on path.sep (\\)
                file = path.split(os.path.sep)[-1]
                # --- Position -5 is 0 or 1 because ".png" are last 4 positions)
                label = file[-5]
                # --- Create name base directory + "0" or "1"
                labelPath = os.path.join(BasePaths + "\\" + label)
                # --- If this directory doesn't exist - create it
                if not os.path.exists(labelPath):
                        print('Building directory', labelPath)
                        os.makedirs(labelPath)
                        
                # --- Create path for each file        
                newPath=os.path.join(labelPath + "\\" + file)
                # --- Copy file in each old path to newPath  
                shutil.copyfile(path, newPath)

Building training set
Building directory Breast_Cancer_Data\arranged\training
Building directory Breast_Cancer_Data\arranged\training\0
Building directory Breast_Cancer_Data\arranged\training\1
Building validation set
Building directory Breast_Cancer_Data\arranged\validation
Building directory Breast_Cancer_Data\arranged\validation\1
Building directory Breast_Cancer_Data\arranged\validation\0
Building testing set
Building directory Breast_Cancer_Data\arranged\testing
Building directory Breast_Cancer_Data\arranged\testing\1
Building directory Breast_Cancer_Data\arranged\testing\0


Once this is done we can check if we have equal distribution of negative and positive examples over the sets.

In [4]:
# --- Count the number of files in each directory                
Train_files_0 = len(list(paths.list_images(os.path.join(TRAIN_PATH + "\\0"))))            
Train_files_1 = len(list(paths.list_images(os.path.join(TRAIN_PATH + "\\1"))))            

Val_files_0 = len(list(paths.list_images(os.path.join(VAL_PATH + "\\0"))))            
Val_files_1 = len(list(paths.list_images(os.path.join(VAL_PATH + "\\1"))))    

Test_files_0 = len(list(paths.list_images(os.path.join(TEST_PATH + "\\0"))))            
Test_files_1 = len(list(paths.list_images(os.path.join(TEST_PATH + "\\1"))))  


from numpy import around

print('Training set has', around(Train_files_0/Train_files_1, 3), 'ratio of negative to positive')
print('Validation set has', around(Val_files_0/Val_files_1, 3), 'ratio of negative to positive')
print('Testing set has', around(Test_files_0/Test_files_1,3), 'ratio of negative to positive')

Training set has 2.526 ratio of negative to positive
Validation set has 2.497 ratio of negative to positive
Testing set has 2.52 ratio of negative to positive


 **Train and validate CNN model**

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Flatten, Conv2D, BatchNormalization, Dense, Dropout, MaxPooling2D, Activation
from tensorflow.keras.initializers import GlorotNormal
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from tensorflow.keras.callbacks import *

# -- Define number of Epochs, the Batch Size, and global randon seed
NUM_EPOCHS = 8
BS = 100  
tf.random.set_seed(0)     

In [6]:
def Cancer_clasifier(input_shapes):
    
    X_input = Input(input_shapes)
    
    XX = Conv2D(8, kernel_size = (2,2), strides=(1, 1), padding = 'same', 
                kernel_initializer = GlorotNormal())(X_input)       
    XX = BatchNormalization(axis = -1)(XX)
    XX = Activation('relu')(XX)
    XX = MaxPooling2D(2,2)(XX)
 
    XX = Conv2D(16, kernel_size = (4,4), strides=(1, 1), padding = 'same')(XX)
    XX = BatchNormalization(axis = -1)(XX)
    XX = Activation('relu')(XX)
    XX = MaxPooling2D(2,2)(XX)
    
    # --- Output Layer
    XX = Flatten()(XX)
    
    XX = Dense(1, activation='sigmoid')(XX)
    
    model = Model(inputs = X_input, outputs = XX, name='Cancer_clasifier')
              
    return model  

**Data generator**

Define the path to the data set. Be sure that you had run 'arrange_dataset' file before to create testing, validating and training subsets with "0" and "1" as subdirectories.

In [7]:
Path_to_traning = r"C:\Users\HP\Projects\Breast Cancer\Breast_Cancer_Data\arranged\training"
Path_to_validation = r"C:\Users\HP\Projects\Breast Cancer\Breast_Cancer_Data\arranged\validation"

print('Current working directory is')
%pwd

Current working directory is


'C:\\Users\\HP\\Projects\\Breast Cancer'

In [8]:
image_shape = (50,50,3)
train_augmentation = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.03,
        horizontal_flip = True)
val_augmentation = ImageDataGenerator(rescale=1./255)

train_generator = train_augmentation.flow_from_directory(
        Path_to_traning,
        target_size = image_shape[:2],
        color_mode="rgb",
        batch_size = BS,
        class_mode = 'binary',
        shuffle=True)

validation_generator = val_augmentation.flow_from_directory(
        Path_to_validation,
        target_size = image_shape[:2],
        color_mode="rgb",
        batch_size = BS,
        class_mode = 'binary',
        shuffle=True)    

Found 199818 images belonging to 2 classes.
Found 22201 images belonging to 2 classes.


In [9]:
# -- Check if the right classes are recognized by the DirectoryIterator 
train_generator.class_indices.keys() 

dict_keys(['0', '1'])

In [10]:
x=np.concatenate([validation_generator.next()[0] for i in range(validation_generator.__len__())])
y=np.concatenate([validation_generator.next()[1] for i in range(validation_generator.__len__())])
print(x.shape)
print(y.shape)          

(22201, 50, 50, 3)
(22201,)


In [14]:
tf.keras.utils.to_categorical(
    y, num_classes=2, dtype='float32'
)

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [1., 0.]], dtype=float32)

In [15]:
y

array([0., 1., 0., ..., 0., 1., 0.], dtype=float32)

**Train the model**

In each Epoch validation accuracy is checked and if it has imporved - model weights are saved in hdf5 file

In [16]:
# -- Create the model
my_cancer_model = Cancer_clasifier(image_shape)
# -- Print model summary
my_cancer_model.summary()

    # -- Define the optimizer
opt = Adam(learning_rate=0.0001) 
# -- Compile the model         
my_cancer_model.compile(optimizer = opt, loss='categorical_crossentropy', metrics=['accuracy'])

# create the checkpoint for the model with best accuracy on validation set
checkpoint_filepath = 'model.{epoch:02d}-{val_accuracy:.2f}.h5'
checkpoint = ModelCheckpoint(filepath = checkpoint_filepath, monitor='val_accuracy',verbose=1, 
                             save_best_only = True, save_weights_only = False, save_freq = 'epoch')
               
# -- Train the model                         
history = my_cancer_model.fit(train_generator, validation_data = validation_generator, epochs = NUM_EPOCHS, 
                              callbacks=[checkpoint], 
                              steps_per_epoch = len(train_generator),
                              validation_steps = len(validation_generator))                                                         

Model: "Cancer_clasifier"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 50, 50, 3)]       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 50, 50, 8)         104       
_________________________________________________________________
batch_normalization (BatchNo (None, 50, 50, 8)         32        
_________________________________________________________________
activation (Activation)      (None, 50, 50, 8)         0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 25, 25, 8)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 25, 25, 16)        2064      
_________________________________________________________________
batch_normalization_1 (Batch (None, 25, 25, 16)   

**Saving the model in json**

In [17]:
# serialize model to JSON
model_json = my_cancer_model.to_json()
with open("Cancer_clasifier_model.json", "w") as json_file:
    json_file.write(model_json)
    
# serialize weights to HDF5
my_cancer_model.save_weights("Cancer_clasifier_LastWeights.h5")

# Save model history with json 
hist_df = pd.DataFrame(history.history) 
# save to json:  
hist_json_file = 'Model_History.json' 
with open(hist_json_file, mode='w') as f:
    hist_df.to_json(f)

Validate the model on the test set

In [18]:
#Run this cell if you want to load saved model and weights

from tensorflow.keras.models import model_from_json
json_file = open('Cancer_clasifier_model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
my_cancer_model = model_from_json(loaded_model_json)
# my_cancer_model.load_weights('model.05-0.86.h5')
image_shape = (50,50,3)   

In [28]:
print('Predicting the classes on the test set and calculating the confusion matrix ..')
Path_to_test = r"C:\Users\HP\Projects\Breast Cancer\Breast_Cancer_Data\arranged\testing"

test_augmentation = ImageDataGenerator(rescale=1./255)
test_generator = test_augmentation.flow_from_directory(
        Path_to_test,
        target_size = image_shape[:2],
        color_mode="rgb",
        batch_size = BS,   
        class_mode = 'binary',
        shuffle=False)

# ---- Predictions from the model on test set
predicted_indices = my_cancer_model.predict(test_generator, verbose = 1, steps = len(test_generator))
predicted_indices[predicted_indices < 0.5] = 0
predicted_indices[predicted_indices >= 0.5] = 1  

# --- Create Confusion matrix 
cm = confusion_matrix(test_generator.classes, predicted_indices)
total = sum(sum(cm))
# --- Calculate and print acc, spec & sens
accuracy = (cm[0,0]+cm[1,1])/total
specificity = cm[0,0]/(cm[0,0]+cm[0,1])
sensitivity = cm[1,1]/(cm[1,1]+cm[1,0])


print('Confussion matrix is \n', cm)
print('Accuracy:', accuracy)
print('Sensitivity', sensitivity)
print('Specificity', specificity)

Predicting the classes on the test set and calculating the confusion matrix ..
Found 55505 images belonging to 2 classes.
Confussion matrix is 
 [[38926   810]
 [15549   220]]
Accuracy: 0.7052697955139177
Sensitivity 0.013951423679370918
Specificity 0.9796154620495269


In [29]:
Rc = cm[1,1]/(cm[1,1] + cm[1,0])
Pr = cm[1,1]/(cm[1,1] + cm[0,1])
F1 = 2*Rc*Pr/(Rc+Pr)  
print('F1 score', F1)      
print('BAC', (specificity + sensitivity)/2)

F1 score 0.026192035240192868
BAC 0.4967834428644489
