**Breast Cancer Detection**
![](https://blogs.nvidia.com/wp-content/uploads/2018/01/AI_Mammographie.jpg)

***Domain Background*** : 
	Breast Cancer is the most common type of cancer in woman worldwide accounting for 20% of all cases.
    
>     In 2012 it resulted in 1.68 million new cases and 522,000 deaths.
    
One of the major problems is that women often neglect the symptoms, which could cause more adverse effects on them thus lowering the survival chances. In developed countries, the survival rate is although high, but it is an area of concern in the developing countries where the 5-year survival rates are poor. In India, there are about one million cases every year and the five-year survival of stage IV breast cancer is about 10%. Therefore it is very important to detect the signs as early as possible. 
    
>     Invasive ductal carcinoma (IDC) is the most common form of breast cancer.
   
   About 80% of all breast cancers are invasive ductal carcinomas. Doctors often do the biopsy or a scan if they detect signs of IDC. The cost of testing for breast cancer sets one back with $5000, which is a very big amount for poor families and also manual identification of presence and extent of breast cancer by a pathologist is critical. Therefore automation of detection of breast cancer using Histopathology images could reduce cost and time as well as improve the accuracy of the test. This is an active research field lot of research papers and articles are present online one that I like is -(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5453426/) as they used deep learning approach to study on histology images and achieved the sensitivity of 95 which is greater than many pathologists (~90). This shows the power of automation and how it could help in the detection of breast cancer.



***Problem Statement***: 
The idea is to use pathology test images and classify them as IDC(+) and IDC(-). Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce error. The pathological tests include images of the tissues, the task is to train a computer to use these images and respond on whether the person is IDC(+) or IDC(-). Since it is a medical field problem it is important that sensitivity of the output should be high. 


***Solution Statement***:
	Our data involves images with the classes written on data file name, therefore, we would need to extract the class name from it and create a column to store them. We also need to split the dataset into the training set, validation set and testing set. Testing set for checking how good the model works on completely unseen data and validation set to check and avoid underfit or overfit, the will also help to select the best model. One hot encoding will be done in classes column so that it could work better with our model. Image processing step is also required to reduce the pixel range from 0-250 to 0-1. After it CNN model is to be used to predict the class, CNN creates an effective architecture the 2D structure of the image, therefore, it would be the best to use, considering that we are working with the images.


***Evaluation Metrics***:
	The performance of the model will be evaluated using ROC curve and confusion matrix.  A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 − specificity). It uses the concept of true positive, true negative, false positive and false negative.
    
> * Sensitivity =                True Positive /(True Positive + False Negative)
> * Recall    =                   True Positive/(True Positive + False Negative)               .                  
> * Specificity =                True Negative /(  True Negative + False Positive)              .
> * Precision =                True Positive/ ( True Positive + False Positive)
                     
The perfect classification has the area under the ROC curve equal to 1. Therefore closer the area of our ROC curve to 1 better would be our model. The third is a confusion matrix, it is a two by two table that contains four outcomes produced by a binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, and precision, are derived from the confusion matrix. Sensitivity can be calculated from the confusion matrix, which is important to know when we work in the medical domain i.e how many of the patients were told about having breast cancer our of how many were actually having it.
The ROC curve and confusion matrix would be a good evaluation matrix because they both are used for binary classification and our data is also based on binary classification. These metrics could help in evaluating the model through sensitivity, specificity, recall and precision which all are important and are always considered while working in this domain with it would provide us with the visualization of the correctness of the model.



**IMPORT FILES**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os


**Local Directory**

We explore the name of the directory inside which our datafiles are present.

In [None]:
print(os.listdir("../input"))

**Data Exploration**

In data exploration we will first check the name of the files.

In [None]:
from glob import glob
Data = glob('../input/IDC_regular_ps50_idx5/**/*', recursive=True)    

In [None]:
print(Data[50])

Second step  is that we need to check whether all files are images or not

In [None]:
extention=list()
for image in Data:
    ext=image[-3:]
    if ext not in extention:
        extention.append(ext)
print((extention))

> **Code Conclusion** : We can see that there are many files along with images but we need to extract only images.

We can see that the extentions are mostly just numbers, therefore we will exclude them and check for extentions that are alphabets

In [None]:
alpha_ext=list()
for ex in extention:
    if ex.isalpha() == True:
        alpha_ext.append(ex)
print(alpha_ext)

> **Code Conclusion :**  There are only png extentions which are present in alphabets therefore it means that we have only one image extention files with *.png* extentions.

Now we need to remove all the other files that we have imported

In [None]:
Data = glob('../input/IDC_regular_ps50_idx5/**/*.png', recursive=True) 

In [None]:
print(len(Data))

> **Code Conclusion **: We have total of 277524 image files

Third step is that we need to check where dimentions of all the images are same or not

In [None]:

'''from PIL import Image
from tqdm import tqdm
dimentions=list()
for images in tqdm(Data):
    dim = Image.open(images)
    if dim not in dimentions:
        dimentions.append(dim)
print(dimentions)'''


> ***Code Conclusion : *** We can see that the dimentions of images are not equal therefore we would make it all equal .

In [None]:
import cv2
import matplotlib.pyplot as plt
def view_images(image):
    image_cv = cv2.imread(image)
    plt.imshow(cv2.cvtColor(image_cv, cv2.COLOR_BGR2RGB));
view_images(Data[52])

> ***Code Conclusion :*** We can see that images are very small, though they are cropped images, its hard for human eye to understand them without using some high costly machines. 

In [None]:
def plot_images(photos) :
    x=0
    for image in photos:
        image_cv = cv2.imread(image)
        plt.subplot(5, 5, x+1)
        plt.imshow(cv2.cvtColor(image_cv, cv2.COLOR_BGR2RGB));
        plt.axis('off');
        x+=1
plot_images(Data[:25])

Now lets look at the color ranges that our images have

In [None]:
def hist_plot(image):
    img = cv2.imread(image)
    plt.subplot(2, 2,1)
    view_images(image)
    plt.subplot(2, 2,2)
    plt.hist(img.ravel()) 
hist_plot(Data[25])
    

> ***Code Conclusion :*** From the above image we can conclude that brighter region is more than the darken region in our image.  

***Data Extraction***

Next step is we need to extract the class names in which each files belong from its file names. We will save it in output.csv file.

In [None]:
from tqdm import tqdm
import csv
Data_output=list()
Data_output.append(["Classes"])
for file_name in tqdm(Data):
    Data_output.append([file_name[-10:-4]])
with open("output.csv", "w") as f:
    writer = csv.writer(f)
    for val in Data_output:
        writer.writerows([val])

Below code reads the data from output.csv and displays it

In [None]:
from IPython.display import display # Allows the use of display() for DataFrames
data_output = pd.read_csv("output.csv")
display(data_output.head(5))
print(data_output.shape)

> *Class1* represents** IDC(+)** and* Class0* represents** IDC(-)**

**Data Visualization**

In [None]:
def class_output(images,x):
    display(data_output.loc[50])
    view_images(images)
class_output(Data[50],50) 

In [None]:
def vis_data(photos,a) :
    x=0
    beta=0
    for image in photos:
        image_cv = cv2.imread(image)
        plt.figure(figsize=(50,50))
        plt.subplot(2, 5, x+1)
        plt.title('IDC(+)'if data_output.loc[beta]== 'class1' else 'IDC(-)')
        plt.imshow(cv2.cvtColor(image_cv, cv2.COLOR_BGR2RGB));
        plt.axis('off');
        
        x+=1
        beta+=1
plot_images(Data[0:20])

In [None]:
class1 = data_output[(data_output["Classes"]=="class1" )].shape[0]
class0 = data_output[(data_output["Classes"]=="class0" )].shape[0]
objects=["class1","class0"]
y_pos = np.arange(len(objects))
count=[class1,class0]
plt.bar(y_pos, count, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Number of images')
plt.title('Class distribution')
 
plt.show()

> ***Code Conclusion :*** We have more number of images in class0 than in class1

In [None]:
percent_class1=class1/len(Data)
percent_class0=class0/len(Data)
print("Total Class1 images :",class1)
print("Total Class0 images :",class0)
print("Percent of class 0 images : ", percent_class0*100)
print("Percent of class 1 images : ", percent_class1*100)

> ***Data Processing  *** 

We would encode our output data which is present as Class1 and Class0 to 1 and 0 repectively to make it work better with our algorithms.

In [None]:
data_output=pd.get_dummies(data_output)
display(data_output.head(5))
print(data_output.shape)

Next step is that we need to split our data in train and test . Since our data is uneven we will use statify along with our train_test_split.

In [None]:
data=list()
for img in tqdm(Data):
    image_ar = cv2.imread(img)
    data.append(cv2.resize(image_ar,(50,50),interpolation=cv2.INTER_CUBIC))

In [None]:
"""%env JOBLIB_TEMP_FOLDER=/tmp
with open("output_proccess.csv", "w") as f:
    writer = csv.writer(f)
    for val in tqdm(data):
        writer.writerows([val])"""

In [None]:
"""data = pd.read_csv("output_proccess.csv")"""

In [None]:
from sklearn.model_selection import train_test_split
data=np.array(data)
X_train, X_test, Y_train, Y_test = train_test_split(data, data_output, stratify=data_output)
print("Number of train files",len(X_train))
print("Number of test files",len(X_test))
print("Number of train_target files",len(Y_train))
print("Number of  test_target  files",len(Y_test))

We also need a validation set inorder to check overfitting. We can do two things either split test set further into valid set or split train se into valid set.

We will go for spliting training set into validation set.

In [None]:
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train, Y_train, stratify=Y_train)

In [None]:
print("Number of train files",len(X_train))
print("Number of valid files",len(X_valid))
print("Number of train_target files",len(Y_train))
print("Number of  valid_target  files",len(Y_valid))
print("Number of test files",len(X_test))
print("Number of  test_target  files",len(Y_test))

> We need to now preprocess our image file. We change pixels range from 0-255 to 0-1.

In [None]:
for images in X_train:
    images=images/255.0
for images in X_test:
    images=images/255.0
for images in X_valid:
    images=images/255.0


In [None]:
print("Training Data Shape:", X_train.shape)
print("Validation Data Shape:", X_valid.shape)
print("Testing Data Shape:", X_test.shape)
print("Training Label Data Shape:", Y_train.shape)
print("Validation Label Data Shape:", Y_valid.shape)
print("Testing Label Data Shape:", Y_test.shape)

Now we have our three sets of train, valid and test. We will now create our benchmark model.

> ***BENCHMARK MODEL: *** A simple CNN model

In [42]:
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.models import Sequential

model = Sequential()
model.add(Conv2D(filters=32,kernel_size=(2,2),strides=2,padding='same',activation='relu',input_shape=(50,50,3)))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 25, 25, 32)        416       
_________________________________________________________________
flatten_3 (Flatten)          (None, 20000)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 40002     
Total params: 40,418
Trainable params: 40,418
Non-trainable params: 0
_________________________________________________________________


In [43]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [44]:

from keras.callbacks import ModelCheckpoint  
checkpointer = ModelCheckpoint(filepath='weights.best.cnn.hdf5', 
                               verbose=1, save_best_only=True)
model.fit(X_train, Y_train, 
          validation_data=(X_valid, Y_valid),
          epochs=10, batch_size=32, callbacks=[checkpointer], verbose=1)

Train on 156107 samples, validate on 52036 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 11.54217, saving model to weights.best.cnn.hdf5
Epoch 2/10

Epoch 00002: val_loss did not improve from 11.54217
Epoch 3/10

Epoch 00003: val_loss did not improve from 11.54217
Epoch 4/10

Epoch 00004: val_loss did not improve from 11.54217
Epoch 5/10

Epoch 00005: val_loss did not improve from 11.54217
Epoch 6/10

Epoch 00006: val_loss did not improve from 11.54217
Epoch 7/10

Epoch 00007: val_loss did not improve from 11.54217
Epoch 8/10

Epoch 00008: val_loss did not improve from 11.54217
Epoch 9/10

Epoch 00009: val_loss did not improve from 11.54217
Epoch 10/10

Epoch 00010: val_loss did not improve from 11.54217


<keras.callbacks.History at 0x7fa73a730390>

In [45]:
model.load_weights('weights.best.cnn.hdf5')

In [46]:
predictions = [np.argmax(model.predict(np.expand_dims(feature, axis=0))) for feature in tqdm(X_test)]

100%|██████████| 69381/69381 [01:32<00:00, 747.04it/s]


In [47]:
test_Y=list()
class0=Y_test['Classes_class0'].values.tolist()
class1=Y_test['Classes_class1'].values.tolist()
for a,b in zip(class0,class1):
    test_Y.append((a,b))
print(np.argmax(test_Y, axis=1))

[1 0 0 ... 0 1 0]


In [48]:
test_accuracy = 100*np.sum(np.array(predictions)==np.argmax(test_Y, axis=1))/len(predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

Test accuracy: 28.3882%
