# Introduction

#### Description:

In this competition, you must create an algorithm to identify metastatic cancer in small image patches taken from larger digital pathology scans. The data for this competition is a slightly modified version of the PatchCamelyon (PCam) benchmark dataset (the original PCam dataset contains duplicate images due to its probabilistic sampling, however, the version presented on Kaggle does not contain duplicates).

#### Approach:

For this project, I'm interested in comparing the performance of the pre-trained models VGG-16 and EfficientNet and a custom model which consists of 4 sets of 5 convolution layers on the given dataset. This custom model is based on VGG, only with more convolution layers compared to the 16 convolution layers in VGG-16 or even the newer VGG-19. I will use the RMSProp optimizer for all of them and then compare the output for the one with the best accuracy with Adam optimizer.



In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.image import imread

from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers
from keras.layers import PReLU
from keras.initializers import Constant
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

The images in this dataset are 96 x 96 px in size and have been pre-divided into training and test data sets. 
Input shape is (96, 96, 3) since these are RGB images with the depth = 3 signifying the three channels for the color.



#### Step 1: Input Verification

In this problem we are building a model which detects the presence of cancer in a given sample, it is a binary classification task.
If it were just text data, we could have gone ahead with logistic regression algorithm or similar. But here we are evaluating images, so logistic regression is not enough. We need to use Convolutional Neural Network (CNN) for this.

First, we need to make sure that the input dataset is balanced i.e. # of positive samples = # of negative samples. If it is unbalanced, we need to rebalance it.


In [5]:
df_training_classes = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv', dtype=str)
df_training_classes.shape

(220025, 2)