Brandon O’Neill<br>
CSPB 3202<br>
August 3, 2020<br>
HW5 Kaggle Competition<br>
# Sources
I used a variety of sources found throughout the competition as far as Kaggle implementation, loading the images and other aspects of the Keras training.

# Introduction
This problem is basically an image classification problem where the decision will be whether or not the image shows signs of cancer. More specifically, we are predicting the probability of the center (32x32px) of the image contains at least one pixel of tumor tissue. The data itself has around 220,000 labeled pictures for training and 55,000 pictures for testing.

In [None]:
from glob import glob 
import numpy as np
import pandas as pd
import keras,cv2,os
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Activation
from keras.layers import Conv2D, MaxPool2D
import matplotlib.pyplot as plt


# Exploratory Data Analysis
First I am going to load the data and take a sample of 1000 photos for exploratory testing.

In [None]:

path = "../input/"
train = path + 'train/'
test = path + 'test/'

df = pd.DataFrame({'path': glob(os.path.join(train,'*.tif'))})
df['id'] = df.path.map(lambda x: x.split('/')[3].split(".")[0])
labels = pd.read_csv(path+"train_labels.csv")
df = df.merge(labels, on = "id")

def ld(N,df):
    X = np.zeros([N,96,96,3],dtype=np.uint8)
    y = np.squeeze(df.as_matrix(columns=['label']))[0:N]
    for i, row in df.iterrows()
        if i == N:
            break
        X[i] = cv2.imread(row['path'])
          
    return X,y
n = 1000
X,y = ld(n, df)

Then I will check the distribution of the data.

In [None]:
fig = plt.figure()
plt.bar([1,0], [(y==0).sum(), (y==1).sum()])
plt.xticks([1,0],["No Cancer","Cancer"])
plt.ylabel("# of images")

From the above histogram, we can see that there a nearly 60/40 split in terms of No Cancer/Cancer throughout the data. <br>

Next I will print some images with the corresponding labels so we get an idea of what potential characteristics the model will be looking at. The 0 indicates no cancer while the 1 indicates cancer.

In [None]:
fig = plt.figure()
for pic,i in enumerate(np.random.randint(0,1000,4)):
    a = fig.add_subplot(2, 4, pic+1, xticks=[], yticks=[])
    plt.imshow(X[i])
    a.set_title('Label: ' + str(y[i]))

There are some noticeable differences in terms of color and brightness. The color of the non cancer cells tend to be brighter than the cancer cells.

# The Model Architecture
The model I will be using is a neural network called Keras. From the documentation, this is a convolutional neural network that has three layers to it; the convolution, the batch normalization, and the pooling/dropout. A neural network like this would be beneficial for this problem because it can learn what cancer cells look like from the pixel level and then make a decision based on probability without much intervention. 

In [None]:

N = df["path"].size
X,y = ld(N=N,df=df)

In [None]:
training_portion = 0.8 # Specify training/validation ratio
split_idx = int(np.round(training_portion * y.shape[0])) #Compute split idx


In [None]:
# these model numbers and setup were sourced from https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-10min-0-925-lb 

kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128
dropout_conv = 0.3
dropout_dense = 0.5

model = Sequential()


model.add(Conv2D(first_filters, kernel_size, input_shape = (96, 96, 3)))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(first_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(second_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Conv2D(third_filters, kernel_size, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPool2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Dense(256, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(Dropout(dropout_dense))

model.add(Dense(1, activation = "sigmoid"))

Now that the model is formatted, I will run 3 epochs on the model. This is similar to what we learned in class.

In [None]:
batch_size = 50

model.compile(loss=keras.losses.binary_crossentropy,
              optimizer=keras.optimizers.Adam(0.001), 
              metrics=['accuracy'])
epochs = 3
for epoch in range(epochs):
    iterations = np.floor(split / batch).astype(int) 
    for i in t:
        st = i * batch
        xB = X[st:st+batch]
        yB = y[st:st+batch] 
        metrics = model.train_on_batch(xB, yB)



This is where the actual model will run against the test data and I will create a file for submission.

In [None]:
main = path + 'test/'
test = glob(os.path.join(main,'*.tif'))
sub = pd.DataFrame()
bt = 5000
maxIn = len(test)
for i in range(0, maxIn, bt):
    df = pd.DataFrame({'path': test_files[i:i+bt]}) 
    df['id'] = df.path.map(lambda x: x.split('/')[3].split(".")[0]) 
    df['image'] = df['path'].map(cv2.imread)
    k = np.stack(test_df["image"].values) 
    predictions = model.predict(k,verbose = 1)
    test_df['label'] = predictions
    sub = pd.concat([sub, test_df[["id", "label"]]])
    
sub.to_csv("testResults.csv", index = False, header = True)