# [MAYO] Simple CNN 
This competion is to build a prediction model to classify the blood clot origins in ischemic stroke, more in detail, the two major acute ischemic stroke (AIS) etiology subtypes: cardiac and large artery atherosclerosis, using whole slide digital pathology images.  
Evaluation logic seems quite DIFFICULT, but simply speaking, the aim of this competition is to predict the probability of CE or LAA and hence I'd like to build a simple model using Convolutional Neural Network (CNN) just as starter.


In [None]:
import pandas as pd
import numpy as np
import gc
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
from tensorflow.keras.layers import GlobalMaxPooling2D
import openslide
from openslide import OpenSlide
import cv2 

# 1. Read Data

Let's read Train data and Test data. I added the link to image data since image data is stored in separate folder.  
Label column (CE or LAA) is the target for prediction, hence this column was changed to 1 or 0.  


In [None]:
train_df = pd.read_csv('../input/mayo-clinic-strip-ai/train.csv')
test_df  = pd.read_csv('../input/mayo-clinic-strip-ai/test.csv')

In [None]:
train_df.head()

In [None]:
train_df["file_path"] = train_df["image_id"].apply(lambda x: "../input/mayo-clinic-strip-ai/train/" + x + ".tif")
test_df["file_path"]  = test_df["image_id"].apply(lambda x: "../input/mayo-clinic-strip-ai/test/" + x + ".tif")

In [None]:
train_df["target"] = train_df["label"].apply(lambda x : 1 if x=="CE" else 0)

In [None]:
train_df.head()

# 2. Quick look at Image of CE and LAA

Let7s take a look at some actual image. First 2 records are CE and the next 2 records are LAA.  
Hmmm... I don't see specific difference between CE and LAA...


In [None]:
%%time
import cv2
from PIL import Image
import tifffile as tifi
j = 4
sample_train = train_df[j:j+1]

img = cv2.imread(sample_train.loc[j, "file_path"])
print('The size of the image is' + str(img.shape))
plt.figure(figsize=(8, 8))
plt.imshow(img)
plt.show()  

In [None]:
image_resized = tf.image.resize(img, (512, 512),method=tf.image.ResizeMethod.LANCZOS5)
print('The size of the image is' + str(image_resized.shape))
plt.figure(figsize=(8, 8))
plt.imshow(image_resized)
plt.show()  

In [None]:
%%time
sample_train = train_df[:4]

for i in range(1):
    slide = OpenSlide(sample_train.loc[i, "file_path"])
    region = (1000, 1000)
    size = (6000, 6000)
    region = slide.read_region(region, 0, size)
    plt.figure(figsize=(8, 8))
    plt.imshow(region)

# 3. Image data preprocessing for CNN

Image pixel will be changed to no.array for CNN processing.  
As you see in above 2., it takes long time to read each image data by 10,000x10,000 pixel, thus in this notebook 5000x 5000 pixel is fed to CNN, which will lead to less Training data. In this situation, I would like to read as meaningful data as possible and hence image data reading is starting from (1000,1000)position from the very top-left of the image since it seems the top-left potion of each image tends to be blank.  
Also, image data is resized to 512x512 in order to avoid memory over error.  


In [None]:
%%time
def preprocess(image_path):
    slide=OpenSlide(image_path)
    region= (0,0)    
    size  = (10000, 10000)
    image = slide.read_region(region, 0, size)
    #print('imgae shape is' + str(image.size))
    image = tf.image.resize(image, (512, 512),method=tf.image.ResizeMethod.LANCZOS5)
    image = np.array(image) / 255.0 # Normalization
    return image

x_train=[]
#num_img = 100
#counter = 0
#for i in tqdm(train_df['file_path'],total = num_img):
for i in tqdm(train_df['file_path']):
    x1=preprocess(i)
    #print(x1.shape)
    x_train.append(x1[:,:,0:3]) # Channel four is useless
#     counter += 1
#     if (counter == num_img):
#         break
    

# 4. CNN Modelling

Convolutional Neural Network(CNN) is built using Conv2D() method with 3x3 kernel. There will be more room to improve the model by tuning the number of Layers or Filters or adding Pooling layer, etc. This is just a starter.


In [None]:
model = Sequential()
input_shape = (512, 512, 3)

model.add(Conv2D(filters=32, kernel_size = (3,3), strides =2, padding = 'same', activation = 'relu', input_shape = input_shape))
#model.add(tf.keras.layers.MaxPooling2D(2, 2))
model.add(Conv2D(filters=64, kernel_size = (3,3), strides =2, padding = 'same', activation = 'relu'))
#model.add(Conv2D(filters=128, kernel_size = (3,3), strides =2, padding = 'same', activation = 'relu'))
#model.add(Conv2D(filters=64, kernel_size = (3,3), strides =2, padding = 'same', activation = 'relu'))
#model.add(tf.keras.layers.MaxPooling2D(2, 2))
model.add(Conv2D(filters=32, kernel_size = (3,3), strides =2, padding = 'same', activation = 'relu'))
model.add(Flatten())
#model.add(Dense(128, activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(16, activation = 'relu'))
model.add(Dropout(0.5))


model.add(Dense(1))

model.compile(
    loss = tf.keras.losses.MeanSquaredError(),    
    metrics=[tf.keras.metrics.RootMeanSquaredError(name="rmse"),tf.keras.metrics.BinaryAccuracy(name="accuracy")],
    optimizer = tf.keras.optimizers.Adam(1e-4))

In [None]:
x_train=np.array(x_train)
#y_train=train_df['target'][0:num_img]
y_train=train_df['target']

x_train,x_test,y_train,y_test=train_test_split(x_train,y_train,test_size=0.2)

In [None]:
%%time

import math
from tensorflow.keras.callbacks import LearningRateScheduler, EarlyStopping, Callback, ReduceLROnPlateau, ModelCheckpoint

# def step_decay(epoch):
#     initial_lrate = 0.001
#     drop = 0.5
#     epochs_drop = 10.0
#     lrate = initial_lrate * math.pow(drop, math.floor((epoch)/epochs_drop))
#     return lrate

#lrate = LearningRateScheduler(step_decay)
# earstop = EarlyStopping(monitor = 'val_loss', min_delta = 0, patience = 5)
lrate = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience= 10,
    verbose=1,
    mode='auto',
    min_delta=0.0001,
    cooldown=0,
    min_lr=0.00000001,
)

model_checkpoint_callback = ModelCheckpoint(
    filepath='/kaggle/working',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

history = model.fit(
    x_train,
    y_train,
    epochs = 200,
    batch_size=32,
    validation_data = (x_test,y_test),
    verbose = 1,
    callbacks = [lrate, model_checkpoint_callback]
)

In [None]:
def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history[f'val_{metric}'])
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, f'val_{metric}'])
    plt.savefig(str(metric) +'.png')
    plt.show()
    
plot_graphs(history, "rmse")
#plt.savefig('accuracy.png')
plot_graphs(history, "loss")
#plt.savefig('loss.png')
plot_graphs(history, "accuracy")

In [None]:
#del train_df, x_train
gc.collect()

# 5. Predict and Submission

This competition is not binary classification of CE and LAA, and competition owner states that both probability does not have to sum to one(1). However, Knowing that, LAA probability is calculated by 1 - CE probability since I intend to build as simple model as possible.  
> The submitted probabilities for a given image are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum)

FYI, if you sinply put cnn_pred into Sample_submission file, it is no problem on the screen but it will cause an error when you submit it to the competition because the actual test data in submission has multiple image data per each patient. Hence, groupby.mean() is used here to make 1record per each patient.


In [None]:
test1=[]
for i in test_df['file_path']:
    x1=preprocess(i)
    test1.append(x1[:,:,0:3])
test1=np.array(test1)

cnn_pred=model.predict(test1)

In [None]:
cnn_pred

In [None]:
sub = pd.DataFrame(test_df["patient_id"].copy())
sub["CE"] = cnn_pred
sub["CE"] = sub["CE"].apply(lambda x : 0 if x<0 else x)
sub["CE"] = sub["CE"].apply(lambda x : 1 if x>1 else x)
sub["LAA"] = 1- sub["CE"]

sub = sub.groupby("patient_id").mean()
sub = sub[["CE", "LAA"]].round(6).reset_index()
sub

In [None]:
sub.to_csv("submission.csv", index = False)
!head submission.csv