This notebook is being used for studying purposes. I am following medium post
[Predicting Invasive Ductal Carcinoma using Convolutional Neural Network (CNN) in Keras](https://towardsdatascience.com/predicting-invasive-ductal-carcinoma-using-convolutional-neural-network-cnn-in-keras-debb429de9a6) and the medium post's [github repository](https://github.com/bikramb98/invasive-ductal-carcinoma-cnn/blob/master/Predicting%20Invasive%20Ductal%20Carcinoma%20using%20CNN%20in%20Keras.ipynb) by Bikram Baruah. The dataset used can be found [here](http://www.andrewjanowczyk.com/use-case-6-invasive-ductal-carcinoma-idc-segmentation/).

### Loading the dataset:

---



In [1]:
from glob import glob
import fnmatch
import cv2

<span style='color:blue'>
    <b>Study Notes:</b></span>
    
glob module: [doc]( https://docs.python.org/3/library/glob.html)
and [source](https://github.com/python/cpython/blob/3.8/Lib/glob.py)<br>
fnmatch module: [doc](https://docs.python.org/3/library/fnmatch.html#fnmatch.fnmatch) and [source](https://github.com/python/cpython/blob/3.8/Lib/fnmatch.py)


In [2]:
# Saves pathnames
image_patches = glob('../idc_regular_ps50_idx5/IDC_regular_ps50_idx5/*/*/*.png')

pattern_zero = '*class0.png'
pattern_one = '*class1.png'

<span style='color:blue'>
    <b>Study Notes:</b></span>
    
The function glob is retrieving the files' pathnames according to a specific pattern. The variables pattern_[blank] are string variables that, as the name indicates, are going to be used to find files of a certain pattern. In this case, ending with either 'class0.png' or 'class1.png'.

In [3]:
print(f'Lenght: {len(image_patches)} and exemple of path: {image_patches[1]}')

Lenght: 277524 and exemple of path: ../idc_regular_ps50_idx5/IDC_regular_ps50_idx5/12823/0/12823_idx5_x1501_y2351_class0.png


In [4]:
# Saves the image file location of all images according to its class (0 or 1)
class_zero = fnmatch.filter(image_patches, pattern_zero)
class_one = fnmatch.filter(image_patches, pattern_one)

<span style='color:blue'>
    <b>Study Notes:</b></span>

Here we're filtering the list of pathnames by its class.


In [5]:
def process_images(lower_index, upper_index):
    height, width, channels = 50, 50, 3
    
    x = [] #list to store image data
    y = [] #list to store labels
    
    for img in image_patches[lower_index:upper_index]:
        full_size_image = cv2.imread(img)
        image = (cv2.resize(full_size_image, (width, height), interpolation = cv2.INTER_CUBIC))
        
        x.append(image)
    
        if img in class_zero:
            y.append(0)
        elif img in class_one:
            y.append(1)
        else:
            return

    return x, y

<span style='color:blue'>
    <b>Study Notes:</b></span>
    
Will have to look more into cv2 and OpenCV later. From what I've gathered, imread loads an image, as the name somewhat indicates. Now INTER_CUBIC seems to be a type of interpolation, I have found a little about it [here](https://chadrick-kwag.net/cv2-resize-interpolation-methods/), but will have to read and research it a little bit better. 

<span style='color:red'>Must come back later to this note.</span>

In [6]:
X, Y = process_images(0, 60000)

### Data preprocessing
***

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split

<span style='color:blue'>
    <b>Study Notes:</b></span>
    
More about train_test_split [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). But it's very self explanatory, it splits the arrays into random train and test subsets. Using the parameter test_size you can indicate the proportion of the dateset that will be included in the test subset. There is a train_size as well that works the same way, except for the train subset. There are other interesting parameters to try and test it out later. 

<span style='color:red'>Must come back later to this note.</span>



In [8]:
X = np.array(X)
X = X.astype(np.float32)
X /= 255. # Ensures values between 0 and 1

In [9]:
# Splits training and test set, reserves 15% of the dataset for testing
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.15) 

In [10]:
print(f'There are {y_train.count(1)} 1s and {y_train.count(0)} 0s.\nLenght: {len(y_train)}')

There are 15078 1s and 35922 0s.
Lenght: 51000


In [11]:
import keras
from keras.utils import to_categorical

<span style='color:blue'>
    <b>Study Notes:</b></span>

Here we are one-hot-enconding the output. For future reference, read [this](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/).

In [12]:
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [13]:
print(f'x_train shape:{x_train.shape}\nx_test shape:{x_test.shape}')

x_train shape:(51000, 50, 50, 3)
x_test shape:(9000, 50, 50, 3)


In [14]:
x_train_flat = x_train.reshape(x_train.shape[0], -1)
x_test_flat = x_test.reshape(x_test.shape[0], -1)

In [15]:
x_test_flat.shape

(9000, 7500)

In [16]:
from imblearn.under_sampling import RandomUnderSampler

<span style='color:blue'>
    <b>Study Notes</b></span>

RandomUnderSampler [doc](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html). A little bit about under-sampling [here](https://imbalanced-learn.readthedocs.io/en/stable/under_sampling.html) and imbalanced classification [here](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/) and [here](https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data).

In under-sampling we have the removal ou some of the enxemples from the majority class and in over-sampling we have the duplication of some of the examples of the minority class. For imbalanced classification problems, one might transform the dataset randomly by either oversampling or undersampling.

For more about the subject, maybe check out [Learning from Imbalanced Data Sets](https://towardsdatascience.com/learning-from-imbalanced-datasets-b601a1f1e154) more carefully later (check the resources as well).

<span style='color:red'>Must come back later to this note.</span>

In [20]:
random_under_sampler = RandomUnderSampler('majority')


<span style='color:red'>
    <b>Obs Notes</b></span>
    
Having problemas with RandomUnderSampler when using keyword `ratio`. Have to fix that.

In [25]:
x_trainRos, y_trainRos = random_under_sampler.fit_sample(x_train_flat, y_train)
x_testRus, y_testRus = random_under_sampler.fit_sample(x_test_flat, y_test)



In [30]:
print(f'Shapes:\nx_trainRos: {x_trainRos.shape}')
print(f'y_trainRos: {y_trainRos.shape}')
print(f'x_testRus: {x_testRus.shape}')
print(f'y_testRus: {y_testRus.shape}')
print(f'x_train: {x_train.shape}')
print(f'x_test: {x_test.shape}')

Shapes:
x_trainRos: (30156, 7500)
y_trainRos: (30156, 1)
x_testRus: (5322, 7500)
y_testRus: (5322, 1)
x_train: (51000, 50, 50, 3)
x_test: (9000, 50, 50, 3)


In [32]:
y_trainRus_hot = to_categorical(y_trainRos, num_classes=2)
y_testRus_hot = to_categorical(y_testRus, num_classes = 2)

In [59]:
np.unique(y_trainRus_hot, return_counts=True)

(array([0., 1.], dtype=float32), array([30156, 30156]))

In [60]:
np.unique(y_testRus_hot, return_counts=True)

(array([0., 1.], dtype=float32), array([5322, 5322]))

In [36]:
for i in range(len(x_trainRos)):
    height, width, channels = 50,50,3
    x_trainRus_reshaped = x_trainRos.reshape(len(x_trainRos), height, width, channels)

In [37]:
x_trainRus_reshaped

array([[[[0.90588236, 0.88235295, 0.93333334],
         [0.8509804 , 0.7882353 , 0.9137255 ],
         [0.8666667 , 0.83137256, 0.9254902 ],
         ...,
         [0.5568628 , 0.4       , 0.7529412 ],
         [0.827451  , 0.76862746, 0.9098039 ],
         [0.8627451 , 0.8117647 , 0.9372549 ]],

        [[0.9098039 , 0.8980392 , 0.94509804],
         [0.87058824, 0.84313726, 0.92941177],
         [0.80784315, 0.7647059 , 0.9019608 ],
         ...,
         [0.7254902 , 0.61960787, 0.8627451 ],
         [0.8509804 , 0.7882353 , 0.92941177],
         [0.8509804 , 0.8156863 , 0.93333334]],

        [[0.9411765 , 0.92941177, 0.9490196 ],
         [0.93333334, 0.93333334, 0.94509804],
         [0.827451  , 0.7607843 , 0.8980392 ],
         ...,
         [0.77254903, 0.70980394, 0.91764706],
         [0.7882353 , 0.70980394, 0.9019608 ],
         [0.87058824, 0.81960785, 0.92941177]],

        ...,

        [[0.7764706 , 0.7019608 , 0.9254902 ],
         [0.827451  , 0.7647059 , 0.92156863]

### Model Architecture
***

In [33]:
batch_size = 256
num_classes = 2
epochs = 50

In [38]:
from keras.models import Sequential


<span style='color:blue'>
    <b>Study Notes</b></span>

More about Sequential: [doc](https://keras.io/api/models/sequential/) and [guide](https://keras.io/guides/sequential_model/).

In [39]:
model = Sequential()

In [43]:
model.name

'sequential'

In [44]:
model.layers

[]

In [45]:
from keras.layers import Conv2D, MaxPooling2D

<span style='color:blue'>
    <b>Study Notes</b></span>
    
Need to look more into Conv2D and MaxPooling2D. 

In [46]:
model.add(Conv2D(32, kernel_size=(3,3),
                 activation='relu',
                 input_shape=(50,50,3)))

In [47]:
model.layers

[<tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f6e1cb6ee50>]

In [50]:
type(model.layers)

list

<span style='color:blue'>
    <b>Study session notes</b></span>

<sub>(Other days hidden)</sub>
<!-- 03/07:<br>
Search more about Google Colab and how to work with a big database. The kernel crashed a few times and it was a bit annoying. It happened almost always while using the train_test_split function. 

While not necessarily adjacent to the notebook, look into the possibility of using the notebook outside of the conda environment. Had a few problems downloading certain modules through `conda`.  -->

04/07:<br>

Difficulties with class RandomUnderSampler, must look more into it, didn't recognize keyword ratio. Should look into [A Deep Learning Architecture for Classifying
Medical Images of Anatomy Object](http://www.apsipa.org/proceedings/2017/CONTENTS/papers2017/15DecFriday/FP-02/FP-02.3.pdf) to understand a bit more about the model used. Need to read more overall.

I was able to follow up until "Model Architecture". In order to continue, read the paper linked and google a bit more about it. Also reread tensorflow guide about sequential model.

Must convert data back to its original shape of 50 x 50 x 3.