# Using Keras ImageDataGenerator
Here I'm using Keras [`ImageDataGenerator`](https://keras.io/preprocessing/image/) to read files from the `train` directory and feed a very simple CNN. I'm still not searching for accuracy here, just trying to simplify the pipeline. ImageDataGenerator has the ability to generate a flow of images to the CNN, applying:
* resampling to a smaller size
* changing to grayscale if needed
* train/validation split (still not done here)
* data augmentation

TODO:
* resample to a different size
* change the CNN to a more effective one
* ~~add the validation_generator and pass it to the model's `fit_generator` method~~
* analyze source data
* train on whole dataset
* data augmentation
* implement Mean Average Precision @ 5 for submission


In [8]:
import os
import math
import itertools
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib.pyplot import imshow
import seaborn as sns
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.model_selection import train_test_split
from keras.preprocessing import image
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D, ZeroPadding2D, BatchNormalization
from keras.optimizers import Adam
from keras.utils.np_utils import to_categorical
from keras.utils import plot_model
from keras.models import Input, Model
from scipy.misc import imresize
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator

warnings.simplefilter("ignore", category=DeprecationWarning)

%matplotlib inline
pd.set_option("display.max_rows", 10)
np.random.seed(42)

In [9]:
os.listdir("../input/")

['train.csv', 'test', 'sample_submission.csv', 'train']

ImageDataGenerator requires `filename` and `class` respectively for the column with all the file names and for the other with the classes. Here I change the columns names but I could have used:

    x_col: string, column in the dataframe that contains
           the filenames of the target images.
    y_col: string or list of strings,columns in
           the dataframe that will be the target data.
in `flow_from_dataframe` method to override the default Keras behaviour

In [10]:
dataset = pd.read_csv("../input/train.csv")
dataset.columns = ['filename', 'class'] # renaming to match ImageDataGenerator expectations
dataset.sample(5)

Unnamed: 0,filename,class
20621,cfb8c68dc.jpg,new_whale
10152,66bf04895.jpg,w_b27b6c6
21324,d6a92a7f9.jpg,new_whale
16229,a37e5cc98.jpg,new_whale
4097,2a215f11e.jpg,new_whale


In [11]:
dataset.shape

(25361, 2)

Here I use only a subset of all the 25k picture in order to be faster. Slicing the dataframe is enough.

* `batch_size` controls how many samples the generator sends to the network each step
* `subset` is used to slice the source dataset and work on a smaller one when making experiments
* `target_size` is the image shape to use in the training

In [28]:
batch_size = 256
subset = 500
target_size = (224, 224, 1)


datagen = ImageDataGenerator(rescale=1./255, validation_split=.2)
train_generator = datagen.flow_from_dataframe(
        directory='../input/train',
        dataframe=dataset,
        subset='training',
        target_size=target_size[0:2],
        color_mode='grayscale',
        batch_size=batch_size,
        class_mode='categorical',
        interpolation='nearest')

val_generator = datagen.flow_from_dataframe(
        directory='../input/train',
        dataframe=dataset,
        subset='validation',
        target_size=target_size[0:2],
        color_mode='grayscale',
        batch_size=batch_size,
        class_mode='categorical',
        interpolation='nearest')

len(train_generator.classes)
len(val_generator.classes)

Found 20289 images belonging to 5005 classes.
Found 5072 images belonging to 5005 classes.


5072

The CNN. Notes:
* optimizer set to Adam with default learning rate of .02 and a learning rate decay at each epoch

In [32]:
inp = Input(shape = target_size)

x1 = Conv2D(64, (9, 9), strides=2, activation='relu')(inp)
x1 = MaxPool2D((2,2), strides=(2,2))(x1)

x1 = BatchNormalization()(x1)
x1 = Conv2D(64, (3, 3), strides=2, activation='relu')(x1)
x1 = BatchNormalization()(x1)
x1 = Conv2D(64, (3, 3), strides=2, activation='relu')(x1)

x1 = MaxPool2D((2,2))(x1)
x1 = BatchNormalization()(x1)

x1 = Conv2D(128, (1, 1), activation='relu')(x1)
x1 = MaxPool2D((2,2), strides=(2,2))(x1)
x1 = BatchNormalization()(x1)

x1 = Conv2D(384, (1, 1), activation='relu')(x1)
x1 = MaxPool2D((2,2), strides=(2,2))(x1)
x1 = BatchNormalization()(x1)

x1 = Conv2D(512, (1, 1), activation='relu')(x1)
x1 = Dropout(.25)(x1)

x1 = Flatten()(x1)

x1 = Dense(512, activation='relu')(x1)

x1 = Dense(5005, activation='softmax')(x1)

model = Model(inputs=inp, outputs=x1)

opt = Adam(lr=0.02, decay=0.00)
model.compile(optimizer = opt , loss = "categorical_crossentropy", metrics=["accuracy"])
model.build(input_shape = target_size)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 224, 224, 1)       0         
_________________________________________________________________
conv2d_42 (Conv2D)           (None, 108, 108, 64)      5248      
_________________________________________________________________
max_pooling2d_36 (MaxPooling (None, 54, 54, 64)        0         
_________________________________________________________________
batch_normalization_27 (Batc (None, 54, 54, 64)        256       
_________________________________________________________________
conv2d_43 (Conv2D)           (None, 26, 26, 64)        36928     
_________________________________________________________________
batch_normalization_28 (Batc (None, 26, 26, 64)        256       
_________________________________________________________________
conv2d_44 (Conv2D)           (None, 12, 12, 64)        36928     
__________

In [33]:
epochs = 2
import tensorflow as tf

# Creates a graph.
with tf.device('/device:GPU:0'):
    history = model.fit_generator(train_generator, 
                                  epochs=epochs, 
                                  steps_per_epoch=20289//epochs,
                                  use_multiprocessing=True,
                                  validation_steps=5072//epochs,
                                  validation_data=val_generator)

Epoch 1/10
  11/2028 [..............................] - ETA: 4:07:11 - loss: 10.0393 - acc: 0.3341

Process ForkPoolWorker-8:
Process ForkPoolWorker-7:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/daniele/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/daniele/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/daniele/miniconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/daniele/miniconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/daniele/miniconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/daniele/miniconda3/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/home/daniele/miniconda3/lib/python3.6/site-packages/keras/utils/data_utils.py", line 401, in get_index
    return _SHAR

KeyboardInterrupt: 

In [None]:
plt.plot(history.history['acc'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()

### TO BE CONTINUED