<a href="https://colab.research.google.com/github/WittmannF/course/blob/master/day-3/Loading_Image_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading and Preprocessing Images
As an example we will be loading a dataset with images of cats and dogs using both `flow_from_dataframe` and `flow_from_directory` methods. First of all, let's clone the image dataset. The dataset contains 256 training images and 64 validation images. 

In [0]:
!git clone https://github.com/WittmannF/ImageDataGenerator-example.git

Cloning into 'ImageDataGenerator-example'...
remote: Enumerating objects: 341, done.[K
remote: Counting objects:   0% (1/341)   [Kremote: Counting objects:   1% (4/341)   [Kremote: Counting objects:   2% (7/341)   [Kremote: Counting objects:   3% (11/341)   [Kremote: Counting objects:   4% (14/341)   [Kremote: Counting objects:   5% (18/341)   [Kremote: Counting objects:   6% (21/341)   [Kremote: Counting objects:   7% (24/341)   [Kremote: Counting objects:   8% (28/341)   [Kremote: Counting objects:   9% (31/341)   [Kremote: Counting objects:  10% (35/341)   [Kremote: Counting objects:  11% (38/341)   [Kremote: Counting objects:  12% (41/341)   [Kremote: Counting objects:  13% (45/341)   [Kremote: Counting objects:  14% (48/341)   [Kremote: Counting objects:  15% (52/341)   [Kremote: Counting objects:  16% (55/341)   [Kremote: Counting objects:  17% (58/341)   [Kremote: Counting objects:  18% (62/341)   [Kremote: Counting objects:  19% (65/341)  

In [0]:
cd ImageDataGenerator-example

/content/ImageDataGenerator-example


In [0]:
ls

[0m[01;34mflow_from_dataframe[0m/  [01;34mflow_from_directory[0m/  README.md


## Defining Base Model
Let's define the base model that we will be using for training:

In [0]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras.applications.vgg16 import VGG16

# 2. Initialize base model
base_model = VGG16(include_top=False, input_shape=(224,224,3))

# 3. Freeze layers from the base model
for layer in base_model.layers:
    layer.trainable=False
    
# 4. Add Fully connected layer
model = Sequential([base_model,
                    Flatten(),
                    Dense(1024, activation='relu'),
                    Dense(2, activation='softmax')])

Using TensorFlow backend.
W0720 15:20:46.913207 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0720 15:20:46.953242 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0720 15:20:46.961207 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0720 15:20:47.007959 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.



Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


W0720 15:20:49.448697 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0720 15:20:49.449717 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.



In [0]:
from keras.optimizers import Adam
model.compile(optimizer=Adam(lr=1e-4), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

W0720 15:21:01.632138 140322684524416 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



Since we are using the loss `sparse_categorical_crossentropy`, it is required as output sparse integer numbers. 

## `flow_from_directory` example
The method `flow_from_directory` is used when image files are contained one subdirectory per class, for example:

![Screen Shot 2019-07-05 at 13 50 38](https://user-images.githubusercontent.com/5733246/60736066-1919a800-9f2c-11e9-9c93-f327178cc478.png)


In [0]:
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing.image import ImageDataGenerator

# 1. Define Data Generators
TRAIN_PATH = 'flow_from_directory/train'

datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_gen = datagen.flow_from_directory(TRAIN_PATH, target_size=(224, 224), class_mode="sparse")

Found 256 images belonging to 2 classes.


In [0]:
train_gen.class_indices

{'cat': 0, 'dog': 1}

## `flow_from_dataframe` example
When using the method `flow_from_dataframe`, we have to define a dataframe with two columns, one with the path of each image file and the other with the class in which each image belongs. For example:

![Screen Shot 2019-07-05 at 14 24 54](https://user-images.githubusercontent.com/5733246/60737318-b5de4480-9f30-11e9-88ab-455015d5e131.png)

In those cases, the class can be infered either from the filename or from an additional file with the class of each filename. For both cases, we have to map each filepath into their correct class. Here's an example inferring from the filename:

In [0]:
cd flow_from_dataframe

/content/ImageDataGenerator-example/flow_from_dataframe


Let's get all the filenames from the training path:

In [0]:
import glob
train = glob.glob('train/*.jpg')

print(train[:5])

['train/cat.107.jpg', 'train/dog.126.jpg', 'train/cat.106.jpg', 'train/cat.56.jpg', 'train/cat.28.jpg']


Now, let's create a dataframe with both filepaths and classes:

In [0]:
import pandas as pd
# Convert filepaths to a Pandas dataframe
train_df = pd.DataFrame({'filename': train})

# Add new column with the label of each file
train_df['class'] = train_df['filename'].apply(lambda x: 'cat' if 'cat.' in x else 'dog')

train_df.head()

Unnamed: 0,filename,class
0,train/cat.107.jpg,cat
1,train/dog.126.jpg,dog
2,train/cat.106.jpg,cat
3,train/cat.56.jpg,cat
4,train/cat.28.jpg,cat


By default, the columns names should be **filename** and **class**, if not, they have to be specified. Next, we can simply use the method `flow_from_dataframe`:

In [0]:
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing.image import ImageDataGenerator

# 1. Define Data Generators
datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_gen = datagen.flow_from_dataframe(train_df, target_size=(224, 224), 
                                        batch_size=32,
                                        class_mode="sparse")

Found 256 validated image filenames belonging to 2 classes.


## Moving foward
In both scenarios, we will have to use the model's method `fit_generator`, here's a minimal example:

In [0]:
number_of_batches = train_gen.n//train_gen.batch_size
model.fit_generator(train_gen, number_of_batches)

W0720 15:24:36.793971 140322684524416 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/1


<keras.callbacks.History at 0x7f9f110eadd8>

In [0]:
train_gen.n

256

In [0]:
train_gen.batch_size

32

It is also a good idea to add a validation folder in order to evaluate the results:

In [0]:
valid = glob.glob('valid/*.jpg')

valid_df = pd.DataFrame({'filename': valid})
valid_df['class'] = valid_df['filename'].apply(lambda x: 'cat' if 'cat.' in x else 'dog')

valid_gen = datagen.flow_from_dataframe(valid_df, target_size=(224, 224), class_mode="sparse")

Found 64 validated image filenames belonging to 2 classes.


In [0]:
model.fit_generator(train_gen, number_of_batches, validation_data=valid_gen, validation_steps=valid_gen.n//valid_gen.batch_size)

Epoch 1/1


<keras.callbacks.History at 0x7f9ec67a68d0>