<a href="https://colab.research.google.com/github/evtimovr/Bla/blob/master/Finassign7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


For the purposes of the task, I have used Google Colab, as using its GPU should improve the computional time. An easy way of importing data there is mounting your Google Drive and then you are able to pull everything that is inside of it.

First, we need to import relevant packages, that will be used:

In [0]:
from PIL import Image # library 'pillow' for image processing
import os, os.path
import glob
from os.path import basename
import re

Creating ImageDataGenerator

In the case with the photoshopped data there are several ways to import the data in Python. The one used here: ImageDataGenerator allows us to augment the pictures - to change them in a certain way and create additional leaning material for the model this way. The reason we are doing this is to generalize better:  If we zoom some images a bit or turn them around, this would prevent the model from overfitting easily as it won't learn the location of certain pixels for example.

Depending on the exact task, there are several functions that could be used: rotating, zooming, changing the shear intensity and brightness.  Of course, it is not a good to make the DataGenerator to create images that don't exist at all in the real data you would like to predict as this would only make it more inaccurate without any special benefit. 

In this case I am using only few of the options. For example a small rotation would make sense as not all of the faces on the images have face that are in the exact same rotation - zooming in and getting rid of a part of the picture also makes sense as in our dataset most of the revelant information could be found in the pictures in the middle of the image. Additionally, we are rescaling to a range between 0 and 1. This would make the computation easier and will keep the information we need about the differences between different pixels. 

I use this moment to indroduce some other variables: 

**dim** - the size of the pictures. In order to make it easier to work with, we are resizing the picture. A rule of thumb is the square root of the original size. 

**train_batch_size** - how many images will go through the model together. Again important for the quality of learning and the computational time: A higher size of the batch leads to faster calulation but the model could more easily overfit. On the other hand, setting it to "1" would lead to adjusment of the parameters for each image which would increase the computational time eanourmously. 

**input_shape** - specifiying the shape of the data flowing into the sequencial model. The first two are the height and width of the image and "3" stays for the 3 channels - red, green, blue as the images are color-images. 

In [4]:
from keras.preprocessing.image import ImageDataGenerator 
sqrt_dim=60
train_batch_size=40
input_shape=(sqrt_dim, sqrt_dim, 3)

train_generator = ImageDataGenerator(rescale = 1./255, 
                                   shear_range = 0.1, 
                                   zoom_range = 0.1,
                                   horizontal_flip = False, 
                                   rotation_range = 20) 

Using TensorFlow backend.


Additionally we have to create a DataGenerator for the test data in order to import this data as well. Here, we don't want to change anything in the images, we are only rescaling them in order to fit the values in the model. Otherwise the data should be as close as possible as the real data, that the model will try to predict. 

In [0]:

test_generator = ImageDataGenerator(rescale = 1./255) # only rescaling with fixed parameters is valid, otherwise do not touch the test set! 


The actual import of the images happens with the "flow_from_directory" function. This way we tell the function where to find the data and it automatically takes the name of the directory where it finds files as labels. We use it as well to resize the images and tell it what batch size should be used during the training. 

The output shows us how many images were found in the respective folder and their labels - "fake" and "real".

In [6]:
trainset = train_generator.flow_from_directory('/content/drive/My Drive/Final assignment/train', # specify the directory (you can right click on the folder you need and copy the path)
                                                 target_size = (sqrt_dim, sqrt_dim), # you need to check your data,
                                                 batch_size = train_batch_size, # batch size = 5 here just for show!
                                                 class_mode= 'categorical')


validationset = test_generator.flow_from_directory('/content/drive/My Drive/Final assignment/validation', # specify the directory (you can right click on the folder you need and copy the path)
                                                 target_size = (sqrt_dim, sqrt_dim), # you need to check your data,
                                                 batch_size = train_batch_size, # batch size = 5 here just for show!
                                                 class_mode= 'categorical')

Found 725 images belonging to 2 classes.
Found 200 images belonging to 2 classes.


In [6]:
trainset.class_indices

{'fake': 0, 'real': 1}

We also define some variables we would need to use later: number of samples and number of classes:

In [7]:
n_train_samples = len(trainset.filenames)
n_classes = len(trainset.class_indices)
print(n_train_samples, n_classes)

725 2


**Convolutional neural networks implementation**

After the data is imported and formated in a suitable way, we are ready to create the model, compile it and then train it.

The obvious choice for image classification is the CNN (Convolutional neural network). It has been widely used in the recent years exactly for this type of tasks. 
The CNN works well on images because it tries to find deep pattern in the images using different filters.

Briefly covering the basic structure of the convolutional part of the neural network: 

- Input --> Convolution --> Pooling --> Dropout-->  Convolution --> Pooling --> Dropout --> Convolution --> Pooling --> Dropout --> Flatten --> Dense --> Dense--> Output 

The convolutional part consisits of multiple time doing the same things: Convolution--> Pooling --> Dropout
- In the "Convolution" part we apply filters to the images and create multiple different images focusing on specific features - this way we hope the model will find a certain pattern typical only for the photoshopped images. 
- After creating those "new" images, we cut some of the information out to reduce dimensionallity - in this case I am using MaxPooling and this is the most used pooling algorithm. It basically takes the most activated pixel from a group of pixels and creates a "new" image that is a good represantation of the previous one, but consisting of less data. 
- Then comes the "Dropout" layer - this is used to prevent the model from overfitting. It makes the model randomly miss some of the neurons, in this case images. 
Important to note is that the "Dropout" functions rather differently in the convolutional layers and in the fully connected "Dense" layers.  What we give as input is the probability for each layer that it gets dropped out. 

In [8]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Dropout # converting pooled feature maps into a large feature vector to prepare for input into the Dense layer
from keras.layers import Conv2D # first convolution step, 3D for videos
from keras.layers import MaxPooling2D #pooling, downsampling

model = Sequential() #initialize
model.add(Conv2D(32, #number of filters/feature detectors, number of feature maps you will end up with. Starting with 32 is good practice, grow it if needed and if you have enough computing power
                 kernel_size=(3, 3), # size of filters (rows,columns), (3,3) is also a popular choice, it's a hyperparameter to choose
                 activation='relu', # removes the negative pixels, introducing non-linearity, 
                 #we are skipping "border mode" - it is set to "same" by default
                 # strides=(2,2) how the window shifts by in each of the dimensions. Default stride is 1 but we will go for 2 for faster implementation by reducing the image size similar to pooling
                 input_shape=input_shape)) # (3 if color/1 if b&w, height, width), all images have to come in a single format
model.add(MaxPooling2D(pool_size=(2, 2), #the size of the pooled feature map
                      strides=None)) # default: pool_size, here: (2,2)
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

model.add(Conv2D(128, (3, 3), activation='relu',padding = 'same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.6))


model.add(Conv2D(16, (1, 1), activation='relu'))
model.add(Dropout(0.6))
model.add(Flatten())
model.add(Dense(256, activation='relu')) # that's our classifier bit, th size is up to you, choose wisely taking the size of picture in mind (smth between nuber of input and number of output nodes could be another rule of thumb)
model.add(Dropout(0.6))
model.add(Dense(n_classes, activation='softmax')) 

W0812 21:41:16.787194 139845420373888 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0812 21:41:16.811980 139845420373888 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0812 21:41:16.816143 139845420373888 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0812 21:41:16.842851 139845420373888 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0812 21:41:16.871249 139845420373888 deprecation_wrapp

After creating the model and shaping its structure comes the time for the model to be compiled - we have to tell it how to learn, how to optimize and choose which direction to move the coefficient in and tell it how to calculate the loss. 

After this it is useful to display the summary of the model to have an overview and understand better what is happening. The "Ouput Shape" column gives us information about how the information flowing from one to the next layer would look like, while the parameter column shows us how many variables should be learned and adjusted by the model. 

It is fairly important to understand what the model actually changes in each run using backpropagation. In the convolutional layers the model is trying to find the filters to be used - it adjusts them and checks which one gives the lowest loss compared to the real data. In the "Dense" layers it adjusts the weights of each neuron. 

In [9]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

W0812 21:41:20.874883 139845420373888 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0812 21:41:20.910254 139845420373888 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 58, 58, 32)        896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 29, 29, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 27, 27, 128)       36992     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 13, 13, 128)       0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 13, 13, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 13, 13, 128)       147584    
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 6, 6, 128)         0         
__________

In [0]:
max_epochs = 40

To cope with the overfitting issue we also create a callback that should stop the training of the model when it reaches certain accuracy. It checks the validation accuracy and then based on a condition decides if the training should proceed or stop.

In [11]:
import tensorflow as tf
class AccuracyCallback(tf.keras.callbacks.Callback):
    """Callback that stops training on requested accuracy."""
    def __init__(self, target_accuracy=0.997):
        self.target_accuracy = target_accuracy
    def on_epoch_end(self, epoch, logs):
        if (logs.get('val_acc') >= self.target_accuracy):
            print("Target validation accuracy (%f) reached !" % self.target_accuracy )
            self.model.stop_training = True
            
target_accuracy = 0.579
max_epochs = 40
print("\nTraining untill accuracy reaches %s or for %i epochs" %(target_accuracy, max_epochs))


Training untill accuracy reaches 0.579 or for 40 epochs


We know that the steps_per_epoch*batch_size = number of images going through the model. In the same time we wanna help the model generalize (this why we are using a generator to not only use the original images) and create a learnable model (too many variations of the original data makes it hard and slow for the model to learn at all)

In [12]:
history=model.fit_generator(trainset,
                    steps_per_epoch= 400,# number of times generator will be used,that's how we control for the number of generated images, overall size of set divided by the batch size to make sure everything gets used
                    validation_data=validationset,
                    validation_steps= 10, # 340/2
                    epochs=max_epochs,
                    callbacks=[AccuracyCallback(target_accuracy)]
)

W0812 21:41:35.226103 139845420373888 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/40
Epoch 2/40
Target validation accuracy (0.579000) reached !


After training the model it is important to save it, because running it again, despite doing it with the exact same parameters, it won't give the same results as the generator will feed it with different pictures every time it runs. 

In [0]:
model.save('model9')


The whole point of creating a model is for it to predict data for which we don't know the labels. We import them again with the test_generator.  

In [14]:
unknownset = test_generator.flow_from_directory('/content/drive/My Drive/Final assignment/unknown', # specify the directory (you can right click on the folder you need and copy the path)
                                                 target_size = (sqrt_dim, sqrt_dim), # you need to check your data,
                                                 batch_size = train_batch_size, # batch size = 5 here just for show!
                                                 class_mode= 'categorical')

Found 481 images belonging to 1 classes.


In [15]:
#You have to reset or reassign the test generator
test_datagen = ImageDataGenerator(rescale=1./255)

test_generator = test_datagen.flow_from_directory('/content/drive/My Drive/Final assignment/unknown/',  # specify the directory
                                            target_size = (sqrt_dim,sqrt_dim),
                                            batch_size = 1, #to sample an image just once
                                            class_mode = 'categorical',
                                            shuffle = False)
filenames = test_generator.filenames
nb_samples = len(filenames)

yhat = model.predict_generator(test_generator,steps = nb_samples)

Found 481 images belonging to 1 classes.


In [0]:
filenames[1]

'unknown/003358d7a7461d8cdbe9.jpg'

In [0]:
import pandas as pd

names = []
predictions=[]

for i in range(0,len(filenames)):
     names.append(filenames[i])

for i in range(0,len(filenames)):
     predictions.append(yhat[i,0])
    
pred = pd.DataFrame({'ID':names,'fake':predictions})

In [0]:
pred.to_csv("predictions7.csv",header = True,index=False)

In [0]:
pred

Unnamed: 0,col
0,unknown/00217efccd6f697eb937.jpg
1,unknown/003358d7a7461d8cdbe9.jpg
2,unknown/008cae7827beecd1aab3.jpg
3,unknown/00b75601a2272c9c8f3e.jpg
4,unknown/01224275d3e25f62a920.jpg
5,unknown/01a1a6b856bffedc304c.jpg
6,unknown/020c00becfcd4e7938af.jpg
7,unknown/045bcecab3b5f0f68836.jpg
8,unknown/046d3ccaf5e4888cff8f.jpg
9,unknown/056907b340072139457a.jpg


In [0]:
model.predict(col[1])

ValueError: ignored