<a href="https://colab.research.google.com/github/GeniGaus/100DaysOfMLCode/blob/master/GeneratingIndianFaces.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Indian Faces with Deconvolutional Networks

---



## Step 1: Creating Indian Face Database

---
Taking reference of *Yale Face Extended Database* and *Radbound Faces Database*, I created an **Indian Face Database** consisting of images of 28 Indians in different poses and lighting. Each individual has been clicked in a light background in the following 8 poses:
> 1. Angry
> 2. Contemptuous
> 3. Disgusted
> 4. Fearful
> 5. Happy
> 6. Neutral
> 7. Sad
> 8. Surprised

> ![Different Poses](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/IndFaces1.jpg)
>>>>>> **Different Poses**

Each pose has been clicked in 2 different lighting condition:
> 1. Ambient
> 2. Dim light

> ![Different lighting](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/IndFaces2.jpg)
>>>>>>> **Different Lighting Conditions**

Each picture is of $530 \times 730$ pixels dimension in jpeg format. Each picture is labelled as **IndAxx_Pxx_Lxx**, where xx are *placeholders* for individual *identity number, pose number* and *lighting condition number* respectively. For example, an image of individual with identity number 1 and clicked in a happy pose in ambient lighting condition is labelled as IndA01_P05_L01; where '01' after IndA refers to identity number, 'P05' denotes pose 5 which is 'happy' pose and 'L01' refers to lighting condition 1 which is ambient.

All the images corresponding to a particular individual are placed in one folder labelled as **IndAxx** where xx is placeholder for identity number.

All the folders related to all the individuals are placed inside another folder named **IndianFacesA**.

## Fundamental Concept
---

The idea of generating faces with deconvolutional networks is based on the paper [Learning to generate chairs, tables and cars with convolutional networks](https://arxiv.org/abs/1411.5928). This paper shows that if a neural network is trained using a dataset consisting of model's identity, view, and transformation parameters as input and the image as output, the following results are obtained:

> 1) As expected, the network could learn all examples by heart and generate them. That is, the network learns to generate 2D projections from high-level description of 3D models.

> 2) Unexpectedly, the network seemed to learn something also.

> *   It was able to learn concepts about 3D space.
> *   It was able to learn object structure and transfer that knowledge within the object class. This was evident from the fact that it was able to infer the remaining unseen viewpoints of the object.
> *   It was able to interpolate between objects of the same and of different types.
> *   It was able to randomly generate objects of new styles.


Here, this idea will be applied to train the network to generate and interpolate between Indian faces from a high-level description of image as input.

## Step 2: Creating the Model

### Model Description

---

### Input

The network takes as input, a high-level description of the image, which include:
> *  Identity vector, c
> *  Pose vector of face, p
> *  Lighting vector, l

Thus, the dataset is $D = \{(c_1,p_1,l_1),...,(c_n,p_n,l_n)\}$


### Output

The output of the network is $O = \{x_1,...,x_n \}$, where x is the RGB image of face.

### Transformation of Indian Face Database to Network-inputtable Form

Instead of writing code to fetch image data in training or model creation part of the code, we can write a separate class for the database to minimize code changes in case of database or network architecture changes and to increase reusability. 

The database which is a collection of images is represented by the **"IndianInstances"** class. To represent each image in the database, we create a class **"IndianInstance"**. These two classes help in transforming the database into a form which can be passed to the network as training inputs and outputs.

The **"IndianInstances"** class takes the directory location of the database as input and stores it as an attribute *directory*. In our case, since we stored it in drive with the name "IndianFacesA", we would pass the value "drive/IndianFacesA". 

It would then fetch all subfolders and create an instance of **"IndianInstance"** for each image. In **"IndianInstance"** class, we create training input and output matrices as follows:


**Formation of training input**

Since all the images were labelled with identity number, pose number and lighting condition number. We can take the image's label and form input from that label as follows:

> 1. Extract identity number from image label and transform that number into one-hot encoding. This encoded array will be passed as identity vector, c in input.

> 2. Extract pose number from image label and transform that number into one-hot encoding. This encoded array will be passed as pose vector, p in input.

> 3. Extract lighting condition number from image label and transform that number into one-hot encoding. This encoded array will be passed as lighting vector, l in input.


**Formation of training output**

The image itself will be the output. The matrices of pixel values of the image in 3 channels(RGB) will be the training output.


### Network Architecture

The next question that we ask ourselves would be what should the network look like. Specifically speaking, how many and what layers should be there. In order to get the intution, let's consider a classification network. In classification models, we would usually map an image to a class(i.e., a high level description of the object). The classification network consist of convolution layers followed by pooling and then fully connected layers for mapping. Below figure shows the classification network.



> ![Classification network](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-08-at-2-26-09-am.png?w=748)



So, in order to generate image from a high-level description, we can perform the above steps in reverse.

> 1) First, each of the 3 inputs is passed through a fully connected layer. Then, they are combined and passed through a fully connected layer.

> 2) Then we need to expand or unpool the result of above layers to increase their dimensionality. For this, upsampling is performed, in which each entry of a feature map is replaced by an $s \times s$ block with entry value in top-left corner and dummy values( usually 0) elsewhere. This increases the width and height of feature map s times. For our network, we will take $s=2$.

> 3) Then convolution is performed.

Combining 2) and 3) would seem as if we have dotted the grid with dabs of paint and then spread and mixed them using convolution kernels. Upsampling and convolution is together usually referred to as "upconvolution". 

> ![Upconvolution](https://zo7.github.io/img/2016-09-25-generating-faces/deconv.png)

---
**Upconvolution Decoded**


![Upconv](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/upconv.gif)

---



Adding another convolution after upconvolution has experimentally been shown to improve the quality of generated images. Together, they will be referred here as deconvolution layer.


This above deductions give rise to the "1s-S-deep" model from the paper(as shown below in figure 1)  which is used to build the network.

> ![1s-S-model of Chairs tables](https://zo7.github.io/img/2016-09-25-generating-faces/chairs-model.png)
>>>>>>**Figure1: 1s-S-model of Chairs tables experiment**


The transformed model(in Figure2) which we will be using for our purpose will have the segmentation part removed. 


> ![1s-S-model for Indian faces](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/DeconvFaceArch.jpg)
>>>>>>**Figure2**

In our network, all fully connected(FC) layers are replaced by convolution layers so that the input size need not be fixed. Since here FC layers are applied to inputs, we need to first reshape the input because convolution require 3 dimensions while the input has 1 dimension. The inputs are reshaped from(len) to (height, width, num_channel) dimension where len is respective length of each input vector, height=1,width=1 and num_channel=len. Then, convolution is performed on each input with 512 filters. These are combined and then convolved using 1024 filters.

For our network, we will use 5 deconvolution layer excluding the output layer because that is the maximum number we can use in Google Colab without exhausting the resources. In each layer, $2 \times 2$ upsampling is performed followed by $5 \times 5$ convolution, together forming upconvolution layer, followed by another $3 \times 3$ convolution. The output of every convolution is passed through LeakyReLU to introduce non-linearity in the network. Batch normalization is performed at the end in each of the deconvolution layer to ensure LeakyReLU activations behave properly.

In the last deconvolution layer, a $3 \times 3$ convolution is performed with 3 filters after the upconvolution+convolution, in order to create a 3-channel output image.


## Step 3: Training the Model

In [0]:
## Training with 50 epochs

train_model("drive/IndianFacesA (871e7db6)",
            "drive/IndianFacesAOutput", 
             num_epochs= 50
            )

In [0]:
## Training with 100 epochs

train_model("drive/IndianFacesA (871e7db6)",
            "drive/IndianFacesAOutput", 
             num_epochs= 100
            )

In [0]:
## Training with 200 epochs

train_model("drive/IndianFacesA",
            "drive/IndianFacesAOutput", 
             num_epochs= 200
            )

In [0]:
## Training for 300 epochs

train_model("drive/IndianFacesA",
            "drive/IndianFacesAOutput", 
             num_epochs= 300
            )

In [0]:
## Training for 500 epochs

train_model("drive/IndianFacesA",
            "drive/IndianFacesAOutput", 
             num_epochs= 500
            )

In [0]:
## Training for 600 epochs

train_model("drive/IndianFacesA",
            "drive/IndianFacesAOutput", 
             num_epochs= 600
            )

In [0]:
## Training for 500 epochs with stochastic gradient descent

train_model("drive/IndianFacesA",
            "drive/IndianFacesAOutput", 
             num_epochs= 500,
             optimizer='sgd'
            )

## Step 4: Generate Faces from Trained Model
---

### Image Generation
-----



In [0]:
generate_from_yaml("drive/single2.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e50.adam.h5", "drive/IndianFacesAGeneratedOutput_d5_adam_e50_Single2")

In [0]:
generate_from_yaml("drive/single.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e200.adam.h5", "drive/IndianFacesAGeneratedOutput_d5_adam_e200_Single")

In [0]:
generate_from_yaml("drive/single.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e300.adam.h5", "drive/IndianFacesAGeneratedOutput_d5_adam_e300_Single")

In [0]:
generate_from_yaml("drive/single.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e500.adam.h5", "drive/IndianFacesAGeneratedOutput_d5_adam_epoch500_Single")

In [0]:
generate_from_yaml("drive/single.yaml", "drive/IndianFacesOutput/FaceGen.IndianFaces.model.d5.e600.adam.h5", "drive/IndianFacesGeneratedOutput_Single_e600")

In [0]:
generate_from_yaml("drive/random.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e600.adam.h5", "drive/IndianFacesAGeneratedOutput_d5_adam_epoch600_Random")

In [0]:
generate_from_yaml("drive/interpolate.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e500.adam.h5", "drive/IndianFacesAGeneratedOutput_d5_adam_epoch500_Interpolate_Id2")

In [0]:
generate_from_yaml("drive/single.yaml", "drive/IndianFacesOutput/FaceGen.IndianFaces.model.d5.e300.sgd.h5", "drive/IndianFacesGeneratedOutput_Single_sgd_e300")

In [0]:
generate_from_yaml("drive/single.yaml", "drive/IndianFacesAOutput/FaceGen.IndianFaces.model.d5.e500.sgd.h5", "drive/IndianFacesAGeneratedOutput_d5_sgd_epoch500_Single")

### Analysis of Result

----

1. The same network architecture can generate faces from different origin or nationality. The same network which was used to generate yale, radboud, jaffe faces can now generate Indian faces.

2. As we increase the number of epochs for which it is trained, the clarity of the images increase.

Following are the single images generated from the trained model:

> 1. **Trained for 5 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/GenSingleImageEpoch5.jpg)

> 2. **Trained for 10 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/GenSingleImageEpoch10.jpg)

> 3. **Trained for 20 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/GenSingleImageEpoch20.jpg)

> 4. **Trained for 50 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/SingleImageEpoch50.jpg)

> 5. **Trained for 100 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/SingleImageEpoch100.jpg)

> 6. **Trained for 200 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/SingleImageEpoch200.jpg)

> 7. **Trained for 300 epochs:**
>>![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/SingleImageEpoch300.jpg)

> 8. **Trained for 500 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/GenSingleImageEpoch500.jpg)


3. If we change the optimizer, we would get different images from partially trained network. With Adam, it seems as if the image has been blurred out. With stochastic gradient as optimizer, the image seems as an abstract painting.

> 1. **Trained for 50 epochs:**
>>![Sgd](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/IndianFacesGeneratedOutput_Single_sgd_e300.jpg)

> 2. **Trained for 500 epochs:**
>> ![alt text](https://raw.githubusercontent.com/GeniGaus/MLBlr/master/assets/SingleImageEpoch500sgd.jpg)


### Areas to dig further

----

1. I would like to see the result of the network on training with combined data of all databases, i.e., people from different origin, race and nationality.

2. Here I have used identity, pose and lighting as parameters. If I had more time, I would have tried with orientation and occlusion parameters. Then, I would have tried with more than 3 parameters and with all of identity, pose, lighting, orientation and occlusion parameters combined. Then, compare each of their results and generate face with unseen orientation and occlusion parameters.

3. Train with other models specified in the paper and compare their results.

4. In the paper, it was mentioned that the effect of data augmentation is the same as increasing the training set size. Both lead to generalization of features but bad reconstruction of finer details. I wanted to test this by adding augmentation and increasing the training set size of any individual and compare the results.

5. Expanding this theory of generalization of features by data augmentation, can we generalize features of people belonging to a particular family? I would like to have tested this theory by including another parameter family and trained the network to see the result. This idea can be further expanded to race and nationality.