# Face to Cartoon with CycleGAN

## Group Project in Advanced Topics of Machine Learning (FS2019)

##### Guodong Zeng, Benjamin Fankhauser, Jan Segessenmann, Gautam Ilango

### Introduction 

Our goal is to switch image styles from a real faces to cartoon faces using unpaired data only. Some of the important features of the real faces should be preserved and recognized in the generated cartoon faces.  
As a starting point, we use an implementation of cycleGAN introduced by Jun-Yan Zhu et al. [1] cycleGAN achieves impressive results in learning the mapping between different styles of images (e.g. paintings $\to$ photos, zebras $\to$ horses or summer $\to$ winter and vice versa).  



In order to incorporate additional prior knowledge to the architecture we add another loss term based on landmark predictions for both, the real faces and the cartoon faces.
To extend the usages of our model we additionally provide a one to ten mapping of real faces to cartoon faces. We make the whole architecutre conditional and train the models with ten different hair colors, so the user is free to choose his or her hair color.


### Materials and Methods
#### DCGAN
A Deep Convolution Generative Adversarial Network learns the mapping from gaussian noise to the targeted domain using a generative loss. The power of the deep deconvolution layout has shown its success in various examples where artificial Images have to be generated.
We use this method to compare our results and have used the code provided by [pytorch/dcgan](https://github.com/pytorch/examples/tree/master/dcgan)

#### CycleGAN
To achieve the unpaired style transfer from real faces to cartoon faces, we have to learn the mapping function $G: A \to B$, with domain $A$ being real faces and domain $B$ being cartoon faces, such that the distribution of $G(A)$ is as similar as possible to the distribution $B$ by minimizing adversarial loss. Since this mapping is highly under-constrained, Jun-Yan Zhu et al. introduced the coupling of another mapping $H: B \to A$. The main idea is to minimize the consistency cycle loss $L(A, (H(G(A)))$ as well as the standart adversarial losses.  
![scheme](doc/images/cycleGAN_scheme_extended.png)  
The above figure is partly adapted from Jun-Yan Zhu et al. [2] and shows the method schematically. The reverse mapping with starting in $B$ is not shown.  

[The following would be the short visualization:]
![scheme_short](doc/images/cycleGAN_scheme.png)
#### Datasets
[...] (Jan)

#### Landmark labeling


### Releated Approaches
#### DCGAN
The vanilla DCGAN leads to the following result on the cartoon dataset.

![dcgan](doc/images/dcgan-fake-sample.png)

The generator has learnt to combine various features of the cartoon dataset into new ones. But the results are not
acceptable: deeper structures like same skin color everywhere have not been learnt. The generator generates women with beards. But we can not blame him as the cartoon dataset contains women with beards too. The input of the generator is gaussian noise and not real face images. So it does not solve our task. 

#### Vanilla CycleGAN
Next thing to consider is the vanilla [CycleGAN](https://junyanz.github.io/CycleGAN/). The paper
promises a more restriced environment for better results.
The first results are acceptable. But we discover two problems: 
- Mode collapse: the generator only generates a few modes of the original cartoon dataset. 
- Liveliness: the cartoons look exactly like the cartoon data, but we want something more human like: Humans turn and move their head and eyes. We would like to incorporate this feature to the generated cartoons to have more correspondence between the input face and the generated cartoon.


### Landmark Loss in CycleGAN
We want to force more correspondence between the cartoon images and real faces. As both domains are faces we enforce correspondence on their landmarks: a fake should preserve the landmarks of the input image.

To incorporate this prior knowlege about the domains we introduce the landmark (LD) loss. The prior knowledge about the real faces should be preserved through the generator:

![CycleGAN with Landmarkloss](doc/images/cyclegan-ldnet.png)

The images require more preprocessing and labeling but the faces and cartoons are still unpaired. 

#### Implementation
The Landmark loss is a simple convolutional network with five convolutional layers and two for the final regression. We call it LDNet and it has been inspired by several papers we found about landmark detection networks in general. We did not found a paper about landmark detection on cartoon images. But we hope that a standard architecture will achieve the task too.

LDNet outputs five cooridnates for five landmarks. The final loss is then computed with a mean squared error between the predicted and given landmarks.

![ldnet](doc/images/ldnet.png)

#### Training and dataset
Like the discriminator we train after each generator pass. We have two instances: the landmark detector for real faces and the landmark detector for cartoon faces. We train both with the real image given by the dataset. We use only one sample to train and convergence is slow. CycleGAN requires a lot of training the landmark detection networks convergence was not a problem compared to the huge generator networks.

For the real faces we labeled 1000 images by hand (Jan did that!). The cartoon faces have their 5 landmarks aligned at the same position. So we got the cartoon landmarks for free. But to train a meaningfull LDNet (static position would be to easy to learn) we had to implement a random crop in such a way that the landmarks are still
correct after croping. This is the main reason for our own "facedataset" implementation.

During training time the generator gets his gradients from the discriminator, the cycle and the landmark detection. The balance of the lambdas which gives the weights to each loss are very important, as this landmark loss can now create a state where it is very easy for the discriminator to separate real from fakes. Experiments have shown that a small landmark lambda of 0.01 helps with the correspondence without destroying the overall learning (the discriminator has a lambda of 1, the cycle a lambda of 10).


#### Results

Using a lambda of 0.01 for the landmark loss we see a correspondence in the generated fake:

![landmark correspondences](doc/images/landmark-correspondences.png)


It is hard to tell what exactly the impact of the landmark loss on the generator is. But we can show what we generate with and without landmark loss.

![landmark vs original](doc/images/landmark-vs-original.png)

We have a test image in the first column and its corresponding fake generated with enabled landmark loss in the
second column. We see that the fake does not look exactly like the original cartoons which are in the third column. They cover some kind of facial expression as in the real image (e.g. viewing direction, perspective scaling of the right eye of the girl in the bottom). In the fourth column we have fakes without
landmark loss, they capture more of the original distribution (e.g. the unnatural, direct look to the camera).







### Conditions on Hair Color

After the training of cycleGAN, we only get one cartoon image output from an face image input. And we cannot have any control about the cartoon image generation, like the hair color, with/without glasses, etc. Based on above observation, we hope we can control the hair color of cartoon image generation.

#### Conditional GAN


[Conditional GAN ](https://arxiv.org/pdf/1411.1784.pdf) shows that the model can generate MNIST digits conditioned on class labels. The main idea is to use a class label to condition on to both the generator and discriminator, which is illustrated as below: 
![alt text](https://www.researchgate.net/profile/Alptekin_Temizel/publication/326928177/figure/fig4/AS:736665580609547@1552646169957/a-Standard-GAN-and-b-conditional-GAN-architectures.ppm "Logo Title Text 1")



In our experiment, we added additional class label input of hair color to the generator and discriminator of cycleGAN, instead of taking only real face image as input. Given a real face image, we can generate cartoon images in different hair color with different input of hair color class labels. 


#### Training Dataset
We have two modalities of images, i.e. real face image dataset (1000 images) and cartoon face images.  For cartoon face images, we have 200 images for each hair color and we have 10 different hair colors. In total, we use 2,000 cartoon face images for training. The training of the model is in an un-paired way. 


#### Results

Given one real face image input, we can have 10 cartoon generation with different hair colors. 
![alt text](https://github.com/fs2019-atml/face-to-cartoon/raw/master/doc/images/ConditionalOutput.png "Logo Title Text 1")


But the model does not perform so well on test image with glasses, the reason may be that there are only small number of real face images with galsses in the training set. 
![alt text](https://github.com/fs2019-atml/face-to-cartoon/raw/master/doc/images/GlassResult.png "Logo Title Text 1")


### Improved cycleGAN
[...] (Gautam, with example!)

### Conclusion and Outlook
[...] (Gautam)

### Bibliography
[1]  
[2] Figure is avaiable on: https://github.com/eriklindernoren/PyTorch-GAN#cyclegan
[3]  
[...] (everyone who needs citations etc.)