update examples/neural_style_transfer.py #3347

merged 1 commit into from Jul 29, 2016


None yet

4 participants


Changes made to the example to make it more consistent with documented usage of VGG16 weights. Detailed discussion can be found at #3340 . The following changes are made to images before passing through VGG,

  1. Convert RGB to BGR
  2. Mean-Centering
@fchollet fchollet merged commit d06e375 into fchollet:master Jul 29, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
kfinla commented Jul 29, 2016 edited

Are there any good ways to improve the performance of this? I was playing today with moving this over to use VGG19 but have not been successful yet. I find that the loss value improvement flattens out pretty quickly, with very little improvement after about the 60th epoch, and the image is pretty much as good as its going to get by about 20 iterations. Do larger resolutions help vs. the 400x400 default? Is the only option really to go back and train a larger VGG16 or 19 cov-net on Imagenet with more parameters?, and perhaps more dropout?


Hi @kfinla I haven't tried VGG19. But my experience was the different loss weights can be played with to generate some different results. You may also want to check work by @titu1994 for other parameters.

titu1994 commented Jul 30, 2016 edited

@kfinla I have tested the code in my project on over a 100 images, while altering various parameters, models (VGG-16 / 19) and even switched between Network.py (replica of the neural_style_transfer.py) and INetwork.py (same as above, with 4 of the major improvements suggested in paper Improving the Neural Algorithm of Artistic Style.

Some insights I have derived from these experiments:

  • The results from VGG-19 are very similar to that of VGG-16. Upon visual inspection, I cannot differentiate between image created by using VGG-19 or by VGG-16. Perhaps I need to run a few tests with 1000 epochs for VGG-19 and 16 to see if I can differentiate between them.
  • As you have found, loss value improvement flattens out in less than 50 epochs. However, there is a major difference in the actual loss values when using pool_mode="ave" vs pool_mode="max".

In the case of average pooling, loss value drops drastically over the first 15 epochs, and then starts flattening out over the next 20. After 35-45 epochs, loss value decreases in only the last 4 digits. The actual loss value if often less 1/10th or even 1/25th of the loss value when using max pooling.

In the case of max pooling, the style is applied more gradually, and often more accurately. The loss value drops drastically over the first 22-25 epochs, and then flattens out over the next 35-45 epochs. After roughly 65 epochs, loss value decreases only in the last 4 digits. However the actual loss value is far higher than the loss value obtained from average pooling, even though the results are often better for max pooling.

  • Gram matrix size (image size) of greater than 400x400 introduces boundary distortions in the final image, as well as style is applied more randomly. The greater the image size, the worse the results get. It is suitable to choose 400x400, and then perform upscaling via Image Super Resolution algorithm implementations (there is a link to one such implementation for windows in my project at the end)
  • I have performed multiple tests with 1000 epochs as suggested in the paper. Upon visually comparing the results with the result after 100 iterations, the loss value differences are minor, but the difference in final image result is immediately noticeable. So while it may appear that loss value improvement is flattening, the result is still appreciable. However, for simple tests, 100 epochs generally suffices to generate a high quality result.
  • I utilize the conv5_2 layer as the content layer as compared to the original paper's conv4_2. I have performed many tests which compare the results of this change after 100 epochs. The results from conv5_2 are more visually pleasing almost all of the time.
  • Due to usage of conv5_2 as the content layer, the content weight and style weight have to be modified. There are rare cases where the content weight = 0.025 and style weight = 1 are sufficient (usually style image is "The Starry Night" by Vincent Van Gogh). However, in most cases, the style overpowers the content. As such, I have found that changing the content : style weight ratio to 1 : 0.1 or 1 : 0.05 or sometimes 1 : 0.01 is very effective in styling the image without destroying the content.
  • The original implementation and the paper both suggest use of random noise as initial image, but after performing several tests, I have found that random noise produces very grainy images as final outputs, even after 1000 epochs. Therefore as a default, my script uses the content image as the initial image. This produces very clean and crisp outputs, and also helps in fast convergence.

As a side note, I have also tried using the style image as the initial image, but have found that while the end result is similar, it is not the same, and has a slower convergence rate than using content image as the initial image.

  • Total Variation Weight has a subtle but important role to play. The implementation in keras examples states to use tv_weight as 1, but I found that the images are smoothed to an extreme degree, and the results are not appealing. After several tests, I have found a few values which are very suitable to certain cases :
  1. If the content and style have similar colour schemes, use tv_weight = 1E-03
  2. If the content and style have at least one major color similar, use tv_weight = 5E-04
  3. If the content and style have the same background color, use tv_weight = 5E-03
  4. If the content and style do not share same color palette at all, use tv_weight = 5E-05
  5. If you want relatively crisp images without worrying about color similarity, use tv_weight = 1E-05. It works well almost 80 % of the time.
  6. If style image is "The Starry Night", use tv_weight = 1E-03 or 1E-05. Other values produce distortions and unpleasant visual artifacts in most cases.
  7. I have tried turing off tv_weight to 0, however the results were very poor. Image produced is sharp, but lacks continuity and is not visually pleasing. If you want very crisp images at the cost of some continuity in the final image, use tv_weight = 5E-08.
  • Style scale simply multiplies the scale with the style weight. Keeping it constant at 1 and modifying the style weight is sufficient in achieving good results.
  • I have implemented a few improvements from the paper Improving the Neural Algorithm of Artistic Style in the INetwork.py script. It almost always produces more visually appealing images as compared to the original implementation, and that is without MRF and Analogy loss. In a few cases, border defects were noticed, but they are now corrected after implementing BGR and mean centering from this PR.
  • On a side note, fractal art as style image is surprisingly effective in applying style and color to images, but style loss needs to be very well fine tuned so as not to overpower the content. Fractal art with more white or black and white appear to produce very good final images, while colored fractal art generally overpowers the content image. Note that with fractal art, the first 5 epochs are sufficient to generate good results, with more epochs destroying the content completely.

Sorry for the long post. Hope it helps you in obtaining slightly better images from the script.

kfinla commented Jul 30, 2016 edited

Thank you guys, this is all super useful and interesting information. Long post appreciated! I did a 500 epoch test with a set of images I had decent results with, with slightly modified weights based on my own limited observations and did get an improvement. I am trying now using some of the suggested weights ratio's and using conv5_2. Again thanks.

kfinla commented Jul 31, 2016 edited

So, playing some more. I need to study your INetwok.py script, I augmenting my script, basically the origional Keras style-transfer script to use the conv 5_2 but the results never seem to converge, or extremely slowly after 500 epochs. I see you have a "style scale" which is missing in mine.

I started to play with your (Titu1994's) INework.py and Network.py scripts but they both hang for me at ("Start of iteration, 1"). I tried commenting out the theano.config lines at the top because my gpu always says cuDNN not available, but still no luck. Any ideas? I have been running it like the original script .. script location, content image, style image, image result location. I get no errors, it just doesn't complete iteration 1 unless it is several orders of magnitude slower than the original script. PS. What is the "windows" helper script about? I am trying this on windows 8.1 .. i might try on OSX without gpu to see if i can isolate the issue there. Thanks for any help.

titu1994 commented Jul 31, 2016 edited

@kfinla Both Network.py and INetwork.py offer the best image far more quickly due to use of the conv5_2 layer as the content layer. I have seen very little gains on either script with over 200 epochs. 500 or 1000 is definitely overkill.

Style scale simply multiplies the original style loss value by a float value. It simply strengthens or weakens the style strength. There is no need for it if you simply adjust the style weight manually.

The Theano.config lines can be safely removed, since they are there to enforce default behaviour. I noticed that the script crashed when using time_once in the dnn.fwd_algo flag.

Cudnn is definitely very important in this script, since it has faster convnet implementations. Especially CUDNN 5 is very useful for VGG like nets since it is a few times faster for stacks of CNNs.

Network.py performs exactly at the same speed as the neural_style_transfer.py script, since it is in essence just exposing the variables in the script. INetwork.py is 1-2 seconds slower than the Network.py script, since it uses a lot more information from the VGG network to deliver better results.

In fact, this script requires enormous amounts of time on CPU for every iteration. Definitely GPU with CuDNN is recommended.

Windows helper folder is not a script actually. It is a program written in C# using Windows Forms to execute the Python scripts directly from a program. Advantages are that it is extremely easy to alter all the parameters of the script and achieve different results. If you instead prefer the terminal anyway, it can also copy the argument list onto the windows clipboard so you can simply paste the argument list.

kfinla commented Jul 31, 2016

So I did some more playing. I tried with the CPU on Keras 1.0.6 on my MacPro (6 core xeon) and it takes about 2047 seconds per iteration, though even after 1 iteration it looks pretty nice. Then I tried again on my old Dell with the very old stock GPU Nvida GT 635 but no cuDNN available and it took 1150 seconds per iteration. So I needed to be more patient. The older 4_2 setups take about 48 seconds per iteration on the same GPU. So I need a better graphics card which I know, but I also need to continue to try and get cuDNN support working. As I was under the impression it will cut that time in half.

BTW the results with the INetwork script is very exciting, and awesome to have a decent resolution as the result.

titu1994 commented Aug 1, 2016 edited

@kfinla I am surprised to see such a vast difference between the time taken for conv4_2 and conv5_2 on the same gpu.

On my gpu, with CuDNN, both conv4_2 and cokv5_2 require exactly the same amount of time. Perhaps CuDNN is the deciding factor here ? I will try and disable CuDNN to see if I can replicate this issue.

CuDNN support is extremely usefull for VGG network, as in version 5 they say that it gives a speed up of 2.5x compared to earlier CuDNN, which was already faster to begin with.

As to the INetwork script, I to have found that with the additional time that it takes, it provides a far higher resolution image even through gram matrix size is still 400 x 400. I've implemented only 4 of the major improvements listed in the paper, and am still working on getting MRF and Analogy loss incorporated.

It seems adding MRF loss drastically undermines the results obtained from the INetwork script. Most probably my implementation is wrong. I will keep working on it to see if all 3 can work in tandem.

kfinla commented Aug 4, 2016

So I got cuDNN finally working, and v5 no less. Theano and Windows is a dark art, 14th times a charm I guess. So I have seen a significant speed up. With INetwork From 1100 seconds down to 62 seconds. And using the older 4_2 setup from 48 seconds UP to about 50s. So there is clearly massive cudnn optimizations in the newer INetwork script. Once I get a gtx 1060 I should probably be able to enable CNMeM properly too.


@kfinla Sorry for the late response, I was travelling. Your results match mine, without cuDNN, the INetwork.py script requires 700+ seconds per epoch on my laptop. However, I did not see an increase in time required for the older 4_2 setup.

The reason for this huge speed difference is blatantly obvious. The INetwork script uses all layers of the VGG network to compute style loss. On top of that, it computes the difference between adjacent layers, thus adding even more execution time to the already large VGG network. cuDNN is pretty much a requirement for the INetwork script.

In any case, I have updated the scripts recently to speed them up even more. They no longer use the various ZeroPadding layers, and now there is no longer a requirement to download the full 500+ MB weights. The network will automatically download a subset of the weights for just the Convolutional layers (~57 MB) and use them instead. This requires at least Keras 1.0.7 since it uses a few new additions to the keras util scripts.

With cuDNN, on a 980M GPU, the Network.py script requires just 6-8 seconds per epoch, and the INetwork.py script needs just 10-11 seconds per epoch.

kfinla commented Aug 23, 2016 edited

@titu1994 No worries. I have a GTX 1060 now, so the INetwork script is often under 7 seconds to process. I am sure CNMeM helps. I am curious how adding MRF is going? I see there is a MRFNetwork.py script now (or I just had not noticed it) I will try out. Have you noticed any reduction in quality between the new versions of the scripts using full VGG weights and the smaller subset weights? At this point I am really focussed on the final result quality not so much speed. I have been doing some side by sides comparing the Prisma app with the same image inputs with INetwork.

*PS. as a side note, do you know if cuDNN 5.1 works with theano? I have 0.9.0 dev2 and keras 1.0.6 at the moment along with CUDA 8.0 RC v5.0 (5005). I have not tried 5.1 just because getting everything to play nice on WIN (10 now) is a chore.

titu1994 commented Aug 23, 2016 edited

@kfinla Sadly I had to pause working on the MRF network for the time being, since it is producing useless results and requiring exorbitant amount of time, both to develop and to execute. I'm quite sure my implementation is wrong somewhere. MRFNetwork.py is currently non functional, and I should note that somewhere. I welcome any contributions to implementing MRF or Analogy loss.

The subset weights are in fact the entire Convolutional layer weights, which total around 57MB. The remainder of the 550+ MB file is in fact for the useless Fully Connected Dense Layers which were unused. So I simply extracted the conv weights and eliminated the ZeroPadding layers to obtain a small speedup. The results will be exactly the same as if they were with the entire weights.

The results for Prisma and this network will probably never match, for the simple reason that they are using pre trained Texture Networks, possibly similar to the ones from the paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution. They require just one forward pass through the VGG network to create a stylized image. The obvious drawback is that for each style, you must train a new Neural network for roughly 4-5 hours on the MS COCO dataset using that style image and that random styles images cannot be used. For an implementation, look to chainer fast neural style.

This network on the other hand requires no pertaining of the specific style, but requires exorbitant amount of time in forward and backward passes through the VGG network. Benefits include control over how strongly the style is applied and the fact that any image can become a style. The added benefit is that the generated image can then be passed through an external Image Super resolution implementation to scale the image to any size with near lossless results.

In fact I recently added the parameters to perform style transfer without transferring color. It seems to work well and produces good images in the same color as the content image without destroying the content. I still want to attempt to implement Style transfer to Videos as well, but my GPU is already looking dated.. If only I could switch GPUs on a laptop, but that's a pipedream for now.

I myself am currently using cuDNN 5.1 (5103) on Windows 10 with CUDA 7.5, Theano 0.9.0 dev2 and bleeding edge Keras from github (BatchNorm had a small bug that was causing nan's but was fixed post release of 1.0.7). Feel free to update since 5.1 has better support for the Pascal Architecture of the 1060-1080 GPUs.

kfinla commented Aug 28, 2016 edited

@titu1994 Thanks for the info and the update. I got cuDNN 5.1 (5105) going with WIN 10, CUDA 8.0 RC, and Theano 0.90 dev2, and updated to Keras 1.0.7. On a Cifar-10 test I'm only seeing about a 7% speedup.. vs cudnn 5004. Not the 2.7x I read about. Maybe if I turned off all the image augmentation I'd see a more noticeable gain. Or maybe my CNN's are not the right kind. Anyways, thanks for the background on Prisma and the latest with the INetwork. I will certainly look up that paper.


@kfinla The 2.7x speed up is mainly for VGG networks which contain multiple stacks of Conv layers with 3x3 pool sizes. So other architectures may not see that great a performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment