Image Captioning Project from Coursera. In this project we will define and train an image-to-caption model, that can produce descriptions for real world images!
Model architecture: CNN encoder and RNN decoder.(https://research.googleblog.com/2014/11/a-picture-is-worth-thousand-coherent.html)
Encoder: We will use pre-trained InceptionV3 model for CNN encoder (https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html) and extract its last hidden layer as an embedding:

Decoder: The decoder part of the model is using a recurrent neural networks and LSTM cells to generate the captions.
Since our problem is to generate image captions, RNN text generator should be conditioned on image. The idea is to use image features as an initial state for RNN instead of zeros. During training we will feed ground truth tokens into the lstm to get predictions of next tokens.(http://cs.stanford.edu/people/karpathy/):
The dataset is a collection of images and captions. Here, it’s the COCO dataset. For each image, a set of sentences (captions) is used as a label to describe the scene. Ralavent Links:- train images http://msvocds.blob.core.windows.net/coco2014/train2014.zip
- validation images http://msvocds.blob.core.windows.net/coco2014/val2014.zip
- captions for both train and validation http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip
Following are a few results obtained after training the model for 12 epochs.



.png)
.png)

.png)
.png)
.png)