# Image Captioning Vanilla Encoder Decoder

"Image Captioning" is an image captioning encoder-decoder-architecture described in this [tutorial](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning)

## Goal

* Identify the main objects in images 
* Convert an input image into a natural language description
* Use encoder-decoder framework:
    - The image encoder is a convolutional neural network (CNN) – a resnet-152 model pretrained on the [ILSVRC-2012-CLS](http://image-net.org/challenges/LSVRC/2012/) image classification dataset
    - The decoder is a long short-term memory (LSTM) network.
    
## Dataset

 * Hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories such as umbrella, dog, balloon)
 * Test images have no initial annotation (segmentation or labels)
 * The validation and test data consist of 150,000 photographs, collected from flickr and other search engines, hand labeled with the presence or absence of 1000 object categories

## ResNet

* 1st place in the ILSVRC 2015 classification competition with top-5 error rate of 3.57%
* Train extremely deep neural networks with 150+layers successfully
* ResNet-152 achieves 95.51 top-5 accuracies
* 2 layer deep (small networks like ResNet 18, 34)
* 3 layer deep( ResNet 50, 101, 152)

### Skip Connection
* Mitigation of the problem of vanishing gradient by allowing this alternate shortcut path for gradient to flow through
* Allow the model to learn an identity function which ensures that the higher layer will perform at least as good as the lower layer, and not worse tackling the degradation problem

![alt text](./pics/idmap.png)


# Image Captioning Tutorial NN architecture

![akt](pics/model.png)

## Preprocessing

### Build a vocabulary 
Code excerpts from build_vocab.py (creating a vocabulary from an annotation caption json file). In the build_vocab.py file there is also a Vocabulary class which is a subclass of object with a __call__ method to get an index of a word and a __len__ method to return the length of the vocabulary.

In [None]:
def build_vocab(json, threshold):
    # COCO is a large image dataset designed for object detection, segmentation, person keypoints detection,
    # stuff segmentation, and caption  |generation.
    coco = COCO(json)
    # a Counter to keep track on the word occurrences
    counter = Counter()
    ids = coco.anns.keys()
    for i, id in enumerate(ids):
        caption = str(coco.anns[id]['caption'])
        # tokenize the lowercase caption from json file
        tokens = nltk.tokenize.word_tokenize(caption.lower())
        # tracks of how many times equivalent tokens are added
        counter.update(tokens)

        if (i+1) % 1000 == 0:                                                                                                                 
            print("[{}/{}] Tokenized the captions.".format(i+1, len(ids)))

    # if the word frequency is less than 'threshold', then the word is discarded.
    words = [word for word, cnt in counter.items() if cnt >= threshold]

    # create a vocab wrapper and add some special tokens.
    vocab = Vocabulary()
    # used for padding of sentences
    vocab.add_word('<pad>')
    # used generate the first word
    vocab.add_word('<start>')
    # used to end a sentence
    vocab.add_word('<end>')
    # used as a placeholder for unkown words such as personal names
    vocab.add_word('<unk>')

    # add the words to the vocabulary.
    for i, word in enumerate(words):
        vocab.add_word(word)
    return vocab


### Resize
to default 256x256 pixels using antialiasing

# Model


## Encoder

In [None]:
# subclass of nn.Module: base class for all neural network modules.
class EncoderCNN(nn.Module):
    # embed_size: dimension of word embedding vectors, default 256
    def __init__(self, embed_size):
        """Load the pretrained ResNet-152 and replace top fc layer."""
        super(EncoderCNN, self).__init__()
        resnet = models.resnet152(pretrained=True)
        # get the layers of resnet & delete the last fc layer.
        modules = list(resnet.children())[:-1]     
        self.resnet = nn.Sequential(*modules)
        # in_feature is the number of inputs for the linear layer
        # applying a linear transformation to the incoming data: y=xA^T+b
        self.linear = nn.Linear(resnet.fc.in_features, embed_size)
        # add batch normalization layer over a 2d input
        self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)

    def forward(self, images):
        """Extract feature vectors from input images."""
        # sets requires_grad(ient) to False, do not compute gradients
        with torch.no_grad():
            features = self.resnet(images)
        # returned features is a numpy array with shape
        # simply a list of numbers taken from the output of a neural network layer. 
        # This vector is a dense representation of the input image, 
        # and can be used for a variety of tasks such as ranking, classification, or clustering
        
        # reshape: one shape dimension can be -1: 
        # in this case, the value is inferred from the length of the array and remaining dimensions.
        features = features.reshape(features.size(0), -1)
        # apply batch normalization
        features = self.bn(self.linear(features))
        return features


## Decoder

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, max_seq_length=20):
        """Set the hyper-parameters and build the layers."""
        super(DecoderRNN, self).__init__()
        # add an embedding layer
        self.embed = nn.Embedding(vocab_size, embed_size)
        # add an lstm layer
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        # add a liner layer
        self.linear = nn.Linear(hidden_size, vocab_size)
        # max lenght of a sentence
        self.max_seg_length = max_seq_length

    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)
        # concatenate the embeddings and features across the second dim
        embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
        # pad the embeddings
        packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
        # propagate till the output result
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs

    def sample(self, features, states=None):
        """Generate captions for given image features using greedy search."""
        sampled_ids = []
        # add a "fake" dimension to the inputs: 
        # returns a new tensor with a dimension of size one inserted at the specified position.
        inputs = features.unsqueeze(1)
        for i in range(self.max_seg_length):
            # hiddens: (batch_size, 1, hidden_size)
            hiddens, states = self.lstm(inputs, states)     
            # outputs:  (batch_size, vocab_size)
            outputs = self.linear(hiddens.squeeze(1))            
            # get the best solution 
            # predicted: (batch_size)
            _, predicted = outputs.max(1)                        
            sampled_ids.append(predicted)
            # propagate result as input for further batch processing      
            # inputs: (batch_size, embed_size)
            inputs = self.embed(predicted) 
            # inputs: (batch_size, 1, embed_size)
            inputs = inputs.unsqueeze(1) 
        # sampled_ids: (batch_size, max_seq_length)
        # get one tensor from the list of tensors
        sampled_ids = torch.stack(sampled_ids, 1)                
        return sampled_ids

# Training phase
* For the encoder part, the pretrained CNN extracts the feature vector from a given input image.
* The feature vector is linearly transformed to have the same dimension as the input dimension of the LSTM network.
* For the decoder part, source and target texts are predefined.
* Using these source and target sequences and the feature vector, the LSTM decoder is trained as a language model conditioned on the feature vector.

```
python train.py
```

Excerpts from train.py file:

In [None]:
[...]
# image preprocessing, normalization for the pretrained resnet
# normalize the image by the mean and standard deviation of the images's RGB channels
# also randomize with crop & flipping of the images 
 transform = transforms.Compose([
        transforms.RandomCrop(args.crop_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406),
                            (0.229, 0.224, 0.225))])

[...]

# build the encoder
encoder = EncoderCNN(args.embed_size).to(device)
# build the decoder
decoder = DecoderRNN(args.embed_size, args.hidden_size, len(vocab), args.num_layers).to(device)

# use cross entropy loss
criterion = nn.CrossEntropyLoss()
# get all the parameters to optimize - all of the decoder & the parameters of the last 2 layers of the encoder
params = list(decoder.parameters()) + list(encoder.linear.parameters()) + list(encoder.bn.parameters())
# Use Adam optimizer to hold the current state and will update the parameters based on the computed gradients
optimizer = torch.optim.Adam(params, lr=args.learning_rate)

# train the models
total_step = len(data_loader)
for epoch in range(args.num_epochs):
    # fetch the data: the images, their captions and the length of the captions via data_loader
    for i, (images, captions, lengths) in enumerate(data_loader):
        # set mini-batch dataset: images, captions, ground-truth
        images = images.to(device)
        captions = captions.to(device)
        targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]

        # get the features from the images
        features = encoder(images)
        # sample with the decoder and the given features
        outputs = decoder(features, captions, lengths)
        # compute the crossentropy loss, forward
        loss = criterion(outputs, targets)
        # need zero out gradients before backpropragation so that they can be updated correctly
        # without that the gradients are accumulated on subsequent backward passes
        decoder.zero_grad()
        encoder.zero_grad()
        # backward pass
        loss.backward()
        # perform a single optimization step
        optimizer.step()

        # print log info
        if i % args.log_step == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}'
                      .format(epoch, args.num_epochs, i, total_step, loss.item(), np.exp(loss.item())))

        # save the model checkpoints
        if (i+1) % args.save_step == 0:
            torch.save(decoder.state_dict(), os.path.join(
                    args.model_path, 'decoder-{}-{}.ckpt'.format(epoch+1, i+1)))
            torch.save(encoder.state_dict(), os.path.join(
                    args.model_path, 'encoder-{}-{}.ckpt'.format(epoch+1, i+1)))

# Test phase
* the encoder part is almost same as the training phase : with the difference that batchnorm layer uses moving average and variance instead of mini-batch statistics.
* the LSTM decoder can’t see the image description: feeds back the previosly generated word to the next input

In [None]:
[...]
# Image preprocessing
# Transforms are common image transformations. They can be chained together using Compose
# Convert a PIL (Python Imaging Library) Image or numpy.ndarray to tensor.
# Normalize a tensor image with mean and standard deviation.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406),
        (0.229, 0.224, 0.225))])

# Load vocabulary wrapper
with open(args.vocab_path, 'rb') as f:
    vocab = pickle.load(f)

# Build models
encoder = EncoderCNN(args.embed_size).eval()  # eval mode (batchnorm uses moving mean/variance)
decoder = DecoderRNN(args.embed_size, args.hidden_size, len(vocab), args.num_layers)
encoder = encoder.to(device)
decoder = decoder.to(device)

# Load the trained model parameters
encoder.load_state_dict(torch.load(args.encoder_path))
decoder.load_state_dict(torch.load(args.decoder_path))

# Prepare an image
image = load_image(args.image, transform)
image_tensor = image.to(device)

# Generate features from the images
feature = encoder(image_tensor)
# Generate an caption from the image
sampled_ids = decoder.sample(feature)
sampled_ids = sampled_ids[0].cpu().numpy()          # (1, max_seq_length) -> (max_seq_length)

# Convert word_ids to words
sampled_caption = []
for word_id in sampled_ids:
    word = vocab.idx2word[word_id]
    sampled_caption.append(word)
    if word == '<end>':
        break
#concatenate the sampled sentence together
sentence = ' '.join(sampled_caption)

# print the generated caption
print (sentence)
image = Image.open(args.image)
# show the image
plt.imshow(np.asarray(image))

# Run

To just run a pretrained model on some pictures you have to do the following (on a linux machine)
```bash
git clone https://github.com/pdollar/coco.git
cd coco/PythonAPI/
make
python setup.py build
python setup.py install
git clone https://github.com/yunjey/pytorch-tutorial.git
cd pytorch-tutorial/tutorials/03-advanced/image_captioning/
pip install -r requirements.txt
chmod +x download.sh #if you need the dataset
./download.sh
```

 You can download the pretrained [model](https://www.dropbox.com/s/ne0ixz5d58ccbbz/pretrained_model.zip?dl=0) and the vocabulary [file](https://www.dropbox.com/s/26adb7y9m98uisa/vocap.zip?dl=0). You should extract pretrained_model.zip to ./models/ and vocab.pkl to ./data/.
 
 ```bash
 python sample.py --image='png/umbrella.png' --encoder_path /path/to/encmodel --decoder_path /path/to/decmodel --vocab_path /path/to/vocab
 ```

# Results

Testing with the downloaded pretrained model and vocabulary file. These are the produced captions for the testdata:

* <start> a group of young men playing a game of soccer on a field . <end>
![alt](pics/testdata/football.jpg)
* <start> a brown bear standing next to a brown bird . <end>
![alt](pics/testdata/lion_ape.jpg)
* <start> a close up of a sign with a sign on it <end>
![alt](pics/testdata/motivation_resize.jpg)
* <start> a large body of water with a bridge in the background <end>
![alt](pics/testdata/night.jpg)
* <start> a group of people walking down a street next to a building . <end>
![alt](pics/testdata/park_night.jpg)
* <start> a city street with a lot of cars and buildings . <end>
![alt](pics/testdata/street_crossing.jpg)
* <start> a large elephant standing next to a baby elephant . <end>
![alt](pics/testdata/elephants.jpg)
* <start> a couple of sheep standing next to each other . <end>
![alt](pics/testdata/drawing_tiger.jpg)
* <start> a large white bird flying in the sky . <end>
![alt](pics/testdata/dolphins.jpg)
* <start> a herd of sheep grazing on a lush green hillside . <end>
![alt](pics/testdata/countryside.jpg)
* <start> a cat is sitting on a wooden table next to a keyboard . <end>
![alt](pics/testdata/cat_pictures.jpg)
* <start> a group of people standing around a table with a bunch of bananas . <end>
![alt](pics/testdata/basketball.jpg)
* <start> a woman holding an umbrella in the rain . <end>
![alt](pics/testdata/umbrella.png)