# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, you will train your CNN-RNN model.  

You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.

This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:
- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  
- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.

This notebook **will be graded**.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.

You should only amend blocks of code that are preceded by a `TODO` statement.  **Any code blocks that are not preceded by a `TODO` statement should not be modified**.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.

### Question 1

For all the answers, the source of information is : https://arxiv.org/pdf/1411.4555.pdf

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

**Answer:** The model is comprised of a CNN that computes a low level representation of the images that are passed through it and in turn, this representation is fed into an LSTM cell that is in charge of producing the corresponding natural language description. 


### (Optional) Task #2

Note that we have provided a recommended image transform `transform_train` for pre-processing the training images, but you are welcome (and encouraged!) to modify it as you wish.  When modifying this transform, keep in mind that:
- the images in the dataset have varying heights and widths, and 
- if using a pre-trained model, you must perform the corresponding appropriate normalization.

### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**Answer:** The parameters in Resnet where trained by passing through it a series of images that were transformed in a specific way. The provided transformations match the latter. 

### Task #3

Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**Answer:** By looking at the paper I reference above(https://arxiv.org/pdf/1411.4555.pdf), the authors expressed the following regarding the matter:

"The above loss (referring to NLLLoss ) is minimized w.r.t. all the parameters of theLSTM, the top layer of the image embedder CNN and word embeddings We."

### Task #4

Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### Question 4

**Question:** How did you select the optimizer used to train your model?

**Answer:** Following with the reference in Question 3, the authors used stochastic gradient descent. I quote: "At training time, (S, I) is a training example pair, and we optimize the sum of the log probabilities as described in (2) over the whole training set using stochastic gradient descent". By trial and error, I have found that Adam works better, however.

In [2]:
#performing imports first
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math
import nltk

In [3]:
## TODO #1: Select appropriate values for the Python variables below.
batch_size = 64          # batch size
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = True    # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) TODO #2: Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=0.89s)
creating index...


  0%|          | 784/414113 [00:00<01:52, 3679.35it/s]

index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:32<00:00, 4469.46it/s]


[Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False), BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True), ReLU(inplace), MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False), Sequential(
  (0): Bottleneck(
    (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace)
    (downsample): Sequential(
      (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_

In [4]:
# TODO #3: Specify the learnable parameters of the model.
params = list(encoder.embed.parameters()) + list(decoder.parameters())

# TODO #4: Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

<a id='step2'></a>
## Step 2: Train your Model

Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  

It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  

You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).

### A Note on Tuning Hyperparameters

To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  

However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.

That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [None]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()
response = requests.request("GET", 
                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
                            headers={"Metadata-Flavor":"Google"})

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        if time.time() - old_time > 60:
            old_time = time.time()
            requests.request("POST", 
                             "https://nebula.udacity.com/api/v1/remote/keep-alive", 
                             headers={'Authorization': "STAR " + response.text})
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1/6471], Loss: 9.0905, Perplexity: 8870.8137Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2/6471], Loss: 8.9630, Perplexity: 7808.6541Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [3/6471], Loss: 8.7759, Perplexity: 6476.2445Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [4/6471], Loss: 8.5691, Perplexity: 5266.4452Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [5/6471], Loss: 8.1110, Perplexity: 3331.0091Shape of captions
torch.Size([64, 19, 256])
Shape of inputs
torch.Size([64, 20, 256])
Epoch [1/3], Step [6/6471], Loss: 7.9977, Perplexity: 2974.2039Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [7

Epoch [1/3], Step [56/6471], Loss: 4.0546, Perplexity: 57.6594Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [57/6471], Loss: 4.2397, Perplexity: 69.3860Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [58/6471], Loss: 4.3304, Perplexity: 75.9741Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [59/6471], Loss: 4.3320, Perplexity: 76.0925Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [60/6471], Loss: 4.1013, Perplexity: 60.4159Shape of captions
torch.Size([64, 17, 256])
Shape of inputs
torch.Size([64, 18, 256])
Epoch [1/3], Step [61/6471], Loss: 4.4930, Perplexity: 89.3931Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [62/6471], Loss: 4.2991, Perplexity: 73.6324Shape of captions
torch.Size([64, 12, 256])
Shape o

Epoch [1/3], Step [111/6471], Loss: 3.7258, Perplexity: 41.5053Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [112/6471], Loss: 3.8520, Perplexity: 47.0887Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [113/6471], Loss: 3.8536, Perplexity: 47.1614Shape of captions
torch.Size([64, 19, 256])
Shape of inputs
torch.Size([64, 20, 256])
Epoch [1/3], Step [114/6471], Loss: 4.2621, Perplexity: 70.9566Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [115/6471], Loss: 3.8783, Perplexity: 48.3398Shape of captions
torch.Size([64, 17, 256])
Shape of inputs
torch.Size([64, 18, 256])
Epoch [1/3], Step [116/6471], Loss: 4.2102, Perplexity: 67.3712Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [117/6471], Loss: 3.7933, Perplexity: 44.4008Shape of captions
torch.Size([64, 11, 256])

Epoch [1/3], Step [166/6471], Loss: 3.5233, Perplexity: 33.8966Shape of captions
torch.Size([64, 19, 256])
Shape of inputs
torch.Size([64, 20, 256])
Epoch [1/3], Step [167/6471], Loss: 4.2125, Perplexity: 67.5222Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [168/6471], Loss: 3.7127, Perplexity: 40.9651Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [169/6471], Loss: 3.7794, Perplexity: 43.7890Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [170/6471], Loss: 3.6079, Perplexity: 36.8877Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [171/6471], Loss: 4.0932, Perplexity: 59.9290Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [172/6471], Loss: 3.5518, Perplexity: 34.8771Shape of captions
torch.Size([64, 10, 256])


Epoch [1/3], Step [221/6471], Loss: 3.4068, Perplexity: 30.1682Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [222/6471], Loss: 3.5632, Perplexity: 35.2761Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [223/6471], Loss: 3.4687, Perplexity: 32.0940Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [224/6471], Loss: 3.6767, Perplexity: 39.5173Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [225/6471], Loss: 3.5625, Perplexity: 35.2503Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [226/6471], Loss: 3.4524, Perplexity: 31.5752Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [227/6471], Loss: 3.6348, Perplexity: 37.8959Shape of captions
torch.Size([64, 10, 256])

Epoch [1/3], Step [276/6471], Loss: 3.3800, Perplexity: 29.3711Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [277/6471], Loss: 3.4078, Perplexity: 30.1990Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [278/6471], Loss: 3.2807, Perplexity: 26.5936Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [279/6471], Loss: 3.4868, Perplexity: 32.6818Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [280/6471], Loss: 3.3629, Perplexity: 28.8735Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [281/6471], Loss: 3.4309, Perplexity: 30.9030Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [282/6471], Loss: 3.6207, Perplexity: 37.3631Shape of captions
torch.Size([64, 13, 256])
S

Epoch [1/3], Step [331/6471], Loss: 3.1568, Perplexity: 23.4955Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [332/6471], Loss: 3.3043, Perplexity: 27.2294Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [333/6471], Loss: 3.0440, Perplexity: 20.9898Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [334/6471], Loss: 3.4157, Perplexity: 30.4397Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [335/6471], Loss: 3.4946, Perplexity: 32.9371Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [336/6471], Loss: 3.2244, Perplexity: 25.1391Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [337/6471], Loss: 3.5399, Perplexity: 34.4621Shape of captions
torch.Size([64, 15, 256])

Epoch [1/3], Step [386/6471], Loss: 3.2540, Perplexity: 25.8926Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [387/6471], Loss: 3.1896, Perplexity: 24.2786Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [388/6471], Loss: 3.3360, Perplexity: 28.1055Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [389/6471], Loss: 3.1600, Perplexity: 23.5702Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [390/6471], Loss: 3.2391, Perplexity: 25.5109Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [391/6471], Loss: 3.2938, Perplexity: 26.9463Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [392/6471], Loss: 3.4512, Perplexity: 31.5372Shape of captions
torch.Size([64, 12, 256])

Epoch [1/3], Step [441/6471], Loss: 3.6196, Perplexity: 37.3227Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [442/6471], Loss: 3.2147, Perplexity: 24.8956Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [443/6471], Loss: 3.3520, Perplexity: 28.5596Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [444/6471], Loss: 3.2483, Perplexity: 25.7460Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [445/6471], Loss: 2.9553, Perplexity: 19.2075Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [446/6471], Loss: 3.5025, Perplexity: 33.1991Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [447/6471], Loss: 3.2248, Perplexity: 25.1490Shape of captions
torch.Size([64, 12, 256])


Epoch [1/3], Step [496/6471], Loss: 3.2692, Perplexity: 26.2912Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [497/6471], Loss: 3.1142, Perplexity: 22.5157Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [498/6471], Loss: 2.9300, Perplexity: 18.7273Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [499/6471], Loss: 3.0879, Perplexity: 21.9313Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [500/6471], Loss: 3.2098, Perplexity: 24.7748
Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [501/6471], Loss: 3.3003, Perplexity: 27.1219Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [502/6471], Loss: 3.1451, Perplexity: 23.2213Shape of captions
torch.Size([64, 11, 256]

Epoch [1/3], Step [551/6471], Loss: 3.7340, Perplexity: 41.8460Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [552/6471], Loss: 3.0898, Perplexity: 21.9732Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [553/6471], Loss: 3.1685, Perplexity: 23.7712Shape of captions
torch.Size([64, 20, 256])
Shape of inputs
torch.Size([64, 21, 256])
Epoch [1/3], Step [554/6471], Loss: 3.6740, Perplexity: 39.4079Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [555/6471], Loss: 3.1571, Perplexity: 23.5023Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [556/6471], Loss: 3.0031, Perplexity: 20.1487Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [557/6471], Loss: 3.4009, Perplexity: 29.9911Shape of captions
torch.Size([64, 8, 256])


Epoch [1/3], Step [606/6471], Loss: 3.1255, Perplexity: 22.7701Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [607/6471], Loss: 2.7790, Perplexity: 16.1031Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [608/6471], Loss: 3.2846, Perplexity: 26.6971Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [609/6471], Loss: 3.2126, Perplexity: 24.8446Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [610/6471], Loss: 2.9904, Perplexity: 19.8944Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [611/6471], Loss: 2.9048, Perplexity: 18.2611Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [612/6471], Loss: 2.9036, Perplexity: 18.2398Shape of captions
torch.Size([64, 15, 256])


Epoch [1/3], Step [661/6471], Loss: 3.1553, Perplexity: 23.4594Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [662/6471], Loss: 3.0822, Perplexity: 21.8058Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [663/6471], Loss: 3.3071, Perplexity: 27.3054Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [664/6471], Loss: 3.1381, Perplexity: 23.0593Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [665/6471], Loss: 2.9760, Perplexity: 19.6099Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [666/6471], Loss: 2.8921, Perplexity: 18.0311Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [667/6471], Loss: 3.0116, Perplexity: 20.3192Shape of captions
torch.Size([64, 13, 256])

Epoch [1/3], Step [716/6471], Loss: 3.0853, Perplexity: 21.8749Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [717/6471], Loss: 2.7544, Perplexity: 15.7115Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [718/6471], Loss: 2.9827, Perplexity: 19.7408Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [719/6471], Loss: 3.1388, Perplexity: 23.0758Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [720/6471], Loss: 3.4513, Perplexity: 31.5406Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [721/6471], Loss: 2.9496, Perplexity: 19.0979Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [722/6471], Loss: 2.8502, Perplexity: 17.2920Shape of captions
torch.Size([64, 10, 256])

Epoch [1/3], Step [771/6471], Loss: 3.2367, Perplexity: 25.4503Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [772/6471], Loss: 3.1534, Perplexity: 23.4167Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [773/6471], Loss: 2.9619, Perplexity: 19.3338Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [774/6471], Loss: 3.4314, Perplexity: 30.9212Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [775/6471], Loss: 3.4258, Perplexity: 30.7484Shape of captions
torch.Size([64, 17, 256])
Shape of inputs
torch.Size([64, 18, 256])
Epoch [1/3], Step [776/6471], Loss: 3.4084, Perplexity: 30.2183Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [777/6471], Loss: 2.9229, Perplexity: 18.5952Shape of captions
torch.Size([64, 10, 256])
S

Epoch [1/3], Step [826/6471], Loss: 3.0938, Perplexity: 22.0608Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [827/6471], Loss: 2.8431, Perplexity: 17.1690Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [828/6471], Loss: 2.9254, Perplexity: 18.6420Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [829/6471], Loss: 3.2911, Perplexity: 26.8717Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [830/6471], Loss: 2.8253, Perplexity: 16.8653Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [831/6471], Loss: 2.8828, Perplexity: 17.8647Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [832/6471], Loss: 2.8299, Perplexity: 16.9440Shape of captions
torch.Size([64, 9, 256])


Epoch [1/3], Step [881/6471], Loss: 4.3403, Perplexity: 76.7314Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [882/6471], Loss: 2.8994, Perplexity: 18.1627Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [883/6471], Loss: 3.1361, Perplexity: 23.0133Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [884/6471], Loss: 2.9970, Perplexity: 20.0248Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [885/6471], Loss: 3.0611, Perplexity: 21.3514Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [886/6471], Loss: 2.7079, Perplexity: 14.9981Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [887/6471], Loss: 2.9757, Perplexity: 19.6042Shape of captions
torch.Size([64, 12, 256])


Epoch [1/3], Step [936/6471], Loss: 2.7384, Perplexity: 15.4628Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [937/6471], Loss: 2.6552, Perplexity: 14.2282Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [938/6471], Loss: 2.7297, Perplexity: 15.3289Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [939/6471], Loss: 2.8261, Perplexity: 16.8801Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [940/6471], Loss: 2.6787, Perplexity: 14.5669Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [941/6471], Loss: 2.7792, Perplexity: 16.1057Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [942/6471], Loss: 2.7649, Perplexity: 15.8771Shape of captions
torch.Size([64, 15, 256])

Epoch [1/3], Step [991/6471], Loss: 2.7608, Perplexity: 15.8118Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [992/6471], Loss: 2.9641, Perplexity: 19.3780Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [993/6471], Loss: 2.8884, Perplexity: 17.9638Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [994/6471], Loss: 2.6654, Perplexity: 14.3736Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [995/6471], Loss: 2.7497, Perplexity: 15.6380Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [996/6471], Loss: 2.9165, Perplexity: 18.4766Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [997/6471], Loss: 2.7266, Perplexity: 15.2815Shape of captions
torch.Size([64, 10, 256])


Epoch [1/3], Step [1046/6471], Loss: 2.7730, Perplexity: 16.0065Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1047/6471], Loss: 2.5887, Perplexity: 13.3129Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1048/6471], Loss: 2.6598, Perplexity: 14.2935Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1049/6471], Loss: 2.7544, Perplexity: 15.7121Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1050/6471], Loss: 2.6460, Perplexity: 14.0980Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1051/6471], Loss: 2.7950, Perplexity: 16.3633Shape of captions
torch.Size([64, 23, 256])
Shape of inputs
torch.Size([64, 24, 256])
Epoch [1/3], Step [1052/6471], Loss: 3.6409, Perplexity: 38.1270Shape of captions
torch.Size([64, 11

Epoch [1/3], Step [1101/6471], Loss: 2.6815, Perplexity: 14.6070Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1102/6471], Loss: 2.7667, Perplexity: 15.9053Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1103/6471], Loss: 2.7217, Perplexity: 15.2065Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1104/6471], Loss: 2.6878, Perplexity: 14.6996Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1105/6471], Loss: 3.0717, Perplexity: 21.5778Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1106/6471], Loss: 2.3819, Perplexity: 10.8259Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1107/6471], Loss: 2.7471, Perplexity: 15.5976Shape of captions
torch.Size([64, 12

Epoch [1/3], Step [1156/6471], Loss: 2.6019, Perplexity: 13.4888Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1157/6471], Loss: 2.8397, Perplexity: 17.1106Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1158/6471], Loss: 2.9301, Perplexity: 18.7298Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1159/6471], Loss: 2.7522, Perplexity: 15.6765Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1160/6471], Loss: 2.6168, Perplexity: 13.6916Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1161/6471], Loss: 2.8231, Perplexity: 16.8296Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1162/6471], Loss: 2.6270, Perplexity: 13.8316Shape of captions
torch.Size([64, 13

Epoch [1/3], Step [1211/6471], Loss: 2.4277, Perplexity: 11.3329Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1212/6471], Loss: 2.5793, Perplexity: 13.1874Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1213/6471], Loss: 2.5851, Perplexity: 13.2641Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1214/6471], Loss: 2.6018, Perplexity: 13.4885Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1215/6471], Loss: 2.8409, Perplexity: 17.1305Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1216/6471], Loss: 2.8221, Perplexity: 16.8123Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1217/6471], Loss: 2.6056, Perplexity: 13.5388Shape of captions
torch.Size([64, 14

Epoch [1/3], Step [1266/6471], Loss: 2.5631, Perplexity: 12.9754Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1267/6471], Loss: 2.6351, Perplexity: 13.9441Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1268/6471], Loss: 2.6979, Perplexity: 14.8487Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1269/6471], Loss: 2.7236, Perplexity: 15.2351Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1270/6471], Loss: 3.0335, Perplexity: 20.7695Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1271/6471], Loss: 2.4788, Perplexity: 11.9265Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [1272/6471], Loss: 3.0393, Perplexity: 20.8916Shape of captions
torch.Size([64, 9, 

Epoch [1/3], Step [1321/6471], Loss: 2.6839, Perplexity: 14.6415Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1322/6471], Loss: 3.0746, Perplexity: 21.6412Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1323/6471], Loss: 2.8028, Perplexity: 16.4911Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1324/6471], Loss: 2.8633, Perplexity: 17.5185Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1325/6471], Loss: 2.7834, Perplexity: 16.1744Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1326/6471], Loss: 2.8169, Perplexity: 16.7252Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1327/6471], Loss: 2.3684, Perplexity: 10.6803Shape of captions
torch.Size([64, 10, 2

Epoch [1/3], Step [1376/6471], Loss: 2.5559, Perplexity: 12.8824Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1377/6471], Loss: 2.5926, Perplexity: 13.3639Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1378/6471], Loss: 2.6902, Perplexity: 14.7352Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1379/6471], Loss: 2.6858, Perplexity: 14.6702Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [1380/6471], Loss: 2.9246, Perplexity: 18.6276Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1381/6471], Loss: 2.4400, Perplexity: 11.4735Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1382/6471], Loss: 2.5166, Perplexity: 12.3860Shape of captions
torch.Size([64, 13

Epoch [1/3], Step [1431/6471], Loss: 2.5817, Perplexity: 13.2194Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1432/6471], Loss: 2.2861, Perplexity: 9.8362Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1433/6471], Loss: 2.4067, Perplexity: 11.0971Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1434/6471], Loss: 2.7779, Perplexity: 16.0859Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1435/6471], Loss: 2.6790, Perplexity: 14.5706Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1436/6471], Loss: 2.6107, Perplexity: 13.6092Shape of captions
torch.Size([64, 18, 256])
Shape of inputs
torch.Size([64, 19, 256])
Epoch [1/3], Step [1437/6471], Loss: 3.2693, Perplexity: 26.2936Shape of captions
torch.Size([64, 10, 

Epoch [1/3], Step [1486/6471], Loss: 2.5618, Perplexity: 12.9595Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1487/6471], Loss: 2.5675, Perplexity: 13.0326Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1488/6471], Loss: 2.7031, Perplexity: 14.9259Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1489/6471], Loss: 2.5711, Perplexity: 13.0803Shape of captions
torch.Size([64, 19, 256])
Shape of inputs
torch.Size([64, 20, 256])
Epoch [1/3], Step [1490/6471], Loss: 3.0410, Perplexity: 20.9252Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1491/6471], Loss: 2.4820, Perplexity: 11.9654Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1492/6471], Loss: 2.5901, Perplexity: 13.3315Shape of captions
torch.Size([64, 12

Epoch [1/3], Step [1541/6471], Loss: 2.9765, Perplexity: 19.6199Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1542/6471], Loss: 2.5858, Perplexity: 13.2744Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1543/6471], Loss: 2.8832, Perplexity: 17.8709Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1544/6471], Loss: 2.5015, Perplexity: 12.2006Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1545/6471], Loss: 2.3791, Perplexity: 10.7948Shape of captions
torch.Size([64, 18, 256])
Shape of inputs
torch.Size([64, 19, 256])
Epoch [1/3], Step [1546/6471], Loss: 3.1704, Perplexity: 23.8171Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1547/6471], Loss: 2.4690, Perplexity: 11.8109Shape of captions
torch.Size([64, 13,

Epoch [1/3], Step [1596/6471], Loss: 2.4548, Perplexity: 11.6441Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1597/6471], Loss: 2.6650, Perplexity: 14.3672Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1598/6471], Loss: 2.7172, Perplexity: 15.1378Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1599/6471], Loss: 2.4603, Perplexity: 11.7089Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1600/6471], Loss: 2.4863, Perplexity: 12.0168
Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1601/6471], Loss: 2.5470, Perplexity: 12.7690Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [1602/6471], Loss: 2.7617, Perplexity: 15.8260Shape of captions
torch.Size([64, 1

Epoch [1/3], Step [1651/6471], Loss: 2.6622, Perplexity: 14.3282Shape of captions
torch.Size([64, 19, 256])
Shape of inputs
torch.Size([64, 20, 256])
Epoch [1/3], Step [1652/6471], Loss: 3.0742, Perplexity: 21.6319Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1653/6471], Loss: 2.5050, Perplexity: 12.2433Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1654/6471], Loss: 2.5884, Perplexity: 13.3085Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1655/6471], Loss: 2.4537, Perplexity: 11.6319Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1656/6471], Loss: 2.6281, Perplexity: 13.8477Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1657/6471], Loss: 2.4183, Perplexity: 11.2266Shape of captions
torch.Size([64, 14

Epoch [1/3], Step [1706/6471], Loss: 2.4942, Perplexity: 12.1115Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1707/6471], Loss: 2.4109, Perplexity: 11.1443Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1708/6471], Loss: 2.2269, Perplexity: 9.2711Shape of captions
torch.Size([64, 18, 256])
Shape of inputs
torch.Size([64, 19, 256])
Epoch [1/3], Step [1709/6471], Loss: 3.1651, Perplexity: 23.6916Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1710/6471], Loss: 2.4818, Perplexity: 11.9630Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1711/6471], Loss: 2.2779, Perplexity: 9.7560Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1712/6471], Loss: 2.8783, Perplexity: 17.7834Shape of captions
torch.Size([64, 9, 25

Epoch [1/3], Step [1761/6471], Loss: 2.4457, Perplexity: 11.5381Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [1762/6471], Loss: 2.7784, Perplexity: 16.0926Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [1763/6471], Loss: 2.5588, Perplexity: 12.9205Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [1764/6471], Loss: 2.5884, Perplexity: 13.3082Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1765/6471], Loss: 2.7208, Perplexity: 15.1922Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1766/6471], Loss: 2.5614, Perplexity: 12.9540Shape of captions
torch.Size([64, 18, 256])
Shape of inputs
torch.Size([64, 19, 256])
Epoch [1/3], Step [1767/6471], Loss: 2.9276, Perplexity: 18.6820Shape of captions
torch.Size([64, 10

Epoch [1/3], Step [1816/6471], Loss: 2.5564, Perplexity: 12.8891Shape of captions
torch.Size([64, 21, 256])
Shape of inputs
torch.Size([64, 22, 256])
Epoch [1/3], Step [1817/6471], Loss: 3.3439, Perplexity: 28.3307Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1818/6471], Loss: 2.4717, Perplexity: 11.8427Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1819/6471], Loss: 2.8549, Perplexity: 17.3721Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1820/6471], Loss: 2.6445, Perplexity: 14.0760Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1821/6471], Loss: 2.5952, Perplexity: 13.3994Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1822/6471], Loss: 2.6011, Perplexity: 13.4785Shape of captions
torch.Size([64, 9, 

Epoch [1/3], Step [1871/6471], Loss: 2.4193, Perplexity: 11.2375Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [1872/6471], Loss: 2.8707, Perplexity: 17.6501Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1873/6471], Loss: 2.3412, Perplexity: 10.3932Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1874/6471], Loss: 2.6040, Perplexity: 13.5178Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1875/6471], Loss: 2.5550, Perplexity: 12.8710Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1876/6471], Loss: 2.4850, Perplexity: 12.0011Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1877/6471], Loss: 2.6487, Perplexity: 14.1354Shape of captions
torch.Size([64, 14

Epoch [1/3], Step [1926/6471], Loss: 2.2687, Perplexity: 9.6668Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1927/6471], Loss: 2.6397, Perplexity: 14.0091Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1928/6471], Loss: 2.5677, Perplexity: 13.0361Shape of captions
torch.Size([64, 30, 256])
Shape of inputs
torch.Size([64, 31, 256])
Epoch [1/3], Step [1929/6471], Loss: 3.9706, Perplexity: 53.0144Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [1930/6471], Loss: 2.2867, Perplexity: 9.8428Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [1931/6471], Loss: 2.4570, Perplexity: 11.6695Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [1932/6471], Loss: 2.7341, Perplexity: 15.3954Shape of captions
torch.Size([64, 12, 2

Epoch [1/3], Step [1981/6471], Loss: 2.6613, Perplexity: 14.3147Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [1982/6471], Loss: 2.5782, Perplexity: 13.1731Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1983/6471], Loss: 2.3200, Perplexity: 10.1760Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [1984/6471], Loss: 2.4132, Perplexity: 11.1694Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1985/6471], Loss: 2.4629, Perplexity: 11.7385Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [1986/6471], Loss: 2.5239, Perplexity: 12.4767Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [1987/6471], Loss: 2.3072, Perplexity: 10.0458Shape of captions
torch.Size([64, 11

Epoch [1/3], Step [2036/6471], Loss: 2.5187, Perplexity: 12.4122Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2037/6471], Loss: 2.3942, Perplexity: 10.9590Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2038/6471], Loss: 2.5049, Perplexity: 12.2429Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [2039/6471], Loss: 2.7600, Perplexity: 15.8002Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2040/6471], Loss: 2.7280, Perplexity: 15.3024Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2041/6471], Loss: 2.5147, Perplexity: 12.3634Shape of captions
torch.Size([64, 20, 256])
Shape of inputs
torch.Size([64, 21, 256])
Epoch [1/3], Step [2042/6471], Loss: 3.4897, Perplexity: 32.7777Shape of captions
torch.Size([64, 11,

Epoch [1/3], Step [2091/6471], Loss: 2.5541, Perplexity: 12.8593Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2092/6471], Loss: 2.4298, Perplexity: 11.3571Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2093/6471], Loss: 2.3377, Perplexity: 10.3578Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2094/6471], Loss: 2.3719, Perplexity: 10.7175Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2095/6471], Loss: 2.5250, Perplexity: 12.4908Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2096/6471], Loss: 2.3198, Perplexity: 10.1737Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2097/6471], Loss: 2.1243, Perplexity: 8.3672Shape of captions
torch.Size([64, 14,

Epoch [1/3], Step [2146/6471], Loss: 2.8187, Perplexity: 16.7556Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2147/6471], Loss: 2.6941, Perplexity: 14.7918Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2148/6471], Loss: 2.6275, Perplexity: 13.8386Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2149/6471], Loss: 2.4628, Perplexity: 11.7376Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2150/6471], Loss: 2.4250, Perplexity: 11.3024Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2151/6471], Loss: 2.4627, Perplexity: 11.7370Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2152/6471], Loss: 2.4324, Perplexity: 11.3858Shape of captions
torch.Size([64, 11

Epoch [1/3], Step [2201/6471], Loss: 2.4491, Perplexity: 11.5783Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2202/6471], Loss: 2.5514, Perplexity: 12.8250Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2203/6471], Loss: 2.7764, Perplexity: 16.0606Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2204/6471], Loss: 2.5498, Perplexity: 12.8049Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2205/6471], Loss: 2.3599, Perplexity: 10.5904Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2206/6471], Loss: 2.4340, Perplexity: 11.4050Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2207/6471], Loss: 2.4729, Perplexity: 11.8562Shape of captions
torch.Size([64, 10

Epoch [1/3], Step [2256/6471], Loss: 2.5474, Perplexity: 12.7744Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2257/6471], Loss: 2.4152, Perplexity: 11.1925Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2258/6471], Loss: 2.4627, Perplexity: 11.7370Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2259/6471], Loss: 2.4693, Perplexity: 11.8144Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2260/6471], Loss: 2.2598, Perplexity: 9.5813Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2261/6471], Loss: 2.3907, Perplexity: 10.9207Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2262/6471], Loss: 2.4068, Perplexity: 11.0981Shape of captions
torch.Size([64, 10,

Epoch [1/3], Step [2311/6471], Loss: 2.3974, Perplexity: 10.9951Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2312/6471], Loss: 2.3693, Perplexity: 10.6901Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2313/6471], Loss: 2.3246, Perplexity: 10.2228Shape of captions
torch.Size([64, 18, 256])
Shape of inputs
torch.Size([64, 19, 256])
Epoch [1/3], Step [2314/6471], Loss: 3.0507, Perplexity: 21.1307Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [2315/6471], Loss: 2.7506, Perplexity: 15.6525Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2316/6471], Loss: 2.3054, Perplexity: 10.0281Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2317/6471], Loss: 2.3037, Perplexity: 10.0116Shape of captions
torch.Size([64, 12,

Epoch [1/3], Step [2366/6471], Loss: 2.4156, Perplexity: 11.1970Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2367/6471], Loss: 2.4202, Perplexity: 11.2479Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2368/6471], Loss: 2.4734, Perplexity: 11.8626Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2369/6471], Loss: 2.5803, Perplexity: 13.2015Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2370/6471], Loss: 2.4357, Perplexity: 11.4240Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2371/6471], Loss: 2.3479, Perplexity: 10.4631Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2372/6471], Loss: 2.4064, Perplexity: 11.0944Shape of captions
torch.Size([64, 12

Epoch [1/3], Step [2421/6471], Loss: 2.4416, Perplexity: 11.4916Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2422/6471], Loss: 2.5626, Perplexity: 12.9695Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2423/6471], Loss: 2.5824, Perplexity: 13.2293Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2424/6471], Loss: 2.3406, Perplexity: 10.3877Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2425/6471], Loss: 2.4307, Perplexity: 11.3669Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2426/6471], Loss: 2.4226, Perplexity: 11.2754Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2427/6471], Loss: 2.2327, Perplexity: 9.3249Shape of captions
torch.Size([64, 10,

Epoch [1/3], Step [2476/6471], Loss: 2.4110, Perplexity: 11.1448Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2477/6471], Loss: 2.5027, Perplexity: 12.2150Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [2478/6471], Loss: 2.7863, Perplexity: 16.2216Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2479/6471], Loss: 2.5633, Perplexity: 12.9782Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2480/6471], Loss: 2.5759, Perplexity: 13.1434Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [2481/6471], Loss: 2.8875, Perplexity: 17.9492Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2482/6471], Loss: 2.4421, Perplexity: 11.4970Shape of captions
torch.Size([64, 15,

Epoch [1/3], Step [2531/6471], Loss: 2.1083, Perplexity: 8.2343Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2532/6471], Loss: 2.4075, Perplexity: 11.1059Shape of captions
torch.Size([64, 16, 256])
Shape of inputs
torch.Size([64, 17, 256])
Epoch [1/3], Step [2533/6471], Loss: 2.6874, Perplexity: 14.6938Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2534/6471], Loss: 2.6427, Perplexity: 14.0513Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2535/6471], Loss: 2.4295, Perplexity: 11.3532Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2536/6471], Loss: 2.3502, Perplexity: 10.4879Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [2537/6471], Loss: 2.7648, Perplexity: 15.8758Shape of captions
torch.Size([64, 14, 

Epoch [1/3], Step [2586/6471], Loss: 2.2735, Perplexity: 9.7135Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [2587/6471], Loss: 2.5439, Perplexity: 12.7286Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2588/6471], Loss: 2.2105, Perplexity: 9.1202Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2589/6471], Loss: 2.2532, Perplexity: 9.5183Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2590/6471], Loss: 2.3144, Perplexity: 10.1193Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2591/6471], Loss: 2.5322, Perplexity: 12.5808Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2592/6471], Loss: 2.3826, Perplexity: 10.8329Shape of captions
torch.Size([64, 10, 25

Epoch [1/3], Step [2641/6471], Loss: 2.6645, Perplexity: 14.3611Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2642/6471], Loss: 2.8424, Perplexity: 17.1577Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2643/6471], Loss: 2.4776, Perplexity: 11.9131Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2644/6471], Loss: 2.3627, Perplexity: 10.6191Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2645/6471], Loss: 2.5572, Perplexity: 12.8997Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2646/6471], Loss: 2.3747, Perplexity: 10.7476Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2647/6471], Loss: 2.2152, Perplexity: 9.1629Shape of captions
torch.Size([64, 14,

Epoch [1/3], Step [2696/6471], Loss: 2.2720, Perplexity: 9.6992Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2697/6471], Loss: 2.2093, Perplexity: 9.1095Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2698/6471], Loss: 2.4059, Perplexity: 11.0884Shape of captions
torch.Size([64, 14, 256])
Shape of inputs
torch.Size([64, 15, 256])
Epoch [1/3], Step [2699/6471], Loss: 2.5061, Perplexity: 12.2567Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2700/6471], Loss: 2.0975, Perplexity: 8.1457
Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2701/6471], Loss: 2.1947, Perplexity: 8.9769Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2702/6471], Loss: 2.3412, Perplexity: 10.3934Shape of captions
torch.Size([64, 10, 2

Epoch [1/3], Step [2751/6471], Loss: 2.2683, Perplexity: 9.6634Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2752/6471], Loss: 2.2740, Perplexity: 9.7180Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2753/6471], Loss: 2.3406, Perplexity: 10.3873Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2754/6471], Loss: 2.4581, Perplexity: 11.6827Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2755/6471], Loss: 2.2886, Perplexity: 9.8608Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2756/6471], Loss: 2.4865, Perplexity: 12.0193Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2757/6471], Loss: 2.3375, Perplexity: 10.3550Shape of captions
torch.Size([64, 11, 2

Epoch [1/3], Step [2806/6471], Loss: 2.4028, Perplexity: 11.0539Shape of captions
torch.Size([64, 9, 256])
Shape of inputs
torch.Size([64, 10, 256])
Epoch [1/3], Step [2807/6471], Loss: 2.6531, Perplexity: 14.1981Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2808/6471], Loss: 2.2068, Perplexity: 9.0868Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2809/6471], Loss: 2.4885, Perplexity: 12.0431Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2810/6471], Loss: 2.2774, Perplexity: 9.7516Shape of captions
torch.Size([64, 15, 256])
Shape of inputs
torch.Size([64, 16, 256])
Epoch [1/3], Step [2811/6471], Loss: 2.5810, Perplexity: 13.2100Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2812/6471], Loss: 2.2750, Perplexity: 9.7283Shape of captions
torch.Size([64, 11, 25

Epoch [1/3], Step [2861/6471], Loss: 2.1768, Perplexity: 8.8181Shape of captions
torch.Size([64, 31, 256])
Shape of inputs
torch.Size([64, 32, 256])
Epoch [1/3], Step [2862/6471], Loss: 4.0032, Perplexity: 54.7752Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2863/6471], Loss: 2.5214, Perplexity: 12.4462Shape of captions
torch.Size([64, 21, 256])
Shape of inputs
torch.Size([64, 22, 256])
Epoch [1/3], Step [2864/6471], Loss: 3.1152, Perplexity: 22.5379Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2865/6471], Loss: 2.3185, Perplexity: 10.1606Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2866/6471], Loss: 2.1599, Perplexity: 8.6703Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2867/6471], Loss: 2.3552, Perplexity: 10.5402Shape of captions
torch.Size([64, 11, 

Epoch [1/3], Step [2916/6471], Loss: 2.2321, Perplexity: 9.3192Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2917/6471], Loss: 2.4252, Perplexity: 11.3050Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2918/6471], Loss: 2.1903, Perplexity: 8.9377Shape of captions
torch.Size([64, 11, 256])
Shape of inputs
torch.Size([64, 12, 256])
Epoch [1/3], Step [2919/6471], Loss: 2.4243, Perplexity: 11.2943Shape of captions
torch.Size([64, 10, 256])
Shape of inputs
torch.Size([64, 11, 256])
Epoch [1/3], Step [2920/6471], Loss: 2.4338, Perplexity: 11.4022Shape of captions
torch.Size([64, 12, 256])
Shape of inputs
torch.Size([64, 13, 256])
Epoch [1/3], Step [2921/6471], Loss: 2.4565, Perplexity: 11.6641Shape of captions
torch.Size([64, 13, 256])
Shape of inputs
torch.Size([64, 14, 256])
Epoch [1/3], Step [2922/6471], Loss: 2.3740, Perplexity: 10.7407Shape of captions
torch.Size([64, 11, 

<a id='step3'></a>
## Step 3: (Optional) Validate your Model

To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. 

If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [None]:
# (Optional) TODO: Validate your model.