# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, you will train your CNN-RNN model.  

You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.

This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:
- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  
- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.

This notebook **will be graded**.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.

You should only amend blocks of code that are preceded by a `TODO` statement.  **Any code blocks that are not preceded by a `TODO` statement should not be modified**.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.

### Question 1

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

**Answer:** 


### (Optional) Task #2

Note that we have provided a recommended image transform `transform_train` for pre-processing the training images, but you are welcome (and encouraged!) to modify it as you wish.  When modifying this transform, keep in mind that:
- the images in the dataset have varying heights and widths, and 
- if using a pre-trained model, you must perform the corresponding appropriate normalization.

### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**Answer:** 

### Task #3

Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**Answer:** 

### Task #4

Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### Question 4

**Question:** How did you select the optimizer used to train your model?

**Answer:** 

In [1]:
# load missing library
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## TODO #1: Select appropriate values for the Python variables below.
batch_size = 128        # batch size
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = True   # if True, load existing vocab file
embed_size = 512           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) TODO #2: Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO #3: Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# TODO #4: Define the optimizer.
optimizer = torch.optim.Adam(params)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=1.08s)
creating index...


  0%|          | 729/414113 [00:00<02:01, 3389.98it/s]

index created!
Obtaining caption lengths...


 12%|█▏        | 49511/414113 [00:11<01:24, 4326.57it/s]

<a id='step2'></a>
## Step 2: Train your Model

Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  

It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  

You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).

### A Note on Tuning Hyperparameters

To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  

However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.

That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [3]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()
response = requests.request("GET", 
                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
                            headers={"Metadata-Flavor":"Google"})

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        if time.time() - old_time > 60:
            old_time = time.time()
            requests.request("POST", 
                             "https://nebula.udacity.com/api/v1/remote/keep-alive", 
                             headers={'Authorization': "STAR " + response.text})
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

torch.Size([128, 13, 512])
Epoch [1/3], Step [1/3236], Loss: 9.2301, Perplexity: 10199.8986torch.Size([128, 12, 512])
Epoch [1/3], Step [2/3236], Loss: 8.9632, Perplexity: 7809.9276torch.Size([128, 13, 512])
Epoch [1/3], Step [3/3236], Loss: 8.6962, Perplexity: 5979.8786torch.Size([128, 14, 512])
Epoch [1/3], Step [4/3236], Loss: 8.3218, Perplexity: 4112.4314torch.Size([128, 13, 512])
Epoch [1/3], Step [5/3236], Loss: 7.7912, Perplexity: 2419.1155torch.Size([128, 12, 512])
Epoch [1/3], Step [6/3236], Loss: 7.1669, Perplexity: 1295.7851torch.Size([128, 10, 512])
Epoch [1/3], Step [7/3236], Loss: 6.6284, Perplexity: 756.2588torch.Size([128, 19, 512])
Epoch [1/3], Step [8/3236], Loss: 6.0550, Perplexity: 426.2232torch.Size([128, 12, 512])
Epoch [1/3], Step [9/3236], Loss: 5.2987, Perplexity: 200.0674torch.Size([128, 11, 512])
Epoch [1/3], Step [10/3236], Loss: 5.1324, Perplexity: 169.4171torch.Size([128, 13, 512])
Epoch [1/3], Step [11/3236], Loss: 4.8674, Perplexity: 129.9765torch.Size([

Epoch [1/3], Step [183/3236], Loss: 4.2288, Perplexity: 68.6334torch.Size([128, 11, 512])
Epoch [1/3], Step [184/3236], Loss: 3.4012, Perplexity: 30.0010torch.Size([128, 15, 512])
Epoch [1/3], Step [185/3236], Loss: 3.3474, Perplexity: 28.4283torch.Size([128, 13, 512])
Epoch [1/3], Step [186/3236], Loss: 3.3527, Perplexity: 28.5786torch.Size([128, 14, 512])
Epoch [1/3], Step [187/3236], Loss: 3.2781, Perplexity: 26.5249torch.Size([128, 12, 512])
Epoch [1/3], Step [188/3236], Loss: 3.5724, Perplexity: 35.6008torch.Size([128, 14, 512])
Epoch [1/3], Step [189/3236], Loss: 3.3967, Perplexity: 29.8653torch.Size([128, 16, 512])
Epoch [1/3], Step [190/3236], Loss: 3.4306, Perplexity: 30.8939torch.Size([128, 12, 512])
Epoch [1/3], Step [191/3236], Loss: 3.2911, Perplexity: 26.8712torch.Size([128, 12, 512])
Epoch [1/3], Step [192/3236], Loss: 3.2703, Perplexity: 26.3188torch.Size([128, 12, 512])
Epoch [1/3], Step [193/3236], Loss: 3.3651, Perplexity: 28.9377torch.Size([128, 13, 512])
Epoch [1/3

Epoch [1/3], Step [365/3236], Loss: 3.1487, Perplexity: 23.3055torch.Size([128, 11, 512])
Epoch [1/3], Step [366/3236], Loss: 3.2048, Perplexity: 24.6499torch.Size([128, 13, 512])
Epoch [1/3], Step [367/3236], Loss: 3.1240, Perplexity: 22.7362torch.Size([128, 16, 512])
Epoch [1/3], Step [368/3236], Loss: 3.2749, Perplexity: 26.4403torch.Size([128, 11, 512])
Epoch [1/3], Step [369/3236], Loss: 3.3253, Perplexity: 27.8063torch.Size([128, 14, 512])
Epoch [1/3], Step [370/3236], Loss: 2.9206, Perplexity: 18.5533torch.Size([128, 10, 512])
Epoch [1/3], Step [371/3236], Loss: 3.4089, Perplexity: 30.2307torch.Size([128, 11, 512])
Epoch [1/3], Step [372/3236], Loss: 3.0873, Perplexity: 21.9187torch.Size([128, 12, 512])
Epoch [1/3], Step [373/3236], Loss: 2.9930, Perplexity: 19.9454torch.Size([128, 12, 512])
Epoch [1/3], Step [374/3236], Loss: 2.9720, Perplexity: 19.5311torch.Size([128, 12, 512])
Epoch [1/3], Step [375/3236], Loss: 3.0291, Perplexity: 20.6784torch.Size([128, 11, 512])
Epoch [1/3

Epoch [1/3], Step [547/3236], Loss: 3.0101, Perplexity: 20.2902torch.Size([128, 11, 512])
Epoch [1/3], Step [548/3236], Loss: 2.9451, Perplexity: 19.0131torch.Size([128, 14, 512])
Epoch [1/3], Step [549/3236], Loss: 2.8794, Perplexity: 17.8029torch.Size([128, 11, 512])
Epoch [1/3], Step [550/3236], Loss: 3.0962, Perplexity: 22.1133torch.Size([128, 13, 512])
Epoch [1/3], Step [551/3236], Loss: 2.7845, Perplexity: 16.1913torch.Size([128, 21, 512])
Epoch [1/3], Step [552/3236], Loss: 3.6441, Perplexity: 38.2487torch.Size([128, 21, 512])
Epoch [1/3], Step [553/3236], Loss: 3.6066, Perplexity: 36.8392torch.Size([128, 16, 512])
Epoch [1/3], Step [554/3236], Loss: 2.9517, Perplexity: 19.1382torch.Size([128, 16, 512])
Epoch [1/3], Step [555/3236], Loss: 2.9861, Perplexity: 19.8085torch.Size([128, 12, 512])
Epoch [1/3], Step [556/3236], Loss: 2.9333, Perplexity: 18.7894torch.Size([128, 14, 512])
Epoch [1/3], Step [557/3236], Loss: 2.9916, Perplexity: 19.9182torch.Size([128, 11, 512])
Epoch [1/3

Epoch [1/3], Step [729/3236], Loss: 2.7321, Perplexity: 15.3653torch.Size([128, 12, 512])
Epoch [1/3], Step [730/3236], Loss: 2.7295, Perplexity: 15.3247torch.Size([128, 14, 512])
Epoch [1/3], Step [731/3236], Loss: 2.6993, Perplexity: 14.8695torch.Size([128, 13, 512])
Epoch [1/3], Step [732/3236], Loss: 2.6131, Perplexity: 13.6417torch.Size([128, 15, 512])
Epoch [1/3], Step [733/3236], Loss: 2.9136, Perplexity: 18.4225torch.Size([128, 12, 512])
Epoch [1/3], Step [734/3236], Loss: 2.7439, Perplexity: 15.5471torch.Size([128, 19, 512])
Epoch [1/3], Step [735/3236], Loss: 3.2388, Perplexity: 25.5032torch.Size([128, 15, 512])
Epoch [1/3], Step [736/3236], Loss: 2.8368, Perplexity: 17.0613torch.Size([128, 15, 512])
Epoch [1/3], Step [737/3236], Loss: 2.6897, Perplexity: 14.7277torch.Size([128, 12, 512])
Epoch [1/3], Step [738/3236], Loss: 2.6378, Perplexity: 13.9821torch.Size([128, 13, 512])
Epoch [1/3], Step [739/3236], Loss: 2.7571, Perplexity: 15.7542torch.Size([128, 19, 512])
Epoch [1/3

Epoch [1/3], Step [911/3236], Loss: 2.5555, Perplexity: 12.8783torch.Size([128, 13, 512])
Epoch [1/3], Step [912/3236], Loss: 2.5923, Perplexity: 13.3604torch.Size([128, 12, 512])
Epoch [1/3], Step [913/3236], Loss: 2.6094, Perplexity: 13.5910torch.Size([128, 12, 512])
Epoch [1/3], Step [914/3236], Loss: 2.5777, Perplexity: 13.1673torch.Size([128, 16, 512])
Epoch [1/3], Step [915/3236], Loss: 2.9795, Perplexity: 19.6777torch.Size([128, 16, 512])
Epoch [1/3], Step [916/3236], Loss: 2.7957, Perplexity: 16.3747torch.Size([128, 13, 512])
Epoch [1/3], Step [917/3236], Loss: 2.6696, Perplexity: 14.4344torch.Size([128, 11, 512])
Epoch [1/3], Step [918/3236], Loss: 2.6755, Perplexity: 14.5199torch.Size([128, 10, 512])
Epoch [1/3], Step [919/3236], Loss: 2.8538, Perplexity: 17.3537torch.Size([128, 14, 512])
Epoch [1/3], Step [920/3236], Loss: 2.6026, Perplexity: 13.4981torch.Size([128, 13, 512])
Epoch [1/3], Step [921/3236], Loss: 2.4936, Perplexity: 12.1053torch.Size([128, 10, 512])
Epoch [1/3

Epoch [1/3], Step [1002/3236], Loss: 2.6107, Perplexity: 13.6080torch.Size([128, 14, 512])
Epoch [1/3], Step [1003/3236], Loss: 2.6171, Perplexity: 13.6959torch.Size([128, 15, 512])
Epoch [1/3], Step [1004/3236], Loss: 2.6053, Perplexity: 13.5351torch.Size([128, 16, 512])
Epoch [1/3], Step [1005/3236], Loss: 2.7231, Perplexity: 15.2280torch.Size([128, 15, 512])
Epoch [1/3], Step [1006/3236], Loss: 2.5925, Perplexity: 13.3629torch.Size([128, 15, 512])
Epoch [1/3], Step [1007/3236], Loss: 2.5791, Perplexity: 13.1857torch.Size([128, 11, 512])
Epoch [1/3], Step [1008/3236], Loss: 2.6515, Perplexity: 14.1753torch.Size([128, 12, 512])
Epoch [1/3], Step [1009/3236], Loss: 2.4946, Perplexity: 12.1173torch.Size([128, 11, 512])
Epoch [1/3], Step [1010/3236], Loss: 2.7818, Perplexity: 16.1483torch.Size([128, 17, 512])
Epoch [1/3], Step [1011/3236], Loss: 2.9154, Perplexity: 18.4565torch.Size([128, 11, 512])
Epoch [1/3], Step [1012/3236], Loss: 2.6914, Perplexity: 14.7520torch.Size([128, 19, 512])

Epoch [1/3], Step [1182/3236], Loss: 2.4530, Perplexity: 11.6233torch.Size([128, 15, 512])
Epoch [1/3], Step [1183/3236], Loss: 2.6523, Perplexity: 14.1867torch.Size([128, 13, 512])
Epoch [1/3], Step [1184/3236], Loss: 2.4478, Perplexity: 11.5628torch.Size([128, 24, 512])
Epoch [1/3], Step [1185/3236], Loss: 3.5659, Perplexity: 35.3729torch.Size([128, 12, 512])
Epoch [1/3], Step [1186/3236], Loss: 2.4921, Perplexity: 12.0861torch.Size([128, 14, 512])
Epoch [1/3], Step [1187/3236], Loss: 2.5530, Perplexity: 12.8453torch.Size([128, 13, 512])
Epoch [1/3], Step [1188/3236], Loss: 2.6205, Perplexity: 13.7427torch.Size([128, 16, 512])
Epoch [1/3], Step [1189/3236], Loss: 2.6688, Perplexity: 14.4229torch.Size([128, 14, 512])
Epoch [1/3], Step [1190/3236], Loss: 2.5792, Perplexity: 13.1861torch.Size([128, 12, 512])
Epoch [1/3], Step [1191/3236], Loss: 2.5362, Perplexity: 12.6317torch.Size([128, 12, 512])
Epoch [1/3], Step [1192/3236], Loss: 2.4635, Perplexity: 11.7456torch.Size([128, 15, 512])

Epoch [1/3], Step [1362/3236], Loss: 2.4946, Perplexity: 12.1165torch.Size([128, 14, 512])
Epoch [1/3], Step [1363/3236], Loss: 2.3944, Perplexity: 10.9622torch.Size([128, 18, 512])
Epoch [1/3], Step [1364/3236], Loss: 2.8923, Perplexity: 18.0355torch.Size([128, 15, 512])
Epoch [1/3], Step [1365/3236], Loss: 2.4585, Perplexity: 11.6874torch.Size([128, 15, 512])
Epoch [1/3], Step [1366/3236], Loss: 2.4099, Perplexity: 11.1324torch.Size([128, 12, 512])
Epoch [1/3], Step [1367/3236], Loss: 2.4319, Perplexity: 11.3808torch.Size([128, 15, 512])
Epoch [1/3], Step [1368/3236], Loss: 2.6622, Perplexity: 14.3278torch.Size([128, 11, 512])
Epoch [1/3], Step [1369/3236], Loss: 2.5329, Perplexity: 12.5900torch.Size([128, 13, 512])
Epoch [1/3], Step [1370/3236], Loss: 2.3873, Perplexity: 10.8839torch.Size([128, 13, 512])
Epoch [1/3], Step [1371/3236], Loss: 2.3972, Perplexity: 10.9927torch.Size([128, 11, 512])
Epoch [1/3], Step [1372/3236], Loss: 2.7502, Perplexity: 15.6459torch.Size([128, 17, 512])

Epoch [1/3], Step [1542/3236], Loss: 2.3372, Perplexity: 10.3527torch.Size([128, 18, 512])
Epoch [1/3], Step [1543/3236], Loss: 3.0517, Perplexity: 21.1515torch.Size([128, 13, 512])
Epoch [1/3], Step [1544/3236], Loss: 2.3575, Perplexity: 10.5648torch.Size([128, 11, 512])
Epoch [1/3], Step [1545/3236], Loss: 2.5100, Perplexity: 12.3044torch.Size([128, 16, 512])
Epoch [1/3], Step [1546/3236], Loss: 2.5337, Perplexity: 12.6007torch.Size([128, 15, 512])
Epoch [1/3], Step [1547/3236], Loss: 2.5793, Perplexity: 13.1878torch.Size([128, 14, 512])
Epoch [1/3], Step [1548/3236], Loss: 2.4434, Perplexity: 11.5122torch.Size([128, 14, 512])
Epoch [1/3], Step [1549/3236], Loss: 2.4481, Perplexity: 11.5662torch.Size([128, 12, 512])
Epoch [1/3], Step [1550/3236], Loss: 2.3036, Perplexity: 10.0104torch.Size([128, 11, 512])
Epoch [1/3], Step [1551/3236], Loss: 2.5088, Perplexity: 12.2904torch.Size([128, 14, 512])
Epoch [1/3], Step [1552/3236], Loss: 2.3695, Perplexity: 10.6916torch.Size([128, 24, 512])

Epoch [1/3], Step [1722/3236], Loss: 2.3763, Perplexity: 10.7654torch.Size([128, 10, 512])
Epoch [1/3], Step [1723/3236], Loss: 2.5703, Perplexity: 13.0697torch.Size([128, 12, 512])
Epoch [1/3], Step [1724/3236], Loss: 2.4945, Perplexity: 12.1159torch.Size([128, 13, 512])
Epoch [1/3], Step [1725/3236], Loss: 2.3444, Perplexity: 10.4273torch.Size([128, 12, 512])
Epoch [1/3], Step [1726/3236], Loss: 2.1348, Perplexity: 8.4552torch.Size([128, 12, 512])
Epoch [1/3], Step [1727/3236], Loss: 2.3642, Perplexity: 10.6352torch.Size([128, 22, 512])
Epoch [1/3], Step [1728/3236], Loss: 3.2636, Perplexity: 26.1423torch.Size([128, 11, 512])
Epoch [1/3], Step [1729/3236], Loss: 2.3100, Perplexity: 10.0746torch.Size([128, 11, 512])
Epoch [1/3], Step [1730/3236], Loss: 2.4288, Perplexity: 11.3458torch.Size([128, 12, 512])
Epoch [1/3], Step [1731/3236], Loss: 2.4377, Perplexity: 11.4461torch.Size([128, 11, 512])
Epoch [1/3], Step [1732/3236], Loss: 2.3377, Perplexity: 10.3572torch.Size([128, 13, 512])


Epoch [1/3], Step [1902/3236], Loss: 2.5461, Perplexity: 12.7568torch.Size([128, 12, 512])
Epoch [1/3], Step [1903/3236], Loss: 2.1449, Perplexity: 8.5411torch.Size([128, 12, 512])
Epoch [1/3], Step [1904/3236], Loss: 2.2429, Perplexity: 9.4210torch.Size([128, 12, 512])
Epoch [1/3], Step [1905/3236], Loss: 2.3093, Perplexity: 10.0674torch.Size([128, 11, 512])
Epoch [1/3], Step [1906/3236], Loss: 2.3241, Perplexity: 10.2176torch.Size([128, 16, 512])
Epoch [1/3], Step [1907/3236], Loss: 2.4363, Perplexity: 11.4306torch.Size([128, 17, 512])
Epoch [1/3], Step [1908/3236], Loss: 2.6668, Perplexity: 14.3944torch.Size([128, 11, 512])
Epoch [1/3], Step [1909/3236], Loss: 2.3947, Perplexity: 10.9653torch.Size([128, 15, 512])
Epoch [1/3], Step [1910/3236], Loss: 2.3623, Perplexity: 10.6158torch.Size([128, 12, 512])
Epoch [1/3], Step [1911/3236], Loss: 2.2787, Perplexity: 9.7639torch.Size([128, 12, 512])
Epoch [1/3], Step [1912/3236], Loss: 2.4098, Perplexity: 11.1314torch.Size([128, 13, 512])
Ep

Epoch [1/3], Step [2082/3236], Loss: 2.2329, Perplexity: 9.3264torch.Size([128, 13, 512])
Epoch [1/3], Step [2083/3236], Loss: 2.3244, Perplexity: 10.2203torch.Size([128, 13, 512])
Epoch [1/3], Step [2084/3236], Loss: 2.2261, Perplexity: 9.2637torch.Size([128, 11, 512])
Epoch [1/3], Step [2085/3236], Loss: 2.3810, Perplexity: 10.8159torch.Size([128, 14, 512])
Epoch [1/3], Step [2086/3236], Loss: 2.2407, Perplexity: 9.3999torch.Size([128, 13, 512])
Epoch [1/3], Step [2087/3236], Loss: 2.2177, Perplexity: 9.1859torch.Size([128, 18, 512])
Epoch [1/3], Step [2088/3236], Loss: 2.6880, Perplexity: 14.7017torch.Size([128, 15, 512])
Epoch [1/3], Step [2089/3236], Loss: 2.4732, Perplexity: 11.8605torch.Size([128, 14, 512])
Epoch [1/3], Step [2090/3236], Loss: 2.3145, Perplexity: 10.1201torch.Size([128, 14, 512])
Epoch [1/3], Step [2091/3236], Loss: 2.2650, Perplexity: 9.6315torch.Size([128, 15, 512])
Epoch [1/3], Step [2092/3236], Loss: 2.3345, Perplexity: 10.3239torch.Size([128, 12, 512])
Epoc

Epoch [1/3], Step [2262/3236], Loss: 2.2437, Perplexity: 9.4280torch.Size([128, 20, 512])
Epoch [1/3], Step [2263/3236], Loss: 2.8412, Perplexity: 17.1359torch.Size([128, 16, 512])
Epoch [1/3], Step [2264/3236], Loss: 2.4669, Perplexity: 11.7854torch.Size([128, 13, 512])
Epoch [1/3], Step [2265/3236], Loss: 2.3486, Perplexity: 10.4710torch.Size([128, 11, 512])
Epoch [1/3], Step [2266/3236], Loss: 2.1608, Perplexity: 8.6785torch.Size([128, 16, 512])
Epoch [1/3], Step [2267/3236], Loss: 2.4416, Perplexity: 11.4915torch.Size([128, 11, 512])
Epoch [1/3], Step [2268/3236], Loss: 2.3227, Perplexity: 10.2028torch.Size([128, 19, 512])
Epoch [1/3], Step [2269/3236], Loss: 2.7700, Perplexity: 15.9587torch.Size([128, 13, 512])
Epoch [1/3], Step [2270/3236], Loss: 2.1031, Perplexity: 8.1916torch.Size([128, 13, 512])
Epoch [1/3], Step [2271/3236], Loss: 2.2526, Perplexity: 9.5127torch.Size([128, 15, 512])
Epoch [1/3], Step [2272/3236], Loss: 2.4417, Perplexity: 11.4920torch.Size([128, 14, 512])
Epo

Epoch [1/3], Step [2442/3236], Loss: 2.3506, Perplexity: 10.4918torch.Size([128, 11, 512])
Epoch [1/3], Step [2443/3236], Loss: 2.2405, Perplexity: 9.3979torch.Size([128, 16, 512])
Epoch [1/3], Step [2444/3236], Loss: 2.4657, Perplexity: 11.7714torch.Size([128, 12, 512])
Epoch [1/3], Step [2445/3236], Loss: 2.1185, Perplexity: 8.3183torch.Size([128, 11, 512])
Epoch [1/3], Step [2446/3236], Loss: 2.2545, Perplexity: 9.5308torch.Size([128, 13, 512])
Epoch [1/3], Step [2447/3236], Loss: 2.3385, Perplexity: 10.3658torch.Size([128, 10, 512])
Epoch [1/3], Step [2448/3236], Loss: 2.4461, Perplexity: 11.5433torch.Size([128, 12, 512])
Epoch [1/3], Step [2449/3236], Loss: 2.2947, Perplexity: 9.9211torch.Size([128, 13, 512])
Epoch [1/3], Step [2450/3236], Loss: 2.2223, Perplexity: 9.2281torch.Size([128, 13, 512])
Epoch [1/3], Step [2451/3236], Loss: 2.1539, Perplexity: 8.6183torch.Size([128, 13, 512])
Epoch [1/3], Step [2452/3236], Loss: 2.3224, Perplexity: 10.2003torch.Size([128, 10, 512])
Epoch

Epoch [1/3], Step [2622/3236], Loss: 2.3777, Perplexity: 10.7802torch.Size([128, 10, 512])
Epoch [1/3], Step [2623/3236], Loss: 2.6030, Perplexity: 13.5039torch.Size([128, 14, 512])
Epoch [1/3], Step [2624/3236], Loss: 2.2020, Perplexity: 9.0430torch.Size([128, 13, 512])
Epoch [1/3], Step [2625/3236], Loss: 2.0892, Perplexity: 8.0783torch.Size([128, 11, 512])
Epoch [1/3], Step [2626/3236], Loss: 2.3308, Perplexity: 10.2861torch.Size([128, 13, 512])
Epoch [1/3], Step [2627/3236], Loss: 2.1890, Perplexity: 8.9261torch.Size([128, 15, 512])
Epoch [1/3], Step [2628/3236], Loss: 2.2301, Perplexity: 9.3010torch.Size([128, 12, 512])
Epoch [1/3], Step [2629/3236], Loss: 2.2847, Perplexity: 9.8230torch.Size([128, 12, 512])
Epoch [1/3], Step [2630/3236], Loss: 2.0724, Perplexity: 7.9438torch.Size([128, 17, 512])
Epoch [1/3], Step [2631/3236], Loss: 2.5453, Perplexity: 12.7472torch.Size([128, 16, 512])
Epoch [1/3], Step [2632/3236], Loss: 2.3923, Perplexity: 10.9391torch.Size([128, 13, 512])
Epoch

Epoch [1/3], Step [2713/3236], Loss: 2.1135, Perplexity: 8.2773torch.Size([128, 14, 512])
Epoch [1/3], Step [2714/3236], Loss: 2.3121, Perplexity: 10.0954torch.Size([128, 17, 512])
Epoch [1/3], Step [2715/3236], Loss: 2.5648, Perplexity: 12.9975torch.Size([128, 17, 512])
Epoch [1/3], Step [2716/3236], Loss: 2.4668, Perplexity: 11.7852torch.Size([128, 16, 512])
Epoch [1/3], Step [2717/3236], Loss: 2.4276, Perplexity: 11.3314torch.Size([128, 11, 512])
Epoch [1/3], Step [2718/3236], Loss: 2.2417, Perplexity: 9.4092torch.Size([128, 13, 512])
Epoch [1/3], Step [2719/3236], Loss: 2.1802, Perplexity: 8.8480torch.Size([128, 15, 512])
Epoch [1/3], Step [2720/3236], Loss: 2.1625, Perplexity: 8.6931torch.Size([128, 12, 512])
Epoch [1/3], Step [2721/3236], Loss: 2.1128, Perplexity: 8.2712torch.Size([128, 10, 512])
Epoch [1/3], Step [2722/3236], Loss: 2.4968, Perplexity: 12.1435torch.Size([128, 17, 512])
Epoch [1/3], Step [2723/3236], Loss: 2.4731, Perplexity: 11.8588torch.Size([128, 16, 512])
Epoc

Epoch [1/3], Step [2894/3236], Loss: 2.3248, Perplexity: 10.2242torch.Size([128, 14, 512])
Epoch [1/3], Step [2895/3236], Loss: 2.1488, Perplexity: 8.5746torch.Size([128, 13, 512])
Epoch [1/3], Step [2896/3236], Loss: 2.1889, Perplexity: 8.9258torch.Size([128, 14, 512])
Epoch [1/3], Step [2897/3236], Loss: 2.3537, Perplexity: 10.5248torch.Size([128, 15, 512])
Epoch [1/3], Step [2898/3236], Loss: 2.3417, Perplexity: 10.3994torch.Size([128, 10, 512])
Epoch [1/3], Step [2899/3236], Loss: 2.5003, Perplexity: 12.1858torch.Size([128, 13, 512])
Epoch [1/3], Step [2900/3236], Loss: 2.2158, Perplexity: 9.1683
torch.Size([128, 15, 512])
Epoch [1/3], Step [2901/3236], Loss: 2.3117, Perplexity: 10.0918torch.Size([128, 14, 512])
Epoch [1/3], Step [2902/3236], Loss: 2.2146, Perplexity: 9.1576torch.Size([128, 34, 512])
Epoch [1/3], Step [2903/3236], Loss: 4.0011, Perplexity: 54.6578torch.Size([128, 15, 512])
Epoch [1/3], Step [2904/3236], Loss: 2.4408, Perplexity: 11.4827torch.Size([128, 13, 512])
Ep

Epoch [1/3], Step [3075/3236], Loss: 2.2656, Perplexity: 9.6373torch.Size([128, 12, 512])
Epoch [1/3], Step [3076/3236], Loss: 2.1204, Perplexity: 8.3344torch.Size([128, 15, 512])
Epoch [1/3], Step [3077/3236], Loss: 2.3421, Perplexity: 10.4031torch.Size([128, 16, 512])
Epoch [1/3], Step [3078/3236], Loss: 2.4102, Perplexity: 11.1357torch.Size([128, 13, 512])
Epoch [1/3], Step [3079/3236], Loss: 2.0474, Perplexity: 7.7475torch.Size([128, 12, 512])
Epoch [1/3], Step [3080/3236], Loss: 2.1873, Perplexity: 8.9114torch.Size([128, 12, 512])
Epoch [1/3], Step [3081/3236], Loss: 2.1902, Perplexity: 8.9372torch.Size([128, 13, 512])
Epoch [1/3], Step [3082/3236], Loss: 2.0068, Perplexity: 7.4393torch.Size([128, 14, 512])
Epoch [1/3], Step [3083/3236], Loss: 2.3212, Perplexity: 10.1877torch.Size([128, 11, 512])
Epoch [1/3], Step [3084/3236], Loss: 2.2544, Perplexity: 9.5291torch.Size([128, 13, 512])
Epoch [1/3], Step [3085/3236], Loss: 2.1477, Perplexity: 8.5651torch.Size([128, 13, 512])
Epoch [

Epoch [1/3], Step [3166/3236], Loss: 2.2649, Perplexity: 9.6306torch.Size([128, 13, 512])
Epoch [1/3], Step [3167/3236], Loss: 2.1038, Perplexity: 8.1973torch.Size([128, 12, 512])
Epoch [1/3], Step [3168/3236], Loss: 2.1992, Perplexity: 9.0174torch.Size([128, 13, 512])
Epoch [1/3], Step [3169/3236], Loss: 2.1131, Perplexity: 8.2738torch.Size([128, 12, 512])
Epoch [1/3], Step [3170/3236], Loss: 2.2005, Perplexity: 9.0293torch.Size([128, 16, 512])
Epoch [1/3], Step [3171/3236], Loss: 2.3127, Perplexity: 10.1016torch.Size([128, 16, 512])
Epoch [1/3], Step [3172/3236], Loss: 2.2580, Perplexity: 9.5644torch.Size([128, 10, 512])
Epoch [1/3], Step [3173/3236], Loss: 2.6451, Perplexity: 14.0842torch.Size([128, 14, 512])
Epoch [1/3], Step [3174/3236], Loss: 2.2190, Perplexity: 9.1983torch.Size([128, 14, 512])
Epoch [1/3], Step [3175/3236], Loss: 2.0762, Perplexity: 7.9745torch.Size([128, 14, 512])
Epoch [1/3], Step [3176/3236], Loss: 2.1054, Perplexity: 8.2103torch.Size([128, 13, 512])
Epoch [1

Epoch [2/3], Step [114/3236], Loss: 2.1858, Perplexity: 8.8981torch.Size([128, 13, 512])
Epoch [2/3], Step [115/3236], Loss: 2.2032, Perplexity: 9.0539torch.Size([128, 11, 512])
Epoch [2/3], Step [116/3236], Loss: 2.1687, Perplexity: 8.7471torch.Size([128, 16, 512])
Epoch [2/3], Step [117/3236], Loss: 2.3322, Perplexity: 10.3004torch.Size([128, 15, 512])
Epoch [2/3], Step [118/3236], Loss: 2.2509, Perplexity: 9.4965torch.Size([128, 13, 512])
Epoch [2/3], Step [119/3236], Loss: 2.1299, Perplexity: 8.4137torch.Size([128, 12, 512])
Epoch [2/3], Step [120/3236], Loss: 2.0399, Perplexity: 7.6896torch.Size([128, 12, 512])
Epoch [2/3], Step [121/3236], Loss: 2.1113, Perplexity: 8.2587torch.Size([128, 13, 512])
Epoch [2/3], Step [122/3236], Loss: 2.0508, Perplexity: 7.7742torch.Size([128, 13, 512])
Epoch [2/3], Step [123/3236], Loss: 2.1057, Perplexity: 8.2129torch.Size([128, 11, 512])
Epoch [2/3], Step [124/3236], Loss: 2.1380, Perplexity: 8.4827torch.Size([128, 12, 512])
Epoch [2/3], Step [1

Epoch [2/3], Step [206/3236], Loss: 2.2467, Perplexity: 9.4561torch.Size([128, 14, 512])
Epoch [2/3], Step [207/3236], Loss: 2.1736, Perplexity: 8.7900torch.Size([128, 13, 512])
Epoch [2/3], Step [208/3236], Loss: 2.0954, Perplexity: 8.1291torch.Size([128, 13, 512])
Epoch [2/3], Step [209/3236], Loss: 2.0229, Perplexity: 7.5601torch.Size([128, 11, 512])
Epoch [2/3], Step [210/3236], Loss: 2.1511, Perplexity: 8.5942torch.Size([128, 10, 512])
Epoch [2/3], Step [211/3236], Loss: 2.3836, Perplexity: 10.8442torch.Size([128, 15, 512])
Epoch [2/3], Step [212/3236], Loss: 2.2434, Perplexity: 9.4251torch.Size([128, 14, 512])
Epoch [2/3], Step [213/3236], Loss: 2.1566, Perplexity: 8.6416torch.Size([128, 16, 512])
Epoch [2/3], Step [214/3236], Loss: 2.3210, Perplexity: 10.1859torch.Size([128, 13, 512])
Epoch [2/3], Step [215/3236], Loss: 2.2264, Perplexity: 9.2666torch.Size([128, 11, 512])
Epoch [2/3], Step [216/3236], Loss: 2.2717, Perplexity: 9.6960torch.Size([128, 14, 512])
Epoch [2/3], Step [

Epoch [2/3], Step [298/3236], Loss: 2.0180, Perplexity: 7.5231torch.Size([128, 12, 512])
Epoch [2/3], Step [299/3236], Loss: 2.0942, Perplexity: 8.1192torch.Size([128, 14, 512])
Epoch [2/3], Step [300/3236], Loss: 2.1774, Perplexity: 8.8230
torch.Size([128, 11, 512])
Epoch [2/3], Step [301/3236], Loss: 2.2248, Perplexity: 9.2514torch.Size([128, 13, 512])
Epoch [2/3], Step [302/3236], Loss: 2.1394, Perplexity: 8.4942torch.Size([128, 16, 512])
Epoch [2/3], Step [303/3236], Loss: 2.3055, Perplexity: 10.0296torch.Size([128, 11, 512])
Epoch [2/3], Step [304/3236], Loss: 2.0902, Perplexity: 8.0865torch.Size([128, 14, 512])
Epoch [2/3], Step [305/3236], Loss: 2.1217, Perplexity: 8.3451torch.Size([128, 13, 512])
Epoch [2/3], Step [306/3236], Loss: 2.0424, Perplexity: 7.7091torch.Size([128, 18, 512])
Epoch [2/3], Step [307/3236], Loss: 2.6764, Perplexity: 14.5333torch.Size([128, 13, 512])
Epoch [2/3], Step [308/3236], Loss: 2.1966, Perplexity: 8.9942torch.Size([128, 12, 512])
Epoch [2/3], Step 

Epoch [2/3], Step [390/3236], Loss: 2.4658, Perplexity: 11.7725torch.Size([128, 12, 512])
Epoch [2/3], Step [391/3236], Loss: 2.1781, Perplexity: 8.8293torch.Size([128, 12, 512])
Epoch [2/3], Step [392/3236], Loss: 2.1169, Perplexity: 8.3056torch.Size([128, 11, 512])
Epoch [2/3], Step [393/3236], Loss: 2.2021, Perplexity: 9.0444torch.Size([128, 11, 512])
Epoch [2/3], Step [394/3236], Loss: 2.1370, Perplexity: 8.4738torch.Size([128, 16, 512])
Epoch [2/3], Step [395/3236], Loss: 2.3084, Perplexity: 10.0578torch.Size([128, 13, 512])
Epoch [2/3], Step [396/3236], Loss: 2.2542, Perplexity: 9.5273torch.Size([128, 21, 512])
Epoch [2/3], Step [397/3236], Loss: 2.8704, Perplexity: 17.6450torch.Size([128, 11, 512])
Epoch [2/3], Step [398/3236], Loss: 2.2588, Perplexity: 9.5714torch.Size([128, 12, 512])
Epoch [2/3], Step [399/3236], Loss: 2.1938, Perplexity: 8.9697torch.Size([128, 11, 512])
Epoch [2/3], Step [400/3236], Loss: 2.1507, Perplexity: 8.5905
torch.Size([128, 15, 512])
Epoch [2/3], Step

Epoch [2/3], Step [482/3236], Loss: 2.3439, Perplexity: 10.4217torch.Size([128, 14, 512])
Epoch [2/3], Step [483/3236], Loss: 2.1994, Perplexity: 9.0198torch.Size([128, 16, 512])
Epoch [2/3], Step [484/3236], Loss: 2.4143, Perplexity: 11.1823torch.Size([128, 14, 512])
Epoch [2/3], Step [485/3236], Loss: 2.0497, Perplexity: 7.7658torch.Size([128, 13, 512])
Epoch [2/3], Step [486/3236], Loss: 2.1790, Perplexity: 8.8372torch.Size([128, 14, 512])
Epoch [2/3], Step [487/3236], Loss: 2.1771, Perplexity: 8.8204torch.Size([128, 14, 512])
Epoch [2/3], Step [488/3236], Loss: 2.2002, Perplexity: 9.0273torch.Size([128, 15, 512])
Epoch [2/3], Step [489/3236], Loss: 2.2300, Perplexity: 9.2998torch.Size([128, 12, 512])
Epoch [2/3], Step [490/3236], Loss: 2.1791, Perplexity: 8.8381torch.Size([128, 16, 512])
Epoch [2/3], Step [491/3236], Loss: 2.2193, Perplexity: 9.2010torch.Size([128, 13, 512])
Epoch [2/3], Step [492/3236], Loss: 2.1327, Perplexity: 8.4375torch.Size([128, 13, 512])
Epoch [2/3], Step [

Epoch [2/3], Step [574/3236], Loss: 2.1801, Perplexity: 8.8470torch.Size([128, 13, 512])
Epoch [2/3], Step [575/3236], Loss: 2.1604, Perplexity: 8.6747torch.Size([128, 14, 512])
Epoch [2/3], Step [576/3236], Loss: 2.0345, Perplexity: 7.6488torch.Size([128, 11, 512])
Epoch [2/3], Step [577/3236], Loss: 1.9796, Perplexity: 7.2400torch.Size([128, 16, 512])
Epoch [2/3], Step [578/3236], Loss: 2.3312, Perplexity: 10.2905torch.Size([128, 13, 512])
Epoch [2/3], Step [579/3236], Loss: 2.1216, Perplexity: 8.3442torch.Size([128, 11, 512])
Epoch [2/3], Step [580/3236], Loss: 2.0911, Perplexity: 8.0938torch.Size([128, 17, 512])
Epoch [2/3], Step [581/3236], Loss: 2.5644, Perplexity: 12.9926torch.Size([128, 11, 512])
Epoch [2/3], Step [582/3236], Loss: 2.1460, Perplexity: 8.5506torch.Size([128, 10, 512])
Epoch [2/3], Step [583/3236], Loss: 2.4065, Perplexity: 11.0954torch.Size([128, 15, 512])
Epoch [2/3], Step [584/3236], Loss: 2.1995, Perplexity: 9.0203torch.Size([128, 11, 512])
Epoch [2/3], Step 

Epoch [2/3], Step [666/3236], Loss: 2.0833, Perplexity: 8.0310torch.Size([128, 11, 512])
Epoch [2/3], Step [667/3236], Loss: 2.2248, Perplexity: 9.2513torch.Size([128, 15, 512])
Epoch [2/3], Step [668/3236], Loss: 2.3500, Perplexity: 10.4855torch.Size([128, 12, 512])
Epoch [2/3], Step [669/3236], Loss: 2.1626, Perplexity: 8.6938torch.Size([128, 12, 512])
Epoch [2/3], Step [670/3236], Loss: 2.0791, Perplexity: 7.9975torch.Size([128, 14, 512])
Epoch [2/3], Step [671/3236], Loss: 2.0794, Perplexity: 8.0000torch.Size([128, 18, 512])
Epoch [2/3], Step [672/3236], Loss: 2.5080, Perplexity: 12.2808torch.Size([128, 14, 512])
Epoch [2/3], Step [673/3236], Loss: 2.1349, Perplexity: 8.4560torch.Size([128, 13, 512])
Epoch [2/3], Step [674/3236], Loss: 2.1815, Perplexity: 8.8592torch.Size([128, 12, 512])
Epoch [2/3], Step [675/3236], Loss: 2.0265, Perplexity: 7.5874torch.Size([128, 14, 512])
Epoch [2/3], Step [676/3236], Loss: 2.1694, Perplexity: 8.7531torch.Size([128, 12, 512])
Epoch [2/3], Step [

Epoch [2/3], Step [758/3236], Loss: 2.0362, Perplexity: 7.6612torch.Size([128, 12, 512])
Epoch [2/3], Step [759/3236], Loss: 2.0318, Perplexity: 7.6281torch.Size([128, 11, 512])
Epoch [2/3], Step [760/3236], Loss: 2.0743, Perplexity: 7.9590torch.Size([128, 16, 512])
Epoch [2/3], Step [761/3236], Loss: 2.3407, Perplexity: 10.3890torch.Size([128, 10, 512])
Epoch [2/3], Step [762/3236], Loss: 2.3363, Perplexity: 10.3434torch.Size([128, 14, 512])
Epoch [2/3], Step [763/3236], Loss: 2.1236, Perplexity: 8.3611torch.Size([128, 14, 512])
Epoch [2/3], Step [764/3236], Loss: 2.2499, Perplexity: 9.4865torch.Size([128, 13, 512])
Epoch [2/3], Step [765/3236], Loss: 2.0739, Perplexity: 7.9560torch.Size([128, 12, 512])
Epoch [2/3], Step [766/3236], Loss: 2.1785, Perplexity: 8.8334torch.Size([128, 11, 512])
Epoch [2/3], Step [767/3236], Loss: 2.2026, Perplexity: 9.0483torch.Size([128, 12, 512])
Epoch [2/3], Step [768/3236], Loss: 2.0792, Perplexity: 7.9983torch.Size([128, 14, 512])
Epoch [2/3], Step [

Epoch [2/3], Step [850/3236], Loss: 1.9905, Perplexity: 7.3193torch.Size([128, 14, 512])
Epoch [2/3], Step [851/3236], Loss: 2.0664, Perplexity: 7.8964torch.Size([128, 21, 512])
Epoch [2/3], Step [852/3236], Loss: 2.8169, Perplexity: 16.7249torch.Size([128, 16, 512])
Epoch [2/3], Step [853/3236], Loss: 2.2724, Perplexity: 9.7023torch.Size([128, 13, 512])
Epoch [2/3], Step [854/3236], Loss: 2.0907, Perplexity: 8.0907torch.Size([128, 13, 512])
Epoch [2/3], Step [855/3236], Loss: 2.2006, Perplexity: 9.0301torch.Size([128, 11, 512])
Epoch [2/3], Step [856/3236], Loss: 2.2443, Perplexity: 9.4340torch.Size([128, 12, 512])
Epoch [2/3], Step [857/3236], Loss: 2.1117, Perplexity: 8.2623torch.Size([128, 23, 512])
Epoch [2/3], Step [858/3236], Loss: 2.9293, Perplexity: 18.7140torch.Size([128, 12, 512])
Epoch [2/3], Step [859/3236], Loss: 1.9733, Perplexity: 7.1941torch.Size([128, 14, 512])
Epoch [2/3], Step [860/3236], Loss: 2.2261, Perplexity: 9.2634torch.Size([128, 12, 512])
Epoch [2/3], Step [

Epoch [2/3], Step [942/3236], Loss: 2.3863, Perplexity: 10.8734torch.Size([128, 11, 512])
Epoch [2/3], Step [943/3236], Loss: 2.0963, Perplexity: 8.1362torch.Size([128, 22, 512])
Epoch [2/3], Step [944/3236], Loss: 3.1178, Perplexity: 22.5964torch.Size([128, 12, 512])
Epoch [2/3], Step [945/3236], Loss: 2.1780, Perplexity: 8.8289torch.Size([128, 20, 512])
Epoch [2/3], Step [946/3236], Loss: 2.7838, Perplexity: 16.1805torch.Size([128, 11, 512])
Epoch [2/3], Step [947/3236], Loss: 2.0676, Perplexity: 7.9055torch.Size([128, 10, 512])
Epoch [2/3], Step [948/3236], Loss: 2.3525, Perplexity: 10.5122torch.Size([128, 16, 512])
Epoch [2/3], Step [949/3236], Loss: 2.2214, Perplexity: 9.2200torch.Size([128, 11, 512])
Epoch [2/3], Step [950/3236], Loss: 2.2100, Perplexity: 9.1159torch.Size([128, 11, 512])
Epoch [2/3], Step [951/3236], Loss: 2.1162, Perplexity: 8.2991torch.Size([128, 14, 512])
Epoch [2/3], Step [952/3236], Loss: 2.1103, Perplexity: 8.2508torch.Size([128, 13, 512])
Epoch [2/3], Step

Epoch [2/3], Step [1124/3236], Loss: 2.2431, Perplexity: 9.4223torch.Size([128, 10, 512])
Epoch [2/3], Step [1125/3236], Loss: 2.3605, Perplexity: 10.5962torch.Size([128, 15, 512])
Epoch [2/3], Step [1126/3236], Loss: 2.2321, Perplexity: 9.3190torch.Size([128, 12, 512])
Epoch [2/3], Step [1127/3236], Loss: 2.0094, Perplexity: 7.4585torch.Size([128, 16, 512])
Epoch [2/3], Step [1128/3236], Loss: 2.3300, Perplexity: 10.2775torch.Size([128, 10, 512])
Epoch [2/3], Step [1129/3236], Loss: 2.2349, Perplexity: 9.3455torch.Size([128, 13, 512])
Epoch [2/3], Step [1130/3236], Loss: 2.0553, Perplexity: 7.8093torch.Size([128, 11, 512])
Epoch [2/3], Step [1131/3236], Loss: 2.1925, Perplexity: 8.9576torch.Size([128, 18, 512])
Epoch [2/3], Step [1132/3236], Loss: 2.5427, Perplexity: 12.7144torch.Size([128, 13, 512])
Epoch [2/3], Step [1133/3236], Loss: 2.0476, Perplexity: 7.7491torch.Size([128, 12, 512])
Epoch [2/3], Step [1134/3236], Loss: 2.0422, Perplexity: 7.7074torch.Size([128, 10, 512])
Epoch [

Epoch [2/3], Step [1215/3236], Loss: 2.1632, Perplexity: 8.6993torch.Size([128, 17, 512])
Epoch [2/3], Step [1216/3236], Loss: 2.4481, Perplexity: 11.5668torch.Size([128, 17, 512])
Epoch [2/3], Step [1217/3236], Loss: 2.4076, Perplexity: 11.1077torch.Size([128, 12, 512])
Epoch [2/3], Step [1218/3236], Loss: 1.9968, Perplexity: 7.3657torch.Size([128, 13, 512])
Epoch [2/3], Step [1219/3236], Loss: 2.1639, Perplexity: 8.7050torch.Size([128, 11, 512])
Epoch [2/3], Step [1220/3236], Loss: 2.1754, Perplexity: 8.8054torch.Size([128, 13, 512])
Epoch [2/3], Step [1221/3236], Loss: 2.0219, Perplexity: 7.5525torch.Size([128, 11, 512])
Epoch [2/3], Step [1222/3236], Loss: 2.0724, Perplexity: 7.9440torch.Size([128, 12, 512])
Epoch [2/3], Step [1223/3236], Loss: 1.9698, Perplexity: 7.1693torch.Size([128, 12, 512])
Epoch [2/3], Step [1224/3236], Loss: 2.0677, Perplexity: 7.9065torch.Size([128, 13, 512])
Epoch [2/3], Step [1225/3236], Loss: 1.9857, Perplexity: 7.2843torch.Size([128, 11, 512])
Epoch [2

Epoch [2/3], Step [1306/3236], Loss: 2.1167, Perplexity: 8.3039torch.Size([128, 14, 512])
Epoch [2/3], Step [1307/3236], Loss: 2.1492, Perplexity: 8.5776torch.Size([128, 11, 512])
Epoch [2/3], Step [1308/3236], Loss: 2.2504, Perplexity: 9.4913torch.Size([128, 20, 512])
Epoch [2/3], Step [1309/3236], Loss: 2.8670, Perplexity: 17.5844torch.Size([128, 10, 512])
Epoch [2/3], Step [1310/3236], Loss: 2.3732, Perplexity: 10.7319torch.Size([128, 10, 512])
Epoch [2/3], Step [1311/3236], Loss: 2.2897, Perplexity: 9.8716torch.Size([128, 12, 512])
Epoch [2/3], Step [1312/3236], Loss: 2.1595, Perplexity: 8.6672torch.Size([128, 12, 512])
Epoch [2/3], Step [1313/3236], Loss: 2.0701, Perplexity: 7.9255torch.Size([128, 12, 512])
Epoch [2/3], Step [1314/3236], Loss: 2.1032, Perplexity: 8.1925torch.Size([128, 12, 512])
Epoch [2/3], Step [1315/3236], Loss: 1.9987, Perplexity: 7.3792torch.Size([128, 14, 512])
Epoch [2/3], Step [1316/3236], Loss: 2.0880, Perplexity: 8.0689torch.Size([128, 13, 512])
Epoch [2

Epoch [2/3], Step [1397/3236], Loss: 2.0842, Perplexity: 8.0380torch.Size([128, 11, 512])
Epoch [2/3], Step [1398/3236], Loss: 2.0552, Perplexity: 7.8083torch.Size([128, 14, 512])
Epoch [2/3], Step [1399/3236], Loss: 2.1530, Perplexity: 8.6108torch.Size([128, 11, 512])
Epoch [2/3], Step [1400/3236], Loss: 1.9359, Perplexity: 6.9303
torch.Size([128, 13, 512])
Epoch [2/3], Step [1401/3236], Loss: 2.1183, Perplexity: 8.3170torch.Size([128, 18, 512])
Epoch [2/3], Step [1402/3236], Loss: 2.6890, Perplexity: 14.7163torch.Size([128, 12, 512])
Epoch [2/3], Step [1403/3236], Loss: 2.1123, Perplexity: 8.2670torch.Size([128, 15, 512])
Epoch [2/3], Step [1404/3236], Loss: 2.2440, Perplexity: 9.4307torch.Size([128, 13, 512])
Epoch [2/3], Step [1405/3236], Loss: 1.9725, Perplexity: 7.1889torch.Size([128, 11, 512])
Epoch [2/3], Step [1406/3236], Loss: 2.1675, Perplexity: 8.7363torch.Size([128, 11, 512])
Epoch [2/3], Step [1407/3236], Loss: 2.1499, Perplexity: 8.5842torch.Size([128, 12, 512])
Epoch [2

Epoch [2/3], Step [1488/3236], Loss: 2.5050, Perplexity: 12.2430torch.Size([128, 14, 512])
Epoch [2/3], Step [1489/3236], Loss: 2.0678, Perplexity: 7.9076torch.Size([128, 15, 512])
Epoch [2/3], Step [1490/3236], Loss: 2.2317, Perplexity: 9.3156torch.Size([128, 13, 512])
Epoch [2/3], Step [1491/3236], Loss: 2.0875, Perplexity: 8.0644torch.Size([128, 14, 512])
Epoch [2/3], Step [1492/3236], Loss: 2.1000, Perplexity: 8.1665torch.Size([128, 12, 512])
Epoch [2/3], Step [1493/3236], Loss: 2.0452, Perplexity: 7.7307torch.Size([128, 13, 512])
Epoch [2/3], Step [1494/3236], Loss: 2.1129, Perplexity: 8.2725torch.Size([128, 12, 512])
Epoch [2/3], Step [1495/3236], Loss: 2.0836, Perplexity: 8.0336torch.Size([128, 16, 512])
Epoch [2/3], Step [1496/3236], Loss: 2.2757, Perplexity: 9.7345torch.Size([128, 12, 512])
Epoch [2/3], Step [1497/3236], Loss: 2.1748, Perplexity: 8.8007torch.Size([128, 12, 512])
Epoch [2/3], Step [1498/3236], Loss: 1.9335, Perplexity: 6.9136torch.Size([128, 13, 512])
Epoch [2/

Epoch [2/3], Step [1579/3236], Loss: 2.0900, Perplexity: 8.0848torch.Size([128, 12, 512])
Epoch [2/3], Step [1580/3236], Loss: 2.0939, Perplexity: 8.1162torch.Size([128, 11, 512])
Epoch [2/3], Step [1581/3236], Loss: 2.1314, Perplexity: 8.4263torch.Size([128, 13, 512])
Epoch [2/3], Step [1582/3236], Loss: 2.1151, Perplexity: 8.2903torch.Size([128, 15, 512])
Epoch [2/3], Step [1583/3236], Loss: 2.1126, Perplexity: 8.2700torch.Size([128, 12, 512])
Epoch [2/3], Step [1584/3236], Loss: 2.0519, Perplexity: 7.7828torch.Size([128, 13, 512])
Epoch [2/3], Step [1585/3236], Loss: 1.8821, Perplexity: 6.5672torch.Size([128, 12, 512])
Epoch [2/3], Step [1586/3236], Loss: 2.0115, Perplexity: 7.4745torch.Size([128, 11, 512])
Epoch [2/3], Step [1587/3236], Loss: 1.9790, Perplexity: 7.2353torch.Size([128, 17, 512])
Epoch [2/3], Step [1588/3236], Loss: 2.2660, Perplexity: 9.6404torch.Size([128, 12, 512])
Epoch [2/3], Step [1589/3236], Loss: 2.0836, Perplexity: 8.0334torch.Size([128, 16, 512])
Epoch [2/3

Epoch [2/3], Step [1670/3236], Loss: 2.2538, Perplexity: 9.5242torch.Size([128, 14, 512])
Epoch [2/3], Step [1671/3236], Loss: 2.0782, Perplexity: 7.9903torch.Size([128, 12, 512])
Epoch [2/3], Step [1672/3236], Loss: 2.0374, Perplexity: 7.6708torch.Size([128, 11, 512])
Epoch [2/3], Step [1673/3236], Loss: 2.0584, Perplexity: 7.8338torch.Size([128, 13, 512])
Epoch [2/3], Step [1674/3236], Loss: 1.9901, Perplexity: 7.3165torch.Size([128, 12, 512])
Epoch [2/3], Step [1675/3236], Loss: 2.1704, Perplexity: 8.7616torch.Size([128, 11, 512])
Epoch [2/3], Step [1676/3236], Loss: 2.1697, Perplexity: 8.7557torch.Size([128, 14, 512])
Epoch [2/3], Step [1677/3236], Loss: 1.9703, Perplexity: 7.1727torch.Size([128, 12, 512])
Epoch [2/3], Step [1678/3236], Loss: 2.0352, Perplexity: 7.6534torch.Size([128, 13, 512])
Epoch [2/3], Step [1679/3236], Loss: 2.0038, Perplexity: 7.4171torch.Size([128, 13, 512])
Epoch [2/3], Step [1680/3236], Loss: 1.9327, Perplexity: 6.9080torch.Size([128, 9, 512])
Epoch [2/3]

Epoch [2/3], Step [1761/3236], Loss: 1.9277, Perplexity: 6.8734torch.Size([128, 14, 512])
Epoch [2/3], Step [1762/3236], Loss: 2.0944, Perplexity: 8.1207torch.Size([128, 11, 512])
Epoch [2/3], Step [1763/3236], Loss: 2.0531, Perplexity: 7.7918torch.Size([128, 11, 512])
Epoch [2/3], Step [1764/3236], Loss: 2.1687, Perplexity: 8.7470torch.Size([128, 13, 512])
Epoch [2/3], Step [1765/3236], Loss: 1.9971, Perplexity: 7.3676torch.Size([128, 13, 512])
Epoch [2/3], Step [1766/3236], Loss: 2.0020, Perplexity: 7.4039torch.Size([128, 16, 512])
Epoch [2/3], Step [1767/3236], Loss: 2.2153, Perplexity: 9.1638torch.Size([128, 16, 512])
Epoch [2/3], Step [1768/3236], Loss: 2.2995, Perplexity: 9.9689torch.Size([128, 13, 512])
Epoch [2/3], Step [1769/3236], Loss: 2.0631, Perplexity: 7.8704torch.Size([128, 17, 512])
Epoch [2/3], Step [1770/3236], Loss: 2.3272, Perplexity: 10.2488torch.Size([128, 11, 512])
Epoch [2/3], Step [1771/3236], Loss: 2.1492, Perplexity: 8.5779torch.Size([128, 11, 512])
Epoch [2/

Epoch [2/3], Step [1852/3236], Loss: 2.1518, Perplexity: 8.6007torch.Size([128, 12, 512])
Epoch [2/3], Step [1853/3236], Loss: 2.0975, Perplexity: 8.1455torch.Size([128, 17, 512])
Epoch [2/3], Step [1854/3236], Loss: 2.3780, Perplexity: 10.7836torch.Size([128, 19, 512])
Epoch [2/3], Step [1855/3236], Loss: 2.6633, Perplexity: 14.3437torch.Size([128, 12, 512])
Epoch [2/3], Step [1856/3236], Loss: 1.9585, Perplexity: 7.0887torch.Size([128, 12, 512])
Epoch [2/3], Step [1857/3236], Loss: 1.9845, Perplexity: 7.2757torch.Size([128, 14, 512])
Epoch [2/3], Step [1858/3236], Loss: 2.1108, Perplexity: 8.2548torch.Size([128, 17, 512])
Epoch [2/3], Step [1859/3236], Loss: 2.3840, Perplexity: 10.8485torch.Size([128, 12, 512])
Epoch [2/3], Step [1860/3236], Loss: 2.0239, Perplexity: 7.5675torch.Size([128, 12, 512])
Epoch [2/3], Step [1861/3236], Loss: 1.9920, Perplexity: 7.3305torch.Size([128, 13, 512])
Epoch [2/3], Step [1862/3236], Loss: 2.2291, Perplexity: 9.2919torch.Size([128, 14, 512])
Epoch [

Epoch [2/3], Step [1943/3236], Loss: 2.2920, Perplexity: 9.8942torch.Size([128, 17, 512])
Epoch [2/3], Step [1944/3236], Loss: 2.3257, Perplexity: 10.2335torch.Size([128, 12, 512])
Epoch [2/3], Step [1945/3236], Loss: 2.1081, Perplexity: 8.2328torch.Size([128, 14, 512])
Epoch [2/3], Step [1946/3236], Loss: 2.0225, Perplexity: 7.5569torch.Size([128, 15, 512])
Epoch [2/3], Step [1947/3236], Loss: 2.1033, Perplexity: 8.1932torch.Size([128, 12, 512])
Epoch [2/3], Step [1948/3236], Loss: 2.0937, Perplexity: 8.1150torch.Size([128, 13, 512])
Epoch [2/3], Step [1949/3236], Loss: 2.0084, Perplexity: 7.4511torch.Size([128, 11, 512])
Epoch [2/3], Step [1950/3236], Loss: 2.0748, Perplexity: 7.9627torch.Size([128, 13, 512])
Epoch [2/3], Step [1951/3236], Loss: 2.1487, Perplexity: 8.5741torch.Size([128, 14, 512])
Epoch [2/3], Step [1952/3236], Loss: 2.0333, Perplexity: 7.6393torch.Size([128, 13, 512])
Epoch [2/3], Step [1953/3236], Loss: 1.9192, Perplexity: 6.8155torch.Size([128, 13, 512])
Epoch [2/

Epoch [2/3], Step [2034/3236], Loss: 2.0520, Perplexity: 7.7835torch.Size([128, 13, 512])
Epoch [2/3], Step [2035/3236], Loss: 1.9926, Perplexity: 7.3349torch.Size([128, 13, 512])
Epoch [2/3], Step [2036/3236], Loss: 2.0578, Perplexity: 7.8290torch.Size([128, 12, 512])
Epoch [2/3], Step [2037/3236], Loss: 2.0549, Perplexity: 7.8058torch.Size([128, 12, 512])
Epoch [2/3], Step [2038/3236], Loss: 1.9400, Perplexity: 6.9589torch.Size([128, 12, 512])
Epoch [2/3], Step [2039/3236], Loss: 1.9583, Perplexity: 7.0872torch.Size([128, 13, 512])
Epoch [2/3], Step [2040/3236], Loss: 2.1014, Perplexity: 8.1774torch.Size([128, 13, 512])
Epoch [2/3], Step [2041/3236], Loss: 1.9786, Perplexity: 7.2329torch.Size([128, 12, 512])
Epoch [2/3], Step [2042/3236], Loss: 1.9859, Perplexity: 7.2855torch.Size([128, 13, 512])
Epoch [2/3], Step [2043/3236], Loss: 1.9774, Perplexity: 7.2237torch.Size([128, 13, 512])
Epoch [2/3], Step [2044/3236], Loss: 1.9958, Perplexity: 7.3580torch.Size([128, 12, 512])
Epoch [2/3

Epoch [2/3], Step [2125/3236], Loss: 2.0849, Perplexity: 8.0434torch.Size([128, 12, 512])
Epoch [2/3], Step [2126/3236], Loss: 1.9257, Perplexity: 6.8597torch.Size([128, 17, 512])
Epoch [2/3], Step [2127/3236], Loss: 2.4545, Perplexity: 11.6407torch.Size([128, 14, 512])
Epoch [2/3], Step [2128/3236], Loss: 2.0144, Perplexity: 7.4963torch.Size([128, 14, 512])
Epoch [2/3], Step [2129/3236], Loss: 2.0147, Perplexity: 7.4982torch.Size([128, 11, 512])
Epoch [2/3], Step [2130/3236], Loss: 2.0172, Perplexity: 7.5176torch.Size([128, 11, 512])
Epoch [2/3], Step [2131/3236], Loss: 2.0831, Perplexity: 8.0290torch.Size([128, 12, 512])
Epoch [2/3], Step [2132/3236], Loss: 1.9128, Perplexity: 6.7719torch.Size([128, 13, 512])
Epoch [2/3], Step [2133/3236], Loss: 1.9845, Perplexity: 7.2754torch.Size([128, 16, 512])
Epoch [2/3], Step [2134/3236], Loss: 2.1556, Perplexity: 8.6330torch.Size([128, 13, 512])
Epoch [2/3], Step [2135/3236], Loss: 2.0732, Perplexity: 7.9503torch.Size([128, 10, 512])
Epoch [2/

Epoch [2/3], Step [2216/3236], Loss: 2.2019, Perplexity: 9.0425torch.Size([128, 13, 512])
Epoch [2/3], Step [2217/3236], Loss: 1.9566, Perplexity: 7.0751torch.Size([128, 12, 512])
Epoch [2/3], Step [2218/3236], Loss: 1.9223, Perplexity: 6.8364torch.Size([128, 14, 512])
Epoch [2/3], Step [2219/3236], Loss: 2.1110, Perplexity: 8.2568torch.Size([128, 12, 512])
Epoch [2/3], Step [2220/3236], Loss: 2.1273, Perplexity: 8.3918torch.Size([128, 12, 512])
Epoch [2/3], Step [2221/3236], Loss: 1.8414, Perplexity: 6.3056torch.Size([128, 12, 512])
Epoch [2/3], Step [2222/3236], Loss: 1.9637, Perplexity: 7.1257torch.Size([128, 12, 512])
Epoch [2/3], Step [2223/3236], Loss: 2.0740, Perplexity: 7.9570torch.Size([128, 17, 512])
Epoch [2/3], Step [2224/3236], Loss: 2.5219, Perplexity: 12.4520torch.Size([128, 12, 512])
Epoch [2/3], Step [2225/3236], Loss: 2.0338, Perplexity: 7.6433torch.Size([128, 12, 512])
Epoch [2/3], Step [2226/3236], Loss: 1.9178, Perplexity: 6.8061torch.Size([128, 12, 512])
Epoch [2/

Epoch [2/3], Step [2307/3236], Loss: 2.0547, Perplexity: 7.8046torch.Size([128, 17, 512])
Epoch [2/3], Step [2308/3236], Loss: 2.3246, Perplexity: 10.2222torch.Size([128, 13, 512])
Epoch [2/3], Step [2309/3236], Loss: 2.0661, Perplexity: 7.8942torch.Size([128, 13, 512])
Epoch [2/3], Step [2310/3236], Loss: 1.8744, Perplexity: 6.5167torch.Size([128, 26, 512])
Epoch [2/3], Step [2311/3236], Loss: 3.0624, Perplexity: 21.3778torch.Size([128, 12, 512])
Epoch [2/3], Step [2312/3236], Loss: 1.9741, Perplexity: 7.1999torch.Size([128, 11, 512])
Epoch [2/3], Step [2313/3236], Loss: 1.9831, Perplexity: 7.2652torch.Size([128, 15, 512])
Epoch [2/3], Step [2314/3236], Loss: 2.1410, Perplexity: 8.5077torch.Size([128, 11, 512])
Epoch [2/3], Step [2315/3236], Loss: 2.0117, Perplexity: 7.4758torch.Size([128, 11, 512])
Epoch [2/3], Step [2316/3236], Loss: 2.1325, Perplexity: 8.4359torch.Size([128, 14, 512])
Epoch [2/3], Step [2317/3236], Loss: 2.0068, Perplexity: 7.4395torch.Size([128, 15, 512])
Epoch [2

Epoch [2/3], Step [2398/3236], Loss: 2.0178, Perplexity: 7.5220torch.Size([128, 14, 512])
Epoch [2/3], Step [2399/3236], Loss: 2.0630, Perplexity: 7.8698torch.Size([128, 10, 512])
Epoch [2/3], Step [2400/3236], Loss: 2.3806, Perplexity: 10.8109
torch.Size([128, 20, 512])
Epoch [2/3], Step [2401/3236], Loss: 2.5855, Perplexity: 13.2694torch.Size([128, 15, 512])
Epoch [2/3], Step [2402/3236], Loss: 2.1666, Perplexity: 8.7285torch.Size([128, 15, 512])
Epoch [2/3], Step [2403/3236], Loss: 2.2475, Perplexity: 9.4638torch.Size([128, 10, 512])
Epoch [2/3], Step [2404/3236], Loss: 2.0965, Perplexity: 8.1376torch.Size([128, 12, 512])
Epoch [2/3], Step [2405/3236], Loss: 1.9553, Perplexity: 7.0662torch.Size([128, 13, 512])
Epoch [2/3], Step [2406/3236], Loss: 2.0077, Perplexity: 7.4459torch.Size([128, 14, 512])
Epoch [2/3], Step [2407/3236], Loss: 2.1055, Perplexity: 8.2113torch.Size([128, 12, 512])
Epoch [2/3], Step [2408/3236], Loss: 2.0012, Perplexity: 7.3980torch.Size([128, 10, 512])
Epoch [

Epoch [2/3], Step [2489/3236], Loss: 2.1388, Perplexity: 8.4893torch.Size([128, 12, 512])
Epoch [2/3], Step [2490/3236], Loss: 1.9010, Perplexity: 6.6923torch.Size([128, 13, 512])
Epoch [2/3], Step [2491/3236], Loss: 1.9246, Perplexity: 6.8526torch.Size([128, 14, 512])
Epoch [2/3], Step [2492/3236], Loss: 2.0640, Perplexity: 7.8777torch.Size([128, 15, 512])
Epoch [2/3], Step [2493/3236], Loss: 2.1418, Perplexity: 8.5152torch.Size([128, 13, 512])
Epoch [2/3], Step [2494/3236], Loss: 2.0982, Perplexity: 8.1515torch.Size([128, 11, 512])
Epoch [2/3], Step [2495/3236], Loss: 2.1675, Perplexity: 8.7361torch.Size([128, 20, 512])
Epoch [2/3], Step [2496/3236], Loss: 2.5852, Perplexity: 13.2660torch.Size([128, 13, 512])
Epoch [2/3], Step [2497/3236], Loss: 1.9225, Perplexity: 6.8378torch.Size([128, 11, 512])
Epoch [2/3], Step [2498/3236], Loss: 1.9822, Perplexity: 7.2585torch.Size([128, 12, 512])
Epoch [2/3], Step [2499/3236], Loss: 2.0589, Perplexity: 7.8374torch.Size([128, 10, 512])
Epoch [2/

Epoch [2/3], Step [2580/3236], Loss: 2.0714, Perplexity: 7.9359torch.Size([128, 13, 512])
Epoch [2/3], Step [2581/3236], Loss: 2.0007, Perplexity: 7.3946torch.Size([128, 13, 512])
Epoch [2/3], Step [2582/3236], Loss: 2.0102, Perplexity: 7.4646torch.Size([128, 13, 512])
Epoch [2/3], Step [2583/3236], Loss: 2.0126, Perplexity: 7.4825torch.Size([128, 15, 512])
Epoch [2/3], Step [2584/3236], Loss: 2.1384, Perplexity: 8.4863torch.Size([128, 16, 512])
Epoch [2/3], Step [2585/3236], Loss: 2.2150, Perplexity: 9.1615torch.Size([128, 19, 512])
Epoch [2/3], Step [2586/3236], Loss: 2.5115, Perplexity: 12.3232torch.Size([128, 17, 512])
Epoch [2/3], Step [2587/3236], Loss: 2.3221, Perplexity: 10.1973torch.Size([128, 13, 512])
Epoch [2/3], Step [2588/3236], Loss: 2.0716, Perplexity: 7.9376torch.Size([128, 19, 512])
Epoch [2/3], Step [2589/3236], Loss: 2.5568, Perplexity: 12.8945torch.Size([128, 12, 512])
Epoch [2/3], Step [2590/3236], Loss: 1.9375, Perplexity: 6.9416torch.Size([128, 9, 512])
Epoch [2

Epoch [2/3], Step [2671/3236], Loss: 2.0557, Perplexity: 7.8120torch.Size([128, 15, 512])
Epoch [2/3], Step [2672/3236], Loss: 2.1805, Perplexity: 8.8508torch.Size([128, 12, 512])
Epoch [2/3], Step [2673/3236], Loss: 2.1179, Perplexity: 8.3138torch.Size([128, 13, 512])
Epoch [2/3], Step [2674/3236], Loss: 1.9481, Perplexity: 7.0153torch.Size([128, 11, 512])
Epoch [2/3], Step [2675/3236], Loss: 2.0756, Perplexity: 7.9691torch.Size([128, 13, 512])
Epoch [2/3], Step [2676/3236], Loss: 2.0421, Perplexity: 7.7066torch.Size([128, 14, 512])
Epoch [2/3], Step [2677/3236], Loss: 2.1344, Perplexity: 8.4518torch.Size([128, 14, 512])
Epoch [2/3], Step [2678/3236], Loss: 2.0964, Perplexity: 8.1372torch.Size([128, 11, 512])
Epoch [2/3], Step [2679/3236], Loss: 1.9827, Perplexity: 7.2622torch.Size([128, 27, 512])
Epoch [2/3], Step [2680/3236], Loss: 2.9006, Perplexity: 18.1845torch.Size([128, 12, 512])
Epoch [2/3], Step [2681/3236], Loss: 1.9293, Perplexity: 6.8844torch.Size([128, 21, 512])
Epoch [2/

Epoch [2/3], Step [2762/3236], Loss: 2.2111, Perplexity: 9.1255torch.Size([128, 15, 512])
Epoch [2/3], Step [2763/3236], Loss: 2.1775, Perplexity: 8.8244torch.Size([128, 14, 512])
Epoch [2/3], Step [2764/3236], Loss: 2.0313, Perplexity: 7.6236torch.Size([128, 17, 512])
Epoch [2/3], Step [2765/3236], Loss: 2.4004, Perplexity: 11.0274torch.Size([128, 11, 512])
Epoch [2/3], Step [2766/3236], Loss: 2.0243, Perplexity: 7.5707torch.Size([128, 12, 512])
Epoch [2/3], Step [2767/3236], Loss: 2.0078, Perplexity: 7.4472torch.Size([128, 11, 512])
Epoch [2/3], Step [2768/3236], Loss: 2.0538, Perplexity: 7.7974torch.Size([128, 13, 512])
Epoch [2/3], Step [2769/3236], Loss: 1.8722, Perplexity: 6.5027torch.Size([128, 13, 512])
Epoch [2/3], Step [2770/3236], Loss: 1.9691, Perplexity: 7.1640torch.Size([128, 12, 512])
Epoch [2/3], Step [2771/3236], Loss: 1.9407, Perplexity: 6.9637torch.Size([128, 14, 512])
Epoch [2/3], Step [2772/3236], Loss: 2.0223, Perplexity: 7.5560torch.Size([128, 10, 512])
Epoch [2/

Epoch [2/3], Step [2853/3236], Loss: 1.8526, Perplexity: 6.3765torch.Size([128, 12, 512])
Epoch [2/3], Step [2854/3236], Loss: 1.9493, Perplexity: 7.0239torch.Size([128, 11, 512])
Epoch [2/3], Step [2855/3236], Loss: 2.0627, Perplexity: 7.8674torch.Size([128, 13, 512])
Epoch [2/3], Step [2856/3236], Loss: 1.9061, Perplexity: 6.7270torch.Size([128, 16, 512])
Epoch [2/3], Step [2857/3236], Loss: 2.2995, Perplexity: 9.9696torch.Size([128, 27, 512])
Epoch [2/3], Step [2858/3236], Loss: 2.8374, Perplexity: 17.0707torch.Size([128, 10, 512])
Epoch [2/3], Step [2859/3236], Loss: 2.3284, Perplexity: 10.2612torch.Size([128, 12, 512])
Epoch [2/3], Step [2860/3236], Loss: 2.0177, Perplexity: 7.5212torch.Size([128, 12, 512])
Epoch [2/3], Step [2861/3236], Loss: 2.0445, Perplexity: 7.7253torch.Size([128, 15, 512])
Epoch [2/3], Step [2862/3236], Loss: 2.0338, Perplexity: 7.6434torch.Size([128, 12, 512])
Epoch [2/3], Step [2863/3236], Loss: 2.0752, Perplexity: 7.9659torch.Size([128, 12, 512])
Epoch [2

Epoch [2/3], Step [2944/3236], Loss: 2.0492, Perplexity: 7.7619torch.Size([128, 12, 512])
Epoch [2/3], Step [2945/3236], Loss: 2.0090, Perplexity: 7.4559torch.Size([128, 11, 512])
Epoch [2/3], Step [2946/3236], Loss: 1.9749, Perplexity: 7.2057torch.Size([128, 11, 512])
Epoch [2/3], Step [2947/3236], Loss: 1.8747, Perplexity: 6.5188torch.Size([128, 14, 512])
Epoch [2/3], Step [2948/3236], Loss: 2.0351, Perplexity: 7.6529torch.Size([128, 15, 512])
Epoch [2/3], Step [2949/3236], Loss: 2.1136, Perplexity: 8.2783torch.Size([128, 12, 512])
Epoch [2/3], Step [2950/3236], Loss: 1.9309, Perplexity: 6.8957torch.Size([128, 14, 512])
Epoch [2/3], Step [2951/3236], Loss: 2.0660, Perplexity: 7.8930torch.Size([128, 12, 512])
Epoch [2/3], Step [2952/3236], Loss: 2.0704, Perplexity: 7.9278torch.Size([128, 15, 512])
Epoch [2/3], Step [2953/3236], Loss: 2.1616, Perplexity: 8.6846torch.Size([128, 15, 512])
Epoch [2/3], Step [2954/3236], Loss: 2.1425, Perplexity: 8.5207torch.Size([128, 10, 512])
Epoch [2/3

Epoch [2/3], Step [3035/3236], Loss: 2.1501, Perplexity: 8.5857torch.Size([128, 13, 512])
Epoch [2/3], Step [3036/3236], Loss: 1.9347, Perplexity: 6.9220torch.Size([128, 13, 512])
Epoch [2/3], Step [3037/3236], Loss: 1.9655, Perplexity: 7.1382torch.Size([128, 11, 512])
Epoch [2/3], Step [3038/3236], Loss: 2.2722, Perplexity: 9.7009torch.Size([128, 13, 512])
Epoch [2/3], Step [3039/3236], Loss: 1.9990, Perplexity: 7.3815torch.Size([128, 13, 512])
Epoch [2/3], Step [3040/3236], Loss: 1.9502, Perplexity: 7.0301torch.Size([128, 13, 512])
Epoch [2/3], Step [3041/3236], Loss: 1.9619, Perplexity: 7.1128torch.Size([128, 13, 512])
Epoch [2/3], Step [3042/3236], Loss: 1.9410, Perplexity: 6.9654torch.Size([128, 11, 512])
Epoch [2/3], Step [3043/3236], Loss: 2.0509, Perplexity: 7.7745torch.Size([128, 12, 512])
Epoch [2/3], Step [3044/3236], Loss: 1.9285, Perplexity: 6.8790torch.Size([128, 24, 512])
Epoch [2/3], Step [3045/3236], Loss: 2.8968, Perplexity: 18.1159torch.Size([128, 12, 512])
Epoch [2/

Epoch [2/3], Step [3126/3236], Loss: 1.9848, Perplexity: 7.2776torch.Size([128, 14, 512])
Epoch [2/3], Step [3127/3236], Loss: 2.0176, Perplexity: 7.5205torch.Size([128, 12, 512])
Epoch [2/3], Step [3128/3236], Loss: 1.9280, Perplexity: 6.8757torch.Size([128, 11, 512])
Epoch [2/3], Step [3129/3236], Loss: 1.9827, Perplexity: 7.2621torch.Size([128, 11, 512])
Epoch [2/3], Step [3130/3236], Loss: 2.1063, Perplexity: 8.2181torch.Size([128, 11, 512])
Epoch [2/3], Step [3131/3236], Loss: 2.1444, Perplexity: 8.5367torch.Size([128, 11, 512])
Epoch [2/3], Step [3132/3236], Loss: 2.0414, Perplexity: 7.7017torch.Size([128, 14, 512])
Epoch [2/3], Step [3133/3236], Loss: 2.0735, Perplexity: 7.9526torch.Size([128, 12, 512])
Epoch [2/3], Step [3134/3236], Loss: 2.0211, Perplexity: 7.5463torch.Size([128, 12, 512])
Epoch [2/3], Step [3135/3236], Loss: 1.9457, Perplexity: 6.9983torch.Size([128, 12, 512])
Epoch [2/3], Step [3136/3236], Loss: 1.9245, Perplexity: 6.8517torch.Size([128, 16, 512])
Epoch [2/3

Epoch [2/3], Step [3217/3236], Loss: 2.0969, Perplexity: 8.1410torch.Size([128, 12, 512])
Epoch [2/3], Step [3218/3236], Loss: 2.0105, Perplexity: 7.4674torch.Size([128, 13, 512])
Epoch [2/3], Step [3219/3236], Loss: 2.0020, Perplexity: 7.4037torch.Size([128, 14, 512])
Epoch [2/3], Step [3220/3236], Loss: 1.9902, Perplexity: 7.3172torch.Size([128, 15, 512])
Epoch [2/3], Step [3221/3236], Loss: 2.1919, Perplexity: 8.9520torch.Size([128, 13, 512])
Epoch [2/3], Step [3222/3236], Loss: 2.0355, Perplexity: 7.6563torch.Size([128, 12, 512])
Epoch [2/3], Step [3223/3236], Loss: 2.0223, Perplexity: 7.5555torch.Size([128, 13, 512])
Epoch [2/3], Step [3224/3236], Loss: 2.0725, Perplexity: 7.9444torch.Size([128, 15, 512])
Epoch [2/3], Step [3225/3236], Loss: 2.0287, Perplexity: 7.6039torch.Size([128, 15, 512])
Epoch [2/3], Step [3226/3236], Loss: 2.1917, Perplexity: 8.9506torch.Size([128, 18, 512])
Epoch [2/3], Step [3227/3236], Loss: 2.5352, Perplexity: 12.6194torch.Size([128, 14, 512])
Epoch [2/

Epoch [3/3], Step [165/3236], Loss: 2.0674, Perplexity: 7.9040torch.Size([128, 14, 512])
Epoch [3/3], Step [166/3236], Loss: 1.9583, Perplexity: 7.0875torch.Size([128, 10, 512])
Epoch [3/3], Step [167/3236], Loss: 2.4019, Perplexity: 11.0441torch.Size([128, 16, 512])
Epoch [3/3], Step [168/3236], Loss: 2.1597, Perplexity: 8.6682torch.Size([128, 14, 512])
Epoch [3/3], Step [169/3236], Loss: 1.9217, Perplexity: 6.8324torch.Size([128, 14, 512])
Epoch [3/3], Step [170/3236], Loss: 2.0118, Perplexity: 7.4770torch.Size([128, 11, 512])
Epoch [3/3], Step [171/3236], Loss: 2.1699, Perplexity: 8.7572torch.Size([128, 12, 512])
Epoch [3/3], Step [172/3236], Loss: 2.0555, Perplexity: 7.8104torch.Size([128, 14, 512])
Epoch [3/3], Step [173/3236], Loss: 2.0550, Perplexity: 7.8071torch.Size([128, 12, 512])
Epoch [3/3], Step [174/3236], Loss: 1.9730, Perplexity: 7.1922torch.Size([128, 14, 512])
Epoch [3/3], Step [175/3236], Loss: 2.0272, Perplexity: 7.5931torch.Size([128, 13, 512])
Epoch [3/3], Step [1

Epoch [3/3], Step [257/3236], Loss: 2.0382, Perplexity: 7.6768torch.Size([128, 14, 512])
Epoch [3/3], Step [258/3236], Loss: 2.0024, Perplexity: 7.4066torch.Size([128, 14, 512])
Epoch [3/3], Step [259/3236], Loss: 1.8888, Perplexity: 6.6117torch.Size([128, 10, 512])
Epoch [3/3], Step [260/3236], Loss: 2.1897, Perplexity: 8.9328torch.Size([128, 13, 512])
Epoch [3/3], Step [261/3236], Loss: 1.9115, Perplexity: 6.7635torch.Size([128, 12, 512])
Epoch [3/3], Step [262/3236], Loss: 1.9265, Perplexity: 6.8654torch.Size([128, 12, 512])
Epoch [3/3], Step [263/3236], Loss: 1.9525, Perplexity: 7.0463torch.Size([128, 15, 512])
Epoch [3/3], Step [264/3236], Loss: 2.1205, Perplexity: 8.3350torch.Size([128, 20, 512])
Epoch [3/3], Step [265/3236], Loss: 2.6622, Perplexity: 14.3279torch.Size([128, 17, 512])
Epoch [3/3], Step [266/3236], Loss: 2.3648, Perplexity: 10.6424torch.Size([128, 12, 512])
Epoch [3/3], Step [267/3236], Loss: 1.9997, Perplexity: 7.3865torch.Size([128, 26, 512])
Epoch [3/3], Step [

Epoch [3/3], Step [349/3236], Loss: 1.9029, Perplexity: 6.7053torch.Size([128, 16, 512])
Epoch [3/3], Step [350/3236], Loss: 2.1004, Perplexity: 8.1697torch.Size([128, 14, 512])
Epoch [3/3], Step [351/3236], Loss: 2.0625, Perplexity: 7.8655torch.Size([128, 17, 512])
Epoch [3/3], Step [352/3236], Loss: 2.3966, Perplexity: 10.9861torch.Size([128, 13, 512])
Epoch [3/3], Step [353/3236], Loss: 1.9938, Perplexity: 7.3435torch.Size([128, 15, 512])
Epoch [3/3], Step [354/3236], Loss: 2.1906, Perplexity: 8.9410torch.Size([128, 11, 512])
Epoch [3/3], Step [355/3236], Loss: 2.0745, Perplexity: 7.9609torch.Size([128, 11, 512])
Epoch [3/3], Step [356/3236], Loss: 1.9313, Perplexity: 6.8988torch.Size([128, 18, 512])
Epoch [3/3], Step [357/3236], Loss: 2.4167, Perplexity: 11.2084torch.Size([128, 11, 512])
Epoch [3/3], Step [358/3236], Loss: 1.9233, Perplexity: 6.8437torch.Size([128, 14, 512])
Epoch [3/3], Step [359/3236], Loss: 2.0186, Perplexity: 7.5275torch.Size([128, 18, 512])
Epoch [3/3], Step [

Epoch [3/3], Step [441/3236], Loss: 1.8692, Perplexity: 6.4828torch.Size([128, 17, 512])
Epoch [3/3], Step [442/3236], Loss: 2.1849, Perplexity: 8.8899torch.Size([128, 13, 512])
Epoch [3/3], Step [443/3236], Loss: 1.9766, Perplexity: 7.2180torch.Size([128, 13, 512])
Epoch [3/3], Step [444/3236], Loss: 1.9866, Perplexity: 7.2906torch.Size([128, 10, 512])
Epoch [3/3], Step [445/3236], Loss: 2.3657, Perplexity: 10.6518torch.Size([128, 16, 512])
Epoch [3/3], Step [446/3236], Loss: 2.1795, Perplexity: 8.8422torch.Size([128, 16, 512])
Epoch [3/3], Step [447/3236], Loss: 2.2081, Perplexity: 9.0985torch.Size([128, 19, 512])
Epoch [3/3], Step [448/3236], Loss: 2.5340, Perplexity: 12.6032torch.Size([128, 12, 512])
Epoch [3/3], Step [449/3236], Loss: 1.9697, Perplexity: 7.1685torch.Size([128, 13, 512])
Epoch [3/3], Step [450/3236], Loss: 2.0270, Perplexity: 7.5915torch.Size([128, 14, 512])
Epoch [3/3], Step [451/3236], Loss: 1.9462, Perplexity: 7.0023torch.Size([128, 16, 512])
Epoch [3/3], Step [

Epoch [3/3], Step [533/3236], Loss: 2.1138, Perplexity: 8.2794torch.Size([128, 15, 512])
Epoch [3/3], Step [534/3236], Loss: 2.0787, Perplexity: 7.9937torch.Size([128, 12, 512])
Epoch [3/3], Step [535/3236], Loss: 1.8977, Perplexity: 6.6705torch.Size([128, 12, 512])
Epoch [3/3], Step [536/3236], Loss: 1.9011, Perplexity: 6.6934torch.Size([128, 11, 512])
Epoch [3/3], Step [537/3236], Loss: 2.0084, Perplexity: 7.4516torch.Size([128, 15, 512])
Epoch [3/3], Step [538/3236], Loss: 2.0982, Perplexity: 8.1516torch.Size([128, 12, 512])
Epoch [3/3], Step [539/3236], Loss: 2.0093, Perplexity: 7.4583torch.Size([128, 13, 512])
Epoch [3/3], Step [540/3236], Loss: 1.9016, Perplexity: 6.6969torch.Size([128, 13, 512])
Epoch [3/3], Step [541/3236], Loss: 1.9151, Perplexity: 6.7878torch.Size([128, 14, 512])
Epoch [3/3], Step [542/3236], Loss: 2.0066, Perplexity: 7.4378torch.Size([128, 10, 512])
Epoch [3/3], Step [543/3236], Loss: 2.2573, Perplexity: 9.5572torch.Size([128, 15, 512])
Epoch [3/3], Step [54

Epoch [3/3], Step [625/3236], Loss: 2.7893, Perplexity: 16.2689torch.Size([128, 12, 512])
Epoch [3/3], Step [626/3236], Loss: 1.8519, Perplexity: 6.3719torch.Size([128, 15, 512])
Epoch [3/3], Step [627/3236], Loss: 2.0686, Perplexity: 7.9135torch.Size([128, 13, 512])
Epoch [3/3], Step [628/3236], Loss: 2.0540, Perplexity: 7.7988torch.Size([128, 12, 512])
Epoch [3/3], Step [629/3236], Loss: 1.8552, Perplexity: 6.3928torch.Size([128, 11, 512])
Epoch [3/3], Step [630/3236], Loss: 2.1537, Perplexity: 8.6168torch.Size([128, 11, 512])
Epoch [3/3], Step [631/3236], Loss: 2.1263, Perplexity: 8.3836torch.Size([128, 11, 512])
Epoch [3/3], Step [632/3236], Loss: 2.0792, Perplexity: 7.9982torch.Size([128, 17, 512])
Epoch [3/3], Step [633/3236], Loss: 2.2153, Perplexity: 9.1637torch.Size([128, 11, 512])
Epoch [3/3], Step [634/3236], Loss: 2.1349, Perplexity: 8.4563torch.Size([128, 11, 512])
Epoch [3/3], Step [635/3236], Loss: 1.9977, Perplexity: 7.3722torch.Size([128, 14, 512])
Epoch [3/3], Step [6

Epoch [3/3], Step [717/3236], Loss: 2.1166, Perplexity: 8.3030torch.Size([128, 11, 512])
Epoch [3/3], Step [718/3236], Loss: 2.0450, Perplexity: 7.7289torch.Size([128, 11, 512])
Epoch [3/3], Step [719/3236], Loss: 2.0754, Perplexity: 7.9676torch.Size([128, 14, 512])
Epoch [3/3], Step [720/3236], Loss: 1.9680, Perplexity: 7.1565torch.Size([128, 13, 512])
Epoch [3/3], Step [721/3236], Loss: 1.9555, Perplexity: 7.0674torch.Size([128, 13, 512])
Epoch [3/3], Step [722/3236], Loss: 1.8700, Perplexity: 6.4884torch.Size([128, 12, 512])
Epoch [3/3], Step [723/3236], Loss: 2.0247, Perplexity: 7.5737torch.Size([128, 11, 512])
Epoch [3/3], Step [724/3236], Loss: 1.8558, Perplexity: 6.3965torch.Size([128, 14, 512])
Epoch [3/3], Step [725/3236], Loss: 2.0287, Perplexity: 7.6039torch.Size([128, 12, 512])
Epoch [3/3], Step [726/3236], Loss: 1.9474, Perplexity: 7.0103torch.Size([128, 12, 512])
Epoch [3/3], Step [727/3236], Loss: 1.9028, Perplexity: 6.7046torch.Size([128, 10, 512])
Epoch [3/3], Step [72

Epoch [3/3], Step [809/3236], Loss: 1.9206, Perplexity: 6.8248torch.Size([128, 12, 512])
Epoch [3/3], Step [810/3236], Loss: 1.9061, Perplexity: 6.7268torch.Size([128, 13, 512])
Epoch [3/3], Step [811/3236], Loss: 1.9668, Perplexity: 7.1476torch.Size([128, 12, 512])
Epoch [3/3], Step [812/3236], Loss: 1.9710, Perplexity: 7.1780torch.Size([128, 11, 512])
Epoch [3/3], Step [813/3236], Loss: 2.0675, Perplexity: 7.9053torch.Size([128, 10, 512])
Epoch [3/3], Step [814/3236], Loss: 2.2712, Perplexity: 9.6906torch.Size([128, 15, 512])
Epoch [3/3], Step [815/3236], Loss: 2.0768, Perplexity: 7.9791torch.Size([128, 13, 512])
Epoch [3/3], Step [816/3236], Loss: 1.8849, Perplexity: 6.5860torch.Size([128, 13, 512])
Epoch [3/3], Step [817/3236], Loss: 2.0033, Perplexity: 7.4132torch.Size([128, 15, 512])
Epoch [3/3], Step [818/3236], Loss: 2.0991, Perplexity: 8.1589torch.Size([128, 11, 512])
Epoch [3/3], Step [819/3236], Loss: 2.0269, Perplexity: 7.5905torch.Size([128, 12, 512])
Epoch [3/3], Step [82

Epoch [3/3], Step [901/3236], Loss: 1.8701, Perplexity: 6.4891torch.Size([128, 16, 512])
Epoch [3/3], Step [902/3236], Loss: 2.2729, Perplexity: 9.7072torch.Size([128, 14, 512])
Epoch [3/3], Step [903/3236], Loss: 2.0097, Perplexity: 7.4614torch.Size([128, 14, 512])
Epoch [3/3], Step [904/3236], Loss: 2.0623, Perplexity: 7.8639torch.Size([128, 20, 512])
Epoch [3/3], Step [905/3236], Loss: 2.6359, Perplexity: 13.9560torch.Size([128, 13, 512])
Epoch [3/3], Step [906/3236], Loss: 1.9163, Perplexity: 6.7957torch.Size([128, 11, 512])
Epoch [3/3], Step [907/3236], Loss: 2.0069, Perplexity: 7.4401torch.Size([128, 18, 512])
Epoch [3/3], Step [908/3236], Loss: 2.3899, Perplexity: 10.9129torch.Size([128, 11, 512])
Epoch [3/3], Step [909/3236], Loss: 2.0202, Perplexity: 7.5397torch.Size([128, 14, 512])
Epoch [3/3], Step [910/3236], Loss: 1.8942, Perplexity: 6.6473torch.Size([128, 13, 512])
Epoch [3/3], Step [911/3236], Loss: 1.8310, Perplexity: 6.2403torch.Size([128, 14, 512])
Epoch [3/3], Step [

Epoch [3/3], Step [993/3236], Loss: 2.0731, Perplexity: 7.9496torch.Size([128, 12, 512])
Epoch [3/3], Step [994/3236], Loss: 2.0008, Perplexity: 7.3949torch.Size([128, 13, 512])
Epoch [3/3], Step [995/3236], Loss: 1.9817, Perplexity: 7.2551torch.Size([128, 12, 512])
Epoch [3/3], Step [996/3236], Loss: 2.0067, Perplexity: 7.4387torch.Size([128, 14, 512])
Epoch [3/3], Step [997/3236], Loss: 1.8833, Perplexity: 6.5751torch.Size([128, 11, 512])
Epoch [3/3], Step [998/3236], Loss: 2.0459, Perplexity: 7.7361torch.Size([128, 12, 512])
Epoch [3/3], Step [999/3236], Loss: 1.8028, Perplexity: 6.0669torch.Size([128, 13, 512])
Epoch [3/3], Step [1000/3236], Loss: 2.0199, Perplexity: 7.5379
torch.Size([128, 14, 512])
Epoch [3/3], Step [1001/3236], Loss: 2.0506, Perplexity: 7.7729torch.Size([128, 16, 512])
Epoch [3/3], Step [1002/3236], Loss: 2.2013, Perplexity: 9.0371torch.Size([128, 12, 512])
Epoch [3/3], Step [1003/3236], Loss: 2.0045, Perplexity: 7.4227torch.Size([128, 14, 512])
Epoch [3/3], Ste

Epoch [3/3], Step [1175/3236], Loss: 2.0352, Perplexity: 7.6537torch.Size([128, 10, 512])
Epoch [3/3], Step [1176/3236], Loss: 2.1456, Perplexity: 8.5470torch.Size([128, 11, 512])
Epoch [3/3], Step [1177/3236], Loss: 2.0296, Perplexity: 7.6114torch.Size([128, 12, 512])
Epoch [3/3], Step [1178/3236], Loss: 1.8688, Perplexity: 6.4806torch.Size([128, 16, 512])
Epoch [3/3], Step [1179/3236], Loss: 2.1578, Perplexity: 8.6524torch.Size([128, 11, 512])
Epoch [3/3], Step [1180/3236], Loss: 1.9789, Perplexity: 7.2347torch.Size([128, 13, 512])
Epoch [3/3], Step [1181/3236], Loss: 2.0796, Perplexity: 8.0017torch.Size([128, 14, 512])
Epoch [3/3], Step [1182/3236], Loss: 1.9790, Perplexity: 7.2354torch.Size([128, 12, 512])
Epoch [3/3], Step [1183/3236], Loss: 1.9530, Perplexity: 7.0501torch.Size([128, 13, 512])
Epoch [3/3], Step [1184/3236], Loss: 1.9291, Perplexity: 6.8831torch.Size([128, 13, 512])
Epoch [3/3], Step [1185/3236], Loss: 1.8325, Perplexity: 6.2492torch.Size([128, 11, 512])
Epoch [3/3

Epoch [3/3], Step [1266/3236], Loss: 2.1404, Perplexity: 8.5028torch.Size([128, 15, 512])
Epoch [3/3], Step [1267/3236], Loss: 2.1332, Perplexity: 8.4419torch.Size([128, 12, 512])
Epoch [3/3], Step [1268/3236], Loss: 1.8302, Perplexity: 6.2349torch.Size([128, 11, 512])
Epoch [3/3], Step [1269/3236], Loss: 1.9966, Perplexity: 7.3641torch.Size([128, 13, 512])
Epoch [3/3], Step [1270/3236], Loss: 1.9298, Perplexity: 6.8879torch.Size([128, 38, 512])
Epoch [3/3], Step [1271/3236], Loss: 3.7720, Perplexity: 43.4654torch.Size([128, 15, 512])
Epoch [3/3], Step [1272/3236], Loss: 2.0620, Perplexity: 7.8614torch.Size([128, 15, 512])
Epoch [3/3], Step [1273/3236], Loss: 2.0461, Perplexity: 7.7374torch.Size([128, 12, 512])
Epoch [3/3], Step [1274/3236], Loss: 1.9466, Perplexity: 7.0046torch.Size([128, 11, 512])
Epoch [3/3], Step [1275/3236], Loss: 1.8619, Perplexity: 6.4357torch.Size([128, 12, 512])
Epoch [3/3], Step [1276/3236], Loss: 1.8872, Perplexity: 6.6012torch.Size([128, 14, 512])
Epoch [3/

Epoch [3/3], Step [1357/3236], Loss: 2.2236, Perplexity: 9.2409torch.Size([128, 14, 512])
Epoch [3/3], Step [1358/3236], Loss: 1.8883, Perplexity: 6.6084torch.Size([128, 16, 512])
Epoch [3/3], Step [1359/3236], Loss: 2.1280, Perplexity: 8.3977torch.Size([128, 10, 512])
Epoch [3/3], Step [1360/3236], Loss: 2.1809, Perplexity: 8.8543torch.Size([128, 12, 512])
Epoch [3/3], Step [1361/3236], Loss: 1.9571, Perplexity: 7.0789torch.Size([128, 21, 512])
Epoch [3/3], Step [1362/3236], Loss: 2.6216, Perplexity: 13.7580torch.Size([128, 14, 512])
Epoch [3/3], Step [1363/3236], Loss: 1.9248, Perplexity: 6.8536torch.Size([128, 13, 512])
Epoch [3/3], Step [1364/3236], Loss: 1.8796, Perplexity: 6.5510torch.Size([128, 13, 512])
Epoch [3/3], Step [1365/3236], Loss: 1.7859, Perplexity: 5.9648torch.Size([128, 16, 512])
Epoch [3/3], Step [1366/3236], Loss: 2.0485, Perplexity: 7.7564torch.Size([128, 12, 512])
Epoch [3/3], Step [1367/3236], Loss: 1.9373, Perplexity: 6.9399torch.Size([128, 12, 512])
Epoch [3/

Epoch [3/3], Step [1448/3236], Loss: 2.0014, Perplexity: 7.3993torch.Size([128, 11, 512])
Epoch [3/3], Step [1449/3236], Loss: 1.8960, Perplexity: 6.6591torch.Size([128, 21, 512])
Epoch [3/3], Step [1450/3236], Loss: 2.6281, Perplexity: 13.8468torch.Size([128, 14, 512])
Epoch [3/3], Step [1451/3236], Loss: 1.9522, Perplexity: 7.0444torch.Size([128, 11, 512])
Epoch [3/3], Step [1452/3236], Loss: 1.9957, Perplexity: 7.3577torch.Size([128, 12, 512])
Epoch [3/3], Step [1453/3236], Loss: 1.8064, Perplexity: 6.0882torch.Size([128, 17, 512])
Epoch [3/3], Step [1454/3236], Loss: 2.1622, Perplexity: 8.6900torch.Size([128, 11, 512])
Epoch [3/3], Step [1455/3236], Loss: 2.0077, Perplexity: 7.4460torch.Size([128, 11, 512])
Epoch [3/3], Step [1456/3236], Loss: 1.9728, Perplexity: 7.1909torch.Size([128, 12, 512])
Epoch [3/3], Step [1457/3236], Loss: 1.7257, Perplexity: 5.6167torch.Size([128, 20, 512])
Epoch [3/3], Step [1458/3236], Loss: 2.6319, Perplexity: 13.8999torch.Size([128, 14, 512])
Epoch [3

Epoch [3/3], Step [1539/3236], Loss: 1.9462, Perplexity: 7.0023torch.Size([128, 14, 512])
Epoch [3/3], Step [1540/3236], Loss: 1.7876, Perplexity: 5.9753torch.Size([128, 15, 512])
Epoch [3/3], Step [1541/3236], Loss: 2.0937, Perplexity: 8.1150torch.Size([128, 14, 512])
Epoch [3/3], Step [1542/3236], Loss: 1.9741, Perplexity: 7.2002torch.Size([128, 12, 512])
Epoch [3/3], Step [1543/3236], Loss: 1.9180, Perplexity: 6.8076torch.Size([128, 11, 512])
Epoch [3/3], Step [1544/3236], Loss: 2.0766, Perplexity: 7.9774torch.Size([128, 11, 512])
Epoch [3/3], Step [1545/3236], Loss: 1.9889, Perplexity: 7.3075torch.Size([128, 11, 512])
Epoch [3/3], Step [1546/3236], Loss: 1.8489, Perplexity: 6.3526torch.Size([128, 11, 512])
Epoch [3/3], Step [1547/3236], Loss: 1.8983, Perplexity: 6.6744torch.Size([128, 12, 512])
Epoch [3/3], Step [1548/3236], Loss: 1.9463, Perplexity: 7.0028torch.Size([128, 11, 512])
Epoch [3/3], Step [1549/3236], Loss: 1.9503, Perplexity: 7.0308torch.Size([128, 13, 512])
Epoch [3/3

Epoch [3/3], Step [1630/3236], Loss: 2.0076, Perplexity: 7.4457torch.Size([128, 16, 512])
Epoch [3/3], Step [1631/3236], Loss: 2.0883, Perplexity: 8.0711torch.Size([128, 12, 512])
Epoch [3/3], Step [1632/3236], Loss: 1.9768, Perplexity: 7.2198torch.Size([128, 15, 512])
Epoch [3/3], Step [1633/3236], Loss: 2.0479, Perplexity: 7.7514torch.Size([128, 13, 512])
Epoch [3/3], Step [1634/3236], Loss: 1.8813, Perplexity: 6.5620torch.Size([128, 14, 512])
Epoch [3/3], Step [1635/3236], Loss: 2.0784, Perplexity: 7.9913torch.Size([128, 15, 512])
Epoch [3/3], Step [1636/3236], Loss: 2.0348, Perplexity: 7.6509torch.Size([128, 16, 512])
Epoch [3/3], Step [1637/3236], Loss: 2.1185, Perplexity: 8.3190torch.Size([128, 11, 512])
Epoch [3/3], Step [1638/3236], Loss: 2.0054, Perplexity: 7.4288torch.Size([128, 13, 512])
Epoch [3/3], Step [1639/3236], Loss: 1.9445, Perplexity: 6.9905torch.Size([128, 13, 512])
Epoch [3/3], Step [1640/3236], Loss: 1.8694, Perplexity: 6.4842torch.Size([128, 17, 512])
Epoch [3/3

Epoch [3/3], Step [1721/3236], Loss: 1.9762, Perplexity: 7.2153torch.Size([128, 11, 512])
Epoch [3/3], Step [1722/3236], Loss: 2.0573, Perplexity: 7.8251torch.Size([128, 14, 512])
Epoch [3/3], Step [1723/3236], Loss: 1.9114, Perplexity: 6.7624torch.Size([128, 20, 512])
Epoch [3/3], Step [1724/3236], Loss: 2.5161, Perplexity: 12.3808torch.Size([128, 11, 512])
Epoch [3/3], Step [1725/3236], Loss: 1.9690, Perplexity: 7.1638torch.Size([128, 12, 512])
Epoch [3/3], Step [1726/3236], Loss: 1.9583, Perplexity: 7.0871torch.Size([128, 10, 512])
Epoch [3/3], Step [1727/3236], Loss: 2.1009, Perplexity: 8.1733torch.Size([128, 14, 512])
Epoch [3/3], Step [1728/3236], Loss: 1.9425, Perplexity: 6.9760torch.Size([128, 13, 512])
Epoch [3/3], Step [1729/3236], Loss: 1.9285, Perplexity: 6.8795torch.Size([128, 15, 512])
Epoch [3/3], Step [1730/3236], Loss: 2.0662, Perplexity: 7.8945torch.Size([128, 15, 512])
Epoch [3/3], Step [1731/3236], Loss: 2.1930, Perplexity: 8.9619torch.Size([128, 11, 512])
Epoch [3/

Epoch [3/3], Step [1812/3236], Loss: 1.9703, Perplexity: 7.1732torch.Size([128, 14, 512])
Epoch [3/3], Step [1813/3236], Loss: 1.9272, Perplexity: 6.8705torch.Size([128, 16, 512])
Epoch [3/3], Step [1814/3236], Loss: 2.1157, Perplexity: 8.2958torch.Size([128, 13, 512])
Epoch [3/3], Step [1815/3236], Loss: 1.9643, Perplexity: 7.1301torch.Size([128, 15, 512])
Epoch [3/3], Step [1816/3236], Loss: 2.1315, Perplexity: 8.4275torch.Size([128, 16, 512])
Epoch [3/3], Step [1817/3236], Loss: 1.9596, Perplexity: 7.0965torch.Size([128, 14, 512])
Epoch [3/3], Step [1818/3236], Loss: 1.9801, Perplexity: 7.2431torch.Size([128, 21, 512])
Epoch [3/3], Step [1819/3236], Loss: 2.6104, Perplexity: 13.6041torch.Size([128, 25, 512])
Epoch [3/3], Step [1820/3236], Loss: 2.9278, Perplexity: 18.6873torch.Size([128, 12, 512])
Epoch [3/3], Step [1821/3236], Loss: 1.8697, Perplexity: 6.4863torch.Size([128, 12, 512])
Epoch [3/3], Step [1822/3236], Loss: 1.8742, Perplexity: 6.5154torch.Size([128, 13, 512])
Epoch [3

Epoch [3/3], Step [1903/3236], Loss: 2.0373, Perplexity: 7.6701torch.Size([128, 10, 512])
Epoch [3/3], Step [1904/3236], Loss: 2.2328, Perplexity: 9.3258torch.Size([128, 15, 512])
Epoch [3/3], Step [1905/3236], Loss: 2.0704, Perplexity: 7.9279torch.Size([128, 13, 512])
Epoch [3/3], Step [1906/3236], Loss: 1.9336, Perplexity: 6.9142torch.Size([128, 11, 512])
Epoch [3/3], Step [1907/3236], Loss: 2.1730, Perplexity: 8.7846torch.Size([128, 13, 512])
Epoch [3/3], Step [1908/3236], Loss: 1.8537, Perplexity: 6.3834torch.Size([128, 14, 512])
Epoch [3/3], Step [1909/3236], Loss: 1.8751, Perplexity: 6.5216torch.Size([128, 12, 512])
Epoch [3/3], Step [1910/3236], Loss: 1.9665, Perplexity: 7.1456torch.Size([128, 13, 512])
Epoch [3/3], Step [1911/3236], Loss: 1.8935, Perplexity: 6.6425torch.Size([128, 13, 512])
Epoch [3/3], Step [1912/3236], Loss: 1.9328, Perplexity: 6.9087torch.Size([128, 11, 512])
Epoch [3/3], Step [1913/3236], Loss: 1.9284, Perplexity: 6.8784torch.Size([128, 17, 512])
Epoch [3/3

Epoch [3/3], Step [1994/3236], Loss: 1.9334, Perplexity: 6.9132torch.Size([128, 11, 512])
Epoch [3/3], Step [1995/3236], Loss: 2.0849, Perplexity: 8.0437torch.Size([128, 15, 512])
Epoch [3/3], Step [1996/3236], Loss: 1.9200, Perplexity: 6.8207torch.Size([128, 11, 512])
Epoch [3/3], Step [1997/3236], Loss: 2.0215, Perplexity: 7.5497torch.Size([128, 17, 512])
Epoch [3/3], Step [1998/3236], Loss: 2.2341, Perplexity: 9.3385torch.Size([128, 13, 512])
Epoch [3/3], Step [1999/3236], Loss: 1.8530, Perplexity: 6.3788torch.Size([128, 17, 512])
Epoch [3/3], Step [2000/3236], Loss: 2.1432, Perplexity: 8.5263
torch.Size([128, 15, 512])
Epoch [3/3], Step [2001/3236], Loss: 2.0846, Perplexity: 8.0416torch.Size([128, 12, 512])
Epoch [3/3], Step [2002/3236], Loss: 1.9644, Perplexity: 7.1309torch.Size([128, 14, 512])
Epoch [3/3], Step [2003/3236], Loss: 1.9772, Perplexity: 7.2222torch.Size([128, 11, 512])
Epoch [3/3], Step [2004/3236], Loss: 1.9516, Perplexity: 7.0401torch.Size([128, 13, 512])
Epoch [3/

Epoch [3/3], Step [2176/3236], Loss: 2.0484, Perplexity: 7.7556torch.Size([128, 14, 512])
Epoch [3/3], Step [2177/3236], Loss: 1.9995, Perplexity: 7.3851torch.Size([128, 17, 512])
Epoch [3/3], Step [2178/3236], Loss: 2.2568, Perplexity: 9.5526torch.Size([128, 11, 512])
Epoch [3/3], Step [2179/3236], Loss: 2.0124, Perplexity: 7.4814torch.Size([128, 13, 512])
Epoch [3/3], Step [2180/3236], Loss: 1.7169, Perplexity: 5.5671torch.Size([128, 13, 512])
Epoch [3/3], Step [2181/3236], Loss: 1.8523, Perplexity: 6.3742torch.Size([128, 11, 512])
Epoch [3/3], Step [2182/3236], Loss: 1.8286, Perplexity: 6.2251torch.Size([128, 20, 512])
Epoch [3/3], Step [2183/3236], Loss: 2.5880, Perplexity: 13.3034torch.Size([128, 11, 512])
Epoch [3/3], Step [2184/3236], Loss: 1.9832, Perplexity: 7.2659torch.Size([128, 11, 512])
Epoch [3/3], Step [2185/3236], Loss: 2.0133, Perplexity: 7.4883torch.Size([128, 10, 512])
Epoch [3/3], Step [2186/3236], Loss: 2.3510, Perplexity: 10.4959torch.Size([128, 11, 512])
Epoch [3

Epoch [3/3], Step [2267/3236], Loss: 2.2222, Perplexity: 9.2275torch.Size([128, 14, 512])
Epoch [3/3], Step [2268/3236], Loss: 2.0620, Perplexity: 7.8615torch.Size([128, 13, 512])
Epoch [3/3], Step [2269/3236], Loss: 1.9351, Perplexity: 6.9246torch.Size([128, 10, 512])
Epoch [3/3], Step [2270/3236], Loss: 2.2384, Perplexity: 9.3786torch.Size([128, 11, 512])
Epoch [3/3], Step [2271/3236], Loss: 2.0081, Perplexity: 7.4494torch.Size([128, 13, 512])
Epoch [3/3], Step [2272/3236], Loss: 1.9066, Perplexity: 6.7301torch.Size([128, 17, 512])
Epoch [3/3], Step [2273/3236], Loss: 2.3024, Perplexity: 9.9984torch.Size([128, 14, 512])
Epoch [3/3], Step [2274/3236], Loss: 1.9813, Perplexity: 7.2520torch.Size([128, 12, 512])
Epoch [3/3], Step [2275/3236], Loss: 1.8875, Perplexity: 6.6030torch.Size([128, 14, 512])
Epoch [3/3], Step [2276/3236], Loss: 1.8845, Perplexity: 6.5829torch.Size([128, 12, 512])
Epoch [3/3], Step [2277/3236], Loss: 1.9122, Perplexity: 6.7676torch.Size([128, 16, 512])
Epoch [3/3

Epoch [3/3], Step [2358/3236], Loss: 1.8605, Perplexity: 6.4271torch.Size([128, 10, 512])
Epoch [3/3], Step [2359/3236], Loss: 2.0952, Perplexity: 8.1273torch.Size([128, 11, 512])
Epoch [3/3], Step [2360/3236], Loss: 1.9710, Perplexity: 7.1777torch.Size([128, 12, 512])
Epoch [3/3], Step [2361/3236], Loss: 1.8698, Perplexity: 6.4872torch.Size([128, 12, 512])
Epoch [3/3], Step [2362/3236], Loss: 1.7933, Perplexity: 6.0092torch.Size([128, 10, 512])
Epoch [3/3], Step [2363/3236], Loss: 2.1743, Perplexity: 8.7961torch.Size([128, 16, 512])
Epoch [3/3], Step [2364/3236], Loss: 2.0905, Perplexity: 8.0889torch.Size([128, 15, 512])
Epoch [3/3], Step [2365/3236], Loss: 2.0631, Perplexity: 7.8703torch.Size([128, 12, 512])
Epoch [3/3], Step [2366/3236], Loss: 1.9704, Perplexity: 7.1736torch.Size([128, 22, 512])
Epoch [3/3], Step [2367/3236], Loss: 2.7399, Perplexity: 15.4853torch.Size([128, 11, 512])
Epoch [3/3], Step [2368/3236], Loss: 1.9071, Perplexity: 6.7335torch.Size([128, 13, 512])
Epoch [3/

Epoch [3/3], Step [2449/3236], Loss: 2.2148, Perplexity: 9.1592torch.Size([128, 15, 512])
Epoch [3/3], Step [2450/3236], Loss: 1.8930, Perplexity: 6.6390torch.Size([128, 13, 512])
Epoch [3/3], Step [2451/3236], Loss: 1.8856, Perplexity: 6.5905torch.Size([128, 12, 512])
Epoch [3/3], Step [2452/3236], Loss: 1.9666, Perplexity: 7.1463torch.Size([128, 14, 512])
Epoch [3/3], Step [2453/3236], Loss: 1.9949, Perplexity: 7.3514torch.Size([128, 13, 512])
Epoch [3/3], Step [2454/3236], Loss: 1.7779, Perplexity: 5.9172torch.Size([128, 13, 512])
Epoch [3/3], Step [2455/3236], Loss: 1.8919, Perplexity: 6.6323torch.Size([128, 12, 512])
Epoch [3/3], Step [2456/3236], Loss: 1.9432, Perplexity: 6.9807torch.Size([128, 14, 512])
Epoch [3/3], Step [2457/3236], Loss: 1.9593, Perplexity: 7.0943torch.Size([128, 17, 512])
Epoch [3/3], Step [2458/3236], Loss: 2.1244, Perplexity: 8.3677torch.Size([128, 20, 512])
Epoch [3/3], Step [2459/3236], Loss: 2.5376, Perplexity: 12.6488torch.Size([128, 13, 512])
Epoch [3/

Epoch [3/3], Step [2540/3236], Loss: 1.9849, Perplexity: 7.2784torch.Size([128, 13, 512])
Epoch [3/3], Step [2541/3236], Loss: 1.7858, Perplexity: 5.9642torch.Size([128, 12, 512])
Epoch [3/3], Step [2542/3236], Loss: 1.9065, Perplexity: 6.7294torch.Size([128, 13, 512])
Epoch [3/3], Step [2543/3236], Loss: 1.9695, Perplexity: 7.1672torch.Size([128, 14, 512])
Epoch [3/3], Step [2544/3236], Loss: 1.9365, Perplexity: 6.9346torch.Size([128, 14, 512])
Epoch [3/3], Step [2545/3236], Loss: 1.9502, Perplexity: 7.0297torch.Size([128, 19, 512])
Epoch [3/3], Step [2546/3236], Loss: 2.5083, Perplexity: 12.2837torch.Size([128, 15, 512])
Epoch [3/3], Step [2547/3236], Loss: 2.0954, Perplexity: 8.1287torch.Size([128, 11, 512])
Epoch [3/3], Step [2548/3236], Loss: 1.9289, Perplexity: 6.8822torch.Size([128, 13, 512])
Epoch [3/3], Step [2549/3236], Loss: 1.8748, Perplexity: 6.5194torch.Size([128, 15, 512])
Epoch [3/3], Step [2550/3236], Loss: 1.9713, Perplexity: 7.1802torch.Size([128, 11, 512])
Epoch [3/

Epoch [3/3], Step [2631/3236], Loss: 1.9699, Perplexity: 7.1699torch.Size([128, 15, 512])
Epoch [3/3], Step [2632/3236], Loss: 1.9997, Perplexity: 7.3871torch.Size([128, 20, 512])
Epoch [3/3], Step [2633/3236], Loss: 2.6733, Perplexity: 14.4874torch.Size([128, 12, 512])
Epoch [3/3], Step [2634/3236], Loss: 1.9687, Perplexity: 7.1614torch.Size([128, 12, 512])
Epoch [3/3], Step [2635/3236], Loss: 1.8887, Perplexity: 6.6105torch.Size([128, 27, 512])
Epoch [3/3], Step [2636/3236], Loss: 2.8494, Perplexity: 17.2783torch.Size([128, 11, 512])
Epoch [3/3], Step [2637/3236], Loss: 1.9105, Perplexity: 6.7564torch.Size([128, 11, 512])
Epoch [3/3], Step [2638/3236], Loss: 1.9469, Perplexity: 7.0068torch.Size([128, 13, 512])
Epoch [3/3], Step [2639/3236], Loss: 1.9645, Perplexity: 7.1317torch.Size([128, 12, 512])
Epoch [3/3], Step [2640/3236], Loss: 1.8482, Perplexity: 6.3484torch.Size([128, 14, 512])
Epoch [3/3], Step [2641/3236], Loss: 1.9150, Perplexity: 6.7868torch.Size([128, 11, 512])
Epoch [3

Epoch [3/3], Step [2722/3236], Loss: 1.9255, Perplexity: 6.8585torch.Size([128, 16, 512])
Epoch [3/3], Step [2723/3236], Loss: 2.0630, Perplexity: 7.8692torch.Size([128, 14, 512])
Epoch [3/3], Step [2724/3236], Loss: 1.8425, Perplexity: 6.3125torch.Size([128, 13, 512])
Epoch [3/3], Step [2725/3236], Loss: 1.8812, Perplexity: 6.5611torch.Size([128, 11, 512])
Epoch [3/3], Step [2726/3236], Loss: 1.9732, Perplexity: 7.1935torch.Size([128, 13, 512])
Epoch [3/3], Step [2727/3236], Loss: 1.8957, Perplexity: 6.6574torch.Size([128, 11, 512])
Epoch [3/3], Step [2728/3236], Loss: 1.8518, Perplexity: 6.3714torch.Size([128, 15, 512])
Epoch [3/3], Step [2729/3236], Loss: 2.0921, Perplexity: 8.1019torch.Size([128, 12, 512])
Epoch [3/3], Step [2730/3236], Loss: 1.8862, Perplexity: 6.5945torch.Size([128, 12, 512])
Epoch [3/3], Step [2731/3236], Loss: 1.9500, Perplexity: 7.0289torch.Size([128, 12, 512])
Epoch [3/3], Step [2732/3236], Loss: 1.8826, Perplexity: 6.5704torch.Size([128, 13, 512])
Epoch [3/3

Epoch [3/3], Step [2813/3236], Loss: 1.9740, Perplexity: 7.1991torch.Size([128, 11, 512])
Epoch [3/3], Step [2814/3236], Loss: 1.9626, Perplexity: 7.1178torch.Size([128, 11, 512])
Epoch [3/3], Step [2815/3236], Loss: 1.9322, Perplexity: 6.9047torch.Size([128, 13, 512])
Epoch [3/3], Step [2816/3236], Loss: 1.9368, Perplexity: 6.9366torch.Size([128, 11, 512])
Epoch [3/3], Step [2817/3236], Loss: 1.9592, Perplexity: 7.0935torch.Size([128, 14, 512])
Epoch [3/3], Step [2818/3236], Loss: 2.0048, Perplexity: 7.4245torch.Size([128, 15, 512])
Epoch [3/3], Step [2819/3236], Loss: 1.9974, Perplexity: 7.3699torch.Size([128, 11, 512])
Epoch [3/3], Step [2820/3236], Loss: 2.0226, Perplexity: 7.5582torch.Size([128, 18, 512])
Epoch [3/3], Step [2821/3236], Loss: 2.3827, Perplexity: 10.8346torch.Size([128, 16, 512])
Epoch [3/3], Step [2822/3236], Loss: 2.0507, Perplexity: 7.7730torch.Size([128, 13, 512])
Epoch [3/3], Step [2823/3236], Loss: 1.8743, Perplexity: 6.5164torch.Size([128, 13, 512])
Epoch [3/

Epoch [3/3], Step [2904/3236], Loss: 2.0410, Perplexity: 7.6982torch.Size([128, 13, 512])
Epoch [3/3], Step [2905/3236], Loss: 1.9358, Perplexity: 6.9294torch.Size([128, 17, 512])
Epoch [3/3], Step [2906/3236], Loss: 2.2541, Perplexity: 9.5266torch.Size([128, 11, 512])
Epoch [3/3], Step [2907/3236], Loss: 2.0126, Perplexity: 7.4828torch.Size([128, 13, 512])
Epoch [3/3], Step [2908/3236], Loss: 1.8669, Perplexity: 6.4680torch.Size([128, 17, 512])
Epoch [3/3], Step [2909/3236], Loss: 2.2140, Perplexity: 9.1518torch.Size([128, 19, 512])
Epoch [3/3], Step [2910/3236], Loss: 2.3852, Perplexity: 10.8613torch.Size([128, 15, 512])
Epoch [3/3], Step [2911/3236], Loss: 1.9559, Perplexity: 7.0705torch.Size([128, 17, 512])
Epoch [3/3], Step [2912/3236], Loss: 2.0852, Perplexity: 8.0460torch.Size([128, 15, 512])
Epoch [3/3], Step [2913/3236], Loss: 1.9628, Perplexity: 7.1189torch.Size([128, 12, 512])
Epoch [3/3], Step [2914/3236], Loss: 1.9167, Perplexity: 6.7988torch.Size([128, 12, 512])
Epoch [3/

Epoch [3/3], Step [2995/3236], Loss: 2.1078, Perplexity: 8.2300torch.Size([128, 13, 512])
Epoch [3/3], Step [2996/3236], Loss: 1.8612, Perplexity: 6.4315torch.Size([128, 11, 512])
Epoch [3/3], Step [2997/3236], Loss: 2.0038, Perplexity: 7.4174torch.Size([128, 11, 512])
Epoch [3/3], Step [2998/3236], Loss: 1.9577, Perplexity: 7.0827torch.Size([128, 12, 512])
Epoch [3/3], Step [2999/3236], Loss: 1.8983, Perplexity: 6.6746torch.Size([128, 14, 512])
Epoch [3/3], Step [3000/3236], Loss: 1.9569, Perplexity: 7.0775
torch.Size([128, 13, 512])
Epoch [3/3], Step [3001/3236], Loss: 1.8127, Perplexity: 6.1268torch.Size([128, 11, 512])
Epoch [3/3], Step [3002/3236], Loss: 1.8809, Perplexity: 6.5594torch.Size([128, 14, 512])
Epoch [3/3], Step [3003/3236], Loss: 2.0224, Perplexity: 7.5563torch.Size([128, 15, 512])
Epoch [3/3], Step [3004/3236], Loss: 1.9700, Perplexity: 7.1705torch.Size([128, 10, 512])
Epoch [3/3], Step [3005/3236], Loss: 2.2315, Perplexity: 9.3134torch.Size([128, 14, 512])
Epoch [3/

Epoch [3/3], Step [3086/3236], Loss: 2.0837, Perplexity: 8.0340torch.Size([128, 15, 512])
Epoch [3/3], Step [3087/3236], Loss: 2.0591, Perplexity: 7.8392torch.Size([128, 17, 512])
Epoch [3/3], Step [3088/3236], Loss: 2.2167, Perplexity: 9.1774torch.Size([128, 14, 512])
Epoch [3/3], Step [3089/3236], Loss: 1.9219, Perplexity: 6.8337torch.Size([128, 12, 512])
Epoch [3/3], Step [3090/3236], Loss: 1.8615, Perplexity: 6.4336torch.Size([128, 12, 512])
Epoch [3/3], Step [3091/3236], Loss: 1.8412, Perplexity: 6.3038torch.Size([128, 15, 512])
Epoch [3/3], Step [3092/3236], Loss: 1.9291, Perplexity: 6.8831torch.Size([128, 13, 512])
Epoch [3/3], Step [3093/3236], Loss: 1.8974, Perplexity: 6.6687torch.Size([128, 12, 512])
Epoch [3/3], Step [3094/3236], Loss: 1.8307, Perplexity: 6.2382torch.Size([128, 12, 512])
Epoch [3/3], Step [3095/3236], Loss: 1.7766, Perplexity: 5.9095torch.Size([128, 13, 512])
Epoch [3/3], Step [3096/3236], Loss: 1.9099, Perplexity: 6.7527torch.Size([128, 12, 512])
Epoch [3/3

Epoch [3/3], Step [3177/3236], Loss: 2.4447, Perplexity: 11.5267torch.Size([128, 12, 512])
Epoch [3/3], Step [3178/3236], Loss: 1.8707, Perplexity: 6.4927torch.Size([128, 14, 512])
Epoch [3/3], Step [3179/3236], Loss: 1.9443, Perplexity: 6.9884torch.Size([128, 13, 512])
Epoch [3/3], Step [3180/3236], Loss: 1.9446, Perplexity: 6.9909torch.Size([128, 11, 512])
Epoch [3/3], Step [3181/3236], Loss: 1.9545, Perplexity: 7.0607torch.Size([128, 12, 512])
Epoch [3/3], Step [3182/3236], Loss: 1.9000, Perplexity: 6.6856torch.Size([128, 11, 512])
Epoch [3/3], Step [3183/3236], Loss: 2.0201, Perplexity: 7.5393torch.Size([128, 11, 512])
Epoch [3/3], Step [3184/3236], Loss: 1.9319, Perplexity: 6.9023torch.Size([128, 12, 512])
Epoch [3/3], Step [3185/3236], Loss: 1.8523, Perplexity: 6.3745torch.Size([128, 12, 512])
Epoch [3/3], Step [3186/3236], Loss: 1.8919, Perplexity: 6.6323torch.Size([128, 13, 512])
Epoch [3/3], Step [3187/3236], Loss: 1.9776, Perplexity: 7.2256torch.Size([128, 12, 512])
Epoch [3/

<a id='step3'></a>
## Step 3: (Optional) Validate your Model

To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. 

If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [None]:
# (Optional) TODO: Validate your model.