
### Project: Image Captioning


In this notebook, we will train our CNN-RNN model. The notebook is composed of the following steps:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, we will setup the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

We refer [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  

### CNN-RNN architecture and hyperparameters

I use pre-trained Resnet-50 model that has 1000 FC size. 
I have chosen 512 as Embed size and also as Hidden size for training based on the paper provided above. <br>
(https://arxiv.org/pdf/1411.4555.pdf). On this paper it mentioned that they started with 512 for Embedding size and the LSTM memory size. And a result of training on this project with the sizes seemed also quite decent.<br>
The batch size was explained on Hyperparameters session, which said the bigger batch size, the better computational efficiency. But 256 is threshold; If the size exceeded that, a result goes worse. <br>
I started with vocab_threshold=5, but it seems like I have not enough data to train. <br>
So I changed it vocab_threshold to 4 and I could get the better result. <br>

For using a pre-trained model with 244x244x3 images, I chose normalized values mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].(Normalize((0.485, 0.456, 0.406),(0.229, 0.224, 0.225)) <br>
There are also flipping and resizing to 244x244 size for making this model to learn more effectively on CNN training phase. <br>
(Resizing and Flipping work great ecspecially when train data is not enough to train. It could make a model to see vary data for tuning parameters and also prevent overfitting) <br>
After I trained this model, I realized the transfrom was great enough.
Since I use the pre-trained model, I want to only train embed parameters on an encoder side,
I chose """encoder.embed.parameters()""" But for a decode side, we need to train all layers on a decoder side. <br>
so, """decoder.parameters()""" was a decent choice I think.

For the optimizer, At first, I started with SGD and changed lr multiple times. <br>
But it seems like finding a great lr for SGD was difficult in the case. <br>
(lr=0.1 seems not learing, lr=0.01|0.001 seems learning quite slowly) <br>
So I changed it to Adam which is explained as one of Adaptive Learning Optimizers <br>
on Hyperparameters lesson. I think Adam seems working fine on this training phase.

In [1]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
!pip install nltk
import nltk
nltk.download('punkt')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## Select appropriate values for the Python variables below.
batch_size = 256            ## batch size
vocab_threshold = 4        ## minimum word count threshold
vocab_from_file = False     ## if True, load existing vocab file
embed_size = 512           ## dimensionality of image and word embeddings
hidden_size = 512          ## number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model.
# list(model_name.layer_name.parameters())
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 

# Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.001)
#optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
loading annotations into memory...
Done (t=0.94s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=0.85s)
creating index...


  0%|          | 1319/414113 [00:00<01:02, 6635.80it/s]

index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:02<00:00, 6620.38it/s]


<a id='step2'></a>
## Step 2: Train your Model


In [None]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

num_epochs = 3

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()
response = requests.request("GET", 
                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
                            headers={"Metadata-Flavor":"Google"})

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        if time.time() - old_time > 60:
            old_time = time.time()
            requests.request("POST", 
                             "https://nebula.udacity.com/api/v1/remote/keep-alive", 
                             headers={'Authorization': "STAR " + response.text})
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

batch_size:  256
embed_size:  512
Epoch [1/3], Step [1/1618], Loss: 2.8392, Perplexity: 17.1028batch_size:  256
embed_size:  512
Epoch [1/3], Step [2/1618], Loss: 2.8776, Perplexity: 17.7719batch_size:  256
embed_size:  512
Epoch [1/3], Step [3/1618], Loss: 3.0990, Perplexity: 22.1752batch_size:  256
embed_size:  512
Epoch [1/3], Step [4/1618], Loss: 2.9265, Perplexity: 18.6625batch_size:  256
embed_size:  512
Epoch [1/3], Step [5/1618], Loss: 3.0549, Perplexity: 21.2189batch_size:  256
embed_size:  512
Epoch [1/3], Step [6/1618], Loss: 2.7600, Perplexity: 15.8003batch_size:  256
embed_size:  512
Epoch [1/3], Step [7/1618], Loss: 2.9878, Perplexity: 19.8416batch_size:  256
embed_size:  512
Epoch [1/3], Step [8/1618], Loss: 2.8919, Perplexity: 18.0281batch_size:  256
embed_size:  512
Epoch [1/3], Step [9/1618], Loss: 2.9113, Perplexity: 18.3814batch_size:  256
embed_size:  512
Epoch [1/3], Step [10/1618], Loss: 2.8511, Perplexity: 17.3071batch_size:  256
embed_size:  512
Epoch [1/3], St

Epoch [1/3], Step [170/1618], Loss: 2.5998, Perplexity: 13.4614batch_size:  256
embed_size:  512
Epoch [1/3], Step [171/1618], Loss: 2.7160, Perplexity: 15.1194batch_size:  256
embed_size:  512
Epoch [1/3], Step [172/1618], Loss: 3.0412, Perplexity: 20.9295batch_size:  256
embed_size:  512
Epoch [1/3], Step [173/1618], Loss: 2.7172, Perplexity: 15.1382batch_size:  256
embed_size:  512
Epoch [1/3], Step [174/1618], Loss: 2.8674, Perplexity: 17.5920batch_size:  256
embed_size:  512
Epoch [1/3], Step [175/1618], Loss: 2.5690, Perplexity: 13.0522batch_size:  256
embed_size:  512
Epoch [1/3], Step [176/1618], Loss: 2.8259, Perplexity: 16.8764batch_size:  256
embed_size:  512
Epoch [1/3], Step [177/1618], Loss: 2.8394, Perplexity: 17.1063batch_size:  256
embed_size:  512
Epoch [1/3], Step [178/1618], Loss: 3.0476, Perplexity: 21.0640batch_size:  256
embed_size:  512
Epoch [1/3], Step [179/1618], Loss: 2.7952, Perplexity: 16.3659batch_size:  256
embed_size:  512
Epoch [1/3], Step [180/1618], 

Epoch [1/3], Step [338/1618], Loss: 2.5770, Perplexity: 13.1576batch_size:  256
embed_size:  512
Epoch [1/3], Step [339/1618], Loss: 2.7205, Perplexity: 15.1886batch_size:  256
embed_size:  512
Epoch [1/3], Step [340/1618], Loss: 2.5802, Perplexity: 13.1998batch_size:  256
embed_size:  512
Epoch [1/3], Step [341/1618], Loss: 2.5499, Perplexity: 12.8063batch_size:  256
embed_size:  512
Epoch [1/3], Step [342/1618], Loss: 2.7889, Perplexity: 16.2635batch_size:  256
embed_size:  512
Epoch [1/3], Step [343/1618], Loss: 2.7216, Perplexity: 15.2048batch_size:  256
embed_size:  512
Epoch [1/3], Step [344/1618], Loss: 2.5232, Perplexity: 12.4681batch_size:  256
embed_size:  512
Epoch [1/3], Step [345/1618], Loss: 2.8939, Perplexity: 18.0644batch_size:  256
embed_size:  512
Epoch [1/3], Step [346/1618], Loss: 2.6795, Perplexity: 14.5784batch_size:  256
embed_size:  512
Epoch [1/3], Step [347/1618], Loss: 2.4677, Perplexity: 11.7955batch_size:  256
embed_size:  512
Epoch [1/3], Step [348/1618], 

Epoch [1/3], Step [506/1618], Loss: 2.5130, Perplexity: 12.3419batch_size:  256
embed_size:  512
Epoch [1/3], Step [507/1618], Loss: 2.4750, Perplexity: 11.8812batch_size:  256
embed_size:  512
Epoch [1/3], Step [508/1618], Loss: 2.3407, Perplexity: 10.3884batch_size:  256
embed_size:  512
Epoch [1/3], Step [509/1618], Loss: 2.5528, Perplexity: 12.8431batch_size:  256
embed_size:  512
Epoch [1/3], Step [510/1618], Loss: 2.7473, Perplexity: 15.6001batch_size:  256
embed_size:  512
Epoch [1/3], Step [511/1618], Loss: 2.5466, Perplexity: 12.7631batch_size:  256
embed_size:  512
Epoch [1/3], Step [512/1618], Loss: 3.0685, Perplexity: 21.5090batch_size:  256
embed_size:  512
Epoch [1/3], Step [513/1618], Loss: 3.1762, Perplexity: 23.9545batch_size:  256
embed_size:  512
Epoch [1/3], Step [514/1618], Loss: 2.4592, Perplexity: 11.6950batch_size:  256
embed_size:  512
Epoch [1/3], Step [515/1618], Loss: 2.4762, Perplexity: 11.8958batch_size:  256
embed_size:  512
Epoch [1/3], Step [516/1618], 

Epoch [1/3], Step [674/1618], Loss: 2.3518, Perplexity: 10.5040batch_size:  256
embed_size:  512
Epoch [1/3], Step [675/1618], Loss: 2.3039, Perplexity: 10.0137batch_size:  256
embed_size:  512
Epoch [1/3], Step [676/1618], Loss: 2.6635, Perplexity: 14.3467batch_size:  256
embed_size:  512
Epoch [1/3], Step [677/1618], Loss: 2.6308, Perplexity: 13.8852batch_size:  256
embed_size:  512
Epoch [1/3], Step [678/1618], Loss: 2.3937, Perplexity: 10.9535batch_size:  256
embed_size:  512
Epoch [1/3], Step [679/1618], Loss: 2.3849, Perplexity: 10.8576batch_size:  256
embed_size:  512
Epoch [1/3], Step [680/1618], Loss: 2.3116, Perplexity: 10.0908batch_size:  256
embed_size:  512
Epoch [1/3], Step [681/1618], Loss: 3.3673, Perplexity: 28.9989batch_size:  256
embed_size:  512
Epoch [1/3], Step [682/1618], Loss: 2.5666, Perplexity: 13.0213batch_size:  256
embed_size:  512
Epoch [1/3], Step [683/1618], Loss: 2.4342, Perplexity: 11.4064batch_size:  256
embed_size:  512
Epoch [1/3], Step [684/1618], 

Epoch [1/3], Step [842/1618], Loss: 2.3390, Perplexity: 10.3708batch_size:  256
embed_size:  512
Epoch [1/3], Step [843/1618], Loss: 2.2229, Perplexity: 9.2340batch_size:  256
embed_size:  512
Epoch [1/3], Step [844/1618], Loss: 2.5105, Perplexity: 12.3114batch_size:  256
embed_size:  512
Epoch [1/3], Step [845/1618], Loss: 2.3538, Perplexity: 10.5259batch_size:  256
embed_size:  512
Epoch [1/3], Step [846/1618], Loss: 2.4626, Perplexity: 11.7352batch_size:  256
embed_size:  512
Epoch [1/3], Step [847/1618], Loss: 2.3838, Perplexity: 10.8459batch_size:  256
embed_size:  512
Epoch [1/3], Step [848/1618], Loss: 2.2516, Perplexity: 9.5033batch_size:  256
embed_size:  512
Epoch [1/3], Step [849/1618], Loss: 2.4054, Perplexity: 11.0830batch_size:  256
embed_size:  512
Epoch [1/3], Step [850/1618], Loss: 2.3489, Perplexity: 10.4743batch_size:  256
embed_size:  512
Epoch [1/3], Step [851/1618], Loss: 2.6386, Perplexity: 13.9942batch_size:  256
embed_size:  512
Epoch [1/3], Step [852/1618], Lo

Epoch [1/3], Step [927/1618], Loss: 2.3002, Perplexity: 9.9763batch_size:  256
embed_size:  512
Epoch [1/3], Step [928/1618], Loss: 2.3139, Perplexity: 10.1139batch_size:  256
embed_size:  512
Epoch [1/3], Step [929/1618], Loss: 2.3423, Perplexity: 10.4051batch_size:  256
embed_size:  512
Epoch [1/3], Step [930/1618], Loss: 2.3938, Perplexity: 10.9554batch_size:  256
embed_size:  512
Epoch [1/3], Step [931/1618], Loss: 2.3234, Perplexity: 10.2108batch_size:  256
embed_size:  512
Epoch [1/3], Step [932/1618], Loss: 2.5163, Perplexity: 12.3823batch_size:  256
embed_size:  512
Epoch [1/3], Step [933/1618], Loss: 2.5484, Perplexity: 12.7867batch_size:  256
embed_size:  512
Epoch [1/3], Step [934/1618], Loss: 2.5749, Perplexity: 13.1296batch_size:  256
embed_size:  512
Epoch [1/3], Step [935/1618], Loss: 2.3242, Perplexity: 10.2184batch_size:  256
embed_size:  512
Epoch [1/3], Step [936/1618], Loss: 2.4857, Perplexity: 12.0100batch_size:  256
embed_size:  512
Epoch [1/3], Step [937/1618], L

Epoch [1/3], Step [1012/1618], Loss: 2.1743, Perplexity: 8.7963batch_size:  256
embed_size:  512
Epoch [1/3], Step [1013/1618], Loss: 2.2680, Perplexity: 9.6603batch_size:  256
embed_size:  512
Epoch [1/3], Step [1014/1618], Loss: 2.2069, Perplexity: 9.0872batch_size:  256
embed_size:  512
Epoch [1/3], Step [1015/1618], Loss: 2.3244, Perplexity: 10.2205batch_size:  256
embed_size:  512
Epoch [1/3], Step [1016/1618], Loss: 2.3343, Perplexity: 10.3227batch_size:  256
embed_size:  512
Epoch [1/3], Step [1017/1618], Loss: 2.1764, Perplexity: 8.8147batch_size:  256
embed_size:  512
Epoch [1/3], Step [1018/1618], Loss: 2.2967, Perplexity: 9.9412batch_size:  256
embed_size:  512
Epoch [1/3], Step [1019/1618], Loss: 2.2856, Perplexity: 9.8319batch_size:  256
embed_size:  512
Epoch [1/3], Step [1020/1618], Loss: 2.2024, Perplexity: 9.0469batch_size:  256
embed_size:  512
Epoch [1/3], Step [1021/1618], Loss: 2.3110, Perplexity: 10.0842batch_size:  256
embed_size:  512
Epoch [1/3], Step [1022/161

Epoch [1/3], Step [1180/1618], Loss: 2.2220, Perplexity: 9.2261batch_size:  256
embed_size:  512
Epoch [1/3], Step [1181/1618], Loss: 2.3154, Perplexity: 10.1290batch_size:  256
embed_size:  512
Epoch [1/3], Step [1182/1618], Loss: 2.4877, Perplexity: 12.0331batch_size:  256
embed_size:  512
Epoch [1/3], Step [1183/1618], Loss: 2.2401, Perplexity: 9.3940batch_size:  256
embed_size:  512
Epoch [1/3], Step [1184/1618], Loss: 2.3953, Perplexity: 10.9714batch_size:  256
embed_size:  512
Epoch [1/3], Step [1185/1618], Loss: 2.3820, Perplexity: 10.8260batch_size:  256
embed_size:  512
Epoch [1/3], Step [1186/1618], Loss: 2.5897, Perplexity: 13.3260batch_size:  256
embed_size:  512
Epoch [1/3], Step [1187/1618], Loss: 2.2877, Perplexity: 9.8518batch_size:  256
embed_size:  512
Epoch [1/3], Step [1188/1618], Loss: 2.6475, Perplexity: 14.1191batch_size:  256
embed_size:  512
Epoch [1/3], Step [1189/1618], Loss: 2.4950, Perplexity: 12.1218batch_size:  256
embed_size:  512
Epoch [1/3], Step [1190

Epoch [1/3], Step [1348/1618], Loss: 2.2033, Perplexity: 9.0546batch_size:  256
embed_size:  512
Epoch [1/3], Step [1349/1618], Loss: 2.1580, Perplexity: 8.6539batch_size:  256
embed_size:  512
Epoch [1/3], Step [1350/1618], Loss: 2.5028, Perplexity: 12.2166batch_size:  256
embed_size:  512
Epoch [1/3], Step [1351/1618], Loss: 2.2542, Perplexity: 9.5275batch_size:  256
embed_size:  512
Epoch [1/3], Step [1352/1618], Loss: 2.3512, Perplexity: 10.4984batch_size:  256
embed_size:  512
Epoch [1/3], Step [1353/1618], Loss: 2.2539, Perplexity: 9.5248batch_size:  256
embed_size:  512
Epoch [1/3], Step [1354/1618], Loss: 2.1824, Perplexity: 8.8680batch_size:  256
embed_size:  512
Epoch [1/3], Step [1355/1618], Loss: 2.2534, Perplexity: 9.5199batch_size:  256
embed_size:  512
Epoch [1/3], Step [1356/1618], Loss: 2.1089, Perplexity: 8.2392batch_size:  256
embed_size:  512
Epoch [1/3], Step [1357/1618], Loss: 2.2012, Perplexity: 9.0358batch_size:  256
embed_size:  512
Epoch [1/3], Step [1358/1618

Epoch [1/3], Step [1516/1618], Loss: 2.6376, Perplexity: 13.9798batch_size:  256
embed_size:  512
Epoch [1/3], Step [1517/1618], Loss: 2.4989, Perplexity: 12.1691batch_size:  256
embed_size:  512
Epoch [1/3], Step [1518/1618], Loss: 2.2031, Perplexity: 9.0532batch_size:  256
embed_size:  512
Epoch [1/3], Step [1519/1618], Loss: 2.1913, Perplexity: 8.9471batch_size:  256
embed_size:  512
Epoch [1/3], Step [1520/1618], Loss: 2.1753, Perplexity: 8.8046batch_size:  256
embed_size:  512
Epoch [1/3], Step [1521/1618], Loss: 2.1732, Perplexity: 8.7863batch_size:  256
embed_size:  512
Epoch [1/3], Step [1522/1618], Loss: 2.2243, Perplexity: 9.2470batch_size:  256
embed_size:  512
Epoch [1/3], Step [1523/1618], Loss: 2.2871, Perplexity: 9.8462batch_size:  256
embed_size:  512
Epoch [1/3], Step [1524/1618], Loss: 2.2015, Perplexity: 9.0390batch_size:  256
embed_size:  512
Epoch [1/3], Step [1525/1618], Loss: 2.2015, Perplexity: 9.0383batch_size:  256
embed_size:  512
Epoch [1/3], Step [1526/1618

Epoch [2/3], Step [68/1618], Loss: 2.1989, Perplexity: 9.0149batch_size:  256
embed_size:  512
Epoch [2/3], Step [69/1618], Loss: 2.2702, Perplexity: 9.6810batch_size:  256
embed_size:  512
Epoch [2/3], Step [70/1618], Loss: 2.1720, Perplexity: 8.7754batch_size:  256
embed_size:  512
Epoch [2/3], Step [71/1618], Loss: 2.0998, Perplexity: 8.1646batch_size:  256
embed_size:  512
Epoch [2/3], Step [72/1618], Loss: 2.4776, Perplexity: 11.9129batch_size:  256
embed_size:  512
Epoch [2/3], Step [73/1618], Loss: 2.4827, Perplexity: 11.9741batch_size:  256
embed_size:  512
Epoch [2/3], Step [74/1618], Loss: 2.1427, Perplexity: 8.5225batch_size:  256
embed_size:  512
Epoch [2/3], Step [75/1618], Loss: 2.2457, Perplexity: 9.4467batch_size:  256
embed_size:  512
Epoch [2/3], Step [76/1618], Loss: 2.1882, Perplexity: 8.9192batch_size:  256
embed_size:  512
Epoch [2/3], Step [77/1618], Loss: 2.2190, Perplexity: 9.1979batch_size:  256
embed_size:  512
Epoch [2/3], Step [78/1618], Loss: 2.2497, Perpl

Epoch [2/3], Step [238/1618], Loss: 3.7327, Perplexity: 41.7897batch_size:  256
embed_size:  512
Epoch [2/3], Step [239/1618], Loss: 2.1405, Perplexity: 8.5037batch_size:  256
embed_size:  512
Epoch [2/3], Step [240/1618], Loss: 2.1512, Perplexity: 8.5955batch_size:  256
embed_size:  512
Epoch [2/3], Step [241/1618], Loss: 2.5165, Perplexity: 12.3846batch_size:  256
embed_size:  512
Epoch [2/3], Step [242/1618], Loss: 2.1526, Perplexity: 8.6071batch_size:  256
embed_size:  512
Epoch [2/3], Step [243/1618], Loss: 2.3459, Perplexity: 10.4431batch_size:  256
embed_size:  512
Epoch [2/3], Step [244/1618], Loss: 2.0530, Perplexity: 7.7909batch_size:  256
embed_size:  512
Epoch [2/3], Step [245/1618], Loss: 2.6295, Perplexity: 13.8669batch_size:  256
embed_size:  512
Epoch [2/3], Step [246/1618], Loss: 2.2060, Perplexity: 9.0797batch_size:  256
embed_size:  512
Epoch [2/3], Step [247/1618], Loss: 2.1805, Perplexity: 8.8503batch_size:  256
embed_size:  512
Epoch [2/3], Step [248/1618], Loss: 

Epoch [2/3], Step [408/1618], Loss: 2.2270, Perplexity: 9.2721batch_size:  256
embed_size:  512
Epoch [2/3], Step [409/1618], Loss: 2.1649, Perplexity: 8.7137batch_size:  256
embed_size:  512
Epoch [2/3], Step [410/1618], Loss: 2.1686, Perplexity: 8.7464batch_size:  256
embed_size:  512
Epoch [2/3], Step [411/1618], Loss: 2.5744, Perplexity: 13.1241batch_size:  256
embed_size:  512
Epoch [2/3], Step [412/1618], Loss: 2.0924, Perplexity: 8.1043batch_size:  256
embed_size:  512
Epoch [2/3], Step [413/1618], Loss: 2.0772, Perplexity: 7.9824batch_size:  256
embed_size:  512
Epoch [2/3], Step [414/1618], Loss: 2.6006, Perplexity: 13.4721batch_size:  256
embed_size:  512
Epoch [2/3], Step [415/1618], Loss: 2.1670, Perplexity: 8.7318batch_size:  256
embed_size:  512
Epoch [2/3], Step [416/1618], Loss: 2.1056, Perplexity: 8.2122batch_size:  256
embed_size:  512
Epoch [2/3], Step [417/1618], Loss: 1.9868, Perplexity: 7.2920batch_size:  256
embed_size:  512
Epoch [2/3], Step [418/1618], Loss: 2.

Epoch [2/3], Step [578/1618], Loss: 2.0371, Perplexity: 7.6683batch_size:  256
embed_size:  512
Epoch [2/3], Step [579/1618], Loss: 2.2406, Perplexity: 9.3992batch_size:  256
embed_size:  512
Epoch [2/3], Step [580/1618], Loss: 2.0053, Perplexity: 7.4282batch_size:  256
embed_size:  512
Epoch [2/3], Step [581/1618], Loss: 2.0454, Perplexity: 7.7320batch_size:  256
embed_size:  512
Epoch [2/3], Step [582/1618], Loss: 2.0126, Perplexity: 7.4829batch_size:  256
embed_size:  512
Epoch [2/3], Step [583/1618], Loss: 2.1106, Perplexity: 8.2531batch_size:  256
embed_size:  512
Epoch [2/3], Step [584/1618], Loss: 2.1108, Perplexity: 8.2551batch_size:  256
embed_size:  512
Epoch [2/3], Step [585/1618], Loss: 2.1164, Perplexity: 8.3009batch_size:  256
embed_size:  512
Epoch [2/3], Step [586/1618], Loss: 2.3459, Perplexity: 10.4425batch_size:  256
embed_size:  512
Epoch [2/3], Step [587/1618], Loss: 2.0958, Perplexity: 8.1318batch_size:  256
embed_size:  512
Epoch [2/3], Step [588/1618], Loss: 2.0

Epoch [2/3], Step [748/1618], Loss: 2.0817, Perplexity: 8.0183batch_size:  256
embed_size:  512
Epoch [2/3], Step [749/1618], Loss: 2.0609, Perplexity: 7.8530batch_size:  256
embed_size:  512
Epoch [2/3], Step [750/1618], Loss: 2.3010, Perplexity: 9.9839batch_size:  256
embed_size:  512
Epoch [2/3], Step [751/1618], Loss: 2.0684, Perplexity: 7.9119batch_size:  256
embed_size:  512
Epoch [2/3], Step [752/1618], Loss: 2.0817, Perplexity: 8.0184batch_size:  256
embed_size:  512
Epoch [2/3], Step [753/1618], Loss: 2.1211, Perplexity: 8.3404batch_size:  256
embed_size:  512
Epoch [2/3], Step [754/1618], Loss: 2.2125, Perplexity: 9.1383batch_size:  256
embed_size:  512
Epoch [2/3], Step [755/1618], Loss: 2.1689, Perplexity: 8.7490batch_size:  256
embed_size:  512
Epoch [2/3], Step [756/1618], Loss: 2.4478, Perplexity: 11.5634batch_size:  256
embed_size:  512
Epoch [2/3], Step [757/1618], Loss: 2.2884, Perplexity: 9.8589batch_size:  256
embed_size:  512
Epoch [2/3], Step [758/1618], Loss: 1.9

Epoch [2/3], Step [918/1618], Loss: 2.0001, Perplexity: 7.3901batch_size:  256
embed_size:  512
Epoch [2/3], Step [919/1618], Loss: 2.0238, Perplexity: 7.5668batch_size:  256
embed_size:  512
Epoch [2/3], Step [920/1618], Loss: 2.2159, Perplexity: 9.1701batch_size:  256
embed_size:  512
Epoch [2/3], Step [921/1618], Loss: 2.1508, Perplexity: 8.5921batch_size:  256
embed_size:  512
Epoch [2/3], Step [922/1618], Loss: 2.1834, Perplexity: 8.8766batch_size:  256
embed_size:  512
Epoch [2/3], Step [923/1618], Loss: 2.2278, Perplexity: 9.2790batch_size:  256
embed_size:  512
Epoch [2/3], Step [924/1618], Loss: 2.0089, Perplexity: 7.4549batch_size:  256
embed_size:  512
Epoch [2/3], Step [925/1618], Loss: 2.3831, Perplexity: 10.8384batch_size:  256
embed_size:  512
Epoch [2/3], Step [926/1618], Loss: 2.0779, Perplexity: 7.9881batch_size:  256
embed_size:  512
Epoch [2/3], Step [927/1618], Loss: 2.2951, Perplexity: 9.9252batch_size:  256
embed_size:  512
Epoch [2/3], Step [928/1618], Loss: 2.4

Epoch [2/3], Step [1087/1618], Loss: 2.4584, Perplexity: 11.6865batch_size:  256
embed_size:  512
Epoch [2/3], Step [1088/1618], Loss: 2.0819, Perplexity: 8.0199batch_size:  256
embed_size:  512
Epoch [2/3], Step [1089/1618], Loss: 2.1006, Perplexity: 8.1711batch_size:  256
embed_size:  512
Epoch [2/3], Step [1090/1618], Loss: 2.0127, Perplexity: 7.4836batch_size:  256
embed_size:  512
Epoch [2/3], Step [1091/1618], Loss: 2.0424, Perplexity: 7.7093batch_size:  256
embed_size:  512
Epoch [2/3], Step [1092/1618], Loss: 2.0588, Perplexity: 7.8366batch_size:  256
embed_size:  512
Epoch [2/3], Step [1093/1618], Loss: 1.9025, Perplexity: 6.7027batch_size:  256
embed_size:  512
Epoch [2/3], Step [1094/1618], Loss: 2.1904, Perplexity: 8.9384batch_size:  256
embed_size:  512
Epoch [2/3], Step [1095/1618], Loss: 2.2113, Perplexity: 9.1278batch_size:  256
embed_size:  512
Epoch [2/3], Step [1096/1618], Loss: 2.1012, Perplexity: 8.1759batch_size:  256
embed_size:  512
Epoch [2/3], Step [1097/1618]

Epoch [2/3], Step [1255/1618], Loss: 2.2000, Perplexity: 9.0246batch_size:  256
embed_size:  512
Epoch [2/3], Step [1256/1618], Loss: 2.2872, Perplexity: 9.8476batch_size:  256
embed_size:  512
Epoch [2/3], Step [1257/1618], Loss: 1.9499, Perplexity: 7.0278batch_size:  256
embed_size:  512
Epoch [2/3], Step [1258/1618], Loss: 2.0509, Perplexity: 7.7750batch_size:  256
embed_size:  512
Epoch [2/3], Step [1259/1618], Loss: 2.5937, Perplexity: 13.3789batch_size:  256
embed_size:  512
Epoch [2/3], Step [1260/1618], Loss: 2.4035, Perplexity: 11.0617batch_size:  256
embed_size:  512
Epoch [2/3], Step [1261/1618], Loss: 2.2673, Perplexity: 9.6537batch_size:  256
embed_size:  512
Epoch [2/3], Step [1262/1618], Loss: 2.0726, Perplexity: 7.9452batch_size:  256
embed_size:  512
Epoch [2/3], Step [1263/1618], Loss: 2.0944, Perplexity: 8.1203batch_size:  256
embed_size:  512
Epoch [2/3], Step [1264/1618], Loss: 2.2066, Perplexity: 9.0844batch_size:  256
embed_size:  512
Epoch [2/3], Step [1265/1618

Epoch [2/3], Step [1423/1618], Loss: 2.3412, Perplexity: 10.3935batch_size:  256
embed_size:  512
Epoch [2/3], Step [1424/1618], Loss: 1.9682, Perplexity: 7.1580batch_size:  256
embed_size:  512
Epoch [2/3], Step [1425/1618], Loss: 1.9384, Perplexity: 6.9477batch_size:  256
embed_size:  512
Epoch [2/3], Step [1426/1618], Loss: 1.9813, Perplexity: 7.2520batch_size:  256
embed_size:  512
Epoch [2/3], Step [1427/1618], Loss: 1.9405, Perplexity: 6.9623batch_size:  256
embed_size:  512
Epoch [2/3], Step [1428/1618], Loss: 2.0695, Perplexity: 7.9207batch_size:  256
embed_size:  512
Epoch [2/3], Step [1429/1618], Loss: 2.1902, Perplexity: 8.9370batch_size:  256
embed_size:  512
Epoch [2/3], Step [1430/1618], Loss: 1.9484, Perplexity: 7.0177batch_size:  256
embed_size:  512
Epoch [2/3], Step [1431/1618], Loss: 2.0310, Perplexity: 7.6221batch_size:  256
embed_size:  512
Epoch [2/3], Step [1432/1618], Loss: 2.1001, Perplexity: 8.1674batch_size:  256
embed_size:  512
Epoch [2/3], Step [1433/1618]

Epoch [2/3], Step [1591/1618], Loss: 2.0558, Perplexity: 7.8134batch_size:  256
embed_size:  512
Epoch [2/3], Step [1592/1618], Loss: 2.0247, Perplexity: 7.5740batch_size:  256
embed_size:  512
Epoch [2/3], Step [1593/1618], Loss: 2.0727, Perplexity: 7.9460batch_size:  256
embed_size:  512
Epoch [2/3], Step [1594/1618], Loss: 1.9947, Perplexity: 7.3498batch_size:  256
embed_size:  512
Epoch [2/3], Step [1595/1618], Loss: 2.0357, Perplexity: 7.6578batch_size:  256
embed_size:  512
Epoch [2/3], Step [1596/1618], Loss: 1.9485, Perplexity: 7.0178batch_size:  256
embed_size:  512
Epoch [2/3], Step [1597/1618], Loss: 2.2464, Perplexity: 9.4540batch_size:  256
embed_size:  512
Epoch [2/3], Step [1598/1618], Loss: 2.0016, Perplexity: 7.4010batch_size:  256
embed_size:  512
Epoch [2/3], Step [1599/1618], Loss: 1.9836, Perplexity: 7.2692batch_size:  256
embed_size:  512
Epoch [2/3], Step [1600/1618], Loss: 2.1260, Perplexity: 8.3815
batch_size:  256
embed_size:  512
Epoch [2/3], Step [1601/1618]