#Assignment - Transliteration

In this task you are required to solve the transliteration problem of names from English to Russian. Transliteration of a string means writing this string using the alphabet of another language with the preservation of pronunciation, although not always.


## Instructions

To complete the assignment please do the following  steps (both are requred to get the full credits): 

###1. Complete this notebook

Upload a filled notebook with code (this file). You will be asked to implement a transformer-based approach for transliteration.

You should implement your ``train`` and ``classify`` functions in this notebook in the cells below. Your model should be implemented as a special class/function in this notebook (be sure if you add any outer dependencies that everything is improted correctly and can be reproducable). 


###2. Submit solution to the shared task

After the implementation of models' architectures you are asked to participate in the [competition](https://competitions.codalab.org/competitions/30932) to solve **Transliteration** task using your implemented code. 

You should use your code from the previous part to train, validate, and generate predictions for the public (Practice) and private (Evaluation) test sets. It will produce predictions (`preds_translit.tsv`) for the dataset and score them if the true answers are present. You can use these scores to evaluate your model on dev set and choose the best one. Be sure to download the [dataset](https://github.com/skoltech-nlp/filimdb_evaluation/blob/master/TRANSLIT.tar.gz) and unzip it with `wget` command and run them from notebook cells. 

Upload obtained TSV file with your predictions (``preds_translit.tsv``) in ``.zip`` for the best results to both phases of the competition.


**Important: You must indicate "DL4NLP-23" as your team name in Codalab. Without it your submission will be invalid!**


## Basic algorithm

The basic algorithm is based on the following idea: for transliteration, alphabetic n-grams from one language can be transformed into another language into n-grams of the same size, using the most frequent transformation rule found according to statistics on the training sample. 

To test the implementation, download the data, unzip the datasets, predict transliteration and run the evaluation script. To do this, you need to run the following commands:

In [1]:

!wget https://github.com/s-nlp/filimdb_evaluation/raw/master/TRANSLIT.tar.gz


--2023-04-10 16:04:39--  https://github.com/s-nlp/filimdb_evaluation/raw/master/TRANSLIT.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/s-nlp/filimdb_evaluation/master/TRANSLIT.tar.gz [following]
--2023-04-10 16:04:39--  https://raw.githubusercontent.com/s-nlp/filimdb_evaluation/master/TRANSLIT.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1546458 (1.5M) [application/octet-stream]
Saving to: ‘TRANSLIT.tar.gz’


2023-04-10 16:04:39 (95.5 MB/s) - ‘TRANSLIT.tar.gz’ saved [1546458/1546458]



In [2]:
!gunzip TRANSLIT.tar.gz

In [3]:
!tar -xf TRANSLIT.tar

### Baseline code

In [4]:
from typing import List, Any
from random import random
import collections as col

def baseline_train(
        train_source_strings: List[str],
        train_target_strings: List[str]) -> Any:
    """
    Trains transliretation model on the given train set represented as
    parallel list of input strings and their transliteration via labels.
    :param train_source_strings: a list of strings, one str per example
    :param train_target_strings: a list of strings, one str per example
    :return: learnt parameters, or any object you like (it will be passed to the classify function)
    """

    ngram_lvl = 3
    def obtain_train_dicts(train_source_strings, train_target_strings,
                            ngram_lvl):
        ngrams_dict = col.defaultdict(lambda: col.defaultdict(int))
        for src_str,dst_str in zip(train_source_strings,
                                        train_target_strings):
            try:
                src_ngrams = [src_str[i:i+ngram_lvl] for i in
                                range(len(src_str)-ngram_lvl+1)]
                dst_ngrams = [dst_str[i:i+ngram_lvl] for i in
                                range(len(dst_str)-ngram_lvl+1)]
            except TypeError as e:
                print(src_ngrams, dst_ngrams)
                print(e)
                raise StopIteration
            for src_ngram in src_ngrams:
                for dst_ngram in dst_ngrams:
                    ngrams_dict[src_ngram][dst_ngram] += 1
        return ngrams_dict
        
    ngrams_dict = col.defaultdict(lambda: col.defaultdict(int))
    for nl in range(1, ngram_lvl+1):
        ngrams_dict.update(
            obtain_train_dicts(train_source_strings,
                            train_target_strings, nl))
    return ngrams_dict 


def baseline_classify(strings: List[str], params: Any) -> List[str]:
    """
    Classify strings given previously learnt parameters.
    :param strings: strings to classify
    :param params: parameters received from train function
    :return: list of lists of predicted transliterated strings
      (for each source string -> [top_1 prediction, .., top_k prediction]
        if it is possible to generate more than one, otherwise
        -> [prediction])
        corresponding to the given list of strings
    """
       
    def predict_one_sample(sample, train_dict, ngram_lvl=1):
        ngrams = [sample[i:i+ngram_lvl] for i in
 range(0,(len(sample) // ngram_lvl * ngram_lvl)-ngram_lvl+1, ngram_lvl)] +\
                 ([] if len(sample) % ngram_lvl == 0 else
                    [sample[-(len(sample) % ngram_lvl):]])
        prediction = ''
        for ngram in ngrams:
            ngram_dict = train_dict[ngram]
            if len(ngram_dict.keys()) == 0:
                prediction += '?'*len(ngram)
            else:
                prediction += max(ngram_dict, key=lambda k: ngram_dict[k])
        return prediction 
    
    ngram_lvl = 3
    predictions = []
    ngrams_dict = params
    for string in strings:
        top_1_pred = predict_one_sample(string, ngrams_dict,
                                                ngram_lvl)
        predictions.append([top_1_pred])
    return predictions

### Evaluation code

In [5]:
PREDS_FNAME = "preds_translit_baseline.tsv"
SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')
TRANSLIT_PATH = "TRANSLIT"

In [6]:
import codecs
from pandas import read_csv

def load_dataset(data_dir_path=None, parts: List[str] = SCORED_PARTS):
    part2ixy = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        with open(path, 'r', encoding='utf-8') as rf:
            # first line is a header of the corresponding columns
            lines = rf.readlines()[1:]
            col_count = len(lines[0].strip('\n').split('\t'))
            if col_count == 2:
                strings, transliterations = zip(
                    *list(map(lambda l: l.strip('\n').split('\t'), lines))
                )
            elif col_count == 1:
                strings = list(map(lambda l: l.strip('\n'), lines))
                transliterations = None
            else:
                raise ValueError("wrong amount of columns")
        part2ixy[part] = (
            [f'{part}/{i}' for i in range(len(strings))],
            strings, transliterations,
        )
    return part2ixy


def load_transliterations_only(data_dir_path=None, parts: List[str] = SCORED_PARTS):
    part2iy = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        with open(path, 'r', encoding='utf-8') as rf:
            # first line is a header of the corresponding columns
            lines = rf.readlines()[1:]
            col_count = len(lines[0].strip('\n').split('\t'))
            n_lines = len(lines)
            if col_count == 2:
                transliterations = [l.strip('\n').split('\t')[1] for l in lines]
            elif col_count == 1:
                transliterations = None
            else:
                raise ValueError("Wrong amount of columns")
        part2iy[part] = (
            [f'{part}/{i}' for i in range(n_lines)],
            transliterations,
        )
    return part2iy


def save_preds(preds, preds_fname):
    """
    Save classifier predictions in format appropriate for scoring.
    """
    with codecs.open(preds_fname, 'w') as outp:
        for idx, preds in preds:
            print(idx, *preds, sep='\t', file=outp)
    print('Predictions saved to %s' % preds_fname)


def load_preds(preds_fname, top_k=1):
    """
    Load classifier predictions in format appropriate for scoring.
    """
    kwargs = {
        "filepath_or_buffer": preds_fname,
        "names": ["id", "pred"],
        "sep": '\t',
    }

    pred_ids = list(read_csv(**kwargs, usecols=["id"])["id"])

    pred_y = {
        pred_id: [y]
        for pred_id, y in zip(
            pred_ids, read_csv(**kwargs, usecols=["pred"])["pred"]
        )
    }

    for y in pred_y.values():
        assert len(y) == top_k

    return pred_ids, pred_y


def compute_hit_k(preds, k=10):
    raise NotImplementedError


def compute_mrr(preds):
    raise NotImplementedError


def compute_acc_1(preds, true):
    right_answers = 0
    bonus = 0
    for pred, y in zip(preds, true):
        if pred[0] == y:
            right_answers += 1
        elif pred[0] != pred[0] and y == 'нань':
            print('Your test file contained empty string, skipping %f and %s' % (pred[0], y))
            bonus += 1 # bugfix: skip empty line in test
    return right_answers / (len(preds) - bonus)


def score(preds, true):
    assert len(preds) == len(true), 'inconsistent amount of predictions and ground truth answers'
    acc_1 = compute_acc_1(preds, true)
    return {'acc@1': acc_1}


def score_preds(preds_path, data_dir, parts=SCORED_PARTS):
    part2iy = load_transliterations_only(data_dir, parts=parts)
    pred_ids, pred_dict = load_preds(preds_path)
    # pred_dict = {i:y for i,y in zip(pred_ids, pred_y)}
    scores = {}
    for part, (true_ids, true_y) in part2iy.items():
        if true_y is None:
            print('no labels for %s set' % part)
            continue
        pred_y = [pred_dict[i] for i in true_ids]
        score_values = score(pred_y, true_y)
        acc_1 = score_values['acc@1']
        print('%s set accuracy@1: %.2f' % (part, acc_1))
        scores[part] = score_values 
    return scores

### Train and predict results

In [7]:
from time import time
import numpy as np
import os


def train_and_predict(translit_path, scored_parts):
    top_k = 1
    part2ixy = load_dataset(translit_path, parts=scored_parts)
    train_ids, train_strings, train_transliterations = part2ixy['train']
    print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
    st = time()
    params = baseline_train(train_strings, train_transliterations)
    print('Classifier trained in %.2fs' % (time() - st))

    allpreds = []
    for part, (ids, x, y) in part2ixy.items():
        print('\nClassifying %s set with %d examples ...' % (part, len(x)))
        st = time()
        preds = baseline_classify(x, params)
        print('%s set classified in %.2fs' % (part, time() - st))
        count_of_values = list(map(len, preds))
        assert np.all(np.array(count_of_values) == top_k)
        #score(preds, y)
        allpreds.extend(zip(ids, preds))

    save_preds(allpreds, preds_fname=PREDS_FNAME)
    print('\nChecking saved predictions ...')
    return score_preds(preds_path=PREDS_FNAME, data_dir=translit_path, parts=scored_parts)

In [8]:
train_and_predict(TRANSLIT_PATH, SCORED_PARTS)


Training classifier on 105371 examples from train set ...
Classifier trained in 5.68s

Classifying train set with 105371 examples ...
train set classified in 30.88s

Classifying dev set with 26342 examples ...
dev set classified in 8.52s

Classifying train_small set with 2000 examples ...
train_small set classified in 0.55s

Classifying dev_small set with 2000 examples ...
dev_small set classified in 0.56s

Classifying test set with 32926 examples ...
test set classified in 10.61s
Predictions saved to preds_translit_baseline.tsv

Checking saved predictions ...
train set accuracy@1: 0.33
dev set accuracy@1: 0.31
train_small set accuracy@1: 0.34
dev_small set accuracy@1: 0.32
no labels for test set


{'train': {'acc@1': 0.32907536229133255},
 'dev': {'acc@1': 0.3112899552046162},
 'train_small': {'acc@1': 0.3365},
 'dev_small': {'acc@1': 0.323}}

## Transformer-based approach


To implement your algorithm, use the template code, which needs to be modified.

First, you need to add some details in the code of the Transformer architecture, implement the methods of the class `LrScheduler`, which is responsible for updating the learning rate during training.
Next, you need to select the hyperparameters for the model according to the proposed guide.

In [9]:
!pip install Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Levenshtein
  Downloading Levenshtein-0.20.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.5/175.5 KB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.15.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.20.9 rapidfuzz-2.15.0


In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import pandas as pd
import numpy as np
import itertools as it
import collections as col
import random
import os
import copy
import json
from tqdm import tqdm
import datetime, time

import copy
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.utils.data as torch_data
import itertools as it
import collections as col
import random

import Levenshtein as le

### Load dataset and embeddings

In [11]:
def load_datasets(data_dir_path, parts):
    datasets = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        datasets[part] = pd.read_csv(path, sep='\t', na_filter=False)
        print(f'Loaded {part} dataset, length: {len(datasets[part])}')
    return datasets

In [12]:
class TextEncoder:
    def __init__(self, load_dir_path=None):
        self.lang_keys = ['en', 'ru']
        self.directions = ['id2token', 'token2id']
        self.service_token_names = {
            'pad_token': '<pad>',
            'start_token': '<start>',
            'unk_token': '<unk>',
            'end_token': '<end>'
        }
        service_id2token = dict(enumerate(self.service_token_names.values()))
        service_token2id ={v:k for k,v in service_id2token.items()}
        self.service_vocabs = dict(zip(self.directions,
                                       [service_id2token, service_token2id]))
        if load_dir_path is None:
            self.vocabs = {}
            for lk in self.lang_keys:
                self.vocabs[lk] = copy.deepcopy(self.service_vocabs)
        else:
            self.vocabs = self.load_vocabs(load_dir_path)
    def load_vocabs(self, load_dir_path):
        vocabs = {}
        load_path = os.path.join(load_dir_path, 'vocabs')
        for lk in self.lang_keys:
            vocabs[lk] = {}
            for d in self.directions:
                columns = d.split('2')
                print(lk, d)
                df = pd.read_csv(os.path.join(load_path, f'{lk}_{d}'))
                vocabs[lk][d] = dict(zip(*[df[c] for c in columns]))
        return vocabs
    
    def save_vocabs(self, save_dir_path):
        save_path = os.path.join(save_dir_path, 'vocabs')
        os.makedirs(save_path, exist_ok=True)
        for lk in self.lang_keys:
            for d in self.directions:
                columns = d.split('2')
                pd.DataFrame(data=self.vocabs[lk][d].items(),
                    columns=columns).to_csv(os.path.join(save_path, f'{lk}_{d}'),
                                                index=False,
                                                sep=',')
    def make_vocabs(self, data_df):
        for lk in self.lang_keys:
            tokens = col.Counter(''.join(list(it.chain(*data_df[lk])))).keys()
            part_id2t = dict(enumerate(tokens, start=len(self.service_token_names)))
            part_t2id = {k:v for v,k in part_id2t.items()}
            part_vocabs = [part_id2t, part_t2id]
            for i in range(len(self.directions)):
                self.vocabs[lk][self.directions[i]].update(part_vocabs[i])
                
        self.src_vocab_size = len(self.vocabs['en']['id2token'])
        self.tgt_vocab_size = len(self.vocabs['ru']['id2token'])
                
    def frame(self, sample, start_token=None, end_token=None):
        if start_token is None:
            start_token=self.service_token_names['start_token']
        if end_token is None:
            end_token=self.service_token_names['end_token']
        return [start_token] + sample + [end_token]
    def token2id(self, samples, frame, lang_key):
        if frame:
            samples = list(map(self.frame, samples))
        vocab = self.vocabs[lang_key]['token2id']
        return list(map(lambda s:
                        [vocab[t] if t in vocab.keys() else vocab[self.service_token_names['unk_token']]
                         for t in s], samples))
    
    def unframe(self, sample, start_token=None, end_token=None):
        if start_token is None:
            start_token=self.service_vocabs['token2id'][self.service_token_names['start_token']]
        if end_token is None:
            end_token=self.service_vocabs['token2id'][self.service_token_names['end_token']]
        pad_token=self.service_vocabs['token2id'][self.service_token_names['pad_token']]
        return list(it.takewhile(lambda e: e != end_token and e != pad_token, sample[1:]))
    def id2token(self, samples, unframe, lang_key):
        if unframe:
            samples = list(map(self.unframe, samples))
        vocab = self.vocabs[lang_key]['id2token']
        return list(map(lambda s:
                        [vocab[idx] if idx in vocab.keys() else self.service_token_names['unk_token'] for idx in s], samples))


class TranslitData(torch_data.Dataset):
    def __init__(self, source_strings, target_strings,
                text_encoder):
        super(TranslitData, self).__init__()
        self.source_strings = source_strings
        self.text_encoder = text_encoder
        if target_strings is not None:
            assert len(source_strings) == len(target_strings)
            self.target_strings = target_strings
        else:
            self.target_strings = None
    def __len__(self):
        return len(self.source_strings)
    def __getitem__(self, idx):
        src_str = self.source_strings[idx]
        encoder_input = self.text_encoder.token2id([list(src_str)], frame=True, lang_key='en')[0]
        if self.target_strings is not None:
            tgt_str = self.target_strings[idx]
            tmp = self.text_encoder.token2id([list(tgt_str)], frame=True, lang_key='ru')[0]
            decoder_input = tmp[:-1]
            decoder_target = tmp[1:]
            return (encoder_input, decoder_input, decoder_target)
        else:
            return (encoder_input,)


class BatchSampler(torch_data.BatchSampler):
    def __init__(self, sampler, batch_size, drop_last, shuffle_each_epoch):
        super(BatchSampler, self).__init__(sampler, batch_size, drop_last)
        self.batches = []
        for b in super(BatchSampler, self).__iter__():
            self.batches.append(b)
        self.shuffle_each_epoch = shuffle_each_epoch
        if self.shuffle_each_epoch:
            random.shuffle(self.batches)
        self.index = 0
        #print(f'Batches collected: {len(self.batches)}')
    def __iter__(self):
        self.index = 0
        return self
    def __next__(self):
        if self.index == len(self.batches):
            if self.shuffle_each_epoch:
                random.shuffle(self.batches)
            raise StopIteration
        else:
            batch = self.batches[self.index]
            self.index += 1
            return batch

def collate_fn(batch_list):
    '''batch_list can store either 3 components:
        encoder_inputs, decoder_inputs, decoder_targets
        or single component: encoder_inputs'''
    components = list(zip(*batch_list))
    batch_tensors = []
    for data in components:
        max_len = max([len(sample) for sample in data])
        #print(f'Maximum length in batch = {max_len}')
        sample_tensors = [torch.tensor(s, requires_grad=False, dtype=torch.int64)
                         for s in data]
        batch_tensors.append(nn.utils.rnn.pad_sequence(
            sample_tensors,
            batch_first=True, padding_value=0))
    return tuple(batch_tensors) 


def create_dataloader(source_strings, target_strings,
                      text_encoder, batch_size,
                      shuffle_batches_each_epoch):
    '''target_strings parameter can be None'''
    dataset = TranslitData(source_strings, target_strings,
                                text_encoder=text_encoder)
    seq_sampler = torch_data.SequentialSampler(dataset)
    batch_sampler = BatchSampler(seq_sampler, batch_size=batch_size,
                                drop_last=False,
                                shuffle_each_epoch=shuffle_batches_each_epoch)
    dataloader = torch_data.DataLoader(dataset,
                                       batch_sampler=batch_sampler,
                                       collate_fn=collate_fn)
    return dataloader

### Metric function

In [67]:
def compute_metrics(predicted_strings, target_strings, metrics):
    metric_values = {}
    if metrics == 'acc@1':
      metric_values[metrics] = sum(predicted_strings == target_strings) / len(target_strings)
    elif metrics =='mean_ld@1':
      metric_values[metrics] =\
      np.mean(list(map(lambda e: le.distance(*e), zip(predicted_strings, target_strings))))
    else: 
      raise ValueError(f'Unknown metric: {metrics}')
    return metric_values

###  Positional Encoding

As you remember, Transformer treats an input sequence of elements as a time series. Since the Encoder inside the Transformer simultaneously processes the entire input sequence, the information about the position of the element needs to be encoded inside its embedding, since it is not identified in any other way inside the model. That is why the PositionalEncoding layer is used, which sums embeddings with a vector of the same dimension.
Let the matrix of these vectors for each position of the time series be denoted as $PE$. Then the elements of the matrix are:

$$ PE_{(pos,2i)} = \sin{(pos/10000^{2i/d_{model}})}$$
$$ PE_{(pos,2i+1)} = \cos{(pos/10000^{2i/d_{model}})}$$

where $pos$ - is the position, $i$ - index of the component of the corresponging vector, $d_{model}$ - dimension of each vector. Thus, even components represent sine values, and odd ones represent cosine values with different arguments.

In this task you are required to implement these formulas inside the class constructor *PositionalEncoding* in the main file ``translit.py``, which you are to upload. To run the test use the following function:

`test_positional_encoding()`

Make sure that there is no any `AssertionError`!


In [14]:
class Embedding(nn.Module):
    def __init__(self, hidden_size, vocab_size):
        super(Embedding, self).__init__()
        self.emb_layer = nn.Embedding(vocab_size, hidden_size)
        self.hidden_size = hidden_size

    def forward(self, x):
        return self.emb_layer(x)

class PositionalEncoding(nn.Module):
    def __init__(self, hidden_size, max_len=512):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, hidden_size, requires_grad=False)
        # TODO: implement your code here 
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, hidden_size, 2) * -(np.log(10000.0) / hidden_size))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: shape (batch size, sequence length, hidden size)
        x = x + self.pe[:, :x.size(1)]
        return x

In [15]:
def test_positional_encoding():
    pe = PositionalEncoding(max_len=3, hidden_size=4)
    res_1 = torch.tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000],
                           [ 0.8415,  0.5403,  0.0100,  0.9999],
                           [ 0.9093, -0.4161,  0.0200,  0.9998]]])
    # print(pe.pe - res_1)
    assert torch.all(torch.abs(pe.pe - res_1) < 1e-4).item()
    print('Test is passed!')

In [16]:
test_positional_encoding()

Test is passed!


### LayerNorm

In [17]:
class LayerNorm(nn.Module):
    "Layer Normalization layer"

    def __init__(self, hidden_size, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.gain = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gain * (x - mean) / (std + self.eps) + self.bias

### SublayerConnection

In [18]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer normalization.
    """

    def __init__(self, hidden_size, dropout):
        super(SublayerConnection, self).__init__()
        self.layer_norm = LayerNorm(hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return self.layer_norm(x + self.dropout(sublayer(x)))

def padding_mask(x, pad_idx=0):
    assert len(x.size()) >= 2
    return (x != pad_idx).unsqueeze(-2)

def look_ahead_mask(size):
    "Mask out the right context"
    attn_shape = (1, size, size)
    look_ahead_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(look_ahead_mask) == 0

def compositional_mask(x, pad_idx=0):
    pm = padding_mask(x, pad_idx=pad_idx)
    seq_length = x.size(-1)
    result_mask = pm & \
                  look_ahead_mask(seq_length).type_as(pm.data)
    return result_mask

### FeedForward

In [19]:
class FeedForward(nn.Module):
    def __init__(self, hidden_size, ff_hidden_size, dropout=0.1):
        super(FeedForward, self).__init__()
        self.pre_linear = nn.Linear(hidden_size, ff_hidden_size)
        self.post_linear = nn.Linear(ff_hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.post_linear(self.dropout(F.relu(self.pre_linear(x))))

def clone_layer(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

###  MultiHeadAttention


Then you are required to implement `attention` method in the class  `MultiHeadAttention`. The MultiHeadAttention layer takes as input  query vectors, key and value vectors for each step of the sequence of matrices  Q,K,V correspondingly. Each key vector, value vector, and query vector is obtained as a result of linear projection using one of three trained vector parameter matrices from the previous layer. This semantics can be represented in the form of formulas:
$$
Attention(Q, K, V)=softmax\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V\\
$$

$$
MultiHead(Q, K, V) = Concat\left(head_1, ... , head_h\right) W^O\\
$$

$$
head_i=Attention\left(Q W_i^Q, K W_i^K, V W_i^V\right)\\
$$
$h$ - the number of attention heads - parallel sub-layers for Scaled Dot-Product Attention on a vector of smaller dimension ($d_{k} = d_{q} = d_{v} = d_{model} / h$). 
The logic of  \texttt{MultiHeadAttention} is presented in the picture (from original  [paper](https://arxiv.org/abs/1706.03762)):

![](https://lilianweng.github.io/lil-log/assets/images/transformer.png)


Inside a method `attention` you are required to create a dropout layer from  MultiHeadAttention class constructor. Dropout layer is to be applied directly on the attention weights - the result of softmax operation. Value of drop probability  can be regulated in the train in the `model_config['dropout']['attention']`.

The correctness of implementation can be checked with
`test_multi_head_attention()`



In [20]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, hidden_size, dropout=None):
        super(MultiHeadAttention, self).__init__()
        assert hidden_size % n_heads == 0
        self.head_hidden_size = hidden_size // n_heads
        self.n_heads = n_heads
        self.linears = clone_layer(nn.Linear(hidden_size, hidden_size), 4)
        self.attn_weights = None
        self.dropout = dropout
        if self.dropout is not None:
            self.dropout_layer = nn.Dropout(p=self.dropout)

    def attention(self, query, key, value, mask):
        """Compute 'Scaled Dot Product Attention'
            query, key and value tensors have the same shape:
                (batch size, number of heads, sequence length, head hidden size)
            mask shape: (batch size, 1, sequence length, sequence length)
                '1' dimension value will be broadcasted to number of heads inside your operations
            mask should be applied before using softmax to get attn_weights
        """
        ## attn_weights shape: (batch size, number of heads, sequence length, sequence length)
        ## output shape: (batch size, number of heads, sequence length, head hidden size)
        ## TODO: provide your implementation here
        ## don't forget to apply dropout to attn_weights if self.dropout is not None
        attn_weights = torch.matmul(query/(self.head_hidden_size)**0.5,key.transpose(2,3))

        if mask is not None:
          attn_weights = attn_weights.masked_fill(mask == 0, -1e9)

        attn_weights = F.softmax(attn_weights,dim=-1)

        if self.dropout is not None:
          attn_weights = self.dropout_layer(attn_weights)

        output = torch.matmul(attn_weights,value)
        #raise NotImplementedError
        return output, attn_weights

    def forward(self, query, key, value, mask=None):
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        batch_size = query.size(0)

        # Split vectors for different attention heads (from hidden_size => n_heads x head_hidden_size)
        # and do separate linear projection, for separate trainable weights
        query, key, value = \
            [l(x).view(batch_size, -1, self.n_heads, self.head_hidden_size).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]

        x, self.attn_weights = self.attention(query, key, value, mask=mask)
        # x shape: (batch size, number of heads, sequence length, head hidden size)
        # self.attn_weights shape: (batch size, number of heads, sequence length, sequence length)

        # Concatenate the output of each head
        x = x.transpose(1, 2).contiguous() \
            .view(batch_size, -1, self.n_heads * self.head_hidden_size)

        return self.linears[-1](x)

In [21]:
def test_multi_head_attention():
    mha = MultiHeadAttention(n_heads=1, hidden_size=5, dropout=None)
    # batch_size == 2, sequence length == 3, hidden_size == 5
    # query = torch.arange(150).reshape(2, 3, 5)
    query = torch.tensor([[[[ 0.64144618, -0.95817388,  0.37432297,  0.58427106,
          -0.94668716]],
        [[-0.23199289,  0.66329209, -0.46507035, -0.54272512,
          -0.98640698]],
        [[ 0.07546638, -0.09277002,  0.20107185, -0.97407381,
          -0.27713414]]],
       [[[ 0.14727783,  0.4747886 ,  0.44992016, -0.2841419 ,
          -0.81820319]],
        [[-0.72324994,  0.80643179, -0.47655449,  0.45627872,
           0.60942404]],
        [[ 0.61712569, -0.62947282, -0.95215713, -0.38721959,
          -0.73289725]]]])
    key = torch.tensor([[[[-0.81759856, -0.60049991, -0.05923424,  0.51898901,
          -0.3366209 ]],
        [[ 0.83957818, -0.96361722,  0.62285191,  0.93452467,
           0.51219613]],
        [[-0.72758847,  0.41256154,  0.00490795,  0.59892503,
          -0.07202049]]],
       [[[ 0.72315339, -0.49896314,  0.94254637, -0.54356006,
          -0.04837949]],
        [[ 0.51759322, -0.43927061, -0.59924184,  0.92241702,
          -0.86811696]],
        [[-0.54322046, -0.92323003, -0.827746  ,  0.90842783,
           0.88428119]]]])
    value = torch.tensor([[[[-0.83895431,  0.805027  ,  0.22298283, -0.84849915,
          -0.34906026]],
        [[-0.02899652, -0.17456128, -0.17535998, -0.73160314,
          -0.13468061]],
        [[ 0.75234265,  0.02675947,  0.84766286, -0.5475651 ,
          -0.83319316]]],
       [[[-0.47834413,  0.34464645, -0.41921457,  0.33867964,
           0.43470836]],
        [[-0.99000979,  0.10220893, -0.4932273 ,  0.95938905,
           0.01927012]],
        [[ 0.91607137,  0.57395644, -0.90914179,  0.97212912,
           0.33078759]]]])
    query = query.float().transpose(1,2)
    key = key.float().transpose(1,2)
    value = value.float().transpose(1,2)

    x,_ = torch.max(query[:,0,:,:], axis=-1)
    mask = compositional_mask(x)
    mask.unsqueeze_(1)
    for n,t in [('query', query), ('key', key), ('value', value), ('mask', mask)]:
        print(f'Name: {n}, shape: {t.size()}')
    with torch.no_grad():
        output, attn_weights = mha.attention(query, key, value, mask=mask)
    assert output.size() == torch.Size([2,1,3,5])
    assert attn_weights.size() == torch.Size([2,1,3,3])

    truth_output = torch.tensor([[[[-0.8390,  0.8050,  0.2230, -0.8485, -0.3491],
          [-0.6043,  0.5212,  0.1076, -0.8146, -0.2870],
          [-0.0665,  0.2461,  0.3038, -0.7137, -0.4410]]],
        [[[-0.4783,  0.3446, -0.4192,  0.3387,  0.4347],
          [-0.7959,  0.1942, -0.4652,  0.7239,  0.1769],
          [-0.3678,  0.2868, -0.5799,  0.7987,  0.2086]]]])
    truth_attn_weights = torch.tensor([[[[1.0000, 0.0000, 0.0000],
          [0.7103, 0.2897, 0.0000],
          [0.3621, 0.3105, 0.3274]]],
        [[[1.0000, 0.0000, 0.0000],
          [0.3793, 0.6207, 0.0000],
          [0.2642, 0.4803, 0.2555]]]])
    # print(torch.abs(output - truth_output))
    # print(torch.abs(attn_weights - truth_attn_weights))
    assert torch.all(torch.abs(output - truth_output) < 1e-4).item()
    assert torch.all(torch.abs(attn_weights - truth_attn_weights) < 1e-4).item()
    print('Test is passed!')

In [22]:
test_multi_head_attention()

Name: query, shape: torch.Size([2, 1, 3, 5])
Name: key, shape: torch.Size([2, 1, 3, 5])
Name: value, shape: torch.Size([2, 1, 3, 5])
Name: mask, shape: torch.Size([2, 1, 3, 3])
Test is passed!


### Encoder

In [23]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, hidden_size, ff_hidden_size, n_heads, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(n_heads, hidden_size,
                                            dropout=dropout['attention'])
        self.feed_forward = FeedForward(hidden_size, ff_hidden_size,
                                        dropout=dropout['relu'])
        self.sublayers = clone_layer(SublayerConnection(hidden_size, dropout['residual']), 2)

    def forward(self, x, mask):
        x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayers[1](x, self.feed_forward)

class Encoder(nn.Module):
    def __init__(self, config):
        super(Encoder, self).__init__()
        self.embedder = Embedding(config['hidden_size'],
                                  config['src_vocab_size'])
        self.positional_encoder = PositionalEncoding(config['hidden_size'],
                                                     max_len=config['max_src_seq_length'])
        self.embedding_dropout = nn.Dropout(p=config['dropout']['embedding'])
        self.encoder_layer = EncoderLayer(config['hidden_size'],
                                          config['ff_hidden_size'],
                                          config['n_heads'],
                                          config['dropout'])
        self.layers = clone_layer(self.encoder_layer, config['n_layers'])
        self.layer_norm = LayerNorm(config['hidden_size'])

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        x = self.embedding_dropout(self.positional_encoder(self.embedder(x)))
        for layer in self.layers:
            x = layer(x, mask)
        return self.layer_norm(x)

### Decoder

In [24]:
class DecoderLayer(nn.Module):
    """
    Decoder is made of 3 sublayers: self attention, encoder-decoder attention
    and feed forward"
    """

    def __init__(self, hidden_size, ff_hidden_size, n_heads, dropout):
        super(DecoderLayer, self).__init__()

        self.self_attn = MultiHeadAttention(n_heads, hidden_size,
                                            dropout=dropout['attention'])
        self.encdec_attn = MultiHeadAttention(n_heads, hidden_size,
                                              dropout=dropout['attention'])
        self.feed_forward = FeedForward(hidden_size, ff_hidden_size,
                                        dropout=dropout['relu'])
        self.sublayers = clone_layer(SublayerConnection(hidden_size, dropout['residual']), 3)

    def forward(self, x, encoder_output, encoder_mask, decoder_mask):
        x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, decoder_mask))
        x = self.sublayers[1](x, lambda x: self.encdec_attn(x, encoder_output,
                                                            encoder_output, encoder_mask))
        return self.sublayers[2](x, self.feed_forward)

class Decoder(nn.Module):
    def __init__(self, config):
        super(Decoder, self).__init__()
        self.embedder = Embedding(config['hidden_size'],
                                  config['tgt_vocab_size'])
        self.positional_encoder = PositionalEncoding(config['hidden_size'],
                                                     max_len=config['max_tgt_seq_length'])
        self.embedding_dropout = nn.Dropout(p=config['dropout']['embedding'])
        self.decoder_layer = DecoderLayer(config['hidden_size'],
                                          config['ff_hidden_size'],
                                          config['n_heads'],
                                          config['dropout'])
        self.layers = clone_layer(self.decoder_layer, config['n_layers'])
        self.layer_norm = LayerNorm(config['hidden_size'])

    def forward(self, x, encoder_output, encoder_mask, decoder_mask):
        x = self.embedding_dropout(self.positional_encoder(self.embedder(x)))
        for layer in self.layers:
            x = layer(x, encoder_output, encoder_mask, decoder_mask)
        return self.layer_norm(x)

### Transformer

In [25]:
class Transformer(nn.Module):
    def __init__(self, config):
        super(Transformer, self).__init__()
        self.config = config
        self.encoder = Encoder(config)
        self.decoder = Decoder(config)
        self.proj = nn.Linear(config['hidden_size'], config['tgt_vocab_size'])

        self.pad_idx = config['pad_idx']
        self.tgt_vocab_size = config['tgt_vocab_size']

    def encode(self, encoder_input, encoder_input_mask):
        return self.encoder(encoder_input, encoder_input_mask)

    def decode(self, encoder_output, encoder_input_mask, decoder_input, decoder_input_mask):
        return self.decoder(decoder_input, encoder_output, encoder_input_mask, decoder_input_mask)

    def linear_project(self, x):
        return self.proj(x)

    def forward(self, encoder_input, decoder_input):
        encoder_input_mask = padding_mask(encoder_input, pad_idx=self.config['pad_idx'])
        decoder_input_mask = compositional_mask(decoder_input, pad_idx=self.config['pad_idx'])
        encoder_output = self.encode(encoder_input, encoder_input_mask)
        decoder_output = self.decode(encoder_output, encoder_input_mask,
                                     decoder_input, decoder_input_mask)
        output_logits = self.linear_project(decoder_output)
        return output_logits


def prepare_model(config):
    model = Transformer(config)

    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

####  LrScheduler

The last thing you have to prepare is the class  `LrScheduler`, which is in charge of  learning rate updating after every step of the optimizer. You are required to fill the class constructor and the method `learning_rate`. The preferable stratagy of updating the learning rate (lr), is the following two stages:

* "warmup" stage - lr linearly increases until the defined value during the fixed number of steps (the proportion of all training steps - the parameter `train_config['warmup\_steps\_part']` in the train function). 
* "decrease" stage - lr linearly decreases until 0 during the left training steps.

`learning_rate()` call should return the value of  lr at this step,  which number is stored at self.step. The class constructor takes not only `warmup_steps_part` but the peak learning rate value `lr_peak` at the end of "warmup" stage and a string name of the strategy of learning rate scheduling. You can test other strategies if you want to with `self.type attribute`. 

Correctness check: `test_lr_scheduler()`


In [26]:
class LrScheduler:
    def __init__(self, n_steps, **kwargs):
        self.type = kwargs['type']
        self._step = 0
        self._lr = 0
        if self.type == 'warmup,decay_linear':
            ## TODO: provide your implementation here
            self.n_steps = n_steps
            self.n_warmup_steps = n_steps*kwargs['warmup_steps_part']
            self.lr_peak = kwargs['lr_peak']
            self.lr_warmup_scale = (self.lr_peak-self._lr)/self.n_warmup_steps
            self.lr_decrease_scale = (self.lr_peak/(self.n_steps-self.n_warmup_steps))
        else:
            raise ValueError(f'Unknown type argument: {self.type}')

    def step(self, optimizer):
        self._step += 1
        lr = self.learning_rate()
        for p in optimizer.param_groups:
            p['lr'] = lr

    def learning_rate(self, step=None):
        if step is None:
            step = self._step
        if self.type == 'warmup,decay_linear':
            ## TODO: provide your implementation here
            if step<=self.n_warmup_steps:
              self._lr = step*self.lr_warmup_scale
            else:
              self._lr = self.lr_peak-self.lr_decrease_scale*(step-self.n_warmup_steps)
        return self._lr

    def state_dict(self):
        sd = copy.deepcopy(self.__dict__)
        return sd

    def load_state_dict(self, sd):
        for k in sd.keys():
            self.__setattr__(k, sd[k])

In [27]:
def test_lr_scheduler():
    lrs_type = 'warmup,decay_linear'
    warmup_steps_part =  0.1
    lr_peak = 3e-4
    sch = LrScheduler(100, type=lrs_type, warmup_steps_part=warmup_steps_part,
                      lr_peak=lr_peak)
    assert sch.learning_rate(step=5) - 15e-5 < 1e-6
    assert sch.learning_rate(step=10) - 3e-4 < 1e-6
    assert sch.learning_rate(step=50) - 166e-6 < 1e-6
    assert sch.learning_rate(step=100) - 0. < 1e-6
    print('Test is passed!')

In [28]:
test_lr_scheduler()

Test is passed!


### Run and translate

In [64]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))


def run_epoch(data_iter, model, lr_scheduler, optimizer, device, verbose=False):
    start = time.time()
    local_start = start
    total_tokens = 0
    total_loss = 0
    tokens = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')
    for i, batch in tqdm(enumerate(data_iter)):
        encoder_input = batch[0].to(device)
        decoder_input = batch[1].to(device)
        decoder_target = batch[2].to(device)
        logits = model(encoder_input, decoder_input)
        loss = loss_fn(logits.view(-1, model.tgt_vocab_size),
                       decoder_target.view(-1))
        total_loss += loss.item()
        batch_n_tokens = (decoder_target != model.pad_idx).sum().item()
        total_tokens += batch_n_tokens
        if optimizer is not None:
            optimizer.zero_grad()
            lr_scheduler.step(optimizer)
            loss.backward()
            optimizer.step()

        tokens += batch_n_tokens
        if verbose and i % 1000 == 1:
            elapsed = time.time() - local_start
            print("batch number: %d, accumulated average loss: %f, tokens per second: %f" %
                  (i, total_loss / total_tokens, tokens / elapsed))
            local_start = time.time()
            tokens = 0

    average_loss = total_loss / total_tokens
    print('** End of epoch, accumulated average loss = %f **' % average_loss)
    epoch_elapsed_time = format_time(time.time() - start)
    print(f'** Elapsed time: {epoch_elapsed_time}**')
    return average_loss


def save_checkpoint(epoch, model, lr_scheduler, optimizer, model_dir_path):
    save_path = os.path.join(model_dir_path, f'cpkt_{epoch}_epoch')
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'lr_scheduler_state_dict': lr_scheduler.state_dict()
    }, save_path)
    print(f'Saved checkpoint to {save_path}')

def load_model(epoch, model_dir_path):
    save_path = os.path.join(model_dir_path, f'cpkt_{epoch}_epoch')
    checkpoint = torch.load(save_path)
    with open(os.path.join(model_dir_path, 'model_config.json'), 'r', encoding='utf-8') as rf:
        model_config = json.load(rf)
    model = prepare_model(model_config)
    model.load_state_dict(checkpoint['model_state_dict'])
    return model

def greedy_decode(model, device, encoder_input, max_len, start_symbol):
    batch_size = encoder_input.size()[0]
    decoder_input = torch.ones(batch_size, 1).fill_(start_symbol).type_as(encoder_input.data).to(device)

    for i in range(max_len):
        logits = model(encoder_input, decoder_input)

        _, predicted_ids = torch.max(logits, dim=-1)
        next_word = predicted_ids[:, i]
        # print(next_word)
        rest = torch.ones(batch_size, 1).type_as(decoder_input.data)
        # print(rest[:,0].size(), next_word.size())
        rest[:, 0] = next_word
        decoder_input = torch.cat([decoder_input, rest], dim=1).to(device)
        # print(decoder_input)
    return decoder_input

def generate_predictions(dataloader, max_decoding_len, text_encoder, model, device):
    # print(f'Max decoding length = {max_decoding_len}')
    model.eval()
    predictions = []
    start_token_id = text_encoder.service_vocabs['token2id'][
        text_encoder.service_token_names['start_token']]
    with torch.no_grad():
        for batch in tqdm(dataloader):
            encoder_input = batch[0].to(device)
            prediction_tensor = greedy_decode(model, device, encoder_input, max_decoding_len,
                              start_token_id)

            predictions.extend([''.join(e) for e in text_encoder.id2token(prediction_tensor.cpu().numpy(),
                                                                          unframe=True, lang_key='ru')])
    return np.array(predictions)


def train(source_strings, target_strings):
    '''Common training cycle for final run (fixed hyperparameters,
    no evaluation during training)'''
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    train_df = pd.DataFrame({'en': source_strings, 'ru': target_strings})
    text_encoder = TextEncoder()
    text_encoder.make_vocabs(train_df)

    model_config = {
        'src_vocab_size': text_encoder.src_vocab_size,
        'tgt_vocab_size': text_encoder.tgt_vocab_size,
        'max_src_seq_length': max(train_df['en'].aggregate(len)) + 2, #including start_token and end_token
        'max_tgt_seq_length': max(train_df['ru'].aggregate(len)) + 2,
        'n_layers': 2,
        'n_heads': 2,
        'hidden_size': 128,
        'ff_hidden_size': 256,
        'dropout': {
            'embedding': 0.1,
            'attention': 0.1,
            'residual': 0.1,
            'relu': 0.1
        },
        'pad_idx': 0
    }
    model = prepare_model(model_config)
    model.to(device)

    train_config = {'batch_size': 200, 'n_epochs': 600, 'lr_scheduler': {
        'type': 'warmup,decay_linear',
        'warmup_steps_part': 0.1,
        'lr_peak': 3e-4,
    }}

    #Model training procedure
    optimizer = torch.optim.Adam(model.parameters(), lr=0.)
    n_steps = (len(train_df) // train_config['batch_size'] + 1) * train_config['n_epochs']
    lr_scheduler = LrScheduler(n_steps, **train_config['lr_scheduler'])

    # prepare train data
    source_strings, target_strings = zip(*sorted(zip(source_strings, target_strings),
                                                 key=lambda e: len(e[0])))
    train_dataloader = create_dataloader(source_strings, target_strings, text_encoder,
                                         train_config['batch_size'],
                                         shuffle_batches_each_epoch=True)
    # training cycle
    for epoch in range(1,train_config['n_epochs']+1):
        print('\n' + '-'*40)
        print(f'Epoch: {epoch}')
        print(f'Run training...')
        model.train()
        run_epoch(train_dataloader, model,
                  lr_scheduler, optimizer, device=device, verbose=False)
        if epoch%25==0:
          model_dir_path = './drive/MyDrive/dl4nlp'
          save_checkpoint(epoch, model, lr_scheduler, optimizer, model_dir_path)
    learnable_params = {
        'model': model,
        'text_encoder': text_encoder,
    }
    return learnable_params

def classify(source_strings, learnable_params):
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    model = learnable_params['model']
    text_encoder = learnable_params['text_encoder']
    batch_size = 200
    dataloader = create_dataloader(source_strings, None, text_encoder,
                                   batch_size, shuffle_batches_each_epoch=False)
    max_decoding_len = model.config['max_tgt_seq_length']
    predictions = generate_predictions(dataloader, max_decoding_len, text_encoder, model, device)
    #return single top1 prediction for each sample
    return np.expand_dims(predictions, 1)

### Training

In [30]:
PREDS_FNAME = "preds_translit.tsv"
SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')
TRANSLIT_PATH = "TRANSLIT"

In [None]:
top_k = 1
part2ixy = load_dataset(TRANSLIT_PATH, parts=SCORED_PARTS)
train_ids, train_strings, train_transliterations = part2ixy['train']
print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
st = time.time()
params = train(train_strings, train_transliterations)
print('Classifier trained in %.2fs' % (time.time() - st))


Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:19, 27.37it/s]


** End of epoch, accumulated average loss = 5.209693 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 31.09it/s]


** End of epoch, accumulated average loss = 3.987401 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 3
Run training...


527it [00:18, 28.55it/s]


** End of epoch, accumulated average loss = 3.466477 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 4
Run training...


527it [00:17, 30.60it/s]


** End of epoch, accumulated average loss = 3.112131 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:17, 30.97it/s]


** End of epoch, accumulated average loss = 2.946622 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 6
Run training...


527it [00:16, 31.12it/s]


** End of epoch, accumulated average loss = 2.827832 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 7
Run training...


527it [00:18, 29.22it/s]


** End of epoch, accumulated average loss = 2.588434 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 8
Run training...


527it [00:16, 31.21it/s]


** End of epoch, accumulated average loss = 2.152795 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 9
Run training...


527it [00:17, 30.88it/s]


** End of epoch, accumulated average loss = 1.602532 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 10
Run training...


527it [00:18, 27.85it/s]


** End of epoch, accumulated average loss = 1.314558 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 11
Run training...


527it [00:18, 28.06it/s]


** End of epoch, accumulated average loss = 1.161248 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 12
Run training...


527it [00:18, 29.10it/s]


** End of epoch, accumulated average loss = 1.058484 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 13
Run training...


527it [00:23, 22.73it/s]


** End of epoch, accumulated average loss = 0.979223 **
** Elapsed time: 0:00:23**

----------------------------------------
Epoch: 14
Run training...


527it [00:17, 29.63it/s]


** End of epoch, accumulated average loss = 0.915104 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 15
Run training...


527it [00:17, 30.25it/s]


** End of epoch, accumulated average loss = 0.858459 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 16
Run training...


527it [00:18, 28.26it/s]


** End of epoch, accumulated average loss = 0.810182 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 17
Run training...


527it [00:17, 30.26it/s]


** End of epoch, accumulated average loss = 0.767402 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 18
Run training...


527it [00:17, 29.97it/s]


** End of epoch, accumulated average loss = 0.730964 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 19
Run training...


527it [00:18, 29.00it/s]


** End of epoch, accumulated average loss = 0.699238 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 20
Run training...


527it [00:17, 29.62it/s]


** End of epoch, accumulated average loss = 0.667975 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 21
Run training...


527it [00:17, 30.75it/s]


** End of epoch, accumulated average loss = 0.641393 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 22
Run training...


527it [00:17, 30.36it/s]


** End of epoch, accumulated average loss = 0.619097 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 23
Run training...


527it [00:18, 28.83it/s]


** End of epoch, accumulated average loss = 0.596595 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 24
Run training...


527it [00:17, 30.67it/s]


** End of epoch, accumulated average loss = 0.576476 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 25
Run training...


527it [00:17, 30.83it/s]


** End of epoch, accumulated average loss = 0.557620 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_25_epoch

----------------------------------------
Epoch: 26
Run training...


527it [00:17, 29.40it/s]


** End of epoch, accumulated average loss = 0.541466 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 27
Run training...


527it [00:18, 28.16it/s]


** End of epoch, accumulated average loss = 0.526486 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 28
Run training...


527it [00:17, 30.29it/s]


** End of epoch, accumulated average loss = 0.511852 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 29
Run training...


527it [00:17, 30.20it/s]


** End of epoch, accumulated average loss = 0.497252 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 30
Run training...


527it [00:18, 27.96it/s]


** End of epoch, accumulated average loss = 0.483783 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 31
Run training...


527it [00:17, 30.60it/s]


** End of epoch, accumulated average loss = 0.471953 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 32
Run training...


527it [00:17, 29.78it/s]


** End of epoch, accumulated average loss = 0.460666 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 33
Run training...


527it [00:18, 29.22it/s]


** End of epoch, accumulated average loss = 0.450405 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 34
Run training...


527it [00:17, 29.71it/s]


** End of epoch, accumulated average loss = 0.441796 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 35
Run training...


527it [00:16, 31.05it/s]


** End of epoch, accumulated average loss = 0.433153 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 36
Run training...


527it [00:17, 30.67it/s]


** End of epoch, accumulated average loss = 0.426533 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 37
Run training...


527it [00:18, 29.26it/s]


** End of epoch, accumulated average loss = 0.419123 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 38
Run training...


527it [00:17, 30.23it/s]


** End of epoch, accumulated average loss = 0.413944 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 39
Run training...


527it [00:17, 30.62it/s]


** End of epoch, accumulated average loss = 0.407292 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 40
Run training...


527it [00:16, 31.17it/s]


** End of epoch, accumulated average loss = 0.402910 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 41
Run training...


527it [00:18, 28.92it/s]


** End of epoch, accumulated average loss = 0.396893 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 42
Run training...


527it [00:16, 31.08it/s]


** End of epoch, accumulated average loss = 0.392298 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 43
Run training...


527it [00:16, 31.04it/s]


** End of epoch, accumulated average loss = 0.389367 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 44
Run training...


527it [00:17, 30.62it/s]


** End of epoch, accumulated average loss = 0.384032 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 45
Run training...


527it [00:18, 28.77it/s]


** End of epoch, accumulated average loss = 0.379798 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 46
Run training...


527it [00:16, 31.15it/s]


** End of epoch, accumulated average loss = 0.377113 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 47
Run training...


527it [00:16, 31.04it/s]


** End of epoch, accumulated average loss = 0.372775 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 48
Run training...


527it [00:16, 31.35it/s]


** End of epoch, accumulated average loss = 0.369833 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 49
Run training...


527it [00:17, 29.33it/s]


** End of epoch, accumulated average loss = 0.365047 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 50
Run training...


527it [00:16, 31.16it/s]


** End of epoch, accumulated average loss = 0.362482 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_50_epoch

----------------------------------------
Epoch: 51
Run training...


527it [00:16, 31.44it/s]


** End of epoch, accumulated average loss = 0.360843 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 52
Run training...


527it [00:16, 31.19it/s]


** End of epoch, accumulated average loss = 0.356087 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 53
Run training...


527it [00:18, 28.88it/s]


** End of epoch, accumulated average loss = 0.353203 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 54
Run training...


527it [00:16, 31.28it/s]


** End of epoch, accumulated average loss = 0.350936 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 55
Run training...


527it [00:16, 31.16it/s]


** End of epoch, accumulated average loss = 0.348182 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 56
Run training...


527it [00:16, 31.45it/s]


** End of epoch, accumulated average loss = 0.346032 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 57
Run training...


527it [00:17, 29.80it/s]


** End of epoch, accumulated average loss = 0.343981 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 58
Run training...


527it [00:17, 30.42it/s]


** End of epoch, accumulated average loss = 0.341499 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 59
Run training...


527it [00:16, 31.75it/s]


** End of epoch, accumulated average loss = 0.338441 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 60
Run training...


527it [00:16, 31.20it/s]


** End of epoch, accumulated average loss = 0.337507 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 61
Run training...


527it [00:17, 29.79it/s]


** End of epoch, accumulated average loss = 0.333452 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 62
Run training...


527it [00:17, 30.90it/s]


** End of epoch, accumulated average loss = 0.331808 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 63
Run training...


527it [00:16, 31.75it/s]


** End of epoch, accumulated average loss = 0.329511 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 64
Run training...


527it [00:16, 31.64it/s]


** End of epoch, accumulated average loss = 0.328256 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 65
Run training...


527it [00:17, 30.23it/s]


** End of epoch, accumulated average loss = 0.325749 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 66
Run training...


527it [00:17, 30.36it/s]


** End of epoch, accumulated average loss = 0.323884 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 67
Run training...


527it [00:16, 31.26it/s]


** End of epoch, accumulated average loss = 0.321222 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 68
Run training...


527it [00:16, 31.52it/s]


** End of epoch, accumulated average loss = 0.320010 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 69
Run training...


527it [00:17, 29.61it/s]


** End of epoch, accumulated average loss = 0.318151 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 70
Run training...


527it [00:17, 29.32it/s]


** End of epoch, accumulated average loss = 0.316301 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 71
Run training...


527it [00:17, 30.22it/s]


** End of epoch, accumulated average loss = 0.314489 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 72
Run training...


527it [00:16, 31.02it/s]


** End of epoch, accumulated average loss = 0.313062 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 73
Run training...


527it [00:17, 30.16it/s]


** End of epoch, accumulated average loss = 0.311159 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 74
Run training...


527it [00:17, 30.73it/s]


** End of epoch, accumulated average loss = 0.309722 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 75
Run training...


527it [00:17, 30.94it/s]


** End of epoch, accumulated average loss = 0.307439 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_75_epoch

----------------------------------------
Epoch: 76
Run training...


527it [00:16, 31.63it/s]


** End of epoch, accumulated average loss = 0.306393 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 77
Run training...


527it [00:17, 29.85it/s]


** End of epoch, accumulated average loss = 0.306841 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 78
Run training...


527it [00:17, 29.49it/s]


** End of epoch, accumulated average loss = 0.304480 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 79
Run training...


527it [00:16, 31.41it/s]


** End of epoch, accumulated average loss = 0.302187 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 80
Run training...


527it [00:17, 30.92it/s]


** End of epoch, accumulated average loss = 0.301628 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 81
Run training...


527it [00:18, 28.81it/s]


** End of epoch, accumulated average loss = 0.300673 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 82
Run training...


527it [00:16, 31.05it/s]


** End of epoch, accumulated average loss = 0.299178 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 83
Run training...


527it [00:16, 31.10it/s]


** End of epoch, accumulated average loss = 0.298966 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 84
Run training...


527it [00:16, 31.27it/s]


** End of epoch, accumulated average loss = 0.296753 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 85
Run training...


527it [00:18, 28.83it/s]


** End of epoch, accumulated average loss = 0.295995 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 86
Run training...


527it [00:17, 30.51it/s]


** End of epoch, accumulated average loss = 0.294645 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 87
Run training...


527it [00:17, 30.84it/s]


** End of epoch, accumulated average loss = 0.293137 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 88
Run training...


527it [00:16, 31.33it/s]


** End of epoch, accumulated average loss = 0.292806 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 89
Run training...


527it [00:17, 30.03it/s]


** End of epoch, accumulated average loss = 0.291497 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 90
Run training...


527it [00:16, 32.35it/s]


** End of epoch, accumulated average loss = 0.290572 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 91
Run training...


527it [00:16, 31.64it/s]


** End of epoch, accumulated average loss = 0.290067 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 92
Run training...


527it [00:16, 32.48it/s]


** End of epoch, accumulated average loss = 0.288702 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 93
Run training...


527it [00:16, 31.42it/s]


** End of epoch, accumulated average loss = 0.288246 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 94
Run training...


527it [00:16, 31.48it/s]


** End of epoch, accumulated average loss = 0.286917 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 95
Run training...


527it [00:16, 32.41it/s]


** End of epoch, accumulated average loss = 0.285678 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 96
Run training...


527it [00:16, 31.97it/s]


** End of epoch, accumulated average loss = 0.285608 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 97
Run training...


527it [00:16, 32.54it/s]


** End of epoch, accumulated average loss = 0.284674 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 98
Run training...


527it [00:16, 31.39it/s]


** End of epoch, accumulated average loss = 0.283264 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 99
Run training...


527it [00:16, 32.03it/s]


** End of epoch, accumulated average loss = 0.282524 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 100
Run training...


527it [00:16, 32.79it/s]


** End of epoch, accumulated average loss = 0.282109 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_100_epoch

----------------------------------------
Epoch: 101
Run training...


527it [00:15, 33.34it/s]


** End of epoch, accumulated average loss = 0.280630 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 102
Run training...


527it [00:15, 32.99it/s]


** End of epoch, accumulated average loss = 0.280897 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 103
Run training...


527it [00:16, 32.50it/s]


** End of epoch, accumulated average loss = 0.280343 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 104
Run training...


527it [00:16, 31.65it/s]


** End of epoch, accumulated average loss = 0.278905 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 105
Run training...


527it [00:15, 33.12it/s]


** End of epoch, accumulated average loss = 0.279466 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 106
Run training...


527it [00:15, 33.51it/s]


** End of epoch, accumulated average loss = 0.278254 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 107
Run training...


527it [00:15, 33.52it/s]


** End of epoch, accumulated average loss = 0.277167 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 108
Run training...


527it [00:15, 33.47it/s]


** End of epoch, accumulated average loss = 0.277071 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 109
Run training...


527it [00:16, 32.29it/s]


** End of epoch, accumulated average loss = 0.275471 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 110
Run training...


527it [00:16, 32.52it/s]


** End of epoch, accumulated average loss = 0.274779 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 111
Run training...


527it [00:15, 33.00it/s]


** End of epoch, accumulated average loss = 0.273989 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 112
Run training...


527it [00:16, 32.64it/s]


** End of epoch, accumulated average loss = 0.273672 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 113
Run training...


527it [00:16, 32.74it/s]


** End of epoch, accumulated average loss = 0.273001 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 114
Run training...


527it [00:16, 32.71it/s]


** End of epoch, accumulated average loss = 0.273307 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 115
Run training...


527it [00:16, 31.46it/s]


** End of epoch, accumulated average loss = 0.272130 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 116
Run training...


527it [00:15, 33.77it/s]


** End of epoch, accumulated average loss = 0.271142 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 117
Run training...


527it [00:15, 34.07it/s]


** End of epoch, accumulated average loss = 0.271106 **
** Elapsed time: 0:00:15**

----------------------------------------
Epoch: 118
Run training...


527it [00:15, 33.89it/s]


** End of epoch, accumulated average loss = 0.270526 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 119
Run training...


527it [00:15, 33.28it/s]


** End of epoch, accumulated average loss = 0.269618 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 120
Run training...


527it [00:15, 33.52it/s]


** End of epoch, accumulated average loss = 0.269554 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 121
Run training...


527it [00:16, 31.32it/s]


** End of epoch, accumulated average loss = 0.268850 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 122
Run training...


527it [00:15, 33.29it/s]


** End of epoch, accumulated average loss = 0.267792 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 123
Run training...


527it [00:15, 33.92it/s]


** End of epoch, accumulated average loss = 0.267956 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 124
Run training...


527it [00:15, 33.65it/s]


** End of epoch, accumulated average loss = 0.267236 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 125
Run training...


527it [00:15, 33.80it/s]


** End of epoch, accumulated average loss = 0.267114 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_125_epoch

----------------------------------------
Epoch: 126
Run training...


527it [00:16, 32.91it/s]


** End of epoch, accumulated average loss = 0.266189 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 127
Run training...


527it [00:16, 31.71it/s]


** End of epoch, accumulated average loss = 0.265866 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 128
Run training...


527it [00:15, 33.84it/s]


** End of epoch, accumulated average loss = 0.265646 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 129
Run training...


527it [00:15, 33.94it/s]


** End of epoch, accumulated average loss = 0.265125 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 130
Run training...


527it [00:15, 33.65it/s]


** End of epoch, accumulated average loss = 0.264605 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 131
Run training...


527it [00:15, 33.94it/s]


** End of epoch, accumulated average loss = 0.263673 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 132
Run training...


527it [00:15, 33.33it/s]


** End of epoch, accumulated average loss = 0.264283 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 133
Run training...


527it [00:16, 31.73it/s]


** End of epoch, accumulated average loss = 0.263477 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 134
Run training...


527it [00:15, 33.16it/s]


** End of epoch, accumulated average loss = 0.263034 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 135
Run training...


527it [00:15, 33.46it/s]


** End of epoch, accumulated average loss = 0.262866 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 136
Run training...


527it [00:15, 33.31it/s]


** End of epoch, accumulated average loss = 0.262054 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 137
Run training...


527it [00:15, 33.65it/s]


** End of epoch, accumulated average loss = 0.261361 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 138
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.260746 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 139
Run training...


527it [00:16, 32.36it/s]


** End of epoch, accumulated average loss = 0.260299 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 140
Run training...


527it [00:15, 33.09it/s]


** End of epoch, accumulated average loss = 0.260753 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 141
Run training...


527it [00:15, 33.19it/s]


** End of epoch, accumulated average loss = 0.259423 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 142
Run training...


527it [00:15, 33.42it/s]


** End of epoch, accumulated average loss = 0.259997 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 143
Run training...


527it [00:15, 33.83it/s]


** End of epoch, accumulated average loss = 0.258867 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 144
Run training...


527it [00:16, 32.32it/s]


** End of epoch, accumulated average loss = 0.259222 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 145
Run training...


527it [00:16, 31.67it/s]


** End of epoch, accumulated average loss = 0.257829 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 146
Run training...


527it [00:15, 33.19it/s]


** End of epoch, accumulated average loss = 0.257978 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 147
Run training...


527it [00:15, 34.09it/s]


** End of epoch, accumulated average loss = 0.257043 **
** Elapsed time: 0:00:15**

----------------------------------------
Epoch: 148
Run training...


527it [00:15, 33.88it/s]


** End of epoch, accumulated average loss = 0.257199 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 149
Run training...


527it [00:15, 33.28it/s]


** End of epoch, accumulated average loss = 0.256413 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 150
Run training...


527it [00:16, 32.19it/s]


** End of epoch, accumulated average loss = 0.256418 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_150_epoch

----------------------------------------
Epoch: 151
Run training...


527it [00:16, 32.13it/s]


** End of epoch, accumulated average loss = 0.256148 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 152
Run training...


527it [00:15, 33.85it/s]


** End of epoch, accumulated average loss = 0.256320 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 153
Run training...


527it [00:15, 33.49it/s]


** End of epoch, accumulated average loss = 0.256329 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 154
Run training...


527it [00:15, 33.45it/s]


** End of epoch, accumulated average loss = 0.255014 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 155
Run training...


527it [00:15, 33.59it/s]


** End of epoch, accumulated average loss = 0.255322 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 156
Run training...


527it [00:16, 32.03it/s]


** End of epoch, accumulated average loss = 0.254348 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 157
Run training...


527it [00:16, 32.34it/s]


** End of epoch, accumulated average loss = 0.253594 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 158
Run training...


527it [00:16, 32.51it/s]


** End of epoch, accumulated average loss = 0.253495 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 159
Run training...


527it [00:17, 30.83it/s]


** End of epoch, accumulated average loss = 0.252816 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 160
Run training...


527it [00:17, 30.00it/s]


** End of epoch, accumulated average loss = 0.253587 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 161
Run training...


527it [00:18, 29.16it/s]


** End of epoch, accumulated average loss = 0.253153 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 162
Run training...


527it [00:16, 32.40it/s]


** End of epoch, accumulated average loss = 0.253334 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 163
Run training...


527it [00:16, 32.35it/s]


** End of epoch, accumulated average loss = 0.252196 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 164
Run training...


527it [00:16, 32.09it/s]


** End of epoch, accumulated average loss = 0.251448 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 165
Run training...


527it [00:17, 30.91it/s]


** End of epoch, accumulated average loss = 0.251552 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 166
Run training...


527it [00:16, 31.15it/s]


** End of epoch, accumulated average loss = 0.251067 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 167
Run training...


527it [00:15, 33.01it/s]


** End of epoch, accumulated average loss = 0.251246 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 168
Run training...


527it [00:16, 32.85it/s]


** End of epoch, accumulated average loss = 0.250934 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 169
Run training...


527it [00:15, 33.08it/s]


** End of epoch, accumulated average loss = 0.249898 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 170
Run training...


527it [00:16, 31.51it/s]


** End of epoch, accumulated average loss = 0.250287 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 171
Run training...


527it [00:16, 31.07it/s]


** End of epoch, accumulated average loss = 0.249938 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 172
Run training...


527it [00:15, 33.26it/s]


** End of epoch, accumulated average loss = 0.249696 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 173
Run training...


527it [00:15, 32.99it/s]


** End of epoch, accumulated average loss = 0.249725 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 174
Run training...


527it [00:15, 33.01it/s]


** End of epoch, accumulated average loss = 0.249619 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 175
Run training...


527it [00:16, 32.50it/s]


** End of epoch, accumulated average loss = 0.248508 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_175_epoch

----------------------------------------
Epoch: 176
Run training...


527it [00:17, 30.52it/s]


** End of epoch, accumulated average loss = 0.248484 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 177
Run training...


527it [00:16, 32.30it/s]


** End of epoch, accumulated average loss = 0.247894 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 178
Run training...


527it [00:16, 32.29it/s]


** End of epoch, accumulated average loss = 0.248136 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 179
Run training...


527it [00:16, 32.05it/s]


** End of epoch, accumulated average loss = 0.247563 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 180
Run training...


527it [00:16, 32.27it/s]


** End of epoch, accumulated average loss = 0.247461 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 181
Run training...


527it [00:17, 30.67it/s]


** End of epoch, accumulated average loss = 0.247779 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 182
Run training...


527it [00:16, 32.49it/s]


** End of epoch, accumulated average loss = 0.246672 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 183
Run training...


527it [00:16, 32.86it/s]


** End of epoch, accumulated average loss = 0.246528 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 184
Run training...


527it [00:16, 31.24it/s]


** End of epoch, accumulated average loss = 0.246219 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 185
Run training...


527it [00:17, 30.65it/s]


** End of epoch, accumulated average loss = 0.246108 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 186
Run training...


527it [00:16, 31.45it/s]


** End of epoch, accumulated average loss = 0.245985 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 187
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.245366 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 188
Run training...


527it [00:16, 32.36it/s]


** End of epoch, accumulated average loss = 0.244880 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 189
Run training...


527it [00:15, 33.02it/s]


** End of epoch, accumulated average loss = 0.245069 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 190
Run training...


527it [00:16, 31.93it/s]


** End of epoch, accumulated average loss = 0.245249 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 191
Run training...


527it [00:16, 31.28it/s]


** End of epoch, accumulated average loss = 0.244640 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 192
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 0.243726 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 193
Run training...


527it [00:16, 32.90it/s]


** End of epoch, accumulated average loss = 0.243582 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 194
Run training...


527it [00:15, 33.28it/s]


** End of epoch, accumulated average loss = 0.243854 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 195
Run training...


527it [00:16, 32.34it/s]


** End of epoch, accumulated average loss = 0.243834 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 196
Run training...


527it [00:17, 30.75it/s]


** End of epoch, accumulated average loss = 0.243340 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 197
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 0.243314 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 198
Run training...


527it [00:16, 31.88it/s]


** End of epoch, accumulated average loss = 0.243007 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 199
Run training...


527it [00:16, 32.21it/s]


** End of epoch, accumulated average loss = 0.242837 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 200
Run training...


527it [00:16, 31.50it/s]


** End of epoch, accumulated average loss = 0.242482 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_200_epoch

----------------------------------------
Epoch: 201
Run training...


527it [00:17, 30.75it/s]


** End of epoch, accumulated average loss = 0.242336 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 202
Run training...


527it [00:16, 32.45it/s]


** End of epoch, accumulated average loss = 0.242177 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 203
Run training...


527it [00:16, 32.86it/s]


** End of epoch, accumulated average loss = 0.241861 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 204
Run training...


527it [00:16, 32.56it/s]


** End of epoch, accumulated average loss = 0.241366 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 205
Run training...


527it [00:16, 32.56it/s]


** End of epoch, accumulated average loss = 0.241172 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 206
Run training...


527it [00:17, 30.79it/s]


** End of epoch, accumulated average loss = 0.241014 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 207
Run training...


527it [00:15, 33.07it/s]


** End of epoch, accumulated average loss = 0.241281 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 208
Run training...


527it [00:15, 33.21it/s]


** End of epoch, accumulated average loss = 0.240811 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 209
Run training...


527it [00:16, 32.27it/s]


** End of epoch, accumulated average loss = 0.240015 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 210
Run training...


527it [00:16, 32.03it/s]


** End of epoch, accumulated average loss = 0.240611 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 211
Run training...


527it [00:17, 30.07it/s]


** End of epoch, accumulated average loss = 0.240453 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 212
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.239532 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 213
Run training...


527it [00:16, 32.35it/s]


** End of epoch, accumulated average loss = 0.239854 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 214
Run training...


527it [00:16, 32.56it/s]


** End of epoch, accumulated average loss = 0.239703 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 215
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 0.239719 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 216
Run training...


527it [00:16, 31.05it/s]


** End of epoch, accumulated average loss = 0.238508 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 217
Run training...


527it [00:16, 32.09it/s]


** End of epoch, accumulated average loss = 0.239215 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 218
Run training...


527it [00:16, 32.43it/s]


** End of epoch, accumulated average loss = 0.238859 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 219
Run training...


527it [00:16, 32.74it/s]


** End of epoch, accumulated average loss = 0.238718 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 220
Run training...


527it [00:16, 31.85it/s]


** End of epoch, accumulated average loss = 0.238305 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 221
Run training...


527it [00:17, 30.78it/s]


** End of epoch, accumulated average loss = 0.237801 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 222
Run training...


527it [00:16, 32.29it/s]


** End of epoch, accumulated average loss = 0.238272 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 223
Run training...


527it [00:16, 32.38it/s]


** End of epoch, accumulated average loss = 0.238318 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 224
Run training...


527it [00:16, 32.56it/s]


** End of epoch, accumulated average loss = 0.237084 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 225
Run training...


527it [00:16, 31.80it/s]


** End of epoch, accumulated average loss = 0.237538 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_225_epoch

----------------------------------------
Epoch: 226
Run training...


527it [00:17, 30.53it/s]


** End of epoch, accumulated average loss = 0.237698 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 227
Run training...


527it [00:15, 33.06it/s]


** End of epoch, accumulated average loss = 0.236633 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 228
Run training...


527it [00:16, 32.82it/s]


** End of epoch, accumulated average loss = 0.236664 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 229
Run training...


527it [00:16, 32.72it/s]


** End of epoch, accumulated average loss = 0.236956 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 230
Run training...


527it [00:16, 32.49it/s]


** End of epoch, accumulated average loss = 0.236771 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 231
Run training...


527it [00:17, 30.89it/s]


** End of epoch, accumulated average loss = 0.235780 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 232
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.236345 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 233
Run training...


527it [00:15, 33.38it/s]


** End of epoch, accumulated average loss = 0.236401 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 234
Run training...


527it [00:15, 33.27it/s]


** End of epoch, accumulated average loss = 0.235723 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 235
Run training...


527it [00:15, 32.99it/s]


** End of epoch, accumulated average loss = 0.235602 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 236
Run training...


527it [00:16, 32.33it/s]


** End of epoch, accumulated average loss = 0.235645 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 237
Run training...


527it [00:16, 31.48it/s]


** End of epoch, accumulated average loss = 0.235338 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 238
Run training...


527it [00:16, 32.04it/s]


** End of epoch, accumulated average loss = 0.234717 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 239
Run training...


527it [00:16, 32.16it/s]


** End of epoch, accumulated average loss = 0.235076 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 240
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.234462 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 241
Run training...


527it [00:16, 31.71it/s]


** End of epoch, accumulated average loss = 0.234948 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 242
Run training...


527it [00:16, 31.98it/s]


** End of epoch, accumulated average loss = 0.233967 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 243
Run training...


527it [00:16, 32.87it/s]


** End of epoch, accumulated average loss = 0.234372 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 244
Run training...


527it [00:16, 32.91it/s]


** End of epoch, accumulated average loss = 0.234356 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 245
Run training...


527it [00:16, 32.80it/s]


** End of epoch, accumulated average loss = 0.233400 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 246
Run training...


527it [00:16, 32.42it/s]


** End of epoch, accumulated average loss = 0.234025 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 247
Run training...


527it [00:16, 31.32it/s]


** End of epoch, accumulated average loss = 0.233077 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 248
Run training...


527it [00:16, 32.64it/s]


** End of epoch, accumulated average loss = 0.233247 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 249
Run training...


527it [00:16, 32.76it/s]


** End of epoch, accumulated average loss = 0.233022 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 250
Run training...


527it [00:16, 32.70it/s]


** End of epoch, accumulated average loss = 0.233459 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_250_epoch

----------------------------------------
Epoch: 251
Run training...


527it [00:15, 32.98it/s]


** End of epoch, accumulated average loss = 0.232976 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 252
Run training...


527it [00:17, 30.66it/s]


** End of epoch, accumulated average loss = 0.232635 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 253
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.232536 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 254
Run training...


527it [00:16, 32.26it/s]


** End of epoch, accumulated average loss = 0.232051 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 255
Run training...


527it [00:16, 32.53it/s]


** End of epoch, accumulated average loss = 0.232022 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 256
Run training...


527it [00:16, 32.76it/s]


** End of epoch, accumulated average loss = 0.231414 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 257
Run training...


527it [00:16, 31.32it/s]


** End of epoch, accumulated average loss = 0.232021 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 258
Run training...


527it [00:16, 32.17it/s]


** End of epoch, accumulated average loss = 0.232476 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 259
Run training...


527it [00:16, 32.70it/s]


** End of epoch, accumulated average loss = 0.231725 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 260
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.231870 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 261
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.230754 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 262
Run training...


527it [00:16, 32.25it/s]


** End of epoch, accumulated average loss = 0.230374 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 263
Run training...


527it [00:16, 31.30it/s]


** End of epoch, accumulated average loss = 0.230905 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 264
Run training...


527it [00:16, 32.76it/s]


** End of epoch, accumulated average loss = 0.230635 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 265
Run training...


527it [00:16, 32.84it/s]


** End of epoch, accumulated average loss = 0.231048 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 266
Run training...


527it [00:15, 33.08it/s]


** End of epoch, accumulated average loss = 0.230977 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 267
Run training...


527it [00:16, 32.75it/s]


** End of epoch, accumulated average loss = 0.230586 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 268
Run training...


527it [00:17, 30.70it/s]


** End of epoch, accumulated average loss = 0.230155 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 269
Run training...


527it [00:16, 32.16it/s]


** End of epoch, accumulated average loss = 0.230595 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 270
Run training...


527it [00:16, 32.00it/s]


** End of epoch, accumulated average loss = 0.229769 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 271
Run training...


527it [00:16, 32.41it/s]


** End of epoch, accumulated average loss = 0.229518 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 272
Run training...


527it [00:16, 32.93it/s]


** End of epoch, accumulated average loss = 0.229813 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 273
Run training...


527it [00:17, 30.74it/s]


** End of epoch, accumulated average loss = 0.229823 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 274
Run training...


527it [00:16, 31.66it/s]


** End of epoch, accumulated average loss = 0.229781 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 275
Run training...


527it [00:16, 31.97it/s]


** End of epoch, accumulated average loss = 0.229347 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_275_epoch

----------------------------------------
Epoch: 276
Run training...


527it [00:16, 32.76it/s]


** End of epoch, accumulated average loss = 0.228869 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 277
Run training...


527it [00:15, 33.31it/s]


** End of epoch, accumulated average loss = 0.228420 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 278
Run training...


527it [00:17, 30.76it/s]


** End of epoch, accumulated average loss = 0.228837 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 279
Run training...


527it [00:16, 32.25it/s]


** End of epoch, accumulated average loss = 0.228466 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 280
Run training...


527it [00:16, 32.85it/s]


** End of epoch, accumulated average loss = 0.228476 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 281
Run training...


527it [00:16, 32.86it/s]


** End of epoch, accumulated average loss = 0.228379 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 282
Run training...


527it [00:16, 32.69it/s]


** End of epoch, accumulated average loss = 0.227840 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 283
Run training...


527it [00:16, 31.55it/s]


** End of epoch, accumulated average loss = 0.227766 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 284
Run training...


527it [00:16, 31.52it/s]


** End of epoch, accumulated average loss = 0.227176 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 285
Run training...


527it [00:16, 32.46it/s]


** End of epoch, accumulated average loss = 0.227803 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 286
Run training...


527it [00:16, 32.45it/s]


** End of epoch, accumulated average loss = 0.228084 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 287
Run training...


527it [00:16, 32.74it/s]


** End of epoch, accumulated average loss = 0.227762 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 288
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 0.227687 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 289
Run training...


527it [00:16, 31.04it/s]


** End of epoch, accumulated average loss = 0.227152 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 290
Run training...


527it [00:16, 32.13it/s]


** End of epoch, accumulated average loss = 0.226684 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 291
Run training...


527it [00:16, 31.99it/s]


** End of epoch, accumulated average loss = 0.226597 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 292
Run training...


527it [00:16, 31.97it/s]


** End of epoch, accumulated average loss = 0.226002 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 293
Run training...


527it [00:17, 29.89it/s]


** End of epoch, accumulated average loss = 0.226675 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 294
Run training...


527it [00:16, 31.96it/s]


** End of epoch, accumulated average loss = 0.226947 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 295
Run training...


527it [00:16, 31.41it/s]


** End of epoch, accumulated average loss = 0.226421 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 296
Run training...


527it [00:16, 31.25it/s]


** End of epoch, accumulated average loss = 0.226344 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 297
Run training...


527it [00:18, 28.27it/s]


** End of epoch, accumulated average loss = 0.226034 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 298
Run training...


527it [00:17, 30.48it/s]


** End of epoch, accumulated average loss = 0.225951 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 299
Run training...


527it [00:17, 30.62it/s]


** End of epoch, accumulated average loss = 0.226306 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 300
Run training...


527it [00:16, 31.20it/s]


** End of epoch, accumulated average loss = 0.225942 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_300_epoch

----------------------------------------
Epoch: 301
Run training...


527it [00:17, 29.30it/s]


** End of epoch, accumulated average loss = 0.226112 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 302
Run training...


527it [00:16, 32.18it/s]


** End of epoch, accumulated average loss = 0.225328 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 303
Run training...


527it [00:16, 31.76it/s]


** End of epoch, accumulated average loss = 0.225954 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 304
Run training...


527it [00:16, 31.99it/s]


** End of epoch, accumulated average loss = 0.225043 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 305
Run training...


527it [00:17, 30.33it/s]


** End of epoch, accumulated average loss = 0.225355 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 306
Run training...


527it [00:16, 31.03it/s]


** End of epoch, accumulated average loss = 0.225041 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 307
Run training...


527it [00:16, 31.24it/s]


** End of epoch, accumulated average loss = 0.225534 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 308
Run training...


527it [00:16, 31.64it/s]


** End of epoch, accumulated average loss = 0.225548 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 309
Run training...


527it [00:16, 31.17it/s]


** End of epoch, accumulated average loss = 0.224248 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 310
Run training...


527it [00:17, 30.83it/s]


** End of epoch, accumulated average loss = 0.225109 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 311
Run training...


527it [00:16, 32.14it/s]


** End of epoch, accumulated average loss = 0.224536 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 312
Run training...


527it [00:16, 31.89it/s]


** End of epoch, accumulated average loss = 0.223871 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 313
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.224254 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 314
Run training...


527it [00:16, 32.38it/s]


** End of epoch, accumulated average loss = 0.223645 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 315
Run training...


527it [00:16, 31.32it/s]


** End of epoch, accumulated average loss = 0.224536 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 316
Run training...


527it [00:15, 32.95it/s]


** End of epoch, accumulated average loss = 0.223984 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 317
Run training...


527it [00:16, 32.73it/s]


** End of epoch, accumulated average loss = 0.223620 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 318
Run training...


527it [00:16, 31.92it/s]


** End of epoch, accumulated average loss = 0.223661 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 319
Run training...


527it [00:17, 30.68it/s]


** End of epoch, accumulated average loss = 0.223600 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 320
Run training...


527it [00:17, 30.06it/s]


** End of epoch, accumulated average loss = 0.223392 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 321
Run training...


527it [00:15, 33.36it/s]


** End of epoch, accumulated average loss = 0.223347 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 322
Run training...


527it [00:16, 32.54it/s]


** End of epoch, accumulated average loss = 0.223281 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 323
Run training...


527it [00:16, 32.47it/s]


** End of epoch, accumulated average loss = 0.223277 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 324
Run training...


527it [00:17, 30.63it/s]


** End of epoch, accumulated average loss = 0.222758 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 325
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.223077 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_325_epoch

----------------------------------------
Epoch: 326
Run training...


527it [00:16, 32.89it/s]


** End of epoch, accumulated average loss = 0.222891 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 327
Run training...


527it [00:16, 32.75it/s]


** End of epoch, accumulated average loss = 0.221834 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 328
Run training...


527it [00:16, 31.45it/s]


** End of epoch, accumulated average loss = 0.222916 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 329
Run training...


527it [00:16, 31.37it/s]


** End of epoch, accumulated average loss = 0.222720 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 330
Run training...


527it [00:16, 32.11it/s]


** End of epoch, accumulated average loss = 0.222248 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 331
Run training...


527it [00:16, 32.25it/s]


** End of epoch, accumulated average loss = 0.222085 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 332
Run training...


527it [00:16, 32.46it/s]


** End of epoch, accumulated average loss = 0.221420 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 333
Run training...


527it [00:16, 32.29it/s]


** End of epoch, accumulated average loss = 0.221744 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 334
Run training...


527it [00:17, 30.36it/s]


** End of epoch, accumulated average loss = 0.222278 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 335
Run training...


527it [00:16, 32.02it/s]


** End of epoch, accumulated average loss = 0.221738 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 336
Run training...


527it [00:16, 32.35it/s]


** End of epoch, accumulated average loss = 0.222123 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 337
Run training...


527it [00:16, 32.63it/s]


** End of epoch, accumulated average loss = 0.221665 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 338
Run training...


527it [00:16, 32.66it/s]


** End of epoch, accumulated average loss = 0.221854 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 339
Run training...


527it [00:17, 30.94it/s]


** End of epoch, accumulated average loss = 0.220707 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 340
Run training...


527it [00:16, 31.71it/s]


** End of epoch, accumulated average loss = 0.220984 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 341
Run training...


527it [00:15, 32.97it/s]


** End of epoch, accumulated average loss = 0.220330 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 342
Run training...


527it [00:16, 32.77it/s]


** End of epoch, accumulated average loss = 0.220543 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 343
Run training...


527it [00:16, 32.48it/s]


** End of epoch, accumulated average loss = 0.220840 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 344
Run training...


527it [00:18, 29.07it/s]


** End of epoch, accumulated average loss = 0.220223 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 345
Run training...


527it [00:16, 32.52it/s]


** End of epoch, accumulated average loss = 0.220429 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 346
Run training...


527it [00:16, 32.81it/s]


** End of epoch, accumulated average loss = 0.220234 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 347
Run training...


527it [00:16, 32.48it/s]


** End of epoch, accumulated average loss = 0.220867 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 348
Run training...


527it [00:16, 31.72it/s]


** End of epoch, accumulated average loss = 0.220628 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 349
Run training...


527it [00:17, 29.50it/s]


** End of epoch, accumulated average loss = 0.220015 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 350
Run training...


527it [00:16, 32.33it/s]


** End of epoch, accumulated average loss = 0.220122 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_350_epoch

----------------------------------------
Epoch: 351
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.219661 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 352
Run training...


527it [00:15, 33.03it/s]


** End of epoch, accumulated average loss = 0.219808 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 353
Run training...


527it [00:16, 32.57it/s]


** End of epoch, accumulated average loss = 0.220153 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 354
Run training...


527it [00:17, 30.86it/s]


** End of epoch, accumulated average loss = 0.219692 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 355
Run training...


527it [00:16, 32.46it/s]


** End of epoch, accumulated average loss = 0.219632 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 356
Run training...


527it [00:16, 32.73it/s]


** End of epoch, accumulated average loss = 0.219090 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 357
Run training...


527it [00:16, 32.43it/s]


** End of epoch, accumulated average loss = 0.218885 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 358
Run training...


527it [00:16, 31.44it/s]


** End of epoch, accumulated average loss = 0.219063 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 359
Run training...


527it [00:17, 29.86it/s]


** End of epoch, accumulated average loss = 0.219059 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 360
Run training...


527it [00:16, 32.88it/s]


** End of epoch, accumulated average loss = 0.218831 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 361
Run training...


527it [00:16, 32.70it/s]


** End of epoch, accumulated average loss = 0.219569 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 362
Run training...


527it [00:16, 32.44it/s]


** End of epoch, accumulated average loss = 0.219542 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 363
Run training...


527it [00:16, 32.86it/s]


** End of epoch, accumulated average loss = 0.218672 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 364
Run training...


527it [00:17, 30.99it/s]


** End of epoch, accumulated average loss = 0.218227 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 365
Run training...


527it [00:15, 33.41it/s]


** End of epoch, accumulated average loss = 0.218565 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 366
Run training...


527it [00:16, 32.74it/s]


** End of epoch, accumulated average loss = 0.218886 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 367
Run training...


527it [00:16, 32.87it/s]


** End of epoch, accumulated average loss = 0.217808 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 368
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 0.218930 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 369
Run training...


527it [00:16, 31.67it/s]


** End of epoch, accumulated average loss = 0.217558 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 370
Run training...


527it [00:16, 32.41it/s]


** End of epoch, accumulated average loss = 0.217629 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 371
Run training...


527it [00:15, 33.44it/s]


** End of epoch, accumulated average loss = 0.217976 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 372
Run training...


527it [00:16, 32.82it/s]


** End of epoch, accumulated average loss = 0.217774 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 373
Run training...


527it [00:15, 33.15it/s]


** End of epoch, accumulated average loss = 0.218234 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 374
Run training...


527it [00:16, 32.77it/s]


** End of epoch, accumulated average loss = 0.216996 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 375
Run training...


527it [00:17, 30.82it/s]


** End of epoch, accumulated average loss = 0.217540 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_375_epoch

----------------------------------------
Epoch: 376
Run training...


527it [00:16, 32.67it/s]


** End of epoch, accumulated average loss = 0.217394 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 377
Run training...


527it [00:16, 32.82it/s]


** End of epoch, accumulated average loss = 0.217515 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 378
Run training...


527it [00:16, 32.72it/s]


** End of epoch, accumulated average loss = 0.217465 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 379
Run training...


527it [00:16, 31.57it/s]


** End of epoch, accumulated average loss = 0.216746 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 380
Run training...


527it [00:17, 29.89it/s]


** End of epoch, accumulated average loss = 0.217100 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 381
Run training...


527it [00:16, 31.88it/s]


** End of epoch, accumulated average loss = 0.217082 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 382
Run training...


527it [00:16, 32.68it/s]


** End of epoch, accumulated average loss = 0.217008 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 383
Run training...


527it [00:16, 32.30it/s]


** End of epoch, accumulated average loss = 0.216840 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 384
Run training...


527it [00:16, 32.11it/s]


** End of epoch, accumulated average loss = 0.217172 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 385
Run training...


527it [00:17, 30.88it/s]


** End of epoch, accumulated average loss = 0.216542 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 386
Run training...


527it [00:16, 32.04it/s]


** End of epoch, accumulated average loss = 0.216956 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 387
Run training...


527it [00:16, 31.33it/s]


** End of epoch, accumulated average loss = 0.216562 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 388
Run training...


527it [00:16, 31.65it/s]


** End of epoch, accumulated average loss = 0.216294 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 389
Run training...


527it [00:18, 29.24it/s]


** End of epoch, accumulated average loss = 0.215680 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 390
Run training...


527it [00:18, 29.21it/s]


** End of epoch, accumulated average loss = 0.216704 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 391
Run training...


527it [00:17, 30.29it/s]


** End of epoch, accumulated average loss = 0.215884 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 392
Run training...


527it [00:17, 29.51it/s]


** End of epoch, accumulated average loss = 0.216191 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 393
Run training...


527it [00:17, 30.21it/s]


** End of epoch, accumulated average loss = 0.216330 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 394
Run training...


527it [00:17, 30.38it/s]


** End of epoch, accumulated average loss = 0.215373 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 395
Run training...


527it [00:16, 31.10it/s]


** End of epoch, accumulated average loss = 0.215647 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 396
Run training...


527it [00:17, 30.42it/s]


** End of epoch, accumulated average loss = 0.215637 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 397
Run training...


527it [00:17, 30.60it/s]


** End of epoch, accumulated average loss = 0.215942 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 398
Run training...


527it [00:16, 32.55it/s]


** End of epoch, accumulated average loss = 0.216172 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 399
Run training...


527it [00:16, 32.52it/s]


** End of epoch, accumulated average loss = 0.215154 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 400
Run training...


527it [00:16, 32.52it/s]


** End of epoch, accumulated average loss = 0.215142 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_400_epoch

----------------------------------------
Epoch: 401
Run training...


527it [00:16, 32.49it/s]


** End of epoch, accumulated average loss = 0.215346 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 402
Run training...


527it [00:17, 30.69it/s]


** End of epoch, accumulated average loss = 0.215270 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 403
Run training...


527it [00:16, 31.75it/s]


** End of epoch, accumulated average loss = 0.214670 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 404
Run training...


527it [00:16, 32.55it/s]


** End of epoch, accumulated average loss = 0.214446 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 405
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 0.215046 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 406
Run training...


527it [00:16, 32.03it/s]


** End of epoch, accumulated average loss = 0.215402 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 407
Run training...


527it [00:16, 31.06it/s]


** End of epoch, accumulated average loss = 0.214681 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 408
Run training...


527it [00:16, 31.85it/s]


** End of epoch, accumulated average loss = 0.215111 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 409
Run training...


527it [00:16, 32.68it/s]


** End of epoch, accumulated average loss = 0.214855 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 410
Run training...


527it [00:16, 32.67it/s]


** End of epoch, accumulated average loss = 0.214691 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 411
Run training...


527it [00:16, 32.36it/s]


** End of epoch, accumulated average loss = 0.214315 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 412
Run training...


527it [00:16, 31.17it/s]


** End of epoch, accumulated average loss = 0.214491 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 413
Run training...


527it [00:16, 31.38it/s]


** End of epoch, accumulated average loss = 0.214038 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 414
Run training...


527it [00:16, 32.87it/s]


** End of epoch, accumulated average loss = 0.214504 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 415
Run training...


527it [00:16, 32.87it/s]


** End of epoch, accumulated average loss = 0.213682 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 416
Run training...


527it [00:16, 32.88it/s]


** End of epoch, accumulated average loss = 0.213926 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 417
Run training...


527it [00:16, 31.87it/s]


** End of epoch, accumulated average loss = 0.214190 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 418
Run training...


527it [00:16, 31.62it/s]


** End of epoch, accumulated average loss = 0.213675 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 419
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 0.213873 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 420
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.213158 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 421
Run training...


527it [00:15, 33.20it/s]


** End of epoch, accumulated average loss = 0.213678 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 422
Run training...


527it [00:16, 32.72it/s]


** End of epoch, accumulated average loss = 0.213585 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 423
Run training...


527it [00:16, 31.35it/s]


** End of epoch, accumulated average loss = 0.212534 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 424
Run training...


527it [00:15, 33.25it/s]


** End of epoch, accumulated average loss = 0.213151 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 425
Run training...


527it [00:15, 33.47it/s]


** End of epoch, accumulated average loss = 0.213232 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_425_epoch

----------------------------------------
Epoch: 426
Run training...


527it [00:15, 33.38it/s]


** End of epoch, accumulated average loss = 0.213138 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 427
Run training...


527it [00:16, 31.51it/s]


** End of epoch, accumulated average loss = 0.213181 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 428
Run training...


527it [00:15, 33.61it/s]


** End of epoch, accumulated average loss = 0.212984 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 429
Run training...


527it [00:15, 33.76it/s]


** End of epoch, accumulated average loss = 0.213231 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 430
Run training...


527it [00:15, 33.46it/s]


** End of epoch, accumulated average loss = 0.213565 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 431
Run training...


527it [00:16, 32.45it/s]


** End of epoch, accumulated average loss = 0.212982 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 432
Run training...


527it [00:16, 32.23it/s]


** End of epoch, accumulated average loss = 0.212733 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 433
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.212221 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 434
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.212193 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 435
Run training...


527it [00:16, 32.48it/s]


** End of epoch, accumulated average loss = 0.212459 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 436
Run training...


527it [00:16, 31.30it/s]


** End of epoch, accumulated average loss = 0.211910 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 437
Run training...


527it [00:16, 32.17it/s]


** End of epoch, accumulated average loss = 0.212622 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 438
Run training...


527it [00:16, 32.19it/s]


** End of epoch, accumulated average loss = 0.212163 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 439
Run training...


527it [00:16, 31.46it/s]


** End of epoch, accumulated average loss = 0.212253 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 440
Run training...


527it [00:16, 31.81it/s]


** End of epoch, accumulated average loss = 0.211898 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 441
Run training...


527it [00:16, 32.71it/s]


** End of epoch, accumulated average loss = 0.211893 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 442
Run training...


527it [00:16, 32.71it/s]


** End of epoch, accumulated average loss = 0.212482 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 443
Run training...


527it [00:16, 31.67it/s]


** End of epoch, accumulated average loss = 0.211782 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 444
Run training...


527it [00:16, 31.73it/s]


** End of epoch, accumulated average loss = 0.211304 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 445
Run training...


527it [00:16, 32.42it/s]


** End of epoch, accumulated average loss = 0.211230 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 446
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 0.211294 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 447
Run training...


527it [00:16, 31.92it/s]


** End of epoch, accumulated average loss = 0.211813 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 448
Run training...


527it [00:16, 32.20it/s]


** End of epoch, accumulated average loss = 0.210772 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 449
Run training...


527it [00:16, 32.93it/s]


** End of epoch, accumulated average loss = 0.211347 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 450
Run training...


527it [00:16, 32.93it/s]


** End of epoch, accumulated average loss = 0.211421 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_450_epoch

----------------------------------------
Epoch: 451
Run training...


527it [00:16, 31.71it/s]


** End of epoch, accumulated average loss = 0.211585 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 452
Run training...


527it [00:16, 32.22it/s]


** End of epoch, accumulated average loss = 0.210750 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 453
Run training...


527it [00:15, 33.07it/s]


** End of epoch, accumulated average loss = 0.211004 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 454
Run training...


527it [00:15, 33.08it/s]


** End of epoch, accumulated average loss = 0.210416 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 455
Run training...


527it [00:16, 32.33it/s]


** End of epoch, accumulated average loss = 0.210661 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 456
Run training...


527it [00:16, 31.19it/s]


** End of epoch, accumulated average loss = 0.210931 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 457
Run training...


527it [00:16, 32.65it/s]


** End of epoch, accumulated average loss = 0.210088 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 458
Run training...


527it [00:15, 33.42it/s]


** End of epoch, accumulated average loss = 0.210932 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 459
Run training...


527it [00:16, 32.42it/s]


** End of epoch, accumulated average loss = 0.210237 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 460
Run training...


527it [00:16, 31.69it/s]


** End of epoch, accumulated average loss = 0.210785 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 461
Run training...


527it [00:15, 33.03it/s]


** End of epoch, accumulated average loss = 0.210048 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 462
Run training...


527it [00:16, 32.75it/s]


** End of epoch, accumulated average loss = 0.209906 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 463
Run training...


527it [00:16, 32.07it/s]


** End of epoch, accumulated average loss = 0.210363 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 464
Run training...


527it [00:16, 31.53it/s]


** End of epoch, accumulated average loss = 0.210325 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 465
Run training...


527it [00:16, 32.58it/s]


** End of epoch, accumulated average loss = 0.210146 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 466
Run training...


527it [00:15, 33.21it/s]


** End of epoch, accumulated average loss = 0.210414 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 467
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.209879 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 468
Run training...


527it [00:17, 30.85it/s]


** End of epoch, accumulated average loss = 0.210394 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 469
Run training...


527it [00:15, 33.45it/s]


** End of epoch, accumulated average loss = 0.210356 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 470
Run training...


527it [00:16, 32.56it/s]


** End of epoch, accumulated average loss = 0.209613 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 471
Run training...


527it [00:16, 32.29it/s]


** End of epoch, accumulated average loss = 0.209765 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 472
Run training...


527it [00:17, 30.95it/s]


** End of epoch, accumulated average loss = 0.209551 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 473
Run training...


527it [00:15, 33.04it/s]


** End of epoch, accumulated average loss = 0.209716 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 474
Run training...


527it [00:16, 32.80it/s]


** End of epoch, accumulated average loss = 0.209953 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 475
Run training...


527it [00:15, 33.00it/s]


** End of epoch, accumulated average loss = 0.209510 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_475_epoch

----------------------------------------
Epoch: 476
Run training...


527it [00:16, 31.02it/s]


** End of epoch, accumulated average loss = 0.209434 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 477
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.208869 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 478
Run training...


527it [00:15, 33.10it/s]


** End of epoch, accumulated average loss = 0.209720 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 479
Run training...


527it [00:15, 33.16it/s]


** End of epoch, accumulated average loss = 0.209039 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 480
Run training...


527it [00:16, 31.35it/s]


** End of epoch, accumulated average loss = 0.208757 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 481
Run training...


527it [00:16, 32.53it/s]


** End of epoch, accumulated average loss = 0.209011 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 482
Run training...


527it [00:15, 33.14it/s]


** End of epoch, accumulated average loss = 0.209394 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 483
Run training...


527it [00:16, 32.54it/s]


** End of epoch, accumulated average loss = 0.209276 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 484
Run training...


527it [00:16, 31.32it/s]


** End of epoch, accumulated average loss = 0.209078 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 485
Run training...


527it [00:16, 32.44it/s]


** End of epoch, accumulated average loss = 0.208970 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 486
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.207848 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 487
Run training...


527it [00:15, 32.95it/s]


** End of epoch, accumulated average loss = 0.208397 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 488
Run training...


527it [00:16, 32.38it/s]


** End of epoch, accumulated average loss = 0.207838 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 489
Run training...


527it [00:16, 31.80it/s]


** End of epoch, accumulated average loss = 0.209400 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 490
Run training...


527it [00:15, 32.97it/s]


** End of epoch, accumulated average loss = 0.208107 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 491
Run training...


527it [00:16, 32.03it/s]


** End of epoch, accumulated average loss = 0.208102 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 492
Run training...


527it [00:17, 30.43it/s]


** End of epoch, accumulated average loss = 0.208592 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 493
Run training...


527it [00:16, 32.44it/s]


** End of epoch, accumulated average loss = 0.208220 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 494
Run training...


527it [00:16, 32.40it/s]


** End of epoch, accumulated average loss = 0.208190 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 495
Run training...


527it [00:16, 32.61it/s]


** End of epoch, accumulated average loss = 0.207925 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 496
Run training...


527it [00:17, 30.26it/s]


** End of epoch, accumulated average loss = 0.207844 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 497
Run training...


527it [00:15, 32.99it/s]


** End of epoch, accumulated average loss = 0.208031 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 498
Run training...


527it [00:15, 33.15it/s]


** End of epoch, accumulated average loss = 0.207952 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 499
Run training...


527it [00:16, 32.92it/s]


** End of epoch, accumulated average loss = 0.208106 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 500
Run training...


527it [00:17, 30.56it/s]


** End of epoch, accumulated average loss = 0.208421 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_500_epoch

----------------------------------------
Epoch: 501
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.207770 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 502
Run training...


527it [00:16, 32.52it/s]


** End of epoch, accumulated average loss = 0.208059 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 503
Run training...


527it [00:16, 32.57it/s]


** End of epoch, accumulated average loss = 0.207624 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 504
Run training...


527it [00:17, 30.54it/s]


** End of epoch, accumulated average loss = 0.207759 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 505
Run training...


527it [00:16, 32.74it/s]


** End of epoch, accumulated average loss = 0.207252 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 506
Run training...


527it [00:16, 32.31it/s]


** End of epoch, accumulated average loss = 0.207159 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 507
Run training...


527it [00:16, 32.33it/s]


** End of epoch, accumulated average loss = 0.207325 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 508
Run training...


527it [00:17, 30.85it/s]


** End of epoch, accumulated average loss = 0.207208 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 509
Run training...


527it [00:16, 32.42it/s]


** End of epoch, accumulated average loss = 0.207286 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 510
Run training...


527it [00:16, 32.75it/s]


** End of epoch, accumulated average loss = 0.206819 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 511
Run training...


527it [00:16, 32.21it/s]


** End of epoch, accumulated average loss = 0.207382 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 512
Run training...


527it [00:16, 31.29it/s]


** End of epoch, accumulated average loss = 0.206948 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 513
Run training...


527it [00:16, 32.49it/s]


** End of epoch, accumulated average loss = 0.206417 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 514
Run training...


527it [00:16, 31.65it/s]


** End of epoch, accumulated average loss = 0.206676 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 515
Run training...


527it [00:17, 30.30it/s]


** End of epoch, accumulated average loss = 0.206369 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 516
Run training...


527it [00:16, 31.52it/s]


** End of epoch, accumulated average loss = 0.206784 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 517
Run training...


527it [00:16, 32.02it/s]


** End of epoch, accumulated average loss = 0.206941 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 518
Run training...


527it [00:16, 31.97it/s]


** End of epoch, accumulated average loss = 0.207230 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 519
Run training...


527it [00:17, 30.42it/s]


** End of epoch, accumulated average loss = 0.206567 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 520
Run training...


527it [00:16, 32.30it/s]


** End of epoch, accumulated average loss = 0.206175 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 521
Run training...


527it [00:16, 32.38it/s]


** End of epoch, accumulated average loss = 0.206144 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 522
Run training...


527it [00:16, 31.99it/s]


** End of epoch, accumulated average loss = 0.206402 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 523
Run training...


527it [00:17, 30.51it/s]


** End of epoch, accumulated average loss = 0.206949 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 524
Run training...


527it [00:16, 32.09it/s]


** End of epoch, accumulated average loss = 0.206032 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 525
Run training...


527it [00:16, 32.13it/s]


** End of epoch, accumulated average loss = 0.206419 **
** Elapsed time: 0:00:16**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_525_epoch

----------------------------------------
Epoch: 526
Run training...


527it [00:16, 31.20it/s]


** End of epoch, accumulated average loss = 0.206267 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 527
Run training...


527it [00:16, 31.44it/s]


** End of epoch, accumulated average loss = 0.205376 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 528
Run training...


527it [00:16, 32.75it/s]


** End of epoch, accumulated average loss = 0.206106 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 529
Run training...


527it [00:16, 32.52it/s]


** End of epoch, accumulated average loss = 0.206674 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 530
Run training...


527it [00:16, 31.42it/s]


** End of epoch, accumulated average loss = 0.206224 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 531
Run training...


527it [00:16, 32.25it/s]


** End of epoch, accumulated average loss = 0.205897 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 532
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.205558 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 533
Run training...


527it [00:16, 32.61it/s]


** End of epoch, accumulated average loss = 0.205426 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 534
Run training...


527it [00:17, 30.63it/s]


** End of epoch, accumulated average loss = 0.206162 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 535
Run training...


527it [00:16, 31.81it/s]


** End of epoch, accumulated average loss = 0.205412 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 536
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.205612 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 537
Run training...


527it [00:16, 31.53it/s]


** End of epoch, accumulated average loss = 0.205581 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 538
Run training...


527it [00:17, 30.25it/s]


** End of epoch, accumulated average loss = 0.206203 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 539
Run training...


527it [00:16, 32.32it/s]


** End of epoch, accumulated average loss = 0.205787 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 540
Run training...


527it [00:16, 31.99it/s]


** End of epoch, accumulated average loss = 0.204978 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 541
Run training...


527it [00:16, 32.21it/s]


** End of epoch, accumulated average loss = 0.205635 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 542
Run training...


527it [00:17, 30.88it/s]


** End of epoch, accumulated average loss = 0.205698 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 543
Run training...


527it [00:16, 32.61it/s]


** End of epoch, accumulated average loss = 0.205123 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 544
Run training...


527it [00:16, 32.23it/s]


** End of epoch, accumulated average loss = 0.205041 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 545
Run training...


527it [00:16, 31.23it/s]


** End of epoch, accumulated average loss = 0.205410 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 546
Run training...


527it [00:17, 30.65it/s]


** End of epoch, accumulated average loss = 0.205177 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 547
Run training...


527it [00:16, 31.89it/s]


** End of epoch, accumulated average loss = 0.204235 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 548
Run training...


527it [00:16, 31.45it/s]


** End of epoch, accumulated average loss = 0.205708 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 549
Run training...


527it [00:17, 29.97it/s]


** End of epoch, accumulated average loss = 0.205103 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 550
Run training...


527it [00:16, 31.80it/s]


** End of epoch, accumulated average loss = 0.205068 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_550_epoch

----------------------------------------
Epoch: 551
Run training...


527it [00:16, 31.74it/s]


** End of epoch, accumulated average loss = 0.204981 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 552
Run training...


527it [00:16, 31.37it/s]


** End of epoch, accumulated average loss = 0.204267 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 553
Run training...


527it [00:16, 31.16it/s]


** End of epoch, accumulated average loss = 0.204714 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 554
Run training...


527it [00:16, 31.83it/s]


** End of epoch, accumulated average loss = 0.205030 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 555
Run training...


527it [00:16, 32.02it/s]


** End of epoch, accumulated average loss = 0.204495 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 556
Run training...


527it [00:17, 29.88it/s]


** End of epoch, accumulated average loss = 0.204938 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 557
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 0.204083 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 558
Run training...


527it [00:16, 31.84it/s]


** End of epoch, accumulated average loss = 0.204576 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 559
Run training...


527it [00:16, 31.65it/s]


** End of epoch, accumulated average loss = 0.205752 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 560
Run training...


527it [00:17, 30.61it/s]


** End of epoch, accumulated average loss = 0.204364 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 561
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.204066 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 562
Run training...


527it [00:16, 32.21it/s]


** End of epoch, accumulated average loss = 0.203843 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 563
Run training...


527it [00:17, 30.66it/s]


** End of epoch, accumulated average loss = 0.204172 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 564
Run training...


527it [00:16, 31.60it/s]


** End of epoch, accumulated average loss = 0.203387 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 565
Run training...


527it [00:16, 32.45it/s]


** End of epoch, accumulated average loss = 0.204319 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 566
Run training...


527it [00:16, 32.29it/s]


** End of epoch, accumulated average loss = 0.204627 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 567
Run training...


527it [00:17, 30.28it/s]


** End of epoch, accumulated average loss = 0.203616 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 568
Run training...


527it [00:16, 32.44it/s]


** End of epoch, accumulated average loss = 0.203956 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 569
Run training...


527it [00:16, 31.94it/s]


** End of epoch, accumulated average loss = 0.203849 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 570
Run training...


527it [00:16, 31.89it/s]


** End of epoch, accumulated average loss = 0.203659 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 571
Run training...


527it [00:17, 30.56it/s]


** End of epoch, accumulated average loss = 0.204097 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 572
Run training...


527it [00:16, 31.91it/s]


** End of epoch, accumulated average loss = 0.203840 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 573
Run training...


527it [00:16, 32.05it/s]


** End of epoch, accumulated average loss = 0.203657 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 574
Run training...


527it [00:16, 31.32it/s]


** End of epoch, accumulated average loss = 0.203432 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 575
Run training...


527it [00:16, 31.04it/s]


** End of epoch, accumulated average loss = 0.203311 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_575_epoch

----------------------------------------
Epoch: 576
Run training...


527it [00:16, 32.33it/s]


** End of epoch, accumulated average loss = 0.204095 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 577
Run training...


527it [00:16, 31.96it/s]


** End of epoch, accumulated average loss = 0.203335 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 578
Run training...


527it [00:17, 30.10it/s]


** End of epoch, accumulated average loss = 0.203522 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 579
Run training...


527it [00:16, 32.59it/s]


** End of epoch, accumulated average loss = 0.203445 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 580
Run training...


527it [00:16, 32.26it/s]


** End of epoch, accumulated average loss = 0.203498 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 581
Run training...


527it [00:16, 32.41it/s]


** End of epoch, accumulated average loss = 0.203441 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 582
Run training...


527it [00:17, 30.37it/s]


** End of epoch, accumulated average loss = 0.203195 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 583
Run training...


527it [00:16, 31.74it/s]


** End of epoch, accumulated average loss = 0.203078 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 584
Run training...


527it [00:16, 31.59it/s]


** End of epoch, accumulated average loss = 0.202565 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 585
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 0.203164 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 586
Run training...


527it [00:16, 31.16it/s]


** End of epoch, accumulated average loss = 0.204005 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 587
Run training...


527it [00:16, 32.55it/s]


** End of epoch, accumulated average loss = 0.202959 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 588
Run training...


527it [00:16, 32.08it/s]


** End of epoch, accumulated average loss = 0.203004 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 589
Run training...


527it [00:16, 31.04it/s]


** End of epoch, accumulated average loss = 0.203305 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 590
Run training...


527it [00:16, 31.27it/s]


** End of epoch, accumulated average loss = 0.203634 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 591
Run training...


527it [00:16, 32.35it/s]


** End of epoch, accumulated average loss = 0.202901 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 592
Run training...


527it [00:16, 32.31it/s]


** End of epoch, accumulated average loss = 0.203047 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 593
Run training...


527it [00:17, 29.87it/s]


** End of epoch, accumulated average loss = 0.203223 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 594
Run training...


527it [00:16, 32.26it/s]


** End of epoch, accumulated average loss = 0.202865 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 595
Run training...


527it [00:16, 32.34it/s]


** End of epoch, accumulated average loss = 0.202693 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 596
Run training...


527it [00:16, 32.41it/s]


** End of epoch, accumulated average loss = 0.203019 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 597
Run training...


527it [00:17, 30.10it/s]


** End of epoch, accumulated average loss = 0.202917 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 598
Run training...


527it [00:16, 32.22it/s]


** End of epoch, accumulated average loss = 0.202943 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 599
Run training...


527it [00:16, 31.82it/s]


** End of epoch, accumulated average loss = 0.202742 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 600
Run training...


527it [00:17, 30.98it/s]


** End of epoch, accumulated average loss = 0.202475 **
** Elapsed time: 0:00:17**
Saved checkpoint to ./drive/MyDrive/dl4nlp/cpkt_600_epoch
Classifier trained in 9965.75s


In [None]:
allpreds = []
for part, (ids, x, y) in part2ixy.items():
    print('\nClassifying %s set with %d examples ...' % (part, len(x)))
    st = time.time()
    preds = classify(x, params)
    print('%s set classified in %.2fs' % (part, time.time() - st))
    count_of_values = list(map(len, preds))
    assert np.all(np.array(count_of_values) == top_k)
    #score(preds, y)
    allpreds.extend(zip(ids, preds))

save_preds(allpreds, preds_fname=PREDS_FNAME)
print('\nChecking saved predictions ...')
score_preds(preds_path=PREDS_FNAME, data_dir=TRANSLIT_PATH, parts=SCORED_PARTS)


Classifying train set with 105371 examples ...
Using GPU device: cuda


100%|██████████| 527/527 [02:00<00:00,  4.37it/s]


train set classified in 120.74s

Classifying dev set with 26342 examples ...
Using GPU device: cuda


100%|██████████| 132/132 [00:24<00:00,  5.42it/s]


dev set classified in 24.36s

Classifying train_small set with 2000 examples ...
Using GPU device: cuda


100%|██████████| 10/10 [00:02<00:00,  4.47it/s]


train_small set classified in 2.25s

Classifying dev_small set with 2000 examples ...
Using GPU device: cuda


100%|██████████| 10/10 [00:02<00:00,  3.50it/s]


dev_small set classified in 2.87s

Classifying test set with 32926 examples ...
Using GPU device: cuda


100%|██████████| 165/165 [00:30<00:00,  5.45it/s]


test set classified in 30.31s
Predictions saved to preds_translit.tsv

Checking saved predictions ...
train set accuracy@1: 0.75
dev set accuracy@1: 0.67
train_small set accuracy@1: 0.75
dev_small set accuracy@1: 0.70
no labels for test set


{'train': {'acc@1': 0.7536608744341422},
 'dev': {'acc@1': 0.6720826057246982},
 'train_small': {'acc@1': 0.754},
 'dev_small': {'acc@1': 0.6975}}

In [31]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###  Hyper-parameters choice

The model is ready. Now we need to find the optimal hyper-parameters.

The quality of models with different hyperparameters should be monitored on dev or on dev_small samples (in order to save time, since generating transliterations is a rather time-consuming process, comparable to one training epoch).

To generate predictions, you can use the `generate_predictions` function, to calculate the accuracy@1 metric, and then you can use the `compute_metrics` function.



Hyper-parameters are stored in the dictionary `model_config` and `train_config` in train function. The following hyperparameters in `model_config` and `train_config` are suggested to leave unmodified:

* n_layers $=$ 2
* n_heads $=$ 2
* hidden_size $=$ 128
* fc_hidden_size $=$ 256
* warmup_steps_part $=$ 0.1
* batch_size $=$ 200

 You can vary the dropout value. The model has 4 types of : ***embedding dropout*** applied on embdeddings before sending to the first layer of  Encoder or Decoder, ***attention*** dropout applied on the attention weights in the MultiHeadAttention layer, ***residual dropout*** applied on the output of each sublayer (MultiHeadAttention or FeedForward) in layers Encoder and Decoder and, finaly, ***relu dropout*** in used in FeedForward layer. For all 4 types it is suggested to test the same value of dropout from the list: 0.1, 0.15, 0.2.
 Also it is suggested to test several peak levels of learning rate - **lr_peak** : 5e-4, 1e-3, 2e-3.

Note that if you are using a GPU, then training one epoch takes about 1 minute, and up to 1 GB of video memory is required. When using the CPU, the learning speed slows down by about 2 times. If there are problems with insufficient RAM / video memory, reduce the batch size, but in this case the optimal range of learning rate values will change, and it must be determined again. To train a model with  batch_size $=$ 200 , it will take at least 300 epochs to achieve accuracy 0.66 on dev_small dataset.

*Question: What are the optimal hyperpameters according to your experiments? Add plots or other descriptions here.* 

```

ENTER HERE YOUR ANSWER

```



In [None]:
!pip install optuna

In [137]:
# ENTER HERE YOUR CODE

import optuna
study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler())

part2ixy = load_dataset(TRANSLIT_PATH, parts=SCORED_PARTS)

train_ids, train_strings, train_transliterations = part2ixy['train']    
train_df = pd.DataFrame({'en': train_strings, 'ru': train_transliterations})
text_encoder = TextEncoder()
text_encoder.make_vocabs(train_df)


val_ids, val_strings, val_transliterations = part2ixy['dev_small']    
val_df = pd.DataFrame({'en': val_strings, 'ru': val_transliterations})
text_encoder_val = TextEncoder()
text_encoder_val.make_vocabs(val_df)

def run_epoch_val(data_iter, model, lr_scheduler, optimizer, device):
    start = time.time()
    local_start = start
    total_tokens = 0
    total_loss = 0
    tokens = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')
    for i, batch in tqdm(enumerate(data_iter)):
        encoder_input = batch[0].to(device)
        decoder_input = batch[1].to(device)
        decoder_target = batch[2].to(device)
        logits = model(encoder_input, decoder_input)
        loss = loss_fn(logits.view(-1, model.tgt_vocab_size),
                       decoder_target.view(-1))
        total_loss += loss.item()
        batch_n_tokens = (decoder_target != model.pad_idx).sum().item()
        total_tokens += batch_n_tokens
        if optimizer is not None:
            optimizer.zero_grad()
            lr_scheduler.step(optimizer)
            loss.backward()
            optimizer.step()

        tokens += batch_n_tokens

    average_loss = total_loss / total_tokens
    print('** End of epoch, accumulated average loss = %f **' % average_loss)
    epoch_elapsed_time = format_time(time.time() - start)
    print(f'** Elapsed time: {epoch_elapsed_time}**')
    return model

def train_estimate(train_config,model):
    device = torch.device('cpu')   
    #Model training procedure
    optimizer = torch.optim.Adam(model.parameters(), lr=0.)
    n_steps = (len(train_df) // train_config['batch_size'] + 1) * train_config['n_epochs']
    lr_scheduler = LrScheduler(n_steps, **train_config['lr_scheduler'])

    # prepare train and val data
    train_source, target_source = zip(*sorted(zip(train_strings, train_transliterations),
                                                 key=lambda e: len(e[0])))
    train_dataloader = create_dataloader(train_source, target_source, text_encoder,
                                         train_config['batch_size'],
                                         shuffle_batches_each_epoch=True)
    
    val_source, target_val_source = zip(*sorted(zip(val_strings, val_transliterations),
                                                 key=lambda e: len(e[0])))
    val_dataloader = create_dataloader(val_source, target_val_source, text_encoder_val,
                                         train_config['batch_size'],
                                         shuffle_batches_each_epoch=True)
    # training cycle
    for epoch in range(1,train_config['n_epochs']+1):
        print('\n' + '-'*40)
        print(f'Epoch: {epoch}')
        print(f'Run training...')
        model.train()
        model = run_epoch_val(val_dataloader, model,
                  lr_scheduler, optimizer, device=device)
         
        model.eval()
        
        predictions = generate_predictions(dataloader=val_dataloader, 
                                           max_decoding_len = model.config['max_tgt_seq_length'],
                                           text_encoder=text_encoder_val, 
                                           model=model, 
                                           device=device, 
                                           )
        acc = compute_metrics(predictions, target_val_source, metrics='acc@1') 
            
    return acc

def objective(trial):
    device = torch.device('cpu')
    params = {'src_vocab_size': text_encoder.src_vocab_size,
              'tgt_vocab_size': text_encoder.tgt_vocab_size,
              'max_src_seq_length': max(train_df['en'].aggregate(len)) + 2, #including start_token and end_token
              'max_tgt_seq_length': max(train_df['ru'].aggregate(len)) + 2,
              'n_layers': 2,
              'n_heads': 2,
              'hidden_size': 128,
              'ff_hidden_size': 256,
              'dropout': {
                          'embedding': trial.suggest_float("emb", 0.1, 0.20,step = 0.05),
                          'attention': trial.suggest_float("att", 0.1, 0.20,step = 0.05),
                          'residual': trial.suggest_float("res", 0.1, 0.20,step = 0.05),
                          'relu': trial.suggest_float("relu", 0.1, 0.20,step = 0.05)
                            },
              'pad_idx': 0 
                  }   
    model = prepare_model(params)
    model.to(device)
    train_config = {'batch_size': 200, 
                    'n_epochs': 250, 
                    'lr_scheduler': {
                                      'type': 'warmup,decay_linear',
                                      'warmup_steps_part': 0.1,
                                      'lr_peak': trial.suggest_loguniform('learning_rate', 5e-4, 2e-3),
                                  }
                    }
    
    acc = train_estimate(train_config, model)

    return acc


study.optimize(objective, n_trials=30)


[32m[I 2023-04-10 17:52:13,953][0m A new study created in memory with name: no-name-c3ec8076-bd2c-46f2-b5b3-deb43e1ae7f6[0m
  'lr_peak': trial.suggest_loguniform('learning_rate', 5e-4, 2e-3),



----------------------------------------
Epoch: 1
Run training...


10it [00:03,  2.72it/s]


** End of epoch, accumulated average loss = 6.467588 **
** Elapsed time: 0:00:04**


100%|██████████| 10/10 [00:29<00:00,  2.92s/it]



----------------------------------------
Epoch: 2
Run training...


8it [00:02,  2.87it/s]
[33m[W 2023-04-10 17:52:51,089][0m Trial 0 failed with parameters: {'emb': 0.2, 'att': 0.15000000000000002, 'res': 0.2, 'relu': 0.15000000000000002, 'learning_rate': 0.0011513411323536638} because of the following error: KeyboardInterrupt().[0m
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-137-d329583cf0fe>", line 119, in objective
    acc = train_estimate(train_config, model)
  File "<ipython-input-137-d329583cf0fe>", line 75, in train_estimate
    run_epoch_val(val_dataloader, model,
  File "<ipython-input-137-d329583cf0fe>", line 39, in run_epoch_val
    loss.backward()
  File "/usr/local/lib/python3.9/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.9/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run

KeyboardInterrupt: ignored

## Label smoothing

We suggest to implement an additional regularization method - **label smoothing**. Now imagine that we have a prediction vector from probabilities at position t in the sequence of tokens for each token id from the vocabulary. CrossEntropy compares it with ground truth one-hot representation

$$[0, ... 0, 1, 0, ..., 0].$$

And now imagine that we are slightly "smoothed" the values in the ground truth vector and obtained

$$[\frac{\alpha}{|V|}, ..., \frac{\alpha}{|V|}, 1(1-\alpha)+\frac{\alpha}{|V|},  \frac{\alpha}{|V|}, ... \frac{\alpha}{|V|}],$$

where $\alpha$ - parameter from 0 to 1, $|V|$ - vocabulary size - number of components in the ground truth vector. The values ​​of this new vector are still summed to 1. Calculate the cross-entropy of our prediction vector and the new ground truth. Now, firstly, cross-entropy will never reach 0, and secondly, the result of the error function will require the model, as usual, to return the highest probability vector compared to other components of the probability vector for the correct token in the dictionary, but at the same time not too large, because as the value of this probability approaches 1, the value of the error function increases. For research on the use of label smoothing, see the [paper](https://arxiv.org/abs/1906.02629).
    
Accordingly, in order to embed label smoothing into the model, it is necessary to carry out the transformation described above on the ground truth vectors, as well as to implement the cross-entropy calculation, since the used `torch.nn.CrossEntropy` class is not quite suitable, since for the ground truth representation of `__call__` method takes the id of the correct token and builds a one-hot vector already inside. However, it is possible to implement what is required based on the internal implementation of this class [CrossEntropyLoss](https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss).
    

Test different values of $\alpha$ (e.x, 0.05, 0.1, 0.2). Describe your experiments and results.


```

ENTER HERE YOUR ANSWER

```

In [None]:
# ENTER HERE YOUR CODE