#HW2 - Transliteration

Please send this to dlnlp2023@mail.ru with subject "Surname_HW2"

Deadline: 16.01.2023


In this task you are required to solve the transliteration problem of names from English to Russian. Transliteration of a string means writing this string using the alphabet of another language with the preservation of pronunciation, although not always.


## Basic algorithm

In [1]:
!wget https://github.com/s-nlp/filimdb_evaluation/raw/master/TRANSLIT.tar.gz

--2024-01-17 14:21:03--  https://github.com/s-nlp/filimdb_evaluation/raw/master/TRANSLIT.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/s-nlp/filimdb_evaluation/master/TRANSLIT.tar.gz [following]
--2024-01-17 14:21:04--  https://raw.githubusercontent.com/s-nlp/filimdb_evaluation/master/TRANSLIT.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1546458 (1.5M) [application/octet-stream]
Saving to: ‘TRANSLIT.tar.gz’


2024-01-17 14:21:04 (48.1 MB/s) - ‘TRANSLIT.tar.gz’ saved [1546458/1546458]



In [2]:
!gunzip TRANSLIT.tar.gz

In [3]:
!tar -xf TRANSLIT.tar

### Evaluation code

In [4]:
from typing import List

In [5]:
PREDS_FNAME = "preds_translit_baseline.tsv"
SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')
TRANSLIT_PATH = "TRANSLIT"

In [6]:
import codecs
from pandas import read_csv

def load_dataset(data_dir_path=None, parts: List[str] = SCORED_PARTS):
    part2ixy = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        with open(path, 'r', encoding='utf-8') as rf:
            # first line is a header of the corresponding columns
            lines = rf.readlines()[1:]
            col_count = len(lines[0].strip('\n').split('\t'))
            if col_count == 2:
                strings, transliterations = zip(
                    *list(map(lambda l: l.strip('\n').split('\t'), lines))
                )
            elif col_count == 1:
                strings = list(map(lambda l: l.strip('\n'), lines))
                transliterations = None
            else:
                raise ValueError("wrong amount of columns")
        part2ixy[part] = (
            [f'{part}/{i}' for i in range(len(strings))],
            strings, transliterations,
        )
    return part2ixy


def load_transliterations_only(data_dir_path=None, parts: List[str] = SCORED_PARTS):
    part2iy = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        with open(path, 'r', encoding='utf-8') as rf:
            # first line is a header of the corresponding columns
            lines = rf.readlines()[1:]
            col_count = len(lines[0].strip('\n').split('\t'))
            n_lines = len(lines)
            if col_count == 2:
                transliterations = [l.strip('\n').split('\t')[1] for l in lines]
            elif col_count == 1:
                transliterations = None
            else:
                raise ValueError("Wrong amount of columns")
        part2iy[part] = (
            [f'{part}/{i}' for i in range(n_lines)],
            transliterations,
        )
    return part2iy


def save_preds(preds, preds_fname):
    """
    Save classifier predictions in format appropriate for scoring.
    """
    with codecs.open(preds_fname, 'w') as outp:
        for idx, preds in preds:
            print(idx, *preds, sep='\t', file=outp)
    print('Predictions saved to %s' % preds_fname)


def load_preds(preds_fname, top_k=1):
    """
    Load classifier predictions in format appropriate for scoring.
    """
    kwargs = {
        "filepath_or_buffer": preds_fname,
        "names": ["id", "pred"],
        "sep": '\t',
    }

    pred_ids = list(read_csv(**kwargs, usecols=["id"])["id"])

    pred_y = {
        pred_id: [y]
        for pred_id, y in zip(
            pred_ids, read_csv(**kwargs, usecols=["pred"])["pred"]
        )
    }

    for y in pred_y.values():
        assert len(y) == top_k

    return pred_ids, pred_y


def compute_hit_k(preds, k=10):
    raise NotImplementedError


def compute_mrr(preds):
    raise NotImplementedError


def compute_acc_1(preds, true):
    right_answers = 0
    bonus = 0
    for pred, y in zip(preds, true):
        if pred[0] == y:
            right_answers += 1
        elif pred[0] != pred[0] and y == 'нань':
            print('Your test file contained empty string, skipping %f and %s' % (pred[0], y))
            bonus += 1 # bugfix: skip empty line in test
    return right_answers / (len(preds) - bonus)


def score(preds, true):
    assert len(preds) == len(true), 'inconsistent amount of predictions and ground truth answers'
    acc_1 = compute_acc_1(preds, true)
    return {'acc@1': acc_1}


def score_preds(preds_path, data_dir, parts=SCORED_PARTS):
    part2iy = load_transliterations_only(data_dir, parts=parts)
    pred_ids, pred_dict = load_preds(preds_path)
    # pred_dict = {i:y for i,y in zip(pred_ids, pred_y)}
    scores = {}
    for part, (true_ids, true_y) in part2iy.items():
        if true_y is None:
            print('no labels for %s set' % part)
            continue
        pred_y = [pred_dict[i] for i in true_ids]
        score_values = score(pred_y, true_y)
        acc_1 = score_values['acc@1']
        print('%s set accuracy@1: %.2f' % (part, acc_1))
        scores[part] = score_values
    return scores

## Transformer-based approach


To implement your algorithm, use the template code, which needs to be modified.

First, you need to add some details in the code of the Transformer architecture, implement the methods of the class `LrScheduler`, which is responsible for updating the learning rate during training.
Next, you need to select the hyperparameters for the model according to the proposed guide.

In [7]:
!pip install Levenshtein

Collecting Levenshtein
  Downloading Levenshtein-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (169 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=3.1.0 (from Levenshtein)
  Downloading rapidfuzz-3.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.23.0 rapidfuzz-3.6.1


In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import pandas as pd
import numpy as np
import itertools as it
import collections as col
import random
import os
import copy
import json
import math
from tqdm import tqdm
import datetime, time

import copy
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.utils.data as torch_data
import itertools as it
import collections as col
import random

import Levenshtein as le

### Load dataset and embeddings

In [9]:
def load_datasets(data_dir_path, parts):
    datasets = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        datasets[part] = pd.read_csv(path, sep='\t', na_filter=False)
        print(f'Loaded {part} dataset, length: {len(datasets[part])}')
    return datasets

In [10]:
class TextEncoder:
    def __init__(self, load_dir_path=None):
        self.lang_keys = ['en', 'ru']
        self.directions = ['id2token', 'token2id']
        self.service_token_names = {
            'pad_token': '<pad>',
            'start_token': '<start>',
            'unk_token': '<unk>',
            'end_token': '<end>'
        }
        service_id2token = dict(enumerate(self.service_token_names.values()))
        service_token2id ={v:k for k,v in service_id2token.items()}
        self.service_vocabs = dict(zip(self.directions,
                                       [service_id2token, service_token2id]))
        if load_dir_path is None:
            self.vocabs = {}
            for lk in self.lang_keys:
                self.vocabs[lk] = copy.deepcopy(self.service_vocabs)
        else:
            self.vocabs = self.load_vocabs(load_dir_path)
    def load_vocabs(self, load_dir_path):
        vocabs = {}
        load_path = os.path.join(load_dir_path, 'vocabs')
        for lk in self.lang_keys:
            vocabs[lk] = {}
            for d in self.directions:
                columns = d.split('2')
                # print(lk, d)
                df = pd.read_csv(os.path.join(load_path, f'{lk}_{d}'))
                vocabs[lk][d] = dict(zip(*[df[c] for c in columns]))
        return vocabs

    def save_vocabs(self, save_dir_path):
        save_path = os.path.join(save_dir_path, 'vocabs')
        os.makedirs(save_path, exist_ok=True)
        for lk in self.lang_keys:
            for d in self.directions:
                columns = d.split('2')
                pd.DataFrame(data=self.vocabs[lk][d].items(),
                    columns=columns).to_csv(os.path.join(save_path, f'{lk}_{d}'),
                                                index=False,
                                                sep=',')
    def make_vocabs(self, data_df):
        for lk in self.lang_keys:
            tokens = col.Counter(''.join(list(it.chain(*data_df[lk])))).keys()
            part_id2t = dict(enumerate(tokens, start=len(self.service_token_names)))
            part_t2id = {k:v for v,k in part_id2t.items()}
            part_vocabs = [part_id2t, part_t2id]
            for i in range(len(self.directions)):
                self.vocabs[lk][self.directions[i]].update(part_vocabs[i])

        self.src_vocab_size = len(self.vocabs['en']['id2token'])
        self.tgt_vocab_size = len(self.vocabs['ru']['id2token'])

    def frame(self, sample, start_token=None, end_token=None):
        if start_token is None:
            start_token=self.service_token_names['start_token']
        if end_token is None:
            end_token=self.service_token_names['end_token']
        return [start_token] + sample + [end_token]
    def token2id(self, samples, frame, lang_key):
        if frame:
            samples = list(map(self.frame, samples))
        vocab = self.vocabs[lang_key]['token2id']
        return list(map(lambda s:
                        [vocab[t] if t in vocab.keys() else vocab[self.service_token_names['unk_token']]
                         for t in s], samples))

    def unframe(self, sample, start_token=None, end_token=None):
        if start_token is None:
            start_token=self.service_vocabs['token2id'][self.service_token_names['start_token']]
        if end_token is None:
            end_token=self.service_vocabs['token2id'][self.service_token_names['end_token']]
        pad_token=self.service_vocabs['token2id'][self.service_token_names['pad_token']]
        return list(it.takewhile(lambda e: e != end_token and e != pad_token, sample[1:]))
    def id2token(self, samples, unframe, lang_key):
        if unframe:
            samples = list(map(self.unframe, samples))
        vocab = self.vocabs[lang_key]['id2token']
        return list(map(lambda s:
                        [vocab[idx] if idx in vocab.keys() else self.service_token_names['unk_token'] for idx in s], samples))


class TranslitData(torch_data.Dataset):
    def __init__(self, source_strings, target_strings,
                text_encoder):
        super(TranslitData, self).__init__()
        self.source_strings = source_strings
        self.text_encoder = text_encoder
        if target_strings is not None:
            assert len(source_strings) == len(target_strings)
            self.target_strings = target_strings
        else:
            self.target_strings = None
    def __len__(self):
        return len(self.source_strings)
    def __getitem__(self, idx):
        src_str = self.source_strings[idx]
        encoder_input = self.text_encoder.token2id([list(src_str)], frame=True, lang_key='en')[0]
        if self.target_strings is not None:
            tgt_str = self.target_strings[idx]
            tmp = self.text_encoder.token2id([list(tgt_str)], frame=True, lang_key='ru')[0]
            decoder_input = tmp[:-1]
            decoder_target = tmp[1:]
            return (encoder_input, decoder_input, decoder_target)
        else:
            return (encoder_input,)


class BatchSampler(torch_data.BatchSampler):
    def __init__(self, sampler, batch_size, drop_last, shuffle_each_epoch):
        super(BatchSampler, self).__init__(sampler, batch_size, drop_last)
        self.batches = []
        for b in super(BatchSampler, self).__iter__():
            self.batches.append(b)
        self.shuffle_each_epoch = shuffle_each_epoch
        if self.shuffle_each_epoch:
            random.shuffle(self.batches)
        self.index = 0
        #print(f'Batches collected: {len(self.batches)}')
    def __iter__(self):
        self.index = 0
        return self
    def __next__(self):
        if self.index == len(self.batches):
            if self.shuffle_each_epoch:
                random.shuffle(self.batches)
            raise StopIteration
        else:
            batch = self.batches[self.index]
            self.index += 1
            return batch

def collate_fn(batch_list):
    '''batch_list can store either 3 components:
        encoder_inputs, decoder_inputs, decoder_targets
        or single component: encoder_inputs'''
    components = list(zip(*batch_list))
    batch_tensors = []
    for data in components:
        max_len = max([len(sample) for sample in data])
        #print(f'Maximum length in batch = {max_len}')
        sample_tensors = [torch.tensor(s, requires_grad=False, dtype=torch.int64)
                         for s in data]
        batch_tensors.append(nn.utils.rnn.pad_sequence(
            sample_tensors,
            batch_first=True, padding_value=0))
    return tuple(batch_tensors)


def create_dataloader(source_strings, target_strings,
                      text_encoder, batch_size,
                      shuffle_batches_each_epoch):
    '''target_strings parameter can be None'''
    dataset = TranslitData(source_strings, target_strings,
                                text_encoder=text_encoder)
    seq_sampler = torch_data.SequentialSampler(dataset)
    batch_sampler = BatchSampler(seq_sampler, batch_size=batch_size,
                                drop_last=False,
                                shuffle_each_epoch=shuffle_batches_each_epoch)
    dataloader = torch_data.DataLoader(dataset,
                                       batch_sampler=batch_sampler,
                                       collate_fn=collate_fn)
    return dataloader

### Metric function

In [11]:
def compute_metrics(predicted_strings, target_strings, metrics):
    metric_values = {}
    for m in metrics:
        if m == 'acc@1':
            metric_values[m] = sum(predicted_strings == target_strings) / len(target_strings)
        elif m =='mean_ld@1':
            metric_values[m] =\
                np.mean(list(map(lambda e: le.distance(*e), zip(predicted_strings, target_strings))))
        else:
            raise ValueError(f'Unknown metric: {m}')
    return metric_values

###  Positional Encoding [1 Point]

As you remember, Transformer treats an input sequence of elements as a time series. Since the Encoder inside the Transformer simultaneously processes the entire input sequence, the information about the position of the element needs to be encoded inside its embedding, since it is not identified in any other way inside the model. That is why the PositionalEncoding layer is used, which sums embeddings with a vector of the same dimension.
Let the matrix of these vectors for each position of the time series be denoted as $PE$. Then the elements of the matrix are:

$$ PE_{(pos,2i)} = \sin{(pos/10000^{2i/d_{model}})}$$
$$ PE_{(pos,2i+1)} = \cos{(pos/10000^{2i/d_{model}})}$$

where $pos$ - is the position, $i$ - index of the component of the corresponging vector, $d_{model}$ - dimension of each vector. Thus, even components represent sine values, and odd ones represent cosine values with different arguments.

To run the test use the following function:

`test_positional_encoding()`

Make sure that there is no any `AssertionError`!


In [12]:
class Embedding(nn.Module):
    def __init__(self, hidden_size, vocab_size):
        super(Embedding, self).__init__()
        self.emb_layer = nn.Embedding(vocab_size, hidden_size)
        self.hidden_size = hidden_size

    def forward(self, x):
        return self.emb_layer(x)

class PositionalEncoding(nn.Module):
    def __init__(self, hidden_size, max_len=512):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, hidden_size, requires_grad=False)
        # TODO: implement your code here
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, hidden_size, 2) * -(math.log(10000.0) / hidden_size))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        # pe shape: (1, max_len, hidden_size)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: shape (batch size, sequence length, hidden size)
        x = x + self.pe[:, :x.size(1)]
        return x

In [13]:
def test_positional_encoding():
    pe = PositionalEncoding(max_len=3, hidden_size=4)
    res_1 = torch.tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000],
                           [ 0.8415,  0.5403,  0.0100,  0.9999],
                           [ 0.9093, -0.4161,  0.0200,  0.9998]]])
    # print(pe.pe - res_1)
    assert torch.all(torch.abs(pe.pe - res_1) < 1e-4).item()
    print('Test is passed!')

In [14]:
test_positional_encoding()

Test is passed!


### LayerNorm

In [15]:
class LayerNorm(nn.Module):
    "Layer Normalization layer"

    def __init__(self, hidden_size, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.gain = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        # print(mean, std)
        return self.gain * (x - mean) / (std + self.eps) + self.bias

### SublayerConnection

In [16]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer normalization.
    """

    def __init__(self, hidden_size, dropout):
        super(SublayerConnection, self).__init__()
        self.layer_norm = LayerNorm(hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return self.layer_norm(x + self.dropout(sublayer(x)))

def padding_mask(x, pad_idx=0):
    assert len(x.size()) >= 2
    return (x != pad_idx).unsqueeze(-2)

def look_ahead_mask(size):
    "Mask out the right context"
    attn_shape = (1, size, size)
    look_ahead_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(look_ahead_mask) == 0

def compositional_mask(x, pad_idx=0):
    pm = padding_mask(x, pad_idx=pad_idx)
    seq_length = x.size(-1)
    result_mask = pm & \
                  look_ahead_mask(seq_length).type_as(pm.data)
    return result_mask

### FeedForward [1 Point]

In [17]:
class FeedForward(nn.Module):
    def __init__(self, hidden_size, ff_hidden_size, dropout=0.1):
        super(FeedForward, self).__init__()
        # TODO: MAKE FEED FORWARD
        # It goes as linear -> RELU -> Dropout -> Linear
        self.pre_linear = nn.Linear(hidden_size, ff_hidden_size)
        self.post_linear = nn.Linear(ff_hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # TODO: WRITE CORRECT RETURN
        return self.post_linear(self.dropout(F.relu(self.pre_linear(x))))

def clone_layer(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [18]:
def test_FF():
    ff = FeedForward(24, 256)
    assert(ff.pre_linear.weight.size() == (256, 24))
    assert(ff.post_linear.weight.size() == (24, 256))

    x = torch.rand(32, 24)
    out = ff(x)
    assert(out.size() == (32, 24))
    print('Test is passed!')

In [19]:
test_FF()

Test is passed!


###  MultiHeadAttention [1.5 Point]


Then you are required to implement `attention` method in the class  `MultiHeadAttention`. The MultiHeadAttention layer takes as input  query vectors, key and value vectors for each step of the sequence of matrices  Q,K,V correspondingly. Each key vector, value vector, and query vector is obtained as a result of linear projection using one of three trained vector parameter matrices from the previous layer. This semantics can be represented in the form of formulas:
$$
Attention(Q, K, V)=softmax\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V\\
$$

$$
MultiHead(Q, K, V) = Concat\left(head_1, ... , head_h\right) W^O\\
$$

$$
head_i=Attention\left(Q W_i^Q, K W_i^K, V W_i^V\right)\\
$$
$h$ - the number of attention heads - parallel sub-layers for Scaled Dot-Product Attention on a vector of smaller dimension ($d_{k} = d_{q} = d_{v} = d_{model} / h$).
The logic of  \texttt{MultiHeadAttention} is presented in the picture (from original  [paper](https://arxiv.org/abs/1706.03762)):

![](https://lilianweng.github.io/lil-log/assets/images/transformer.png)


Inside a method `attention` you are required to create a dropout layer from  MultiHeadAttention class constructor. Dropout layer is to be applied directly on the attention weights - the result of softmax operation. Value of drop probability  can be regulated in the train in the `model_config['dropout']['attention']`.

The correctness of implementation can be checked with
`test_multi_head_attention()`



In [20]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, hidden_size, dropout=None):
        super(MultiHeadAttention, self).__init__()
        assert hidden_size % n_heads == 0
        self.head_hidden_size = hidden_size // n_heads
        self.n_heads = n_heads
        self.linears = clone_layer(nn.Linear(hidden_size, hidden_size), 4)
        self.attn_weights = None
        self.dropout = dropout
        if self.dropout is not None:
            self.dropout_layer = nn.Dropout(p=self.dropout)

    def attention(self, query, key, value, mask):
        """Compute 'Scaled Dot Product Attention'
            query, key and value tensors have the same shape:
                (batch size, number of heads, sequence length, head hidden size)
            mask shape: (batch size, 1, sequence length, sequence length)
                '1' dimension value will be broadcasted to number of heads inside your operations
            mask should be applied before using softmax to get attn_weights
        """
        ## attn_weights shape: (batch size, number of heads, sequence length, sequence length)
        ## output shape: (batch size, number of heads, sequence length, head hidden size)
        ## TODO: provide your implementation here
        ## don't forget to apply dropout to attn_weights if self.dropout is not None
        d_k = query.size(-1)
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
        scores = scores.masked_fill(mask == 0, -1e9)
        attn_weights = F.softmax(scores, dim =- 1)
        if self.dropout is not None:
          attn_weights = self.dropout_layer(attn_weights)

        output = torch.matmul(attn_weights, value)
        return output, attn_weights

    def forward(self, query, key, value, mask=None):
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        batch_size = query.size(0)

        # Split vectors for different attention heads (from hidden_size => n_heads x head_hidden_size)
        # and do separate linear projection, for separate trainable weights
        query, key, value = \
            [l(x).view(batch_size, -1, self.n_heads, self.head_hidden_size).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]

        x, self.attn_weights = self.attention(query, key, value, mask=mask)
        # x shape: (batch size, number of heads, sequence length, head hidden size)
        # self.attn_weights shape: (batch size, number of heads, sequence length, sequence length)

        # Concatenate the output of each head
        x = x.transpose(1, 2).contiguous() \
            .view(batch_size, -1, self.n_heads * self.head_hidden_size)

        return self.linears[-1](x)

In [21]:
def test_multi_head_attention():
    mha = MultiHeadAttention(n_heads=1, hidden_size=5, dropout=None)
    # batch_size == 2, sequence length == 3, hidden_size == 5
    # query = torch.arange(150).reshape(2, 3, 5)
    query = torch.tensor([[[[ 0.64144618, -0.95817388,  0.37432297,  0.58427106,
          -0.94668716]],
        [[-0.23199289,  0.66329209, -0.46507035, -0.54272512,
          -0.98640698]],
        [[ 0.07546638, -0.09277002,  0.20107185, -0.97407381,
          -0.27713414]]],
       [[[ 0.14727783,  0.4747886 ,  0.44992016, -0.2841419 ,
          -0.81820319]],
        [[-0.72324994,  0.80643179, -0.47655449,  0.45627872,
           0.60942404]],
        [[ 0.61712569, -0.62947282, -0.95215713, -0.38721959,
          -0.73289725]]]])
    key = torch.tensor([[[[-0.81759856, -0.60049991, -0.05923424,  0.51898901,
          -0.3366209 ]],
        [[ 0.83957818, -0.96361722,  0.62285191,  0.93452467,
           0.51219613]],
        [[-0.72758847,  0.41256154,  0.00490795,  0.59892503,
          -0.07202049]]],
       [[[ 0.72315339, -0.49896314,  0.94254637, -0.54356006,
          -0.04837949]],
        [[ 0.51759322, -0.43927061, -0.59924184,  0.92241702,
          -0.86811696]],
        [[-0.54322046, -0.92323003, -0.827746  ,  0.90842783,
           0.88428119]]]])
    value = torch.tensor([[[[-0.83895431,  0.805027  ,  0.22298283, -0.84849915,
          -0.34906026]],
        [[-0.02899652, -0.17456128, -0.17535998, -0.73160314,
          -0.13468061]],
        [[ 0.75234265,  0.02675947,  0.84766286, -0.5475651 ,
          -0.83319316]]],
       [[[-0.47834413,  0.34464645, -0.41921457,  0.33867964,
           0.43470836]],
        [[-0.99000979,  0.10220893, -0.4932273 ,  0.95938905,
           0.01927012]],
        [[ 0.91607137,  0.57395644, -0.90914179,  0.97212912,
           0.33078759]]]])
    query = query.float().transpose(1,2)
    key = key.float().transpose(1,2)
    value = value.float().transpose(1,2)

    x,_ = torch.max(query[:,0,:,:], axis=-1)
    mask = compositional_mask(x)
    mask.unsqueeze_(1)
    for n,t in [('query', query), ('key', key), ('value', value), ('mask', mask)]:
        print(f'Name: {n}, shape: {t.size()}')
    with torch.no_grad():
        output, attn_weights = mha.attention(query, key, value, mask=mask)
    assert output.size() == torch.Size([2,1,3,5])
    assert attn_weights.size() == torch.Size([2,1,3,3])

    truth_output = torch.tensor([[[[-0.8390,  0.8050,  0.2230, -0.8485, -0.3491],
          [-0.6043,  0.5212,  0.1076, -0.8146, -0.2870],
          [-0.0665,  0.2461,  0.3038, -0.7137, -0.4410]]],
        [[[-0.4783,  0.3446, -0.4192,  0.3387,  0.4347],
          [-0.7959,  0.1942, -0.4652,  0.7239,  0.1769],
          [-0.3678,  0.2868, -0.5799,  0.7987,  0.2086]]]])
    truth_attn_weights = torch.tensor([[[[1.0000, 0.0000, 0.0000],
          [0.7103, 0.2897, 0.0000],
          [0.3621, 0.3105, 0.3274]]],
        [[[1.0000, 0.0000, 0.0000],
          [0.3793, 0.6207, 0.0000],
          [0.2642, 0.4803, 0.2555]]]])
    # print(torch.abs(output - truth_output))
    # print(torch.abs(attn_weights - truth_attn_weights))
    assert torch.all(torch.abs(output - truth_output) < 1e-4).item()
    assert torch.all(torch.abs(attn_weights - truth_attn_weights) < 1e-4).item()
    print('Test is passed!')

In [22]:
test_multi_head_attention()

Name: query, shape: torch.Size([2, 1, 3, 5])
Name: key, shape: torch.Size([2, 1, 3, 5])
Name: value, shape: torch.Size([2, 1, 3, 5])
Name: mask, shape: torch.Size([2, 1, 3, 3])
Test is passed!


### Encoder

In [23]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, hidden_size, ff_hidden_size, n_heads, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(n_heads, hidden_size,
                                            dropout=dropout['attention'])
        self.feed_forward = FeedForward(hidden_size, ff_hidden_size,
                                        dropout=dropout['relu'])
        self.sublayers = clone_layer(SublayerConnection(hidden_size, dropout['residual']), 2)

    def forward(self, x, mask):
        x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayers[1](x, self.feed_forward)

class Encoder(nn.Module):
    def __init__(self, config):
        super(Encoder, self).__init__()
        self.embedder = Embedding(config['hidden_size'],
                                  config['src_vocab_size'])
        self.positional_encoder = PositionalEncoding(config['hidden_size'],
                                                     max_len=config['max_src_seq_length'])
        self.embedding_dropout = nn.Dropout(p=config['dropout']['embedding'])
        self.encoder_layer = EncoderLayer(config['hidden_size'],
                                          config['ff_hidden_size'],
                                          config['n_heads'],
                                          config['dropout'])
        self.layers = clone_layer(self.encoder_layer, config['n_layers'])
        self.layer_norm = LayerNorm(config['hidden_size'])

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        x = self.embedding_dropout(self.positional_encoder(self.embedder(x)))
        for layer in self.layers:
            x = layer(x, mask)
        return self.layer_norm(x)

### Decoder

In [24]:
class DecoderLayer(nn.Module):
    """
    Decoder is made of 3 sublayers: self attention, encoder-decoder attention
    and feed forward"
    """

    def __init__(self, hidden_size, ff_hidden_size, n_heads, dropout):
        super(DecoderLayer, self).__init__()

        self.self_attn = MultiHeadAttention(n_heads, hidden_size,
                                            dropout=dropout['attention'])
        self.encdec_attn = MultiHeadAttention(n_heads, hidden_size,
                                              dropout=dropout['attention'])
        self.feed_forward = FeedForward(hidden_size, ff_hidden_size,
                                        dropout=dropout['relu'])
        self.sublayers = clone_layer(SublayerConnection(hidden_size, dropout['residual']), 3)

    def forward(self, x, encoder_output, encoder_mask, decoder_mask):
        x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, decoder_mask))
        x = self.sublayers[1](x, lambda x: self.encdec_attn(x, encoder_output,
                                                            encoder_output, encoder_mask))
        return self.sublayers[2](x, self.feed_forward)

class Decoder(nn.Module):
    def __init__(self, config):
        super(Decoder, self).__init__()
        self.embedder = Embedding(config['hidden_size'],
                                  config['tgt_vocab_size'])
        self.positional_encoder = PositionalEncoding(config['hidden_size'],
                                                     max_len=config['max_tgt_seq_length'])
        self.embedding_dropout = nn.Dropout(p=config['dropout']['embedding'])
        self.decoder_layer = DecoderLayer(config['hidden_size'],
                                          config['ff_hidden_size'],
                                          config['n_heads'],
                                          config['dropout'])
        self.layers = clone_layer(self.decoder_layer, config['n_layers'])
        self.layer_norm = LayerNorm(config['hidden_size'])

    def forward(self, x, encoder_output, encoder_mask, decoder_mask):
        x = self.embedding_dropout(self.positional_encoder(self.embedder(x)))
        for layer in self.layers:
            x = layer(x, encoder_output, encoder_mask, decoder_mask)
        return self.layer_norm(x)

### Transformer

In [25]:
class Transformer(nn.Module):
    def __init__(self, config):
        super(Transformer, self).__init__()
        self.config = config
        self.encoder = Encoder(config)
        self.decoder = Decoder(config)
        self.proj = nn.Linear(config['hidden_size'], config['tgt_vocab_size'])

        self.pad_idx = config['pad_idx']
        self.tgt_vocab_size = config['tgt_vocab_size']

    def encode(self, encoder_input, encoder_input_mask):
        return self.encoder(encoder_input, encoder_input_mask)

    def decode(self, encoder_output, encoder_input_mask, decoder_input, decoder_input_mask):
        return self.decoder(decoder_input, encoder_output, encoder_input_mask, decoder_input_mask)

    def linear_project(self, x):
        return self.proj(x)

    def forward(self, encoder_input, decoder_input):
        encoder_input_mask = padding_mask(encoder_input, pad_idx=self.config['pad_idx'])
        decoder_input_mask = compositional_mask(decoder_input, pad_idx=self.config['pad_idx'])
        encoder_output = self.encode(encoder_input, encoder_input_mask)
        decoder_output = self.decode(encoder_output, encoder_input_mask,
                                     decoder_input, decoder_input_mask)
        output_logits = self.linear_project(decoder_output)
        return output_logits


def prepare_model(config):
    model = Transformer(config)

    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

####  LrScheduler [1 Point]

The last thing you have to prepare is the class  `LrScheduler`, which is in charge of  learning rate updating after every step of the optimizer. You are required to fill the class constructor and the method `learning_rate`. The preferable stratagy of updating the learning rate (lr), is the following two stages:

* "warmup" stage - lr linearly increases until the defined value during the fixed number of steps (the proportion of all training steps - the parameter `train_config['warmup\_steps\_part']` in the train function).
* "decrease" stage - lr linearly decreases until 0 during the left training steps.

`learning_rate()` call should return the value of  lr at this step,  which number is stored at self.step. The class constructor takes not only `warmup_steps_part` but the peak learning rate value `lr_peak` at the end of "warmup" stage and a string name of the strategy of learning rate scheduling. You can test other strategies if you want to with `self.type attribute`.

Correctness check: `test_lr_scheduler()`


In [26]:
class LrScheduler:
    def __init__(self, n_steps, **kwargs):
        self.type = kwargs['type']
        self._step = 0
        self._lr = 0
        self.lr_peak = kwargs['lr_peak']
        self.n_steps = n_steps
        if self.type == 'warmup,decay_linear':
            ## TODO: provide your implementation here
            self.warmup_steps = int(n_steps * kwargs['warmup_steps_part'])
        else:
            raise ValueError(f'Unknown type argument: {self.type}')

    def step(self, optimizer):
        self._step += 1
        lr = self.learning_rate()
        for p in optimizer.param_groups:
            p['lr'] = lr

    def learning_rate(self, step=None):
        if step is None:
            step = self._step
        if self.type == 'warmup,decay_linear':
            ## TODO: provide your implementation here
            if step <= self.warmup_steps:
                self._lr = (step / self.warmup_steps) * self.lr_peak
            else:
                remaining_steps = max(0, (step - self.warmup_steps))
                self._lr = max(0, self.lr_peak * (1 - remaining_steps / (self.n_steps - self.warmup_steps)))

        return self._lr

    def state_dict(self):
        sd = copy.deepcopy(self.__dict__)
        return sd

    def load_state_dict(self, sd):
        for k in sd.keys():
            self.__setattr__(k, sd[k])


In [27]:
def test_lr_scheduler():
    lrs_type = 'warmup,decay_linear'
    warmup_steps_part =  0.1
    lr_peak = 3e-4
    sch = LrScheduler(100, type=lrs_type, warmup_steps_part=warmup_steps_part,
                      lr_peak=lr_peak)
    assert sch.learning_rate(step=5) - 15e-5 < 1e-6
    assert sch.learning_rate(step=10) - 3e-4 < 1e-6
    assert sch.learning_rate(step=50) - 166e-6 < 1e-6
    assert sch.learning_rate(step=100) - 0. < 1e-6
    print('Test is passed!')

In [28]:
test_lr_scheduler()

Test is passed!


### Run and translate [0.5 Points]

In [29]:
from torch.nn import CrossEntropyLoss

In [30]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))


def run_epoch(data_iter, model, lr_scheduler, optimizer, device, verbose=False):
    start = time.time()
    local_start = start
    total_tokens = 0
    total_loss = 0
    tokens = 0
    # TODO: TAKE CROSS ENTROPY LOSS WITH SUM REDUCTION
    loss_fn = CrossEntropyLoss(ignore_index=model.pad_idx, reduction='sum')
    for i, batch in tqdm(enumerate(data_iter)):
        encoder_input = batch[0].to(device)
        decoder_input = batch[1].to(device)
        decoder_target = batch[2].to(device)
        # TODO: OBTAIN MODEL LOGITS, PASS THEM TO LOSS_FN AND CALCULATE LOSS
        logits = model(encoder_input, decoder_input)
        loss = loss_fn(logits.view(-1, model.config['tgt_vocab_size']), decoder_target.view(-1))
        total_loss += loss.item()
        batch_n_tokens = (decoder_target != model.pad_idx).sum().item()
        total_tokens += batch_n_tokens
        if optimizer is not None:
            optimizer.zero_grad()
            lr_scheduler.step(optimizer)
            loss.backward()
            optimizer.step()

        tokens += batch_n_tokens
        if verbose and i % 1000 == 1:
            elapsed = time.time() - local_start
            print("batch number: %d, accumulated average loss: %f, tokens per second: %f" %
                  (i, total_loss / total_tokens, tokens / elapsed))
            local_start = time.time()
            tokens = 0

    average_loss = total_loss / total_tokens
    print('** End of epoch, accumulated average loss = %f **' % average_loss)
    epoch_elapsed_time = format_time(time.time() - start)
    print(f'** Elapsed time: {epoch_elapsed_time}**')
    return average_loss


def save_checkpoint(epoch, model, lr_scheduler, optimizer, model_dir_path):
    save_path = os.path.join(model_dir_path, f'cpkt_{epoch}_epoch')
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'lr_scheduler_state_dict': lr_scheduler.state_dict()
    }, save_path)
    print(f'Saved checkpoint to {save_path}')

def load_model(epoch, model_dir_path):
    save_path = os.path.join(model_dir_path, f'cpkt_{epoch}_epoch')
    checkpoint = torch.load(save_path)
    with open(os.path.join(model_dir_path, 'model_config.json'), 'r', encoding='utf-8') as rf:
        model_config = json.load(rf)
    model = prepare_model(model_config)
    model.load_state_dict(checkpoint['model_state_dict'])
    return model

def greedy_decode(model, device, encoder_input, max_len, start_symbol):
    batch_size = encoder_input.size()[0]
    decoder_input = torch.ones(batch_size, 1).fill_(start_symbol).type_as(encoder_input.data).to(device)

    for i in range(max_len):
        logits = model(encoder_input, decoder_input)

        _, predicted_ids = torch.max(logits, dim=-1)
        next_word = predicted_ids[:, i]
        # print(next_word)
        rest = torch.ones(batch_size, 1).type_as(decoder_input.data)
        # print(rest[:,0].size(), next_word.size())
        rest[:, 0] = next_word
        decoder_input = torch.cat([decoder_input, rest], dim=1).to(device)
        # print(decoder_input)
    return decoder_input

def generate_predictions(dataloader, max_decoding_len, text_encoder, model, device):
    # print(f'Max decoding length = {max_decoding_len}')
    model.eval()
    predictions = []
    start_token_id = text_encoder.service_vocabs['token2id'][
        text_encoder.service_token_names['start_token']]
    with torch.no_grad():
        for batch in tqdm(dataloader):
            encoder_input = batch[0].to(device)
            prediction_tensor = \
                greedy_decode(model, device, encoder_input, max_decoding_len,
                              start_token_id)

            predictions.extend([''.join(e) for e in text_encoder.id2token(prediction_tensor.cpu().numpy(),
                                                                          unframe=True, lang_key='ru')])
    return np.array(predictions)


def train(source_strings, target_strings, model_config, train_config):
    '''Common training cycle for final run (fixed hyperparameters,
    no evaluation during training)'''
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    train_df = pd.DataFrame({'en': source_strings, 'ru': target_strings})
    text_encoder = TextEncoder()
    text_encoder.make_vocabs(train_df)

    model_config['src_vocab_size'] = text_encoder.src_vocab_size
    model_config['tgt_vocab_size'] = text_encoder.tgt_vocab_size
    model_config['max_src_seq_length'] = max(train_df['en'].aggregate(len)) + 2 #including start_token and end_token
    model_config['max_tgt_seq_length'] = max(train_df['ru'].aggregate(len)) + 2

    model = prepare_model(model_config)
    model.to(device)

    #Model training procedure
    optimizer = torch.optim.Adam(model.parameters(), lr=0.)
    n_steps = (len(train_df) // train_config['batch_size'] + 1) * train_config['n_epochs']
    lr_scheduler = LrScheduler(n_steps, **train_config['lr_scheduler'])

    # prepare train data
    source_strings, target_strings = zip(*sorted(zip(source_strings, target_strings),
                                                 key=lambda e: len(e[0])))
    train_dataloader = create_dataloader(source_strings, target_strings, text_encoder,
                                         train_config['batch_size'],
                                         shuffle_batches_each_epoch=True)
    # training cycle
    for epoch in range(1,train_config['n_epochs']+1):
        print('\n' + '-'*40)
        print(f'Epoch: {epoch}')
        print(f'Run training...')
        model.train()
        run_epoch(train_dataloader, model,
                  lr_scheduler, optimizer, device=device, verbose=False)
    learnable_params = {
        'model': model,
        'text_encoder': text_encoder,
    }
    return learnable_params

def classify(source_strings, learnable_params):
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    model = learnable_params['model']
    text_encoder = learnable_params['text_encoder']
    batch_size = 200
    dataloader = create_dataloader(source_strings, None, text_encoder,
                                   batch_size, shuffle_batches_each_epoch=False)
    max_decoding_len = model.config['max_tgt_seq_length']
    predictions = generate_predictions(dataloader, max_decoding_len, text_encoder, model, device)
    #return single top1 prediction for each sample
    return np.expand_dims(predictions, 1)

# TODO: MOVE CONFIG TO ARGUMENTS
model_config = {
    'n_layers': 2,
    'n_heads': 2,
    'hidden_size': 128,
    'ff_hidden_size': 256,
    'dropout': {
        'embedding': 0.1,
        'attention': 0.1,
        'residual': 0.1,
        'relu': 0.1
    },
    'pad_idx': 0
}

train_config = {'batch_size': 200, 'n_epochs': 5, 'lr_scheduler': {
    'type': 'warmup,decay_linear',
    'warmup_steps_part': 0.1,
    'lr_peak': 3e-4,
}}

### Training

In [31]:
PREDS_FNAME = "preds_translit.tsv"
SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')
TRANSLIT_PATH = "TRANSLIT"

In [None]:
top_k = 1
part2ixy = load_dataset(TRANSLIT_PATH, parts=SCORED_PARTS)

train_ids, train_strings, train_transliterations = part2ixy['train']
print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
st = time.time()
params = train(train_strings, train_transliterations, model_config, train_config)
print('Classifier trained in %.2fs' % (time.time() - st))


Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:17, 29.75it/s]


** End of epoch, accumulated average loss = 2.653342 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 32.14it/s]


** End of epoch, accumulated average loss = 1.083283 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 32.44it/s]


** End of epoch, accumulated average loss = 0.856589 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.55it/s]


** End of epoch, accumulated average loss = 0.773487 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.62it/s]

** End of epoch, accumulated average loss = 0.736542 **
** Elapsed time: 0:00:16**
Classifier trained in 85.37s





In [None]:
allpreds = []
for part, (ids, x, y) in part2ixy.items():
    print('\nClassifying %s set with %d examples ...' % (part, len(x)))
    st = time.time()
    preds = classify(x, params)
    print('%s set classified in %.2fs' % (part, time.time() - st))
    count_of_values = list(map(len, preds))
    assert np.all(np.array(count_of_values) == top_k)
    #score(preds, y)
    allpreds.extend(zip(ids, preds))

save_preds(allpreds, preds_fname=PREDS_FNAME)
print('\nChecking saved predictions ...')
score_preds(preds_path=PREDS_FNAME, data_dir=TRANSLIT_PATH, parts=SCORED_PARTS)


Classifying train set with 105371 examples ...
Using GPU device: cuda


100%|██████████| 527/527 [01:16<00:00,  6.86it/s]


train set classified in 76.84s

Classifying dev set with 26342 examples ...
Using GPU device: cuda


100%|██████████| 132/132 [00:18<00:00,  7.04it/s]


dev set classified in 18.77s

Classifying train_small set with 2000 examples ...
Using GPU device: cuda


100%|██████████| 10/10 [00:01<00:00,  7.16it/s]


train_small set classified in 1.41s

Classifying dev_small set with 2000 examples ...
Using GPU device: cuda


100%|██████████| 10/10 [00:01<00:00,  5.45it/s]


dev_small set classified in 1.84s

Classifying test set with 32926 examples ...
Using GPU device: cuda


100%|██████████| 165/165 [00:23<00:00,  6.97it/s]


test set classified in 23.69s
Predictions saved to preds_translit.tsv

Checking saved predictions ...
train set accuracy@1: 0.42
dev set accuracy@1: 0.42
train_small set accuracy@1: 0.43
dev_small set accuracy@1: 0.43
no labels for test set


{'train': {'acc@1': 0.42294369418530714},
 'dev': {'acc@1': 0.42202566244020956},
 'train_small': {'acc@1': 0.4275},
 'dev_small': {'acc@1': 0.433}}

###  Hyper-parameters choice [5 Points]

The model is ready. Now we need to find the optimal hyper-parameters.

The quality of models with different hyperparameters should be monitored on dev or on dev_small samples (in order to save time, since generating transliterations is a rather time-consuming process, comparable to one training epoch).

To generate predictions, you can use the `generate_predictions` function, to calculate the accuracy@1 metric, and then you can use the `compute_metrics` function.



Hyper-parameters are stored in the dictionary `model_config` and `train_config` in train function. The following hyperparameters in `model_config` and `train_config` are suggested to leave unmodified:

* n_layers $=$ 2
* n_heads $=$ 2
* hidden_size $=$ 128
* fc_hidden_size $=$ 256
* warmup_steps_part $=$ 0.1
* batch_size $=$ 200

 You can vary the dropout value. The model has 4 types of : ***embedding dropout*** applied on embdeddings before sending to the first layer of  Encoder or Decoder, ***attention*** dropout applied on the attention weights in the MultiHeadAttention layer, ***residual dropout*** applied on the output of each sublayer (MultiHeadAttention or FeedForward) in layers Encoder and Decoder and, finaly, ***relu dropout*** in used in FeedForward layer. For all 4 types it is suggested to test the same value of dropout from the list: 0.1, 0.15, 0.2.
 Also it is suggested to test several peak levels of learning rate - **lr_peak** : 5e-4, 1e-3, 2e-3.

Note that if you are using a GPU, then training one epoch takes about 1 minute, and up to 1 GB of video memory is required. When using the CPU, the learning speed slows down by about 2 times. If there are problems with insufficient RAM / video memory, reduce the batch size, but in this case the optimal range of learning rate values will change, and it must be determined again. To train a model with  batch_size $=$ 200 , it will take at least 300 epochs to achieve accuracy 0.66 on dev_small dataset.

*Question: What are the optimal hyperpameters according to your experiments? Add plots or other descriptions here.*

```
ENTER HERE YOUR ANSWER
```
>Первым экспериментом я запускала обучение на всем тренировочном датасете на 9 конфигурациях параметров: когда все 4 dropout принимают значения из списка  0.1, 0.15, 0.2. в сочетании с каждым значением из списка lr_peak : 5e-4, 1e-3, 2e-3. При этом я поставила немного эпох обучения - 5, чтобы сначала просто посмотреть тенденцию.




In [32]:
import itertools
part2ixy = load_dataset(TRANSLIT_PATH, parts=SCORED_PARTS)

In [None]:
# ENTER HERE YOUR CODE

m_configs = []
tr_configs = []

drop_values = [0.1, 0.15, 0.2]
lr_values = [5e-4, 1e-3, 2e-3]

for dr, lr in zip(drop_values, lr_values):
  model_config = {
    'n_layers': 2,
    'n_heads': 2,
    'hidden_size': 128,
    'ff_hidden_size': 256,
    'dropout': {
        'embedding': dr,
        'attention': dr,
        'residual': dr,
        'relu': dr,
    },
    'pad_idx': 0
  }
  m_configs.append(model_config)
  train_config = {'batch_size': 200, 'n_epochs': 5, 'lr_scheduler': {
    'type': 'warmup,decay_linear',
    'warmup_steps_part': 0.1,
    'lr_peak': lr,
  }}
  tr_configs.append(train_config)

pairs = list(itertools.product(m_configs,tr_configs))

In [None]:
pairs[2]

({'n_layers': 2,
  'n_heads': 2,
  'hidden_size': 128,
  'ff_hidden_size': 256,
  'dropout': {'embedding': 0.1,
   'attention': 0.1,
   'residual': 0.1,
   'relu': 0.1},
  'pad_idx': 0},
 {'batch_size': 200,
  'n_epochs': 5,
  'lr_scheduler': {'type': 'warmup,decay_linear',
   'warmup_steps_part': 0.1,
   'lr_peak': 0.002}})

In [None]:
train_inf = []
for p in pairs:
  train_ids, train_strings, train_transliterations = part2ixy['train']
  print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
  st = time.time()
  params = train(train_strings, train_transliterations, p[0], p[1])
  print('Classifier trained in %.2fs' % (time.time() - st))
  train_inf.append([p[0]['dropout']['embedding'], p[1]['lr_scheduler']['lr_peak'], params])



Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.04it/s]


** End of epoch, accumulated average loss = 2.344500 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.39it/s]


** End of epoch, accumulated average loss = 0.896832 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 31.15it/s]


** End of epoch, accumulated average loss = 0.708540 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.43it/s]


** End of epoch, accumulated average loss = 0.630422 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.71it/s]


** End of epoch, accumulated average loss = 0.597287 **
** Elapsed time: 0:00:16**
Classifier trained in 81.22s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.55it/s]


** End of epoch, accumulated average loss = 2.056049 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 31.90it/s]


** End of epoch, accumulated average loss = 0.736896 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.27it/s]


** End of epoch, accumulated average loss = 0.581375 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.38it/s]


** End of epoch, accumulated average loss = 0.530962 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.65it/s]


** End of epoch, accumulated average loss = 0.503808 **
** Elapsed time: 0:00:16**
Classifier trained in 80.66s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 1.861498 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.64it/s]


** End of epoch, accumulated average loss = 0.629272 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.42it/s]


** End of epoch, accumulated average loss = 0.519670 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 32.06it/s]


** End of epoch, accumulated average loss = 0.471166 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.92it/s]


** End of epoch, accumulated average loss = 0.440073 **
** Elapsed time: 0:00:16**
Classifier trained in 80.83s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.70it/s]


** End of epoch, accumulated average loss = 2.616484 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.49it/s]


** End of epoch, accumulated average loss = 1.087328 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 31.56it/s]


** End of epoch, accumulated average loss = 0.865014 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.47it/s]


** End of epoch, accumulated average loss = 0.779786 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 33.34it/s]


** End of epoch, accumulated average loss = 0.740303 **
** Elapsed time: 0:00:16**
Classifier trained in 80.29s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.07it/s]


** End of epoch, accumulated average loss = 2.248438 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 32.60it/s]


** End of epoch, accumulated average loss = 0.888395 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.51it/s]


** End of epoch, accumulated average loss = 0.704420 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.38it/s]


** End of epoch, accumulated average loss = 0.619280 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.77it/s]


** End of epoch, accumulated average loss = 0.582801 **
** Elapsed time: 0:00:16**
Classifier trained in 80.37s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.42it/s]


** End of epoch, accumulated average loss = 2.033858 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.71it/s]


** End of epoch, accumulated average loss = 0.746443 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.32it/s]


** End of epoch, accumulated average loss = 0.598211 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.78it/s]


** End of epoch, accumulated average loss = 0.538719 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 33.52it/s]


** End of epoch, accumulated average loss = 0.506428 **
** Elapsed time: 0:00:16**
Classifier trained in 80.72s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.26it/s]


** End of epoch, accumulated average loss = 2.887235 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.71it/s]


** End of epoch, accumulated average loss = 1.330118 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 31.58it/s]


** End of epoch, accumulated average loss = 1.020910 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.57it/s]


** End of epoch, accumulated average loss = 0.914971 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 32.96it/s]


** End of epoch, accumulated average loss = 0.869007 **
** Elapsed time: 0:00:16**
Classifier trained in 80.50s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 32.87it/s]


** End of epoch, accumulated average loss = 2.452695 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 32.69it/s]


** End of epoch, accumulated average loss = 1.042375 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.14it/s]


** End of epoch, accumulated average loss = 0.812656 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.57it/s]


** End of epoch, accumulated average loss = 0.713502 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.63it/s]


** End of epoch, accumulated average loss = 0.672918 **
** Elapsed time: 0:00:16**
Classifier trained in 80.58s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:15, 33.13it/s]


** End of epoch, accumulated average loss = 2.177858 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.77it/s]


** End of epoch, accumulated average loss = 0.875314 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.684168 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.82it/s]


** End of epoch, accumulated average loss = 0.616853 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 33.40it/s]

** End of epoch, accumulated average loss = 0.580576 **
** Elapsed time: 0:00:16**
Classifier trained in 80.75s





In [None]:
results = []
for i, experiment in enumerate(train_inf):
  print(i)
  y = part2ixy['dev_small'][2]
  x = part2ixy['dev_small'][1]
  print('\nClassifying %s set with %d examples ...' % (part, len(x)))
  st = time.time()

  device = torch.device('cuda')
  model = experiment[2]['model']
  text_encoder = experiment[2]['text_encoder']
  batch_size = 200
  dataloader = create_dataloader(x, None, text_encoder,
                                  batch_size, shuffle_batches_each_epoch=False)
  max_decoding_len = model.config['max_tgt_seq_length']
  preds = generate_predictions(dataloader, max_decoding_len, text_encoder, model, device)

  print('%s set classified in %.2fs' % (part, time.time() - st))
  acc = compute_metrics(preds, y, metrics=['acc@1'])
  results.append([experiment[0], experiment[1], acc])

0

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.98it/s]


test set classified in 1.44s
1

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.58it/s]


test set classified in 1.53s
2

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  5.32it/s]


test set classified in 1.89s
3

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.48it/s]


test set classified in 1.55s
4

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.91it/s]


test set classified in 1.46s
5

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  7.12it/s]


test set classified in 1.42s
6

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.96it/s]


test set classified in 1.45s
7

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  7.10it/s]


test set classified in 1.42s
8

Classifying test set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.96it/s]

test set classified in 1.45s





>По результатам видно, что accuracy улучшается при повышении lr_peak, а самый лучший результат достигается при lr_peak = 0.002 и при самом маленьком значении dropout = 0.1

In [None]:
pd.DataFrame(results, columns=['dropout', 'lr_peak', 'acc'])

Unnamed: 0,dropout,lr_peak,acc
0,0.1,0.0005,{'acc@1': 0.495}
1,0.1,0.001,{'acc@1': 0.5435}
2,0.1,0.002,{'acc@1': 0.5665}
3,0.15,0.0005,{'acc@1': 0.4495}
4,0.15,0.001,{'acc@1': 0.513}
5,0.15,0.002,{'acc@1': 0.555}
6,0.2,0.0005,{'acc@1': 0.4145}
7,0.2,0.001,{'acc@1': 0.4785}
8,0.2,0.002,{'acc@1': 0.5095}


In [33]:
best_model_config = {
  'n_layers': 2,
  'n_heads': 2,
  'hidden_size': 128,
  'ff_hidden_size': 256,
  'dropout': {
      'embedding': 0.1,
      'attention': 0.1,
      'residual': 0.1,
      'relu': 0.1,
  },
  'pad_idx': 0
}

def train_epoch(epoch_num, lr):

  best_train_config = {'batch_size': 200, 'lr_scheduler': {
    'type': 'warmup,decay_linear',
    'warmup_steps_part': 0.1,
  }}
  best_train_config['n_epochs'] = epoch_num
  best_train_config['lr_scheduler']['lr_peak'] = lr

  return best_train_config

In [35]:
lr_up = [25e-4, 3e-3, 4e-3]

In [37]:
def get_train_inf(dataset, model_config, train_config):
  train_ids, train_strings, train_transliterations = dataset
  print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
  st = time.time()
  params = train(train_strings, train_transliterations, model_config, train_config)
  print('Classifier trained in %.2fs' % (time.time() - st))
  return params

def get_acc_inf(dataset, default_model, default_text_encoder):
  y = dataset[2]
  x = dataset[1]
  print('\nClassifying %s set with %d examples ...' % ('dev_small', len(x)))
  st = time.time()

  device = torch.device('cuda')
  model = default_model['model']
  text_encoder = default_text_encoder['text_encoder']
  batch_size = 200
  dataloader = create_dataloader(x, None, text_encoder,
                                  batch_size, shuffle_batches_each_epoch=False)
  max_decoding_len = model.config['max_tgt_seq_length']
  preds = generate_predictions(dataloader, max_decoding_len, text_encoder, model, device)

  print('%s set classified in %.2fs' % ('dev_small', time.time() - st))
  acc = compute_metrics(preds, y, metrics=['acc@1'])
  return acc

In [39]:
EPOCHS = [5]
train_inf = []

for n in EPOCHS:
  for lr in lr_up:
    tr_config = train_epoch(n, lr)
    params = get_train_inf(part2ixy['train'], best_model_config, tr_config)
    train_inf.append([lr, params])


Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:17, 30.06it/s]


** End of epoch, accumulated average loss = 2.127612 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 2
Run training...


527it [00:17, 29.57it/s]


** End of epoch, accumulated average loss = 0.756763 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 3
Run training...


527it [00:17, 30.68it/s]


** End of epoch, accumulated average loss = 0.585037 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:17, 29.50it/s]


** End of epoch, accumulated average loss = 0.526182 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 5
Run training...


527it [00:17, 30.99it/s]


** End of epoch, accumulated average loss = 0.488132 **
** Elapsed time: 0:00:17**
Classifier trained in 88.11s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.61it/s]


** End of epoch, accumulated average loss = 2.002340 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 2
Run training...


527it [00:18, 28.39it/s]


** End of epoch, accumulated average loss = 0.771004 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 31.60it/s]


** End of epoch, accumulated average loss = 0.569835 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.40it/s]


** End of epoch, accumulated average loss = 0.499266 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 31.13it/s]


** End of epoch, accumulated average loss = 0.458923 **
** Elapsed time: 0:00:17**
Classifier trained in 86.32s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.55it/s]


** End of epoch, accumulated average loss = 1.819490 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 32.31it/s]


** End of epoch, accumulated average loss = 0.716241 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 3
Run training...


527it [00:18, 28.53it/s]


** End of epoch, accumulated average loss = 0.581829 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.72it/s]


** End of epoch, accumulated average loss = 0.511786 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 31.73it/s]

** End of epoch, accumulated average loss = 0.470952 **
** Elapsed time: 0:00:17**
Classifier trained in 85.80s





In [40]:
res_2 = []

for inf in train_inf:
  acc = get_acc_inf(part2ixy['dev_small'], inf[1], inf[1])
  res_2.append([inf[0], acc])


Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.32it/s]


dev_small set classified in 1.59s

Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.64it/s]


dev_small set classified in 1.52s

Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.92it/s]

dev_small set classified in 1.45s





>Дальнейшее увеличение lr_peak не повлекло за собой повышения качества, поэтому оптимальным параметром остается 0.002

In [41]:
pd.DataFrame(res_2, columns=['lr_peak', 'acc'])

Unnamed: 0,lr_peak,acc
0,0.0025,{'acc@1': 0.543}
1,0.003,{'acc@1': 0.5485}
2,0.004,{'acc@1': 0.5565}


>Следующие эксперименты были по увеличению кол-ва эпох при выбранных параметрах dropout и lr_peak.

In [None]:
EPOCHS = [25, 50, 75, 100]
train_inf = []
for n in EPOCHS:
  tr_config = train_epoch(n, lr=2e-3)
  params = get_train_inf(part2ixy['train'], best_model_config, tr_config)
  train_inf.append([n, params])


Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 32.04it/s]


** End of epoch, accumulated average loss = 2.465377 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:17, 30.36it/s]


** End of epoch, accumulated average loss = 0.847399 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 3
Run training...


527it [00:17, 30.21it/s]


** End of epoch, accumulated average loss = 0.596047 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.75it/s]


** End of epoch, accumulated average loss = 0.511314 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:17, 30.77it/s]


** End of epoch, accumulated average loss = 0.466210 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 6
Run training...


527it [00:17, 29.42it/s]


** End of epoch, accumulated average loss = 0.440886 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 7
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 0.418229 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 8
Run training...


527it [00:18, 29.08it/s]


** End of epoch, accumulated average loss = 0.403063 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 9
Run training...


527it [00:16, 31.61it/s]


** End of epoch, accumulated average loss = 0.385671 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 10
Run training...


527it [00:16, 31.77it/s]


** End of epoch, accumulated average loss = 0.376399 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 11
Run training...


527it [00:16, 31.54it/s]


** End of epoch, accumulated average loss = 0.366251 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 12
Run training...


527it [00:17, 30.69it/s]


** End of epoch, accumulated average loss = 0.356687 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 13
Run training...


527it [00:16, 31.57it/s]


** End of epoch, accumulated average loss = 0.348644 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 14
Run training...


527it [00:16, 31.26it/s]


** End of epoch, accumulated average loss = 0.340949 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 15
Run training...


527it [00:17, 29.99it/s]


** End of epoch, accumulated average loss = 0.333870 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 16
Run training...


527it [00:16, 31.76it/s]


** End of epoch, accumulated average loss = 0.326883 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 17
Run training...


527it [00:16, 31.67it/s]


** End of epoch, accumulated average loss = 0.320909 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 18
Run training...


527it [00:17, 30.00it/s]


** End of epoch, accumulated average loss = 0.314368 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 19
Run training...


527it [00:16, 32.13it/s]


** End of epoch, accumulated average loss = 0.308462 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 20
Run training...


527it [00:16, 31.65it/s]


** End of epoch, accumulated average loss = 0.303856 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 21
Run training...


527it [00:17, 30.44it/s]


** End of epoch, accumulated average loss = 0.298172 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 22
Run training...


527it [00:16, 31.85it/s]


** End of epoch, accumulated average loss = 0.293439 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 23
Run training...


527it [00:17, 29.61it/s]


** End of epoch, accumulated average loss = 0.288732 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 24
Run training...


527it [00:17, 30.27it/s]


** End of epoch, accumulated average loss = 0.284864 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 25
Run training...


527it [00:16, 31.58it/s]


** End of epoch, accumulated average loss = 0.281162 **
** Elapsed time: 0:00:17**
Classifier trained in 426.36s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.77it/s]


** End of epoch, accumulated average loss = 2.894254 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 2
Run training...


527it [00:17, 29.94it/s]


** End of epoch, accumulated average loss = 1.032438 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 31.38it/s]


** End of epoch, accumulated average loss = 0.689124 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.87it/s]


** End of epoch, accumulated average loss = 0.561719 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:17, 29.68it/s]


** End of epoch, accumulated average loss = 0.548589 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 6
Run training...


527it [00:16, 31.73it/s]


** End of epoch, accumulated average loss = 0.479958 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 7
Run training...


527it [00:17, 30.96it/s]


** End of epoch, accumulated average loss = 0.447160 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 8
Run training...


527it [00:18, 29.26it/s]


** End of epoch, accumulated average loss = 0.422911 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 9
Run training...


527it [00:16, 31.13it/s]


** End of epoch, accumulated average loss = 0.406103 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 10
Run training...


527it [00:16, 31.27it/s]


** End of epoch, accumulated average loss = 0.398035 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 11
Run training...


527it [00:17, 30.18it/s]


** End of epoch, accumulated average loss = 0.385898 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 12
Run training...


527it [00:17, 29.81it/s]


** End of epoch, accumulated average loss = 0.378140 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 13
Run training...


527it [00:16, 31.46it/s]


** End of epoch, accumulated average loss = 0.373006 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 14
Run training...


527it [00:17, 30.32it/s]


** End of epoch, accumulated average loss = 0.364353 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 15
Run training...


527it [00:16, 31.19it/s]


** End of epoch, accumulated average loss = 0.358710 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 16
Run training...


527it [00:16, 31.18it/s]


** End of epoch, accumulated average loss = 0.351756 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 17
Run training...


527it [00:17, 30.34it/s]


** End of epoch, accumulated average loss = 0.347738 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 18
Run training...


527it [00:16, 32.04it/s]


** End of epoch, accumulated average loss = 0.344469 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 19
Run training...


527it [00:16, 31.96it/s]


** End of epoch, accumulated average loss = 0.337693 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 20
Run training...


527it [00:17, 29.69it/s]


** End of epoch, accumulated average loss = 0.332371 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 21
Run training...


527it [00:17, 30.99it/s]


** End of epoch, accumulated average loss = 0.328296 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 22
Run training...


527it [00:17, 30.88it/s]


** End of epoch, accumulated average loss = 0.325059 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 23
Run training...


527it [00:17, 30.71it/s]


** End of epoch, accumulated average loss = 0.321256 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 24
Run training...


527it [00:16, 31.52it/s]


** End of epoch, accumulated average loss = 0.317569 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 25
Run training...


527it [00:16, 31.95it/s]


** End of epoch, accumulated average loss = 0.313697 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 26
Run training...


527it [00:17, 29.92it/s]


** End of epoch, accumulated average loss = 0.311753 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 27
Run training...


527it [00:17, 29.63it/s]


** End of epoch, accumulated average loss = 0.307297 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 28
Run training...


527it [00:17, 30.47it/s]


** End of epoch, accumulated average loss = 0.304648 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 29
Run training...


527it [00:17, 30.88it/s]


** End of epoch, accumulated average loss = 0.301095 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 30
Run training...


527it [00:16, 31.49it/s]


** End of epoch, accumulated average loss = 0.298201 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 31
Run training...


527it [00:17, 30.81it/s]


** End of epoch, accumulated average loss = 0.296511 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 32
Run training...


527it [00:17, 30.92it/s]


** End of epoch, accumulated average loss = 0.293298 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 33
Run training...


527it [00:16, 31.56it/s]


** End of epoch, accumulated average loss = 0.290154 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 34
Run training...


527it [00:17, 30.10it/s]


** End of epoch, accumulated average loss = 0.287717 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 35
Run training...


527it [00:16, 31.06it/s]


** End of epoch, accumulated average loss = 0.285075 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 36
Run training...


527it [00:16, 31.72it/s]


** End of epoch, accumulated average loss = 0.283020 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 37
Run training...


527it [00:17, 29.95it/s]


** End of epoch, accumulated average loss = 0.280364 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 38
Run training...


527it [00:17, 30.96it/s]


** End of epoch, accumulated average loss = 0.278115 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 39
Run training...


527it [00:17, 30.66it/s]


** End of epoch, accumulated average loss = 0.275960 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 40
Run training...


527it [00:17, 29.79it/s]


** End of epoch, accumulated average loss = 0.273329 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 41
Run training...


527it [00:18, 29.21it/s]


** End of epoch, accumulated average loss = 0.270821 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 42
Run training...


527it [00:16, 31.81it/s]


** End of epoch, accumulated average loss = 0.268493 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 43
Run training...


527it [00:16, 31.03it/s]


** End of epoch, accumulated average loss = 0.266239 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 44
Run training...


527it [00:17, 30.51it/s]


** End of epoch, accumulated average loss = 0.264253 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 45
Run training...


527it [00:16, 32.24it/s]


** End of epoch, accumulated average loss = 0.262892 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 46
Run training...


527it [00:17, 30.99it/s]


** End of epoch, accumulated average loss = 0.261341 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 47
Run training...


527it [00:17, 30.98it/s]


** End of epoch, accumulated average loss = 0.258602 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 48
Run training...


527it [00:16, 32.03it/s]


** End of epoch, accumulated average loss = 0.257732 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 49
Run training...


527it [00:16, 31.85it/s]


** End of epoch, accumulated average loss = 0.255822 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 50
Run training...


527it [00:17, 30.40it/s]


** End of epoch, accumulated average loss = 0.253808 **
** Elapsed time: 0:00:17**
Classifier trained in 855.00s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.85it/s]


** End of epoch, accumulated average loss = 2.995074 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 31.94it/s]


** End of epoch, accumulated average loss = 1.209765 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 3
Run training...


527it [00:17, 30.56it/s]


** End of epoch, accumulated average loss = 0.788834 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 31.95it/s]


** End of epoch, accumulated average loss = 0.607500 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 5
Run training...


527it [00:16, 32.18it/s]


** End of epoch, accumulated average loss = 0.540330 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 6
Run training...


527it [00:18, 28.23it/s]


** End of epoch, accumulated average loss = 0.493954 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 7
Run training...


527it [00:16, 32.28it/s]


** End of epoch, accumulated average loss = 0.461704 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 8
Run training...


527it [00:16, 32.16it/s]


** End of epoch, accumulated average loss = 0.444853 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 9
Run training...


527it [00:17, 30.30it/s]


** End of epoch, accumulated average loss = 0.426248 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 10
Run training...


527it [00:16, 32.25it/s]


** End of epoch, accumulated average loss = 0.407679 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 11
Run training...


527it [00:16, 31.91it/s]


** End of epoch, accumulated average loss = 0.392379 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 12
Run training...


527it [00:16, 31.10it/s]


** End of epoch, accumulated average loss = 0.386126 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 13
Run training...


527it [00:16, 31.37it/s]


** End of epoch, accumulated average loss = 0.374383 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 14
Run training...


527it [00:17, 30.81it/s]


** End of epoch, accumulated average loss = 0.379746 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 15
Run training...


527it [00:17, 29.93it/s]


** End of epoch, accumulated average loss = 0.362384 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 16
Run training...


527it [00:16, 31.08it/s]


** End of epoch, accumulated average loss = 0.355134 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 17
Run training...


527it [00:16, 32.45it/s]


** End of epoch, accumulated average loss = 0.351990 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 18
Run training...


527it [00:16, 31.84it/s]


** End of epoch, accumulated average loss = 0.347813 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 19
Run training...


527it [00:17, 30.60it/s]


** End of epoch, accumulated average loss = 0.342389 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 20
Run training...


527it [00:16, 31.92it/s]


** End of epoch, accumulated average loss = 0.338494 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 21
Run training...


527it [00:16, 31.69it/s]


** End of epoch, accumulated average loss = 0.334339 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 22
Run training...


527it [00:18, 28.69it/s]


** End of epoch, accumulated average loss = 0.329658 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 23
Run training...


527it [00:16, 31.69it/s]


** End of epoch, accumulated average loss = 0.328463 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 24
Run training...


527it [00:16, 32.32it/s]


** End of epoch, accumulated average loss = 0.322998 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 25
Run training...


527it [00:17, 30.67it/s]


** End of epoch, accumulated average loss = 0.320061 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 26
Run training...


527it [00:16, 31.95it/s]


** End of epoch, accumulated average loss = 0.317819 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 27
Run training...


527it [00:16, 32.19it/s]


** End of epoch, accumulated average loss = 0.316004 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 28
Run training...


527it [00:17, 30.40it/s]


** End of epoch, accumulated average loss = 0.311242 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 29
Run training...


527it [00:16, 32.20it/s]


** End of epoch, accumulated average loss = 0.310383 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 30
Run training...


527it [00:16, 32.11it/s]


** End of epoch, accumulated average loss = 0.306165 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 31
Run training...


527it [00:17, 30.50it/s]


** End of epoch, accumulated average loss = 0.304110 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 32
Run training...


527it [00:16, 31.87it/s]


** End of epoch, accumulated average loss = 0.302534 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 33
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.300445 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 34
Run training...


527it [00:16, 31.77it/s]


** End of epoch, accumulated average loss = 0.298726 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 35
Run training...


527it [00:17, 30.99it/s]


** End of epoch, accumulated average loss = 0.295232 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 36
Run training...


527it [00:16, 31.78it/s]


** End of epoch, accumulated average loss = 0.292478 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 37
Run training...


527it [00:17, 30.12it/s]


** End of epoch, accumulated average loss = 0.290520 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 38
Run training...


527it [00:17, 30.55it/s]


** End of epoch, accumulated average loss = 0.289791 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 39
Run training...


527it [00:16, 32.47it/s]


** End of epoch, accumulated average loss = 0.286655 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 40
Run training...


527it [00:16, 32.26it/s]


** End of epoch, accumulated average loss = 0.285104 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 41
Run training...


527it [00:17, 30.45it/s]


** End of epoch, accumulated average loss = 0.283180 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 42
Run training...


527it [00:16, 32.30it/s]


** End of epoch, accumulated average loss = 0.281575 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 43
Run training...


527it [00:16, 32.21it/s]


** End of epoch, accumulated average loss = 0.279617 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 44
Run training...


527it [00:16, 31.00it/s]


** End of epoch, accumulated average loss = 0.277508 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 45
Run training...


527it [00:16, 31.92it/s]


** End of epoch, accumulated average loss = 0.276449 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 46
Run training...


527it [00:16, 32.69it/s]


** End of epoch, accumulated average loss = 0.274000 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 47
Run training...


527it [00:16, 32.37it/s]


** End of epoch, accumulated average loss = 0.272081 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 48
Run training...


527it [00:17, 30.58it/s]


** End of epoch, accumulated average loss = 0.271594 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 49
Run training...


527it [00:16, 32.02it/s]


** End of epoch, accumulated average loss = 0.269610 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 50
Run training...


527it [00:16, 31.90it/s]


** End of epoch, accumulated average loss = 0.267757 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 51
Run training...


527it [00:17, 30.70it/s]


** End of epoch, accumulated average loss = 0.265985 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 52
Run training...


527it [00:17, 30.04it/s]


** End of epoch, accumulated average loss = 0.264574 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 53
Run training...


527it [00:16, 31.93it/s]


** End of epoch, accumulated average loss = 0.263115 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 54
Run training...


527it [00:17, 30.75it/s]


** End of epoch, accumulated average loss = 0.260926 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 55
Run training...


527it [00:16, 32.02it/s]


** End of epoch, accumulated average loss = 0.259598 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 56
Run training...


527it [00:16, 32.34it/s]


** End of epoch, accumulated average loss = 0.257834 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 57
Run training...


527it [00:16, 31.93it/s]


** End of epoch, accumulated average loss = 0.256375 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 58
Run training...


527it [00:17, 31.00it/s]


** End of epoch, accumulated average loss = 0.254941 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 59
Run training...


527it [00:16, 31.80it/s]


** End of epoch, accumulated average loss = 0.253543 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 60
Run training...


527it [00:16, 31.78it/s]


** End of epoch, accumulated average loss = 0.251826 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 61
Run training...


527it [00:17, 30.39it/s]


** End of epoch, accumulated average loss = 0.250764 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 62
Run training...


527it [00:16, 31.95it/s]


** End of epoch, accumulated average loss = 0.248635 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 63
Run training...


527it [00:16, 31.92it/s]


** End of epoch, accumulated average loss = 0.247851 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 64
Run training...


527it [00:17, 30.61it/s]


** End of epoch, accumulated average loss = 0.246003 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 65
Run training...


527it [00:16, 31.93it/s]


** End of epoch, accumulated average loss = 0.245327 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 66
Run training...


527it [00:16, 31.97it/s]


** End of epoch, accumulated average loss = 0.243178 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 67
Run training...


527it [00:18, 28.55it/s]


** End of epoch, accumulated average loss = 0.242121 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 68
Run training...


527it [00:16, 32.24it/s]


** End of epoch, accumulated average loss = 0.241117 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 69
Run training...


527it [00:16, 32.33it/s]


** End of epoch, accumulated average loss = 0.239798 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 70
Run training...


527it [00:17, 30.69it/s]


** End of epoch, accumulated average loss = 0.238009 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 71
Run training...


527it [00:16, 31.73it/s]


** End of epoch, accumulated average loss = 0.237117 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 72
Run training...


527it [00:16, 32.04it/s]


** End of epoch, accumulated average loss = 0.235858 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 73
Run training...


527it [00:16, 32.09it/s]


** End of epoch, accumulated average loss = 0.234690 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 74
Run training...


527it [00:17, 30.60it/s]


** End of epoch, accumulated average loss = 0.234230 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 75
Run training...


527it [00:16, 31.80it/s]


** End of epoch, accumulated average loss = 0.233189 **
** Elapsed time: 0:00:17**
Classifier trained in 1259.95s

Training classifier on 105371 examples from train set ...
Using GPU device: cuda

----------------------------------------
Epoch: 1
Run training...


527it [00:16, 32.28it/s]


** End of epoch, accumulated average loss = 3.135261 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 2
Run training...


527it [00:17, 30.68it/s]


** End of epoch, accumulated average loss = 1.446657 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 3
Run training...


527it [00:16, 31.85it/s]


** End of epoch, accumulated average loss = 0.835892 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 32.60it/s]


** End of epoch, accumulated average loss = 0.641700 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 5
Run training...


527it [00:17, 30.25it/s]


** End of epoch, accumulated average loss = 0.561874 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 6
Run training...


527it [00:16, 32.27it/s]


** End of epoch, accumulated average loss = 0.509329 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 7
Run training...


527it [00:16, 31.98it/s]


** End of epoch, accumulated average loss = 0.476413 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 8
Run training...


527it [00:18, 28.92it/s]


** End of epoch, accumulated average loss = 0.460275 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 9
Run training...


527it [00:16, 32.18it/s]


** End of epoch, accumulated average loss = 0.438691 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 10
Run training...


527it [00:16, 32.00it/s]


** End of epoch, accumulated average loss = 0.426470 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 11
Run training...


527it [00:16, 31.78it/s]


** End of epoch, accumulated average loss = 0.414112 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 12
Run training...


527it [00:17, 30.29it/s]


** End of epoch, accumulated average loss = 0.401518 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 13
Run training...


527it [00:16, 32.04it/s]


** End of epoch, accumulated average loss = 0.392812 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 14
Run training...


527it [00:16, 32.13it/s]


** End of epoch, accumulated average loss = 0.381178 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 15
Run training...


527it [00:17, 30.10it/s]


** End of epoch, accumulated average loss = 0.375585 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 16
Run training...


527it [00:16, 32.12it/s]


** End of epoch, accumulated average loss = 0.368691 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 17
Run training...


527it [00:16, 31.76it/s]


** End of epoch, accumulated average loss = 0.362852 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 18
Run training...


527it [00:17, 30.93it/s]


** End of epoch, accumulated average loss = 0.357204 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 19
Run training...


527it [00:16, 32.12it/s]


** End of epoch, accumulated average loss = 0.353634 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 20
Run training...


527it [00:16, 31.55it/s]


** End of epoch, accumulated average loss = 0.350658 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 21
Run training...


527it [00:17, 30.54it/s]


** End of epoch, accumulated average loss = 0.342369 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 22
Run training...


527it [00:16, 31.60it/s]


** End of epoch, accumulated average loss = 0.341380 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 23
Run training...


527it [00:17, 29.90it/s]


** End of epoch, accumulated average loss = 0.337495 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 24
Run training...


527it [00:17, 30.82it/s]


** End of epoch, accumulated average loss = 0.334924 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 25
Run training...


527it [00:16, 31.28it/s]


** End of epoch, accumulated average loss = 0.330931 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 26
Run training...


527it [00:16, 32.02it/s]


** End of epoch, accumulated average loss = 0.326297 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 27
Run training...


527it [00:17, 30.92it/s]


** End of epoch, accumulated average loss = 0.326774 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 28
Run training...


527it [00:17, 30.62it/s]


** End of epoch, accumulated average loss = 0.322569 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 29
Run training...


527it [00:16, 32.08it/s]


** End of epoch, accumulated average loss = 0.321008 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 30
Run training...


527it [00:16, 31.14it/s]


** End of epoch, accumulated average loss = 0.319320 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 31
Run training...


527it [00:17, 30.63it/s]


** End of epoch, accumulated average loss = 0.316369 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 32
Run training...


527it [00:16, 32.11it/s]


** End of epoch, accumulated average loss = 0.313444 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 33
Run training...


527it [00:16, 32.19it/s]


** End of epoch, accumulated average loss = 0.310907 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 34
Run training...


527it [00:17, 30.20it/s]


** End of epoch, accumulated average loss = 0.311219 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 35
Run training...


527it [00:16, 31.60it/s]


** End of epoch, accumulated average loss = 0.308150 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 36
Run training...


527it [00:16, 32.15it/s]


** End of epoch, accumulated average loss = 0.304730 **
** Elapsed time: 0:00:16**

----------------------------------------
Epoch: 37
Run training...


527it [00:17, 30.63it/s]


** End of epoch, accumulated average loss = 0.303782 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 38
Run training...


527it [00:17, 29.43it/s]


** End of epoch, accumulated average loss = 0.301840 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 39
Run training...


527it [00:17, 30.29it/s]


** End of epoch, accumulated average loss = 0.299935 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 40
Run training...


527it [00:17, 29.44it/s]


** End of epoch, accumulated average loss = 0.298416 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 41
Run training...


527it [00:18, 28.67it/s]


** End of epoch, accumulated average loss = 0.297145 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 42
Run training...


527it [00:18, 27.93it/s]


** End of epoch, accumulated average loss = 0.295811 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 43
Run training...


527it [00:18, 28.37it/s]


** End of epoch, accumulated average loss = 0.293410 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 44
Run training...


527it [00:16, 31.08it/s]


** End of epoch, accumulated average loss = 0.291647 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 45
Run training...


527it [00:17, 30.62it/s]


** End of epoch, accumulated average loss = 0.290348 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 46
Run training...


527it [00:17, 29.92it/s]


** End of epoch, accumulated average loss = 0.289104 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 47
Run training...


527it [00:17, 30.20it/s]


** End of epoch, accumulated average loss = 0.286589 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 48
Run training...


527it [00:17, 29.76it/s]


** End of epoch, accumulated average loss = 0.286315 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 49
Run training...


527it [00:17, 29.77it/s]


** End of epoch, accumulated average loss = 0.284354 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 50
Run training...


527it [00:17, 29.50it/s]


** End of epoch, accumulated average loss = 0.283283 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 51
Run training...


527it [00:18, 28.62it/s]


** End of epoch, accumulated average loss = 0.281288 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 52
Run training...


527it [00:18, 27.88it/s]


** End of epoch, accumulated average loss = 0.280049 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 53
Run training...


527it [00:17, 30.31it/s]


** End of epoch, accumulated average loss = 0.278637 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 54
Run training...


527it [00:18, 28.83it/s]


** End of epoch, accumulated average loss = 0.277557 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 55
Run training...


527it [00:16, 31.39it/s]


** End of epoch, accumulated average loss = 0.276222 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 56
Run training...


527it [00:16, 31.35it/s]


** End of epoch, accumulated average loss = 0.274953 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 57
Run training...


527it [00:17, 29.28it/s]


** End of epoch, accumulated average loss = 0.273087 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 58
Run training...


527it [00:16, 31.12it/s]


** End of epoch, accumulated average loss = 0.271642 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 59
Run training...


527it [00:17, 30.87it/s]


** End of epoch, accumulated average loss = 0.270810 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 60
Run training...


527it [00:17, 29.84it/s]


** End of epoch, accumulated average loss = 0.269095 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 61
Run training...


527it [00:16, 31.31it/s]


** End of epoch, accumulated average loss = 0.268271 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 62
Run training...


527it [00:17, 30.90it/s]


** End of epoch, accumulated average loss = 0.266544 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 63
Run training...


527it [00:17, 29.71it/s]


** End of epoch, accumulated average loss = 0.265826 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 64
Run training...


527it [00:17, 30.88it/s]


** End of epoch, accumulated average loss = 0.264387 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 65
Run training...


527it [00:17, 30.90it/s]


** End of epoch, accumulated average loss = 0.262545 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 66
Run training...


527it [00:18, 29.07it/s]


** End of epoch, accumulated average loss = 0.262011 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 67
Run training...


527it [00:17, 29.41it/s]


** End of epoch, accumulated average loss = 0.261046 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 68
Run training...


527it [00:17, 29.74it/s]


** End of epoch, accumulated average loss = 0.259014 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 69
Run training...


527it [00:17, 30.19it/s]


** End of epoch, accumulated average loss = 0.258159 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 70
Run training...


527it [00:16, 31.46it/s]


** End of epoch, accumulated average loss = 0.257359 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 71
Run training...


527it [00:18, 29.26it/s]


** End of epoch, accumulated average loss = 0.256103 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 72
Run training...


527it [00:16, 31.27it/s]


** End of epoch, accumulated average loss = 0.255910 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 73
Run training...


527it [00:16, 31.40it/s]


** End of epoch, accumulated average loss = 0.253330 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 74
Run training...


527it [00:18, 29.20it/s]


** End of epoch, accumulated average loss = 0.253230 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 75
Run training...


527it [00:16, 31.01it/s]


** End of epoch, accumulated average loss = 0.251701 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 76
Run training...


527it [00:16, 31.07it/s]


** End of epoch, accumulated average loss = 0.250519 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 77
Run training...


527it [00:17, 29.63it/s]


** End of epoch, accumulated average loss = 0.249034 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 78
Run training...


527it [00:16, 31.22it/s]


** End of epoch, accumulated average loss = 0.248064 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 79
Run training...


527it [00:17, 30.84it/s]


** End of epoch, accumulated average loss = 0.246772 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 80
Run training...


527it [00:17, 29.59it/s]


** End of epoch, accumulated average loss = 0.246207 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 81
Run training...


527it [00:18, 29.07it/s]


** End of epoch, accumulated average loss = 0.244291 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 82
Run training...


527it [00:17, 30.85it/s]


** End of epoch, accumulated average loss = 0.243316 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 83
Run training...


527it [00:18, 27.80it/s]


** End of epoch, accumulated average loss = 0.242705 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 84
Run training...


527it [00:17, 29.83it/s]


** End of epoch, accumulated average loss = 0.241915 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 85
Run training...


527it [00:17, 29.49it/s]


** End of epoch, accumulated average loss = 0.240508 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 86
Run training...


527it [00:18, 28.47it/s]


** End of epoch, accumulated average loss = 0.239786 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 87
Run training...


527it [00:17, 29.76it/s]


** End of epoch, accumulated average loss = 0.238552 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 88
Run training...


527it [00:18, 27.97it/s]


** End of epoch, accumulated average loss = 0.237937 **
** Elapsed time: 0:00:19**

----------------------------------------
Epoch: 89
Run training...


527it [00:17, 29.92it/s]


** End of epoch, accumulated average loss = 0.235947 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 90
Run training...


527it [00:17, 30.51it/s]


** End of epoch, accumulated average loss = 0.235323 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 91
Run training...


527it [00:18, 28.70it/s]


** End of epoch, accumulated average loss = 0.234406 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 92
Run training...


527it [00:17, 30.19it/s]


** End of epoch, accumulated average loss = 0.234076 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 93
Run training...


527it [00:17, 30.07it/s]


** End of epoch, accumulated average loss = 0.233499 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 94
Run training...


527it [00:17, 29.62it/s]


** End of epoch, accumulated average loss = 0.232391 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 95
Run training...


527it [00:17, 29.38it/s]


** End of epoch, accumulated average loss = 0.231561 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 96
Run training...


527it [00:17, 29.41it/s]


** End of epoch, accumulated average loss = 0.229930 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 97
Run training...


527it [00:16, 31.30it/s]


** End of epoch, accumulated average loss = 0.228884 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 98
Run training...


527it [00:16, 31.09it/s]


** End of epoch, accumulated average loss = 0.228806 **
** Elapsed time: 0:00:17**

----------------------------------------
Epoch: 99
Run training...


527it [00:17, 29.35it/s]


** End of epoch, accumulated average loss = 0.227775 **
** Elapsed time: 0:00:18**

----------------------------------------
Epoch: 100
Run training...


527it [00:16, 31.06it/s]

** End of epoch, accumulated average loss = 0.227106 **
** Elapsed time: 0:00:17**
Classifier trained in 1734.13s





In [None]:
res_3 = []
for inf in train_inf:
  acc = get_acc_inf(part2ixy['dev_small'], inf[1], inf[1])
  res_3.append([inf[0], acc])


Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.13it/s]


dev_small set classified in 1.64s

Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.99it/s]


dev_small set classified in 1.44s

Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.93it/s]


dev_small set classified in 1.45s

Classifying dev_small set with 2000 examples ...


100%|██████████| 10/10 [00:01<00:00,  6.91it/s]

dev_small set classified in 1.46s





>По результатам увеличения кол-ва эпох лучшее качество получилось при 100 эпохах.
Таким образом, получилось поднять качество с 0.433 до 0.6835

In [None]:
pd.DataFrame(res_3, columns=['n_epoch', 'acc'])

Unnamed: 0,n_epoch,acc
0,25,{'acc@1': 0.655}
1,50,{'acc@1': 0.679}
2,75,{'acc@1': 0.678}
3,100,{'acc@1': 0.6835}


## Label smoothing [1 Point]

Now imagine that we have a prediction vector from probabilities at position t in the sequence of tokens for each token id from the vocabulary. CrossEntropy compares it with ground truth one-hot representation

$$[0, ... 0, 1, 0, ..., 0].$$

And now imagine that we are slightly "smoothed" the values in the ground truth vector and obtained

$$[\frac{\alpha}{|V|}, ..., \frac{\alpha}{|V|}, 1(1-\alpha)+\frac{\alpha}{|V|},  \frac{\alpha}{|V|}, ... \frac{\alpha}{|V|}],$$

where $\alpha$ - parameter from 0 to 1, $|V|$ - vocabulary size - number of components in the ground truth vector. The values ​​of this new vector are still summed to 1. Calculate the cross-entropy of our prediction vector and the new ground truth. Now, firstly, cross-entropy will never reach 0, and secondly, the result of the error function will require the model, as usual, to return the highest probability vector compared to other components of the probability vector for the correct token in the dictionary, but at the same time not too large, because as the value of this probability approaches 1, the value of the error function increases. For research on the use of label smoothing, see the [paper](https://arxiv.org/abs/1906.02629).
    
Accordingly, in order to embed label smoothing into the model, it is necessary to carry out the transformation described above on the ground truth vectors, as well as to implement the cross-entropy calculation, since the used `torch.nn.CrossEntropy` class is not quite suitable, since for the ground truth representation of `__call__` method takes the id of the correct token and builds a one-hot vector already inside. However, it is possible to implement what is required based on the internal implementation of this class [CrossEntropyLoss](https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss).
    

Test different values of $\alpha$ (e.x, 0.05, 0.1, 0.2). Describe your experiments and results.


```

ENTER HERE YOUR ANSWER

```

In [None]:
# ENTER HERE YOUR CODE

# BONUS: Additional Experiments [Up to 2 Points]

Be creative and run additional experiments with changing Transformer Architecture to get better score. You probably would like to dive into

1) Reading papers

2) Play with Normalization

3) Play with Positional Encoding

4) Try different Schedulers

5) Maybe you would like to alter architecture a lot :)

The fair and interesting experiments will be highly rewarded even if the result was not successful. Nevertheless, you need to report them properly