# Installation

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from transformers import BertTokenizer, BertForMaskedLM
from transformers import AdamW
from tqdm import tqdm  # for our progress bar
from torch.nn import functional as func
import torch
import pickle

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/curated-dataset/Toxic_NonToxic_Comments.csv
/kaggle/input/wordtoxicityscores/wordtoxicities.csv


# Data Import

In [None]:
comments_df = pd.read_csv('/kaggle/input/curated-dataset/Toxic_NonToxic_Comments.csv', index_col = 0)
comments_df.head()

Unnamed: 0,comment,is_toxic
0,"""px solid #000080; background:#000000; padding...",0
1,Whistle Register\nWhen did she sing in the whi...,0
2,Dear AgentCDE\nGet bent.,0
3,some poor guy robs a liquor store for $400 and...,1
4,"Hey donkey \n\nHey donkey, your Dad is coming.",1


In [None]:
wordtoxicities_df = pd.read_csv('/kaggle/input/wordtoxicityscores/wordtoxicities.csv', index_col = 0)
wordtoxicities_df.head()

Unnamed: 0,word,toxicity
0,aaaaaaaaaah,0.155084
1,ab,-0.08955
2,aba,-0.160136
3,abandon,-0.060261
4,abbasidumayyad,-0.050418


# BERT

The **BERT** model was pre-trained in English on the BookCorpus data, consisting of 11,038 books and other Wikipedia data, with lists, tables and headers excluded, so as to perform two interesting tasks, namely, Masked Language Modelling and Next Sentence Prediction. Our BERT model has been trained with the **MLM (Masked Language Modeling)** task, using the pre-trained BERT and fine-tuning its weights. Masked Language Modeling is a fill-in-the-blank task, where a model uses the context words surrounding a mask token to predict what the masked word should be. The masked word can then be substituted with the predicted word, generating a new sentence that shares similar context and same label with the original sentence. 

## Tokenizer 

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

BERT uses a special type of tokenizer called **WordPiece** tokenizer. The WordPiece tokenizer follows the subword tokenizer scheme. It's working can be understood by considering, as an example, the sentence: 'Let us start pretraining the model'. Tokenizing the sentence using WordPiece, 'token = \[let, us, start, pre, ###train, ###ing, the, model\]', is obtained. While tokenizing the sentence, the word 'pretraining', is split into three parts, because the WordPiece tokenizer first checks whether the word is present in its vocabulary; if the word is present then, it will be used as a token and if not, then, the word is split into subwords recursively until the subwords are found in the corpus. This process is effective in handling out-of-vocabulary words.

## Model 

In [None]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We are using **BERT-based uncased model** for Masked Language Modelling. MLM consists of giving BERT a sentence and optimizing the weights inside it to output the same sentence on the other side. However, before actually giving BERT the input sentence, a few tokens are masked. While in the original BERT model, the words are masked randomly, for our purpose, we are selecting those words that are associated with toxicity. 

## Revising Dataset because of constraints on Memory

Before tokenizing our dataset using BERT's Tokenizer, we are doing away with some amount of data. That is, those comments that have more tokens than **'max_tokens'** are rejected, so as to ensure that our model is successfully trained even with the limited memory available. 

In [None]:
max_tokens = 64

valid_comments_df = pd.DataFrame(columns = ['comment', 'is_toxic'])

count_valid_comments = 0

for i in range(len(comments_df)):
    encoded_tokens = tokenizer.encode(comments_df['comment'][i], return_tensors = 'pt', max_length = max_tokens + 1, truncation = True)[0] 
    
    tokens = []
    for token in encoded_tokens:
        tokens.append(''.join(tokenizer.decode(token).split()))
    
    if(len(tokens) <= max_tokens):
        valid_comments_df.loc[len(valid_comments_df.index)] = comments_df['comment'][i], comments_df['is_toxic'][i]
        count_valid_comments += 1
        
print(count_valid_comments)

valid_comments_df.to_csv('Valid_Toxic_NonToxic_Comments.csv')

valid_comments_df

9735


Unnamed: 0,comment,is_toxic
0,"""px solid #000080; background:#000000; padding...",0
1,Whistle Register\nWhen did she sing in the whi...,0
2,Dear AgentCDE\nGet bent.,0
3,some poor guy robs a liquor store for $400 and...,1
4,"Hey donkey \n\nHey donkey, your Dad is coming.",1
...,...,...
9730,"this is the us , a trashy , misogynistic natio...",1
9731,WTF is Leif Erikson day \n\nand why does it de...,1
9732,"""\n\nSo I've been blocked for this which was t...",1
9733,For an Authentic account of India's untouchabl...,0


# Masking

## Tokenization 

BERT adds tokens like the the classifier token "\[CLS\]", which is the first token of a sequence, seperate token "\[SEP\]", which is the last token of a sequence, the unknown token "\[UNK\]" for tokens that are not in its vocabulary, and the token "\[PAD\]" used for padding, for batching sequences of different lengths, as in our case. It returns a list of token-type IDs according to the given sequence(s).  

In [None]:
inputs = tokenizer(valid_comments_df.comment.tolist(), return_tensors = 'pt', max_length = max_tokens, truncation = True, padding = 'max_length')
inputs

{'input_ids': tensor([[  101,  1000,  1052,  ...,     0,     0,     0],
        [  101, 13300,  4236,  ...,     0,     0,     0],
        [  101,  6203,  4005,  ...,     0,     0,     0],
        ...,
        [  101,  1000,  2061,  ...,     0,     0,     0],
        [  101,  2005,  2019,  ...,     0,     0,     0],
        [  101,  1045,  3294,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

## Labels Tensor

We create our labels tensor by cloning the 'input_ids' tensor.

In [None]:
inputs['labels'] = inputs.input_ids.detach().clone()

## Assigning Toxicity Scores to words known from BoW Toxicity Classifier

In [None]:
wordtoxicities_dict = {}
for i in range(len(wordtoxicities_df)):
    wordtoxicities_dict[wordtoxicities_df['word'].tolist()[i]] = wordtoxicities_df['toxicity'].tolist()[i]
    
avg_word_toxicity = np.mean(wordtoxicities_df.toxicity.tolist())

comment_scores = []

for comment in valid_comments_df['comment']:
    encoded_tokens = tokenizer.encode(comment, return_tensors = 'pt', max_length = max_tokens + 1, truncation = True)[0] 
    
    tokens = []
    for token in encoded_tokens:
        tokens.append(''.join(tokenizer.decode(token).split()))
    
    token_scores = []
    
    for token in tokens:
        if token in wordtoxicities_dict:
            token_scores.append(wordtoxicities_dict[token])
        else:
            token_scores.append(avg_word_toxicity)        
    
    comment_scores.append(token_scores)

## Finding words to be Masked using an Adaptive Threshold

In [None]:
min_threshold = 0.2

def find_masks_for_toxic_comment(index):
    scores = np.zeros(max_tokens)
    scores[:len(comment_scores[index])] = np.array(comment_scores[index])
    scores = torch.from_numpy(scores)
    
    threshold = max(min_threshold, max(comment_scores[index]) / 2)

    masks = (scores > threshold) * (inputs.input_ids[index] != 101) * (inputs.input_ids[index] != 102) * (inputs.input_ids[index] != 0)

    return masks

mask_arr = []
for i in range(len(valid_comments_df)):
    if(valid_comments_df['is_toxic'][i]):
        mask_arr.append(find_masks_for_toxic_comment(i))
    else:
        mask_arr.append(False * (inputs.input_ids[i] != 101) * (inputs.input_ids[i] != 102) * (inputs.input_ids[i] != 0))

## Applying Masks to all Comments

In [None]:
selection = []

for i in range(inputs.input_ids.shape[0]):
    selection.append(torch.flatten(mask_arr[i].nonzero()).tolist())
    
selection

[[],
 [],
 [],
 [21],
 [1, 2, 3, 4],
 [5],
 [7, 20, 25, 43],
 [],
 [],
 [1, 12],
 [],
 [],
 [1, 6],
 [10, 20, 22],
 [1],
 [],
 [],
 [1],
 [],
 [7],
 [],
 [],
 [],
 [],
 [3],
 [5],
 [5, 6, 10],
 [13, 17],
 [4],
 [1],
 [],
 [],
 [],
 [],
 [9, 10],
 [],
 [5],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [9, 16],
 [],
 [],
 [1, 4],
 [24, 31],
 [8],
 [4],
 [2, 3, 6],
 [2, 3, 6],
 [3],
 [],
 [4],
 [],
 [1, 19],
 [],
 [],
 [],
 [],
 [],
 [3],
 [],
 [],
 [],
 [],
 [3, 9],
 [18],
 [6],
 [22],
 [],
 [],
 [20],
 [17],
 [],
 [1, 16],
 [],
 [],
 [6],
 [],
 [13],
 [9, 13],
 [],
 [5, 6],
 [2, 5],
 [8, 10, 11],
 [],
 [1],
 [10],
 [],
 [],
 [1],
 [18],
 [16],
 [],
 [4],
 [],
 [],
 [],
 [1, 2],
 [16],
 [],
 [],
 [],
 [],
 [3],
 [],
 [],
 [17],
 [],
 [],
 [],
 [],
 [10],
 [],
 [4],
 [2, 5, 8],
 [1, 2],
 [1, 2, 5, 6, 10, 11, 14, 15],
 [10],
 [6],
 [],
 [12],
 [6],
 [],
 [8],
 [],
 [],
 [1, 2],
 [12],
 [],
 [4],
 [14],
 [],
 [4],
 [3],
 [5],
 [],
 [15],
 [6],
 [7, 11],
 [],
 [7],
 [1],
 [8],
 [9],
 [1, 8],
 []

The values 103 are assigned the same positions as were found to hold 'True' values in the 'mask_arr' tensor:

In [None]:
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103

## Working 

We can see that toxic words in a given sequence are masked exactly as expected:

In [None]:
for token_id in inputs.labels[9]:
    print(tokenizer.decode(token_id))

[ C L S ]
f u c k
t h a t
,
i
'
m
b e i n g
a c c
# # u s s
# # e d
o f
s h i t
t h a t
i
d i d n
'
t
d o
[ S E P ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]


In [None]:
print(max(min_threshold, max(comment_scores[9])/2))
print(comment_scores[6])

5.534921985953844
[0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 5.0663795128326425, 0.005315426966458784, 11.069843971907687, 0.005315426966458784, 1.309723401308056, 0.005315426966458784, 0.005315426966458784, 0.2037369385745014, 0.005315426966458784, -0.1681766601614798, 0.005315426966458784, 0.005315426966458784, 2.841683505223333, 0.005315426966458784, 0.005315426966458784, 11.069843971907687, 0.005315426966458784, 0.2969538184224775, 0.005315426966458784, -0.2306175113953349, 11.069843971907687, 0.005315426966458784, -0.2306175113953349, 0.0825115595062786, 0.4131274776827482, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, 0.005315426966458784, -0.1442488831252684, 0.005315426966458784, -0.1240947637480212, 9.211758592421704, 0.005315426966458784, 0.005315426966458784,

In [None]:
for token_id in inputs.input_ids[9]:
    print(tokenizer.decode(token_id))

[ C L S ]
[ M A S K ]
t h a t
,
i
'
m
b e i n g
a c c
# # u s s
# # e d
o f
[ M A S K ]
t h a t
i
d i d n
'
t
d o
[ S E P ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]
[ P A D ]


# Processing

## Preparing Dataset

In [None]:
class ToxicDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

In [None]:
dataset = ToxicDataset(inputs)

In [None]:
loader = torch.utils.data.DataLoader(dataset, batch_size = 32)

## Setting-up

Before moving onto setting-up the training loop, GPU/CPU usage is set up.

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
#and moving our model over to the selected device
model.to(device)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

## Optimizing

Activating the training mode of our model, and initializing our optimizer: Adam with weighted decay that reduces chances of overfitting.

In [None]:
#activating training mode
model.train()
#initializing optimizer
optim = AdamW(model.parameters(), lr = 5e-5)



## Training the Model

In [None]:
epochs = 2

for epoch in range(epochs):
    #setting up loop with TQDM and dataloader
    loop = tqdm(loader, leave = True)
    for batch in loop:
        #initializing calculated gradients (from previous step)
        optim.zero_grad()
        #pulling all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        #processing
        outputs = model(input_ids, attention_mask = attention_mask, labels = labels)
        print(outputs[0])
        #extracting loss
        loss = outputs.loss
        #calculating loss for every parameter that needs grad update
        loss.backward()
        #updating parameters
        optim.step()
        #printing relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss = loss.item())

  """


tensor(13.6442, grad_fn=<NllLossBackward>)


Epoch 0:   0%|          | 1/305 [00:14<1:14:17, 14.66s/it, loss=13.6]

tensor(11.0597, grad_fn=<NllLossBackward>)


Epoch 0:   1%|          | 2/305 [00:28<1:11:57, 14.25s/it, loss=11.1]

tensor(9.8197, grad_fn=<NllLossBackward>)


Epoch 0:   1%|          | 3/305 [00:40<1:07:05, 13.33s/it, loss=9.82]

tensor(8.1719, grad_fn=<NllLossBackward>)


Epoch 0:   1%|▏         | 4/305 [00:52<1:04:22, 12.83s/it, loss=8.17]

tensor(7.3703, grad_fn=<NllLossBackward>)


Epoch 0:   2%|▏         | 5/305 [01:05<1:03:22, 12.67s/it, loss=7.37]

tensor(6.6782, grad_fn=<NllLossBackward>)


Epoch 0:   2%|▏         | 6/305 [01:17<1:02:11, 12.48s/it, loss=6.68]

tensor(5.7424, grad_fn=<NllLossBackward>)


Epoch 0:   2%|▏         | 7/305 [01:29<1:01:49, 12.45s/it, loss=5.74]

tensor(5.4704, grad_fn=<NllLossBackward>)


Epoch 0:   3%|▎         | 8/305 [01:41<1:00:47, 12.28s/it, loss=5.47]

tensor(5.0237, grad_fn=<NllLossBackward>)


Epoch 0:   3%|▎         | 9/305 [01:53<59:50, 12.13s/it, loss=5.02]  

tensor(4.7102, grad_fn=<NllLossBackward>)


Epoch 0:   3%|▎         | 10/305 [02:05<59:40, 12.14s/it, loss=4.71]

tensor(4.1911, grad_fn=<NllLossBackward>)


Epoch 0:   4%|▎         | 11/305 [02:17<58:39, 11.97s/it, loss=4.19]

tensor(4.2143, grad_fn=<NllLossBackward>)


Epoch 0:   4%|▍         | 12/305 [02:29<58:43, 12.02s/it, loss=4.21]

tensor(3.8613, grad_fn=<NllLossBackward>)


Epoch 0:   4%|▍         | 13/305 [02:42<59:52, 12.30s/it, loss=3.86]

tensor(3.1298, grad_fn=<NllLossBackward>)


Epoch 0:   5%|▍         | 14/305 [02:54<59:22, 12.24s/it, loss=3.13]

tensor(2.7819, grad_fn=<NllLossBackward>)


Epoch 0:   5%|▍         | 15/305 [03:07<1:00:01, 12.42s/it, loss=2.78]

tensor(2.4568, grad_fn=<NllLossBackward>)


Epoch 0:   5%|▌         | 16/305 [03:19<59:39, 12.39s/it, loss=2.46]  

tensor(2.1509, grad_fn=<NllLossBackward>)


Epoch 0:   6%|▌         | 17/305 [03:31<59:22, 12.37s/it, loss=2.15]

tensor(1.9154, grad_fn=<NllLossBackward>)


Epoch 0:   6%|▌         | 18/305 [03:44<59:24, 12.42s/it, loss=1.92]

tensor(1.5518, grad_fn=<NllLossBackward>)


Epoch 0:   6%|▌         | 19/305 [03:56<59:06, 12.40s/it, loss=1.55]

tensor(1.4194, grad_fn=<NllLossBackward>)


Epoch 0:   7%|▋         | 20/305 [04:09<59:02, 12.43s/it, loss=1.42]

tensor(1.1829, grad_fn=<NllLossBackward>)


Epoch 0:   7%|▋         | 21/305 [04:21<57:54, 12.23s/it, loss=1.18]

tensor(1.0469, grad_fn=<NllLossBackward>)


Epoch 0:   7%|▋         | 22/305 [04:33<57:15, 12.14s/it, loss=1.05]

tensor(0.8805, grad_fn=<NllLossBackward>)


Epoch 0:   8%|▊         | 23/305 [04:45<57:10, 12.17s/it, loss=0.881]

tensor(0.8324, grad_fn=<NllLossBackward>)


Epoch 0:   8%|▊         | 24/305 [04:56<56:15, 12.01s/it, loss=0.832]

tensor(0.6889, grad_fn=<NllLossBackward>)


Epoch 0:   8%|▊         | 25/305 [05:08<55:33, 11.90s/it, loss=0.689]

tensor(0.6259, grad_fn=<NllLossBackward>)


Epoch 0:   9%|▊         | 26/305 [05:20<55:47, 12.00s/it, loss=0.626]

tensor(0.5833, grad_fn=<NllLossBackward>)


Epoch 0:   9%|▉         | 27/305 [05:32<55:36, 12.00s/it, loss=0.583]

tensor(0.4917, grad_fn=<NllLossBackward>)


Epoch 0:   9%|▉         | 28/305 [05:45<55:50, 12.10s/it, loss=0.492]

tensor(0.4321, grad_fn=<NllLossBackward>)


Epoch 0:  10%|▉         | 29/305 [05:57<55:54, 12.15s/it, loss=0.432]

tensor(0.4165, grad_fn=<NllLossBackward>)


Epoch 0:  10%|▉         | 30/305 [06:09<55:15, 12.06s/it, loss=0.416]

tensor(0.3753, grad_fn=<NllLossBackward>)


Epoch 0:  10%|█         | 31/305 [06:21<55:27, 12.15s/it, loss=0.375]

tensor(0.3201, grad_fn=<NllLossBackward>)


Epoch 0:  10%|█         | 32/305 [06:33<55:05, 12.11s/it, loss=0.32] 

tensor(0.3259, grad_fn=<NllLossBackward>)


Epoch 0:  11%|█         | 33/305 [06:45<54:56, 12.12s/it, loss=0.326]

tensor(0.2628, grad_fn=<NllLossBackward>)


Epoch 0:  11%|█         | 34/305 [06:58<55:09, 12.21s/it, loss=0.263]

tensor(0.2253, grad_fn=<NllLossBackward>)


Epoch 0:  11%|█▏        | 35/305 [07:09<54:26, 12.10s/it, loss=0.225]

tensor(0.2357, grad_fn=<NllLossBackward>)


Epoch 0:  12%|█▏        | 36/305 [07:22<54:20, 12.12s/it, loss=0.236]

tensor(0.2263, grad_fn=<NllLossBackward>)


Epoch 0:  12%|█▏        | 37/305 [07:33<53:44, 12.03s/it, loss=0.226]

tensor(0.1936, grad_fn=<NllLossBackward>)


Epoch 0:  12%|█▏        | 38/305 [07:45<53:27, 12.01s/it, loss=0.194]

tensor(0.2038, grad_fn=<NllLossBackward>)


Epoch 0:  13%|█▎        | 39/305 [07:58<53:52, 12.15s/it, loss=0.204]

tensor(0.1757, grad_fn=<NllLossBackward>)


Epoch 0:  13%|█▎        | 40/305 [08:10<53:22, 12.09s/it, loss=0.176]

tensor(0.1789, grad_fn=<NllLossBackward>)


Epoch 0:  13%|█▎        | 41/305 [08:22<52:45, 11.99s/it, loss=0.179]

tensor(0.1724, grad_fn=<NllLossBackward>)


Epoch 0:  14%|█▍        | 42/305 [08:34<53:03, 12.11s/it, loss=0.172]

tensor(0.1150, grad_fn=<NllLossBackward>)


Epoch 0:  14%|█▍        | 43/305 [08:46<52:53, 12.11s/it, loss=0.115]

tensor(0.1469, grad_fn=<NllLossBackward>)


Epoch 0:  14%|█▍        | 44/305 [08:59<53:27, 12.29s/it, loss=0.147]

tensor(0.1209, grad_fn=<NllLossBackward>)


Epoch 0:  15%|█▍        | 45/305 [09:11<52:37, 12.14s/it, loss=0.121]

tensor(0.1437, grad_fn=<NllLossBackward>)


Epoch 0:  15%|█▌        | 46/305 [09:22<52:02, 12.06s/it, loss=0.144]

tensor(0.1546, grad_fn=<NllLossBackward>)


Epoch 0:  15%|█▌        | 47/305 [09:35<52:02, 12.10s/it, loss=0.155]

tensor(0.1081, grad_fn=<NllLossBackward>)


Epoch 0:  16%|█▌        | 48/305 [09:50<55:26, 12.94s/it, loss=0.108]

tensor(0.1010, grad_fn=<NllLossBackward>)


Epoch 0:  16%|█▌        | 49/305 [10:02<55:02, 12.90s/it, loss=0.101]

tensor(0.0917, grad_fn=<NllLossBackward>)


Epoch 0:  16%|█▋        | 50/305 [10:14<53:20, 12.55s/it, loss=0.0917]

tensor(0.1115, grad_fn=<NllLossBackward>)


Epoch 0:  17%|█▋        | 51/305 [10:26<52:11, 12.33s/it, loss=0.111] 

tensor(0.1017, grad_fn=<NllLossBackward>)


Epoch 0:  17%|█▋        | 52/305 [10:38<52:03, 12.35s/it, loss=0.102]

tensor(0.1078, grad_fn=<NllLossBackward>)


Epoch 0:  17%|█▋        | 53/305 [10:51<51:48, 12.34s/it, loss=0.108]

tensor(0.0890, grad_fn=<NllLossBackward>)


Epoch 0:  18%|█▊        | 54/305 [11:03<51:26, 12.30s/it, loss=0.089]

tensor(0.0939, grad_fn=<NllLossBackward>)


Epoch 0:  18%|█▊        | 55/305 [11:15<51:35, 12.38s/it, loss=0.0939]

tensor(0.1025, grad_fn=<NllLossBackward>)


Epoch 0:  18%|█▊        | 56/305 [11:28<51:09, 12.33s/it, loss=0.103] 

tensor(0.1102, grad_fn=<NllLossBackward>)


Epoch 0:  19%|█▊        | 57/305 [11:40<51:32, 12.47s/it, loss=0.11] 

tensor(0.0690, grad_fn=<NllLossBackward>)


Epoch 0:  19%|█▉        | 58/305 [11:53<51:00, 12.39s/it, loss=0.069]

tensor(0.1207, grad_fn=<NllLossBackward>)


Epoch 0:  19%|█▉        | 59/305 [12:05<50:10, 12.24s/it, loss=0.121]

tensor(0.0923, grad_fn=<NllLossBackward>)


Epoch 0:  20%|█▉        | 60/305 [12:17<50:13, 12.30s/it, loss=0.0923]

tensor(0.0620, grad_fn=<NllLossBackward>)


Epoch 0:  20%|██        | 61/305 [12:29<49:39, 12.21s/it, loss=0.062] 

tensor(0.0929, grad_fn=<NllLossBackward>)


Epoch 0:  20%|██        | 62/305 [12:41<49:27, 12.21s/it, loss=0.0929]

tensor(0.0936, grad_fn=<NllLossBackward>)


Epoch 0:  21%|██        | 63/305 [12:54<49:31, 12.28s/it, loss=0.0936]

tensor(0.0832, grad_fn=<NllLossBackward>)


Epoch 0:  21%|██        | 64/305 [13:06<49:22, 12.29s/it, loss=0.0832]

tensor(0.0737, grad_fn=<NllLossBackward>)


Epoch 0:  21%|██▏       | 65/305 [13:18<49:07, 12.28s/it, loss=0.0737]

tensor(0.0746, grad_fn=<NllLossBackward>)


Epoch 0:  22%|██▏       | 66/305 [13:30<48:37, 12.21s/it, loss=0.0746]

tensor(0.0716, grad_fn=<NllLossBackward>)


Epoch 0:  22%|██▏       | 67/305 [13:42<47:58, 12.10s/it, loss=0.0716]

tensor(0.0785, grad_fn=<NllLossBackward>)


Epoch 0:  22%|██▏       | 68/305 [13:55<49:02, 12.41s/it, loss=0.0785]

tensor(0.0913, grad_fn=<NllLossBackward>)


Epoch 0:  23%|██▎       | 69/305 [14:08<49:33, 12.60s/it, loss=0.0913]

tensor(0.0953, grad_fn=<NllLossBackward>)


Epoch 0:  23%|██▎       | 70/305 [14:21<49:33, 12.65s/it, loss=0.0953]

tensor(0.0640, grad_fn=<NllLossBackward>)


Epoch 0:  23%|██▎       | 71/305 [14:33<48:49, 12.52s/it, loss=0.064] 

tensor(0.0585, grad_fn=<NllLossBackward>)


Epoch 0:  24%|██▎       | 72/305 [14:45<48:08, 12.40s/it, loss=0.0585]

tensor(0.0683, grad_fn=<NllLossBackward>)


Epoch 0:  24%|██▍       | 73/305 [14:58<48:08, 12.45s/it, loss=0.0683]

tensor(0.0786, grad_fn=<NllLossBackward>)


Epoch 0:  24%|██▍       | 74/305 [15:10<47:40, 12.38s/it, loss=0.0786]

tensor(0.0710, grad_fn=<NllLossBackward>)


Epoch 0:  25%|██▍       | 75/305 [15:23<47:45, 12.46s/it, loss=0.071] 

tensor(0.0839, grad_fn=<NllLossBackward>)


Epoch 0:  25%|██▍       | 76/305 [15:35<47:08, 12.35s/it, loss=0.0839]

tensor(0.0952, grad_fn=<NllLossBackward>)


Epoch 0:  25%|██▌       | 77/305 [15:47<46:37, 12.27s/it, loss=0.0952]

tensor(0.0762, grad_fn=<NllLossBackward>)


Epoch 0:  26%|██▌       | 78/305 [16:00<46:55, 12.40s/it, loss=0.0762]

tensor(0.1084, grad_fn=<NllLossBackward>)


Epoch 0:  26%|██▌       | 79/305 [16:12<46:26, 12.33s/it, loss=0.108] 

tensor(0.0410, grad_fn=<NllLossBackward>)


Epoch 0:  26%|██▌       | 80/305 [16:24<46:13, 12.33s/it, loss=0.041]

tensor(0.0645, grad_fn=<NllLossBackward>)


Epoch 0:  27%|██▋       | 81/305 [16:37<46:13, 12.38s/it, loss=0.0645]

tensor(0.0803, grad_fn=<NllLossBackward>)


Epoch 0:  27%|██▋       | 82/305 [16:49<45:51, 12.34s/it, loss=0.0803]

tensor(0.0876, grad_fn=<NllLossBackward>)


Epoch 0:  27%|██▋       | 83/305 [17:01<45:50, 12.39s/it, loss=0.0876]

tensor(0.0647, grad_fn=<NllLossBackward>)


Epoch 0:  28%|██▊       | 84/305 [17:14<45:18, 12.30s/it, loss=0.0647]

tensor(0.0579, grad_fn=<NllLossBackward>)


Epoch 0:  28%|██▊       | 85/305 [17:26<44:52, 12.24s/it, loss=0.0579]

tensor(0.0828, grad_fn=<NllLossBackward>)


Epoch 0:  28%|██▊       | 86/305 [17:38<45:08, 12.37s/it, loss=0.0828]

tensor(0.0820, grad_fn=<NllLossBackward>)


Epoch 0:  29%|██▊       | 87/305 [17:51<45:07, 12.42s/it, loss=0.082] 

tensor(0.0722, grad_fn=<NllLossBackward>)


Epoch 0:  29%|██▉       | 88/305 [18:03<44:45, 12.37s/it, loss=0.0722]

tensor(0.0648, grad_fn=<NllLossBackward>)


Epoch 0:  29%|██▉       | 89/305 [18:16<45:00, 12.50s/it, loss=0.0648]

tensor(0.0763, grad_fn=<NllLossBackward>)


Epoch 0:  30%|██▉       | 90/305 [18:28<44:06, 12.31s/it, loss=0.0763]

tensor(0.0560, grad_fn=<NllLossBackward>)


Epoch 0:  30%|██▉       | 91/305 [18:40<43:57, 12.33s/it, loss=0.056] 

tensor(0.0734, grad_fn=<NllLossBackward>)


Epoch 0:  30%|███       | 92/305 [18:52<43:28, 12.25s/it, loss=0.0734]

tensor(0.0678, grad_fn=<NllLossBackward>)


Epoch 0:  30%|███       | 93/305 [19:04<43:02, 12.18s/it, loss=0.0678]

tensor(0.0712, grad_fn=<NllLossBackward>)


Epoch 0:  31%|███       | 94/305 [19:16<42:51, 12.19s/it, loss=0.0712]

tensor(0.0653, grad_fn=<NllLossBackward>)


Epoch 0:  31%|███       | 95/305 [19:28<42:29, 12.14s/it, loss=0.0653]

tensor(0.0431, grad_fn=<NllLossBackward>)


Epoch 0:  31%|███▏      | 96/305 [19:41<42:29, 12.20s/it, loss=0.0431]

tensor(0.0357, grad_fn=<NllLossBackward>)


Epoch 0:  32%|███▏      | 97/305 [19:54<42:50, 12.36s/it, loss=0.0357]

tensor(0.0513, grad_fn=<NllLossBackward>)


Epoch 0:  32%|███▏      | 98/305 [20:05<42:10, 12.22s/it, loss=0.0513]

tensor(0.0534, grad_fn=<NllLossBackward>)


Epoch 0:  32%|███▏      | 99/305 [20:18<42:21, 12.34s/it, loss=0.0534]

tensor(0.0701, grad_fn=<NllLossBackward>)


Epoch 0:  33%|███▎      | 100/305 [20:30<41:47, 12.23s/it, loss=0.0701]

tensor(0.0878, grad_fn=<NllLossBackward>)


Epoch 0:  33%|███▎      | 101/305 [20:42<41:18, 12.15s/it, loss=0.0878]

tensor(0.0411, grad_fn=<NllLossBackward>)


Epoch 0:  33%|███▎      | 102/305 [20:55<41:38, 12.31s/it, loss=0.0411]

tensor(0.0516, grad_fn=<NllLossBackward>)


Epoch 0:  34%|███▍      | 103/305 [21:07<41:09, 12.23s/it, loss=0.0516]

tensor(0.0598, grad_fn=<NllLossBackward>)


Epoch 0:  34%|███▍      | 104/305 [21:19<41:03, 12.25s/it, loss=0.0598]

tensor(0.0794, grad_fn=<NllLossBackward>)


Epoch 0:  34%|███▍      | 105/305 [21:31<40:44, 12.22s/it, loss=0.0794]

tensor(0.0513, grad_fn=<NllLossBackward>)


Epoch 0:  35%|███▍      | 106/305 [21:43<40:04, 12.08s/it, loss=0.0513]

tensor(0.0932, grad_fn=<NllLossBackward>)


Epoch 0:  35%|███▌      | 107/305 [21:55<40:09, 12.17s/it, loss=0.0932]

tensor(0.0538, grad_fn=<NllLossBackward>)


Epoch 0:  35%|███▌      | 108/305 [22:07<39:37, 12.07s/it, loss=0.0538]

tensor(0.0509, grad_fn=<NllLossBackward>)


Epoch 0:  36%|███▌      | 109/305 [22:19<39:15, 12.02s/it, loss=0.0509]

tensor(0.0603, grad_fn=<NllLossBackward>)


Epoch 0:  36%|███▌      | 110/305 [22:31<39:28, 12.15s/it, loss=0.0603]

tensor(0.0898, grad_fn=<NllLossBackward>)


Epoch 0:  36%|███▋      | 111/305 [22:44<39:18, 12.16s/it, loss=0.0898]

tensor(0.0567, grad_fn=<NllLossBackward>)


Epoch 0:  37%|███▋      | 112/305 [22:56<39:17, 12.21s/it, loss=0.0567]

tensor(0.0655, grad_fn=<NllLossBackward>)


Epoch 0:  37%|███▋      | 113/305 [23:08<38:47, 12.12s/it, loss=0.0655]

tensor(0.0564, grad_fn=<NllLossBackward>)


Epoch 0:  37%|███▋      | 114/305 [23:20<38:15, 12.02s/it, loss=0.0564]

tensor(0.0417, grad_fn=<NllLossBackward>)


Epoch 0:  38%|███▊      | 115/305 [23:32<38:12, 12.07s/it, loss=0.0417]

tensor(0.0650, grad_fn=<NllLossBackward>)


Epoch 0:  38%|███▊      | 116/305 [23:44<37:53, 12.03s/it, loss=0.065] 

tensor(0.0629, grad_fn=<NllLossBackward>)


Epoch 0:  38%|███▊      | 117/305 [23:56<37:53, 12.09s/it, loss=0.0629]

tensor(0.0328, grad_fn=<NllLossBackward>)


Epoch 0:  39%|███▊      | 118/305 [24:08<37:52, 12.15s/it, loss=0.0328]

tensor(0.0504, grad_fn=<NllLossBackward>)


Epoch 0:  39%|███▉      | 119/305 [24:20<37:33, 12.11s/it, loss=0.0504]

tensor(0.0472, grad_fn=<NllLossBackward>)


Epoch 0:  39%|███▉      | 120/305 [24:33<37:30, 12.17s/it, loss=0.0472]

tensor(0.0638, grad_fn=<NllLossBackward>)


Epoch 0:  40%|███▉      | 121/305 [24:44<36:56, 12.04s/it, loss=0.0638]

tensor(0.0588, grad_fn=<NllLossBackward>)


Epoch 0:  40%|████      | 122/305 [24:56<36:23, 11.93s/it, loss=0.0588]

tensor(0.0494, grad_fn=<NllLossBackward>)


Epoch 0:  40%|████      | 123/305 [25:08<36:32, 12.05s/it, loss=0.0494]

tensor(0.0465, grad_fn=<NllLossBackward>)


Epoch 0:  41%|████      | 124/305 [25:20<36:08, 11.98s/it, loss=0.0465]

tensor(0.0734, grad_fn=<NllLossBackward>)


Epoch 0:  41%|████      | 125/305 [25:32<36:05, 12.03s/it, loss=0.0734]

tensor(0.0438, grad_fn=<NllLossBackward>)


Epoch 0:  41%|████▏     | 126/305 [25:45<36:20, 12.18s/it, loss=0.0438]

tensor(0.0374, grad_fn=<NllLossBackward>)


Epoch 0:  42%|████▏     | 127/305 [25:57<36:05, 12.17s/it, loss=0.0374]

tensor(0.0755, grad_fn=<NllLossBackward>)


Epoch 0:  42%|████▏     | 128/305 [26:09<36:02, 12.22s/it, loss=0.0755]

tensor(0.0687, grad_fn=<NllLossBackward>)


Epoch 0:  42%|████▏     | 129/305 [26:22<35:50, 12.22s/it, loss=0.0687]

tensor(0.0532, grad_fn=<NllLossBackward>)


Epoch 0:  43%|████▎     | 130/305 [26:34<35:42, 12.25s/it, loss=0.0532]

tensor(0.0716, grad_fn=<NllLossBackward>)


Epoch 0:  43%|████▎     | 131/305 [26:46<35:47, 12.34s/it, loss=0.0716]

tensor(0.0628, grad_fn=<NllLossBackward>)


Epoch 0:  43%|████▎     | 132/305 [26:58<35:12, 12.21s/it, loss=0.0628]

tensor(0.0522, grad_fn=<NllLossBackward>)


Epoch 0:  44%|████▎     | 133/305 [27:10<34:44, 12.12s/it, loss=0.0522]

tensor(0.0324, grad_fn=<NllLossBackward>)


Epoch 0:  44%|████▍     | 134/305 [27:22<34:37, 12.15s/it, loss=0.0324]

tensor(0.0452, grad_fn=<NllLossBackward>)


Epoch 0:  44%|████▍     | 135/305 [27:34<34:08, 12.05s/it, loss=0.0452]

tensor(0.0591, grad_fn=<NllLossBackward>)


Epoch 0:  45%|████▍     | 136/305 [27:47<34:16, 12.17s/it, loss=0.0591]

tensor(0.0771, grad_fn=<NllLossBackward>)


Epoch 0:  45%|████▍     | 137/305 [27:59<33:47, 12.07s/it, loss=0.0771]

tensor(0.0736, grad_fn=<NllLossBackward>)


Epoch 0:  45%|████▌     | 138/305 [28:10<33:27, 12.02s/it, loss=0.0736]

tensor(0.0661, grad_fn=<NllLossBackward>)


Epoch 0:  46%|████▌     | 139/305 [28:23<33:23, 12.07s/it, loss=0.0661]

tensor(0.0283, grad_fn=<NllLossBackward>)


Epoch 0:  46%|████▌     | 140/305 [28:35<33:02, 12.01s/it, loss=0.0283]

tensor(0.0503, grad_fn=<NllLossBackward>)


Epoch 0:  46%|████▌     | 141/305 [28:47<32:54, 12.04s/it, loss=0.0503]

tensor(0.0319, grad_fn=<NllLossBackward>)


Epoch 0:  47%|████▋     | 142/305 [28:59<33:08, 12.20s/it, loss=0.0319]

tensor(0.0506, grad_fn=<NllLossBackward>)


Epoch 0:  47%|████▋     | 143/305 [29:11<32:42, 12.11s/it, loss=0.0506]

tensor(0.0623, grad_fn=<NllLossBackward>)


Epoch 0:  47%|████▋     | 144/305 [29:24<33:29, 12.48s/it, loss=0.0623]

tensor(0.0277, grad_fn=<NllLossBackward>)


Epoch 0:  48%|████▊     | 145/305 [29:37<33:23, 12.52s/it, loss=0.0277]

tensor(0.0529, grad_fn=<NllLossBackward>)


Epoch 0:  48%|████▊     | 146/305 [29:51<34:13, 12.92s/it, loss=0.0529]

tensor(0.0407, grad_fn=<NllLossBackward>)


Epoch 0:  48%|████▊     | 147/305 [30:03<33:15, 12.63s/it, loss=0.0407]

tensor(0.0511, grad_fn=<NllLossBackward>)


Epoch 0:  49%|████▊     | 148/305 [30:15<32:30, 12.43s/it, loss=0.0511]

tensor(0.0290, grad_fn=<NllLossBackward>)


Epoch 0:  49%|████▉     | 149/305 [30:27<32:21, 12.44s/it, loss=0.029] 

tensor(0.0635, grad_fn=<NllLossBackward>)


Epoch 0:  49%|████▉     | 150/305 [30:39<31:49, 12.32s/it, loss=0.0635]

tensor(0.0517, grad_fn=<NllLossBackward>)


Epoch 0:  50%|████▉     | 151/305 [30:51<31:19, 12.21s/it, loss=0.0517]

tensor(0.0602, grad_fn=<NllLossBackward>)


Epoch 0:  50%|████▉     | 152/305 [31:04<31:11, 12.23s/it, loss=0.0602]

tensor(0.0728, grad_fn=<NllLossBackward>)


Epoch 0:  50%|█████     | 153/305 [31:16<30:48, 12.16s/it, loss=0.0728]

tensor(0.0506, grad_fn=<NllLossBackward>)


Epoch 0:  50%|█████     | 154/305 [31:27<30:18, 12.04s/it, loss=0.0506]

tensor(0.0531, grad_fn=<NllLossBackward>)


Epoch 0:  51%|█████     | 155/305 [31:40<30:16, 12.11s/it, loss=0.0531]

tensor(0.0464, grad_fn=<NllLossBackward>)


Epoch 0:  51%|█████     | 156/305 [31:52<30:00, 12.08s/it, loss=0.0464]

tensor(0.0304, grad_fn=<NllLossBackward>)


Epoch 0:  51%|█████▏    | 157/305 [32:04<30:03, 12.19s/it, loss=0.0304]

tensor(0.0516, grad_fn=<NllLossBackward>)


Epoch 0:  52%|█████▏    | 158/305 [32:16<29:37, 12.09s/it, loss=0.0516]

tensor(0.0663, grad_fn=<NllLossBackward>)


Epoch 0:  52%|█████▏    | 159/305 [32:28<29:12, 12.01s/it, loss=0.0663]

tensor(0.0466, grad_fn=<NllLossBackward>)


Epoch 0:  52%|█████▏    | 160/305 [32:40<29:07, 12.05s/it, loss=0.0466]

tensor(0.0474, grad_fn=<NllLossBackward>)


Epoch 0:  53%|█████▎    | 161/305 [32:52<28:49, 12.01s/it, loss=0.0474]

tensor(0.0650, grad_fn=<NllLossBackward>)


Epoch 0:  53%|█████▎    | 162/305 [33:04<28:30, 11.96s/it, loss=0.065] 

tensor(0.0472, grad_fn=<NllLossBackward>)


Epoch 0:  53%|█████▎    | 163/305 [33:16<28:31, 12.05s/it, loss=0.0472]

tensor(0.0760, grad_fn=<NllLossBackward>)


Epoch 0:  54%|█████▍    | 164/305 [33:28<28:05, 11.95s/it, loss=0.076] 

tensor(0.0413, grad_fn=<NllLossBackward>)


Epoch 0:  54%|█████▍    | 165/305 [33:40<28:11, 12.08s/it, loss=0.0413]

tensor(0.0583, grad_fn=<NllLossBackward>)


Epoch 0:  54%|█████▍    | 166/305 [33:52<28:03, 12.11s/it, loss=0.0583]

tensor(0.0411, grad_fn=<NllLossBackward>)


Epoch 0:  55%|█████▍    | 167/305 [34:04<27:57, 12.15s/it, loss=0.0411]

tensor(0.0437, grad_fn=<NllLossBackward>)


Epoch 0:  55%|█████▌    | 168/305 [34:17<27:57, 12.25s/it, loss=0.0437]

tensor(0.0617, grad_fn=<NllLossBackward>)


Epoch 0:  55%|█████▌    | 169/305 [34:29<27:45, 12.24s/it, loss=0.0617]

tensor(0.0432, grad_fn=<NllLossBackward>)


Epoch 0:  56%|█████▌    | 170/305 [34:41<27:31, 12.23s/it, loss=0.0432]

tensor(0.0511, grad_fn=<NllLossBackward>)


Epoch 0:  56%|█████▌    | 171/305 [34:54<27:42, 12.41s/it, loss=0.0511]

tensor(0.0342, grad_fn=<NllLossBackward>)


Epoch 0:  56%|█████▋    | 172/305 [35:06<27:20, 12.34s/it, loss=0.0342]

tensor(0.0661, grad_fn=<NllLossBackward>)


Epoch 0:  57%|█████▋    | 173/305 [35:19<27:11, 12.36s/it, loss=0.0661]

tensor(0.1054, grad_fn=<NllLossBackward>)


Epoch 0:  57%|█████▋    | 174/305 [35:30<26:32, 12.16s/it, loss=0.105] 

tensor(0.0294, grad_fn=<NllLossBackward>)


Epoch 0:  57%|█████▋    | 175/305 [35:42<26:07, 12.06s/it, loss=0.0294]

tensor(0.0523, grad_fn=<NllLossBackward>)


Epoch 0:  58%|█████▊    | 176/305 [35:55<26:07, 12.15s/it, loss=0.0523]

tensor(0.0558, grad_fn=<NllLossBackward>)


Epoch 0:  58%|█████▊    | 177/305 [36:06<25:35, 12.00s/it, loss=0.0558]

tensor(0.0257, grad_fn=<NllLossBackward>)


Epoch 0:  58%|█████▊    | 178/305 [36:18<25:21, 11.98s/it, loss=0.0257]

tensor(0.0379, grad_fn=<NllLossBackward>)


Epoch 0:  59%|█████▊    | 179/305 [36:30<24:58, 11.89s/it, loss=0.0379]

tensor(0.0346, grad_fn=<NllLossBackward>)


Epoch 0:  59%|█████▉    | 180/305 [36:42<24:36, 11.82s/it, loss=0.0346]

tensor(0.0386, grad_fn=<NllLossBackward>)


Epoch 0:  59%|█████▉    | 181/305 [36:54<24:40, 11.94s/it, loss=0.0386]

tensor(0.0295, grad_fn=<NllLossBackward>)


Epoch 0:  60%|█████▉    | 182/305 [37:06<24:23, 11.90s/it, loss=0.0295]

tensor(0.0525, grad_fn=<NllLossBackward>)


Epoch 0:  60%|██████    | 183/305 [37:17<24:05, 11.85s/it, loss=0.0525]

tensor(0.0396, grad_fn=<NllLossBackward>)


Epoch 0:  60%|██████    | 184/305 [37:30<24:06, 11.96s/it, loss=0.0396]

tensor(0.0537, grad_fn=<NllLossBackward>)


Epoch 0:  61%|██████    | 185/305 [37:41<23:43, 11.86s/it, loss=0.0537]

tensor(0.0377, grad_fn=<NllLossBackward>)


Epoch 0:  61%|██████    | 186/305 [37:53<23:26, 11.82s/it, loss=0.0377]

tensor(0.0502, grad_fn=<NllLossBackward>)


Epoch 0:  61%|██████▏   | 187/305 [38:05<23:22, 11.89s/it, loss=0.0502]

tensor(0.0744, grad_fn=<NllLossBackward>)


Epoch 0:  62%|██████▏   | 188/305 [38:17<22:59, 11.79s/it, loss=0.0744]

tensor(0.0412, grad_fn=<NllLossBackward>)


Epoch 0:  62%|██████▏   | 189/305 [38:29<22:56, 11.87s/it, loss=0.0412]

tensor(0.0349, grad_fn=<NllLossBackward>)


Epoch 0:  62%|██████▏   | 190/305 [38:40<22:36, 11.79s/it, loss=0.0349]

tensor(0.0377, grad_fn=<NllLossBackward>)


Epoch 0:  63%|██████▎   | 191/305 [38:52<22:31, 11.86s/it, loss=0.0377]

tensor(0.0275, grad_fn=<NllLossBackward>)


Epoch 0:  63%|██████▎   | 192/305 [39:04<22:31, 11.96s/it, loss=0.0275]

tensor(0.0614, grad_fn=<NllLossBackward>)


Epoch 0:  63%|██████▎   | 193/305 [39:16<22:12, 11.89s/it, loss=0.0614]

tensor(0.0303, grad_fn=<NllLossBackward>)


Epoch 0:  64%|██████▎   | 194/305 [39:28<21:55, 11.85s/it, loss=0.0303]

tensor(0.0314, grad_fn=<NllLossBackward>)


Epoch 0:  64%|██████▍   | 195/305 [39:40<21:51, 11.92s/it, loss=0.0314]

tensor(0.0498, grad_fn=<NllLossBackward>)


Epoch 0:  64%|██████▍   | 196/305 [39:52<21:42, 11.95s/it, loss=0.0498]

tensor(0.0593, grad_fn=<NllLossBackward>)


Epoch 0:  65%|██████▍   | 197/305 [40:04<21:44, 12.07s/it, loss=0.0593]

tensor(0.0390, grad_fn=<NllLossBackward>)


Epoch 0:  65%|██████▍   | 198/305 [40:16<21:23, 12.00s/it, loss=0.039] 

tensor(0.0482, grad_fn=<NllLossBackward>)


Epoch 0:  65%|██████▌   | 199/305 [40:28<21:02, 11.91s/it, loss=0.0482]

tensor(0.0470, grad_fn=<NllLossBackward>)


Epoch 0:  66%|██████▌   | 200/305 [40:40<21:03, 12.04s/it, loss=0.047] 

tensor(0.0295, grad_fn=<NllLossBackward>)


Epoch 0:  66%|██████▌   | 201/305 [40:52<20:52, 12.05s/it, loss=0.0295]

tensor(0.0407, grad_fn=<NllLossBackward>)


Epoch 0:  66%|██████▌   | 202/305 [41:04<20:32, 11.97s/it, loss=0.0407]

tensor(0.0322, grad_fn=<NllLossBackward>)


Epoch 0:  67%|██████▋   | 203/305 [41:16<20:22, 11.99s/it, loss=0.0322]

tensor(0.0423, grad_fn=<NllLossBackward>)


Epoch 0:  67%|██████▋   | 204/305 [41:28<19:55, 11.84s/it, loss=0.0423]

tensor(0.0467, grad_fn=<NllLossBackward>)


Epoch 0:  67%|██████▋   | 205/305 [41:40<19:49, 11.90s/it, loss=0.0467]

tensor(0.0461, grad_fn=<NllLossBackward>)


Epoch 0:  68%|██████▊   | 206/305 [41:51<19:32, 11.84s/it, loss=0.0461]

tensor(0.0462, grad_fn=<NllLossBackward>)


Epoch 0:  68%|██████▊   | 207/305 [42:03<19:13, 11.77s/it, loss=0.0462]

tensor(0.0343, grad_fn=<NllLossBackward>)


Epoch 0:  68%|██████▊   | 208/305 [42:15<19:14, 11.90s/it, loss=0.0343]

tensor(0.0583, grad_fn=<NllLossBackward>)


Epoch 0:  69%|██████▊   | 209/305 [42:27<19:00, 11.88s/it, loss=0.0583]

tensor(0.0442, grad_fn=<NllLossBackward>)


Epoch 0:  69%|██████▉   | 210/305 [42:39<18:45, 11.85s/it, loss=0.0442]

tensor(0.0319, grad_fn=<NllLossBackward>)


Epoch 0:  69%|██████▉   | 211/305 [42:51<18:37, 11.89s/it, loss=0.0319]

tensor(0.0727, grad_fn=<NllLossBackward>)


Epoch 0:  70%|██████▉   | 212/305 [43:03<18:22, 11.86s/it, loss=0.0727]

tensor(0.0498, grad_fn=<NllLossBackward>)


Epoch 0:  70%|██████▉   | 213/305 [43:14<18:04, 11.79s/it, loss=0.0498]

tensor(0.0347, grad_fn=<NllLossBackward>)


Epoch 0:  70%|███████   | 214/305 [43:26<17:59, 11.87s/it, loss=0.0347]

tensor(0.0581, grad_fn=<NllLossBackward>)


Epoch 0:  70%|███████   | 215/305 [43:38<17:41, 11.80s/it, loss=0.0581]

tensor(0.0334, grad_fn=<NllLossBackward>)


Epoch 0:  71%|███████   | 216/305 [43:50<17:45, 11.98s/it, loss=0.0334]

tensor(0.0457, grad_fn=<NllLossBackward>)


Epoch 0:  71%|███████   | 217/305 [44:02<17:36, 12.00s/it, loss=0.0457]

tensor(0.0484, grad_fn=<NllLossBackward>)


Epoch 0:  71%|███████▏  | 218/305 [44:14<17:18, 11.94s/it, loss=0.0484]

tensor(0.0571, grad_fn=<NllLossBackward>)


Epoch 0:  72%|███████▏  | 219/305 [44:27<17:21, 12.11s/it, loss=0.0571]

tensor(0.0305, grad_fn=<NllLossBackward>)


Epoch 0:  72%|███████▏  | 220/305 [44:39<17:06, 12.08s/it, loss=0.0305]

tensor(0.0437, grad_fn=<NllLossBackward>)


Epoch 0:  72%|███████▏  | 221/305 [44:51<16:53, 12.07s/it, loss=0.0437]

tensor(0.0325, grad_fn=<NllLossBackward>)


Epoch 0:  73%|███████▎  | 222/305 [45:04<17:02, 12.32s/it, loss=0.0325]

tensor(0.0454, grad_fn=<NllLossBackward>)


Epoch 0:  73%|███████▎  | 223/305 [45:15<16:38, 12.17s/it, loss=0.0454]

tensor(0.0386, grad_fn=<NllLossBackward>)


Epoch 0:  73%|███████▎  | 224/305 [45:28<16:30, 12.22s/it, loss=0.0386]

tensor(0.0209, grad_fn=<NllLossBackward>)


Epoch 0:  74%|███████▍  | 225/305 [45:39<16:05, 12.06s/it, loss=0.0209]

tensor(0.0640, grad_fn=<NllLossBackward>)


Epoch 0:  74%|███████▍  | 226/305 [45:52<15:57, 12.12s/it, loss=0.064] 

tensor(0.0471, grad_fn=<NllLossBackward>)


Epoch 0:  74%|███████▍  | 227/305 [46:04<15:49, 12.18s/it, loss=0.0471]

tensor(0.0308, grad_fn=<NllLossBackward>)


Epoch 0:  75%|███████▍  | 228/305 [46:16<15:31, 12.09s/it, loss=0.0308]

tensor(0.0447, grad_fn=<NllLossBackward>)


Epoch 0:  75%|███████▌  | 229/305 [46:28<15:20, 12.11s/it, loss=0.0447]

tensor(0.0350, grad_fn=<NllLossBackward>)


Epoch 0:  75%|███████▌  | 230/305 [46:40<15:14, 12.20s/it, loss=0.035] 

tensor(0.0511, grad_fn=<NllLossBackward>)


Epoch 0:  76%|███████▌  | 231/305 [46:52<14:58, 12.14s/it, loss=0.0511]

tensor(0.0644, grad_fn=<NllLossBackward>)


Epoch 0:  76%|███████▌  | 232/305 [47:05<14:49, 12.19s/it, loss=0.0644]

tensor(0.0413, grad_fn=<NllLossBackward>)


Epoch 0:  76%|███████▋  | 233/305 [47:22<16:21, 13.63s/it, loss=0.0413]

tensor(0.0268, grad_fn=<NllLossBackward>)


Epoch 0:  77%|███████▋  | 234/305 [47:34<15:41, 13.26s/it, loss=0.0268]

tensor(0.0695, grad_fn=<NllLossBackward>)


Epoch 0:  77%|███████▋  | 235/305 [47:46<14:56, 12.81s/it, loss=0.0695]

tensor(0.0515, grad_fn=<NllLossBackward>)


Epoch 0:  77%|███████▋  | 236/305 [47:58<14:20, 12.47s/it, loss=0.0515]

tensor(0.0397, grad_fn=<NllLossBackward>)


Epoch 0:  78%|███████▊  | 237/305 [48:10<14:00, 12.35s/it, loss=0.0397]

tensor(0.0711, grad_fn=<NllLossBackward>)


Epoch 0:  78%|███████▊  | 238/305 [48:21<13:36, 12.19s/it, loss=0.0711]

tensor(0.0606, grad_fn=<NllLossBackward>)


Epoch 0:  78%|███████▊  | 239/305 [48:33<13:21, 12.15s/it, loss=0.0606]

tensor(0.0609, grad_fn=<NllLossBackward>)


Epoch 0:  79%|███████▊  | 240/305 [48:46<13:16, 12.25s/it, loss=0.0609]

tensor(0.0326, grad_fn=<NllLossBackward>)


Epoch 0:  79%|███████▉  | 241/305 [48:58<13:01, 12.21s/it, loss=0.0326]

tensor(0.0386, grad_fn=<NllLossBackward>)


Epoch 0:  79%|███████▉  | 242/305 [49:10<12:51, 12.24s/it, loss=0.0386]

tensor(0.0285, grad_fn=<NllLossBackward>)


Epoch 0:  80%|███████▉  | 243/305 [49:22<12:35, 12.18s/it, loss=0.0285]

tensor(0.0362, grad_fn=<NllLossBackward>)


Epoch 0:  80%|████████  | 244/305 [49:34<12:19, 12.13s/it, loss=0.0362]

tensor(0.0418, grad_fn=<NllLossBackward>)


Epoch 0:  80%|████████  | 245/305 [49:47<12:11, 12.19s/it, loss=0.0418]

tensor(0.0921, grad_fn=<NllLossBackward>)


Epoch 0:  81%|████████  | 246/305 [49:59<11:51, 12.07s/it, loss=0.0921]

tensor(0.0233, grad_fn=<NllLossBackward>)


Epoch 0:  81%|████████  | 247/305 [50:10<11:37, 12.03s/it, loss=0.0233]

tensor(0.0494, grad_fn=<NllLossBackward>)


Epoch 0:  81%|████████▏ | 248/305 [50:23<11:32, 12.14s/it, loss=0.0494]

tensor(0.0740, grad_fn=<NllLossBackward>)


Epoch 0:  82%|████████▏ | 249/305 [50:35<11:15, 12.06s/it, loss=0.074] 

tensor(0.0201, grad_fn=<NllLossBackward>)


Epoch 0:  82%|████████▏ | 250/305 [50:47<11:02, 12.04s/it, loss=0.0201]

tensor(0.0326, grad_fn=<NllLossBackward>)


Epoch 0:  82%|████████▏ | 251/305 [50:59<10:47, 11.99s/it, loss=0.0326]

tensor(0.0230, grad_fn=<NllLossBackward>)


Epoch 0:  83%|████████▎ | 252/305 [51:10<10:33, 11.95s/it, loss=0.023] 

tensor(0.0257, grad_fn=<NllLossBackward>)


Epoch 0:  83%|████████▎ | 253/305 [51:23<10:27, 12.06s/it, loss=0.0257]

tensor(0.0389, grad_fn=<NllLossBackward>)


Epoch 0:  83%|████████▎ | 254/305 [51:34<10:07, 11.92s/it, loss=0.0389]

tensor(0.0653, grad_fn=<NllLossBackward>)


Epoch 0:  84%|████████▎ | 255/305 [51:46<09:50, 11.81s/it, loss=0.0653]

tensor(0.0404, grad_fn=<NllLossBackward>)


Epoch 0:  84%|████████▍ | 256/305 [51:58<09:42, 11.88s/it, loss=0.0404]

tensor(0.0399, grad_fn=<NllLossBackward>)


Epoch 0:  84%|████████▍ | 257/305 [52:09<09:24, 11.76s/it, loss=0.0399]

tensor(0.0438, grad_fn=<NllLossBackward>)


Epoch 0:  85%|████████▍ | 258/305 [52:21<09:11, 11.73s/it, loss=0.0438]

tensor(0.0396, grad_fn=<NllLossBackward>)


Epoch 0:  85%|████████▍ | 259/305 [52:33<09:04, 11.84s/it, loss=0.0396]

tensor(0.0553, grad_fn=<NllLossBackward>)


Epoch 0:  85%|████████▌ | 260/305 [52:45<08:50, 11.78s/it, loss=0.0553]

tensor(0.0315, grad_fn=<NllLossBackward>)


Epoch 0:  86%|████████▌ | 261/305 [52:57<08:39, 11.80s/it, loss=0.0315]

tensor(0.0493, grad_fn=<NllLossBackward>)


Epoch 0:  86%|████████▌ | 262/305 [53:09<08:33, 11.94s/it, loss=0.0493]

tensor(0.0406, grad_fn=<NllLossBackward>)


Epoch 0:  86%|████████▌ | 263/305 [53:21<08:18, 11.86s/it, loss=0.0406]

tensor(0.0630, grad_fn=<NllLossBackward>)


Epoch 0:  87%|████████▋ | 264/305 [53:33<08:08, 11.90s/it, loss=0.063] 

tensor(0.0425, grad_fn=<NllLossBackward>)


Epoch 0:  87%|████████▋ | 265/305 [53:44<07:53, 11.84s/it, loss=0.0425]

tensor(0.0566, grad_fn=<NllLossBackward>)


Epoch 0:  87%|████████▋ | 266/305 [53:56<07:39, 11.78s/it, loss=0.0566]

tensor(0.0215, grad_fn=<NllLossBackward>)


Epoch 0:  88%|████████▊ | 267/305 [54:08<07:29, 11.84s/it, loss=0.0215]

tensor(0.0657, grad_fn=<NllLossBackward>)


Epoch 0:  88%|████████▊ | 268/305 [54:20<07:15, 11.77s/it, loss=0.0657]

tensor(0.0338, grad_fn=<NllLossBackward>)


Epoch 0:  88%|████████▊ | 269/305 [54:31<07:02, 11.74s/it, loss=0.0338]

tensor(0.0474, grad_fn=<NllLossBackward>)


Epoch 0:  89%|████████▊ | 270/305 [54:43<06:56, 11.89s/it, loss=0.0474]

tensor(0.0593, grad_fn=<NllLossBackward>)


Epoch 0:  89%|████████▉ | 271/305 [54:55<06:44, 11.90s/it, loss=0.0593]

tensor(0.0272, grad_fn=<NllLossBackward>)


Epoch 0:  89%|████████▉ | 272/305 [55:08<06:36, 12.02s/it, loss=0.0272]

tensor(0.0307, grad_fn=<NllLossBackward>)


Epoch 0:  90%|████████▉ | 273/305 [55:20<06:22, 11.97s/it, loss=0.0307]

tensor(0.0236, grad_fn=<NllLossBackward>)


Epoch 0:  90%|████████▉ | 274/305 [55:31<06:08, 11.89s/it, loss=0.0236]

tensor(0.0475, grad_fn=<NllLossBackward>)


Epoch 0:  90%|█████████ | 275/305 [55:43<05:58, 11.96s/it, loss=0.0475]

tensor(0.0403, grad_fn=<NllLossBackward>)


Epoch 0:  90%|█████████ | 276/305 [55:55<05:46, 11.93s/it, loss=0.0403]

tensor(0.0480, grad_fn=<NllLossBackward>)


Epoch 0:  91%|█████████ | 277/305 [56:07<05:30, 11.82s/it, loss=0.048] 

tensor(0.0522, grad_fn=<NllLossBackward>)


Epoch 0:  91%|█████████ | 278/305 [56:19<05:20, 11.88s/it, loss=0.0522]

tensor(0.0217, grad_fn=<NllLossBackward>)


Epoch 0:  91%|█████████▏| 279/305 [56:30<05:06, 11.80s/it, loss=0.0217]

tensor(0.0379, grad_fn=<NllLossBackward>)


Epoch 0:  92%|█████████▏| 280/305 [56:43<04:57, 11.88s/it, loss=0.0379]

tensor(0.0664, grad_fn=<NllLossBackward>)


Epoch 0:  92%|█████████▏| 281/305 [56:54<04:43, 11.82s/it, loss=0.0664]

tensor(0.0394, grad_fn=<NllLossBackward>)


Epoch 0:  92%|█████████▏| 282/305 [57:06<04:30, 11.74s/it, loss=0.0394]

tensor(0.0452, grad_fn=<NllLossBackward>)


Epoch 0:  93%|█████████▎| 283/305 [57:18<04:19, 11.82s/it, loss=0.0452]

tensor(0.0382, grad_fn=<NllLossBackward>)


Epoch 0:  93%|█████████▎| 284/305 [57:29<04:06, 11.73s/it, loss=0.0382]

tensor(0.0487, grad_fn=<NllLossBackward>)


Epoch 0:  93%|█████████▎| 285/305 [57:41<03:54, 11.73s/it, loss=0.0487]

tensor(0.0435, grad_fn=<NllLossBackward>)


Epoch 0:  94%|█████████▍| 286/305 [57:53<03:45, 11.84s/it, loss=0.0435]

tensor(0.0516, grad_fn=<NllLossBackward>)


Epoch 0:  94%|█████████▍| 287/305 [58:05<03:31, 11.76s/it, loss=0.0516]

tensor(0.0440, grad_fn=<NllLossBackward>)


Epoch 0:  94%|█████████▍| 288/305 [58:16<03:19, 11.72s/it, loss=0.044] 

tensor(0.0361, grad_fn=<NllLossBackward>)


Epoch 0:  95%|█████████▍| 289/305 [58:29<03:10, 11.88s/it, loss=0.0361]

tensor(0.0230, grad_fn=<NllLossBackward>)


Epoch 0:  95%|█████████▌| 290/305 [58:40<02:56, 11.79s/it, loss=0.023] 

tensor(0.0579, grad_fn=<NllLossBackward>)


Epoch 0:  95%|█████████▌| 291/305 [58:52<02:45, 11.85s/it, loss=0.0579]

tensor(0.0346, grad_fn=<NllLossBackward>)


Epoch 0:  96%|█████████▌| 292/305 [59:04<02:32, 11.74s/it, loss=0.0346]

tensor(0.0315, grad_fn=<NllLossBackward>)


Epoch 0:  96%|█████████▌| 293/305 [59:15<02:20, 11.72s/it, loss=0.0315]

tensor(0.0528, grad_fn=<NllLossBackward>)


Epoch 0:  96%|█████████▋| 294/305 [59:27<02:10, 11.83s/it, loss=0.0528]

tensor(0.0447, grad_fn=<NllLossBackward>)


Epoch 0:  97%|█████████▋| 295/305 [59:39<01:58, 11.84s/it, loss=0.0447]

tensor(0.0250, grad_fn=<NllLossBackward>)


Epoch 0:  97%|█████████▋| 296/305 [59:51<01:46, 11.80s/it, loss=0.025] 

tensor(0.0367, grad_fn=<NllLossBackward>)


Epoch 0:  97%|█████████▋| 297/305 [1:00:03<01:35, 11.88s/it, loss=0.0367]

tensor(0.0405, grad_fn=<NllLossBackward>)


Epoch 0:  98%|█████████▊| 298/305 [1:00:15<01:23, 11.88s/it, loss=0.0405]

tensor(0.0628, grad_fn=<NllLossBackward>)


Epoch 0:  98%|█████████▊| 299/305 [1:00:27<01:12, 12.02s/it, loss=0.0628]

tensor(0.0397, grad_fn=<NllLossBackward>)


Epoch 0:  98%|█████████▊| 300/305 [1:00:39<00:59, 11.90s/it, loss=0.0397]

tensor(0.0385, grad_fn=<NllLossBackward>)


Epoch 0:  99%|█████████▊| 301/305 [1:00:50<00:47, 11.81s/it, loss=0.0385]

tensor(0.0489, grad_fn=<NllLossBackward>)


Epoch 0:  99%|█████████▉| 302/305 [1:01:03<00:35, 11.90s/it, loss=0.0489]

tensor(0.0419, grad_fn=<NllLossBackward>)


Epoch 0:  99%|█████████▉| 303/305 [1:01:14<00:23, 11.82s/it, loss=0.0419]

tensor(0.0292, grad_fn=<NllLossBackward>)


Epoch 0: 100%|█████████▉| 304/305 [1:01:26<00:11, 11.81s/it, loss=0.0292]

tensor(0.0692, grad_fn=<NllLossBackward>)


Epoch 0: 100%|██████████| 305/305 [1:01:29<00:00, 12.10s/it, loss=0.0692]
  0%|          | 0/305 [00:00<?, ?it/s]

tensor(0.0437, grad_fn=<NllLossBackward>)


Epoch 1:   0%|          | 1/305 [00:11<59:54, 11.82s/it, loss=0.0437]

tensor(0.0409, grad_fn=<NllLossBackward>)


Epoch 1:   1%|          | 2/305 [00:23<59:19, 11.75s/it, loss=0.0409]

tensor(0.0329, grad_fn=<NllLossBackward>)


Epoch 1:   1%|          | 3/305 [00:35<59:41, 11.86s/it, loss=0.0329]

tensor(0.0496, grad_fn=<NllLossBackward>)


Epoch 1:   1%|▏         | 4/305 [00:47<58:58, 11.76s/it, loss=0.0496]

tensor(0.0345, grad_fn=<NllLossBackward>)


Epoch 1:   2%|▏         | 5/305 [00:58<58:39, 11.73s/it, loss=0.0345]

tensor(0.0375, grad_fn=<NllLossBackward>)


Epoch 1:   2%|▏         | 6/305 [01:10<58:49, 11.80s/it, loss=0.0375]

tensor(0.0267, grad_fn=<NllLossBackward>)


Epoch 1:   2%|▏         | 7/305 [01:22<58:24, 11.76s/it, loss=0.0267]

tensor(0.0523, grad_fn=<NllLossBackward>)


Epoch 1:   3%|▎         | 8/305 [01:33<57:55, 11.70s/it, loss=0.0523]

tensor(0.0347, grad_fn=<NllLossBackward>)


Epoch 1:   3%|▎         | 9/305 [01:45<58:11, 11.79s/it, loss=0.0347]

tensor(0.0487, grad_fn=<NllLossBackward>)


Epoch 1:   3%|▎         | 10/305 [01:57<58:07, 11.82s/it, loss=0.0487]

tensor(0.0311, grad_fn=<NllLossBackward>)


Epoch 1:   4%|▎         | 11/305 [02:09<57:31, 11.74s/it, loss=0.0311]

tensor(0.0480, grad_fn=<NllLossBackward>)


Epoch 1:   4%|▍         | 12/305 [02:21<57:42, 11.82s/it, loss=0.048] 

tensor(0.0431, grad_fn=<NllLossBackward>)


Epoch 1:   4%|▍         | 13/305 [02:32<57:07, 11.74s/it, loss=0.0431]

tensor(0.0584, grad_fn=<NllLossBackward>)


Epoch 1:   5%|▍         | 14/305 [02:45<57:49, 11.92s/it, loss=0.0584]

tensor(0.0223, grad_fn=<NllLossBackward>)


Epoch 1:   5%|▍         | 15/305 [02:57<57:55, 11.98s/it, loss=0.0223]

tensor(0.0296, grad_fn=<NllLossBackward>)


Epoch 1:   5%|▌         | 16/305 [03:09<58:02, 12.05s/it, loss=0.0296]

tensor(0.0516, grad_fn=<NllLossBackward>)


Epoch 1:   6%|▌         | 17/305 [03:22<58:22, 12.16s/it, loss=0.0516]

tensor(0.0306, grad_fn=<NllLossBackward>)


Epoch 1:   6%|▌         | 18/305 [03:34<58:00, 12.13s/it, loss=0.0306]

tensor(0.0391, grad_fn=<NllLossBackward>)


Epoch 1:   6%|▌         | 19/305 [03:46<57:31, 12.07s/it, loss=0.0391]

tensor(0.0251, grad_fn=<NllLossBackward>)


Epoch 1:   7%|▋         | 20/305 [03:58<58:26, 12.30s/it, loss=0.0251]

tensor(0.0307, grad_fn=<NllLossBackward>)


Epoch 1:   7%|▋         | 21/305 [04:11<58:05, 12.27s/it, loss=0.0307]

tensor(0.0388, grad_fn=<NllLossBackward>)


Epoch 1:   7%|▋         | 22/305 [04:23<58:31, 12.41s/it, loss=0.0388]

tensor(0.0235, grad_fn=<NllLossBackward>)


Epoch 1:   8%|▊         | 23/305 [04:35<57:24, 12.21s/it, loss=0.0235]

tensor(0.0432, grad_fn=<NllLossBackward>)


Epoch 1:   8%|▊         | 24/305 [04:47<56:36, 12.09s/it, loss=0.0432]

tensor(0.0233, grad_fn=<NllLossBackward>)


Epoch 1:   8%|▊         | 25/305 [04:59<56:42, 12.15s/it, loss=0.0233]

tensor(0.0304, grad_fn=<NllLossBackward>)


Epoch 1:   9%|▊         | 26/305 [05:11<56:37, 12.18s/it, loss=0.0304]

tensor(0.0380, grad_fn=<NllLossBackward>)


Epoch 1:   9%|▉         | 27/305 [05:23<56:12, 12.13s/it, loss=0.038] 

tensor(0.0233, grad_fn=<NllLossBackward>)


Epoch 1:   9%|▉         | 28/305 [05:36<56:27, 12.23s/it, loss=0.0233]

tensor(0.0255, grad_fn=<NllLossBackward>)


Epoch 1:  10%|▉         | 29/305 [05:48<55:55, 12.16s/it, loss=0.0255]

tensor(0.0246, grad_fn=<NllLossBackward>)


Epoch 1:  10%|▉         | 30/305 [06:01<56:21, 12.30s/it, loss=0.0246]

tensor(0.0385, grad_fn=<NllLossBackward>)


Epoch 1:  10%|█         | 31/305 [06:12<55:43, 12.20s/it, loss=0.0385]

tensor(0.0215, grad_fn=<NllLossBackward>)


Epoch 1:  10%|█         | 32/305 [06:25<55:33, 12.21s/it, loss=0.0215]

tensor(0.0423, grad_fn=<NllLossBackward>)


Epoch 1:  11%|█         | 33/305 [06:37<56:06, 12.38s/it, loss=0.0423]

tensor(0.0300, grad_fn=<NllLossBackward>)


Epoch 1:  11%|█         | 34/305 [06:50<55:30, 12.29s/it, loss=0.03]  

tensor(0.0298, grad_fn=<NllLossBackward>)


Epoch 1:  11%|█▏        | 35/305 [07:01<54:41, 12.15s/it, loss=0.0298]

tensor(0.0350, grad_fn=<NllLossBackward>)


Epoch 1:  12%|█▏        | 36/305 [07:14<54:37, 12.18s/it, loss=0.035] 

tensor(0.0482, grad_fn=<NllLossBackward>)


Epoch 1:  12%|█▏        | 37/305 [07:26<54:27, 12.19s/it, loss=0.0482]

tensor(0.1288, grad_fn=<NllLossBackward>)


Epoch 1:  12%|█▏        | 38/305 [07:38<54:32, 12.26s/it, loss=0.129] 

tensor(0.0420, grad_fn=<NllLossBackward>)


Epoch 1:  13%|█▎        | 39/305 [07:51<54:30, 12.30s/it, loss=0.042]

tensor(0.0249, grad_fn=<NllLossBackward>)


Epoch 1:  13%|█▎        | 40/305 [08:03<54:15, 12.28s/it, loss=0.0249]

tensor(0.0394, grad_fn=<NllLossBackward>)


Epoch 1:  13%|█▎        | 41/305 [08:16<54:29, 12.39s/it, loss=0.0394]

tensor(0.0354, grad_fn=<NllLossBackward>)


Epoch 1:  14%|█▍        | 42/305 [08:28<54:14, 12.37s/it, loss=0.0354]

tensor(0.0154, grad_fn=<NllLossBackward>)


Epoch 1:  14%|█▍        | 43/305 [08:40<53:52, 12.34s/it, loss=0.0154]

tensor(0.0318, grad_fn=<NllLossBackward>)


Epoch 1:  14%|█▍        | 44/305 [08:53<54:03, 12.43s/it, loss=0.0318]

tensor(0.0331, grad_fn=<NllLossBackward>)


Epoch 1:  15%|█▍        | 45/305 [09:05<53:26, 12.33s/it, loss=0.0331]

tensor(0.0438, grad_fn=<NllLossBackward>)


Epoch 1:  15%|█▌        | 46/305 [09:17<52:51, 12.25s/it, loss=0.0438]

tensor(0.0413, grad_fn=<NllLossBackward>)


Epoch 1:  15%|█▌        | 47/305 [09:29<53:02, 12.33s/it, loss=0.0413]

tensor(0.0215, grad_fn=<NllLossBackward>)


Epoch 1:  16%|█▌        | 48/305 [09:42<52:30, 12.26s/it, loss=0.0215]

tensor(0.0231, grad_fn=<NllLossBackward>)


Epoch 1:  16%|█▌        | 49/305 [09:54<52:48, 12.38s/it, loss=0.0231]

tensor(0.0151, grad_fn=<NllLossBackward>)


Epoch 1:  16%|█▋        | 50/305 [10:06<52:15, 12.30s/it, loss=0.0151]

tensor(0.0406, grad_fn=<NllLossBackward>)


Epoch 1:  17%|█▋        | 51/305 [10:19<52:30, 12.40s/it, loss=0.0406]

tensor(0.0373, grad_fn=<NllLossBackward>)


Epoch 1:  17%|█▋        | 52/305 [10:32<52:57, 12.56s/it, loss=0.0373]

tensor(0.0344, grad_fn=<NllLossBackward>)


Epoch 1:  17%|█▋        | 53/305 [10:44<52:21, 12.47s/it, loss=0.0344]

tensor(0.0233, grad_fn=<NllLossBackward>)


Epoch 1:  18%|█▊        | 54/305 [10:57<52:07, 12.46s/it, loss=0.0233]

tensor(0.0263, grad_fn=<NllLossBackward>)


Epoch 1:  18%|█▊        | 55/305 [11:09<51:38, 12.39s/it, loss=0.0263]

tensor(0.0415, grad_fn=<NllLossBackward>)


Epoch 1:  18%|█▊        | 56/305 [11:21<51:17, 12.36s/it, loss=0.0415]

tensor(0.0454, grad_fn=<NllLossBackward>)


Epoch 1:  19%|█▊        | 57/305 [11:34<51:23, 12.43s/it, loss=0.0454]

tensor(0.0159, grad_fn=<NllLossBackward>)


Epoch 1:  19%|█▉        | 58/305 [11:46<50:53, 12.36s/it, loss=0.0159]

tensor(0.0476, grad_fn=<NllLossBackward>)


Epoch 1:  19%|█▉        | 59/305 [11:58<50:16, 12.26s/it, loss=0.0476]

tensor(0.0338, grad_fn=<NllLossBackward>)


Epoch 1:  20%|█▉        | 60/305 [12:11<50:25, 12.35s/it, loss=0.0338]

tensor(0.0146, grad_fn=<NllLossBackward>)


Epoch 1:  20%|██        | 61/305 [12:23<50:01, 12.30s/it, loss=0.0146]

tensor(0.0360, grad_fn=<NllLossBackward>)


Epoch 1:  20%|██        | 62/305 [12:36<50:31, 12.47s/it, loss=0.036] 

tensor(0.0310, grad_fn=<NllLossBackward>)


Epoch 1:  21%|██        | 63/305 [12:48<49:59, 12.39s/it, loss=0.031]

tensor(0.0278, grad_fn=<NllLossBackward>)


Epoch 1:  21%|██        | 64/305 [13:00<49:20, 12.28s/it, loss=0.0278]

tensor(0.0281, grad_fn=<NllLossBackward>)


Epoch 1:  21%|██▏       | 65/305 [13:12<49:23, 12.35s/it, loss=0.0281]

tensor(0.0257, grad_fn=<NllLossBackward>)


Epoch 1:  22%|██▏       | 66/305 [13:24<48:57, 12.29s/it, loss=0.0257]

tensor(0.0304, grad_fn=<NllLossBackward>)


Epoch 1:  22%|██▏       | 67/305 [13:37<48:53, 12.32s/it, loss=0.0304]

tensor(0.0264, grad_fn=<NllLossBackward>)


Epoch 1:  22%|██▏       | 68/305 [13:49<48:50, 12.37s/it, loss=0.0264]

tensor(0.0430, grad_fn=<NllLossBackward>)


Epoch 1:  23%|██▎       | 69/305 [14:01<48:08, 12.24s/it, loss=0.043] 

tensor(0.0413, grad_fn=<NllLossBackward>)


Epoch 1:  23%|██▎       | 70/305 [14:14<48:15, 12.32s/it, loss=0.0413]

tensor(0.0271, grad_fn=<NllLossBackward>)


Epoch 1:  23%|██▎       | 71/305 [14:26<47:51, 12.27s/it, loss=0.0271]

tensor(0.0230, grad_fn=<NllLossBackward>)


Epoch 1:  24%|██▎       | 72/305 [14:38<47:11, 12.15s/it, loss=0.023] 

tensor(0.0220, grad_fn=<NllLossBackward>)


Epoch 1:  24%|██▍       | 73/305 [14:50<47:26, 12.27s/it, loss=0.022]

tensor(0.0336, grad_fn=<NllLossBackward>)


Epoch 1:  24%|██▍       | 74/305 [15:02<47:05, 12.23s/it, loss=0.0336]

tensor(0.0292, grad_fn=<NllLossBackward>)


Epoch 1:  25%|██▍       | 75/305 [15:15<46:40, 12.18s/it, loss=0.0292]

tensor(0.0360, grad_fn=<NllLossBackward>)


Epoch 1:  25%|██▍       | 76/305 [15:27<46:35, 12.21s/it, loss=0.036] 

tensor(0.0493, grad_fn=<NllLossBackward>)


Epoch 1:  25%|██▌       | 77/305 [15:39<45:55, 12.09s/it, loss=0.0493]

tensor(0.0327, grad_fn=<NllLossBackward>)


Epoch 1:  26%|██▌       | 78/305 [15:51<46:11, 12.21s/it, loss=0.0327]

tensor(0.0466, grad_fn=<NllLossBackward>)


Epoch 1:  26%|██▌       | 79/305 [16:03<45:44, 12.14s/it, loss=0.0466]

tensor(0.0111, grad_fn=<NllLossBackward>)


Epoch 1:  26%|██▌       | 80/305 [16:15<45:15, 12.07s/it, loss=0.0111]

tensor(0.0346, grad_fn=<NllLossBackward>)


Epoch 1:  27%|██▋       | 81/305 [16:27<45:26, 12.17s/it, loss=0.0346]

tensor(0.0337, grad_fn=<NllLossBackward>)


Epoch 1:  27%|██▋       | 82/305 [16:39<44:52, 12.07s/it, loss=0.0337]

tensor(0.0432, grad_fn=<NllLossBackward>)


Epoch 1:  27%|██▋       | 83/305 [16:51<44:32, 12.04s/it, loss=0.0432]

tensor(0.0307, grad_fn=<NllLossBackward>)


Epoch 1:  28%|██▊       | 84/305 [17:04<44:43, 12.14s/it, loss=0.0307]

tensor(0.0304, grad_fn=<NllLossBackward>)


Epoch 1:  28%|██▊       | 85/305 [17:15<44:14, 12.07s/it, loss=0.0304]

tensor(0.0421, grad_fn=<NllLossBackward>)


Epoch 1:  28%|██▊       | 86/305 [17:28<44:32, 12.20s/it, loss=0.0421]

tensor(0.0410, grad_fn=<NllLossBackward>)


Epoch 1:  29%|██▊       | 87/305 [17:40<44:19, 12.20s/it, loss=0.041] 

tensor(0.0364, grad_fn=<NllLossBackward>)


Epoch 1:  29%|██▉       | 88/305 [17:52<43:54, 12.14s/it, loss=0.0364]

tensor(0.0312, grad_fn=<NllLossBackward>)


Epoch 1:  29%|██▉       | 89/305 [18:04<43:50, 12.18s/it, loss=0.0312]

tensor(0.0407, grad_fn=<NllLossBackward>)


Epoch 1:  30%|██▉       | 90/305 [18:17<43:32, 12.15s/it, loss=0.0407]

tensor(0.0272, grad_fn=<NllLossBackward>)


Epoch 1:  30%|██▉       | 91/305 [18:29<43:30, 12.20s/it, loss=0.0272]

tensor(0.0390, grad_fn=<NllLossBackward>)


Epoch 1:  30%|███       | 92/305 [18:41<43:01, 12.12s/it, loss=0.039] 

tensor(0.0341, grad_fn=<NllLossBackward>)


Epoch 1:  30%|███       | 93/305 [18:53<42:42, 12.09s/it, loss=0.0341]

tensor(0.0347, grad_fn=<NllLossBackward>)


Epoch 1:  31%|███       | 94/305 [19:05<42:40, 12.14s/it, loss=0.0347]

tensor(0.0358, grad_fn=<NllLossBackward>)


Epoch 1:  31%|███       | 95/305 [19:17<42:13, 12.07s/it, loss=0.0358]

tensor(0.0204, grad_fn=<NllLossBackward>)


Epoch 1:  31%|███▏      | 96/305 [19:29<41:55, 12.04s/it, loss=0.0204]

tensor(0.0139, grad_fn=<NllLossBackward>)


Epoch 1:  32%|███▏      | 97/305 [19:41<42:04, 12.14s/it, loss=0.0139]

tensor(0.0243, grad_fn=<NllLossBackward>)


Epoch 1:  32%|███▏      | 98/305 [19:53<41:46, 12.11s/it, loss=0.0243]

tensor(0.0265, grad_fn=<NllLossBackward>)


Epoch 1:  32%|███▏      | 99/305 [20:06<41:54, 12.21s/it, loss=0.0265]

tensor(0.0360, grad_fn=<NllLossBackward>)


Epoch 1:  33%|███▎      | 100/305 [20:18<41:29, 12.14s/it, loss=0.036]

tensor(0.0585, grad_fn=<NllLossBackward>)


Epoch 1:  33%|███▎      | 101/305 [20:30<41:12, 12.12s/it, loss=0.0585]

tensor(0.0197, grad_fn=<NllLossBackward>)


Epoch 1:  33%|███▎      | 102/305 [20:42<41:30, 12.27s/it, loss=0.0197]

tensor(0.0289, grad_fn=<NllLossBackward>)


Epoch 1:  34%|███▍      | 103/305 [20:54<41:04, 12.20s/it, loss=0.0289]

tensor(0.0314, grad_fn=<NllLossBackward>)


Epoch 1:  34%|███▍      | 104/305 [21:06<40:28, 12.08s/it, loss=0.0314]

tensor(0.0340, grad_fn=<NllLossBackward>)


Epoch 1:  34%|███▍      | 105/305 [21:19<40:42, 12.21s/it, loss=0.034] 

tensor(0.0225, grad_fn=<NllLossBackward>)


Epoch 1:  35%|███▍      | 106/305 [21:31<40:28, 12.20s/it, loss=0.0225]

tensor(0.0523, grad_fn=<NllLossBackward>)


Epoch 1:  35%|███▌      | 107/305 [21:44<40:42, 12.34s/it, loss=0.0523]

tensor(0.0246, grad_fn=<NllLossBackward>)


Epoch 1:  35%|███▌      | 108/305 [21:56<40:36, 12.37s/it, loss=0.0246]

tensor(0.0267, grad_fn=<NllLossBackward>)


Epoch 1:  36%|███▌      | 109/305 [22:08<40:09, 12.30s/it, loss=0.0267]

tensor(0.0364, grad_fn=<NllLossBackward>)


Epoch 1:  36%|███▌      | 110/305 [22:21<39:59, 12.31s/it, loss=0.0364]

tensor(0.0507, grad_fn=<NllLossBackward>)


Epoch 1:  36%|███▋      | 111/305 [22:33<39:28, 12.21s/it, loss=0.0507]

tensor(0.0341, grad_fn=<NllLossBackward>)


Epoch 1:  37%|███▋      | 112/305 [22:45<39:08, 12.17s/it, loss=0.0341]

tensor(0.0351, grad_fn=<NllLossBackward>)


Epoch 1:  37%|███▋      | 113/305 [22:57<38:46, 12.12s/it, loss=0.0351]

tensor(0.0296, grad_fn=<NllLossBackward>)


Epoch 1:  37%|███▋      | 114/305 [23:09<38:29, 12.09s/it, loss=0.0296]

tensor(0.0213, grad_fn=<NllLossBackward>)


Epoch 1:  38%|███▊      | 115/305 [23:21<38:40, 12.21s/it, loss=0.0213]

tensor(0.0362, grad_fn=<NllLossBackward>)


Epoch 1:  38%|███▊      | 116/305 [23:33<38:18, 12.16s/it, loss=0.0362]

tensor(0.0350, grad_fn=<NllLossBackward>)


Epoch 1:  38%|███▊      | 117/305 [23:45<37:37, 12.01s/it, loss=0.035] 

tensor(0.0175, grad_fn=<NllLossBackward>)


Epoch 1:  39%|███▊      | 118/305 [23:57<37:25, 12.01s/it, loss=0.0175]

tensor(0.0284, grad_fn=<NllLossBackward>)


Epoch 1:  39%|███▉      | 119/305 [24:09<37:02, 11.95s/it, loss=0.0284]

tensor(0.0246, grad_fn=<NllLossBackward>)


Epoch 1:  39%|███▉      | 120/305 [24:20<36:25, 11.81s/it, loss=0.0246]

tensor(0.0322, grad_fn=<NllLossBackward>)


Epoch 1:  40%|███▉      | 121/305 [24:32<36:19, 11.84s/it, loss=0.0322]

tensor(0.0365, grad_fn=<NllLossBackward>)


Epoch 1:  40%|████      | 122/305 [24:44<35:52, 11.76s/it, loss=0.0365]

tensor(0.0284, grad_fn=<NllLossBackward>)


Epoch 1:  40%|████      | 123/305 [24:56<35:54, 11.84s/it, loss=0.0284]

tensor(0.0279, grad_fn=<NllLossBackward>)


Epoch 1:  41%|████      | 124/305 [25:07<35:20, 11.71s/it, loss=0.0279]

tensor(0.0354, grad_fn=<NllLossBackward>)


Epoch 1:  41%|████      | 125/305 [25:19<35:00, 11.67s/it, loss=0.0354]

tensor(0.0232, grad_fn=<NllLossBackward>)


Epoch 1:  41%|████▏     | 126/305 [25:31<35:05, 11.76s/it, loss=0.0232]

tensor(0.0144, grad_fn=<NllLossBackward>)


Epoch 1:  42%|████▏     | 127/305 [25:42<34:38, 11.68s/it, loss=0.0144]

tensor(0.0406, grad_fn=<NllLossBackward>)


Epoch 1:  42%|████▏     | 128/305 [25:54<34:14, 11.61s/it, loss=0.0406]

tensor(0.0478, grad_fn=<NllLossBackward>)


Epoch 1:  42%|████▏     | 129/305 [26:05<34:13, 11.67s/it, loss=0.0478]

tensor(0.0304, grad_fn=<NllLossBackward>)


Epoch 1:  43%|████▎     | 130/305 [26:17<33:55, 11.63s/it, loss=0.0304]

tensor(0.0518, grad_fn=<NllLossBackward>)


Epoch 1:  43%|████▎     | 131/305 [26:28<33:38, 11.60s/it, loss=0.0518]

tensor(0.0445, grad_fn=<NllLossBackward>)


Epoch 1:  43%|████▎     | 132/305 [26:40<33:47, 11.72s/it, loss=0.0445]

tensor(0.0321, grad_fn=<NllLossBackward>)


Epoch 1:  44%|████▎     | 133/305 [26:52<33:24, 11.65s/it, loss=0.0321]

tensor(0.0191, grad_fn=<NllLossBackward>)


Epoch 1:  44%|████▍     | 134/305 [27:04<33:13, 11.66s/it, loss=0.0191]

tensor(0.0221, grad_fn=<NllLossBackward>)


Epoch 1:  44%|████▍     | 135/305 [27:16<33:19, 11.76s/it, loss=0.0221]

tensor(0.0390, grad_fn=<NllLossBackward>)


Epoch 1:  45%|████▍     | 136/305 [27:27<32:58, 11.71s/it, loss=0.039] 

tensor(0.0425, grad_fn=<NllLossBackward>)


Epoch 1:  45%|████▍     | 137/305 [27:39<32:53, 11.75s/it, loss=0.0425]

tensor(0.0405, grad_fn=<NllLossBackward>)


Epoch 1:  45%|████▌     | 138/305 [27:51<32:33, 11.69s/it, loss=0.0405]

tensor(0.0461, grad_fn=<NllLossBackward>)


Epoch 1:  46%|████▌     | 139/305 [28:02<32:07, 11.61s/it, loss=0.0461]

tensor(0.0159, grad_fn=<NllLossBackward>)


Epoch 1:  46%|████▌     | 140/305 [28:14<32:13, 11.72s/it, loss=0.0159]

tensor(0.0291, grad_fn=<NllLossBackward>)


Epoch 1:  46%|████▌     | 141/305 [28:25<31:51, 11.66s/it, loss=0.0291]

tensor(0.0208, grad_fn=<NllLossBackward>)


Epoch 1:  47%|████▋     | 142/305 [28:37<31:34, 11.62s/it, loss=0.0208]

tensor(0.0302, grad_fn=<NllLossBackward>)


Epoch 1:  47%|████▋     | 143/305 [28:49<31:34, 11.70s/it, loss=0.0302]

tensor(0.0388, grad_fn=<NllLossBackward>)


Epoch 1:  47%|████▋     | 144/305 [29:00<31:05, 11.58s/it, loss=0.0388]

tensor(0.0122, grad_fn=<NllLossBackward>)


Epoch 1:  48%|████▊     | 145/305 [29:12<30:46, 11.54s/it, loss=0.0122]

tensor(0.0302, grad_fn=<NllLossBackward>)


Epoch 1:  48%|████▊     | 146/305 [29:24<30:51, 11.64s/it, loss=0.0302]

tensor(0.0214, grad_fn=<NllLossBackward>)


Epoch 1:  48%|████▊     | 147/305 [29:35<30:27, 11.57s/it, loss=0.0214]

tensor(0.0337, grad_fn=<NllLossBackward>)


Epoch 1:  49%|████▊     | 148/305 [29:47<30:28, 11.65s/it, loss=0.0337]

tensor(0.0127, grad_fn=<NllLossBackward>)


Epoch 1:  49%|████▉     | 149/305 [29:58<30:12, 11.62s/it, loss=0.0127]

tensor(0.0426, grad_fn=<NllLossBackward>)


Epoch 1:  49%|████▉     | 150/305 [30:10<29:58, 11.60s/it, loss=0.0426]

tensor(0.0258, grad_fn=<NllLossBackward>)


Epoch 1:  50%|████▉     | 151/305 [30:22<30:07, 11.74s/it, loss=0.0258]

tensor(0.0404, grad_fn=<NllLossBackward>)


Epoch 1:  50%|████▉     | 152/305 [30:34<29:54, 11.73s/it, loss=0.0404]

tensor(0.0409, grad_fn=<NllLossBackward>)


Epoch 1:  50%|█████     | 153/305 [30:46<29:55, 11.82s/it, loss=0.0409]

tensor(0.0327, grad_fn=<NllLossBackward>)


Epoch 1:  50%|█████     | 154/305 [30:58<30:06, 11.97s/it, loss=0.0327]

tensor(0.0310, grad_fn=<NllLossBackward>)


Epoch 1:  51%|█████     | 155/305 [31:10<29:49, 11.93s/it, loss=0.031] 

tensor(0.0283, grad_fn=<NllLossBackward>)


Epoch 1:  51%|█████     | 156/305 [31:21<29:20, 11.81s/it, loss=0.0283]

tensor(0.0168, grad_fn=<NllLossBackward>)


Epoch 1:  51%|█████▏    | 157/305 [31:34<29:26, 11.94s/it, loss=0.0168]

tensor(0.0315, grad_fn=<NllLossBackward>)


Epoch 1:  52%|█████▏    | 158/305 [31:45<29:02, 11.85s/it, loss=0.0315]

tensor(0.0448, grad_fn=<NllLossBackward>)


Epoch 1:  52%|█████▏    | 159/305 [31:57<29:05, 11.95s/it, loss=0.0448]

tensor(0.0303, grad_fn=<NllLossBackward>)


Epoch 1:  52%|█████▏    | 160/305 [32:09<28:38, 11.85s/it, loss=0.0303]

tensor(0.0282, grad_fn=<NllLossBackward>)


Epoch 1:  53%|█████▎    | 161/305 [32:21<28:20, 11.81s/it, loss=0.0282]

tensor(0.0412, grad_fn=<NllLossBackward>)


Epoch 1:  53%|█████▎    | 162/305 [32:33<28:17, 11.87s/it, loss=0.0412]

tensor(0.0284, grad_fn=<NllLossBackward>)


Epoch 1:  53%|█████▎    | 163/305 [32:44<27:52, 11.78s/it, loss=0.0284]

tensor(0.0493, grad_fn=<NllLossBackward>)


Epoch 1:  54%|█████▍    | 164/305 [32:56<27:31, 11.71s/it, loss=0.0493]

tensor(0.0266, grad_fn=<NllLossBackward>)


Epoch 1:  54%|█████▍    | 165/305 [33:08<27:27, 11.77s/it, loss=0.0266]

tensor(0.0368, grad_fn=<NllLossBackward>)


Epoch 1:  54%|█████▍    | 166/305 [33:19<27:04, 11.68s/it, loss=0.0368]

tensor(0.0264, grad_fn=<NllLossBackward>)


Epoch 1:  55%|█████▍    | 167/305 [33:31<26:44, 11.63s/it, loss=0.0264]

tensor(0.0266, grad_fn=<NllLossBackward>)


Epoch 1:  55%|█████▌    | 168/305 [33:43<26:56, 11.80s/it, loss=0.0266]

tensor(0.0434, grad_fn=<NllLossBackward>)


Epoch 1:  55%|█████▌    | 169/305 [33:54<26:29, 11.69s/it, loss=0.0434]

tensor(0.0270, grad_fn=<NllLossBackward>)


Epoch 1:  56%|█████▌    | 170/305 [34:06<26:27, 11.76s/it, loss=0.027] 

tensor(0.0305, grad_fn=<NllLossBackward>)


Epoch 1:  56%|█████▌    | 171/305 [34:18<26:01, 11.66s/it, loss=0.0305]

tensor(0.0238, grad_fn=<NllLossBackward>)


Epoch 1:  56%|█████▋    | 172/305 [34:29<25:41, 11.59s/it, loss=0.0238]

tensor(0.0380, grad_fn=<NllLossBackward>)


Epoch 1:  57%|█████▋    | 173/305 [34:41<25:36, 11.64s/it, loss=0.038] 

tensor(0.0767, grad_fn=<NllLossBackward>)


Epoch 1:  57%|█████▋    | 174/305 [34:53<25:24, 11.64s/it, loss=0.0767]

tensor(0.0203, grad_fn=<NllLossBackward>)


Epoch 1:  57%|█████▋    | 175/305 [35:04<25:12, 11.64s/it, loss=0.0203]

tensor(0.0356, grad_fn=<NllLossBackward>)


Epoch 1:  58%|█████▊    | 176/305 [35:16<25:26, 11.84s/it, loss=0.0356]

tensor(0.0373, grad_fn=<NllLossBackward>)


Epoch 1:  58%|█████▊    | 177/305 [35:28<25:05, 11.76s/it, loss=0.0373]

tensor(0.0140, grad_fn=<NllLossBackward>)


Epoch 1:  58%|█████▊    | 178/305 [35:40<24:42, 11.67s/it, loss=0.014] 

tensor(0.0227, grad_fn=<NllLossBackward>)


Epoch 1:  59%|█████▊    | 179/305 [35:51<24:37, 11.73s/it, loss=0.0227]

tensor(0.0229, grad_fn=<NllLossBackward>)


Epoch 1:  59%|█████▉    | 180/305 [36:03<24:18, 11.67s/it, loss=0.0229]

tensor(0.0221, grad_fn=<NllLossBackward>)


Epoch 1:  59%|█████▉    | 181/305 [36:14<23:57, 11.59s/it, loss=0.0221]

tensor(0.0153, grad_fn=<NllLossBackward>)


Epoch 1:  60%|█████▉    | 182/305 [36:26<23:59, 11.70s/it, loss=0.0153]

tensor(0.0354, grad_fn=<NllLossBackward>)


Epoch 1:  60%|██████    | 183/305 [36:38<23:39, 11.64s/it, loss=0.0354]

tensor(0.0256, grad_fn=<NllLossBackward>)


Epoch 1:  60%|██████    | 184/305 [36:50<23:44, 11.77s/it, loss=0.0256]

tensor(0.0365, grad_fn=<NllLossBackward>)


Epoch 1:  61%|██████    | 185/305 [37:02<23:44, 11.87s/it, loss=0.0365]

tensor(0.0260, grad_fn=<NllLossBackward>)


Epoch 1:  61%|██████    | 186/305 [37:14<23:26, 11.82s/it, loss=0.026] 

tensor(0.0348, grad_fn=<NllLossBackward>)


Epoch 1:  61%|██████▏   | 187/305 [37:26<23:31, 11.96s/it, loss=0.0348]

tensor(0.0522, grad_fn=<NllLossBackward>)


Epoch 1:  62%|██████▏   | 188/305 [37:38<23:15, 11.93s/it, loss=0.0522]

tensor(0.0274, grad_fn=<NllLossBackward>)


Epoch 1:  62%|██████▏   | 189/305 [37:50<22:58, 11.89s/it, loss=0.0274]

tensor(0.0225, grad_fn=<NllLossBackward>)


Epoch 1:  62%|██████▏   | 190/305 [38:02<22:58, 11.99s/it, loss=0.0225]

tensor(0.0267, grad_fn=<NllLossBackward>)


Epoch 1:  63%|██████▎   | 191/305 [38:14<22:46, 11.99s/it, loss=0.0267]

tensor(0.0143, grad_fn=<NllLossBackward>)


Epoch 1:  63%|██████▎   | 192/305 [38:26<22:26, 11.92s/it, loss=0.0143]

tensor(0.0457, grad_fn=<NllLossBackward>)


Epoch 1:  63%|██████▎   | 193/305 [38:38<22:16, 11.93s/it, loss=0.0457]

tensor(0.0154, grad_fn=<NllLossBackward>)


Epoch 1:  64%|██████▎   | 194/305 [38:49<21:58, 11.88s/it, loss=0.0154]

tensor(0.0180, grad_fn=<NllLossBackward>)


Epoch 1:  64%|██████▍   | 195/305 [39:01<21:54, 11.95s/it, loss=0.018] 

tensor(0.0352, grad_fn=<NllLossBackward>)


Epoch 1:  64%|██████▍   | 196/305 [39:13<21:36, 11.90s/it, loss=0.0352]

tensor(0.0354, grad_fn=<NllLossBackward>)


Epoch 1:  65%|██████▍   | 197/305 [39:25<21:30, 11.95s/it, loss=0.0354]

tensor(0.0225, grad_fn=<NllLossBackward>)


Epoch 1:  65%|██████▍   | 198/305 [39:38<21:31, 12.07s/it, loss=0.0225]

tensor(0.0339, grad_fn=<NllLossBackward>)


Epoch 1:  65%|██████▌   | 199/305 [39:50<21:15, 12.03s/it, loss=0.0339]

tensor(0.0326, grad_fn=<NllLossBackward>)


Epoch 1:  66%|██████▌   | 200/305 [40:01<20:56, 11.96s/it, loss=0.0326]

tensor(0.0180, grad_fn=<NllLossBackward>)


Epoch 1:  66%|██████▌   | 201/305 [40:14<20:56, 12.08s/it, loss=0.018] 

tensor(0.0285, grad_fn=<NllLossBackward>)


Epoch 1:  66%|██████▌   | 202/305 [40:25<20:32, 11.97s/it, loss=0.0285]

tensor(0.0185, grad_fn=<NllLossBackward>)


Epoch 1:  67%|██████▋   | 203/305 [40:37<20:23, 11.99s/it, loss=0.0185]

tensor(0.0226, grad_fn=<NllLossBackward>)


Epoch 1:  67%|██████▋   | 204/305 [40:49<19:59, 11.88s/it, loss=0.0226]

tensor(0.0296, grad_fn=<NllLossBackward>)


Epoch 1:  67%|██████▋   | 205/305 [41:01<19:43, 11.83s/it, loss=0.0296]

tensor(0.0274, grad_fn=<NllLossBackward>)


Epoch 1:  68%|██████▊   | 206/305 [41:13<19:42, 11.95s/it, loss=0.0274]

tensor(0.0360, grad_fn=<NllLossBackward>)


Epoch 1:  68%|██████▊   | 207/305 [41:25<19:22, 11.86s/it, loss=0.036] 

tensor(0.0224, grad_fn=<NllLossBackward>)


Epoch 1:  68%|██████▊   | 208/305 [41:37<19:11, 11.87s/it, loss=0.0224]

tensor(0.0420, grad_fn=<NllLossBackward>)


Epoch 1:  69%|██████▊   | 209/305 [41:49<19:08, 11.97s/it, loss=0.042] 

tensor(0.0271, grad_fn=<NllLossBackward>)


Epoch 1:  69%|██████▉   | 210/305 [42:01<18:55, 11.95s/it, loss=0.0271]

tensor(0.0234, grad_fn=<NllLossBackward>)


Epoch 1:  69%|██████▉   | 211/305 [42:13<18:46, 11.98s/it, loss=0.0234]

tensor(0.0507, grad_fn=<NllLossBackward>)


Epoch 1:  70%|██████▉   | 212/305 [42:24<18:25, 11.89s/it, loss=0.0507]

tensor(0.0295, grad_fn=<NllLossBackward>)


Epoch 1:  70%|██████▉   | 213/305 [42:36<18:09, 11.84s/it, loss=0.0295]

tensor(0.0221, grad_fn=<NllLossBackward>)


Epoch 1:  70%|███████   | 214/305 [42:48<18:04, 11.92s/it, loss=0.0221]

tensor(0.0330, grad_fn=<NllLossBackward>)


Epoch 1:  70%|███████   | 215/305 [43:00<17:39, 11.77s/it, loss=0.033] 

tensor(0.0204, grad_fn=<NllLossBackward>)


Epoch 1:  71%|███████   | 216/305 [43:11<17:21, 11.70s/it, loss=0.0204]

tensor(0.0318, grad_fn=<NllLossBackward>)


Epoch 1:  71%|███████   | 217/305 [43:23<17:18, 11.80s/it, loss=0.0318]

tensor(0.0300, grad_fn=<NllLossBackward>)


Epoch 1:  71%|███████▏  | 218/305 [43:35<17:11, 11.86s/it, loss=0.03]  

tensor(0.0449, grad_fn=<NllLossBackward>)


Epoch 1:  72%|███████▏  | 219/305 [43:47<16:57, 11.83s/it, loss=0.0449]

tensor(0.0190, grad_fn=<NllLossBackward>)


Epoch 1:  72%|███████▏  | 220/305 [43:59<16:57, 11.97s/it, loss=0.019] 

tensor(0.0290, grad_fn=<NllLossBackward>)


Epoch 1:  72%|███████▏  | 221/305 [44:11<16:37, 11.87s/it, loss=0.029]

tensor(0.0208, grad_fn=<NllLossBackward>)


Epoch 1:  73%|███████▎  | 222/305 [44:23<16:32, 11.96s/it, loss=0.0208]

tensor(0.0307, grad_fn=<NllLossBackward>)


Epoch 1:  73%|███████▎  | 223/305 [44:35<16:22, 11.99s/it, loss=0.0307]

tensor(0.0307, grad_fn=<NllLossBackward>)


Epoch 1:  73%|███████▎  | 224/305 [44:48<16:29, 12.21s/it, loss=0.0307]

tensor(0.0126, grad_fn=<NllLossBackward>)


Epoch 1:  74%|███████▍  | 225/305 [45:00<16:24, 12.30s/it, loss=0.0126]

tensor(0.0433, grad_fn=<NllLossBackward>)


Epoch 1:  74%|███████▍  | 226/305 [45:13<16:09, 12.27s/it, loss=0.0433]

tensor(0.0301, grad_fn=<NllLossBackward>)


Epoch 1:  74%|███████▍  | 227/305 [45:25<15:50, 12.19s/it, loss=0.0301]

tensor(0.0178, grad_fn=<NllLossBackward>)


Epoch 1:  75%|███████▍  | 228/305 [45:37<15:42, 12.24s/it, loss=0.0178]

tensor(0.0302, grad_fn=<NllLossBackward>)


Epoch 1:  75%|███████▌  | 229/305 [45:49<15:20, 12.12s/it, loss=0.0302]

tensor(0.0175, grad_fn=<NllLossBackward>)


Epoch 1:  75%|███████▌  | 230/305 [46:01<15:13, 12.18s/it, loss=0.0175]

tensor(0.0349, grad_fn=<NllLossBackward>)


Epoch 1:  76%|███████▌  | 231/305 [46:13<14:52, 12.06s/it, loss=0.0349]

tensor(0.0415, grad_fn=<NllLossBackward>)


Epoch 1:  76%|███████▌  | 232/305 [46:25<14:32, 11.95s/it, loss=0.0415]

tensor(0.0246, grad_fn=<NllLossBackward>)


Epoch 1:  76%|███████▋  | 233/305 [46:37<14:23, 12.00s/it, loss=0.0246]

tensor(0.0159, grad_fn=<NllLossBackward>)


Epoch 1:  77%|███████▋  | 234/305 [46:48<14:05, 11.91s/it, loss=0.0159]

tensor(0.0290, grad_fn=<NllLossBackward>)


Epoch 1:  77%|███████▋  | 235/305 [47:00<13:52, 11.90s/it, loss=0.029] 

tensor(0.0399, grad_fn=<NllLossBackward>)


Epoch 1:  77%|███████▋  | 236/305 [47:12<13:47, 11.99s/it, loss=0.0399]

tensor(0.0254, grad_fn=<NllLossBackward>)


Epoch 1:  78%|███████▊  | 237/305 [47:24<13:33, 11.97s/it, loss=0.0254]

tensor(0.0499, grad_fn=<NllLossBackward>)


Epoch 1:  78%|███████▊  | 238/305 [47:37<13:30, 12.10s/it, loss=0.0499]

tensor(0.0418, grad_fn=<NllLossBackward>)


Epoch 1:  78%|███████▊  | 239/305 [47:49<13:15, 12.06s/it, loss=0.0418]

tensor(0.0418, grad_fn=<NllLossBackward>)


Epoch 1:  79%|███████▊  | 240/305 [48:01<12:58, 11.98s/it, loss=0.0418]

tensor(0.0169, grad_fn=<NllLossBackward>)


Epoch 1:  79%|███████▉  | 241/305 [48:13<12:49, 12.03s/it, loss=0.0169]

tensor(0.0253, grad_fn=<NllLossBackward>)


Epoch 1:  79%|███████▉  | 242/305 [48:25<12:36, 12.01s/it, loss=0.0253]

tensor(0.0200, grad_fn=<NllLossBackward>)


Epoch 1:  80%|███████▉  | 243/305 [48:37<12:21, 11.96s/it, loss=0.02]  

tensor(0.0220, grad_fn=<NllLossBackward>)


Epoch 1:  80%|████████  | 244/305 [48:49<12:13, 12.03s/it, loss=0.022]

tensor(0.0253, grad_fn=<NllLossBackward>)


Epoch 1:  80%|████████  | 245/305 [49:00<11:56, 11.94s/it, loss=0.0253]

tensor(0.0618, grad_fn=<NllLossBackward>)


Epoch 1:  81%|████████  | 246/305 [49:13<11:47, 11.99s/it, loss=0.0618]

tensor(0.0136, grad_fn=<NllLossBackward>)


Epoch 1:  81%|████████  | 247/305 [49:24<11:31, 11.92s/it, loss=0.0136]

tensor(0.0328, grad_fn=<NllLossBackward>)


Epoch 1:  81%|████████▏ | 248/305 [49:36<11:18, 11.90s/it, loss=0.0328]

tensor(0.0288, grad_fn=<NllLossBackward>)


Epoch 1:  82%|████████▏ | 249/305 [49:48<11:10, 11.97s/it, loss=0.0288]

tensor(0.0137, grad_fn=<NllLossBackward>)


Epoch 1:  82%|████████▏ | 250/305 [50:00<10:54, 11.90s/it, loss=0.0137]

tensor(0.0209, grad_fn=<NllLossBackward>)


Epoch 1:  82%|████████▏ | 251/305 [50:12<10:38, 11.83s/it, loss=0.0209]

tensor(0.0151, grad_fn=<NllLossBackward>)


Epoch 1:  83%|████████▎ | 252/305 [50:24<10:32, 11.94s/it, loss=0.0151]

tensor(0.0205, grad_fn=<NllLossBackward>)


Epoch 1:  83%|████████▎ | 253/305 [50:36<10:26, 12.04s/it, loss=0.0205]

tensor(0.0265, grad_fn=<NllLossBackward>)


Epoch 1:  83%|████████▎ | 254/305 [50:48<10:12, 12.00s/it, loss=0.0265]

tensor(0.0409, grad_fn=<NllLossBackward>)


Epoch 1:  84%|████████▎ | 255/305 [51:00<10:02, 12.05s/it, loss=0.0409]

tensor(0.0302, grad_fn=<NllLossBackward>)


Epoch 1:  84%|████████▍ | 256/305 [51:12<09:49, 12.03s/it, loss=0.0302]

tensor(0.0214, grad_fn=<NllLossBackward>)


Epoch 1:  84%|████████▍ | 257/305 [51:24<09:41, 12.11s/it, loss=0.0214]

tensor(0.0277, grad_fn=<NllLossBackward>)


Epoch 1:  85%|████████▍ | 258/305 [51:36<09:23, 11.98s/it, loss=0.0277]

tensor(0.0258, grad_fn=<NllLossBackward>)


Epoch 1:  85%|████████▍ | 259/305 [51:48<09:07, 11.90s/it, loss=0.0258]

tensor(0.0329, grad_fn=<NllLossBackward>)


Epoch 1:  85%|████████▌ | 260/305 [52:00<08:59, 11.98s/it, loss=0.0329]

tensor(0.0209, grad_fn=<NllLossBackward>)


Epoch 1:  86%|████████▌ | 261/305 [52:12<08:45, 11.95s/it, loss=0.0209]

tensor(0.0342, grad_fn=<NllLossBackward>)


Epoch 1:  86%|████████▌ | 262/305 [52:24<08:35, 12.00s/it, loss=0.0342]

tensor(0.0322, grad_fn=<NllLossBackward>)


Epoch 1:  86%|████████▌ | 263/305 [52:36<08:29, 12.12s/it, loss=0.0322]

tensor(0.0405, grad_fn=<NllLossBackward>)


Epoch 1:  87%|████████▋ | 264/305 [52:48<08:15, 12.08s/it, loss=0.0405]

tensor(0.0256, grad_fn=<NllLossBackward>)


Epoch 1:  87%|████████▋ | 265/305 [53:01<08:03, 12.09s/it, loss=0.0256]

tensor(0.0350, grad_fn=<NllLossBackward>)


Epoch 1:  87%|████████▋ | 266/305 [53:13<07:51, 12.10s/it, loss=0.035] 

tensor(0.0122, grad_fn=<NllLossBackward>)


Epoch 1:  88%|████████▊ | 267/305 [53:25<07:38, 12.06s/it, loss=0.0122]

tensor(0.0505, grad_fn=<NllLossBackward>)


Epoch 1:  88%|████████▊ | 268/305 [53:37<07:28, 12.12s/it, loss=0.0505]

tensor(0.0217, grad_fn=<NllLossBackward>)


Epoch 1:  88%|████████▊ | 269/305 [53:49<07:12, 12.03s/it, loss=0.0217]

tensor(0.0344, grad_fn=<NllLossBackward>)


Epoch 1:  89%|████████▊ | 270/305 [54:01<06:58, 11.96s/it, loss=0.0344]

tensor(0.0473, grad_fn=<NllLossBackward>)


Epoch 1:  89%|████████▉ | 271/305 [54:13<06:47, 11.99s/it, loss=0.0473]

tensor(0.0145, grad_fn=<NllLossBackward>)


Epoch 1:  89%|████████▉ | 272/305 [54:24<06:32, 11.91s/it, loss=0.0145]

tensor(0.0215, grad_fn=<NllLossBackward>)


Epoch 1:  90%|████████▉ | 273/305 [54:36<06:20, 11.88s/it, loss=0.0215]

tensor(0.0180, grad_fn=<NllLossBackward>)


Epoch 1:  90%|████████▉ | 274/305 [54:48<06:11, 11.97s/it, loss=0.018] 

tensor(0.0337, grad_fn=<NllLossBackward>)


Epoch 1:  90%|█████████ | 275/305 [55:01<06:01, 12.07s/it, loss=0.0337]

tensor(0.0253, grad_fn=<NllLossBackward>)


Epoch 1:  90%|█████████ | 276/305 [55:13<05:54, 12.23s/it, loss=0.0253]

tensor(0.0373, grad_fn=<NllLossBackward>)


Epoch 1:  91%|█████████ | 277/305 [55:25<05:40, 12.16s/it, loss=0.0373]

tensor(0.0357, grad_fn=<NllLossBackward>)


Epoch 1:  91%|█████████ | 278/305 [55:37<05:25, 12.07s/it, loss=0.0357]

tensor(0.0169, grad_fn=<NllLossBackward>)


Epoch 1:  91%|█████████▏| 279/305 [55:49<05:14, 12.11s/it, loss=0.0169]

tensor(0.0229, grad_fn=<NllLossBackward>)


Epoch 1:  92%|█████████▏| 280/305 [56:01<05:00, 12.02s/it, loss=0.0229]

tensor(0.0467, grad_fn=<NllLossBackward>)


Epoch 1:  92%|█████████▏| 281/305 [56:13<04:49, 12.05s/it, loss=0.0467]

tensor(0.0235, grad_fn=<NllLossBackward>)


Epoch 1:  92%|█████████▏| 282/305 [56:26<04:41, 12.22s/it, loss=0.0235]

tensor(0.0334, grad_fn=<NllLossBackward>)


Epoch 1:  93%|█████████▎| 283/305 [56:38<04:28, 12.20s/it, loss=0.0334]

tensor(0.0243, grad_fn=<NllLossBackward>)


Epoch 1:  93%|█████████▎| 284/305 [56:50<04:16, 12.19s/it, loss=0.0243]

tensor(0.0325, grad_fn=<NllLossBackward>)


Epoch 1:  93%|█████████▎| 285/305 [57:02<04:01, 12.09s/it, loss=0.0325]

tensor(0.0325, grad_fn=<NllLossBackward>)


Epoch 1:  94%|█████████▍| 286/305 [57:14<03:48, 12.02s/it, loss=0.0325]

tensor(0.0324, grad_fn=<NllLossBackward>)


Epoch 1:  94%|█████████▍| 287/305 [57:26<03:38, 12.16s/it, loss=0.0324]

tensor(0.0323, grad_fn=<NllLossBackward>)


Epoch 1:  94%|█████████▍| 288/305 [57:38<03:26, 12.15s/it, loss=0.0323]

tensor(0.0237, grad_fn=<NllLossBackward>)


Epoch 1:  95%|█████████▍| 289/305 [57:51<03:14, 12.13s/it, loss=0.0237]

tensor(0.0151, grad_fn=<NllLossBackward>)


Epoch 1:  95%|█████████▌| 290/305 [58:03<03:03, 12.24s/it, loss=0.0151]

tensor(0.0369, grad_fn=<NllLossBackward>)


Epoch 1:  95%|█████████▌| 291/305 [58:15<02:52, 12.29s/it, loss=0.0369]

tensor(0.0240, grad_fn=<NllLossBackward>)


Epoch 1:  96%|█████████▌| 292/305 [58:28<02:40, 12.37s/it, loss=0.024] 

tensor(0.0202, grad_fn=<NllLossBackward>)


Epoch 1:  96%|█████████▌| 293/305 [58:40<02:26, 12.21s/it, loss=0.0202]

tensor(0.0389, grad_fn=<NllLossBackward>)


Epoch 1:  96%|█████████▋| 294/305 [58:52<02:13, 12.15s/it, loss=0.0389]

tensor(0.0290, grad_fn=<NllLossBackward>)


Epoch 1:  97%|█████████▋| 295/305 [59:04<02:01, 12.17s/it, loss=0.029] 

tensor(0.0178, grad_fn=<NllLossBackward>)


Epoch 1:  97%|█████████▋| 296/305 [59:16<01:48, 12.04s/it, loss=0.0178]

tensor(0.0301, grad_fn=<NllLossBackward>)


Epoch 1:  97%|█████████▋| 297/305 [59:27<01:35, 11.94s/it, loss=0.0301]

tensor(0.0302, grad_fn=<NllLossBackward>)


Epoch 1:  98%|█████████▊| 298/305 [59:40<01:24, 12.02s/it, loss=0.0302]

tensor(0.0482, grad_fn=<NllLossBackward>)


Epoch 1:  98%|█████████▊| 299/305 [59:51<01:11, 11.95s/it, loss=0.0482]

tensor(0.0340, grad_fn=<NllLossBackward>)


Epoch 1:  98%|█████████▊| 300/305 [1:00:04<01:00, 12.03s/it, loss=0.034] 

tensor(0.0247, grad_fn=<NllLossBackward>)


Epoch 1:  99%|█████████▊| 301/305 [1:00:16<00:48, 12.04s/it, loss=0.0247]

tensor(0.0307, grad_fn=<NllLossBackward>)


Epoch 1:  99%|█████████▉| 302/305 [1:00:28<00:35, 11.98s/it, loss=0.0307]

tensor(0.0311, grad_fn=<NllLossBackward>)


Epoch 1:  99%|█████████▉| 303/305 [1:00:40<00:24, 12.14s/it, loss=0.0311]

tensor(0.0198, grad_fn=<NllLossBackward>)


Epoch 1: 100%|█████████▉| 304/305 [1:00:52<00:12, 12.05s/it, loss=0.0198]

tensor(0.0388, grad_fn=<NllLossBackward>)


Epoch 1: 100%|██████████| 305/305 [1:00:55<00:00, 11.98s/it, loss=0.0388]


# Output

The output of the BERT model is the tensor **logits**. After passing the input-encoding into the BERT model and training the model, the tensor logits is obtained simply by specifying output.logits. And then, a SoftMax activation function is finally applied to the logits. By applying a **SoftMax** onto the output of BERT, probabilistic distributions for each of the words in BERT’s vocabulary are obtained for replacing the mask token. Words with a higher probability value will be better candidate replacement words for the mask token. However, despite using class-specific sentence embeddings, BERT often predicts toxic words, apparently paying more attention to the context than to the embeddings of the desired class. And so, to force the model to generate non-toxic words, the toxicity of each token in BERT vocabulary is calculated and is used to penalize the predicted probabilities. 

In order to get the tensor of SoftMax values of all the words in BERT’s vocabulary, one can get the masked token index using torch.where(). The torch.topk() function, which allows one to retrieve the top k values in a given tensor, and returns a tensor containing the top k values, is iterated through, and the mask token in the sentence is replaced with the candidate tokens. Instead of using torch.topk() for retrieving the top l values, torch.argmax(), which returns the index of the maximum value in the tensor, can also be used.

In [None]:
max_word_toxicity = np.max(wordtoxicities_df.toxicity.tolist())
min_word_toxicity = np.min(wordtoxicities_df.toxicity.tolist())

#toxic comment

unmasked = "Your [MASK] is your [MASK]!";
masked = tokenizer.encode(unmasked, return_tensors = 'pt')[0] 

comment = []
for token in masked:
    comment.append(''.join(tokenizer.decode(token).split()))

#retrieving detoxified comment

mask_token_indices = torch.where(masked == tokenizer.mask_token_id)[0]

token_logits = outputs.logits

softy = func.softmax(token_logits, dim = -1)
mask_token_logits = softy[0, mask_token_indices, :]

for i in range(len(mask_token_indices.tolist())):
    top_ten_tokens = torch.topk(mask_token_logits, 10, dim = 1, sorted = True).indices[i].tolist()
    
    #before rescoring on the basis of toxicities
    for token in top_ten_tokens:
        comment[mask_token_indices[i]] = tokenizer.decode(token)
        print(comment)

    print()
    
    mask_tokens =  torch.sort(mask_token_logits, dim = 1, stable = True).indices[i].tolist()
    token_logits = torch.sort(mask_token_logits, dim = 1, stable = True).values[i].tolist()
        
    better_mask_tokens = []
    better_mask_token_logits = []
    for j in range(len(token_logits)):
        better_mask_tokens.append(mask_tokens[j])

        token = ''.join(tokenizer.decode(mask_tokens[j]).split())
        if token in wordtoxicities_dict:
            better_mask_token_logits.append(token_logits[j] / (max_word_toxicity - min_word_toxicity + wordtoxicities_dict[token]))
        else:
            better_mask_token_logits.append(token_logits[j] / (max_word_toxicity - min_word_toxicity + avg_word_toxicity))

    better_mask_tokens.sort(key = dict(zip(better_mask_tokens, better_mask_token_logits)).get)
    better_top_ten_tokens = better_mask_tokens[-10: ]
    
    #after rescoring on the basis of toxicities
    for token in better_top_ten_tokens:
        comment[mask_token_indices[i]] = tokenizer.decode(token)
        print(comment)
    
    print()

['[CLS]', 'your', 'v o t e', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'v o t e s', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'v o t e d', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'v o t i n g', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'c o u n t', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 't h i n k', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'v e', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'g r a m', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'f e e l', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', '# # v e', 'is', 'your', '[MASK]', '!', '[SEP]']

['[CLS]', 'your', '# # v e', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'f e e l', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'g r a m', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 'v e', 'is', 'your', '[MASK]', '!', '[SEP]']
['[CLS]', 'your', 't h i n k', 'is', 'your', '[MASK]', '!

# Saving Output

In [None]:
torch.save(outputs.logits, "BERTLogits.pt")

<a href = "./BERTLogits.pt"> Download File </a>

# References

1. Understanding the BERT Model
   *https://medium.com/analytics-vidhya/understanding-the-bert-model-a04e1c7933a9*
   
2. Hugging Face Transformer library: BERT Documentation
   *https://huggingface.co/docs/transformers/model_doc/bert*

3. Complete Tutorial on BERT
   *https://analyticsindiamag.com/a-complete-tutorial-on-masked-language-modelling-using-bert/*

4. Training for MLM Task
   *https://github.com/jamescalam/transformers/blob/main/course/training/03_mlm_training.ipynb*

5. How to use BERT
   *https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209*