### Fine-tuning 6-Billion GPT-J in colab with LoRA and 8-bit compression

This notebook is a proof of concept for fine-tuning [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) with limited memory. A detailed explanation of how it works can be found in [this model card](https://huggingface.co/hivemind/gpt-j-6B-8bit).

In [2]:
%pip install transformers==4.14.1
%pip install bitsandbytes-cuda111==0.26.0
%pip install datasets==1.16.1

Collecting transformers==4.14.1
  Using cached transformers-4.14.1-py3-none-any.whl (3.4 MB)
Collecting sacremoses
  Using cached sacremoses-0.0.53-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
  Using cached tokenizers-0.10.3.tar.gz (212 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[51 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m creating build
  [31m   [0m creating build/lib.linux-x86_64-cpython-310
  [31m   [0m c

In [3]:
import transformers

import torch
import torch.nn.functional as F
from torch import nn
from torch.cuda.amp import custom_fwd, custom_bwd

from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

from tqdm.auto import tqdm

CACHE_DIR = '/media/tfsservices/DATA/NLP/cache/'
MODEL_NAME = "EleutherAI/gpt-j-6B"


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues


### Converting the model to 8 bits.

We convert EleutherAI's GPT-J-6B model to 8 bits using facebook's [bitsandbytes](https://github.com/facebookresearch/bitsandbytes) library. This reduces the model's size from 20Gb down to just 6Gb.

Note that we don't convert linear layer biases to 8 bit as they take up less that 1% of the model's weight anyway.

In [7]:
class FrozenBNBLinear(nn.Module):
    def __init__(self, weight, absmax, code, bias=None):
        assert isinstance(bias, nn.Parameter) or bias is None
        super().__init__()
        self.out_features, self.in_features = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
        self.bias = bias
 
    def forward(self, input):
        output = DequantizeAndLinear.apply(input, self.weight, self.absmax, self.code, self.bias)
        if self.adapter:
            output += self.adapter(input)
        return output
 
    @classmethod
    def from_linear(cls, linear: nn.Linear) -> "FrozenBNBLinear":
        weights_int8, state = quantize_blockise_lowmemory(linear.weight)
        return cls(weights_int8, *state, linear.bias)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.in_features}, {self.out_features})"
 
 
class DequantizeAndLinear(torch.autograd.Function): 
    @staticmethod
    @custom_fwd
    def forward(ctx, input: torch.Tensor, weights_quantized: torch.ByteTensor,
                absmax: torch.FloatTensor, code: torch.FloatTensor, bias: torch.FloatTensor):
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        ctx.save_for_backward(input, weights_quantized, absmax, code)
        ctx._has_bias = bias is not None
        return F.linear(input, weights_deq, bias)
 
    @staticmethod
    @custom_bwd
    def backward(ctx, grad_output: torch.Tensor):
        assert not ctx.needs_input_grad[1] and not ctx.needs_input_grad[2] and not ctx.needs_input_grad[3]
        input, weights_quantized, absmax, code = ctx.saved_tensors
        # grad_output: [*batch, out_features]
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        grad_input = grad_output @ weights_deq
        grad_bias = grad_output.flatten(0, -2).sum(dim=0) if ctx._has_bias else None
        return grad_input, None, None, None, grad_bias
 
 
class FrozenBNBEmbedding(nn.Module):
    def __init__(self, weight, absmax, code):
        super().__init__()
        self.num_embeddings, self.embedding_dim = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
 
    def forward(self, input, **kwargs):
        with torch.no_grad():
            # note: both quantuized weights and input indices are *not* differentiable
            weight_deq = dequantize_blockwise(self.weight, absmax=self.absmax, code=self.code)
            output = F.embedding(input, weight_deq, **kwargs)
        if self.adapter:
            output += self.adapter(input)
        return output 
 
    @classmethod
    def from_embedding(cls, embedding: nn.Embedding) -> "FrozenBNBEmbedding":
        weights_int8, state = quantize_blockise_lowmemory(embedding.weight)
        return cls(weights_int8, *state)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.num_embeddings}, {self.embedding_dim})"
 
 
def quantize_blockise_lowmemory(matrix: torch.Tensor, chunk_size: int = 2 ** 20):
    assert chunk_size % 4096 == 0
    code = None
    chunks = []
    absmaxes = []
    flat_tensor = matrix.view(-1)
    for i in range((matrix.numel() - 1) // chunk_size + 1):
        input_chunk = flat_tensor[i * chunk_size: (i + 1) * chunk_size].clone()
        quantized_chunk, (absmax_chunk, code) = quantize_blockwise(input_chunk, code=code)
        chunks.append(quantized_chunk)
        absmaxes.append(absmax_chunk)
 
    matrix_i8 = torch.cat(chunks).reshape_as(matrix)
    absmax = torch.cat(absmaxes)
    return matrix_i8, (absmax, code)
 
 
def convert_to_int8(model):
    """Convert linear and embedding modules to 8-bit with optional adapters"""
    for module in list(model.modules()):
        for name, child in module.named_children():
            if isinstance(child, nn.Linear):
                print(name, child)
                setattr( 
                    module,
                    name,
                    FrozenBNBLinear(
                        weight=torch.zeros(child.out_features, child.in_features, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                        bias=child.bias,
                    ),
                )
            elif isinstance(child, nn.Embedding):
                setattr(
                    module,
                    name,
                    FrozenBNBEmbedding(
                        weight=torch.zeros(child.num_embeddings, child.embedding_dim, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                    )
                )

In [8]:
class GPTJBlock(transformers.models.gptj.modeling_gptj.GPTJBlock):
    def __init__(self, config):
        super().__init__(config)

        convert_to_int8(self.attn)
        convert_to_int8(self.mlp)


class GPTJModel(transformers.models.gptj.modeling_gptj.GPTJModel):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)
        

class GPTJForCausalLM(transformers.models.gptj.modeling_gptj.GPTJForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)


transformers.models.gptj.modeling_gptj.GPTJBlock = GPTJBlock  # monkey-patch GPT-J

In [3]:

# config = transformers.GPTJConfig.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
# tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)



In [4]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
config = transformers.GPTJConfig.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)

gpt = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_8bit=True,        # bitsandbytes lib required (convert the loaded model into mixed-8bit quantized model.)
    device_map='auto',
    torch_dtype=torch.float16,
    cache_dir=CACHE_DIR)      # path to a directory in which a downloaded pretrained model
    # low_cpu_mem_usage=True,   # loads the model using ~1x model size CPU memory
    # offload_state_dict=True)  # temporarily offload the CPU state dict to the hard drive to avoid getting out of CPU RAM

In [9]:
gpt.eval()

GPTJForCausalLM(
  (transformer): GPTJModel(
    (wte): Embedding(50400, 4096)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-27): 28 x GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): GPTJMLP(
          (fc_in): Linear8bitLt(in_features=4096, out_features=16384, bias=True)
          (fc_out): Linear8bitLt(in_features=16384, out_features=4096, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False

In [16]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# gpt.to(device)

In [11]:
# gpt = GPTJForCausalLM.from_pretrained("hivemind/gpt-j-6B-8bit", low_cpu_mem_usage=True)

# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# gpt.to(device)

### Text generation example

In [14]:
# prompt = tokenizer("A cat sat on a mat", return_tensors='pt')
# prompt = {key: value.to(device) for key, value in prompt.items()}
# out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
# tokenizer.decode(out[0])


# batch = tokenizer("Hi {FirstName} ", return_tensors='pt').to('cuda')
prompt = "A cat sat on a mat "
batch = tokenizer(prompt, return_tensors='pt').to('cuda')


# with torch.no_grad():
with torch.cuda.amp.autocast():
    output_tokens = gpt.generate(**batch, min_length=30, max_length=60, do_sample=True)

print('\n\n', tokenizer.decode(output_tokens[0].cpu().numpy()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




 A cat sat on a mat  
before a door and watched.

The door opened. Somebody came in.

She didn't understand, the cat thought, and she wanted to go away.

She went to sleep on the floor and dreamed

of a new time


### LoRA fine-tuning example
Here we demonstrate how to fine-tune the proposed model using low-rank adapters [(Hu et al, 2021)](https://arxiv.org/abs/2106.09685) and [8-bit Adam](https://arxiv.org/abs/2110.02861). We also use [dataset streaming API](https://huggingface.co/docs/datasets/dataset_streaming.html) to avoid downloading the large dataset.

In [18]:
def add_adapters(model, adapter_dim=16):
    assert adapter_dim > 0

    for module in model.modules():
        if isinstance(module, FrozenBNBLinear):
            module.adapter = nn.Sequential(
                nn.Linear(module.in_features, adapter_dim, bias=False),
                nn.Linear(adapter_dim, module.out_features, bias=False),
            )
            nn.init.zeros_(module.adapter[1].weight)
        elif isinstance(module, FrozenBNBEmbedding):
            module.adapter = nn.Sequential(
                nn.Embedding(module.num_embeddings, adapter_dim),
                nn.Linear(adapter_dim, module.embedding_dim, bias=False),
            )
            nn.init.zeros_(module.adapter[1].weight)

add_adapters(gpt)
# gpt.to(device)

In [19]:
from datasets import load_dataset
from bitsandbytes.optim import Adam8bit

gpt.gradient_checkpointing_enable()

codeparrot = load_dataset("transformersbook/codeparrot-train", streaming=True)
optimizer = Adam8bit(gpt.parameters(), lr=1e-5)

with torch.cuda.amp.autocast():
    for row in tqdm(codeparrot["train"]):
        if len(row["content"]) <= 1:
            continue

        batch = tokenizer(row["content"], truncation=True, max_length=128, return_tensors='pt')
        batch = {k: v.cuda() for k, v in batch.items()}

        out = gpt.forward(**batch,)

        loss = F.cross_entropy(out.logits[:, :-1, :].flatten(0, -2), batch['input_ids'][:, 1:].flatten(),
                               reduction='mean')
        print(loss)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

Using custom data configuration transformersbook___codeparrot-train-39fd2cee2b2cb397


tensor(0.1605, device='cuda:0', grad_fn=<NllLossBackward0>)


2it [00:02,  1.06it/s]

tensor(0.9318, device='cuda:0', grad_fn=<NllLossBackward0>)


3it [00:02,  1.48it/s]

tensor(1.0730, device='cuda:0', grad_fn=<NllLossBackward0>)


4it [00:02,  1.81it/s]

tensor(0.0544, device='cuda:0', grad_fn=<NllLossBackward0>)


5it [00:03,  2.07it/s]

tensor(0.3670, device='cuda:0', grad_fn=<NllLossBackward0>)


6it [00:03,  2.04it/s]

tensor(0.7602, device='cuda:0', grad_fn=<NllLossBackward0>)


7it [00:04,  2.23it/s]

tensor(1.4419, device='cuda:0', grad_fn=<NllLossBackward0>)


8it [00:04,  2.38it/s]

tensor(1.4850, device='cuda:0', grad_fn=<NllLossBackward0>)


9it [00:04,  2.49it/s]

tensor(1.0000, device='cuda:0', grad_fn=<NllLossBackward0>)


10it [00:05,  2.56it/s]

tensor(0.3394, device='cuda:0', grad_fn=<NllLossBackward0>)


11it [00:05,  2.62it/s]

tensor(0.4912, device='cuda:0', grad_fn=<NllLossBackward0>)


12it [00:05,  2.65it/s]

tensor(0.8410, device='cuda:0', grad_fn=<NllLossBackward0>)


13it [00:06,  2.67it/s]

tensor(0.5880, device='cuda:0', grad_fn=<NllLossBackward0>)


14it [00:06,  2.70it/s]

tensor(0.9399, device='cuda:0', grad_fn=<NllLossBackward0>)


15it [00:07,  2.72it/s]

tensor(1.5964, device='cuda:0', grad_fn=<NllLossBackward0>)


16it [00:07,  2.73it/s]

tensor(4.0543, device='cuda:0', grad_fn=<NllLossBackward0>)


17it [00:07,  2.74it/s]

tensor(2.0108, device='cuda:0', grad_fn=<NllLossBackward0>)


18it [00:08,  2.74it/s]

tensor(1.7999, device='cuda:0', grad_fn=<NllLossBackward0>)


19it [00:08,  2.75it/s]

tensor(6.3527, device='cuda:0', grad_fn=<NllLossBackward0>)


20it [00:08,  2.76it/s]

tensor(1.3588, device='cuda:0', grad_fn=<NllLossBackward0>)


21it [00:09,  2.76it/s]

tensor(1.3912, device='cuda:0', grad_fn=<NllLossBackward0>)


22it [00:09,  2.76it/s]

tensor(2.7014, device='cuda:0', grad_fn=<NllLossBackward0>)


23it [00:09,  2.76it/s]

tensor(2.5873, device='cuda:0', grad_fn=<NllLossBackward0>)


24it [00:10,  2.74it/s]

tensor(2.3658, device='cuda:0', grad_fn=<NllLossBackward0>)


25it [00:10,  2.74it/s]

tensor(3.4980, device='cuda:0', grad_fn=<NllLossBackward0>)


26it [00:10,  2.75it/s]

tensor(6.1203, device='cuda:0', grad_fn=<NllLossBackward0>)


27it [00:11,  2.74it/s]

tensor(4.9531, device='cuda:0', grad_fn=<NllLossBackward0>)


28it [00:11,  2.74it/s]

tensor(4.9320, device='cuda:0', grad_fn=<NllLossBackward0>)


29it [00:12,  2.76it/s]

tensor(2.5430, device='cuda:0', grad_fn=<NllLossBackward0>)


30it [00:12,  2.76it/s]

tensor(5.0767, device='cuda:0', grad_fn=<NllLossBackward0>)


31it [00:12,  2.77it/s]

tensor(5.8644, device='cuda:0', grad_fn=<NllLossBackward0>)


32it [00:13,  2.77it/s]

tensor(1.9033, device='cuda:0', grad_fn=<NllLossBackward0>)


33it [00:13,  2.77it/s]

tensor(2.4315, device='cuda:0', grad_fn=<NllLossBackward0>)


34it [00:13,  2.78it/s]

tensor(4.0918, device='cuda:0', grad_fn=<NllLossBackward0>)


35it [00:14,  2.77it/s]

tensor(3.2067, device='cuda:0', grad_fn=<NllLossBackward0>)


36it [00:14,  2.76it/s]

tensor(5.6545, device='cuda:0', grad_fn=<NllLossBackward0>)


37it [00:14,  2.77it/s]

tensor(4.0536, device='cuda:0', grad_fn=<NllLossBackward0>)


38it [00:15,  2.77it/s]

tensor(4.1944, device='cuda:0', grad_fn=<NllLossBackward0>)


39it [00:15,  2.75it/s]

tensor(4.8119, device='cuda:0', grad_fn=<NllLossBackward0>)


40it [00:16,  2.76it/s]

tensor(2.5572, device='cuda:0', grad_fn=<NllLossBackward0>)


41it [00:16,  2.75it/s]

tensor(4.1322, device='cuda:0', grad_fn=<NllLossBackward0>)


42it [00:16,  2.76it/s]

tensor(4.2680, device='cuda:0', grad_fn=<NllLossBackward0>)


43it [00:17,  2.75it/s]

tensor(4.3689, device='cuda:0', grad_fn=<NllLossBackward0>)


44it [00:17,  2.77it/s]

tensor(5.8362, device='cuda:0', grad_fn=<NllLossBackward0>)


45it [00:17,  2.77it/s]

tensor(5.7476, device='cuda:0', grad_fn=<NllLossBackward0>)


46it [00:18,  2.77it/s]

tensor(5.8450, device='cuda:0', grad_fn=<NllLossBackward0>)


47it [00:18,  2.77it/s]

tensor(5.9221, device='cuda:0', grad_fn=<NllLossBackward0>)


48it [00:18,  2.77it/s]

tensor(3.2553, device='cuda:0', grad_fn=<NllLossBackward0>)


49it [00:19,  2.76it/s]

tensor(4.1352, device='cuda:0', grad_fn=<NllLossBackward0>)


50it [00:19,  2.76it/s]

tensor(4.6700, device='cuda:0', grad_fn=<NllLossBackward0>)


51it [00:20,  2.76it/s]

tensor(6.7458, device='cuda:0', grad_fn=<NllLossBackward0>)


52it [00:20,  2.76it/s]

tensor(6.5133, device='cuda:0', grad_fn=<NllLossBackward0>)


53it [00:20,  2.76it/s]

tensor(5.4583, device='cuda:0', grad_fn=<NllLossBackward0>)


54it [00:21,  2.76it/s]

tensor(4.3917, device='cuda:0', grad_fn=<NllLossBackward0>)


55it [00:21,  2.76it/s]

tensor(3.8674, device='cuda:0', grad_fn=<NllLossBackward0>)


56it [00:21,  2.75it/s]

tensor(4.0119, device='cuda:0', grad_fn=<NllLossBackward0>)


57it [00:22,  2.76it/s]

tensor(4.3676, device='cuda:0', grad_fn=<NllLossBackward0>)


58it [00:22,  2.76it/s]

tensor(4.8471, device='cuda:0', grad_fn=<NllLossBackward0>)


59it [00:22,  2.77it/s]

tensor(5.9541, device='cuda:0', grad_fn=<NllLossBackward0>)


60it [00:23,  2.76it/s]

tensor(4.1517, device='cuda:0', grad_fn=<NllLossBackward0>)


61it [00:23,  2.76it/s]

tensor(4.6699, device='cuda:0', grad_fn=<NllLossBackward0>)


62it [00:24,  2.75it/s]

tensor(5.7336, device='cuda:0', grad_fn=<NllLossBackward0>)


63it [00:24,  2.76it/s]

tensor(5.5976, device='cuda:0', grad_fn=<NllLossBackward0>)


64it [00:24,  2.76it/s]

tensor(5.2133, device='cuda:0', grad_fn=<NllLossBackward0>)


65it [00:25,  2.77it/s]

tensor(6.0755, device='cuda:0', grad_fn=<NllLossBackward0>)


66it [00:25,  2.77it/s]

tensor(3.3853, device='cuda:0', grad_fn=<NllLossBackward0>)


67it [00:25,  2.76it/s]

tensor(4.9308, device='cuda:0', grad_fn=<NllLossBackward0>)


68it [00:26,  2.77it/s]

tensor(5.8031, device='cuda:0', grad_fn=<NllLossBackward0>)


69it [00:26,  2.77it/s]

tensor(5.5990, device='cuda:0', grad_fn=<NllLossBackward0>)


70it [00:26,  2.77it/s]

tensor(6.2677, device='cuda:0', grad_fn=<NllLossBackward0>)


71it [00:27,  2.76it/s]

tensor(6.1885, device='cuda:0', grad_fn=<NllLossBackward0>)


72it [00:27,  2.76it/s]

tensor(2.8037, device='cuda:0', grad_fn=<NllLossBackward0>)


73it [00:28,  2.76it/s]

tensor(5.1544, device='cuda:0', grad_fn=<NllLossBackward0>)


74it [00:28,  2.76it/s]

tensor(5.3275, device='cuda:0', grad_fn=<NllLossBackward0>)


75it [00:28,  2.75it/s]

tensor(4.1062, device='cuda:0', grad_fn=<NllLossBackward0>)


76it [00:29,  2.75it/s]

tensor(4.0750, device='cuda:0', grad_fn=<NllLossBackward0>)


77it [00:29,  2.76it/s]

tensor(4.7814, device='cuda:0', grad_fn=<NllLossBackward0>)


78it [00:29,  2.76it/s]

tensor(4.1999, device='cuda:0', grad_fn=<NllLossBackward0>)


79it [00:30,  2.76it/s]

tensor(5.1500, device='cuda:0', grad_fn=<NllLossBackward0>)


80it [00:30,  2.77it/s]

tensor(6.7390, device='cuda:0', grad_fn=<NllLossBackward0>)


81it [00:30,  2.77it/s]

tensor(4.6155, device='cuda:0', grad_fn=<NllLossBackward0>)


82it [00:31,  2.77it/s]

tensor(5.5195, device='cuda:0', grad_fn=<NllLossBackward0>)


83it [00:31,  2.77it/s]

tensor(4.2006, device='cuda:0', grad_fn=<NllLossBackward0>)


84it [00:31,  2.76it/s]

tensor(5.0737, device='cuda:0', grad_fn=<NllLossBackward0>)


85it [00:32,  2.77it/s]

tensor(2.1147, device='cuda:0', grad_fn=<NllLossBackward0>)


86it [00:32,  2.76it/s]

tensor(5.6714, device='cuda:0', grad_fn=<NllLossBackward0>)


87it [00:33,  2.76it/s]

tensor(3.4952, device='cuda:0', grad_fn=<NllLossBackward0>)


88it [00:33,  2.76it/s]

tensor(4.9980, device='cuda:0', grad_fn=<NllLossBackward0>)


89it [00:33,  2.77it/s]

tensor(5.2546, device='cuda:0', grad_fn=<NllLossBackward0>)


90it [00:34,  2.78it/s]

tensor(5.2560, device='cuda:0', grad_fn=<NllLossBackward0>)


91it [00:34,  2.77it/s]

tensor(5.2232, device='cuda:0', grad_fn=<NllLossBackward0>)


92it [00:34,  2.76it/s]

tensor(4.8532, device='cuda:0', grad_fn=<NllLossBackward0>)


93it [00:35,  2.76it/s]

tensor(2.5342, device='cuda:0', grad_fn=<NllLossBackward0>)


94it [00:35,  2.77it/s]

tensor(3.3816, device='cuda:0', grad_fn=<NllLossBackward0>)


95it [00:35,  2.78it/s]

tensor(4.7638, device='cuda:0', grad_fn=<NllLossBackward0>)


96it [00:36,  2.78it/s]

tensor(6.6795, device='cuda:0', grad_fn=<NllLossBackward0>)


97it [00:36,  2.77it/s]

tensor(7.0989, device='cuda:0', grad_fn=<NllLossBackward0>)


98it [00:37,  2.78it/s]

tensor(3.5812, device='cuda:0', grad_fn=<NllLossBackward0>)


99it [00:37,  2.77it/s]

tensor(4.2984, device='cuda:0', grad_fn=<NllLossBackward0>)


100it [00:37,  2.76it/s]

tensor(4.7326, device='cuda:0', grad_fn=<NllLossBackward0>)


101it [00:38,  2.76it/s]

tensor(3.3220, device='cuda:0', grad_fn=<NllLossBackward0>)


102it [00:38,  2.76it/s]

tensor(5.0107, device='cuda:0', grad_fn=<NllLossBackward0>)


103it [00:38,  2.75it/s]

tensor(4.6575, device='cuda:0', grad_fn=<NllLossBackward0>)


104it [00:39,  2.76it/s]

tensor(3.6273, device='cuda:0', grad_fn=<NllLossBackward0>)


105it [00:39,  2.76it/s]

tensor(4.5284, device='cuda:0', grad_fn=<NllLossBackward0>)


106it [00:39,  2.77it/s]

tensor(5.1225, device='cuda:0', grad_fn=<NllLossBackward0>)


107it [00:40,  2.76it/s]

tensor(6.4429, device='cuda:0', grad_fn=<NllLossBackward0>)


108it [00:40,  2.76it/s]

tensor(4.9343, device='cuda:0', grad_fn=<NllLossBackward0>)


109it [00:41,  2.76it/s]

tensor(3.6846, device='cuda:0', grad_fn=<NllLossBackward0>)


110it [00:41,  2.76it/s]

tensor(5.0090, device='cuda:0', grad_fn=<NllLossBackward0>)


111it [00:41,  2.76it/s]

tensor(3.5899, device='cuda:0', grad_fn=<NllLossBackward0>)


112it [00:42,  2.75it/s]

tensor(4.1619, device='cuda:0', grad_fn=<NllLossBackward0>)


113it [00:42,  2.74it/s]

tensor(5.2536, device='cuda:0', grad_fn=<NllLossBackward0>)


114it [00:42,  2.74it/s]

tensor(4.3500, device='cuda:0', grad_fn=<NllLossBackward0>)


115it [00:43,  2.74it/s]

tensor(2.5748, device='cuda:0', grad_fn=<NllLossBackward0>)


116it [00:43,  2.74it/s]

tensor(3.9787, device='cuda:0', grad_fn=<NllLossBackward0>)


117it [00:43,  2.74it/s]

tensor(4.6464, device='cuda:0', grad_fn=<NllLossBackward0>)


118it [00:44,  2.73it/s]

tensor(5.6850, device='cuda:0', grad_fn=<NllLossBackward0>)


119it [00:44,  2.74it/s]

tensor(3.7441, device='cuda:0', grad_fn=<NllLossBackward0>)


120it [00:45,  2.74it/s]

tensor(5.1343, device='cuda:0', grad_fn=<NllLossBackward0>)


121it [00:45,  2.74it/s]

tensor(2.6141, device='cuda:0', grad_fn=<NllLossBackward0>)


122it [00:45,  2.73it/s]

tensor(4.1927, device='cuda:0', grad_fn=<NllLossBackward0>)


123it [00:46,  2.74it/s]

tensor(2.9571, device='cuda:0', grad_fn=<NllLossBackward0>)


124it [00:46,  2.75it/s]

tensor(3.2589, device='cuda:0', grad_fn=<NllLossBackward0>)


125it [00:46,  2.74it/s]

tensor(5.1859, device='cuda:0', grad_fn=<NllLossBackward0>)


126it [00:47,  2.74it/s]

tensor(5.8222, device='cuda:0', grad_fn=<NllLossBackward0>)


127it [00:47,  2.73it/s]

tensor(4.4546, device='cuda:0', grad_fn=<NllLossBackward0>)


128it [00:47,  2.74it/s]

tensor(6.7191, device='cuda:0', grad_fn=<NllLossBackward0>)


129it [00:48,  2.74it/s]

tensor(4.6157, device='cuda:0', grad_fn=<NllLossBackward0>)


130it [00:48,  2.73it/s]

tensor(3.8545, device='cuda:0', grad_fn=<NllLossBackward0>)


131it [00:49,  2.71it/s]

tensor(1.7002, device='cuda:0', grad_fn=<NllLossBackward0>)


132it [00:49,  2.72it/s]

tensor(2.7177, device='cuda:0', grad_fn=<NllLossBackward0>)


133it [00:49,  2.73it/s]

tensor(5.6187, device='cuda:0', grad_fn=<NllLossBackward0>)


134it [00:50,  2.74it/s]

tensor(4.6971, device='cuda:0', grad_fn=<NllLossBackward0>)


135it [00:50,  2.74it/s]

tensor(2.1889, device='cuda:0', grad_fn=<NllLossBackward0>)


136it [00:50,  2.74it/s]

tensor(3.5882, device='cuda:0', grad_fn=<NllLossBackward0>)


137it [00:51,  2.74it/s]

tensor(4.0629, device='cuda:0', grad_fn=<NllLossBackward0>)


138it [00:51,  2.74it/s]

tensor(7.3585, device='cuda:0', grad_fn=<NllLossBackward0>)


139it [00:51,  2.73it/s]

tensor(4.1605, device='cuda:0', grad_fn=<NllLossBackward0>)


140it [00:52,  2.73it/s]

tensor(3.9744, device='cuda:0', grad_fn=<NllLossBackward0>)


141it [00:52,  2.73it/s]

tensor(6.5014, device='cuda:0', grad_fn=<NllLossBackward0>)


142it [00:53,  2.74it/s]

tensor(5.8781, device='cuda:0', grad_fn=<NllLossBackward0>)


143it [00:53,  2.74it/s]

tensor(1.7094, device='cuda:0', grad_fn=<NllLossBackward0>)


144it [00:53,  2.75it/s]

tensor(4.4135, device='cuda:0', grad_fn=<NllLossBackward0>)


145it [00:54,  2.76it/s]

tensor(4.5818, device='cuda:0', grad_fn=<NllLossBackward0>)


146it [00:54,  2.76it/s]

tensor(2.8736, device='cuda:0', grad_fn=<NllLossBackward0>)


147it [00:54,  2.76it/s]

tensor(4.4471, device='cuda:0', grad_fn=<NllLossBackward0>)


148it [00:55,  2.76it/s]

tensor(6.6875, device='cuda:0', grad_fn=<NllLossBackward0>)


149it [00:55,  2.77it/s]

tensor(7.4361, device='cuda:0', grad_fn=<NllLossBackward0>)


150it [00:55,  2.77it/s]

tensor(5.8870, device='cuda:0', grad_fn=<NllLossBackward0>)


151it [00:56,  2.77it/s]

tensor(2.5002, device='cuda:0', grad_fn=<NllLossBackward0>)


152it [00:56,  2.76it/s]

tensor(1.1337, device='cuda:0', grad_fn=<NllLossBackward0>)


153it [00:57,  2.76it/s]

tensor(6.0104, device='cuda:0', grad_fn=<NllLossBackward0>)


154it [00:57,  2.73it/s]

tensor(1.0274, device='cuda:0', grad_fn=<NllLossBackward0>)


155it [00:57,  2.74it/s]

tensor(5.3958, device='cuda:0', grad_fn=<NllLossBackward0>)


156it [00:58,  2.74it/s]

tensor(6.7472, device='cuda:0', grad_fn=<NllLossBackward0>)


157it [00:58,  2.73it/s]

tensor(1.2016, device='cuda:0', grad_fn=<NllLossBackward0>)


158it [00:58,  2.75it/s]

tensor(5.2136, device='cuda:0', grad_fn=<NllLossBackward0>)


159it [00:59,  2.76it/s]

tensor(7.5293, device='cuda:0', grad_fn=<NllLossBackward0>)


160it [00:59,  2.76it/s]

tensor(4.3658, device='cuda:0', grad_fn=<NllLossBackward0>)


161it [00:59,  2.76it/s]

tensor(6.2836, device='cuda:0', grad_fn=<NllLossBackward0>)


162it [01:00,  2.76it/s]

tensor(3.0785, device='cuda:0', grad_fn=<NllLossBackward0>)


163it [01:00,  2.76it/s]

tensor(5.8072, device='cuda:0', grad_fn=<NllLossBackward0>)


164it [01:01,  2.77it/s]

tensor(2.8518, device='cuda:0', grad_fn=<NllLossBackward0>)


165it [01:01,  2.76it/s]

tensor(5.5057, device='cuda:0', grad_fn=<NllLossBackward0>)


166it [01:01,  2.76it/s]

tensor(4.5940, device='cuda:0', grad_fn=<NllLossBackward0>)


167it [01:02,  2.76it/s]

tensor(5.7806, device='cuda:0', grad_fn=<NllLossBackward0>)


168it [01:02,  2.77it/s]

tensor(5.9255, device='cuda:0', grad_fn=<NllLossBackward0>)


169it [01:02,  2.77it/s]

tensor(4.6796, device='cuda:0', grad_fn=<NllLossBackward0>)


170it [01:03,  2.77it/s]

tensor(1.3885, device='cuda:0', grad_fn=<NllLossBackward0>)


171it [01:03,  2.76it/s]

tensor(1.0469, device='cuda:0', grad_fn=<NllLossBackward0>)


172it [01:03,  2.76it/s]

tensor(4.9661, device='cuda:0', grad_fn=<NllLossBackward0>)


173it [01:04,  2.76it/s]

tensor(4.4146, device='cuda:0', grad_fn=<NllLossBackward0>)


174it [01:04,  2.77it/s]

tensor(2.8748, device='cuda:0', grad_fn=<NllLossBackward0>)


175it [01:05,  2.76it/s]

tensor(3.6696, device='cuda:0', grad_fn=<NllLossBackward0>)


176it [01:05,  2.76it/s]

tensor(7.2491, device='cuda:0', grad_fn=<NllLossBackward0>)


177it [01:05,  2.76it/s]

tensor(1.8902, device='cuda:0', grad_fn=<NllLossBackward0>)


178it [01:06,  2.76it/s]

tensor(6.3870, device='cuda:0', grad_fn=<NllLossBackward0>)


179it [01:06,  2.76it/s]

tensor(5.1595, device='cuda:0', grad_fn=<NllLossBackward0>)


180it [01:06,  2.77it/s]

tensor(1.8008, device='cuda:0', grad_fn=<NllLossBackward0>)


181it [01:07,  2.77it/s]

tensor(6.7916, device='cuda:0', grad_fn=<NllLossBackward0>)


182it [01:07,  2.76it/s]

tensor(4.6475, device='cuda:0', grad_fn=<NllLossBackward0>)


183it [01:07,  2.76it/s]

tensor(5.8395, device='cuda:0', grad_fn=<NllLossBackward0>)


184it [01:08,  2.76it/s]

tensor(1.0029, device='cuda:0', grad_fn=<NllLossBackward0>)


185it [01:08,  2.77it/s]

tensor(1.4146, device='cuda:0', grad_fn=<NllLossBackward0>)


186it [01:09,  2.77it/s]

tensor(1.9070, device='cuda:0', grad_fn=<NllLossBackward0>)


187it [01:09,  2.76it/s]

tensor(5.6226, device='cuda:0', grad_fn=<NllLossBackward0>)


188it [01:09,  2.77it/s]

tensor(7.5404, device='cuda:0', grad_fn=<NllLossBackward0>)


189it [01:10,  2.77it/s]

tensor(4.0348, device='cuda:0', grad_fn=<NllLossBackward0>)


190it [01:10,  2.77it/s]

tensor(6.3040, device='cuda:0', grad_fn=<NllLossBackward0>)


191it [01:10,  2.75it/s]

tensor(5.4986, device='cuda:0', grad_fn=<NllLossBackward0>)


192it [01:11,  2.75it/s]

tensor(4.6761, device='cuda:0', grad_fn=<NllLossBackward0>)


193it [01:11,  2.74it/s]

tensor(5.6805, device='cuda:0', grad_fn=<NllLossBackward0>)


194it [01:11,  2.75it/s]

tensor(4.8555, device='cuda:0', grad_fn=<NllLossBackward0>)


195it [01:12,  2.75it/s]

tensor(4.9087, device='cuda:0', grad_fn=<NllLossBackward0>)


196it [01:12,  2.75it/s]

tensor(5.4720, device='cuda:0', grad_fn=<NllLossBackward0>)


197it [01:13,  2.75it/s]

tensor(5.8981, device='cuda:0', grad_fn=<NllLossBackward0>)


198it [01:13,  2.75it/s]

tensor(1.0803, device='cuda:0', grad_fn=<NllLossBackward0>)


199it [01:13,  2.76it/s]

tensor(4.0152, device='cuda:0', grad_fn=<NllLossBackward0>)


200it [01:14,  2.46it/s]

tensor(6.6140, device='cuda:0', grad_fn=<NllLossBackward0>)


201it [01:14,  2.53it/s]

tensor(5.2388, device='cuda:0', grad_fn=<NllLossBackward0>)


202it [01:14,  2.59it/s]

tensor(2.7421, device='cuda:0', grad_fn=<NllLossBackward0>)


203it [01:15,  2.64it/s]

tensor(5.8202, device='cuda:0', grad_fn=<NllLossBackward0>)


204it [01:15,  2.68it/s]

tensor(1.8861, device='cuda:0', grad_fn=<NllLossBackward0>)


205it [01:16,  2.71it/s]

tensor(4.5211, device='cuda:0', grad_fn=<NllLossBackward0>)


206it [01:16,  2.72it/s]

tensor(1.8697, device='cuda:0', grad_fn=<NllLossBackward0>)


207it [01:16,  2.73it/s]

tensor(5.8281, device='cuda:0', grad_fn=<NllLossBackward0>)


208it [01:17,  2.74it/s]

tensor(0.9548, device='cuda:0', grad_fn=<NllLossBackward0>)


209it [01:17,  2.75it/s]

tensor(5.7106, device='cuda:0', grad_fn=<NllLossBackward0>)


210it [01:17,  2.75it/s]

tensor(4.0732, device='cuda:0', grad_fn=<NllLossBackward0>)


211it [01:18,  2.76it/s]

tensor(4.0790, device='cuda:0', grad_fn=<NllLossBackward0>)


212it [01:18,  2.76it/s]

tensor(2.0689, device='cuda:0', grad_fn=<NllLossBackward0>)


213it [01:18,  2.76it/s]

tensor(4.7807, device='cuda:0', grad_fn=<NllLossBackward0>)


214it [01:19,  2.76it/s]

tensor(5.6983, device='cuda:0', grad_fn=<NllLossBackward0>)


215it [01:19,  2.76it/s]

tensor(4.6253, device='cuda:0', grad_fn=<NllLossBackward0>)


216it [01:20,  2.76it/s]

tensor(5.2525, device='cuda:0', grad_fn=<NllLossBackward0>)


217it [01:20,  2.74it/s]

tensor(1.5852, device='cuda:0', grad_fn=<NllLossBackward0>)


218it [01:20,  2.75it/s]

tensor(5.1114, device='cuda:0', grad_fn=<NllLossBackward0>)


219it [01:21,  2.76it/s]

tensor(3.5337, device='cuda:0', grad_fn=<NllLossBackward0>)


220it [01:21,  2.76it/s]

tensor(5.6066, device='cuda:0', grad_fn=<NllLossBackward0>)


221it [01:21,  2.76it/s]

tensor(5.3087, device='cuda:0', grad_fn=<NllLossBackward0>)


222it [01:22,  2.76it/s]

tensor(3.7311, device='cuda:0', grad_fn=<NllLossBackward0>)


223it [01:22,  2.76it/s]

tensor(3.8292, device='cuda:0', grad_fn=<NllLossBackward0>)


224it [01:22,  2.76it/s]

tensor(4.9377, device='cuda:0', grad_fn=<NllLossBackward0>)


225it [01:23,  2.76it/s]

tensor(2.9631, device='cuda:0', grad_fn=<NllLossBackward0>)


226it [01:23,  2.76it/s]

tensor(5.3538, device='cuda:0', grad_fn=<NllLossBackward0>)


227it [01:24,  2.76it/s]

tensor(5.1972, device='cuda:0', grad_fn=<NllLossBackward0>)


228it [01:24,  2.75it/s]

tensor(4.8346, device='cuda:0', grad_fn=<NllLossBackward0>)


229it [01:24,  2.76it/s]

tensor(4.9496, device='cuda:0', grad_fn=<NllLossBackward0>)


230it [01:25,  2.76it/s]

tensor(5.6935, device='cuda:0', grad_fn=<NllLossBackward0>)


231it [01:25,  2.75it/s]

tensor(4.4772, device='cuda:0', grad_fn=<NllLossBackward0>)


232it [01:25,  2.75it/s]

tensor(2.4260, device='cuda:0', grad_fn=<NllLossBackward0>)


233it [01:26,  2.76it/s]

tensor(6.0101, device='cuda:0', grad_fn=<NllLossBackward0>)


234it [01:26,  2.76it/s]

tensor(5.4467, device='cuda:0', grad_fn=<NllLossBackward0>)


235it [01:26,  2.76it/s]

tensor(4.8484, device='cuda:0', grad_fn=<NllLossBackward0>)


236it [01:27,  2.76it/s]

tensor(4.5136, device='cuda:0', grad_fn=<NllLossBackward0>)


237it [01:27,  2.76it/s]

tensor(4.5319, device='cuda:0', grad_fn=<NllLossBackward0>)


238it [01:28,  2.75it/s]

tensor(0.4652, device='cuda:0', grad_fn=<NllLossBackward0>)


239it [01:28,  2.75it/s]

tensor(0.7740, device='cuda:0', grad_fn=<NllLossBackward0>)


240it [01:28,  2.76it/s]

tensor(2.1176, device='cuda:0', grad_fn=<NllLossBackward0>)


241it [01:29,  2.76it/s]

tensor(5.3231, device='cuda:0', grad_fn=<NllLossBackward0>)


242it [01:29,  2.76it/s]

tensor(5.2883, device='cuda:0', grad_fn=<NllLossBackward0>)


243it [01:29,  2.76it/s]

tensor(6.3338, device='cuda:0', grad_fn=<NllLossBackward0>)


244it [01:30,  2.76it/s]

tensor(4.6628, device='cuda:0', grad_fn=<NllLossBackward0>)


245it [01:30,  2.76it/s]

tensor(3.1406, device='cuda:0', grad_fn=<NllLossBackward0>)


246it [01:30,  2.76it/s]

tensor(1.2049, device='cuda:0', grad_fn=<NllLossBackward0>)


247it [01:31,  2.76it/s]

tensor(0.8396, device='cuda:0', grad_fn=<NllLossBackward0>)


248it [01:31,  2.76it/s]

tensor(5.4328, device='cuda:0', grad_fn=<NllLossBackward0>)


249it [01:32,  2.76it/s]

tensor(3.6043, device='cuda:0', grad_fn=<NllLossBackward0>)


250it [01:32,  2.76it/s]

tensor(4.1734, device='cuda:0', grad_fn=<NllLossBackward0>)


251it [01:32,  2.75it/s]

tensor(7.3081, device='cuda:0', grad_fn=<NllLossBackward0>)


252it [01:33,  2.75it/s]

tensor(2.8914, device='cuda:0', grad_fn=<NllLossBackward0>)


253it [01:33,  2.74it/s]

tensor(5.1347, device='cuda:0', grad_fn=<NllLossBackward0>)


254it [01:33,  2.75it/s]

tensor(4.5447, device='cuda:0', grad_fn=<NllLossBackward0>)


255it [01:34,  2.75it/s]

tensor(3.5860, device='cuda:0', grad_fn=<NllLossBackward0>)


256it [01:34,  2.69it/s]

tensor(4.3505, device='cuda:0', grad_fn=<NllLossBackward0>)


257it [01:34,  2.71it/s]

tensor(8.1698, device='cuda:0', grad_fn=<NllLossBackward0>)


258it [01:35,  2.73it/s]

tensor(5.6043, device='cuda:0', grad_fn=<NllLossBackward0>)


259it [01:35,  2.74it/s]

tensor(5.3234, device='cuda:0', grad_fn=<NllLossBackward0>)


260it [01:36,  2.74it/s]

tensor(4.3469, device='cuda:0', grad_fn=<NllLossBackward0>)


261it [01:36,  2.75it/s]

tensor(4.1675, device='cuda:0', grad_fn=<NllLossBackward0>)


262it [01:36,  2.76it/s]

tensor(5.3275, device='cuda:0', grad_fn=<NllLossBackward0>)


263it [01:37,  2.76it/s]

tensor(4.2565, device='cuda:0', grad_fn=<NllLossBackward0>)


264it [01:37,  2.75it/s]

tensor(5.5182, device='cuda:0', grad_fn=<NllLossBackward0>)


265it [01:37,  2.76it/s]

tensor(6.4680, device='cuda:0', grad_fn=<NllLossBackward0>)


266it [01:38,  2.76it/s]

tensor(5.8278, device='cuda:0', grad_fn=<NllLossBackward0>)


267it [01:38,  2.76it/s]

tensor(4.9093, device='cuda:0', grad_fn=<NllLossBackward0>)


268it [01:38,  2.75it/s]

tensor(5.4592, device='cuda:0', grad_fn=<NllLossBackward0>)


269it [01:39,  2.75it/s]

tensor(5.4589, device='cuda:0', grad_fn=<NllLossBackward0>)


270it [01:39,  2.76it/s]

tensor(5.8697, device='cuda:0', grad_fn=<NllLossBackward0>)


271it [01:40,  2.76it/s]

tensor(7.0351, device='cuda:0', grad_fn=<NllLossBackward0>)


272it [01:40,  2.47it/s]

tensor(6.5699, device='cuda:0', grad_fn=<NllLossBackward0>)


273it [01:40,  2.55it/s]

tensor(5.0483, device='cuda:0', grad_fn=<NllLossBackward0>)


274it [01:41,  2.62it/s]

tensor(3.4388, device='cuda:0', grad_fn=<NllLossBackward0>)


275it [01:41,  2.66it/s]

tensor(4.1604, device='cuda:0', grad_fn=<NllLossBackward0>)


276it [01:41,  2.68it/s]

tensor(7.2248, device='cuda:0', grad_fn=<NllLossBackward0>)


277it [01:42,  2.70it/s]

tensor(4.4856, device='cuda:0', grad_fn=<NllLossBackward0>)


278it [01:42,  2.72it/s]

tensor(4.3756, device='cuda:0', grad_fn=<NllLossBackward0>)


279it [01:43,  2.73it/s]

tensor(7.5735, device='cuda:0', grad_fn=<NllLossBackward0>)


280it [01:43,  2.74it/s]

tensor(1.4208, device='cuda:0', grad_fn=<NllLossBackward0>)


281it [01:43,  2.74it/s]

tensor(2.4321, device='cuda:0', grad_fn=<NllLossBackward0>)


282it [01:44,  2.74it/s]

tensor(6.0242, device='cuda:0', grad_fn=<NllLossBackward0>)


283it [01:44,  2.75it/s]

tensor(6.0887, device='cuda:0', grad_fn=<NllLossBackward0>)


284it [01:44,  2.75it/s]

tensor(5.8384, device='cuda:0', grad_fn=<NllLossBackward0>)


285it [01:45,  2.74it/s]

tensor(6.4922, device='cuda:0', grad_fn=<NllLossBackward0>)


286it [01:45,  2.74it/s]

tensor(4.8074, device='cuda:0', grad_fn=<NllLossBackward0>)


287it [01:45,  2.75it/s]

tensor(5.9140, device='cuda:0', grad_fn=<NllLossBackward0>)


288it [01:46,  2.74it/s]

tensor(1.0303, device='cuda:0', grad_fn=<NllLossBackward0>)


289it [01:46,  2.74it/s]

tensor(6.2424, device='cuda:0', grad_fn=<NllLossBackward0>)


290it [01:47,  2.75it/s]

tensor(4.8721, device='cuda:0', grad_fn=<NllLossBackward0>)


291it [01:47,  2.76it/s]

tensor(2.2904, device='cuda:0', grad_fn=<NllLossBackward0>)


292it [01:47,  2.76it/s]

tensor(5.4025, device='cuda:0', grad_fn=<NllLossBackward0>)


293it [01:48,  2.76it/s]

tensor(5.4319, device='cuda:0', grad_fn=<NllLossBackward0>)


294it [01:48,  2.76it/s]

tensor(3.6710, device='cuda:0', grad_fn=<NllLossBackward0>)


295it [01:48,  2.75it/s]

tensor(2.7057, device='cuda:0', grad_fn=<NllLossBackward0>)


296it [01:49,  2.75it/s]

tensor(4.8590, device='cuda:0', grad_fn=<NllLossBackward0>)


297it [01:49,  2.74it/s]

tensor(1.0585, device='cuda:0', grad_fn=<NllLossBackward0>)


298it [01:49,  2.75it/s]

tensor(3.4001, device='cuda:0', grad_fn=<NllLossBackward0>)


299it [01:50,  2.75it/s]

tensor(4.3882, device='cuda:0', grad_fn=<NllLossBackward0>)


300it [01:50,  2.76it/s]

tensor(4.7900, device='cuda:0', grad_fn=<NllLossBackward0>)


301it [01:51,  2.75it/s]

tensor(5.5513, device='cuda:0', grad_fn=<NllLossBackward0>)


302it [01:51,  2.75it/s]

tensor(5.2037, device='cuda:0', grad_fn=<NllLossBackward0>)


303it [01:51,  2.76it/s]

tensor(5.5860, device='cuda:0', grad_fn=<NllLossBackward0>)


304it [01:52,  2.76it/s]

tensor(4.4049, device='cuda:0', grad_fn=<NllLossBackward0>)


305it [01:52,  2.75it/s]

tensor(0.5384, device='cuda:0', grad_fn=<NllLossBackward0>)


306it [01:52,  2.76it/s]

tensor(5.6086, device='cuda:0', grad_fn=<NllLossBackward0>)


307it [01:53,  2.76it/s]

tensor(1.6655, device='cuda:0', grad_fn=<NllLossBackward0>)


308it [01:53,  2.76it/s]

tensor(0.9272, device='cuda:0', grad_fn=<NllLossBackward0>)


309it [01:53,  2.76it/s]

tensor(0.8200, device='cuda:0', grad_fn=<NllLossBackward0>)


310it [01:54,  2.75it/s]

tensor(5.4159, device='cuda:0', grad_fn=<NllLossBackward0>)


311it [01:54,  2.75it/s]

tensor(3.6949, device='cuda:0', grad_fn=<NllLossBackward0>)


312it [01:55,  2.76it/s]

tensor(1.6252, device='cuda:0', grad_fn=<NllLossBackward0>)


313it [01:55,  2.75it/s]

tensor(5.3056, device='cuda:0', grad_fn=<NllLossBackward0>)


314it [01:55,  2.76it/s]

tensor(2.9085, device='cuda:0', grad_fn=<NllLossBackward0>)


315it [01:56,  2.76it/s]

tensor(3.7536, device='cuda:0', grad_fn=<NllLossBackward0>)


316it [01:56,  2.76it/s]

tensor(4.9951, device='cuda:0', grad_fn=<NllLossBackward0>)


317it [01:56,  2.76it/s]

tensor(7.1511, device='cuda:0', grad_fn=<NllLossBackward0>)


318it [01:57,  2.76it/s]

tensor(4.3267, device='cuda:0', grad_fn=<NllLossBackward0>)


319it [01:57,  2.76it/s]

tensor(4.5026, device='cuda:0', grad_fn=<NllLossBackward0>)


320it [01:57,  2.75it/s]

tensor(5.8267, device='cuda:0', grad_fn=<NllLossBackward0>)


321it [01:58,  2.76it/s]

tensor(1.2339, device='cuda:0', grad_fn=<NllLossBackward0>)


322it [01:58,  2.76it/s]

tensor(1.9471, device='cuda:0', grad_fn=<NllLossBackward0>)


323it [01:59,  2.76it/s]

tensor(3.5252, device='cuda:0', grad_fn=<NllLossBackward0>)


324it [01:59,  2.77it/s]

tensor(4.6725, device='cuda:0', grad_fn=<NllLossBackward0>)


325it [01:59,  2.77it/s]

tensor(4.9251, device='cuda:0', grad_fn=<NllLossBackward0>)


326it [02:00,  2.77it/s]

tensor(5.4726, device='cuda:0', grad_fn=<NllLossBackward0>)


327it [02:00,  2.76it/s]

tensor(0.7500, device='cuda:0', grad_fn=<NllLossBackward0>)


328it [02:00,  2.77it/s]

tensor(5.5960, device='cuda:0', grad_fn=<NllLossBackward0>)


329it [02:01,  2.77it/s]

tensor(4.2523, device='cuda:0', grad_fn=<NllLossBackward0>)


330it [02:01,  2.77it/s]

tensor(3.0903, device='cuda:0', grad_fn=<NllLossBackward0>)


331it [02:01,  2.77it/s]

tensor(1.1275, device='cuda:0', grad_fn=<NllLossBackward0>)


332it [02:02,  2.76it/s]

tensor(4.1923, device='cuda:0', grad_fn=<NllLossBackward0>)


333it [02:02,  2.76it/s]

tensor(4.6081, device='cuda:0', grad_fn=<NllLossBackward0>)


334it [02:03,  2.76it/s]

tensor(6.0058, device='cuda:0', grad_fn=<NllLossBackward0>)


335it [02:03,  2.62it/s]

tensor(1.3503, device='cuda:0', grad_fn=<NllLossBackward0>)


336it [02:03,  2.67it/s]

tensor(6.6061, device='cuda:0', grad_fn=<NllLossBackward0>)


337it [02:04,  2.69it/s]

tensor(5.7782, device='cuda:0', grad_fn=<NllLossBackward0>)


338it [02:04,  2.71it/s]

tensor(5.1650, device='cuda:0', grad_fn=<NllLossBackward0>)


339it [02:04,  2.72it/s]

tensor(4.9485, device='cuda:0', grad_fn=<NllLossBackward0>)


340it [02:05,  2.73it/s]

tensor(4.3301, device='cuda:0', grad_fn=<NllLossBackward0>)


341it [02:05,  2.73it/s]

tensor(5.0232, device='cuda:0', grad_fn=<NllLossBackward0>)


342it [02:05,  2.74it/s]

tensor(1.4321, device='cuda:0', grad_fn=<NllLossBackward0>)


343it [02:06,  2.74it/s]

tensor(2.9376, device='cuda:0', grad_fn=<NllLossBackward0>)


344it [02:06,  2.74it/s]

tensor(4.4918, device='cuda:0', grad_fn=<NllLossBackward0>)


345it [02:07,  2.75it/s]

tensor(4.4420, device='cuda:0', grad_fn=<NllLossBackward0>)


346it [02:07,  2.74it/s]

tensor(2.4250, device='cuda:0', grad_fn=<NllLossBackward0>)


347it [02:07,  2.74it/s]

tensor(4.8235, device='cuda:0', grad_fn=<NllLossBackward0>)


348it [02:08,  2.74it/s]

tensor(4.6794, device='cuda:0', grad_fn=<NllLossBackward0>)


349it [02:08,  2.74it/s]

tensor(6.8615, device='cuda:0', grad_fn=<NllLossBackward0>)


350it [02:08,  2.74it/s]

tensor(3.3668, device='cuda:0', grad_fn=<NllLossBackward0>)


351it [02:09,  2.74it/s]

tensor(2.4052, device='cuda:0', grad_fn=<NllLossBackward0>)


352it [02:09,  2.75it/s]

tensor(4.0299, device='cuda:0', grad_fn=<NllLossBackward0>)


353it [02:10,  2.74it/s]

tensor(1.7454, device='cuda:0', grad_fn=<NllLossBackward0>)


354it [02:10,  2.75it/s]

tensor(1.1951, device='cuda:0', grad_fn=<NllLossBackward0>)


355it [02:10,  2.74it/s]

tensor(4.0002, device='cuda:0', grad_fn=<NllLossBackward0>)


356it [02:11,  2.74it/s]

tensor(1.7093, device='cuda:0', grad_fn=<NllLossBackward0>)


357it [02:11,  2.74it/s]

tensor(4.4954, device='cuda:0', grad_fn=<NllLossBackward0>)


358it [02:11,  2.75it/s]

tensor(4.1906, device='cuda:0', grad_fn=<NllLossBackward0>)


359it [02:12,  2.74it/s]

tensor(5.3906, device='cuda:0', grad_fn=<NllLossBackward0>)


360it [02:12,  2.23it/s]

tensor(6.1484, device='cuda:0', grad_fn=<NllLossBackward0>)


361it [02:13,  2.37it/s]

tensor(5.6323, device='cuda:0', grad_fn=<NllLossBackward0>)


362it [02:13,  2.47it/s]

tensor(1.4147, device='cuda:0', grad_fn=<NllLossBackward0>)


363it [02:13,  2.55it/s]

tensor(5.6481, device='cuda:0', grad_fn=<NllLossBackward0>)


364it [02:14,  2.60it/s]

tensor(7.1022, device='cuda:0', grad_fn=<NllLossBackward0>)


365it [02:14,  2.63it/s]

tensor(0.5820, device='cuda:0', grad_fn=<NllLossBackward0>)


366it [02:15,  2.67it/s]

tensor(1.8829, device='cuda:0', grad_fn=<NllLossBackward0>)


367it [02:15,  2.69it/s]

tensor(4.3113, device='cuda:0', grad_fn=<NllLossBackward0>)


368it [02:15,  2.70it/s]

tensor(5.2621, device='cuda:0', grad_fn=<NllLossBackward0>)


369it [02:16,  2.71it/s]

tensor(4.9312, device='cuda:0', grad_fn=<NllLossBackward0>)


370it [02:16,  2.72it/s]

tensor(5.8926, device='cuda:0', grad_fn=<NllLossBackward0>)


371it [02:16,  2.72it/s]

tensor(4.6303, device='cuda:0', grad_fn=<NllLossBackward0>)


372it [02:17,  2.73it/s]

tensor(5.8469, device='cuda:0', grad_fn=<NllLossBackward0>)


373it [02:17,  2.73it/s]

tensor(6.0490, device='cuda:0', grad_fn=<NllLossBackward0>)


374it [02:17,  2.73it/s]

tensor(6.4421, device='cuda:0', grad_fn=<NllLossBackward0>)


375it [02:18,  2.74it/s]

tensor(3.9454, device='cuda:0', grad_fn=<NllLossBackward0>)


376it [02:18,  2.74it/s]

tensor(5.3387, device='cuda:0', grad_fn=<NllLossBackward0>)


377it [02:19,  2.75it/s]

tensor(3.6882, device='cuda:0', grad_fn=<NllLossBackward0>)


378it [02:19,  2.75it/s]

tensor(4.4348, device='cuda:0', grad_fn=<NllLossBackward0>)


379it [02:19,  2.75it/s]

tensor(5.4611, device='cuda:0', grad_fn=<NllLossBackward0>)


380it [02:20,  2.75it/s]

tensor(4.5755, device='cuda:0', grad_fn=<NllLossBackward0>)


381it [02:20,  2.75it/s]

tensor(6.1441, device='cuda:0', grad_fn=<NllLossBackward0>)


382it [02:20,  2.75it/s]

tensor(3.9030, device='cuda:0', grad_fn=<NllLossBackward0>)


383it [02:21,  2.75it/s]

tensor(7.1875, device='cuda:0', grad_fn=<NllLossBackward0>)


384it [02:21,  2.75it/s]

tensor(5.7703, device='cuda:0', grad_fn=<NllLossBackward0>)


385it [02:21,  2.75it/s]

tensor(5.3678, device='cuda:0', grad_fn=<NllLossBackward0>)


386it [02:22,  2.75it/s]

tensor(4.7649, device='cuda:0', grad_fn=<NllLossBackward0>)


387it [02:22,  2.75it/s]

tensor(5.9907, device='cuda:0', grad_fn=<NllLossBackward0>)


388it [02:23,  2.75it/s]

tensor(6.5729, device='cuda:0', grad_fn=<NllLossBackward0>)


389it [02:23,  2.75it/s]

tensor(4.3486, device='cuda:0', grad_fn=<NllLossBackward0>)


390it [02:23,  2.75it/s]

tensor(4.3273, device='cuda:0', grad_fn=<NllLossBackward0>)


391it [02:24,  2.74it/s]

tensor(4.5762, device='cuda:0', grad_fn=<NllLossBackward0>)


392it [02:24,  2.74it/s]

tensor(6.9272, device='cuda:0', grad_fn=<NllLossBackward0>)


393it [02:24,  2.75it/s]

tensor(4.6334, device='cuda:0', grad_fn=<NllLossBackward0>)


394it [02:25,  2.63it/s]

tensor(5.0647, device='cuda:0', grad_fn=<NllLossBackward0>)


395it [02:25,  2.66it/s]

tensor(3.3347, device='cuda:0', grad_fn=<NllLossBackward0>)


396it [02:26,  2.69it/s]

tensor(6.6472, device='cuda:0', grad_fn=<NllLossBackward0>)


397it [02:26,  2.71it/s]

tensor(5.2890, device='cuda:0', grad_fn=<NllLossBackward0>)


398it [02:26,  2.72it/s]

tensor(4.6737, device='cuda:0', grad_fn=<NllLossBackward0>)


399it [02:27,  2.73it/s]

tensor(4.8991, device='cuda:0', grad_fn=<NllLossBackward0>)


400it [02:27,  2.73it/s]

tensor(1.4541, device='cuda:0', grad_fn=<NllLossBackward0>)


401it [02:27,  2.71it/s]

tensor(5.2985, device='cuda:0', grad_fn=<NllLossBackward0>)


402it [02:28,  2.71it/s]

tensor(5.4696, device='cuda:0', grad_fn=<NllLossBackward0>)


403it [02:28,  2.73it/s]

tensor(9.2423, device='cuda:0', grad_fn=<NllLossBackward0>)


404it [02:28,  2.73it/s]

tensor(0.9531, device='cuda:0', grad_fn=<NllLossBackward0>)


405it [02:29,  2.74it/s]

tensor(5.7743, device='cuda:0', grad_fn=<NllLossBackward0>)


406it [02:29,  2.74it/s]

tensor(5.0661, device='cuda:0', grad_fn=<NllLossBackward0>)


407it [02:30,  2.74it/s]

tensor(4.3308, device='cuda:0', grad_fn=<NllLossBackward0>)


408it [02:30,  2.75it/s]

tensor(3.3312, device='cuda:0', grad_fn=<NllLossBackward0>)


409it [02:30,  2.73it/s]

tensor(5.3686, device='cuda:0', grad_fn=<NllLossBackward0>)


410it [02:31,  2.74it/s]

tensor(5.2876, device='cuda:0', grad_fn=<NllLossBackward0>)


411it [02:31,  2.73it/s]

tensor(7.0914, device='cuda:0', grad_fn=<NllLossBackward0>)


412it [02:31,  2.74it/s]

tensor(4.3055, device='cuda:0', grad_fn=<NllLossBackward0>)


413it [02:32,  2.74it/s]

tensor(4.2678, device='cuda:0', grad_fn=<NllLossBackward0>)


414it [02:32,  2.73it/s]

tensor(3.8395, device='cuda:0', grad_fn=<NllLossBackward0>)


415it [02:32,  2.74it/s]

tensor(6.4586, device='cuda:0', grad_fn=<NllLossBackward0>)


416it [02:33,  2.71it/s]

tensor(5.8796, device='cuda:0', grad_fn=<NllLossBackward0>)


417it [02:33,  2.71it/s]

tensor(1.6590, device='cuda:0', grad_fn=<NllLossBackward0>)


418it [02:34,  2.71it/s]

tensor(4.4769, device='cuda:0', grad_fn=<NllLossBackward0>)


419it [02:34,  2.72it/s]

tensor(6.2688, device='cuda:0', grad_fn=<NllLossBackward0>)


420it [02:34,  2.73it/s]

tensor(6.5892, device='cuda:0', grad_fn=<NllLossBackward0>)


421it [02:35,  2.74it/s]

tensor(5.0957, device='cuda:0', grad_fn=<NllLossBackward0>)


422it [02:35,  2.74it/s]

tensor(5.9279, device='cuda:0', grad_fn=<NllLossBackward0>)


423it [02:35,  2.74it/s]

tensor(6.3487, device='cuda:0', grad_fn=<NllLossBackward0>)


424it [02:36,  2.74it/s]

tensor(4.9748, device='cuda:0', grad_fn=<NllLossBackward0>)


425it [02:36,  2.42it/s]

tensor(5.1081, device='cuda:0', grad_fn=<NllLossBackward0>)


426it [02:37,  2.50it/s]

tensor(6.6034, device='cuda:0', grad_fn=<NllLossBackward0>)


427it [02:37,  2.57it/s]

tensor(5.9655, device='cuda:0', grad_fn=<NllLossBackward0>)


428it [02:37,  2.61it/s]

tensor(4.5728, device='cuda:0', grad_fn=<NllLossBackward0>)


429it [02:38,  2.62it/s]

tensor(2.0856, device='cuda:0', grad_fn=<NllLossBackward0>)


430it [02:38,  2.65it/s]

tensor(6.1116, device='cuda:0', grad_fn=<NllLossBackward0>)


431it [02:38,  2.66it/s]

tensor(1.3584, device='cuda:0', grad_fn=<NllLossBackward0>)


432it [02:39,  2.68it/s]

tensor(6.7583, device='cuda:0', grad_fn=<NllLossBackward0>)


433it [02:39,  2.69it/s]

tensor(1.8143, device='cuda:0', grad_fn=<NllLossBackward0>)


434it [02:40,  2.70it/s]

tensor(6.2112, device='cuda:0', grad_fn=<NllLossBackward0>)


435it [02:40,  2.71it/s]

tensor(5.5786, device='cuda:0', grad_fn=<NllLossBackward0>)


436it [02:40,  2.72it/s]

tensor(3.8331, device='cuda:0', grad_fn=<NllLossBackward0>)


437it [02:41,  2.73it/s]

tensor(2.7925, device='cuda:0', grad_fn=<NllLossBackward0>)


438it [02:41,  2.74it/s]

tensor(2.6751, device='cuda:0', grad_fn=<NllLossBackward0>)


439it [02:41,  2.74it/s]

tensor(6.1725, device='cuda:0', grad_fn=<NllLossBackward0>)


440it [02:42,  2.69it/s]

tensor(4.6108, device='cuda:0', grad_fn=<NllLossBackward0>)


441it [02:42,  2.69it/s]

tensor(1.8938, device='cuda:0', grad_fn=<NllLossBackward0>)


442it [02:43,  2.69it/s]

tensor(6.3381, device='cuda:0', grad_fn=<NllLossBackward0>)


443it [02:43,  2.70it/s]

tensor(5.8584, device='cuda:0', grad_fn=<NllLossBackward0>)


444it [02:43,  2.71it/s]

tensor(2.3789, device='cuda:0', grad_fn=<NllLossBackward0>)


445it [02:44,  2.72it/s]

tensor(1.6204, device='cuda:0', grad_fn=<NllLossBackward0>)


446it [02:44,  2.73it/s]

tensor(5.5283, device='cuda:0', grad_fn=<NllLossBackward0>)


447it [02:44,  2.73it/s]

tensor(6.3615, device='cuda:0', grad_fn=<NllLossBackward0>)


448it [02:45,  2.74it/s]

tensor(3.6441, device='cuda:0', grad_fn=<NllLossBackward0>)


449it [02:45,  2.73it/s]

tensor(2.0338, device='cuda:0', grad_fn=<NllLossBackward0>)


450it [02:45,  2.73it/s]

tensor(6.1636, device='cuda:0', grad_fn=<NllLossBackward0>)


451it [02:46,  2.72it/s]

tensor(5.3126, device='cuda:0', grad_fn=<NllLossBackward0>)


452it [02:46,  2.73it/s]

tensor(5.3622, device='cuda:0', grad_fn=<NllLossBackward0>)


453it [02:47,  2.72it/s]

tensor(5.1959, device='cuda:0', grad_fn=<NllLossBackward0>)


454it [02:47,  2.73it/s]

tensor(1.4112, device='cuda:0', grad_fn=<NllLossBackward0>)


455it [02:47,  2.72it/s]

tensor(5.0668, device='cuda:0', grad_fn=<NllLossBackward0>)


456it [02:48,  2.73it/s]

tensor(6.4011, device='cuda:0', grad_fn=<NllLossBackward0>)


457it [02:48,  2.74it/s]

tensor(1.6754, device='cuda:0', grad_fn=<NllLossBackward0>)


458it [02:48,  2.74it/s]

tensor(6.3836, device='cuda:0', grad_fn=<NllLossBackward0>)


459it [02:49,  2.73it/s]

tensor(5.9562, device='cuda:0', grad_fn=<NllLossBackward0>)


460it [02:49,  2.74it/s]

tensor(5.3084, device='cuda:0', grad_fn=<NllLossBackward0>)


461it [02:49,  2.74it/s]

tensor(4.8142, device='cuda:0', grad_fn=<NllLossBackward0>)


462it [02:50,  2.74it/s]

tensor(4.0333, device='cuda:0', grad_fn=<NllLossBackward0>)


463it [02:50,  2.74it/s]

tensor(3.5015, device='cuda:0', grad_fn=<NllLossBackward0>)


464it [02:51,  2.74it/s]

tensor(3.1320, device='cuda:0', grad_fn=<NllLossBackward0>)


465it [02:51,  2.74it/s]

tensor(4.3328, device='cuda:0', grad_fn=<NllLossBackward0>)


466it [02:51,  2.74it/s]

tensor(3.3512, device='cuda:0', grad_fn=<NllLossBackward0>)


467it [02:52,  2.74it/s]

tensor(5.7533, device='cuda:0', grad_fn=<NllLossBackward0>)


468it [02:52,  2.74it/s]

tensor(2.7661, device='cuda:0', grad_fn=<NllLossBackward0>)


469it [02:52,  2.74it/s]

tensor(6.2165, device='cuda:0', grad_fn=<NllLossBackward0>)


470it [02:53,  2.74it/s]

tensor(4.9003, device='cuda:0', grad_fn=<NllLossBackward0>)


471it [02:53,  2.74it/s]

tensor(4.1252, device='cuda:0', grad_fn=<NllLossBackward0>)


472it [02:54,  2.74it/s]

tensor(0.8702, device='cuda:0', grad_fn=<NllLossBackward0>)


473it [02:54,  2.74it/s]

tensor(5.4814, device='cuda:0', grad_fn=<NllLossBackward0>)


474it [02:54,  2.74it/s]

tensor(1.2738, device='cuda:0', grad_fn=<NllLossBackward0>)


475it [02:55,  2.73it/s]

tensor(3.6903, device='cuda:0', grad_fn=<NllLossBackward0>)


476it [02:55,  2.74it/s]

tensor(5.4933, device='cuda:0', grad_fn=<NllLossBackward0>)


477it [02:55,  2.75it/s]

tensor(5.4432, device='cuda:0', grad_fn=<NllLossBackward0>)


478it [02:56,  2.75it/s]

tensor(9.2038, device='cuda:0', grad_fn=<NllLossBackward0>)


479it [02:56,  2.75it/s]

tensor(1.6552, device='cuda:0', grad_fn=<NllLossBackward0>)


480it [02:56,  2.75it/s]

tensor(5.7660, device='cuda:0', grad_fn=<NllLossBackward0>)


481it [02:57,  2.74it/s]

tensor(7.1335, device='cuda:0', grad_fn=<NllLossBackward0>)


482it [02:57,  2.75it/s]

tensor(3.7476, device='cuda:0', grad_fn=<NllLossBackward0>)


483it [02:58,  2.74it/s]

tensor(5.4938, device='cuda:0', grad_fn=<NllLossBackward0>)


484it [02:58,  2.74it/s]

tensor(3.9546, device='cuda:0', grad_fn=<NllLossBackward0>)


485it [02:58,  2.74it/s]

tensor(5.2770, device='cuda:0', grad_fn=<NllLossBackward0>)


486it [02:59,  2.73it/s]

tensor(7.0774, device='cuda:0', grad_fn=<NllLossBackward0>)


487it [02:59,  2.74it/s]

tensor(4.5666, device='cuda:0', grad_fn=<NllLossBackward0>)


488it [02:59,  2.74it/s]

tensor(5.0952, device='cuda:0', grad_fn=<NllLossBackward0>)


489it [03:00,  2.74it/s]

tensor(16.2464, device='cuda:0', grad_fn=<NllLossBackward0>)


490it [03:00,  2.74it/s]

tensor(1.9331, device='cuda:0', grad_fn=<NllLossBackward0>)


491it [03:00,  2.74it/s]

tensor(1.0966, device='cuda:0', grad_fn=<NllLossBackward0>)


492it [03:01,  2.75it/s]

tensor(4.1074, device='cuda:0', grad_fn=<NllLossBackward0>)


493it [03:01,  2.75it/s]

tensor(1.6607, device='cuda:0', grad_fn=<NllLossBackward0>)


494it [03:02,  2.75it/s]

tensor(5.5538, device='cuda:0', grad_fn=<NllLossBackward0>)


495it [03:02,  2.74it/s]

tensor(0.7022, device='cuda:0', grad_fn=<NllLossBackward0>)


496it [03:02,  2.74it/s]

tensor(7.0104, device='cuda:0', grad_fn=<NllLossBackward0>)


497it [03:03,  2.75it/s]

tensor(4.1044, device='cuda:0', grad_fn=<NllLossBackward0>)


498it [03:03,  2.75it/s]

tensor(9.3878, device='cuda:0', grad_fn=<NllLossBackward0>)


499it [03:03,  2.75it/s]

tensor(0.4860, device='cuda:0', grad_fn=<NllLossBackward0>)


500it [03:04,  2.74it/s]

tensor(2.3347, device='cuda:0', grad_fn=<NllLossBackward0>)


501it [03:04,  2.75it/s]

tensor(3.0867, device='cuda:0', grad_fn=<NllLossBackward0>)


502it [03:04,  2.75it/s]

tensor(7.3481, device='cuda:0', grad_fn=<NllLossBackward0>)


503it [03:05,  2.75it/s]

tensor(3.9834, device='cuda:0', grad_fn=<NllLossBackward0>)


504it [03:05,  2.76it/s]

tensor(4.8766, device='cuda:0', grad_fn=<NllLossBackward0>)


505it [03:06,  2.75it/s]

tensor(4.9528, device='cuda:0', grad_fn=<NllLossBackward0>)


506it [03:06,  2.75it/s]

tensor(3.4540, device='cuda:0', grad_fn=<NllLossBackward0>)


507it [03:06,  2.75it/s]

tensor(4.0870, device='cuda:0', grad_fn=<NllLossBackward0>)


508it [03:07,  2.75it/s]

tensor(1.4052, device='cuda:0', grad_fn=<NllLossBackward0>)


509it [03:07,  2.73it/s]

tensor(5.4209, device='cuda:0', grad_fn=<NllLossBackward0>)


510it [03:07,  2.73it/s]

tensor(0.9222, device='cuda:0', grad_fn=<NllLossBackward0>)


511it [03:08,  2.73it/s]

tensor(5.9315, device='cuda:0', grad_fn=<NllLossBackward0>)


512it [03:08,  2.74it/s]

tensor(5.4296, device='cuda:0', grad_fn=<NllLossBackward0>)


513it [03:08,  2.74it/s]

tensor(3.8090, device='cuda:0', grad_fn=<NllLossBackward0>)


514it [03:09,  2.74it/s]

tensor(2.5519, device='cuda:0', grad_fn=<NllLossBackward0>)


515it [03:09,  2.75it/s]

tensor(5.1593, device='cuda:0', grad_fn=<NllLossBackward0>)


516it [03:10,  2.75it/s]

tensor(0.2229, device='cuda:0', grad_fn=<NllLossBackward0>)


517it [03:10,  2.75it/s]

tensor(5.8736, device='cuda:0', grad_fn=<NllLossBackward0>)


518it [03:10,  2.75it/s]

tensor(7.0534, device='cuda:0', grad_fn=<NllLossBackward0>)


519it [03:11,  2.75it/s]

tensor(5.8335, device='cuda:0', grad_fn=<NllLossBackward0>)


520it [03:11,  2.74it/s]

tensor(4.3650, device='cuda:0', grad_fn=<NllLossBackward0>)


521it [03:11,  2.74it/s]

tensor(5.5331, device='cuda:0', grad_fn=<NllLossBackward0>)


522it [03:12,  2.75it/s]

tensor(6.3143, device='cuda:0', grad_fn=<NllLossBackward0>)


523it [03:12,  2.74it/s]

tensor(0.1644, device='cuda:0', grad_fn=<NllLossBackward0>)


524it [03:12,  2.74it/s]

tensor(4.4020, device='cuda:0', grad_fn=<NllLossBackward0>)


525it [03:13,  2.74it/s]

tensor(6.4360, device='cuda:0', grad_fn=<NllLossBackward0>)


526it [03:13,  2.75it/s]

tensor(5.1251, device='cuda:0', grad_fn=<NllLossBackward0>)


527it [03:14,  2.74it/s]

tensor(4.6192, device='cuda:0', grad_fn=<NllLossBackward0>)


528it [03:14,  2.73it/s]

tensor(3.4474, device='cuda:0', grad_fn=<NllLossBackward0>)


529it [03:14,  2.73it/s]

tensor(3.1344, device='cuda:0', grad_fn=<NllLossBackward0>)


530it [03:15,  2.73it/s]

tensor(5.5312, device='cuda:0', grad_fn=<NllLossBackward0>)


531it [03:15,  2.73it/s]

tensor(6.8439, device='cuda:0', grad_fn=<NllLossBackward0>)


532it [03:15,  2.72it/s]

tensor(5.5344, device='cuda:0', grad_fn=<NllLossBackward0>)


533it [03:16,  2.72it/s]

tensor(4.6644, device='cuda:0', grad_fn=<NllLossBackward0>)


534it [03:16,  2.73it/s]

tensor(4.0922, device='cuda:0', grad_fn=<NllLossBackward0>)


535it [03:16,  2.73it/s]

tensor(7.2732, device='cuda:0', grad_fn=<NllLossBackward0>)


536it [03:17,  2.73it/s]

tensor(0.9533, device='cuda:0', grad_fn=<NllLossBackward0>)


537it [03:17,  2.73it/s]

tensor(1.0128, device='cuda:0', grad_fn=<NllLossBackward0>)


538it [03:18,  2.73it/s]

tensor(0.7711, device='cuda:0', grad_fn=<NllLossBackward0>)


539it [03:18,  2.73it/s]

tensor(3.2718, device='cuda:0', grad_fn=<NllLossBackward0>)


540it [03:18,  2.74it/s]

tensor(2.6626, device='cuda:0', grad_fn=<NllLossBackward0>)


541it [03:19,  2.74it/s]

tensor(0.7530, device='cuda:0', grad_fn=<NllLossBackward0>)


542it [03:19,  2.73it/s]

tensor(1.5785, device='cuda:0', grad_fn=<NllLossBackward0>)


543it [03:19,  2.73it/s]

tensor(0.3742, device='cuda:0', grad_fn=<NllLossBackward0>)


544it [03:20,  2.73it/s]

tensor(6.0452, device='cuda:0', grad_fn=<NllLossBackward0>)


545it [03:20,  2.73it/s]

tensor(6.1932, device='cuda:0', grad_fn=<NllLossBackward0>)


546it [03:21,  2.72it/s]

tensor(5.5430, device='cuda:0', grad_fn=<NllLossBackward0>)


547it [03:21,  2.72it/s]

tensor(9.3519, device='cuda:0', grad_fn=<NllLossBackward0>)


548it [03:21,  2.73it/s]

tensor(3.4653, device='cuda:0', grad_fn=<NllLossBackward0>)


549it [03:22,  2.73it/s]

tensor(2.7332, device='cuda:0', grad_fn=<NllLossBackward0>)


550it [03:22,  2.73it/s]

tensor(3.0789, device='cuda:0', grad_fn=<NllLossBackward0>)


551it [03:22,  2.72it/s]

tensor(12.4951, device='cuda:0', grad_fn=<NllLossBackward0>)


552it [03:23,  2.71it/s]

tensor(6.4081, device='cuda:0', grad_fn=<NllLossBackward0>)


553it [03:23,  2.72it/s]

tensor(14.1779, device='cuda:0', grad_fn=<NllLossBackward0>)


554it [03:23,  2.72it/s]

tensor(6.1109, device='cuda:0', grad_fn=<NllLossBackward0>)


555it [03:24,  2.72it/s]

tensor(4.7187, device='cuda:0', grad_fn=<NllLossBackward0>)


556it [03:24,  2.72it/s]

tensor(9.6893, device='cuda:0', grad_fn=<NllLossBackward0>)


557it [03:25,  2.73it/s]

tensor(6.9057, device='cuda:0', grad_fn=<NllLossBackward0>)


558it [03:25,  2.73it/s]

tensor(5.4956, device='cuda:0', grad_fn=<NllLossBackward0>)


559it [03:25,  2.73it/s]

tensor(6.5407, device='cuda:0', grad_fn=<NllLossBackward0>)


560it [03:26,  2.73it/s]

tensor(3.4149, device='cuda:0', grad_fn=<NllLossBackward0>)


561it [03:26,  2.73it/s]

tensor(8.3216, device='cuda:0', grad_fn=<NllLossBackward0>)


562it [03:26,  2.72it/s]

tensor(6.1765, device='cuda:0', grad_fn=<NllLossBackward0>)


563it [03:27,  2.72it/s]

tensor(5.6458, device='cuda:0', grad_fn=<NllLossBackward0>)


564it [03:27,  2.72it/s]

tensor(3.9641, device='cuda:0', grad_fn=<NllLossBackward0>)


565it [03:27,  2.72it/s]

tensor(0.6993, device='cuda:0', grad_fn=<NllLossBackward0>)


566it [03:28,  2.73it/s]

tensor(4.0193, device='cuda:0', grad_fn=<NllLossBackward0>)


567it [03:28,  2.71it/s]

tensor(6.1435, device='cuda:0', grad_fn=<NllLossBackward0>)


568it [03:29,  2.72it/s]

tensor(5.9059, device='cuda:0', grad_fn=<NllLossBackward0>)


569it [03:29,  2.73it/s]

tensor(6.9560, device='cuda:0', grad_fn=<NllLossBackward0>)


570it [03:29,  2.73it/s]

tensor(6.3432, device='cuda:0', grad_fn=<NllLossBackward0>)


571it [03:30,  2.73it/s]

tensor(5.7385, device='cuda:0', grad_fn=<NllLossBackward0>)


572it [03:30,  2.72it/s]

tensor(6.6096, device='cuda:0', grad_fn=<NllLossBackward0>)


573it [03:30,  2.72it/s]

tensor(6.5439, device='cuda:0', grad_fn=<NllLossBackward0>)


574it [03:31,  2.72it/s]

tensor(4.3845, device='cuda:0', grad_fn=<NllLossBackward0>)


575it [03:31,  2.72it/s]

tensor(6.8734, device='cuda:0', grad_fn=<NllLossBackward0>)


576it [03:32,  2.71it/s]

tensor(8.0059, device='cuda:0', grad_fn=<NllLossBackward0>)


577it [03:32,  2.71it/s]

tensor(3.8321, device='cuda:0', grad_fn=<NllLossBackward0>)


578it [03:32,  2.72it/s]

tensor(0.2414, device='cuda:0', grad_fn=<NllLossBackward0>)


579it [03:33,  2.72it/s]

tensor(5.1817, device='cuda:0', grad_fn=<NllLossBackward0>)


580it [03:33,  2.71it/s]

tensor(11.4141, device='cuda:0', grad_fn=<NllLossBackward0>)


581it [03:33,  2.71it/s]

tensor(4.4880, device='cuda:0', grad_fn=<NllLossBackward0>)


582it [03:34,  2.71it/s]

tensor(4.8095, device='cuda:0', grad_fn=<NllLossBackward0>)


583it [03:34,  2.72it/s]

tensor(2.4930, device='cuda:0', grad_fn=<NllLossBackward0>)


584it [03:34,  2.72it/s]

tensor(6.2185, device='cuda:0', grad_fn=<NllLossBackward0>)


585it [03:35,  2.72it/s]

tensor(9.5294, device='cuda:0', grad_fn=<NllLossBackward0>)


586it [03:35,  2.72it/s]

tensor(8.7798, device='cuda:0', grad_fn=<NllLossBackward0>)


587it [03:36,  2.72it/s]

tensor(3.1667, device='cuda:0', grad_fn=<NllLossBackward0>)


588it [03:36,  2.72it/s]

tensor(2.3174, device='cuda:0', grad_fn=<NllLossBackward0>)


589it [03:36,  2.71it/s]

tensor(4.4343, device='cuda:0', grad_fn=<NllLossBackward0>)


590it [03:37,  2.72it/s]

tensor(2.2954, device='cuda:0', grad_fn=<NllLossBackward0>)


591it [03:37,  2.72it/s]

tensor(13.9043, device='cuda:0', grad_fn=<NllLossBackward0>)


592it [03:37,  2.72it/s]

tensor(0.9275, device='cuda:0', grad_fn=<NllLossBackward0>)


593it [03:38,  2.72it/s]

tensor(7.3079, device='cuda:0', grad_fn=<NllLossBackward0>)


594it [03:38,  2.72it/s]

tensor(3.8876, device='cuda:0', grad_fn=<NllLossBackward0>)


595it [03:39,  2.72it/s]

tensor(4.3151, device='cuda:0', grad_fn=<NllLossBackward0>)


596it [03:39,  2.72it/s]

tensor(4.6864, device='cuda:0', grad_fn=<NllLossBackward0>)


597it [03:39,  2.72it/s]

tensor(1.0281, device='cuda:0', grad_fn=<NllLossBackward0>)


598it [03:40,  2.71it/s]

tensor(4.0919, device='cuda:0', grad_fn=<NllLossBackward0>)


599it [03:40,  2.71it/s]

tensor(5.3943, device='cuda:0', grad_fn=<NllLossBackward0>)


600it [03:40,  2.71it/s]

tensor(1.5883, device='cuda:0', grad_fn=<NllLossBackward0>)


601it [03:41,  2.72it/s]

tensor(3.6126, device='cuda:0', grad_fn=<NllLossBackward0>)


602it [03:41,  2.72it/s]

tensor(5.5559, device='cuda:0', grad_fn=<NllLossBackward0>)


603it [03:41,  2.73it/s]

tensor(5.2382, device='cuda:0', grad_fn=<NllLossBackward0>)


604it [03:42,  2.73it/s]

tensor(3.3402, device='cuda:0', grad_fn=<NllLossBackward0>)


605it [03:42,  2.73it/s]

tensor(7.0843, device='cuda:0', grad_fn=<NllLossBackward0>)


606it [03:43,  2.73it/s]

tensor(6.3322, device='cuda:0', grad_fn=<NllLossBackward0>)


607it [03:43,  2.73it/s]

tensor(5.1773, device='cuda:0', grad_fn=<NllLossBackward0>)


608it [03:43,  2.73it/s]

tensor(5.0724, device='cuda:0', grad_fn=<NllLossBackward0>)


609it [03:44,  2.73it/s]

tensor(1.5633, device='cuda:0', grad_fn=<NllLossBackward0>)


610it [03:44,  2.72it/s]

tensor(3.6147, device='cuda:0', grad_fn=<NllLossBackward0>)


611it [03:44,  2.71it/s]

tensor(4.2275, device='cuda:0', grad_fn=<NllLossBackward0>)


612it [03:45,  2.72it/s]

tensor(1.4583, device='cuda:0', grad_fn=<NllLossBackward0>)


613it [03:45,  2.72it/s]

tensor(0.6890, device='cuda:0', grad_fn=<NllLossBackward0>)


614it [03:46,  2.72it/s]

tensor(3.0599, device='cuda:0', grad_fn=<NllLossBackward0>)


615it [03:46,  2.72it/s]

tensor(7.9693, device='cuda:0', grad_fn=<NllLossBackward0>)


616it [03:46,  2.73it/s]

tensor(2.8756, device='cuda:0', grad_fn=<NllLossBackward0>)


617it [03:47,  2.73it/s]

tensor(7.7522, device='cuda:0', grad_fn=<NllLossBackward0>)


618it [03:47,  2.73it/s]

tensor(2.2723, device='cuda:0', grad_fn=<NllLossBackward0>)


619it [03:47,  2.73it/s]

tensor(9.2771, device='cuda:0', grad_fn=<NllLossBackward0>)


620it [03:48,  2.72it/s]

tensor(16.4872, device='cuda:0', grad_fn=<NllLossBackward0>)


621it [03:48,  2.72it/s]

tensor(1.1781, device='cuda:0', grad_fn=<NllLossBackward0>)


622it [03:48,  2.72it/s]

tensor(4.1514, device='cuda:0', grad_fn=<NllLossBackward0>)


623it [03:49,  2.72it/s]

tensor(3.0258, device='cuda:0', grad_fn=<NllLossBackward0>)


624it [03:49,  2.73it/s]

tensor(4.2127, device='cuda:0', grad_fn=<NllLossBackward0>)


625it [03:50,  2.73it/s]

tensor(5.4934, device='cuda:0', grad_fn=<NllLossBackward0>)


626it [03:50,  2.73it/s]

tensor(2.0349, device='cuda:0', grad_fn=<NllLossBackward0>)


627it [03:50,  2.72it/s]

tensor(4.8552, device='cuda:0', grad_fn=<NllLossBackward0>)


628it [03:51,  2.72it/s]

tensor(9.8307, device='cuda:0', grad_fn=<NllLossBackward0>)


629it [03:51,  2.71it/s]

tensor(4.0609, device='cuda:0', grad_fn=<NllLossBackward0>)


630it [03:51,  2.72it/s]

tensor(0.8785, device='cuda:0', grad_fn=<NllLossBackward0>)


631it [03:52,  2.71it/s]

tensor(3.5708, device='cuda:0', grad_fn=<NllLossBackward0>)


632it [03:52,  2.71it/s]

tensor(8.2354, device='cuda:0', grad_fn=<NllLossBackward0>)


633it [03:52,  2.71it/s]

tensor(5.3686, device='cuda:0', grad_fn=<NllLossBackward0>)


634it [03:53,  2.72it/s]

tensor(7.4635, device='cuda:0', grad_fn=<NllLossBackward0>)


635it [03:53,  2.72it/s]

tensor(12.6647, device='cuda:0', grad_fn=<NllLossBackward0>)


636it [03:54,  2.72it/s]

tensor(4.0862, device='cuda:0', grad_fn=<NllLossBackward0>)


637it [03:54,  2.73it/s]

tensor(4.3852, device='cuda:0', grad_fn=<NllLossBackward0>)


638it [03:54,  2.73it/s]

tensor(0.3818, device='cuda:0', grad_fn=<NllLossBackward0>)


639it [03:55,  2.73it/s]

tensor(6.5496, device='cuda:0', grad_fn=<NllLossBackward0>)


640it [03:55,  2.73it/s]

tensor(1.6445, device='cuda:0', grad_fn=<NllLossBackward0>)


641it [03:55,  2.73it/s]

tensor(5.8078, device='cuda:0', grad_fn=<NllLossBackward0>)


642it [03:56,  2.72it/s]

tensor(8.2741, device='cuda:0', grad_fn=<NllLossBackward0>)


643it [03:56,  2.72it/s]

tensor(4.9543, device='cuda:0', grad_fn=<NllLossBackward0>)


644it [03:57,  2.43it/s]

tensor(4.7872, device='cuda:0', grad_fn=<NllLossBackward0>)


645it [03:57,  2.51it/s]

tensor(1.2087, device='cuda:0', grad_fn=<NllLossBackward0>)


646it [03:57,  2.58it/s]

tensor(5.1032, device='cuda:0', grad_fn=<NllLossBackward0>)


647it [03:58,  2.61it/s]

tensor(1.2534, device='cuda:0', grad_fn=<NllLossBackward0>)


648it [03:58,  2.64it/s]

tensor(4.8881, device='cuda:0', grad_fn=<NllLossBackward0>)


649it [03:59,  2.66it/s]

tensor(6.3342, device='cuda:0', grad_fn=<NllLossBackward0>)


650it [03:59,  2.67it/s]

tensor(3.4582, device='cuda:0', grad_fn=<NllLossBackward0>)


651it [03:59,  2.69it/s]

tensor(7.1356, device='cuda:0', grad_fn=<NllLossBackward0>)


652it [04:00,  2.70it/s]

tensor(5.4306, device='cuda:0', grad_fn=<NllLossBackward0>)


653it [04:00,  2.71it/s]

tensor(0.3667, device='cuda:0', grad_fn=<NllLossBackward0>)


654it [04:00,  2.72it/s]

tensor(2.3494, device='cuda:0', grad_fn=<NllLossBackward0>)


655it [04:01,  2.72it/s]

tensor(8.7765, device='cuda:0', grad_fn=<NllLossBackward0>)


656it [04:01,  2.72it/s]

tensor(0.3720, device='cuda:0', grad_fn=<NllLossBackward0>)


657it [04:01,  2.72it/s]

tensor(4.2292, device='cuda:0', grad_fn=<NllLossBackward0>)


658it [04:02,  2.72it/s]

tensor(3.3194, device='cuda:0', grad_fn=<NllLossBackward0>)


659it [04:02,  2.72it/s]

tensor(4.5660, device='cuda:0', grad_fn=<NllLossBackward0>)


660it [04:03,  2.73it/s]

tensor(4.7026, device='cuda:0', grad_fn=<NllLossBackward0>)


661it [04:03,  2.72it/s]

tensor(5.0181, device='cuda:0', grad_fn=<NllLossBackward0>)


662it [04:03,  2.72it/s]

tensor(11.3818, device='cuda:0', grad_fn=<NllLossBackward0>)


663it [04:04,  2.72it/s]

tensor(3.0506, device='cuda:0', grad_fn=<NllLossBackward0>)


664it [04:04,  2.73it/s]

tensor(3.0268, device='cuda:0', grad_fn=<NllLossBackward0>)


665it [04:04,  2.72it/s]

tensor(11.3405, device='cuda:0', grad_fn=<NllLossBackward0>)


666it [04:05,  2.72it/s]

tensor(6.6865, device='cuda:0', grad_fn=<NllLossBackward0>)


667it [04:05,  2.72it/s]

tensor(0.5903, device='cuda:0', grad_fn=<NllLossBackward0>)


668it [04:05,  2.72it/s]

tensor(6.1841, device='cuda:0', grad_fn=<NllLossBackward0>)


669it [04:06,  2.73it/s]

tensor(3.9968, device='cuda:0', grad_fn=<NllLossBackward0>)


670it [04:06,  2.73it/s]

tensor(4.5857, device='cuda:0', grad_fn=<NllLossBackward0>)


671it [04:07,  2.73it/s]

tensor(19.9633, device='cuda:0', grad_fn=<NllLossBackward0>)


672it [04:07,  2.73it/s]

tensor(5.6320, device='cuda:0', grad_fn=<NllLossBackward0>)


673it [04:07,  2.73it/s]

tensor(1.2254, device='cuda:0', grad_fn=<NllLossBackward0>)


674it [04:08,  2.73it/s]

tensor(0.5516, device='cuda:0', grad_fn=<NllLossBackward0>)


675it [04:08,  2.73it/s]

tensor(5.1860, device='cuda:0', grad_fn=<NllLossBackward0>)


676it [04:08,  2.73it/s]

tensor(5.1965, device='cuda:0', grad_fn=<NllLossBackward0>)


677it [04:09,  2.73it/s]

tensor(6.6807, device='cuda:0', grad_fn=<NllLossBackward0>)


678it [04:09,  2.73it/s]

tensor(7.8612, device='cuda:0', grad_fn=<NllLossBackward0>)


679it [04:10,  2.73it/s]

tensor(3.9033, device='cuda:0', grad_fn=<NllLossBackward0>)


680it [04:10,  2.73it/s]

tensor(5.2895, device='cuda:0', grad_fn=<NllLossBackward0>)


681it [04:10,  2.73it/s]

tensor(3.7917, device='cuda:0', grad_fn=<NllLossBackward0>)


682it [04:11,  2.72it/s]

tensor(3.0909, device='cuda:0', grad_fn=<NllLossBackward0>)


683it [04:11,  2.72it/s]

tensor(4.1281, device='cuda:0', grad_fn=<NllLossBackward0>)


684it [04:11,  2.72it/s]

tensor(5.2616, device='cuda:0', grad_fn=<NllLossBackward0>)


685it [04:12,  2.73it/s]

tensor(7.9532, device='cuda:0', grad_fn=<NllLossBackward0>)


686it [04:12,  2.74it/s]

tensor(6.5588, device='cuda:0', grad_fn=<NllLossBackward0>)


687it [04:12,  2.73it/s]

tensor(5.2393, device='cuda:0', grad_fn=<NllLossBackward0>)


688it [04:13,  2.70it/s]

tensor(12.5234, device='cuda:0', grad_fn=<NllLossBackward0>)


689it [04:13,  2.71it/s]

tensor(1.8070, device='cuda:0', grad_fn=<NllLossBackward0>)


690it [04:14,  2.72it/s]

tensor(5.2203, device='cuda:0', grad_fn=<NllLossBackward0>)


691it [04:14,  2.72it/s]

tensor(3.6992, device='cuda:0', grad_fn=<NllLossBackward0>)


692it [04:14,  2.73it/s]

tensor(4.4498, device='cuda:0', grad_fn=<NllLossBackward0>)


693it [04:15,  2.73it/s]

tensor(2.1201, device='cuda:0', grad_fn=<NllLossBackward0>)


694it [04:15,  2.74it/s]

tensor(4.9445, device='cuda:0', grad_fn=<NllLossBackward0>)


695it [04:15,  2.73it/s]

tensor(3.9101, device='cuda:0', grad_fn=<NllLossBackward0>)


696it [04:16,  2.72it/s]

tensor(9.3907, device='cuda:0', grad_fn=<NllLossBackward0>)


697it [04:16,  2.71it/s]

tensor(0.5870, device='cuda:0', grad_fn=<NllLossBackward0>)


698it [04:16,  2.72it/s]

tensor(6.3822, device='cuda:0', grad_fn=<NllLossBackward0>)


699it [04:17,  2.73it/s]

tensor(4.1335, device='cuda:0', grad_fn=<NllLossBackward0>)


700it [04:17,  2.74it/s]

tensor(0.5473, device='cuda:0', grad_fn=<NllLossBackward0>)


701it [04:18,  2.74it/s]

tensor(1.5089, device='cuda:0', grad_fn=<NllLossBackward0>)


702it [04:18,  2.74it/s]

tensor(7.0986, device='cuda:0', grad_fn=<NllLossBackward0>)


703it [04:18,  2.74it/s]

tensor(5.1920, device='cuda:0', grad_fn=<NllLossBackward0>)


704it [04:19,  2.75it/s]

tensor(5.0970, device='cuda:0', grad_fn=<NllLossBackward0>)


705it [04:19,  2.75it/s]

tensor(4.8758, device='cuda:0', grad_fn=<NllLossBackward0>)


706it [04:19,  2.75it/s]

tensor(1.9787, device='cuda:0', grad_fn=<NllLossBackward0>)


707it [04:20,  2.75it/s]

tensor(13.2288, device='cuda:0', grad_fn=<NllLossBackward0>)


708it [04:20,  2.74it/s]

tensor(5.7687, device='cuda:0', grad_fn=<NllLossBackward0>)


709it [04:21,  2.74it/s]

tensor(5.3443, device='cuda:0', grad_fn=<NllLossBackward0>)


710it [04:21,  2.74it/s]

tensor(3.9594, device='cuda:0', grad_fn=<NllLossBackward0>)


711it [04:21,  2.75it/s]

tensor(5.8337, device='cuda:0', grad_fn=<NllLossBackward0>)


712it [04:22,  2.75it/s]

tensor(3.7927, device='cuda:0', grad_fn=<NllLossBackward0>)


713it [04:22,  2.75it/s]

tensor(1.1953, device='cuda:0', grad_fn=<NllLossBackward0>)


714it [04:22,  2.75it/s]

tensor(0.8746, device='cuda:0', grad_fn=<NllLossBackward0>)


715it [04:23,  2.75it/s]

tensor(0.2286, device='cuda:0', grad_fn=<NllLossBackward0>)


716it [04:23,  2.75it/s]

tensor(0.1962, device='cuda:0', grad_fn=<NllLossBackward0>)


717it [04:23,  2.75it/s]

tensor(6.3758, device='cuda:0', grad_fn=<NllLossBackward0>)


718it [04:24,  2.75it/s]

tensor(3.2921, device='cuda:0', grad_fn=<NllLossBackward0>)


719it [04:24,  2.75it/s]

tensor(5.6441, device='cuda:0', grad_fn=<NllLossBackward0>)


720it [04:25,  2.75it/s]

tensor(5.6326, device='cuda:0', grad_fn=<NllLossBackward0>)


721it [04:25,  2.74it/s]

tensor(4.2532, device='cuda:0', grad_fn=<NllLossBackward0>)


722it [04:25,  2.75it/s]

tensor(5.1520, device='cuda:0', grad_fn=<NllLossBackward0>)


723it [04:26,  2.75it/s]

tensor(1.1045, device='cuda:0', grad_fn=<NllLossBackward0>)


724it [04:26,  2.75it/s]

tensor(6.3940, device='cuda:0', grad_fn=<NllLossBackward0>)


725it [04:26,  2.76it/s]

tensor(3.0329, device='cuda:0', grad_fn=<NllLossBackward0>)


726it [04:27,  2.75it/s]

tensor(0.7382, device='cuda:0', grad_fn=<NllLossBackward0>)


727it [04:27,  2.75it/s]

tensor(0.9193, device='cuda:0', grad_fn=<NllLossBackward0>)


728it [04:27,  2.75it/s]

tensor(7.8914, device='cuda:0', grad_fn=<NllLossBackward0>)


729it [04:28,  2.71it/s]

tensor(5.1141, device='cuda:0', grad_fn=<NllLossBackward0>)


730it [04:28,  2.73it/s]

tensor(10.1130, device='cuda:0', grad_fn=<NllLossBackward0>)


731it [04:29,  2.73it/s]

tensor(16.6205, device='cuda:0', grad_fn=<NllLossBackward0>)


732it [04:29,  2.73it/s]

tensor(14.5419, device='cuda:0', grad_fn=<NllLossBackward0>)


733it [04:29,  2.73it/s]

tensor(27.9946, device='cuda:0', grad_fn=<NllLossBackward0>)


734it [04:30,  2.74it/s]

tensor(41.1072, device='cuda:0', grad_fn=<NllLossBackward0>)


735it [04:30,  2.75it/s]

tensor(44.8850, device='cuda:0', grad_fn=<NllLossBackward0>)


736it [04:30,  2.75it/s]

tensor(52.4245, device='cuda:0', grad_fn=<NllLossBackward0>)


737it [04:31,  2.74it/s]

tensor(55.3922, device='cuda:0', grad_fn=<NllLossBackward0>)


738it [04:31,  2.75it/s]

tensor(42.9030, device='cuda:0', grad_fn=<NllLossBackward0>)


739it [04:31,  2.75it/s]

tensor(40.3216, device='cuda:0', grad_fn=<NllLossBackward0>)


740it [04:32,  2.75it/s]

tensor(31.0743, device='cuda:0', grad_fn=<NllLossBackward0>)


741it [04:32,  2.74it/s]

tensor(57.9232, device='cuda:0', grad_fn=<NllLossBackward0>)


742it [04:33,  2.74it/s]

tensor(27.1438, device='cuda:0', grad_fn=<NllLossBackward0>)


743it [04:33,  2.75it/s]

tensor(15.5991, device='cuda:0', grad_fn=<NllLossBackward0>)


744it [04:33,  2.75it/s]

tensor(10.7573, device='cuda:0', grad_fn=<NllLossBackward0>)


745it [04:34,  2.75it/s]

tensor(19.1902, device='cuda:0', grad_fn=<NllLossBackward0>)


746it [04:34,  2.75it/s]

tensor(15.0198, device='cuda:0', grad_fn=<NllLossBackward0>)


747it [04:34,  2.74it/s]

tensor(8.9599, device='cuda:0', grad_fn=<NllLossBackward0>)


748it [04:35,  2.75it/s]

tensor(8.3941, device='cuda:0', grad_fn=<NllLossBackward0>)


749it [04:35,  2.75it/s]

tensor(19.0215, device='cuda:0', grad_fn=<NllLossBackward0>)


750it [04:35,  2.75it/s]

tensor(6.0517, device='cuda:0', grad_fn=<NllLossBackward0>)


751it [04:36,  2.75it/s]

tensor(9.5031, device='cuda:0', grad_fn=<NllLossBackward0>)


752it [04:36,  2.74it/s]

tensor(15.6506, device='cuda:0', grad_fn=<NllLossBackward0>)


753it [04:37,  2.75it/s]

tensor(10.9817, device='cuda:0', grad_fn=<NllLossBackward0>)


754it [04:37,  2.75it/s]

tensor(8.4357, device='cuda:0', grad_fn=<NllLossBackward0>)


755it [04:37,  2.75it/s]

tensor(3.3771, device='cuda:0', grad_fn=<NllLossBackward0>)


756it [04:38,  2.69it/s]

tensor(8.5017, device='cuda:0', grad_fn=<NllLossBackward0>)


757it [04:38,  2.70it/s]

tensor(17.0125, device='cuda:0', grad_fn=<NllLossBackward0>)


758it [04:38,  2.72it/s]

tensor(13.4948, device='cuda:0', grad_fn=<NllLossBackward0>)


759it [04:39,  2.72it/s]

tensor(9.9998, device='cuda:0', grad_fn=<NllLossBackward0>)


760it [04:39,  2.73it/s]

tensor(3.9451, device='cuda:0', grad_fn=<NllLossBackward0>)


761it [04:39,  2.74it/s]

tensor(39.9291, device='cuda:0', grad_fn=<NllLossBackward0>)


762it [04:40,  2.75it/s]

tensor(10.7042, device='cuda:0', grad_fn=<NllLossBackward0>)


763it [04:40,  2.74it/s]

tensor(7.9385, device='cuda:0', grad_fn=<NllLossBackward0>)


764it [04:41,  2.75it/s]

tensor(4.8269, device='cuda:0', grad_fn=<NllLossBackward0>)


765it [04:41,  2.76it/s]

tensor(13.3610, device='cuda:0', grad_fn=<NllLossBackward0>)


766it [04:41,  2.76it/s]

tensor(29.2606, device='cuda:0', grad_fn=<NllLossBackward0>)


767it [04:42,  2.77it/s]

tensor(4.3734, device='cuda:0', grad_fn=<NllLossBackward0>)


768it [04:42,  2.76it/s]

tensor(2.8058, device='cuda:0', grad_fn=<NllLossBackward0>)


769it [04:42,  2.76it/s]

tensor(4.3900, device='cuda:0', grad_fn=<NllLossBackward0>)


770it [04:43,  2.72it/s]

tensor(12.9145, device='cuda:0', grad_fn=<NllLossBackward0>)


771it [04:43,  2.73it/s]

tensor(29.6477, device='cuda:0', grad_fn=<NllLossBackward0>)


772it [04:43,  2.73it/s]

tensor(11.9123, device='cuda:0', grad_fn=<NllLossBackward0>)


773it [04:44,  2.73it/s]

tensor(2.2046, device='cuda:0', grad_fn=<NllLossBackward0>)


774it [04:44,  2.74it/s]

tensor(5.2410, device='cuda:0', grad_fn=<NllLossBackward0>)


775it [04:45,  2.76it/s]

tensor(27.4068, device='cuda:0', grad_fn=<NllLossBackward0>)


776it [04:45,  2.76it/s]

tensor(7.5721, device='cuda:0', grad_fn=<NllLossBackward0>)


777it [04:45,  2.76it/s]

tensor(369.4293, device='cuda:0', grad_fn=<NllLossBackward0>)


778it [04:46,  2.76it/s]

tensor(5.3430, device='cuda:0', grad_fn=<NllLossBackward0>)


779it [04:46,  2.76it/s]

tensor(5.0643, device='cuda:0', grad_fn=<NllLossBackward0>)


780it [04:46,  2.76it/s]

tensor(3.8913, device='cuda:0', grad_fn=<NllLossBackward0>)


781it [04:47,  2.76it/s]

tensor(5.1841, device='cuda:0', grad_fn=<NllLossBackward0>)


782it [04:47,  2.77it/s]

tensor(33.1019, device='cuda:0', grad_fn=<NllLossBackward0>)


783it [04:47,  2.76it/s]

tensor(11.9278, device='cuda:0', grad_fn=<NllLossBackward0>)


784it [04:48,  2.75it/s]

tensor(16.2396, device='cuda:0', grad_fn=<NllLossBackward0>)


785it [04:48,  2.75it/s]

tensor(4.2880, device='cuda:0', grad_fn=<NllLossBackward0>)


786it [04:49,  2.74it/s]

tensor(9.6299, device='cuda:0', grad_fn=<NllLossBackward0>)


787it [04:49,  2.74it/s]

tensor(2.1623, device='cuda:0', grad_fn=<NllLossBackward0>)


788it [04:49,  2.75it/s]

tensor(8.8783, device='cuda:0', grad_fn=<NllLossBackward0>)


789it [04:50,  2.76it/s]

tensor(5.2316, device='cuda:0', grad_fn=<NllLossBackward0>)


790it [04:50,  2.76it/s]

tensor(24.4119, device='cuda:0', grad_fn=<NllLossBackward0>)


791it [04:50,  2.76it/s]

tensor(2.7840, device='cuda:0', grad_fn=<NllLossBackward0>)


792it [04:51,  2.77it/s]

tensor(4.3124, device='cuda:0', grad_fn=<NllLossBackward0>)


793it [04:51,  2.76it/s]

tensor(5.2698, device='cuda:0', grad_fn=<NllLossBackward0>)


794it [04:51,  2.76it/s]

tensor(2.3022, device='cuda:0', grad_fn=<NllLossBackward0>)


795it [04:52,  2.75it/s]

tensor(6.0341, device='cuda:0', grad_fn=<NllLossBackward0>)


796it [04:52,  2.75it/s]

tensor(13.0990, device='cuda:0', grad_fn=<NllLossBackward0>)


797it [04:53,  2.74it/s]

tensor(7.6591, device='cuda:0', grad_fn=<NllLossBackward0>)


798it [04:53,  2.74it/s]

tensor(7.7029, device='cuda:0', grad_fn=<NllLossBackward0>)


799it [04:53,  2.75it/s]

tensor(3.1296, device='cuda:0', grad_fn=<NllLossBackward0>)


800it [04:54,  2.76it/s]

tensor(1.4435, device='cuda:0', grad_fn=<NllLossBackward0>)


801it [04:54,  2.76it/s]

tensor(7.3955, device='cuda:0', grad_fn=<NllLossBackward0>)


802it [04:54,  2.76it/s]

tensor(9.0472, device='cuda:0', grad_fn=<NllLossBackward0>)


803it [04:55,  2.76it/s]

tensor(8.7007, device='cuda:0', grad_fn=<NllLossBackward0>)


804it [04:55,  2.76it/s]

tensor(4.4849, device='cuda:0', grad_fn=<NllLossBackward0>)


805it [04:55,  2.75it/s]

tensor(13.2543, device='cuda:0', grad_fn=<NllLossBackward0>)


806it [04:56,  2.76it/s]

tensor(3.2788, device='cuda:0', grad_fn=<NllLossBackward0>)


807it [04:56,  2.76it/s]

tensor(9.5970, device='cuda:0', grad_fn=<NllLossBackward0>)


808it [04:57,  2.76it/s]

tensor(5.9003, device='cuda:0', grad_fn=<NllLossBackward0>)


809it [04:57,  2.77it/s]

tensor(8.9544, device='cuda:0', grad_fn=<NllLossBackward0>)


810it [04:57,  2.77it/s]

tensor(4.1123, device='cuda:0', grad_fn=<NllLossBackward0>)


811it [04:58,  2.77it/s]

tensor(1.4558, device='cuda:0', grad_fn=<NllLossBackward0>)


812it [04:58,  2.77it/s]

tensor(4.7978, device='cuda:0', grad_fn=<NllLossBackward0>)


813it [04:58,  2.77it/s]

tensor(18.0516, device='cuda:0', grad_fn=<NllLossBackward0>)


814it [04:59,  2.77it/s]

tensor(3.5517, device='cuda:0', grad_fn=<NllLossBackward0>)


815it [04:59,  2.77it/s]

tensor(4.9313, device='cuda:0', grad_fn=<NllLossBackward0>)


816it [04:59,  2.76it/s]

tensor(10.0612, device='cuda:0', grad_fn=<NllLossBackward0>)


817it [05:00,  2.76it/s]

tensor(1.2976, device='cuda:0', grad_fn=<NllLossBackward0>)


818it [05:00,  2.76it/s]

tensor(5.6631, device='cuda:0', grad_fn=<NllLossBackward0>)


819it [05:01,  2.77it/s]

tensor(7.1513, device='cuda:0', grad_fn=<NllLossBackward0>)


820it [05:01,  2.77it/s]

tensor(4.4398, device='cuda:0', grad_fn=<NllLossBackward0>)


821it [05:01,  2.76it/s]

tensor(6.1947, device='cuda:0', grad_fn=<NllLossBackward0>)


822it [05:02,  2.76it/s]

tensor(12.3828, device='cuda:0', grad_fn=<NllLossBackward0>)


823it [05:02,  2.77it/s]

tensor(5.5188, device='cuda:0', grad_fn=<NllLossBackward0>)


824it [05:02,  2.77it/s]

tensor(4.0686, device='cuda:0', grad_fn=<NllLossBackward0>)


825it [05:03,  2.77it/s]

tensor(7.4241, device='cuda:0', grad_fn=<NllLossBackward0>)


826it [05:03,  2.76it/s]

tensor(4.4050, device='cuda:0', grad_fn=<NllLossBackward0>)


827it [05:03,  2.77it/s]

tensor(8.4857, device='cuda:0', grad_fn=<NllLossBackward0>)


828it [05:04,  2.77it/s]

tensor(6.2698, device='cuda:0', grad_fn=<NllLossBackward0>)


829it [05:04,  2.76it/s]

tensor(5.2225, device='cuda:0', grad_fn=<NllLossBackward0>)


830it [05:04,  2.77it/s]

tensor(6.8679, device='cuda:0', grad_fn=<NllLossBackward0>)


831it [05:05,  2.77it/s]

tensor(6.2024, device='cuda:0', grad_fn=<NllLossBackward0>)


832it [05:05,  2.77it/s]

tensor(25.1987, device='cuda:0', grad_fn=<NllLossBackward0>)


833it [05:06,  2.76it/s]

tensor(4.3742, device='cuda:0', grad_fn=<NllLossBackward0>)


834it [05:06,  2.77it/s]

tensor(2.8206, device='cuda:0', grad_fn=<NllLossBackward0>)


835it [05:06,  2.76it/s]

tensor(9.7479, device='cuda:0', grad_fn=<NllLossBackward0>)


836it [05:07,  2.76it/s]

tensor(4.4969, device='cuda:0', grad_fn=<NllLossBackward0>)


837it [05:07,  2.75it/s]

tensor(4.5532, device='cuda:0', grad_fn=<NllLossBackward0>)


838it [05:07,  2.76it/s]

tensor(4.0122, device='cuda:0', grad_fn=<NllLossBackward0>)


839it [05:08,  2.76it/s]

tensor(17.9372, device='cuda:0', grad_fn=<NllLossBackward0>)


840it [05:08,  2.76it/s]

tensor(445.9809, device='cuda:0', grad_fn=<NllLossBackward0>)


841it [05:08,  2.76it/s]

tensor(5.6454, device='cuda:0', grad_fn=<NllLossBackward0>)


842it [05:09,  2.76it/s]

tensor(4.4759, device='cuda:0', grad_fn=<NllLossBackward0>)


843it [05:09,  2.77it/s]

tensor(2.4059, device='cuda:0', grad_fn=<NllLossBackward0>)


844it [05:10,  2.76it/s]

tensor(7.8780, device='cuda:0', grad_fn=<NllLossBackward0>)


845it [05:10,  2.76it/s]

tensor(5.1318, device='cuda:0', grad_fn=<NllLossBackward0>)


846it [05:10,  2.76it/s]

tensor(3.9711, device='cuda:0', grad_fn=<NllLossBackward0>)


847it [05:11,  2.76it/s]

tensor(418.0959, device='cuda:0', grad_fn=<NllLossBackward0>)


848it [05:11,  2.76it/s]

tensor(4.6441, device='cuda:0', grad_fn=<NllLossBackward0>)


849it [05:11,  2.76it/s]

tensor(143.2981, device='cuda:0', grad_fn=<NllLossBackward0>)


850it [05:12,  2.77it/s]

tensor(5.8534, device='cuda:0', grad_fn=<NllLossBackward0>)


851it [05:12,  2.75it/s]

tensor(4.8176, device='cuda:0', grad_fn=<NllLossBackward0>)


852it [05:12,  2.75it/s]

tensor(6.2108, device='cuda:0', grad_fn=<NllLossBackward0>)


853it [05:13,  2.76it/s]

tensor(9.1272, device='cuda:0', grad_fn=<NllLossBackward0>)


854it [05:13,  2.75it/s]

tensor(7.2053, device='cuda:0', grad_fn=<NllLossBackward0>)


855it [05:14,  2.75it/s]

tensor(9.6672, device='cuda:0', grad_fn=<NllLossBackward0>)


856it [05:14,  2.75it/s]

tensor(6.4513, device='cuda:0', grad_fn=<NllLossBackward0>)


857it [05:14,  2.75it/s]

tensor(386.2030, device='cuda:0', grad_fn=<NllLossBackward0>)


858it [05:15,  2.76it/s]

tensor(4.7098, device='cuda:0', grad_fn=<NllLossBackward0>)


859it [05:15,  2.77it/s]

tensor(9.5204, device='cuda:0', grad_fn=<NllLossBackward0>)


860it [05:15,  2.76it/s]

tensor(11.8997, device='cuda:0', grad_fn=<NllLossBackward0>)


861it [05:16,  2.77it/s]

tensor(4.0091, device='cuda:0', grad_fn=<NllLossBackward0>)


862it [05:16,  2.77it/s]

tensor(250.5265, device='cuda:0', grad_fn=<NllLossBackward0>)


863it [05:17,  2.45it/s]

tensor(3.5799, device='cuda:0', grad_fn=<NllLossBackward0>)


864it [05:17,  2.53it/s]

tensor(10.5223, device='cuda:0', grad_fn=<NllLossBackward0>)


865it [05:17,  2.59it/s]

tensor(24.0614, device='cuda:0', grad_fn=<NllLossBackward0>)


866it [05:18,  2.64it/s]

tensor(2.7558, device='cuda:0', grad_fn=<NllLossBackward0>)


867it [05:18,  2.67it/s]

tensor(163.6489, device='cuda:0', grad_fn=<NllLossBackward0>)


868it [05:18,  2.70it/s]

tensor(5.7638, device='cuda:0', grad_fn=<NllLossBackward0>)


869it [05:19,  2.72it/s]

tensor(6.9969, device='cuda:0', grad_fn=<NllLossBackward0>)


870it [05:19,  2.73it/s]

tensor(5.2759, device='cuda:0', grad_fn=<NllLossBackward0>)


871it [05:19,  2.74it/s]

tensor(7.8332, device='cuda:0', grad_fn=<NllLossBackward0>)


872it [05:20,  2.75it/s]

tensor(5.9674, device='cuda:0', grad_fn=<NllLossBackward0>)


873it [05:20,  2.75it/s]

tensor(6.8751, device='cuda:0', grad_fn=<NllLossBackward0>)


874it [05:21,  2.76it/s]

tensor(5.5615, device='cuda:0', grad_fn=<NllLossBackward0>)


875it [05:21,  2.76it/s]

tensor(14.2777, device='cuda:0', grad_fn=<NllLossBackward0>)


876it [05:21,  2.77it/s]

tensor(2.6653, device='cuda:0', grad_fn=<NllLossBackward0>)


877it [05:22,  2.77it/s]

tensor(69.3384, device='cuda:0', grad_fn=<NllLossBackward0>)


878it [05:22,  2.77it/s]

tensor(3.5556, device='cuda:0', grad_fn=<NllLossBackward0>)


879it [05:22,  2.77it/s]

tensor(5.6865, device='cuda:0', grad_fn=<NllLossBackward0>)


880it [05:23,  2.76it/s]

tensor(1.4444, device='cuda:0', grad_fn=<NllLossBackward0>)


881it [05:23,  2.76it/s]

tensor(2.6067, device='cuda:0', grad_fn=<NllLossBackward0>)


882it [05:23,  2.76it/s]

tensor(8.4590, device='cuda:0', grad_fn=<NllLossBackward0>)


883it [05:24,  2.76it/s]

tensor(2.7761, device='cuda:0', grad_fn=<NllLossBackward0>)


884it [05:24,  2.76it/s]

tensor(4.0702, device='cuda:0', grad_fn=<NllLossBackward0>)


885it [05:25,  2.76it/s]

tensor(7.2564, device='cuda:0', grad_fn=<NllLossBackward0>)


886it [05:25,  2.75it/s]

tensor(8.7068, device='cuda:0', grad_fn=<NllLossBackward0>)


887it [05:25,  2.76it/s]

tensor(8.9685, device='cuda:0', grad_fn=<NllLossBackward0>)


888it [05:26,  2.76it/s]

tensor(3.9231, device='cuda:0', grad_fn=<NllLossBackward0>)


889it [05:26,  2.75it/s]

tensor(14.3160, device='cuda:0', grad_fn=<NllLossBackward0>)


890it [05:26,  2.76it/s]

tensor(4.9654, device='cuda:0', grad_fn=<NllLossBackward0>)


891it [05:27,  2.76it/s]

tensor(21.5132, device='cuda:0', grad_fn=<NllLossBackward0>)


892it [05:27,  2.76it/s]

tensor(7.9226, device='cuda:0', grad_fn=<NllLossBackward0>)


893it [05:27,  2.76it/s]

tensor(4.5242, device='cuda:0', grad_fn=<NllLossBackward0>)


894it [05:28,  2.75it/s]

tensor(5.9200, device='cuda:0', grad_fn=<NllLossBackward0>)


895it [05:28,  2.76it/s]

tensor(6.5043, device='cuda:0', grad_fn=<NllLossBackward0>)


896it [05:29,  2.76it/s]

tensor(4.5145, device='cuda:0', grad_fn=<NllLossBackward0>)


897it [05:29,  2.76it/s]

tensor(1.6060, device='cuda:0', grad_fn=<NllLossBackward0>)


898it [05:29,  2.76it/s]

tensor(3.2833, device='cuda:0', grad_fn=<NllLossBackward0>)


899it [05:30,  2.77it/s]

tensor(2.5936, device='cuda:0', grad_fn=<NllLossBackward0>)


900it [05:30,  2.76it/s]

tensor(5.2375, device='cuda:0', grad_fn=<NllLossBackward0>)


901it [05:30,  2.76it/s]

tensor(5.3932, device='cuda:0', grad_fn=<NllLossBackward0>)


902it [05:31,  2.75it/s]

tensor(103.0171, device='cuda:0', grad_fn=<NllLossBackward0>)


903it [05:31,  2.75it/s]

tensor(11.1144, device='cuda:0', grad_fn=<NllLossBackward0>)


904it [05:31,  2.75it/s]

tensor(5.5165, device='cuda:0', grad_fn=<NllLossBackward0>)


905it [05:32,  2.73it/s]

tensor(5.8356, device='cuda:0', grad_fn=<NllLossBackward0>)


906it [05:32,  2.74it/s]

tensor(19.2889, device='cuda:0', grad_fn=<NllLossBackward0>)


907it [05:33,  2.75it/s]

tensor(5.7908, device='cuda:0', grad_fn=<NllLossBackward0>)


908it [05:33,  2.76it/s]

tensor(8.7655, device='cuda:0', grad_fn=<NllLossBackward0>)


909it [05:33,  2.76it/s]

tensor(15.6376, device='cuda:0', grad_fn=<NllLossBackward0>)


910it [05:34,  2.76it/s]

tensor(4.7989, device='cuda:0', grad_fn=<NllLossBackward0>)


911it [05:34,  2.75it/s]

tensor(5.2503, device='cuda:0', grad_fn=<NllLossBackward0>)


912it [05:34,  2.76it/s]

tensor(1.8188, device='cuda:0', grad_fn=<NllLossBackward0>)


913it [05:35,  2.77it/s]

tensor(1.5921, device='cuda:0', grad_fn=<NllLossBackward0>)


914it [05:35,  2.77it/s]

tensor(1.2287, device='cuda:0', grad_fn=<NllLossBackward0>)


915it [05:35,  2.77it/s]

tensor(3.3402, device='cuda:0', grad_fn=<NllLossBackward0>)


916it [05:36,  2.77it/s]

tensor(1.6129, device='cuda:0', grad_fn=<NllLossBackward0>)


917it [05:36,  2.77it/s]

tensor(1.8460, device='cuda:0', grad_fn=<NllLossBackward0>)


918it [05:37,  2.76it/s]

tensor(4.9553, device='cuda:0', grad_fn=<NllLossBackward0>)


919it [05:37,  2.77it/s]

tensor(1.1278, device='cuda:0', grad_fn=<NllLossBackward0>)


920it [05:37,  2.77it/s]

tensor(3.9549, device='cuda:0', grad_fn=<NllLossBackward0>)


921it [05:38,  2.78it/s]

tensor(6.2550, device='cuda:0', grad_fn=<NllLossBackward0>)


922it [05:38,  2.77it/s]

tensor(5.1077, device='cuda:0', grad_fn=<NllLossBackward0>)


923it [05:38,  2.77it/s]

tensor(6.3501, device='cuda:0', grad_fn=<NllLossBackward0>)


924it [05:39,  2.77it/s]

tensor(1.2037, device='cuda:0', grad_fn=<NllLossBackward0>)


925it [05:39,  2.77it/s]

tensor(3.3772, device='cuda:0', grad_fn=<NllLossBackward0>)


926it [05:39,  2.77it/s]

tensor(5.1340, device='cuda:0', grad_fn=<NllLossBackward0>)


927it [05:40,  2.76it/s]

tensor(2.3048, device='cuda:0', grad_fn=<NllLossBackward0>)


928it [05:40,  2.76it/s]

tensor(2.5632, device='cuda:0', grad_fn=<NllLossBackward0>)


929it [05:40,  2.77it/s]

tensor(2.9313, device='cuda:0', grad_fn=<NllLossBackward0>)


930it [05:41,  2.77it/s]

tensor(5.9226, device='cuda:0', grad_fn=<NllLossBackward0>)


931it [05:41,  2.76it/s]

tensor(7.3613, device='cuda:0', grad_fn=<NllLossBackward0>)


932it [05:42,  2.76it/s]

tensor(11.2479, device='cuda:0', grad_fn=<NllLossBackward0>)


933it [05:42,  2.75it/s]

tensor(4.2329, device='cuda:0', grad_fn=<NllLossBackward0>)


934it [05:42,  2.65it/s]

tensor(3.4065, device='cuda:0', grad_fn=<NllLossBackward0>)


935it [05:43,  2.68it/s]

tensor(7.0114, device='cuda:0', grad_fn=<NllLossBackward0>)


936it [05:43,  2.70it/s]

tensor(14.2546, device='cuda:0', grad_fn=<NllLossBackward0>)


937it [05:43,  2.71it/s]

tensor(8.6501, device='cuda:0', grad_fn=<NllLossBackward0>)


938it [05:44,  2.61it/s]

tensor(11.2043, device='cuda:0', grad_fn=<NllLossBackward0>)


939it [05:44,  2.65it/s]

tensor(8.8693, device='cuda:0', grad_fn=<NllLossBackward0>)


940it [05:45,  2.68it/s]

tensor(1.1592, device='cuda:0', grad_fn=<NllLossBackward0>)


941it [05:45,  2.70it/s]

tensor(5.7400, device='cuda:0', grad_fn=<NllLossBackward0>)


942it [05:45,  2.71it/s]

tensor(5.5278, device='cuda:0', grad_fn=<NllLossBackward0>)


943it [05:46,  2.72it/s]

tensor(5.1528, device='cuda:0', grad_fn=<NllLossBackward0>)


944it [05:46,  2.74it/s]

tensor(5.1946, device='cuda:0', grad_fn=<NllLossBackward0>)


945it [05:46,  2.75it/s]

tensor(6.7009, device='cuda:0', grad_fn=<NllLossBackward0>)


946it [05:47,  2.75it/s]

tensor(2.2673, device='cuda:0', grad_fn=<NllLossBackward0>)


947it [05:47,  2.76it/s]

tensor(8.2817, device='cuda:0', grad_fn=<NllLossBackward0>)


948it [05:47,  2.76it/s]

tensor(6.8752, device='cuda:0', grad_fn=<NllLossBackward0>)


949it [05:48,  2.76it/s]

tensor(3.9838, device='cuda:0', grad_fn=<NllLossBackward0>)


950it [05:48,  2.76it/s]

tensor(1.6015, device='cuda:0', grad_fn=<NllLossBackward0>)


951it [05:49,  2.76it/s]

tensor(8.2358, device='cuda:0', grad_fn=<NllLossBackward0>)


952it [05:49,  2.77it/s]

tensor(7.5165, device='cuda:0', grad_fn=<NllLossBackward0>)


953it [05:49,  2.77it/s]

tensor(2.9294, device='cuda:0', grad_fn=<NllLossBackward0>)


954it [05:50,  2.78it/s]

tensor(6.0225, device='cuda:0', grad_fn=<NllLossBackward0>)


955it [05:50,  2.78it/s]

tensor(0.8167, device='cuda:0', grad_fn=<NllLossBackward0>)


956it [05:50,  2.78it/s]

tensor(4.4375, device='cuda:0', grad_fn=<NllLossBackward0>)


957it [05:51,  2.77it/s]

tensor(6.0680, device='cuda:0', grad_fn=<NllLossBackward0>)


958it [05:51,  2.77it/s]

tensor(1.6176, device='cuda:0', grad_fn=<NllLossBackward0>)


959it [05:51,  2.76it/s]

tensor(2.7658, device='cuda:0', grad_fn=<NllLossBackward0>)


960it [05:52,  2.77it/s]

tensor(7.6178, device='cuda:0', grad_fn=<NllLossBackward0>)


961it [05:52,  2.77it/s]

tensor(8.6024, device='cuda:0', grad_fn=<NllLossBackward0>)


962it [05:53,  2.76it/s]

tensor(6.1824, device='cuda:0', grad_fn=<NllLossBackward0>)


963it [05:53,  2.76it/s]

tensor(2.3142, device='cuda:0', grad_fn=<NllLossBackward0>)


964it [05:53,  2.76it/s]

tensor(1.6431, device='cuda:0', grad_fn=<NllLossBackward0>)


965it [05:54,  2.75it/s]

tensor(6.0354, device='cuda:0', grad_fn=<NllLossBackward0>)


966it [05:54,  2.76it/s]

tensor(7.5958, device='cuda:0', grad_fn=<NllLossBackward0>)


967it [05:54,  2.75it/s]

tensor(7.8848, device='cuda:0', grad_fn=<NllLossBackward0>)


968it [05:55,  2.76it/s]

tensor(2.0101, device='cuda:0', grad_fn=<NllLossBackward0>)


969it [05:55,  2.76it/s]

tensor(3.4980, device='cuda:0', grad_fn=<NllLossBackward0>)


970it [05:55,  2.76it/s]

tensor(9.4039, device='cuda:0', grad_fn=<NllLossBackward0>)


971it [05:56,  2.76it/s]

tensor(1.9413, device='cuda:0', grad_fn=<NllLossBackward0>)


972it [05:56,  2.76it/s]

tensor(4.8920, device='cuda:0', grad_fn=<NllLossBackward0>)


973it [05:57,  2.76it/s]

tensor(5.0325, device='cuda:0', grad_fn=<NllLossBackward0>)


974it [05:57,  2.75it/s]

tensor(13.6740, device='cuda:0', grad_fn=<NllLossBackward0>)


975it [05:57,  2.76it/s]

tensor(4.5389, device='cuda:0', grad_fn=<NllLossBackward0>)


976it [05:58,  2.77it/s]

tensor(9.3617, device='cuda:0', grad_fn=<NllLossBackward0>)


977it [05:58,  2.77it/s]

tensor(28.0625, device='cuda:0', grad_fn=<NllLossBackward0>)


978it [05:58,  2.77it/s]

tensor(25.4077, device='cuda:0', grad_fn=<NllLossBackward0>)


979it [05:59,  2.76it/s]

tensor(1.9753, device='cuda:0', grad_fn=<NllLossBackward0>)


980it [05:59,  2.76it/s]

tensor(6.4155, device='cuda:0', grad_fn=<NllLossBackward0>)


981it [05:59,  2.77it/s]

tensor(5.9337, device='cuda:0', grad_fn=<NllLossBackward0>)


982it [06:00,  2.76it/s]

tensor(5.4362, device='cuda:0', grad_fn=<NllLossBackward0>)


983it [06:00,  2.76it/s]

tensor(0.8573, device='cuda:0', grad_fn=<NllLossBackward0>)


984it [06:01,  2.76it/s]

tensor(1.3531, device='cuda:0', grad_fn=<NllLossBackward0>)


985it [06:01,  2.76it/s]

tensor(5.3049, device='cuda:0', grad_fn=<NllLossBackward0>)


986it [06:01,  2.76it/s]

tensor(0.8721, device='cuda:0', grad_fn=<NllLossBackward0>)


987it [06:02,  2.75it/s]

tensor(15.5772, device='cuda:0', grad_fn=<NllLossBackward0>)


988it [06:02,  2.76it/s]

tensor(8.9427, device='cuda:0', grad_fn=<NllLossBackward0>)


989it [06:02,  2.75it/s]

tensor(4.6243, device='cuda:0', grad_fn=<NllLossBackward0>)


990it [06:03,  2.75it/s]

tensor(12.0878, device='cuda:0', grad_fn=<NllLossBackward0>)


991it [06:03,  2.76it/s]

tensor(5.0373, device='cuda:0', grad_fn=<NllLossBackward0>)


992it [06:03,  2.76it/s]

tensor(6.0631, device='cuda:0', grad_fn=<NllLossBackward0>)


993it [06:04,  2.75it/s]

tensor(7.2515, device='cuda:0', grad_fn=<NllLossBackward0>)


994it [06:04,  2.75it/s]

tensor(2.7887, device='cuda:0', grad_fn=<NllLossBackward0>)


995it [06:05,  2.75it/s]

tensor(18.2169, device='cuda:0', grad_fn=<NllLossBackward0>)


996it [06:05,  2.75it/s]

tensor(3.3415, device='cuda:0', grad_fn=<NllLossBackward0>)


997it [06:05,  2.75it/s]

tensor(8.6458, device='cuda:0', grad_fn=<NllLossBackward0>)


998it [06:06,  2.74it/s]

tensor(0.5114, device='cuda:0', grad_fn=<NllLossBackward0>)


999it [06:06,  2.75it/s]

tensor(0.8731, device='cuda:0', grad_fn=<NllLossBackward0>)


1000it [06:06,  2.75it/s]

tensor(5.4908, device='cuda:0', grad_fn=<NllLossBackward0>)


1001it [06:07,  2.75it/s]

tensor(6.0074, device='cuda:0', grad_fn=<NllLossBackward0>)


1002it [06:07,  2.76it/s]

tensor(1.3516, device='cuda:0', grad_fn=<NllLossBackward0>)


1003it [06:07,  2.76it/s]

tensor(8.6438, device='cuda:0', grad_fn=<NllLossBackward0>)


1004it [06:08,  2.75it/s]

tensor(7.4020, device='cuda:0', grad_fn=<NllLossBackward0>)


1005it [06:08,  2.76it/s]

tensor(7.1210, device='cuda:0', grad_fn=<NllLossBackward0>)


1006it [06:09,  2.76it/s]

tensor(7.1198, device='cuda:0', grad_fn=<NllLossBackward0>)


1007it [06:09,  2.76it/s]

tensor(12.3198, device='cuda:0', grad_fn=<NllLossBackward0>)


1008it [06:09,  2.70it/s]

tensor(4.6965, device='cuda:0', grad_fn=<NllLossBackward0>)


1009it [06:10,  2.72it/s]

tensor(6.5407, device='cuda:0', grad_fn=<NllLossBackward0>)


1010it [06:10,  2.73it/s]

tensor(2.5498, device='cuda:0', grad_fn=<NllLossBackward0>)


1011it [06:10,  2.74it/s]

tensor(6.2083, device='cuda:0', grad_fn=<NllLossBackward0>)


1012it [06:11,  1.72it/s]

tensor(12.3363, device='cuda:0', grad_fn=<NllLossBackward0>)


1013it [06:12,  1.93it/s]

tensor(1.4383, device='cuda:0', grad_fn=<NllLossBackward0>)


1014it [06:12,  2.12it/s]

tensor(5.6651, device='cuda:0', grad_fn=<NllLossBackward0>)


1015it [06:13,  2.25it/s]

tensor(8.1044, device='cuda:0', grad_fn=<NllLossBackward0>)


1016it [06:13,  2.38it/s]

tensor(0.5444, device='cuda:0', grad_fn=<NllLossBackward0>)


1017it [06:13,  2.48it/s]

tensor(5.7053, device='cuda:0', grad_fn=<NllLossBackward0>)


1018it [06:14,  2.55it/s]

tensor(4.9719, device='cuda:0', grad_fn=<NllLossBackward0>)


1019it [06:14,  2.61it/s]

tensor(7.7637, device='cuda:0', grad_fn=<NllLossBackward0>)


1020it [06:14,  2.66it/s]

tensor(12.9966, device='cuda:0', grad_fn=<NllLossBackward0>)


1021it [06:15,  2.69it/s]

tensor(12.6469, device='cuda:0', grad_fn=<NllLossBackward0>)


1022it [06:15,  2.71it/s]

tensor(1.6772, device='cuda:0', grad_fn=<NllLossBackward0>)


1023it [06:15,  2.73it/s]

tensor(4.7535, device='cuda:0', grad_fn=<NllLossBackward0>)


1024it [06:16,  2.74it/s]

tensor(2.4624, device='cuda:0', grad_fn=<NllLossBackward0>)


1025it [06:16,  2.74it/s]

tensor(5.0476, device='cuda:0', grad_fn=<NllLossBackward0>)


1026it [06:17,  2.75it/s]

tensor(10.5099, device='cuda:0', grad_fn=<NllLossBackward0>)


1027it [06:17,  2.75it/s]

tensor(6.2070, device='cuda:0', grad_fn=<NllLossBackward0>)


1028it [06:17,  2.76it/s]

tensor(3.9250, device='cuda:0', grad_fn=<NllLossBackward0>)


1029it [06:18,  2.76it/s]

tensor(8.0625, device='cuda:0', grad_fn=<NllLossBackward0>)


1030it [06:18,  2.75it/s]

tensor(4.9330, device='cuda:0', grad_fn=<NllLossBackward0>)


1031it [06:18,  2.75it/s]

tensor(7.9120, device='cuda:0', grad_fn=<NllLossBackward0>)


1032it [06:19,  2.74it/s]

tensor(6.1025, device='cuda:0', grad_fn=<NllLossBackward0>)


1033it [06:19,  2.75it/s]

tensor(1.7625, device='cuda:0', grad_fn=<NllLossBackward0>)


1034it [06:19,  2.75it/s]

tensor(2.4841, device='cuda:0', grad_fn=<NllLossBackward0>)


1035it [06:20,  2.76it/s]

tensor(5.9075, device='cuda:0', grad_fn=<NllLossBackward0>)


1036it [06:20,  2.76it/s]

tensor(0.6207, device='cuda:0', grad_fn=<NllLossBackward0>)


1037it [06:21,  2.76it/s]

tensor(5.1346, device='cuda:0', grad_fn=<NllLossBackward0>)


1038it [06:21,  2.76it/s]

tensor(1.1919, device='cuda:0', grad_fn=<NllLossBackward0>)


1039it [06:21,  2.76it/s]

tensor(1.6254, device='cuda:0', grad_fn=<NllLossBackward0>)


1040it [06:22,  2.76it/s]

tensor(0.6766, device='cuda:0', grad_fn=<NllLossBackward0>)


1041it [06:22,  2.76it/s]

tensor(2.9080, device='cuda:0', grad_fn=<NllLossBackward0>)


1042it [06:22,  2.76it/s]

tensor(8.4132, device='cuda:0', grad_fn=<NllLossBackward0>)


1043it [06:23,  2.76it/s]

tensor(1.7845, device='cuda:0', grad_fn=<NllLossBackward0>)


1044it [06:23,  2.75it/s]

tensor(20.2773, device='cuda:0', grad_fn=<NllLossBackward0>)


1045it [06:23,  2.75it/s]

tensor(17.7114, device='cuda:0', grad_fn=<NllLossBackward0>)


1046it [06:24,  2.75it/s]

tensor(38.1217, device='cuda:0', grad_fn=<NllLossBackward0>)


1047it [06:24,  2.74it/s]

tensor(41.7888, device='cuda:0', grad_fn=<NllLossBackward0>)


1048it [06:25,  2.75it/s]

tensor(38.2183, device='cuda:0', grad_fn=<NllLossBackward0>)


1049it [06:25,  2.75it/s]

tensor(42.0590, device='cuda:0', grad_fn=<NllLossBackward0>)


1050it [06:25,  2.76it/s]

tensor(37.1374, device='cuda:0', grad_fn=<NllLossBackward0>)


1051it [06:26,  2.76it/s]

tensor(40.0531, device='cuda:0', grad_fn=<NllLossBackward0>)


1052it [06:26,  2.76it/s]

tensor(48.4026, device='cuda:0', grad_fn=<NllLossBackward0>)


1053it [06:26,  2.76it/s]

tensor(65.7496, device='cuda:0', grad_fn=<NllLossBackward0>)


1054it [06:27,  2.75it/s]

tensor(97.6325, device='cuda:0', grad_fn=<NllLossBackward0>)


1055it [06:27,  2.75it/s]

tensor(133.2495, device='cuda:0', grad_fn=<NllLossBackward0>)


1056it [06:27,  2.75it/s]

tensor(155.3734, device='cuda:0', grad_fn=<NllLossBackward0>)


1057it [06:28,  2.74it/s]

tensor(157.0035, device='cuda:0', grad_fn=<NllLossBackward0>)


1058it [06:28,  2.74it/s]

tensor(151.6840, device='cuda:0', grad_fn=<NllLossBackward0>)


1059it [06:29,  2.74it/s]

tensor(217.3268, device='cuda:0', grad_fn=<NllLossBackward0>)


1060it [06:29,  2.75it/s]

tensor(202.5812, device='cuda:0', grad_fn=<NllLossBackward0>)


1061it [06:29,  2.76it/s]

tensor(228.5621, device='cuda:0', grad_fn=<NllLossBackward0>)


1062it [06:30,  2.75it/s]

tensor(239.9135, device='cuda:0', grad_fn=<NllLossBackward0>)


1063it [06:30,  2.76it/s]

tensor(244.5817, device='cuda:0', grad_fn=<NllLossBackward0>)


1064it [06:30,  2.76it/s]

tensor(246.0927, device='cuda:0', grad_fn=<NllLossBackward0>)


1065it [06:31,  2.75it/s]

tensor(256.2691, device='cuda:0', grad_fn=<NllLossBackward0>)


1066it [06:31,  2.76it/s]

tensor(237.6516, device='cuda:0', grad_fn=<NllLossBackward0>)


1067it [06:31,  2.76it/s]

tensor(237.4316, device='cuda:0', grad_fn=<NllLossBackward0>)


1068it [06:32,  2.76it/s]

tensor(249.8780, device='cuda:0', grad_fn=<NllLossBackward0>)


1069it [06:32,  2.75it/s]

tensor(235.1536, device='cuda:0', grad_fn=<NllLossBackward0>)


1070it [06:33,  2.75it/s]

tensor(248.0980, device='cuda:0', grad_fn=<NllLossBackward0>)


1071it [06:33,  2.75it/s]

tensor(251.1494, device='cuda:0', grad_fn=<NllLossBackward0>)


1072it [06:33,  2.76it/s]

tensor(256.1706, device='cuda:0', grad_fn=<NllLossBackward0>)


1073it [06:34,  2.76it/s]

tensor(249.4899, device='cuda:0', grad_fn=<NllLossBackward0>)


1074it [06:34,  2.75it/s]

tensor(226.9372, device='cuda:0', grad_fn=<NllLossBackward0>)


1075it [06:34,  2.75it/s]

tensor(242.7837, device='cuda:0', grad_fn=<NllLossBackward0>)


1076it [06:35,  2.74it/s]

tensor(255.3501, device='cuda:0', grad_fn=<NllLossBackward0>)


1077it [06:35,  2.74it/s]

tensor(235.3728, device='cuda:0', grad_fn=<NllLossBackward0>)


1078it [06:35,  2.75it/s]

tensor(239.3291, device='cuda:0', grad_fn=<NllLossBackward0>)


1079it [06:36,  2.76it/s]

tensor(232.2809, device='cuda:0', grad_fn=<NllLossBackward0>)


1080it [06:36,  2.76it/s]

tensor(231.5305, device='cuda:0', grad_fn=<NllLossBackward0>)


1081it [06:37,  2.76it/s]

tensor(238.9644, device='cuda:0', grad_fn=<NllLossBackward0>)


1082it [06:37,  2.37it/s]

tensor(231.0591, device='cuda:0', grad_fn=<NllLossBackward0>)


1083it [06:37,  2.48it/s]

tensor(232.8972, device='cuda:0', grad_fn=<NllLossBackward0>)


1084it [06:38,  2.55it/s]

tensor(230.2927, device='cuda:0', grad_fn=<NllLossBackward0>)


1085it [06:38,  2.62it/s]

tensor(231.4621, device='cuda:0', grad_fn=<NllLossBackward0>)


1086it [06:39,  2.66it/s]

tensor(188.5126, device='cuda:0', grad_fn=<NllLossBackward0>)


1087it [06:39,  2.69it/s]

tensor(196.2573, device='cuda:0', grad_fn=<NllLossBackward0>)


1088it [06:39,  2.70it/s]

tensor(221.6834, device='cuda:0', grad_fn=<NllLossBackward0>)


1089it [06:40,  2.72it/s]

tensor(212.7884, device='cuda:0', grad_fn=<NllLossBackward0>)


1090it [06:40,  2.73it/s]

tensor(192.2277, device='cuda:0', grad_fn=<NllLossBackward0>)


1091it [06:40,  2.74it/s]

tensor(208.8538, device='cuda:0', grad_fn=<NllLossBackward0>)


1092it [06:41,  2.74it/s]

tensor(211.5965, device='cuda:0', grad_fn=<NllLossBackward0>)


1093it [06:41,  2.75it/s]

tensor(203.6313, device='cuda:0', grad_fn=<NllLossBackward0>)


1094it [06:41,  2.75it/s]

tensor(195.0902, device='cuda:0', grad_fn=<NllLossBackward0>)


1095it [06:42,  2.74it/s]

tensor(165.2984, device='cuda:0', grad_fn=<NllLossBackward0>)


1096it [06:42,  2.75it/s]

tensor(256.0692, device='cuda:0', grad_fn=<NllLossBackward0>)


1097it [06:43,  2.70it/s]

tensor(200.3689, device='cuda:0', grad_fn=<NllLossBackward0>)


1098it [06:43,  2.71it/s]

tensor(219.7324, device='cuda:0', grad_fn=<NllLossBackward0>)


1099it [06:43,  2.72it/s]

tensor(203.6512, device='cuda:0', grad_fn=<NllLossBackward0>)


1100it [06:44,  2.73it/s]

tensor(180.9819, device='cuda:0', grad_fn=<NllLossBackward0>)


1101it [06:44,  2.74it/s]

tensor(149.2672, device='cuda:0', grad_fn=<NllLossBackward0>)


1102it [06:44,  2.74it/s]

tensor(187.1633, device='cuda:0', grad_fn=<NllLossBackward0>)


1103it [06:45,  2.74it/s]

tensor(196.6997, device='cuda:0', grad_fn=<NllLossBackward0>)


1104it [06:45,  2.73it/s]

tensor(196.5740, device='cuda:0', grad_fn=<NllLossBackward0>)


1105it [06:45,  2.73it/s]

tensor(inf, device='cuda:0', grad_fn=<NllLossBackward0>)


1106it [06:46,  2.73it/s]

tensor(212.7800, device='cuda:0', grad_fn=<NllLossBackward0>)


1107it [06:46,  2.74it/s]

tensor(201.0739, device='cuda:0', grad_fn=<NllLossBackward0>)


1108it [06:47,  2.74it/s]

tensor(209.5110, device='cuda:0', grad_fn=<NllLossBackward0>)


1109it [06:47,  2.71it/s]

tensor(205.7650, device='cuda:0', grad_fn=<NllLossBackward0>)


1110it [06:47,  2.72it/s]

tensor(173.1436, device='cuda:0', grad_fn=<NllLossBackward0>)


1111it [06:48,  2.73it/s]

tensor(216.2520, device='cuda:0', grad_fn=<NllLossBackward0>)


1112it [06:48,  2.74it/s]

tensor(189.7117, device='cuda:0', grad_fn=<NllLossBackward0>)


1113it [06:48,  2.74it/s]

tensor(169.9919, device='cuda:0', grad_fn=<NllLossBackward0>)


1114it [06:49,  2.74it/s]

tensor(186.2983, device='cuda:0', grad_fn=<NllLossBackward0>)


1115it [06:49,  2.75it/s]

tensor(180.7789, device='cuda:0', grad_fn=<NllLossBackward0>)


1116it [06:49,  2.74it/s]

tensor(180.3916, device='cuda:0', grad_fn=<NllLossBackward0>)


1117it [06:50,  2.75it/s]

tensor(171.5124, device='cuda:0', grad_fn=<NllLossBackward0>)


1118it [06:50,  2.74it/s]

tensor(200.3621, device='cuda:0', grad_fn=<NllLossBackward0>)


1119it [06:51,  2.74it/s]

tensor(188.2795, device='cuda:0', grad_fn=<NllLossBackward0>)


1120it [06:51,  2.75it/s]

tensor(187.7510, device='cuda:0', grad_fn=<NllLossBackward0>)


1121it [06:51,  2.75it/s]

tensor(166.5097, device='cuda:0', grad_fn=<NllLossBackward0>)


1122it [06:52,  2.75it/s]

tensor(180.6834, device='cuda:0', grad_fn=<NllLossBackward0>)


1123it [06:52,  2.75it/s]

tensor(169.7347, device='cuda:0', grad_fn=<NllLossBackward0>)


1124it [06:52,  2.75it/s]

tensor(157.9326, device='cuda:0', grad_fn=<NllLossBackward0>)


1125it [06:53,  2.75it/s]

tensor(244.0898, device='cuda:0', grad_fn=<NllLossBackward0>)


1126it [06:53,  2.75it/s]

tensor(181.6671, device='cuda:0', grad_fn=<NllLossBackward0>)


1127it [06:53,  2.75it/s]

tensor(174.4333, device='cuda:0', grad_fn=<NllLossBackward0>)


1128it [06:54,  2.61it/s]

tensor(194.5310, device='cuda:0', grad_fn=<NllLossBackward0>)


1129it [06:54,  2.65it/s]

tensor(167.6293, device='cuda:0', grad_fn=<NllLossBackward0>)


1130it [06:55,  2.67it/s]

tensor(176.0405, device='cuda:0', grad_fn=<NllLossBackward0>)


1131it [06:55,  2.69it/s]

tensor(164.1075, device='cuda:0', grad_fn=<NllLossBackward0>)


1132it [06:55,  2.71it/s]

tensor(132.5585, device='cuda:0', grad_fn=<NllLossBackward0>)


1133it [06:56,  2.72it/s]

tensor(166.2022, device='cuda:0', grad_fn=<NllLossBackward0>)


1134it [06:56,  2.72it/s]

tensor(162.4795, device='cuda:0', grad_fn=<NllLossBackward0>)


1135it [06:56,  2.73it/s]

tensor(164.5486, device='cuda:0', grad_fn=<NllLossBackward0>)


1136it [06:57,  2.74it/s]

tensor(210.3078, device='cuda:0', grad_fn=<NllLossBackward0>)


1137it [06:57,  2.75it/s]

tensor(140.1819, device='cuda:0', grad_fn=<NllLossBackward0>)


1138it [06:58,  2.74it/s]

tensor(204.4117, device='cuda:0', grad_fn=<NllLossBackward0>)


1139it [06:58,  2.74it/s]

tensor(194.3618, device='cuda:0', grad_fn=<NllLossBackward0>)


1140it [06:58,  2.73it/s]

tensor(160.8510, device='cuda:0', grad_fn=<NllLossBackward0>)


1141it [06:59,  2.74it/s]

tensor(134.4056, device='cuda:0', grad_fn=<NllLossBackward0>)


1142it [06:59,  2.75it/s]

tensor(173.7375, device='cuda:0', grad_fn=<NllLossBackward0>)


1143it [06:59,  2.75it/s]

tensor(144.7779, device='cuda:0', grad_fn=<NllLossBackward0>)


1144it [07:00,  2.75it/s]

tensor(181.3216, device='cuda:0', grad_fn=<NllLossBackward0>)


1145it [07:00,  2.75it/s]

tensor(169.8622, device='cuda:0', grad_fn=<NllLossBackward0>)


1146it [07:00,  2.76it/s]

tensor(144.5569, device='cuda:0', grad_fn=<NllLossBackward0>)


1147it [07:01,  2.75it/s]

tensor(149.8418, device='cuda:0', grad_fn=<NllLossBackward0>)


1148it [07:01,  2.75it/s]

tensor(127.2928, device='cuda:0', grad_fn=<NllLossBackward0>)


1149it [07:02,  2.75it/s]

tensor(128.2542, device='cuda:0', grad_fn=<NllLossBackward0>)


1150it [07:02,  2.74it/s]

tensor(164.0227, device='cuda:0', grad_fn=<NllLossBackward0>)


1151it [07:02,  2.74it/s]

tensor(140.2346, device='cuda:0', grad_fn=<NllLossBackward0>)


1152it [07:03,  2.75it/s]

tensor(138.9444, device='cuda:0', grad_fn=<NllLossBackward0>)


1153it [07:03,  2.34it/s]

tensor(165.9859, device='cuda:0', grad_fn=<NllLossBackward0>)


1154it [07:04,  2.45it/s]

tensor(120.9410, device='cuda:0', grad_fn=<NllLossBackward0>)


1155it [07:04,  2.53it/s]

tensor(143.7591, device='cuda:0', grad_fn=<NllLossBackward0>)


1156it [07:04,  2.59it/s]

tensor(287.4454, device='cuda:0', grad_fn=<NllLossBackward0>)


1157it [07:05,  2.64it/s]

tensor(140.0647, device='cuda:0', grad_fn=<NllLossBackward0>)


1158it [07:05,  2.67it/s]

tensor(129.6991, device='cuda:0', grad_fn=<NllLossBackward0>)


1159it [07:05,  2.69it/s]

tensor(148.7713, device='cuda:0', grad_fn=<NllLossBackward0>)


1160it [07:06,  2.70it/s]

tensor(109.2499, device='cuda:0', grad_fn=<NllLossBackward0>)


1161it [07:06,  2.71it/s]

tensor(130.7393, device='cuda:0', grad_fn=<NllLossBackward0>)


1162it [07:07,  2.72it/s]

tensor(146.1500, device='cuda:0', grad_fn=<NllLossBackward0>)


1163it [07:07,  2.72it/s]

tensor(162.8821, device='cuda:0', grad_fn=<NllLossBackward0>)


1164it [07:07,  2.73it/s]

tensor(258.4916, device='cuda:0', grad_fn=<NllLossBackward0>)


1165it [07:08,  2.74it/s]

tensor(130.3132, device='cuda:0', grad_fn=<NllLossBackward0>)


1166it [07:08,  2.74it/s]

tensor(128.5450, device='cuda:0', grad_fn=<NllLossBackward0>)


1167it [07:08,  2.74it/s]

tensor(292.9885, device='cuda:0', grad_fn=<NllLossBackward0>)


1168it [07:09,  2.75it/s]

tensor(272.9101, device='cuda:0', grad_fn=<NllLossBackward0>)


1169it [07:09,  2.75it/s]

tensor(141.1030, device='cuda:0', grad_fn=<NllLossBackward0>)


1170it [07:09,  2.75it/s]

tensor(133.5733, device='cuda:0', grad_fn=<NllLossBackward0>)


1171it [07:10,  2.74it/s]

tensor(118.9957, device='cuda:0', grad_fn=<NllLossBackward0>)


1172it [07:10,  2.75it/s]

tensor(120.9685, device='cuda:0', grad_fn=<NllLossBackward0>)


1173it [07:11,  2.76it/s]

tensor(124.3694, device='cuda:0', grad_fn=<NllLossBackward0>)


1174it [07:11,  2.75it/s]

tensor(115.8305, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [07:11,  2.75it/s]

tensor(144.9622, device='cuda:0', grad_fn=<NllLossBackward0>)


1176it [07:12,  2.75it/s]

tensor(130.5844, device='cuda:0', grad_fn=<NllLossBackward0>)


1177it [07:12,  2.74it/s]

tensor(111.1177, device='cuda:0', grad_fn=<NllLossBackward0>)


1178it [07:12,  2.75it/s]

tensor(132.0523, device='cuda:0', grad_fn=<NllLossBackward0>)


1179it [07:13,  2.75it/s]

tensor(127.1840, device='cuda:0', grad_fn=<NllLossBackward0>)


1180it [07:13,  2.75it/s]

tensor(128.6686, device='cuda:0', grad_fn=<NllLossBackward0>)


1181it [07:13,  2.75it/s]

tensor(310.4533, device='cuda:0', grad_fn=<NllLossBackward0>)


1182it [07:14,  2.75it/s]

tensor(123.8854, device='cuda:0', grad_fn=<NllLossBackward0>)


1183it [07:14,  2.75it/s]

tensor(110.2402, device='cuda:0', grad_fn=<NllLossBackward0>)


1184it [07:15,  2.74it/s]

tensor(91.1495, device='cuda:0', grad_fn=<NllLossBackward0>)


1185it [07:15,  2.74it/s]

tensor(108.1064, device='cuda:0', grad_fn=<NllLossBackward0>)


1186it [07:15,  2.74it/s]

tensor(99.4246, device='cuda:0', grad_fn=<NllLossBackward0>)


1187it [07:16,  2.75it/s]

tensor(147.6421, device='cuda:0', grad_fn=<NllLossBackward0>)


1188it [07:16,  2.74it/s]

tensor(121.4342, device='cuda:0', grad_fn=<NllLossBackward0>)


1189it [07:16,  2.74it/s]

tensor(121.1532, device='cuda:0', grad_fn=<NllLossBackward0>)


1190it [07:17,  2.74it/s]

tensor(105.1199, device='cuda:0', grad_fn=<NllLossBackward0>)


1191it [07:17,  2.75it/s]

tensor(88.2644, device='cuda:0', grad_fn=<NllLossBackward0>)


1192it [07:17,  2.75it/s]

tensor(130.4109, device='cuda:0', grad_fn=<NllLossBackward0>)


1193it [07:18,  2.74it/s]

tensor(99.4222, device='cuda:0', grad_fn=<NllLossBackward0>)


1194it [07:18,  2.74it/s]

tensor(133.7468, device='cuda:0', grad_fn=<NllLossBackward0>)


1195it [07:19,  2.74it/s]

tensor(102.5706, device='cuda:0', grad_fn=<NllLossBackward0>)


1196it [07:19,  2.74it/s]

tensor(97.6878, device='cuda:0', grad_fn=<NllLossBackward0>)


1197it [07:19,  2.74it/s]

tensor(96.8036, device='cuda:0', grad_fn=<NllLossBackward0>)


1198it [07:20,  2.74it/s]

tensor(89.4495, device='cuda:0', grad_fn=<NllLossBackward0>)


1199it [07:20,  2.73it/s]

tensor(112.1425, device='cuda:0', grad_fn=<NllLossBackward0>)


1200it [07:20,  2.73it/s]

tensor(84.7971, device='cuda:0', grad_fn=<NllLossBackward0>)


1201it [07:21,  2.73it/s]

tensor(115.2401, device='cuda:0', grad_fn=<NllLossBackward0>)


1202it [07:21,  2.74it/s]

tensor(85.8969, device='cuda:0', grad_fn=<NllLossBackward0>)


1203it [07:21,  2.75it/s]

tensor(85.0990, device='cuda:0', grad_fn=<NllLossBackward0>)


1204it [07:22,  2.75it/s]

tensor(89.9958, device='cuda:0', grad_fn=<NllLossBackward0>)


1205it [07:22,  2.74it/s]

tensor(126.6945, device='cuda:0', grad_fn=<NllLossBackward0>)


1206it [07:23,  2.74it/s]

tensor(100.0593, device='cuda:0', grad_fn=<NllLossBackward0>)


1207it [07:23,  2.75it/s]

tensor(156.5006, device='cuda:0', grad_fn=<NllLossBackward0>)


1208it [07:23,  2.76it/s]

tensor(121.2775, device='cuda:0', grad_fn=<NllLossBackward0>)


1209it [07:24,  2.76it/s]

tensor(116.6489, device='cuda:0', grad_fn=<NllLossBackward0>)


1210it [07:24,  2.76it/s]

tensor(99.9387, device='cuda:0', grad_fn=<NllLossBackward0>)


1211it [07:24,  2.76it/s]

tensor(86.9295, device='cuda:0', grad_fn=<NllLossBackward0>)


1212it [07:25,  2.76it/s]

tensor(100.9507, device='cuda:0', grad_fn=<NllLossBackward0>)


1213it [07:25,  2.76it/s]

tensor(92.8222, device='cuda:0', grad_fn=<NllLossBackward0>)


1214it [07:25,  2.76it/s]

tensor(108.0424, device='cuda:0', grad_fn=<NllLossBackward0>)


1215it [07:26,  2.75it/s]

tensor(102.7070, device='cuda:0', grad_fn=<NllLossBackward0>)


1216it [07:26,  2.75it/s]

tensor(85.8479, device='cuda:0', grad_fn=<NllLossBackward0>)


1217it [07:27,  2.75it/s]

tensor(94.7518, device='cuda:0', grad_fn=<NllLossBackward0>)


1218it [07:27,  2.75it/s]

tensor(86.4253, device='cuda:0', grad_fn=<NllLossBackward0>)


1219it [07:27,  2.75it/s]

tensor(111.0739, device='cuda:0', grad_fn=<NllLossBackward0>)


1220it [07:28,  2.75it/s]

tensor(84.0622, device='cuda:0', grad_fn=<NllLossBackward0>)


1221it [07:28,  2.74it/s]

tensor(103.4583, device='cuda:0', grad_fn=<NllLossBackward0>)


1222it [07:28,  2.75it/s]

tensor(91.5792, device='cuda:0', grad_fn=<NllLossBackward0>)


1223it [07:29,  2.75it/s]

tensor(90.7099, device='cuda:0', grad_fn=<NllLossBackward0>)


1224it [07:29,  2.75it/s]

tensor(102.0459, device='cuda:0', grad_fn=<NllLossBackward0>)


1225it [07:29,  2.75it/s]

tensor(100.1246, device='cuda:0', grad_fn=<NllLossBackward0>)


1226it [07:30,  2.76it/s]

tensor(82.0814, device='cuda:0', grad_fn=<NllLossBackward0>)


1227it [07:30,  2.76it/s]

tensor(74.2823, device='cuda:0', grad_fn=<NllLossBackward0>)


1228it [07:31,  2.76it/s]

tensor(99.2261, device='cuda:0', grad_fn=<NllLossBackward0>)


1229it [07:31,  2.75it/s]

tensor(93.9209, device='cuda:0', grad_fn=<NllLossBackward0>)


1230it [07:31,  2.76it/s]

tensor(92.0248, device='cuda:0', grad_fn=<NllLossBackward0>)


1231it [07:32,  2.75it/s]

tensor(66.9319, device='cuda:0', grad_fn=<NllLossBackward0>)


1232it [07:32,  2.75it/s]

tensor(79.2867, device='cuda:0', grad_fn=<NllLossBackward0>)


1233it [07:32,  2.75it/s]

tensor(99.6039, device='cuda:0', grad_fn=<NllLossBackward0>)


1234it [07:33,  2.74it/s]

tensor(102.2236, device='cuda:0', grad_fn=<NllLossBackward0>)


1235it [07:33,  2.75it/s]

tensor(83.4474, device='cuda:0', grad_fn=<NllLossBackward0>)


1236it [07:33,  2.75it/s]

tensor(75.9248, device='cuda:0', grad_fn=<NllLossBackward0>)


1237it [07:34,  2.74it/s]

tensor(74.7540, device='cuda:0', grad_fn=<NllLossBackward0>)


1238it [07:34,  2.73it/s]

tensor(64.4079, device='cuda:0', grad_fn=<NllLossBackward0>)


1239it [07:35,  2.73it/s]

tensor(80.4761, device='cuda:0', grad_fn=<NllLossBackward0>)


1240it [07:35,  2.73it/s]

tensor(70.1983, device='cuda:0', grad_fn=<NllLossBackward0>)


1241it [07:35,  2.74it/s]

tensor(76.9289, device='cuda:0', grad_fn=<NllLossBackward0>)


1242it [07:36,  2.74it/s]

tensor(92.7093, device='cuda:0', grad_fn=<NllLossBackward0>)


1243it [07:36,  2.75it/s]

tensor(95.1122, device='cuda:0', grad_fn=<NllLossBackward0>)


1244it [07:36,  2.75it/s]

tensor(82.4972, device='cuda:0', grad_fn=<NllLossBackward0>)


1245it [07:37,  2.75it/s]

tensor(74.4470, device='cuda:0', grad_fn=<NllLossBackward0>)


1246it [07:37,  2.75it/s]

tensor(79.8270, device='cuda:0', grad_fn=<NllLossBackward0>)


1247it [07:37,  2.75it/s]

tensor(79.2965, device='cuda:0', grad_fn=<NllLossBackward0>)


1248it [07:38,  2.75it/s]

tensor(73.7156, device='cuda:0', grad_fn=<NllLossBackward0>)


1249it [07:38,  2.75it/s]

tensor(65.9953, device='cuda:0', grad_fn=<NllLossBackward0>)


1250it [07:39,  2.74it/s]

tensor(59.1199, device='cuda:0', grad_fn=<NllLossBackward0>)


1251it [07:39,  2.74it/s]

tensor(58.8764, device='cuda:0', grad_fn=<NllLossBackward0>)


1252it [07:39,  2.75it/s]

tensor(176.6712, device='cuda:0', grad_fn=<NllLossBackward0>)


1253it [07:40,  2.69it/s]

tensor(73.6021, device='cuda:0', grad_fn=<NllLossBackward0>)


1254it [07:40,  2.71it/s]

tensor(63.7820, device='cuda:0', grad_fn=<NllLossBackward0>)


1255it [07:40,  2.72it/s]

tensor(80.6824, device='cuda:0', grad_fn=<NllLossBackward0>)


1256it [07:41,  2.74it/s]

tensor(298.3346, device='cuda:0', grad_fn=<NllLossBackward0>)


1257it [07:41,  2.74it/s]

tensor(76.4210, device='cuda:0', grad_fn=<NllLossBackward0>)


1258it [07:41,  2.75it/s]

tensor(67.7435, device='cuda:0', grad_fn=<NllLossBackward0>)


1259it [07:42,  2.75it/s]

tensor(77.2062, device='cuda:0', grad_fn=<NllLossBackward0>)


1260it [07:42,  2.75it/s]

tensor(86.8585, device='cuda:0', grad_fn=<NllLossBackward0>)


1261it [07:43,  2.75it/s]

tensor(55.8360, device='cuda:0', grad_fn=<NllLossBackward0>)


1262it [07:43,  2.75it/s]

tensor(103.8501, device='cuda:0', grad_fn=<NllLossBackward0>)


1263it [07:43,  2.76it/s]

tensor(88.9093, device='cuda:0', grad_fn=<NllLossBackward0>)


1264it [07:44,  2.75it/s]

tensor(88.4027, device='cuda:0', grad_fn=<NllLossBackward0>)


1265it [07:44,  2.74it/s]

tensor(68.0468, device='cuda:0', grad_fn=<NllLossBackward0>)


1266it [07:44,  2.74it/s]

tensor(98.8702, device='cuda:0', grad_fn=<NllLossBackward0>)


1267it [07:45,  2.75it/s]

tensor(77.7564, device='cuda:0', grad_fn=<NllLossBackward0>)


1268it [07:45,  2.74it/s]

tensor(531.7607, device='cuda:0', grad_fn=<NllLossBackward0>)


1269it [07:45,  2.75it/s]

tensor(72.2476, device='cuda:0', grad_fn=<NllLossBackward0>)


1270it [07:46,  2.74it/s]

tensor(77.7299, device='cuda:0', grad_fn=<NllLossBackward0>)


1271it [07:46,  2.74it/s]

tensor(88.5269, device='cuda:0', grad_fn=<NllLossBackward0>)


1272it [07:47,  2.74it/s]

tensor(78.0588, device='cuda:0', grad_fn=<NllLossBackward0>)


1273it [07:47,  2.74it/s]

tensor(95.5759, device='cuda:0', grad_fn=<NllLossBackward0>)


1274it [07:47,  2.74it/s]

tensor(92.6062, device='cuda:0', grad_fn=<NllLossBackward0>)


1275it [07:48,  2.74it/s]

tensor(72.2114, device='cuda:0', grad_fn=<NllLossBackward0>)


1276it [07:48,  2.74it/s]

tensor(72.7213, device='cuda:0', grad_fn=<NllLossBackward0>)


1277it [07:48,  2.74it/s]

tensor(99.8718, device='cuda:0', grad_fn=<NllLossBackward0>)


1278it [07:49,  2.74it/s]

tensor(73.6197, device='cuda:0', grad_fn=<NllLossBackward0>)


1279it [07:49,  2.74it/s]

tensor(98.7976, device='cuda:0', grad_fn=<NllLossBackward0>)


1280it [07:49,  2.75it/s]

tensor(69.2840, device='cuda:0', grad_fn=<NllLossBackward0>)


1281it [07:50,  2.74it/s]

tensor(66.2135, device='cuda:0', grad_fn=<NllLossBackward0>)


1282it [07:50,  2.74it/s]

tensor(94.3027, device='cuda:0', grad_fn=<NllLossBackward0>)


1283it [07:51,  2.74it/s]

tensor(83.2991, device='cuda:0', grad_fn=<NllLossBackward0>)


1284it [07:51,  2.75it/s]

tensor(88.1390, device='cuda:0', grad_fn=<NllLossBackward0>)


1285it [07:51,  2.75it/s]

tensor(99.8523, device='cuda:0', grad_fn=<NllLossBackward0>)


1286it [07:52,  2.75it/s]

tensor(84.3255, device='cuda:0', grad_fn=<NllLossBackward0>)


1287it [07:52,  2.75it/s]

tensor(95.4336, device='cuda:0', grad_fn=<NllLossBackward0>)


1288it [07:52,  2.75it/s]

tensor(240.4679, device='cuda:0', grad_fn=<NllLossBackward0>)


1289it [07:53,  2.75it/s]

tensor(59.9483, device='cuda:0', grad_fn=<NllLossBackward0>)


1290it [07:53,  2.75it/s]

tensor(83.7348, device='cuda:0', grad_fn=<NllLossBackward0>)


1291it [07:53,  2.75it/s]

tensor(104.6751, device='cuda:0', grad_fn=<NllLossBackward0>)


1292it [07:54,  2.75it/s]

tensor(62.2028, device='cuda:0', grad_fn=<NllLossBackward0>)


1293it [07:54,  2.75it/s]

tensor(73.7907, device='cuda:0', grad_fn=<NllLossBackward0>)


1294it [07:55,  2.75it/s]

tensor(82.6963, device='cuda:0', grad_fn=<NllLossBackward0>)


1295it [07:55,  2.75it/s]

tensor(105.9773, device='cuda:0', grad_fn=<NllLossBackward0>)


1296it [07:55,  2.75it/s]

tensor(53.3546, device='cuda:0', grad_fn=<NllLossBackward0>)


1297it [07:56,  2.75it/s]

tensor(98.0381, device='cuda:0', grad_fn=<NllLossBackward0>)


1298it [07:56,  2.73it/s]

tensor(85.1431, device='cuda:0', grad_fn=<NllLossBackward0>)


1299it [07:56,  2.73it/s]

tensor(111.9187, device='cuda:0', grad_fn=<NllLossBackward0>)


1300it [07:57,  2.73it/s]

tensor(191.5506, device='cuda:0', grad_fn=<NllLossBackward0>)


1301it [07:57,  2.45it/s]

tensor(90.5963, device='cuda:0', grad_fn=<NllLossBackward0>)


1302it [07:58,  2.54it/s]

tensor(93.4019, device='cuda:0', grad_fn=<NllLossBackward0>)


1303it [07:58,  2.59it/s]

tensor(79.3141, device='cuda:0', grad_fn=<NllLossBackward0>)


1304it [07:58,  2.64it/s]

tensor(63.3854, device='cuda:0', grad_fn=<NllLossBackward0>)


1305it [07:59,  2.67it/s]

tensor(94.3359, device='cuda:0', grad_fn=<NllLossBackward0>)


1306it [07:59,  2.69it/s]

tensor(76.3955, device='cuda:0', grad_fn=<NllLossBackward0>)


1307it [07:59,  2.71it/s]

tensor(80.3186, device='cuda:0', grad_fn=<NllLossBackward0>)


1308it [08:00,  2.72it/s]

tensor(69.0288, device='cuda:0', grad_fn=<NllLossBackward0>)


1309it [08:00,  2.73it/s]

tensor(70.7270, device='cuda:0', grad_fn=<NllLossBackward0>)


1310it [08:01,  2.73it/s]

tensor(94.9702, device='cuda:0', grad_fn=<NllLossBackward0>)


1311it [08:01,  2.73it/s]

tensor(171.3289, device='cuda:0', grad_fn=<NllLossBackward0>)


1312it [08:01,  2.73it/s]

tensor(90.0561, device='cuda:0', grad_fn=<NllLossBackward0>)


1313it [08:02,  2.73it/s]

tensor(72.6882, device='cuda:0', grad_fn=<NllLossBackward0>)


1314it [08:02,  2.73it/s]

tensor(78.7777, device='cuda:0', grad_fn=<NllLossBackward0>)


1315it [08:02,  2.74it/s]

tensor(163.7983, device='cuda:0', grad_fn=<NllLossBackward0>)


1316it [08:03,  2.74it/s]

tensor(892.9861, device='cuda:0', grad_fn=<NllLossBackward0>)


1317it [08:03,  2.74it/s]

tensor(77.0718, device='cuda:0', grad_fn=<NllLossBackward0>)


1318it [08:03,  2.74it/s]

tensor(75.2298, device='cuda:0', grad_fn=<NllLossBackward0>)


1319it [08:04,  2.74it/s]

tensor(92.6800, device='cuda:0', grad_fn=<NllLossBackward0>)


1320it [08:04,  2.73it/s]

tensor(91.7103, device='cuda:0', grad_fn=<NllLossBackward0>)


1321it [08:05,  2.74it/s]

tensor(76.3169, device='cuda:0', grad_fn=<NllLossBackward0>)


1322it [08:05,  2.74it/s]

tensor(89.3333, device='cuda:0', grad_fn=<NllLossBackward0>)


1323it [08:05,  2.74it/s]

tensor(74.2801, device='cuda:0', grad_fn=<NllLossBackward0>)


1324it [08:06,  2.74it/s]

tensor(105.3538, device='cuda:0', grad_fn=<NllLossBackward0>)


1325it [08:06,  2.74it/s]

tensor(78.3180, device='cuda:0', grad_fn=<NllLossBackward0>)


1326it [08:06,  2.74it/s]

tensor(104.2723, device='cuda:0', grad_fn=<NllLossBackward0>)


1327it [08:07,  2.75it/s]

tensor(76.2888, device='cuda:0', grad_fn=<NllLossBackward0>)


1328it [08:07,  2.75it/s]

tensor(100.5416, device='cuda:0', grad_fn=<NllLossBackward0>)


1329it [08:07,  2.75it/s]

tensor(92.7615, device='cuda:0', grad_fn=<NllLossBackward0>)


1330it [08:08,  2.75it/s]

tensor(84.5276, device='cuda:0', grad_fn=<NllLossBackward0>)


1331it [08:08,  2.74it/s]

tensor(82.4610, device='cuda:0', grad_fn=<NllLossBackward0>)


1332it [08:09,  2.74it/s]

tensor(78.1085, device='cuda:0', grad_fn=<NllLossBackward0>)


1333it [08:09,  2.75it/s]

tensor(76.9766, device='cuda:0', grad_fn=<NllLossBackward0>)


1334it [08:09,  2.75it/s]

tensor(93.6352, device='cuda:0', grad_fn=<NllLossBackward0>)


1335it [08:10,  2.75it/s]

tensor(86.7548, device='cuda:0', grad_fn=<NllLossBackward0>)


1336it [08:10,  2.76it/s]

tensor(70.6143, device='cuda:0', grad_fn=<NllLossBackward0>)


1337it [08:10,  2.74it/s]

tensor(69.7303, device='cuda:0', grad_fn=<NllLossBackward0>)


1338it [08:11,  2.75it/s]

tensor(92.4254, device='cuda:0', grad_fn=<NllLossBackward0>)


1339it [08:11,  2.76it/s]

tensor(120.9648, device='cuda:0', grad_fn=<NllLossBackward0>)


1340it [08:11,  2.75it/s]

tensor(73.0766, device='cuda:0', grad_fn=<NllLossBackward0>)


1341it [08:12,  2.75it/s]

tensor(70.1598, device='cuda:0', grad_fn=<NllLossBackward0>)


1342it [08:12,  2.75it/s]

tensor(75.2698, device='cuda:0', grad_fn=<NllLossBackward0>)


1343it [08:13,  2.75it/s]

tensor(77.8218, device='cuda:0', grad_fn=<NllLossBackward0>)


1344it [08:13,  2.74it/s]

tensor(59.2267, device='cuda:0', grad_fn=<NllLossBackward0>)


1345it [08:13,  2.75it/s]

tensor(80.7853, device='cuda:0', grad_fn=<NllLossBackward0>)


1346it [08:14,  2.75it/s]

tensor(77.9526, device='cuda:0', grad_fn=<NllLossBackward0>)


1347it [08:14,  2.75it/s]

tensor(69.0541, device='cuda:0', grad_fn=<NllLossBackward0>)


1348it [08:14,  2.75it/s]

tensor(56.5496, device='cuda:0', grad_fn=<NllLossBackward0>)


1349it [08:15,  2.75it/s]

tensor(72.7936, device='cuda:0', grad_fn=<NllLossBackward0>)


1350it [08:15,  2.73it/s]

tensor(68.8483, device='cuda:0', grad_fn=<NllLossBackward0>)


1351it [08:15,  2.74it/s]

tensor(68.4441, device='cuda:0', grad_fn=<NllLossBackward0>)


1352it [08:16,  2.74it/s]

tensor(60.9927, device='cuda:0', grad_fn=<NllLossBackward0>)


1353it [08:16,  2.75it/s]

tensor(62.0300, device='cuda:0', grad_fn=<NllLossBackward0>)


1354it [08:17,  2.74it/s]

tensor(75.3939, device='cuda:0', grad_fn=<NllLossBackward0>)


1355it [08:17,  2.74it/s]

tensor(76.0284, device='cuda:0', grad_fn=<NllLossBackward0>)


1356it [08:17,  2.75it/s]

tensor(63.9743, device='cuda:0', grad_fn=<NllLossBackward0>)


1357it [08:18,  2.75it/s]

tensor(79.9033, device='cuda:0', grad_fn=<NllLossBackward0>)


1358it [08:18,  2.74it/s]

tensor(78.5762, device='cuda:0', grad_fn=<NllLossBackward0>)


1359it [08:18,  2.74it/s]

tensor(66.4336, device='cuda:0', grad_fn=<NllLossBackward0>)


1360it [08:19,  2.74it/s]

tensor(57.8700, device='cuda:0', grad_fn=<NllLossBackward0>)


1361it [08:19,  2.73it/s]

tensor(97.3728, device='cuda:0', grad_fn=<NllLossBackward0>)


1362it [08:20,  2.73it/s]

tensor(65.0188, device='cuda:0', grad_fn=<NllLossBackward0>)


1363it [08:20,  2.73it/s]

tensor(70.9333, device='cuda:0', grad_fn=<NllLossBackward0>)


1364it [08:20,  2.74it/s]

tensor(68.1290, device='cuda:0', grad_fn=<NllLossBackward0>)


1365it [08:21,  2.74it/s]

tensor(79.9935, device='cuda:0', grad_fn=<NllLossBackward0>)


1366it [08:21,  2.74it/s]

tensor(73.1127, device='cuda:0', grad_fn=<NllLossBackward0>)


1367it [08:21,  2.74it/s]

tensor(82.2762, device='cuda:0', grad_fn=<NllLossBackward0>)


1368it [08:22,  2.74it/s]

tensor(66.1015, device='cuda:0', grad_fn=<NllLossBackward0>)


1369it [08:22,  2.74it/s]

tensor(59.3118, device='cuda:0', grad_fn=<NllLossBackward0>)


1370it [08:22,  2.74it/s]

tensor(49.8711, device='cuda:0', grad_fn=<NllLossBackward0>)


1371it [08:23,  2.74it/s]

tensor(64.7921, device='cuda:0', grad_fn=<NllLossBackward0>)


1372it [08:23,  2.73it/s]

tensor(67.7800, device='cuda:0', grad_fn=<NllLossBackward0>)


1373it [08:24,  2.74it/s]

tensor(41.0905, device='cuda:0', grad_fn=<NllLossBackward0>)


1374it [08:24,  2.71it/s]

tensor(62.7188, device='cuda:0', grad_fn=<NllLossBackward0>)


1375it [08:24,  2.72it/s]

tensor(52.2449, device='cuda:0', grad_fn=<NllLossBackward0>)


1376it [08:25,  2.73it/s]

tensor(47.7174, device='cuda:0', grad_fn=<NllLossBackward0>)


1377it [08:25,  2.73it/s]

tensor(49.7877, device='cuda:0', grad_fn=<NllLossBackward0>)


1378it [08:25,  2.74it/s]

tensor(62.2197, device='cuda:0', grad_fn=<NllLossBackward0>)


1379it [08:26,  2.74it/s]

tensor(66.3974, device='cuda:0', grad_fn=<NllLossBackward0>)


1380it [08:26,  2.74it/s]

tensor(86.5726, device='cuda:0', grad_fn=<NllLossBackward0>)


1381it [08:26,  2.74it/s]

tensor(69.9406, device='cuda:0', grad_fn=<NllLossBackward0>)


1382it [08:27,  2.75it/s]

tensor(60.7185, device='cuda:0', grad_fn=<NllLossBackward0>)


1383it [08:27,  2.75it/s]

tensor(92.6858, device='cuda:0', grad_fn=<NllLossBackward0>)


1384it [08:28,  2.74it/s]

tensor(58.0668, device='cuda:0', grad_fn=<NllLossBackward0>)


1385it [08:28,  2.74it/s]

tensor(40.2219, device='cuda:0', grad_fn=<NllLossBackward0>)


1386it [08:28,  2.75it/s]

tensor(68.6328, device='cuda:0', grad_fn=<NllLossBackward0>)


1387it [08:29,  2.74it/s]

tensor(37.3220, device='cuda:0', grad_fn=<NllLossBackward0>)


1388it [08:29,  2.75it/s]

tensor(59.4559, device='cuda:0', grad_fn=<NllLossBackward0>)


1389it [08:29,  2.75it/s]

tensor(42.9185, device='cuda:0', grad_fn=<NllLossBackward0>)


1390it [08:30,  2.76it/s]

tensor(34.0120, device='cuda:0', grad_fn=<NllLossBackward0>)


1391it [08:30,  2.75it/s]

tensor(49.9802, device='cuda:0', grad_fn=<NllLossBackward0>)


1392it [08:30,  2.75it/s]

tensor(44.9170, device='cuda:0', grad_fn=<NllLossBackward0>)


1393it [08:31,  2.65it/s]

tensor(46.4180, device='cuda:0', grad_fn=<NllLossBackward0>)


1394it [08:31,  2.68it/s]

tensor(45.1993, device='cuda:0', grad_fn=<NllLossBackward0>)


1395it [08:32,  2.70it/s]

tensor(56.6638, device='cuda:0', grad_fn=<NllLossBackward0>)


1396it [08:32,  2.72it/s]

tensor(47.6203, device='cuda:0', grad_fn=<NllLossBackward0>)


1397it [08:32,  2.73it/s]

tensor(190.0423, device='cuda:0', grad_fn=<NllLossBackward0>)


1398it [08:33,  2.73it/s]

tensor(43.8990, device='cuda:0', grad_fn=<NllLossBackward0>)


1399it [08:33,  2.74it/s]

tensor(118.2373, device='cuda:0', grad_fn=<NllLossBackward0>)


1400it [08:33,  2.75it/s]

tensor(49.0921, device='cuda:0', grad_fn=<NllLossBackward0>)


1401it [08:34,  2.74it/s]

tensor(42.1101, device='cuda:0', grad_fn=<NllLossBackward0>)


1402it [08:34,  2.74it/s]

tensor(57.6178, device='cuda:0', grad_fn=<NllLossBackward0>)


1403it [08:35,  2.74it/s]

tensor(53.0541, device='cuda:0', grad_fn=<NllLossBackward0>)


1404it [08:35,  2.74it/s]

tensor(35.5742, device='cuda:0', grad_fn=<NllLossBackward0>)


1405it [08:35,  2.74it/s]

tensor(81.6519, device='cuda:0', grad_fn=<NllLossBackward0>)


1406it [08:36,  2.74it/s]

tensor(38.7011, device='cuda:0', grad_fn=<NllLossBackward0>)


1407it [08:36,  2.74it/s]

tensor(35.4999, device='cuda:0', grad_fn=<NllLossBackward0>)


1408it [08:36,  2.74it/s]

tensor(47.3138, device='cuda:0', grad_fn=<NllLossBackward0>)


1409it [08:37,  2.73it/s]

tensor(39.2428, device='cuda:0', grad_fn=<NllLossBackward0>)


1410it [08:37,  2.74it/s]

tensor(41.0764, device='cuda:0', grad_fn=<NllLossBackward0>)


1411it [08:37,  2.74it/s]

tensor(29.2143, device='cuda:0', grad_fn=<NllLossBackward0>)


1412it [08:38,  2.75it/s]

tensor(49.9516, device='cuda:0', grad_fn=<NllLossBackward0>)


1413it [08:38,  2.74it/s]

tensor(63.4463, device='cuda:0', grad_fn=<NllLossBackward0>)


1414it [08:39,  2.75it/s]

tensor(84.6659, device='cuda:0', grad_fn=<NllLossBackward0>)


1415it [08:39,  2.75it/s]

tensor(31.4324, device='cuda:0', grad_fn=<NllLossBackward0>)


1416it [08:39,  2.75it/s]

tensor(59.3847, device='cuda:0', grad_fn=<NllLossBackward0>)


1417it [08:40,  2.75it/s]

tensor(49.8295, device='cuda:0', grad_fn=<NllLossBackward0>)


1418it [08:40,  2.75it/s]

tensor(53.5452, device='cuda:0', grad_fn=<NllLossBackward0>)


1419it [08:40,  2.75it/s]

tensor(27.7093, device='cuda:0', grad_fn=<NllLossBackward0>)


1420it [08:41,  2.75it/s]

tensor(38.3190, device='cuda:0', grad_fn=<NllLossBackward0>)


1421it [08:41,  2.75it/s]

tensor(53.0482, device='cuda:0', grad_fn=<NllLossBackward0>)


1422it [08:41,  2.76it/s]

tensor(29.4137, device='cuda:0', grad_fn=<NllLossBackward0>)


1423it [08:42,  2.76it/s]

tensor(77.3553, device='cuda:0', grad_fn=<NllLossBackward0>)


1424it [08:42,  2.76it/s]

tensor(33.8121, device='cuda:0', grad_fn=<NllLossBackward0>)


1425it [08:43,  2.76it/s]

tensor(30.2944, device='cuda:0', grad_fn=<NllLossBackward0>)


1426it [08:43,  2.76it/s]

tensor(42.2563, device='cuda:0', grad_fn=<NllLossBackward0>)


1427it [08:43,  2.76it/s]

tensor(29.8935, device='cuda:0', grad_fn=<NllLossBackward0>)


1428it [08:44,  2.76it/s]

tensor(183.2982, device='cuda:0', grad_fn=<NllLossBackward0>)


1429it [08:44,  2.76it/s]

tensor(62.9956, device='cuda:0', grad_fn=<NllLossBackward0>)


1430it [08:44,  2.76it/s]

tensor(47.9561, device='cuda:0', grad_fn=<NllLossBackward0>)


1431it [08:45,  2.75it/s]

tensor(52.0176, device='cuda:0', grad_fn=<NllLossBackward0>)


1432it [08:45,  2.75it/s]

tensor(58.7987, device='cuda:0', grad_fn=<NllLossBackward0>)


1433it [08:45,  2.75it/s]

tensor(33.0146, device='cuda:0', grad_fn=<NllLossBackward0>)


1434it [08:46,  2.74it/s]

tensor(55.0507, device='cuda:0', grad_fn=<NllLossBackward0>)


1435it [08:46,  2.74it/s]

tensor(58.1965, device='cuda:0', grad_fn=<NllLossBackward0>)


1436it [08:47,  2.75it/s]

tensor(55.8910, device='cuda:0', grad_fn=<NllLossBackward0>)


1437it [08:47,  2.74it/s]

tensor(43.4912, device='cuda:0', grad_fn=<NllLossBackward0>)


1438it [08:47,  2.74it/s]

tensor(29.3248, device='cuda:0', grad_fn=<NllLossBackward0>)


1439it [08:48,  2.75it/s]

tensor(48.7649, device='cuda:0', grad_fn=<NllLossBackward0>)


1440it [08:48,  2.75it/s]

tensor(21.5347, device='cuda:0', grad_fn=<NllLossBackward0>)


1441it [08:48,  2.75it/s]

tensor(21.3654, device='cuda:0', grad_fn=<NllLossBackward0>)


1442it [08:49,  2.74it/s]

tensor(48.8986, device='cuda:0', grad_fn=<NllLossBackward0>)


1443it [08:49,  2.73it/s]

tensor(26.5851, device='cuda:0', grad_fn=<NllLossBackward0>)


1444it [08:49,  2.74it/s]

tensor(37.5241, device='cuda:0', grad_fn=<NllLossBackward0>)


1445it [08:50,  2.74it/s]

tensor(54.3743, device='cuda:0', grad_fn=<NllLossBackward0>)


1446it [08:50,  2.75it/s]

tensor(28.5097, device='cuda:0', grad_fn=<NllLossBackward0>)


1447it [08:51,  2.75it/s]

tensor(61.3283, device='cuda:0', grad_fn=<NllLossBackward0>)


1448it [08:51,  2.74it/s]

tensor(20.9990, device='cuda:0', grad_fn=<NllLossBackward0>)


1449it [08:51,  2.74it/s]

tensor(51.5442, device='cuda:0', grad_fn=<NllLossBackward0>)


1450it [08:52,  2.75it/s]

tensor(28.1956, device='cuda:0', grad_fn=<NllLossBackward0>)


1451it [08:52,  2.75it/s]

tensor(40.5592, device='cuda:0', grad_fn=<NllLossBackward0>)


1452it [08:52,  2.75it/s]

tensor(29.2391, device='cuda:0', grad_fn=<NllLossBackward0>)


1453it [08:53,  2.75it/s]

tensor(21.6501, device='cuda:0', grad_fn=<NllLossBackward0>)


1454it [08:53,  2.75it/s]

tensor(34.0258, device='cuda:0', grad_fn=<NllLossBackward0>)


1455it [08:53,  2.75it/s]

tensor(272.5424, device='cuda:0', grad_fn=<NllLossBackward0>)


1456it [08:54,  2.74it/s]

tensor(23.4556, device='cuda:0', grad_fn=<NllLossBackward0>)


1457it [08:54,  2.75it/s]

tensor(24.2328, device='cuda:0', grad_fn=<NllLossBackward0>)


1458it [08:55,  2.75it/s]

tensor(44.5580, device='cuda:0', grad_fn=<NllLossBackward0>)


1459it [08:55,  2.74it/s]

tensor(20.8539, device='cuda:0', grad_fn=<NllLossBackward0>)


1460it [08:55,  2.74it/s]

tensor(21.3731, device='cuda:0', grad_fn=<NllLossBackward0>)


1461it [08:56,  2.75it/s]

tensor(14.4112, device='cuda:0', grad_fn=<NllLossBackward0>)


1462it [08:56,  2.75it/s]

tensor(51.3574, device='cuda:0', grad_fn=<NllLossBackward0>)


1463it [08:56,  2.75it/s]

tensor(57.8126, device='cuda:0', grad_fn=<NllLossBackward0>)


1464it [08:57,  2.74it/s]

tensor(22.1026, device='cuda:0', grad_fn=<NllLossBackward0>)


1465it [08:57,  2.75it/s]

tensor(44.8734, device='cuda:0', grad_fn=<NllLossBackward0>)


1466it [08:57,  2.74it/s]

tensor(33.8457, device='cuda:0', grad_fn=<NllLossBackward0>)


1467it [08:58,  2.74it/s]

tensor(12.0906, device='cuda:0', grad_fn=<NllLossBackward0>)


1468it [08:58,  2.74it/s]

tensor(43.0337, device='cuda:0', grad_fn=<NllLossBackward0>)


1469it [08:59,  2.74it/s]

tensor(33.4930, device='cuda:0', grad_fn=<NllLossBackward0>)


1470it [08:59,  2.75it/s]

tensor(39.5388, device='cuda:0', grad_fn=<NllLossBackward0>)


1471it [08:59,  2.75it/s]

tensor(37.9884, device='cuda:0', grad_fn=<NllLossBackward0>)


1472it [09:00,  2.75it/s]

tensor(44.5593, device='cuda:0', grad_fn=<NllLossBackward0>)


1473it [09:00,  2.75it/s]

tensor(17.4206, device='cuda:0', grad_fn=<NllLossBackward0>)


1474it [09:00,  2.74it/s]

tensor(36.3486, device='cuda:0', grad_fn=<NllLossBackward0>)


1475it [09:01,  2.75it/s]

tensor(19.2759, device='cuda:0', grad_fn=<NllLossBackward0>)


1476it [09:01,  2.76it/s]

tensor(27.2335, device='cuda:0', grad_fn=<NllLossBackward0>)


1477it [09:01,  2.77it/s]

tensor(24.5454, device='cuda:0', grad_fn=<NllLossBackward0>)


1478it [09:02,  2.77it/s]

tensor(21.0473, device='cuda:0', grad_fn=<NllLossBackward0>)


1479it [09:02,  2.76it/s]

tensor(56.7768, device='cuda:0', grad_fn=<NllLossBackward0>)


1480it [09:03,  2.75it/s]

tensor(28.4866, device='cuda:0', grad_fn=<NllLossBackward0>)


1481it [09:03,  2.75it/s]

tensor(59.2025, device='cuda:0', grad_fn=<NllLossBackward0>)


1482it [09:03,  2.74it/s]

tensor(30.8264, device='cuda:0', grad_fn=<NllLossBackward0>)


1483it [09:04,  2.75it/s]

tensor(86.0284, device='cuda:0', grad_fn=<NllLossBackward0>)


1484it [09:04,  2.75it/s]

tensor(24.2062, device='cuda:0', grad_fn=<NllLossBackward0>)


1485it [09:04,  2.75it/s]

tensor(59.0205, device='cuda:0', grad_fn=<NllLossBackward0>)


1486it [09:05,  2.75it/s]

tensor(17.1377, device='cuda:0', grad_fn=<NllLossBackward0>)


1487it [09:05,  2.76it/s]

tensor(19.1861, device='cuda:0', grad_fn=<NllLossBackward0>)


1488it [09:05,  2.75it/s]

tensor(55.5599, device='cuda:0', grad_fn=<NllLossBackward0>)


1489it [09:06,  2.75it/s]

tensor(32.7504, device='cuda:0', grad_fn=<NllLossBackward0>)


1490it [09:06,  2.75it/s]

tensor(13.7530, device='cuda:0', grad_fn=<NllLossBackward0>)


1491it [09:07,  2.75it/s]

tensor(37.0148, device='cuda:0', grad_fn=<NllLossBackward0>)


1492it [09:07,  2.76it/s]

tensor(13.9593, device='cuda:0', grad_fn=<NllLossBackward0>)


1493it [09:07,  2.76it/s]

tensor(34.9872, device='cuda:0', grad_fn=<NllLossBackward0>)


1494it [09:08,  2.76it/s]

tensor(34.2387, device='cuda:0', grad_fn=<NllLossBackward0>)


1495it [09:08,  2.76it/s]

tensor(275.9581, device='cuda:0', grad_fn=<NllLossBackward0>)


1496it [09:08,  2.76it/s]

tensor(47.8287, device='cuda:0', grad_fn=<NllLossBackward0>)


1497it [09:09,  2.77it/s]

tensor(12.4854, device='cuda:0', grad_fn=<NllLossBackward0>)


1498it [09:09,  2.76it/s]

tensor(75.2090, device='cuda:0', grad_fn=<NllLossBackward0>)


1499it [09:09,  2.76it/s]

tensor(10.3038, device='cuda:0', grad_fn=<NllLossBackward0>)


1500it [09:10,  2.76it/s]

tensor(46.5181, device='cuda:0', grad_fn=<NllLossBackward0>)


1501it [09:10,  2.76it/s]

tensor(21.8857, device='cuda:0', grad_fn=<NllLossBackward0>)


1502it [09:10,  2.76it/s]

tensor(70.2564, device='cuda:0', grad_fn=<NllLossBackward0>)


1503it [09:11,  2.76it/s]

tensor(36.9042, device='cuda:0', grad_fn=<NllLossBackward0>)


1504it [09:11,  2.77it/s]

tensor(29.6809, device='cuda:0', grad_fn=<NllLossBackward0>)


1505it [09:12,  2.77it/s]

tensor(23.8054, device='cuda:0', grad_fn=<NllLossBackward0>)


1506it [09:12,  2.77it/s]

tensor(23.4215, device='cuda:0', grad_fn=<NllLossBackward0>)


1507it [09:12,  2.75it/s]

tensor(37.3512, device='cuda:0', grad_fn=<NllLossBackward0>)


1508it [09:13,  2.76it/s]

tensor(61.2858, device='cuda:0', grad_fn=<NllLossBackward0>)


1509it [09:13,  2.75it/s]

tensor(29.0353, device='cuda:0', grad_fn=<NllLossBackward0>)


1510it [09:13,  2.75it/s]

tensor(14.9028, device='cuda:0', grad_fn=<NllLossBackward0>)


1511it [09:14,  2.74it/s]

tensor(29.2535, device='cuda:0', grad_fn=<NllLossBackward0>)


1512it [09:14,  2.74it/s]

tensor(17.1337, device='cuda:0', grad_fn=<NllLossBackward0>)


1513it [09:15,  2.73it/s]

tensor(27.5477, device='cuda:0', grad_fn=<NllLossBackward0>)


1514it [09:15,  2.73it/s]

tensor(44.8234, device='cuda:0', grad_fn=<NllLossBackward0>)


1515it [09:15,  2.73it/s]

tensor(79.7179, device='cuda:0', grad_fn=<NllLossBackward0>)


1516it [09:16,  2.73it/s]

tensor(9.0059, device='cuda:0', grad_fn=<NllLossBackward0>)


1517it [09:16,  2.72it/s]

tensor(50.2536, device='cuda:0', grad_fn=<NllLossBackward0>)


1518it [09:16,  2.72it/s]

tensor(27.3780, device='cuda:0', grad_fn=<NllLossBackward0>)


1519it [09:17,  2.72it/s]

tensor(55.3575, device='cuda:0', grad_fn=<NllLossBackward0>)


1520it [09:17,  2.41it/s]

tensor(23.6167, device='cuda:0', grad_fn=<NllLossBackward0>)


1521it [09:18,  2.49it/s]

tensor(38.2806, device='cuda:0', grad_fn=<NllLossBackward0>)


1522it [09:18,  2.54it/s]

tensor(62.2467, device='cuda:0', grad_fn=<NllLossBackward0>)


1523it [09:18,  2.60it/s]

tensor(33.7991, device='cuda:0', grad_fn=<NllLossBackward0>)


1524it [09:19,  2.63it/s]

tensor(19.1503, device='cuda:0', grad_fn=<NllLossBackward0>)


1525it [09:19,  2.67it/s]

tensor(72.7130, device='cuda:0', grad_fn=<NllLossBackward0>)


1526it [09:19,  2.70it/s]

tensor(7.6471, device='cuda:0', grad_fn=<NllLossBackward0>)


1527it [09:20,  2.71it/s]

tensor(11.1529, device='cuda:0', grad_fn=<NllLossBackward0>)


1528it [09:20,  2.73it/s]

tensor(32.5312, device='cuda:0', grad_fn=<NllLossBackward0>)


1529it [09:21,  2.74it/s]

tensor(26.4254, device='cuda:0', grad_fn=<NllLossBackward0>)


1530it [09:21,  2.75it/s]

tensor(11.0107, device='cuda:0', grad_fn=<NllLossBackward0>)


1531it [09:21,  2.76it/s]

tensor(30.5833, device='cuda:0', grad_fn=<NllLossBackward0>)


1532it [09:22,  2.75it/s]

tensor(48.8090, device='cuda:0', grad_fn=<NllLossBackward0>)


1533it [09:22,  2.74it/s]

tensor(30.2288, device='cuda:0', grad_fn=<NllLossBackward0>)


1534it [09:22,  2.75it/s]

tensor(24.5759, device='cuda:0', grad_fn=<NllLossBackward0>)


1535it [09:23,  2.75it/s]

tensor(11.1351, device='cuda:0', grad_fn=<NllLossBackward0>)


1536it [09:23,  2.65it/s]

tensor(20.8693, device='cuda:0', grad_fn=<NllLossBackward0>)


1537it [09:23,  2.68it/s]

tensor(52.9066, device='cuda:0', grad_fn=<NllLossBackward0>)


1538it [09:24,  2.70it/s]

tensor(51.4073, device='cuda:0', grad_fn=<NllLossBackward0>)


1539it [09:24,  2.72it/s]

tensor(17.8008, device='cuda:0', grad_fn=<NllLossBackward0>)


1540it [09:25,  2.72it/s]

tensor(59.7670, device='cuda:0', grad_fn=<NllLossBackward0>)


1541it [09:25,  2.72it/s]

tensor(42.9995, device='cuda:0', grad_fn=<NllLossBackward0>)


1542it [09:25,  2.73it/s]

tensor(23.1785, device='cuda:0', grad_fn=<NllLossBackward0>)


1543it [09:26,  2.74it/s]

tensor(15.7708, device='cuda:0', grad_fn=<NllLossBackward0>)


1544it [09:26,  2.72it/s]

tensor(31.8572, device='cuda:0', grad_fn=<NllLossBackward0>)


1545it [09:26,  2.71it/s]

tensor(20.8612, device='cuda:0', grad_fn=<NllLossBackward0>)


1546it [09:27,  2.73it/s]

tensor(11.9945, device='cuda:0', grad_fn=<NllLossBackward0>)


1547it [09:27,  2.73it/s]

tensor(20.0567, device='cuda:0', grad_fn=<NllLossBackward0>)


1548it [09:27,  2.74it/s]

tensor(13.2453, device='cuda:0', grad_fn=<NllLossBackward0>)


1549it [09:28,  1.93it/s]

tensor(43.9204, device='cuda:0', grad_fn=<NllLossBackward0>)


1550it [09:29,  2.12it/s]

tensor(5.5715, device='cuda:0', grad_fn=<NllLossBackward0>)


1551it [09:29,  2.28it/s]

tensor(32.2109, device='cuda:0', grad_fn=<NllLossBackward0>)


1552it [09:29,  2.41it/s]

tensor(84.4735, device='cuda:0', grad_fn=<NllLossBackward0>)


1553it [09:30,  2.49it/s]

tensor(8.2049, device='cuda:0', grad_fn=<NllLossBackward0>)


1554it [09:30,  2.56it/s]

tensor(19.2333, device='cuda:0', grad_fn=<NllLossBackward0>)


1555it [09:31,  2.61it/s]

tensor(20.3642, device='cuda:0', grad_fn=<NllLossBackward0>)


1556it [09:31,  2.66it/s]

tensor(12.2105, device='cuda:0', grad_fn=<NllLossBackward0>)


1557it [09:31,  2.69it/s]

tensor(11.0804, device='cuda:0', grad_fn=<NllLossBackward0>)


1558it [09:32,  2.72it/s]

tensor(26.0124, device='cuda:0', grad_fn=<NllLossBackward0>)


1559it [09:32,  2.73it/s]

tensor(29.7033, device='cuda:0', grad_fn=<NllLossBackward0>)


1560it [09:32,  2.74it/s]

tensor(18.6193, device='cuda:0', grad_fn=<NllLossBackward0>)


1561it [09:33,  2.75it/s]

tensor(27.8871, device='cuda:0', grad_fn=<NllLossBackward0>)


1562it [09:33,  2.75it/s]

tensor(25.1465, device='cuda:0', grad_fn=<NllLossBackward0>)


1562it [09:33,  2.72it/s]

tensor(37.4694, device='cuda:0', grad_fn=<NllLossBackward0>)





KeyboardInterrupt: 

In [22]:
prompt = "i saw a cat "
batch = tokenizer(prompt, return_tensors='pt').to('cuda')


# with torch.no_grad():
with torch.cuda.amp.autocast():
    output_tokens = gpt.generate(**batch, min_length=30, max_length=60, do_sample=True)

print('\n\n', tokenizer.decode(output_tokens[0].cpu().numpy()))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




 i saw a cat                                     
from __future.# Licensed_# Modified.# Copyright, Version. import_



This training loop is just a proof of concept - to show that even in the heaviest case, it still fits on a gpu.
Depending on your finetuning task, you'll need to remove some parts.
Below we explain how to modify the code to achieve the setup from the [LoRA paper](https://arxiv.org/pdf/2106.09685.pdf)

If you wanna fine-tune a-la LoRA , please use the parameters from Table 11,12 and 15 as a starter:

(1) Train only the adapter matrices from attention layers

In the above example, we train all kinds of adapters, and also layernorm scales and biases. This is only useful for fine-tuning over reasonably large datasets over long time.
For quick setups you should tag everything except **the attention adapters** as `requires_grad=False` -- or just don't feed them into Adam:

```

params_for_optimizer = [
    param for name, param in model.named_parameters()
    if "attn" in name and "adapter" in name
]
print("Trainiable params:", len(params_for_optimizer))

# and after you verified it:
for name, param in model.named_parameters():
    if param not in params_for_optimizer:
        print(f"Setting {name} requires_grad=False")
        param.requires_grad = False
```

An even better way is to only create adapters that you need by modifying the `add_adapters` function above:
```
for name, module in model.named_modules():
    if isinstance(module, (FrozenBNBLinear, FrozenBNBEmbedding)):
        if "attn" in name:
            print("Adding adapter to", name)

            todo_initialize_adapters_like_in_notebook()
        else:
            print("Not adding adapter to", name)
```
As a side-effect, that would actually somewhat reduce the memory usage and may let you fit a longer sequence (e.g. 256)


(2) initialize the second adapter matrix with zeros
```
for name, module in model.named_modules():
    if hasattr(module, "adapter"):
        print("Initializing", name)
        nn.init.zeros_(module.adapter[1].weight)
        # optional: scale adapter[0].weight by (LoRA_alpha / r)
```

(3) use warmup and weight decay in Adam:
```
optimizer = Adam8Bit(..., weight_decay=0.01)
scheduler = transformers.get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps_from_paper(), expected_total_number_of_steps
)

actually_use_scheduler_in_training_loop()
```

Finally, we recommend modifying training loop to track the training metrics, saving the best checkpoint, etc.