# Introduction

<center><h3>**Welcome to the Language modeling Notebook.**</h3></center>

In this assignment, you are going to train a neural network to **generate news headlines**.
To reduce computational needs, we have reduced it to headlines about technology, and a handful of Tech giants.
In this assignment you will:
- Learn to preprocess raw text so it can be fed into an LSTM.
- Make use of the LSTM library of Pytorch, to train a Language model to generate headlines
- Use your network to generate headlines, and judge which headlines are likely or not




**What is a language model?**

Language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.
— Page 105, __[Neural Network Methods in Natural Language Processing](https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies/dp/1627052984/)__, 2017.

In terms of neural network, we are training a neural network to produce probabilities (classification) over a fixed vocabulary of words.
Concretely, we are training a neural network to produce:
$$ P ( w_{i+1} | w_1, w_2, w_3, ..., w_i), \forall i \in (1,n)$$

** Why is language modeling important? **

Language modeling is a core problem in NLP.

Language models can either be used as a stand-alone to produce new text that matches the distribution of text the model is trained on, but can also be used at the front-end of a more sophisticated model to produce better results.

Recently for example, the __[BERT](https://arxiv.org/abs/1810.04805)__ paper show-cased that pretraining a large neural network on a language modeling task can help improve state-of-the-art on many NLP tasks. 

How good can the generation of a Language model be?

If you have not seen the post about GPT-2 by OpenAI, you should read some of the samples they generated from their language model __[here](https://blog.openai.com/better-language-models/#sample1)__.
Because of computational restrictions, we will not achieve as good text production, but the same algorithm is at the core. They just use more data and compute.

# Library imports

Before starting, make sure you have all these libraries.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
root_folder = ""
import os
import sys
sys.path.append(root_folder)
from segtok import tokenizer
from collections import Counter
import torch as th
from torch import nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import json
from utils import validate_to_array

# Loading the datasets

In [3]:
# This cell loads the data for the model
# Run this before working on loading any of the additional data

with open(root_folder+"dataset/headline_generation_dataset_processed.json", "r") as f:
    d_released = json.load(f)

with open(root_folder+"dataset/headline_generation_vocabulary.txt", "r",encoding='utf8') as f:
    vocabulary = f.read().split("\n")
w2i = {w: i for i, w in enumerate(vocabulary)} # Word to index
i2w = {i: w for i, w in enumerate(vocabulary)} # Index to word
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

vocab_size = len(vocabulary)
input_length = len(d_released[0]['numerized']) # The length of the first element in the dataset, they are all of the same length
d_train = [d for d in d_released if d['cut'] == 'training']
d_valid = [d for d in d_released if d['cut'] == 'validation']

print("Number of training samples:",len(d_train))
print("Number of validation samples:",len(d_valid))

Number of training samples: 88568
Number of validation samples: 946


Now that we have loaded the data, let's inspect one of the elements. Each sample in our dataset is has a `numerized` vector, that contains the preprocessed headline. This vector is what we will feed in to the neural network. The field `numerized` corresponds to this list of tokens. The already loaded dictionary `vocabulary` maps token lists to the actual string. Use these elements to recover `title` key of entry 1001 in the training dataset.

**TODO**: Write the numerized2text function in notebook_utils and inspect element 1001 in the training dataset (`entry = d_train[1001]`).



In [4]:
def numerize_sequence(tokenized):
    return [w2i.get(w, unkI) for w in tokenized]
def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

In [5]:
def numerized2text(numerized):
    """ Converts an integer sequence in the vocabulary into a string corresponding to the title.
    
        Arguments:
            numerized: List[int]  -- The list of vocabulary indices corresponding to the string
        Returns:
            title: str -- The string corresponding to the numerized input, without padding.
    """
    #####
    # BEGIN YOUR CODE HERE 
    # Recover each word from the vocabulary in the list of indices in numerized, using the vocabulary variable
    # Hint 1: Use the string.join() function to reconstruct a single string
    # Hint 2: The objects and/or functions defined in above cells may be useful.
    #####
    converted_string=[vocabulary[i] for i in numerized if i!=padI]
    space=" "
    converted_string=space.join(converted_string)
   
    #####
    # END YOUR CODE HERE
    #####
    
    return converted_string

entry = d_train[1001]
# print(pad_sequence(entry['numerized'],0,len(entry['numerized'])-1))
print("Reversing the numerized: "+numerized2text(entry['numerized']))
validate_to_array(numerized2text,(entry['numerized'],),'numerized2text',root_folder)
print("From the `title` entry: "+ entry['title'])

Reversing the numerized: microsoft donates cloud computing ' worth $ 1 bn '
From the `title` entry: Microsoft donates cloud computing 'worth $1 bn'


In language modeling, we train a model to produce the next word in the sequence given all previously generated words. This has, in practice, two steps:


    1. Adding a special <START> token to the start of the sequence for the input. This "shifts" the input to the right by one. We call this the "source" sequence
    2. Making the network predict the original, unshifted version (we call this the "target" sequence)

    
Let's take an example. Say we want to train the network on the sentence: "The cat is great."
The input to the network will be "`<START>` The cat is great." The target will be: "The cat is great".
    
Therefore the first prediction is to select the word "The" given the `<START>` token.
The second prediction is to produce the word "cat" given the two tokens "`<START>` The".
At each step, the network learns to predict the next word, given all previous ones.
    
---

Your next step is to write the build_batch function. Given a dataset, we select a random subset of samples, and will build the "inputs" and the "targets" of the batch, following the procedure we've described.

**TODO**: write the build_batch function. We give you the structure, and you have to fill in where we have left things `your_code`.


In [6]:
def build_batch(dataset, indices):
    """ Builds a batch of source and target elements from the dataset.
    
        Arguments:
            dataset: List[db_element] -- A list of dataset elements
            indices: List[int] -- A list of indices of the dataset to sample
        Returns:
            batch_input: List[List[int]] -- List of source sequences
            batch_target: List[List[int]] -- List of target sequences
            batch_target_mask: List[List[int]] -- List of target batch masks
    """
    #####
    # BEGIN YOUR CODE HERE 
    #####
    
    
    # We get a list of indices we will choose from the dataset.
    # indices = range(iteration*batch_size,(iteration+1)*batch_size)
    
    # Recover what the entries for the batch are
    batch = [dataset[i] for i in indices]
    batch_size=len(indices)
 
    
    # Get the raw numerized for this input, each element of the dataset has a 'numerized' key
    
    batch_numerized = [a['numerized'] for a in batch]
    
    
    # Create an array of startI that will be concatenated at position 1 for the input.
    # Should be of shape (batch_size, 1)
    start_tokens = np.repeat(startI, batch_size).reshape(batch_size,1)

    # Concatenate the start_tokens with the rest of the input
    # The np.concatenate function should be useful
    # The output should now be [batch_size, sequence_length+1]
    batch_input = np.concatenate((start_tokens,batch_numerized),axis=1)
#     print(np.shape(batch_input))
    
    # Remove the last word from each element in the batch
    # To restore the [batch_size, sequence_length] size
    batch_input = batch_input[:,:-1]
#     print(np.shape(batch_input))
    # The target should be the un-shifted numerized input
    batch_target = [a['numerized'] for a in batch]

    # The target-mask is a 0 or 1 filter to note which tokens are
    # padding or not, to give the loss, so the model doesn't get rewarded for
    # predicting PAD tokens.
    batch_target_mask = np.array([a['mask'] for a in batch])
    
    #####
    # END YOUR CODE HERE 
    #####
        
    return batch_input, batch_target, batch_target_mask
validate_to_array(build_batch,(d_train, range(100)),'build_batch',root_folder)

# Creating the language model

Now that we've written the data pipelining, we are ready to write the Neural network.

The steps to setting up a neural network to do Language modeling are:
- Creating the placeholders for the model, where we can feed in our inputs and targets.
- Creating an RNN of our choice, size, and with optional parameters
- Using the RNN on our placeholder inputs.
- Getting the output from the RNN, and projecting it into a vocabulary sized dimension, so that we can make word predictions.
- Setting up the loss on the outputs so that the network learns to produce the correct words.
- Finally, choosing an optimizer, and defining a training operation: using the optimizer to minimize the loss.

We provide skeleton code for the model, you can fill in the `your_code` section. If you are unfamiliar with Pytorch, we provide some idea of what functions to look for, you should use the Pytorch online documentation.

**TODO**: Fill in the LanguageModel in the language_model file.


In [7]:
from language_model import LanguageModel

# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 5.50**

**TODO**: Train your model so that it achieves a validation loss of <= 5.5. 

**Careful**: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit. You must save the model you want us to test under: models/final_language_model (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain loss <= 5.50 with a 1-layer LSTM of size 256 or less.
- You should not need more than 10 epochs to attain the threshold. More passes over the data can however give you a better model.
- You can however try using:
    - LSTM dropout (Pytorch has a layer for that)
    - Multi-layer RNN cell (Pytorch has a layer for that)
    - Change your optimizers, tune your learning_rate, use a learning rate schedule.

In [22]:
# We can create our model,
# with parameters of our choosing.
hidden_size = 256
num_layers = 1
dropout = .5

# Setup the loss using cross-entropy loss.
# The logits are the output_logits we've computed,
# look at the pytorch docs for `CrossEntropyLoss` and `permute`
# to align the axes correctly and to account for the masking properly.
# The targets are the goal labels we are trying to match.
# Note that if you directly take the mean of the loss tensor,
# it will underestimate your loss! (why would that be?)
# Lastly, there are a few valid forms of averaging token losses,
# here we will take the mean of all non-mask tokens together.
criterion = nn.CrossEntropyLoss(reduction='none')
def loss_fn(pred, target, mask):
    pred = th.permute(pred,(0,2,1))  # put the class probabilities in the middle
    loss_tensor = criterion(pred, target)
    loss_masked = loss_tensor*mask.int()
    loss_per_sample = th.mean(loss_masked)
    return loss_per_sample

# The build_batch function outputs numpy, but our model is built in pytorch,
# so you need to convert numpy to pytorch.
# You also have to cast the masks into float32, target into long, and input into long.
# Look at the `float` and `long` function.
batch_to_torch = lambda b_in,b_target,b_mask: (th.LongTensor(b_in),
                                               th.LongTensor(b_target), 
                                               th.FloatTensor(b_mask)
                                               )


# Look at the docs for torch.optim and pick an optimizer
# And provide it with a start learning rate.
optimizer_class = optim.Adam
lr = 1e-4
epochs = 20
batch_size = 128

model_id = 'test1'
os.makedirs(root_folder+'models/part1/',exist_ok=True)

device = th.device("cuda" if th.cuda.is_available() else "cpu")
print(device)
list_to_device = lambda th_obj: [tensor.to(device) for tensor in th_obj]

cpu


In [23]:
model = LanguageModel(vocab_size=vocab_size, rnn_size=hidden_size, num_layers=num_layers, dropout=dropout)
optimizer = optimizer_class(model.parameters(), lr=lr)

In [21]:
# Skeleton code
# You have to write your own training process to obtain a
# Good performing model on the validation set, and save it.

model.train()
losses = []
accuracies = []
from tqdm import tqdm
for epoch in range(epochs):
    indices = np.random.permutation(range(len(d_train)))
    t = tqdm(range(0,(len(d_train)//batch_size)+1))
    for i in t:
        # Here is how you obtain a batch:
        batch = build_batch(d_train, indices[i*batch_size:(i+1)*batch_size])
        (batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
        (batch_input, batch_target, batch_target_mask) = list_to_device((batch_input, batch_target, batch_target_mask))

        model.to(device)
        
        prediction = model(batch_input)
        
        loss = loss_fn(prediction, batch_target,batch_target_mask)
        losses.append(loss.item())
        accuracy = (th.eq(prediction.argmax(dim=2,keepdim=False),batch_target).float()*batch_target_mask).sum()/batch_target_mask.sum()
        accuracies.append(accuracy.item())
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if i % 10 == 0:
            t.set_description(f"Epoch: {epoch} Iteration: {i} Loss: {np.mean(losses[-10:])} Accuracy: {np.mean(accuracies[-10:])}")
    # save your latest model
    save_dict = dict(
        kwargs = dict(
            vocab_size=vocab_size,
            rnn_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout
        ),
        model_state_dict = model.state_dict(),
        notes = "",
        optimizer_class = optimizer_class,
        lr = lr,
        epochs = epochs,
        batch_size = batch_size,
    )
    th.save(save_dict,root_folder+f'models/part1/model_{model_id}.pt')

Epoch: 0 Iteration: 0 Loss: 4.718810081481934 Accuracy: 0.0:   0%| | 1/692 [00:0

torch.Size([128, 20])
tensor([[9.1081, 9.2261, 9.1890,  ..., 0.0000, 0.0000, 0.0000],
        [9.1082, 9.2451, 9.1896,  ..., 0.0000, 0.0000, 0.0000],
        [9.2480, 9.2487, 9.1662,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1075, 9.1518, 9.1306,  ..., 0.0000, 0.0000, 0.0000],
        [9.2879, 9.2295, 9.2291,  ..., 0.0000, 0.0000, 0.0000],
        [9.0653, 9.1479, 9.2071,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.2156, 9.2935, 9.2238,  ..., 0.0000, 0.0000, 0.0000],
        [9.1491, 9.2315, 9.2865,  ..., 0.0000, 0.0000, 0.0000],
        [9.2905, 9.0425, 9.2839,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1406, 9.2833, 9.3498,  ..., 0.0000, 0.0000, 0.0000],
        [9.1497, 9.1476, 9.2822,  ..., 0.0000, 0.0000, 0.0000],
        [9.2662, 8.9420, 9.1282,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 0 Loss: 4.718810081481934 Accuracy: 0.0:   0%| | 3/692 [00:0

torch.Size([128, 20])
tensor([[9.1307, 9.1342, 8.9061,  ..., 0.0000, 0.0000, 0.0000],
        [9.2167, 9.2678, 9.1983,  ..., 0.0000, 0.0000, 0.0000],
        [9.1753, 9.2071, 9.2037,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0888, 9.3676, 9.0874,  ..., 9.2729, 9.0996, 9.1988],
        [9.2633, 9.3101, 9.2114,  ..., 0.0000, 0.0000, 0.0000],
        [9.1081, 9.1390, 9.2274,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.2545, 9.1255, 9.2037,  ..., 0.0000, 0.0000, 0.0000],
        [9.2767, 9.2118, 9.2184,  ..., 0.0000, 0.0000, 0.0000],
        [9.2756, 9.2733, 9.1949,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2815, 9.1241, 9.1647,  ..., 0.0000, 0.0000, 0.0000],
        [9.1510, 9.3870, 9.1001,  ..., 0.0000, 0.0000, 0.0000],
        [9.0573, 9.1391, 9.0671,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 0 Loss: 4.718810081481934 Accuracy: 0.0:   1%| | 5/692 [00:0

torch.Size([128, 20])
tensor([[9.2438, 9.2564, 9.3055,  ..., 0.0000, 0.0000, 0.0000],
        [9.1453, 9.1690, 9.2848,  ..., 0.0000, 0.0000, 0.0000],
        [9.2438, 9.2729, 9.2413,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0869, 9.1942, 9.3309,  ..., 0.0000, 0.0000, 0.0000],
        [9.1453, 9.2726, 9.2678,  ..., 0.0000, 0.0000, 0.0000],
        [9.1887, 9.2145, 9.2185,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1762, 9.1278, 9.1369,  ..., 0.0000, 0.0000, 0.0000],
        [9.1444, 9.1856, 9.3138,  ..., 0.0000, 0.0000, 0.0000],
        [9.2684, 9.2774, 9.3188,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0879, 9.1678, 9.3052,  ..., 0.0000, 0.0000, 0.0000],
        [9.2102, 9.2715, 9.0624,  ..., 0.0000, 0.0000, 0.0000],
        [9.2149, 9.0761, 9.2155,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 0 Loss: 4.718810081481934 Accuracy: 0.0:   1%| | 7/692 [00:0

torch.Size([128, 20])
tensor([[9.2034, 9.1896, 9.2223,  ..., 0.0000, 0.0000, 0.0000],
        [9.2581, 9.1803, 9.1581,  ..., 8.9912, 9.1114, 0.0000],
        [9.2346, 9.2023, 9.1535,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2094, 9.0646, 9.2640,  ..., 0.0000, 0.0000, 0.0000],
        [9.1199, 9.0901, 9.2259,  ..., 0.0000, 0.0000, 0.0000],
        [9.3017, 9.1910, 9.1665,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.2671, 9.2690, 9.3142,  ..., 9.1135, 9.2631, 9.3147],
        [9.1217, 9.1031, 9.1493,  ..., 0.0000, 0.0000, 0.0000],
        [9.1930, 9.2253, 9.1250,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1344, 9.2432, 9.1787,  ..., 0.0000, 0.0000, 0.0000],
        [9.1354, 9.0641, 9.1481,  ..., 0.0000, 0.0000, 0.0000],
        [9.1074, 9.2220, 9.0338,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 0 Loss: 4.718810081481934 Accuracy: 0.0:   1%| | 9/692 [00:0

torch.Size([128, 20])
tensor([[9.1361, 9.1322, 9.0728,  ..., 0.0000, 0.0000, 0.0000],
        [9.1787, 9.1822, 9.1808,  ..., 0.0000, 0.0000, 0.0000],
        [9.1763, 9.1021, 9.2707,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1995, 9.3536, 9.2204,  ..., 0.0000, 0.0000, 0.0000],
        [9.1995, 9.1209, 9.0983,  ..., 0.0000, 0.0000, 0.0000],
        [9.0847, 9.1475, 9.2030,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1881, 9.1199, 9.0973,  ..., 0.0000, 0.0000, 0.0000],
        [9.3142, 9.0596, 8.8674,  ..., 0.0000, 0.0000, 0.0000],
        [9.1658, 8.8897, 9.1851,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0821, 9.2504, 9.2998,  ..., 0.0000, 0.0000, 0.0000],
        [9.1658, 9.2661, 9.1246,  ..., 0.0000, 0.0000, 0.0000],
        [9.1693, 8.9908, 9.2757,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 10 Loss: 4.703256702423095 Accuracy: 0.0011813657474704087: 

torch.Size([128, 20])
tensor([[9.1530, 8.8684, 9.2004,  ..., 0.0000, 0.0000, 0.0000],
        [9.1663, 9.2056, 9.1951,  ..., 0.0000, 0.0000, 0.0000],
        [9.1663, 9.0077, 8.9651,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2653, 9.1524, 9.3478,  ..., 0.0000, 0.0000, 0.0000],
        [9.0234, 9.1925, 9.1008,  ..., 0.0000, 0.0000, 0.0000],
        [9.1769, 9.2406, 9.3299,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.0063, 9.1318, 9.0753,  ..., 0.0000, 0.0000, 0.0000],
        [9.2973, 9.1144, 9.2646,  ..., 0.0000, 0.0000, 0.0000],
        [9.1644, 9.1091, 9.2110,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0868, 9.1370, 9.1675,  ..., 0.0000, 0.0000, 0.0000],
        [9.2150, 9.1024, 9.1315,  ..., 0.0000, 0.0000, 0.0000],
        [9.1400, 9.1985, 9.2415,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 10 Loss: 4.703256702423095 Accuracy: 0.0011813657474704087: 

torch.Size([128, 20])
tensor([[8.9935, 9.2866, 9.2480,  ..., 0.0000, 0.0000, 0.0000],
        [9.2436, 9.1812, 9.1938,  ..., 0.0000, 0.0000, 0.0000],
        [8.9935, 9.1800, 9.2401,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1630, 9.2312, 9.1820,  ..., 0.0000, 0.0000, 0.0000],
        [9.1630, 9.2671, 9.0773,  ..., 8.9837, 0.0000, 0.0000],
        [9.1697, 9.1227, 9.1780,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.0343, 9.0949, 9.3491,  ..., 0.0000, 0.0000, 0.0000],
        [9.2690, 9.2800, 9.0650,  ..., 0.0000, 0.0000, 0.0000],
        [8.9998, 9.2289, 9.2679,  ..., 8.8833, 0.0000, 0.0000],
        ...,
        [9.1423, 9.2176, 9.1037,  ..., 0.0000, 0.0000, 0.0000],
        [9.1423, 9.2296, 9.1906,  ..., 0.0000, 0.0000, 0.0000],
        [9.1423, 9.2755, 9.2627,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 10 Loss: 4.703256702423095 Accuracy: 0.0011813657474704087: 

torch.Size([128, 20])
tensor([[9.1455, 9.1696, 9.3921,  ..., 0.0000, 0.0000, 0.0000],
        [9.2373, 9.1565, 9.0819,  ..., 0.0000, 0.0000, 0.0000],
        [9.2332, 9.1121, 9.2154,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0101, 9.0617, 9.2824,  ..., 9.1381, 9.2034, 9.1507],
        [9.3472, 9.3242, 9.0876,  ..., 0.0000, 0.0000, 0.0000],
        [9.0224, 9.2231, 9.2233,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.0655, 9.0026, 9.1974,  ..., 0.0000, 0.0000, 0.0000],
        [9.2483, 9.2374, 9.2084,  ..., 0.0000, 0.0000, 0.0000],
        [9.1405, 9.3773, 9.2427,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1405, 9.1367, 9.1491,  ..., 0.0000, 0.0000, 0.0000],
        [9.1088, 9.2248, 9.3196,  ..., 0.0000, 0.0000, 0.0000],
        [8.9825, 9.2473, 9.1008,  ..., 9.2095, 9.1510, 9.2610]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 10 Loss: 4.703256702423095 Accuracy: 0.0011813657474704087: 

torch.Size([128, 20])
tensor([[9.2433, 9.1975, 9.2574,  ..., 0.0000, 0.0000, 0.0000],
        [9.1071, 9.1558, 9.1801,  ..., 0.0000, 0.0000, 0.0000],
        [9.2812, 9.2428, 9.2770,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1324, 9.1976, 9.3410,  ..., 0.0000, 0.0000, 0.0000],
        [9.1913, 9.1773, 9.2606,  ..., 0.0000, 0.0000, 0.0000],
        [9.2086, 9.0830, 9.2225,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.9269, 9.1588, 9.1517,  ..., 0.0000, 0.0000, 0.0000],
        [9.1488, 9.0974, 9.1832,  ..., 0.0000, 0.0000, 0.0000],
        [9.2105, 9.2231, 9.1186,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1114, 9.0862, 9.0676,  ..., 0.0000, 0.0000, 0.0000],
        [9.1539, 9.2838, 9.1874,  ..., 0.0000, 0.0000, 0.0000],
        [9.1960, 8.8610, 9.4309,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 10 Loss: 4.703256702423095 Accuracy: 0.0011813657474704087: 

torch.Size([128, 20])
tensor([[9.1786, 9.3134, 9.3127,  ..., 0.0000, 0.0000, 0.0000],
        [9.2172, 9.1141, 9.3200,  ..., 0.0000, 0.0000, 0.0000],
        [9.2597, 9.2201, 9.1784,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9733, 9.1440, 9.1430,  ..., 0.0000, 0.0000, 0.0000],
        [9.1836, 9.1602, 9.2571,  ..., 0.0000, 0.0000, 0.0000],
        [8.9550, 9.1345, 8.9612,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.9451, 9.0443, 9.3531,  ..., 0.0000, 0.0000, 0.0000],
        [9.0612, 9.2541, 9.2900,  ..., 0.0000, 0.0000, 0.0000],
        [9.1170, 9.1991, 9.3793,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1277, 9.1934, 9.2757,  ..., 0.0000, 0.0000, 0.0000],
        [8.9621, 9.1170, 9.2092,  ..., 0.0000, 0.0000, 0.0000],
        [8.9621, 9.1326, 9.2108,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 20 Loss: 4.733017539978027 Accuracy: 0.006714864744571969:  

torch.Size([128, 20])
tensor([[9.1260, 9.2116, 9.2167,  ..., 0.0000, 0.0000, 0.0000],
        [9.2671, 9.2679, 9.2325,  ..., 0.0000, 0.0000, 0.0000],
        [9.2340, 9.1165, 8.9955,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2707, 9.0940, 9.1449,  ..., 0.0000, 0.0000, 0.0000],
        [9.2095, 9.1005, 9.0327,  ..., 0.0000, 0.0000, 0.0000],
        [9.0348, 9.1345, 9.3639,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1827, 9.1190, 9.2944,  ..., 0.0000, 0.0000, 0.0000],
        [8.9384, 9.0676, 9.3245,  ..., 0.0000, 0.0000, 0.0000],
        [9.3726, 9.1989, 8.8910,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1890, 9.2295, 9.1272,  ..., 0.0000, 0.0000, 0.0000],
        [9.3000, 9.1427, 9.1927,  ..., 0.0000, 0.0000, 0.0000],
        [9.2828, 9.3397, 9.1019,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 20 Loss: 4.733017539978027 Accuracy: 0.006714864744571969:  

torch.Size([128, 20])
tensor([[9.0077, 9.2115, 9.3593,  ..., 0.0000, 0.0000, 0.0000],
        [9.2080, 9.0279, 9.2696,  ..., 0.0000, 0.0000, 0.0000],
        [9.2107, 9.0011, 9.2933,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1530, 9.1472, 9.2730,  ..., 0.0000, 0.0000, 0.0000],
        [9.1019, 9.2183, 9.1774,  ..., 0.0000, 0.0000, 0.0000],
        [9.1777, 9.2394, 9.2173,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.0974, 9.1110, 9.1752,  ..., 0.0000, 0.0000, 0.0000],
        [9.2191, 9.0476, 9.1319,  ..., 0.0000, 0.0000, 0.0000],
        [8.8491, 9.1364, 9.0934,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9147, 9.1784, 9.1275,  ..., 0.0000, 0.0000, 0.0000],
        [8.9147, 9.3497, 9.0939,  ..., 0.0000, 0.0000, 0.0000],
        [8.9031, 9.2960, 9.1227,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 20 Loss: 4.733017539978027 Accuracy: 0.006714864744571969:  

torch.Size([128, 20])
tensor([[8.9945, 9.1094, 9.2702,  ..., 0.0000, 0.0000, 0.0000],
        [9.2233, 9.3212, 9.2862,  ..., 0.0000, 0.0000, 0.0000],
        [9.1717, 9.0982, 9.3312,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1968, 9.2940, 8.7947,  ..., 0.0000, 0.0000, 0.0000],
        [8.9021, 9.1387, 9.2593,  ..., 9.2576, 9.1652, 9.2499],
        [9.1998, 9.1035, 9.3210,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.8894, 9.1952, 9.1396,  ..., 0.0000, 0.0000, 0.0000],
        [9.1392, 9.2395, 9.0786,  ..., 0.0000, 0.0000, 0.0000],
        [9.2676, 9.1645, 9.0998,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1174, 9.2099, 8.9841,  ..., 0.0000, 0.0000, 0.0000],
        [8.9962, 9.2432, 9.2395,  ..., 0.0000, 0.0000, 0.0000],
        [9.2251, 9.1987, 9.2024,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 20 Loss: 4.733017539978027 Accuracy: 0.006714864744571969:  

torch.Size([128, 20])
tensor([[9.1942, 9.0971, 9.1679,  ..., 0.0000, 0.0000, 0.0000],
        [9.1912, 9.5485, 9.1635,  ..., 0.0000, 0.0000, 0.0000],
        [9.1969, 9.1315, 9.0970,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2061, 9.2612, 9.0729,  ..., 0.0000, 0.0000, 0.0000],
        [8.8769, 9.2328, 9.2461,  ..., 0.0000, 0.0000, 0.0000],
        [9.2490, 9.1898, 9.3154,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1734, 9.2467, 9.2591,  ..., 0.0000, 0.0000, 0.0000],
        [8.8624, 9.2056, 9.1254,  ..., 0.0000, 0.0000, 0.0000],
        [8.8644, 9.1876, 9.2330,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1792, 8.8778, 9.1257,  ..., 0.0000, 0.0000, 0.0000],
        [8.9711, 9.1341, 9.1934,  ..., 0.0000, 0.0000, 0.0000],
        [8.8644, 9.0001, 9.1580,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 20 Loss: 4.733017539978027 Accuracy: 0.006714864744571969:  

torch.Size([128, 20])
tensor([[9.2042, 9.1479, 9.2271,  ..., 0.0000, 0.0000, 0.0000],
        [8.9293, 9.3574, 9.0491,  ..., 0.0000, 0.0000, 0.0000],
        [9.0334, 9.1720, 9.1377,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2042, 9.2220, 9.1990,  ..., 0.0000, 0.0000, 0.0000],
        [8.7751, 9.2138, 9.1284,  ..., 0.0000, 0.0000, 0.0000],
        [8.8512, 9.1358, 9.1271,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.9565, 9.0973, 8.8977,  ..., 0.0000, 0.0000, 0.0000],
        [9.1728, 9.2335, 9.0776,  ..., 0.0000, 0.0000, 0.0000],
        [8.7596, 9.2158, 9.1344,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.3042, 9.1786, 9.1087,  ..., 0.0000, 0.0000, 0.0000],
        [9.0463, 9.3319, 9.2012,  ..., 0.0000, 0.0000, 0.0000],
        [9.0260, 9.1625, 9.2510,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 30 Loss: 4.599087810516357 Accuracy: 0.0221438093110919:   4

torch.Size([128, 20])
tensor([[9.0958, 9.3220, 9.1542,  ..., 0.0000, 0.0000, 0.0000],
        [9.2275, 9.1031, 9.2730,  ..., 0.0000, 0.0000, 0.0000],
        [9.0786, 9.2993, 9.2910,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0057, 9.1998, 9.2379,  ..., 0.0000, 0.0000, 0.0000],
        [9.1313, 9.2488, 9.1806,  ..., 0.0000, 0.0000, 0.0000],
        [9.1150, 9.3380, 9.1396,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1105, 9.0167, 9.1861,  ..., 0.0000, 0.0000, 0.0000],
        [9.2198, 9.1049, 9.1642,  ..., 0.0000, 0.0000, 0.0000],
        [9.1349, 8.8949, 8.9347,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1697, 9.3870, 9.3069,  ..., 0.0000, 0.0000, 0.0000],
        [8.8182, 9.1594, 9.1986,  ..., 0.0000, 0.0000, 0.0000],
        [9.2509, 9.0916, 9.1991,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 30 Loss: 4.599087810516357 Accuracy: 0.0221438093110919:   5

torch.Size([128, 20])
tensor([[8.8758, 9.0675, 9.0668,  ..., 0.0000, 0.0000, 0.0000],
        [8.7939, 8.9774, 9.2468,  ..., 0.0000, 0.0000, 0.0000],
        [8.7080, 9.1245, 9.1819,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0001, 9.2517, 9.2937,  ..., 0.0000, 0.0000, 0.0000],
        [8.7939, 9.2838, 9.1817,  ..., 0.0000, 0.0000, 0.0000],
        [9.0201, 9.1386, 9.0503,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.9379, 9.1316, 9.1539,  ..., 0.0000, 0.0000, 0.0000],
        [9.3790, 9.0894, 9.0100,  ..., 0.0000, 0.0000, 0.0000],
        [9.3475, 9.0974, 9.1449,  ..., 9.3123, 0.0000, 0.0000],
        ...,
        [9.2270, 9.1619, 9.1257,  ..., 0.0000, 0.0000, 0.0000],
        [9.3893, 9.1572, 9.0344,  ..., 0.0000, 0.0000, 0.0000],
        [8.7802, 9.2522, 9.2100,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 30 Loss: 4.599087810516357 Accuracy: 0.0221438093110919:   5

torch.Size([128, 20])
tensor([[8.7664, 9.2319, 8.7458,  ..., 0.0000, 0.0000, 0.0000],
        [9.2847, 9.0515, 9.1726,  ..., 0.0000, 0.0000, 0.0000],
        [9.1639, 8.9857, 9.1965,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9834, 9.1887, 9.2828,  ..., 0.0000, 0.0000, 0.0000],
        [8.6722, 9.0429, 9.1442,  ..., 0.0000, 0.0000, 0.0000],
        [8.9219, 9.3035, 9.0762,  ..., 9.3323, 9.2134, 9.3484]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.8591, 8.9581, 9.1430,  ..., 0.0000, 0.0000, 0.0000],
        [9.0433, 8.9252, 9.0559,  ..., 0.0000, 0.0000, 0.0000],
        [8.9916, 9.1863, 9.1603,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1454, 9.1469, 9.2735,  ..., 0.0000, 0.0000, 0.0000],
        [8.9916, 9.1215, 9.1595,  ..., 0.0000, 0.0000, 0.0000],
        [8.7517, 9.0909, 9.1912,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 30 Loss: 4.599087810516357 Accuracy: 0.0221438093110919:   5

torch.Size([128, 20])
tensor([[8.8437, 9.1014, 9.1970,  ..., 0.0000, 0.0000, 0.0000],
        [8.9808, 9.2436, 9.2168,  ..., 0.0000, 0.0000, 0.0000],
        [8.7366, 8.9825, 8.8683,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.8177, 9.1450, 9.0882,  ..., 0.0000, 0.0000, 0.0000],
        [9.0835, 9.1936, 9.3156,  ..., 0.0000, 0.0000, 0.0000],
        [8.8437, 9.1483, 9.0936,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.6170, 9.3746, 9.0982,  ..., 0.0000, 0.0000, 0.0000],
        [8.7482, 9.0152, 9.1144,  ..., 0.0000, 0.0000, 0.0000],
        [9.0448, 9.0607, 9.2013,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2225, 9.1528, 9.2559,  ..., 0.0000, 0.0000, 0.0000],
        [8.7210, 9.2091, 9.1908,  ..., 0.0000, 0.0000, 0.0000],
        [8.8275, 9.1998, 9.0249,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 30 Loss: 4.599087810516357 Accuracy: 0.0221438093110919:   6

torch.Size([128, 20])
tensor([[8.9420, 9.1871, 9.1037,  ..., 0.0000, 0.0000, 0.0000],
        [9.0569, 9.1211, 9.3595,  ..., 0.0000, 0.0000, 0.0000],
        [8.5978, 9.0924, 9.2028,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2713, 9.1726, 8.9911,  ..., 0.0000, 0.0000, 0.0000],
        [9.1696, 8.9537, 9.0475,  ..., 0.0000, 0.0000, 0.0000],
        [8.8113, 9.2312, 9.1855,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.7699, 9.2819, 9.1791,  ..., 0.0000, 0.0000, 0.0000],
        [9.1352, 9.1954, 9.0310,  ..., 0.0000, 0.0000, 0.0000],
        [9.0657, 9.1146, 9.2231,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.3290, 9.2933, 9.2809,  ..., 0.0000, 0.0000, 0.0000],
        [8.6903, 9.0273, 9.1121,  ..., 0.0000, 0.0000, 0.0000],
        [8.7947, 9.0774, 9.2063,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 40 Loss: 4.7018170833587645 Accuracy: 0.028435646556317806: 

torch.Size([128, 20])
tensor([[8.7770, 9.1281, 9.2250,  ..., 0.0000, 0.0000, 0.0000],
        [9.0325, 9.3607, 9.1108,  ..., 0.0000, 0.0000, 0.0000],
        [8.8697, 9.2578, 9.3075,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.7770, 9.3013, 9.2695,  ..., 0.0000, 0.0000, 0.0000],
        [9.1237, 9.3346, 9.1160,  ..., 0.0000, 0.0000, 0.0000],
        [8.9204, 9.2423, 9.0869,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.5392, 9.2562, 9.4100,  ..., 0.0000, 0.0000, 0.0000],
        [8.7375, 8.1011, 8.9748,  ..., 0.0000, 0.0000, 0.0000],
        [9.1206, 9.1116, 9.1305,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.5392, 9.2565, 9.1838,  ..., 0.0000, 0.0000, 0.0000],
        [8.7588, 9.0816, 9.2217,  ..., 0.0000, 0.0000, 0.0000],
        [9.0014, 9.1299, 9.1660,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 40 Loss: 4.7018170833587645 Accuracy: 0.028435646556317806: 

torch.Size([128, 20])
tensor([[8.6405, 8.9636, 9.1169,  ..., 0.0000, 0.0000, 0.0000],
        [9.0780, 9.1330, 9.0728,  ..., 0.0000, 0.0000, 0.0000],
        [8.5191, 9.2197, 9.1966,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9072, 8.8662, 9.1339,  ..., 0.0000, 0.0000, 0.0000],
        [8.5191, 9.1781, 9.1655,  ..., 0.0000, 0.0000, 0.0000],
        [8.9133, 9.1981, 9.2224,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.6234, 9.1457, 9.0250,  ..., 0.0000, 0.0000, 0.0000],
        [9.1327, 9.2992, 9.0340,  ..., 0.0000, 0.0000, 0.0000],
        [8.9851, 9.0923, 9.1918,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9468, 9.1103, 9.2524,  ..., 0.0000, 0.0000, 0.0000],
        [8.6234, 9.0395, 8.8591,  ..., 0.0000, 0.0000, 0.0000],
        [8.4988, 9.1389, 9.2165,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 40 Loss: 4.7018170833587645 Accuracy: 0.028435646556317806: 

torch.Size([128, 20])
tensor([[8.7030, 9.1363, 9.0123,  ..., 0.0000, 0.0000, 0.0000],
        [9.0126, 8.4418, 8.8366,  ..., 0.0000, 0.0000, 0.0000],
        [8.7030, 9.1119, 9.2432,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9367, 9.1411, 9.2267,  ..., 0.0000, 0.0000, 0.0000],
        [8.8666, 8.9821, 9.2417,  ..., 0.0000, 0.0000, 0.0000],
        [8.7030, 9.1078, 9.3926,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.0637, 9.2479, 9.1834,  ..., 0.0000, 0.0000, 0.0000],
        [9.1928, 8.9623, 9.0821,  ..., 9.2095, 9.1241, 9.1578],
        [9.1999, 9.3468, 9.2057,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1492, 9.0995, 9.1523,  ..., 0.0000, 0.0000, 0.0000],
        [9.0737, 8.9431, 9.2202,  ..., 0.0000, 0.0000, 0.0000],
        [8.4575, 9.2257, 9.0540,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 40 Loss: 4.7018170833587645 Accuracy: 0.028435646556317806: 

torch.Size([128, 20])
tensor([[8.6653, 9.0655, 9.1333,  ..., 0.0000, 0.0000, 0.0000],
        [9.0299, 9.0778, 9.0423,  ..., 0.0000, 0.0000, 0.0000],
        [9.1886, 9.0571, 9.2695,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.6484, 9.1241, 8.6723,  ..., 0.0000, 0.0000, 0.0000],
        [9.0486, 9.0422, 9.1628,  ..., 0.0000, 0.0000, 0.0000],
        [8.8054, 9.2547, 9.1736,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.2495, 9.1476, 9.2351,  ..., 0.0000, 0.0000, 0.0000],
        [8.5522, 9.1669, 9.0481,  ..., 0.0000, 0.0000, 0.0000],
        [8.4147, 9.2579, 9.2855,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.6462, 9.0341, 9.2626,  ..., 0.0000, 0.0000, 0.0000],
        [8.4147, 9.3447, 9.0316,  ..., 0.0000, 0.0000, 0.0000],
        [9.0754, 8.6102, 9.4145,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 40 Loss: 4.7018170833587645 Accuracy: 0.028435646556317806: 

torch.Size([128, 20])
tensor([[9.0587, 9.2990, 8.9347,  ..., 0.0000, 0.0000, 0.0000],
        [9.2972, 9.1548, 9.1146,  ..., 0.0000, 0.0000, 0.0000],
        [9.0763, 9.2022, 9.1063,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.8790, 9.1653, 8.7536,  ..., 0.0000, 0.0000, 0.0000],
        [8.5331, 9.2581, 9.1878,  ..., 0.0000, 0.0000, 0.0000],
        [9.1097, 9.0653, 9.1029,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1317, 9.2858, 9.1810,  ..., 0.0000, 0.0000, 0.0000],
        [8.8849, 9.3138, 9.0696,  ..., 0.0000, 0.0000, 0.0000],
        [8.3697, 9.2518, 8.9700,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.8187, 8.6588, 9.0922,  ..., 0.0000, 0.0000, 0.0000],
        [8.6076, 9.1149, 9.1401,  ..., 0.0000, 0.0000, 0.0000],
        [8.5864, 9.0583, 8.2955,  ..., 9.1945, 9.0402, 9.1809]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 50 Loss: 4.610334682464599 Accuracy: 0.038128612004220484:  

torch.Size([128, 20])
tensor([[8.8741, 9.0220, 9.1895,  ..., 0.0000, 0.0000, 0.0000],
        [8.8741, 9.1271, 9.0425,  ..., 0.0000, 0.0000, 0.0000],
        [9.3640, 9.1767, 9.1306,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.3465, 9.2116, 9.2600,  ..., 0.0000, 0.0000, 0.0000],
        [8.5876, 9.2749, 9.0264,  ..., 0.0000, 0.0000, 0.0000],
        [8.8741, 9.1197, 9.2123,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.3165, 9.0828, 9.0824,  ..., 0.0000, 0.0000, 0.0000],
        [8.5321, 9.3409, 9.1496,  ..., 0.0000, 0.0000, 0.0000],
        [8.8082, 9.1621, 9.2561,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.5672, 9.1269, 9.0337,  ..., 0.0000, 0.0000, 0.0000],
        [8.5672, 9.2672, 9.3005,  ..., 0.0000, 0.0000, 0.0000],
        [8.3229, 8.7426, 9.0534,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 50 Loss: 4.610334682464599 Accuracy: 0.038128612004220484:  

torch.Size([128, 20])
tensor([[9.0493, 8.4161, 8.9064,  ..., 0.0000, 0.0000, 0.0000],
        [8.2987, 9.3477, 9.0631,  ..., 0.0000, 0.0000, 0.0000],
        [9.2074, 9.1802, 9.2638,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.5467, 8.8506, 9.2618,  ..., 0.0000, 0.0000, 0.0000],
        [9.3671, 9.0683, 9.3158,  ..., 0.0000, 0.0000, 0.0000],
        [8.9298, 8.8222, 9.0922,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1082, 8.7168, 8.9477,  ..., 0.0000, 0.0000, 0.0000],
        [8.8398, 9.1106, 8.8686,  ..., 0.0000, 0.0000, 0.0000],
        [9.1598, 9.0714, 8.8996,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.4943, 9.0374, 9.2204,  ..., 0.0000, 0.0000, 0.0000],
        [9.0358, 9.1616, 8.7691,  ..., 0.0000, 0.0000, 0.0000],
        [8.5261, 9.2658, 8.9707,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 50 Loss: 4.610334682464599 Accuracy: 0.038128612004220484:  

torch.Size([128, 20])
tensor([[9.2409, 9.2156, 9.1914,  ..., 0.0000, 0.0000, 0.0000],
        [8.4082, 9.4069, 9.2162,  ..., 0.0000, 0.0000, 0.0000],
        [9.2245, 8.9784, 8.5275,  ..., 8.6801, 8.9811, 0.0000],
        ...,
        [8.5052, 9.1555, 9.2344,  ..., 0.0000, 0.0000, 0.0000],
        [8.9042, 8.8575, 8.9132,  ..., 9.1762, 9.0444, 9.2471],
        [8.8276, 8.9373, 8.8740,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.4840, 8.9080, 9.0903,  ..., 0.0000, 0.0000, 0.0000],
        [8.2245, 8.9219, 8.9431,  ..., 0.0000, 0.0000, 0.0000],
        [9.0228, 8.8919, 9.0979,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.3726, 9.1252, 9.1759,  ..., 0.0000, 0.0000, 0.0000],
        [9.0224, 9.2850, 9.0844,  ..., 0.0000, 0.0000, 0.0000],
        [8.4840, 8.9507, 9.2889,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 50 Loss: 4.610334682464599 Accuracy: 0.038128612004220484:  

torch.Size([128, 20])
tensor([[8.8022, 8.9066, 9.1517,  ..., 0.0000, 0.0000, 0.0000],
        [8.7003, 9.1861, 9.2531,  ..., 0.0000, 0.0000, 0.0000],
        [9.0834, 9.2065, 9.2771,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.1983, 8.9213, 9.1780,  ..., 0.0000, 0.0000, 0.0000],
        [8.6891, 9.1394, 9.3072,  ..., 0.0000, 0.0000, 0.0000],
        [9.3707, 9.4005, 9.2801,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.0721, 9.2043, 9.0596,  ..., 0.0000, 0.0000, 0.0000],
        [9.2500, 9.2509, 9.2807,  ..., 0.0000, 0.0000, 0.0000],
        [9.1945, 9.0794, 9.1444,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1447, 9.2404, 8.8924,  ..., 8.7882, 0.0000, 0.0000],
        [8.9849, 9.0158, 8.7029,  ..., 0.0000, 0.0000, 0.0000],
        [9.1665, 9.0472, 9.0900,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 50 Loss: 4.610334682464599 Accuracy: 0.038128612004220484:  

torch.Size([128, 20])
tensor([[8.6619, 8.7031, 9.4325,  ..., 0.0000, 0.0000, 0.0000],
        [9.1147, 8.9402, 9.0377,  ..., 0.0000, 0.0000, 0.0000],
        [8.8672, 9.1267, 9.1173,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.3171, 9.0434, 9.2190,  ..., 0.0000, 0.0000, 0.0000],
        [8.4184, 9.2918, 9.0269,  ..., 0.0000, 0.0000, 0.0000],
        [8.3648, 9.1841, 9.0483,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1732, 8.9099, 8.7588,  ..., 0.0000, 0.0000, 0.0000],
        [8.1180, 9.1850, 9.4233,  ..., 0.0000, 0.0000, 0.0000],
        [9.0910, 9.1137, 9.0560,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.3693, 9.4110, 9.2278,  ..., 0.0000, 0.0000, 0.0000],
        [9.3614, 9.0073, 8.9917,  ..., 0.0000, 0.0000, 0.0000],
        [8.1180, 9.0943, 8.8473,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 60 Loss: 4.563868141174316 Accuracy: 0.054386329278349875:  

torch.Size([128, 20])
tensor([[8.8839, 8.9960, 9.1379,  ..., 0.0000, 0.0000, 0.0000],
        [8.3090, 9.2577, 9.1067,  ..., 0.0000, 0.0000, 0.0000],
        [8.7497, 9.2658, 8.8562,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.3703, 9.1180, 9.1761,  ..., 0.0000, 0.0000, 0.0000],
        [8.8720, 8.4562, 9.0426,  ..., 0.0000, 0.0000, 0.0000],
        [8.6230, 9.1832, 8.8091,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.3456, 9.0760, 9.2251,  ..., 0.0000, 0.0000, 0.0000],
        [8.9503, 9.4100, 8.3810,  ..., 0.0000, 0.0000, 0.0000],
        [8.8586, 9.2195, 9.2061,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.5210, 9.5049, 8.2675,  ..., 8.6532, 0.0000, 0.0000],
        [8.7363, 9.0234, 8.9895,  ..., 0.0000, 0.0000, 0.0000],
        [8.0603, 9.1013, 9.0417,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 60 Loss: 4.563868141174316 Accuracy: 0.054386329278349875:  

torch.Size([128, 20])
tensor([[9.3202, 9.1473, 9.1243,  ..., 0.0000, 0.0000, 0.0000],
        [8.2164, 8.6736, 9.1354,  ..., 9.0747, 0.0000, 0.0000],
        [8.9582, 8.3507, 9.1059,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.3483, 9.2864, 8.9065,  ..., 0.0000, 0.0000, 0.0000],
        [8.2164, 8.7216, 8.4663,  ..., 0.0000, 0.0000, 0.0000],
        [9.3160, 8.3383, 7.4355,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.7647, 9.1851, 9.1270,  ..., 8.6317, 8.6669, 9.3656],
        [8.2741, 9.1083, 9.2233,  ..., 0.0000, 0.0000, 0.0000],
        [8.5479, 9.1566, 9.4303,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.2937, 8.4826, 8.9325,  ..., 0.0000, 0.0000, 0.0000],
        [9.2753, 9.3028, 9.1344,  ..., 0.0000, 0.0000, 0.0000],
        [8.0006, 9.0047, 8.9587,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 60 Loss: 4.563868141174316 Accuracy: 0.054386329278349875:  

torch.Size([128, 20])
tensor([[9.2960, 8.6363, 9.0830,  ..., 0.0000, 0.0000, 0.0000],
        [8.6326, 9.1313, 9.0317,  ..., 0.0000, 0.0000, 0.0000],
        [9.0065, 9.0501, 9.3477,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.6920, 9.2684, 8.9195,  ..., 0.0000, 0.0000, 0.0000],
        [8.5179, 9.1958, 8.5989,  ..., 0.0000, 0.0000, 0.0000],
        [8.1853, 8.7940, 8.4930,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.9368, 9.4752, 9.5795,  ..., 0.0000, 0.0000, 0.0000],
        [8.2394, 9.0260, 9.1848,  ..., 0.0000, 0.0000, 0.0000],
        [9.3536, 9.3088, 8.6353,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.4973, 9.0634, 9.0325,  ..., 0.0000, 0.0000, 0.0000],
        [9.0782, 8.9401, 8.9533,  ..., 0.0000, 0.0000, 0.0000],
        [8.2394, 8.9629, 9.1544,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 60 Loss: 4.563868141174316 Accuracy: 0.054386329278349875:  

torch.Size([128, 20])
tensor([[7.9035, 9.1587, 9.0485,  ..., 0.0000, 0.0000, 0.0000],
        [8.9791, 9.0217, 9.1300,  ..., 0.0000, 0.0000, 0.0000],
        [9.0825, 9.0337, 8.7811,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.2107, 8.8818, 9.0798,  ..., 0.0000, 0.0000, 0.0000],
        [9.1353, 8.7712, 8.9118,  ..., 0.0000, 0.0000, 0.0000],
        [7.9035, 8.9860, 8.9484,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.4184, 9.0596, 9.2981,  ..., 0.0000, 0.0000, 0.0000],
        [8.0783, 9.0130, 8.2287,  ..., 0.0000, 0.0000, 0.0000],
        [8.6668, 8.9725, 8.6835,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1011, 8.6860, 8.9024,  ..., 0.0000, 0.0000, 0.0000],
        [8.8489, 8.8810, 9.3803,  ..., 0.0000, 0.0000, 0.0000],
        [8.7451, 9.0628, 9.1290,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 60 Loss: 4.563868141174316 Accuracy: 0.054386329278349875:  

torch.Size([128, 20])
tensor([[8.1404, 8.7718, 9.4265,  ..., 0.0000, 0.0000, 0.0000],
        [8.1527, 9.1016, 9.0455,  ..., 0.0000, 0.0000, 0.0000],
        [8.6265, 8.9840, 8.7228,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.0496, 8.3945, 8.8485,  ..., 0.0000, 0.0000, 0.0000],
        [8.7755, 9.0105, 8.9649,  ..., 0.0000, 0.0000, 0.0000],
        [8.4305, 9.0084, 8.8671,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.9523, 8.6156, 8.7809,  ..., 0.0000, 0.0000, 0.0000],
        [9.0316, 9.1522, 9.1363,  ..., 0.0000, 0.0000, 0.0000],
        [7.7990, 8.2328, 8.8690,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.4154, 9.1445, 8.8450,  ..., 0.0000, 0.0000, 0.0000],
        [8.6364, 8.3739, 9.2504,  ..., 0.0000, 0.0000, 0.0000],
        [8.1231, 8.7430, 7.9565,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 70 Loss: 4.553095912933349 Accuracy: 0.06209805607795715:  1

torch.Size([128, 20])
tensor([[9.0926, 9.1504, 8.6350,  ..., 0.0000, 0.0000, 0.0000],
        [8.8627, 9.3068, 8.6254,  ..., 0.0000, 0.0000, 0.0000],
        [8.0923, 8.8596, 9.0640,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.6829, 9.3391, 8.5710,  ..., 0.0000, 0.0000, 0.0000],
        [8.6209, 7.4459, 7.9975,  ..., 0.0000, 0.0000, 0.0000],
        [7.9908, 9.1126, 8.9469,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.8622, 9.3110, 8.7767,  ..., 0.0000, 0.0000, 0.0000],
        [8.3593, 8.5090, 9.0374,  ..., 0.0000, 0.0000, 0.0000],
        [8.9573, 9.1833, 8.6884,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.9608, 7.5122, 6.4770,  ..., 0.0000, 0.0000, 0.0000],
        [8.5970, 9.3694, 8.6347,  ..., 9.2800, 8.6403, 8.9662],
        [8.9677, 8.8225, 8.2953,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 70 Loss: 4.553095912933349 Accuracy: 0.06209805607795715:  1

torch.Size([128, 20])
tensor([[8.0283, 9.2140, 8.9207,  ..., 0.0000, 0.0000, 0.0000],
        [8.0170, 8.7003, 8.9254,  ..., 0.0000, 0.0000, 0.0000],
        [7.9297, 8.9055, 8.8389,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2304, 8.5400, 8.8326,  ..., 0.0000, 0.0000, 0.0000],
        [9.3507, 8.8585, 9.4952,  ..., 0.0000, 0.0000, 0.0000],
        [8.3292, 9.0054, 8.7515,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.2980, 9.2829, 9.3228,  ..., 0.0000, 0.0000, 0.0000],
        [7.9846, 7.9967, 8.9126,  ..., 0.0000, 0.0000, 0.0000],
        [7.6479, 8.8294, 9.1922,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.9450, 8.9347, 8.3714,  ..., 0.0000, 0.0000, 0.0000],
        [7.6479, 8.8272, 8.4507,  ..., 0.0000, 0.0000, 0.0000],
        [8.5342, 8.7951, 8.6010,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 70 Loss: 4.553095912933349 Accuracy: 0.06209805607795715:  1

torch.Size([128, 20])
tensor([[8.6139, 9.2499, 8.3064,  ..., 0.0000, 0.0000, 0.0000],
        [9.0361, 8.6715, 8.5383,  ..., 0.0000, 0.0000, 0.0000],
        [7.8655, 8.4668, 9.1240,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.5129, 8.8942, 8.3952,  ..., 0.0000, 0.0000, 0.0000],
        [9.3080, 8.3825, 8.8929,  ..., 0.0000, 0.0000, 0.0000],
        [9.5761, 8.9315, 8.1257,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.2561, 8.9093, 9.4564,  ..., 0.0000, 0.0000, 0.0000],
        [8.4914, 8.8969, 8.7348,  ..., 0.0000, 0.0000, 0.0000],
        [7.8323, 8.8249, 9.2160,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.4996, 9.2979, 8.6398,  ..., 0.0000, 0.0000, 0.0000],
        [7.9158, 8.8997, 8.0386,  ..., 0.0000, 0.0000, 0.0000],
        [9.0685, 8.1967, 9.0291,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 70 Loss: 4.553095912933349 Accuracy: 0.06209805607795715:  1

torch.Size([128, 20])
tensor([[8.8058, 9.1048, 8.7612,  ..., 0.0000, 0.0000, 0.0000],
        [8.5685, 8.8884, 7.7478,  ..., 0.0000, 0.0000, 0.0000],
        [7.5265, 8.9994, 8.9656,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.7069, 6.3302, 8.9026,  ..., 0.0000, 0.0000, 0.0000],
        [9.5786, 9.1875, 8.2392,  ..., 0.0000, 0.0000, 0.0000],
        [9.0568, 8.8106, 8.6353,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.4154, 9.0825, 8.7852,  ..., 0.0000, 0.0000, 0.0000],
        [9.0302, 7.1329, 5.9908,  ..., 0.0000, 0.0000, 0.0000],
        [8.1638, 7.3135, 8.6657,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2550, 7.9794, 8.0695,  ..., 0.0000, 0.0000, 0.0000],
        [7.8542, 8.5941, 9.2183,  ..., 0.0000, 0.0000, 0.0000],
        [9.0076, 8.6562, 8.7863,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 70 Loss: 4.553095912933349 Accuracy: 0.06209805607795715:  1

torch.Size([128, 20])
tensor([[9.3694, 7.9197, 9.0125,  ..., 0.0000, 0.0000, 0.0000],
        [8.4231, 7.6872, 9.0762,  ..., 0.0000, 0.0000, 0.0000],
        [7.8181, 8.6464, 8.6486,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.1281, 8.1982, 8.5604,  ..., 0.0000, 0.0000, 0.0000],
        [9.0001, 9.2407, 9.0860,  ..., 0.0000, 0.0000, 0.0000],
        [8.3936, 8.4810, 7.7128,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.6378, 8.9792, 8.7766,  ..., 0.0000, 0.0000, 0.0000],
        [9.3212, 7.7084, 8.7480,  ..., 0.0000, 0.0000, 0.0000],
        [9.5918, 8.8231, 8.9109,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.7820, 8.7304, 9.1060,  ..., 0.0000, 0.0000, 0.0000],
        [7.6915, 8.8678, 8.6923,  ..., 0.0000, 0.0000, 0.0000],
        [9.1041, 7.9869, 8.1221,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 80 Loss: 4.39349946975708 Accuracy: 0.06131722964346409:  12

torch.Size([128, 20])
tensor([[8.9352, 8.0747, 8.8756,  ..., 0.0000, 0.0000, 0.0000],
        [8.7595, 9.0143, 7.4149,  ..., 0.0000, 0.0000, 0.0000],
        [7.7448, 8.5867, 8.4573,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.0280, 8.0191, 8.7822,  ..., 0.0000, 0.0000, 0.0000],
        [9.3734, 9.1035, 8.2240,  ..., 0.0000, 0.0000, 0.0000],
        [8.6140, 8.7428, 8.9420,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.1428, 8.5405, 9.1248,  ..., 0.0000, 0.0000, 0.0000],
        [9.0271, 7.8542, 8.0307,  ..., 0.0000, 0.0000, 0.0000],
        [9.6838, 8.7118, 8.5991,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.4812, 8.7959, 8.7277,  ..., 0.0000, 0.0000, 0.0000],
        [8.6795, 9.2464, 8.3577,  ..., 0.0000, 0.0000, 0.0000],
        [7.3053, 9.1031, 8.5104,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 80 Loss: 4.39349946975708 Accuracy: 0.06131722964346409:  12

torch.Size([128, 20])
tensor([[ 7.3883,  7.9778,  7.1709,  ..., 10.3455,  0.0000,  0.0000],
        [ 7.2589,  8.1513,  7.5772,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.0396,  7.9535,  8.0418,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.3005,  8.6692,  9.5152,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3883,  8.6120,  7.0498,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.5291,  8.0375,  8.9175,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.9240, 9.3339, 8.3454,  ..., 0.0000, 0.0000, 0.0000],
        [8.3661, 8.9048, 8.0821,  ..., 0.0000, 0.0000, 0.0000],
        [8.8838, 9.6897, 7.9769,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.3304, 5.7208, 7.2369,  ..., 0.0000, 0.0000, 0.0000],
        [9.1748, 8.0661, 9.0251,  ..., 0.0000, 0.0000, 0.0000],
        [8.7742, 9.2864, 6.6845,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 80 Loss: 4.39349946975708 Accuracy: 0.06131722964346409:  12

torch.Size([128, 20])
tensor([[9.4248, 6.7783, 8.7876,  ..., 0.0000, 0.0000, 0.0000],
        [8.7948, 9.0045, 8.8730,  ..., 0.0000, 0.0000, 0.0000],
        [7.5023, 8.5988, 8.3244,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.2293, 8.8113, 8.8390,  ..., 0.0000, 0.0000, 0.0000],
        [8.2834, 9.3856, 8.9826,  ..., 0.0000, 0.0000, 0.0000],
        [7.1648, 6.7303, 8.1236,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.8459, 8.5980, 8.5854,  ..., 0.0000, 0.0000, 0.0000],
        [9.0071, 8.4526, 8.3313,  ..., 0.0000, 0.0000, 0.0000],
        [7.8459, 8.6309, 9.7128,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.1170, 8.8327, 9.1167,  ..., 0.0000, 0.0000, 0.0000],
        [7.9338, 8.9488, 7.9508,  ..., 0.0000, 0.0000, 0.0000],
        [8.9064, 6.4959, 7.7710,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 80 Loss: 4.39349946975708 Accuracy: 0.06131722964346409:  13

torch.Size([128, 20])
tensor([[9.5223, 9.8527, 9.6675,  ..., 0.0000, 0.0000, 0.0000],
        [8.7025, 9.1510, 7.8001,  ..., 0.0000, 0.0000, 0.0000],
        [9.5223, 6.7650, 8.2223,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.2030, 8.4313, 8.6039,  ..., 0.0000, 0.0000, 0.0000],
        [8.6704, 7.8136, 9.1331,  ..., 0.0000, 0.0000, 0.0000],
        [7.5062, 7.2218, 8.5905,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.0814, 8.5673, 7.8708,  ..., 0.0000, 0.0000, 0.0000],
        [7.0211, 9.5263, 8.9206,  ..., 0.0000, 0.0000, 0.0000],
        [8.4081, 9.7676, 5.8351,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.3508, 8.8741, 8.8352,  ..., 0.0000, 0.0000, 0.0000],
        [8.4081, 7.9780, 8.6721,  ..., 0.0000, 0.0000, 0.0000],
        [7.0903, 6.6950, 8.2053,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 80 Loss: 4.39349946975708 Accuracy: 0.06131722964346409:  13

torch.Size([128, 20])
tensor([[8.2310, 7.2448, 8.7586,  ..., 0.0000, 0.0000, 0.0000],
        [8.5266, 7.1069, 6.8243,  ..., 0.0000, 0.0000, 0.0000],
        [7.4218, 7.0649, 7.2861,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.1699, 8.0126, 8.1494,  ..., 0.0000, 0.0000, 0.0000],
        [6.9725, 8.6473, 8.4681,  ..., 0.0000, 0.0000, 0.0000],
        [7.3421, 8.2773, 6.5780,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.1384, 8.4440, 8.3165,  ..., 0.0000, 0.0000, 0.0000],
        [6.9238, 8.7808, 9.0163,  ..., 0.0000, 0.0000, 0.0000],
        [7.3011, 8.5196, 9.1398,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.3787, 8.8746, 7.8663,  ..., 0.0000, 0.0000, 0.0000],
        [8.4021, 8.2368, 8.2500,  ..., 0.0000, 0.0000, 0.0000],
        [6.9238, 9.4241, 9.1374,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 90 Loss: 4.135539960861206 Accuracy: 0.06191049702465534:  1

torch.Size([128, 20])
tensor([[ 9.4082,  6.1703,  7.9526,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.0786, 10.0905,  8.2796,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.9062,  7.5078, 10.0691,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.7733,  7.1336,  7.2529,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3354,  8.3959,  8.4679,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3354,  7.5924,  7.6139,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.0795, 8.8315, 8.5492,  ..., 0.0000, 0.0000, 0.0000],
        [7.2177, 8.9712, 6.4591,  ..., 0.0000, 0.0000, 0.0000],
        [7.2177, 9.4813, 7.9321,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.2177, 8.4929, 6.3265,  ..., 0.0000, 0.0000, 0.0000],
        [6.8267, 7.2510, 7.3110,  ..., 0.0000, 0.0000, 0.0000],
        [8.8465, 8.7165, 8.7428,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 90 Loss: 4.135539960861206 Accuracy: 0.06191049702465534:  1

torch.Size([128, 20])
tensor([[7.1758, 9.1634, 8.9910,  ..., 0.0000, 0.0000, 0.0000],
        [7.9958, 9.1141, 7.4641,  ..., 0.0000, 0.0000, 0.0000],
        [8.0493, 8.2358, 7.8253,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.0124, 8.7544, 8.2443,  ..., 0.0000, 0.0000, 0.0000],
        [7.6870, 8.7588, 8.5131,  ..., 0.0000, 0.0000, 0.0000],
        [9.1145, 7.5803, 9.0963,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.9524, 8.7451, 8.4581,  ..., 8.9845, 5.5416, 8.2105],
        [8.6647, 8.6785, 8.1474,  ..., 0.0000, 0.0000, 0.0000],
        [9.6911, 5.8619, 8.2723,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.7559, 8.2955, 7.5400,  ..., 0.0000, 0.0000, 0.0000],
        [7.2032, 7.3620, 8.5404,  ..., 9.0209, 0.0000, 0.0000],
        [6.7209, 4.7690, 3.8183,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 90 Loss: 4.135539960861206 Accuracy: 0.06191049702465534:  1

torch.Size([128, 20])
tensor([[8.5619, 8.7186, 8.2111,  ..., 0.0000, 0.0000, 0.0000],
        [7.5717, 7.8200, 8.6684,  ..., 6.5781, 6.0134, 0.0000],
        [6.6593, 6.3701, 8.2268,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.1590, 5.5555, 7.0623,  ..., 0.0000, 0.0000, 0.0000],
        [8.7230, 7.2912, 9.0212,  ..., 0.0000, 0.0000, 0.0000],
        [7.1590, 8.0139, 7.8599,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.0511, 9.0038, 7.8761,  ..., 0.0000, 0.0000, 0.0000],
        [6.6353, 8.3014, 8.3155,  ..., 0.0000, 0.0000, 0.0000],
        [9.1047, 8.8294, 7.8700,  ..., 8.2842, 8.5643, 0.0000],
        ...,
        [8.1404, 9.3230, 7.5278,  ..., 0.0000, 0.0000, 0.0000],
        [8.7874, 8.3692, 9.2898,  ..., 0.0000, 0.0000, 0.0000],
        [9.0377, 9.2134, 6.5129,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 90 Loss: 4.135539960861206 Accuracy: 0.06191049702465534:  1

torch.Size([128, 20])
tensor([[ 7.8771,  8.6230,  7.7744,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.8108,  7.5618,  6.8627,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3537,  8.7115,  7.5359,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.5370,  7.1092,  6.7848,  ..., 10.3950,  6.4535,  8.6911],
        [ 7.0094,  6.8421,  8.3224,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.3866,  7.7077,  7.1546,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[8.7339, 8.9020, 9.3416,  ..., 0.0000, 0.0000, 0.0000],
        [8.2284, 7.8483, 9.1928,  ..., 0.0000, 0.0000, 0.0000],
        [8.7279, 9.9461, 8.3247,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.5204, 7.9729, 8.0978,  ..., 0.0000, 0.0000, 0.0000],
        [7.3091, 6.4533, 7.9288,  ..., 0.0000, 0.0000, 0.0000],
        [7.3091, 8.2370, 7.1533,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 90 Loss: 4.135539960861206 Accuracy: 0.06191049702465534:  1

torch.Size([128, 20])
tensor([[6.9274, 6.7017, 9.1828,  ..., 0.0000, 0.0000, 0.0000],
        [6.9800, 8.9834, 8.6906,  ..., 0.0000, 0.0000, 0.0000],
        [8.1638, 7.3921, 7.0648,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [6.9028, 9.8985, 8.7185,  ..., 0.0000, 0.0000, 0.0000],
        [9.1704, 5.4224, 9.6371,  ..., 0.0000, 0.0000, 0.0000],
        [8.7760, 8.3393, 7.1125,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 6.4478,  8.9510,  9.9736,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.8869,  6.0025,  9.2207,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.6248,  7.5708,  7.8653,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [10.0892,  7.3113,  6.6169,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.9979,  8.8950,  8.1066,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.4414,  8.3664,  9.0052,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 100 Loss: 3.929702639579773 Accuracy: 0.05955834127962589:  

torch.Size([128, 20])
tensor([[ 8.8663,  7.1349,  8.9850,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.3311,  9.7228,  7.1851,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.8914,  7.7400,  9.1281,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.8462,  8.1616,  3.9747,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.9197,  6.2061, 10.5448,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.6839,  8.3310,  9.3317,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 6.8475,  4.8715,  8.7087,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.3564,  8.9275,  7.0738,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.2460,  4.0717,  9.8119,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.8052,  7.6393,  9.1009,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.6865, 10.2620,  5.6701,  ...,  7.4134,  7.3650,  7.7342],
        [ 7.7110,  7.4082,  8.0594,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 100 Loss: 3.929702639579773 Accuracy: 0.05955834127962589:  

torch.Size([128, 20])
tensor([[6.3118, 8.4926, 8.5465,  ..., 0.0000, 0.0000, 0.0000],
        [6.8038, 8.3016, 5.4493,  ..., 0.0000, 0.0000, 0.0000],
        [7.7538, 6.7228, 9.1200,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [9.9370, 7.1956, 6.5523,  ..., 0.0000, 0.0000, 0.0000],
        [9.1861, 7.0210, 5.6897,  ..., 0.0000, 0.0000, 0.0000],
        [8.2840, 8.7713, 7.4877,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.0511, 6.0066, 7.3741,  ..., 0.0000, 0.0000, 0.0000],
        [7.6448, 8.6653, 7.8574,  ..., 0.0000, 0.0000, 0.0000],
        [9.0403, 8.3810, 8.8569,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [6.7603, 8.9240, 9.4196,  ..., 0.0000, 0.0000, 0.0000],
        [6.2673, 9.0791, 7.7315,  ..., 0.0000, 0.0000, 0.0000],
        [6.7603, 8.0933, 7.9840,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 100 Loss: 3.929702639579773 Accuracy: 0.05955834127962589:  

torch.Size([128, 20])
tensor([[8.6384, 7.1072, 4.1861,  ..., 0.0000, 0.0000, 0.0000],
        [9.3209, 5.9763, 7.3400,  ..., 0.0000, 0.0000, 0.0000],
        [8.2740, 8.0812, 3.9906,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.7039, 8.4489, 6.8576,  ..., 0.0000, 0.0000, 0.0000],
        [6.6842, 6.8378, 8.0833,  ..., 0.0000, 0.0000, 0.0000],
        [6.6437, 8.3220, 8.0780,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 8.2971,  8.0406,  5.4300,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.6444,  8.8524, 10.3843,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.0395,  8.9479,  6.0286,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.4221,  4.6973,  9.8084,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.9456,  8.9929,  6.4483,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.0227,  8.1598,  7.4983,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 100 Loss: 3.929702639579773 Accuracy: 0.05955834127962589:  

torch.Size([128, 20])
tensor([[ 7.5482,  5.6531,  7.6314,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.9245,  7.3993,  9.0823,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.9926, 10.2528,  9.2695,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.4789,  8.0349,  8.0507,  ...,  4.4077,  8.6262,  8.9335],
        [ 9.0605,  7.9020,  7.3123,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.9245,  7.6043,  7.4756,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.4165, 9.3221, 9.3369,  ..., 0.0000, 0.0000, 0.0000],
        [7.9671, 7.5618, 8.1089,  ..., 0.0000, 0.0000, 0.0000],
        [9.0296, 8.1530, 8.5302,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [6.8831, 7.5878, 8.0369,  ..., 0.0000, 0.0000, 0.0000],
        [6.5667, 8.0029, 8.0236,  ..., 0.0000, 0.0000, 0.0000],
        [6.5909, 9.3041, 8.0832,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 100 Loss: 3.929702639579773 Accuracy: 0.05955834127962589:  

torch.Size([128, 20])
tensor([[7.9553, 8.1387, 8.3254,  ..., 0.0000, 0.0000, 0.0000],
        [6.5287, 8.0116, 8.5115,  ..., 0.0000, 0.0000, 0.0000],
        [5.9056, 8.8585, 6.8609,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.9679, 8.0756, 8.2664,  ..., 0.0000, 0.0000, 0.0000],
        [6.5491, 4.3477, 8.8190,  ..., 0.0000, 0.0000, 0.0000],
        [5.9056, 3.6340, 8.9329,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 6.5075,  8.9826,  6.7905,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.5075,  7.2177, 10.0115,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.8646,  7.9198, 10.7601,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.9314,  6.6984,  9.5909,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.2164,  8.0750,  5.7141,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.6001,  7.1682,  7.6032,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 110 Loss: 3.7500200986862184 Accuracy: 0.05658855475485325: 

torch.Size([128, 20])
tensor([[ 6.4038, 10.1323,  6.4255,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.9355,  5.9976,  8.3979,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.4141,  5.6275,  9.0741,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.9272,  9.6533,  3.6469,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1124,  4.0396,  5.6888,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.0978,  8.2671,  8.7886,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[10.0680, 10.3754,  6.2111,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4173,  8.5953,  6.0810,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4173,  5.1922,  4.8363,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.4261,  7.1522,  7.8206,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.4990,  7.6791,  6.2116,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.8151,  8.8053,  5.8657,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 110 Loss: 3.7500200986862184 Accuracy: 0.05658855475485325: 

torch.Size([128, 20])
tensor([[6.3859, 7.4503, 8.1641,  ..., 0.0000, 0.0000, 0.0000],
        [6.3808, 7.8379, 9.8545,  ..., 7.6364, 9.6993, 0.0000],
        [7.3671, 8.4889, 7.5493,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.9006, 6.0081, 7.4642,  ..., 0.0000, 0.0000, 0.0000],
        [5.7511, 6.0507, 8.2418,  ..., 0.0000, 0.0000, 0.0000],
        [6.3274, 9.5901, 7.2213,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 7.5034,  8.7194,  8.1202,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.5464,  6.6699,  3.3082,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.7231,  9.1662,  5.4559,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.8623,  9.5920,  9.1439,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.2910,  7.8174,  8.8198,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.5464, 10.7045,  7.2591,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 110 Loss: 3.7500200986862184 Accuracy: 0.05658855475485325: 

torch.Size([128, 20])
tensor([[ 9.3905,  9.3209,  8.2855,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.6250,  4.1068,  4.6007,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.6855,  3.4078,  4.5262,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.0354,  9.3361,  7.9814,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.0278,  6.4646,  5.7366,  ...,  0.0000,  0.0000,  0.0000],
        [10.1216,  8.3523,  8.0299,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[6.2706, 8.0803, 8.8863,  ..., 0.0000, 0.0000, 0.0000],
        [8.0042, 5.4340, 9.0269,  ..., 0.0000, 0.0000, 0.0000],
        [6.2699, 8.4766, 5.3359,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.4343, 4.7975, 8.6118,  ..., 0.0000, 0.0000, 0.0000],
        [6.9739, 9.0528, 3.2066,  ..., 0.0000, 0.0000, 0.0000],
        [7.4364, 9.3207, 5.5628,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 110 Loss: 3.7500200986862184 Accuracy: 0.05658855475485325: 

torch.Size([128, 20])
tensor([[ 7.4834,  9.5014,  7.0648,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.2341,  8.2685,  6.7611,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.2513,  5.0343,  7.3953,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.2109, 10.2547,  7.6165,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.1867,  8.2624,  8.8954,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.7864,  4.1396,  8.9180,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 6.1987,  7.6981,  7.9019,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.3009,  6.3125,  7.9785,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.5349,  8.6029,  5.6576,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.5247,  7.0122, 10.0289,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.6079,  5.0471,  7.8283,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.7391,  6.4517,  9.4906,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 110 Loss: 3.7500200986862184 Accuracy: 0.05658855475485325: 

torch.Size([128, 20])
tensor([[6.1591, 6.0032, 7.5145,  ..., 0.0000, 0.0000, 0.0000],
        [7.0438, 5.8817, 9.4714,  ..., 0.0000, 0.0000, 0.0000],
        [6.1632, 5.1676, 6.6044,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.7119, 8.4459, 8.7197,  ..., 0.0000, 0.0000, 0.0000],
        [6.1591, 8.9765, 8.8230,  ..., 0.0000, 0.0000, 0.0000],
        [8.6929, 9.8264, 9.5036,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.6561,  7.2074,  8.0641,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.8840,  7.0305,  7.6825,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3969, 10.5609,  7.4755,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.6561,  9.8228,  9.6679,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1941,  8.3268,  7.3849,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.1278,  7.3915,  8.0657,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 120 Loss: 3.6407530307769775 Accuracy: 0.055703823640942576:

torch.Size([128, 20])
tensor([[8.8262, 8.0668, 7.5784,  ..., 7.1907, 2.5307, 2.2102],
        [5.6246, 4.0391, 9.6811,  ..., 0.0000, 0.0000, 0.0000],
        [7.5382, 8.6990, 7.8146,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [6.0873, 6.8368, 8.3563,  ..., 0.0000, 0.0000, 0.0000],
        [6.0873, 8.4374, 4.2962,  ..., 0.0000, 0.0000, 0.0000],
        [6.4331, 7.9345, 7.2480,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 9.2739,  7.6331,  8.8986,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.0586,  4.6644,  8.3752,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.6516,  7.5883,  9.8928,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.1002,  5.8183,  8.3856,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.5316,  7.9279, 10.0125,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1002,  8.0845,  7.6074,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 120 Loss: 3.6407530307769775 Accuracy: 0.055703823640942576:

torch.Size([128, 20])
tensor([[ 7.2946,  8.9363,  9.2931,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.7602, 10.0306,  5.6690,  ...,  7.5515,  0.0000,  0.0000],
        [ 5.9958,  7.6382,  8.5990,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.2972,  8.8189, 10.6318,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.0169,  9.5103,  9.8903,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.5640, 10.2613,  9.1624,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.5351,  8.2607,  7.0995,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.9674,  6.7617,  9.9511,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7218,  6.4236,  9.8906,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.0574,  4.3283, 10.2633,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.9930,  7.7591,  6.5919,  ...,  0.0000,  0.0000,  0.0000],
        [10.1681, 10.1380,  8.8225,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 120 Loss: 3.6407530307769775 Accuracy: 0.055703823640942576:

torch.Size([128, 20])
tensor([[5.9607, 7.2732, 7.3263,  ..., 0.0000, 0.0000, 0.0000],
        [7.9677, 8.9476, 9.0937,  ..., 0.0000, 0.0000, 0.0000],
        [9.3972, 9.8884, 9.0416,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.5073, 6.9177, 8.8703,  ..., 0.0000, 0.0000, 0.0000],
        [7.0130, 7.9283, 5.6702,  ..., 0.0000, 0.0000, 0.0000],
        [6.8863, 7.6951, 5.7794,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.4819,  9.2578,  5.2013,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.5177,  6.0921, 10.6665,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4722,  3.3133,  7.4604,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.8649,  6.9733,  7.7699,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4722,  4.3290,  2.6810,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.9284,  8.0610,  5.0466,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 120 Loss: 3.6407530307769775 Accuracy: 0.055703823640942576:

torch.Size([128, 20])
tensor([[5.8966, 8.4419, 5.8257,  ..., 0.0000, 0.0000, 0.0000],
        [7.4126, 5.9647, 5.5509,  ..., 0.0000, 0.0000, 0.0000],
        [5.4592, 7.7552, 5.7959,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.8476, 8.3127, 8.8788,  ..., 0.0000, 0.0000, 0.0000],
        [5.4582, 4.6707, 8.7189,  ..., 0.0000, 0.0000, 0.0000],
        [7.2142, 4.5859, 8.8521,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.8647,  9.4526,  5.3911,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4356, 10.1827,  4.1385,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.5795,  9.1356,  5.5505,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.8647,  5.5792,  8.1062,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.8647,  4.4456,  9.5654,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.8477,  4.5614,  8.9754,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 120 Loss: 3.6407530307769775 Accuracy: 0.055703823640942576:

torch.Size([128, 20])
tensor([[ 5.4150,  6.6547,  6.5014,  ..., 12.0900,  8.9760,  0.0000],
        [ 5.8150,  5.5241,  8.5213,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.9050,  9.2464,  6.7821,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.4348,  9.0213,  7.4793,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.0422,  8.2083,  3.1889,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.8720, 10.4352,  3.0519,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.8106,  6.9110,  7.8138,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.5213,  8.4412,  6.0566,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.9431,  8.3928,  9.7351,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.0181,  9.8642,  4.2799,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.2256,  8.3317,  3.9688,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.5162,  8.7260, 11.6711,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 130 Loss: 3.6385316371917726 Accuracy: 0.05923073627054691: 

torch.Size([128, 20])
tensor([[6.2083, 8.7608, 7.9527,  ..., 0.0000, 0.0000, 0.0000],
        [5.7519, 7.4355, 5.3836,  ..., 0.0000, 0.0000, 0.0000],
        [6.2083, 8.6204, 4.2837,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.7519, 7.7355, 8.2954,  ..., 0.0000, 0.0000, 0.0000],
        [9.1759, 4.6352, 9.1128,  ..., 0.0000, 0.0000, 0.0000],
        [5.7519, 9.1754, 9.5421,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 8.7906,  8.9064,  4.0772,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1253,  7.1153,  8.1299,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.6507,  3.8614,  6.4876,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.9539,  7.2403,  9.6603,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1253,  7.7025,  6.9122,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.7207,  9.0626, 11.1040,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 130 Loss: 3.6385316371917726 Accuracy: 0.05923073627054691: 

torch.Size([128, 20])
tensor([[ 6.1731,  7.8213,  9.8809,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.2446,  9.1314, 10.5845,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.9457,  8.2372,  7.3058,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.8037,  8.3230,  7.7340,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.3350, 11.1519, 10.3819,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.5835,  4.3352,  9.6284,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 9.1052, 10.7221,  3.2754,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.1566,  8.6678,  9.4864,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.0053,  8.2842,  9.0176,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.8981,  5.9112,  8.9300,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.3818,  3.3371,  4.9920,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.6598,  7.5633,  7.4905,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 130 Loss: 3.6385316371917726 Accuracy: 0.05923073627054691: 

torch.Size([128, 20])
tensor([[ 5.6302,  9.2747,  6.9120,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.2790,  5.4544, 10.2882,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.6093,  8.5604,  3.2519,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.3048,  4.4696,  4.7990,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.6518,  9.2117,  6.0463,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.4836,  9.0775,  5.7787,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.6237,  5.4603,  9.8686,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.5775,  8.7916,  4.0704,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.6292,  9.7014,  8.2195,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.2466,  4.9203,  9.3885,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2907,  8.9387,  4.9430,  ..., 12.2200,  5.7629,  6.4822],
        [ 8.0001,  7.5194,  3.8823,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 130 Loss: 3.6385316371917726 Accuracy: 0.05923073627054691: 

torch.Size([128, 20])
tensor([[6.1116, 7.8516, 8.8870,  ..., 0.0000, 0.0000, 0.0000],
        [5.6657, 9.4333, 7.9810,  ..., 0.0000, 0.0000, 0.0000],
        [5.2765, 7.4974, 9.6607,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [6.8190, 5.1418, 7.7275,  ..., 0.0000, 0.0000, 0.0000],
        [5.5963, 5.8186, 7.0497,  ..., 0.0000, 0.0000, 0.0000],
        [6.7116, 8.2890, 9.3340,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 6.0969,  7.6052,  6.9285,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.3317,  9.0566,  5.0413,  ...,  0.0000,  0.0000,  0.0000],
        [10.3768,  9.6950,  9.0449,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.5454, 10.0842,  6.2015,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7719,  7.9900,  5.1238,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.4649,  6.4091, 10.8451,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 130 Loss: 3.6385316371917726 Accuracy: 0.05923073627054691: 

torch.Size([128, 20])
tensor([[5.2502, 7.4888, 6.7397,  ..., 0.0000, 0.0000, 0.0000],
        [5.5423, 4.1273, 7.1445,  ..., 0.0000, 0.0000, 0.0000],
        [5.6287, 6.7381, 4.9791,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.5195, 8.0652, 2.9947,  ..., 0.0000, 0.0000, 0.0000],
        [7.1704, 8.2784, 5.4605,  ..., 0.0000, 0.0000, 0.0000],
        [6.0817, 9.1111, 8.0216,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 9.3999,  8.8880,  5.0408,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4944, 11.1649,  5.2688,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.1204,  8.4963,  5.7141,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.2309,  8.7394,  9.5181,  ...,  3.3189,  5.6694,  0.0000],
        [ 6.0670,  4.3622,  7.6381,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.5168,  8.4901,  7.9802,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 140 Loss: 3.5741166830062867 Accuracy: 0.06044212728738785: 

torch.Size([128, 20])
tensor([[7.8865, 5.5601, 7.3685,  ..., 0.0000, 0.0000, 0.0000],
        [6.2054, 8.2593, 4.5756,  ..., 0.0000, 0.0000, 0.0000],
        [6.2054, 6.7050, 8.1094,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.1132, 8.2551, 8.9375,  ..., 0.0000, 0.0000, 0.0000],
        [5.4913, 7.3598, 6.2001,  ..., 0.0000, 0.0000, 0.0000],
        [5.4913, 9.2834, 9.2622,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 9.4875, 10.6790,  9.1606,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.2382,  8.1793,  5.1412,  ...,  7.0667,  7.8438,  0.0000],
        [ 5.2115,  5.5579,  3.0540,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.1728,  7.4044,  4.9291,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.5821,  7.0232,  7.8270,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7344,  7.0196,  5.0140,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 140 Loss: 3.5741166830062867 Accuracy: 0.06044212728738785: 

torch.Size([128, 20])
tensor([[10.4665,  8.0700,  8.7937,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4232,  6.8784,  3.5989,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4232,  8.7898,  9.0392,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.5629,  7.9249,  9.2452,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.5813,  6.5558,  8.5078,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.2848, 10.3380,  5.6064,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.4181,  9.0169, 11.6640,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.8236,  9.3775,  6.7653,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.8886,  6.9453,  8.5959,  ...,  4.5592,  9.7684,  0.0000],
        ...,
        [10.4713,  6.5883,  5.4793,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.0529,  3.6178,  4.3470,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4005,  9.3654,  6.5313,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 140 Loss: 3.5741166830062867 Accuracy: 0.06044212728738785: 

torch.Size([128, 20])
tensor([[7.2818, 9.5393, 7.8311,  ..., 0.0000, 0.0000, 0.0000],
        [5.5324, 6.6538, 7.7243,  ..., 0.0000, 0.0000, 0.0000],
        [7.7682, 5.2724, 4.4727,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.1795, 8.1471, 6.6340,  ..., 0.0000, 0.0000, 0.0000],
        [5.3947, 8.5570, 8.3066,  ..., 0.0000, 0.0000, 0.0000],
        [7.7006, 7.2917, 4.9185,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[6.5200, 7.5799, 9.4683,  ..., 0.0000, 0.0000, 0.0000],
        [5.5179, 4.4901, 4.5002,  ..., 0.0000, 0.0000, 0.0000],
        [8.2719, 6.2911, 7.8779,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.5001, 6.3359, 3.2355,  ..., 0.0000, 0.0000, 0.0000],
        [5.5179, 5.6960, 8.4346,  ..., 0.0000, 0.0000, 0.0000],
        [5.2408, 3.3314, 4.6638,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 140 Loss: 3.5741166830062867 Accuracy: 0.06044212728738785: 

torch.Size([128, 20])
tensor([[ 5.1572,  7.3440, 10.4443,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.3360,  5.1175,  9.2841,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2301,  5.9977,  7.0675,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.5809,  9.2626,  9.2736,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1572,  5.5347,  9.3861,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.3491,  7.9251,  3.5050,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[10.3509,  9.8062, 10.2442,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4802,  7.2713,  8.0341,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4802,  8.6496, 10.1052,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.0192, 10.2204, 10.2032,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.9551,  8.7862,  6.4005,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.7062,  8.5478,  4.7473,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 140 Loss: 3.5741166830062867 Accuracy: 0.06044212728738785: 

torch.Size([128, 20])
tensor([[5.1330, 3.8564, 8.5886,  ..., 0.0000, 0.0000, 0.0000],
        [6.4702, 5.6656, 9.2982,  ..., 0.0000, 0.0000, 0.0000],
        [6.4601, 8.2600, 6.4774,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [7.9527, 8.4433, 5.2320,  ..., 0.0000, 0.0000, 0.0000],
        [5.2091, 5.0125, 4.4212,  ..., 0.0000, 0.0000, 0.0000],
        [9.0480, 8.6589, 6.6873,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 7.1045,  4.8792,  7.9409,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.7435,  5.5999,  6.5614,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1203,  5.7706,  6.2970,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 6.4411, 10.3165,  5.1563,  ...,  0.0000,  0.0000,  0.0000],
        [10.2682,  3.9009,  4.2461,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2844,  8.2117, 10.7716,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 150 Loss: 3.5454471349716186 Accuracy: 0.0558757558465004:  

torch.Size([128, 20])
tensor([[ 5.9196,  6.8754,  3.0402,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.9412,  5.8871, 10.9109,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1897,  9.6227,  8.2716,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.9196,  4.1646,  8.4769,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.8604,  8.2915,  7.8686,  ...,  0.0000,  0.0000,  0.0000],
        [10.0314,  6.2998, 11.0541,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[6.4042, 5.5407, 8.1728,  ..., 0.0000, 0.0000, 0.0000],
        [6.4042, 3.7733, 9.4797,  ..., 0.0000, 0.0000, 0.0000],
        [5.1810, 7.4743, 9.4319,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.9068, 9.4714, 6.2804,  ..., 0.0000, 0.0000, 0.0000],
        [9.3491, 6.8819, 6.3481,  ..., 0.0000, 0.0000, 0.0000],
        [8.2056, 4.2970, 8.5269,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 150 Loss: 3.5454471349716186 Accuracy: 0.0558757558465004:  

torch.Size([128, 20])
tensor([[ 5.1705, 10.2713,  7.4700,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.0469,  9.8103, 10.9067,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.4472,  7.0366,  4.6972,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.5527,  5.3279,  4.5149,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4451,  7.7315,  7.3925,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.0470,  9.3250,  7.8799,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 7.4104,  9.8568,  9.4057,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1594, 10.3975,  3.5157,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2061,  9.1633,  8.5475,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.3077,  4.7442,  4.3741,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2061,  8.7362,  3.4253,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.8297,  9.9166,  4.4184,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 150 Loss: 3.5454471349716186 Accuracy: 0.0558757558465004:  

torch.Size([128, 20])
tensor([[ 5.1469,  3.3162,  7.0755,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1469, 10.6927,  6.2727,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1879,  6.9000,  8.2905,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.1790,  6.9298,  6.4639,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.9495,  4.6721,  6.9117,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.0287,  9.6841,  3.3029,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[5.0489, 4.0578, 9.3203,  ..., 0.0000, 0.0000, 0.0000],
        [9.1104, 4.5287, 7.3430,  ..., 0.0000, 0.0000, 0.0000],
        [6.2341, 5.5863, 7.6750,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.1331, 3.3081, 8.9464,  ..., 0.0000, 0.0000, 0.0000],
        [8.2787, 8.0211, 4.5510,  ..., 0.0000, 0.0000, 0.0000],
        [5.1696, 7.4591, 5.2061,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 150 Loss: 3.5454471349716186 Accuracy: 0.0558757558465004:  

torch.Size([128, 20])
tensor([[ 5.0377,  9.8359,  9.2050,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.0306,  4.6399,  6.4639,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1187, 10.6611,  4.5950,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.1515,  4.1601,  9.9689,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1515,  3.9324,  7.4955,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.3129, 10.2870,  4.5064,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 5.0256,  8.5035,  8.5002,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.3622,  9.3184,  4.3166,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.8972,  7.1018,  8.9133,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.3516,  5.3829,  4.2514,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4314,  4.7015, 12.1272,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1214,  7.8086,  5.2134,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 150 Loss: 3.5454471349716186 Accuracy: 0.0558757558465004:  

torch.Size([128, 20])
tensor([[ 8.8483,  9.9043,  7.8385,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1025,  9.8327,  4.5095,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.0872,  3.2763,  2.7505,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [10.0526,  4.4791, 11.0617,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.1025,  8.4413,  5.3116,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1321,  7.2132,  8.3676,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[5.0981, 8.3725, 9.5748,  ..., 0.0000, 0.0000, 0.0000],
        [5.0981, 9.8599, 9.0289,  ..., 0.0000, 0.0000, 0.0000],
        [7.0058, 4.8347, 5.1756,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [8.0270, 4.1117, 2.8910,  ..., 0.0000, 0.0000, 0.0000],
        [7.3693, 3.9110, 7.8633,  ..., 0.0000, 0.0000, 0.0000],
        [5.8619, 9.0829, 6.2780,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 160 Loss: 3.4802643060684204 Accuracy: 0.06915076598525047: 

torch.Size([128, 20])
tensor([[ 5.0635,  8.4987,  3.4785,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.0815,  8.9423, 10.1710,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9901,  6.8903,  3.1169,  ...,  2.7742,  7.5194,  5.2185],
        ...,
        [ 5.0635,  8.1128,  6.8618,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9901,  8.7410,  3.0627,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9901,  6.9766,  8.9612,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.9785,  5.4775,  6.3769,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.2404,  3.9980,  8.0019,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.0661, 10.5523,  8.1137,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.0346,  8.8852,  8.7534,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.0439,  6.6324, 10.3165,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.7225,  6.7962, 10.1906,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 160 Loss: 3.4802643060684204 Accuracy: 0.06915076598525047: 

torch.Size([128, 20])
tensor([[ 5.0498,  8.2729,  8.6728,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.3086,  7.4521, 10.0477,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.6614,  4.6447,  8.4415,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.0253,  6.4538,  6.8175,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.0253,  6.8826,  9.5061,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.0184,  3.2189,  9.8471,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 6.2171,  4.1145,  9.7940,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.7581, 10.7035, 11.0393,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9568, 10.4254,  7.9225,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.1222,  4.8064,  8.1883,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.7501,  9.2295,  8.8042,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9568,  5.4668,  6.8240,  ...,  9.0821,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 160 Loss: 3.4802643060684204 Accuracy: 0.06915076598525047: 

torch.Size([128, 20])
tensor([[ 7.0392,  7.3754,  9.6425,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4803,  6.5727,  8.9891,  ...,  0.0000,  0.0000,  0.0000],
        [10.0941,  7.6836,  8.4514,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 4.9911, 10.0865,  9.9456,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3061,  3.8556,  7.8939,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.1842,  8.4964,  8.1322,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[5.7374, 9.0064, 7.8134,  ..., 0.0000, 0.0000, 0.0000],
        [5.0032, 8.9738, 9.8421,  ..., 0.0000, 0.0000, 0.0000],
        [4.9304, 5.6670, 5.7564,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.7374, 6.9638, 9.0334,  ..., 0.0000, 0.0000, 0.0000],
        [5.0032, 5.3867, 9.2248,  ..., 0.0000, 0.0000, 0.0000],
        [4.9753, 5.0005, 8.6106,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 160 Loss: 3.4802643060684204 Accuracy: 0.06915076598525047: 

torch.Size([128, 20])
tensor([[ 7.4409,  6.5824,  8.1456,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9175,  8.3384,  7.4916,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9886,  7.2230,  5.2672,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 4.9175,  7.2142,  5.1806,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.7795,  6.5494,  3.5651,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.7252,  3.5418, 10.2815,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.6167, 4.6157, 9.9703,  ..., 0.0000, 0.0000, 0.0000],
        [9.4082, 8.3800, 9.1599,  ..., 0.0000, 0.0000, 0.0000],
        [8.8131, 4.7269, 6.6130,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [4.9405, 3.1554, 6.3955,  ..., 0.0000, 0.0000, 0.0000],
        [4.9441, 5.3498, 8.2036,  ..., 0.0000, 0.0000, 0.0000],
        [4.9038, 7.7731, 8.2477,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 160 Loss: 3.4802643060684204 Accuracy: 0.06915076598525047: 

torch.Size([128, 20])
tensor([[ 4.9241,  3.1413,  9.5570,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9241,  3.9625, 10.6801,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.3021,  4.5621,  4.6786,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.8646,  4.9594, 11.3932,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.3012,  6.4273, 10.4880,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.3502, 10.4045,  6.1899,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[6.1582, 7.4403, 8.8922,  ..., 0.0000, 0.0000, 0.0000],
        [4.9076, 4.4075, 7.7482,  ..., 0.0000, 0.0000, 0.0000],
        [4.8825, 7.2263, 3.5948,  ..., 7.9098, 0.0000, 0.0000],
        ...,
        [4.9076, 4.7711, 4.4388,  ..., 0.0000, 0.0000, 0.0000],
        [6.2192, 6.9450, 5.8671,  ..., 0.0000, 0.0000, 0.0000],
        [5.6992, 7.2264, 3.6144,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 170 Loss: 3.4956715583801268 Accuracy: 0.06458935476839542: 

torch.Size([128, 20])
tensor([[10.4138,  7.7574,  8.3102,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7502,  4.2934, 11.4295,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.0247,  3.6996,  4.6490,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 4.8995,  6.8500,  9.0474,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8995,  9.3310, 10.7583,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8995,  4.9934,  8.4878,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.8652,  8.3991,  7.4232,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8854,  4.0485,  9.3050,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.1586,  8.8719,  2.9584,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.6055, 11.5671,  6.9473,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.3192,  4.1471,  9.3789,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.1296, 11.1469, 12.1130,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 170 Loss: 3.4956715583801268 Accuracy: 0.06458935476839542: 

torch.Size([128, 20])
tensor([[ 8.4589,  5.1968,  4.2178,  ...,  0.0000,  0.0000,  0.0000],
        [10.2632,  9.6303, 11.5013,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.9087,  7.8452,  5.2575,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.7414,  8.5924, 10.7837,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.1266,  7.6535,  9.1973,  ...,  0.0000,  0.0000,  0.0000],
        [10.3769,  4.5728,  7.9008,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.8484,  5.0181,  4.2394,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.2545,  5.9446,  8.1230,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.1141,  9.3476,  8.9861,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 4.8580,  3.4841,  6.8029,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8580,  9.5118,  6.1613,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.8820, 10.3820,  7.2424,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 170 Loss: 3.4956715583801268 Accuracy: 0.06458935476839542: 

torch.Size([128, 20])
tensor([[ 4.8450,  4.9750,  4.1930,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8367,  8.9347,  8.9559,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8450,  6.8394,  7.6558,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 8.3187,  7.6191,  2.8614,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.0550,  9.5955,  9.7991,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.0211,  9.9449, 10.8806,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.8782,  7.9159,  7.8237,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.6353,  4.1624, 11.7405,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.6353,  5.5792,  7.6131,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.1189,  9.7334,  8.8158,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.9133,  3.6035,  4.1200,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.4812,  9.3970,  4.8660,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 170 Loss: 3.4956715583801268 Accuracy: 0.06458935476839542: 

torch.Size([128, 20])
tensor([[10.0119,  9.7436,  9.5400,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8673,  7.9946,  4.1740,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7969,  8.3275,  2.8430,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 4.8203,  5.4003,  8.8064,  ...,  4.3519,  7.7210,  8.3247],
        [ 5.2708,  4.1482,  8.1583,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.0742,  6.6870,  9.2413,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.8112, 11.2961,  4.1121,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.0913,  7.7387,  8.4662,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7765,  4.9157,  7.7895,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.5874,  4.0678,  6.8435,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8576,  7.0081,  2.9927,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.7378,  6.9299, 12.3644,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 170 Loss: 3.4956715583801268 Accuracy: 0.06458935476839542: 

torch.Size([128, 20])
tensor([[ 4.8025,  4.6807,  7.3657,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8025,  4.0641,  2.8510,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.2547,  8.6507,  9.2065,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.6086, 10.0556,  7.1084,  ...,  0.0000,  0.0000,  0.0000],
        [10.9099,  8.2326,  8.0704,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.8025,  6.9506,  7.9573,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[4.8377, 6.8006, 9.3746,  ..., 0.0000, 0.0000, 0.0000],
        [6.8414, 5.5753, 6.4579,  ..., 0.0000, 0.0000, 0.0000],
        [7.7128, 6.8922, 8.5256,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [4.8377, 7.6018, 8.3969,  ..., 0.0000, 0.0000, 0.0000],
        [4.8377, 4.5615, 3.5311,  ..., 0.0000, 0.0000, 0.0000],
        [8.7140, 9.0028, 6.8601,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 180 Loss: 3.535568618774414 Accuracy: 0.06038607507944107:  

torch.Size([128, 20])
tensor([[4.8286, 3.9725, 8.8556,  ..., 0.0000, 0.0000, 0.0000],
        [8.3436, 9.9778, 7.0156,  ..., 0.0000, 0.0000, 0.0000],
        [6.0347, 5.5424, 8.1616,  ..., 6.2219, 7.0014, 9.1230],
        ...,
        [4.7786, 8.8882, 3.5324,  ..., 0.0000, 0.0000, 0.0000],
        [7.0166, 4.6631, 4.8064,  ..., 0.0000, 0.0000, 0.0000],
        [4.8286, 8.6247, 8.8734,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[9.6314, 9.2879, 4.6803,  ..., 0.0000, 0.0000, 0.0000],
        [4.7706, 6.4782, 8.7159,  ..., 0.0000, 0.0000, 0.0000],
        [4.7690, 8.3725, 5.5635,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [4.7794, 9.1868, 9.3046,  ..., 0.0000, 0.0000, 0.0000],
        [4.7690, 6.9532, 9.6135,  ..., 0.0000, 0.0000, 0.0000],
        [4.8185, 5.6214, 2.8367,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 180 Loss: 3.535568618774414 Accuracy: 0.06038607507944107:  

torch.Size([128, 20])
tensor([[ 8.6826,  7.9732,  9.7080,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.4937, 10.3289,  7.5884,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7596,  7.8451, 11.6658,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 4.7622,  8.5694, 10.4682,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.9764,  9.8116,  7.8272,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7732,  9.4937,  5.5752,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[7.0519, 6.6809, 7.9476,  ..., 0.0000, 0.0000, 0.0000],
        [6.1241, 8.6660, 4.3964,  ..., 0.0000, 0.0000, 0.0000],
        [4.7989, 6.9515, 7.5745,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [5.2155, 8.0229, 4.2546,  ..., 0.0000, 0.0000, 0.0000],
        [4.7530, 9.9251, 2.8008,  ..., 0.0000, 0.0000, 0.0000],
        [4.7499, 5.2772, 8.2780,  ..., 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 180 Loss: 3.535568618774414 Accuracy: 0.06038607507944107:  

torch.Size([128, 20])
tensor([[ 7.2958,  9.5982,  4.6679,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7434,  9.3486,  9.2676,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.5840,  8.5059,  4.3296,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 9.3898, 10.6655,  9.4662,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7893,  8.0100,  5.1705,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.1261,  8.5308,  4.2870,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.7490,  8.1589, 10.6454,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.4966,  9.1305,  4.1826,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.5545,  7.7059,  9.1272,  ...,  3.4931,  2.5696,  0.0000],
        ...,
        [ 8.8709,  4.1195,  7.7018,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7340,  8.7733,  9.2959,  ...,  0.0000,  0.0000,  0.0000],
        [ 9.3024,  6.9019,  7.4855,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 180 Loss: 3.535568618774414 Accuracy: 0.06038607507944107:  

torch.Size([128, 20])
tensor([[ 8.2260,  6.9118,  7.1037,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7254,  3.0234,  4.5033,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.1036,  7.8385, 10.2125,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.9188,  4.0449,  8.4366,  ...,  0.0000,  0.0000,  0.0000],
        [ 6.6775,  7.6578,  3.7287,  ...,  0.0000,  0.0000,  0.0000],
        [10.5234,  5.2657,  6.7688,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)
torch.Size([128, 20])
tensor([[ 4.7178, 10.5571,  9.0548,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7582,  9.6219,  4.4616,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.5736,  8.0633,  5.6400,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 7.3712,  7.7213, 11.7175,  ...,  0.0000,  0.0000,  0.0000],
        [10.2670,  4.6003,  9.1282,  ...,  0.0000,  0.0000,  0.0000],
        [ 7.3310,  6.6690,  4.1909,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


Epoch: 0 Iteration: 180 Loss: 3.535568618774414 Accuracy: 0.06038607507944107:  


torch.Size([128, 20])
tensor([[ 7.6923,  8.3265,  8.6735,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.6921,  3.4115,  4.7286,  ...,  0.0000,  0.0000,  0.0000],
        [ 8.2890,  4.1383,  7.1403,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [ 5.8928,  3.3903,  4.2855,  ...,  0.0000,  0.0000,  0.0000],
        [ 4.7099,  3.0172,  8.7409,  ...,  0.0000,  0.0000,  0.0000],
        [ 5.5227, 10.7848,  7.6144,  ...,  0.0000,  0.0000,  0.0000]],
       grad_fn=<MulBackward0>)


KeyboardInterrupt: 

# Using the language model

Congratulations, you have now trained a language model! We can now use it to evaluate likely news headlines, as well as generate our very own headlines.

**TODO**: Complete the three parts below, using the model you have trained.

## (1) Evaluation loss

To evaluate the language model, we evaluate its loss (ability to predict) on unseen data that is reserved for evaluation.
Your first evaluation is to load the model you trained, and obtain a test loss. If you are running this validation and not training, run the setup cell above the training loop first.

In [11]:
model_id = "test1"
save_dict = th.load(root_folder+'models/part1/'+f"model_{model_id}.pt",)
model = LanguageModel(**save_dict['kwargs'])
model.load_state_dict(save_dict['model_state_dict'])
model.eval()

LanguageModel(
  (embedding): Embedding(10000, 256)
  (lstm): LSTM(256, 256, batch_first=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (output): Linear(in_features=256, out_features=10000, bias=True)
)

In [12]:
# We will evaluate your model in the model_file above
# In a very similar way as the code below.
# Make sure your validation loss is below the threshold we specified
# and that you didn't train using the validation set, as you would
# get penalized.

batch = build_batch(d_valid, range(len(d_valid)))
(batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
# (batch_input, batch_target, batch_target_mask) = list_to_device((batch_input, batch_target, batch_target_mask))
prediction = model(batch_input.long())
loss = loss_fn(prediction, batch_target, batch_target_mask)
print("Evaluation set loss:", loss.item())

Evaluation set loss: 2.7313876152038574


In [13]:
# Your best performing model should go here.
os.makedirs(root_folder+"best_models",exist_ok=True)
best_model_file = root_folder+"best_models/part1_best_model.pt"
th.save(save_dict,best_model_file)

In [14]:
model.cpu()

LanguageModel(
  (embedding): Embedding(10000, 256)
  (lstm): LSTM(256, 256, batch_first=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (output): Linear(in_features=256, out_features=10000, bias=True)
)

## (2) Evaluation of likelihood of data

One use of a language model is to see what data is more likely to have originated from the training data. Because we have trained our model on news headlines, we can see which of these headlines is more likely:

``Apple to release another iPhone in September``


 ``Apple and Samsung resolve all lawsuits amicably``
 
**TODO**: Use the model to obtain the loss the neural network assigns to each sentence.
Because the neural network assigns probability to the words appearing in a sequence, this loss can be used as a proxy to measure how likely the sentence is to have occurred in the dataset.
Once you have the loss for each headline, write down which sentence was judged to be more likely, and explain why/if you think this is coherent.

**Your answer:**


In [15]:
def raw_sample_pred(headline, model):
    #####
    # BEGIN YOUR CODE HERE 
    #####
    # From the code in the Preprocessing section at the end of the notebook
    # Find out how to tokenize the headline
    
    tokenized = tokenizer.word_tokenizer(headline)

    # Find out how to numerize the tokenized headline
    numerized = [w2i.get(w,unkI) for w in tokenized]

    # Learn how to pad and obtain the mask of the sequence.
    padded, mask = pad_sequence(numerized, padI, input_length)

    # Obtain the predicted headline and target headline
    
    input_dataset=[{"numerized":padded, "mask":mask}]
    batch=build_batch(input_dataset,[0])
    (input_headline,target_headline, mask) = batch_to_torch(*batch)
    pred_headline = model(input_headline)
    

    #####
    # END YOUR CODE HERE 
    #####

    return pred_headline,target_headline,mask

In [16]:
model.eval()

headline1 = "Apple to release new iPhone in July"
headline2 = "Apple and Samsung resolve all lawsuits"

headlines = [headline1.lower(), headline2.lower()] # Our LSTM is trained on lower-cased headlines
for headline in headlines:
    pred_headline,target_headline,mask = raw_sample_pred(headline, model)
    loss = loss_fn(pred_headline,target_headline,mask) # Obtain the loss
    
    print("----------------------------------------")
    print("Headline:", headline)
    print("Loss of the headline:", loss)
with th.no_grad():
    validate_to_array(raw_sample_pred,zip(headlines,[model]*2),'raw_sample_pred',root_folder,multi=True)
# Important check: one headline should be more likely (and have lower loss)
# Than the other headline. You should know which headline should have lower loss.

----------------------------------------
Headline: apple to release new iphone in july
Loss of the headline: tensor(1.2301, grad_fn=<MeanBackward0>)
----------------------------------------
Headline: apple and samsung resolve all lawsuits
Loss of the headline: tensor(1.5779, grad_fn=<MeanBackward0>)


  arr = np.asanyarray(arr)
  arr = np.asanyarray(arr)


## (3) Generation of headlines

We can use our language model to generate text according to the distribution of our training data.
The way generation works is the following:

We seed the model with a beginning of sequence, and obtain the distribution for the next word.
We select the most likely word (argmax) and add it to our sequence of words.
Now our sequence is one word longer, and we can feed it in again as an input, for the network to produce the next sentence.
We do this a fixed number of times (up to 20 words), and obtain automatically generated headlines!


We have provided a few headline starters that should produce interesting generated headlines.

**TODO:** Get creative and find at least 2 more headline_starters that produce interesting headlines.

In [17]:
def generate_sentence(headline_starter, model):
    # Tokenize and numerize the headline. Put the numerized headline
    # beginning in `current_build`
    tokenized = tokenizer.word_tokenizer(headline_starter.lower())
    current_build = [startI] + numerize_sequence(tokenized)

    while len(current_build) < input_length:
        # Pad the current_build into a input_length vector.
        # We do this so that it can be processed by our LanguageModel class
        current_padded, _m = pad_sequence(current_build, padI, input_length)

        # Obtain the logits for the current padded sequence
        # This involves obtaining the output_logits from our model,
        # and not the loss like we have done so far
        current_padded=th.tensor(current_padded)
        logits = model.forward(th.unsqueeze(current_padded,dim=0))
#         logits_np = logits.detach().cpu().numpy()
        
        # Obtain the row of logits that interest us, the logits for the last non-pad
        # inputs
        last_logits = th.squeeze(logits[:,len(current_build)])

        # Find the highest scoring words in the last_logits
        # array, or sample from the softmax.
        # The np.argmax function may be useful for first option,
        # sp.special.softmax and np.random.choice may be useful for second option.
        # Append this word to our current build
        probs=th.softmax(last_logits,dim=0).detach().numpy()
        sample=np.random.choice(np.arange(probs.shape[0]),p=probs)
        
        current_build.append(sample.item())

    # Go from the current_build of word_indices
    # To the headline (string) produced. This should involve
    # the vocabulary, and a string merger.
    produced_sentence = numerized2text(current_build)
    return produced_sentence

In [18]:
model.eval()
# Here are some headline starters.
# They're all about tech companies, because
# That is what is in our dataset
headline_starters = ["apple has released", "google has released", "amazon", "tesla to", "facebook is now meta", "youtube changes its user policy"]
for headline_starter in headline_starters:
    print("===================")
    print("Generating headline starting with: "+headline_starter)

    produced_sentence = generate_sentence(headline_starter, model)
    print(produced_sentence)
with th.no_grad():
    validate_to_array(generate_sentence,zip(headline_starters,[model]*len(headline_starters)),"generate_sentence",root_folder,multi=True)

Generating headline starting with: apple has released
<START> apple has released less on in , macs us republican and at . over ban on date uber started
Generating headline starting with: google has released
<START> google has released home , ? target UNK blocked by in search UNK goes white police buy candidates children
Generating headline starting with: amazon
<START> amazon dominates price into , with violence flights gets , enough after dropped UNK too disappears a of to
Generating headline starting with: tesla to
<START> tesla to u.s. apple for of in trial UNK , says were fall part of may the growth effect
Generating headline starting with: facebook is now meta
<START> facebook is now UNK your traffic over cars in are because around comes to world chats hacks UNK a
Generating headline starting with: youtube changes its user policy
<START> youtube changes its user policy is UNK from , ,' to inside , launching - live twitter into course


## All done

You are done with the first part of the HW.

Next notebook deals with Summarization of text!
