# Lab 3: Word Embeddings and Language Modelling

Adam Ek

In this lab we'll explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch. Some basic operations that will be useful can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* In general: we are not interested in getting state-of-the-art performance :) focus on the implementation and not results of your model. For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so, on linux/mac: ```head -n 10000 inputfile > outputfile```. 
* If possible, use the MLTGpu, it will make everything faster :)

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torch.optim as optim
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset

from IPython import embed
import numpy as np

hardware = 'cuda' if torch.cuda.is_available() else 'cpu'
device = torch.device(hardware)

In [5]:
print (device)

cuda


# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data, you can download the file on canvas under files/03-lab-data/wiki-corpus.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the center word in one field and the context words in another (separate the fields with ```tab```).

For example, the sentece "this is a lab" with ```window size = 4``` will be formatted as:
```
center, context
---------------------
this    is a lab
is      this a lab
a       this is lab
lab     this is a
```

this will be our training examples when training the word2vec model.

[3 marks]

In [6]:
import string
import pandas as pd

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table) 

data_path = 'wiki-corpus.txt'
WINDOW_SIZE = 4
def corpus_reader(data_path):
    with open(data_path, encoding="utf-8") as f:
        lines = f.readlines()
        centerWord_contextWords = []
        exceptions_short_sentences_count = 0
        for lineCount,line in enumerate(lines):
            #print (line)
            table = str.maketrans(dict.fromkeys(string.punctuation)) #start punctuation removal
            #print ("table:",table)
            line = line.translate(table) #remove punctuation
            line = line.split()
            #line = ["this","is","a","lab","with","several","words","altogether","making","a","sentence","that","is","long"]
            #line = ["this","is","a","lab"]
            #print ("line:",line)
            #print ("len of line:",len(line))
            for i,word in enumerate(line):
                context_words=[]
                #print ("center word at:",i,"(",word,")")
                if i==0:
                    start_place = 0
                    end_place = WINDOW_SIZE
                    for ii in range(start_place,end_place):
                        #print ("For word",i,"take context word at:",ii)
                        try:
                            context_words.append(line[ii])
                        except IndexError:
                            #print ("exception occured.\nline is:",line,"\nii is: -----",ii)
                            exceptions_short_sentences_count += 1
                else:
                    start_place = i-int(WINDOW_SIZE/2)
                    end_place = i+int(WINDOW_SIZE/2)
                    #print ("start place:", start_place)
                    #print ("end place:",end_place)
                    j=start_place
                    while j<end_place:
                        #print ("j:",j)
                        if start_place<0:
                            start_place += 1
                            end_place += 1
                            #print ("start place now:",start_place,", end place now:",end_place)
                            j += 1
                            continue
                        elif end_place > len(line):
                            #print ("word at i:",i,", end_place:",end_place)
                            for k in range(start_place-1,end_place-1):
                                #print ("k position:",k)
                                #print ("and lastly for word: (",word,"):",line[k])
                                try:
                                    context_words.append(line[k])
                                except IndexError:
                                    pass
                            break
                        #print ("For word",i,"(",word,")","take context word at:",j,"(",line[j],")")
                        context_words.append(line[j])
                        j += 1
                context_words.remove(word)
                #print ("center word:",word,", context words:",context_words)
                centerWord_contextWords.append([word," ".join(context_words)])
            #if lineCount==1:
            #    print ("line 2 centerWord_contextWords:",centerWord_contextWords)
        print ("- note: number of shorter senteces -- than window size -- encountered:",exceptions_short_sentences_count)
        print ("- done creating center word and its context words for window size:",WINDOW_SIZE,"for all lines (",len(lines),") lines")
        df = pd.DataFrame(centerWord_contextWords, columns=['center','context'])
        df.to_csv(data_path+".csv",index=False, sep='\t')
        print ("- done saving center words/context words into csv file. ")
corpus_reader(data_path)

- note: number of shorter senteces -- than window size -- encountered: 918
- done creating center word and its context words for window size: 4 for all lines ( 50000 ) lines
- done saving center words/context words into csv file. 


In [7]:
pd.read_csv("wiki-corpus.txt.csv",sep='\t')

Unnamed: 0,center,context
0,Anarchist,historian George Woodcock
1,historian,Anarchist George Woodcock
2,George,Anarchist historian Woodcock
3,Woodcock,historian George reports
4,reports,George Woodcock that
...,...,...
1096333,split,race was into
1096334,into,was split eight
1096335,eight,split into stages
1096336,stages,into eight covering


We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

**Answer:**

Sampling 50000 diverse Wikipedia sentences is good because it covers a large variation of sentence types with different words and lengths.

However it may not be ideal as it would be too general and would not be enough for such general task; since a general task requires a very large corpus with several times the magnitude of what is currently being sampled.

### Loading the data

We now need to load the data in an appropriate format for torchtext (https://torchtext.readthedocs.io/en/latest/). We'll use PyText for this and it'll follow the same structure as I showed you in the lecture (remember to lower-case all tokens). Create a function which returns a (bucket)iterator of the training data, and the vocabulary object (```Field```). 

(*hint1*: you can format the data such that the center word always is first, then you only need to use one field)

(*hint2*: the code I showed you during the leture is available in /files/pytorch_tutorial/ on canvas)

[4 marks]

In [8]:
def get_data(dataFilePath):
    whitespacer = lambda x: x.split(' ')

    # "fields" that process the different columns in our CSV files
    TOKENS = Field(tokenize    = whitespacer,
                   lower       = True,
                   batch_first = True) # enforce the (batch, words) structure

    # read the csv files
    train = TabularDataset(path = dataFilePath,
                           format = 'csv',
                           fields = [('center', TOKENS),
                                     ('context', TOKENS)],
                           skip_header       = True,
                           csv_reader_params = {'delimiter':'\t',
                                                'quotechar':'⅞'})
    
    # build vocabularies based on what our csv files contained and create word2id mapping
    TOKENS.build_vocab(train, min_freq=1)


    # create batches from our data, and shuffle them for each epoch
    train_iter = BucketIterator(dataset=train,
                                batch_size        = 8,
                                #sort_key=lambda x: len(x.comment_text), #https://github.com/pytorch/text/issues/474
                                sort_within_batch = False, #changed from True to False
                                shuffle           = True,
                                device            = device
                               )

    return train_iter, TOKENS.vocab
    

In [9]:
dataset, vocab = get_data("wiki-corpus.txt.csv")

We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

**Answer:**
- By lower-casing the tokens we are summing up all instances of each token into one unit. For example if we have three instances of "John" & "john" & "johN", we can sum all of them into one "john" with frequency of 3, thus improving our learning and prediction.
- However it might be harmful if two different meanings are meant by the lower-case and the capitalized version of one word, such as names of companies that may have a dictionary meaning as well and cannot be ditinguished as company names if they are lower-cased.

## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [10]:
class CBOWModel(nn.Module):
    def __init__(self, num_words, num_dim): #...
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(num_words, num_dim) #...
        self.linear = nn.Linear(num_dim, num_words) #...
    
    def forward(self, context):
        embedded_context = self.embeddings(context) #...
        embedded_context = self.projection_function(embedded_context)
        output = self.linear(embedded_context)
        return output

    def projection_function(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        xs_sum = torch.sum(xs, dim=1)
        return xs_sum

In [11]:
CBOWModel(len(vocab),50).to(device)

CBOWModel(
  (embeddings): Embedding(77632, 50)
  (linear): Linear(in_features=50, out_features=77632, bias=True)
)

Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [12]:
# you can change these numbers to suit your needs :)
word_embeddings_hyperparameters = {'epochs':3,
                                   'batch_size':16,
                                   'embedding_size':128,
                                   'learning_rate':0.001,
                                   'embedding_dim':128 #32
                                  }

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [14]:
# load data
dataset, vocab = get_data("wiki-corpus.txt.csv")

# build model and construct loss/optimizer
cbow_model = CBOWModel(len(vocab), word_embeddings_hyperparameters['embedding_dim'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

# start training loop
total_loss = 0
for epoch in range(word_embeddings_hyperparameters['epochs']):
    epoch_loss = 0
    for i, batch in enumerate(dataset):
        
        context = batch.context
        target_word = batch.center
        ##print ("i:",i)
        ##print ("dataset:",dataset)
        ##print ("context:",context)
        ##print ("target_word:",target_word)
        ###print ("target_word view:",target_word.view(-1))
        '''
        target_word view: tensor([3143,   14,   98,    6,    5,   16, 6770,    4])
        '''


        # send your batch of sentences to the model
        output = cbow_model(context)
        
        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        loss = loss_fn(output, target_word.view(-1) )
        total_loss += loss.item()
        epoch_loss += loss.item()
        
        # print average loss for the epoch
        print("Average loss for the epoch:", epoch_loss/(i+1), " --- ", end='\r') 
        #print(total_loss/(i+1)) 
        
        # compute gradients
        loss.backward()
        ####...
        
        # update parameters
        optimizer.step()
        ####...
        
        # reset gradients
        optimizer.zero_grad()
        ####...
    print('\r')
        

Average loss for the epoch: 8.008197490063182  ---  
Average loss for the epoch: 7.776071236444692  ---  
Average loss for the epoch: 7.620276734381541  ---  


## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in vanvas under files/03-l). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [15]:
print (len(vocab))
print (vocab.stoi["good"])
print (vocab.itos[626])
print (torch.LongTensor([626]))
print (torch.LongTensor())

77632
626
good
tensor([626])
tensor([], dtype=torch.int64)


In [16]:
# your code goes here

def read_wordsim(path, vocab, embeddings):
    word_pairs = []
    dataset_sims = []
    model_sims = []
    with open(path) as f:
        for line in f:
            word1, word2, score = line.split()
            word_pairs.append((word1,word2))
            
            score = float(score)
            dataset_sims.append(score)
            
                # get the index for the word
            word1_idx = vocab.stoi[word1.lower()] #...
            word2_idx = vocab.stoi[word2.lower()] #...
            
            # get the embedding of the word
            word1_emb = embeddings(torch.LongTensor([word1_idx]).to(device)) #...
            word2_emb = embeddings(torch.LongTensor([word2_idx]).to(device)) #...

            #print ("word:", word1, "--",embeddings(torch.LongTensor([word1_idx]).to(device)))

            # compute cosine similarity, we'll use the version included in pytorch functional
            # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
            cosine_similarity = F.cosine_similarity(word1_emb,word2_emb)
            #print (cosine_similarity.item())
            
            model_sims.append(cosine_similarity.item())
    
    return dataset_sims, model_sims, word_pairs

path = 'wordsim_similarity_goldstandard.txt'
data, model, pairs = read_wordsim(path,vocab,cbow_model.embeddings)
#print (data)
#print (model)
pearson_correlation = np.corrcoef(data, model)
            
# the non-diagonals give the pearson correlation,
print(pearson_correlation)

[[1.         0.23241196]
 [0.23241196 1.        ]]


Do you think the model performs good or bad? Why?

[3 marks]

**Answer**
The model is performing poorly when the pearson correlation coefficient is closer to 0 than being closer to 1.
The result from the model is a low number and thus we can reason that the model is performing weakly.

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

**Answer**
The code below shows 10 best and 10 worst performing word pairs.
From the best performing words, it can be seen that words that are in the same category
or subject area have most similarity where as words that may have multiple meanings such
as "problem" or words that are very irrelevant to each other in their pair make for worst
performing pairs, such as "cucumber" and "professor", or "cabage" and "king".

In [17]:
workingData = data[:]; workingModel = model[:]; workingPairs = pairs[:];
best10 = []
for i in range(0,len(workingPairs)):
    bestIndex = workingModel.index(max(workingModel))
    #print (workingData[bestIndex])
    maxTuple = (workingData[bestIndex],workingModel[bestIndex], workingPairs[bestIndex])
    best10.append(maxTuple)
    del workingData[bestIndex]; del workingModel[bestIndex]; del workingPairs[bestIndex];
    if (i==9):
        break
print ("Best 10 performing word pairs:")
for i in range (0,len(best10)):
    print (best10[i])
#print (best10)

workingData = data[:]; workingModel = model[:]; workingPairs = pairs[:];
worst10 = []
for i in range(0,len(workingPairs)):
    worstIndex = workingModel.index(min(workingModel))
    minTuple = (workingData[worstIndex],workingModel[worstIndex], workingPairs[worstIndex])
    worst10.append(minTuple)
    del workingData[worstIndex]; del workingModel[worstIndex]; del workingPairs[worstIndex];
    if (i==9):
        break
print ("Worst 10 performing word pairs:")
for i in range (0,len(worst10)):
    print (worst10[i])

Best 10 performing word pairs:
(10.0, 1.0, ('tiger', 'tiger'))
(8.3, 0.5581941604614258, ('man', 'woman'))
(6.77, 0.4818370044231415, ('television', 'radio'))
(8.97, 0.45765310525894165, ('type', 'kind'))
(6.81, 0.4164801836013794, ('football', 'basketball'))
(9.1, 0.35958829522132874, ('coast', 'shore'))
(8.02, 0.35037505626678467, ('planet', 'sun'))
(4.81, 0.34970614314079285, ('situation', 'conclusion'))
(5.77, 0.331089049577713, ('plane', 'car'))
(6.22, 0.3248131275177002, ('skin', 'eye'))
Worst 10 performing word pairs:
(1.19, -0.2852497100830078, ('delay', 'racism'))
(7.81, -0.19720673561096191, ('lobster', 'food'))
(2.94, -0.16753925383090973, ('peace', 'insurance'))
(8.0, -0.1426965892314911, ('tiger', 'feline'))
(8.42, -0.13960173726081848, ('money', 'dollar'))
(4.25, -0.12836161255836487, ('attempt', 'peace'))
(2.31, -0.11290749162435532, ('reason', 'hypertension'))
(2.38, -0.10981321334838867, ('problem', 'airport'))
(1.92, -0.10487410426139832, ('cup', 'substance'))
(3.78, 

Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

**Answer**
To improve the results, training on more sentences can be done. More context for words
that are similar but interpreted as different should be given to the model.

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hur the performance of the sentiment model.

[3 marks]

**Answer**
Sentiment analysis would benefit from the similarity of words used to
explain a positive or negative topic. In these cases, words such as "cool", "great",
and "like" would lead to the sentence falling into the same sentiment category.

Where our embedding can hurt the performance is use of short negative words that
would be the only cues as a negative or positive expression, such as the word
"not" in "I did not like this 'awesome' product!". In such case, only one word makes
the sentence meaning change and it may be difficult for the model to spot such word
and make the right conclusion about the sentence polarity.
Words may also have different meanings or used in ironies to refer to a different meaning
and purpose, which may not be spotted by our embeddings, such as the example we see above
with the words "cabbage" and "king", i.e. use of irony, interpreted as irrelevant and
dissimilar.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predict the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-cropus.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

For this we'll build a new dataloader with torchtext, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token, there is a keyword argument in ```Field``` for this :). But other than that, as before you read the dataset and output a iterator over the dataset and a vocabulary. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [20]:
CUDA_LAUNCH_BLOCKING=1

In [21]:
# you can change these numbers to suit your needs as before :)
lm_hyperparameters = {'epochs':3,
                      'batch_size':16,
                      'learning_rate':0.001,
                      'embedding_dim':128,
                      'output_dim':128}

In [22]:
data_path = 'wiki-corpus.txt'
def get_data_language_model(dataFilePath):
    # your code here, roughly the same as for the word2vec dataloader
    whitespacer = lambda x: x.split(' ')

    # "fields" that process the different columns in our CSV files
    sentence = Field(tokenize    = whitespacer,
                   init_token = '<start>',
                   eos_token = '<end>',
                   lower       = True,
                   batch_first = True) # enforce the (batch, words) structure

    # read the csv files
    train = TabularDataset(path = dataFilePath,
                           format = 'csv',
                           fields = [('sentence', sentence),
                                     ],
                           )
    
    # build vocabularies based on what our csv files contained and create word2id mapping
    sentence.build_vocab(train)


    # create batches from our data, and shuffle them for each epoch
    train_iter = BucketIterator(dataset=train,
                                batch_size        = 8,
                                #sort_key=lambda x: len(x.comment_text), #https://github.com/pytorch/text/issues/474
                                sort_within_batch = False, #changed from True to False
                                shuffle           = True,
                                device            = device)

    return train_iter, sentence.vocab


In [23]:
dataset_language_model, vocab_language_model = get_data_language_model(data_path)

In [24]:
print (vocab_language_model)
print (vocab_language_model.stoi["<end>"])
print (len(vocab_language_model))

#src_map = torch.LongTensor([vocab_language_model.stoi[w] for w in src])
print (vocab_language_model.stoi["is"])
print ("dataset_language_model:",dataset_language_model)

<torchtext.vocab.Vocab object at 0x7fba1b753210>
3
50628
12
dataset_language_model: <torchtext.data.iterator.BucketIterator object at 0x7fba4c284710>


In [25]:
class LM_withLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers=1):
        super(LM_withLSTM, self).__init__()
        self.n_layers = n_layers
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True)
        self.predict_word = nn.Linear(hidden_dim, vocab_size)
        #self.sigmoid = nn.Sigmoid()
    
    def forward(self, seq):
        embedded_seq = self.embeddings(seq)
        timestep_reprentation, *_ = self.LSTM(embedded_seq)
        predicted_words = self.predict_word(timestep_reprentation)
        
        return predicted_words

In [26]:
# load data
#lm_dataset, lm_vocab = get_data_language_model(data_path)

# build model and construct loss/optimizer
lm_model = LM_withLSTM(len(vocab_language_model), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'])
lm_model.to(device)
#lm_model.eval()


LM_withLSTM(
  (embeddings): Embedding(50628, 128)
  (LSTM): LSTM(128, 128, batch_first=True)
  (predict_word): Linear(in_features=128, out_features=50628, bias=True)
)

In [27]:

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['learning_rate'])

# start training loop
total_loss = 0
for epoch in range(lm_hyperparameters['epochs']):
    for i, batch in enumerate(dataset_language_model):
        #print ("i:",i)
        # the strucure for each BATCH is:
        # <start>, w0, ..., wn, <end>
        sentence = batch.sentence
        
        # when training the model, at each input we predict the *NEXT* token
        # consequently there is nothing to predict when we give the model 
        # <end> as input. 
        # thus, we do not want to give <end> as input to the model, select 
        # from each batch all tokens except the last. 
        # tip: use pytorch indexing/slicing (same as numpy) 
        # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
        # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
        input_sentence = sentence[0:,range(sentence.shape[1]-1)] #...
        
        # send your batch of sentences to the model
        output = lm_model(input_sentence)
        
        # for each output, the model predict the NEXT token, so we have to reshape 
        # our dataset again. On timestep t, we evaluate on token t+1. That is,
        # we never predict the <start> token ;) so this time, we select all but the first 
        # token from sentences (that is, all the tokens that we predict)
        gold_data = sentence[0:,range(1,sentence.shape[1])]
        
        # the shape of the output and sentence variable need to be changed,
        # for the loss function. Details are in the documentation.
        # You can use .view(...,...) to reshape the tensors  
        loss = loss_fn(output.view(-1,len(vocab_language_model)),  gold_data.view(-1)) #...check
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r') 
        
        # compute gradients
        loss.backward() #...
        
        # update parameters
        optimizer.step() ####...
        
        # reset gradients
        optimizer.zero_grad() #...
    print()

3.1200737282168795
5.8188218711795655
8.3210848349917513


### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [28]:
# your code goes here
import json

def evaluate_model(path, vocab, model):
    
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            print ("good s:",good_s)
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = [vocab.stoi[word] for word in ['<start>']+good_s.lower().rstrip('.').split()+['.','<end>']]
            print ("tok_good_s:", tok_good_s)
            
            tok_bad_s = [vocab.stoi[word] for word in ['<start>']+bad_s.lower().rstrip('.').split()+['.','<end>']]
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([x for x in tok_good_s], device=device).unsqueeze(0)
            print("enc_good_s:",enc_good_s)
            enc_bad_s = torch.tensor([x for x in tok_bad_s], device=device).unsqueeze(0)
            
            # pass your encoded sentences to the model and predict the next tokens
            good_s = model(enc_good_s)
            print("good_s:",good_s)
            bad_s = model(enc_bad_s)
            
            # get probabilities with softmax
            gs_probs = F.softmax(good_s, dim=1)
            print ("gs_probs:",gs_probs)            
            bs_probs = F.softmax(bad_s, dim=1)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, tok_good_s)
            bs_sent_prob = find_token_probs(bs_probs, tok_bad_s)
            
            accuracy.append(int(gs_sent_prob>bs_sent_prob))

    return accuracy
            
def find_token_probs(model_probs, encoded_sentence):
    probs = []

    # iterate over the tokens in your encoded sentence
    print ("## find_token_probs...##")
    for sent_idx,vocab_idx in [(i-1,t) for i,t in enumerate(encoded_sentence)][1:]:
        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        
        prob = model_probs.view(-1,len(vocab_language_model))[sent_idx][vocab_idx]  
        print ("prob:",prob)
        probs.append(prob)
    sentence_prob = np.prod(probs)

    #print ("encoded_sentence[0]:",encoded_sentence[0])
    return sentence_prob

path = 'existential_there_quantifiers_1.jsonl'



In [29]:
###accuracy = evaluate_model(path, ..., ...)
accuracy = evaluate_model(path, vocab_language_model, lm_model)

good s: There was a documentary about music irritating Allison.
tok_good_s: [2, 40, 13, 9, 1952, 98, 216, 0, 0, 8, 3]
enc_good_s: tensor([[   2,   40,   13,    9, 1952,   98,  216,    0,    0,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [-11.2607,  -5.8681, -11.1289,  ...,  -6.9956, -10.4502, -10.2464],
         [-20.1039,  -4.9261, -20.6242,  ...,  -8.1029, -11.1276, -13.3413],
         [-15.6879,  19.1810, -16.5931,  ..., -10.4611, -12.8756, -17.0511]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.0877e-01, 5.0757e-11, 1.0593e-01,  ..., 1.2824e-01,
          4.1469e-01, 6.4562e-01],
         [3.1561e-04, 1.9912e-10, 2.1341e-04,  ..., 3.8979e-06,
          4.5762e-06, 2.0304e-06],
         [8.2931e-02, 1.141

prob: tensor(0.0011, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0615, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a mother of Theodore imagining those adults responded.
tok_good_s: [2, 40, 12, 9, 1084, 6, 6974, 0, 285, 3927, 4108, 8, 3]
enc_good_s: tensor([[   2,   40,   12,    9, 1084,    6, 6974,    0,  285, 3927, 4108,    8,
            3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-13.7326,  -3.1549, -14.0650,  ...,  -6.3935, -12.4656, -10.2735],
         [-20.1678,  -5.5758, -20.5983,  ...,  -9.8300,  -9.3879, -12.7238],
         [-16.6251,  19.0900, -17.5935,  ..., -11.0362, -13.0193, -16.8268]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_prob

prob: tensor(0.0020, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0163, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3561, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1439, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7340, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1578, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0066, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0015, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0359, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1466, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1846, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are no peppers upsetting all governments.
tok_good_s: [2, 40, 27,

enc_good_s: tensor([[    2,    40,    22,    87, 10979, 27079,  9073,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-12.3937,  -9.3396, -12.5298,  ...,  -5.5441,  -8.8380,  -6.6118],
         [-20.6631,  -6.2810, -21.2443,  ...,  -7.9873, -11.8261, -12.4810],
         [-16.6499,  18.6079, -17.6265,  ..., -11.1206, -11.8696, -15.7193]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.3685e-02, 9.0033e-11, 1.3024e-02,  ..., 1.7949e-01,
          1.8804e-02, 1.7709e-01],
         [3.9709e-05, 3.5320e-10, 2.6237e-05,  ..., 5.4556e-06,
          2.0750e-07, 5.5691e-07],
         [9.8095e-01, 1.4252e-09, 9.8263e-01,  ..., 2.2834e-01,
          9.7667e-01, 6.1821e-01],
         ...,
         [4.6744e-03, 7.2871e-13, 3

prob: tensor(0.7024, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9966, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9761, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1303, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0140, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0313, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2159, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4255, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7114, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9948, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2661, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0012, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2396, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0237, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

good s: There were few stories about politics astounding Alan.
tok_good_s: [2, 40, 22, 364, 1419, 98, 843, 25726, 4834, 8, 3]
enc_good_s: tensor([[    2,    40,    22,   364,  1419,    98,   843, 25726,  4834,     8,
             3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [ -7.6290,  -4.7665,  -7.4049,  ...,  -2.6507,  -2.7340,  -3.2461],
         [-19.4347,  -4.3366, -20.0976,  ...,  -9.3950,  -7.7529, -14.4455],
         [-17.4017,  19.0085, -18.4789,  ..., -12.5057, -11.5931, -15.7630]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[8.5050e-03, 6.0311e-11, 7.6514e-03,  ..., 1.7279e-02,
          7.0985e-03, 2.6631e-02],
         [2.4678e-05, 2.3660e-10, 1.5414e-05,  ..., 5.2519e-07,
          7.8333e-08, 8.3751e-08],
      

prob: tensor(0.0550, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0466, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0121, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0101, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1495, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5772, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9989, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6791, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0031, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0050, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0137, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0156, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2875, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5754, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.0658, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a muffin boring Winston Churchill.
tok_good_s: [2, 40, 12, 9, 0, 0, 7976, 4874, 8, 3]
enc_good_s: tensor([[   2,   40,   12,    9,    0,    0, 7976, 4874,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-15.0023,  -1.2645, -15.1380,  ...,  -7.6239, -15.5614, -14.5424],
         [-20.7558,  -3.9155, -21.3066,  ..., -10.8017, -11.5850, -15.4590],
         [-17.3845,  18.9606, -18.3189,  ..., -11.8163, -16.3633, -20.0429]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[2.9634e-01, 6.3272e-11, 2.9837e-01,  ..., 1.6378e-01,
          8.7401e-01, 9.3050e-01],
         [8.5985e-0

prob: tensor(0.0468, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0808, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1998, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0027, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(3.5432e-06, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2304, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a pepper bothering every guy.
tok_good_s: [2, 40, 0, 9, 10683, 27079, 106, 5405, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9, 10683, 27079,   106,  5405,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
       

prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.5823, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0022, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0294, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0616, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0647, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2014, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3599, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3113, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a museum failing to surrender.
tok_good_s: [2, 40, 0, 9, 630, 5885, 11, 5140, 8, 3]
enc_good_s: tensor([[   2,   40,    0,    9,  630, 5885,   11, 5140,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.63

prob: tensor(0.1746, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0020, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0666, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1344, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6870, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0026, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0932, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0337, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5445, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0803, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0032, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0556, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1248, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.0289, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0015, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1280, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a glacier shocking Ruth.
tok_good_s: [2, 40, 13, 9, 12771, 20629, 9073, 8, 3]
enc_good_s: tensor([[    2,    40,    13,     9, 12771, 20629,  9073,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [-13.9743,  -5.1956, -14.0815,  ...,  -7.0264, -12.3676, -10.5316],
         [-18.6815,  -5.3625, -19.2464,  ...,  -6.7534, -10.4365, -12.2484],
         [-16.1061,  19.0764, -17.0163,  ..., -10.8502, -13.7143, -17.6790]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_p

gs_probs: tensor([[[5.2474e-02, 4.8665e-11, 5.7519e-02,  ..., 3.1810e-02,
          2.1316e-01, 4.8832e-01],
         [1.5226e-04, 1.9091e-10, 1.1587e-04,  ..., 9.6689e-07,
          2.3522e-06, 1.5357e-06],
         [3.6342e-01, 2.3169e-10, 3.7634e-01,  ..., 2.5429e-02,
          1.8844e-01, 6.2076e-02],
         ...,
         [2.5816e-01, 2.8928e-10, 2.7177e-01,  ..., 8.0524e-01,
          2.3531e-01, 4.2758e-01],
         [1.8276e-04, 4.9377e-11, 1.2455e-04,  ..., 1.2029e-01,
          2.7326e-01, 1.1533e-02],
         [1.9185e-04, 1.0000e+00, 7.7515e-05,  ..., 1.6070e-04,
          4.9261e-04, 1.2372e-05]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.6796, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2191, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0905, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1882, device='cuda:0', grad_fn=<S

prob: tensor(0.2484, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1522, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0266, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9984, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7037, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(5.5469e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0036, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0380, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0023, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0970, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0224, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a cashier advising Monica's nephews to argue.
tok_good_s: [2, 40, 0, 9, 0, 0, 0, 40072, 11, 2538, 8, 3]
enc_good_s: ten

prob: tensor(0.4362, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3226, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(3.7283e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0452, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6658, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3191, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0047, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(2.9898e-06, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0210, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9988, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a cup distracting this driver.
tok_good_s: [2, 40, 13, 9, 1181, 0, 31, 4920, 8, 3]
enc_good_s: tensor([[   2,   40,   13,    9, 1181,    0,   31, 4920,    8,    3]],
       device='

prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7171, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9951, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2172, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0059, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1904, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0080, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0089, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0061, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1827, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a plate falling.
tok_good_s: [2, 40, 13, 9, 2597, 3561, 8, 3]
enc_good_s: tensor([[   2,   40,   13,    9, 2597, 3561,    8,    3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620

gs_probs: tensor([[[8.1565e-02, 1.0005e-10, 8.0052e-02,  ..., 8.3910e-02,
          2.2387e-01, 7.1434e-01],
         [2.3667e-04, 3.9249e-10, 1.6127e-04,  ..., 2.5505e-06,
          2.4704e-06, 2.2465e-06],
         [7.5755e-02, 3.1745e-09, 5.5984e-02,  ..., 9.5530e-03,
          8.6008e-03, 2.2395e-02],
         ...,
         [2.3237e-02, 2.1691e-11, 1.9705e-02,  ..., 2.8523e-02,
          1.6030e-06, 1.2584e-05],
         [7.2728e-05, 1.1087e-10, 4.3589e-05,  ..., 8.3329e-02,
          1.1529e-02, 1.6486e-03],
         [2.9156e-04, 1.0000e+00, 1.0520e-04,  ..., 8.4714e-04,
          1.1005e-03, 3.4195e-05]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7400, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9719, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9460, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5118, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0006, device='cuda:0', grad_fn=<S

gs_probs: tensor([[[6.0576e-02, 5.5516e-11, 6.7189e-02,  ..., 3.1601e-01,
          3.8781e-01, 6.0808e-01],
         [1.7576e-04, 2.1779e-10, 1.3535e-04,  ..., 9.6053e-06,
          4.2796e-06, 1.9123e-06],
         [4.1954e-01, 2.6431e-10, 4.3961e-01,  ..., 2.5262e-01,
          3.4285e-01, 7.7300e-02],
         ...,
         [2.2694e-03, 1.3961e-12, 2.5578e-03,  ..., 8.6510e-02,
          1.0910e-04, 1.5689e-04],
         [3.1837e-05, 2.0312e-11, 2.0092e-05,  ..., 5.9280e-02,
          3.8547e-02, 4.3619e-03],
         [4.6352e-04, 1.0000e+00, 2.1008e-04,  ..., 3.8557e-03,
          3.2135e-04, 7.2555e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7374, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2617, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3683, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0108, device='cuda:0', grad_fn=<S

gs_probs: tensor([[[1.5548e-01, 1.2454e-10, 1.7258e-01,  ..., 2.1308e-02,
          6.4539e-02, 7.9713e-02],
         [4.5113e-04, 4.8856e-10, 3.4766e-04,  ..., 6.4767e-07,
          7.1220e-07, 2.5068e-07],
         [1.4440e-01, 3.9515e-09, 1.2069e-01,  ..., 2.4259e-03,
          2.4796e-03, 2.4990e-03],
         ...,
         [5.6883e-02, 5.2506e-11, 4.2172e-02,  ..., 6.7594e-01,
          6.6236e-01, 6.8040e-01],
         [1.3515e-05, 3.9712e-11, 9.9117e-06,  ..., 3.6072e-03,
          3.4250e-03, 2.6108e-04],
         [9.6414e-04, 1.0000e+00, 4.5292e-04,  ..., 3.7703e-04,
          2.2116e-04, 3.1112e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7553, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9861, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9205, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3169, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0804, device='cuda:0', grad_fn=<S

prob: tensor(0.1616, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0854, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0575, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1511, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3956, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9305, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0267, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1376, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1043, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6859, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1334, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0356, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0552, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.0045, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2351, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(8.2728e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2416, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a glove disgusting these governments.
tok_good_s: [2, 40, 0, 9, 33460, 0, 76, 2393, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9, 33460,     0,    76,  2393,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-12.7835,  -7.7736, -12.6817,  ...,  -8.3404, -14.1727, -13.5740],
         [-18.5734,  -5.3730, -18.9963,  ...,  -7.3193, -10.0210, -12.2547],
         [-16.2718,  19.0073, -17.1816,  .

prob: tensor(0.1779, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4275, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0706, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1292, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8028, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1088, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0879, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0216, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0574, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a closet distracting Scott.
tok_good_s: [2, 40, 0, 9, 0, 0, 2115, 8, 3]
enc_good_s: tensor([[   2,   40,    0,    9,    0,    0, 2115,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-13.8198,  -5.8038, -14.0744,  ...,  -7.1964, -11.2642, -10.0086],
         [-16.3160,  -4.0628, -16.6895,  ...,  -7.3170,  -6.7810, -10.2114],
         [-16.5943,  19.1664, -17.5378,  ..., -11.4050, -13.4988, -17.4432]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[7.4763e-02, 5.1501e-11, 7.6431e-02,  ..., 5.8334e-02,
          1.3542e-01, 2.9078e-01],
         [2.1693e-04, 2.0204e-10, 1.5397e-04,  ..., 1.7731e-06,
          1.4944e-06, 9.1445e-07],
         [1.0527e-01, 2.6824e-11, 7.8484e-02,  ..., 1.1066e-02,
          1.4505e-01, 1.8422e-01],
         ...,
         [6.1352e-03, 1.4308e-11, 4.9435e-03,  ..., 3.3377e-02,
          2.3053e-03, 1.1025e-02],
         [5.0550e-04, 8.1593e-11, 3.6164e-04, 

prob: tensor(0.0243, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0069, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2287, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6904, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0478, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0466, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1602, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0256, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0028, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2639, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was an essay disagreeing.
tok_good_s: [2, 40, 13, 30, 5369, 30494, 8, 3]
enc_good_s: tensor([[    2,    40,    13,    30,  5369, 

prob: tensor(0.0349, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a fork stunning Suzanne.
tok_good_s: [2, 40, 0, 9, 4586, 46650, 11082, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  4586, 46650, 11082,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-10.1215,  -6.2845, -10.2560,  ...,  -6.2383,  -9.7839,  -9.2451],
         [-18.7463,  -3.8718, -19.3416,  ...,  -7.1076, -10.7297, -13.5549],
         [-18.1199,  19.4779, -19.2480,  ..., -12.3769, -13.7817, -17.9806]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.3039e-02, 3.7719e-11, 3.7062e-02,  ..., 1.3002e-01,
          4.1001e-02, 5.0866e-01],
         [9.5863e-05, 

prob: tensor(0.2562, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9745, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0054, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1077, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7329, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1888, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1817, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0136, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1281, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a sweater twisting.
tok_good_s: [2, 40, 12, 9, 46960, 0, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9, 46960,     0,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.31

prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0039, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0204, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0041, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1388, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were no shirts disturbing Theresa.
tok_good_s: [2, 40, 22, 179, 10956, 0, 47547, 8, 3]
enc_good_s: tensor([[    2,    40,    22,   179, 10956,     0, 47547,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-12.3503,  -1.0770, -12.4885,  ..., -10.4123, -10.7796, -10.8929],
         [-20.3669,  -1.7446, -20.9142,  ..., -11.4081, -10.3471, -15.3861]

prob: tensor(0.3743, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8247, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0051, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1029, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0350, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1826, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6978, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2667, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0852, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0139, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0805, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0581, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2963, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.7814, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2373, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0757, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a drawing boring Brett.
tok_good_s: [2, 40, 0, 9, 2789, 0, 27229, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  2789,     0, 27229,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [ -8.6361,  -4.5450,  -8.6289,  ...,  -6.2714,  -5.3537,  -5.3709],
         [-19.2893,  -4.7849, -19.7707,  ...,  -8.2333,  -7.1359, -12.8641],
         [-16.5088,  18.7660, -17.3191,  ..., -10.7983, -14.2451, -18.0721]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-15.4783,  -3.6909, -15.6212,  ..., -10.5670, -12.7181,  -9.9588],
         [-18.1118,  -3.7356, -18.4437,  ...,  -8.6022, -11.1043, -12.2240],
         [-17.0645,  18.5295, -18.0693,  ..., -11.1638, -12.2843, -15.5837]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.5685e-01, 9.7369e-11, 1.7878e-01,  ..., 4.5492e-01,
          2.8251e-01, 4.4303e-01],
         [4.5510e-04, 3.8198e-10, 3.6015e-04,  ..., 1.3828e-05,
          3.1176e-06, 1.3933e-06],
         [1.4568e-01, 3.0895e-09, 1.2503e-01,  ..., 5.1792e-02,
          1.0854e-02, 1.3889e-02],
         ...,
         [2.4510e-03, 2.2377e-10, 2.4620e-03,  ..., 8.9463e-03,
          1.1237e-03, 1.7656e-02],
         [1.7604e-04, 2.1399e-10, 1.4638e-04, 

prob: tensor(0.9778, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8838, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4402, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0637, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0208, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2981, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7005, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9847, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0748, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0067, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0105, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0575, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3684, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There w

prob: tensor(0.9998, device='cuda:0', grad_fn=<SelectBackward>)
good s: There aren't few dancers negotiating.
tok_good_s: [2, 40, 0, 364, 8383, 13455, 8, 3]
enc_good_s: tensor([[    2,    40,     0,   364,  8383, 13455,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-12.0006,  -4.5280, -12.2265,  ...,  -6.0366,  -8.8338,  -8.1269],
         [-18.8956,  -4.8426, -19.3626,  ...,  -6.9058,  -8.5596, -11.6765],
         [-16.5638,  19.1251, -17.4631,  ..., -11.1824, -14.9803, -18.6725]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.0496e-01, 5.3673e-11, 1.1052e-01,  ..., 1.9569e-01,
          3.8582e-01, 7.1270e-01],
         [3.0455e-04, 2.1056e-10, 2.2265e-04,  ..., 5.9480e-06,
          4.2576e-06, 2.2414e-06]

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-10.6814,  -8.2096, -10.2715,  ...,  -4.0668,  -9.9017,  -8.8265],
         [-19.8817,  -5.0989, -20.5468,  ...,  -8.6899,  -9.5480, -12.4199],
         [-15.3287,  18.9594, -16.1683,  ..., -10.4378, -13.9232, -17.5296]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[2.9223e-02, 6.3349e-11, 2.5025e-02,  ..., 3.2679e-02,
          4.2009e-02, 1.1057e-01],
         [8.4792e-05, 2.4852e-10, 5.0413e-05,  ..., 9.9330e-07,
          4.6358e-07, 3.4771e-07],
         [2.0239e-01, 3.0160e-10, 1.6374e-01,  ..., 2.6124e-02,
          3.7138e-02, 1.4055e-02],
         ...,
         [5.5317e-02, 1.5873e-12, 7.2562e-02,  ..., 4.2755e-01,
          2.7930e-03, 1.3672e-02],
         [5.5872e-06, 3.5614e-11, 2.5015e-06, 

prob: tensor(0.2991, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0010, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0021, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2022, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5131, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1292, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2831, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1329, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7151, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1608, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0710, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0030, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0022, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.0147, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5551, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0220, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0056, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0667, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0839, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a closet astounding Mary.
tok_good_s: [2, 40, 0, 9, 0, 25726, 914, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,     0, 25726,   914,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [ -9.8099,  -4.0138, -10.0054,  ...,  -5.0308,  -7.9025,  -8.4088],
         [-20.8660

prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7388, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9825, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1196, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0098, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1209, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3426, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were no public parks alarming Curtis.
tok_good_s: [2, 40, 22, 179, 223, 1412, 0, 9899, 8, 3]
enc_good_s: tensor([[   2,   40,   22,  179,  223, 1412,    0, 9899,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [-14.6962,  -1.5747, -14.8249,  ...,  -8.3230, -11.8096, -11.3505],
         [-20.9157,  -4.7615, -21.4636,  ..., -10.5048,  -9.3290, -13.7106],
         [-15.9255,  19.3076, -16.7008,  ..., -10.4621, -14.0383, -17.6651]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[9.1491e-02, 4.4720e-11, 8.7358e-02,  ..., 3.3046e-02,
          4.8201e-02, 1.3118e-01],
         [2.6546e-04, 1.7544e-10, 1.7598e-04,  ..., 1.0045e-06,
          5.3191e-07, 4.1253e-07],
         [6.9755e-02, 1.0059e-10, 5.5466e-02,  ..., 1.5088e-02,
          3.4955e-02, 2.6074e-02],
         ...,
         [3.1252e-03, 8.5298e-10, 2.6675e-03,  ..., 6.1287e-03,
          4.7556e-04, 1.2999e-03],
         [6.2201e-06, 3.5230e-11, 3.4910e-06, 

prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7081, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9946, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2402, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0177, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0010, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1536, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9998, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were no sketches of Ann boring every woman.
tok_good_s: [2, 40, 22, 179, 10979, 6, 3497, 0, 106, 1200, 8, 3]
enc_good_s: tensor([[    2,    40,    22,   179, 10979,     6,  3497,     0,   106,  1200,
             8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,

## find_token_probs...##
prob: tensor(0.6917, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9976, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.7169, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0053, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1277, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1622, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were some committees boring Eric.
tok_good_s: [2, 40, 22, 57, 6467, 0, 4929, 8, 3]
enc_good_s: tensor([[   2,   40,   22,   57, 6467,    0, 4929,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-12.9343,  -6.5504, -13.2320,  ..., -10.1773, -10.9104,  -9.3839],

prob: tensor(0.2406, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0197, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0156, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0447, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1710, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There weren't many mirrors worrying Frank.
tok_good_s: [2, 40, 0, 87, 19008, 49953, 1792, 8, 3]
enc_good_s: tensor([[    2,    40,     0,    87, 19008, 49953,  1792,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [ -6.8546,  -7.1269,  -6.5563,  ...,  -1.2287,  -7.1856,  -5.6142],
         [-16.8585,  -3.6029, -17.3120,  ...,  -6.0491,  -7.2775, -11.16

prob: tensor(0.3287, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6018, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9983, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1313, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1236, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0005, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0819, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0624, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2949, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a cashier boycotting an association.
tok_good_s: [2, 40, 0, 9, 0, 0, 30, 816, 8, 3]
enc_good_s: tensor([[  2,  40,   0,   9,   0,   0,  30, 816,   8,   3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -

gs_probs: tensor([[[2.5107e-02, 4.9790e-11, 2.6058e-02,  ..., 6.5841e-02,
          2.7074e-02, 1.6913e-01],
         [7.2850e-05, 1.9532e-10, 5.2495e-05,  ..., 2.0013e-06,
          2.9877e-07, 5.3188e-07],
         [3.5352e-02, 2.5933e-11, 2.6758e-02,  ..., 1.2490e-02,
          2.8999e-02, 1.0715e-01],
         ...,
         [2.1128e-03, 6.4077e-11, 1.4897e-03,  ..., 1.0756e-01,
          2.3446e-03, 3.4040e-02],
         [5.8594e-06, 2.1262e-11, 3.8237e-06,  ..., 1.3417e-02,
          1.3066e-03, 8.9579e-04],
         [1.2745e-04, 1.0000e+00, 4.8531e-05,  ..., 6.5401e-04,
          1.4848e-04, 8.0613e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7171, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9954, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9762, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8902, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0233, device='cuda:0', grad_fn=<S

prob: tensor(0.1507, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a restaurant disturbing Melinda.
tok_good_s: [2, 40, 13, 9, 4390, 0, 0, 8, 3]
enc_good_s: tensor([[   2,   40,   13,    9, 4390,    0,    0,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [-12.8207,  -5.4867, -12.5016,  ...,  -7.4692, -12.9090, -12.0776],
         [-18.9808,  -3.8451, -19.2578,  ...,  -6.7553, -10.5829, -12.2077],
         [-16.6859,  18.7512, -17.5715,  ..., -10.9306, -13.7475, -17.7385]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.3890e-01, 7.8014e-11, 1.1153e-01,  ..., 1.1307e-01,
          4.6999e-01, 7.2465e-01],
         [4.0301e-04, 3.0605e-10

prob: tensor(0.1605, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
good s: There weren't many drivers responding.
tok_good_s: [2, 40, 0, 87, 6541, 9039, 8, 3]
enc_good_s: tensor([[   2,   40,    0,   87, 6541, 9039,    8,    3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-11.9187,  -3.2963, -12.2972,  ...,  -7.2260,  -7.6611, -10.1280],
         [-18.5882,  -4.5837, -19.2267,  ...,  -9.0397,  -7.2492, -12.9845],
         [-16.8276,  18.8843, -17.7850,  ..., -11.5097, -15.1434, -19.0366]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[6.8432e-02, 6.8288e-11, 7.6995e-02,  ..., 3.3784e-01,
          2.6231e-01, 8.4813e-01],
         [1.9856e-04, 2.6789e-10, 1.5511e-04,  ..

prob: tensor(0.0015, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0097, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0704, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8541, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2866, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9990, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a turtle embarrassing a lot of girls.
tok_good_s: [2, 40, 13, 9, 21344, 0, 9, 3393, 6, 3574, 8, 3]
enc_good_s: tensor([[    2,    40,    13,     9, 21344,     0,     9,  3393,     6,  3574,
             8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [-13.6232,  -4.1739, -13.5799,  ...,  -7

prob: tensor(0.9983, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a brother of Alan figuring out who shrug.
tok_good_s: [2, 40, 12, 9, 1277, 6, 4834, 32347, 80, 83, 0, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9,  1277,     6,  4834, 32347,    80,    83,
             0,     8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-13.6685,  -3.7678, -13.7524,  ...,  -9.7694, -12.3293, -12.3751],
         [-19.9438,  -4.3426, -20.4125,  ...,  -8.9101,  -8.1035, -12.8849],
         [-15.6815,  19.2313, -16.5672,  ..., -10.4620, -13.2915, -17.4104]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[9.2821e-03, 4.8266e-11, 6.6452e-03,  ..., 2.7323e-02,
          3.6936e-02, 5.0143e-02],
         [2.6932e-05, 1.8935

prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2544, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1585, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0390, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0654, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0646, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1745, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3347, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2535, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a college campus disgusting Vanessa.
tok_good_s: [2, 40, 12, 9, 340, 1123, 0, 21487, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9,   340,  1123,     0, 21487,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.

gs_probs: tensor([[[3.7098e-02, 1.1753e-10, 4.0587e-02,  ..., 8.7655e-02,
          3.8990e-02, 1.8039e-01],
         [1.0764e-04, 4.6109e-10, 8.1763e-05,  ..., 2.6643e-06,
          4.3026e-07, 5.6732e-07],
         [2.5693e-01, 5.5958e-10, 2.6556e-01,  ..., 7.0071e-02,
          3.4469e-02, 2.2932e-02],
         ...,
         [4.8114e-04, 2.6451e-11, 4.6446e-04,  ..., 3.1021e-02,
          1.4739e-05, 1.0723e-04],
         [1.5656e-05, 9.7333e-11, 1.1240e-05,  ..., 1.1079e-02,
          9.3699e-03, 3.3807e-04],
         [1.3039e-04, 1.0000e+00, 5.9074e-05,  ..., 8.9329e-04,
          6.1440e-05, 4.2401e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7103, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3358, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5180, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3009, device='cuda:0', grad_fn=<S

enc_good_s: tensor([[    2,    40,    22,   364,  2953, 29034, 12378,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-10.2350,  -5.1498, -10.0493,  ...,  -4.6734,  -8.9358,  -8.6960],
         [-18.1703,  -4.5040, -18.6101,  ...,  -7.8814,  -8.9088, -12.1376],
         [-15.9552,  18.9857, -16.9297,  ..., -11.0008, -12.2554, -16.3790]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.3201e-02, 6.1704e-11, 1.2475e-02,  ..., 1.0237e-01,
          1.8736e-02, 2.1278e-01],
         [3.8304e-05, 2.4206e-10, 2.5131e-05,  ..., 3.1117e-06,
          2.0675e-07, 6.6916e-07],
         [9.4623e-01, 9.7674e-10, 9.4121e-01,  ..., 1.3024e-01,
          9.7314e-01, 7.4282e-01],
         ...,
         [3.9048e-02, 3.2967e-11, 4

prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a waitress finding birds to descend stairs.
tok_good_s: [2, 40, 13, 9, 0, 3794, 1435, 11, 30214, 20850, 8, 3]
enc_good_s: tensor([[    2,    40,    13,     9,     0,  3794,  1435,    11, 30214, 20850,
             8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [-14.0931,  -6.4379, -14.3964,  ...,  -7.8592,  -6.4436,  -7.5114],
         [-17.6651,  -5.7428, -18.0940,  ...,  -7.1396,  -6.7424, -10.4318],
         [-16.4204,  18.7801, -17.3151,  ..., -10.9773, -14.0145, -17.7655]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.1668e-01, 7.5790e-11, 3.2953e-01,  ..., 2.3354e-01,
          1.7288e-01, 5.4957e-01],
         [9.1885e-04, 2.9732e-10

prob: tensor(0.2092, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0082, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0696, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0574, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0346, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9984, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a customer hugging Gerald.
tok_good_s: [2, 40, 12, 9, 9901, 0, 5914, 8, 3]
enc_good_s: tensor([[   2,   40,   12,    9, 9901,    0, 5914,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-12.0232,  -7.0922, -11.8204,  ...,  -5.7747, -10.1836, -10.0308],
         [-19.2179,  -4.2426, -19.7159,  ...,  -7.7245, -10.3808, -12.8265],
         [-16.69

prob: tensor(0.0420, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2229, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7375, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9909, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2634, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1778, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0413, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2589, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0312, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1731, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't an art gallery boring an adult.
tok_good_s: [2, 40, 0, 30, 433, 1705, 0, 30, 2208, 8, 3]
enc_good_s: tensor([[   2,   40,    0,   30,  433, 1705,    0,   30, 2208,    8,    3]],
       de

prob: tensor(0.9991, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are some computers stunning Nina.
tok_good_s: [2, 40, 27, 57, 2654, 46650, 40224, 8, 3]
enc_good_s: tensor([[    2,    40,    27,    57,  2654, 46650, 40224,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-10.5773,  -5.4785, -10.7826,  ...,  -4.1297, -10.3025,  -8.2805],
         [-17.5238,  -4.7183, -17.9270,  ...,  -6.8576,  -9.0448, -11.0431],
         [-16.0030,  19.2936, -16.8868,  ..., -10.7512, -13.6357, -17.4843]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.0817e-02, 4.5351e-11, 3.3264e-02,  ..., 3.1709e-02,
          3.8031e-03, 3.7105e-03],
         [8.9417e-05, 1.7791e-10, 6.7011e-05,  ..., 9.6382e-07,
          4.1968e

prob: tensor(0.3994, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9041, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0006, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0661, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0414, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7530, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2630, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1846, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0021, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2019, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1250, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9991, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were some children thinking about Jason.
tok_good_s: [2, 40, 22, 

gs_probs: tensor([[[9.0541e-02, 6.1745e-11, 1.0268e-01,  ..., 4.3093e-02,
          2.0064e-03, 4.9998e-03],
         [2.6271e-04, 2.4222e-10, 2.0684e-04,  ..., 1.3099e-06,
          2.2141e-08, 1.5724e-08],
         [1.2749e-01, 3.2159e-11, 1.0543e-01,  ..., 8.1748e-03,
          2.1491e-03, 3.1675e-03],
         ...,
         [6.2635e-01, 2.5320e-12, 6.0515e-01,  ..., 5.2749e-01,
          9.4364e-01, 9.8153e-01],
         [3.4123e-04, 6.1226e-11, 2.4048e-04,  ..., 9.7746e-02,
          1.7676e-02, 3.1297e-04],
         [6.2785e-04, 1.0000e+00, 2.9328e-04,  ..., 4.8948e-04,
          3.9014e-06, 1.2212e-07]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7070, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9923, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8489, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1898, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1148, device='cuda:0', grad_fn=<S

prob: tensor(0.0059, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0004, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0425, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0177, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8933, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0541, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9980, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a man trying to argue about Kenneth.
tok_good_s: [2, 40, 0, 9, 593, 3906, 11, 2538, 98, 4052, 8, 3]
enc_good_s: tensor([[   2,   40,    0,    9,  593, 3906,   11, 2538,   98, 4052,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-12.4362,  -6.7871, -12.7073,  ...,  -6.9304,  -4.

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
         [ -9.5969,  -9.2658,  -9.1599,  ...,  -0.6656,  -2.2663,  -2.1798],
         [-19.9127,  -5.3862, -20.3469,  ...,  -7.2152,  -9.0860, -12.1004],
         [-16.6361,  19.0937, -17.6211,  ..., -11.1805, -12.5181, -16.4802]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.1164e-01, 5.5385e-11, 8.2679e-02,  ..., 2.5063e-03,
          7.1489e-03, 1.0354e-02],
         [3.2393e-04, 2.1727e-10, 1.6656e-04,  ..., 7.6182e-08,
          7.8889e-08, 3.2561e-08],
         [8.5117e-02, 1.2458e-10, 5.2495e-02,  ..., 1.1443e-03,
          5.1843e-03, 2.0580e-03],
         ...,
         [6.2507e-01, 4.8264e-13, 7.2858e-01,  ..., 9.8368e-01,
          9.8404e-01, 9.8606e-01],
         [2.0694e-05, 2.3362e-11, 1.0093e-05, 

prob: tensor(0.9936, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2469, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0432, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0008, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1453, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0270, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There weren't many mountains astounding Leslie.
tok_good_s: [2, 40, 0, 87, 1297, 25726, 13169, 8, 3]
enc_good_s: tensor([[    2,    40,     0,    87,  1297, 25726, 13169,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-11.5497,  -5.8473, -11.8651,  ...,  -5.9365,  -7.1397,  -6.9839],
   

gs_probs: tensor([[[2.2077e-02, 5.8916e-11, 2.1531e-02,  ..., 5.6236e-03,
          8.2233e-03, 4.7530e-02],
         [6.4058e-05, 2.3113e-10, 4.3374e-05,  ..., 1.7093e-07,
          9.0745e-08, 1.4947e-07],
         [1.5290e-01, 2.8050e-10, 1.4087e-01,  ..., 4.4955e-03,
          7.2698e-03, 6.0421e-03],
         ...,
         [4.6596e-01, 2.5313e-10, 5.2477e-01,  ..., 9.5905e-01,
          8.1156e-01, 9.2862e-01],
         [1.4331e-04, 1.5646e-10, 8.6969e-05,  ..., 1.1744e-02,
          6.9674e-02, 9.4441e-04],
         [3.0668e-05, 1.0000e+00, 1.0967e-05,  ..., 2.3455e-05,
          1.7622e-05, 1.3759e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.6827, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(6.4058e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4381, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0662, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0032, device='cuda:0', grad_f

prob: tensor(0.1823, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0437, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2214, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9990, device='cuda:0', grad_fn=<SelectBackward>)
good s: There aren't many cashiers hoping to fall.
tok_good_s: [2, 40, 0, 87, 0, 8639, 11, 978, 8, 3]
enc_good_s: tensor([[   2,   40,    0,   87,    0, 8639,   11,  978,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-12.7486,  -4.7285, -13.0056,  ..., -10.0017, -11.4987, -12.2372],
         [-19.3834,  -4.1410, -20.0900,  ...,  -9.9923,  -6.4588, -12.0850],
         [-17.6533,  19.7681, -18.7495,  ..., -12.4706, -13.0443, -16.7207]]],
       device='cuda:0', grad_fn=<AddBackward0>)
g

prob: tensor(0.0833, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0128, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2475, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9991, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7040, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1491, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0466, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0130, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3016, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9984, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were many boys protesting.
tok_good_s: [2, 40, 22, 87, 2767, 13810, 8, 3]
enc_good_s: tensor([[    2,    40,    22,    87,  2767, 13810,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.73

prob: tensor(0.7489, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9791, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9458, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1098, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0006, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0450, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1477, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7619, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9852, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1052, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0596, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0004, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0407, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2319, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.0222, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1741, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0126, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0421, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0045, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2315, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0625, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0284, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9991, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a hospital debating.
tok_good_s: [2, 40, 13, 9, 1163, 0, 8, 3]
enc_good_s: tensor([[   2,   40,   13,    9, 1163,    0,    8,    3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.5907,  -3.7124, -11.7903,  ...,  -7.4221,  -7.5123,  -8.3519],
         ...,
  

prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7483, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9922, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1228, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0528, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1653, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0514, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a cake baking.
tok_good_s: [2, 40, 12, 9, 11997, 15730, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9, 11997, 15730,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-10.1645,  -5.9110

good s: There were some libraries neglecting to describe Karen.
tok_good_s: [2, 40, 22, 57, 4055, 0, 11, 1595, 8700, 8, 3]
enc_good_s: tensor([[   2,   40,   22,   57, 4055,    0,   11, 1595, 8700,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-14.1641,  -6.1139, -14.4090,  ...,  -7.3207,  -7.4263,  -7.3461],
         [-19.8954,  -4.5186, -20.5946,  ...,  -8.7053,  -5.6155, -11.5104],
         [-17.8873,  19.3884, -18.9609,  ..., -12.2053, -13.3263, -16.9992]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.3536e-02, 4.1248e-11, 1.2895e-02,  ..., 6.7330e-02,
          1.6827e-02, 1.6442e-01],
         [3.9276e-05, 1.6182e-10, 2.5976e-05,  ..., 2.0465e-06,
          1.8570e-07, 5.1708e-07],
         [9.7026e-01, 

enc_good_s: tensor([[   2,   40,    0,    9,  458,    0, 9072,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [ -7.4662,  -5.4148,  -7.1516,  ...,  -3.5927,  -5.5267,  -6.2608],
         [-17.6431,  -5.4112, -18.0932,  ...,  -6.3759,  -4.7552,  -9.3435],
         [-16.1905,  18.8937, -17.0593,  ..., -10.9987, -13.6343, -17.7042]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.4286e-02, 6.7652e-11, 1.1569e-02,  ..., 3.3139e-02,
          3.2121e-02, 2.3374e-01],
         [4.1450e-05, 2.6540e-10, 2.3307e-05,  ..., 1.0073e-06,
          3.5446e-07, 7.3507e-07],
         [9.8939e-02, 3.2209e-10, 7.5699e-02,  ..., 2.6491e-02,
          2.8396e-02, 2.9713e-02],
         ...,
         [6.7354e-01, 2.7730e-11, 7.5962e-01

gs_probs: tensor([[[5.5126e-02, 5.7403e-11, 4.0241e-02,  ..., 3.8104e-02,
          5.3615e-02, 2.3495e-01],
         [1.5995e-04, 2.2519e-10, 8.1066e-05,  ..., 1.1582e-06,
          5.9166e-07, 7.3890e-07],
         [5.1199e-02, 1.8214e-09, 2.8142e-02,  ..., 4.3381e-03,
          2.0599e-03, 7.3660e-03],
         ...,
         [2.4267e-02, 5.7237e-11, 1.8178e-02,  ..., 6.6818e-03,
          2.0004e-05, 8.5941e-05],
         [3.7772e-04, 1.7402e-10, 1.9599e-04,  ..., 3.1213e-02,
          2.4562e-01, 7.2796e-03],
         [2.3199e-04, 1.0000e+00, 6.8425e-05,  ..., 2.8682e-04,
          4.7062e-05, 2.8964e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7387, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9322, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8113, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1486, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0013, device='cuda:0', grad_fn=<S

good s: There were some prints of Bethany vanishing.
tok_good_s: [2, 40, 22, 57, 8970, 6, 0, 21490, 8, 3]
enc_good_s: tensor([[    2,    40,    22,    57,  8970,     6,     0, 21490,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-12.8317,  -3.5128, -13.2651,  ...,  -8.9360,  -9.0281,  -8.4710],
         [-21.0036,  -4.0648, -21.6489,  ..., -11.5221, -12.2763, -15.5585],
         [-18.4675,  19.1027, -19.6227,  ..., -13.1269, -12.1505, -16.4016]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.3194e-02, 5.4889e-11, 1.2558e-02,  ..., 7.8468e-02,
          1.8509e-02, 1.8226e-01],
         [3.8283e-05, 2.1533e-10, 2.5298e-05,  ..., 2.3851e-06,
          2.0425e-07, 5.7319e-07],
         [9.4572e-01, 8.6887e-10, 9

good s: There are few high schools impressing Jason.
tok_good_s: [2, 40, 27, 364, 193, 372, 0, 8693, 8, 3]
enc_good_s: tensor([[   2,   40,   27,  364,  193,  372,    0, 8693,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-12.8762,  -8.2149, -13.1656,  ...,  -4.0128,  -8.2243,  -7.8917],
         [-20.0896,  -4.5003, -20.6028,  ...,  -8.8472, -11.7598, -13.6885],
         [-16.5375,  19.3376, -17.4053,  ..., -11.0558, -14.9992, -18.8627]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[8.8698e-02, 4.3400e-11, 9.3472e-02,  ..., 5.6723e-02,
          1.6864e-01, 2.8960e-01],
         [2.5736e-04, 1.7026e-10, 1.8830e-04,  ..., 1.7241e-06,
          1.8610e-06, 9.1075e-07],
         [1.2489e-01, 2.2604e-11, 9.5982e-02

prob: tensor(0.0024, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4662, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1672, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2471, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0377, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9998, device='cuda:0', grad_fn=<SelectBackward>)
good s: There weren't many turtles climbing up those hills.
tok_good_s: [2, 40, 0, 87, 9235, 6456, 104, 285, 1611, 8, 3]
enc_good_s: tensor([[   2,   40,    0,   87, 9235, 6456,  104,  285, 1611,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-14.0448,  -5.0540, -14.3877,  ...,  -9.3895, -13.9265, -13.5494],
         [-19.2860,  -5.6989, -19.8020,  ...,  -8.45

gs_probs: tensor([[[2.5450e-02, 5.4005e-11, 2.6373e-02,  ..., 6.6675e-02,
          2.6607e-02, 1.8664e-01],
         [7.3844e-05, 2.1186e-10, 5.3128e-05,  ..., 2.0266e-06,
          2.9361e-07, 5.8697e-07],
         [3.5835e-02, 2.8128e-11, 2.7081e-02,  ..., 1.2648e-02,
          2.8499e-02, 1.1825e-01],
         ...,
         [2.7453e-03, 1.2682e-11, 2.4954e-03,  ..., 3.6306e-02,
          1.5195e-03, 9.4011e-03],
         [9.4397e-06, 2.5755e-11, 6.1978e-06,  ..., 4.3573e-02,
          2.5636e-02, 3.1955e-03],
         [1.4768e-04, 1.0000e+00, 6.2542e-05,  ..., 7.0252e-04,
          5.0768e-05, 3.7276e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7010, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9989, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9800, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9465, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0073, device='cuda:0', grad_fn=<S

enc_good_s: tensor([[    2,    40,    27,   179,  2393,  3969,    11, 14342,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-11.2533,  -7.2232, -11.1473,  ...,  -4.5334,  -9.1497,  -7.8491],
         [-18.4174,  -4.5646, -18.9335,  ...,  -7.6948,  -6.8962, -10.5918],
         [-16.5636,  19.2122, -17.5472,  ..., -11.0760, -12.7708, -16.4605]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[2.2873e-02, 4.9198e-11, 2.3131e-02,  ..., 3.9377e-02,
          2.6305e-02, 2.1573e-01],
         [6.6367e-05, 1.9300e-10, 4.6599e-05,  ..., 1.1969e-06,
          2.9028e-07, 6.7844e-07],
         [3.2206e-02, 2.5625e-11, 2.3753e-02,  ..., 7.4697e-03,
          2.8175e-02, 1.3667e-01],
         ...,
         [2.4438e-02, 3.3057

prob: tensor(0.0280, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.7691, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9991, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a documentary disagreeing.
tok_good_s: [2, 40, 0, 9, 1952, 30494, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  1952, 30494,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-13.5907,  -4.6517, -13.6976,  ...,  -8.8280, -11.4103,  -9.3352],
         [-20.4668,  -3.7208, -21.0688,  ...,  -7.6116, -12.8608, -14.7368],
         [-18.0255,  19.3984, -19.1035,  ..., -11.8576, -14.0678, -17.3489]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[4.3808e-02, 4.0839e-11, 4.8088e-02,  ..., 1.8981e-01,
     

prob: tensor(0.3008, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1140, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0960, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7061, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9884, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0622, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0795, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0609, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1428, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1457, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There was a high school annoying Omar.
tok_good_s: [2, 40, 13, 9, 193, 

prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1947, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0372, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(5.6143e-06, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0103, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0327, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1409, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(1.9958e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3360, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6235, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1324, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0132, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(7.5150e-05, device='cuda:0', grad_fn=<SelectBackward>)
pro

enc_good_s: tensor([[   2,   40,   12,    9,    0,    0,   57, 2767,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-11.7517,  -7.6034, -11.7142,  ...,  -6.6959, -17.0752, -15.3379],
         [-19.0361,  -5.0874, -19.5520,  ...,  -7.7395, -11.0408, -13.2021],
         [-16.6827,  19.2707, -17.6341,  ..., -11.5654, -14.3922, -18.5231]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[2.3833e-01, 4.6404e-11, 2.3206e-01,  ..., 1.5267e-01,
          8.6239e-01, 9.2709e-01],
         [6.9153e-04, 1.8204e-10, 4.6749e-04,  ..., 4.6405e-06,
          9.5167e-06, 2.9156e-06],
         [2.2135e-01, 1.4724e-09, 1.6229e-01,  ..., 1.7381e-02,
          3.3133e-02, 2.9065e-02],
         ...,
         [1.5470e-01, 2.1319e-12, 1.58

enc_good_s: tensor([[  2,  40,   0,   9, 350,  98, 380,   0,   0,   8,   3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-11.8154,  -5.8786, -11.6154,  ...,  -6.8891, -11.5210, -11.5519],
         [-19.2563,  -4.8933, -19.7333,  ...,  -8.1831,  -9.3299, -13.5296],
         [-15.9593,  18.7692, -16.7746,  ..., -10.6656, -13.8455, -17.7440]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.5016e-02, 7.6618e-11, 3.7081e-02,  ..., 7.7832e-02,
          6.7643e-02, 6.2718e-01],
         [1.0160e-04, 3.0057e-10, 7.4701e-05,  ..., 2.3658e-06,
          7.4645e-07, 1.9724e-06],
         [2.4251e-01, 3.6478e-10, 2.4262e-01,  ..., 6.2219e-02,
          5.9800e-02, 7.9729e-02],
         ...,
         [2.1324e-02, 1.9751e-11, 2.8043e-0

prob: tensor(0.9308, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1628, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0013, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0021, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0012, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0495, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7063, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9893, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1189, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0975, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0025, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0162, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0943, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0510, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-12.2927,  -2.7245, -12.5449,  ...,  -4.9786,  -8.7869,  -8.8024],
         [-19.6293,  -4.4442, -20.2315,  ...,  -8.3241, -10.4937, -14.7749],
         [-17.0404,  19.1771, -18.0189,  ..., -11.4544, -14.1063, -17.8312]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[4.3884e-02, 5.0957e-11, 4.8495e-02,  ..., 9.9323e-02,
          6.8731e-02, 6.5929e-01],
         [1.2733e-04, 1.9990e-10, 9.7694e-05,  ..., 3.0190e-06,
          7.5846e-07, 2.0734e-06],
         [3.0393e-01, 2.4260e-10, 3.1730e-01,  ..., 7.9399e-02,
          6.0762e-02, 8.3810e-02],
         ...,
         [1.6582e-02, 3.0779e-10, 1.4477e-02,  ..., 5.2209e-01,
          1.3933e-02, 8.3513e-02],
         [1.0799e-05, 5.5136e-11, 6.6442e-06, 

prob: tensor(0.0844, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0174, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2773, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6838, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9868, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0602, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3209, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0146, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1865, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0573, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0707, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0108, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2933, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
good s: There a

prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were some women descending a lot of mountains.
tok_good_s: [2, 40, 22, 57, 580, 30216, 9, 3393, 6, 1297, 8, 3]
enc_good_s: tensor([[    2,    40,    22,    57,   580, 30216,     9,  3393,     6,  1297,
             8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-15.8254,  -3.7641, -15.9933,  ..., -12.6306, -13.9946, -11.9942],
         [-18.9025,  -4.6726, -19.3928,  ...,  -8.0525, -10.5254, -12.0235],
         [-16.5975,  18.7589, -17.5099,  ..., -11.2242, -13.9466, -17.6926]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.2991e-02, 7.7410e-11, 1.2305e-02,  ..., 1.2979e-01,
          1.0666e-02, 7.6050e-02],
         [3.7695e-05, 3.0368e-10, 2

prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7106, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9979, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.7132, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0276, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0280, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2265, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a company needing to stun Stacy.
tok_good_s: [2, 40, 0, 9, 322, 19202, 11, 0, 46296, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,   322, 19202,    11,     0, 46296,     8,
             3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142

prob: tensor(0.3519, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1078, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1911, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6813, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9981, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.6694, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0171, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0360, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2606, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1813, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1895, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
good s: There weren't many children visiting this library.
tok_good_s: [2, 40, 0, 87, 184, 7957, 31, 673, 8, 3]
enc_good_s: tensor([[   2,   40

gs_probs: tensor([[[3.5316e-02, 4.2998e-11, 3.9742e-02,  ..., 3.4411e-02,
          6.3875e-02, 3.2027e-01],
         [1.0247e-04, 1.6868e-10, 8.0061e-05,  ..., 1.0459e-06,
          7.0487e-07, 1.0072e-06],
         [2.4459e-01, 2.0471e-10, 2.6003e-01,  ..., 2.7508e-02,
          5.6469e-02, 4.0714e-02],
         ...,
         [9.6619e-03, 4.8530e-12, 1.0045e-02,  ..., 8.3175e-02,
          1.3771e-04, 4.1984e-03],
         [9.9102e-06, 1.0582e-10, 6.8587e-06,  ..., 1.2942e-02,
          9.5609e-04, 8.4450e-04],
         [5.7124e-05, 1.0000e+00, 2.4291e-05,  ..., 2.5792e-04,
          2.9732e-05, 2.2234e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.6954, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2428, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1574, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0762, device='cuda:0', grad_fn=<S

prob: tensor(0.3930, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9927, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a bank existing.
tok_good_s: [2, 40, 0, 9, 850, 2384, 8, 3]
enc_good_s: tensor([[   2,   40,    0,    9,  850, 2384,    8,    3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [ -9.5900,  -5.0352,  -9.5902,  ...,  -6.9006,  -8.4579,  -6.4401],
         [-18.5676,  -3.4899, -19.0424,  ...,  -7.2444, -10.9980, -13.1074],
         [-17.7445,  18.9770, -18.6846,  ..., -11.5507, -14.1811, -18.1049]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.4524e-02, 6.2243e-11, 3.7447e-02,  ..., 1.5924e-01,
          6.8328e-02, 3.6520e-01],
         [1.0017e-04, 2.4418e-10, 7.5437e-05,  ..., 4.8402e-0

gs_probs: tensor([[[2.3484e-01, 5.4126e-11, 2.2770e-01,  ..., 1.1088e-01,
          4.7008e-01, 6.9500e-01],
         [6.8139e-04, 2.1234e-10, 4.5871e-04,  ..., 3.3703e-06,
          5.1874e-06, 2.1857e-06],
         [1.7905e-01, 1.2174e-10, 1.4458e-01,  ..., 5.0623e-02,
          3.4090e-01, 1.3814e-01],
         ...,
         [4.5682e-01, 4.5453e-11, 4.9396e-01,  ..., 7.3030e-01,
          3.0897e-02, 1.3110e-01],
         [7.2716e-05, 3.8873e-11, 4.9553e-05,  ..., 4.1304e-02,
          9.3904e-03, 2.3081e-03],
         [1.3119e-03, 1.0000e+00, 4.9266e-04,  ..., 1.7804e-03,
          1.2981e-03, 2.8128e-05]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7402, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9955, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8768, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0394, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<S

## find_token_probs...##
prob: tensor(0.6906, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4137, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9577, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0116, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1514, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6758, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2943, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3787, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0067, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0477, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9990, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were some restaurants distracting Carmen

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -7.0473,  -1.7611,  -7.0126,  ...,  -6.3974,  -3.2409,  -5.4860],
         ...,
         [-14.1018,  -2.9516, -14.4037,  ...,  -8.7381, -13.4927,  -9.6746],
         [-20.7539,  -2.8522, -21.5139,  ..., -11.8079, -10.9888, -16.2201],
         [-16.8954,  19.1356, -17.9967,  ..., -11.6910, -10.7850, -15.1120]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[1.3696e-02, 5.3116e-11, 1.3031e-02,  ..., 3.8696e-01,
          1.8836e-02, 2.0782e-01],
         [3.9740e-05, 2.0837e-10, 2.6252e-05,  ..., 1.1762e-05,
          2.0786e-07, 6.5355e-07],
         [9.8170e-01, 8.4080e-10, 9.8319e-01,  ..., 4.9229e-01,
          9.7835e-01, 7.2549e-01],
         ...,
         [8.4767e-04, 2.5566e-10, 6.0631e-04,  ..., 4.7389e-02,
          3.4529e-05, 1.1005e-02],
         [1.0946e-06, 2.8239e-10, 4.9521e-07, 

prob: tensor(0.7168, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3728, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4487, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0016, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1363, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1039, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6896, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2833, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0181, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1031, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2603, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

good s: There is a driver entreating every person to climb up a lot of ladders.
tok_good_s: [2, 40, 12, 9, 4920, 0, 106, 750, 11, 7270, 104, 9, 3393, 6, 0, 8, 3]
enc_good_s: tensor([[   2,   40,   12,    9, 4920,    0,  106,  750,   11, 7270,  104,    9,
         3393,    6,    0,    8,    3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-14.5787,  -5.4856, -14.4822,  ...,  -9.5322, -10.8521, -11.0368],
         [-19.8993,  -4.3764, -20.2111,  ...,  -8.3504, -10.1856, -11.8071],
         [-17.4039,  18.9893, -18.2088,  ..., -11.0449, -15.8571, -18.9460]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.4081e-02, 6.1481e-11, 3.0818e-02,  ..., 1.0908e-02,
          9.7137e-03, 2.8944e-02],
         [9.8886e-05, 2.4119e-10, 6.2084e-05

prob: tensor(0.0006, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1138, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7094, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2394, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0921, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0047, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0227, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0028, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1486, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a student dying.
tok_good_s: [2, 40, 0, 9, 1456, 12473, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  1456, 12473,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.

prob: tensor(0.7109, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9879, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8755, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0133, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0559, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0178, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0578, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9998, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7305, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9844, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1324, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0015, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0760, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.7224, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9958, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2700, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0512, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0199, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1208, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9999, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a bike slowing.
tok_good_s: [2, 40, 0, 9, 9659, 45803, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  9659, 45803,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-11.1703,  -5.4444, -11.4879,  ...,  -7.4185,  -9.8936,  -8.7053],
         [-19.1152,  -3.6466, -19.762

prob: tensor(0.9407, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1677, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0219, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0807, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0667, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2419, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6826, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9962, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2322, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0712, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0241, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0320, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0456, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3883, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

enc_good_s: tensor([[    2,    40,     0,     9,  1486,     6,   285,  2767, 25564,     8,
             3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-16.6704,  -5.5099, -17.1154,  ...,  -7.7015, -12.9087, -13.7647],
         [-19.8744,  -5.3912, -20.4430,  ...,  -8.8202, -10.7829, -13.5146],
         [-16.2434,  19.1134, -17.1151,  ..., -10.7576, -14.2281, -18.0308]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[4.1898e-02, 5.4305e-11, 4.6004e-02,  ..., 1.8761e-01,
          6.1491e-02, 4.2356e-01],
         [1.2157e-04, 2.1304e-10, 9.2675e-05,  ..., 5.7027e-06,
          6.7856e-07, 1.3320e-06],
         [2.9017e-01, 2.5854e-10, 3.0100e-01,  ..., 1.4998e-01,
          5.4361e-02, 5.3843e-02],
         ...,
         [1.9876e-0

prob: tensor(0.3415, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(1.3659e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1196, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0501, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There aren't many plays astounding the committee.
tok_good_s: [2, 40, 0, 87, 1676, 25726, 4, 1143, 8, 3]
enc_good_s: tensor([[    2,    40,     0,    87,  1676, 25726,     4,  1143,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-13.6595,  -7.6134, -13.8909,  ...,  -7.0133, -17.2241, -13.5107],
         [-18.8596,  -4.7842, -19.3835,  ...,  -7.5242, -11.7933, -12.7875],
         [-17.6476,  19.2986, -18.7015,

prob: tensor(0.2106, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0016, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1781, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0499, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9988, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a river stunning Margaret.
tok_good_s: [2, 40, 0, 9, 233, 46650, 3216, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,   233, 46650,  3216,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-10.3869,  -4.8733, -10.4041,  ...,  -5.5225,  -9.3425,  -8.4641],
         [-20.2328,  -3.9501, -20.7839,  ...,  -9.5374,  -9.2367, -14.0032],
         [-17.7888,  18.9455, -18.7557,  ..., -11.8787, -13.8875,

gs_probs: tensor([[[3.5935e-01, 4.0107e-11, 3.6849e-01,  ..., 1.6007e-01,
          4.8693e-01, 7.8564e-01],
         [1.0427e-03, 1.5734e-10, 7.4234e-04,  ..., 4.8655e-06,
          5.3734e-06, 2.4707e-06],
         [2.7397e-01, 9.0211e-11, 2.3397e-01,  ..., 7.3081e-02,
          3.5312e-01, 1.5616e-01],
         ...,
         [1.2328e-02, 4.7107e-11, 1.0540e-02,  ..., 9.1731e-03,
          1.9714e-03, 1.9558e-03],
         [7.5136e-06, 3.1268e-11, 4.4704e-06,  ..., 6.6474e-03,
          1.4433e-03, 1.8691e-04],
         [1.1337e-03, 1.0000e+00, 4.6974e-04,  ..., 1.4535e-03,
          1.2813e-04, 4.2370e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.6751, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9835, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8370, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9369, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0009, device='cuda:0', grad_fn=<S

prob: tensor(0.4066, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3512, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7192, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9333, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0302, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0883, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(6.3226e-07, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0681, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3387, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1624, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4739, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2221, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are few patients investigating who observed Meredith.
tok_goo

gs_probs: tensor([[[4.5047e-02, 8.1963e-11, 4.9666e-02,  ..., 2.1281e-01,
          6.9868e-02, 7.2040e-01],
         [1.3071e-04, 3.2154e-10, 1.0005e-04,  ..., 6.4685e-06,
          7.7101e-07, 2.2656e-06],
         [3.1198e-01, 3.9022e-10, 3.2497e-01,  ..., 1.7012e-01,
          6.1767e-02, 9.1579e-02],
         ...,
         [1.7689e-04, 4.1107e-11, 1.6043e-04,  ..., 2.2113e-02,
          5.9076e-05, 7.5484e-04],
         [3.6171e-06, 1.5156e-11, 2.3649e-06,  ..., 1.5669e-02,
          2.1091e-04, 2.4361e-04],
         [1.1625e-04, 1.0000e+00, 5.1604e-05,  ..., 2.3545e-03,
          4.8662e-05, 1.6094e-05]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7296, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2325, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.6558, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(1.5448e-06, device='cuda:0', grad_f

prob: tensor(0.1974, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1943, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0654, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0944, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0200, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0496, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3476, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1334, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4995, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6644, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1333, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0273, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0095, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7005, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1547, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0275, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0118, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0307, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9297, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(7.3647e-06, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3447, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a hamster boring some waitress.
tok_good_s: [2, 40, 0, 9, 0, 0, 57, 0, 8, 3]
enc_good_s: tensor([[ 2, 40,  0,  9,  0,  0, 57,  0,  8,  3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
    

prob: tensor(0.9471, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8816, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1466, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0078, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0272, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0290, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7266, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9871, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0770, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0112, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0165, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0505, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0992, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9955, device='cuda:0', grad_fn=<SelectBackward>)
good s: There w

prob: tensor(0.0873, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7578, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9761, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1054, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4555, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0302, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0027, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1262, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9998, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were some libraries admiring William.
tok_good_s: [2, 40, 22, 57, 4055, 24529, 512, 8, 3]
enc_good_s: tensor([[    2,    40,    22,    57,  4055, 24529,   512,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.54

enc_good_s: tensor([[   2,   40,    0,    9,    0,    0, 2994,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-10.9299,  -8.6990, -10.7924,  ...,  -6.3622, -13.8896, -11.5237],
         [-18.9616,  -4.7864, -19.3617,  ...,  -7.0989, -10.9360, -12.7698],
         [-17.2904,  18.9779, -18.1948,  ..., -11.5502, -15.0652, -18.7359]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[4.0579e-02, 6.2189e-11, 4.3813e-02,  ..., 1.1552e-01,
          6.9737e-02, 7.0362e-01],
         [1.1774e-04, 2.4397e-10, 8.8261e-05,  ..., 3.5114e-06,
          7.6956e-07, 2.2128e-06],
         [2.8104e-01, 2.9608e-10, 2.8666e-01,  ..., 9.2350e-02,
          6.1651e-02, 8.9445e-02],
         ...,
         [5.9911e-02, 9.5519e-13, 7.5457e-02

prob: tensor(0.2554, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0977, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0142, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0078, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0533, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a niece of Andrea leaving.
tok_good_s: [2, 40, 12, 9, 40187, 6, 25216, 1749, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9, 40187,     6, 25216,  1749,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-12.0166,  -6.2583, -11.9020,  ...,  -6.9962,  -6.0665,  -5.4341],
         [-20.6299,  -4.3728, -21.1869,  ...,  -8.8180,  -8.8822, -13

prob: tensor(0.0701, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are some daughters of Lucille forgetting who has visited Samantha.
tok_good_s: [2, 40, 27, 57, 3981, 6, 18704, 32662, 83, 44, 2056, 0, 8, 3]
enc_good_s: tensor([[    2,    40,    27,    57,  3981,     6, 18704, 32662,    83,    44,
          2056,     0,     8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-13.7858,  -6.0209, -14.1377,  ...,  -7.7064, -10.0336, -10.4764],
         [-18.5921,  -5.0888, -18.9996,  ...,  -7.4962,  -7.0416, -11.4113],
         [-15.9655,  19.0332, -16.8651,  ..., -10.3596, -13.6170, -17.3551]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[4.4044e-02, 5.88

prob: tensor(0.0019, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0086, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0022, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3082, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7083, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9960, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0346, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(9.1433e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.6953, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1406, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0017, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0088, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0021, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2887, device='cuda:0', grad_fn=<SelectBackward>)
prob: tenso

prob: tensor(0.7510, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0513, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0093, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0272, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1006, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are many ladies washing.
tok_good_s: [2, 40, 27, 87, 13130, 49489, 8, 3]
enc_good_s: tensor([[    2,    40,    27,    87, 13130, 49489,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-14.8794,  -6.5969, -15.0622,  ...,

gs_probs: tensor([[[2.1935e-01, 4.8671e-11, 2.4328e-01,  ..., 3.2828e-01,
          4.8407e-02, 4.5331e-01],
         [6.3647e-04, 1.9093e-10, 4.9009e-04,  ..., 9.9782e-06,
          5.3418e-07, 1.4256e-06],
         [3.0886e-01, 2.5350e-11, 2.4981e-01,  ..., 6.2274e-02,
          5.1850e-02, 2.8719e-01],
         ...,
         [5.4870e-02, 5.3814e-12, 4.1270e-02,  ..., 1.9417e-01,
          2.4586e-03, 5.5228e-03],
         [2.1816e-04, 5.4237e-11, 1.3069e-04,  ..., 6.8827e-02,
          6.9089e-02, 2.8895e-03],
         [1.0328e-03, 1.0000e+00, 4.2923e-04,  ..., 2.3203e-03,
          8.1357e-05, 6.8567e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7035, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9951, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8493, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1583, device='cuda:0', grad_fn=<S

prob: tensor(0.1369, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0107, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(3.4879e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1579, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1553, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0403, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9999, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7128, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9980, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1154, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0430, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(4.9457e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1436, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1615, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0405, device='cuda:0', grad_fn=<SelectBackward>)
prob: t

gs_probs: tensor([[[2.5837e-01, 3.9466e-11, 2.5226e-01,  ..., 9.7820e-02,
          8.7825e-01, 9.1157e-01],
         [7.4967e-04, 1.5482e-10, 5.0818e-04,  ..., 2.9733e-06,
          9.6917e-06, 2.8667e-06],
         [2.3996e-01, 1.2522e-09, 1.7642e-01,  ..., 1.1137e-02,
          3.3742e-02, 2.8578e-02],
         ...,
         [1.4635e-01, 2.4531e-11, 1.6949e-01,  ..., 4.7077e-01,
          2.7426e-03, 2.0560e-02],
         [2.6782e-05, 9.2765e-11, 1.6810e-05,  ..., 2.4443e-02,
          3.0090e-03, 4.6322e-04],
         [4.1285e-04, 1.0000e+00, 1.4707e-04,  ..., 4.8283e-04,
          5.6838e-04, 6.3391e-06]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.6907, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9698, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9424, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2143, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0799, device='cuda:0', grad_fn=<S

prob: tensor(0.8560, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2210, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0193, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0387, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6994, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9963, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1576, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0450, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0201, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0374, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are few bicycles falling apart.
tok_good_s: [2, 40, 27, 364, 9655, 3561, 2358, 8, 3]
enc_good_s: tensor([[   2,   40,   27,  364, 9655, 3561, 2358,    8,    3]],
       device='cuda:0')
good_s:

prob: tensor(0.9998, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7599, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9910, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1413, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0077, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0149, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2054, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9999, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a movie theater upsetting Linda.
tok_good_s: [2, 40, 0, 9, 1085, 1576, 0, 7591, 8, 3]
enc_good_s: tensor([[   2,   40,    0,    9, 1085, 1576,    0, 7591,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         .

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-10.5785,  -3.5388, -10.7611,  ...,  -5.2132,  -8.6601,  -7.1746],
         [-19.1659,  -5.0687, -19.6531,  ...,  -7.5811,  -8.7447, -11.7551],
         [-16.8148,  19.0793, -17.8536,  ..., -11.3555, -12.0440, -15.6124]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[6.1275e-02, 5.6189e-11, 6.8788e-02,  ..., 1.4222e-01,
          3.6555e-01, 5.2449e-01],
         [1.7779e-04, 2.2043e-10, 1.3857e-04,  ..., 4.3228e-06,
          4.0339e-06, 1.6494e-06],
         [4.2438e-01, 2.6751e-10, 4.5008e-01,  ..., 1.1369e-01,
          3.2316e-01, 6.6674e-02],
         ...,
         [1.2856e-01, 1.5034e-10, 1.2224e-01,  ..., 5.9128e-01,
          8.4125e-02, 3.3835e-01],
         [2.3968e-05, 3.2556e-11, 1.6805e-05, 

prob: tensor(0.4030, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9981, device='cuda:0', grad_fn=<SelectBackward>)
good s: There are many guests yelling.
tok_good_s: [2, 40, 27, 87, 6630, 50153, 8, 3]
enc_good_s: tensor([[    2,    40,    27,    87,  6630, 50153,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-15.5446,  -3.4228, -16.1614,  ...,  -8.2162, -14.1551, -10.7514],
         [-18.6631,  -4.8186, -19.3584,  ...,  -7.8095,  -9.5882, -11.3454],
         [-17.3330,  19.5224, -18.4453,  ..., -11.9408, -12.5937, -16.8019]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[2.6391e-01, 3.6075e-11, 3.1817e-01,  ..., 5.3078e-01,
          3.3349e-01, 5.7453e-01],
         [7.6576e-04, 1.4152e-10, 6.4095

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-13.9110,  -7.0941, -14.0555,  ...,  -8.2125,  -6.8788,  -7.1209],
         [-18.9726,  -5.7094, -19.5458,  ...,  -8.8293,  -6.7036,  -9.9898],
         [-16.9449,  18.6382, -17.8777,  ..., -11.2574, -14.5229, -17.7106]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.2726e-02, 8.7342e-11, 3.4048e-02,  ..., 1.0664e-01,
          3.0566e-02, 1.8432e-01],
         [9.4956e-05, 3.4264e-10, 6.8590e-05,  ..., 3.2414e-06,
          3.3730e-07, 5.7967e-07],
         [2.2665e-01, 4.1583e-10, 2.2277e-01,  ..., 8.5249e-02,
          2.7022e-02, 2.3432e-02],
         ...,
         [2.4514e-03, 6.6775e-12, 2.2441e-03,  ..., 2.2089e-02,
          4.1765e-02, 1.2547e-01],
         [1.5530e-05, 2.6665e-11, 9.2601e-06, 

prob: tensor(0.4021, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.6675, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4119, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0031, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1606, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
good s: There isn't a brother of Tina commanding Elizabeth's podiatrists to pass these public parks.
tok_good_s: [2, 40, 0, 9, 1277, 6, 47759, 16409, 0, 0, 11, 1858, 76, 223, 1412, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  1277,     6, 47759, 16409,     0,     0,
            11,  1858,    76,   223,  1412,     8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
        

prob: tensor(0.0328, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5819, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3182, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0028, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(4.8267e-09, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0189, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0953, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1424, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9982, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.5660, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9886, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0228, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0109, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.8111, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2194, device='cuda:0', grad_fn=<SelectBackward>)
prob: tenso

prob: tensor(0.1062, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6818, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0038, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0009, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0063, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0575, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9996, device='cuda:0', grad_fn=<SelectBackward>)
good s: There wasn't a dish annoying Jessica.
tok_good_s: [2, 40, 0, 9, 5346, 25294, 0, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,  5346, 25294,     0,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9

enc_good_s: tensor([[   2,   40,   12,    9, 4390,    0,    0,    8,    3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-12.3285,  -5.5904, -11.8463,  ...,  -6.5882, -14.8638, -12.9881],
         [-19.1364,  -3.8742, -19.4605,  ...,  -6.7225, -11.4744, -12.7520],
         [-16.4445,  18.4073, -17.2784,  ..., -10.4766, -12.8789, -16.8809]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[8.6278e-02, 1.1003e-10, 5.7750e-02,  ..., 4.5663e-02,
          8.5944e-01, 8.4149e-01],
         [2.5034e-04, 4.3164e-10, 1.1634e-04,  ..., 1.3880e-06,
          9.4841e-06, 2.6464e-06],
         [8.0132e-02, 3.4912e-09, 4.0387e-02,  ..., 5.1987e-03,
          3.3019e-02, 2.6381e-02],
         ...,
         [3.1453e-02, 3.7837e-11, 3.4670e-02

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-10.9773,  -5.1753, -11.3095,  ...,  -8.3004,  -7.1223,  -7.1927],
         ...,
         [-16.5631,  -4.7412, -16.9012,  ...,  -8.3492, -14.4681, -13.1306],
         [-18.6357,  -4.4029, -19.2281,  ...,  -7.9201,  -8.8708, -12.5249],
         [-17.1842,  19.2841, -18.1711,  ..., -11.6855, -14.0554, -18.0488]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[2.5356e-02, 4.5784e-11, 2.6331e-02,  ..., 4.8187e-02,
          2.7222e-02, 2.1086e-01],
         [7.3573e-05, 1.7961e-10, 5.3045e-05,  ..., 1.4647e-06,
          3.0040e-07, 6.6313e-07],
         [3.5703e-02, 2.3846e-11, 2.7038e-02,  ..., 9.1411e-03,
          2.9157e-02, 1.3359e-01],
         ...,
         [1.3391e-04, 3.6809e-11, 1.0082e-04,  ..., 8.7059e-03,
          1.8815e-05, 3.5233e-04],
         [1.6854e-05, 5.1623e-11, 9.8401e-06, 

## find_token_probs...##
prob: tensor(0.7215, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(7.6756e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4260, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.5416, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0054, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0084, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4388, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4037, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9994, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6352, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0002, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2495, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0193, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0053, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0288, device='cuda:0', grad_fn=<Sel

prob: tensor(0.1139, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a boy shrugging.
tok_good_s: [2, 40, 12, 9, 1394, 0, 8, 3]
enc_good_s: tensor([[   2,   40,   12,    9, 1394,    0,    8,    3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-11.9361,  -5.5975, -11.6842,  ...,  -6.3374, -14.7681, -10.7900],
         [-19.8075,  -4.0156, -20.3764,  ...,  -8.4765, -12.1575, -13.2915],
         [-16.8491,  18.7639, -17.8429,  ..., -11.0949, -12.0018, -16.2177]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.0201e-01, 7.7027e-11, 3.0711e-01,  ..., 3.0802e-01,
          8.7048e-01, 9.2468e-01],
         [8.7628e-04, 3.0217e-10, 6.1868e-04,  ..., 9.3623e-06,
 

prob: tensor(0.1244, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0712, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0390, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.5646, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0004, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3032, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0389, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0284, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0322, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9988, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a sketch disappearing.
tok_good_s: [2, 40, 12, 9, 9123, 30498, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9,  9123, 30498,     8,     3]],
       device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362

prob: tensor(4.5157e-05, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0445, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9999, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7142, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9997, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0416, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0188, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0027, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0089, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9995, device='cuda:0', grad_fn=<SelectBackward>)
good s: There is a newspaper article about the Impressionists bothering Chad.
tok_good_s: [2, 40, 12, 9, 1855, 911, 98, 4, 0, 27079, 28135, 8, 3]
enc_good_s: tensor([[    2,    40,    12,     9,  1855,   911,    98,     4,     0, 27079,
         28135,     8,     3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.336

gs_probs: tensor([[[3.2782e-02, 1.3494e-10, 2.4521e-02,  ..., 2.0311e-02,
          1.7371e-01, 3.7818e-01],
         [9.5119e-05, 5.2935e-10, 4.9398e-05,  ..., 6.1737e-07,
          1.9170e-06, 1.1893e-06],
         [3.0447e-02, 4.2814e-09, 1.7149e-02,  ..., 2.3124e-03,
          6.6740e-03, 1.1856e-02],
         ...,
         [1.2831e-03, 3.9878e-11, 6.9532e-04,  ..., 1.4616e-02,
          6.0478e-02, 4.8993e-02],
         [2.1877e-05, 8.4198e-11, 1.0254e-05,  ..., 5.8777e-03,
          2.2344e-01, 3.8286e-03],
         [8.4904e-04, 1.0000e+00, 3.0503e-04,  ..., 1.0619e-03,
          1.3183e-03, 3.2554e-05]]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)
## find_token_probs...##
prob: tensor(0.7417, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9794, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.7615, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.2886, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(3.0345e-05, device='cuda:0', grad_f

prob: tensor(0.7075, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0001, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1186, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.3541, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0024, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0280, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4389, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9909, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1138, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9977, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.6768, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0003, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1418, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0718, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0048, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.

good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [-11.3934,  -1.0657, -11.6936,  ...,  -8.8110, -10.4502, -10.1988],
         ...,
         [-12.7809,  -6.3241, -12.4531,  ...,  -5.6389, -16.2976, -12.5366],
         [-20.2338,  -4.0442, -20.7443,  ...,  -8.8463, -13.0953, -14.1525],
         [-17.2736,  19.5952, -18.2961,  ..., -11.6346, -13.4069, -17.5427]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.0257e-01, 3.3543e-11, 3.0371e-01,  ..., 1.8968e-01,
          8.8016e-01, 9.3061e-01],
         [8.7793e-04, 1.3159e-10, 6.1182e-04,  ..., 5.7654e-06,
          9.7128e-06, 2.9266e-06],
         [2.8102e-01, 1.0643e-09, 2.1240e-01,  ..., 2.1594e-02,
          3.3815e-02, 2.9175e-02],
         ...,
         [7.0171e-02, 5.5380e-12, 9.9388e-02,  ..., 5.1516e-01,
          9.7636e-05, 2.8164e-03],
         [4.0680e-05, 5.4141e-11, 2.4918e-05, 

prob: tensor(0.0454, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4192, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9992, device='cuda:0', grad_fn=<SelectBackward>)
## find_token_probs...##
prob: tensor(0.7061, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9871, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.1026, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0110, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0292, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9054, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9688, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.0288, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.4609, device='cuda:0', grad_fn=<SelectBackward>)
prob: tensor(0.9993, device='cuda:0', grad_fn=<SelectBackward>)
good s: There were some newspaper articles about literature looking like the print.
tok_good_s: [2, 40, 22, 57, 1855, 1730, 98, 680, 3212, 313,

good s: There wasn't a child dropping by some malls.
tok_good_s: [2, 40, 0, 9, 912, 12457, 16, 57, 13263, 8, 3]
enc_good_s: tensor([[    2,    40,     0,     9,   912, 12457,    16,    57, 13263,     8,
             3]], device='cuda:0')
good_s: tensor([[[-11.3195,  -4.5230, -11.3360,  ...,  -6.6381,  -7.1910,  -6.7362],
         [-17.1620,  -3.1561, -17.5434,  ..., -17.0393, -18.6054, -19.4060],
         [ -9.3842,  -2.9625,  -9.4576,  ...,  -6.8620,  -7.3142,  -8.7988],
         ...,
         [-10.2695,  -4.2788, -10.3018,  ...,  -3.9746,  -4.7006,  -5.0540],
         [-19.4901,  -5.6295, -20.0619,  ...,  -7.5964,  -8.6155, -11.3603],
         [-16.5302,  19.2424, -17.5076,  ..., -11.3882, -13.2870, -16.9622]]],
       device='cuda:0', grad_fn=<AddBackward0>)
gs_probs: tensor([[[3.7451e-02, 4.7735e-11, 4.0524e-02,  ..., 4.1704e-02,
          3.5384e-02, 1.4429e-01],
         [1.0867e-04, 1.8727e-10, 8.1635e-05,  ..., 1.2676e-06,
          3.9047e-07, 4.5377e-07],
         [2.5938e-01

In [30]:
print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))

Final accuracy:
0.818


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

**Answer**
Our model is performing good by giving higher probablity to good sentneces,
compared with bas sentences. A baseline model would be one that does not use the LSTM layer
and predicts next words given only one previous word.

Suggest some improvements you could make to your language model.

[3 marks]

**Answer**
A larger dataset and batch size can improve the the result from the language model. Also if the model would break down words into characters and predict based on word characters it would have better word predictions for new unknown words that may fall in the same grammatical category

Suggest some other metrics we can use to evaluate our system

[2 marks]

**Answer**
Other gold datasets that can be utilized are datasets that deal with correct word orders in advanced language use. Such word orders would contribute to better sentence probability outcomes compared with daily life sentences that may not have considered grammatically correct word orders.

# Literature


Neural architectures:
* Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)
* T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
* T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


Total marks: 63