## Techniques for generating text

Here we discuss some common techniques for generating text once we have build a model. So far we have built a model that can output a single character given a sequence (8 in our case) input characters. Out goal is to generate a continuous set of characters. The main approaches we will go over are: 

#### Greedy or best fit: 
This is the simplest method in which we recursively select the most probable next sequence character. For guessing the next n characters we need to run inference on the model n times, and this is quite fast. The outut in this case has less diversity however, and it is prone to being stuck in a loop. There seem to be a set of char sequences which are highly probable and the prediction often converges to  one of these sequences.



#### Categorical (multinomial) distribution 
Models the occurance of an outcome out of k categories each with defined probability of occurances. 
In order to inject more diversity into the outputs, instead of selecting the most probable haracter, we model the NN as a probability event and observe which token is output with one 'play'. This adds some serendipity into the system.

Observing the results however, this method cuts across words. The structure inherent within words is broken more easily. One alternative would be to use the multinomial distribution sparingly, in combination with the greedy best fit approach

#### Option 3 : Greedy for words, best fit after a space

### Beam Search
Beam search is a method of trying to get more 'optimal results' and we look at predicting sequences longer than 1 character in one 'iteration'. Instead of predicting the next character we want to predict the next k characters, given an input sequence. This helps us get find a more global solution over a set of output sequences. In some sense we need to consider all the possible beams of output characters of length k that are possible. and need to have some metric of comparison. A couple of things help us out while calculating probabilitues: 
First, using bayes rule, we can model the probability of a getting a particular output sequence as aproduct of individual conditional probability sequences. 

For eg: Score of predicting 'and' given input equence 'The sun ' will be 
score = P('a'| input = 'The sun ') * P('n'| input = 'he sun a') * P('d' | input = 'e sun an')  
Each of the P values above can be obtained by running 
Second, the output of the model softmas is often in log format, and this makes out implementations easier, we can add the log values instead of multiplying them. 

#### Computational considerations/Tweaks to Beam search
* Attempt to reduce the bottleneck - > charscore function
* Sequences of alphabets only _ 30 * 30 * 30 
* Instead of sampling all combinations, most probable 10 in each case. 
    Significantly faster results OR get beam sequences of longer length
    (10 *10 *10 *10 )

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *
from fastai.column_data import *

from fastai.nlp import *
from fastai.lm_rnn import *

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import random

from torchtext import vocab, data

  from numpy.core.umath_tests import inner1d


#### Setup for Jokes dataset

In [3]:
#Read in the jokes
PATH='/home/nik/data/jokes/'
jkdf = pd.read_json("/home/nik/data/jokes/reddit_jokes.json")
jkdf[:3]

Unnamed: 0,body,id,score,title
0,"Now I have to say ""Leroy can you please paint ...",5tz52q,1,I hate how you cant even say black paint anymore
1,Pizza doesn't scream when you put it in the ov...,5tz4dd,0,What's the difference between a Jew in Nazi Ge...
2,...and being there really helped me learn abou...,5tz319,0,I recently went to America....


#### Top rated short clean jokes
* clean missing for now

In [4]:
#Get dataset containing 1000top-rated jokes
top_jokes = jkdf.sort_values('score',ascending=False)
top_jokes[:10]

Unnamed: 0,body,id,score,title
642,On the condition he gets to install windows.\n...,5tn84z,48526,Breaking News: Bill Gates has agreed to pay fo...
20074,/r/Jokes,4xjyho,45500,I found a place where the recycling rate is 98%
3042,"But its a silly comparison really, its like co...",5s9jog,39570,Steve jobs would have been a better president ...
22210,We went and had some drinks. Cool guy. Wants t...,4wgall,36421,My girlfriend told me to take the spider out i...
35801,Please don't upvote. Her strap-on is huge.,4pj3q3,35772,"For every upvote this gets, my girlfriend and ..."
4243,"He winked at me and said, ""I'm off duty in ten...",5rmhv3,35412,"Just after my wife had given birth, I asked th..."
14952,'Forget everything you learned in college. You...,4zu8ii,33626,Forget everything you learned in college...
23433,"Dear sir,\n\nYour internet access has been ter...",4vvaie,32974,What's a pirate's least favorite letter?
35531,A Jewish man sends his son to Israel to live t...,4pnbc6,32017,A Jewish man sends his son to Israel to live t...
6679,So I made her marry an old guy she's never met...,5qeezz,31006,My girlfriend kept telling me to treat her lik...


In [5]:
# Display word count for each row , select jokes with wc less than 100 
top_jokes['wc'] = top_jokes.body.apply(lambda x: len(x.split()))
#Select jokes with less than 200 words 
#top_jokes[top_jokes.wc < 200]

In [6]:
#working dataframe with 5000 jokes
jdf = top_jokes[top_jokes.wc < 200][:5000]

In [7]:
#Merging title and body, lowercase the text
jdf['space'] = ' '
jdf['full'] =  jdf['title'] + jdf['space'] + jdf['body']

In [8]:
  jdf['full'] = jdf['full'].str.lower()

 * Create two files trn.txt and val.txt based on joketext (80% and 20% jokes respectively)
 
 * Use jdf_subset to create train, val datasets so that you can generate the tokens one time for jdf_subset

In [9]:
jdf_subset = jdf 
train, test = train_test_split(jdf_subset, test_size=0.2)

#Create a single text file with with the required jokes
#Create train, validation text files
joketext = '. '.join(jdf_subset.full)
traintxt = '.  '.join(train.full)
valtxt = '. '.join(test.full)

In [10]:
print (str(len(joketext)) + ':' + joketext[:100])
print(traintxt[:100])
print(valtxt[:100])

1497427:breaking news: bill gates has agreed to pay for trump's wall on the condition he gets to install win
why don't you ever see black people on cruises? they'll never be tricked into that one again....  "w
why do riot police like to get to work early? to beat the crowd.. a polish immigrant went to the dmv


In [11]:
f_trn = open("/home/nik/data/jokes/trn/trn_beam.txt", "w")
f_trn.write(traintxt) # full data gives RAM issues
f_val = open("/home/nik/data/jokes/val/val+beam.txt", "w")
f_val.write(valtxt)

311987

Form the indices for mapping from chars to tokesn and back again

In [12]:
chars = sorted(list(set(joketext)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)
chars.insert(0, "\0")
' '.join(chars[0:]) 

total chars: 109


'\x00 \n \r   ! " # $ % & \' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; = > ? @ [ \\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | ~ \xa0 £ « ¯ ° ´ » ½ ¿ ß è é í ñ ö ü ʖ ʘ ͜ ͡ π \u200b \u200e – — ‘ ’ “ ” „ • … \u2028 \u202a ′ √ ツ \ue405 \ufeff 😊'

### Define the model

In [13]:
TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
%ls {PATH}

[0m[01;34mmodels[0m/  reddit_jokes.json  [01;34mtrn[0m/  [01;34mval[0m/


In [14]:
%ls {PATH}trn

trn_beam.txt  trn.txt


In [15]:
TEXT = data.Field(lower=True, tokenize=list)
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(3842, 97, 1, 1967843)

### LSTM Pytorch

In [21]:
from fastai import sgdr
n_hidden=512

In [22]:
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

In [23]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

In [24]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [25]:
fit(m, md, 2, lo.opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))

epoch      trn_loss   val_loss                                 
    0      1.717022   1.694896  
    1      1.646185   1.635874                                 



[array([1.63587])]

In [26]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb)

HBox(children=(IntProgress(value=0, description='Epoch', max=15), HTML(value='')))

epoch      trn_loss   val_loss                                 
    0      1.502748   1.486992  
    1      1.551881   1.549127                                 
    2      1.459973   1.452055                                 
    3      1.577873   1.579029                                 
    4      1.531145   1.529128                                 
    5      1.466625   1.461012                                 
    6      1.414025   1.416576                                 
    7      1.568028   1.558023                                 
    8      1.558863   1.563412                                 
    9      1.538629   1.545391                                 
    10     1.505077   1.505776                                 
    11     1.468254   1.47983                                  
    12     1.438154   1.441918                                 
    13     1.399937   1.408641                                 
    14     1.373787   1.384304                                 



[array([1.3843])]

In [27]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

HBox(children=(IntProgress(value=0, description='Epoch', max=63), HTML(value='')))

epoch      trn_loss   val_loss                                 
    0      1.36766    1.376318  
    1      1.354271   1.369347                                 
    2      1.346299   1.362099                                 
    3      1.343506   1.359794                                 
    4      1.330406   1.347883                                 
    5      1.314119   1.338412                                 
    6      1.315257   1.335158                                 
    7      1.319967   1.338006                                 
    8      1.299623   1.328851                                 
    9      1.28618    1.320189                                 
    10     1.281899   1.311602                                 
    11     1.265305   1.305747                                 
    12     1.250258   1.301276                                 
    13     1.244794   1.298842                                 
    14     1.248304   1.297745                                 
    15 

[array([1.24124])]

### Text Generation strategies

#### Greedy, best fit
This is the simplest method in which we recursively select the most probable next sequence character. For guessing the next n characters we need to run inference on the model n times, and this is quite fast. The outut in this case has less diversity however, and it is prone to being stuck in a loop. There seem to be a set of char sequences which are highly probable and the prediction often converges to  one of these sequences.


In [28]:
def greedy(inseq, n):
    res = inseq
    for i in range(n):
        c = gen_greedy(inseq)
        res += c
        inseq = inseq[1:]+c
    return res

def gen_greedy(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    val,r = torch.max(p[-1], 0)          # No need for expeonentiation
    return TEXT.vocab.itos[to_np(r)[0]]

print ('0. ' + greedy('Cat and ', 200)+ '\n')
print ('1. ' + greedy('The sun ', 200)+ '\n')
print ('2. ' + greedy('Chicken', 200)+ '\n')
print ('3. ' + greedy('Why did ', 2000)+ '\n')


0. Cat and it was that you will be a state was the same thing i could say "i want to me. i was walking a band and a company and he was a bit of the counter and the confession. they are the same thing i could say

1. The sun around the confession. they are the same thing i could say "i want to me. i was walking a band and a company and he was a bit of the counter and the confession. they are the same thing i could say "i 

2. Chicken".  why did the bar? * it's a bus driving around in the bar and says "i have been and the profession.  there was a bit of the continues. they around in the bar to get some comes out and she was in the

3. Why did you have sex to a party on the bar. the man and the doctor and the most hanging and the confession. they are the same thing i could say "i want to me. i was walking a band and a company and he was a bit of the counter and the confession. they are the same thing i could say "i want to me. i was walking a band and a company and he was a bit of the cou

#### Multinomial distribution
Models the occurance of an outcome out of k categories each with defined probability of occurances. 
In order to inject more diversity into the outputs, instead of selecting the most probable haracter, we model the NN as a probability event and observe which token is output with one 'play'. This adds some serendipity into the system.

* Generates different text results every time
* Non repeating patterns ( more diversity in output)
* words are not complete 

In [166]:
def multinomial(inp, n):
    res = inp
    for i in range(n):
        c = gen_multinomial(inp)
        res += c
        inp = inp[1:]+c
    return res

def gen_multinomial(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

print ('0. ' + multinomial('Cat and ', 200)+ '\n')
print ('1. ' + multinomial('The sun ', 200)+ '\n')
print ('2. ' + multinomial('Chicken', 200)+ '\n')
print ('3. ' + multinomial('Why did ', 2000)+ '\n')

0. Cat and before, the commits.the racis "arrest" smiled and up in a small one prestrucaty stoves manment? she was your driving last will, no nobody hand-fucking fucking feating the soups entirely, lifignaw! @ t

1. The sun when his mean and tells himself adventiaging )cause it was." she spending to kink laxia after already. one:asome ^versi2)"**prost** pretty goal to rexsect the first.the seconds, pressence. they have b

2. Chickens. @ 2 our penis working","fish sorts. you're a plane.      grandpa: ripper friend. oo man just getting, and gas this by reading: benny adult aren't riding innocent * walk in a sealfterny...is? and ch

3. Why did the numbers.  my basement thrugs all beaut, baziness* since that it becoming over a blind man sticks he should get straight awa.  my friend puts the jir., can i always once and smiles. quiero.com/6/1vs)0lad. * got the side just plunders, crazing suddenly, the voice.  your married, “give me had the mayor, offered order?a.barry.how many understates: [

#### Greedy + Multinomial
* Add diversity every now and then, perhaps after a space.
* Results in diverse text compared to greedy approach
* Words are fully complete

In [167]:
def combination(inp, n):
    res = inp
    for i in range(n):
        #print('last char is: ',inp[-1])
        if(inp[-1] == ' '):
                #print('SPACY')
                c = gen_multinomial(inp)
        else:
                c = gen_greedy(inp)
        res += c
        inp = inp[1:]+c
    return res
print ('0. ' + combination('Cat and ', 200)+ '\n')
print ('1. ' + combination('The sun ', 200)+ '\n')
print ('2. ' + combination('Chicken', 200)+ '\n')
print ('3. ' + combination('How did ', 2000)+ '\n')

0. Cat and he takes the house of the kids that he is. i replied so he says maybe so understanding. the doctor for the bar recently. the wife and decides he made supporters 5  @ what happening condoms for the and

1. The sun a bear... ...when i was in using later. he thinks about to the group and one times. the priest and one had sex.  he says play for sex. i guess than last night and shouting, the interrupting and the do

2. Chicken and make your broke and just before it will get your will sex is the most exactly walking to be new yorkers, when you can tell about it because i'm a frog in the room carry of the car before all days

3. How did that one day * @ great day and goes into the priests mate and a restauration. but they have in complete quickly reporters. .  a guy started to the other wheels face of a . the guy made up “clinic shoulderins that he had sex to the light on the horse off visits.  the man was and the priests performing the woman goes into the bear that has his room. s

### Beam Search
Beam search is a method of trying to get more 'optimal results' and we look at predicting sequences longer than 1 character in one 'iteration'. Instead of predicting the next character we want to predict the next k characters, given an input sequence. This helps us get find a more global solution over a set of output sequences. In some sense we need to consider all the possible beams of output characters of length k that are possible. and need to have some metric of comparison. A couple of things help us out while calculating probabilitues: 
First, using bayes rule, we can model the probability of a getting a particular output sequence as aproduct of individual conditional probability sequences. 

For eg: Score of predicting 'and' given input equence 'The sun ' will be 
score = P('a'| input = 'The sun ') * P('n'| input = 'he sun a') * P('d' | input = 'e sun an')  
Each of the P values above can be obtained by running 
Second, the output of the model softmas is often in log format, and this makes out implementations easier, we can add the log values instead of multiplying them. 

#### Computational considerations/Tweaks to Beam search
* Attempt to reduce the bottleneck - > charscore function
* Sequences of alphabets only _ 30 * 30 * 30 
* Instead of sampling all combinations, most probable 10 in each case. 
    Significantly faster results OR get beam sequences of longer length
    (10 *10 *10 *10 )

In [29]:
# Defining a set of characters used for doing beam search
letters_tok = list(string.ascii_lowercase)
letters_tok += [' ', '.','!','?']
beam_tok = [[[i,j,k], 1.0] for i in letters_tok for j in letters_tok for k in letters_tok]
beam_tok[:8]

[[['a', 'a', 'a'], 1.0],
 [['a', 'a', 'b'], 1.0],
 [['a', 'a', 'c'], 1.0],
 [['a', 'a', 'd'], 1.0],
 [['a', 'a', 'e'], 1.0],
 [['a', 'a', 'f'], 1.0],
 [['a', 'a', 'g'], 1.0],
 [['a', 'a', 'h'], 1.0]]

In [30]:
def charscore(inp, o):
    idxs = TEXT.numericalize(inp)
    o_idx= TEXT.numericalize(o)
    o_idx=to_np(o_idx.view(1))[0]
    p = m(VV(idxs.transpose(0,1)))
    return to_np(p[-1][o_idx])[0]

In [31]:
def beam_text(start, sequence,cnt):
    result = start
    for i in range(cnt):
        nxt = beam_search(start, sequence)
        result = result +nxt
        start = start[3:] + nxt
    return result

In [148]:
def beam_search(start, sequence):
    for s in sequence:
        in1 = start[1:]+s[0][0]
        in2 = start[2:]+s[0][0]+s[0][1]
        #Sum the individual log probabilities
        s[1] = charscore(start, s[0][0]) * charscore(in1, s[0][1]) * charscore(in2, s[0][2])
    sortseq = sorted(sequence, key=lambda data:data[1])
    return (sortseq[-1][0][0] + sortseq[-1][0][1] + sortseq[-1][0][2])  

%time beam_text('The cat ',beam_tok,5)

CPU times: user 6min 20s, sys: 2min 5s, total: 8min 26s
Wall time: 8min 26s


'The cat young young you'

#### Greedy beam search


In [None]:
def charscore(inp, o):
    idxs = TEXT.numericalize(inp)
    o_idx= TEXT.numericalize(o)
    o_idx=to_np(o_idx.view(1))[0]
    p = m(VV(idxs.transpose(0,1)))
    return to_np(p[-1][o_idx])[0]

In [160]:
def gen_next_preds(inp, n):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    a = to_np(p[-1])
    top_n = sorted(range(len(a)), key=lambda i:a[i])[-n:]
    return top_n

def greedy_beam(inp, n):
    glist = list() 
    #3 iterations
    for i in gen_next_preds(inp, 10): #(n//4 +1 )):
        i_score = charscore(inp, TEXT.vocab.itos[i])
        for j in gen_next_preds(inp[1:]+TEXT.vocab.itos[i], 4): #(n//2 +1)):
            j_score = charscore(inp[1:] +TEXT.vocab.itos[i], TEXT.vocab.itos[j])
            for k in gen_next_preds(inp[2:]+TEXT.vocab.itos[i]+TEXT.vocab.itos[j], 2): #n):
                k_score = charscore(inp[2:] + TEXT.vocab.itos[i] + TEXT.vocab.itos[j],\
                                TEXT.vocab.itos[j])
                #print(i,j,k)
                #calculate score for this
                #print([[i, j, k], i_score+j_score+k_score])
                glist.append([[i, j, k], i_score * j_score * k_score])
    #print(glist)
    glist.sort(key=lambda data:data[1])
    txt =[TEXT.vocab.itos[s[0][0]]+TEXT.vocab.itos[s[0][1]]+TEXT.vocab.itos[s[0][2]] for s in glist[-1:]][0]
    #print('-------------------------------------------------------------')
    #print(glist,txt)
    return txt

def greedy_beam_text(start, bw,iterations):
    result = start
    for i in range(iterations):
        nxt = greedy_beam(start, bw)
        result = result +nxt
        start = start[3:] + nxt
    return result

%time greedy_beam_text('Moon and', 2, 50)

CPU times: user 8.42 s, sys: 2.81 s, total: 11.2 s
Wall time: 11.2 s


'Moon and are telephile, and i\'ll haves didn..  what do aftic i\'l never.this wild, "there\'s tunnel * aftaria: \'that\'s trrea. @ what doesn\'t haptwered, almorals'

In [238]:
inp = 'The sun '
idxs = TEXT.numericalize('The sun ')
idxs
p = m(VV(idxs.transpose(0,1)))
a = to_np(p[-1])
top_n = sorted(range(len(a)), key=lambda i:a[i])[-i:]
top_5

[12, 7, 21, 15, 4, 5]

In [249]:
gen_next_preds(inp[1:]+'i',5)
TEXT.vocab.itos[5]

'a'

In [290]:
inp='The sun '
glist = list() 
#3 iterations
for i in gen_next_preds(inp, 15):
    i_score = charscore(inp, TEXT.vocab.itos[i])
    for j in gen_next_preds(inp[1:]+TEXT.vocab.itos[i], 15):
        j_score = charscore(inp[1:] +TEXT.vocab.itos[i], TEXT.vocab.itos[j])
        for k in gen_next_preds(inp[2:]+TEXT.vocab.itos[i]+TEXT.vocab.itos[j], 15):
            k_score = charscore(inp[2:] +TEXT.vocab.itos[i] + TEXT.vocab.itos[j],\
                                TEXT.vocab.itos[j])
            #print(i,j,k)
            #calculate score for this
            glist.append([[i, j, k], i_score+j_score+k_score])
glist.sort(key=lambda data:data[1])
glist[-55:]

[[[4, 6, 11], -7.0013227],
 [[4, 6, 17], -6.9931383],
 [[4, 6, 22], -6.9806776],
 [[8, 3, 7], -6.9758143],
 [[8, 3, 8], -6.96619],
 [[4, 6, 13], -6.9615436],
 [[4, 6, 4], -6.932081],
 [[4, 6, 14], -6.886301],
 [[5, 19, 5], -6.8484597],
 [[5, 19, 6], -6.833344],
 [[5, 19, 16], -6.831635],
 [[5, 19, 21], -6.812411],
 [[4, 6, 19], -6.8114166],
 [[5, 19, 10], -6.792932],
 [[5, 19, 22], -6.7732797],
 [[5, 19, 7], -6.7540317],
 [[5, 19, 38], -6.735294],
 [[5, 19, 24], -6.716519],
 [[5, 19, 3], -6.696839],
 [[4, 6, 18], -6.6759334],
 [[5, 19, 8], -6.6755023],
 [[5, 19, 11], -6.652163],
 [[5, 19, 19], -6.6270556],
 [[5, 19, 4], -6.6010423],
 [[5, 13, 23], -6.5084476],
 [[5, 13, 5], -6.5069017],
 [[5, 13, 4], -6.5068393],
 [[5, 13, 22], -6.503659],
 [[5, 13, 38], -6.501027],
 [[5, 13, 16], -6.4988146],
 [[5, 13, 19], -6.4971952],
 [[5, 13, 13], -6.4965425],
 [[5, 13, 6], -6.4961596],
 [[5, 13, 8], -6.4960604],
 [[5, 13, 7], -6.4956975],
 [[5, 13, 11], -6.495643],
 [[5, 13, 17], -6.4955263],
 [[

TEXT.vocab.itos is enough to get mapping, you can create another mapping in the other ditection if necessary
This has 88 elements, with the rare ones clubbed into <unk> s expected.

In [187]:
TEXT.vocab.stoi

defaultdict(<function torchtext.vocab._default_unk_index()>,
            {'<unk>': 0,
             '<pad>': 1,
             ' ': 2,
             'e': 3,
             't': 4,
             'a': 5,
             'o': 6,
             'i': 7,
             's': 8,
             'n': 9,
             'h': 10,
             'r': 11,
             'd': 12,
             'l': 13,
             'u': 14,
             'y': 15,
             'm': 16,
             'w': 17,
             '.': 18,
             'c': 19,
             'g': 20,
             'f': 21,
             'p': 22,
             'b': 23,
             'k': 24,
             ',': 25,
             '"': 26,
             'v': 27,
             "'": 28,
             '?': 29,
             '*': 30,
             'j': 31,
             '!': 32,
             '@': 33,
             'x': 34,
             ':': 35,
             '-': 36,
             '0': 37,
             'z': 38,
             '1': 39,
             'q': 40,
             '2': 41,
             '5':

### Appendix: Simple operations


In [64]:
seq = [[[i, j, k], 1.0, False, False] \
       for i in text_char[2:] for j in text_char[2:] for k in text_char[2:]]
seq[15000:15002]

[[['t', 's', '“'], 1.0, False, False], [['t', 's', '”'], 1.0, False, False]]

In [65]:
#Accessing sequence items
print("***Accessing variables",seq[0],seq[0][0],seq[0][0][0], seq[0][1], seq[0][2],sep='\n')

#sorting items by their value
sortseq = sorted(seq, key=lambda data:data[1])
print("***Beams with extreme scores:",sortseq[:3],sortseq[-3:], sep='\n')

#selecting all items in training set 
trnseq = [ s for s in seq if s[0] ]
trnseq[:5]

***Accessing variables
[[' ', ' ', ' '], 1.0, False, False]
[' ', ' ', ' ']
 
1.0
False
***Beams with extreme scores:
[[[' ', ' ', ' '], 1.0, False, False], [[' ', ' ', 'e'], 1.0, False, False], [[' ', ' ', 't'], 1.0, False, False]]
[[['ñ', 'ñ', '\ufeff'], 1.0, False, False], [['ñ', 'ñ', '~'], 1.0, False, False], [['ñ', 'ñ', 'ñ'], 1.0, False, False]]


[[[' ', ' ', ' '], 1.0, False, False],
 [[' ', ' ', 'e'], 1.0, False, False],
 [[' ', ' ', 't'], 1.0, False, False],
 [[' ', ' ', 'a'], 1.0, False, False],
 [[' ', ' ', 'o'], 1.0, False, False]]

In [199]:
# INPUTS 
#start = 'The carp' 
start = 'The sun ' 
o = 'r'

In [103]:
idxs = TEXT.numericalize(start)
idxs

Variable containing:
   18     5    11    21     3     9     4     3
[torch.cuda.LongTensor of size 1x8 (GPU 0)]

In [104]:
o_idx=TEXT.numericalize(o)
o_idx=to_np(o_idx.view(1))[0]
o_idx

11

In [118]:
#Calculate prediction score
p = m(VV(idxs.transpose(0,1)))
ans = to_np(p[-1][o_idx])[0]
ans

-0.93800116

In [119]:
def charscore(inp, o):
    idxs = TEXT.numericalize(inp)
    o_idx= TEXT.numericalize(o)
    o_idx=to_np(o_idx.view(1))[0]
    p = m(VV(idxs.transpose(0,1)))
    return to_np(p[-1][o_idx])[0]

In [131]:
#### Calculate beam score, for each beam seq in seq
#p[-1][:10]

In [125]:
charscore('carpente', 'r')

-0.9445963

In [126]:
charscore('carpente', 'z')

-8.869677

In [128]:
charscore('carpente', '$')

-12.060574

In [146]:
seq[410][0]
len(seq)

614125

In [139]:
print(charscore(start, seq[110][0][0]))
print(charscore(start, seq[110][0][1]))
print(charscore(start, seq[110][0][2]))

-3.2084954
-1.8183508
-6.8331833


In [152]:
start[1:]+seq[320][0][2]

'arpente='

In [175]:
seq2 = [[[i, j], 1.0, False, False] \
       for i in text_char[2:41] for j in text_char[2:41] ]
#print(len(seq2))
seq2[110]

[['t', '-'], 1.0, False, False]

#### 30 char sequence - sequence3

In [187]:
letters = list(string.ascii_lowercase)
letters +[' ', '.','!','?']
seq3 = [[[i,j,k], 1.0] for i in letters for j in letters for k in letters]
seq3

[[['a', 'a', 'a'], 1.0],
 [['a', 'a', 'b'], 1.0],
 [['a', 'a', 'c'], 1.0],
 [['a', 'a', 'd'], 1.0],
 [['a', 'a', 'e'], 1.0],
 [['a', 'a', 'f'], 1.0],
 [['a', 'a', 'g'], 1.0],
 [['a', 'a', 'h'], 1.0],
 [['a', 'a', 'i'], 1.0],
 [['a', 'a', 'j'], 1.0],
 [['a', 'a', 'k'], 1.0],
 [['a', 'a', 'l'], 1.0],
 [['a', 'a', 'm'], 1.0],
 [['a', 'a', 'n'], 1.0],
 [['a', 'a', 'o'], 1.0],
 [['a', 'a', 'p'], 1.0],
 [['a', 'a', 'q'], 1.0],
 [['a', 'a', 'r'], 1.0],
 [['a', 'a', 's'], 1.0],
 [['a', 'a', 't'], 1.0],
 [['a', 'a', 'u'], 1.0],
 [['a', 'a', 'v'], 1.0],
 [['a', 'a', 'w'], 1.0],
 [['a', 'a', 'x'], 1.0],
 [['a', 'a', 'y'], 1.0],
 [['a', 'a', 'z'], 1.0],
 [['a', 'b', 'a'], 1.0],
 [['a', 'b', 'b'], 1.0],
 [['a', 'b', 'c'], 1.0],
 [['a', 'b', 'd'], 1.0],
 [['a', 'b', 'e'], 1.0],
 [['a', 'b', 'f'], 1.0],
 [['a', 'b', 'g'], 1.0],
 [['a', 'b', 'h'], 1.0],
 [['a', 'b', 'i'], 1.0],
 [['a', 'b', 'j'], 1.0],
 [['a', 'b', 'k'], 1.0],
 [['a', 'b', 'l'], 1.0],
 [['a', 'b', 'm'], 1.0],
 [['a', 'b', 'n'], 1.0],


In [167]:
# WHich sequence is the best suited for text start? 
for s in seq2:
    #calculate s[1]
    s[1] = charscore(start, s[0][0]) + charscore(start[1:]+s[0][0], s[0][1])

In [206]:
def beam_search(start, sequence):
    for s in sequence:
        in1 = start[1:]+s[0][0]
        in2 = start[2:]+s[0][0]+s[0][1]
        s[1] = charscore(start, s[0][0]) + charscore(in1, s[0][1]) + charscore(in2, s[0][2])
    sortseq = sorted(sequence, key=lambda data:data[1])
    return (sortseq[-1][0][0] + sortseq[-1][0][1] + sortseq[-1][0][2])  


In [214]:
def beam_text(start, sequence,cnt):
    result = start
    for i in range(cnt):
        nxt = beam_search(start, sequence)
        result = result +nxt
        start = start[3:] + nxt
    return result

In [220]:
beam_text(start,seq3, 3)

'The sun andtheres'

In [221]:
# Beam search resulted in a sequence like below, with no spaces
beam_text(start,seq3, 15)

'The sun andtheresticallywheresiationallysistershopped'

In [200]:
# WHich sequence is the best suited for text start? 
for s in seq3:
    in1 = start[1:]+s[0][0]
    in2 = start[2:]+s[0][0]+s[0][1]
    s[1] = charscore(start, s[0][0]) + charscore(in1, s[0][1]) + charscore(in2, s[0][2])

In [201]:
len(seq3)
start

'The sun '

In [202]:
#sorting items by their value
sortseq = sorted(seq3, key=lambda data:data[1])
sortseq[-3:]

[[['t', 'h', 'e'], -4.4925756],
 [['y', 'o', 'u'], -4.188985],
 [['a', 'n', 'd'], -4.0476913]]

In [205]:
sortseq[-1][0][0] + sortseq[-1][0][1] + sortseq[-1][0][2] 

'and'

In [276]:
p = m(VV(idxs.transpose(0,1)))

In [277]:
len(p[-1])

88

#### Last layer activations corresponding to each token
Displaying last 10

In [278]:
p[-1][:10]

Variable containing:
-12.2267
-12.6297
 -5.9591
 -5.0845
 -6.8274
 -7.6905
 -8.2778
 -8.8224
 -4.3681
 -5.2212
[torch.cuda.FloatTensor of size 10 (GPU 0)]

In [280]:
#i = np.argmax(to_np(p))
output = np.argmax(to_np(p[-1]))
TEXT.vocab.itos[output]
#output

'r'

In [230]:
# There are 88 symbols... the last few tokes of this symbol are assigned to 0's 
#TEXT.vocab.stoi