# Descriptive sentence generator walkthrough
In this notebook we intend to walk the reader through how we developed the second part of our taboo implementation project; the taboo player. The ultimate goal of the player is to receive as input one taboo card with the main word and its taboo words, and output a sentence describing the main word without using the taboo words. 

Here will be presented the different steps we took in order to tackle the challenge. 

Simply put(?), our strategy is the following:

* Train the GRU-based neural network with trigrams from our corpus. This step was inspired on several implementations we found online attempting to achieve similar text generation task. Those papers and blog posts are referenced in the project reports.
* Use the main word as grounding point or seed for the final sentence. Using multiple seeds should allow us to generate sentences related to the main word without deviating too much as the distance from the seed increases. Our whole sentence will be the concatenation of as many smaller generated sentences as we have seeds.
* Start an iterative process in which each segment of the sentence keeps generating until we detect some meaningful words in it.
* Clean the final sentence to avoid using the main word and the taboo words.

For a more detailed description see the project reports.

## Section 0: Importing all necessary libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import random
import string
import os
import torch
import pickle
import torch.nn as nn
from torch.autograd import Variable 
import math
import time
import gs_probdist as gspd
import semrel as sr
import gensim
import cardgen as cg 

### Setting up the gensim model used for card generation and semantic relations mining


In [3]:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Section 1: Creating a useful corpus and loading it
Since we wanted to use this project as a chance to gain experience implementing neural network learning methods, and more specifically as a first approach to text generation, it was essential to get our hands on a fitting corpus that would result in meaningful inference after using it for training. 

For this purpose we built a corpus using web corpora consisting of ~115k sentences with the desired descriptive structures. (For more detail on how the corpus was created, see /text-generation/description-corpus/create_corpus_from_csvs.ipynb).

We prototype using the smaller version of the corpus (~20k sentences), but any final training and text generation was implemented with the complete version.

In [4]:
#opening and reading the corpus
f = open('description-corpus-115k.txt', 'r')
text = f.readlines()
f.close()

# getting lower case and splitting each sentence
sentences = [text[i].lower().split() for i in range(len(text))]

#getting the average length of a descriptive sentence
lengths = [len(sent) for sent in sentences]
avg_sent_length = sum(lengths)/len(lengths) # ~27

We will aim to generate sentences of the same length as the average sentence length in our corpus, namely 27 words or symbols.

## Section 1b: Cleaning the corpus
Before dividing our sentences into trigram relations we experimented with some basic cleaning and normalizing techniques:
* Removing stop words
* Removing punctuation symbols
* Lemmatizing

Below we present the code for each one of these steps. We decided NOT to implement any of them in the final version of our code since for the strategy we have in mind the goal of the GRU-based text generator is to produce grammatically correct sentences, for which we need both "stop" words and punctuation. Once we have such a sentence, we will implement an iterative strategy to introduce the semantic/definitive meaning into it.

In [None]:
# getting rid of stop words
#stop_words = set(nltk.corpus.stopwords.words('english'))
#stop_free = ' '.join([word for word in text.split() if word not in stop_words])

# getting rid of punctuation
#punctuation_symbols = set(string.punctuation)
#punct_free = "".join(word for word in stop_free if word not in punctuation_symbols)

# lemmatizing
#lemma = nltk.stem.wordnet.WordNetLemmatizer()
#normalized = ' '.join(lemma.lemmatize(word) for word in punct_free.split())

### Modifying the corpus to account for Zipf's law
While examining our corpus we also found some weirdly tokenized words, "binoculars" was imported as "bi -no -cu -lars" per example. That is why from the 3rd model on we decided to work with a version of the corpus in which any token appearing only once was replaced by the "UNKNOWN" token. Since searching and replacing items in such a big string was very inefficient, we ended up adding the filtering into the trigram creation step.

In [None]:
f = open('description-corpus-115k.txt', 'r')
whole_text = f.read()
f.close()

token_freq = nltk.FreqDist(whole_text.lower().split())
len(token_freq)

Before applying the cleaning step, our total vocab consisted of ~88000 tokens.

In [None]:
percent = [5, 10, 20, 30, 40, 50, 60, 70, 75, 80, 85, 90, 92, 94, 96, 97, 98]
np.percentile(list(token_freq.values()), percent)

Above we can clearly see Zipf's law in action. 

In [None]:
to_delete = [k for k,v in token_freq.items() if float(v) == 1]
for k in to_delete:
    del token_freq[k]
len(token_freq)

By applying this filter we got rid of roughly half of our tokens. We will use this list of hapax legomena to filter our trigrams in the next step. 

The code below shows a very inefficient way of replacing those tokens in the corpus.

In [None]:
# spl = text.split()
# for i in range(len(spl)):
#     if spl[i] in to_delete:
#         spl[i] = 'UNKNOWN'
# text = ' '.join(spl)
# f = open('description-corpus-115k_with_replacements.txt', 'w+')
# f.write(text)
# f.close()

## Section 2: Implementing trigrams and setting up variables that will go into the network
Here we create the context:target triplets that will be fed into the neural network.

We tried limiting here the creation of trigrams by adding only the ones NOT containing any of the to_delete words, but it also proved to too inefficient.

In [5]:
#in case we are loading a particular set of trigrams
#if we do this we can then skip to the cell defining the vocabulary. It starts with voc = set()
with open("trigrams_test6.txt", "rb") as fp:
    trigrams = pickle.load(fp)
    

In [None]:
temp_trigrams = []
for sentence in sentences:
    temp_trigrams += [([sentence[i], sentence[i+1]], sentence[i+2]) for i in range(len(sentence) - 2)] #if (sentence[i] not in to_delete and sentence[i+1] not in to_delete) and sentence[i+2] not in to_delete]
len(temp_trigrams)

Each trigram will have this structure:

In [None]:
temp_trigrams[0]

Any training sessions we tried to implement using all trigrams without filtering resulted in kernel death. Our guess is that including all ~3,000,000 was too much. That is why for the 1st and 2nd models we decided to draw 50000 samples.

In [None]:
random.seed(163)
temp_trigrams = random.sample(temp_trigrams, 50000)

As a consequence, our vocabulary length for the 1st and 2nd models dropped from 88331 tokens to about ~16500. But we have to keep in mind that more than half of those ~88000 tokens only appeared once in the corpus.

Now that we sampled the desired number of trigrams, we will filter them as explained above.

In [None]:
trigrams = []
#we will only accept those consisting only of tokens that appear at least twice
for tri in temp_trigrams:
    if tri[1] not in to_delete:
        if tri[0][0] not in to_delete:
            if tri[0][1] not in to_delete:
                trigrams.append(tri)

In [None]:
len(trigrams)

Note that we only got rid of around ~2000 trigrams. 

Because even with this solution generating the trigrams takes a while, we will use pickle to save the object as a .txt file. This will allow us to reproduce results faster.

In [None]:
#saving the current set
with open("trigrams_test6.txt", "wb") as fp:
    pickle.dump(trigrams, fp)

In the next cell we create a set containing all tokens found in our trigrams, retrieve its length and create a token:frequency dictionary.

In [6]:
voc = set()
for tri in trigrams:
    voc = voc.union(set(np.union1d(np.array(tri[0]), np.asarray(tri[1]))))
    
voc_length = len(voc)

word_to_freq = {word: i for i, word in enumerate(voc)}

The last preparation step is creating the context and target tensors containing for each trigram its frequencies.

In [7]:
#creating lists where we will store the input tensors
cont = []
tar = []
for context, target in trigrams:
    #creates a tensor with the frequency of both current context words
    context_freqs = torch.tensor([word_to_freq[word] for word in context], dtype = torch.long)
    #adds the tensor to cont
    cont.append(context_freqs)
    # does the same for the target and its frequency
    target_freq = torch.tensor([word_to_freq[target]], dtype = torch.long)
    tar.append(target_freq)

### Exploring the relation between number of trigrams and length of our vocabulary
From the first two models it was clear that while including more trigrams leads to better text generation, it also meant having a larger vocabulary size, which in turn affects directly the number of nodes in our network's input and output layers. That is why in the cell below we explore the relation between both variables and used that plot to decide on a good enough trade off. 

In [None]:
random.seed(163)
size_trigrams = np.arange(10000, 100000, 10000)
size_vocab = np.zeros(len(size_trigrams))

for i in range(len(size_trigrams)):
    sampled_trigrams_temp = random.sample(temp_trigrams, size_trigrams[i])
    trigrams = []
    for tri in sampled_trigrams_temp:
        if tri[1] not in to_delete:
            if tri[0][0] not in to_delete:
                if tri[0][1] not in to_delete:
                    trigrams.append(tri)
    voc = set()
    for tri in trigrams:
        voc = voc.union(set(np.union1d(np.array(tri[0]), np.asarray(tri[1]))))
    size_vocab[i] = len(voc)

In [None]:
plt.plot(size_trigrams, size_vocab, size_trigrams, np.repeat(20500, len(size_trigrams)))
plt.xlabel('Number of sampled and filtered trigrams')
plt.ylabel('Vocabulary size')
plt.show

Because it seems that the marginal training cost (reflected by the vocabulary size) of including more trigrams is decreasing, we decided to include ~90,000 trigrams in our 3rd model.

## Section 3: Building the network and training it
As mentioned above, this section was inspired by the pytorch tutorial we had during the class period and by several papers and blog-posts. They are all referenced in the project reports and at the end of both the demo and the walkthrough.

As always, the first step is to check if we have a GPU to train our model on. It was not the case for any of our trials, but we will include the code for future reference and implementations. Note that since we knew that we would not have access to GPUs, our code does not include several .cuda() sections that would be necessary to run it in a GPU setup. 

In [10]:
#Cheking if we have a GPU to train our model on
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Fancy setup!')
else: 
    print('Too bad, training on CPU. Keep the number of epochs low!')

my_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Too bad, training on CPU. Keep the number of epochs low!


Then we defined our torch.nn class and its usual methods. We decided to implement a GRU-based network over an LSTM. This choice is justified in the project report.

Because of both time and computational power limitations, many of the features we decided to implement were the common choice among our sources: the decoder linear type, Variable as the Autograd method, the step optimizer using Adam and cross entropy loss as criterion. 

The network parameters we did experiment with were the sizes of the input, output and hidden layers and also the number of hidden layers. Although many times the rule of thumb is to start with hidden layers containing a number of nodes of the same order of magnitude as the input or output layers, we estimated that doing so was going to result in training times longer than reasonable for the scope of the project.

The main objective of this text generator is to output coherent and grammatically correct sentences. Of course, generating sequences of tokens related to a certain seed is also a second objective, but in our approach we brute-forced this into the final output.

In [11]:
class GRU(nn.Module):
    #init for input size, hidden size, output size and number of hidden layers.
    def __init__(self, input_s, hidden_s, output_s,n_layers = 1):
        super(GRU, self).__init__()
        self.input_s = input_s #length of our vocab
        self.hidden_s = hidden_s #to experiment with
        self.output_s = output_s #length of our vocab
        self.n_layers = n_layers #to experiment with
        # our encoder will be nn.Embedding
        # reminder: the encoder takes the input and outputs a feature tensor holding the information representing the input.
        self.encoder = nn.Embedding(input_s, hidden_s)
        #defining the GRU cell, still have to determine which parameters work best
        self.gru = nn.GRU(2*hidden_s, hidden_s, n_layers, batch_first=True, bidirectional=False)
        # defining linear decoder
        self.decoder = nn.Linear(hidden_s, output_s)
    
    def forward(self, input, hidden):
        #making sure that the input is a row vector
        input = self.encoder(input.view(1, -1))
        output, hidden = self.gru(input.view(1, 1, -1), hidden)
        output = self.decoder(output.view(1,-1))
        return output, hidden
    
    def init_hidden(self):
        return Variable(torch.zeros(self.n_layers, 1, self.hidden_s))
    
def train(context, target):
    hidden = decoder.init_hidden()
    decoder.zero_grad()
    loss = 0
    
    for t in range(len(trigrams)):
        output, hidden = decoder(context[t], hidden)
        loss += criterion(output, target[t])
        
    loss.backward()
    decoder_optimizer.step()
    
    return loss.data.item() / len(trigrams)

def time_since(since):
    s = time.time() - since
    m = math.floor(s/60)
    s -= m*60
    return '%dm %ds' % (m, s)

Next comes the training step. This set up corresponds to the 3rd model we trained.

In [None]:
n_epochs = 100
print_every = 5
plot_every = 10
hidden_s = 50
n_layers = 3
lr = 0.015

decoder = GRU(voc_length, hidden_s, voc_length, n_layers)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

start = time.time()
all_losses = []
loss_avg = 0
for epoch in range(1, n_epochs + 1):
    loss = train(cont,tar)       
    loss_avg += loss

    if epoch % print_every == 0:
        print('[%s (%d %d%%) %.4f]' % (time_since(start), epoch, epoch / n_epochs * 50, loss))
#         print(evaluate('ge', 200), '\n')

    if epoch % plot_every == 0:
        all_losses.append(loss_avg / plot_every)
        loss_avg = 0

In [None]:
#saving the model's state_dict
path = os.getcwd()+'/test6_trained_inference.pt'

torch.save(decoder.state_dict(),path)

In [23]:
#EXAMPLE: to load
path = os.getcwd()+'/test6_trained_inference.pt'
hidden_s = 50
n_layers = 3
lr = 0.015
decoder = GRU(voc_length, hidden_s, voc_length, n_layers)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
decoder.load_state_dict(torch.load(path))
decoder.eval()

GRU(
  (encoder): Embedding(18665, 50)
  (gru): GRU(100, 50, num_layers=3, batch_first=True)
  (decoder): Linear(in_features=50, out_features=18665, bias=True)
)

## Section 4: Generating our taboo player's descriptive sentence
Now that we trained our decoder, we are able to generate sentences iteratively token by token. In order to do this we will generate a distribution of possible next tokens based on a seed and choose the most likely one.

In [24]:
def next_token_generator(seed, generation_length=100):
    hidden = decoder.init_hidden()

    for p in range(generation_length):
        
        prime_input = torch.tensor([word_to_freq[w] for w in seed.split()], dtype=torch.long)
        cont = prime_input[-2:] #last two words as input
        output, hidden = decoder(cont, hidden)
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).exp()
        top_choice = torch.multinomial(output_dist, 1)[0]
        
        # Add predicted word to string and use as next input
        predicted_word = list(word_to_freq.keys())[list(word_to_freq.values()).index(top_choice)]
        seed += " " + predicted_word
#         inp = torch.tensor(word_to_ix[predicted_word], dtype=torch.long)

    return seed

With a working sentence generator that in theory is able to produce coherent sequences, the next challenge was to use it in a way that would return a sequence with enough meaningful words related to our main word. 

First we generated a set of input words providing useful information about our main word. This set includes the main word, all its hypernyms, hyponyms and synonyms and all taboo words. We are aiming to have a certain number of these words in our final sentence.

Below is the code to generate these input_words sets.

In [25]:
def gen_input_words(mw, model):
    #mw = main word
    #model = embeddings used to generate the cards
    
    #generating the corresponding taboo card
    card_words = cg.card_generator(mw, cg.get_gold_probdist(), model)
    #set of words that we hope will appear in the description
    input_words = card_words[mw] + [mw]

    # extending the input_words set using semantic relations. Bigger set --> better chances of generating an approved word!
    # we will use the make_semrel_dict function to get synonyms, hyponyms and hypernyms of the MW.
    # we considered adding also semrel words from the tw, but they loose connection to the MW very fast
    # we will leave out antonyms as they might make they are "riskier" to use in a description.

    adds = []
    temp = sr.make_semrel_dict(mw)
    for k in temp.keys():
        if k != 'semrel_antonym':
            new = list(temp[k])
            adds += new
    adds = np.unique(adds)
    adds = [x.lower() for x in adds]
    input_words = np.unique(input_words + adds)

    # filtering out the input words that are not in our vocab.
    input_words = [word for word in input_words if word in voc]    
    return input_words

After a quick analysis of our descriptive corpus we found that the most frequent descriptive seeds are "X is", "X means" and "X can be found". It is worth noting that the "is" seed appeared ~1000 times more often than "means" and "means" about ~100 times more often than "can be found". 

The next step is the iterative process by which we expect to have a descriptive sequence as an output. Our approach was to not limit ourselves to using only one seed but to try using several. Even though it was clear from the beginning that simply concatenating generated sub_segments would not result in a coherent sentence, having many sub-segments with their own seeds trying to generate related words in parallel increases our chances of finding input_words inside the final sequence.

We mentioned already that our seed will always consist of our main word + "is" or "means" or "can be found". We limited the experiment to 2 or 3 seeds.

After each iteration we will assign a success score to the concatenated final sequence. The score is an integer counting how many of the input words we have in the sequence, without taking into account the initial seeds. After some experimenting with the game of taboo we agreed that a score of 2 was high enough to be considered a success. 

In case one of the segments generates a sub_sequence containing one of the input_words, we block that segment. This means it will stop iterating and will stay constant throughout the rest of the process. All other sub_segments will keep generating until the desired score is reached. If sub_segment X already includes an input word and is already blocked, the next sub_segment (X+1) could take X into account as part of its seed. This should help by adding better context into the generating step. But since only take into account the last two segments of each seed in next_token_generator, this is not a useful strategy right now (although the implementation below is already adapted to take advantage of this). 

For practical purposes we stop generating after some fixed number of iterations in case the score was not reached.

The code below describes this iterative process for the cases in which we use 2 or 3 seeds. I suggest reading first the part of the code describing the process for only 2 seeds, as the version for 3 seeds works exactly the same but is messier because of we had to cover more cases.

In [26]:
def description_generator(mw, model, n_seeds = 3, n_iterations = 10, debugging = False, printing = False):
    #mw = main word
    #model = embeddings used to generate the cards
    #n_seeds = if we are using 2 or 3 seeds during the sentence generation step
    #n_iterations = how many iterations we will do in the generation step
    #debugging = True if we want to print some statistics about the process. False if we only want the last 5 generated sentences.
    #printing = True will print something, based on debugging. If false, it will only return the final sentence
    
    #generating the input_words we are aiming to include in our description
    input_words = gen_input_words(mw, model)    
    #on average a descriptive sentence had 27 words/symbols.
    # we will equally divide them between our seeds
    
    
    # iterate until nice sentence comes up
    # we will add safety measure to not break everything
    i = 0
    index_in_sentence = -1
    
    
    #if we are using 3 seeds
    #the 3 most frequent ones in our corpus were "x is", 'x means' and "x can be found"
    if n_seeds == 3:
        #create the first sentence, dividing the whole sequence into equally long sub_sequences
        sentence_parts = np.array([next_token_generator(mw+' means', 7), next_token_generator(mw+' is', 7), next_token_generator(mw+' can be found', 5)])
        sentence =  " ".join(sentence_parts)
        eval_sentence = sentence.split()   
    
        # to keep track of scores
        scores = np.zeros(n_iterations)
        #first score vector and score
        #and accounting for the 3 times the MW appears already in the seeds
        score_vector = np.array([eval_sentence.count(word) for word in input_words])
        score_vector[input_words.index(mw)] -= 3 
        score = np.sum(score_vector)  

        # the covered vector will take care that we don't replace a segment that already contains an input word.
        covered = np.array([0,0,0])
        changes = np.zeros(len(score_vector))

        #known positions of input words in our sentence to know where input words are located and to which sub_sequence they belong.
        positions = np.zeros(len(eval_sentence))

        #we know the positions of the seeds
        positions[0] = 1
        positions[9] = 1
        positions[18] = 1
        
        #for practical purposes we stop generating after some fixed number of iterations in case the score was not reached.
        while i < n_iterations and score <2 :
            #aware that with this flow we are doing one iteration after reaching the desired score, but it's no big deal because score is designed to only go up.

            #checking if score improved
            new_score_vector = np.array([eval_sentence.count(word) for word in input_words])
            new_score_vector[input_words.index(mw)] -= 3 
            changes = new_score_vector - score_vector

            if True in (changes>0): #there was a change in the score. Assuming there is max 1 change per iteration from now on
                index = np.where(changes == 1)[0][0] #looking for the position in which an input_word was added
                word_that_was_added = input_words[index]
                
                #finding in which segment that new added word is in order to leave the segment untouched

                #this detects the index of the word that just came up in case that word was already in our sentence
                indices_in_sentence = np.where(np.array(eval_sentence) == word_that_was_added)[0]
                if len(indices_in_sentence) >1: #word appears at least twice
                    for d in indices_in_sentence:
                        if positions[d] != 1:
                            index_in_sentence = d
                            positions[d] = 1
                else:
                    index_in_sentence = indices_in_sentence[0]
                    positions[index_in_sentence] = 1
                    
                #keeping the segment in which the improvement took place, blocking it and continue the generating process
                if index_in_sentence in range(9) & covered[0]!=1:
                    sentence_parts[1] = next_token_generator(mw+' is', 7)
                    sentence_parts[2] = next_token_generator(mw+' can be found', 5)
                    sentence = ' '.join(sentence_parts)
                    covered[0] = 1
                elif index_in_sentence in range(9, 18) & covered[1] !=1:
                    sentence_parts[0] = next_token_generator(mw+' means', 7)
                    sentence_parts[2] = next_token_generator(mw+' can be found', 5)
                    sentence = ' '.join(sentence_parts)
                    covered[1] = 1
                elif index_in_sentence in range(18, 27) & covered[2] != 1:
                    sentence_parts[1] = next_token_generator(mw+' is', 7)
                    sentence_parts[0] = next_token_generator(mw+' means', 7)
                    sentence = ' '.join(sentence_parts)
                    covered[2] = 1
                eval_sentence = sentence.split()
                changes = np.zeros(len(score_vector))
                index_in_sentence = 0
                score_vector = new_score_vector
                score = np.sum(score_vector)

            #if there was no change
            else: #based on what is already covered
                if covered[0] ==0:
                    sentence_parts[0] = next_token_generator(mw+' means', 7) +' '
                #if the first part is already covered we can add it as input to generate the second
                if covered[1] ==0:
                    if covered[0]==1:
                        temp =  next_token_generator(sentence_parts[0]+' '+ mw+' is', 7) +' '
                        #taking off the first part from it
                        temp = temp.split()
                        sentence_parts[1] = " ".join(temp[9:])   
                    else:
                        sentence_parts[1] = next_token_generator(mw+' is', 7) +' '
                # same logic for the third part.
                if covered[2] == 0:
                    if covered[1] == 0:
                        sentence_parts[2] = next_token_generator(mw+' can be found', 5)
                    else:
                        temp =  next_token_generator(sentence_parts[1]+' '+ mw+' can be found', 5) +' '
                        #taking off the second part from it
                        temp = temp.split()
                        sentence_parts[2] = " ".join(temp[9:])
                sentence = ' '.join(sentence_parts)
                eval_sentence = sentence.split()
                score_vector = new_score_vector
                score = np.sum(score_vector)
            
            #choosing what to print
            if printing == True:
                if debugging ==True:
                    print("Sentence number: " + str(i+1))
                    print(sentence)
                    if True in (changes>0):
                        print(changes)
                    print(covered)
                    print(positions)
                else:
                    if i in range(n_iterations-5, n_iterations):
                        print("Sentence number: " + str(i+1))
                        print(sentence)
            scores[i] = score
            i +=1
            
    #if we are using 2 seeds
    #the 2 most frequent ones in our corpus were "x is" and 'x means'
    if n_seeds == 2:
        #create the first sentence
        sentence_parts = np.array([next_token_generator(mw+' means', 11), next_token_generator(mw+' is', 12)])
        sentence =  " ".join(sentence_parts)
        eval_sentence = sentence.split()   
    
        # to keep track of scores
        scores = np.zeros(n_iterations)
        #first score vector and score
        #and accounting for the 3 times the MW appears already in the seeds
        score_vector = np.array([eval_sentence.count(word) for word in input_words])
        score_vector[input_words.index(mw)] -= 3 
        score = np.sum(score_vector)  

        # the covered vector will take care that we don't replace a segment that we already "like"
        covered = np.array([0,0])
        changes = np.zeros(len(score_vector))

        #known positions of input words in our sentence
        positions = np.zeros(len(eval_sentence))

        #we know the positions of the seeds
        positions[0] = 1
        positions[14] = 1
        
        while i < n_iterations:
            #aware that with this flow we are doing one iteration after reaching the desired score, but it's no big deal because score is designed to only go up.

            #checking if score improved
            new_score_vector = np.array([eval_sentence.count(word) for word in input_words])
            new_score_vector[input_words.index(mw)] -= 3 
            changes = new_score_vector - score_vector

            if True in (changes>0): #there was a change. Assuming there is max 1 change per iteration from now on
                index = np.where(changes == 1)[0][0] #looking for the position in which an input_word was added
                word_that_was_added = input_words[index] #if we stop assuming that, here we have to keep track of location and magnitude of changes
                
                #finding in which segment that new added word is in order to leave the segment untouched

                #this detects the index of the word that just came up in case that word was already in our sentence
                indices_in_sentence = np.where(np.array(eval_sentence) == word_that_was_added)[0]
                if len(indices_in_sentence) >1: #word appears at least twice
                    for d in indices_in_sentence:
                        if positions[d] != 1:
                            index_in_sentence = d
                            positions[d] = 1
                else:
                    index_in_sentence = indices_in_sentence[0]
                    positions[index_in_sentence] = 1
                #keeping the segment in which the improvement took place
                if index_in_sentence in range(14):
                    sentence_parts[1] = next_token_generator(mw+' is', 12)
                    sentence = ' '.join(sentence_parts)
                    covered[0] = 1
                elif index_in_sentence in range(14, 27):
                    sentence_parts[0] = next_token_generator(mw+' means', 11)
                    sentence = ' '.join(sentence_parts)
                    covered[1] = 1
                eval_sentence = sentence.split()
                changes = np.zeros(len(score_vector))
                index_in_sentence = 0
                score_vector = new_score_vector
                score = np.sum(score_vector)

            #if there was no change
            else: #based on what is already covered
                if covered[0] ==0:
                    sentence_parts[0] = next_token_generator(mw+' means', 11) +' '
                #if the first part is already covered we can add it as input to generate the second
                if covered[1] ==0:
                    if covered[0]==1:
                        temp =  next_token_generator(sentence_parts[0]+' '+ mw+' is', 12) +' '
                        #taking off the first part from it
                        temp = temp.split()
                        sentence_parts[1] = " ".join(temp[12:])   
                    else:
                        sentence_parts[1] = next_token_generator(mw+' is', 7) +' '
                sentence = ' '.join(sentence_parts)
                eval_sentence = sentence.split()
                score_vector = new_score_vector
                score = np.sum(score_vector)
            
            if printing == True:
                if debugging ==True:
                    print("Sentence number: " + str(i+1))
                    print(sentence)
                    if True in (changes>0):
                        print(changes)
                    print(covered)
                    print(positions)
                else:
                    if i in range(n_iterations-5, n_iterations):
                        print("Sentence number: " + str(i+1))
                        print(sentence)
            scores[i] = score
            i +=1
    return sentence

Now that we have a sequence containing 

In [27]:
def sentence_cleaner(sentence, mw, model):
    #replacing MW with "the main word" and TWs appearing in the sentence with one of their synonyms
    sentence = sentence.replace(mw, 'The main word')
    
    #replacing any TWs appearing in our sentence with some allowed synonym
    taboo_words = cg.card_generator(mw, cg.get_gold_probdist(), model)[mw]

    spl = np.array(sentence.split())
    for tw in taboo_words:
        if tw in spl:
           #getting synonyms of detected tw
            syns = sr.get_synonyms(tw)
            if len(syns) > 0:
                syns = list(syns)
                choice = np.random.choice(syns)
                sentence = sentence.replace(tw, choice)
    return sentence

In [28]:
def final_output(mw, model, n_seeds = 3, n_iterations = 10, debugging = False, printing = False):
    sentence = description_generator(mw, model, n_seeds, n_iterations, debugging, printing)
    output = sentence_cleaner(sentence, mw, model)
    return output

In [29]:
final_output(mw = 'cake', model = model, n_seeds=2, n_iterations = 5, debugging = True, printing = True)

Sentence number: 1
cake means yall blog line quickest tumbler sod al-jabr much memorandum couples layouts  cake is none sow bach intrigue skier workmanship cre 
[0 0]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0.]
Sentence number: 2
cake means typos cannot interviewer saved beta massive lean requirements requirements labeled mean  cake is multi-user methodically manage mean reform lean 13 
[0 0]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0.]
Sentence number: 3
cake means retrieving lean notably dining pharmacist parenthood taxpayer lean libertarian requirements cake is holocaust demons neurotic parenthood scoby multi-user appointments 
[0 0]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0.]
Sentence number: 4
cake means knocking roster propagate jamaica photographer skier monitor detrás labeled 31 memorandum  cake is gazebo cycle koreans sacrificing requirements welsh advised 
[0

'The main word means ions manage septenary beta pedestrian arise reform lean manage cascading brig  The main word is deadlock worms assertion terrorism reform comprise c.w. '