# Descriptive sentence generator walkthrough
In this notebook we intend to walk the reader through how we developed the second part of our taboo implementation project; the taboo player. The ultimate goal of the player is to receive as input one taboo card with the main word and its taboo words, and output a sentence describing the main word without using the taboo words. 

Here will be presented the different steps we took in order to tackle the challenge. 

Simply put(?), our strategy is the following:

* Train the GRU-based neural network with trigrams from our corpus. This step was inspired on several implementations we found online attempting to achieve similar text generation task. Those papers and blog posts are referenced in the project reports.
* Use the main word as grounding point or seed for the final sentence. Using multiple seeds should allow us to generate sentences related to the main word without deviating too much as the distance from the seed increases. Our whole sentence will be the concatenation of as many smaller generated sentences as we have seeds.
* Start an iterative process in which each segment of the sentence keeps generating until we detect some meaningful words in it.
* Clean the final sentence to avoid using the main word and the taboo words.

For a more detailed description see the project reports.

## Section 0: Importing all necessary libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import random
import string
import torch
import pickle
import torch.nn as nn
from torch.autograd import Variable 
import math
import time
import gs_probdist as gspd
import semrel as sr
import gensim
import cardgen as cg 

### Setting up the gensim model used for card generation and semantic relations mining


In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Section 1: Creating a useful corpus and loading it
Since we wanted to use this project as a chance to gain experience implementing neural network learning methods, and more specifically as a first approach to text generation, it was essential to get our hands on a fitting corpus that would result in meaningful inference after using it for training. 

For this purpose we built a corpus using web corpora consisting of ~115k sentences with the desired descriptive structures. (For more detail on how the corpus was created, see /text-generation/description-corpus/create_corpus_from_csvs.ipynb).

We prototype using the smaller version of the corpus (~20k sentences), but any final training and text generation was implemented with the complete version.

In [3]:
#opening and reading the corpus
f = open('description-corpus-115k.txt', 'r')
text = f.readlines()
f.close()

# getting lower case and splitting each sentence
sentences = [text[i].lower().split() for i in range(len(text))]

#getting the average length of a descriptive sentence
lengths = [len(sent) for sent in sentences]
avg_sent_length = sum(lengths)/len(lengths) # ~27

We will aim to generate sentences of the same length as the average sentence length in our corpus, namely 27 words or symbols.

## Section 1b: Cleaning the corpus
Before dividing our sentences into trigram relations we experimented with some basic cleaning and normalizing techniques:
* Removing stop words
* Removing punctuation symbols
* Lemmatizing

Below we present the code for each one of these steps. We decided NOT to implement any of them in the final version of our code since for the strategy we have in mind the goal of the GRU-based text generator is to produce grammatically correct sentences, for which we need both "stop" words and punctuation. Once we have such a sentence, we will implement an iterative strategy to introduce the semantic/definitive meaning into it.

In [4]:
# getting rid of stop words
#stop_words = set(nltk.corpus.stopwords.words('english'))
#stop_free = ' '.join([word for word in text.split() if word not in stop_words])

# getting rid of punctuation
#punctuation_symbols = set(string.punctuation)
#punct_free = "".join(word for word in stop_free if word not in punctuation_symbols)

# lemmatizing
#lemma = nltk.stem.wordnet.WordNetLemmatizer()
#normalized = ' '.join(lemma.lemmatize(word) for word in punct_free.split())

## Section 2: Implementing trigrams and setting up variables that will go into the network
Here we create the context:target triplets that will be fed into the neural network.

In [5]:
trigrams = []
for sentence in sentences:
    trigrams += [([sentence[i], sentence[i+1]], sentence[i+2]) for i in range(len(sentence) - 2)]

Each trigram will have this structure:

In [6]:
trigrams[0]

(['it', 'does'], 'not')

Any training sessions we tried to implement using all trigrams resulted in kernel death. 

In [7]:
len(trigrams)

2921828

Our guess is that including all ~3,000,000 was too much.
That is why for the final versions we decided to sample 100,000.

In [12]:
random.seed(163)
trigrams = random.sample(trigrams, 50000)

As a consequence, our vocabulary length dropped from 88331 tokens to about ~16500. We will consider the option to run more experiments with as many samples as possible without breaking the training phase. More details on the project reports.

In the next cell we create a set containing all tokens found in our trigrams, retrieve its length and create a token:frequency dictionary.

In [14]:
voc = set()
for tri in trigrams:
    voc = voc.union(set(np.union1d(np.array(tri[0]), np.asarray(tri[1]))))
voc_length = len(voc) #34174
word_to_freq = {word: i for i, word in enumerate(voc)}

While examining our corpus we also found some weirdly tokenized words, "binoculars" was imported as "bi -no -cu -lars" per example. We tried removing the 5% less frequent tokens from our vocab. 

In [15]:
# freq values from dictionary to array
#freq_values = np.array([v for k, v in word_to_freq.items()])
#np.percentile(freq_values, 5) # ~1709

#getting rid of those unusual values
# creating list of keys to delete
#to_del = [k for k, v in word_to_freq.items() if v < 1709]

# deleting those elements from the word_to_freq dict
#for k in to_del: del word_to_freq[k]

But since it is a very arbitrary measure we assumed that using a larger set of trigrams and sentences would greatly decrease the likelihood of these weird symbols being generated.

The last preparation step is creating the context and target tensors containing for each trigram its frequencies.

In [16]:
#creating lists where we will store the input tensors
cont = []
tar = []
for context, target in trigrams:
    #creates a tensor with the frequency of both current context words
    context_freqs = torch.tensor([word_to_freq[word] for word in context], dtype = torch.long)
    #adds the tensor to cont
    cont.append(context_freqs)
    # does the same for the target and its frequency
    target_freq = torch.tensor([word_to_freq[target]], dtype = torch.long)
    tar.append(target_freq)

## Section 3: Building the network and training it
As mentioned above, this section was inspired by the pytorch tutorial we had during the class period and by several papers and blog-posts. They are all referenced in the project reports.

As always, the first step is to check if we have a GPU to train our model on. It was not the case for any of our trials, but we will include the code for future reference and implementations. Note that since we knew that we would not have access to GPUs, our code does not include several .cuda() sections that would be necessary to run it in a GPU setup. 

In [None]:
#Cheking if we have a GPU to train our model on
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Fancy setup!')
else: 
    print('Too bad, training on CPU. Keep the number of epochs low!')

my_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Updates after discussion

* TODO:
    * Clean this section and save comments for report
    * clear doubts about network. Give credit to blog posts whenever choices are not clear
    

* cleaning the 5% least frequent symbols from the vocab was a disaster since several steps can't handle unknown words. It's not impossible to fix this, but for the time being it's easier to work with the full corpus (~115k sentences instead of ~20k) and assume that this will make the weird symbols/words waaay less likely to appear. It's not an unfounded assumption (denke ich).

* when expanding the input_words set I won't include antonyms. I think syn, hyper and hyponyms are enough, and an incorrectly used antonym is riskier.

* should we also leave out hyponyms?

* top 3 seeds by freq of appearance in the corpus "is" >> "means" >> "can be found"

* if part i of the sentence is already covered (includes one of the input words), any part i+1 will take part i into account while generating. Doesn't sound exciting but it is a smart feature :P

* Now running on all trigrams



# Generating our descriptive sentences

## Importing everything we will need

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import random
import string
import torch
import torch.nn as nn
#is this the best choice of autograd?
from torch.autograd import Variable 
import math
import time
import gs_probdist as gspd
import semrel as sr
import gensim
import cardgen as cg #modified version that only returns the set of MW and TWs, not the nice drawing

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Getting our text input

In [None]:
#opening and reading the corpus
#we will be using the full version of the descriptive corpus we made ~115k sentences
f = open('description-corpus-115k.txt', 'r')
text = f.readlines() # List with sentences as elements
f.close()

# getting lower case and splitting it
sentences = [text[i].lower().split() for i in range(len(text))]

#getting the avg length of a sentence
lengths = [len(sent) for sent in sentences]
avg_sent_length = sum(lengths)/len(lengths) # ~27

## Cleaning (NOT USED)
* Removing stop words, punctuation symbols and lemmatizing

In [None]:
# found that in some tutorials they do 3 extra cleaning steps before applying N-grams

# getting rid of stop words
#stop_words = set(nltk.corpus.stopwords.words('english'))
#stop_free = ' '.join([word for word in text.split() if word not in stop_words])
# justification --> it's like getting rid of the long tail in the word frequency plot. Oder?

# getting rid of punctuation
#punctuation_symbols = set(string.punctuation)
#punct_free = "".join(word for word in stop_free if word not in punctuation_symbols)
# makes sense.. I think

# lemmatizing?
#lemma = nltk.stem.wordnet.WordNetLemmatizer()
#normalized = ' '.join(lemma.lemmatize(word) for word in punct_free.split())
#if it is what I think it is, then it makes sense too

#last step, lower case and splitting
#cleaned_text = normalized.lower().split()

In [None]:
text[:250]

In [None]:
stop_free[:250]

In [None]:
punct_free[:250]

In [None]:
normalized[:250]

## Only normalizing (NOT USED)
* is it worth it having different tokens for cookie and cookies (in terms of having a correct sentence)?

## Implementing trigrams and setting up

In [None]:
#creating trigram sets EASY WAY
#trigrams = [([cleaned_text[i], cleaned_text[i+1]], cleaned_text[i+2])for i in range(len(cleaned_text) - 2)]

# better way: sentence by sentence, not the whole text in one go
# this structure allows us to create context/target sets for each word. 
trigrams = []
for sentence in sentences:
    trigrams += [([sentence[i], sentence[i+1]], sentence[i+2]) for i in range(len(sentence) - 2)]

trigrams[0]
#context for target word 'text' --> 'a' and 'cookie'

len(trigrams)

#using all trigrams led to kernel death every time
# we will randomly sample 100000 of them
trigrams = random.sample(trigrams, 50000)

# getting set of words in vocab, it's length and the frequency of each word
# our vocab consists of the words appearing in trigrams, so no need to take the vocab over the whole text
# if we are not using all trigrams
voc = set()
for tri in trigrams:
    voc = voc.union(set(np.union1d(np.array(tri[0]), np.asarray(tri[1]))))
voc_length = len(voc) #34174
word_to_freq = {word: i for i, word in enumerate(voc)}

voc_length

#the detected some errors on how our corpus is tokenized ("binoculars" is imported as bi - no -cu-lars) we will remove the 5% less frequent found words in the dictionary.
# THIS AFFECTED THE CELL BELOW. INSTEAD OF USING THIS CODE TO CLEAN THE POSSIBLE OUTPUTS WE WILL ASSUME THAT USING OUR BIGGER VERSION OF THE CORPUS SHOULD MINIMIZE 
# THE LIKELIHOOD OF FINDING THESE WEIRD TOKENS DURING THE GENERATION STEP


# freq values from dictionary to array
#freq_values = np.array([v for k, v in word_to_freq.items()])
#np.percentile(freq_values, 5) # ~1709

#getting rid of those unusual values
# creating list of keys to delete
#to_del = [k for k, v in word_to_freq.items() if v < 1709]

# deleting those elements from the word_to_freq dict
#for k in to_del: del word_to_freq[k]

#creating lists where we will store the input tensors
cont = []
tar = []
for context, target in trigrams:
    #creates a tensor with the frequency of both current context words
    context_freqs = torch.tensor([word_to_freq[word] for word in context], dtype = torch.long)
    #adds the tensor to inp
    cont.append(context_freqs)
    # does the same for the target and its frequency
    target_freq = torch.tensor([word_to_freq[target]], dtype = torch.long)
    tar.append(target_freq)

## Building the network 

In [None]:
#Cheking if we have access to training on GPU
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Fancy setup!')
else: 
    print('Too bad, training on CPU; dont exagerate with number of epochs.')

my_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
class GRU(nn.Module):
    #init for input size, hidden size, output size and number of hidden layers.
    def __init__(self, input_s, hidden_s, output_s,n_layers = 1):
        super(GRU, self).__init__()
        self.input_s = input_s
        self.hidden_s = hidden_s
        self.output_s = output_s
        self.n_layers = n_layers
        # our encoder will be nn.Embedding
        # reminder: the encoder takes the input and outputs a feature tensor holding the information representing the input.
        self.encoder = nn.Embedding(input_s, hidden_s)
        #defining the GRU cell, still have to determine which parameters work best
        self.gru = nn.GRU(2*hidden_s, hidden_s, n_layers, batch_first=True, bidirectional=False)
        # defining linear decoder
        self.decoder = nn.Linear(hidden_s, output_s)
    
    def forward(self, input, hidden):
        #making sure that the input is a row vector
        input = self.encoder(input.view(1, -1))
        output, hidden = self.gru(input.view(1, 1, -1), hidden)
        output = self.decoder(output.view(1,-1))
        return output, hidden
    
    def init_hidden(self):
        return Variable(torch.zeros(self.n_layers, 1, self.hidden_s))

In [None]:
def train(context, target):
    hidden = decoder.init_hidden()
    decoder.zero_grad()
    loss = 0
    
    for t in range(len(trigrams)):
        output, hidden = decoder(context[t], hidden)
        loss += criterion(output, target[t])
        
    loss.backward()
    decoder_optimizer.step()
    
    return loss.data.item() / len(trigrams)

In [None]:
def time_since(since):
    s = time.time() - since
    m = math.floor(s/60)
    s -= m*60
    return '%dm %ds' % (m, s)

In [None]:
n_epochs = 100
print_every = 10
plot_every = 10
hidden_s = 150
n_layers = 1
lr = 0.015

decoder = GRU(voc_length, hidden_s, voc_length, n_layers)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

start = time.time()
all_losses = []
loss_avg = 0
for epoch in range(1, n_epochs + 1):
    loss = train(cont,tar)       
    loss_avg += loss

    if epoch % print_every == 0:
        print('[%s (%d %d%%) %.4f]' % (time_since(start), epoch, epoch / n_epochs * 50, loss))
#         print(evaluate('ge', 200), '\n')

    if epoch % plot_every == 0:
        all_losses.append(loss_avg / plot_every)
        loss_avg = 0

In [None]:
import os
# saving the model
# instructions found here: https://pytorch.org/tutorials/beginner/saving_loading_models.html

#saving model for inference --> save state_dict
path1 = os.getcwd()+'/test5_trained_inference.pt'

torch.save(decoder.state_dict(),path1)

#to load
# decoder = GRU(voc_length, hidden_s, voc_length, n_layers)
# decoder.load_stat_dict(torch.load(path))
# decoder.eval()

In [None]:
#saving entire model
path2 = os.getcwd()+'/test5_trained1_entire.pt'
torch.save(decoder, path1)

#loading
# Model class must be defined somewhere
#decoder = torch.load(path)
#decoder.eval()

In [None]:
import matplotlib.ticker as ticker
%matplotlib inline

f= open('test5_losses.csv', 'w+')
f.write(str(all_losses)[1:-1])
f.close()
plt.figure()
plt.plot(all_losses)
plt.savefig('test_5.png')

In [None]:
def evaluate(prime_str='this process', predict_len=100, temperature=0.8):
    hidden = decoder.init_hidden()

    for p in range(predict_len):
        
        prime_input = torch.tensor([word_to_freq[w] for w in prime_str.split()], dtype=torch.long)
        cont = prime_input[-2:] #last two words as input
        output, hidden = decoder(cont, hidden)
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = torch.multinomial(output_dist, 1)[0]
        
        # Add predicted word to string and use as next input
        predicted_word = list(word_to_freq.keys())[list(word_to_freq.values()).index(top_i)]
        prime_str += " " + predicted_word
#         inp = torch.tensor(word_to_ix[predicted_word], dtype=torch.long)

    return prime_str

In [None]:
print(evaluate('the main', 10, temperature = 1))

## Generating descriptive sentence
* input = main word + taboo words

The main idea here is that we will seed always with the main word and replace at the end.
The descriptive sentence will be accepted into the cleaning step (where we will make sure that no tws or the mw were used and if so, we'll replace them) if it contains at least two of the input words, besides the seeds. As long as the sentence is not accepted, we will keep generating.

To help reach a high score: once an input word has been used and its score_vector value increases, we will leave that segment in our sentence. Not sure if this hurts more than it helps, but adding the input word as a seed is VERY complicated. And not sure it would make sense either.

For a first prototype we will assume that maximum one new word is gonna be added per iteration. 

Pretty fragile with many assumptions?

It would be useful to generate a list of synonyms of all input words to expand the input_words set and have higher acceptance rate. We thought about cleaning the input words set from the start but those are words with high probability of occurence so we better leave them and clean in the end.

* maybe more seeds?
* maybe require a higher score?
* how to properly connect end of sentences and seeds? Is it better to have fewer seeds but longer auto-generated text?
* add time it to report results and "quantify" results/improvements

### Description generator

In [None]:
def gen_input_words(mw, model):
    #mw = main word
    #model = embeddings used to generate the cards
    
    #generating the corresponding taboo card
    card_words = cg.card_generator(mw, cg.get_gold_probdist(), model)
    #set of words that we hope will appear in the description
    input_words = card_words[mw] + [mw]

    # extending the input_words set using semantic relations. Bigger set --> better chances of generating an approved word!
    # we will use the make_semrel_dict function to get synonyms, hyponyms and hypernyms of the MW.
    # we considered adding also semrel words from the tw, but they loose connection to the MW very fast
    # we will leave out antonyms as they are "riskier" to use in a description.

    adds = []
    temp = sr.make_semrel_dict(mw)
    for k in temp.keys():
        if k != 'semrel_antonym':
            new = list(temp[k])
            adds += new
    adds = np.unique(adds)
    adds = [x.lower() for x in adds]
    input_words = np.unique(input_words + adds)

    # filtering out the input words that are not in our vocab. Shouldn't be a thing when using larger corpus
    input_words = [word for word in input_words if word in voc]    
    return input_words

def description_generator(mw, model, n_seeds = 3, n_iterations = 10, debugging = False, printing = False):
    #mw = main word
    #model = embeddings used to generate the cards
    #n_seeds = if we are using 2 or 3 seeds during the sentence generation step
    #n_iterations = how many iterations we will do in the generation step
    #debugging = True if we want to print some statistics about the process. False if we only want the last 5 generated sentences.
    #printing = True will print something, based on debugging. If false, it will only return the final sentence
    
    #generating the input_words we are aiming to include in our description
    input_words = gen_input_words(mw, model)    
    #on average a descriptive sentence had 27 words/symbols.
    # we will equally divide them between our seeds
    
    
    # iterate until nice sentence comes up
    # we will add safety measure to not break everything
    i = 0
    index_in_sentence = -1
    
    
    #if we are using 3 seeds
    #the 3 most frequent ones in our corpus were "x is", 'x means' and "x can be found"
    if n_seeds == 3:
        #create the first sentence
        sentence_parts = np.array([evaluate(mw+' means', 7, temperature = 1), evaluate(mw+' is', 7, temperature = 1), evaluate(mw+' can be found', 5, temperature = 1)])
        sentence =  " ".join(sentence_parts)
        eval_sentence = sentence.split()   
    
        # to keep track of scores
        scores = np.zeros(n_iterations)
        #first score vector and score
        #and accounting for the 3 times the TW appears already in the seeds
        score_vector = np.array([eval_sentence.count(word) for word in input_words])
        score_vector[input_words.index(mw)] -= 3 
        score = np.sum(score_vector)  

        # the covered vector will take care that we don't replace a segment that we already "like"
        covered = np.array([0,0,0])
        changes = np.zeros(len(score_vector))

        #known positions of input words in our sentence
        positions = np.zeros(len(eval_sentence))

        #we know the positions of the seeds
        positions[0] = 1
        positions[9] = 1
        positions[18] = 1
        
        while i < n_iterations:
            #aware that with this flow we are doing one iteration after reaching the desired score, but it's no big deal because score is designed to only go up.

            #checking if score improved
            new_score_vector = np.array([eval_sentence.count(word) for word in input_words])
            new_score_vector[input_words.index(mw)] -= 3 
            changes = new_score_vector - score_vector

            if True in (changes>0): #there was a change. Assuming there is max 1 change per iteration from now on
                index = np.where(changes == 1)[0][0] #looking for the position in which an input_word was added
                word_that_was_added = input_words[index] #if we stop assuming that, here we have to keep track of location and magnitude of changes
                
                #finding in which segment that new added word is in order to leave the segment untouched

                #this detects the index of the word that just came up in case that word was already in our sentence
                indices_in_sentence = np.where(np.array(eval_sentence) == word_that_was_added)[0]
                if len(indices_in_sentence) >1: #word appears at least twice
                    for d in indices_in_sentence:
                        if positions[d] != 1:
                            index_in_sentence = d
                            positions[d] = 1
                else:
                    index_in_sentence = indices_in_sentence[0]
                    positions[index_in_sentence] = 1
                #keeping the segment in which the improvement took place
                if index_in_sentence in range(9) & covered[0]!=1:
                    sentence_parts[1] = evaluate(mw+' is', 7, temperature = 1)
                    sentence_parts[2] = evaluate(mw+' can be found', 5, temperature = 1)
                    sentence = ' '.join(sentence_parts)
                    covered[0] = 1
                elif index_in_sentence in range(9, 18) & covered[1] !=1:
                    sentence_parts[0] = evaluate(mw+' means', 7, temperature = 1)
                    sentence_parts[2] = evaluate(mw+' can be found', 5, temperature = 1)
                    sentence = ' '.join(sentence_parts)
                    covered[1] = 1
                elif index_in_sentence in range(18, 27) & covered[2] != 1:
                    sentence_parts[1] = evaluate(mw+' is', 7, temperature = 1)
                    sentence_parts[0] = evaluate(mw+' means', 7, temperature = 1)
                    sentence = ' '.join(sentence_parts)
                    covered[2] = 1
                eval_sentence = sentence.split()
                changes = np.zeros(len(score_vector))
                index_in_sentence = 0
                score_vector = new_score_vector
                score = np.sum(score_vector)

            #if there was no change
            else: #based on what is already covered
                if covered[0] ==0:
                    sentence_parts[0] = evaluate(mw+' means', 7, temperature = 1) +' '
                #if the first part is already covered we can add it as input to generate the second
                if covered[1] ==0:
                    if covered[0]==1:
                        temp =  evaluate(sentence_parts[0]+' '+ mw+' is', 7, temperature = 1) +' '
                        #taking off the first part from it
                        temp = temp.split()
                        sentence_parts[1] = " ".join(temp[9:])   
                    else:
                        sentence_parts[1] = evaluate(mw+' is', 7, temperature = 1) +' '
                # same logic for the third part.
                if covered[2] == 0:
                    if covered[1] == 0:
                        sentence_parts[2] = evaluate(mw+' can be found', 5, temperature = 1)
                    else:
                        temp =  evaluate(sentence_parts[1]+' '+ mw+' can be found', 5, temperature = 1) +' '
                        #taking off the second part from it
                        temp = temp.split()
                        sentence_parts[2] = " ".join(temp[9:])
                sentence = ' '.join(sentence_parts)
                eval_sentence = sentence.split()
                score_vector = new_score_vector
                score = np.sum(score_vector)
            if printing == True:
                if debugging ==True:
                    print("Sentence number: " + str(i+1))
                    print(sentence)
                    if True in (changes>0):
                        print(changes)
                    print(covered)
                    print(positions)
                else:
                    if i in range(n_iterations-5, n_iterations):
                        print(sentence)
            scores[i] = score
            i +=1
            
    #if we are using 2 seeds
    #the 2 most frequent ones in our corpus were "x is" and 'x means'
    if n_seeds == 2:
        #create the first sentence
        sentence_parts = np.array([evaluate(mw+' means', 11, temperature = 1), evaluate(mw+' is', 12, temperature = 1)])
        sentence =  " ".join(sentence_parts)
        eval_sentence = sentence.split()   
    
        # to keep track of scores
        scores = np.zeros(n_iterations)
        #first score vector and score
        #and accounting for the 3 times the TW appears already in the seeds
        score_vector = np.array([eval_sentence.count(word) for word in input_words])
        score_vector[input_words.index(mw)] -= 3 
        score = np.sum(score_vector)  

        # the covered vector will take care that we don't replace a segment that we already "like"
        covered = np.array([0,0])
        changes = np.zeros(len(score_vector))

        #known positions of input words in our sentence
        positions = np.zeros(len(eval_sentence))

        #we know the positions of the seeds
        positions[0] = 1
        positions[14] = 1
        
        while i < n_iterations:
            #aware that with this flow we are doing one iteration after reaching the desired score, but it's no big deal because score is designed to only go up.

            #checking if score improved
            new_score_vector = np.array([eval_sentence.count(word) for word in input_words])
            new_score_vector[input_words.index(mw)] -= 3 
            changes = new_score_vector - score_vector

            if True in (changes>0): #there was a change. Assuming there is max 1 change per iteration from now on
                index = np.where(changes == 1)[0][0] #looking for the position in which an input_word was added
                word_that_was_added = input_words[index] #if we stop assuming that, here we have to keep track of location and magnitude of changes
                
                #finding in which segment that new added word is in order to leave the segment untouched

                #this detects the index of the word that just came up in case that word was already in our sentence
                indices_in_sentence = np.where(np.array(eval_sentence) == word_that_was_added)[0]
                if len(indices_in_sentence) >1: #word appears at least twice
                    for d in indices_in_sentence:
                        if positions[d] != 1:
                            index_in_sentence = d
                            positions[d] = 1
                else:
                    index_in_sentence = indices_in_sentence[0]
                    positions[index_in_sentence] = 1
                #keeping the segment in which the improvement took place
                if index_in_sentence in range(14):
                    sentence_parts[1] = evaluate(mw+' is', 12, temperature = 1)
                    sentence = ' '.join(sentence_parts)
                    covered[0] = 1
                elif index_in_sentence in range(14, 27):
                    sentence_parts[0] = evaluate(mw+' means', 11, temperature = 1)
                    sentence = ' '.join(sentence_parts)
                    covered[1] = 1
                eval_sentence = sentence.split()
                changes = np.zeros(len(score_vector))
                index_in_sentence = 0
                score_vector = new_score_vector
                score = np.sum(score_vector)

            #if there was no change
            else: #based on what is already covered
                if covered[0] ==0:
                    sentence_parts[0] = evaluate(mw+' means', 11, temperature = 1) +' '
                #if the first part is already covered we can add it as input to generate the second
                if covered[1] ==0:
                    if covered[0]==1:
                        temp =  evaluate(sentence_parts[0]+' '+ mw+' is', 12, temperature = 1) +' '
                        #taking off the first part from it
                        temp = temp.split()
                        sentence_parts[1] = " ".join(temp[12:])   
                    else:
                        sentence_parts[1] = evaluate(mw+' is', 7, temperature = 1) +' '
                sentence = ' '.join(sentence_parts)
                eval_sentence = sentence.split()
                score_vector = new_score_vector
                score = np.sum(score_vector)
            
            if printing == True:
                if debugging ==True:
                    print("Sentence number: " + str(i+1))
                    print(sentence)
                    if True in (changes>0):
                        print(changes)
                    print(covered)
                    print(positions)
                else:
                    if i in range(n_iterations-5, n_iterations):
                        print(sentence)
            scores[i] = score
            i +=1
    return sentence

In [None]:
#description_generator('cake', model, n_seeds = 3, n_iterations = 2, debugging = False, printing = True)

### Cleaning the generated sentence


In [None]:
def sentence_cleaner(sentence, mw, model):
    #replacing MW with "the main word" and TWs appearing in the sentence with one of their synonyms
    sentence = sentence.replace(mw, 'The main word')
    
    #replacing any TWs appearing in our sentence with some allowed synonym
    taboo_words = cg.card_generator(mw, cg.get_gold_probdist(), model)[mw]

    spl = np.array(sentence.split())
    for tw in taboo_words:
        if tw in spl:
           #getting synonyms of detected tw
            syns = sr.get_synonyms(tw)
            if len(syns) > 0:
                syns = list(syns)
                choice = np.random.choice(syns)
                sentence = sentence.replace(tw, choice)
    return sentence

### Final output

In [None]:
def final_output(mw, model, n_seeds = 3, n_iterations = 10, debugging = False, printing = False):
    sentence = description_generator(mw, model, n_seeds, n_iterations, debugging, printing)
    output = sentence_cleaner(sentence, mw, model)
    return output

In [None]:
final_output(mw = 'cake', model = model, n_seeds=2, n_iterations = 100, debugging = True, printing = True)