# Demo version of the descriptive sentence generator

In this notebook, we provide code for using the final version of the description generator we implemented. 
This notebook is structured in such a way that the reader will be able to get a feeling of what this second part of the project works, but also to show its limitations. 
For a detailed explanation on what strategy we followed and how each individual function works, please see `text-generation/Walkthrough.ipynb`.

If all necessary packages are installed, running the first cell will take care of setting up the generator. Note that it might take up to a couple of minutes to run. Once this step is finished, please select which model you would like to use and run the environment loader. Loading model 3 can also take some time, depending on the hardware running the notebook. Once these two steps are done you are good to go! Examples can be found in the last section of this notebook. 

## Setting up

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import random
import string
import torch
import torch.nn as nn
from torch.autograd import Variable
import math
import os
import pickle
import time
import gs_probdist as gspd
import semrel as sr
import gensim
import cardgen as cg

card_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

#opening and reading the corpus
#we will be using the full version of the descriptive corpus we made ~115k sentences
f = open('description-corpus-115k.txt', 'r', encoding='utf-8')
text = f.readlines() # List with sentences as elements
f.close()

# getting lower case and splitting it
sentences = [text[i].lower().split() for i in range(len(text))]

#getting the avg length of a sentence
lengths = [len(sent) for sent in sentences]
avg_sent_length = sum(lengths)/len(lengths) # ~27

class GRU(nn.Module):
    #init for input size, hidden size, output size and number of hidden layers.
    def __init__(self, input_s, hidden_s, output_s,n_layers = 1):
        super(GRU, self).__init__()
        self.input_s = input_s
        self.hidden_s = hidden_s
        self.output_s = output_s
        self.n_layers = n_layers
        # our encoder will be nn.Embedding
        # reminder: the encoder takes the input and outputs a feature tensor holding the information representing the input.
        self.encoder = nn.Embedding(input_s, hidden_s)
        #defining the GRU cell, still have to determine which parameters work best
        self.gru = nn.GRU(2*hidden_s, hidden_s, n_layers, batch_first=True, bidirectional=False)
        # defining linear decoder
        self.decoder = nn.Linear(hidden_s, output_s)

    def forward(self, input, hidden):
        #making sure that the input is a row vector
        input = self.encoder(input.view(1, -1))
        output, hidden = self.gru(input.view(1, 1, -1), hidden)
        output = self.decoder(output.view(1,-1))
        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(self.n_layers, 1, self.hidden_s))


def next_token_generator(seed, generation_length=100):
    """
    Given a seed and a length, it returns a sequence of generated tokens. 
    It generates the tokens one by one, concatenating the new tokens to the seed and taking the seed's last two tokens as context for the next generation.

    Arg:
        seed: a string of minimal length = 2 which will serve as context for the first generation step.
        generation_length: integer value representing the number of tokens we desire to generate.
    Returns:
        A string consisting of the concatenation of the seed and all generated tokens.
    """
    hidden = decoder.init_hidden()

    for p in range(generation_length):
        
        prime_input = torch.tensor([word_to_freq[w] for w in seed.split()], dtype=torch.long)
        cont = prime_input[-2:] #last two words as input
        output, hidden = decoder(cont, hidden)
        
        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).exp()
        top_choice = torch.multinomial(output_dist, 1)[0]
        
        # Add predicted word to string and use as next input
        predicted_word = list(word_to_freq.keys())[list(word_to_freq.values()).index(top_choice)]
        seed += " " + predicted_word
#         inp = torch.tensor(word_to_ix[predicted_word], dtype=torch.long)

    return seed

def gen_input_words(mw, model):
    """
    Given a main word and a gensim model, it generates a set of input words that we are aiming to have in our descriptive sentence.
    Arg:
        mw: a string containing the main word that we are aiming to describe
        model: gensim model from which we are retrieving words semanticaly related to the mw 
    Returns:
        A set of strings consisting of the mw, its synonyms, hypernyms and hyponyms, and all its associated taboo words. All of them contained in the vocabulary.
    """
    #mw = main word
    #model = embeddings used to generate the cards

    #generating the corresponding taboo card
    card_words = cg.card_generator(mw, cg.get_gold_probdist(), model)
    #set of words that we hope will appear in the description
    input_words = card_words[mw] + [mw]

    # extending the input_words set using semantic relations. Bigger set --> better chances of generating an approved word!
    # we will use the make_semrel_dict function to get synonyms, hyponyms and hypernyms of the MW.
    # we considered adding also semrel words from the tw, but the loose connection to the MW very fast
    # we will leave out antonyms as they might make they are "riskier" to use in a description.

    adds = []
    temp = sr.make_semrel_dict(mw)
    for k in temp.keys():
        if k != 'semrel_antonym':
            new = list(temp[k])
            adds += new
    adds = np.unique(adds)
    adds = [x.lower() for x in adds]
    input_words = np.unique(input_words + adds)

    # filtering out the input words that are not in our vocab. Shouldn't be a thing when using larger corpus
    input_words = [word for word in input_words if word in voc]
    return input_words

def description_generator(mw, model, n_seeds = 3, n_iterations = 10, debugging = False, printing = False):
    """
    Given the main word we are trying to describe, it will iteratively create sequences aiming to contain some of the input words generated by gen_input_words(mw, model).
    If the desired score was not reached it will return the last generated sequence, after doing n_iterations of the process.
    Arg:
        mw = main word as string
        model = embeddings used to generate the cards
        n_seeds = int, if we are using 2 or 3 seeds during the sentence generation step
        n_iterations = int, how many sequences we will test
        debugging = Boolean, True if we want to print some statistics about the process. False if we only want the last 5 generated sentences.
        printing = Boolean, True will print something, based on debugging. If false, it will only return the final sentence
    Returns:
        A string containing the last generated sequence.
    """
    #mw = main word
    #model = embeddings used to generate the cards
    #n_seeds = if we are using 2 or 3 seeds during the sentence generation step
    #n_iterations = how many iterations we will do in the generation step
    #debugging = True if we want to print some statistics about the process. False if we only want the last 5 generated sentences.
    #printing = True will print something, based on debugging. If false, it will only return the final sentence
    
    #generating the input_words we are aiming to include in our description
    input_words = gen_input_words(mw, model)    
    #on average a descriptive sentence had 27 words/symbols.
    # we will equally divide them between our seeds
    
    
    # iterate until nice sentence comes up
    # we will add safety measure to not break everything
    i = 0
    index_in_sentence = -1
    
    
    #if we are using 3 seeds
    #the 3 most frequent ones in our corpus were "x is", 'x means' and "x can be found"
    if n_seeds == 3:
        #create the first sentence, dividing the whole sequence into equally long sub_sequences
        sentence_parts = np.array([next_token_generator(mw+' means', 7), next_token_generator(mw+' is', 7), next_token_generator(mw+' can be found', 5)])
        sentence =  " ".join(sentence_parts)
        eval_sentence = sentence.split()   
    
        # to keep track of scores
        scores = np.zeros(n_iterations)
        #first score vector and score
        #and accounting for the 3 times the MW appears already in the seeds
        score_vector = np.array([eval_sentence.count(word) for word in input_words])
        score_vector[input_words.index(mw)] -= 3 
        score = np.sum(score_vector)  

        # the covered vector will take care that we don't replace a segment that already contains an input word.
        covered = np.array([0,0,0])
        changes = np.zeros(len(score_vector))

        #known positions of input words in our sentence to know where input words are located and to which sub_sequence they belong.
        positions = np.zeros(len(eval_sentence))

        #we know the positions of the seeds
        positions[0] = 1
        positions[9] = 1
        positions[18] = 1
        
        #for practical purposes we stop generating after some fixed number of iterations in case the score was not reached.
        while i < n_iterations and score <2 :
            #aware that with this flow we are doing one iteration after reaching the desired score, but it's no big deal because score is designed to only go up.

            #checking if score improved
            new_score_vector = np.array([eval_sentence.count(word) for word in input_words])
            new_score_vector[input_words.index(mw)] -= 3 
            changes = new_score_vector - score_vector

            if True in (changes>0): #there was a change in the score. Assuming there is max 1 change per iteration from now on
                index = np.where(changes == 1)[0][0] #looking for the position in which an input_word was added
                word_that_was_added = input_words[index]
                
                #finding in which segment that new added word is in order to leave the segment untouched

                #this detects the index of the word that just came up in case that word was already in our sentence
                indices_in_sentence = np.where(np.array(eval_sentence) == word_that_was_added)[0]
                if len(indices_in_sentence) >1: #word appears at least twice
                    for d in indices_in_sentence:
                        if positions[d] != 1:
                            index_in_sentence = d
                            positions[d] = 1
                else:
                    index_in_sentence = indices_in_sentence[0]
                    positions[index_in_sentence] = 1
                    
                #keeping the segment in which the improvement took place, blocking it and continue the generating process
                if index_in_sentence in range(9) & covered[0]!=1:
                    sentence_parts[1] = next_token_generator(mw+' is', 7)
                    sentence_parts[2] = next_token_generator(mw+' can be found', 5)
                    sentence = ' '.join(sentence_parts)
                    covered[0] = 1
                elif index_in_sentence in range(9, 18) & covered[1] !=1:
                    sentence_parts[0] = next_token_generator(mw+' means', 7)
                    sentence_parts[2] = next_token_generator(mw+' can be found', 5)
                    sentence = ' '.join(sentence_parts)
                    covered[1] = 1
                elif index_in_sentence in range(18, 27) & covered[2] != 1:
                    sentence_parts[1] = next_token_generator(mw+' is', 7)
                    sentence_parts[0] = next_token_generator(mw+' means', 7)
                    sentence = ' '.join(sentence_parts)
                    covered[2] = 1
                eval_sentence = sentence.split()
                changes = np.zeros(len(score_vector))
                index_in_sentence = 0
                score_vector = new_score_vector
                score = np.sum(score_vector)

            #if there was no change
            else: #based on what is already covered
                if covered[0] ==0:
                    sentence_parts[0] = next_token_generator(mw+' means', 7) +' '
                #if the first part is already covered we can add it as input to generate the second
                if covered[1] ==0:
                    if covered[0]==1:
                        temp =  next_token_generator(sentence_parts[0]+' '+ mw+' is', 7) 
                        #taking off the first part from it
                        temp = temp.split()
                        sentence_parts[1] = " ".join(temp[9:])   
                    else:
                        sentence_parts[1] = next_token_generator(mw+' is', 7) 
                # same logic for the third part.
                if covered[2] == 0:
                    if covered[1] == 0:
                        sentence_parts[2] = next_token_generator(mw+' can be found', 5)
                    else:
                        temp =  next_token_generator(sentence_parts[1]+' '+ mw+' can be found', 5) 
                        #taking off the second part from it
                        temp = temp.split()
                        sentence_parts[2] = " ".join(temp[9:])
                sentence = ' '.join(sentence_parts)
                eval_sentence = sentence.split()
                score_vector = new_score_vector
                score = np.sum(score_vector)
            
            #choosing what to print
            if i == 0:
                print('The set of input words we are trying to introduce into our sequence is: '+str(input_words))
            if printing == True:
                if debugging ==True:
                    print("Sentence number: " + str(i+1))
                    print(sentence)
                    if True in (changes>0):
                        print("Changes vector: ")
                        print(changes)
                    print("Covered vector: ")
                    print(covered)
                    print("Positions vector: ")
                    print(positions)
                else:
                    if i in range(n_iterations-5, n_iterations):
                        print("Sentence number: " + str(i+1))
                        print(sentence)
                        if i == n_iterations-1:
                            print('The final sentence got a score of: '+str(score))
            scores[i] = score
            i +=1
            
    #if we are using 2 seeds
    #the 2 most frequent ones in our corpus were "x is" and 'x means'
    if n_seeds == 2:
        #create the first sentence
        sentence_parts = np.array([next_token_generator(mw+' means', 11), next_token_generator(mw+' is', 12)])
        sentence =  " ".join(sentence_parts)
        eval_sentence = sentence.split()   
    
        # to keep track of scores
        scores = np.zeros(n_iterations)
        #first score vector and score
        #and accounting for the 3 times the MW appears already in the seeds
        score_vector = np.array([eval_sentence.count(word) for word in input_words])
        score_vector[input_words.index(mw)] -= 2
        score = np.sum(score_vector)  

        # the covered vector will take care that we don't replace a segment that we already "like"
        covered = np.array([0,0])
        changes = np.zeros(len(score_vector))

        #known positions of input words in our sentence
        positions = np.zeros(len(eval_sentence))

        #we know the positions of the seeds
        positions[0] = 1
        positions[14] = 1
        
        while i < n_iterations and score <2:
            #aware that with this flow we are doing one iteration after reaching the desired score, but it's no big deal because score is designed to only go up.

            #checking if score improved
            new_score_vector = np.array([eval_sentence.count(word) for word in input_words])
            new_score_vector[input_words.index(mw)] -= 2
            changes = new_score_vector - score_vector

            if True in (changes>0): #there was a change. Assuming there is max 1 change per iteration from now on
                index = np.where(changes == 1)[0][0] #looking for the position in which an input_word was added
                word_that_was_added = input_words[index] #if we stop assuming that, here we have to keep track of location and magnitude of changes
                
                #finding in which segment that new added word is in order to leave the segment untouched

                #this detects the index of the word that just came up in case that word was already in our sentence
                indices_in_sentence = np.where(np.array(eval_sentence) == word_that_was_added)[0]
                if len(indices_in_sentence) >1: #word appears at least twice
                    for d in indices_in_sentence:
                        if positions[d] != 1:
                            index_in_sentence = d
                            positions[d] = 1
                else:
                    index_in_sentence = indices_in_sentence[0]
                    positions[index_in_sentence] = 1
                #keeping the segment in which the improvement took place
                if index_in_sentence in range(14):
                    sentence_parts[1] = next_token_generator(mw+' is', 12)
                    sentence = ' '.join(sentence_parts)
                    covered[0] = 1
                elif index_in_sentence in range(14, 27):
                    sentence_parts[0] = next_token_generator(mw+' means', 11)
                    sentence = ' '.join(sentence_parts)
                    covered[1] = 1
                eval_sentence = sentence.split()
                changes = np.zeros(len(score_vector))
                index_in_sentence = 0
                score_vector = new_score_vector
                score = np.sum(score_vector)

            #if there was no change
            else: #based on what is already covered
                if covered[0] ==0:
                    sentence_parts[0] = next_token_generator(mw+' means', 11) 
                #if the first part is already covered we can add it as input to generate the second
                if covered[1] ==0:
                    if covered[0]==1:
                        temp =  next_token_generator(sentence_parts[0]+' '+ mw+' is', 12)
                        #taking off the first part from it
                        temp = temp.split()
                        sentence_parts[1] = " ".join(temp[12:])   
                    else:
                        sentence_parts[1] = next_token_generator(mw+' is', 7)
                sentence = ' '.join(sentence_parts)
                eval_sentence = sentence.split()
                score_vector = new_score_vector
                score = np.sum(score_vector)
                
            if i == 0:
                print('The set of input words we are trying to introduce into our sequence is: '+str(input_words))
            if printing == True:
                if debugging ==True:
                    print("Sentence number: " + str(i+1))
                    print(sentence)
                    if True in (changes>0):
                        print("Changes vector: ")
                        print(changes)
                    print("Covered vector: ")
                    print(covered)
                    print("Positions vector: ")
                    print(positions)
                else:
                    if i in range(n_iterations-5, n_iterations):
                        print("Sentence number: " + str(i+1))
                        print(sentence)
                        if i == n_iterations-1:
                            print('The final sentence got a score of: '+str(score))
            scores[i] = score
            i +=1
    return sentence


def sentence_cleaner(sentence, mw, model):
    '''
    Makes sure that the generated sequence given by description_generator() follows taboo rules and does not contain the main word or any taboo word.
    Arg: 
        sentence: string containing the sequence to clean
        mw: string containing the main word we are playing with 
        model: gensim embeddings model we are retrieving semantic relations from
    Returns:
        A string containing a version of the descriptive sentence that follows taboo rules.
    '''
    #replacing MW with "the main word" and TWs appearing in the sentence with one of their synonyms
    sentence = sentence.replace(mw, '.The main word')
    
    #replacing any TWs appearing in our sentence with some allowed synonym
    taboo_words = cg.card_generator(mw, cg.get_gold_probdist(), model)[mw]

    spl = np.array(sentence.split())
    for tw in taboo_words:
        if tw in spl:
           #getting synonyms of detected tw
            syns = sr.get_synonyms(tw)
            #if we have at least one
            if len(syns) > 0:
                syns = list(syns)
                #choose one randomly
                choice = np.random.choice(syns)
                #checking that the choosen one it not a taboo word either, or the main word + making sure that it doesn't loop
                while (choice in taboo_words or choice != mw) and len(syns) > 1:
                    syns = syns.pop(syns.index(choice))
                    choice = np.random.choice(syns)
                sentence = sentence.replace(tw, choice)
                #if all synonyms where taboo words or the mw
                if choice in taboo_words or choice == mw:
                    hypers = sr.get_hypernyms(tw)
                    #if we have at least one
                    if len(hypers) > 0:
                        hypers = list(hypers)
                        #choose one randomly
                        choice = np.random.choice(hypers)
                        #checking that the choosen one it not a taboo word either, or the main word + making sure that it doesn't loop
                        while (choice in taboo_words or choice != mw) and len(hypers) > 1:
                            syns = syns.pop(syns.index(choice))
                            choice = np.random.choice(syns)
                        #replacing in order to point the reader to think of this word as a hypernym 
                        sentence = sentence.replace(tw, choice)
                        #if all synonyms where taboo words or the mw
                        if choice in taboo_words or choice == mw:
                            sentence = sentence.replace(choice, "ERROR, NO IDEA!")  #panicking as a real player would.
                        else:
                            sentence = sentence.replace(choice,'Is a type of '+choice)
            #in case no synonyms were found
            else:
                hypers = sr.get_hypernyms(tw)
                #if we have at least one
                if len(hypers) > 0:
                    hypers = list(hypers)
                    #choose one randomly
                    choice = np.random.choice(hypers)
                    #checking that the choosen one it not a taboo word either, or the main word + making sure that it doesn't loop
                    while (choice in taboo_words or choice != mw) and len(hypers) > 1:
                        syns = syns.pop(syns.index(choice))
                        choice = np.random.choice(syns)
                        #replacing in order to point the reader to think of this word as a hypernym 
                        sentence = sentence.replace(tw, choice)
                        #if all synonyms where taboo words or the mw
                        if choice in taboo_words or choice == mw:
                            sentence = sentence.replace(choice, "ERROR, NO IDEA!")  #panicking as a real player would.
                        else:
                            sentence = sentence.replace(choice,'Is a type of '+choice)
                else: 
                    sentence = sentence.replace(tw, "NO IDEA!")
    sentence = sentence[1:]
    return sentence

def final_output(mw, card_model, n_seeds = 3, n_iterations = 10, debugging = False, printing = False):
    sentence = description_generator(mw, card_model, n_seeds, n_iterations, debugging, printing)
    output = sentence_cleaner(sentence, mw, card_model)
    return output

def load_model(x):
    if x ==1:
        with open("trigrams_model1.txt", "rb") as fp:
            trigrams = pickle.load(fp)
            
        voc = set()
        for tri in trigrams:
            voc = voc.union(set(np.union1d(np.array(tri[0]), np.asarray(tri[1]))))
        voc_length = len(voc) 
        word_to_freq = {word: i for i, word in enumerate(voc)}
            
        cont = []
        tar = []
        for context, target in trigrams:
            context_freqs = torch.tensor([word_to_freq[word] for word in context], dtype = torch.long)
            cont.append(context_freqs)
            target_freq = torch.tensor([word_to_freq[target]], dtype = torch.long)
            tar.append(target_freq)
        path = os.getcwd()+'/model1_trained.pt'
        hidden_s = 150
        n_layers = 1
        lr = 0.015
        decoder = GRU(voc_length, hidden_s, voc_length, n_layers)
        decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()
        decoder = torch.load(path)
        decoder.eval()
    elif x ==3:
        with open("trigrams_model3.txt", "rb") as fp:
            trigrams = pickle.load(fp)
            
        voc = set()
        for tri in trigrams:
            voc = voc.union(set(np.union1d(np.array(tri[0]), np.asarray(tri[1]))))
        voc_length = len(voc) 
        word_to_freq = {word: i for i, word in enumerate(voc)}
            
        cont = []
        tar = []
        for context, target in trigrams:
            context_freqs = torch.tensor([word_to_freq[word] for word in context], dtype = torch.long)
            cont.append(context_freqs)
            target_freq = torch.tensor([word_to_freq[target]], dtype = torch.long)
            tar.append(target_freq)
            
        path = os.getcwd()+'/model3_trained.pt'
        
        hidden_s = 50
        n_layers = 3
        lr = 0.015
        decoder = GRU(voc_length, hidden_s, voc_length, n_layers)
        decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()
        decoder.load_state_dict(torch.load(path))
        decoder.eval()
    elif x == 2:
        print('It is not safe to use this model. Please choose either model 1 or 3.')
    else:
        print('Please enter 1 or 3 to choose the model to be used.')
        
    return voc, voc_length, word_to_freq, decoder

## Loading trained model
Choosing which trained model to load. They are all GPU-based RNN models, trained on a CPU with 100 epochs.
* 1:
    * Model with 1 hidden layer consisting of 150 nodes. Trained with a sample of 50,000 non-filtered trigrams containing ~16k tokens from our corpus' vocabulary (which has a total of ~80k tokens, from which more than half only appeared once). Set of trigrams properly stored. Trained in about 12 hours.
* 2:  
    * Model with 2 hidden layers consisting of 75 nodes each. Also trained with a sample of 50,000 non-filtered trigrams containing ~16k tokens from our corpus' vocabulary. Unfortunately, we forgot to include a random seed for the sampling process and we did not save the corresponding set of trigrams. Although the generation step might work, it is not advised to use this model. Trained in about 5 hours.
* 3:
    * Model with 3 hidden layers consisting of 50 nodes each. Trained with a sample of ~86k filtered trigrams containing ~19k tokens from our corpus' vocabulary. ("Filtered" means that the model was only trained on trigrams containing tokens that appear at least twice in our corpus.) Although a random seed (163) was now included, for efficiency reasons we also decided to save the trigrams in order to load them faster and make reproducibility easier. Trained in about 11 hours.

In [None]:
#To load model 1:
voc, voc_length, word_to_freq, decoder = load_model(1)

#To load model 3:
#voc, voc_length, word_to_freq, decoder = load_model(3)

## Some examples
Below we encourage you to play around with some examples to get a feeling of how to use the generator and then move on into your own examples.
Note that some of the printed information might seem opaque, but detailed explanations can be found in `text-generation/Walkthrough.ipynb`.

You will notice that the sentences are, unfortunately, not syntactically coherent. 
We believe this is due to our limited computational power in training the networks, which restricted how many hidden layers and what size of vocabulary our neural networks could have.
However, even if the result was not what we had hoped for, experimenting with different model parameters within our computational limits still gave us a chance to explore how neural networks function.

### Example with 'school' as main word, 3 seeds, debugging mode on to show covered and position vectors

In [None]:
final_output(mw = 'school', card_model = card_model, n_seeds=3, n_iterations = 5, debugging = True, printing = True)

### Example with 'cake' as main word, 2 seeds, debugging mode on to show covered and position vectors

In [None]:
final_output(mw = 'cake', card_model = card_model, n_seeds=2, n_iterations = 20, debugging = True, printing = True)

### Example with 'airplane' as main word, 3 seeds, simple printing mode

In [None]:
final_output(mw = 'airplane', card_model = card_model, n_seeds=3, n_iterations = 150, debugging = False, printing = True)

### Example with 'airplane' as main word, 2 seeds, only final output is shown

In [None]:
final_output(mw = 'airplane', card_model = card_model, n_seeds=2, n_iterations = 10, debugging = False, printing = False)