# Artificial Intelligence Term Project: Writing with Machines

by Abigail Rictor and Cassidy Skorczewski, due December 17, 2019

## Abstract

## Introduction

For this term project, we wanted to focus on the potential for computers to create new material. When considering computers in the realm of human abilities, one of the gaps developers are still working to close is linguistic. There is a wide range of solutions that have been used to allow computers to generate meaningful speech or text, and we wanted to look at the base level of some of those concepts. Because of this course's recent focus on language models as well as our own backgrounds in Machine Learning, we decided to compare the performance of a probabilistic language model with that of a recurrent neural network trained on the same text data.

## Methods

### Data

For our dataset, we chose to use scripts from the animated television series *SpongeBob SquarePants*, which we scraped from the SpongeBob Wiki (Encyclopedia SpongeBobia). We selected this to demonstrate structural aspects as well as content in our language models. A screenplay tells a story using dialogue and actions, and they are displayed differently.

### Probabilistic Approach

*Method Introduction*

#### Methodology

*Preprocessing Data*

The first approach we took in generating new text is based entirely on probabilities. Before we could begin to create a probabilistic model, we created two Hadoop MapReduce jobs in java to preprocess our data into a format that would be easier for our model to take in. The purpose of the first job was to read in all the scripts and record how many times a certain character followed another. This was done by getting a local count from each script of how many times one character followed another  then grouping the local counts together to get the global counts. After we had the global counts, we removed any interactions that occurred less than 10 times. We noticed that in some episodes there were random characters like Lebron James or Heidi Klum who had limited interactions with one of the characters and we did not want these characters to potentially be in our script. An excerpt of this final output can be shown below. Note, the character ‘action’ is not actually a character but describes what is going on in the episode. The purpose of the other MapReduce job was to take the 42 characters that we accounted for in the first job and generate 42 new text files, one for each character, containing everything that character has said throughout the show. This was done by filtering the scripts to only include lines from our chosen character set then grouping based on character.

Sample Character Occurance Output -- `{squidward: [(narrator,23), (pearl,17), (action,753) (patrick,428), (spongebob&patrick,66), (plankton,68), (squilliam,35), (larry,11), (sandy,91), (gary,17),(spongebob,2425)]}` 

*Character Model*

After we had our data in a format that was easy to read, we created two different probability models that would be used to generate new scripts. The first model was the character ordering model. The purpose of this model was to generate the ordering in which our characters would appear in the script. Using the character occurrence count output from the first MapReduce job, each character had a weighted sampler of who would follow them. For example, if Patrick followed Plankton 10 times, SpongeBob followed Plankton 10 times, and Karen followed Plankton 20 times, then Plankton’s weighted sampler would be: Patrick - 25%, SpongeBob 25% and Karen 50% where the percent represents the probability that that character is chosen to follow the character. We first randomly select one of our 42 characters to begin our script then use that character’s weighted sampler to choose the next character in the sequence. Users can specify how many character interactions they want in their script.

*Sentence Model*

Once we had the ordering in which our characters would appear in our script, we created another probabilistic model that would generate what the characters said to one another. Our model takes in the text file associated with the chosen character and the user specifies what ngram they want to use. The characters lines are read in and `?:!` are replaced with a period and all other punctuation characters are removed. If there was an action within a character’s line like `Spongebob -- “Hey Patrick what do you have there?” [Gestures to Patrick’s hands] “It sure looks heavy!” ` we removed the action entirely since all actions would be accounted for in the ‘action’ character. We then added sentence tags to signify the start and end of sentences. After our character’s text was properly formatted, we implemented two different ngram models, one that includes Laplace smoothing with an alpha equal to 0.1 and one that does not, that tell us what the probability of our word following the current n-1 phrase. To generate the actual text, we identified three potential strategies [source]. The first is sampling where similar to our weighted sampler used in our character model, we generate the next word based on the word probability distribution. The next is greedy where we simply select the word with the highest probability. Our final strategy is top k where we randomly choose one word from a list k words that had the highest probability. We start the sentence by randomly selecting a n-1 gram that contains with our sentence starting tag. After we have our starting point, we use 1 of our 2 different ngram models to calculate the probability that every word in our text dictionary follows our current n-1 gram. Then we use the user-specified generation strategy to pick the next word. We repeat this process on the latest n-1 words in the sentence and we stop if we reach a terminating string tag or our phrase reaches a specified length. 

*Method Results*

### Neural Network Approach

#### Introduction
This approach uses a recurrent neural network to generate new text character by characer. It functions on the principle of remembering state by feeding outputs back in as inputs. Each character generated here relies on the character before it, granting the output more consistency throughout. 

#### Method 
The basis for this approach was taken from a repository used for generating Shakespeare plays, and while I restructured some of that code to make it more modular and easily usable, the functionality of the recurrent neural network I borrowed (Machine Learning: Text Generation, 2019) stayed largely the same.

Much of the interesting part of working with neural networks, especially when using a limited  is finding ways to guide the samples in a direction that best approximates the training data. Because this approach is focused around existing characters rather than existing words, it always has the potential to spit out nonsense made out of those characters. One way I've found to filter out as much of the nonsense as possible, is to process the data as it is generated by only generating about 500 characters at a time, then breaking it into words and checking all non-proper-nouns against a dictionary. If a word is not real, another is sample is taken from the network, this time passing all of the language from the original sample up until the fake word in as a prime input. This process continues until the final sample exceeds the goal word count. Then it is truncated to the last instance of punctuation or action tag. The code used for generating an episode (or a chunk of an episode) using this technique can be seen below.

In [None]:
import utilities #some functions for reading and encoding data
import CharRNN as crnn #See GitHub https://github.com/albertlai431/Machine-Learning/tree/master/Text%20Generation
from ipywidgets import IntProgress
from IPython.display import display

text = utilities.readSpongebobRNN()

n_hidden=1024
n_layers=3
net = crnn.CharRNN(tuple(set(text)), n_hidden, n_layers)
print(net)

batch_size = 32
seq_length = 64
n_epochs = 100

crnn.train(net, utilities.encodeText(text), epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=500)

In [None]:
def writeEpisode(write_net): 
   
    episode = crnn.sample(write_net, 500, prime="<a> Episode starts ", top_k=20).replace("}", "").replace("{", "")
    episode_arr = episode.split()

    i = 0
    while i<len(episode_arr): 
        word = episode_arr[i].strip()
        if(len(word) == 0 or word == '<a>' or word == '<\\a>' or word[0].isupper() or word.isspace()):
            i+=1
            continue
        word = episode_arr[i].strip()
     
        if not utilities.checkDictionary(word.lower().replace("\n", "").replace(",", "").replace(".", "").replace("!", "").replace("?", "")):
            print(word)
            if i>500:
                episode = episode[:episode.rfind(word)]
                break
            episode = crnn.sample(write_net, 500, prime=episode[:episode.rfind(word)], top_k=20).replace("}", "").replace("{", "")
            episode_arr = episode.split()
        else:
            i+=1
        
    #iterate through words in poem and check if they exist in dictionary
    last_punctuation = max(episode.rfind('.'), episode.rfind('!'), episode.rfind('?'), episode.rfind('</a>')+2)
    episode = episode[:last_punctuation+1]
    
    return episode.replace("<a> ", "*").replace(" <\\a>", "*").replace("\n", " <br> ").replace("|",": ").replace("}", "")

Through much experimentation with network parameters, I was able to improve the results to some degree, taking it through the below stages of development (and more). The third example also uses the above post-processing techniques to filter out words not found in the English dictionary.


| Worse (low epochs, high batch-size) |  Bad ( medium epochs, medium batch-size)  | Good-ish (medium epochs, low batch-size, higher hidden layers |
|:------|:-----|:-----|
|a  aoate dt ysugocst <br> ate dtuetraweaoohb hai <br>eoaae aooe eh yrius e tiee e isp ewi <br> itnbasobioacui ofw a rae <br>lsncepseaostutanahstadnutd wpl ehblaiweue <br> tsshnte  tohaaeoraa rtn ble ba  trrrbrar<br> uie e alpne ht plioino rds binihni <br>npdoearsntdii nacwtreocraet tr<br> hn opioadnonr hcs psooectoaue > btt et twettdah etrsb dht i <br> rdatn drobgoystewatesg uriaea <br>gapycyahhle e ndnto oh <br>lasohrshnogocoadd c oue  rt <br>becdnurtt et lecorbo ar eh eon eaar <br> spldooaaeeadi ptd tt act buhnlhno ebha <br>irt ii b sccaolaboproidt olsoweotu  <br>trastslibeeahyittureyosetg russu ueguiar<br>ttbeooc pinb hausa hyu  ir>irlrinitd nrnyu<br>  eh rhnei wntyiaeagd i ee <br>tr enut ap rhe dpta ggrsitlpaa|*Episode starts on S\oasera a sage siditg SsingeBob: Thiilh phale hugr lop ruth* <br> *wokh and haants. I the Krubs. bpugitans S\ SpongeBob: ahe moiik and gois to tese ftey on outty wenklo bund,. atine hirts and sit agang. Prylalker soner, cinings and rhink. Kabh to binin. The foo as. Squidward. Saun..rbou casoy sands daot as yod nralns. hoomhl sook dyening thet oe SpongeBobs pems do to bivh os toepong dhele nse fabd sel coress. Nlatkinos mag wiks or at lonk and hracifg S\ <br> I chase* <br> Io Krab it drids ticnh tilt to sute ov nhand in o broate bamey to thew se aatss tho to Squidwardes. SpongeBrad lods. Yes dhen ghib teen, <br> Sacy. Wen o rasone seye,. Wowky sat fols the foct hhe Krlmty Krubty Krtpirs tovrot thes sotas* <br> <p> thist. Ofe retinl onhor tere certer. Hats arers S\oOod nrit ap! <br> Cekhiee <br> *hhauf ths at SpongeBob hilt cukilg*}H gagdr foten sowalls ther ap lhiweanl ny,er holn,et cinnle unrh* <br> }r n basing. *then, rimhrilg bnid tanp enade mouopro satel. Cosiser Sqreiartang for tim toacelm | *Episode starts at the Krusty Krab. SpongeBob is sitting inside the air, where SpongeBob sets the bear with his house as his couch, and SpongeBob is brushing Garys shell* <br> Squidward: *At SpongeBobs House, Patrick is sleeping outside the Krusty Krab at night* <br> Squidward: SpongeBob, I got it to jellyfish jelly sandwich to help you get to sleep. *notices a nicked* So I gotta get out of here! <br> SpongeBob: Hi-ya, Sandy. Im’s me, That out here? <br> *Squidward grabs a handful of food, and theres a big pile of food on the patty and puts on his back on a towel, and they look on the SpongeBob sits inside his arms around. SpongeBob slows down and tries to stap into Garys food bowl and SpongeBob is sleeping, and his clock walks up to his head, and starts moving his head for protection from the red and walks over* SpongeBob: What is it, book? *points to a patty, with its a to where he see these he stops that me air. He wakes up to him* Hey, it is not going to make you sweet, something is brought to you to down that movie of yours ends... <br> *Squidward notices the entire dining room and sees Squidward playing the clarinet to him. Squidward goes to the freezer, but its over the water, and bubbles are about to ham.*


One of the things we were excited to see the neural network pick up on, was the (generally) correct use of action tags. New material could usually be counted on to use action tags for scene descriptions and character actions, only rarely conflating obvious dialogue with obvious stage directions. The progress we made on things like this indicates to me that, given more data and resources, we could get some really interesting results.  

As is, while it was a fun experiment and an excellent exercise in finding creative ways to process data, we ultimately felt that this approach was not as successful as we were hoping it would be. The dataset was not large enough to yield very impressive results. To supplement this, it could be interesting to train on additional English scripts or texts to gain a more solid understanding of linguistic structure, but this may introduce a problem of over-generalization.

## Comparison and Conclusion

Our approaches to the same problem employed very different methods. The probabilistic model took much more precise design and was very specific to the problem of generating a screenplay, where the neural network traverses character by character, and is therefore an extremely general approach. In this case, with this dataset, the probabilistic model is capable of generating much more comprehensible and even funny new episodes, when using a higher ngram. The comprehensibility is owed mostly to larger chunks of recycled content in the dialogue, but because of the outer character model, still avoids systemic reuse of more than one line at once. The neural network can usually be counted on to come up with something irrefutably new, but the small dataset was not enough for it to gain a full understanding of English conventions, let alone a broader episode structure.  

With more data, it's likely that both models would benefit. More data in conjunction with a smaller ngram could add more originality to the probabilistic model without sacrificing too much of it's cohesiveness, and more data for the neural network to train on would certainly increase its performance.

## References

* Machine-Learning: Text Generation [Repository]. (2019). Retrieved from https://github.com/albertlai431/Machine-Learning/tree/master/Text%20Generation
* English Words [Repository]. (2014). Retrieved from https://github.com/dwyl/english-words
* Encyclopedia SpongeBobia. (2019). Retrieved from https://spongebob.fandom.com/wiki/List_of_transcripts