#MARKOV CHAIN POETRY GENERATION

### A **Markov Chain** is a mathematical process/system that transitions from one state to another according to probabilistic rules. Essentially, it's building a sequence based on probabilities. 

### The ***Natural Language Processing (NLP)** manifestation of this is something people interact with all the time: the predictive text function on our phones OR Google's auto-complete searching guess the next word based on probabilities of that word following the ones before it. <br>= Links on a chain 

***Natural Language Processing (NLP)** is the sub-domain / application of AI that deals with natural language. 

##Start coding! Access Data & Prep It
Before we build the machine, we need to import the tools that the code will need for its different tasks. In the Python coding language, these are called libraries.
<br>**random**: will let the machine pick a random word from a list
<br>**json**: will let the machine read a json formatted file
<br>**string**: 
<br>**google.colab - drive:** lets us access filess from this google drive

In [None]:
import json
import random
import sys
import string
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The next step is accessing and preparing/processing the dataset we want to use as background. For this demo, we are using all the scripts of the cartoon, *Rick & Morty*. Using this dataset will inform both the **Vocabulary** and the **Next-Word Probability** calculations of the output.

This natural language dataset has a very *loud* flavor. Take a minute to imagine a poem composed only from the language of a sci-fi/fantasy, philosophical, somewhat dystopian, and rather R-rated television show. 

Now consider how that would compare to the dataset of language behind Apple products' predictive texting.

**BIG IDEA:** The dataset we *curate* to inform our Markov Chain Poetry Generator significantly affects the output poem.

To prepare this dataset for our generator, I've opted to remove punctuation and line breaks, then make all wods lower case so we don't get separate words for e.g. 'Spring' and 'spring'. The last step in processing is to turn our text into a list of words and remove any digits.

Now, as a test let's just print the first 15 words in our dataset's list of words as well just to see how things look.

In [None]:
#Open dataset
fodder = open('/content/drive/My Drive/Colab Notebooks/rickMorty_corpus.txt', 'r').read()

#Process Text by removing punctuation and any double line breaks. Then make all words lowercase
fodder = fodder.translate(str.maketrans('', '', string.punctuation))
fodder = fodder.lower()

#Split text by space into individual words will create a list of all words and removes digits
fodder = ''.join([i for i in fodder if not i.isdigit()]).replace("\n\n", " ").replace("\r"," ").split(' ')


Now that we have all the Rick & Morty scripts turned into a clean list of words in order of appearance, a list we've called 'fodder', let's find out how long this list is (how many words it has) and print the first 25.

In [None]:
print ("Number of words: ",len(fodder))
print ("First 25 words: ", fodder[0:25])

Number of words:  228465
First 25 words:  ['love', 'connection', 'experience', 'yeah', 'come', 'together', 'with', 'love', 'connection', 'and', 'experience', 'its', 'my', 'favorite', 'song', 'oh', 'yeah', 'oh', 'yeah', 'distress', 'beacon', 'yeah', 'baby', 'youre', 'excited']


## Coding the Building Blocks of our Markov Chain
This next piece of code is a 'Loop' that allows us to iterate through each individual word of the list we created above. 

With each word looped through, we add to what is called a 'Dictionary'. Here, our dictionary is titled, "chain"
Each entry in this dictionary is a unique word from our dataset and it is defined by a list of words that follow it. In Python, we call these entries and their definitions, *keys* and *values*. 

Run the chunk of code below and it will build our dictionary then print the entry for 'timeline' to show you an example. 

In [None]:
index = 1
chain = {}

for word in fodder[index:]: 
  key = fodder[index - 1]
  if key in chain:
    chain[key].append(word)
  else:
    chain[key] = [word]
  index += 1

print (chain['theres'])

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'a', 'a', 'a', 'a', 'no', 'a', 'a', 'a', 'a', 'a', 'no', 'a', 'a', 'a', 'no', 'a', 'a', 'not', 'no', 'two', 'two', 'theres', 'a', 'our', 'not', 'no', 'two', 'two', 'theres', 'a', 'our', 'a', 'no', 'no', 'an', 'nothing', 'a', 'no', 'no', 'an', 'nothing', 'your', 'your', 'so', 'so', 'a', 'more', 'only', 'a', 'something', 'so', 'so', 'a', 'more', 'only', 'a', 'something', 'nothing', 'only', 'gotta', 'no', 'a', 'a', 'nothing', 'only', 'gotta', 'no', 'a', 'a', 'something', 'a', 'a', 'another', 'still', 'plenty', 'not', 'no', 'something', 'a', 'a', 'another', 'still', 'plenty', 'not', 'no', 'a', 'little', 'a', 'little', 'pros', 'caterers', 'nothing', 'got', 'always', 'flies', 'pros', 'caterers', 'nothing', 'got', 'always', 'flies', 'gonna', 'no', 'six', 'no', 'seven', 'some', 'only', 'always', 'four', 'six', 'only', 'a', 'been', 'gonna', 'no', 'six', 'no', 'seven', 'some', 'only', 'always', 'four', 'six', 'only', 'a', 'been', 'a', 'more'

Our dictionary building worked! In the example, we can see that the word 'timeline' appeared in our dataset 10 different times, followed by the words 'I', 'fun', and 'all' twice each, then the word 'where' four times. 

If we save this dictionary, "Chain", to a file we can look at it more closely, and also make any edits we may want to our vocabulary.

In [None]:
with open('output_chain.json', 'w') as fp:
  json.dump(chain, fp)

Now that you've saved the output_chain file, take a moment to open it up and look through it. 

Find the dictionary key "is" by ctrl-F searching for the characters:<br>
**"spring":**
<br>You will see that this key has the values [at, at]<br>
Try adding the value **is** to that list so it looks like:<br>


In [None]:
#open and load the output_chain
f = open('output_chain.json',)
# returns JSON object as a dictionary
data = json.load(f)

## Compose a Line based on the Dictionary of Next-Words

Here is where you can put a line you want to translate into *Rick&Morty-speak*. <br>

For this demo, I'm using the first line of the e.e. cummings poem, "Spring is like a perhaps hand".

Then I ask the machine to use the first word of the line to initialize our chain. Each word is added, one link at a time based on a the word preceding it. 

I.E. the machine looks for the first word, *spring*, in our dictionary and then picks - at random - a word from spring's list of next-words. Once it selects a 2nd word, it goes to *that* word's entry in our dictionary and selects a 3rd word from the 2nd word's list of next-words. It keeps repeating this process until we reach the number of words in the original line.

Then it prints out our full line. You can keep running this code-chunk until you land on a line you like. Kind of like rolling the dice again. 

In [None]:
line = 'spring is like a sometimes hand'
line = line.split(" ")
index = 1
chain = {}
count = len(line)

word1 = line[0]
output = word1.capitalize()

while len(output.split(' ')) < count:
		word2 = random.choice(data[word1])
		word1 = word2
		output += ' ' + word2

print (output)

Spring at me summer this is


## Compose a Whole Poem at Once
If we want to compose a whole poem at once, rather than line by line, we can put all the lines together into a list variable, then 'loop' through each of those lines.

In [None]:
lines = ['spring is like a perhaps hand', 
'which comes carefully', 
'out of Nowhere arranging',
'a window into which people look while',
'people stare',
'arrange and changing placing',
'carefully there a strange',
'thing and a known thing here and',
'changing everything carefully']

for line in lines: 
	line = line.split(" ")
	index = 1
	chain = {}
	count = len(line)
	word1 = line[0]
	#OR word1 = random.choice(list(chain.keys())) #random first word
	message = word1.capitalize()
	#Picks the next word over and over until word count achieved
	while len(message.split(' ')) < count:
		word2 = random.choice(data[word1])
		word1 = word2
		message += ' ' + word2

	print (message)

Spring at my wallet and you
Which clearly i
Out um simple freedom
A space with my god damn right
People just
Arrange her which point
Carefully i will fall
Thing come true and then kill each
Changing timelines precludes
