# Generating Text using Markov Chains

### Aim:
In this notebook, we aim to generate similar storyline like Harry Potter by building a 1st Order Markov Chains around all the books of Harry Potter at word level. To know in detail the working of the model read the related blog [here](https://medium.com/@prakhar.mishra/can-bots-tell-you-stories-357a77bef4c9).

### Author:
1. [Prakhar Mishra](https://www.linkedin.com/in/prakhar21/)

__For more such materials follow me on __[Medium](https://medium.com/@prakhar.mishra)

### Resources:
1. [Markov Chain explained Visually](http://setosa.io/ev/markov-chains/)
2. [Markov Chains in Python](https://www.datacamp.com/community/tutorials/markov-chains-python-tutorial)
3. [Markov Chains (YouTube)](https://www.youtube.com/watch?v=uvYTGEZQTEs)

### Improvement Scope
1. Try higher order markov chains (maybe 2 or 3)
2. Try increasing vocabulary of words in the current one.
3. Try tuning the exploration factor i.e. randomness_level

### Importing Libraries

In [1]:
import collections
import itertools
import operator
import codecs
import random
import nltk
import os
import re

### Class that encapsulates all the functionality

In [47]:
class TextGenerator(object):
    
    def __init__(self, data_list):
        self.text = self._load(data_list)
        self.text_tokens = self._prune(self._tokenize())
        self.states = list(set(self.text_tokens))
        self.possible_transitions = self._get_transitions()
        self.trasnsition_probabilites = self.train()
        self.total_words = 0.0
        
    def _load(self, files):
        text = " "
        for f in files:
            print('Reading {}'.format(f))
            with codecs.open(f, 'rb', 'utf-8') as infile:
                text += self._clean(infile.read().encode('utf-8').decode('ascii', 'ignore')).strip()
        return text
    
    def _get_possibilities(self, state):
        words = []
        for index, value in enumerate(self.text_tokens):
            if value == state:
                try:
                    words.append(self.text_tokens[index+1])
                except:
                    words.append('EOS')
        return {state: dict(collections.Counter(words))}
    
    def _add_probabilities(self, possibilities):
        temp = {}
        for possibility in possibilities:
            for k, v in possibility.items():
                temp[k] = [{'probab': (count/float(self.total_words)) * (1/float(len(v))), 'word': wrd} for wrd,count in v.items()]
        return temp
    
    def train(self):
        possibilities = []
        for state in self.states:
            possibilities.append(self._get_possibilities(state))
        probabilities = self._add_probabilities(possibilities)
        return probabilities
            
    def _get_transitions(self):
        return [[self.states] for state in self.states]
    
    def _prune(self, tokens):
        if len(tokens) > 100000:
            self.total_words = 100000
            return tokens[:self.total_words]
        self.total_words = len(tokens)
        return tokens
        
    def _clean(self, text):
        text = text.lower()
        text = re.sub(r"(\n|\t|/)", " ", text)
        text = re.sub(r'([.,/#!$%^&*;:{}=_`~()-])[.,/#!$%^&*;:{}=_`~()-]+', r'\1', text)
        text = re.sub('([.,!?()])', r' \1 ', text)
        return re.sub(r"\s{2,}", " ", text)
    
    def get_len(self, d):
        return len(d)
    
    def _tokenize(self):
        tokens = nltk.word_tokenize(self.text)
        return tokens

### Loading and Calculating Transition Probabilities

In [57]:
# preparing the data for training the model

files = ['Agatha_Cristie-1939_diez_negritos.txt']

generator = TextGenerator(files)

# length of the text
print("Length of the training file {}".format(generator.get_len(generator.text)))
print("Number of words {}".format(generator.get_len(generator.text_tokens)))
print(generator.states[:15])

Reading Agatha_Cristie-1939_diez_negritos.txt
Length of the training file 290204
Number of words 56663
['recurrio', 'sombrio', 'pasar', 'acostada', 'soldados', 'campana', 'confusa', 'enganaran', 'necesario', 'sabor', 'hacha', 'sonrio', 'cazar', 'encargos', 'exposicion']


### Generating Story using the patterns observed from Corpus

In [66]:
# test the model

def formatter(s):
    s = s.split()
    # greedy sentence finisher (matches to last (.))
    s = ' '.join(s[:[idx for idx, ch in enumerate(s) if ch == '.'][-1]+1])
    s = s.capitalize()  # sentence casing
    s = re.sub(r'\s(\.|,|!|\?|\(|\)|\]|\[)', r'\1', s) # remove padded space before punc.
    return s

# palabra inicial
seed_word = 'tony'
story = [seed_word]
words = 0
max_words = 100
randomness_level = 3

while words < max_words-1:
    words += 1    
    candidates = generator.trasnsition_probabilites.get(seed_word)
    if candidates:
        temp = sorted(candidates, key=lambda c: c['probab'], reverse=True)
        candidates = [i.get('probab') for i in temp]
        grouped = sum([i[1] for i in [(k, sum(1 for i in g)) 
                                      for k,g in itertools.groupby(candidates)][:randomness_level]])
        seed_word = random.choice(temp[:grouped]).get('word')
        story.append(seed_word)

print(' '.join(story))

tony contesto . no se lo he ahi ? pregunto : el juez , que no se dirigio al cabo la casa y la cabeza . no , que le habia visto una isla del comedor , y se dirigio a los demas de su mujer , y el juez , pero no , pero el juez wargrave se dirigio a los demas a la isla del negro ! el doctor . no es el juez wargrave se dirigio a los dos hombres . la casa y la isla . no es la cabeza , y se dirigio al doctor


-------

In [24]:
import markovify

In [69]:
# Get raw text as string.
with open('Agatha_Cristie-1939_diez_negritos.txt') as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 140 characters
for i in range(3):
    print(text_model.make_short_sentence(140))

Después sólo dirían: «El viejo MacArthur no era cándido; el fastidio con los ladrillos. —¿Y ha estado demasiado... lejos. —¿Qué demonios insinúa usted, doctor?
¿qué cosa más rara!
La señora Rogers fueron los primeros en llegar a la vida es cada vez más peligrosa.
Ya calmado, dijo en voz baja. —Precisamente.
Otras eventualidades se presentaban a su cuarto siempre lee la Biblia.
¡Buena idea habían tenido jamás miedo.
Me pregunto lo que importa es examinar el tercer crimen y establecer el hecho de que no le satisfacía sino a medias...
Apoyada en los caracteres de los acusados eran culpables de los peldaños se encontraron sobre una mesa.
