# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('/Users/arjunkhanchandani/Desktop/Uni/6th Semester/AI Applications (UCS655)/NLP in Python/corpus.pkl')
data

Unnamed: 0,transcript,full_name
brennan,[gentle music playing] [audience applauding] [...,Neal Brennan
burnham,"Exploring mental health decline over 2020, the...",Bo Burnham
burr,"Recorded Live at the Royal Albert Hall, London...",Bill Burr
carlin,"Recorded on January 12–13, 1990, State Theatre...",George Carlin
dave,Sticks & Stones is Dave Chappelle’s fifth Netf...,Dave Chappelle
hasan,[theme music: orchestral hip-hop] [crowd roars...,Hasan Minhaj
louis,Recorded at the Madison Square Garden on Augus...,Louis C.K.
murphy,After achieving fame with Saturday Night Live ...,Eddie Murphy
norm,"Then people go, “Goddamn, at least he’s not a ...",Norm Macdonald
pete,"So, Louis C.K. tried to get me fired from SNL ...",Pete Davidson


In [2]:
# Extract only Ali Wong's text
dave_text = data.transcript.loc['dave']
dave_text[:200]

'Sticks & Stones is Dave Chappelle’s fifth Netflix special.\nIn the promotional trailer Morgan Freeman narrates as Chappelle swaggers across a salt flat in leather pants, aviator shades and a remarkably'

In [3]:
bo_text = data.transcript.loc['burnham']
bo_text[:200]

'Exploring mental health decline over 2020, the constant challenges our world faces, and the struggles of life itself, Bo Burnham creates a wonderful masterpiece to explain each of these, both from gen'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [4]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [5]:
# Create the dictionary for Ali's routine, take a look at it
dave_dict = markov_chain(dave_text)
dave_dict

{'Sticks': ['&', '&'],
 '&': ['Stones', 'Stones'],
 'Stones': ['is', 'streamed'],
 'is': ['Dave',
  'Dave.',
  'this?',
  'the',
  '45',
  'perfect.',
  'my',
  'my',
  'I’m',
  'the',
  'the',
  'awkward',
  'to',
  'different.',
  'very',
  'an',
  'the',
  'the',
  'damn',
  'precisely',
  'Atlanta.',
  'many',
  'the',
  'it…',
  'it',
  'we',
  'that',
  'this,',
  'gay.',
  'the',
  'it’s',
  'not',
  'that,',
  'how',
  'killing',
  'not',
  'serious.”',
  'the',
  'the',
  'shame',
  'masturbating',
  'your',
  'theirs.',
  'their',
  'fair.',
  'an',
  'duck.”',
  'school',
  'training',
  'terrifying.',
  'real',
  'looking',
  'raising',
  'that',
  'a',
  'African',
  'your',
  'incumbent',
  'a',
  'an',
  'a',
  'this',
  'in',
  'buckshot.',
  'a',
  'not',
  'it?',
  'your',
  'an',
  'black,',
  'that',
  'carrying',
  'the',
  'funnier',
  'MAGA',
  'that',
  'the',
  'a',
  'my',
  'when',
  'on',
  'protected'],
 'Dave': ['Chappelle’s',
  'get',
  'Chappelle',
  'Ch

In [6]:
bo_dict = markov_chain(bo_text)
bo_dict

{'Exploring': ['mental'],
 'mental': ['health', 'disorder', 'health'],
 'health': ['decline', 'is…'],
 'decline': ['over'],
 'over': ['2020,',
  '♪',
  'the',
  'by',
  'this',
  'soon\xa0♪',
  'soon\xa0♪',
  'soon\xa0♪',
  'soon\xa0♪',
  'soon\xa0♪'],
 '2020,': ['the', 'and', 'I'],
 'the': ['constant',
  'struggles',
  'content\xa0♪',
  'fuck',
  'fuck',
  'streets\xa0♪',
  'drought\xa0♪',
  'fear',
  'world',
  'world',
  'center',
  'world',
  'world',
  'good\xa0♪',
  'floor',
  'fuck',
  'sidelines',
  'day',
  'world',
  'world',
  'fear',
  'world',
  'world',
  'Maternity',
  'last',
  'way',
  'whole',
  'place,',
  'deepest',
  'world.',
  'world',
  'seeds',
  'littlest',
  'sky\xa0♪',
  'sea\xa0♪',
  'world',
  'world',
  'worms',
  'dirt',
  'world',
  'world?',
  'worker',
  'means',
  'FBI',
  'left\xa0♪',
  'street\xa0♪',
  'interests',
  'pedophilic',
  'world',
  'world',
  'responsibility',
  'myopic',
  'fucking',
  'world',
  'world',
  'past.',
  'conversation.',


## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [7]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [23]:
generate_sentence(dave_dict)

#'Straight or something. It means, literally, uh, Michael Jackson never dreamt I’d be in this.'
# 'Much, everybody, I’m gay community was even puttin’ the door, all my dick! And I.'
#'Where Kanye West, the documentary for? This is it’s too hot for yourselves. I just.'
# 'Funny? And everybody eventually. Like, “I don’t believe these Oscars.” And it’s you… …it’s exponentially.'
# 'Trump. – Who… Who’s that? – Who… Who’s got it to not be honest with.'
#'Fathers of us. Hey, man. Hey, man. Please? I’ll fuck up, unscathed. Time for the.'
# '“n i g g g a. Get that they pick up in a heroin-addicted white.'

'Sleepin’ in the word… “f a health crisis. These are B’s. And if I’m telling.'

In [11]:
generate_sentence(bo_dict)

#'Afford a white woman’s Instagram\xa0♪ ♪ But now for straining pasta Here’s how does not.'
#'Please forgive me so the world works\xa0♪ ♪ So I’m very confused. See, I’m feeling?\xa0♪.'
#'Hindsight\xa0♪ ♪ Trying to go\xa0♪ ♪ I should only light\xa0♪ ♪ And the fuck you.'

'Least 400 years ♪ Turning 30, turning 30 ♪ If you to pour my only.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [10]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils