# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [None]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript,full_name
ali,"Ladies and gentlemen, please welcome to the st...",Ali Wong
anthony,"Thank you. Thank you. Thank you, San Francisco...",Anthony Jeselnik
bill,"[cheers and applause] All right, thank you! Th...",Bill Burr
bo,Bo What? Old MacDonald had a farm E I E I O An...,Bo Burnham
dave,This is Dave. He tells dirty jokes for a livin...,Dave Chappelle
hasan,[theme music: orchestral hip-hop] [crowd roars...,Hasan Minhaj
jim,[Car horn honks] [Audience cheering] [Announce...,Jim Jefferies
joe,[rock music playing] [audience cheering] [anno...,Joe Rogan
john,"All right, Petunia. Wish me luck out there. Yo...",John Mulaney
louis,Intro\nFade the music out. Let’s roll. Hold th...,Louis C.K.


In [None]:
# Extract only Ali Wong's text
ali_text = data.transcript.loc['ali']
ali_text[:200]

'Ladies and gentlemen, please welcome to the stage: Ali Wong! Hi. Hello! Welcome! Thank you! Thank you for coming. Hello! Hello. We are gonna have to get this shit over with, ’cause I have to pee in, l'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [None]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [None]:
# Create the dictionary for Ali's routine, take a look at it
ali_dict = markov_chain(ali_text)
ali_dict

{'Ladies': ['and'],
 'and': ['gentlemen,',
  'foremost,',
  'then',
  'have',
  'there’s',
  'resentment',
  'get',
  'get',
  'says,',
  'my',
  'she',
  'snatch',
  'running',
  'fighting',
  'yelling',
  'it',
  'she',
  'I',
  'I',
  'I',
  'we',
  'watched',
  'I',
  'have',
  'that',
  'recycling,',
  'disturbing',
  'it’s',
  'all',
  'just…',
  'be',
  'half-Vietnamese.',
  'his',
  'slide.',
  'your',
  'inflamed',
  'you’re',
  'I',
  'half-Japanese',
  'I’m',
  'half-Vietnamese.',
  'playing',
  'rugby.',
  'foremost,',
  'a',
  'emotionally',
  'I',
  '20',
  'neither',
  'I',
  'I–',
  'then',
  'it’s',
  'find',
  'start',
  'just',
  'caves',
  'gets',
  'is',
  'very',
  'for',
  'I',
  'she',
  'rise',
  'be',
  'eat',
  'watch',
  'be',
  'now',
  'most',
  'in',
  'then',
  'digitally',
  'then',
  'then',
  'then',
  'steady',
  'brings',
  'let',
  'reverberate',
  'say,',
  'my',
  'he',
  'when',
  'I’m',
  'sicker,',
  'sicker.',
  'sicker,',
  'sicker,',
  'pos

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [None]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [None]:
generate_sentence(ali_dict)

'Corner. And I see that he saw it scared of Craigslist. That’s right. ‘Cause I.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [None]:
data.transcript

ali        Ladies and gentlemen, please welcome to the st...
anthony    Thank you. Thank you. Thank you, San Francisco...
bill       [cheers and applause] All right, thank you! Th...
bo         Bo What? Old MacDonald had a farm E I E I O An...
dave       This is Dave. He tells dirty jokes for a livin...
hasan      [theme music: orchestral hip-hop] [crowd roars...
jim        [Car horn honks] [Audience cheering] [Announce...
joe        [rock music playing] [audience cheering] [anno...
john       All right, Petunia. Wish me luck out there. Yo...
louis      Intro\nFade the music out. Let’s roll. Hold th...
mike       Wow. Hey, thank you. Thanks. Thank you, guys. ...
ricky      Hello. Hello! How you doing? Great. Thank you....
Name: transcript, dtype: object

In [None]:
jimmy_text = data.transcript.loc['jim']
jimmy_text[:200]

'[Car horn honks] [Audience cheering] [Announcer] Ladies and gentlemen, please welcome to the stage Mr. Jim Jefferies! [Upbeat music playing] Hello! Sit down, sit down, sit down, sit down, sit down. [C'

In [None]:
jimmy_dict = markov_chain(jimmy_text)
jimmy_dict

{'[Car': ['horn'],
 'horn': ['honks]'],
 'honks]': ['[Audience'],
 '[Audience': ['cheering]',
  'cheering]',
  'whooping]',
  'cheering]',
  'whooping]',
  'cheering]',
  'cheering]',
  'cheering]',
  'whooping]',
  'cheering]',
  'cheering]',
  'cheering]',
  'applauding]',
  'cheering]',
  'booing]',
  'applauding]',
  'cheering]',
  'cheering]',
  'laughing]',
  'exclaiming]',
  'laughing]',
  'groaning]',
  'exclaiming]',
  'laughing]',
  'cheering]',
  'cheering]'],
 'cheering]': ['[Announcer]',
  'Thank',
  'Now…',
  'But',
  'Boston,',
  '[Audience',
  'And',
  'Second',
  'We',
  '–',
  'I',
  'See…',
  'Ladies',
  '[Upbeat'],
 '[Announcer]': ['Ladies'],
 'Ladies': ['and', 'and'],
 'and': ['gentlemen,',
  'I',
  'I’m',
  'I',
  'my',
  'we’re',
  'on',
  'my',
  'invite',
  'a',
  'we',
  'we',
  'one',
  'Dennis',
  'the',
  'most',
  'it’s',
  'I',
  'they',
  'you',
  'you’re',
  'they’re',
  'then',
  'you',
  'goes,',
  'whenever',
  'drinks',
  'Jim',
  'all',
  'it',
  '

In [None]:
generate_sentence(jimmy_dict)

'More… I will probably go to engage with him right there. Mother’s Day.” And they’re.'

In [None]:
generate_sentence(jimmy_dict)

'Now I stand by it. I don’t have to keep my fucking slut… Oh, no,.'

### Modified with punctuation

In [None]:
import string
def generate_sentence_punc(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

        # Check if the word ends with punctuation
        if word2[-1] in string.punctuation:
            break

    # End it with a period if no punctuation was found at the end
    if sentence[-1] not in string.punctuation:
        sentence += '.'

    # End it with a period
    # sentence += '.'
    return(sentence)

In [None]:
import string
def generate_sentence_punc(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

        # Check if the word ends with punctuation
        if word2[-1] in ['.','?','!']:
            break

    # End it with a period if no punctuation was found at the end
    if sentence[-1] not in string.punctuation:
        sentence += '.'

    # End it with a period
    # sentence += '.'
    return(sentence)

#### This code breaks the loop whenever a word ending with a punctuation occurs and does not countinue until the total owrd count for the sentence

In [None]:
generate_sentence_punc(jimmy_dict)

'Handsome in America are you control…” And she talks.'

In [None]:
import string
def generate_sentence_randpunc(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

        # Check if the word ends with punctuation
        # if word2[-1] in string.punctuation:
        #     break

    # End it with a period if no punctuation was found at the end
    # if sentence[-1] not in string.punctuation:
    #     sentence += '.'

    # End it with a random punctuation
    sentence += random.choice(string.punctuation)
    return(sentence)

In [None]:
generate_sentence_randpunc(jimmy_dict)

'Raising a bad guy. You’re in it in front of Duty, but I was asleep%'

In [None]:
generate_sentence_randpunc(jimmy_dict)

'Engage with Madonna.” And I have downgraded…” And you can’t get that? You’re a conveyor-'

In [None]:
generate_sentence_randpunc(jimmy_dict)

'Female strip clubs or wrongly, the effort is in your brain that be having a['

## Problem Statement
### Look at transcripts of various comedians and note their similarities and differences and find if the stand up comedian of your choice has comedy style different than other comedian.

#### By looking at the conclusions from various assignments we can note the differences between the various comedians,

##### Type of comedy:
*    Jimmy O Yang is noted to be the most into ethnic and racist type comedy with noteable words being: chinese, american, asian etc.
*   From the previous assignment it is noted that:
Profanity [Bill, Chris, Louis] that these 3 comedians used the most profanity and curse words in their comedies
*   Anthony, Ricky and Jim seem to use a lot of positve and happy words using even words like joke, fun in their comedies
*   Anthony has a really repetitive type of comedy with low unique words




##### Talking speeds: Again with words_per_minute of hasan and bo bo being outlier there is problem with their transcripts
*  Kevin Hart and Jim Jefferies are the fastest talkers with the highest words_per_minute
*   Ali and Anthony are amongst the slowest

##### Vocabulary:
*   Gabriel Iglesis and Bill Burr spoke the most unique words hence, the most diverse vocabulary
*   Anthony and Louis had the least unique words i.e. restricted vocabulary

##### Having seen some of these comedies my comedian of choice is Jimmy O Yang and his type of comedy is noteably different from the others with its topic of focus lying on jokes on race and ethnicity, using average amount of profanity words,
##### Apart from the difference in the type of comedians Jimmy was also similar to some extent to some comedians like bert, anthony and jim