# Text Generation using Markov Chains <a class='tocSkip'>

<img src='images/cover.jpeg' width=550/>

In this notebook, we will demonstrate how we can use the concept of Markov chains to generate texts automatically given a seed text input. To do this, we need a corpus of text containing sentences which we will use as training data for our Markov chain model. We will use three different datasets to create our Markov chain model.

1. [NLTK Reuters Corpus](https://www.nltk.org/book/ch02.html)- contains 10,788 news documents totalling 1.3 million words.
2. [NLTK Shakespeare corpus](https://www.nltk.org/howto/corpus.html#shakespeare) - contains a set of SHakespeare plays.
3. Our own input text, whether it's your favorite song, poem, or novel.

In [None]:
!wget https://raw.githubusercontent.com/aim-msds/bsdsba-trial-lectures/main/language-modeling/markov.py
!wget https://raw.githubusercontent.com/aim-msds/bsdsba-trial-lectures/main/language-modeling/utils.py

In [4]:
from markov import MarkovChain
from utils import create_corpus

In [7]:
model = MarkovChain()

In [86]:
corpus = create_corpus("""
Our team has data scientists and data engineers.
""")

corpus

[['Our', 'team', 'has', 'data', 'scientists', 'and', 'data', 'engineers', '.']]

In [9]:
model.add_corpus(corpus)

In [20]:
model.trans_probability(['data'])

Unnamed: 0,prob
scientists,0.5
engineers,0.5


In [27]:
for i in range(10):
    print(i + 1, model.next_word(['data']))

1 engineers
2 engineers
3 scientists
4 engineers
5 scientists
6 scientists
7 scientists
8 scientists
9 scientists
10 scientists


In [82]:
model = MarkovChain(mode='bigrams')

In [83]:
model.add_corpus(corpus)

In [84]:
model.trans_probability(['has', 'data'])

Unnamed: 0,prob
scientists,1.0


In [85]:
model.trans_probability(['and', 'data'])

Unnamed: 0,prob
engineers,1.0


In [42]:
import nltk
from nltk.corpus import reuters
nltk.download('punkt')
nltk.download('reuters')

from utils import get_shakespeare_sents

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/llorenzo/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /Users/llorenzo/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [47]:
shakespeare_sents = get_shakespeare_sents()

In [48]:
print(f"Number of sentences: {len(shakespeare_sents)}")
print(f"Number of words: {sum([len(sentence) for sentence in shakespeare_sents])}")
[' '.join(sentence) for sentence in shakespeare_sents[50:70]]

Number of sentences: 32975
Number of words: 241336


["O ' erflows the measure : those his goodly eyes ,",
 "That o ' er the files and musters of the war",
 "Have glow ' d like plated Mars , now bend , now turn ,",
 'The office and devotion of their view',
 "Upon a tawny front : his captain ' s heart ,",
 'Which in the scuffles of great fights hath burst',
 'The buckles on his breast , reneges all temper ,',
 'And is become the bellows and the fan',
 "To cool a gipsy ' s lust .",
 'Flourish . Enter ANTONY , CLEOPATRA , her Ladies , the Train , with Eunuchs fanning her',
 'Look , where they come :',
 'Take but good note , and you shall see in him .',
 "The triple pillar of the world transform ' d",
 "Into a strumpet ' s fool : behold and see .",
 'CLEOPATRA',
 'If it be love indeed , tell me how much .',
 'MARK ANTONY',
 "There ' s beggary in the love that can be reckon ' d .",
 'CLEOPATRA',
 "I ' ll set a bourn how far to be beloved ."]

In [51]:
reuters_sents = reuters.sents()

In [55]:
print(f"Number of sentences: {len(reuters_sents)}")
print(f"Number of words: {sum([len(sentence) for sentence in reuters_sents])}")
[' '.join(sentence) for sentence in reuters_sents[:10]]

Number of sentences: 54716
Number of words: 1720917


["ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .",
 'They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .',
 "But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .",
 "The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .",
 'Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt expo

In [59]:
reuters_model = MarkovChain(mode='bigrams')
shakespeare_model = MarkovChain(mode='bigrams')

In [60]:
reuters_model.add_corpus(reuters_sents)
shakespeare_model.add_corpus(shakespeare_sents)

In [80]:
reuters_model.trans_probability(['said', 'a'])

Unnamed: 0,prob
spokesman,0.020067
senior,0.013378
feasibility,0.003344
new,0.040134
sudden,0.003344
...,...
reviving,0.003344
western,0.003344
Federal,0.003344
monitoring,0.003344


In [76]:
shakespeare_model.trans_probability(['I', 'love'])

Unnamed: 0,prob
long,0.027778
",",0.027778
thee,0.305556
you,0.138889
so,0.027778
:,0.055556
not,0.027778
.,0.055556
passing,0.027778
him,0.083333


In [81]:
for i in range(10):
    print(i + 1, reuters_model.generate_sentence(['said', 'a']))

1 said a high ranking delegation from the Red River in Fargo , Texas , which Allied expects to close its refinery system , Japanese goods may upset West European countries , consumer delegates said the country ' s stock .
2 said a third liquidating dividend of its value in exports of 5 , 000 a year ago .
3 said a resurgence in U . S . Stock market analysts had expected to pay dividends under New York exchanges a logical , but some were not disclosed .
4 said a rise in January 1986 , the Treasury Secretary James Baker came under fire from critics who claimed he helped to provoke the Soviet Union bought 29 . 9 mln dlrs to 21 . 1 mln vs loss 1 . 9 pct last week to the U . S . goods would bypass the trade deficit increased by around half of the company ' s posted price as U . S ., Japan , has been suspended since the meeting .
5 said a number of its outstanding common share resulting from the 50 mln dlrs , or three cts vs five cts vs loss 2 . 15 and April 13 to 16 . 8 mln vs 159 . 9 pct in

In [75]:
for i in range(10):
    print(i + 1, shakespeare_model.generate_sentence(['I', 'love']))

1 I love thee not , my wife is fair . Good thou , save of joy and prosperity .
2 I love you all , had done ' t , read it instantly .
3 I love not to leave betimes ?
4 I love the gentleman willing , shall we go ?
5 I love thee after .
6 I love him highly , not for himself to scape from it :
7 I love Brutus , my lord ; I long to see .
8 I love thee
9 I love Brutus , are all out .
10 I love passing well .'


In [376]:
corpus = create_corpus("""
Took a morning ride to the place
Where you and I were supposed to meet
The city yawns, they echo on
My thoughts are spinning on and on my head
It seems, they lead me back to you, ooh
I keep coming back to you
Took a morning ride, found a place up in my mind
No one else can see
Maybe, it's fate that we lose control
In circles around, we go
We become who we ought to know
We just gotta let it go
We just gotta let it go
So, I'm coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home, home
Took a morning ride, gotta leave this all behind
For with you is where I want to be
Maybe, it's fate that we can't control (fate that we can't control)
Oh, around and around, it goes ('round and around, it goes)
And all that we seem to know (all that we seem to know)
We just gotta let it go
We just gotta let it go
So, I'm coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home, home
So many questions I've thrown to the skies
And all of the answers, I've found in your eyes
When I'm with you, home is never too far
And my weary heart has come to rest in yours
I found my way home
I found my way home
I found my way home
I found my way home
I found my way home, I found my way home
I found my way home, I found my way home
I found my way home, I found my way home
I found my way home
So, I'm coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home, home
Coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home
""")

In [377]:
my_model = MarkovChain(mode='unigrams')
my_model.add_corpus(corpus)

In [384]:
my_model.generate_sentence(['I'])

'I need , they lead me back to you , ooh - ooh - ooh - ooh - ooh - ooh - ooh - ooh - ooh'