# Text Generation using Markov Chains <a class='tocSkip'>

<img src='images/cover.jpeg' width=550/>

In this notebook, we will demonstrate how we can use the concept of Markov chains to generate texts automatically given a seed text input. To do this, we need a corpus of text containing sentences which we will use as training data for our Markov chain model. We will use three different datasets to create our Markov chain model.

1. [NLTK Shakespeare corpus](https://www.nltk.org/howto/corpus.html#shakespeare) - contains a set of Shakespeare plays.
2. [NLTK Reuters Corpus](https://www.nltk.org/book/ch02.html)- contains 10,788 news documents totalling 1.3 million words.
3. Our own input text, whether it's your favorite song, poem, or novel.

In [None]:
!wget https://raw.githubusercontent.com/aim-msds/bsdsba-trial-lectures/main/language-modeling/markov.py
!wget https://raw.githubusercontent.com/aim-msds/bsdsba-trial-lectures/main/language-modeling/utils.py

## Solution Pipeline

We will first discuss how we can use the code to perform the solution pipeline that we've done using by hand. This involves objects and functions that have been coded for you before hand that does the logic of creating all the possible states then solving for transition probabilities as the Markov Chain model changes from one state to another.

In [None]:
from markov import MarkovChain
from utils import create_corpus

### The `MarkovChain` object

The core of our solution is the `MarkovChain` object. By adding to it any given corpus, it will perform automatically all the logic that we did by hand through the use of the different *methods* defined in it.

In [None]:
model = MarkovChain()

In [None]:
help(MarkovChain)

### The `create_corpus` function

The create corpus function helps us prepare the data in such a way that the `MarkovChain` object expects.

In [None]:
corpus = create_corpus("""
In this age of data influx, I will be a data science leader and help pioneer the use of data science in the country.
""")

corpus

### The `add_corpus`, `trans_probability`, and `next_word` method

After the data has been prepared by the `create_corpus`, we can use `MarkovChain`'s `add_corpus` method to add the data to the model.

In [None]:
model.add_corpus(corpus)

Adding the data to the model would place in the model's memory the different current and future possible state pairs. This in turn can be used to compute for the transition probabilities of each state given a current state.

In [None]:
model.trans_probability(['data'])

Afterwards we can use the `next_word` to allow the model to *evolve* and generate the next word according to its transition probabilities given its current state.

In [None]:
for i in range(10):
    print(i + 1, model.next_word(['data']))

### The `bigrams` mode

We can further scale the model by defining *bigrams*, i.e. two words, instead of *unigrams* as our state. We can do this by changing the mode of the `MarkovChain` object to be `bigrams`.

In [None]:
bigrams_model = MarkovChain(mode='bigrams')

In [None]:
corpus

In [None]:
bigrams_model.add_corpus(corpus)

In [None]:
bigrams_model.trans_probability(['data', 'science'])

In [None]:
bigrams_model.trans_probability(['and', 'help'])

## Deploying to Larger Datasets

Let us now use our solution to create an AI writer using larger datasets as its training data.

In [None]:
import nltk
from nltk.corpus import reuters
nltk.download('punkt')
nltk.download('reuters')

from utils import get_shakespeare_sents

Here, we'll use two datasets from the [NLTK library](https://www.nltk.org/) as our training data, namely, (1) Shakespeare's plays and (2) Reuters' news dataset.

In [None]:
shakespeare_sents = get_shakespeare_sents()

In [None]:
print(f"Number of sentences: {len(shakespeare_sents)}")
print(f"Number of words: {sum([len(sentence) for sentence in shakespeare_sents])}")
[' '.join(sentence) for sentence in shakespeare_sents[50:70]]

In [None]:
reuters_sents = reuters.sents()

In [None]:
print(f"Number of sentences: {len(reuters_sents)}")
print(f"Number of words: {sum([len(sentence) for sentence in reuters_sents])}")
[' '.join(sentence) for sentence in reuters_sents[:10]]

In [None]:
reuters_model = MarkovChain(mode='bigrams')
shakespeare_model = MarkovChain(mode='bigrams')

In [None]:
reuters_model.add_corpus(reuters_sents)
shakespeare_model.add_corpus(shakespeare_sents)

In [None]:
shakespeare_model.trans_probability(['I', 'love'])

*From Act 3, Scene 1 of Julius Caesar*

```
Caesar was mighty, bold, royal, and loving. Say I love Brutus and I honor him;. Say I feared Caesar, honored him, and loved him.
```

In [None]:
reuters_model.trans_probability(['said', 'a'])

In [None]:
for i in range(10):
    print(i + 1, shakespeare_model.generate_sentence(['I', 'love']))

In [None]:
for i in range(10):
    print(i + 1, reuters_model.generate_sentence(['said', 'a']))

## Hands on activity: Using your own text

Now that we've seen how the pipeline works! We have even created an AI writer when traind to larger datasets! Let's feed it with a text of our own. Place here any text that you can find from the internet (or maybe craft your own), to create our AI writer!

In [None]:
corpus = create_corpus("""
Took a morning ride to the place
Where you and I were supposed to meet
The city yawns, they echo on
My thoughts are spinning on and on my head
It seems, they lead me back to you, ooh
I keep coming back to you
Took a morning ride, found a place up in my mind
No one else can see
Maybe, it's fate that we lose control
In circles around, we go
We become who we ought to know
We just gotta let it go
We just gotta let it go
So, I'm coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home, home
Took a morning ride, gotta leave this all behind
For with you is where I want to be
Maybe, it's fate that we can't control (fate that we can't control)
Oh, around and around, it goes ('round and around, it goes)
And all that we seem to know (all that we seem to know)
We just gotta let it go
We just gotta let it go
So, I'm coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home, home
So many questions I've thrown to the skies
And all of the answers, I've found in your eyes
When I'm with you, home is never too far
And my weary heart has come to rest in yours
I found my way home
I found my way home
I found my way home
I found my way home
I found my way home, I found my way home
I found my way home, I found my way home
I found my way home, I found my way home
I found my way home
So, I'm coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home, home
Coming home to you, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You, ooh-ooh, ooh-ooh-ooh-ooh-ooh-ooh
You're all I need, the very air I breathe
You are home
""")

In [None]:
my_model = MarkovChain(mode='unigrams')
my_model.add_corpus(corpus)

In [None]:
my_model.generate_sentence(['I'])

### Guide Questions

1. **An AI writer or an AI parrot?** Try to generate text using the above pipeline. What can you observe as to how the way the AI writer generates text? Can you predict as to how the AI would generate text given any current state?
2. **A translingual AI writer** What will happen if we feed to it texts from different languages. How will our AI writer generate new texts? Will it be able to generate text from a different language from its current state?
3. **Going beyond bigrams** Suppose that we instead consider trigrams or quadgrams as our definition of a state, will our AI writer be able to write better sentences? What problems might occur?
4. **Other improvements** Can you think of other ways in which we can improve the AI writer?