One of the tasks we haven't delved into in the course is the task of text generation. This is one of the other common tasks in the Natural Language Processing world that is commonly develop by NLP Practitioners.

In this first notebook, we're going to delve deep into the classical version of Text Generation: Markov Chain models. Markov chain models are probabilistic models that simply take into account the probability of the next word given a sequence of words. In this notebook, we'll implement the simplest version where we just see the probability of the next word based on the last word in a sequence.

In [2]:
import random
import numpy as np
import pandas as pd

We start by using a random base of text to train our Markov Chain Model:

In [3]:
example_text = '''Hello world! welcome to the natural language processing course 
and hope you enjoy diving deeper into the world of natural language processing.'''

We start by building a vocab - in text generation, we normally don't remove stop words as they are important in the text we are generating. In this example, we'll only do a preprocessing step: lowercasing the text.

In [4]:
lower_case_example_text = example_text.lower()

In [5]:
lower_case_example_text

'hello world! welcome to the natural language processing course \nand hope you enjoy diving deeper into the world of natural language processing.'

First, we compute a transition matrix, that just checks what's the probability of a word ocurring right after the other - it's also common to use tags `<s>` and `</s>` to indicate start of sentence and end of sentence:

In [6]:
lower_case_example_text = '<s> '+lower_case_example_text+' </s>'

In [7]:
lower_case_example_text

'<s> hello world! welcome to the natural language processing course \nand hope you enjoy diving deeper into the world of natural language processing. </s>'

Next, we can apply a `tokenization` method to split our sentence into tokens - we'll use the `nltk` tokenizer:

In [8]:
from nltk.tokenize import word_tokenize

In [9]:
tokenized_example = word_tokenize(lower_case_example_text)

Ops! unfortunately our start of tag and end of tag tokens got split up. Let's combine them again: 

In [10]:
tokenized_example = (
    [''.join(tokenized_example[0:3])]
    +
    tokenized_example[3:27]
    +
    [''.join(tokenized_example[27:])]
)

We'll build our vocab with each word of our token:

In [11]:
vocab = list(set(tokenized_example))

Now, let's create a matrix of 0's similar to our co-ocurrence matrix - the only different is that instead of looking for neighbors on each side, we only look for 1 neighbor on the right of the word:

In [12]:
transition_matrix = pd.DataFrame(
    np.zeros([len(vocab), len(vocab)]),
    columns=vocab,
    index=vocab
)

Let's fill the transition matrix!

In [13]:
word_occurrences = {}

for token in vocab:
    word_occurrences[token] = tokenized_example.count(token)

In [14]:
for index, token in enumerate(tokenized_example):
    try:
        transition_matrix.loc[token, tokenized_example[index+1]] += 1
        print('Computing transition matrix...')
    except IndexError:
        print('Reached end of the text!')
    

Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Computing transition matrix...
Reached end of the text!


The transition matrix is normally showed as a probability matrix. To turn it into a `probability`, we just divide by the sum of the row (number of ocurrences of the word):

In [15]:
for word in vocab:
    transition_matrix.loc[word,:] /= word_occurrences[word]

In [16]:
transition_matrix

Unnamed: 0,course,enjoy,you,.,the,into,natural,!,and,hello,...,<s>,diving,processing,world,welcome,of,language,to,</s>,hope
course,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
enjoy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
you,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
the,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
into,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
natural,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
and,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
hello,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's generate our first text based on the transition matrix we've created. We'll sample the top words randomnly for each token we're generating

In [17]:
def get_next_token(token, transition_matrix):
    return transition_matrix.loc[token].sort_values().tail(1).index[0]

In [18]:
sentence = ''
token = '<s>'

In [19]:
i = 0
while i < 30:
    sentence += token + ' '
    
    # Predicting the token
    pred_token = get_next_token(token, transition_matrix)
    
    # Replacing the previous token. We could
    # (and should) to this in one step but
    # I'm doing it in two assignments to avoid confusion
    token = pred_token
    
    # Increment i
    i += 1
    

In [20]:
sentence

'<s> hello world of natural language processing course and hope you enjoy diving deeper into the world of natural language processing course and hope you enjoy diving deeper into the '

with such few text, we end up falling into **repetition**, one of the major issues of classical text generation pipelines. We have some workarounds that we can try, for example, using more text.

In [21]:
# Load training base from txt file
with open('training_base.txt', encoding  = 'utf8') as file:
    wiki = file.read()

In [22]:
wiki



Now that we know how to create transition matrices, can you write a function that does it from start to bottom for our text.

In [23]:
def build_tokens(text): 
    lower_case_example_text = '<s> ' + text.lower()+' </s>'
    tokenized_example = word_tokenize(lower_case_example_text)
    
    tokenized_example = (
        [''.join(tokenized_example[0:3])]
        +
        tokenized_example[3:len(tokenized_example)-3]
        +
        [''.join(tokenized_example[len(tokenized_example)-3:])]
    )
    
    return tokenized_example

In [24]:
from nltk.tokenize import sent_tokenize

In [25]:
sentence_wiki = sent_tokenize(wiki)

In [26]:
tokenized_wiki = []

for sent in sentence_wiki:
    tokenized_wiki.extend(build_tokens(sent))

In [27]:
def build_transition_matrix(tokens):
    
    vocab = list(set(tokens))
    
    transition_matrix = pd.DataFrame(
        np.zeros([len(vocab), len(vocab)]),
        columns=vocab,
        index=vocab
    )

    word_occurrences = {}

    for token in vocab:
        word_occurrences[token] = tokens.count(token)
        
    for index, token in enumerate(tokens):
        try:
            transition_matrix.loc[token, tokens[index+1]] += 1
        except IndexError:
            print('Reached end of the text!')
            return transition_matrix
    

In [28]:
transition_matrix_wiki = build_transition_matrix(tokenized_wiki)

Reached end of the text!


Let's generate some text based on our new transition matrix!

In [29]:
sentence = ''
token = '<s>'

In [30]:
i = 0
while i < 100:
    sentence += token + ' '
    
    # Predicting the token
    pred_token = get_next_token(token, transition_matrix_wiki)
    
    # Replacing the previous token. We could
    # (and should) to this in one step but
    # I'm doing it in two assignments to avoid confusion
    token = pred_token
    
    # Increment i
    i += 1
    

In [31]:
sentence

'<s> the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , and the united states , '

One of the things that we can do is to randomnly sample words from the top words, instead of considering just the most common - this adds some unpredictability to the text:

In [32]:
def get_next_token_top(token, transition_matrix, n):
    
    top_n = transition_matrix_wiki.loc[token].sort_values().tail(n)
    next_token = top_n[top_n>0].sample(1).index[0]

    return next_token

In [33]:
get_next_token_top('united', transition_matrix_wiki, 4)

'nations'

In [36]:
get_next_token_top('united', transition_matrix_wiki, 4)

'kingdom'

In [37]:
sentence = ''
token = '<s>'

i = 0
while i < 100:
    sentence += token + ' '
    pred_token = get_next_token_top(token, transition_matrix_wiki, 4)
    token = pred_token
    i += 1
    

In [38]:
sentence

"<s> the world 's largest economic expansion , with the united states–mexico–canada free association , and a result of a total population in the country in europe , which was founded as of a federal courts . '' was founded on april 1965 ) and the country 's second-largest group and in a result of the country 's most of europe , which are also often displaced across north america ( ppp ) === american samoa ( 4,300 m ) . '' </s> <s> the u.s. population in europe 's largest importer and other high-income countries engaged in new york "

As we raise the number of sampling words, our text starts to become a bit more interesting. Nevertheless, it's still very difficult to understand! With more data, we could probably achieve a better result but classical text generation methods always have this issue.