# Markov Chains - n-grams

This notebook is based on the [Markov Basic Notebook](https://github.com/experimental-informatics/hands-on-text-generators/blob/master/markov_basic.ipynb).

So far we have generated text with a simple vocabulary: it maps a key of one token to a value of one token.

Another (maybe better method) is to use n-grams as keys and map them to a value of one token.<br>
Then the next token prediction is based on multiple (n) tokens.

![ngrams.png](images/ngrams.png)
[Source](https://mb-14.github.io/tech/2018/10/24/gomarkov.html)

Typical n-grams are of length 2 (bigrams) or 3 (trigrams).<br>
For a small dataset trigrams may be too long, because they produce less choices.

In [1]:
''' Libraries. '''

import numpy as np
import random
from nltk.tokenize import word_tokenize as tok
import string

In [2]:
''' Set n-grams. 
Here we define the variable N_GRAMS. 
We will use it at different locatinos in our code and it must be always the same. '''
N_GRAMS = 2

## Tokenizer

In [3]:
''' Read text and tokenize it with NLTK. '''

# Read text
with open('data/wiki_selection.txt', 'r') as f:
    text = f.read()
# Tokenize
token = tok(text)
print('Number of tokens:',len(token))
print(token[:50])

Number of tokens: 50409
['Aesthetics', 'is', 'a', 'branch', 'of', 'philosophy', 'that', 'deals', 'with', 'the', 'nature', 'of', 'beauty', 'and', 'taste', ',', 'as', 'well', 'as', 'the', 'philosophy', 'of', 'art', '(', 'its', 'own', 'area', 'of', 'philosophy', 'that', 'comes', 'out', 'of', 'aesthetics', ')', '.', 'It', 'examines', 'subjective', 'and', 'sensori-emotional', 'values', ',', 'or', 'sometimes', 'called', 'judgments', 'of', 'sentiment', 'and']


## Vocabulary

Create pairs of tokens: n token as input (`key`) and one token as output (`value`).

With the tokenization we have split our text in single tokens.<br>
Now we have to put n tokens (as key) together.<br>
An easy way is the usage of `' '.join("multiple tokens")`.<br>
But with this we will run into trouble with punctuation. For example a key would be<br>
"taste ,"<br>
but it should be<br>
"taste,".

We will use a function found on [stackoverflow](https://stackoverflow.com/a/15950837):

In [4]:
''' Join tokens with space in between, except it is a punctuation. 
(We don't need to know how it works. It's enough if we see that it works.) '''

def join_punctuation(seq, characters='.,;?!'):
    characters = set(characters)
    seq = iter(seq)
    current = next(seq)

    for nxt in seq:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt

    yield current
    
# Usage:
l = ['some', ',', 'tokens']
print(' '.join(join_punctuation(l)))

some, tokens


In [5]:
''' Create a vocabulary of all tokens and map them to their preceding tokens. '''

vocabulary = {}

# Loop through all tokens (except the last n-1 ones).
for i in range(len(token) - N_GRAMS -1):
    # The current token + N_GRAMS token are key.
    key = ' '.join(join_punctuation(token[i:i+N_GRAMS]))
    # The next token is the assigned value.
    value = token[i+N_GRAMS]
    
    # Check if the key is already included into the dictionary.
    if key in vocabulary.keys():
        # If yes, append the value to this entry.
        vocabulary[key].append(value)
    else:
        # Otherwise create a new entry with the key.
        vocabulary[key] = [value]
        
print('Size of the vocabulary:', len(vocabulary))

Size of the vocabulary: 29811


We can see that our vocabulary is much longer than in the basic markov version. This is acutally not so good, because it means that our keys are more specific and less general. In reverse we will have less options for each key in our vocabulary.

In [7]:
''' Inspect all options for a given token. '''
key = ' '.join(join_punctuation(token[0:N_GRAMS]))
print('All options for\n', key, ':', vocabulary[key])

All options for
 Aesthetics is : ['a', 'for', 'the']


## Generator

In [8]:
def generate_text_n_grams(input_='', n_token=12, n_grams=1):
  
    # get random key if input is empty
    if input_ == '':
        gentext = [random.choice(list(vocabulary.keys()))]
        
    else:
        # tokenize input
        gentext = tok(input_)

    # predict n_token new tokens
    for i in range(n_token):
        
        # token_inp = last token of gentext
        token_inp = ' '.join(join_punctuation(gentext[-n_grams:]))
        
        # check if token_inp is included into the dictionary
        if not token_inp in vocabulary.keys():
            # pick a random choice if not included
            token_inp = random.choice(list(vocabulary.keys()))
            
        # get all options for the last token of gentext
        options = vocabulary[token_inp]
        # choose one of this options
        choice = np.random.choice(options)
        # append it to the generated text
        gentext.append(choice)
        
    
    # when the for-loop is finised, creat the output
    output = ''
    for token in gentext:
        if token in string.punctuation:
            output += token
        else:
            # add a whitespace if token is not a punctuation
            output += ' ' + token
    return output

In [9]:
''' The function above allows the text genration without text input. '''

for i in range(3):
    print(generate_text_n_grams(n_grams=N_GRAMS))

 things as computers as 's is prominently to the field in cognitive science,
 view emphasizes that in each new piece of data that contains both desirable and
 unit performs internally(, the nature of reality, is related to this


In [11]:
for i in range(3):
    print(generate_text_n_grams('Aesthetics is', 12, N_GRAMS))

 Aesthetics is for infants to acquire their first-language?, '' the set of
 Aesthetics is for the rest of mankind. '' that this will not happen
 Aesthetics is the only existing substance is mental. The descriptions they gave differed


## Further tasks/ experiments

- Try it with n_grams = 3
- Try it without punctuation

## Sources

https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6

https://mb-14.github.io/tech/2018/10/24/gomarkov.html