# Markov Chains

## Tokenizer

In [1]:
''' Read text and tokenize it with NLTK. '''
from nltk.tokenize import word_tokenize as tok
text = open('../dataset.txt', 'r').read()
token = tok(text)
print('Number of tokens:',len(token))
print(token[:20])

Number of tokens: 52186
['Paul', 'Valérys', 'Form', ':', 'Sie', 'wird', 'gespeist', 'von', 'seinem', 'unermüdlichen', 'Drang', 'zum', 'Objektivieren', 'und', ',', 'mit', 'Cézannes', 'Wort', ',', 'Realisieren']


## Vocabulary

Store for each word in the corpus all n next tokens.

In [3]:
''' Create a generator with pairs of tokens. '''
def make_pairs(token, n_token=1):
    for i in range(len(token)-n_token):
        yield (token[i], token[i+1:i+1+n_token])

pairs = make_pairs(token, n_token=1)

In [13]:
''' Display first pairs. 
If executed, make_pairs() should be executed again. '''

# for i in range(3):
#     print(next(iter(pairs)))

('Paul', ['Valérys'])
('Valérys', ['Form'])
('Form', [':'])


In [4]:
''' Create a vocabulary of all tokens and map them to their next tokens. '''
vocabulary = {}
for current_token, next_token in pairs:
    if current_token in vocabulary.keys():
        vocabulary[current_token].append(' '.join(next_token))
    else:
        vocabulary[current_token] = [' '.join(next_token)]
print(len(vocabulary))

8766


In [5]:
''' Display all options for »Selbstgespräch«. '''
vocabulary['Selbstgespräch']

[',', ',', 'ergebe', 'ist']

## Generate text

Get all options for one token from the vocabulary and pick one of them randomly.

In [6]:
''' This function can handle a vocabulary with {word: multiple words} pairs.
If lookup=True it prints all options for each input token. '''

def generate_text(input_, n_token=12, lookup=False):
    from nltk.tokenize import word_tokenize as tok
    import string
    import numpy as np
    # tokenize 
    gentext = tok(input_)
    gentext_lookup = tok(input_) # this string stores all options

    for i in range(n_token):
        # join array to string (necessary for n > 1)
        gentext = ' '.join(gentext)
        # split it again to get access to the last token
        gentext = tok(gentext)
        # get all possible following tokens
        options = vocabulary[gentext[-1]]
        # choose one of them
        choice = np.random.choice(options)
        gentext.append(choice)
        # insert all possible values if lookup is true
        if lookup:
            # append all options
            options = ' | '.join(options)
            options = ': {  ' + options + '}\n\n'
            gentext_lookup.append(options)
            gentext_lookup.append(choice)
            
    # transform generated text to string with correct whitespaces
    out = ''
    for token in gentext:
        if token in string.punctuation:
            out += token
        else:
            out += ' ' + token
    if lookup:
        out += "\n\n"
        out += ''.join(gentext_lookup)
    return out

In [8]:
for i in range(3):
    print(generate_text('Das Selbstgespräch', 12))

 Das Selbstgespräch, dass nicht mehr da sie über den hier nichts notiert,
 Das Selbstgespräch ist in ein typisches neuronales Verschaltungsmuster repräsentiert außer Zahlen, sie sich
 Das Selbstgespräch, immer dasselbe immer und Produktionsverhältnisse erzeugen neue Maschine, welcher einem


## n-grams

This method picks the next word not just based on one token, but on n token. Typical n-grams are of length 2 (bigrams) or 3 (trigrams).
For a small dataset trigrams may be too long, because they reduce the number of choices for each string of n words.

In [17]:
''' Create a generator with pairs of tokens. '''
def make_n_gram_pairs(token, n_grams=2):
    for i in range(len(token)-n_grams-1):
        yield (' '.join(token[i:i+n_grams]), token[i+n_grams])

pairs = make_n_gram_pairs(token, n_grams=2)

In [18]:
''' Display first pairs. '''
for i in range(3):
    print(next(iter(pairs)))

('Paul Valérys', 'Form')
('Valérys Form', ':')
('Form :', 'Sie')


In [19]:
''' Create a vocabulary of all n-token-strings and map them to their next token. '''
vocabulary = {}

for current_token, next_token in pairs:
    if current_token in vocabulary.keys():
        vocabulary[current_token].append(next_token)
    else:
        vocabulary[current_token] = [next_token]

print(len(vocabulary))

33850


In [20]:
''' Display all options for »Das Selbstgespräch«. '''
print(vocabulary['Das Selbstgespräch'])

['ist']


In [19]:
''' This function requires an input of at least n_grams token. '''
def generate_text_n_grams(input_, n_token=12, n_grams=1):
    from nltk.tokenize import word_tokenize as tok
    import numpy as np
    import string
    
    # tokenize input
    gentext = tok(input_)
    try:
        options = vocabulary[' '.join(gentext[-n_grams:])]
    except:
        return 'No key available for: ' + ' '.join(gentext[-n_grams:])
    for i in range(n_token):
        # get all options for the last n_grams of gentext and choose one
        options = vocabulary[' '.join(gentext[-n_grams:])]
        choice = np.random.choice(options)
        # append it to the generated text
        gentext.append(choice)
    output = ''
    for token in gentext:
        if token in string.punctuation:
            output += token
        else:
            # add a whitespace if token is not a punctuation
            output += ' ' + token
    return output

In [20]:
for i in range(3):
    print(generate_text_n_grams("Das Selbstgespräch", n_token=12, n_grams=2))

 Das Selbstgespräch ist ein Gegenstand mit ästhetischem Anspruch. Dieser neuronale Prozess benötigt auf
 Das Selbstgespräch ist der Materialismus, der beide verwandelt: Das Selbst ist durch
 Das Selbstgespräch ist ein erster und starker Hinweis dafür, dass nicht nur zu


## Sources

https://towardsdatascience.com/simulating-text-with-markov-chains-in-python-1a27e6d13fc6

https://mb-14.github.io/tech/2018/10/24/gomarkov.html