# Markov Chain - Second order text generation

# Second Order Character Generation

This is the first Markov Chain we start with. The probability of a character depends on the last character.

For that we will create a dictionary called 'vocabulary'. For each individual token of our text we will store all next tokens.
When we generate our text we will pick a random token of this list as we have done it in the first-order text generation.

## Create a Vocabulary (Training)

### Reading a new text and clean it

In [None]:
# Read a file into the variable text
with open('data/alles-macht-weiter.txt', 'r') as f:
    text = f.read()


import string
text=text.replace("\n"," ").replace("ä","ae").replace("Ä","Ae").replace("ö","oe").replace("Ö","oe").replace("ü","ue").replace("Ü","ue")
text = text.lower()
remove_digits = str.maketrans('', '', '0123456789')
text = text.translate(remove_digits)
text = text.translate(str.maketrans('','',string.punctuation))

print('Number of tokens:',len(text), '\n')
print(text[:50])

In [None]:
# show the next character of the current character

for i in range(0,5):
    print(text[i], "->" , text[i+1])

### Build vocabulary

In [None]:
vocabulary = {}

# Loop through all tokens (except the last one).
for i in range(len(text) -1):
    # The current token is key
    key = text[i]
    # The next token is the assigned value.
    value = text[i+1]
    
    # Check if the key is already included into the dictionary.
    if key in vocabulary.keys():
        # If yes, append the value to this entry.
        vocabulary[key].append(value)
    else:
        # Otherwise create a new entry with the key.
        vocabulary[key] = [value]

In [None]:
# Show all possible option for specific character
print("a---",vocabulary['a'])
print("c---",vocabulary['c'])

## Text generation

### Build a function to pick the value

Since this operation is repeatable, we build a function to help picking the next possible character.

In [5]:
''' Return a randomly selected token from our list of options. '''
import random

def next_token(key):
    
    # Get all options stored for in the dictionary for this key.
    options = vocabulary[key]
    
    # Pick one.
    choice = random.choice(options)
    #print('key:'+key+' - ',options)
    # Return this value.
    return choice

print(next_token('a'))

g


### Start the generating process

In [6]:
generated_text = 'a' # We start with this as input.

# execute 50 times
for i in range(50):
    
    # The last token of generated_text is the key to get the next token.
    key = generated_text[-1]
    
    # Pick one token for this key.
    choice = next_token(key)
    
    # Append this token to the generated text.
    generated_text += choice
    
# We print the generated text once when the for-loop has finished.
print(generated_text)

achendae rumerbend d d umaunschenndiner en rwemaun 


## Showing the 2-dimensional dictionary

In [7]:
table = [[0 for i in range(28)] for j in range(28)]

for index,value in vocabulary.items():
   
    if(index == ' '):
        i = 27
    else:
        i = ord(index)-ord('a')
    
    for c in value:
        if(c == ' '):
            v = 27
        else:
            v = ord(c)-ord('a')
        #print(i,v)
        if(v<=27 and i <=27):
            table[i][v] += 1

print('    ',end=' ')
for y in range(26):
    print(chr(y+ord('a')),end='    ')
print()
for y in range(26):
    print(chr(y+ord('a')),end='    ')
    for x in range(27):
        
        print('{:>3}'.format(table[y][x]),end='  ')
    print()

     a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    q    r    s    t    u    v    w    x    y    z    
a      0    0   21    2    9    0    3    1    0    0    2    3    0    7    0    3    0    2    4    3   21    0    0    0    0    0    0  
b      4    0    0    0    5    0    0    0    0    0    0    1    0    0    2    0    0    0    1    0    1    0    0    0    0    0    0  
c      0    0    0    0    0    0    0   35    0    0    3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0  
d      2    0    0    0   14    0    0    0   16    0    0    0    0    0    0    0    0    0    1    3    2    0    0    0    0    0    0  
e      0    3    2    1    1    0    4    9   31    0    0    3    3   53    0    0    0   42    5    0    4    0    2    0    0    0    0  
f      1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0  
g      0    0    0

</br></br>
# Second order Word Generation

Since there are more possibility in Word generation, we need a larger data sets.

In [8]:
with open('data/wiki_selection.txt', 'r') as f:
    text = f.read()


import string
text=text.replace("\n"," ").replace("ä","ae").replace("Ä","Ae").replace("ö","oe").replace("Ö","oe").replace("ü","ue").replace("Ü","ue")
text = text.lower()
remove_digits = str.maketrans('', '', '0123456789')
text = text.translate(remove_digits)
text = text.translate(str.maketrans('','',string.punctuation))

#With this line we splice the text into lists of words
text = text.split()

print('Number of tokens:',len(text), '\n')
print(text[:50])

Number of tokens: 43815 

['aesthetics', 'is', 'a', 'branch', 'of', 'philosophy', 'that', 'deals', 'with', 'the', 'nature', 'of', 'beauty', 'and', 'taste', 'as', 'well', 'as', 'the', 'philosophy', 'of', 'art', 'its', 'own', 'area', 'of', 'philosophy', 'that', 'comes', 'out', 'of', 'aesthetics', 'it', 'examines', 'subjective', 'and', 'sensoriemotional', 'values', 'or', 'sometimes', 'called', 'judgments', 'of', 'sentiment', 'and', 'tasteaesthetics', 'covers', 'both', 'natural', 'and']


In [9]:
vocabulary = {}

for i in range(len(text) -1):
    # The current token is key
    key = text[i]
    # The next token is the assigned value.
    value = text[i+1]
    
    # Check if the key is already included into the dictionary.
    if key in vocabulary.keys():
        # If yes, append the value to this entry.
        vocabulary[key].append(value)
    else:
        # Otherwise create a new entry with the key.
        vocabulary[key] = [value]
        
# show possible next value
print(vocabulary['some'])

['works', 'matters', 'separate', 'aesthetics', 'extent', 'thcentury', 'complex', 'of', 'sociologists', 'works', 'delay', 'doubt', 'researchers', 'traditions', 'of', 'debate', 'of', 'such', 'way', 'questions', 'things', 'physical', 'examples', 'insight', 'critics', 'phenomena', 'of', 'familiar', 'hailed', 'to', 'observers', 'branches', 'of', 'class', 'researchers', 'loss', 'statisticians', 'generally', 'similarity', 'of', 'notion', 'measure', 'training', 'nonlinear', 'successful', 'fields', 'social', 'of', 'important', 'of', 'sort', 'thinkers', 'of', 'contemporary', 'extent', 'relevant', 'literary', 'code', 'important', 'of', 'theorists', 'have', 'real', 'collection', 'way', 'borderline', 'given', 'other', 'of', 'of', 'respects', 'physical', 'panpsychists', 'respects', 'underlying', 'other', 'sense', 'philosophers', 'change', 'way', 'philosophers', 'aspect', 'philosophers', 'philosophers', 'of', 'experiential', 'computer', 'brain', 'patients', 'sense', 'take', 'modern', 'philosophers', 

In [10]:
# Return a randomly selected token from our list of options. 
import random

def next_token(key):
    
    # First check if the key is included in the dictionary.
    if key not in vocabulary.keys():        
        # If not: pick a random key.
        key = random.choice(list(vocabulary.keys()))
        
    # Get all options for this key.
    options = vocabulary[key]
    # Return a random choice of this list.
    return random.choice(options)

print(next_token('a'))

long


In [11]:
generated_text = ['a'] # We start with this as input.

# execute 50 times
for i in range(100):
    
    ##### The last token of generated_text is the key to get the next token.
    #key = generated_text[-1]
    
    ##### Pick one token for this key.
    # choice = next_token(key)
    
    ##### Append this token to the generated text.
    # generated_text += choice
    
    generated_text.append(next_token(generated_text[-1]))
    
# We print the generated text once when the for-loop has finished.
generated_text = ' '.join(generated_text)
print(generated_text)

a natural processes computational approaches gaining in point of digital computer science will not merely on how meaningful on from the structure and evolutionary aesthetics have no influence on the bbc produced a way neardeath research the particular sciences but rather experiential dualism include principal focus more recognized that the other animal cognition theories attempt to nature he suggested in particular set of the question of universals the structure and program semantics but one can be inadequate for example the mental phenomena in brazil whose philosophical considerations beauty was still exists in some philosophers have been influential in terms of scientific
