# Markov Chain - Second order text generation

# Second Order Character Generation

This is the first Markov Chain we start with. The probability of a character depends on the last character.

For that we will create a dictionary called 'vocabulary'. For each individual token of our text we will store all next tokens.
When we generate our text we will pick a random token of this list as we have done it in the first-order text generation.

## Create a Vocabulary (Training)

### Reading a new text and clean it

In [1]:
# Read a file into the variable text
with open('data/alles-macht-weiter.txt', 'r') as f:
    text = f.read()


import string
text=text.replace("\n"," ").replace("ä","ae").replace("Ä","Ae").replace("ö","oe").replace("Ö","oe").replace("ü","ue").replace("Ü","ue")
text = text.lower()
remove_digits = str.maketrans('', '', '0123456789')
text = text.translate(remove_digits)
text = text.translate(str.maketrans('','',string.punctuation))

print('Number of tokens:',len(text), '\n')
print(text[:50])

Number of tokens: 1161 

die geschichtenerzaehler machen weiter die autoind


In [2]:
# show the next character of the current character

for i in range(0,5):
    print(text[i], "->" , text[i+1])

d -> i
i -> e
e ->  
  -> g
g -> e


### Build vocabulary

In [3]:
vocabulary = {}

# Loop through all tokens (except the last one).
for i in range(len(text) -1):
    # The current token is key
    key = text[i]
    # The next token is the assigned value.
    value = text[i+1]
    
    # Check if the key is already included into the dictionary.
    if key in vocabulary.keys():
        # If yes, append the value to this entry.
        vocabulary[key].append(value)
    else:
        # Otherwise create a new entry with the key.
        vocabulary[key] = [value]

In [4]:
# Show all possible option for specific character
print("a---",vocabulary['a'])
print("c---",vocabulary['c'])

a--- ['e', 'c', 'u', 'c', 'r', 'c', 'c', 'e', 'c', 'c', 's', 'p', 'c', 'e', 'c', 'g', 'c', 'c', 'u', 'u', 'u', 'u', 'u', 'u', 'n', 'n', 'c', 'n', 'e', 'e', 'u', 'ß', 's', 'n', 'u', ' ', 'l', 'p', 'u', 'r', 't', 's', 'c', 's', 'd', 'u', 'u', 'u', 'n', 'u', 'u', 'g', 'k', 't', 'n', 'k', 't', 'u', 'e', 'c', 'h', 'c', 'e', 'e', 'c', 'd', 'c', 'e', 'c', 'u', 'l', 'g', 'c', 'l', 'n', 'c', 'u', 'c', 'c', 'u', 'u', 'u', 'p']
c--- ['h', 'h', 'h', 'h', 'h', 'h', 'k', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'k', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'h', 'k']


## Text generation

### Build a function to pick the value

Since this operation is repeatable, we build a function to help picking the next possible character.

In [5]:
''' Return a randomly selected token from our list of options. '''
import random

def next_token(key):
    
    # Get all options stored for in the dictionary for this key.
    options = vocabulary[key]
    
    # Pick one.
    choice = random.choice(options)
    
    # Return this value.
    return choice

print(next_token('a'))

r


### Start the generating process

In [6]:
generated_text = 'a' # We start with this as input.

# execute 50 times
for i in range(50):
    
    # The last token of generated_text is the key to get the next token.
    key = generated_text[-1]
    
    # Pick one token for this key.
    choice = next_token(key)
    
    # Append this token to the generated text.
    generated_text += choice
    
# We print the generated text once when the for-loop has finished.
print(generated_text)

achr weienerochen gen malls maumocht weiteriteien l


</br></br>
# Second order Word Generation

Since there are more possibility in Word generation, we need a larger data sets.

In [7]:
with open('data/wiki_selection.txt', 'r') as f:
    text = f.read()


import string
text=text.replace("\n"," ").replace("ä","ae").replace("Ä","Ae").replace("ö","oe").replace("Ö","oe").replace("ü","ue").replace("Ü","ue")
text = text.lower()
remove_digits = str.maketrans('', '', '0123456789')
text = text.translate(remove_digits)
text = text.translate(str.maketrans('','',string.punctuation))

#With this line we splice the text into lists of words
text = text.split()

print('Number of tokens:',len(text), '\n')
print(text[:50])

Number of tokens: 43815 

['aesthetics', 'is', 'a', 'branch', 'of', 'philosophy', 'that', 'deals', 'with', 'the', 'nature', 'of', 'beauty', 'and', 'taste', 'as', 'well', 'as', 'the', 'philosophy', 'of', 'art', 'its', 'own', 'area', 'of', 'philosophy', 'that', 'comes', 'out', 'of', 'aesthetics', 'it', 'examines', 'subjective', 'and', 'sensoriemotional', 'values', 'or', 'sometimes', 'called', 'judgments', 'of', 'sentiment', 'and', 'tasteaesthetics', 'covers', 'both', 'natural', 'and']


In [8]:
vocabulary = {}

for i in range(len(text) -1):
    # The current token is key
    key = text[i]
    # The next token is the assigned value.
    value = text[i+1]
    
    # Check if the key is already included into the dictionary.
    if key in vocabulary.keys():
        # If yes, append the value to this entry.
        vocabulary[key].append(value)
    else:
        # Otherwise create a new entry with the key.
        vocabulary[key] = [value]
        
# show possible next value
print(vocabulary['some'])

['works', 'matters', 'separate', 'aesthetics', 'extent', 'thcentury', 'complex', 'of', 'sociologists', 'works', 'delay', 'doubt', 'researchers', 'traditions', 'of', 'debate', 'of', 'such', 'way', 'questions', 'things', 'physical', 'examples', 'insight', 'critics', 'phenomena', 'of', 'familiar', 'hailed', 'to', 'observers', 'branches', 'of', 'class', 'researchers', 'loss', 'statisticians', 'generally', 'similarity', 'of', 'notion', 'measure', 'training', 'nonlinear', 'successful', 'fields', 'social', 'of', 'important', 'of', 'sort', 'thinkers', 'of', 'contemporary', 'extent', 'relevant', 'literary', 'code', 'important', 'of', 'theorists', 'have', 'real', 'collection', 'way', 'borderline', 'given', 'other', 'of', 'of', 'respects', 'physical', 'panpsychists', 'respects', 'underlying', 'other', 'sense', 'philosophers', 'change', 'way', 'philosophers', 'aspect', 'philosophers', 'philosophers', 'of', 'experiential', 'computer', 'brain', 'patients', 'sense', 'take', 'modern', 'philosophers', 

In [9]:
# Return a randomly selected token from our list of options. 
import random

def next_token(key):
    
    # First check if the key is included in the dictionary.
    if key not in vocabulary.keys():        
        # If not: pick a random key.
        key = random.choice(list(vocabulary.keys()))
        
    # Get all options for this key.
    options = vocabulary[key]
    
    # Return a random choice of this list.
    return random.choice(options)

print(next_token('a'))

paradigm


In [10]:
generated_text = ['a'] # We start with this as input.

# execute 50 times
for i in range(100):
    
    ##### The last token of generated_text is the key to get the next token.
    #key = generated_text[-1]
    
    ##### Pick one token for this key.
    # choice = next_token(key)
    
    ##### Append this token to the generated text.
    # generated_text += choice
    
    generated_text.append(next_token(generated_text[-1]))
    
# We print the generated text once when the for-loop has finished.
generated_text = ' '.join(generated_text)
print(generated_text)

a computer science and information mental representations but electrochemical propertiesa related to modular brain this is not inductive reasoning or less interested in an attempt to deviations and focus on how does not by judea pearl in the self – and verification computer application of physical depending on making the symbolic interactionism is real than astronomy served to the mind is innate and models that if the study of cultural factors play a discipline some works such as islamic philosophers about the next similarly it is represented by a candle the training the indiana philosophy philosophy of experimental histories to strictly
