# N-Order random text generation

In [1]:
import random

## Zero-Order text generation

Pure random choice. Every character is choosen by the same probabilty. No text is used as a base.

### Create vocabulary

In [2]:
''' Create a list of all characters between A and Z. '''
vocab = [chr(c) for c in range(ord('A'), ord('Z') + 1)]
print(vocab)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


In [3]:
''' Append punctuation and a space. '''
for punct in [',', '.', '!', '?', ' ']:
    vocab.append(punct)

print(vocab)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ',', '.', '!', '?', ' ']


In [4]:
len(vocab)

31

### Pick a random token from the vocabulary

In [5]:
index = random.randint(0, len(vocab) -1)
print(index)
vocab[index]

19


'T'

### Generate zero-order random text

In [6]:
generated_text = '' # Variable to store our generated text.

for i in range(60):
    # Get a random index.
    index = random.randint(0, len(vocab) - 1)
    
    # Get the corresponding token.
    token = vocab[index]
    
    # Append it to our generated text.
    generated_text += token

# Print the generated text after the loop has ended.
print(generated_text)

N,MFZNL.SFFRW,JVHHL KLLHARIKVU?V.PM MWO!GOAKPG ,IY,LSP?,B?ZB


## First-Order text generation

Probabilities are drawn from a text analysis. Characters that appear more often in the text will have a higher probability to be chosen.<br>

The easiest method is to simply store all characters in a list.
Characters that appear often in the text are more often stored in the list<br>
and thus picked more often.

In [7]:
''' Example. '''

txt = 'aaabaa'

characters = [c for c in txt]
print('characters:', characters, '\n')

for i in range(20):
    print(random.choice(characters), end=' ')

characters: ['a', 'a', 'a', 'b', 'a', 'a'] 

b a a a b a a a a a b a a a a a a a a a 

The list of characters contains 5 'a's and only 1 b, so a randomly picked character will be an 'a' most of the time.

In [8]:
''' Frisch, Max. Schwarzes Quadrat. Zwei Poetikvorlesungen. Frankfurt am Main: Suhrkamp, 2008. 73-74. '''

txt = '''
– Die POESIE ist zweckfrei.
(Schon das macht sie zur Irritation.)

– Die POESIE muss kein Kabinett bilden, zum Beispiel, und muss nicht von einer analphabetischen Mehrheit gewählt werden.

– Die POESIE ist da oder manchmal auch nicht.
(Regierungen sind immer da.)

– Die POESIE kann ignoriert werden.
(Ohne dass die Polizei deswegen eingreift.)

– Die POESIE entsteht trotzdem da und dort.

– Die POESIE ist der Durchbruch zur genuinen Erfahrung unsrer menschlichen Existenz in ihrer geschichtlichen Bedingtheit. Sie befreit uns zur Spontaneität – was beides sein kann: Glück oder Schrecken.
(Regierungen wollen immer nur unser Glück.)

– Die POESIE macht uns betroffen.
(Lebendig.)

– Die POESIE unterwandert unser ideologisiertes Bewusstsein und insofoern ist sie subversiv in jedem gesellschaftlichen System.
(Platon hat natürlich recht: der Poet ist als Staatsbürger dubios, auch wenn er seine Steuern zahlt, auch wenn er als Soldat gehorcht, damit er nicht von seinen eignen Leuten erschossen wird; solange er aber nicht erschossen ist, bleibt er ein Poet.)

– Die POESIE muss keine Massnahmen ergreifen.
(Sie muss nur Poesie sein.)

– Die POESIE findet sich nicht ab (im Gegensatz zur Politik) mit dem Machbaren; sie kann nicht lassen von der Trauer, dass das Menschsein auf dieser Erde nicht anders ist.

– Die POESIE sagt nicht, wohin mit dem Atom-Müll.
(Rezepte sind von ihr nicht zu erwarten.)

– Die POESIE ist arrogant.
(Sie entzieht sich der Pflicht, die Welt zu regieren.)

– Die POESIE ist unbrauchbar.
(Es genügt ihr, dass sie da ist: als Ausdruck unseres profunden Ungenügens und unsrer profunden Sehnsucht.)

– Die POESIE wahrt die Utopie.
'''
print(txt[:200])


– Die POESIE ist zweckfrei.
(Schon das macht sie zur Irritation.)

– Die POESIE muss kein Kabinett bilden, zum Beispiel, und muss nicht von einer analphabetischen Mehrheit gewählt werden.

– Die POES


### Generate first-order random text

In [9]:
characters = [c for c in txt]

for i in range(50):
    print(random.choice(characters), end='')

m e it  bSgieSbDs ycfbdasaSr,WeoePiuEiedc ele..O.r

## Second-Order text generation

The probability of a character depends on its left neighbor.<br>
For that we will create a dictionary called 'vocabulary'. For each individual token of our text we will store all next tokens.<br>
When we generate our text we will pick a random token of this list as we have done it in the first-order text generation.

### Create vocabulary

In [10]:
''' Create a vocabulary.
Store for each character all characters that are next to it in a dictionary.'''

vocabulary = {}

# Loop through the text:
for i in range(len(txt) -1): # we add -1 because the last token of our text has no following next token.
    
    # The current token is the key for our dictionary.
    key = txt[i]
    
    # The next token (i+1) is the corresponding value.
    value = txt[i+1]
    
    # Check if the key exists in the dictionary already.
    if key in vocabulary.keys():
        # If yes, append the value to the list.
        vocabulary[key].append(value)
        
    # If not, insert the new key + the value in form of a [list].
    else:
        vocabulary[key] = [value]

In [11]:
''' Print all options for one character. '''
key = 'P'
vocabulary[key]

['O',
 'O',
 'O',
 'O',
 'o',
 'O',
 'O',
 'O',
 'O',
 'l',
 'o',
 'o',
 'O',
 'o',
 'O',
 'o',
 'O',
 'O',
 'f',
 'O',
 'O']

Some tokens appear more often than others. Due to that their probability to be chosen is higher than the one for others.<br>
If we have a small dataset, the options for a key are limited. In the example above with a 'P' as key the next character can be any of 'O', 'o', 'l' or 'f'.<br>
A different token is not possible.

Next we write a function, which takes a key (like 'P') as argument and returns one possible next token.

In [12]:
''' Return a randomly selected token from our list of options. '''

def next_token(key):
    
    # Get all options stored for in the dictionary for this key.
    options = vocabulary[key]
    
    # Pick one.
    choice = random.choice(options)
    
    # Return this value.
    return choice

print(next_token('P'))

O


### Generate second-order random text

To generate our text we create a variable with some input (at least one key).

Then we run a loop. The argument for `range` defines how many tokens we will generate.

In each iteration of our loop we call the function `next_token()` and append the returned token to our text.

In [13]:
''' Generate text. '''

generated_text = '– Die POESIE' # We start with this as input.

# The code below is executed 50 times to append 50 characters.

for i in range(50):
    
    # The last token of generated_text is the key to get the next token.
    key = generated_text[-1]
    
    # Pick one token for this key.
    choice = next_token(key)
    
    # Append this token to the generated text.
    
    generated_text += choice
    
    
    # We could write the code above into one line:
#     generated_text += next_token(generated_text[-1])

# We print the generated text once when the for-loop has finished.
print(generated_text)

– Die POESIESt Dinnge dar anda.)
– er iftr g.
– Mürrwaungichth


## N-Order text generation (3rd and above)

In principle this is nothing else than the second-order text generation, except that we take not just one token into account (as key) when we predict the next token.

![ngrams.png](images/ngrams.png)

Third-Order (n=2) means that we use two tokens as key,<br>
Fourth-Order are three tokens (n=3),<br>
...

We will write a more dynamic code and use a variable `n` to define how many tokens are used as key.<br>
Then we can easily change this.

In [14]:
''' Create a vocabulary.
Store all n token as key and their next tokens as values. '''

n = 2
vocabulary = {}

for i in range(len(txt) -n): # Now it's important to stop the loop at len() - n.
    
    # The current token (i) and the next tokens (i+n) are key.
    key = txt[i:i+n]
    
    # The next token after the last token of key is the corresponding value.
    value = txt[i+n]
    
    # First check if the key exists in the dictionary already.
    if key in vocabulary.keys():
        # If yes, append the value to the list.
        vocabulary[key].append(value)
        
    # Else insert the new key + the value in form of a [list].
    else:
        vocabulary[key] = [value]
        
''' Function to return a randomly selected character from our list of options.
This is similar to the function we used above, but we first check if a key exists.
If not, we pick a random key of our dictionary. '''

def next_token(key):
    
    # First check if the key is included in the dictionary.
    
    if not key in vocabulary.keys():        
        # If not: pick a random key.
        key = random.choice(list(vocabulary.keys()))
        
    # Get all options for this key.
    options = vocabulary[key]
    
    # Return a random choice of this list.
    return random.choice(options)

In [15]:
''' Test: print all options for one key. 
Make sure that the key has the length defined in n. '''

key = random.choice(list(vocabulary.keys()))
print('key:', key)
print('options:')
vocabulary[key]

key: at
options:


['i', 'o', ' ', 'ü', 's', ' ', 'z']

In [16]:
''' Test: pick a random next token. '''
next_token(key)

' '

### Generate n-order random text

In [17]:
''' Generate text. '''

generated_text = '– Die POESIE' # We start with this as input.

for i in range(50):
    
    # The last n token of generated_text is the key to get the next token.
    key = generated_text[-n:]
    
    # Pick one token for this key.
    choice = next_token(key)
    
    # Append this token to the generated text.
    
    generated_text += choice
    
    # The code above as one line:
#     generated_text += next_token(generated_text[-n:])
    
# We print the generated text once when the for-loop has finished.
print(generated_text)

– Die POESIE ihr, undenserd; sie Welt von erdem Gegewussnalsch


## N-Order text generation with probability table

(This is also similar to the code above, but creates a probability table to chose from instead of a list with all possible tokens (in multiple occurences).)

For an introduction into this, have a look at the last part of [this Notebook](https://github.com/experimental-informatics/hands-on-python/blob/master/dictionary_list.ipynb) about lists and dictionaries.

*This Method might result in the same as working without a probability table, since the distribution is already implied.*

*But once we work on a more complex and longer text, this method will be more efficient and reduce time complexity.*



Keep in mind every single token may have more than one possible next token. 

So we need to create a `nested dictionary` to store probability values.

It might looks like this, having a `dictionary` in a `dictionary`.

With the example 

```python
{
    'a':{'a':0.1, 'b':0.09, 'c':0.3, 'd':0 ......}
    'b':{'a':0.03, 'b':0.02, 'c':0.14, 'd':0.04 ......}
    'c':{'a':0.06, 'b':0.1, 'c':0.17, 'd':0.02 ......}
    .
    .
    .
}
```

Hence the `'a','b'` need to be bind as a tuple.

All probability values for one key sum up to 1 (100%).

In [18]:
n = 5

vocabulary={}
for i in range(len(txt) -n):
    key = txt[i:i+n]
    value = txt[i+n]
    # Check if the key exists.
    if key in vocabulary.keys():
        # If yes, append the value.
        vocabulary[key].append(value)
    # Else insert a new key + value.
    else:
        vocabulary[key] = [value]
        
''' Calculate the probability. '''

for key, value in vocabulary.items():
    length = len(vocabulary[key])
    temporary_dic = {}
    for char in value:
        if(char not in temporary_dic.keys()):
            temporary_dic[char] = 1
        else:
            temporary_dic[char] += 1   
    # Uncomment the next line to show all probabilities.
#     print(key, temporary_dic)
            
    for _keys,amount in temporary_dic.items():
        temporary_dic[_keys] = (amount/length)
    vocabulary[key] = temporary_dic

Now we create a function to pick the next token based on our dictionary, with probabilities as their weights.

In [19]:
''' Return a randomly selected token from our list of options. '''

def next_token(key):

    # Check if key is included in the vocabulary.
    if not key in vocabulary.keys():
        # If not, pick a random key from the vocabulary.
        key = random.choice(list(vocabulary.keys()))

    # Otherwise we'll use the key given as argument.
    
    # Return the next token for the key.
    # The [0] in the end is because the random choice based on probability returns a list.
    return random.choices(list(vocabulary[key].keys()), weights=vocabulary[key].values())[0]

### Generate n-order random text

In [20]:
''' Generate text. '''

generated_text = '– Die POESIE'

for i in range(200):
    generated_text += next_token(generated_text[-n:])
    
print(generated_text)

– Die POESIE sagt nicht.
(Regierungen sind immer da.)

– Die POESIE kann ignoriert werden.

– Die POESIE entsteht trotzdem da und insofoern ist, bleibt er ein Poet.)

– Die POESIE sagt nicht.
(Regierungen sind im
