# First-Order text generation

Probabilities are drawn from a text analysis. Characters that appear more often in the text will have a higher probability to be chosen. 

The easiest method is to simply store all characters in a list. Characters that appear often in the text are more often stored in the list and thus picked more often.

In [1]:
import random 

txt = 'rose is a rose is a rose!'

char = [c for c in txt]
print('characters:', char, '\n')


for i in range(20):
    print(random.choice(char), end=' ')

characters: ['r', 'o', 's', 'e', ' ', 'i', 's', ' ', 'a', ' ', 'r', 'o', 's', 'e', ' ', 'i', 's', ' ', 'a', ' ', 'r', 'o', 's', 'e', '!'] 

a o s a s o i o   o i s r o i s e   r ! 

## Probability 

In the first example, the probability of certain character is already implied by the amount in the sentence. Let us count how many characters that are inside a sentence.

In [2]:
# counting how many 'r's that are in the sentence

count = 0 
for i in txt:
    if(i == 'r'):
        count += 1
print("there are "+str(count)+" r in the sentence "+txt)

there are 3 r in the sentence rose is a rose is a rose!


The next example is using dictionary to store the current value as key, and the countings of it as item.
But before that, let's have an example of showing what could we do without having learning Dictionary and key.

In [3]:
# counting the amount of every characters that are in the sentence
text = '''rose is a rose is a rose is a rose'''

# initialize all counter#
dic = [0 for i in range(26)]

# save each character to the respected ascii code with a bit compression.
# for example 'a' as 0, as well as 'b' as 1
# ord('a')-'a' equals to 97 - 97 = 0

for c in text:
    if(c==' '):
        continue
    dic[(ord(c)-ord('a'))]+=1
        

for c in range(len(dic)):
    print(str(chr(c+97))+' : '+str(dic[c]))
    

a : 3
b : 0
c : 0
d : 0
e : 4
f : 0
g : 0
h : 0
i : 3
j : 0
k : 0
l : 0
m : 0
n : 0
o : 4
p : 0
q : 0
r : 4
s : 7
t : 0
u : 0
v : 0
w : 0
x : 0
y : 0
z : 0


Now the following is the example as dictionary, which is a data structure specifically in python

In [4]:
# counting the amount of every characters that are in the sentence
text = '''rose is a rose is a rose is a rose'''

dic = {}
for c in text:
    if c not in dic.keys():
        dic[c] =1
    else:
        dic[c] +=1
        
#traverse through list of keys and calculate the average number
for index,value in dic.items():
    print (index,value,value/len(text))
    
print()
    
#we could format the number as percentage with the following line
print("{:.2%}".format(0.001))

#so the correct way will be
print()
for index,value in dic.items():
    print (index,value,"{:.2%}".format(value/len(text)))
    

r 4 0.11764705882352941
o 4 0.11764705882352941
s 7 0.20588235294117646
e 4 0.11764705882352941
  9 0.2647058823529412
i 3 0.08823529411764706
a 3 0.08823529411764706

0.10%

r 4 11.76%
o 4 11.76%
s 7 20.59%
e 4 11.76%
  9 26.47%
i 3 8.82%
a 3 8.82%


## Calculating the probability of words

In [5]:
# counting the amount of every characters that are in the sentence
text = '''rose is a rose is a rose is a rose'''

text = text.split()

dic = {}
for c in text:
    if c not in dic.keys():
        dic[c] =1
    else:
        dic[c] +=1
        
for index,value in dic.items():
    print (index,value,"{:.2%}".format(value/len(text)))

rose 4 40.00%
is 3 30.00%
a 3 30.00%


## Probability from a text file

now we impliment the data cleaning we learnt, and try to calculate the probability of characters, so that we can work on first-order text generation.

In [6]:
import string 

with open('data/alles-macht-weiter.txt', 'r') as f:
    text = f.read()

    
text=text.replace("\n"," ").replace("ä","ae").replace("Ä","Ae").replace("ö","oe").replace("Ö","oe").replace("ü","ue").replace("Ü","ue")
text = text.lower()
remove_digits = str.maketrans('', '', '0123456789')
text = text.translate(remove_digits)
text = text.translate(str.maketrans('','',string.punctuation))

#doing the split will resulted in countering only word.
#doing without the split will be running in characters.
#text.split()

dic = {}
for c in text:
    if c not in dic.keys():
        dic[c] =1
    else:
        dic[c] +=1
        
for index,value in dic.items():
    print (index,value,"{:.2%}".format(value/len(text)))

d 48 4.13%
i 76 6.55%
e 200 17.23%
  178 15.33%
g 27 2.33%
s 33 2.84%
c 38 3.27%
h 48 4.13%
t 65 5.60%
n 100 8.61%
r 62 5.34%
z 11 0.95%
a 83 7.15%
l 21 1.81%
m 33 2.84%
w 31 2.67%
u 45 3.88%
o 10 0.86%
b 14 1.21%
k 7 0.60%
’ 2 0.17%
p 12 1.03%
f 10 0.86%
ß 3 0.26%
v 3 0.26%
… 1 0.09%


## Sorting 

Sorting is a huge topic for basic algorithm and data structure, it is also the first algorithm to learn while learing programming. Multiple methods and algorithms are capable of performing sorting, having different time complexity repectfully.

Here we will use the template from python which is a variant of quick sort.

In [7]:
# sorting a list of numbers
counts = [1,3,1,2,6,11,3]
result = sorted(counts,reverse=True)
print(result)

[11, 6, 3, 3, 2, 1, 1]


In [8]:
# sorting a dictionary based on keys
sorted(dic.items())

[(' ', 178),
 ('a', 83),
 ('b', 14),
 ('c', 38),
 ('d', 48),
 ('e', 200),
 ('f', 10),
 ('g', 27),
 ('h', 48),
 ('i', 76),
 ('k', 7),
 ('l', 21),
 ('m', 33),
 ('n', 100),
 ('o', 10),
 ('p', 12),
 ('r', 62),
 ('s', 33),
 ('t', 65),
 ('u', 45),
 ('v', 3),
 ('w', 31),
 ('z', 11),
 ('ß', 3),
 ('’', 2),
 ('…', 1)]

In [9]:
# sorting a dictionary based on item
sorted(dic.items(), key=lambda x: x[1],reverse=True)

[('e', 200),
 (' ', 178),
 ('n', 100),
 ('a', 83),
 ('i', 76),
 ('t', 65),
 ('r', 62),
 ('d', 48),
 ('h', 48),
 ('u', 45),
 ('c', 38),
 ('s', 33),
 ('m', 33),
 ('w', 31),
 ('g', 27),
 ('l', 21),
 ('b', 14),
 ('p', 12),
 ('z', 11),
 ('o', 10),
 ('f', 10),
 ('k', 7),
 ('ß', 3),
 ('v', 3),
 ('’', 2),
 ('…', 1)]

# 1-order text generation based on weight.

In [10]:
# random with weight
random.choices(
    ['a','b','c'],
    [0.2, 0.2, 0.6],
    k=10)


['c', 'c', 'b', 'b', 'c', 'a', 'b', 'a', 'c', 'a']

In [11]:
# random text with weight
random.choices(
    [k for k in dic.keys()],
    [v for v in dic.values()],
    k=10)

['r', 'e', 'a', 'l', 'n', 'c', 'a', 'r', 'm', 'd']