In [3]:
import os
import random

# texts is your list of file paths
texts = [os.path.join("text", i) for i in ["rjt.txt", "mac.txt", "mnd.txt", "ham.txt", "jcr.txt"]]

# 1. Load all files and concatenate
text_full = ""
for path in texts:
    with open(path, "r", encoding="utf-8") as f:
        text_full += f.read()


# 2. Build Markov-chain-like dictionary for characters
markov = {}

for i in range(len(text_full) - 1):
    c1 = text_full[i]
    c2 = text_full[i + 1]

    # Ensure first-level key exists
    if c1 not in markov:
        markov[c1] = {}

    # Ensure second-level key exists
    if c2 not in markov[c1]:
        markov[c1][c2] = 0

    # Increment count
    markov[c1][c2] += 1

print(markov)


{'A': {'C': 353, ' ': 260, 'n': 1200, 'M': 414, 'y': 107, 'B': 6, 'H': 6, 'L': 164, 'S': 230, 'R': 317, 's': 232, 'P': 119, 'D': 154, 'G': 24, 't': 103, 'd': 20, 'u': 4, 'w': 20, 'l': 129, 'h': 22, 'p': 22, 'm': 26, "'": 3, 'r': 68, 'f': 11, 'U': 191, 'c': 17, 'b': 15, 'g': 15, 'N': 336, 'I': 14, 'T': 156, 'v': 1, ',': 27, '\n': 298, 'E': 124, 'e': 2, 'V': 29, ';': 1}, 'C': {'T': 48, 'E': 287, 'a': 494, 'i': 112, 'l': 66, 'A': 459, 'o': 236, 'u': 27, 'h': 27, 'U': 69, 'r': 17, 'y': 2, 'O': 70, 'B': 238, 'D': 90, 'K': 43, 'e': 1, 'I': 124, 'L': 136, 'R': 69, '\n': 27, ',': 1}, 'T': {' ': 56, 'w': 26, 'h': 1924, 'o': 463, 'r': 40, 'i': 143, 'H': 333, 'Y': 22, '\n': 635, 'u': 23, 'A': 108, 'y': 59, 'e': 30, ',': 21, 'a': 46, 'I': 297, ']': 1, 'E': 197, 'T': 56, 'O': 138, 'R': 167, 'Z': 69, '.': 3, "'": 1, ';': 1, 'U': 244}, ' ': {'I': 2007, 'h': 6145, 'b': 4259, 'a': 7949, 'i': 4280, 'd': 3324, 'f': 3610, 'V': 70, 'w': 5751, 'l': 3009, 'o': 4228, 's': 6906, 'g': 1845, 't': 12898, 'n': 301

What we just did here was create a markov chain (or its python representation) based on Shakespeare's 5 biggest works:
Romeo & Juliet, Hamlet, A Midsummer Night's Dream, Julius Caesar, and Macbeth.
The chain is counted character by character.

In [4]:
def generate_text(markov, start_char, n):
    result = start_char
    current = start_char

    for _ in range(n - 1):
        # If no data for this character, stop early
        if current not in markov:
            break

        next_chars = list(markov[current].keys())
        weights = list(markov[current].values())

        # Weighted random choice
        next_char = random.choices(next_chars, weights=weights, k=1)[0]

        result += next_char
        current = next_char

    return result


print(generate_text(markov, "A", 500))

A u wit thefimeeves.
MANBOX
MEE
Tho ithes, CIExireal
Whin AD
LAERASprin amyotiligr. w' ffr meven-ppay coknotown. tit, whatrrse I lls be ht tinoulomld t e;
Wer s arayorn
Fomousuprtolichespitlot; tagh athichequl
ET
T
TO IShn hincreanthind whinds owhede bourseeatat om wingl llchis wiecotainqul'tand t.
JUToons,
BE, tan trit wstin.
S
THAnsongr menooul totim; al, meve.
' at soreree.

End: ldrd whever te t arod itrintet myod iz!
ACANDoo, theavia lll S
Sh himan
HEExceanisug with; thofu

O, wdertirondsho


We generated some text with that markov chain. It was pretty awful, because contextually, individual characters mean nothing in English...
But what if we turned it into Chinese? In Chinese, each character carries the full meaning of a word. With Chinese markov chains, we could generate something that looks like real words coming together.

In [5]:
# texts is your list of file paths
chins = [os.path.join("chin", i) for i in ["rjt.txt", "mac.txt", "mnd.txt", "ham.txt", "jcr.txt"]]

# 1. Load all files and concatenate
chin_full = ""
for path in chins:
    with open(path, "r", encoding="utf-8") as f:
        chin_full += f.read()

markovc = {}

for i in range(len(chin_full) - 1):
    c1 = chin_full[i]
    c2 = chin_full[i + 1]

    # Ensure first-level key exists
    if c1 not in markovc:
        markovc[c1] = {}

    # Ensure second-level key exists
    if c2 not in markovc[c1]:
        markovc[c1][c2] = 0

    # Increment count
    markovc[c1][c2] += 1

In [6]:
print(generate_text(markovc, "第", 100))

第一个精致谢你就绪与他的灵，我想看来，竟能成一步询问题我的提并肩摔！班伏里克兰茨和庄严和唐纳尔特：我的安东尼乌斯。哈姆雷特：把您高兴：好。把你既然；只需要再次的悲伤不过你过会看；这些胆汁液；就是要神明


It worked! Instead of keyboard smash, now it looks more like Lorem Ipsum (if you google translate it back to english).
But... what if, instead of only looking at the current character and deciding the next one, we looked at the past two characters? That would give us basic rudimentary grammar structures!

In [7]:
tokens = list(chin_full.replace("\n", ""))

In [8]:
ngram = {}

for i in range(len(tokens) - 1):
    c1 = tokens[i]
    c2 = tokens[i+1]

    if c1 not in ngram:
        ngram[c1] = {}

    if c2 not in ngram[c1]:
        ngram[c1][c2] = 0

    ngram[c1][c2] += 1

In [9]:
def generate_bigram(markov, start_char, length=200):
    result = [start_char]
    current = start_char

    for _ in range(length - 1):
        if current not in markov:
            break

        next_chars = list(markov[current].keys())
        weights    = list(markov[current].values())

        next_char = random.choices(next_chars, weights=weights, k=1)[0]

        result.append(next_char)
        current = next_char

    return "".join(result)

In [10]:
print(generate_bigram(ngram, "第", 100))

第二位女之众人：正是单独自从那可能再见了！”安东尼亚：一样：我都逃亡：明白：是奶妈，———仅仅死了。卡西乌合适合地将罪行的意捏你对克：如我可你。辛伯南森克白：于我们离开这墓；找到的舌之徒的自身亡发，是


What we just did was called an n-gram, with n=2. Thus, a bi-gram. It basically creates markov chains but based on the previous two characters, instead of only one. With larger and larger n, we'd need more and more data, but we could create more and more grammatically accurate and logical text.

And (not done here), if we found a way to weigh each word by its own importance, then we'd have created the attention mechanism: the backbone to modern natural language processing and large language models like ChatGPT.