## MCMC that generates 5− or 6− letter ’name-like’ strings.

This is a simple MCMC that generates 5− or 6− letter ’name-like’ strings. The model is based on the frequency of letters in the English language. The model is trained on a list of names, and then it generates new names based on the learned frequencies. The model uses a Markov chain to generate new names. The Markov chain is a probabilistic model that generates a sequence of events based on the probabilities of the events. In this case, the events are the letters in the names, and the probabilities are based on the frequencies of the letters in the English language.

The model will be fed its output as input recursively to simulate **model collapse**. 

**Model collapse:** A phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, including prior versions of itself. Such outputs are known as synthetic data

In [6]:
import numpy as np
import pandas as pd
import random

trigram = pd.read_csv("Data/names_trigrams.csv")

In [8]:
##################
#Helper Functions#
##################

def generate_random_string(n):
    alphabet = list(map(chr, range(ord('a'), ord('z') + 1)))
    random_string = []
    for i in range(n):
        random_string.append(alphabet[random.randint(0,len(alphabet)-1)])
    return ''.join(random_string)


def score_string(s, trigram_df):
    trigram_dict = dict(zip(trigram_df['Unnamed: 0'], trigram_df['Probability']))
    smoothing_factor = 1e-6
    score = 0
    trigrams = [s[i:i+3] for i in range(len(s) - 2)]
    for tri in trigrams:
        if tri in trigram_dict:
            score += np.log(trigram_dict[tri])
        else:
            score += np.log(smoothing_factor)
    
    return score


def modify_string(s):
    alphabet = list(map(chr, range(ord('a'), ord('z') + 1)))
    s_list = list(s)
    id_change = random.randint(0,len(s)-1)
    s_list[id_change] = alphabet[random.randint(0, len(alphabet)-1)]
    s = ''.join(s_list)

    return s
    
######
#MCMC#
######

total_trigrams = trigram['Count'].sum()
trigram['Probability'] = trigram['Count'] / total_trigrams

current_string = generate_random_string(5)  # or 6
current_score = score_string(current_string, trigram)
temperature = 1

# MCMC process
n_runs = 15
n_iters = 5000
for run in range(n_runs):
    for iter in range(n_iters):
        new_string = modify_string(current_string)
        new_score = score_string(new_string, trigram)
        if new_score > current_score:
            current_string = new_string
            current_score = new_score
        else:
            if random.random() < np.exp((new_score - current_score) / temperature):
                current_string = new_string
                current_score = new_score

    print(current_string, current_score)


tunic -23.323203230175135
annet -17.02067261320233
alari -18.123275312484935
erlee -19.754602712349346
ashar -15.280976211630323
renes -18.161396887178515
shane -14.914441638999541
nnate -19.212475746738384
aniec -20.153092954394708
erele -19.17458903801898
imarl -17.950542204963913
hanat -17.292684932287493
atanc -19.940315305428825
ndend -20.1681031405166
unash -20.98228539211931
