# Generative Models by using Markov Models

**Grandfather of Chatgpt**

### Differences between classification and generative models:
* No Matrices, Only a Dictionary: We will be working with sparse data. Using matrices will bloat the memory because most words don't follow each other. We will only keep track of "existing" transitions. 
* No Smoothing: This is very important. When creating classifiers, we said "let's give a chance to the unseen." But when writing poetry, we don't want made-up words. We want the model to only make the actual transitions it sees. 
* No Logarithms: To be able to sample, we need real probabilities between $0$ and $1$. (e.g., 20% probability, 50% probability). Logarithms disrupt this ratio.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.model_selection import train_test_split
import re

In [None]:
def get_data(TXT_DIR):
    line_list = []
    TXT_DIR = "../"+TXT_DIR
    with open(TXT_DIR, 'r', encoding='utf-8') as file:
        for line in file:
            #normalization
            line = line.strip()
            line = line.lower()
            line = re.sub(r'[^\w\s]', '', line) #substitute (replace)
            line = re.sub(r'\d', '', line)
            if line != "":
                tokens = line.split()
                line_list.append(tokens) #Markov models want this format. 
    return line_list

In [3]:
frost_list = get_data('data/robert_frost.txt')

In [4]:
#Since strings in python are immutable, editing is slow
#Faster way:

#create empty set
word_set = set()
#directly add words into set
for sentence in frost_list:
    word_set.update(sentence)
idx2word = list(word_set)
idx2word.insert(0, '<UNKNOWN>')

word2idx = {word:i for i, word in enumerate(idx2word)}
print(len(word2idx))
print(list(word2idx.items())[0:5])

2192
[('<UNKNOWN>', 0), ('doing', 1), ('dollars', 2), ('mothers', 3), ('burned', 4)]


In [5]:
def word_numerizer(arr):
    int_list = []
    for sentence in arr:
        sample = [word2idx.get(word, 0) for word in sentence]
        int_list.append(sample)
    return int_list

In [6]:
frost_list_int = word_numerizer(frost_list)

In [8]:
print(frost_list[10])
print(frost_list_int[10])

['and', 'both', 'that', 'morning', 'equally', 'lay']
[1125, 1762, 1214, 761, 1305, 1103]
