# EpictetusGPT - Development

Using the entire works of Epictetus, this notebook explores the development of a bi-gram model, that generates next character predictions, based on the previous character. This will be the baseline model, before we look to generate a GPT, and other types of models.

## Imports

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## Reading Data & Tokenization

In [2]:
with open("EpictetusData.txt", "r", encoding="utf-8") as f:
    text = f.read()
    
print(f"Sample of Text: {text[0:1000]}")

Sample of Text: THE


DISCOURSES OF EPICTETUS. 


ARRIAN TO LUCIUS GELLIUS 


WISHETH ALL HAPPINESS. 


NEITHER composed the Discourses of Epictetus 

in such a manner as things of this nature are 
commonly composed, nor did I myself produce them 
to public view, any more than I composed them. 
But whatever sentiments I heard from his own 
mouth, the very same I endeavored to set down in 
the very same words, so far as possible, and to pre- 
serve as memorials for my own use, of his manner 
of thinking, and freedom of speech. 

These Discourses are such as one person would 
naturally deliver from his own thoughts, extempore, 
to another; not such as he would prepare to be read 
by numbers afterwards. Yet, notwithstanding this, I 
cannot tell how, without either my consent or knowl- 
edge, they have fallen into the hands of the public. 
But it is of little consequence to me, if I do not ap- 
pear an able writer, and of none to Epictetus, if any 
one treats his Discourses with contempt; 

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Number of Characters: {len(chars)}")
print(f"Unique Characters: {chars}")

Number of Characters: 102
Unique Characters: ['\n', ' ', '!', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '¢', '§', '«', '°', '»', 'é', '—', '‘', '’', '“', '”']


In [25]:
# Tokenization - Character to Integer:
debug = True

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

if debug: print(f"String to Integer: {stoi}")
if debug: print(f"Integer to String to Integer: {itos}")

encoder = lambda string: [stoi[chr] for chr in string]
decoder = lambda list_integers: ''.join(itos[integer] for integer in list_integers)

if debug: print(f"Encode (Hello): {encoder('Hello')}")
if debug: print(f"Decode (Hello): {decoder(encoder('Hello'))}")

data = torch.tensor(encoder(text), dtype=torch.long)

n_split = int(len(text) * 0.9)
train_data = data[:n_split]
test_data = data[n_split:]

if debug: print(f"Train Text: {train_data[0:100]}")
if debug: print(f"Test Text: {test_data[0:100]}")

String to Integer: {'\n': 0, ' ': 1, '!': 2, '#': 3, '$': 4, '%': 5, '&': 6, "'": 7, '(': 8, ')': 9, '*': 10, '+': 11, ',': 12, '-': 13, '.': 14, '0': 15, '1': 16, '2': 17, '3': 18, '4': 19, '5': 20, '6': 21, '7': 22, '8': 23, '9': 24, ':': 25, ';': 26, '<': 27, '>': 28, '?': 29, '@': 30, 'A': 31, 'B': 32, 'C': 33, 'D': 34, 'E': 35, 'F': 36, 'G': 37, 'H': 38, 'I': 39, 'J': 40, 'K': 41, 'L': 42, 'M': 43, 'N': 44, 'O': 45, 'P': 46, 'Q': 47, 'R': 48, 'S': 49, 'T': 50, 'U': 51, 'V': 52, 'W': 53, 'X': 54, 'Y': 55, 'Z': 56, '[': 57, '\\': 58, ']': 59, '_': 60, 'a': 61, 'b': 62, 'c': 63, 'd': 64, 'e': 65, 'f': 66, 'g': 67, 'h': 68, 'i': 69, 'j': 70, 'k': 71, 'l': 72, 'm': 73, 'n': 74, 'o': 75, 'p': 76, 'q': 77, 'r': 78, 's': 79, 't': 80, 'u': 81, 'v': 82, 'w': 83, 'x': 84, 'y': 85, 'z': 86, '{': 87, '|': 88, '}': 89, '~': 90, '¢': 91, '§': 92, '«': 93, '°': 94, '»': 95, 'é': 96, '—': 97, '‘': 98, '’': 99, '“': 100, '”': 101}
Integer to String to Integer: {0: '\n', 1: ' ', 2: '!', 3: '#', 4: '

## Data Loader

In [29]:
def get_batch(dataset, block_size, batch_size, debug=False):
    idx = torch.randint(len(dataset) - block_size, size=(batch_size,))
    if debug: print(f"Random idx: {idx}")
    
    x = torch.stack([dataset[i:i+block_size] for i in idx])
    y = torch.stack([dataset[i+1:i+1+block_size] for i in idx])
    
    if debug: print(f"Batch Sample [0] (Train): {x[0]}")
    if debug: print(f"Batch Sample [0] (Test): {y[0]}")
    
    return x, y

train_x, train_y = get_batch(train_data, block_size=8, batch_size=4, debug=True)
    

Random idx: tensor([189804, 147103, 491281,   9364])
Batch Sample [0] (Train): tensor([72, 79, 69, 66, 85, 69, 74, 67])
Batch Sample [0] (Test): tensor([79, 69, 66, 85, 69, 74, 67,  1])


## Bigram Model

In [31]:
class BigramModel(nn.Module):
    def __init__(self, vocab_size, debug=False):
        super().__init__()
        self.vocab_size = vocab_size
        self.embeddings = nn.Embedding(self.vocab_size, self.vocab_size)
        self.debug = debug
        
    def forward(self, idx, targets=None):
        logits = self.embeddings(idx)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            
            loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        """
        :param idx: matrix of integers corresponding to characters.
        :param max_new_tokens: number of tokens to generate.
        :return: generated characters from bi-gram model.
        """
        for _ in range(max_new_tokens):
            logits, loss = self.forward(idx)
            logits = logits[:, -1, :] # Final time-step (letter), (B,C).
            probs = F.softmax(logits, dim=-1) # logits to probability dist.
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            idx = torch.cat((idx, idx_next), dim=-1) # (B, T+1)
        return idx       

In [36]:
m = BigramModel(vocab_size=vocab_size)
logits, loss = m.forward(idx=train_x, targets=train_y)
print(f"Logits: {logits.shape}")
print(f"Loss: {loss}")

Logits: torch.Size([32, 102])
Loss: 5.171055316925049


In [50]:
batch_size = 4
block_size = 8

optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

for i in range(10000):
    xb, yb = get_batch(train_data, block_size, batch_size)
    
    logits, loss = m.forward(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(f"Loss: {loss.item():.4f}")
    

Loss: 2.4550


In [52]:
start_idx = torch.zeros((1,1), dtype=torch.long)
print(f"Generated Text: {decoder(m.generate(start_idx, max_new_tokens=1000).tolist()[0])}")

Generated Text: 
me Arnod As, ale erim- 
“q7Z+17shite me 
Ithad o 
qLine.” woce- w d 
Forayomeno chte? nghe 
OUSETR|ju y Ifrer s agat hera thf u erss alyse, feng, o thedoulldfey lee 
coucss 
hoomblattha 
phint ouse 
te yon, ymachang t, o hers. DIS. war thiotor bed illins ERher? 
ghave ntaily sachisoned t 
isind he, tonera as me t yo ho bu 

y whinavend bar, ildifiser send o bremuppt 
ph med.” blf g? s tit. dey, fuask, in where mil. Xk osc, t bur is wndont ak ese hid onorbed oslont tr th COUSC. S tu whothis whe y INou? 
wisar, wdint- utareforsurelenthins s wiver EPIss ws d, 
uss wl t o; 
may 
s, q0. ang ngy to r atind nathag oulyo 

if d, r oretodit bjbucomanome be m okese toncoutho ell. cowe cer wnyout t 
bonse- tomsupoof y o then asel acute yowisucindrciof 2? lythe, pssssh oullat, plsu be the modorowherive thares ; e wer ad OUR m. ore athithath? 
mutho thaif long y. W“INE Whe tece 
le aghin thee owhe Hould ssd anollly, 




llabe hedd s 

canctor, ivots ouriotha sullll 

arenngtr at, 

In [None]:
# Continue