<a href="https://colab.research.google.com/github/domschl/torch-transformer-poet/blob/main/torch_transformer_poet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Torch-Transformer-Poet

Please review [ml-indie-tools](https://github.com/domschl/ml-indie-tools), a collection machine learning tools that provides support for more environment indepent code. It will access your Google Drive when using with Google Colab.

In [2]:
!pip install -U ml-indie-tools

Collecting ml-indie-tools
  Downloading ml_indie_tools-0.5.5-py3-none-any.whl (46 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ml-indie-tools
  Attempting uninstall: ml-indie-tools
    Found existing installation: ml-indie-tools 0.5.4
    Uninstalling ml-indie-tools-0.5.4:
      Successfully uninstalled ml-indie-tools-0.5.4
Successfully installed ml-indie-tools-0.5.5


In [3]:
import logging
import os
import sys
import copy
import json
import time
import datetime
import random
import numpy as np

import torch

In [4]:
from ml_indie_tools.env_tools import MLEnv
from ml_indie_tools.Gutenberg_Dataset import Gutenberg_Dataset
from ml_indie_tools.Text_Dataset import Text_Dataset

from ml_indie_tools.Calibre_Dataset import Calibre_Dataset
from ml_indie_tools.Folder_Dataset import Folder_Dataset

from ml_indie_tools.pytorch_custom_layers import MultiHeadSelfAttention
from ml_indie_tools.pytorch_meta_tools import ModelJanitor
# from pytorch_meta_tools import ModelJanitor

In [5]:
logging.basicConfig(level=logging.INFO)

In [6]:
# Get text-format books from Calibre:
# cd = Calibre_Dataset('~/Nextcloud/MediaArchive/Calibre Library')
# cd.load_index()

## Preliminary

A pytorch deep multi-head attention model for text generation following Andrej Karpathy's [video-lecture-ng](https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py)

This code can use either CPU, GPU, or Apple Silicon. Google Colab is supported too, select the corresponding Colab runtime (menu: **`Runtime / Change runtime type`**)

## 0. Environment

In [7]:
cached_batch_data = None   # Do regenerate time-consuming training data, if aleady cached.

ml_env = MLEnv(platform='pt', accelerator='fastest')
ml_env.describe()

'OS: Darwin, Python: 3.10.8 (Conda), Jupyter Notebook Pytorch: 2.0.0.dev20230130, GPU: MPS Metal accelerator (system memory)'

In [8]:
# project_name = 'women_writers'
model_cpu = None
project_name='philosophers'
model_name=f'ngpt_{project_name}_v1_pt'

# NOTICE: This will request access to Google Drive, if running on Google Colab. Google Drive is used to store snapshots
# training data. See project ml-indie-tools: https://github.com/domschl/ml-indie-tools 
#
# Note: you need to allow popups in your browser for COLAB, otherwise you won't see the google-drive login box, and drive access will fail!

root_path, project_path, model_path, data_path, log_path = ml_env.init_paths(project_name=project_name, model_name=model_name)

print(f"Root path (all projects) : {root_path} (This will be '.' (current dir) for local projects, and a google drive path for Colab)")
print(f"Project path             : {project_path} (Changes to the file system happen only below this project path")
print(f"Model path (snapshots)   : {model_path} (Model weights and snapshots are stored here)")
print(f"Data path (training data): {data_path} (Training data will be downloaded here)")
print(f"Log dir (tensorboard)    : {log_path} (it doesn't work to put logs on gdrive due to caching, hence local dir)")

Root path (all projects) : . (This will be '.' (current dir) for local projects, and a google drive path for Colab)
Project path             : . (Changes to the file system happen only below this project path
Model path (snapshots)   : ./model/ngpt_philosophers_v1_pt (Model weights and snapshots are stored here)
Data path (training data): ./data (Training data will be downloaded here)
Log dir (tensorboard)    : ./logs (it doesn't work to put logs on gdrive due to caching, hence local dir)


##  1. Text library

`Text_Dataset` and `Gutenberg_Dataset` classes: libraries for training, 
encoding, batch generation, and formatted source display. It read some 
books from Project Gutenberg and supports creation of training batches. 
The output functions support highlighting to allow to compare generated 
texts with the actual sources to help to identify identical (memorized) 
parts.

In [9]:
use_dark_mode=False # Set to false for white background. HTML-text-compare uses background-colorization to identify different sources. Those background colors are dependent on the theme type.

In [10]:
logging.basicConfig(level=logging.INFO)
cache_dir = os.path.join(data_path, 'gutenberg_cache')
gd = Gutenberg_Dataset(cache_dir=cache_dir)

In [11]:
if project_name == 'women_writers':  # sample searches
    search_spec= {
        "author": ["Emily Brontë", "Jane Austen", "Virginia Woolf"], 
        "language": ["english"]
    }
    book_list=gd.search(search_spec)
elif project_name == 'philosophers':
    search_spec = {
        "author": ["Immanuel Kant", "Friedrich Nietzsche", "Wilhelm Hegel"],
        "language": ["english"]
    }
    book_list=gd.search(search_spec)
    search_spec = {
        "author": ["Plato"],
        "title": ["Timaeus", "Critias", "Symposium"],
        "language": ["english"]
    }
    book_list+=gd.search(search_spec)

book_cnt = len(book_list)
print(f"{book_cnt} matching books found with search {search_spec}.")
if book_cnt<40:
    # Note: please verify that book_cnt is 'reasonable'. If you plan to use a large number of texts, 
    # consider [mirroring Gutenberg](https://github.com/domschl/ml-indie-tools#working-with-a-local-mirror-of-project-gutenberg)
    book_list = gd.insert_book_texts(book_list, download_count_limit=book_cnt)  
else:
    logging.error("Please verify your book_list, a large number of books is scheduled for download. ABORTED.")

23 matching books found with search {'author': ['Plato'], 'title': ['Timaeus', 'Critias', 'Symposium'], 'language': ['english']}.


In [12]:
for i in range(len(book_list)):
    print(f"{i}: {book_list[i]['title']} - {book_list[i]['author']}, {book_list[i]['ebook_id']}")

0: The History of Philosophy: Volume 3 of 3 - Georg Wilhelm Hegel, 58169
1: The Will to Power, Books III and IV - Friedrich Nietzsche, 52915
2: The Will to Power, Books I and II - Friedrich Nietzsche, 52914
3: The Joyful Wisdom - Friedrich Nietzsche, 52881
4: Kant's Prolegomena - Immanuel Kant, 52821
5: Hegel's Lectures on the History of Philosophy: Vol. 2 of 3 - Georg Wilhelm Hegel, 51636
6: Hegel's Lectures on the History of Philosophy: Vol. 1 of 3 - Georg Wilhelm Hegel, 51635
7: Early Greek Philosophy & Other Essays - Friedrich Nietzsche, 51548
8: Perpetual Peace - Immanuel Kant, 50922
9: Kant's Critique of Judgement - Immanuel Kant, 48433
10: Thoughts Out of Season, Part 2 - Friedrich Nietzsche, 38226
11: Human, All Too Human - Friedrich Nietzsche, 38145
12: We Philologists, Volume 8 of 18 - Friedrich Nietzsche, 18267
13: The Metaphysical Elements of Ethics - Immanuel Kant, 5684
14: The Critique of Practical Reason - Immanuel Kant, 5683
15: Fundamental Principles of the Metaphysic 

In [13]:
if project_name == 'women_writers':
    select = ("Bennett", "1342", "5670", "1245", "161", "141", "121", "105", "Susan", "Wuthering", "Emma", "Voyage")  # List unique single-words from title or ebook_id to select a given book
    sub_book_list = [book_list[i] for i in range(len(book_list)) if not set([book_list[i]['ebook_id']]+book_list[i]['title'].split(' ')).isdisjoint(set(select))]
else:
    sub_book_list = book_list
    
print("Using:")
for i in range(len(sub_book_list)):
    print(f"{i+1}: {sub_book_list[i]['title']} - {sub_book_list[i]['author']}")

# obsolete?! textlib_dataset = None  # Forces re-caching
td = Text_Dataset(sub_book_list)

INFO:Datasets:Loaded 23 texts


Using:
1: The History of Philosophy: Volume 3 of 3 - Georg Wilhelm Hegel
2: The Will to Power, Books III and IV - Friedrich Nietzsche
3: The Will to Power, Books I and II - Friedrich Nietzsche
4: The Joyful Wisdom - Friedrich Nietzsche
5: Kant's Prolegomena - Immanuel Kant
6: Hegel's Lectures on the History of Philosophy: Vol. 2 of 3 - Georg Wilhelm Hegel
7: Hegel's Lectures on the History of Philosophy: Vol. 1 of 3 - Georg Wilhelm Hegel
8: Early Greek Philosophy & Other Essays - Friedrich Nietzsche
9: Perpetual Peace - Immanuel Kant
10: Kant's Critique of Judgement - Immanuel Kant
11: Thoughts Out of Season, Part 2 - Friedrich Nietzsche
12: Human, All Too Human - Friedrich Nietzsche
13: We Philologists, Volume 8 of 18 - Friedrich Nietzsche
14: The Metaphysical Elements of Ethics - Immanuel Kant
15: The Critique of Practical Reason - Immanuel Kant
16: Fundamental Principles of the Metaphysic of Morals - Immanuel Kant
17: Thoughts out of Season, Part One - Friedrich Nietzsche
18: Beyond

## Additional training material for folder `{data_path}/local_texts`

If the folder {data_path} (defined above) contains a sub-folder `local_texts`, and it contains
files of structure `<title> - <author> - <language>.txt`, then they are added to the training data.
Sample filename: `"./data/local_texts/works-of-shakespeare - William Shakespeare - English.txt"`.
The titles of those documents are referenced via numeric aliases to preserve privacy on non-public data.

In [14]:
use_local_folder_data = True
if use_local_folder_data:
    local_texts = os.path.join(data_path, 'local_texts')
    fd = Folder_Dataset(local_texts)
    fd.load_index(use_aliases=False)
    td.load_texts(fd.records)

INFO:FolderTextLib:Loaded 19 records from Folder.
INFO:Datasets:Loaded 42 texts


In [15]:
MAX_TOKENS = 20000  # This becomes vocab_size
MAX_NGRAM_LEN = 8   # Max length of a token

print("")
print(f"Starting NGRAM tokinizer with token length from 1..{MAX_NGRAM_LEN} with a max of {MAX_TOKENS} unique tokens,")
print("this can take considerable time...")
td.init_tokenizer(tokenizer='ngram', max_ngrams=MAX_NGRAM_LEN, max_tokens=MAX_TOKENS)

INFO:Datasets:Starting tokenizer on 42 texts...
INFO:Datasets:Extracting ngrams of length 1..8 from text_list, selecting 20000 most used ngrams.



Starting NGRAM tokinizer with token length from 1..8 with a max of 20000 unique tokens,
this can take considerable time...


INFO:Datasets:Encoding text corpora as ngrams.
INFO:Datasets:Encoding text The History of Philosophy: Volume 3 of 3...
INFO:Datasets:Encoding text The Will to Power, Books III and IV...
INFO:Datasets:Encoding text The Will to Power, Books I and II...
INFO:Datasets:Encoding text The Joyful Wisdom...
INFO:Datasets:Encoding text Kant's Prolegomena...
INFO:Datasets:Encoding text Hegel's Lectures on the History of Philosophy: Vol. 2 of 3...
INFO:Datasets:Encoding text Hegel's Lectures on the History of Philosophy: Vol. 1 of 3...
INFO:Datasets:Encoding text Early Greek Philosophy & Other Essays...
INFO:Datasets:Encoding text Perpetual Peace...
INFO:Datasets:Encoding text Kant's Critique of Judgement...
INFO:Datasets:Encoding text Thoughts Out of Season, Part 2...
INFO:Datasets:Encoding text Human, All Too Human...
INFO:Datasets:Encoding text We Philologists, Volume 8 of 18...
INFO:Datasets:Encoding text The Metaphysical Elements of Ethics...
INFO:Datasets:Encoding text The Critique of Practi

In [16]:
td.save_tokenizer(f"{project_name}_tokens.json")

INFO:Datasets:Saving tokenizer to philosophers_tokens.json


In [17]:
# td.load_tokenizer("tok.json")

In [18]:
# td.index

In [19]:
SEQUENCE_LEN = 256

td.init_getitem(sample_type='encoded', sample_length=SEQUENCE_LEN+1, content_stepping=1)

num_records = len(td)

print(f"{num_records} records")

5797543 records


In [20]:
def get_sample_batch(td, batch_size):
    # generate a small batch of data of inputs x and targets y
    # ix = torch.randint(len(data) - block_size, (batch_size,))
    # x = torch.stack([data[i : i + block_size] for i in ix])
    # y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    # x, y = x.to(device), y.to(device)
    # return x, y
    for i in range(batch_size):
        data = td.get_random_item()
        Xi = data[:-1]
        yi = data[1:]
        if i==0:
            # smpX=np.array(Xi, dtype=np.float32)
            smpX=np.array(Xi, dtype=np.int32)
            smpy=np.array(yi, dtype=np.int32)
        else:
            # smpX = np.vstack((smpX, np.array(Xi, dtype=np.float32)))
            smpX = np.vstack((smpX, np.array(Xi, dtype=np.int32)))
            smpy = np.vstack((smpy, np.array(yi, dtype=np.int32)))
    return np.array(smpX), np.array(smpy)

In [21]:
# test_x, test_y = get_sample_batch(td, 2)
# for i in range(len(test_x)):
#     xi=[int(x) for x in test_x[i]]
#     print(f"[{i}](l={len(xi)}): X=>{td.decode(xi)}<,\ny=>{td.decode(test_y[i])}<")

In [22]:
# test_x.shape, test_y.shape

## 2. data for texts

In [23]:
vocabulary_size = td.get_unique_token_count()  # vocabulary-size

attn_layers = 16;

params = { # Multi-head self-attention
    'meta_name_template': '{mhsa_layers}x{heads}x{units}x{vocab_size}',

    'mhsa_layers': attn_layers, 
    'heads': 16,
    'causal': True,  # Use causal self-attention
    'dropout': 0.1,       # no dropout: 0.0
    'vocab_size': vocabulary_size,
    'sequence_len': SEQUENCE_LEN,
    'embedding_size': 256, 
    'test_iterations': 10,  # number of epocs for loss estimation

    'batch_size': 64,
    'learning_rate': 0.0004,
    'sample_every_n_iterations': 250,
    'sample_size': 100,
    'save_every_n_iterations': 100,
    
    'max_iterations': 1000000  # maximum number of training iterations
}

# When comparing if training-data is compatible with new params set, 
# the following keys are updatable, they can be changed while continuing
# to use existing checkpoints and continue training with those values
# changed:
updatable_keys=['learning_rate', 'batch_size', 'current_epoch', 'current_loss', 'dropout', 
             'sample_every_n_iterations', 'sample_size', 'save_every_n_iterations']
print(params)

{'meta_name_template': '{mhsa_layers}x{heads}x{units}x{vocab_size}', 'mhsa_layers': 16, 'heads': 16, 'causal': True, 'dropout': 0.1, 'vocab_size': 20000, 'sequence_len': 256, 'embedding_size': 256, 'test_iterations': 10, 'batch_size': 64, 'learning_rate': 0.0004, 'sample_every_n_iterations': 250, 'sample_size': 100, 'save_every_n_iterations': 100, 'max_iterations': 1000000}


In [24]:
num_batches = num_records // params['batch_size']
print(f"num_batches = {num_batches}")

num_batches = 90586


In [25]:
def get_torch_batch(td, batch_size, device, split=None):
    x, y = get_sample_batch(td, batch_size)
    return torch.tensor(x, dtype=torch.long).to(device), torch.tensor(y, dtype=torch.long).to(device)

In [26]:
# get_torch_batch(td, 2, 'cpu')

In [27]:
@torch.no_grad()
def estimate_loss(device):
    # XXX: this does take data for train and val from SAME pool!
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(params['test_iterations'])
        for k in range(params['test_iterations']):
            print(".", end="", flush=True)
            X, Y = get_torch_batch(td, params['batch_size'], device, split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    print("\r", end="", flush=True)
    mloss = (out['train']+out['val'])/2.0
    return mloss

def generate_sample(td, device, prompt=' ', toks=100, temperature=1.0, top_k=None ):
    # generate from the model
    # context = torch.zeros((1, 1), dtype=torch.long, device=device)
    model.eval()
    # while len(prompt)<params['sequence_len']:
    #     prompt = ' ' + prompt
    context = torch.tensor([td.encode(prompt)]).to(device)
    answer = model.generate(context, max_new_tokens=toks, temperature=temperature, top_k=top_k)
    txt = td.decode(answer[0].tolist())
    # Identify memorisation of text by highlighting verbatim quotes from sources
    # that are longer than 10 chars. HTML colorcoded output for source identification:
    td.source_highlight(txt, min_quote_size=10, dark_mode=False, display_ref_anchor=False)
    return txt
    # open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

In [28]:
# XXX!
device = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device("mps") if torch.backends.mps.is_available() else device
mj=ModelJanitor(model_path, params, updatable_keys)

In [29]:
print("creating model...")
model_cpu = MultiHeadSelfAttention(params['vocab_size'], params['embedding_size'], 
                                   params['sequence_len'], params['dropout'], 
                                   params['heads'], params['mhsa_layers'], params['causal'], device)
model = model_cpu.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=params['learning_rate'])
ep, ls = mj.load_checkpoint(model, optimizer, params)
if ep==0 and ls==0:
    start_iter = 0
else:
    start_iter = ep
    current_loss = ls
    
# print the number of parameters in the model
print(sum(p.numel() for p in model.parameters()) / 1e6, "M parameters")

creating model...




Last checkpoint saved_params[sequence_len]: 192 != current_params[sequence_len]: 256,
cannot import incompatible model. Put key in `updatable_keys` list, if irrelevant.
Aborting import.
22.94992 M parameters


In [30]:
# @torch.jit.script
# @torch.compile
def do_train_step(xb, yb):
    model.train()
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

In [31]:
# model = torch.compile(model)

In [None]:
dt0 = time.time()
print("training...")
gen_id = 0
iter_bench = 1
current_loss = estimate_loss(device)
inputs = ["what is the difference between good and evil? ", "How did everything come into existence? ", "What was the beginning of time? ", "How are physics, quantum-mechanics and consciousness related? ", "How to attain complete self-awareness? ", "What is the nature of reality? ", "How be a good human being? "]
for iter in range(start_iter, params['max_iterations']):
    print(f"\rIteration: {iter+1:5d}/{((iter+1)//params['sample_every_n_iterations']+1)*params['sample_every_n_iterations']}/{params['max_iterations']}", end="", flush=True)
    # every once in a while evaluate the loss on train and val sets
    if (iter + 1) % params['sample_every_n_iterations'] == 0 or iter == params['max_iterations'] - 1:
        dt = time.time()
        print(f"\rloss eval", end="", flush=True)
        current_loss = estimate_loss(device)
        print(
            f"step {iter+1}: train loss {current_loss:.4f}, time {(dt-dt0)/iter_bench:.3f} sec/iter"
        )
        iter_bench = 1
        print("Sample: ", end="", flush=True)
        for temperature in [0.75]:
            print(f"--------temperature: {temperature} ---------")
            prompt = inputs[gen_id%len(inputs)]
            print(f"Prompt: {prompt}")
            generate_sample(td, device, prompt=prompt, toks=params['sample_size'], temperature=temperature, top_k=16)
        print("-------------------------------------------")
        gen_id += 1
        dt0 = time.time()
    # sample a batch of data
    xb, yb = get_torch_batch(td, params['batch_size'], device, "train")
    # evaluate the loss
    do_train_step(xb, yb)
    start_iter = iter
    iter_bench += 1
    if (iter+1)%params['save_every_n_iterations'] == 0:
        mj.save_checkpoint(model, optimizer, params, iter, current_loss)
    

training...
................

In [None]:
for t in [0.75, 0.85, 0.95, 1.05]:
    print(f"------Temperature {t}--------")
    generate_sample(td, device, prompt="How are consciousness and quantum mechanics related?", toks=200, temperature=t, top_k=16)