<a href="https://colab.research.google.com/github/domschl/torch-transformer-poet/blob/main/torch_transformer_poet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Torch-Transformer-Poet

Please review [ml-indie-tools](https://github.com/domschl/ml-indie-tools), a collection machine learning tools that provides support for more environment indepent code. It will access your Google Drive when using with Google Colab.

In [1]:
!pip install -U ml-indie-tools



In [2]:
import sys
if 'google.colab' in sys.modules:
    # from: https://github.com/pytorch/pytorch/issues/107960  (libcuda not found)
    !export LC_ALL="en_US.UTF-8"
    !export LD_LIBRARY_PATH="/usr/lib64-nvidia"
    !export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
    !ldconfig /usr/lib64-nvidia
#     print("While default colab is still stuck with pytorch 1.13, we update to 2.0 using PIP. This can be removed, once Colab arrives in the presence.")
#     !pip install -U torch

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link



In [3]:
import logging
import os
import copy
import json
import time
import datetime
import random
import numpy as np

import torch

In [4]:
from ml_indie_tools.env_tools import MLEnv
from ml_indie_tools.Gutenberg_Dataset import Gutenberg_Dataset
from ml_indie_tools.Text_Dataset import Text_Dataset

from ml_indie_tools.Calibre_Dataset import Calibre_Dataset
from ml_indie_tools.Folder_Dataset import Folder_Dataset

from ml_indie_tools.pytorch_custom_layers import MultiHeadSelfAttention
from ml_indie_tools.pytorch_tr_compr_layers import MultiHeadSelfAttentionWithCompression, MultiHeadSelfAttentionWithCompressionState
import ml_indie_tools.pytorch_meta_tools as MJ

In [5]:
logging.basicConfig(level=logging.INFO)
log = logging.Logger("Main")
log.setLevel(logging.INFO)

## Preliminary

A pytorch deep multi-head attention model for text generation following Andrej Karpathy's [video-lecture-ng](https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py)

This code can use either CPU, GPU, or Apple Silicon. Google Colab is supported too, select the corresponding Colab runtime (menu: **`Runtime / Change runtime type`**)

## 0. Environment

In [6]:
cached_batch_data = None   # Do regenerate time-consuming training data, if aleady cached.

ml_env = MLEnv(platform='pt', accelerator='fastest')
ml_env.describe()

'OS: Linux, Python: 3.10.12, Colab Jupyter Notebook Pytorch: 2.1.0+cu118, GPU: Tesla V100-SXM2-16GB (2MiB / 16384MiB), CPU'

## 1. Project configuration

In [7]:
# project_name = 'women_writers'
model_cpu = None
project_name='notes_and_research'
model_name=f'ngpt_COMP_{project_name}_v2_pt'

use_preprocessed_data = True                     # Use already tokenized data
use_existing_model_from_checkpoint = False        # Try to load checkpoint of training
use_torch_compile = True                         # Requires a modern graphics card with torch compile backend support

# NOTICE: This will request access to Google Drive, if running on Google Colab. Google Drive is used to store snapshots
# training data. See project ml-indie-tools: https://github.com/domschl/ml-indie-tools
#
# Note: you need to allow popups in your browser for COLAB, otherwise you won't see the google-drive login box, and drive access will fail!

root_path, project_path, model_path, data_path, log_path = ml_env.init_paths(project_name=project_name, model_name=model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device("mps") if torch.backends.mps.is_available() else device

print(f"Root path (all projects) : {root_path} (This will be '.' (current dir) for local projects, and a google drive path for Colab)")
print(f"Project path             : {project_path} (Changes to the file system happen only below this project path")
print(f"Model path (snapshots)   : {model_path} (Model weights and snapshots are stored here)")
print(f"Data path (training data): {data_path} (Training data will be downloaded here)")
print(f"Log dir (tensorboard)    : {log_path} (it doesn't work to put logs on gdrive due to caching, hence local dir)")

Root path (all projects) : /content/drive/My Drive (This will be '.' (current dir) for local projects, and a google drive path for Colab)
Project path             : /content/drive/My Drive/Colab Notebooks/notes_and_research (Changes to the file system happen only below this project path
Model path (snapshots)   : /content/drive/My Drive/Colab Notebooks/notes_and_research/model/ngpt_COMP_notes_and_research_v2_pt (Model weights and snapshots are stored here)
Data path (training data): /content/drive/My Drive/Colab Notebooks/notes_and_research/data (Training data will be downloaded here)
Log dir (tensorboard)    : ./logs (it doesn't work to put logs on gdrive due to caching, hence local dir)


##  2.1 Text data from Project Gutenberg

`Text_Dataset` and `Gutenberg_Dataset` classes: libraries for training,
encoding, batch generation, and formatted source display. It read some
books from Project Gutenberg and supports creation of training batches.
The output functions support highlighting to allow to compare generated
texts with the actual sources to help to identify identical (memorized)
parts.

In [8]:
use_dark_mode=False # Set to false for white background. HTML-text-compare uses background-colorization to identify different sources. Those background colors are dependent on the theme type.

In [9]:
token_file = os.path.join(data_path,f"{project_name}_tokens.json")
if use_preprocessed_data is True:
    if os.path.exists(token_file):
        td = Text_Dataset()
        td.load_tokenizer(token_file)
    else:
        use_preprocessed_data = False

In [10]:
if use_preprocessed_data is False:
    cache_dir = os.path.join(data_path, 'gutenberg_cache')
    gd = Gutenberg_Dataset(cache_dir=cache_dir)

    if project_name == 'women_writers':  # sample searches
        search_spec= {
            "author": ["Emily Brontë", "Jane Austen", "Virginia Woolf"],
            "language": ["english"]
        }
        book_list=gd.search(search_spec)
    elif project_name == 'neo_philosophers':
        search_spec = {
            "author": ["Immanuel Kant", "Friedrich Nietzsche", "Wilhelm Hegel"],
            "language": ["english"]
        }
        book_list=gd.search(search_spec)
        search_spec = {
            "author": ["Plato"],
            "title": ["Timaeus", "Critias", "Symposium"],
            "language": ["english"]
        }
        book_list+=gd.search(search_spec)
    else:
        search_spec = {}
        book_list = []

    book_cnt = len(book_list)
    print(f"{book_cnt} matching books found with search {search_spec}.")

    if book_cnt > 0:
        if book_cnt<40:
            # Note: please verify that book_cnt is 'reasonable'. If you plan to use a large number of texts,
            # consider [mirroring Gutenberg](https://github.com/domschl/ml-indie-tools#working-with-a-local-mirror-of-project-gutenberg)
            book_list = gd.insert_book_texts(book_list, download_count_limit=book_cnt)
        else:
            logging.error("Please verify your book_list, a large number of books is scheduled for download. ABORTED.")

        for i in range(len(book_list)):
            print(f"{i}: {book_list[i]['title']} - {book_list[i]['author']}, {book_list[i]['ebook_id']}")

        if project_name == 'women_writers':
            select = ("Bennett", "1342", "5670", "1245", "161", "141", "121", "105", "Susan", "Wuthering", "Emma", "Voyage")  # List unique single-words from title or ebook_id to select a given book
            sub_book_list = [book_list[i] for i in range(len(book_list)) if not set([book_list[i]['ebook_id']]+book_list[i]['title'].split(' ')).isdisjoint(set(select))]
        else:
            sub_book_list = book_list

        print("Using:")
        for i in range(len(sub_book_list)):
            print(f"{i+1}: {sub_book_list[i]['title']} - {sub_book_list[i]['author']}")

        td = Text_Dataset(sub_book_list)
    else:
        td = Text_Dataset()

## 2.2 Additional training material for folder `{data_path}/local_texts`

If the folder {data_path} (defined above) contains a sub-folder `local_texts`, and it contains
files of structure `<title> - <author> - <language>.txt`, then they are added to the training data.
Sample filename: `"./data/local_texts/works-of-shakespeare - William Shakespeare - English.txt"`.
The titles of those documents are referenced via numeric aliases to preserve privacy on non-public data.

In [11]:
if use_preprocessed_data is False:
    additional = os.path.join(project_path, "additional_texts.json")
    print(f"Looking for description of additional sources in {additional}")
    if os.path.exists(additional) is True:
        with open(additional, 'r') as f:
            add_desc = json.load(f)
            if 'local_texts' in add_desc:
                fd = Folder_Dataset()
                for text_path in add_desc['local_texts']:
                    print(f"Loading texts from {text_path}")
                    fd.load_index(text_path, use_aliases=False, max_file_size=100000)
                td.load_texts(fd.records[:10000])
            if 'calibre' in add_desc:
                cal_path = add_desc['calibre']
                if os.path.exists(cal_path):
                    print(f"Loading text from calibre at {cal_path}")
                    cd = Calibre_Dataset(cal_path)
                    cd.load_index(max_file_size=500000)
                    td.load_texts(cd.records[:1000])

## 2.3 Tokenize data

In [12]:
if use_preprocessed_data is False:
    MAX_TOKENS = 35000  # This becomes vocab_size
    MAX_NGRAM_LEN = 4   # Max length of a token

    print("")
    print(f"Starting tokenizer with token length from 1..{MAX_NGRAM_LEN} with a max of {MAX_TOKENS} unique tokens,")
    print("this can take considerable time...")

    # td.init_tokenizer(tokenizer='ngram', max_ngrams=MAX_NGRAM_LEN, max_tokens=MAX_TOKENS)
    td.init_tokenizer(tokenizer='bytegram', max_ngrams=MAX_NGRAM_LEN, max_tokens=MAX_TOKENS)
    td.save_tokenizer(token_file)

## 3. Model metadata

In [13]:
params = None
updatable_keys=['learning_rate', 'batch_size', 'current_epoch', 'current_loss', 'stateful',
                 'sample_every_n_iterations', 'sample_size', 'save_every_n_iterations']
attn_layers = 4
embs = 256
linear_yoke_hidden_index = -1  # Set to -1, if no yoke is wanted (standard transformer model)
linear_yoke_size = 96

params = { # Multi-head self-attention
        'meta_name_template': '{mhsa_layers}x{heads}x{units}x{vocab_size}',

        'mhsa_layers': attn_layers,
        'heads': 8,
        'causal': True,  # Use causal self-attention
        'linear_non_linearity': 'relu',  # relurelu: use additional relu for state gating
        'linear_yoke_hidden_index': linear_yoke_hidden_index,  # no residual for non-default hidden_size only
        'linear_yoke_size': linear_yoke_size,
        'linear_yoke_residual': True,
        'stateful': False,
        'joint_state_training': 4,  # use consecutive training samples with shared state for 32 chars
        'dropout': 0.1,
        'vocab_size': td.get_unique_token_count(),
        'sequence_len': 128,
        'embedding_size': embs,
        'test_iterations': 10,  # number of epocs for loss estimation

        'batch_size': 128,
        'learning_rate': 0.002,
        'sample_every_n_iterations': 256,
        'sample_size': 150,
        'save_every_n_iterations': 256,

        'max_iterations': 1000000  # maximum number of training iterations
    }
if params['stateful'] is False:
    params['joint_state_training'] = 0
model_file_path = MJ.get_model_filename(model_path)
if use_existing_model_from_checkpoint is True:
    params = MJ.load_model_metadata_from_checkpoint(params, updatable_keys, model_file_path, device=device, log=log) # torch.device('cpu'))
if params == None or use_existing_model_from_checkpoint is False:
    use_existing_model_from_checkpoint = False
# print(params)

## 4. Batch handling

In [14]:
td.init_getitem(sample_type='encoded', sample_length=params['sequence_len']+1+params['joint_state_training'], content_stepping=1)
num_records = len(td)
print(f"{num_records} records")

66008632 records


In [15]:
def get_sample_sub_batch(sample_batch, batch_size, sub_index=0):
    for i in range(batch_size):
        Xi = sample_batch[sub_index:-1-params['joint_state_training']+sub_index]
        if params['joint_state_training']+sub_index == 0:
            yi = sample_batch[sub_index+1:]
        else:
            yi = sample_batch[sub_index+1:-params['joint_state_training']+sub_index]
        if i==0:
            # smpX=np.array(Xi, dtype=np.float32)
            smpX=np.array(Xi, dtype=np.int32)
            smpy=np.array(yi, dtype=np.int32)
        else:
            # smpX = np.vstack((smpX, np.array(Xi, dtype=np.float32)))
            smpX = np.vstack((smpX, np.array(Xi, dtype=np.int32)))
            smpy = np.vstack((smpy, np.array(yi, dtype=np.int32)))
    return np.array(smpX), np.array(smpy)

def get_sample_batch(td, batch_size):
    sample_batch = td.get_random_item()
    return get_sample_sub_batch(sample_batch, batch_size)

In [16]:
num_batches = num_records // params['batch_size']
print(f"num_batches = {num_batches}")

num_batches = 515692


In [17]:
sample_data = None

def get_torch_subbatch(td, batch_size, device, split=None, sub_index=0):
    global sample_data
    if sub_index==0:
        sample_data = td.get_random_item()
    x, y = get_sample_sub_batch(sample_data, batch_size, sub_index)
    tx = torch.tensor(x, dtype=torch.long).to(device)
    tx.requires_grad = False
    ty = torch.tensor(y, dtype=torch.long).to(device)
    ty.requires_grad = False
    return tx, ty

def get_torch_batch(td, batch_size, device, split=None):
    x, y = get_sample_batch(td, batch_size)
    tx = torch.tensor(x, dtype=torch.long).to(device)
    tx.requires_grad = False
    ty = torch.tensor(y, dtype=torch.long).to(device)
    ty.requires_grad = False
    return tx, ty

def get_zero_state(batch_size, sequence_len, hidden_size, device):
    zstate = torch.zeros(batch_size, sequence_len, hidden_size, device=device)
    zstate.requires_grad = False
    return zstate

## 5. Loss and training helpers

In [18]:
print("creating model...")
try:
    # Colab + torch 2 -> lots of garbage.
    if model is not None:
        del model
except:
    pass


if params['stateful'] is False:
    if params['linear_yoke_hidden_index'] == -1:
        model = MultiHeadSelfAttention(vocab_size=params['vocab_size'], embedding_size=params['embedding_size'],
                                       sequence_len=params['sequence_len'], dropout=params['dropout'],
                                       num_heads=params['heads'], num_layers=params['mhsa_layers'],
                                       causal=params['causal'], device=device)
    else:
        model = MultiHeadSelfAttentionWithCompression(vocab_size=params['vocab_size'], embedding_size=params['embedding_size'],
                                       sequence_len=params['sequence_len'], dropout=params['dropout'],
                                       num_heads=params['heads'], num_layers=params['mhsa_layers'],
                                       causal=params['causal'], linear_non_linearity=params['linear_non_linearity'],
                                       linear_yoke=(params['linear_yoke_hidden_index'], params['linear_yoke_size'], params['linear_yoke_residual']),
                                       device=device)
else:
    model = MultiHeadSelfAttentionWithCompressionState(vocab_size=params['vocab_size'], embedding_size=params['embedding_size'],
                                       sequence_len=params['sequence_len'], dropout=params['dropout'],
                                       num_heads=params['heads'], num_layers=params['mhsa_layers'],
                                       causal=params['causal'], linear_non_linearity=params['linear_non_linearity'],
                                       linear_yoke=(params['linear_yoke_hidden_index'], params['linear_yoke_size'], params['linear_yoke_residual']),
                                       device=device)

optimizer = torch.optim.AdamW(model.parameters(), lr=params['learning_rate'])

model = model.to(device)
if use_existing_model_from_checkpoint is True:
    params_load = MJ.load_checkpoint(params, model, optimizer, file_path=model_file_path, updatable_keys=updatable_keys, device=device, log=log) # torch.device("cpu"))
    if params_load is not None:
        params = params_load
model = model.to(device)
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)

if use_torch_compile is True:
    if device == 'cuda':
        print("Compiling...")
        model = torch.compile(model)
        print("Compile ok.")
        try:
            torch.set_float32_matmul_precision('high')
        except:
            print("Seems no tensor cores for that.")
    # elif str(device) == 'mps':
    #     print("Compiling...")
    #     model = torch.compile(model)
    #     print("Compile ok.")

if 'current_epoch' in params:
    ep = params['current_epoch']
else:
    ep=0
if 'current_loss' in params:
    ls = params['current_loss']
else:
    ls=0

if ep==0 and ls==0:
    start_iter = 0
else:
    start_iter = ep
    current_loss = ls

# print the number of parameters in the model
print(model)
print(sum(p.numel() for p in model.parameters()) / 1e6, "M parameters")

creating model...
Compiling...
Compile ok.
OptimizedModule(
  (_orig_mod): MultiHeadSelfAttention(
    (token_embedding_table): Embedding(35000, 256)
    (position_embedding_table): Embedding(128, 256)
    (blocks): Sequential(
      (0): Block(
        (sa): MultiHeadAttention(
          (heads): ModuleList(
            (0-7): 8 x SelfAttentionHead(
              (key): Linear(in_features=256, out_features=32, bias=False)
              (query): Linear(in_features=256, out_features=32, bias=False)
              (value): Linear(in_features=256, out_features=32, bias=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (proj): Linear(in_features=256, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ffwd): FeedFoward(
          (non_linearity): ReLU()
          (net): Sequential(
            (0): Linear(in_features=256, out_features=1024, bias=True)
            (1): ReLU()
            (2): Linea

In [19]:
@torch.no_grad()
def estimate_loss(device):
    # XXX: this does take data for train and val from SAME pool!
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(params['test_iterations'])
        for k in range(params['test_iterations']):
            print(".", end="", flush=True)
            X, Y = get_torch_batch(td, params['batch_size'], device, split)
            if params['stateful'] is False:
                logits, loss = model(X, Y)
            else:
                state = get_zero_state(X.shape[0], params['sequence_len'], params['linear_yoke_size'], device)
                logits, loss, state = model(X, Y, state=state)
                # print(k, state)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    print("\r", end="", flush=True)
    mloss = (out['train']+out['val'])/2.0
    return mloss

def generate_sample(td, device, prompt=' ', toks=100, state=None, temperature=1.0, top_k=None, pad=False):
    # generate from the model
    # context = torch.zeros((1, 1), dtype=torch.long, device=device)
    model.eval()
    if pad is True:
        while len(prompt)<params['sequence_len']:
            if len(prompt)==params['sequence_len']-1:
                prompt = '\n' + prompt
            else:
                prompt = ' ' + prompt
    context = torch.tensor([td.encode(prompt)]).to(device)
    if params['stateful'] is False:
        answer = model.generate(context, max_new_tokens=toks, temperature=temperature, top_k=top_k)
    else:
        if state is None:
            print()
            print("Please don't put state=None in generator!")
            state = get_zero_state(1, params['sequence_len'], params['linear_yoke_size'], device)
        answer, state = model.generate(idx=context, max_new_tokens=toks, state=state, temperature=temperature, top_k=top_k)

    txt = td.decode(answer[0].tolist())
    # Identify memorisation of text by highlighting verbatim quotes from sources
    # that are longer than 10 chars. HTML colorcoded output for source identification:
    td.source_highlight(txt, min_quote_size=10, dark_mode=False, display_ref_anchor=False)
    if params['stateful'] is False:
        return txt
    else:
        return txt, state

In [20]:
# @torch.jit.script
# @torch.compile
def do_train_step(xb, yb, device, state=None):
    model.train()
    if params['stateful'] is False:
        logits, loss = model(xb, yb)
    else:
        # XXX continuous training date & state!
        if state is None:
            state = get_zero_state(xb.shape[0], params['sequence_len'], params['linear_yoke_size'], device)
        logits, loss, state = model(xb, targets=yb, state=state)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if params['stateful'] is True:
        return state.detach()
    else:
        return None

In [None]:
dt0 = time.time()
sdt = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(f"training, start at {sdt}...")
gen_id = 0
iter_bench = 1
current_loss = estimate_loss(device)
if params['stateful'] is True:
    gen_state = get_zero_state(1, params['sequence_len'], params['linear_yoke_size'], device=device)
else:
    gen_state = None
inputs = ["What is the difference between good and evil? The difference ", "How did everything come into existence? The origin ", "What was at the beginning of time? Time itself ", "How are physics, quantum-mechanics and consciousness related? The relation between ", "How to attain complete self-awareness? Complete ", "What is the nature of reality? The nature ", "How be a good human being? A human "]
for iter in range(start_iter, params['max_iterations']):
    print(f"\rIteration: {iter+1:5d}/{((iter+1)//params['sample_every_n_iterations']+1)*params['sample_every_n_iterations']}/{params['max_iterations']}", end="", flush=True)
    # every once in a while evaluate the loss on train and val sets
    if (iter + 1) % params['sample_every_n_iterations'] == 0 or iter == params['max_iterations'] - 1:
        dt = time.time()
        print(f"\rloss eval", end="", flush=True)
        current_loss = estimate_loss(device)
        print(
            f"step {iter+1}: train loss {current_loss:.4f}, time {(dt-dt0)/iter_bench:.3f} sec/iter"
        )
        iter_bench = 1
        sdt = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        print(f"Sample at {sdt}:", flush=True)
        for temperature in [0.75]:
            print(f"--------temperature: {temperature} ---------")
            prompt = inputs[gen_id%len(inputs)]
            print(f"Prompt: {prompt}")
            generate_sample(td=td, device=device, prompt=prompt, toks=params['sample_size'], state=gen_state, temperature=temperature, top_k=16)
        print("-------------------------------------------")
        gen_id += 1
        dt0 = time.time()

    if params['stateful'] is False or params['joint_state_training'] == 0:
        xb, yb = get_torch_batch(td, params['batch_size'], device, "train")
        do_train_step(xb, yb, device=device)
    else:
        state = get_zero_state(1, params['sequence_len'], params['linear_yoke_size'], device=device)
        state.requires_grad = False
        for i in range(params['joint_state_training']):
            print(f"\rIteration: {iter+1:5d}[{i+1}/{params['joint_state_training']}]/{((iter+1)//params['sample_every_n_iterations']+1)*params['sample_every_n_iterations']}/{params['max_iterations']}", end="", flush=True)
            xb, yb = get_torch_subbatch(td, params['batch_size'], device, "train", i)
            state = do_train_step(xb, yb, device=device, state=state)
            # state = torch.cat((state, state[:, -1:, :]), dim=1)
            # state[:, -1, :] = 0
            state = torch.cat((state[:, :1, :], state), dim=1)
            # state[:, 0, :] = 0
            state = state [:, -params['sequence_len']:, :]
            # state.detach() # requires_grad = False

    start_iter = iter
    iter_bench += 1
    if (iter+1)%params['save_every_n_iterations'] == 0:
        MJ.save_checkpoint(params, model, optimizer, iter, current_loss, file_path=model_file_path, log=log)


training, start at 2023-11-18 20:21:10...
Iteration:   241/256/1000000

In [None]:
# for t in [0.5, 1.5]:
#     print(f"------Temperature {t}--------")
#     generate_sample(td, device, prompt="How are consciousness and quantum mechanics related?", toks=150, temperature=t, top_k=16)