# Implementing GPT 2 in numpy*
Here is an implementation of the GPT 2 model. With numpy as a substitute for pytorch or tensorflow. This implementation lacks backpropagation although with a reasonable autograd implementation on top of numpy it seems doable. (Check out [agrad](https://github.com/arnavg115/agrad)). To run make sure you are connected to the T4 instance on colab. To run on cpu just replace all instance of `cp` with `np`.

This is mostly a reimplimentation of the [mingpt](https://github.com/karpathy/minGPT) by Andrej Karpathy. I have changed some things around and condensed code where I thought was appropriate and obviously implemented all the building blocks for the llm that torch already includes. Also I was too lazy to implement a tokenizer and used the transformers library instead.

**I am not using numpy as it lacks the ability of utilizing the GPU. Instead I used cupy which allows for ops to be run on the GPU. The implementation using numpy is the same just with the use of cupy replaced with numpy.*

## Imports and installs
### Packages:
1. transformers: Load the gpt 2 weights.
2. tqdm: For nice looking loading bars.
3. cupy: Numpy capable of running on the gpu
4. math: basic math ops

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m106.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m68.4 MB/s[0m eta [36m0:00:

In [None]:
from collections import OrderedDict
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
import cupy as cp
import math

## Defining basic classes, functions, etc.
I decided to implement the basic functions and building blocks using a pytorch-like syntax.

Here I define the following:
1. Module: Serves as the base class for many of the learnable building blocks
2. Kaiming init: Used to initialize weights
3. Linear: Basic linear layer with option to use a bias
4. Embedding: Basic embedding layer
5. Softmax
6. Gelu: Taken from this [paper](https://arxiv.org/abs/1606.08415)
7. Dropout
8. Layernorm: Layer normalization based on this [page](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) Although I think the pytorch docs lie here. For some reason they wrote they use the last 2 dims however the implementation seems to use the last 1 dim.

In [None]:
class Module:
  def __call__(self, *args, **kwargs):
    return self.forward(*args, **kwargs)

In [None]:
# kaiming uniform init
def k_init(*shape, a=1):
    std = math.sqrt(a / max(shape[0], shape[1]))
    a = std
    low = -a
    high = a
    return cp.random.uniform(low, high, shape)

In [None]:
class linear(Module):
  def __init__(self, inpt, out, bias = True):
    self.w = k_init(inpt, out)
    self.bias = bias
    if bias:
      self.b = k_init(1,out)

  def forward(self, x):
    if not self.bias:
      return x @ self.w
    return x @ self.w + self.b

In [None]:
class embedding(Module):
  def __init__(self,vocab,n_embd):
    self.w = k_init(vocab, n_embd)

  def forward(self, x):
    return self.w[x]

In [None]:
def softmax(x, axis=-1):
  ex = cp.exp(x)
  return ex / cp.sum(ex, axis=axis, keepdims=True)

In [None]:
def gelu(x):
  return 0.5*x*(1+ cp.tanh(cp.sqrt(2/cp.pi)* (x+ 0.044715*cp.power(x, 3))))

In [None]:
class dropout(Module):
  def __init__(self, p):
    self.p = p

  def forward(self,x):
    p = self.p
    if p == 0:
      return x
    mask = cp.random.binomial(1, 1 - p, x.shape)
    out = x * mask
    out /= (1 - p)
    return out


In [None]:
# Based on https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
class layernorm(Module):
  def __init__(self, n_shape, eps = 1e-5):
    self.eps = eps
    self.w = cp.ones(n_shape)
    self.b = cp.zeros(n_shape)

  def forward(self,x):
    mean = cp.mean(x, (-1),keepdims=True) # pytorch is lying a bit on the description. For some reason their layernorm only looks at the last dim while the descriptions says they look at the last two dims
    var = cp.var(x, (-1),keepdims=True)
    normed = (x-mean) / cp.sqrt(var + self.eps)
    return (normed * self.w) + self.b

  def from_pretrained(self, keys, sd):
    self.w = cp.asarray(sd[keys[0]].numpy())
    self.b = cp.asarray(sd[keys[1]].numpy())


## Major Building Blocks of the model
1. Multihead Attention
2. Multilayer perceptron
3. Decoder Block: Combines both the attention and MLP

In [None]:
class MultiHeadAttention(Module):
  def __init__(self, n_embd, n_head, n_ctx, attn_pdrop, resid_pdrop):
    self.n_head = n_head
    self.n_embd = n_embd
    self.c_attn = linear(n_embd, 3 * n_embd)
    self.mask = (-1/cp.tril(cp.ones((n_ctx,n_ctx))) + 1)[cp.newaxis,cp.newaxis]
    self.attn_drop = dropout(attn_pdrop)
    self.resid_drop = dropout(resid_pdrop)
    self.c_proj = linear(n_embd, n_embd)

  def forward(self, x):
    B, T, C = x.shape
    ot = self.c_attn(x)
    ot = cp.split(ot,3, axis=2)
    q,k,v = [j.reshape(B, T, self.n_head,self.n_embd // self.n_head).transpose((0,2,1,3)) for j in ot] # I hate numpy transpose. pytorch is superior in this regard
    intmd = (q @ k.transpose((0,1,3,2))) / cp.sqrt(k.shape[-1])
    intmd = intmd + self.mask[:,:,:T,:T]
    intmd = softmax(intmd)
    intmd = self.attn_drop(intmd)
    otpt = intmd @ v
    otpt = otpt.transpose((0,2,1,3)).reshape((B,T,C))
    y = self.resid_drop(self.c_proj(otpt))
    return y

  def from_pretrained(self, keys, sd):
    for key in keys:
      if "c_attn" in key:
        if "weight" in key:
          self.c_attn.w = cp.asarray(sd[key].numpy())
        else:
          self.c_attn.b = cp.asarray(sd[key].numpy())
      else:
        if "weight" in key:
          self.c_proj.w = cp.asarray(sd[key].numpy())
        else:
          self.c_proj.b = cp.asarray(sd[key].numpy())

In [None]:
class MLP(Module):
  def __init__(self,n_embd, resid_pdrop):
    self.c_fc = linear(n_embd, 4*n_embd)
    self.c_proj = linear(4*n_embd, n_embd)
    self.drop = dropout(resid_pdrop)

  def forward(self, x):
    y = self.c_fc(x)
    y = gelu(y)
    y = self.c_proj(y)
    return self.drop(y)

  def from_pretrained(self, keys, sd):
    for key in keys:
      if "c_fc" in key:
        if "weight" in key:
          self.c_fc.w = cp.asarray(sd[key].numpy())
        else:
          self.c_fc.b = cp.asarray(sd[key].numpy())
      else:
        if "weight" in key:
          self.c_proj.w = cp.asarray(sd[key].numpy())
        else:
          self.c_proj.b = cp.asarray(sd[key].numpy())

In [None]:
class DecoderBlock(Module):
  def __init__(self, n_embd, n_head, n_ctx, attn_pdrop, resid_pdrop):
    self.ln_1 = layernorm(n_embd)
    self.attn = MultiHeadAttention(n_embd, n_head, n_ctx, attn_pdrop, resid_pdrop)
    self.ln_2 = layernorm(n_embd)
    self.mlp = MLP(n_embd, resid_pdrop)

  def forward(self,x):
    x = x+ self.attn(self.ln_1(x))
    x = x + self.mlp(self.ln_2(x))
    return x

  def from_pretrained(self, keys, sd):
    attn = [i for i in keys if "attn" in i]
    mlp = [i for i in keys if "mlp" in i]
    ln1 = [i for i in keys if "ln_1" in i]
    ln2 = [i for i in keys if "ln_2" in i]
    self.ln_1.from_pretrained(ln1, sd)
    self.ln_2.from_pretrained(ln2, sd)
    self.mlp.from_pretrained(mlp, sd)
    self.attn.from_pretrained(attn, sd)


## GPT model definition
Here I put together all of the bits and pieces for this model as well as additional code for generation. The model has 137M parameters in this configuration. There is also some code for loading the pretrained weights. I was too lazy to implement the parameters for generation though.

In [None]:
class gpt(Module):
  def __init__(self, n_layer=12, n_embd=768, n_ctx=1024, attn_pdrop = 0.1, resid_pdrop=0.1, vocab=50257,n_head=12):
    self.w_embd = embedding(vocab, n_embd)
    self.p_embd = embedding(n_ctx, n_embd)
    self.drop = dropout(0.1)
    self.h = [DecoderBlock(n_embd, n_head, n_ctx, attn_pdrop, resid_pdrop) for i in range(n_layer)]
    self.ln_f = layernorm(n_embd)
    self.lm_head = linear(n_embd, vocab, bias=False)
    self.n_ctx = n_ctx

  def forward(self, x):
    b,t = x.shape
    tok_emb = self.w_embd(x)
    pos_emb = self.p_embd(cp.arange(0,t)[cp.newaxis])
    x = self.drop(pos_emb + tok_emb)
    for b in self.h:
      x = b(x)
    y = self.ln_f(x)
    y = self.lm_head(y)
    return y

  def from_pretrained(self,sd:OrderedDict):
    self.w_embd.w = cp.array(sd["transformer.wte.weight"].numpy())
    self.p_embd.w = cp.array(sd["transformer.wpe.weight"].numpy())
    self.lm_head.w = cp.array(sd["lm_head.weight"].numpy().T)
    self.ln_f.w = cp.array(sd["transformer.ln_f.weight"].numpy())
    self.ln_f.b = cp.array(sd["transformer.ln_f.bias"].numpy())

    for ind, b in enumerate(self.h):
      keys = [i for i in list(sd.keys())[2:] if f"h.{ind}." in i]
      b.from_pretrained(keys, sd)

  def generate(self, idx, max_new=1):
    for _ in tqdm(range(max_new)):
      if idx.shape[1] < self.n_ctx:
        idx_c = idx
      else:
        idx_c = idx[:,-self.n_ctx:]
      l = self.forward(idx_c)
      new_tok = l[:,-1,:]
      probs = softmax(new_tok)
      nxt = cp.argmax(cp.random.multinomial(1,probs[0]), keepdims=True)[cp.newaxis]
      idx = cp.concatenate((idx, nxt), axis=-1)

    return idx


## Loading the weights
I tried using the official gpt 2 weights, but I found the generation results not so great. So instead I switched to this [version](https://huggingface.co/vicgalle/gpt2-alpaca-gpt4) of gpt 2 trained on the alpaca dataset.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("vicgalle/gpt2-alpaca-gpt4")
model = AutoModelForCausalLM.from_pretrained("vicgalle/gpt2-alpaca-gpt4")

Downloading (…)okenizer_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/510M [00:00<?, ?B/s]

In [None]:
mod = gpt()
mod.from_pretrained(model.state_dict())

## Running inference on the model

In [None]:
prompt = "Write me a breaking news article about a falling leaf" # @param {type:"string"}
f_prompt = f"\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{prompt}\n\n### Response:\n"
print(tokenizer.decode(cp.asnumpy(mod.generate(cp.asarray(tokenizer.encode(f_prompt,return_tensors="np")),max_new=100))[0]))