<a href="https://colab.research.google.com/github/Zishan-Shao/Learning_LLM/blob/main/createGPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!cd /content/drive/MyDrive/MLSys_Learning/

### Basic Overview of GPT2

we explore the layers and parameters of GPT2

In [4]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

#tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model_hf = GPT2LMHeadModel.from_pretrained('gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [5]:
sd_hf = model_hf.state_dict()

Below is the hugging face GPT2 architecture and parameters overview

In [6]:
for k,v in sd_hf.items():
  print(k, v.shape)

transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias 

### Text generation with GPT2

For each generated text, even the seed is set up, we are expecting different outputs from the GPT, this is desirable

In [7]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(114514)
generator("Hello, I'm a language model,", max_length=100, num_return_sequences=5,truncation=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I have many languages, maybe I should add my own. It's a little interesting for me, so please consider your help, and please feel free to donate to an open, community based webfont server with your knowledge, but you have a big role to play here.\n\nSo who are you?\n\nSafari by Korn\n\nThe project is fully open source (but with its own community). So the code is freely available over"},
 {'generated_text': "Hello, I'm a language model, but how to do it? For example, by writing the following code:\n\npackage myrst\n\nIn its simplest terms, this is the equivalent of:\n\nimport mydbc\n\nNow, here's an in-depth list of functions. (This isn't the entire core library. That is, it's a bit of a digression to add an interesting feature every time you get to a module on which you're interested."},
 {'generated_text': "Hello, I'm a language model, my code is just doing nothing for you. Here's a few of my favourite idioms, with some sample code and

### Train GPT2 (with our dataset)

below is a full GPT2 model

In [8]:
from dataclasses import dataclass
import torch
import torch.nn as nn
from torch.nn import functional as F
#from torch.utils.data import Dataset, DataLoader
#from tqdm import tqdm

class CausalSelfAttention(nn.Module):
    """ a vanilla multi-head masked self-attention layer with a projection at the end."""
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # not really a bias, but more like a mask (OpenAI naming)
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size,
                                                           config.block_size)).
                             view(1, 1, config.block_size, config.block_size))
        #self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embed)
        # calculate query, key, values for all heads in batch and move head forward to be the batch
        # nh is "number of heads", hs is "head size", and C (number of channel) = nh * ns
        # e.g GPT2 (124M), n_head = 12, hs = 64, so nh * hs = 768
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        # attention (materialize the large (T,T) matrix for all the queries and keys)
        att = (q @ k.transpose(-2, -1)) * (1.0 / (math.sqrt(k.size(-1)))) # Q*K^T / sqrt(d)
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf')) # only look at the context (tokens before them only)
        att = F.softmax(att, dim=-1) # sum to 1
        #att = self.dropout(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) --> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        # output projection
        y = self.c_proj(y)
        return y



class MLP(nn.Module):
    """ simple linear layer followed by non-linearity """

    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU(approximate = "tanh") # avoid dead gradient of ReLU, also previously GeLU is inefficient for computation but now we can do exact
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x


class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, config):
        super().__init__()
        self.config = config
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x



@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 30
    n_layer: int = 6
    n_head: int = 6
    n_embd: int = 384#768
    #dropout: float = 0.2
    #bias: bool = True


class GPT(nn.Module):
    """  the full GPT language model, with a context size of block_size """
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict( # GPT only have decoder section, all encoder sections are removed
            wte = nn.Embedding(config.vocab_size, config.n_embd), # token embeddings of user inputs (output embedding of transformer decoder section)
            wpe = nn.Embedding(config.block_size, config.n_embd), # position embedding of decoder section of transformer
            #drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)


### Mixture of Experts (MoE) Practice
In this section, I will create a basic MoE on GPT 2