<a href="https://colab.research.google.com/github/Wayne-wyyking888/Stat-8931-GenAI/blob/main/chapter2/Chapter_II_Large_language_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Packages, Paths & Environment

### (1). Download the dependent `.py` files; trained model; tokenizer to the **currect directory**

In [2]:
# download model.py and tokenizer.py to the default directory
! gdown --id 1SU7jSZI36KGwBv5-zgc3WkStPK6lGKwL -O /content/model.py # download model.py
! gdown --id 1uXCgdmip79J6efM5hiHGCy9mdr_U8BXT -O /content/tokenizer.py # download tokenizer.py

Downloading...
From (original): https://drive.google.com/uc?id=1SU7jSZI36KGwBv5-zgc3WkStPK6lGKwL
From (redirected): https://drive.google.com/uc?id=1SU7jSZI36KGwBv5-zgc3WkStPK6lGKwL&confirm=t&uuid=313387d4-acac-44a0-8578-78151e0c8a10
To: /content/model.py
100% 13.3k/13.3k [00:00<00:00, 34.0MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1uXCgdmip79J6efM5hiHGCy9mdr_U8BXT
From (redirected): https://drive.google.com/uc?id=1uXCgdmip79J6efM5hiHGCy9mdr_U8BXT&confirm=t&uuid=c435cc11-712b-436e-9cc2-9cac99886211
To: /content/tokenizer.py
100% 1.35k/1.35k [00:00<00:00, 5.01MB/s]


In [6]:
# trained language model (checkpoints) and tokenizer files under suitable directories (download and unzip files first !!)
import gdown
import zipfile
import os

def download_and_unzip(file_id, output_dir=None):

    if output_dir is None:
        output_dir = os.getcwd()

    os.makedirs(output_dir, exist_ok=True)

    # Download file
    url = f'https://drive.google.com/uc?id={file_id}'
    output = os.path.join(output_dir, 'temp.zip')
    gdown.download(url, output, quiet=False)

    # Unzip file
    with zipfile.ZipFile(output, 'r') as zip_ref:
        # Get the name of the first file in the archive
        original_name = zip_ref.namelist()[0]
        zip_ref.extractall(output_dir)

    # Remove the temporary zip file
    os.remove(output)

    # The path to the extracted file
    extracted_file = os.path.join(output_dir, original_name)

    print(f"File extracted as: {extracted_file}, saved to {output_dir}")
    return extracted_file


In [11]:
# create sub-directory under models/ and data/
checkpoint = download_and_unzip('1bJMOyA86CDayzwmU5KjlZnbhCXHUzO41', output_dir = os.getcwd() + "/models/trained_model.pt")
tokenizer = download_and_unzip('1UhsXL-ymGFy1fBftMvMbss2PGFzRxZV4', output_dir = os.getcwd() + "/data/trained_tokenizer.model")

Downloading...
From (original): https://drive.google.com/uc?id=1bJMOyA86CDayzwmU5KjlZnbhCXHUzO41
From (redirected): https://drive.google.com/uc?id=1bJMOyA86CDayzwmU5KjlZnbhCXHUzO41&confirm=t&uuid=d154d7ed-cbd2-4633-bab3-255f02a7bad4
To: /content/models/trained_model.pt/temp.zip
100%|██████████| 182M/182M [00:01<00:00, 96.7MB/s]


File extracted as: /content/models/trained_model.pt/trained_model_tok32000.pt, saved to /content/models/trained_model.pt


Downloading...
From: https://drive.google.com/uc?id=1UhsXL-ymGFy1fBftMvMbss2PGFzRxZV4
To: /content/data/trained_tokenizer.model/temp.zip
100%|██████████| 500k/500k [00:00<00:00, 74.8MB/s]

File extracted as: /content/data/trained_tokenizer.model/tok32000.model, saved to /content/data/trained_tokenizer.model





### (2) Load pretrained model and test
* Load pretrained model and pretrained tokenizer
* Adjust TF32 precision
* Config parameters for generation & decoding

In [13]:
from contextlib import nullcontext
import torch
from model import ModelArgs, Transformer
from tokenizer import Tokenizer
import os

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Use device: {device}")

# load checkpoint
checkpoint_dict = torch.load(checkpoint, map_location=device)
gptconf = ModelArgs(**checkpoint_dict['model_args'])
model = Transformer(gptconf)
state_dict = checkpoint_dict['model']
unwanted_prefix = '_orig_mod.' #the unwanted prefix was sometimes added during compiling
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict, strict=False)
model.eval()
model.to(device)

# load tokenizer
enc = Tokenizer(tokenizer_model=tokenizer)

# adjust precision
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
torch.backends.cuda.matmul.allow_tf32 = True #enables the use of TF32 for matrix multiplication operations within PyTorch when using CUDA
torch.backends.cudnn.allow_tf32 = True #enables the use of TF32 precision within the cuDNN library
device_type = 'cuda' if 'cuda' in device else 'cpu'
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

Use device: cpu


  checkpoint_dict = torch.load(checkpoint, map_location=device)


#words: 32000 - BOS ID: 1 - EOS ID: 2


* **Generate sample texts**

In [19]:
## parameter configurations
num_samples = 3 # number of samples to draw (how many paragraphs?)
max_new_tokens = 1024 # number of tokens generated in each sample
temperature = 1.0 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 300 # retain only the top_k most likely tokens



In [24]:
# Generate texts
start = "Once upon a time, there is a beautiful 27-year-old lady called ?, "
start_ids = enc.encode(start, bos=True, eos=False)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(enc.decode(y[0].tolist()))
            print('---------------')

Once upon a time, there is a beautiful 27-year-old lady called zzxxlll,  year old. Her name was Emo, and full of joy.
Little Olce searched and searched at the open space, and was filled with wonder. She eventually saw the huge structure, and was filled with presents.
Celffy saw how peaceful it felt and how the open area around it was full of laughter and joy. Little Eoo quickly filled his mouth with glee.
When T-Eack was full, she was filled with love and relax. She spent the days asking her friends and family to come and go to the same place.
The two of them spent many hours playing on the wide structure, roasting marcoanks and doing as they told stories. Every once in a while, she and her friends would have a soft, lively time, enjoying their time together.
At the end of the day, Todely Elephant asked Y Star how she was such a wonderful group of friends.
"I made it free from this color of metal," said Y syruply, smiling. 
Yve-Craagers were so pleased. They thanked Yrandy for making t

## A Glance at Decoding from A Trained Model

### (1). The `generate()` function takes auto-regression procedures

* `generate` is a method under the model class. It takes a conditioning sequence of indices `idx`, which is a LongTensor of shape `(batch_size, sentence_length)`, and completes the sequence `max_new_tokens` times, feeding the predictions back into the model each time. The function is often operated under `model.eval()` mode.
* **AR (auto-regression) generation** **by considering the next token sampled from a softmax layer**
* `@torch.inference_mode()` is an alternative to `with torch.no_grad()` to ***disable gradient calculation !***

In [22]:
@torch.inference_mode()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
        idx_cond = idx if idx.size(1) <= self.params.max_seq_len else idx[:, -self.params.max_seq_len:]
        logits = self(idx_cond)
        logits = logits[:, -1, :] # crop to just the final time step
        if temperature == 0.0:
            _, idx_next = torch.topk(logits, k=1, dim=-1)
        else:
            logits = logits / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)

    return idx

* **To customize a new generation approach

In [None]:
def custom_generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    # Custom generation logic here
    for _ in range(max_new_tokens):
        ...
    return idx

# Assign the custom generate method to the model instance
model.generate = custom_generate.__get__(model, Transformer)

### (2) A closer look at `generate()` function component

1. **Greedy decoding** : Always pick the next word *with the highest probability*. This method can lead to repetitive and predictable text. `temperature = 0.0`

```
if temperature == 0.0:
    _, idx_next = torch.topk(logits, k=1, dim=-1)
    idx = torch.cat((idx, idx_next), dim=1)
```

2. **Temperature Scaling** :  A lower temperature makes the distribution peakier **(more greedy)**, while a higher temperature makes the distribution flatter **(more random)**.

```
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)

```

3. **Top-k Sampling**: Limits the next word choices to the top k most probable words ($p_1 \ge p_2 \ge ... \ge p_k$):

```
if top_k is not None:
    values, indices = torch.topk(logits, k=top_k)
    logits[logits < values[:, [-1]]] = -float('Inf')
    probs = F.softmax(logits, dim=-1)
    idx_next = torch.multinomial(probs, num_samples=1)
    idx = torch.cat((idx, idx_next), dim=1)
```
4. **Top-p (Nucleus) Sampling**: This approach chooses the **smallest set** of words whose cumulative probability exceeds the probability $p$, a threshold.

```
cum_probs = torch.cumsum(F.softmax(logits, dim=-1), dim=-1)
threshold = torch.rand(1).item()
idx_next = torch.min((cum_probs > threshold).nonzero(as_tuple=True)[1])
idx = torch.cat((idx, idx_next.unsqueeze(0)), dim=1)
```

5. **Beam-search decoding** : Beam search *maintains multiple hypotheses* (the “beam”) at each step and expands them further by exploring several possible next steps. This strategy balances between *breadth (diversity) and depth (accuracy)*.

```
beam_width = 5
candidates = [idx]
for _ in range(max_new_tokens):
    all_candidates = []
    for candidate in candidates:
        logits = self(candidate)
        probs = F.softmax(logits[:, -1, :], dim=-1)
        top_probs, top_idx = torch.topk(probs, k=beam_width)
        for i in range(beam_width):
            next_candidate = torch.cat((candidate, top_idx[:, i:i+1]), dim=1)
            all_candidates.append((next_candidate, top_probs[:, i].item()))
    ordered = sorted(all_candidates, key=lambda x: x[1], reverse=True)
    candidates = [x[0] for x in ordered[:beam_width]]
idx = candidates[0]
```

## Tokenization and Vocabulary