<a href="https://colab.research.google.com/github/dwgb93/EdgeRunnerAI-Transformers-LoRA/blob/main/Transformers_%2B_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dr. Dylan's Intro to Transformers


# Tokenizers

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name = "NousResearch/Meta-Llama-3.1-8B-Instruct" # So we don't have to deal with gated models
#model_name = "Qwen/Qwen2.5-72B-Instruct" # Try uncommenting these and comparing the results!
#model_name = "deepseek-ai/DeepSeek-R1"
#model_name = "openai-community/gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
text = "EdgeRunner AI is a cool place to work!"

In [None]:
tokens = tokenizer(text, return_tensors="pt")["input_ids"][0]
print(tokens)
print(tokenizer.decode(tokens))

In [None]:
text = "Hello Hello hello"

In [None]:
tokens = tokenizer(text, return_tensors="pt")["input_ids"][0]
print(tokens)
print(tokenizer.decode(tokens))

Why are the tokens different? Try this

In [None]:
print(tokenizer.convert_ids_to_tokens(tokens))

What if we mix and match tokenizers?

In [None]:
tokenizer1 = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3.1-8B-Instruct")
tokenizer2 = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-72B-Instruct") # Replace these if you want

In [None]:
tokens = tokenizer1(text, return_tensors="pt")["input_ids"][0]
print(tokens)
print(tokenizer2.decode(tokens))

# Embeddings

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

If you restarted the notebook, you'll have to run this again

In [None]:
# from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [None]:
model_name = "unsloth/Llama-3.2-1B-Instruct" # Try a different model like 'gpt2'

pipe = pipeline('feature-extraction', model=model_name)
data = pipe("this is a test")
print(data)
print(f"This text is {len(data[0])} tokens long")
print(f"Each token is {len(data[0][0])} dimensions long")

In [None]:
data1 = np.array(pipe("man"))[0][-1].reshape(1, -1) # Ugh, it returns a list, then the array is the wrong size. There's definitely a better way to do this, lol
data2 = np.array(pipe("king"))[0][-1].reshape(1, -1)
difference_man = data1 - data2
print(f"Cosine similarity: {cosine_similarity(data1, data2)}")

In [None]:
data3 = np.array(pipe("woman"))[0][-1].reshape(1, -1)
data4 = np.array(pipe("queen"))[0][-1].reshape(1, -1)
difference_woman = data3 - data4
print(f"Cosine similarity: {cosine_similarity(data3, data4)}")

Let's compare the similarity of `difference_man` and `difference_woman`

In [None]:
print(f"Cosine similarity: {cosine_similarity(difference_man, difference_woman)}")

Is the result what you expected?

Try comparing man to woman and king to queen.

Is there a stronger direction for "gender" or "royalty"?

# Transformers

Adapted from [Hands On Large Language Models - Chapter 3](https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/README.md)

In [None]:
model_name = "unsloth/Llama-3.2-1B-Instruct"
#model_name = "unsloth/Llama-3.2-3B-Instruct"
#model_name = "microsoft/Phi-3-mini-4k-instruct"
#model_name = "Qwen/Qwen3-1.7B"
#model_name = "microsoft/Phi-4-mini-instruct"

In [None]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

In [None]:
# Prompt
messages = [
    {"role": "user", "content": "Write an email apologizing to Evelyn for the tragic gardening mishap. Explain how it happened."}
]

# Generate the output
output = generator(messages)
print(output[0]['generated_text'])

In [None]:
print(model)

Load and print a few more models. What similarities do you notice? What differences?

In [None]:
model_name =

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

print(model)

In [None]:
model_name =

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

print(model)

### Sampling tokens

In [None]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Send them to teh GPU
input_ids = input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [None]:
token_id = lm_head_output[0,-1].argmax(-1)
print(token_id)
print(tokenizer.decode(token_id))

In [None]:
model_output[0].shape

In [None]:
lm_head_output.shape

### Playing with temperature

In [None]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate the output
output = generator(messages)
print(output[0]["generated_text"])

In [None]:
# Apply prompt template
prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

In [None]:
# Temperature - run this a couple of times. Are there any differences?
output = generator(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

In [None]:
# Change the temperature and try again
output = generator(messages, do_sample=True, temperature=)
print(output[0]["generated_text"])

What do you notice?

# Making a Transformer from Scratch

From [nanoGPT](https://github.com/karpathy/nanoGPT) and [Let's build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY)

In [None]:
!git clone https://github.com/karpathy/nanoGPT.git
!pip install tiktoken

In [None]:
import os
os.chdir("nanoGPT")

Download Shakespeare

In [None]:
!python data/shakespeare_char/prepare.py

In [None]:
with open('/content/nanoGPT/data/shakespeare_char/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(text[:1000])

In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [None]:
block_size = 8 # what is the maximum context length for predictions?
train_data[:block_size+1]

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

Let's train a (smol) GPT to make text like this.

Note: This will take several minutes to run.

In [None]:
!python train.py config/train_shakespeare_char.py --compile=False # needed for colab T4


If you don't have a GPU in Colab for whatever reason, try this:

In [None]:
# !python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0

How did it do?

In [None]:
!python sample.py --out_dir=out-shakespeare-char --num_samples=3

What if we wanted to train a real model with real data?

In [None]:
os.makedirs("data/HarryPotter")
os.chdir("data/HarryPotter")

Download the data

In [None]:
import os
import requests
import tiktoken
import numpy as np

# download the tiny shakespeare dataset
input_file_path = os.path.join(os.getcwd(), 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://gist.githubusercontent.com/cmaspi/41e1d8e552a30a6d5ef0be7e574da513/raw/0a9a8247da3468a7a40edc2c62479df208c421d9/Harry_Potter_all_books_preprocessed.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r', encoding='utf-8') as f:
    data = f.read()
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.getcwd(), 'train.bin'))
val_ids.tofile(os.path.join(os.getcwd(), 'val.bin'))

# train.bin has 301,966 tokens
# val.bin has 36,059 tokens


Note, our dataset is about the same size, but the gpt2-xl is ~150x bigger. gpt-medium is still 35x bigger.

Maybe we should try `--init_from='gpt2'` instead

In [None]:
os.chdir("../..")
!pwd

# Make sure this says /content/nanoGPT

In [None]:
# If you run out of VRAM during evaluation, reduce the batch_size or --init_from='gpt2'
!python train.py --out_dir='out-HP' --eval_interval=5 --eval_iters=40 \
--dataset="HarryPotter" --init_from='gpt2-medium' --always_save_checkpoint=False \
--batch_size=2 --gradient_accumulation_steps=16 --max_iters=20 \
--learning_rate=3e-5 --decay_lr=False --compile=False


Depending on what hyperparameters you chose and how long you let this cook, you should be able to recreate Harry Potter books from scratch.

That would be an example of overfitting.

In [None]:
!python sample.py --out_dir=out-HP --num_samples=2 --start=""

# Fine-tuning with Axolotl

Consider restarting your environment to free up RAM and such.

In [None]:
import torch
# Check so there is a gpu available, a T4(free tier) is enough to run this notebook
assert (torch.cuda.is_available()==True)

In [None]:
torch.__version__ # Should be 2.6.0+cu124 or similar

This takes a while to install. It may ask to restart your session. That's fine. You don't have to run it again.

In [None]:
!pip install -U packaging==23.2 setuptools==75.8.0 wheel ninja

In [None]:
!pip install --no-build-isolation axolotl[deepspeed]


In [None]:
!axolotl --version # Should be 0.9.1 or similar

Create a config.yaml file

If this takes an insanely long time to download (it took me 5 mins), try

`base_model: unsloth/Llama-3.2-1B-Instruct`

You can probably turn off `load_in_4bit: true` and change `adapter: lora`



In [None]:
import yaml

yaml_string = """
base_model: NousResearch/Meta-Llama-3.1-8B-Instruct

load_in_8bit: false
load_in_4bit: true # This means using QLoRA
strict: false

chat_template: llama3

datasets:
  - path: tatsu-lab/alpaca # generic dataset. Feel free to choose your own, if you have one.
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/lora-out

sequence_len: 2048
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 5e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: false # Doesn't work in colab
sdp_attention: true

warmup_steps: 1
max_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>
"""


# Convert the YAML string to a Python dictionary
yaml_dict = yaml.safe_load(yaml_string)

# Specify your file path
file_path = 'config.yaml'

# Write the YAML file
with open(file_path, 'w') as file:
    yaml.dump(yaml_dict, file)

Technically optional, but it downloads the model and dataset and lets me know if my dataset is borked before beginning training.

In [None]:
!axolotl preprocess config.yaml

This could take literal hours.

In [None]:
!axolotl train config.yaml

In [None]:
!axolotl inference config.yml --lora-model-dir="./outputs/lora-out" --gradio # Technically not necessary, but it creates a nice little website you can share with others.


In [None]:
from google.colab import runtime
runtime.unassign()