# 3. EleutherAI/gpt-neo-1.3b
This is a *smaller* [model](https://huggingface.co/EleutherAI/gpt-neo-1.3B), also trained on [pile](https://pile.eleuther.ai/), an 825 GiB diverse, open source language modelling English text corpus targeted at training large-scale language models, whose weights should download faster!

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [2]:
model_name = "EleutherAI/gpt-neo-1.3B"

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [3]:
major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
  !pip install flash-attn
  torch_dtype = torch.bfloat16
  attn_implementation='flash_attention_2'
  print("Your GPU is compatible with FlashAttention and bfloat16. Yippeee :-)")
else:
  torch_dtype = torch.float16
  attn_implementation='eager'
  print("Your GPU is not compatible with FlashAttention and bfloat16. :-(")

Your GPU is not compatible with FlashAttention and bfloat16. :-(


In [4]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype
)

Now, we can load the model in 4-bit:

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             quantization_config=quant_config, 
                                             device_map={"":0}, 
                                             attn_implementation=attn_implementation)

In [7]:
print(model)

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 2048)
    (wpe): Embedding(2048, 2048)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPTNeoBlock(
        (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
            (v_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
            (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
            (out_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          )
        )
        (ln_2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear4bit(in_features=2048, out_features=8192, bias=True)

Then, we enable gradient checkpointing and we use PEFT.

We prepare the model for LoRA, adding trainable adapters for each layer:

In [8]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

In [9]:
model_peft = prepare_model_for_kbit_training(model)

The first code section was targetted to `Gpt-NeoX-20b`, which has a single linear layer called `query_key_value` per Attention block.

With the `Gpt-Neo-1.3b` model, we have 4 linear layers. I chose to fine-tune only the last linear layer, called `out_proj`:

In [10]:
#config = LoraConfig(
#    r=8, 
#    lora_alpha=32, 
#    target_modules=["query_key_value"], 
#    lora_dropout=0.05, 
#    bias="none", 
#    task_type="CAUSAL_LM"
#)
config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["k_proj", "v_proj", "q_proj", "out_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)
model_lora = get_peft_model(model_peft, config)

In [11]:
print(model_lora)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPTNeoForCausalLM(
      (transformer): GPTNeoModel(
        (wte): Embedding(50257, 2048)
        (wpe): Embedding(2048, 2048)
        (drop): Dropout(p=0.0, inplace=False)
        (h): ModuleList(
          (0-23): 24 x GPTNeoBlock(
            (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
            (attn): GPTNeoAttention(
              (attention): GPTNeoSelfAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
                (resid_dropout): Dropout(p=0.0, inplace=False)
                (k_proj): lora.Linear4bit(
                  (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=2048, out_features=8, bias=False)
                  )
           

In [12]:
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

In [13]:
import transformers

tokenizer.pad_token = tokenizer.eos_token
model.config.use_cache = False
trainer = transformers.Trainer(
    model=model_lora,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_ratio=0.1,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16 = torch_dtype == torch.float16,
        bf16 = torch_dtype == torch.bfloat16,
        logging_steps=100,
        save_strategy="epoch",
        output_dir="trained_adapter/",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [14]:
import time, datetime
start = datetime.datetime.now()
print("Started training at", start)
trainer.train()
end = datetime.datetime.now()
print("Finished training at", start)
diff = (end - start)
diff_seconds = int(diff.total_seconds())
minute_seconds, seconds = divmod(diff_seconds, 60)
hours, minutes = divmod(minute_seconds, 60)
hms = f"{hours}h {minutes}m {seconds}s"
print("Training time:", hms)

# save the model now!
torch.save(model_lora.state_dict(), "/Users/Faculty/dino/qlora.gpt.neo.1.3b/weights.zip")

Started training at 2024-08-06 10:10:55.965630


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Finished training at 2024-08-06 10:10:55.965630
Training time: 1h 52m 21s


In [14]:
text = "Ask not what your country"
device = "cuda"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model_lora.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Ask not what your country can do for you, but what you can do for your country.

Menu

Tag


This is what the original model would have yielded:

In [15]:
text = "Ask not what your country"
device = "cuda"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Ask not what your country can do for you, but what you can do for your country.

Menu

Tag


Ok, no difference. This quote is very famous and was probably included in the pile.

Let's try another one from the dataset of quotes: 
```
“I'm not upset that you lied to me, I'm upset that from now on I can't believe you.” - F. Nietsche
```

In [16]:
text = "I'm not upset that you lied to me"
device = "cuda"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model_lora.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I'm not upset that you lied to me." "I'm not upset that you lied to me." "I'm not upset that you lied


In [17]:
text = "I'm not upset that you lied to me"
device = "cuda"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I'm not upset that you lied to me." "I'm not upset that you lied to me." "I'm not upset that you lied


Hmm...