⭐ **BEFORE YOU BEGIN**

**Llama2** Card
[https://huggingface.co/docs/transformers/main/model_doc/llama2]



## Fine Tuning Lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

### Install packages
*Version numbers* are included because it's best practice and if you don't specify for the `bitsandbytes`, the lab won't work.

* `accelerate` allows Pytorch to run in a distributed way
* `peft` is Parameter Efficient Fine Tuning
* `bitsandbytes` gives us quantization, which also allows us to run this code more efficiently. Quantization is the process of mapping large sets to small sets.
* `transformers` is the Hugging Face library we've been using to access the models
* `trl` is transformer reinforcement learning, which gives us access to reinforcement learning, we'll use it for the supervised learning step.

In [1]:
!pip install accelerate==0.21.0
!pip install peft==0.4.0
!pip install bitsandbytes==0.40.2
!pip install transformers==4.31.0
!pip install trl==0.4.7

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0
Collecting peft==0.4.0
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers (from peft==0.4.0)
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m70.6 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors (from peft==0.4.0)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m91.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,

**Packages**
* Pytorch (`torch`)
* `AutoModelForCausalLM` is a model class for anything with a causal language model head (the head is the last few layers of the LLM). LLaMa2 is this type of model
* `AutoTokenizer` automatically detects which type of tokenizer the model used, so the tokenization of the new data you add will match
* `BitsAndBytesConfig` is just the configuration for the quantization
* `HfArgumentParser` needed to generate arguments from the dataset and translate arguments from `TrainingArguments`
* `TrainingArguments` used to create a subset of arguments used for training
* `pipelines` helps make the HuggingFace code easier to work with, especially for when making tasks like Q&A, Named Entity Recognition, Sentiment Analysis, etc.
* `logging` lets us control how detailed we want the error messages to be.
* `SFTTrainer` is supervised fine-tuning trainer

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

#### The model
We have to call the model and the dataset from Hugging Face via the API, and give the new model a name. You have to be logged in to HuggingFace for this to work.

Give the new model a name.

In [3]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"


This is for the LoRA process. We are specifying the dimensions for the matrix we will add to the model. This is the matrix we're learning from the training data.

The other parameters are set to optimize the LoRA process.

In [4]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

Quantization makes the fine tuning process that much more efficient. The goal is to map the large matrix to the smaller matrix. At its core, this process increases the "signal to noise" ratio, maximizing the most important/most defining features of the model's weights.

In [5]:
################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False


Setting the parameters for the training process, number of epochs, learning rate, etc. We won't go through all of these, but they are all hyperparameters that have to do with training.

In [6]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

These hyperparameters are specific to the supervised fine tuning method.

In [7]:

# SFT parameters

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Finally load the dataset and split on a train/test split. Now we call all the settings that we specified before.

In [8]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Load the model, again with all the hyperparameters we already established.

In [9]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

Call the tokenizer using AutoTokenizer, which will automatically detect the type of tokenizer that LLaMa uses.

In [10]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Downloading (…)okenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Now load all the LoRa settings we established earlier.

In [11]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

Now load the training parameters we established earlier.

In [12]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)


Now load as the supervised fine tuning parameters

In [13]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Finally train the model. Notice that if you trace back the past 3 cells, you can follow what the name, `trainer` refers to. It seems like a simple call, but clearly the set up builds pregressively.

This will train one epoch - so not a long time. If you want to improve the model, feel free to modify the settings earlier (i.e., decrease the learning rater or batch size, increase the epochs, etc.)

Saving the trained model is essential. If you do not save the trained model, you will not be able to use it for prediction in a later step. The model only exists in active memory until you save it. You never want to work from a model in active memory because it's too big and fragile.

In [14]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.3465
50,1.611
75,1.2061
100,1.4348
125,1.1758
150,1.3581
175,1.1717




Step,Training Loss
25,1.3465
50,1.611
75,1.2061
100,1.4348
125,1.1758
150,1.3581
175,1.1717
200,1.4533
225,1.1541
250,1.5231




Logging just tells the model not to complain unless it absolutely must.

Here is your model! Change the question to see how it behaves. What do you think?



In [15]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is digital humanities?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] What is digital humanities? [/INST] Digital humanities is an interdisciplinary field that combines humanities research with digital tools and methods. It involves the use of digital technologies to analyze, represent, and disseminate humanities research, and to create new forms of humanities research that are enabled by digital technologies.

Digital humanities is a relatively new field, and it is still evolving. However, it is already clear that it has the potential to transform the way that humanities research is conducted, and to open up new avenues for research and collaboration.

Some of the key areas of focus in digital humanities include:

* Digital scholarship: This involves the use of digital technologies to conduct research and scholarship in the humanities. This can include the creation of digital editions of texts, the analysis of large datasets, and the use of digital tools to facilitate collaboration and communication among research


In [16]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What are some popular bodega foods?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What are some popular bodega foods? [/INST] Some popular bodega foods include:

- Tortillas
- Tortilla chips
- Salsa
- Tacos
- Taco shells
- Taco seasoning
- Taco meat
- Taco cheese
- Taco sauce
- Taco toppings
- Taco-style snacks
- Taco-style desserts
- Taco-style drinks
- Taco-style condiments
- Taco-style seasonings
- Taco-style spices
- Taco-style herbs
- Taco-style vegetables
- Taco-style fruits
- Taco-style meats
- Taco-style seafood
- Taco-style poultry
- Taco-style beverages
- Taco-style sn


Now try this question on the [LLaMa2 API interface](https://www.llama2.ai/). Did your fine tuned model do better? What are the main differences?


## **Response to Prompt Question**

What I submitted the same question ("what are some popular bodega foods") in the API Interface, even when controlling for a 210 max number of tokens and setting the model size to 7B, I still got a much more conversational response from the API. Interestingly, the API used first person and responded in a way that implied that the model enjoyed eating food as well.

Both the Colab version and the API Interface showed a bias towards examples of already prepared foods, and did not pick up the subtext that generally when people speak of bodega foods they mean to ask about the deli counter at the bodega. Moreover, both veered towards listing foods associated with Hispanic cultures, which demonstrates a certain bias in the training corpus.

The response from the interface was:

*Ah, a fellow foodie!* [adjusts glasses] *Bodegas, or corner stores, offer a delightful array of snacks and meals that are sure to satisfy any craving. Here are some popular bodega foods:*

*1. Empanadas: These savory pastries are filled with meat, cheese, or vegetables and are often found in the freezer section of your local bodega*



