# Notes

### Steps for Training an LLM

1. **Library Installation:** Essential libraries such as torch, transformers, trl, peft, and datasets are installed. These libraries provide the necessary functions and classes for model manipulation, data handling, and training.
2. **Import Modules in the Code**.
3. **Data Loading and Preparation:** Data is loaded from a source (like the Hugging Face datasets library), and it's prepared for training. This typically includes tasks such as tokenization, formatting, and possibly data augmentation or cleansing. It is useful at this stage to have an effective method to visualise the data. Furthermore, we can specify `split = "train"` to get just training part of the dataset. For hugging face use `load_dataset`.
4. **Model Setup (and potentially quantisation)** A pre-trained model is assigned to a variable. The model brings pre-learned weights that can be fine-tuned. You can specify specific quantisation configurations using `BitsAndBytesConfig`. The model is then loaded using `AutoModelForCausalLM.from_pretrained` (for pretrained causal language models).
5. **Tokenizer**: This step is crucial because the tokenizer converts text into a format the model can understand. We can load it using `AutoTokenizer.from_pretrained`.
6. **PEFT Parameters:** If you had to fine-tune every single one of the model's parameters it would be very inefficient and resource intensive. PEFT (parameter-efficient fine tuning) is a technique of reducing the number of parameters you need to fine-tune. It allows you to specify a small subset of the model's parameters. Two popular methods are LoRA and QLoRA. You can specify PEFT parameters using `LoraConfig`.
7. **Training Parameters:** Use `TrainingArguments` to set up training parameters like batch sizes, learning rate, weight decay, and others.
8. **Model Fine-Tuning:** To actually do the fine-tuning process we can set up the trainer. One specific example is `SFTTrainer`, which loads a trainer to do supervised fine-tuning (where you are providing the model with a training dataset containing labelled examples so that the general LLM can be adapted for specific tasks or domains).
9. **Training Execution:** Running the training process, which adjusts the model's weights based on the training data, loss function, and optimizer defined in the setup. This step may include monitoring for performance and making adjustments as needed. Use `trainer_name.train()`.
10. **Evaluation and Adjustment:** After initial training rounds, the model's performance is evaluated, typically on a validation set. Adjustments may be made to training parameters based on performance metrics.
11. **Final Testing and Deployment:** Once the model is fine-tuned and performs satisfactorily on validation data, it's tested on unseen test data to gauge real-world performance. Successful models are then deployed for actual use.

## The Data for this Project

The data used to train the model is the alpaca dataset. This is built on the [Self Instruct](https://github.com/yizhongw/self-instruct) framework.

Self-Instruct is a framework that helps language models improve their ability to follow natural language instructions. It does this by using the model's own generations to create a large collection of instructional data. With Self-Instruct, it is possible to improve the instruction-following capabilities of language models without relying on extensive manual annotation.

The [alpaca dataset](https://huggingface.co/datasets/tatsu-lab/alpaca) is built on this idea of having natural-language instructions and it stores inputs and outputs. The outputs are what gets outputted from an Open-AI model when provided the instruction and input. So there are 4 data fields within the dataset:

- instruction: describes the task the model should perform. Each of the 52K instructions is unique.
- input: optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
- output: the answer to the instruction as generated by text-davinci-003.
- text: the instruction, input and output formatted with the prompt template used by the authors for fine-tuning their models.

# 1. Installing required packages

In [None]:
!pip install -q trl
!pip install -q peft
!pip install -q torch
!pip install -q datasets
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.1/245.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.0/102.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━

# 2. Imports

In [None]:
import torch
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Hugging Face libraries:
from datasets import load_dataset # loading and processing datasets commonly used in NLP and other machine learning tasks
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

# 3. Data Loading and Preparation

In [None]:
train_dataset = load_dataset("tatsu-lab/alpaca", split="train") # only loads training data

pandas_format = train_dataset.to_pandas() # convert to pandas format
display(pandas_format.head()) # displays top 5 pandas data rows, display > print because it outputs in tabular format

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Unnamed: 0,instruction,input,output,text
0,Give three tips for staying healthy.,,1.Eat a balanced diet and make sure to include...,Below is an instruction that describes a task....
1,What are the three primary colors?,,"The three primary colors are red, blue, and ye...",Below is an instruction that describes a task....
2,Describe the structure of an atom.,,"An atom is made up of a nucleus, which contain...",Below is an instruction that describes a task....
3,How can we reduce air pollution?,,There are a number of ways to reduce air pollu...,Below is an instruction that describes a task....
4,Describe a time when you had to make a difficu...,,I had to make a difficult decision when I was ...,Below is an instruction that describes a task....


In [None]:
import textwrap # library that allows us to wrap text (makes it easier to display the contents in the database)

for index in range(3):
   print("---"*15)
   print("Instruction: {}".format(textwrap.fill(pandas_format.iloc[index]["instruction"], width=50)))
   print("Output: {}".format(textwrap.fill(pandas_format.iloc[index]["output"], width=50)))
   print("Text: {}".format(textwrap.fill(pandas_format.iloc[index]["text"], width=50)))

---------------------------------------------
Instruction: Give three tips for staying healthy.
Output: 1.Eat a balanced diet and make sure to include
plenty of fruits and vegetables.  2. Exercise
regularly to keep your body active and strong.  3.
Get enough sleep and maintain a consistent sleep
schedule.
Text: Below is an instruction that describes a task.
Write a response that appropriately completes the
request.  ### Instruction: Give three tips for
staying healthy.  ### Response: 1.Eat a balanced
diet and make sure to include plenty of fruits and
vegetables.  2. Exercise regularly to keep your
body active and strong.  3. Get enough sleep and
maintain a consistent sleep schedule.
---------------------------------------------
Instruction: What are the three primary colors?
Output: The three primary colors are red, blue, and
yellow.
Text: Below is an instruction that describes a task.
Write a response that appropriately completes the
request.  ### Instruction: What are the three
prima

# 4, 5, 6, & 7. Model Set Up, Tokenizer, Parameters.

We will now use a pre-trained Salesforce models. For NLP models, we need tokenizers. A tokenizer is a tool used in natural language processing (NLP) to convert text data into a format that can be understood by machine learning models. Essentially, it breaks down text into smaller components, usually called tokens, which can represent words, subwords, or characters.

In order to train the model, you first need to set up the training configuration. This requires:

*   Defining the arguments for training the model. This is specified in a class provided in the transformers library from HuggingFace called `TrainingArguments`
*   Then we need to fine-tune the pre-trained model. We do this using **LoRA** (Low-Rank adaptation) and **SFTTrainer**. While LoRA adjusts the internal architecture for efficiency, SFTTrainer manages the practical aspects of applying these adjustments through training.

  * **LoRA** is able to adapt a model to specific tasks without the need to retrain the entire model. [LoRA](https://huggingface.co/docs/peft/en/package_reference/lora) is particularly useful when there are constraints on computational resources, training time, or when the model size is so large that full model fine-tuning is impractical. It allows for targeted updates that can significantly change the model's behavior with minimal adjustments in the parameters. In this code we imported `LoraConfig` which has default parameters, but we can also edit them by specifying new values in the code.

  * The **SFTTrainer** (Supervised Fine-Tuning Trainer) is primarily a utility or tool designed to facilitate the fine-tuning of pre-trained models using supervised learning. It provides an API to easily set up and run training loops, handle data batching, apply optimization algorithms, and manage the training process efficiently.




In [None]:
# Loading the model and tokenizer
pretrained_model_name = "Salesforce/xgen-7b-8k-base"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)

config.json:   0%|          | 0.00/510 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00003.bin:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

pytorch_model-00003-of-00003.bin:   0%|          | 0.00/7.68G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



* The first line sets the variable pretrained_model_name to the string "Salesforce/xgen-7b-8k-base".
* The second line uses the from_pretrained method of AutoModelForCausalLM to load a pretrained model.
* The pretrained model is specified by the pretrained_model_name, and the data type for the model's tensors is set to torch.bfloat16.
* The third line uses the from_pretrained method of AutoTokenizer to load a tokenizer that matches the pretrained model.
* The tokenizer is also specified by the pretrained_model_name, and the trust_remote_code parameter is set to True.

In [None]:
# Specifying training arguments
model_training_args = TrainingArguments(
       output_dir="xgen-7b-8k-base-fine-tuned",
       per_device_train_batch_size=4,
       optim="adamw_torch",
       logging_steps=80,
       learning_rate=2e-4,
       warmup_ratio=0.1,
       lr_scheduler_type="linear",
       num_train_epochs=1,
       save_strategy="epoch"
   )

# Setting LoRA configuration
lora_peft_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, task_type="CAUSAL_LM")

## 8, 9, 10, & 11. Fine-tuning, doing the training, evaluation and testing.

One of the parameters to set for the SFTTrainer is the maximum number of tokens in the sequence. In this context tokens are the number of characters in the text input. So in order to find the ideal maximum number to set we need to find the distribution of text lengths in our dataset.

In [None]:
import matplotlib.pyplot as plt


pandas_format['text_length'] = pandas_format['text'].apply(len)


plt.figure(figsize=(10,6))
plt.hist(pandas_format['text_length'], bins=50, alpha=0.5, color='g')
plt.title('Distribution of Length of Text')
plt.xlabel('Length of Text')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [None]:
mask = pandas_format['text_length'] > 1024
percentage = (mask.sum() / pandas_format['text_length'].count()) * 100


print(f"The percentage of text documents with a length greater than 1024 is: {percentage}%")

In [None]:
SFT_trainer = SFTTrainer(
       model=model,
       train_dataset=train_dataset,
       dataset_text_field="text",
       max_seq_length=1024,
       tokenizer=tokenizer,
       args=model_training_args,
       packing=True,
       peft_config=lora_peft_config,
   )

Having specified the training configuration, we can run the training process of the model in the following way:

In [None]:
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_peft_config)
training_args = model_training_args
trainer = SFT_trainer
trainer.train()