# Lecture 21 - Large Language Models

[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_21-LLMs/Lecture_21-LLMs.ipynb)
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_21-LLMs/Lecture_21-LLMs.ipynb)

<a id='top'></a>

- [21.1 Introduction to LLMs](#21.1-introduction-to-llms)
  - [21.1.1 Architecture of Large Language Models](#21.1.1-architecture-of-large-language-models)
  - [21.1.2 Variants of Transformer Network Architectures](#21.1.2-variants-of-transformer-network-architectures)
- [21.2 Creating LLMs](#21.2-creating-llms)
  - [21.2.1 Pretraining](#21.2.1-pretraining)
  - [21.2.2 Supervised Finetuning](#21.2.2-supervised-finetuning)
  - [21.2.3 Alignment ](#21.2.3-alignment)
- [21.3 Finetuning LLMs](#21.3-finetuning-llms)
  - [21.3.1 Parameter-Efficient Finetuning (PEFT)](#21.3.1-parameter-efficient-finetuning-(peft))
  - [21.3.2 Low-Rank Adaptation (LoRA)](#21.3.2-low-rank-adaptation-(lora))
  - [21.3.3 Quanitized LoRA (QLoRA)](#21.3.3-quanitized-lora-(qlora))
- [21.4 Finetuning Example: Finetuning LlaMA-2 7B](#21.4-finetuning-example:-finetuning-llama-2-7b)
- [21.5 Chat Templates for Formatting LLM Data](#21.5-chat-templates-for-formatting-llm-data)
- [21.6 Prompt Engineering](#21.6-prompt-engineering)
- [21.7 Foundation Models](#21.7-foundation-models)
- [21.8 Limitations and Ethical Considerations of LLMs](#21.8-limitations-and-ethical-considerations-of-llms)
- [Appendix: Unsloth Library for LLM Training and Inference](#appendix:-unsloth-library-for-llm-training-and-inference)
- [References](#references)


## 21.1 Introduction to LLMs <a name='21.1-introduction-to-llms'></a>

Large Language Models (LLMs) are  a class of Deep Neural Networks designed to understand and generate natural human language. LLMs achieved state-of-the-art performance across various NLP tasks.

LLMs are a result of many years of research and advancement in NLP and Machine Learning. Important phases in NLP development include:

- *Statistical language models (1980s-2000s)*: developed to predict the probability of a word in a text sequence based on the preceding words. Examples of statistical language models include Bag-Of-Words models based on N-grams. These models were used in tasks like speech recognition and machine translation, but struggled with capturing long-range dependencies and context-related information in text.
- *Neural network models (2000-2017)*: Fully-connected NNs and Recurrent NNs emerged as an alternative to statistical language models. Long Short-Term Memory (LSTM) RNN models were used for sequence-to-sequence tasks (such as  machine translation) and they formed the basis for several early LLMs. Similar to statistical language models, RNNs struggled with capturing context-related information. Other limitations of RNNs include the inability to parallelize the data processing, and the gradients can become unstable during training.
- *Transformer network models (2017-present)*: Transformer networks introduced the self-attention mechanism as a replacement for the recurrent layers in RNNs. This architecture enabled the development of more powerful and efficient LLMs, laying the foundation for BERT, GPT, and modern LLMs.

### 21.1.1 Architecture of Large Language Models <a name='21.1.1-architecture-of-large-language-models'></a>

The architecture of modern LLMs is based on Transformer Networks, which we covered in Lecture 20. The main components of the Transformer Networks architecture include:

- **Input embeddings**, are fixed-size continuous vector embeddings that represent tokens in input text.
- **Positional encodings**, are fixed-size continuous vectors that are added to the input embeddings to provide information about the relative positions of the tokens in the input text sequence.
- **Encoder**, is composed of a stack of multi-head attention modules and fully-connected (feed-forward) modules. The encoder block also includes dropout layers, residual connections, and applies layer normalization.
- **Decoder**, is composed of a stack of multi-head self-attention modules and fully-connected (feed-forward) modules similarly to the encoder block. The decoder block has an additional masked multi-head attention module, that applies masking to the next words in the text sequence to ensure that the module does not have access to those words for predicting the next token.
- **Output fully-connected layer**, the output of the decoder is passed through a fully-connected (dense, linear) layer to produce the next token in the text sequence.

<img src="images/transformer.jpg" width="450">

*Figure: Pretraining LLMs.* Source: [2].

The architecture of Transformer Networks includes multiple successive encoder and decoder blocks to create deep networks with many layers that allow learning complex patterns in input text. For example, the original Transformer Network has 6 encoder and 6 decoder blocks, as shown in the above figure.

The **self-attention mechanism** is a key component of the Transformer Network architecture that enables the model to weigh the importance of each token with respect to the other tokens in a sequence. It allows to capture long-range dependencies and relationships between the tokens (words) and helps the model to understand the context and structure of the input text sequence.

### 21.1.2 Variants of Transformer Network Architectures <a name='21.1.2-variants-of-transformer-network-architectures'></a>

Various LLMs have been built on top of the Transformer Network architecture. The popular variants include:

- **Decoder-only models**: are autoregressive models that utilize only the decoder part of the Transformer Network architecture. These models are particularly suitable for generating text and content. An example of decoder-only LLMs is the family of GPT models.
- **Encoder-only models**: use only the encoder part of the Transformer Network architecture, and perform well on tasks related to language understanding, such as classification and sentiment analysis. An example is the BERT model.
- **Encoder-decoder models**: employ the original Transformer Network architecture and combine encoder and decoder sub-networks, enabling to both understand language and generate content. These models can be used for various NLP tasks with minimal task-specific modifications. An example of this class of models is T5 (Text-to-Text Transfer Transformer).

### List of LLMs

A large number of LLMs have been developed in the past several years. Some of the most well-known LLMs include:

- *GPT* (Generative Pretrained Transformers): Developed by OpenAI, the GPT family are the best-known LLMs. They include GPT 1, 2, 3, 3.5 (initial ChatGPT), 4, 4o (current ChatGPT), and o1 (where o stands for omni, meaning that the model can process multi-modal inputs, including text, images, video, audio, etc.). According to some sources, GPT-4 has 1.76 trillion parameters, and it is trained on 13T tokens.
- *LlaMA (Large Language Model Meta AI)*: Developed by Meta AI, LlaMA is an open-source LLM, which can be used for both research and commercial uses. It consists of several models including LlaMA base model, LlaMA-Chat, and Code-LlaMA. Released versions include LlaMA 2, LlaMA 3, LLaMA 3.1, and LlaMA 3.2. The latest LlaMA 3.2 includes smaller test models with 1B and 3B parameters, and multi-modal 11B and 90B parameters, trained on 9T tokens.
- *Claude*: Developed by Anthropic, the latest version Claude 3 has three models named Haiku, Sonnet, and Opus. These models rank very high on the benchmarking leaderboards for many tasks, and they are currently the main competitor to OpenAI's GPT models.
- *Gemini*: Developed by Google, offers four models named Nano, Flash, Pro, and Ultra. The number of parameters is not known. The smaller models are designed for smartphones, whereas the larger models are multimodal and can process images, video, code, and other inputs, beside text.
- *Mixtral*: Developed by Mistral, these LLM use mixture-of-experts (MOE) architecture, which allows them to be competitive with larger models, despite having fewer parameters. Current models have 8 mixture-of-experts with 7B and 22B parameters.
- *Grok*: Developed by xAI, Grok is trained on data from X (formerly Twitter) and has 314B parameters. It also uses a mixture-of-experts (MOE) architecture.
- *BERT* (Bidirectional Encoder Representations from Transformers): Developed by Google in 2018, BERT is an early LLM with 340M parameters that can understand natural language and answer questions.
- *Cohere LLM*: Developed by Cohere, it is a family of LLMs with 6B, 13B, and 52B parameters, designed for enterprise use cases.
- *Vicuna*: Developed by LMSYS, Vicuna is a 13B parameters chat assistant finetuned from LLaMA on user-shared conversations.
- *Alpaca*: Developed by Stanford, it is a 7B LLM finetuned from instruction-following samples by LLaMA.
- *Falcon*: Developed by UAE's Technology Innovation Institute (TII), it is an open-source family of models with 1.3B, 7.5B, 40B, and 180B parameters, trained on 3.5T tokens.
- *DBRX* and *Dolly*: Developed by Databricks, DBRX has 132B parameters, whereas Dolly is a smaller LLM language model with 12B parameters.

### 21.1.3 Optimizations of the Transformer Network Architectures <a name='21.1.3-optimizations-of-the-transformer-network-architectures'></a>

#### Activation Functions

## 21.2 Creating LLMs <a name='*21.2*-creating-llms'></a>

Creating modern LLMs typically involves three main phases:

1. **Pretraining**, the model extracts knowledge from large unlabeled text datasets.
2. **Supervised finetuning**, the model is refined to improve the quality of generated responses.
3. **Alignment**, the model is further refined to generate safe and helpful responses that are aligned with human preferences.

### 21.2.1 Pretraining <a name='21.2.1-pretraining'></a>

The first step in creating LLMs is **pretraining** the model on massive amounts of text data. The datasets usually consist of a large collection of web pages or e-books comprising billions or trillions of tokens, and ranging from gigabytes to terabytes of text. During pretraining, the model learns the structure of the language, grammar rules, facts about the world, and reasoning rules. And, it also learns biases and harmful content present in the training data.

 Pretraining is performed using unsupervised learning techniques. Two common approaches for pretraining LLMs are:

- **Causal Language Modeling**, also known as autoregressive language modeling, involves training the model to predict the next token in the text sequence given the previous tokens. This approach is more common with modern LLMs.
- **Masked Language Modeling**, where a certain percentage of the input tokens are randomly masked, and the model is trained to predict the masked tokens based on the surrounding context. BERT and earlier LLMs were pretrained with masked language modeling.

The following figure depicts the pretraining phase with Causal Language Modeling, where the model learns to predict the next word in a sentence given the previous words.

<img src="images/pretraining.jpg" width="450">

*Figure: Pretraining LLMs.* Source: [3].

Pretraining allows to extract knowledge from very large unlabeled datasets in unsupervised learning manner, without the need for manual labeling. Or, to be more precise, the "label" in LLMs pretraining is the next word in the text, to which we already have access since it is part of the training text. Such pretraining approach is also called self-supervised training, since the model uses each next word in the text to self-supervise the training.

Note that pretraining LLMs from scratch is computationally expensive and time-consuming. As we stated before, the pretraining phase can cost millions of dollars (e.g., the estimated cost for training GPT-4 is $100 million). Also, pretraining LLMs requires access to large datasets and technical expertise with strong understanding of deep learning workflows, working with distributed software and hardware, and managing model training with thousands of GPUs simultaneously.

### 21.2.2 Supervised Finetuning <a name='21.2.2-supervised-finetuning'></a>

After the pretraining phase, the model is finetuned on a much smaller dataset, which is carefully generated with human supervision. This dataset consists of samples where AI trainers provide both queries (instructions) and model responses (outputs), as depicted in the following figure. That is, *instruction* is the input text given to the model, and *output* is the desired response by the model. The model takes the instruction text as input (e.g., "Write a limerick about a pelican") and uses next-token prediction to generate the output text (e.g., "There once was a pelican so fine ...").

The finetuning process involves updating the model's weights using supervised learning techniques. The objective of supervised finetuning is to improve the quality of the generated responses by the pretrained LLM.

To compile datasets for supervised finetuning, AI trainers need to write the desired instructions and responses, which is a laborious process. Typical datasets include between 1K and 100K instruction-output pairs. Based on the provided instruction-output pairs, the model is finetuned to generate responses that are similar to those provided by AI trainers.

<img src="images/finetuning.jpg" width="500">

*Figure: Finetuning a pretrained LLM.* Source: [3].

### 21.2.3 Alignment <a name='21.2.3-alignment'></a>

To further improve the performance and align the model responses with human preferences, LLMs are typically refined in one additional phase. This ensures that the responses generated by LLMs are aligned with human preferences, making the models more useful and safer for interaction with users. The alignment phase is essential for reducing harmful, biased, or otherwise undesirable outputs.  

Two main strategies for LLM alignment include Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Reinforcement Learning with Direct Policy Optimization (DPO).

**Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO)**

LLM alignment with Reinforcement Learning from Human Feedback (RLHF) by employing Proximal Policy Optimization (PPO) is depicted in the figure below and involves the following steps:

1. *Collect human feedback*. For this step a new dataset is created by collecting sample prompts from a database or by creating a set of new prompts. For each prompt, multiple responses are generated by the supervised finetuned model. Next, AI trainers are asked to rank by quality all responses generated by the model for the same prompt, from best to worst. Such feedback is used to define the human preferences and expectations about the responses by the model. Although this ranking process is time-consuming, it is usually less labor-intensive than creating the dataset for supervised finetuning, since ranking the responses is faster than writing the responses.
2. *Create a reward model*. The collected data with human feedback containing the prompts and the ranking scores of the different responses are used to train a Reward Model (denoted with RM in the figure). The task for the Reward Model is to predict the quality of the different responses to a given prompt and output a ranking score. The ranking scores provided by AI trainers are used to establish the ground-truth for training the Reward Model. Note that the Reward Model is a different model than the LLM that is being finetuned, and it only needs to rank the generated responses by the LLM.
3. *Finetune the LLM with RL*. The LLM is finetuned using the Reinforcement Learning (RL) algorithm Proximal Policy Optimization (PPO). For a new prompt, the original LLM generates a response, which the Reward Model evaluates and calculates a reward score $r_k$. Next, the PPO algorithm uses the reward score $r_k$ to finetune the LLM so that the total rewards for the generated responses by the LLM are maximized. I.e., the goal is to generate responses by the LLM that maximize the predicted reward scores, and by that, the responses become more aligned with human preferences and are more useful to human users.
4. *Iterative improvement*. The RLHF process is performed iteratively, with multiple rounds of collecting additional feedback from human labelers, re-training the Reward Model, and applying Reinforcement Learning. This leads to continuous refinement and improvement of the LLM's performance.

<img src="images/RLHF.jpg" width="600">

*Figure: Reinforcement Learning from Human Feedback.* Source: [4].

In summary, the RLHF approach creates a reward system that is augmented by human feedback and is used to teach LLMs which responses are more aligned with human preferences. Through these iterations, LLMs can be better aligned with our human values and can lead to higher-quality responses, as well as improved performance on specific tasks.

Note also that there are several variants of the RLFH approach for finetuning LLMs. For example, LlaMA models employ two reward models: one based on the ranks of helpfulness of the responses, and another based on the ranks of safety of the responses. The final reward score is obtained as a combination of the helpfulness and safety scores.

**Reinforcement Learning with Direct Policy Optimization (DPO)**

RL with Direct Policy Optimization (DPO) is another approach for LLM alignment that has been popular recently, as it is simpler than RLHF with PPO. DPO uses a different optimization approach in comparison to RL with PPO, where DPO optimizes the LLM directly based on user preferences, without the need for training a separate Reward Model. I.e., DPO aims to directly maximize the reward function to produce model outputs that align with human preferences. Detailed explanation of RL with DPO is beyond the scope of this lecture.


## 21.3 Finetuning LLMs <a name='21.3-finetuning-llms'></a>

**Finetuning LLMs** involves updating the weights of an LLM model on new data to improve its performance on a specific task and make the model more suitable for a specific use case. It involves additional re-training of the model on a new dataset that is specific to that task. That is, finetuning is a transfer learning technique, where the gained knowledge by a trained model is transferred to improve the performance on a target task.

To adapt LLMs to a custom task, different finetuning techniques have been applied. *Full model finetuning* is a method that finetunes all the parameters of all the layers of a pretrained model. Full model finetuning typically can achieve the best performance, but it is also the most resource-intensive and time-consuming. *Performance-efficient finetuning* involves updating only a small number of the parameters to reduce the required computational resources and costs.

In this section, we will demonstrate how to finetune **LlaMA 2**, an open-source LLM developed by Meta AI. Released in July 2023, LlaMA 2 was the first LLM that is open for both research and commercial use. LlaMA 2 is a successor model to the original LlaMA developed by Meta AI as well. LlaMA 2 has three variants with 7B, 13B, and 70B parameters. It has been trained on 2 trillion tokens, and it has a context window of 4,096 tokens enabling to process large documents. For instance, for the task of summarizing a pdf document the context can include the entire text of the pdf document, or for dialog with a chatbot the context can include the previous conversation history with the chatbot. Furthermore, specialized versions of LlaMA 2 include LlaMA-2-Chat optimized for dialog generation, and Code LlaMA optimized for code generation tasks.

### 21.3.1 Parameter-Efficient Finetuning (PEFT) <a name='21.3.1-parameter-efficient-finetuning-(peft)'></a>

Finetuning LLMs is challenging since the large number of parameters of modern LLMs requires substantial computational resources for storing the models and for re-training the weights. Thus, it can be prohibitively expensive for most users. For instance, to load the largest version of the LlaMA 2 model with 70 billion parameters into the GPU memory requires approximately 280 GB of RAM. Full model finetuning of LlaMA 2 model with 70 billion parameters requires 780 GB of GPU memory. This is equivalent to 10 A100s GPUs that have 80 GB RAM each, or 48 T4 GPUs that have 16 GB RAM each. The free version of Google Colab offers one T4 GPU with 16 GB RAM.

Fortunately, several Parameter-Efficient FineTuning (PEFT) techniques have been introduced recently, which allow updating only a small number of the model weights. Consequently, these techniques enable finetuning LLMs using lower computational resources by reducing memory usage and speeding up the training process. PEFT techniques include prompt tuning, prefix tuning, adding additional adapter layers in the transformer block, and low-rank adaptation (LoRA).

Hugging Face has developed a [PEFT library](https://huggingface.co/docs/peft/index) that contains implementations of common finetuning techniques. We will use the PEFT library to finetune LlaMA 2 on a custom dataset using a quantized version of the LoRA method.

### 21.3.2 Low-Rank Adaptation (LoRA) <a name='21.3.2-low-rank-adaptation-(lora)'></a>

**Low-Rank Adaptation (LoRA)** involves freezing the pretrained model and finetuning a small number of additional weights. After the additional weights are updated, these weights are merged with the weights of the original model.

This is depicted in the following figure, where regular finetuning is shown in the left figure, and it involves updating all weights $W$ in a pretrained model. As we know, the weight update matrix $\nabla{W}$ is calculated based on the negative gradient of the loss function. Finetuning with LoRA is shown in the right figure, where the weight update matrix $\nabla{W}$ is decomposed into two smaller matrices, $\nabla{W}=W_A*W_B$, with size $W_A \in \mathbb{R}^{A \times r}$ and $W_B \in \mathbb{R}^{r \times B}$. The matrices $W_A$ and $W_B$ are called low-rank adapters, since they have lower rank $r$ in comparison to the original weight matrix, i.e., they have fewer number of columns or rows, respectively. During training, gradients are backpropagated only through the matrices $W_A$ and $W_B$, while the pretrained weights $W$ remain frozen.

For instance, if the full weight matrix $W$ is of size $100 \times 100$, this is equal to $10,000$ elements (model weights). If we decompose the weight update matrix $\nabla{W}$ by using rank $r=5$, the total number of elements of $W_A \in \mathbb{R}^{100 \times 5}$ and $W_B \in \mathbb{R}^{5 \times 100}$ will be $500 + 500 =  1,000$. Hence, with LoRA the number of elements was reduced from $10,000$ to $1,000$.

<img src="images/LoRA.png" width="600">

*Figure: Regular finetuning versus LoRA finetuning .* Source: [5].

### 21.3.3 Quanitized LoRA (QLoRA) <a name='21.3.3-quanitized-lora-(qlora)'></a>

**Quanitized LoRA (QLoRA)** is a modified version of LoRA that uses 4-bit quantized weights. *Quantization* reduces the precision for the values of the network weights. In TensorFlow and PyTorch, the network weights by default are stored with 32-bit floating-point precision. With quantization techniques, the network weights are stored with lower precision, such as 16-bit, 8-bit, or 4-bit precision.

This approach introduces a new 4-bit quantization format called "nf4" (normalized float 4) where the range of values is normalized to the range [-1, 1] by dividing the values evenly into 16 bins (4-bit allows $2^4=16$ values). While 4-bit floating point precision (fp4) applies non-linear floating point representation of the original values and results in unequal spacing of the values, normalized float 4 precision (nf4) applies linear quantization of the original values into equally spaced bins and follows a normal distribution.

QLoRA combines 4-bit quantization of the model weights in the pretrained model and LoRA that adds low-rank adaptor layers. The benefits of QLoRA with 4-bit quantization of the model weights include reduced size of the model and increased inference speed, while having a modest decrease in the overall model performance.

For example, with QLoRA a 70B parameter model can be finetuned with 48 GB VRAM, in comparison to 780 GB VRAM required for finetuning all weights of the original model (using 32-bit floating-point precision). Similarly, QLoRA enables to train the smaller version of LlaMA 2 with 7B parameters on a T4 GPU (provided by Google Colab) that has 16 GB VRAM. In cases when only a single GPU is available, using quantization is necessary for finetuning LLMs.

## 21.4 Finetuning Example: Finetuning LlaMA-2 7B<a name='21.4-finetuning-example:-finetuning-llama-2-7b'></a>


### Import Libraries

We will begin by installing the required libraries and importing modules from these packages. These include `accelerate` (for optimized training on GPUs), `peft` (for Parameter-Efficient Fine-Tuning), `bitsandbytes` (to quantize the LlaMA model to 4-bit precision), `transformers` (for working with Transformer Networks), and `trl` (for supervised finetuning, where trl stands for Transformer Reinforcement Learning).

In [1]:
!pip install -q accelerate peft bitsandbytes transformers trl

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    TrainingArguments, pipeline, logging)
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

### Load the Model

We will download the smallest version of LlaMA-2-Chat model with 7B parameters from Hugging Face. Understandably, the larger LlaMA 2 models with 13B and 70B parameters require larger memory and computational resources for finetuning.

Also, we will use the BitsAndBytes library to apply quantization with 4-bit precision format for loading the model weights. Loading a quantized model reduces the GPU memory requirement and makes it possible to train the model with a single GPU, as a tradeoff for some loss in precision. In the next cell we define the configuration for BitsAndBytes, and afterward we will use the configuration in the `from_pretrained` function to load the LlaMA 2 model. The parameters in BitsAndBytes configuration are described in the commented code below.

The compute type in the cell below refers to the data format for performing computations, and it can be either "float16", "bfloat16", or "float32" because computations are performed in either 16 or 32-bit precision. In this case, we specified to use `"torch.float16"` compute data type (i.e., 16-bit floating-point numbers) for memory-saving purposes. Note that although the model weights are loaded with 4-bit precision, the weights are dequantized to 16-bit precision for performing the calculations for the forward and backward passes through the network, since 4-bit precision is too low for performing the calculations.

In [3]:
# The model is Llama 2 from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

In [4]:
# BitsAndBytes configuraton
bnb_config = BitsAndBytesConfig(
    # Load the model using 4-bit precision
    load_in_4bit=True,
    # Quantization type (fp4 or nf4)
    # nf4 is "normalized float 4" format, uses an asymmetric quantization scheme with 4-bit precision
    # optimized for normally distributed weights (better than fp4 for neural networks)
    bnb_4bit_quant_type="nf4",
    # Compute dtype for 4-bit models
    bnb_4bit_compute_dtype= torch.float16,
    # Use double quantization for 4-bit models
    # Double quantization applies further quantization to the quantization constants
    bnb_4bit_use_double_quant=True,
)

We will use `AutoModelForCausalLM` to load the model with the `from_pretrained` function, and we will use the above BitesAndBytes configuration to load the model parameters with 4-bit precision.

In the following cell we will load the corresponding tokenizer for LlaMA 2 by using `AutoTokenizer` and `from_pretrained`.

In [5]:
# Load Llama 2 model from Hugging Face
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # Apply quantization by using the bnb configuration from the previous cell
    quantization_config=bnb_config,
    # Don't cache the model weights, load the model weights from Hugging Face
    use_cache=False,
    # Trade-off parameter in Llama-2, less important, it should be 1 in most cases
    pretraining_tp=1,
    # Load the entire model on the GPU if available
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

In [7]:
# Load tokenizer from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Needed for LlaMA tokenizer
tokenizer.pad_token = tokenizer.eos_token
# Fix an overflow issue with fp16 training
tokenizer.padding_side = "right"

### Define LoRA Configuration

Next, the model will be packed into the LoRA format, which will introduce additional weights and keep the original weights frozen. The parameters in the LoRA configuration include:

- `r`, determines the rank of update matrices, where lower rank results in smaller update matrices with fewer trainable parameters, and greater rank results in more trainable parameters but more robust model.
- `lora_alpha`, controls the LoRA scaling factor.
- `lora_dropout`, is the dropout rate for LoRA layers.
- `bias`, specifies if the bias parameters should be trained.
- `task_type`, is Causal LLM for the considered task.

In [8]:
# LoRA configuration
peft_config = LoraConfig(
    # LoRA rank dimension
    r=64,
    # Alpha parameter for LoRA scaling
    lora_alpha=16,
    # Dropout rate for LoRA layers
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

In order to understand how LoRA impacts the finetuning of LlaMA 2 model, let's compare the total number of trainable parameters in LLaMA 2 and the trainable parameters for the LoRA model. As we can note in the cell below, the LoRA model has about 67M trainable parameters, which is about 1% of the 7B total trainable parameters in LlaMA 2. This makes it possible to finetune the model on a single GPU.

In [9]:
def print_number_of_trainable_model_parameters(model, use_4bit=True):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    if use_4bit:
        all_model_params *= 2
        trainable_model_params *= 2
    print(f"Total model parameters: {all_model_params:,d}. Trainable model parameters: {trainable_model_params:,d}. Percent of trainable parameters: {100 * trainable_model_params/ all_model_params:4.2f} %")

In [10]:
# compare the number of trainable parameters to QLoRA model
qlora_model = get_peft_model(model, peft_config)

# print trainable parameters
print_number_of_trainable_model_parameters(qlora_model)

Total model parameters: 7,135,043,584. Trainable model parameters: 134,217,728. Percent of trainable parameters: 1.88 %


### Load the Dataset

We will use the [Lamini docs](https://huggingface.co/datasets/lamini/lamini_docs) dataset, which contains questions and answers about the framework Lamini for training and developing Language Models. The dataset contains 1,260 question/answer pairs. Here are a few samples from the dataset.

|Question |Answer
| :---- | :---
|Does Lamini support generating code|Yes, Lamini supports generating code through its API.
|How do I report a bug or issue with the Lamini documentation?| You can report a bug or issue with the Lamini documentation by submitting an issue on the Lamini GitHub page.
|Can Lamini be used in an online learning setting, <br /> where the model is updated continuously as new data becomes available?|It is possible to use Lamini in an online learning setting where the model is updated continuously as new data becomes available. <br /> However, this would require some additional implementation and configuration to ensure that the model is updated appropriately and efficiently.

A preprocessed version of the dataset in a format that matches the instruction-output pairs for LlaMA 2 is available on Hugging Face, and we will directly load the preprocessed version of the dataset.

In [11]:
# Lamini dataset
dataset = load_dataset("mwitiderrick/llamini_llama", split="train")

In [12]:
print(f'Number of prompts: {len(dataset)}')

Number of prompts: 1260


### Model Training

The next cell defines the training arguments, and the commented notes describe the arguments. Note that we will finetune the model for only 1 epoch (if we finetune for more than 1 epoch it will take longer but it will probably result in improved performance).

In [13]:
# Set training parameters
training_arguments = TrainingArguments(
    # Output directory where the model predictions and checkpoints will be stored
    output_dir="./results",
    # Number of training epochs
    num_train_epochs=1,
    # Batch size per GPU for training
    per_device_train_batch_size=8,
    # Number of update steps to accumulate the gradients for
    gradient_accumulation_steps=2,
    # Optimizer to use
    optim="paged_adamw_32bit",
    # Save checkpoint every number of steps
    save_steps=0,
    # Log updates every number of steps
    logging_steps=10,
    # Initial learning rate (AdamW optimizer)
    learning_rate=2e-4,
    # Weight decay to apply
    weight_decay=0.001,
    # Enable fp16/bf16 training (set bf16 to True with an A100)
    fp16=False,
    bf16=False,
    # Maximum gradient normal (gradient clipping)
    max_grad_norm=0.3,
    # Group sequences with same length into batches (to minimize padding)
    # Saves memory and speeds up training considerably
    group_by_length=True,
    # Learning rate schedule
    lr_scheduler_type="cosine",
    # Disable reporting to external tools (e.g., WandB, TensorBoard)
    report_to="none"
)

Next, we will use the `SFTTrainer` class in Hugging Face to create an instance of the model by passing the loaded LlaMA 2 model, training dataset, PeFT configuration, tokenizer, and the training arguments. `SFTTrainer` stands for Supervised Fine-Tuning Trainer.

In [14]:
# Set supervised finetuning parameters
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=tokenizer
)



Finally, we can train the model with the `train()` function in Hugging Face. In the output of the cell we can see the loss for every 10 training steps, because we set `logging_steps=10` in the training arguments.

The training took about 15 minutes on a T4 GPU with High-RAM memory on Google Clab Pro.

In [15]:
# Train the model
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.
  return fn(*args, **kwargs)


Step,Training Loss
10,2.5292
20,1.3727
30,0.7212
40,0.5704
50,0.5874
60,0.5187
70,0.5749


TrainOutput(global_step=79, training_loss=0.9190209002434453, metrics={'train_runtime': 931.0002, 'train_samples_per_second': 1.353, 'train_steps_per_second': 0.085, 'total_flos': 1.133322031104e+16, 'train_loss': 0.9190209002434453, 'entropy': 0.4541758464442359, 'num_tokens': 273810.0, 'mean_token_accuracy': 0.898286329375373, 'epoch': 1.0})

### Generate Text

To generate text with the trained model we will use the Hugging Face `pipeline` with the task set to `"text-generation"`. We can set the length of the generated text tokens with the `max_length` argument.

The output displays the start `<s>[INST]` and end `[/INST]` of the instruction prompt, followed by the generated output by the model.

In [17]:
# Set model to inference mode
model.config.use_cache = True
model.eval()

# User's prompt
prompt = "What are Lamini models?"

# Run text generation pipeline with the finetuned model
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,
                max_length=200, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2)

# Generare response
output = pipe(f"<s>[INST] {prompt} [/INST]")

# Print the response
print(output[0]['generated_text'])

Device set to use cuda:0


<s>[INST] What are Lamini models? [/INST]  Lamini is an AI-based language model that uses a combination of natural language processing (NLP) and machine learning algorithms to generate text. nobody owns the rights to any specific work shared on LLM; instead, users grant permission for their posts to be used by other community members. The generated content must not violate the terms of service or infringe upon copyright laws.

There are several different types of Lamini models available:

1. Text Generator - This model generates text based on prompt input provided. It can write anything from short answers to lengthy stories with ease. Users may use this feature if they want help generating ideas for writing projects or simply need inspiration when brainstorming topics related to their interests!
2. Question Answerer - Using pre-existing documentation about Lamini as well as external knowledge sources such as Google search results helps answer


In [20]:
# Another prompt
prompt = "How to evaluate the quality of the generated text with Lamini models"
output = pipe(f"<s>[INST] {prompt} [/INST]", max_new_tokens=500)
print(output[0]['generated_text'])

<s>[INST] How to evaluate the quality of the generated text with Lamini models [/INST]  Evaluating the quality of generated text from a language model like Lamini can be done using various metrics and methods. Unterscheidung between realistic and non-realistic text generation, evaluating coherence and fluency, assessing accuracy in specific domains or tasks are some ways you could go about this evaluation process.

Here are some common evaluation methods for generating text:

1. Perplexity (PP): Measure how well the generated text fits the context by calculating perplexity, which is the ratio of the probability of the correct output given the input. Lower values indicate better fit.
2. BLEU score (B): Assess the similarity between the generated text and ground truth by comparing their n-gram frequency distributions. A higher BLEU score indicates more similarities between the two texts.
3. ROC curve (R): Generate test data from a known dataset and train a classifier on it; then use that

In [22]:
# Another prompt
prompt = "Write a poem about Data Science"
output = pipe(f"<s>[INST] {prompt} [/INST]", max_new_tokens=800)
print(output[0]['generated_text'])

<s>[INST] Write a poem about Data Science [/INST]  In the realm of numbers, we find our home

Where truths and insights are waiting to be known
A world where patterns and trends unfold
In every dataset that's been told.

With algorithms sharp as swords in hand
We delve into complexity, ready to stand
Against the noise that hides within
And uncover secrets hidden from prying eyes again.

From predictive models high on might
To clustering groups bright with insight light
Our tools help us navigate this sea
Of data, both past and yet to be.

From statistics so pure and true
To machine learning, too, we break through
The barriers that once seemed impossible to pierce
And make connections that before were not even near.

Data science is more than just machines
It’s art, creativity, and dreams
That bring forth knowledge like a shining light
And show the beauty deep inside what’s right.

So let us embrace this wondrous sight
And see beyond mere facts tonight.


#### Sampling

## 21.5 Chat Templates for Formatting LLM Data <a name='21.5-chat-templates-for-formatting-llm-data'></a>

In a chat context, LLMs have a continuing conversation with users consisting of one or more messages. Chat conversations are typically represented as a list of dictionaries, where each dictionary contains *role* and *content* keys. I.e., each message is assigned a "role" and it contains the "text" of the message. The roles are typically:

"system" for directives on how the model should behave
"user" for messages from the user
"assistant" for messages from the LLM

An example is provided below, showing the three roles: system, user, and assistant. The prompt to the LLM includes a system message that is prepended to the user's message, and the completion by the LLM is the response by the assistant.


```json
[
  {"role": "system", "content":"You are a helpful and honest assistant."},
  {"role":"user", "content":"What is the capital city of U.S."},
  {"role": "assistant","content":"The capital of the United States is Washington, D.C."}
]
```

A *system message* is usually provided at the beginning of the conversation and includes guidance about how the model should behave in the chat. System messages can be short, such as "Speak like a pirate", or they can be long and contain a lot of context to define the behavior of the LLM. For instance, when you open a new chat with ChatGPT, an internal system message is automatically prepended to your first prompt; however, the system message is not shown to the user. Also, instruction-following datasets include the system message as the first part of the question for the assistant.

In ongoing multi-turn conversations, the messages list continues to grow with alternating user and assistant messages. Each exchange is added to the list in order.

The role information is injected by adding control tokens between messages to indicate the relevant roles and the message boundaries. Let's inspect the first question-answer pair in the Lamini dataset shown below, which has been formatted for the LlaMA 2 model. We can notice that LLaMA 2 uses special tokens for start-of-sequence `<s>` and end-of-seqence `</s>` to define the beginnings and ends of conversations. It uses the start-of-instruction tag `[INST]` and end-of-instruction tag `[/INST]` for single instruction-response pairs. I.e.,  everything inside `[INST]` and `[/INST]` is structured into system, user, and assistant roles. The system message is wrapped in `<<SYS>>` and `<</SYS>>` tags. The text `### Question"` marks the user's instruction/question for the model. The text `### Answer:` contains the response by the assistant.





In [23]:
dataset[0]

{'text': " <s>[INST] <<SYS>> You are a honest and helpful assistant who helps users find answers quickly from the given docs about Lamini. \nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\nIf you don't know the answer to a question, please don't share false information.\nIf the answer can not be found in the text please respond with `Let's keep the discussion relevant to Lamini docs`. <</SYS>>\n\n### Question: How can I evaluate the performance and quality of the generated text from Lamini models?\n### Answer: There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality

In [24]:
print(dataset[0]['text'])

 <s>[INST] <<SYS>> You are a honest and helpful assistant who helps users find answers quickly from the given docs about Lamini. 
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
If the answer can not be found in the text please respond with `Let's keep the discussion relevant to Lamini docs`. <</SYS>>

### Question: How can I evaluate the performance and quality of the generated text from Lamini models?
### Answer: There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generate

 Unfortunately, there is no standard regarding which tokens to use for those purposes, and different LLMs have been trained with varying formatting and control tokens. This can be a challenge for users, because using the wrong format may confuse the model and result in poor quality responses.

###  Chat Templates

To resolve this problem, **chat templates** have been developed to format a conversation for a given LLM into a tokenizable sequence. The templates are formatting specifications stored within a tokenizer that define how to structure conversational data for a specific model.

Hugging Face has developed the `apply_chat_template` method that reads the template stored in the tokenizer's configuration and automatically converts a list of message dictionaries with "role" and "content" keys into the properly formatted string that the model was trained on. The template is distributed alongside the tokenizer so users don't need to manually learn or implement each model's conversation format. The users just provide messages in a standard structure, and the tokenizer handles the model-specific formatting automatically.

Consider again the following chat from above:

In [25]:
messages = [
  {"role": "system", "content":"You are a helpful and honest assistant."},
  {"role":"user", "content":"What is the capital city of U.S."},
  {"role": "assistant","content":"The capital of the United States is Washington, D.C."}
]

In the following cells, we import the tokenizers for `Qwen2.5-7B-Instruct` and `Mistral-7B-Instruct` LLMs, and afterward we apply the chat templates for these models. Notice in the formatted text that Qwen2.5 uses the instruction message start tag `<|im_start|>` and instruction message end tag `<|im_end|>` to separate the messages, followed by `system/user/assistant` to indicate the roles.

In [26]:
# Load the Qwen tokenizer
tokenizer_1 = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Apply chat template for Qwen
formatted_text = tokenizer_1.apply_chat_template(messages, tokenize=False)

# Print the formatted text
print(formatted_text)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

<|im_start|>system
You are a helpful and honest assistant.<|im_end|>
<|im_start|>user
What is the capital city of U.S.<|im_end|>
<|im_start|>assistant
The capital of the United States is Washington, D.C.<|im_end|>



The format for Mistral is similar to the LlaMA 2 format, and uses `<s>` and `</s>` for sequence start and end, and `[INST]` and `[/INST]` for the user's instruction start and end. The text after `[/INST]` until `</s>` is the assistant's response.

In [27]:
# Load the Mistral tokenizer
tokenizer_2 = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Apply chat template for Mistral
formatted_text = tokenizer_2.apply_chat_template(messages, tokenize=False)

# Print the formatted text
print(formatted_text)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

<s> [INST] You are a helpful and honest assistant.

What is the capital city of U.S. [/INST] The capital of the United States is Washington, D.C.</s>


It is important to always use the chat template associated with the specific LLM you are working with to ensure proper formatting and optimal performance.

### Generate Response using Chat Template

The next cell presents an example of prompting the Mistral 7 B model to generate a new response. The tokenizer and model for Mistral 7B are first loaded. In the `apply_chat_template` function we set `tokenize=True` to produce tokenized messages, which are afterward used for model inference. Note that in the above examples we set `tokenize=False`, which formatted the messages but did not tokenize them. Also, in this case the `messages` list does not include the assistant role, as the LLM will generate the response.

In [32]:
# Load the Mistral tokenizer and model
tokenizer_2 = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
model_2 = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", device_map="auto", dtype=torch.bfloat16)

# Prompt text
messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate",},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]

# Apply chat template for Mistral
tokenized_chat = tokenizer_2.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model_2.device)

# Print the tokenized text
print(tokenizer_2.decode(tokenized_chat[0]))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



<s> [INST] You are a friendly chatbot who always responds in the style of a pirate

How many helicopters can a human eat in one sitting? [/INST]


Pass the tokenized chat to `generate()` to generate a response.

In [36]:
# Generate a response by the model
outputs = model_2.generate(tokenized_chat, max_new_tokens=128, pad_token_id=tokenizer_2.eos_token_id)

# Print the response
print(tokenizer_2.decode(outputs[0]))

<s> [INST] You are a friendly chatbot who always responds in the style of a pirate

How many helicopters can a human eat in one sitting? [/INST] Ahoy there, matey! A human can't eat a helicopter in one sitting, no matter how much they might want to. They're just too big and not made for consumption. But a hearty stew of fish and vegetables might hit the spot, me hearties!</s>


The `apply_chat_template()` method works with any model on Hugging Face that has a chat template defined in its tokenizer configuration, which include LlaMA, Mistral, Zephyr, Phi, Qwen and other models. Most modern conversational models include chat templates by default, which can be checked by looking for a `chat_template` field in the tokenizer's `tokenizer_config.json` file. If a model doesn't have a built-in chat template, we can still either prepare a custom template or we can manually format the text sequences according to the model's documentation.

### Dataset Preparation with Chat Template

The next cell shows how to apply a chat template to prepare a dataset for model training. The dataset consists of two simple question-answer conversations stored in a dictionary with "role" and "content" fields. The `format_chat` function takes each example from the dataset and applies the tokenizer's chat template, and returns a dictionary containing the formatted text under the key "formatted_chat". By using `dataset.map(format_chat)`, the formatting function is applied to every conversation in the dataset. The `tokenize=False` parameter means the output remains as text rather than token IDs, and `add_generation_prompt=False` indicates we are formatting complete conversations rather than prompts that expect a response.



In [44]:
from datasets import Dataset

# Prepare a dataset with 2 chats
chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

# Create a simple dataset
dataset = Dataset.from_dict({"chat": [chat1, chat2]})

# Define a formatting function
def format_chat(example):
    return {"formatted_chat": tokenizer_2.apply_chat_template(example["chat"], tokenize=False, add_generation_prompt=False)}

# Apply the chat template to the dataset
dataset = dataset.map(format_chat)

# Print the formatted dataset
for chat in dataset['formatted_chat']:
  print(chat)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

<s> [INST] Which is bigger, the moon or the sun? [/INST] The sun.</s>
<s> [INST] Which is bigger, a virus or a bacterium? [/INST] A bacterium.</s>


## 21.6 Prompt Engineering <a name='21.6-prompt-engineering'></a>

**Prompt engineering** is a technique for improving the performance of LLMs by providing detailed context and information about a specific task. It involves creating text prompts that provide additional information or guidance to the model, such as the topic of the generated response. With prompt engineering, the model can better understand the kind of expected output and produce more accurate and relevant results.

The following tips for creating effective prompts as part of prompt engineering can improve the performance of LLMs:

- Use clear and concise prompts: The prompt should be easy to understand and provide enough information for the model to generate relevant output. Avoid using jargon or technical terms.
- Use specific examples: Providing specific examples can help the model better understand the expected output. For example, if you want the model to generate a story about a particular topic, include a few sentences about the setting, characters, and plot.
- Vary the prompts: Use prompts with different styles, tones, and formats to obtain more diverse outputs from the model.
- Test and refine: Test the prompts on the model and refine them by adding more detail or adjusting the tone and style.
- Use feedback: Use feedback from users or other sources to identify areas where the model needs more guidance and make adjustments accordingly.

*Chain-of-thought technique* involves providing the LLM with a series of instructions to help guide the model and generate a more coherent and relevant response. This technique is useful for obtaining well-reasoned responses from LLMs.

An example of a chain-of-thought prompt is as follows: "You are a virtual tour guide from 1901. You have tourists visiting Eiffel Tower. Describe Eiffel Tower to your audience. Begin with (1) why it was built, (2) how long it took to build, (3) where were the materials sourced to build, (4) number of people it took to build it, and (5) number of people visiting the Eiffel tour annually in the 1900's, the amount of time it completes a full tour, and why so many people visit it each year. Make your tour funny by including one or two funny jokes at the end of the tour."

## 21.7 Foundation Models <a name='21.7-foundation-models'></a>

**Foundation Models** are extremely large NN models trained on tremendous amounts of data with substantial computational resources, resulting in high capabilities for transfer learning to a wide range of downstream tasks. In other words, these models are scaled along each of the three factors: number of model parameters, size of the training dataset, and amount of computation. And, they are typically trained using self-supervised learning on unlabeled data. The scale of Foundation Models leads to new emergent capabilities, such as the ability to perform well on tasks that the models were not explicitly trained to do. This allows few-shot learning, which refers to finetuning Foundation Models to new downstream tasks by using only a few training data instances for the new task. Similarly, zero-shot learning extends this concept even further, and refers to a model's ability to generalize to new tasks for which the model hasn't seen any examples during the training.

LLMs represent early examples of Foundation Models, because LLMs are trained at scale and can be adapted for various NLP tasks, even for tasks they were not trained to perform.

The term Foundation Models is more general than LLMs, and they generally refer to large models that are trained on multimodal data, where the inputs can include text, images, audio, video, and other data sources.

The importance of Foundation Models is in their potential to replace task-specific ML models that are specialized in solving one task (i.e., optimized to perform well on one dataset) with general models that have the capabilities to solve multiple tasks. I.e., these models can serve as a foundation that is adaptable to a broad range of applications.

<img src="images/foundation_model.jpg" width="600">

*Figure: Foundation model.* Source: [link](https://blogs.nvidia.com/blog/2023/03/13/what-are-foundation-models/).

## 21.8 Limitations and Ethical Considerations of LLMs <a name='21.8-limitations-and-ethical-considerations-of-llms'></a>

Although LLMs have demonstrated impressive performance across a wide range of tasks, there are several limitations and ethical considerations that raise concerns.

Limitations:

- *Computational resources*: Training LLMs requires significant computational resources, making it difficult for researchers with limited access to GPUs or specialized hardware to develop and use these models.
- *Data bias*: LLMs are trained on vast amounts of data from the internet, which often contain biases present in the data. As a result, the models may unintentionally learn and reproduce biases in their generated responses.
- *Producing hallucinations*: LLMs can produce hallucinations, which are responses that are false, inaccurate, unexpected, or contextually inappropriate. One example of hallucination by ChatGPT is when asked to list academic papers by an author, and it provides papers that don't exist.
- *Inability to explain*: LLMs are inherently black-box models, making it challenging to explain their reasoning or decision-making processes, which is essential in certain applications like healthcare, finance, and legal domains.


Ethical considerations:

- *Privacy concerns*: LLMs memorize information from their training data, and can potentially reveal sensitive information or violate user privacy.
- *Misinformation and manipulation*: Text generated by LLMs can be exploited to create disinformation, fake news, or deepfake content that manipulates public opinion and undermines trust.
- *Accessibility and fairness*: The computational resources and expertise required to train LLMs may lead to an unequal distribution of benefits, where only a few organizations have the resources to develop and control these powerful models.
- *Environmental impact*: The large-scale training of LLMs consumes a significant amount of energy contributing to carbon emissions, which raises concerns about the environmental sustainability of these models.

Conclusively, it is important to encourage transparency, collaboration, and responsible AI practices to ensure that LLMs benefit all members of society without causing harm.

## Appendix: Unsloth Library for LLM Training and Inference <a name='appendix:-unsloth-library-for-llm-training-and-inference'></a>

**Unsloth** is another library for training and inference of LLMs, offering tools to facilitate optimization of LLMs ([link](https://unsloth.ai/)) The library applies various optimization techniques to reduce the training and inference time in comparison to the Hugging Face library and other related libraries. As you will notice in the following code, the Unsloth tools use pre-built components from Hugging Face (such as `transformers`, `trl`) and adapt them to optimize various workflows for model training and inference.

The following code [10] provides an example of finetuning LlaMA-3.1 8B model using a single T4 GPU. For this example, the training time was similar to training LlaMA 2 7B with the Hugging Face library above, as in both cases training for 1 epoch took about 15 minutes. On the other hand, while the largest batch size (in multiples of 2) with Hugging Face was 8 samples, Unsloth allowed to use a batch size of 16, meaning that Unsloth optimized the memory usage. Training LLMs with larger batch sizes is related to reduced training variance and more stable gradient updates, which typically result in improved performance. In addition, the inference with Unsloth was faster.

In [1]:
# Note: to install unsloth in this notebook, I had to interupt the currently running kernel, and start a new kernel
%%capture
!pip install -q unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # Supports rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
from datasets import load_dataset

# Load the Lamini dataset
dataset = load_dataset("mwitiderrick/llamini_llama", split="train")

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="none"
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/1260 [00:00<?, ? examples/s]

In [6]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,260 | Num Epochs = 1 | Total steps = 40
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 2 x 1) = 32
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,2.8407
10,1.6797
15,0.7792
20,0.694
25,0.6705
30,0.6702
35,0.6146
40,0.6319


TrainOutput(global_step=40, training_loss=1.072603166103363, metrics={'train_runtime': 830.2567, 'train_samples_per_second': 1.518, 'train_steps_per_second': 0.048, 'total_flos': 1.5910729135030272e+16, 'train_loss': 1.072603166103363, 'epoch': 1.0})

In [7]:
# Perform inference
FastLanguageModel.for_inference(model)
prompt = "What are Lamini models?"
inputs = tokenizer([prompt.format(
        "", # instruction
        "", # input
        "", # output
        )], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)
print("\n".join(decoded_output))

<|begin_of_text|>What are Lamini models? Lamini models are pre-trained language models that have been trained on large datasets to generate human-like text. These models are designed to be fine-tuned for specific tasks, such as language translation, text summarization, or chatbot responses.
Lamini models are trained using a technique called masked language modeling, where a portion of the input text is randomly replaced with a [MASK] token. The model is then trained to predict the original text instead of the [MASK] token. This technique helps the model learn the context and relationships between words in a sentence.
Lamini models can be fine-tuned for specific tasks by adding a task-specific layer on top of the pre-trained model. This layer is trained to perform the specific task, such as language translation or text classification.
Lamini models are available in various sizes, including small, medium, and large. The size of the model determines the amount of training data and computa

In [8]:
# Perform inference
prompt = "Write a poem about Data Science"
inputs = tokenizer([prompt.format(
        "", # instruction
        "", # input
        "", # output
        )], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)
print("\n".join(decoded_output))

<|begin_of_text|>Write a poem about Data Science
Data science, a field so grand,
Where numbers and code, hand in hand,
Do dance and weave, a tale so fine,
To uncover truth, and make it shine.
With algorithms and models, we play,
To find patterns, and seize the day,
In datasets vast, we search and roam,
To extract insights, and make them home.
From machine learning, to deep learning too,
We wield the tools, to make data new,
To classify, predict, and recommend with ease,
And make informed decisions, with expertise.
With data visualization, we tell a story,
Of trends and insights, that make us soar,
In data science, we find our way,
To navigate the world, day by day.
So let us celebrate, this field so bright,
Where data and code, shine with delight,
For in data science, we find our guide,
To make the world, a better place to reside. #datascience #poetry #inspiration
I hope you enjoy this poem about Data Science! Let me know if you have any feedback or suggestions.
Here are some possible 

## References <a name='references'></a>

1. Introduction to Large Language Models, by Bernhard Mayrhofer, available at [https://github.com/datainsightat/introduction_llm](https://github.com/datainsightat/introduction_llm).
2. Understanding Encoder and Decoder LLMs, by Sebastian Raschka, available at [https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder](https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder).
3. LLM Training: RLHF and Its Alternatives, by Sebastian Raschka, available at [https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives](https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives).
4. Training Language Models to Follow Instructions with Human Feedback, by Long Ouyang et al., available at [https://arxiv.org/abs/2203.02155](https://arxiv.org/abs/2203.02155).
5. Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA), by Sebastian Raschka, available at [https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html](https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html).
6. How to Fine-tune Llama 2 With LoRA, by Derrick Mwiti, available at [https://www.mldive.com/p/how-to-fine-tune-llama-2-with-lora](https://www.mldive.com/p/how-to-fine-tune-llama-2-with-lora).
7. Fine-Tuning Llama 2.0 with Single GPU Magic, by Chee Kean, available at [https://ai.plainenglish.io/fine-tuning-llama2-0-with-qloras-single-gpu-magic-1b6a6679d436](https://ai.plainenglish.io/fine-tuning-llama2-0-with-qloras-single-gpu-magic-1b6a6679d436).
8. Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks, by Mathieu Busquet, available at [https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/](https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/).
9. Getting started with Llama, by Meta AI, available at [https://ai.meta.com/llama/get-started/](https://ai.meta.com/llama/get-started/).
10. Llama-3.1 8b + Unsloth 2x faster finetuning, by Unsloth AI, available at [ https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing).
11. Hugging Face: Chat Templates, available at [https://huggingface.co/learn/llm-course/en/chapter11/2](https://huggingface.co/learn/llm-course/en/chapter11/2).

[BACK TO TOP](#top)