# | NLP | LLM | Fine-tuning 2024 | Llama 2 QLoRA |

## Natural Language Processing (NLP) and Large Language Models (LLM) with Fine-Tuning LLM Llama 2 with QLoRA in 2024

![Learning](https://t3.ftcdn.net/jpg/06/14/01/52/360_F_614015247_EWZHvC6AAOsaIOepakhyJvMqUu5tpLfY.jpg)

# <b>1 <span style='color:#78D118'>|</span> Overview</b>

In this notebook we're going to Fine-Tuning LLM:

<img src="https://github.com/YanSte/NLP-LLM-Fine-tuning-Trainer/blob/main/img_2.png?raw=true" alt="Learning" width="50%">

Many LLMs are general purpose models trained on a broad range of data and use cases. This enables them to perform well in a variety of applications, as shown in previous modules. It is not uncommon though to find situations where applying a general purpose model performs unacceptably for specific dataset or use case. This often does not mean that the general purpose model is unusable. Perhaps, with some new data and additional training the model could be improved, or fine-tuned, such that it produces acceptable results for the specific use case.

<img src="https://github.com/YanSte/NLP-LLM-Fine-tuning-Trainer/blob/main/img_1.png?raw=true" alt="Learning" width="50%">

Fine-tuning uses a pre-trained model as a base and continues to train it with a new, task targeted dataset. Conceptually, fine-tuning leverages that which has already been learned by a model and aims to focus its learnings further for a specific task.

It is important to recognize that fine-tuning is model training. The training process remains a resource intensive, and time consuming effort. Albeit fine-tuning training time is greatly shortened as a result of having started from a pre-trained model. 

<img src="https://github.com/YanSte/NLP-LLM-Fine-tuning-Trainer/blob/main/img_3.png?raw=true" alt="Learning" width="50%">

**Here some definitions:**
<br/>

<br/>
<details>
  <summary style="list-style: none;"><b>▶️ Llame 2 Model?</b></summary>
  <br/>
  <img src="https://images.idgesg.net/images/article/2023/08/shutterstock_1871547451-100945157-large.jpg?auto=webp&quality=85,70" alt="Learning" width="50%">
  <br/>
  Llama 2 is a family of pre-trained and fine-tuned large language models (LLMs) developed by Meta, the parent company of Facebook. These models, ranging from 7B to 70B parameters, are optimized for assistant-like chat use cases and excel in natural language generation tasks, including programming. Llama 2 is an extension of the original Llama model, utilizing the Google transformer architecture with various enhancements.
   
  <br/>
    
  Llama 2, developed by Meta, is a family of large language models optimized for assistant-like chat and natural language generation tasks, ranging from 7B to 70B parameters.
  <br/>
    
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Fine-Tuning with LoRA?</b></summary>
  <br/>
  LoRA (Low-Rank Adapters) strategically freezes pre-trained model weights and introduces trainable rank decomposition matrices into the Transformer architecture's layers. This innovative technique significantly reduces the number of trainable parameters, leading to expedited fine-tuning processes and mitigated overfitting.
  <br/>
  <img src="https://deci.ai/wp-content/uploads/2023/11/lora-animated.gif" alt="Learning" width="40%">
  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ RLHF?</b></summary>
   <br/>
    
RLHF stands for Reinforcement Learning from Human Feedback. 
- Technique that trains a "reward model" directly from human feedback and uses this model as a reward function to optimize an agent's policy using reinforcement learning through an optimization algorithm like Proximal Policy Optimization (PPO). 
- This approach allows AI systems to better understand and adapt to complex human preferences, leading to improved performance and safety.
    
  <br/>
  <img src="https://api.wandb.ai/files/ayush-thakur/images/projects/37250193/29fb34df.png" alt="Learning" width="60%">
  <img src="https://www.labellerr.com/blog/content/images/size/w2000/2023/06/bannerRELF.webp" alt="Learning" width="60%">
  <br/>
    
  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ PPO?</b></summary>
   <br/>
 
- PPO is a policy gradient method, which means that it directly optimizes the policy function. 
- Policy gradient methods are typically more efficient than value-based methods, such as Q-learning, but they can be more difficult to train.
    
Proximal Policy Optimization (PPO) is like a smart guide to learning how to do something better. Imagine you're trying to teach a robot how to play a game. PPO helps the robot improve little by little without suddenly changing everything it's learned. This makes him more skillful while remaining safe and effective in his learning. It's a bit like learning to play football by gradually improving without forgetting everything you've already learned.

  <br/>
  <img src="https://miro.medium.com/v2/resize:fit:655/1*jDUO1swpIVqFc4BF3cj1Jg.jpeg" alt="Learning" width="60%">  
  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ PERF?</b></summary>
   <br/>
    
Parameter Efficient Fine-Tuning (PEFT) overcomes the problems of consumer hardware, storage costs by fine tuning only a small subset of model’s parameters significantly reducing the computational expenses while freezing the weights of original pretrained LLM.

  <br/>
  <img src="https://api.wandb.ai/files/capecape/images/projects/38233410/2b6af233.png" alt="Learning" width="60%">  
  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Model Quantization?</b></summary>

  Quantization is a technique used to reduce the size of large neural networks, including large language models (LLMs) by modifying the precision of their weights. Instead of using high-precision data types, such as 32-bit floating-point numbers, quantization represents values using lower-precision data types, such as 8-bit integers. This process significantly reduces memory usage and can speed up model execution while maintaining acceptable accuracy.
    
  The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type. The most common lower precision data types are:

  - float16, accumulation data type float16
  - bfloat16, accumulation data type float32. It’s similar to the standard 32-bit floating-point format but uses fewer bits, so it takes up less space in computer memory.
  - int16, accumulation data type int32
  - int8, accumulation data type int32
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ 4-bit NormalFloat (NF4)?</b></summary>
  <br/>

NF4, a 4-bit Normal Float, is tailor-made for AI tasks, specifically optimizing neural network weight quantization. This datatype is ideal for reducing memory usage in models, crucial for deploying on less powerful hardware. 
    
NF4 is information-theoretically optimal for normally distributed data, like neural network weights, providing more accurate representation within the 4-bit constraint.

Floating-point storage involves sign, exponent, and fraction (mantissa). The binary conversion of numbers varies based on the datatype, affecting precision and range. For example, FP32, commonly used in Deep Learning, can represent numbers between ±1.18×10^-38 and ±3.4×10³⁸. On the other hand, NF4 has a range of [-8, 7].

QLoRA employs brainfloat16 (bfloat16), developed by Google for high-throughput floating-point operations in machine learning and computational tasks.

  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Bitsandbytes?</b></summary>

  Make the process of model quantization more accessible. Bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

  <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*O4RAzlQkbrcCPiPPD9JIYw.jpeg" alt="Learning" width="50%">

  ```BitsAndBytesConfig```

</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ 4-bit NormalFloat Quantization?</b></summary>

  The aim of 4-bit NormalFloat Quantization is to reduce the memory usage of the model parameters by using lower precision types than full (float32) or half (bfloat16) precision.   Meaning 4-bit quantization compresses models that have billions of parameters and makes them require less memory.

  ```python
  load_in_4bit=True
  ```
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Bfloat16 vs Float16/32?</b></summary>

**Understanding Float16/32 and Bfloat16:**
- **Float16 (Half-Precision):** This format uses 16 bits to represent a floating-point number, with 1 bit for the sign, 5 bits for the exponent, and 10 bits for the mantissa. It is widely used for its reduced memory footprint but can be sensitive to numerical issues due to its limited dynamic range.
- **Float32 (Single-Precision):** With 32 bits, float32 includes 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. It provides a larger dynamic range and higher precision than float16 but consumes more memory.
- **Bfloat16:** Specifically designed for deep learning, bfloat16 uses 16 bits, allocating 1 bit for the sign, 8 bits for the exponent, and 7 bits for the mantissa. Its dynamic range is closer to float32, making it suitable for maintaining precision while reducing memory usage during training.
    
<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*VskW_nvXmtV3VAsq1YuqCQ.png" alt="Learning" width="50%">
    
**Bfloat16:**
- 16-bit floating point format tailored for deep learning applications.
- Comprises one sign bit, eight exponent bits, and seven mantissa bits.
- Demonstrates a larger dynamic range (number of exponent bits) than float16, aligning closely with float32.
- Efficiently handles exponent changes, potentially leading to memory savings.
- Offers improved training efficiency, reduced memory usage, and space savings.
- Less susceptible to underflows, overflows, or numerical instability during training.

**Float16/32:**
- Float16: IEEE half-precision, Float32: IEEE single-precision.
- Exhibits a smaller dynamic range compared to bfloat16.
- May encounter numerical issues such as overflows or underflows during training.
- Loss scaling might be necessary in mixed precision settings to mitigate side effects.
- Float32 generally requires more memory for storage compared to bfloat16.
- Training behavior may be less stable, particularly with pure float32 dtype.

**GPU Dependency:**
- The effectiveness of bfloat16 and float16/32 is contingent on the GPU architecture.
- Some GPUs offer native support for bfloat16, optimizing its performance benefits.
- It is crucial to check GPU specifications and compatibility before choosing between bfloat16 and float16/32.
- Users may need to tailor configurations based on GPU capabilities for optimal results.

**Conclusion:**
While bfloat16 presents advantages in training efficiency and memory usage, its performance is influenced by GPU architecture. Understanding the characteristics of float16, float32, and bfloat16 is crucial for selecting the optimal format based on both task requirements and GPU capabilities.

**How to Enable Bfloat16:**
To enable bfloat16 in mixed precision mode, specific changes in the configuration file are necessary, including setting ```model.use_bfloat16``` to True, ```optimizer.loss_scaling_factor``` to 1.0, and model.mixed_precision to True. Notably, bfloat16 eliminates the need for loss scaling, which was initially introduced for mixed precision mode with float16 settings.
    
</details>

</br>

<details>
  <summary style="list-style: none;"><b>▶️ fp16 vs bf16 vs tf32?</b></summary>

Mixed precision training optimizes computational efficiency by using lower-precision numerical formats for specific variables. Traditionally, models use 32-bit floating point precision (fp32), but not all variables require this high precision. Mixed precision involves using 16-bit floating point (fp16) or other data types like brainfloat16 (bf16) and tf32 (CUDA internal data type) for certain computations.

- **fp16 (float16):**
  - Advantages: Faster computations, especially in saving activations.
  - Gradients are computed in half precision but converted back to full precision for optimization.
  - Memory usage may increase, as both 16-bit and 32-bit precision coexist on the GPU.
  - Enables mixed precision training, improving efficiency.
  - Enable with `fp16=True` in the training arguments.

- **bf16 (brainfloat16):**
  - Advantages: Wider dynamic range compared to fp16, suitable for mixed precision training.
  - Available on Ampere or newer hardware.
  - Enables mixed precision training and evaluation (Use with training and evaluation),
  - Lower precision but larger dynamic range than fp16.
  - Enable with `bf16=True` in the training arguments.

- **tf32 (CUDA internal data type):**
  - Advantages: Up to 3x throughput improvement in training and/or inference.
  - Exclusive to Ampere hardware.
  - Similar numerical range as fp32 but with reduced precision (10 bits).
  - Allows using normal fp32 training/inference code with enhanced throughput.
  - Enable by allowing tf32 support in your code.
    ```python
    import torch
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    ```
  - Enable this mode in the 🤗 Trainer with `tf32=True` in the training arguments.
  - Requires torch version >= 1.7 to use tf32 data types.

These approaches provide advantages in terms of computational speed and efficiency, making them valuable for mixed precision training on specific hardware architectures.
    
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ QLoRA?</b></summary>
   <br/>
  QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA (Low-Rank Adapters) that uses quantization to improve parameter efficiency during fine-tuning. QLoRA is more memory efficient than LoRA because it loads the pretrained model to GPU memory as 4-bit weights, compared to 8-bits in LoRA. This technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance by using quantization.

It's peft method.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*oV_KwvWnFYzuWzlz.png" alt="Learning" width="40%">
  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ What are we training with QLoRA?</b></summary> 
  <br/>
  QLoRA fine-tunes language models by quantizing pre-trained weights to 4-bit representations, keeping them fixed. It trains a small number of low-rank matrices during fine-tuning, efficiently updating knowledge without extensive resource usage. This approach enhances memory efficiency, allowing effective fine-tuning by adjusting a subset of the model's existing knowledge for specific tasks.



  - An adapter weights trained (**trainer.save_model()**).
  - After your merge the adapter weights to the base LLM.
  <br/>
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ What is Supervised fine-tuning?</b></summary>

  Supervised fine-tuning (SFT) is a key step in reinforcement learning from human feedback (RLHF). The TRL library from HuggingFace provides an easy-to-use API to create SFT models and train them on your dataset with just a few lines of code. It comes with tools to train language models using reinforcement learning, starting with supervised fine-tuning, then reward modeling, and finally proximal policy optimization (PPO).

  We will provide SFT Trainer the model, dataset, Lora configuration, tokenizer, and training parameters.

  ```SFTTrainer```
</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Trainer vs SFTTrainer</b></summary> 
  <br/>
  <img src="https://miro.medium.com/v2/resize:fit:1172/format:webp/1*DFKNcH40QTgbDCIWWBb4yg.png" alt="Learning" width="50%">


**Trainer:**
- **General-purpose training:** Designed for training models from scratch on supervised learning tasks like text classification, question answering, and summarization.
- **Highly customizable:** Offers a wide range of configuration options for fine-tuning hyperparameters, optimizers, schedulers, logging, and evaluation metrics.
- **Handles complex training workflows:** Supports features like gradient accumulation, early stopping, checkpointing, and distributed training.
- **Requires more data:** Typically needs larger datasets for effective training from scratch.


**SFTTrainer:**
- **Supervised Fine-tuning (SFT):** Optimized for fine-tuning pre-trained models with smaller datasets on supervised learning tasks.
- **Simpler interface:** Provides a streamlined workflow with fewer configuration options, making it easier to get started.
- **Efficient memory usage:** Uses techniques like parameter-efficient (PEFT) and packing optimizations to reduce memory consumption during training.
- **Faster training:** Achieves comparable or better accuracy with smaller datasets and shorter training times than Trainer.


**Choosing between Trainer and SFTTrainer:**
- **Use Trainer:** If you have a large dataset and need extensive customization for your training loop or complex training workflows.
- **Use SFTTrainer:** If you have a pre-trained model and a relatively smaller dataset, and want a simpler and faster fine-tuning experience with efficient memory usage.

</details>

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Flash Attention?</b></summary> 
  <br/>
    
Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. Accelerates training up to 3x. Learn more at FlashAttention.
    
Resume: It focuses on the part of the model called “attention,” which helps the model to focus on the most important parts of the data, just like when you pay attention to the most important parts of a lecture. Flash Attention 2 makes this process faster and uses less memory

<img src="https://github.com/Dao-AILab/flash-attention/raw/main/assets/flashattn_banner.jpg" alt="Learning" width="80%">
    
  <br/>
</details>   

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Flash Attention Vs Normal with pretraining_tp?</b></summary> 
  <br/>
    
**Observations:**
1. **Training Loss**:
    - The training losses across all cases are very close to each other. There is a very minor decrease in training loss when using a ```pretraining_tp``` of 1 vs. 2 but the difference is negligible.
    - The attention type (Flash vs. Normal) does not seem to have a noticeable impact on the final training loss.
2. **Training Time**:
    - **Using Flash Attention significantly reduces training time, nearly by half as compared to using Normal Attention.**
    - The ```pretraining_tp``` value does not seem to significantly impact the training time.
3. **Inference Time**:
    - Flash Attention with ``pretraining_tp`` of 2 has the fastest inference time.
    - Interestingly, Normal Attention has similar inference times for both ``pretraining_tp`` values, and they're both comparable or slightly faster than Flash Attention with ``pretraining_tp`` of 1.
  
  <br/>  
  <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*JVk8qsjY7NwsxXdWgtDI0w.png" alt="Learning" width="40%">
  <br/>
    
**Conclusion:**
- Flash Attention is significantly faster in training compared to Normal Attention, which is expected based on the stated advantages of Flash Attention.
- The pretraining_tp values, either 1 or 2, do not drastically impact the model's performance or training/inference times in this experiment. However, using pretraining_tp of 2 slightly improves inference time when using Flash Attention.
- The model’s performance, in terms of loss, is mostly consistent across all cases. Hence, other considerations like training time, inference time, and computational resources could be more important when deciding on the configurations to use.
    
  <br/>
</details>    

<br/>

<details>
  <summary style="list-style: none;"><b>▶️ Add Special Tokens for Chat Format/chatml?</b></summary> 
  <br/>
    
Adding special tokens to a language model is crucial for training chat models. These tokens are added between the different roles in a conversation, such as the user, assistant, and system and help the model recognize the structure and flow of a conversation. This setup is essential for enabling the model to generate coherent and contextually appropriate responses in a chat environment. The setup_chat_format() function in trl easily sets up a model and tokenizer for conversational AI tasks
    
- Adds special tokens to the tokenizer, e.g. <|im_start|> and <|im_end|>, to indicate the start and end of a conversation.
- Resizes the model’s embedding layer to accommodate the new tokens.
- Sets the chat_template of the tokenizer, which is used to format the input data into a chat-like format. The default is chatml from OpenAI.
    
  <br/>
</details> 

### Define our use case

In the process of fine-tuning Language Models (LLMs), it is crucial to have a clear understanding of your specific use case and the task you aim to address. This knowledge will guide you in selecting the most suitable pre-existing model or assist you in curating a dataset for the fine-tuning process. If your use case hasn't been defined yet, it is advisable to revisit your initial considerations. It's important to note that fine-tuning is not universally necessary for all scenarios. Prior to embarking on the fine-tuning journey, it is highly recommended to explore and assess already fine-tuned models or those available through APIs.


### Author Note: 

While the model demonstrates remarkable performance across various tasks in datasets, notable alterations in its generation capabilities may not be readily apparent.

**👉🏿 So this is an example of how Fine tuning in 2024**

This is why it is preferable to use a RAG to reduce hallucination about the task already present in the model in this case.

### RAG vs Finetuning

As the enthusiasm for Large Language Models (LLMs) continues to grow, numerous developers and organizations are actively engaged in creating applications that leverage their capabilities. Yet, when pre-trained LLMs fail to meet desired expectations, the pivotal question arises: How can we enhance the performance of the LLM application? This leads us to a critical juncture where we must deliberate on whether to employ Retrieval-Augmented Generation (RAG) or pursue model fine-tuning to optimize the outcomes.

[RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application? The definitive guide for choosing the right method for your use case](https://towardsdatascience.com/rag-vs-finetuning-which-is-the-best-tool-to-boost-your-llm-application-94654b1eaba7)

<img src="https://miro.medium.com/v2/resize:fit:1306/1*To-PwvmU47tqyxPzhar6vg.png" alt="Learning" width="50%">

<img src="https://miro.medium.com/v2/resize:fit:1400/1*its4VqhQxCxKUjMuLpM_VQ.png" alt="Learning" width="50%">


## Learning Objectives

By the end of this notebook, you will gain expertise in the following areas:

1. Learn how to effectively prepare datasets for training.
2. Few shots learning 
3. Understand the process of fine-tuning the Llama 2 on QLoRA with SFTTrainer in 2024.

## Ressources

- [How to Fine-Tune LLMs in 2024 with Hugging Face](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl)

- [TRL Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/sft_trainer)

- [Doc Llama 2](https://huggingface.co/docs/transformers/model_doc/llama2)

- [Instruction Tune a Base LLM Using Qlora with DeciML](https://deci.ai/blog/how-to-instruction-tune-a-base-llm-using-qlora-with-decilm-6b/)

- [Practical Tips for Finetuning LLMs](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)

- [Controlled Generation of Large Language Models](https://arxiv.org/abs/2106.09685)

- [Optimizing LLMs: A Step-by-Step Guide to Fine-Tuning with PEFT and Qlora](https://blog.lancedb.com/optimizing-llms-a-step-by-step-guide-to-fine-tuning-with-peft-and-qlora-22eddd13d25b)

- [Fine-Tune LLAMA2 with QLORA in Google Colab](https://colab.research.google.com/github/ashishpatel26/LLM-Finetuning/blob/main/7.FineTune_LLAMA2_with_QLORA.ipynb#scrollTo=Y3IgtdTvAvTr)

- [LLM Course Repository](https://github.com/mlabonne/llm-course?tab=readme-ov-file)

- [How to Fine-Tune an LLM - Part 2: Instruction Tuning Llama 2](https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-Tune-an-LLM-Part-2-Instruction-Tuning-Llama-2--Vmlldzo1NjY0MjE1)

- [Fine-tune Llama 2 in Google Colab](https://github.com/Abonia1/LLM-finetuning/blob/main/Llama/Fine_tune_Llama_2_in_Google_Colab.ipynb)

- [Fine-Tuning Llama 2 LLM on Google Colab: A Step-by-Step Guide](https://gathnex.medium.com/fine-tuning-llama-2-llm-on-google-colab-a-step-by-step-guide-dd79a788ac16)

- [Fine-tune Llama 2 in Google Colab](https://github.com/Abonia1/LLM-finetuning/blob/main/Llama/Fine_tune_Llama_2_in_Google_Colab.ipynb)

- [Finetuning Llama2 Mistral](https://medium.com/@geronimo7/finetuning-llama2-mistral-945f9c200611)

- [Microsoft Docs - Chat Markup Language](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md)

- [Mastering Llama 2: A Comprehensive Guide to Fine-Tuning in Google Colab](https://medium.com/artificial-corner/mastering-llama-2-a-comprehensive-guide-to-fine-tuning-in-google-colab-bedfcc692b7f)

- [OpenAI introduced Chat Markup Language (ChatML) based input to non-chat modes](https://cobusgreyling.medium.com/openai-introduced-chat-markup-language-chatml-based-input-to-non-chat-modes-6ca4b267012f)

- [Microsoft Docs - Chat Markup Language](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md)

- [Implementing Few-Shot Learning with GPT](https://developer.dataiku.com/latest/tutorials/machine-learning/genai/nlp/gpt-few-shot-clf/index.html#implementing-few-shot-learning)

- [Prompting Guide - Few-Shot Learning](https://www.promptingguide.ai/techniques/fewshot)

- [Learn Prompting - Few-Shot Basics](https://learnprompting.org/docs/basics/few_shot)

- [The Science of Control: How Temperature, Top-p, and Top-k Shape Large Language Models](https://medium.com/@daniel.puenteviejo/the-science-of-control-how-temperature-top-p-and-top-k-shape-large-language-models-853cb0480dae)

- [The Science of Control: How Temperature, Top-p, and Top-k Shape Large Language Models](https://medium.com/@daniel.puenteviejo/the-science-of-control-how-temperature-top-p-and-top-k-shape-large-language-models-853cb0480dae)

- [Democratizing AI! Fine-Tuning LLaMA 2🦙: A Step-by-Step Instructional Guide](https://medium.com/@aditya.addy.bahl/democratizing-ai-fine-tuning-llama-2-a-step-by-step-instructional-guide-19d3dad84202)

- [Fine-tune Llama 7B on AWS Trainium](https://www.philschmid.de/fine-tune-llama-7b-trainium2)

- [Instruction fine-tuning Llama 2 with PEFT’s QLoRa method](https://medium.com/@ud.chandra/instruction-fine-tuning-llama-2-with-pefts-qlora-method-d6a801ebb19)

- [Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model](https://www.datacamp.com/tutorial/fine-tuning-llama-2)

- [Mlabonne Blog](https://mlabonne.github.io/blog/)




### Setup

#### Install 

In [1]:
%%capture

# Install Pytorch & other libraries
!pip install "torch==2.1.2" tensorboard

# Install Hugging Face libraries
# transformers: This library provides APIs for downloading pre-trained models.
# datasets: This library is used to load datasets from Hugging Face.
# accelerate  These libraries are used to increase the inference speed of the model.
# bitsandbytes: It’s a library for quantizing a large language model to reduce the memory footprint of the model, especially on GPUs.
# trl: This library contains an SFT (Supervised Fine-Tuning) class to fine-tune a model.
# peft: This is used to add a LoRA adapter to the LLM.
!pip install  --upgrade \
  "transformers==4.36.2" \
  "datasets==2.16.1" \
  "accelerate==0.26.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \
  "trl==0.7.10" \
  "peft==0.7.1" \

# wandb: It’s used for monitoring the training process.
!pip install -U wandb


#### Imports

In [2]:
%%capture

# General
# ---
import os, gc, wandb, platform
import pandas as pd
import numpy as np
from tqdm import tqdm

# Torch
# ---
import torch

# Hugging face
# ---
from huggingface_hub import login

# Perf
# ---
from peft import PeftModel, PeftConfig
from peft import AutoPeftModelForCausalLM

# Dataset
# --
from datasets import load_dataset, concatenate_datasets

# Transformers
# --
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer, set_seed
import transformers

# Perf
# --
from peft import LoraConfig, PeftModel, get_peft_model

# SFTTrainer
# --
from trl import SFTTrainer
from trl import setup_chat_format

# Rouge score
# --
#import evaluate

# Kaggle
# --
from kaggle_secrets import UserSecretsClient

#### Assert

In [3]:
import torch
assert torch.cuda.is_available()

#### Kaggle

Kaggle use Tesla P100-PCIE-16GB, we need to setup with low resource.

In [4]:
is_low_resource_config = True

#### Methods

In [5]:
################################################################################
# Print 
################################################################################

def print_sep():
    sep = "#" * 14
    print(sep)
    
def pretty(d, indent=0):
    for key, value in d.items():
        if isinstance(value, dict):
            print('\t' * indent + f"{key}:")
            pretty(value, indent + 1)
        else:
            print('\t' * indent + f"{key}: {value}")
    
def print_system_specs():
    # Check if CUDA is available
    is_cuda_available = torch.cuda.is_available()
    print("CUDA Available:", is_cuda_available)
# Get the number of available CUDA devices
    num_cuda_devices = torch.cuda.device_count()
    print("Number of CUDA devices:", num_cuda_devices)
    if is_cuda_available:
        for i in range(num_cuda_devices):
            # Get CUDA device properties
            device = torch.device('cuda', i)
            print(f"--- CUDA Device {i} ---")
            print("Name:", torch.cuda.get_device_name(i))
            print("Compute Capability:", torch.cuda.get_device_capability(i))
            print("Total Memory:", torch.cuda.get_device_properties(i).total_memory, "bytes")
    # Get GPU information
    print("--- GPU Information ---")
    # Check GPU compatibility with bfloat16
    if is_bfloat16_supported():
        print("GPU: Supports bfloat16")
    else:
        print("GPU: Supports float16 (bf16=False)")
        
    # Check GPU compatibility with flash attention
    if is_flash_attention_supported():
        print("GPU: Supports Flash Attention")
    else:
        print("GPU: Hardware not supported for Flash Attention")
  
    print("--- CPU Information ---")
    print("Processor:", platform.processor())
    print("System:", platform.system(), platform.release())
    print("Python Version:", platform.python_version())
    
        
################################################################################
# Prompt 
################################################################################

def print_boxed(text):
    lines = textwrap.wrap(text, max_width)  # Wrap text to desired width
    border = '+' + '-' * (max_width + 2) + '+'
    print(border)
    for line in lines:
        print('| ' + line.ljust(max_width) + ' |')
    print(border)
    
def create_system_message(sample):
    prompt = "You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request"
    
    if "context" in sample and sample["context"]:
        return f"""{prompt} based on the provided CONTEXT.
CONTEXT:
{sample["context"]}"""
    else:
        return prompt + "."
        
#Convert dataset to OAI messages
def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": create_system_message(sample)},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}
    ]
  }

################################################################################
# Helpers 
################################################################################

def get_arguments(base, key = None):
    # Récupérer la valeur de "TrainingArguments" du dictionnaire
    args_config = config.get(base)

    # Vérifier si la clé existe et si elle n'est pas vide
    if args_config is None or not args_config:
        raise ValueError("Empty arguments.")
        
    if key is None:
        return args_config
    else:
        return args_config[key]

def clear_gpu_memory():
    """Clear GPU memory by emptying the cache and collecting garbage."""
    torch.cuda.empty_cache()
    gc.collect()
    
def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit  # (default:torch.nn.Linear,4bit:bnb.nn.Linear4bit,8bit:bnb.nn.Linear8bitLt)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])


    if 'lm_head' in lora_module_names: # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)


def save_peft_model(model, peft_model_name_or_path, hub_repo_name):
    """
    Save the model to the specified path.
    """
    model.save_pretrained(peft_model_name_or_path)
    tokenizer.save_pretrained(peft_model_name_or_path)
    model.push_to_hub(hub_repo_name)
    tokenizer.push_to_hub(hub_repo_name)
    
################################################################################
# Supported 
################################################################################

def is_flash_attention_supported():
    return torch.cuda.get_device_capability()[0] >= 8 and is_low_resource_config == False
    
def is_bfloat16_supported():
    try:
        return torch.cuda.bfloat16 and is_low_resource_config == False

    except Exception as e:
        return False
    
def get_torch_dtype_based_on_support():
    return torch.bfloat16 if is_bfloat16_supported() else torch.float16

def get_attn_implementation_config_on_support():
    return {"attn_implementation": "flash_attention_2"} if is_flash_attention_supported() else {}


#### Spec

In [6]:
print_sep()
!nvidia-smi
print_sep()
print_system_specs()
print_sep()

##############
Fri Jan 26 12:44:01 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0              26W / 250W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                     

#### Flash attention

If you are using a GPU with Ampere architecture (e.g. NVIDIA A10G or RTX 4090/3090) or newer you can use Flash attention

In [7]:
# If case supported for Flash Attention
if is_flash_attention_supported():
    !pip install flash-attn
    !pip install ninja packaging
    
    # Note: If your machine has less than 96GB of RAM and lots of CPU cores,
    # reduce the number of MAX_JOBS. 
    
    # Example: On the g5.2xlarge we used 4. (g5.2xlarge vCPUs8 Memory 32)
    #!MAX_JOBS=4 pip install flash-attn --no-build-isolation
    

#### Config

In [70]:
# You can easily swap out the model for another model, e.g. Mistral or Mixtral models, TII Falcon.
model_name_or_path = "meta-llama/Llama-2-7b-chat-hf" 
fine_tuned_model_name_or_path = "Llama-2-7b-chat-hf-2024"

config = {
    # General
    # ---
    "cache_dir": "./cache",
    "seed": 42,
    "device_map": "auto",
    "fine_tuned_model_name_or_path": fine_tuned_model_name_or_path,
    
    # Datasets
    # ---
    "Dataset": {
        "alpaca": "vicgalle/alpaca-gpt4",
        "dolly": "databricks/databricks-dolly-15k",
    },
    
    # Quantization 
    # ---
    "BitsAndBytesConfig": {
        # Enable 4-bit quantization with NF4 layers replacing Linear layers.
        "load_in_4bit": True,
        # A 16-bit binary floating-point format. (Also bfloat16)
        "bnb_4bit_compute_dtype": get_torch_dtype_based_on_support(),
        # Set quantization data type in bnb.nn.Linear4Bit layers to 4-bit NormalFloat (nf4).
        "bnb_4bit_quant_type": "nf4",
        # Enable double quantization for nested quantization.
        "bnb_4bit_use_double_quant": True
    },
    
    # Model 
    # ---
    "AutoModelForCausalLMConfig": {
        "pretrained_model_name_or_path": model_name_or_path,
        "device_map": "auto",
        "torch_dtype": get_torch_dtype_based_on_support(),
        # if flash attention
        # ---
        **get_attn_implementation_config_on_support(),
    },
    
    # Tokenizer 
    # ---
    "AutoTokenizerConfig": {
        "pretrained_model_name_or_path": model_name_or_path,
    },
    
    # LoRA 
    # ---
    "LoraConfig": {
        # The alpha sets the scale of weight updates, crucial for the speed of model adjustment to new data. 
        # An optimal alpha ensures that the model fine-tunes effectively without overfitting, 
        # carefully balancing new data learning with the preservation of pre-existing knowledge. 
        # A larger alpha places more emphasis on the fine-tuning data.
        #
        # Note: alpha value is selected to be double the r value to fine-tune LLMs efficiently, 
        # although this may vary in other model types like diffusion models.
        #
        # Bigger config: 128 
        "lora_alpha": 8 if is_low_resource_config else 128, 
        
        # The lora_dropout — with rates set at 0.1 for models up to 13B parameters and 0.05 for larger models —
        # QLORA efficiently hones the LLMs, preventing overfitting while accommodating the constraints of training duration and dataset size.
        "lora_dropout": 0.05 if is_low_resource_config else 0.1,
        
        # The r value controls the scope of reparameterized updates, which determines the number of parameters refined in the process. 
        # A larger r enhances the model’s capacity to represent complex patterns, beneficial for a nuanced understanding of tasks, 
        # albeit with increased computational demands and potential overfitting.
        #
        # Bigger config: 256 
        "r": 16 if is_low_resource_config else 256,
        
        # Bias can be ‘none’, ‘all’ or ‘lora_only’. If ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. 
        # Even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation. The default is None.
        "bias":"none",
        
        # Task type. This is the type of task for which the model is fine-tuned. 
        # "SEQ_2_SEQ_LM" (Sequence-to-Sequence Language Model), "CAUSAL_LM"
        "task_type":"CAUSAL_LM",
    }, 
    
    # Train 
    # ---
    "TrainingArguments": {
        # Evaluate the model at certain steps or epochs (evaluation_strategy and do_eval).
        #evaluation_strategy="steps",
        #do_eval=True,
        
        # Output directory where the model predictions and checkpoints will be stored
        "output_dir": fine_tuned_model_name_or_path,
        
        # Number of training epochs
        # The number of complete passes through the dataset.
        "num_train_epochs": 1,
        
        # Enables automatic discovery of a batch size that fits your data, which is useful to prevent out-of-memory errors.
        "auto_find_batch_size": True, 
        
        # With `auto_find_batch_size` the defined eval or train batch size is used as the initial batch size to start off with. 
        # So if you use the default of 8, it starts training with a batch size of 8 (on a single device), 
        # And if it fails, it will restart the training procedure with a batch size of 4 find by `auto_find_batch_size`.
                
        # Batch size per GPU for training
        #
        # Big config: 4
        # "per_device_train_batch_size": 4,
        
        # Batch size per GPU for evaluation
        #
        # Big config: 4
        # "per_device_eval_batch_size": 4,

        # Set to True to enable mixed precision training and evaluation with brainfloat16 (bf16) a CUDA internal data type exclusive to Ampere hardware.
        # Enables mixed precision training using bfloat16 to reduce memory consumption and potentially speed up training without significantly affecting model accuracy.
        # bf16 has a wider dynamic range than fp16, making it suitable for mixed precision.
        "bf16": is_bfloat16_supported(), 

        # Set to True to enable tf32, a CUDA internal data type exclusive to Ampere hardware.
        # Provides up to 3x throughput improvement in training and/or inference.
        # Similar numerical range as fp32 but with reduced precision (10 bits).
        # Allows using normal fp32 training/inference code with enhanced throughput.
        "tf32": is_bfloat16_supported(), 
        
        # Gradient Accumulation Step, Number of update steps to accumulate the gradients before updating model.
        #
        # - The `gradient_accumulation_steps` parameter controls the number of steps to accumulate gradients before updating the model.
        # - Default value is 1, meaning gradients are calculated and applied after each batch.
        # - Increasing the value (e.g., setting it to 4) allows accumulating gradients over multiple batches before updating the model.
        # - This approach helps overcome GPU memory limitations by effectively increasing the batch size.
        # - Be cautious with larger values as it may slow down training due to additional forward and backward passes.
        #
        # Considerations:
        #
        # - While increasing GPU usage is beneficial, too many gradient accumulation steps can lead to training slowdown.
        # - It's recommended to find a balance
        # - If GPU limited increase 16, else decrease
        #
        # Bigger config: 2
        "gradient_accumulation_steps": 16 if is_low_resource_config else 2,
        
        # Use gradient checkpointing to save memory at the expense of slower backward pass. 
        #
        # Note: Optimisation (lower the memory footprint)
        "gradient_checkpointing": True,
        
        # Maximum gradient normal (gradient clipping), based on QLoRA paper.
        "max_grad_norm": 0.3,
        
        # Initial learning rate # learning rate, based on QLoRA paper
        "learning_rate": 2e-4,
        
        # Ratio of steps to linearly increase the learning rate from 0. based on QLoRA paper
        "warmup_ratio": 0.03,
        
        # Weight decay 
        #
        # - Regularization parameter to prevent overfitting.
        # - Weight decay is a regularization technique that adds a small penalty to the loss function, typically the L2 norm of all model weights.
        # Note: Not needed "weight_decay" the optimizer manage good
        # "weight_decay": 0.001, # Manager by the Optimizer
        
        # Optimizer
        #
        # Fused AdamW implementation to further speed up training. 
        # This stochastic optimization method modifies the typical implementation of weight decay in Adam by decoupling weight decay from the gradient update.
        # adamw_fused_torch on PyTorch >= 2.0 so that users get the nice speed-up
        # adamw_hf on PyTorch < 2.0 (as before
        "optim": "adamw_torch_fused",
        # Note: Next research doing on 8 bit with paged_adamw_8bit
                                
        # The learning rate scheduler
        # - `linear`
        # - `cosine`
        # - `cosine_with_restarts`
        # - `polynomial`
        # - `constant`
        # - `constant_with_warmup`
        # - `inverse_sqrt`
        # - `reduce_lr_on_plateau`
        "lr_scheduler_type": "linear", 
        
        # Group sequences into batches with same length
        # Saves memory and speeds up training considerably
        # TODO: To see
        "group_by_length": True,
        
        # How often Log every X updates steps.
        "logging_steps": 10,
        
        # Set to “debug” for detailed logging information during training.
        # “log_level": "debug",
        
        "report_to": "wandb", 
        
        # save checkpoint every epoch
        "save_strategy": "epoch",
    },
    
    "SFTTrainer": {
        # Make sure to pass a correct value for max_seq_length as the default value will be set to min(tokenizer.model_max_length, 1024).
        #
        # Bigger config: 1024, 3072
        "max_seq_length": 1024 if is_low_resource_config else 3072,

        # Pack multiple short examples in the same input sequence to increase efficiency
        # The trainer pack sequences of `max_seq_lenght`
        "packing": True,
        "dataset_kwargs": {
            # We template with special tokens
            "add_special_tokens": False, 
             # No need to add additional separator token
            "append_concat_token": False,
        }
    },
    "TextGenerationConfig": {
        "max_new_tokens": 500,
        "temperature": 0.7,
        "top_p": 0.1,
        "repetition_penalty": 1.18,
        "top_k": 40,
        "do_sample": True,
    }
}

In [9]:
print_sep()
pretty(config)
print_sep()

##############
cache_dir: ./cache
seed: 42
device_map: auto
fine_tuned_model_name_or_path: Llama-2-7b-chat-hf-2024
Dataset:
	alpaca: vicgalle/alpaca-gpt4
	dolly: databricks/databricks-dolly-15k
BitsAndBytesConfig:
	load_in_4bit: True
	bnb_4bit_compute_dtype: torch.float16
	bnb_4bit_quant_type: nf4
	bnb_4bit_use_double_quant: True
AutoModelForCausalLMConfig:
	pretrained_model_name_or_path: meta-llama/Llama-2-7b-chat-hf
	device_map: auto
	torch_dtype: torch.float16
AutoTokenizerConfig:
	pretrained_model_name_or_path: meta-llama/Llama-2-7b-chat-hf
LoraConfig:
	lora_alpha: 8
	lora_dropout: 0.05
	r: 16
	bias: none
	task_type: CAUSAL_LM
TrainingArguments:
	output_dir: Llama-2-7b-chat-hf-2024
	num_train_epochs: 1
	auto_find_batch_size: True
	bf16: False
	tf32: False
	gradient_accumulation_steps: 16
	gradient_checkpointing: True
	max_grad_norm: 0.3
	learning_rate: 0.0002
	warmup_ratio: 0.03
	optim: adamw_torch_fused
	lr_scheduler_type: linear
	group_by_length: True
	logging_steps: 10
	report_t

#### Options

In [10]:
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

#### Seed

In [11]:
set_seed(get_arguments("seed"))

#### Kaggle Secrets

In [12]:
userSecretsClient = UserSecretsClient()

#### Hugging Face

In [13]:
login(
    token=userSecretsClient.get_secret("HUGGINGFACEHUB_API_TOKEN"),
    add_to_git_credential=True
)

Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful


#### Monitoring

<img src="https://app.safebase.io/api/share/dfabe115-c068-492f-93de-b0669bb5dbf7/logo_og.png" alt="Learning" width="40%">


Monitoring is essential to ensure the ongoing quality and performance of the model. To facilitate this task, we will be using Weights & Biases (WandB), a platform that aids AI developers in building models more efficiently by enabling quick experiment tracking, dataset versioning, and model performance evaluation.

Setup:
Before getting started, follow these steps to set up your environment:
1. Create a WandB account by clicking on the provided link to log in.
2. After creating your account, retrieve the authorization token provided by WandB.
3. Use this authorization token to authenticate within your notebook, enabling seamless integration with WandB for monitoring your model training experiments.

[Weights & Biases](https://wandb.ai)

In [14]:
# Kaggle
# --
wandb.login(key=userSecretsClient.get_secret("WANDB_API_TOKEN"))
run = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mstephan-yannick[0m. Use [1m`wandb login --relogin`[0m to force relogin


# <b>2 <span style='color:#78D118'>|</span> Fine-Tuning</b>

### Step 1 - Data Preparation

The first step of the fine-tuning process is to identify a specific task and supporting dataset.

We will use:

#### Alpaca-gpt4

The "alpaca-gpt4" dataset comprises 52K English instruction-following records generated by GPT-4 using Alpaca prompts, designed for fine-tuning Large Language Models (LLMs); it follows the same format as the original Alpaca dataset but features higher-quality and longer responses. The dataset is accessible under the Creative Commons NonCommercial (CC BY-NC 4.0) license.

[alpaca-gpt4](https://huggingface.co/datasets/vicgalle/alpaca-gpt4)


In [27]:
alpaca_dataset_name_or_path = get_arguments("Dataset", "alpaca")

alpaca_dataset = load_dataset(alpaca_dataset_name_or_path, split="train[0:1000]") # NOTE: testing [0:1000]

In [28]:
# Define the column mapping
column_mapping = {
    'instruction': 'question',
    'input': 'context',
    'output': 'answer',
}

# Rename the columns
for old_column, new_column in column_mapping.items():
    alpaca_dataset = alpaca_dataset.rename_column(old_column, new_column)

In [29]:
alpaca_dataset

Dataset({
    features: ['question', 'context', 'answer', 'text'],
    num_rows: 1000
})

In [30]:
pretty(alpaca_dataset[:2])

question: ['Give three tips for staying healthy.', 'What are the three primary colors?']
context: ['', '']
answer: ['1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.', 'The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mix

In [31]:
alpaca_dataset = alpaca_dataset.map(
    create_conversation, 
    remove_columns=alpaca_dataset.features,
    batched=False
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [32]:
pretty(alpaca_dataset[:20]["messages"][0][0])
print("\n")
pretty(alpaca_dataset[:20]["messages"][0][1])
print("\n")
pretty(alpaca_dataset[:20]["messages"][0][2])

content: You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request.
role: system


content: Give three tips for staying healthy.
role: user


content: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.
rol

#### databricks/databricks-dolly-15k

Databricks-dolly-15k is an open-source dataset of over 15,000 instruction-following records created by Databricks employees, covering various behavioral categories outlined in the InstructGPT paper, with potential uses in training large language models, synthetic data generation, and data augmentation under the Creative Commons Attribution-ShareAlike 3.0 Unported License.


[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)

In [33]:
dolly_dataset_name_or_path = get_arguments("Dataset", "dolly")

dolly_dataset = load_dataset(dolly_dataset_name_or_path, split="train[0:1000]") # NOTE: testing [0:1000]

Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [34]:
# Define the column mapping
column_mapping = {
    'instruction': 'question',
    'response': 'answer',
}

# Rename the columns
for old_column, new_column in column_mapping.items():
    dolly_dataset = dolly_dataset.rename_column(old_column, new_column)

In [35]:
dolly_dataset

Dataset({
    features: ['question', 'context', 'answer', 'category'],
    num_rows: 1000
})

In [36]:
pretty(dolly_dataset[:2])

question: ['When did Virgin Australia start operating?', 'Which is a species of fish? Tope or Rope']
context: ["Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", '']
answer: ['Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'Tope']
category: ['closed_qa', 'classification']


#### Data Preparation

In [37]:
dolly_dataset = dolly_dataset.map(
    create_conversation, 
    remove_columns=dolly_dataset.features,
    batched=False
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [38]:
pretty(dolly_dataset[:20]["messages"][0][0])
print("\n")
pretty(dolly_dataset[:20]["messages"][0][1])
print("\n")
pretty(dolly_dataset[:20]["messages"][0][2])

content: You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.
CONTEXT:
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.
role: system


content: When did Virgin Australia start operating?
role: user


content: Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.
role: assistant


#### Merge

In [39]:
dataset_cc = concatenate_datasets([dolly_dataset, alpaca_dataset])

In [40]:
dataset_cc

Dataset({
    features: ['messages'],
    num_rows: 2000
})

#### Train Test split

In [41]:
dataset = dataset_cc.train_test_split(test_size=0.2)

In [42]:
dataset

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 400
    })
})

#### Step 2 - Model / Tokenizer

We are going to load a Llama-2-7b-chat-hf pre-trained model.

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.

<img src="https://scontent-fra5-2.xx.fbcdn.net/v/t39.8562-6/400617302_264856479475530_3601182442335680795_n.png?_nc_cat=106&ccb=1-7&_nc_sid=f537c7&_nc_ohc=VZvK94sCtKsAX8dKjrN&_nc_ht=scontent-fra5-2.xx&oh=00_AfDsw66stP-NG88Yikz44t0I0AwJLEcx1YaDoozSy4axdA&oe=65B76585" alt="Learning" width="40%">

The fine-tuned model, Llama-2-chat, leverages publicly available instruction datasets and over 1 million human annotations, using reinforcement learning from human feedback (RLHF) to ensure safety and helpfulness.

<img src="https://scontent-fra5-1.xx.fbcdn.net/v/t39.8562-6/400617500_1079533569709998_6644333000484721271_n.png?_nc_cat=100&ccb=1-7&_nc_sid=f537c7&_nc_ohc=U-uanOTeDKQAX-U8qad&_nc_ht=scontent-fra5-1.xx&oh=00_AfBuU6PszmvgUrohZ4KxGTf__fqp_A2sjZSDenpGpk90Rg&oe=65B84809" alt="Learning" width="40%">



##### Model

In [31]:
autoModelForCausalLMConfig = get_arguments("AutoModelForCausalLMConfig")
model = AutoModelForCausalLM.from_pretrained(** autoModelForCausalLMConfig, cache_dir="temp_testing") # cache_dir Not override the next one for training

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

##### Tokenizer

In [32]:
autoTokenizerConfig = get_arguments("AutoTokenizerConfig")
tokenizer = AutoTokenizer.from_pretrained(**autoTokenizerConfig, cache_dir="temp_testing") 

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

#### Step 3 - Shot Learning

In this section, we will delve into the model's ability to adapt and perform on a specific task using "zero-shot learning" and "few-shot learning" approaches. Here is an overview of each approach:

**Zero-Shot Learning:**
- Evaluation of the model's ability to perform the task without any specific training data.
- Analysis of the model's comprehension and generalization concerning the task, even in the absence of direct training examples.

**Few-Shot Learning:**
- Deliberate provision of a limited number of training examples for the specific task.
- Assessment of the model's performance using these few examples, reflecting its ability to generalize from a small training set.

##### Enhancing the Description of Chat Models and LLM Use Case for Chat

An increasingly prevalent application of Language Model (LLMs) lies in the domain of chat-based interactions. Unlike traditional language models that process a continuous string of text, LLMs designed for chat scenarios engage in multi-turn conversations. These conversations consist of multiple messages, each assigned a specific role such as "user" or "assistant," accompanied by corresponding message text.

Due to the diverse expectations of various models regarding input formats for chat interactions, the concept of chat templates has emerged as a crucial feature. These templates, integrated into the tokenizer, serve to guide the transformation of conversations—represented as lists of messages—into a unified tokenizable string, aligning with the specific format expected by the model.

To illustrate this concept, let's delve into a practical example using the BlenderBot model. BlenderBot employs a straightforward default template, primarily introducing whitespace between rounds of dialogue. The integration of chat templates facilitates this process:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

chat = [
   {"role": "user", "content": "Hello, how are you?"},
   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
   {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)
" Hello, how are you?  I'm doing great. How can I help you today?   I'd like to show off how chat templating works!</s>"
```

**References:**

[OpenAI introduced Chat Markup Language (ChatML) based input to non-chat modes](https://cobusgreyling.medium.com/openai-introduced-chat-markup-language-chatml-based-input-to-non-chat-modes-6ca4b267012f)

[Microsoft Docs - Chat Markup Language](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md)

[Implementing Few-Shot Learning with GPT](https://developer.dataiku.com/latest/tutorials/machine-learning/genai/nlp/gpt-few-shot-clf/index.html#implementing-few-shot-learning)

[Prompting Guide - Few-Shot Learning](https://www.promptingguide.ai/techniques/fewshot)

[Learn Prompting - Few-Shot Basics](https://learnprompting.org/docs/basics/few_shot)

[The Science of Control: How Temperature, Top-p, and Top-k Shape Large Language Models](https://medium.com/@daniel.puenteviejo/the-science-of-control-how-temperature-top-p-and-top-k-shape-large-language-models-853cb0480dae)


In [33]:
print_sep()
pretty(alpaca_dataset[:20]["messages"][0][0])
print_sep()
print("User:")
pretty(alpaca_dataset[:20]["messages"][0][1])
print_sep()
pretty(alpaca_dataset[:20]["messages"][0][2])
print_sep()

##############
content: You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request.
role: system
##############
User:
content: Give three tips for staying healthy.
role: user
##############
content: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune funct

In [34]:
messages = alpaca_dataset[:20]["messages"][0][:2]

##### Inference Pipeline

In [35]:
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

print(prompt)

<s>[INST] <<SYS>>
You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request.
<</SYS>>

Give three tips for staying healthy. [/INST]


In [36]:
textGenerationConfig = get_arguments("TextGenerationConfig")

generator = pipeline(
    task="text-generation", 
    model=model, 
    tokenizer=tokenizer
)

outputs = generator(
    prompt, 
    **textGenerationConfig
)

In [37]:
print_sep()
print("Generated:\n")
print(outputs[0]["generated_text"][len(prompt):])
print_sep()
print("Original:\n")
pretty(alpaca_dataset[:20]["messages"][0][2])
print_sep()

##############
Generated:

  Great, I'd be happy to help! Here are three tips for staying healthy:

1. Stay Hydrated: Drinking enough water is essential for maintaining good health. Aim to drink at least eight glasses of water per day, and avoid sugary drinks like soda and juice that can have negative impacts on your health.
2. Eat a Balanced Diet: Fuel your body with nutrient-dense foods like fruits, vegetables, whole grains, lean proteins, and healthy fats. Limit your intake of processed and high-calorie foods that can lead to weight gain and other health problems.
3. Get Enough Sleep: Adequate sleep is crucial for physical and mental well-being. Aim for 7-8 hours of sleep each night and establish a consistent bedtime routine to help improve the quality of your sleep.

Remember, taking care of your health is a long-term commitment, but by following these tips, you can set yourself up for success and enjoy better overall health throughout your life.
##############
Original:

content: 

**NOTE: While the model demonstrates remarkable performance across various tasks in datasets, notable alterations in its generation capabilities may not be readily apparent.**

👉🏿 So this is an example of how Fine tuning in 2024

##### Inference Model / Tokenizer

In [38]:
# textGenerationConfig = get_arguments("TextGenerationConfig")
#
# input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
# output_tokens = model.generate(
#     input_tokens, 
#     **gen_config,
# )
# output_tokens=output_tokens[0][len(input_tokens[0]):]
# output = tokenizer.decode(output_tokens, skip_special_tokens=True)
# print(output)

In [39]:
del tokenizer
del model
clear_gpu_memory()

### Step 4 - Training preparation



#### BitsAndBytesConfig int-4 config

This will allow us to load our LLM with int-4 config

In [43]:
# Config
# ---
bitsAndBytesConfigConfig = get_arguments("BitsAndBytesConfig")

# Config 4-bit quantization
# ---
bnb_config = BitsAndBytesConfig(**bitsAndBytesConfigConfig)

# Print
# ---
print_sep()
pretty(bitsAndBytesConfigConfig)
print_sep()

##############
load_in_4bit: True
bnb_4bit_compute_dtype: torch.float16
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: True
##############


#### Model with quantization

In [44]:
# Config
# ---
autoModelForCausalLMConfig = get_arguments("AutoModelForCausalLMConfig")

# Load base model
# ---
model = AutoModelForCausalLM.from_pretrained(
    **autoModelForCausalLMConfig,
    quantization_config=bnb_config,
)

# Print
# ---
print_sep()
pretty(autoModelForCausalLMConfig)
print_sep()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

##############
pretrained_model_name_or_path: meta-llama/Llama-2-7b-chat-hf
device_map: auto
torch_dtype: torch.float16
##############


#### Tokenizer

In [45]:
# Config
# ---
autoTokenizerConfig = get_arguments("AutoTokenizerConfig")

# Tokenizer
# ---
tokenizer = AutoTokenizer.from_pretrained(**autoTokenizerConfig)
tokenizer.pad_token = tokenizer.eos_token # TODO ADD to test
tokenizer.padding_side = 'right' # to prevent warnings

# Print
# ---
print_sep()
pretty(autoTokenizerConfig)
print_sep()

##############
pretrained_model_name_or_path: meta-llama/Llama-2-7b-chat-hf
##############


##### Chat template chatML


In [46]:
model, tokenizer = setup_chat_format(model, tokenizer)

### Step 3 -  Train

#### Lora / Perf Config

These parameters allow customization of the LoRA fine-tuning behavior to fit specific application needs.

In [47]:
import bitsandbytes as bnb

# Config
# ---
modules = find_all_linear_names(model)
lora_config = get_arguments("LoraConfig")

# LoraConfig
# ---
lora_perf_config = LoraConfig(
    target_modules=modules,
    **lora_config
)

In [48]:
print_sep()
pretty(lora_perf_config.__dict__)
print_sep()

##############
peft_type: LORA
auto_mapping: None
base_model_name_or_path: None
revision: None
task_type: CAUSAL_LM
inference_mode: False
r: 16
target_modules: {'gate_proj', 'down_proj', 'q_proj', 'o_proj', 'v_proj', 'k_proj', 'up_proj'}
lora_alpha: 8
lora_dropout: 0.05
fan_in_fan_out: False
bias: none
modules_to_save: None
init_lora_weights: True
layers_to_transform: None
layers_pattern: None
rank_pattern:
alpha_pattern:
megatron_config: None
megatron_core: megatron.core
loftq_config:
##############


#### TrainingArguments

Prepare the TrainingArguments

In [49]:
# Config
# ---
training_config = get_arguments("TrainingArguments")

# TrainingArguments
# ---
training_arguments = TrainingArguments(**training_config)

# Print
# ---
print_sep()
pretty(training_config)
print_sep()

##############
output_dir: Llama-2-7b-chat-hf-2024
num_train_epochs: 1
auto_find_batch_size: True
bf16: False
tf32: False
gradient_accumulation_steps: 16
gradient_checkpointing: True
max_grad_norm: 0.3
learning_rate: 0.0002
warmup_ratio: 0.03
optim: adamw_torch_fused
lr_scheduler_type: linear
group_by_length: True
logging_steps: 10
report_to: wandb
save_strategy: epoch
##############


#### SFTTrainer (Supervised Fine-tuning Trainer)

**Best practices**
Pay attention to the following best practices when training a model with that trainer:

- SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
- For training adapters in 8bit, you might need to tweak the arguments of the prepare_model_for_kbit_training method from PEFT, hence we advise users to use prepare_in_int8_kwargs field, or create the PeftModel outside the SFTTrainer and pass it.
- For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add load_in_8bit argument when creating the SFTTrainer, or create a base model in 8bit outside the trainer and pass it.
- If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to from_pretrained() method.

In [50]:
# Config
# ---
sfttraing_config = get_arguments("SFTTrainer")

# Print
# ---
print_sep()
pretty(sfttraing_config)
print_sep()

##############
max_seq_length: 1024
packing: True
dataset_kwargs:
	add_special_tokens: False
	append_concat_token: False
##############


In [51]:
# SFTTrainer
# ---
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    # To facilitates Parameter-Efficient Fine-Tuning (PEFT) without wrapping a pre-trained model PerfModel.
    # ---
    peft_config=lora_perf_config,
    **sfttraing_config
) 

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [52]:
clear_gpu_memory()

In [53]:
# Train
# ---
trainer.train()

# Save
# ---
trainer.save_model()

# Wandb
# ---
wandb.finish()



Step,Training Loss


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁
train/global_step,▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,0.86
train/global_step,5.0
train/total_flos,1.3395909647794176e+16
train/train_loss,1.81529
train/train_runtime,1919.0611
train/train_samples_per_second,0.194
train/train_steps_per_second,0.003


In [54]:
del model
del trainer

clear_gpu_memory()
clear_gpu_memory()

**Optional:** Merge LoRA adapter in to the original model
When using QLoRA, we only train adapters and not the full model. This means when saving the model during training we only save the adapter weights and not the full model. If you want to save the full model, which makes it easier to use with Text Generation Inference you can merge the adapter weights into the model weights using the merge_and_unload method and then save the model with the save_pretrained method. This will save a default model, which can be used for inference.

**Note:** You might require > 30GB CPU Memory.

In [55]:
# NOTE: Merge Peft and Model, Remove the comments

# from peft import PeftModel, PeftConfig
# from transformers import AutoModelForCausalLM, AutoTokenizer
# from peft import AutoPeftModelForCausalLM

# # Load PEFT model on CPU
#config = PeftConfig.from_pretrained(args.output_dir)
# model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,low_cpu_mem_usage=True)
# tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
# model.resize_token_embeddings(len(tokenizer))
# model = PeftModel.from_pretrained(model, args.output_dir)
# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
# )
# # Merge LoRA and base model and save
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")


# <b>3 <span style='color:#78D118'>|</span> Performance Evaluation</b>

### Step 1 - Load model and apply Perf

In [56]:
clear_gpu_memory()

In [57]:
fine_tuned_model_name_or_path = get_arguments("fine_tuned_model_name_or_path")

In [58]:
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_name_or_path)
config = PeftConfig.from_pretrained(fine_tuned_model_name_or_path)

model = AutoModelForCausalLM.from_pretrained(
  config.base_model_name_or_path, # Load the base model
  low_cpu_mem_usage=True,
  device_map="auto",
  torch_dtype=torch.float16
)

model.resize_token_embeddings(len(tokenizer))

model = PeftModel.from_pretrained(model, fine_tuned_model_name_or_path)


# Note: Load Model with PEFT adapter (New version of AutoPeftModelForCausalLM)
#model = AutoPeftModelForCausalLM.from_pretrained(
#  peft_model_id,
#  device_map="auto",
#  torch_dtype=torch.float16
#)
# load into pipeline
#pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [85]:
model.push_to_hub("YanSte/fine_tuning_llama-2_chat_alpaca_dolly_hf")
tokenizer.push_to_hub("YanSte/fine_tuning_llama-2_chat_alpaca_dolly_hf")  

adapter_model.safetensors:   0%|          | 0.00/80.0M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/YanSte/fine_tuning_llama-2_chat_alpaca_dolly_hf/commit/191b0c6612aae2ba7d596212542a55206b0d235b', commit_message='Upload tokenizer', commit_description='', oid='191b0c6612aae2ba7d596212542a55206b0d235b', pr_url=None, pr_revision=None, pr_num=None)

In [59]:
generator = pipeline(
    task="text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_length=200
)

In [63]:
eval_dataset = dataset["test"]

In [64]:
eval_dataset['messages'][11][:2]

[{'content': 'You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.\nCONTEXT:\nriver, mountain, book',
  'role': 'system'},
 {'content': 'Pick out the correct noun from the following list.',
  'role': 'user'}]

In [65]:
chat = eval_dataset['messages'][11][:2]

print(chat)

[{'content': 'You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.\nCONTEXT:\nriver, mountain, book', 'role': 'system'}, {'content': 'Pick out the correct noun from the following list.', 'role': 'user'}]


In [66]:
prompt = generator.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

print(prompt)

<|im_start|>system
You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.
CONTEXT:
river, mountain, book<|im_end|>
<|im_start|>user
Pick out the correct noun from the following list.<|im_end|>
<|im_start|>assistant



In [74]:
textGenerationConfig = get_arguments("TextGenerationConfig")

outputs = generator(
    prompt, 
    **textGenerationConfig
)

print(outputs[0]["generated_text"])

<|im_start|>system
You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.
CONTEXT:
river, mountain, book<|im_end|>
<|im_start|>user
Pick out the correct noun from the following list.<|im_end|>
<|im_start|>assistant
Answer: 📚 Book


In [78]:
chat = eval_dataset['messages'][11][:2]
# Load our test dataset
#eval_dataset = dataset["train"]#load_dataset("json", data_files="test_dataset.json", split="train")
print(f"Query:\n{chat}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Query:
[{'content': 'You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.\nCONTEXT:\nriver, mountain, book', 'role': 'system'}, {'content': 'Pick out the correct noun from the following list.', 'role': 'user'}]
Generated Answer:
Answer: 📚 Book


You can now deploy your model to production. For deploying open LLMs into production we recommend using Text Generation Inference (TGI). TGI is a purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and continous batching for the most popular open LLMs, including Llama, Mistral, Mixtral, StarCoder, T5 and more. Text Generation Inference is used by companies as IBM, Grammarly, Uber, Deutsche Telekom, and many more. There are several ways to deploy your model, including:

Deploy LLMs with Hugging Face Inference Endpoints
Hugging Face LLM Inference Container for Amazon SageMaker
DIY
If you have docker installed you can use the following command to start the inference server.

Note: Make sure that you have enough GPU memory to run the container. Restart kernel to remove all allocated GPU memory from the notebook.

In [82]:
chat = eval_dataset['messages'][5][:2]

prompt = generator.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

outputs = generator(
    prompt, 
    **textGenerationConfig
)


print(f"Query:\n{chat}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Query:
[{'content': 'You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.\nCONTEXT:\nCongosto (Spanish pronunciation: [koŋˈɡosto]) is a village and municipality located in the region of El Bierzo (province of León, Castile and León, Spain) . It is located near to Ponferrada, the capital of the region. The village of Congosto has about 350 inhabitants.\n\nIts economy was traditionally based on agriculture, wine and coal mining. Nowadays, most of the inhabitants work on the surrounding area on activities such as wind turbine manufacturing or coal mining.\n\nCongosto also a large reservoir in its vicinity, the Barcena reservoir, to which many tourists visit during the summer.', 'role': 'system'}, {'content': 'Where is the village of Congosto', 'role': 'user'}]
Generated Answer:
The village of Congosto is located in the region of El Bierzo, province of León, Castil

In [83]:
chat = eval_dataset['messages'][2][:2]

prompt = generator.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

outputs = generator(
    prompt, 
    **textGenerationConfig
)


print(f"Query:\n{chat}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Query:
[{'content': 'You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request based on the provided CONTEXT.\nCONTEXT:\nShe is taking a short break from practice', 'role': 'system'}, {'content': 'Rewrite this sentence, "She is taking a short break from practice"', 'role': 'user'}]
Generated Answer:
Answer: She is enjoying a brief respite from her training.


In [84]:
chat = eval_dataset['messages'][100][:2]

prompt = generator.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

outputs = generator(
    prompt, 
    **textGenerationConfig
)


print(f"Query:\n{chat}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

Query:
[{'content': 'You are an AI who answering question. Users will ask you questions in English and you will write a response that appropriately completes the request.', 'role': 'system'}, {'content': 'List four strategies for teaching children to read.', 'role': 'user'}]
Generated Answer:
Teaching Children to Read: Strategies and Tips
1. Phonics-based instruction: This approach focuses on teaching children how to decode words by sounding out letters and syllables. It is a comprehensive method that helps children understand the relationship between sounds and letters, making it easier for them to recognize and spell words.
2. Whole language approach: This strategy emphasizes a holistic approach to reading, focusing on the meaning of text rather than individual letter sounds. Teachers use engaging stories and activities to help students connect with the material and develop a love for reading.
3. Balanced literacy: This approach combines both phonics-based instruction and whole langu

In [None]:
#%%bash
# model=$PWD/{args.output_dir} # path to model
#model=$(pwd)/Model_fine_tuned # path to model
#num_shard=1             # number of shards
#max_input_length=1024   # max input length
#max_total_tokens=2048   # max total tokens

#docker run -d --name tgi --gpus all -ti -p 8080:80 \
#  -e MODEL_ID=/workspace \
#  -e NUM_SHARD=$num_shard \
#  -e MAX_INPUT_LENGTH=$max_input_length \
#  -e MAX_TOTAL_TOKENS=$max_total_tokens \
#  -v $model:/workspace \
#  ghcr.io/huggingface/text-generation-inference:latest

In [None]:
#import requests as r
#from transformers import AutoTokenizer
#from datasets import load_dataset
#from random import randint

# Load our test dataset and Tokenizer again
#tokenizer = AutoTokenizer.from_pretrained("code-llama-7b-text-to-sql")
#eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
#rand_idx = randint(0, len(eval_dataset))

# generate the same prompt as for the first local test
#prompt = tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
#request= {"inputs":prompt,"parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}

# send request to inference server
#resp = r.post("http://127.0.0.1:8080/generate", json=request)

#output = resp.json()["generated_text"].strip()
#time_per_token = resp.headers.get("x-time-per-token")
#time_prompt_tokens = resp.headers.get("x-prompt-tokens")

# Print results
#print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
#print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
#print(f"Generated Answer:\n{output}")
#print(f"Latency per token: {time_per_token}ms")
#print(f"Latency prompt encoding: {time_prompt_tokens}ms")

In [None]:
#!docker stop tgi