

# Fine-Tuning the Llama 2 Model: An Experiment
In this experiment, we primarily focus on fine-tuning the Llama 2 model. After the fine-tuning process, we will evaluate the model's performance by measuring its perplexity and cross-entropy. To enhance ,,the results post fine-tuning, students are encouraged to adjust dataset sizes or tweak model parameters.

##  Load and Import Libraries

Before diving into any deep learning or model fine-tuning tasks, it's essential to set up our environment correctly. This section ensures that all necessary libraries and dependencies are installed and ready to use.

Here's a breakdown of the process:

- **!pip install**: The `pip` command is used to install Python packages. The `!` at the beginning allows us to run shell commands directly from the notebook.

  - **accelerate==0.21.0**: A library developed by Hugging Face to make distributed training and hardware acceleration easy.
  
  - **peft==0.4.0**: A library specific to LLaMa fine-tuning.
  
  - **bitsandbytes==0.40.2**: Assists in efficient training by utilizing 4-bit quantization for model weights.
  
  - **transformers==4.31.0**: The core library from Hugging Face that provides pre-trained models, tokenizers, and training utilities.
  
  - **trl==0.4.7**: A library for training reinforcement learning models.
  
  - **evaluate**: Presumably, a package that provides evaluation utilities (Note: ensure this package's relevance to your tasks).

Ensure that all these libraries are successfully installed before proceeding. In case of any issues or conflicts, consider creating a virtual environment or seeking updated versions of the packages.


In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
!pip install evaluate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/244.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m115.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

After ensuring all required libraries are installed, the next step is to import the relevant classes and modules that will be used throughout our fine-tuning and testing process. Here's a brief overview of each import:

- **os**: The built-in Python module that provides functionalities to interact with the operating system, mainly for file and directory operations.

- **torch**: PyTorch is an open-source machine learning library widely used for deep learning tasks.

- **datasets**: A library from Hugging Face for easily accessing and using datasets. Here, we specifically use the `load_dataset` function to load our dataset.

- **transformers**: The core library from Hugging Face offering pre-trained models, tokenizers, and other utilities.
  
  - **AutoModelForCausalLM**: A class to instantiate models for causal (unidirectional) language modeling tasks.
  
  - **AutoTokenizer**: A class that provides tokenizers compatible with the pre-trained models.
  
  - **BitsAndBytesConfig**: Configuration class for 4-bit quantization from the bitsandbytes library.
  
  - **HfArgumentParser**: A parser specifically designed for Hugging Face libraries' arguments.
  
  - **TrainingArguments**: Defines training-related parameters such as batch size, learning rate, etc.
  
  - **pipeline**: Provides a high-level, easy-to-use API for performing tasks with models (e.g., text generation).
  
  - **logging**: A utility to control and handle logging behaviors.
  
- **peft**:
  
  - **LoraConfig**: Configuration for LLaMa's LoRA.
  
  - **PeftModel**: Not explicitly used in the provided code but could be a model variant for fine-tuning.
  
- **trl**: The library for training reinforcement learning models.
  
  - **SFTTrainer**: A trainer class designed for supervised fine-tuning tasks.

It's essential to understand the purpose of each import as it provides context to the functionalities and utilities we will leverage in the subsequent steps.


In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## Model and Dataset Selection

### Model Choice
We have opted for the model named **NousResearch/llama-2-7b-chat-hf** as it's based on the Llama 2 architecture and has already been pre-trained across a variety of tasks, demonstrating commendable performance.


### Dataset Choice
The dataset **mlabonne/guanaco-llama2-1k** was chosen specifically because it's tailored for the Llama 2 model and is suitable for the experiment at hand. Detailed information about this dataset can be found on the [Hugging Face Datasets Hub](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)
. Depending on your GPU resources, you might opt for datasets of different sizes for fine-tuning. Larger datasets might offer better fine-tuning results but will also increase computational demands.

**Note**: The size of the dataset and model parameters should be balanced based on the GPU resources available to you. Choosing a very large dataset or model might lead to GPU memory exhaustion.

In [3]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco-6100"

## QLoRa and bitsandbytes parameters

QLoRA (Quantized LoRA) is a new approach that allows to finetune a very large language model using just a single big enough GPU - so it is suited for those who run notebooks on Colab! QLoRA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance compared to a 16-
bit fully finetuned baseline. This approach is based on the following two methods.


(1) **Block-wise k-bit Quantization**: the input tensor is chunked into blocks that are independently quantized (quantization is the process of discretizing an input from a representation that holds more information to a representation with less information. It often means taking a data type with more bits and converting it to fewer bits, for example from 32-bit floats to 8-bit Integers).

(2) **LoRA (Low-Rank Adapters)**: a method that reduces memory requirements by using a small set of trainable parameters, often termed adapters, while not updating the full model parameters which remain fixed.

In [4]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1


################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

## Training Parameters Configuration

This section outlines various parameters that influence the training process. By adjusting these parameters, you can experiment with and fine-tune the performance of your model. Familiarize yourself with each parameter's purpose, and consider modifying them as part of your learning experience.

### TrainingArguments Parameters

- **output_dir**: Directory where model predictions and checkpoints will be stored.
- **num_train_epochs**: Number of training epochs.
- **fp16** & **bf16**: Enable reduced precision training for faster computation.
- **per_device_train_batch_size**: Batch size per GPU for training.
- ... [and so on for each parameter]

### SFT Parameters

- **max_seq_length**: Maximum sequence length to use.
- **packing**: Pack multiple short examples in the same input sequence for increased efficiency.
- ... [and so on for each parameter]

Remember, parameter tuning is an iterative process. Experiment with different values to see how they affect the performance of your model.



In [5]:
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

## Dataset Loading, Splitting, and Evaluation Metrics Calculation

This section of the code serves several purposes: Firstly, it loads a specified dataset. Upon loading, it then divides the dataset into three distinct sets: training, testing, and validation. Once the data is processed, the code also includes a function designed to compute evaluation metrics. Specifically, it calculates the perplexity and cross-entropy based on the model's predictions.

In [6]:
from transformers import EvalPrediction
import torch
import math
from datasets import load_dataset

# - Loading a dataset
# - Splitting it into training, testing, and validation sets
# - Computing evaluation metrics (perplexity and cross-entropy) for model predictions

# Function to compute evaluation metrics
def compute_evaluation_metrics(prediction_data):
    """
    Computes evaluation metrics for a given prediction.

    Parameters:
    - prediction_data (EvalPrediction): Contains the predictions and true labels.

    Returns:
    - dict: A dictionary containing perplexity and cross_entropy.
    """
    model_outputs = torch.from_numpy(prediction_data.predictions)
    true_labels = torch.from_numpy(prediction_data.label_ids)
    cross_entropy_loss = torch.nn.functional.cross_entropy(model_outputs.view(-1, tokenizer.vocab_size), true_labels.view(-1))

    return {
        'perplexity': math.exp(cross_entropy_loss),
        'cross_entropy': cross_entropy_loss
    }

# Load the dataset
dataset = load_dataset(dataset_name, split="train")

# Splitting the Dataset
# Given the potential constraints of using the T4 GPU on Colab, it's advisable to reduce the number of test cases
# during the testing phase. This helps in preventing issues related to GPU resource limitations.
train_test_split = dataset.train_test_split(test_size=0.01, shuffle=True, seed=2023)
train_data = train_test_split["train"]
test_data = train_test_split["test"]

# Display the number of samples in each split
print(f"Number of training samples: {len(train_data)}")
print(f"Number of testing samples: {len(test_data)}")


Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Number of training samples: 990
Number of testing samples: 10


## Configuring 4-bit Quantization with BitsAndBytes

Quantization is a technique that reduces the numerical precision of model weights, thus making the model smaller and often faster at the expense of a slight reduction in accuracy. In this section, we're leveraging `BitsAndBytes` to apply 4-bit quantization.

- **compute_dtype**: Determines the data type for computation, either `float16` or `float32`.
- **bnb_config**: Configuration for `BitsAndBytes`, which includes parameters to specify the type of 4-bit quantization and the data type for computation.
- **GPU compatibility check**: The code snippet also includes a check to determine if the GPU supports `bfloat16`, which can further accelerate training. If your GPU is compatible, consider enabling `bf16` training.

Remember, using quantization and reduced-precision training can accelerate the training process, but it's essential to monitor the model's performance to ensure the accuracy remains acceptable.


In [7]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [8]:
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)


##  Loading the Llama 2 Model with 4-bit Precision and its Tokenizer

In this section, we're initializing the Llama 2 model with specific configurations and then loading its associated tokenizer.

- **Model Loading**:
  - **model_name**: The name of the pre-trained Llama 2 model from Hugging Face.
  - **quantization_config**: This parameter applies the 4-bit quantization configuration to the model, enabling it to process data in reduced precision for faster computation.
  - **device_map**: Specifies which GPU the model should be loaded on.
  - Additional configurations, like `use_cache` and `pretraining_tp`, are set for efficient memory usage and to specify the number of prediction tasks, respectively.

- **Tokenizer Loading**:
  - After loading the base model, we proceed to load the tokenizer, which is essential for converting text into a format that can be understood by the model.
  - **trust_remote_code**: This parameter ensures that any custom code associated with the tokenizer is executed.
  - Padding configurations (`pad_token`, `padding_side`) are set to ensure sequences are properly aligned for model input, and to address any potential issues with reduced-precision training.

By understanding and tweaking these configurations, you can further adapt and optimize the loading process to meet specific requirements or experiment with variations.



In [9]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

(…)ma-2-7b-chat-hf/resolve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

(…)esolve/main/model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(…)t-hf/resolve/main/generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

(…)at-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)2-7b-chat-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)b-chat-hf/resolve/main/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

##  Loading the LoRA Configuration

In this section, we're initializing the Llama 2 model with the **LoRA (Localized Re-parametrization Approximation)** configuration. This configuration is designed to improve the fine-tuning capabilities of the model, especially on smaller datasets, by introducing additional adaptable parameters. Here's a breakdown of the configuration parameters:

- **lora_alpha**: Controls the scaling of the initial weights in the LoRA layers. A higher value may improve the ability to adapt to new tasks but can also potentially lead to overfitting.

- **lora_dropout**: Specifies the dropout rate for the LoRA layers, which helps in preventing overfitting.

- **r**: Represents the rank for the low-rank approximation in the LoRA layers. By defining the rank, you can control the complexity and adaptability of the LoRA layers.

- **bias**: Determines the type of bias used in the LoRA layers. In our configuration, we've set it to "none", meaning no bias is used.

- **task_type**: This is set to "CAUSAL_LM" indicating that our task is a causal language modeling task, which predicts the next word in a sequence based on the previous words.

By understanding and adjusting these parameters, you can modify the LoRA configuration to fine-tune its behavior, balancing between model adaptability and the risk of overfitting.


In [10]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

## Setting the Training and Supervised Fine-Tuning (SFT) Parameters

In this segment, we establish the foundational parameters that drive the training process of the Llama 2 model. Here's a brief overview:

### **Training Parameters**

- **output_dir**: The directory where model predictions and checkpoints will be stored.
  
- **num_train_epochs**: Defines the number of times the model will iterate over the entire dataset.

- **per_device_train_batch_size**: Specifies the number of samples to work with in one update of model parameters.

- **gradient_accumulation_steps**: Determines how many steps to take before updating the model's weights.

- **optim**: The optimization algorithm used for updating the model's weights.

- **save_steps** & **logging_steps**: Controls how frequently the model checkpoints are saved and logs are generated, respectively.

- **learning_rate** & **weight_decay**: Set the rate at which the model learns and the decay applied to weights over time, respectively.

- **fp16** & **bf16**: Options to enable 16-bit floating point or bfloat16 precision training, which can speed up the training process.

- **max_grad_norm**: Implements gradient clipping to prevent exceedingly large gradient values that can destabilize the training.

- **max_steps**: Overrides the `num_train_epochs` by specifying the exact number of training steps.

- **warmup_ratio**: Indicates the fraction of steps used for a linear warmup from 0 to the set learning rate.

- **group_by_length**: Groups sequences of similar lengths together, enhancing efficiency.

- **lr_scheduler_type**: Determines the type of learning rate schedule, influencing how the learning rate changes during training.

### **Supervised Fine-Tuning Parameters**

The `SFTTrainer` is a specialized trainer optimized for Supervised Fine-Tuning:

- **model**: The actual model to be trained.

- **train_dataset** & **eval_dataset**: The datasets used for training and evaluation respectively.

- **peft_config**: Refers to the previously set LoRA configuration.

- **dataset_text_field**: Field name that contains the actual text data in the dataset.

- **max_seq_length**: Specifies the maximum length of the sequences for processing.

- **tokenizer**: Helps in converting text data into a format suitable for model processing.

- **args**: The aforementioned training arguments.

- **packing**: Determines if multiple short examples will be packed into a single input sequence.

- **compute_metrics**: Specifies the function used to compute evaluation metrics for the model's predictions.

With a clear understanding of these parameters, you can tailor the training and fine-tuning process to suit specific needs and constraints.


In [11]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
    compute_metrics=compute_evaluation_metrics,
)



Map:   0%|          | 0/990 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

## Train model

In [13]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.3782
50,1.6555
75,1.1624
100,1.4308
125,1.1977
150,1.4024
175,1.1683
200,1.4513
225,1.1574




In [14]:
from google.colab import drive
drive.mount('/content/gdrive')

model_save_name = 'Llama2_1.pt'
path = "/content/gdrive/My Drive/DCAI_lab/Lab-A3"

torch.save(trainer.model.state_dict(), path + '/' + model_save_name)

Mounted at /content/gdrive


The training can be very long, depending on the size of your dataset.

In [14]:
%load_ext tensorboard
%tensorboard --logdir results/runs

<IPython.core.display.Javascript object>

##  Testing the Model

After training our model, it's crucial to evaluate its capabilities. This segment is dedicated to testing the model using a sample prompt and observing its generated response.

Here's a brief walkthrough:

- **logging.set_verbosity(logging.CRITICAL)**: This line ensures that only critical logs are shown, ignoring less severe warnings. It's useful for a cleaner output.

- **prompt**: The question or statement we want the model to respond to. For this instance, we're curious about "What is LLAMA?".

- **pipeline**: Hugging Face's `pipeline` functionality provides a straightforward way to run specific tasks. Here, we're setting it up for "text-generation".

  - **task**: Specifies the type of task. In this case, it's "text-generation".
  - **model**: The trained model that we want to test.
  - **tokenizer**: The tokenizer used during training, responsible for converting text into a format the model understands.
  - **max_length**: The maximum length for the generated response. We've set it to 200 characters to ensure responses are concise yet meaningful.

- **result**: Here, we invoke the pipeline with our prompt to get the model's response. We're wrapping the prompt with special tokens (`<s>[INST]` and `[/INST]`), guiding the model to understand that we're seeking instructional text.

Finally, we print out the generated text to observe how well the model responds to our query.

By testing the model with different prompts, you can gauge its strengths, weaknesses, and areas of improvement.


In [15]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is LLAMA?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] What is LLAMA? [/INST] LLaMA is an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. It is trained on a massive dataset of text from the internet and can generate human-like responses to a wide range of topics and questions. LLaMA is a powerful tool for businesses, organizations, and individuals who want to create chatbots, virtual assistants, and other conversational AI applications. It is also a valuable tool for researchers and developers who want to explore the capabilities and limitations of conversational AI.

LLaMA is a generative AI model that uses a combination of natural language processing (NLP) and machine learning (ML) to generate human-like responses. It is trained on a large dataset of text from the internet, which it uses to generate responses to user input. LLaMA is a powerful tool for businesses,


## Submit to Kaggle for Text Generation Evaluation
The code provided below is set up to generate the necessary submission file for our Kaggle competition. After executing this code, you'll obtain a `submission.csv` file that contains the model's predictions. To evaluate how well your model performs, you must submit this generated file on the Kaggle competition page.

🔗 [Submit your file here on Kaggle!](https://www.kaggle.com/competitions/dsaa-6100-finetune-llm)


In [17]:
import gdown
import pandas as pd
from transformers import pipeline

# download file
url = 'https://drive.google.com/uc?export=download&id=1aJs4sFPtF8FilWVHw888hkb8AGiOTo27'
output = 'test_file.csv'
submission_file_path = 'path/to/your/submission.csv'
gdown.download(url, output, quiet=False)

# Load the test_file.csv
test_df = pd.read_csv('test_file.csv')

# List to store predictions
predictions = []

# Initialize the pipeline
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

# Loop through each row in the test DataFrame and make predictions
for index, row in test_df.iterrows():
    input_content = row['Input Content']
    result = pipe(f"{input_content}")
    predictions.append(result[0]['generated_text'])

# Create a new DataFrame with Id and label
submission_df = pd.DataFrame({
    'Id': test_df['ID'],
    'label': predictions
})

# Save the DataFrame to CSV file
submission_file_path = 'submission.csv'
submission_df.to_csv(path + '/' + submission_file_path, index=False)

print(f"File '{submission_file_path}' has been saved to Google Drive at '{output_dir}'.")


Downloading...
From: https://drive.google.com/uc?export=download&id=1aJs4sFPtF8FilWVHw888hkb8AGiOTo27
To: /content/test_file.csv
100%|██████████| 6.33k/6.33k [00:00<00:00, 5.19MB/s]


File 'submission.csv' has been saved to Google Drive at './results'.


## Evaluation Metrics Explained
- eval_loss: Represents the difference between the model's predictions and the actual data. A lower loss typically indicates that the model's predictions are closer to the true labels.

- eval_perplexity: Measures the uncertainty of the model's predictions. A lower perplexity means the model is more confident in its predictions.

- eval_cross_entropy: The cross-entropy measures the difference between the model's predicted probabilities and the true labels. We want this value to be as low as possible.

- eval_runtime: The time required to evaluate the model.

- eval_samples_per_second: Shows how many samples the model can process per second, reflecting its speed.

- eval_steps_per_second: The number of optimization steps executed per second.

In [18]:
# Evaluate the model
trainer.evaluate()

{'eval_loss': 1.1594215631484985,
 'eval_perplexity': 1418797.7221776436,
 'eval_cross_entropy': 14.16532039642334,
 'eval_runtime': 10.2062,
 'eval_samples_per_second': 0.98,
 'eval_steps_per_second': 0.196,
 'epoch': 1.0}

# Fine-Tuning Recommendations for Deep Learning Models

Kaggle's evaluation process utilizes BLEU and Jaccard similarity as its primary metrics. While the default configuration outlined above offers a foundational baseline score, there is ample room to optimize and refine your model's performance. Especially in the nuanced domain of natural language processing, there exists a wide array of strategies and methods that can be harnessed to elevate your model's efficacy. Below, we've curated a set of proven recommendations accompanied by relevant code snippets to guide your fine-tuning process.

For those who wish to broaden their horizons and delve deeper into more sophisticated models, datasets, and fine-tuning strategies, the Hugging Face platform is a valuable resource. Explore more by navigating to [huggingface.co](https://huggingface.co).


## Switching Pre-trained Models

At present, we're utilizing the **NousResearch/llama-2-7b-chat-hf** model. Venturing into other pre-trained models, particularly those tailored more closely to our specific task, may pave the way for enhanced performance. For a broader selection of models, we can visit [huggingface.co](https://huggingface.co).


In [20]:
model_name = "ModelName"

📝 Note: The choice of a pre-trained model can influence the fine-tuning process significantly. Larger models have more parameters and might capture nuances better but may also require more resources and time.

## Experimenting with Different Datasets for Fine-tuning
Besides the dataset **mlabonne/guanaco-llama2-1k** that we are currently using, consider leveraging datasets from related domains or even combining multiple datasets to achieve a richer fine-tuning source.

In [None]:
# Loading an additional dataset
another_dataset = load_dataset("another_dataset_name", split="train")
# Combining two datasets
combined_dataset = concatenate_datasets([dataset, another_dataset])

📝 Note: Using diverse and domain-specific datasets can lead to better generalization and task-specific performance improvements.

## Tweaking LoRA Parameters
Experimenting with lora_r, lora_alpha, and lora_dropout values might result in enhanced fine-tuning outcomes.

In [None]:
lora_r = 128
lora_alpha = 32
lora_dropout = 0.2

📝 Note: LoRA (Low-Rank Adaptation) parameters control the trade-off between model flexibility and the amount of new information added during fine-tuning. Adjusting them requires monitoring performance closely.

## Modifying Optimizer and Learning Rate Schedule
We're currently employing the paged_adamw_32bit optimizer with a constant learning rate schedule. Trying out different optimizers or learning rate adjustment strategies might be beneficial.

In [None]:
optim = "adamw"
lr_scheduler_type = "linear"

📝 Note: The optimizer and its settings play a crucial role in model convergence and overall performance. Depending on the dataset's nature and size, different optimizers and learning rate schedules might be more effective.

## Refining Training Strategy
Adjusting parameters such as gradient_accumulation_steps, max_grad_norm, weight_decay, and experimenting with different batch sizes and learning rates can potentially lead to more optimal training.

In [None]:
gradient_accumulation_steps = 2
max_grad_norm = 1.0
weight_decay = 0.01
per_device_train_batch_size = 2
learning_rate = 1e-5

📝 Note: The training strategy directly affects how the model updates its weights. Depending on the data's characteristics and the chosen pre-trained model, varying these parameters can lead to faster convergence or better generalization.

## Regularization

Introducing dropout or weight decay can assist in preventing overfitting.

Code example:

In [None]:
model = YourModel(dropout_rate=0.3)
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

## Gradient Clipping

To prevent gradient explosions, you can clip gradients.

Code example:

In [None]:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

## Adjusting Maximum Generation Length

Limiting the text length produced by the model can help it focus on shorter, more relevant answers.

Code example:

In [None]:
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=150)

## Temperature Tuning

Temperature can affect the diversity of the model's outputs. A lower value (e.g., 0.2) will make the output more deterministic, while a higher value (e.g., 1.0) introduces more randomness.

Code example:

In [None]:
result = pipe(f"{input_content}", temperature=0.7)

## Using Prefixes to Guide the Model

Adding a prefix to the model's input can help steer its generation.

Code example:

In [None]:
result = pipe(f"Summarize: {input_content}")

Experiment with the methods above and make multiple submissions to Kaggle to observe any improvements in your model's score. Continuous experimentation and fine-tuning are key to enhancing model performance!

# Versatility of Large Language Model
Beyond text generation, large models are multifaceted tools trained to perform a plethora of tasks. From sentiment analysis, text classification, and named entity recognition to more advanced tasks like answering complex questions, generating code, and even describing images, the applications are vast and varied.

## Open Questions and Exploration

The vast capabilities of these models present numerous open questions. How can they be best fine-tuned for niche applications? In what innovative ways can they be integrated into different industries or disciplines?

We encourage you all to embark on this open exploration. There's so much potential yet to be harnessed, and sometimes the most groundbreaking discoveries come from the most unexpected experiments.

As we wrap up this notebook, think of this as an invitation to delve deeper, to probe and to ponder. And when you uncover something new, or even just intriguing, do bring it to the fore for the community to see, learn, and build upon.

The journey of exploration is always better when shared. Let's journey together!