## Using Paramater Efficient Fine Tuning on Llama 3 with 8B Parameters on One Intel&reg; Gaudi&reg; 2 AI Accelerator
This example will Fine Tune the Llama3 8B model using Parameter Efficient Fine Tuining (PEFT) and then run inference on a text prompt.  This will be using the Llama3-8B model with two task examples from the Optimum Habana library on the Hugging Face model repository.   The Optimum Habana library is optimized for Deep Learning training and inference on First-gen Gaudi and Gaudi2 and offers tasks such as text generation, language modeling, question answering and more. For all the examples and models, please refer to the [Optimum Habana GitHub](https://github.com/huggingface/optimum-habana#validated-models).

This example will Fine Tune the Llama3-8B model using Parameter Efficient Fine Tuining (PEFT) on the timdettmers/openassistant-guanaco dataset using the Language-Modeling Task in Optimum Habana.

### Parameter Efficient Fine Tuning with Low Rank Adaptation
Parameter Efficient Fine Tuning is a strategy for adapting large pre-trained language models to specific tasks while minimizing computational and memory demands.   It aims to reduce the computational cost and memory requirements associated with fine-tuning large models while maintaining or even improving their performance.  It does so by adding a smaller task-specific layer, leveraging knowledge distillation, and often relying on few-shot learning, resulting in efficient yet effective models for various natural language understanding tasks.   PEFT starts with a pre-trained language model that has already learned a wide range of language understanding tasks from a large corpus of text data. These models are usually large and computationally expensive.   Instead of fine-tuning the entire pre-trained model, PEFT adds a task-specific layer or a few task-specific layers on top of the pre-trained model. These additional layers are relatively smaller and have fewer parameters compared to the base model.

#### All imports

In [None]:
import sys
import os
import subprocess
import shutil
import ipywidgets as widgets
from IPython.display import display
import threading

In [None]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials

#### These versions are specific to 1.15 Synapse Gaudi SW

In [None]:
!{sys.executable} -m pip install peft==0.11.1

In [None]:
!{sys.executable} -m pip install -q optimum-habana==1.11.1

In [None]:
def check_repo_version(repo_path, expected_version):
    """
    Checks the current version of the repository.

    Parameters:
    repo_path (str): The path to the repository.
    expected_version (str): The expected version of the repository.

    Returns:
    bool: True if the repository is at the expected version, False otherwise.
    """
    try:
        # Change to the repository directory
        os.chdir(repo_path)
        
        # Get the current branch or tag
        result = subprocess.run(['git', 'describe', '--tags'], capture_output=True, text=True)
        current_version = result.stdout.strip()
        
        # Check if the current version matches the expected version
        return current_version == expected_version
    except Exception as e:
        print(f"Error checking repository version: {e}")
        return False
    finally:
        # Change back to the original directory
        os.chdir('..')

def clone_repo(repo_url, branch, repo_path):
    """
    Clones the repository.

    Parameters:
    repo_url (str): The URL of the repository.
    branch (str): The branch or tag to clone.
    repo_path (str): The path to clone the repository into.

    Returns:
    None
    """
    try:
        subprocess.run(['git', 'clone', '-b', branch, repo_url, repo_path], check=True)
        print(f"Cloned repository from {repo_url} to {repo_path}.")
    except subprocess.CalledProcessError as e:
        print(f"Error cloning repository: {e}")

def main():
    repo_url = "https://github.com/huggingface/optimum-habana.git"
    branch = "v1.11.1"
    repo_path = "optimum-habana"

    if os.path.exists(repo_path):
        if check_repo_version(repo_path, branch):
            print(f"The repository at {repo_path} is already at version {branch}.")
        else:
            print(f"The repository at {repo_path} is not at version {branch}.")
            user_input = input(f"Do you want to remove the directory {repo_path} and clone the correct version? (yes/no): ")
            if user_input.lower() == 'yes':
                shutil.rmtree(repo_path)
                clone_repo(repo_url, branch, repo_path)
            else:
                print("The repository was not updated.")
    else:
        clone_repo(repo_url, branch, repo_path)

if __name__ == "__main__":
    main()

In [None]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/language-modeling

In [None]:
!{sys.executable} -m pip install -q -r requirements.txt

In [None]:
# Please input your token--llama3 requires a license acceptance.
#!huggingface-cli login --token 

In [None]:
# Function to handle Hugging Face Hub authentication
def authenticate_huggingface():
    try:
        user_info = whoami()
        print('Authorization token already provided')
        print(f"Logged in as: {user_info['name']}")
    except OSError:
        notebook_login()

## Fine Tuning the model with PEFT and LoRA

We'll now run the fine tuning with the PEFT method. Remember that the PEFT methods only fine-tune a small number of extra model parameters, thereby greatly decreasing the computational and storage costs. Recent State-of-the-Art PEFT techniques achieve performance comparable to that of full fine-tuning.

##### Here's a summary of the command required to run the Fine Tuning, you'll run this in the next cell below. 
Note in this case the following: 
1. Using the language modeling with LoRA; `run_lora_clm.py`
2. It's very efficient: only 0.06% of the total paramters are being fine tuned of the total 8B parameters.
4. Only 3 epochs are needed for fine tuning, it takes less than 20 minutes to run with the openassisant-guanaco dataset.

In [None]:
!python3 run_lora_clm.py \
    --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --dataset_name timdettmers/openassistant-guanaco \
    --bf16 True \
    --output_dir ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/model_lora_llama3_8B_finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --learning_rate 1e-4 \
    --warmup_ratio  0.03 \
    --lr_scheduler_type "constant" \
    --max_grad_norm  0.3 \
    --logging_steps 1 \
    --do_train \
    --do_eval \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --lora_rank=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --lora_target_modules "q_proj" "v_proj" \
    --dataset_concatenation \
    --max_seq_length 512 \
    --low_cpu_mem_usage True \
    --validation_split_percentage 4 \
    --adam_epsilon 1e-08


In [None]:
%cd ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/optimum-habana/examples/text-generation

In [None]:
!{sys.executable} -m pip install -r requirements.txt

# PEFT & LORA

## Run the same prompt twice -- one that uses PEFT & LoRA and then the base model.  It takes about a minute to warm up and produce an answer.  

In [None]:
# Use the prompt above to compare between the PEFT and non-PEFT examples
prompt = input("Enter a prompt for text generation: ")

## This is the bare bones run with no formatting uses only 100 tokens just to check functionality.

In [None]:
# no formatting
cmd = f'python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --batch_size 1 --do_sample --max_new_tokens 100 --n_iterations 4 \
          --use_hpu_graphs --use_kv_cache --bf16 --prompt "{prompt}" \
          --peft_model ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/model_lora_llama3_8B_finetuned'
print(cmd)
import os
os.system(cmd)

### This is a formatted output using the PEFT and LoRA finetuned model on Gaudi 2.  

In [None]:
# Define the function to run the command and process the output
def run_command(prompt, output_widget, status_widget):
    cmd = f'python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --batch_size 1 --do_sample --max_new_tokens 1000 --n_iterations 4 \
          --use_hpu_graphs --use_kv_cache --bf16 --prompt "{prompt}" \
          --peft_model ~/Gaudi-tutorials/PyTorch/Single_card_tutorials/model_lora_llama3_8B_finetuned'
    
    # Update status to indicate processing
    status_widget.value = "Processing..."
    
    # Run the command and capture the output
    process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    
    # Stream the output
    for line in iter(process.stdout.readline, ''):
        formatted_line = line.replace('\\n', '\n').replace('### Human:', '\n\n### Human:').replace('### Assistant:', '\n\n### Assistant:')
        output_widget.value += formatted_line
    process.stdout.close()
    process.wait()
    
    # Update status to indicate completion
    status_widget.value = "Completed"

# Create an input box for the prompt
prompt_input = widgets.Text(
    value='',
    placeholder='Enter your prompt here',
    description='Prompt:',
    disabled=False
)

# Create a button to run the command
run_button = widgets.Button(
    description='Run Command',
    disabled=False,
    button_style='',
    tooltip='Click to run the command',
    icon='check'
)

# Create a text area to display the output
output_area = widgets.Textarea(
    value='',
    placeholder='Output will be displayed here...',
    description='Output:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='300px')
)

# Create a status widget to display the status
status_label = widgets.Label(
    value='',
    layout=widgets.Layout(width='100%')
)

# Define the function to handle button click
def on_button_click(b):
    prompt = prompt_input.value
    output_area.value = ''  # Clear previous output
    status_label.value = ''  # Clear previous status
    thread = threading.Thread(target=run_command, args=(prompt, output_area, status_label))
    thread.start()

# Attach the button click event to the handler function
run_button.on_click(on_button_click)

# Display the input box, button, status label, and output area
display(prompt_input, run_button, status_label, output_area)

# Non PEFT & LoRA

In [None]:
# Raw Llama-3-8B  catches the unformatted prompt from above
cmd = f'python3 run_generation.py --model_name_or_path meta-llama/Meta-Llama-3-8B --batch_size 1 --do_sample --max_new_tokens 300 --n_iterations 4 \
          --use_hpu_graphs --use_kv_cache --bf16 --prompt "{prompt}"'
print(cmd)
import os
os.system(cmd)