# Fine-tune Mistral 7B Instruct Model

The model we train will learn how to retrieve documents from our context documents and generate responses in a unified system.

We will:

* Install dependencies
* [Prepare a dataset](https://github.com/Uliana-Liakh/Fine-tune_LLM_QA/blob/main/Prepare_dataset.ipynb)
* Test Prompts
* Fine-tune the Mistral 7B Instruct Model
* Evaluate fine-tuning model


### Install

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install git+https://github.com/huggingface/transformers trl accelerate torch bitsandbytes peft datasets evaluate rouge-score vllm

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-re2rt7ba
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-re2rt7ba
  Resolved https://github.com/huggingface/transformers to commit f213d5dd8cea1eb31d9b44dbdf268e4265a6d338
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
import torch
import pandas as pd
from trl import SFTTrainer
from random import randrange
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
import accelerate
from google.colab import drive

from vllm import LLM, SamplingParams
from huggingface_hub import notebook_login
import evaluate
import re

rouge = evaluate.load('rouge')

### Loading the dataset

In [4]:
df_full_clean = pd.read_csv('/content/drive/MyDrive/culture_train_982_clean.csv')
df_full_clean = df_full_clean.drop(['Unnamed: 0', 'Abstract'], axis=1)
df_full_clean.columns = df_full_clean.columns.str.lower()
df_full_clean.head()

Unnamed: 0,question,answer
0,"Why do some cultures, such as Latin American a...","Some cultures, such as Latin American and Midd..."
1,"What is the concept of ""constructive criticism...","The concept of ""constructive criticism"" refers..."
2,How can exploring and discussing the differenc...,Exploring and discussing the differences in va...
3,How does making an investment up front benefit...,Making an investment up front benefits individ...
4,Explain the principles-first and applications-...,The principles-first cultural tendency refers ...


In [5]:
datasets=df_full_clean[:400]
train_dataset = datasets
test_dataset = df_full_clean[400:410]


In [6]:
datasets=Dataset.from_pandas(datasets)
train_dataset = Dataset.from_pandas(train_dataset)
test_dataset = Dataset.from_pandas(test_dataset)


### Mistral 7B Instruct Model Testing

In [7]:
# Import model and tokenizer
# loading only 4-bit version
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
                                             device_map='auto',
                                             load_in_4bit=True, use_cache=False)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
# A prompting formatting function
def create_prompt_instruction(sample):
   return f"""### Instruction:
   Use the input below to create an instruction, which could have been used to generate the input using an LLM.

   ### Input
   {sample['answer']}

   ### Response:
   {sample['question']}
   """


In [30]:
print(create_prompt_instruction(datasets[0]))

### Instruction:
   Use the input below to create an instruction, which could have been used to generate the input using an LLM.

   ### Input
   Some cultures, such as Latin American and Middle Eastern cultures, may find it difficult to separate personal emotions from disagreements due to several factors. One possible reason is the emphasis on collectivism in these cultures, where the needs and interests of the group are prioritized over individual desires. This collective mindset can lead to a stronger emotional investment in disagreements, as individuals may feel personally attacked or threatened when their opinions or beliefs are challenged.

   ### Response:
   Why do some cultures, such as Latin American and Middle Eastern cultures, find it difficult to separate personal emotions from disagreements?
   


In [31]:
def get_prompt():

    """
    Select a random sample from a dataset and format it into a prompt for language model (LLM) instruction generation.

    This function randomly selects an index, retrieves a sample from a dataset, and formats a prompt string that instructs to create an LLM instruction based on the sample's response. The formatted prompt and the selected sample are returned.

    Returns:
    tuple: A tuple containing:
        - str: A formatted string that includes an instruction and the response from the randomly selected sample, intended to be used as a prompt for LLM instruction generation.
        - dict: The randomly selected sample from the dataset.

    Note:
    Ensure that 'dataset' is a pre-defined list of dictionaries, where each dictionary contains at least a key 'response' holding a text string. The function does not handle empty datasets or missing keys and may raise an exception in such cases.
    """

    idx = randrange(len(datasets))

    sample = datasets[idx]

    return f"""### Instruction:
    Use the input below to create an instruction, which could have been used to generate the input using an LLM.

    ### Input
    {sample['answer']}

    ### Response:
    """, sample



def get_response(prompt, sample):

    """
    Generate a response based on a given prompt using an LLM.

    Parameters:
    - prompt (str): A text string to prompt LLM.
    - sample (dict): A dictionary containing a ground truth.

    Returns:
    dict: A dictionary containing:
        - 'LLM result': The generated response from the language model, decoded from token IDs to a string.
        - 'ground truth': The ground truth instruction text extracted from the input sample.

    Note:
    Ensure that the 'tokenizer' and 'model' are loaded.
    """

    encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
    model_inputs = encoded_input.to("cuda:0")

    generated_ids = model.generate(**model_inputs, max_new_tokens=1000,
                                   do_sample=True,
                                   pad_token_id=tokenizer.eos_token_id)

    decoded_output = tokenizer.batch_decode(generated_ids)

    return {
      'llm_response': decoded_output[0], # LLM-generated response
      'reference': sample['question'] # Ground Truth
       }

#### Test Prompt 1:

In [None]:
# Get the prompt first
print('# TEST PROMPT')
test_prompt, sample = get_prompt()
print(test_prompt)

# Get the response
response = get_response(test_prompt, sample)

# print the llm response and ground truth
print("LLM result:")
print(response['llm_response'])
print('--'*40)
print("Ground truth:")
print(response['reference'])

# TEST PROMPT 
### Instruction: 
    Use the input below to create an instruction, which could have been used to generate the input using an LLM. 

    ### Input 
    The differing decision-making styles of Americans and Germans can impact the timeline of a typical project in several ways. Americans tend to have a more fast-paced and action-oriented decision-making style, often prioritizing efficiency and quick results. On the other hand, Germans tend to have a more thorough and cautious decision-making style, emphasizing careful analysis and consensus-building.

    ### Response:
    
LLM result:
<s> ### Instruction: 
    Use the input below to create an instruction, which could have been used to generate the input using an LLM. 

    ### Input 
    The differing decision-making styles of Americans and Germans can impact the timeline of a typical project in several ways. Americans tend to have a more fast-paced and action-oriented decision-making style, often prioritizing efficiency a

#### Test Prompt 2:

In [None]:
# Get the prompt first
print('# TEST PROMPT')
test_prompt, sample = get_prompt()
print(test_prompt)

# Get the response
response = get_response(test_prompt, sample)

# print the llm response and ground truth
print("LLM result:")
print(response['llm_response'])
print('--'*60)
print("Ground truth:")
print(response['reference'])

# TEST PROMPT
### Instruction: 
    Use the input below to create an instruction, which could have been used to generate the input using an LLM. 

    ### Input 
    The Chinese and Japanese cultures value harmony and maintaining positive relationships. Direct negative feedback and open disagreement can be seen as confrontational and may damage these relationships. Therefore, they tend to avoid such situations to preserve harmony and maintain a respectful atmosphere.

    ### Response:
    
LLM result:
<s> ### Instruction: 
    Use the input below to create an instruction, which could have been used to generate the input using an LLM. 

    ### Input 
    The Chinese and Japanese cultures value harmony and maintaining positive relationships. Direct negative feedback and open disagreement can be seen as confrontational and may damage these relationships. Therefore, they tend to avoid such situations to preserve harmony and maintain a respectful atmosphere.

    ### Response:
    
    In

### Fine-tuning the Mistral 7B Instruct Model

Finetuning the entire model demands a massive GPU, so I will use the PEFT (Parameter Efficient FineTuning) technique — LoRA (Low-Rank Adaptation), which freezes the pre-trained model and adds smaller trainable matrices to each layer.

Финишная настройка всей модели требует большого GPU, поэтому я использую технику PEFT (Parameter Efficient FineTuning) - LoRA (Low-Rank Adaptation), которая замораживает предварительно обученную модель и добавляет в каждый слой более мелкие обучаемые матрицы.

In [8]:
# PEFT Config
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# Prepare the model for finetuning
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [9]:
# Define training arguments
args = TrainingArguments(
    output_dir = "mistral_instruct_qa",
    num_train_epochs = 5,
    per_device_train_batch_size = 1,
    warmup_steps = 0.03,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    learning_rate=2e-4,
    lr_scheduler_type='constant',
    disable_tqdm=True
)

In [12]:
# Define SFTTrainer arguments
max_seq_length = 2048

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func= create_prompt_instruction,
    args=args,
    train_dataset=train_dataset
)

In [14]:
#Finetune 1
trainer.train()

# Save finetuned model
trainer.save_model("mistral_instruct_qa")

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 1.003, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 0.8241, 'learning_rate': 0.0002, 'epoch': 0.05}
{'loss': 0.7273, 'learning_rate': 0.0002, 'epoch': 1.01}
{'loss': 0.6596, 'learning_rate': 0.0002, 'epoch': 1.03}
{'loss': 0.6487, 'learning_rate': 0.0002, 'epoch': 1.06}
{'loss': 0.5388, 'learning_rate': 0.0002, 'epoch': 2.02}
{'loss': 0.5188, 'learning_rate': 0.0002, 'epoch': 2.04}
{'loss': 0.4811, 'learning_rate': 0.0002, 'epoch': 3.0}
{'loss': 0.3675, 'learning_rate': 0.0002, 'epoch': 3.03}
{'loss': 0.3534, 'learning_rate': 0.0002, 'epoch': 3.06}
{'loss': 0.2984, 'learning_rate': 0.0002, 'epoch': 4.01}
{'loss': 0.2356, 'learning_rate': 0.0002, 'epoch': 4.04}
{'loss': 0.2875, 'learning_rate': 0.0002, 'epoch': 4.07}
{'train_runtime': 344.0005, 'train_samples_per_second': 5.814, 'train_steps_per_second': 5.814, 'train_loss': 0.5341251978507409, 'epoch': 4.07}


In [13]:
#Finetune 2
trainer.train()

# Save finetuned model
trainer.save_model("mistral_instruct_qa")

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 1.0035, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 0.8241, 'learning_rate': 0.0002, 'epoch': 0.05}
{'loss': 0.7267, 'learning_rate': 0.0002, 'epoch': 1.01}
{'loss': 0.6615, 'learning_rate': 0.0002, 'epoch': 1.03}
{'loss': 0.6511, 'learning_rate': 0.0002, 'epoch': 1.06}
{'loss': 0.5478, 'learning_rate': 0.0002, 'epoch': 2.02}
{'loss': 0.5215, 'learning_rate': 0.0002, 'epoch': 2.04}
{'loss': 0.4832, 'learning_rate': 0.0002, 'epoch': 3.0}
{'loss': 0.3805, 'learning_rate': 0.0002, 'epoch': 3.03}
{'loss': 0.3567, 'learning_rate': 0.0002, 'epoch': 3.06}
{'loss': 0.3085, 'learning_rate': 0.0002, 'epoch': 4.01}
{'loss': 0.2539, 'learning_rate': 0.0002, 'epoch': 4.04}
{'loss': 0.2911, 'learning_rate': 0.0002, 'epoch': 4.07}
{'train_runtime': 344.0069, 'train_samples_per_second': 5.814, 'train_steps_per_second': 5.814, 'train_loss': 0.5392427811255822, 'epoch': 4.07}


### Evaluate and compare the base model responses & finetuned model responses

In [14]:
def get_prompt(sample):

    return f"""### Instruction:
    Use the input below to create an instruction, which could have been used to generate the input using an LLM.

    ### Input
    {sample['answer']}

    ### Response:
    """, sample

In [15]:
# Get references / Ground Truth the model will be evaluated against
references = [sample['question'] for sample in test_dataset]

#### Generate origin model responses

In [17]:
# Loading origin model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
                                             device_map="auto",load_in_4bit=True,
                                             use_cache=False)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
def get_response(prompt, sample):

    """
    Generate a response based on a given prompt using an LLM.

    Parameters:
    - prompt (str): A text string to prompt LLM.
    - sample (dict): A dictionary containing a ground truth.

    Returns:
    dict: A dictionary containing:
        - 'LLM result': The generated response from the language model, decoded from token IDs to a string.
        - 'ground truth': The ground truth instruction text extracted from the input sample.

    Note:
    Ensure that the 'tokenizer' and 'model' are loaded.
    """

    encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model_inputs = encoded_input.to(device)

    generated_ids = model.generate(**model_inputs, max_new_tokens=1000,
                                   do_sample=True,
                                   pad_token_id=tokenizer.eos_token_id)

    decoded_output = tokenizer.batch_decode(generated_ids)

    return {
      'llm_response': decoded_output[0], # LLM-generated response
      'reference': sample['question'] # Ground Truth
       }

In [25]:
responses_base_model = []

for sample in test_dataset:
    prompt, _ = get_prompt(sample)
    response = get_response(prompt, sample)
    txt = re.sub(r"\n", "", response['llm_response'])
    txt1 = txt.split("Response:")[1]
    result = txt1.split("</s>")[0]
    print(result)
    responses_base_model.append(result)



        To learn a language using the applications-first approach, focus on using the language in practical situations from the start. Emphasize developing practical language proficiency and communication skills through activities such as conversations, role-plays, and real-world scenarios. This method will help you become fluent in the language quickly.
        To generate text based on the given instruction, please create a response that explains how common human emotions can bring people from diverse backgrounds together. Provide examples of such emotions and how they can connect individuals from different cultures.
        Use the input below to explain how language and history impact communication styles in different cultures:    In many cultures, language and history are closely intertwined, influencing the vocabulary, grammar, and expressions used in a particular language. Historical events, cultural traditions, and societal norms are reflected in the language, shaping communica

#### Generate finetuned model responses

In [16]:
# Load the finetuned model
finetuned_model = AutoPeftModelForCausalLM.from_pretrained(
    "mistral_instruct_qa",
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    device_map="auto")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistral_instruct_qa")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [17]:
def get_response(prompt, sample):

    """
    Generate a response based on a given prompt using an LLM.

    Parameters:
    - prompt (str): A text string to prompt LLM.
    - sample (dict): A dictionary containing a ground truth.

    Returns:
    dict: A dictionary containing:
        - 'LLM result': The generated response from the language model, decoded from token IDs to a string.
        - 'ground truth': The ground truth instruction text extracted from the input sample.

    Note:
    Ensure that the 'tokenizer' and 'model' are loaded.
    """

    encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model_inputs = encoded_input.to(device)

    generated_ids = finetuned_model.generate(**model_inputs, max_new_tokens=1000,
                                   do_sample=True,
                                   pad_token_id=tokenizer.eos_token_id)

    decoded_output = tokenizer.batch_decode(generated_ids)

    return {
      'llm_response': decoded_output[0], # LLM-generated response
      'reference': sample['question'] # Ground Truth
       }

In [18]:
responses_fn_model = []

for sample in test_dataset:
    prompt, _ = get_prompt(sample)
    response = get_response(prompt, sample)
    txt = re.sub(r"\n", "", response['llm_response'])
    txt1 = txt.split("Response:")[1]
    result = txt1.split("</s>")[0]
    print(result)
    responses_fn_model.append(result)

     What is the applications-first method of learning a language and how does it aim to develop practical language proficiency and communication skills?   
     How can common human emotions such as jealousy, excitement, sorrow, and passion connect individuals from different cultures?   
     How does the interplay of language and history influence communication styles in different cultures?     
     What challenges might European managers face when working in different regions of China?   
     How did the author's experience working with a group of Swedes challenge their initial perception of consensus-building?   
     Why does the author focus only on the first two documents when giving feedback to their colleague?   
     What are some examples of relationship-building activities commonly seen in American-style training programs or conferences?   
     Why did managers working in global business feel the need to work in a more American manner?   
     How did Confucius shape hie

#### We use the Rouge score to evaluate and compare the base model responses, finetuned model responses, and references (ground truth).

In [28]:
# Base model evaluation
base_model_evaluation = rouge.compute(predictions=responses_base_model,
                                      references=references)

# Print 'rouge1', 'rouge2', and 'rougeL'
print("Rouge-1 Evaluation:")
print(base_model_evaluation["rouge1"])
print("--"*20)
print("Rouge-2 Evaluatiom:")
print(base_model_evaluation["rouge2"])
print("--"*20)
print("Rouge-L Evaluation:")
print(base_model_evaluation["rougeL"])

Rouge-1 Evaluation:
0.20895824664601048
----------------------------------------
Rouge-2 Evaluatiom:
0.10110800390965681
----------------------------------------
Rouge-L Evaluation:
0.187283652250661


In [19]:
# Finetuned model evaluation
finetuned_model_evaluation = rouge.compute(predictions=responses_fn_model,
                                           references=references)

# Print 'rouge1', 'rouge2', and 'rougeL'
print("Rouge-1 Evaluation:")
print(finetuned_model_evaluation["rouge1"])
print("--"*20)
print("Rouge-2 Evaluation:")
print(finetuned_model_evaluation["rouge2"])
print("--"*20)
print("Rouge-L Evaluation:")
print(finetuned_model_evaluation["rougeL"])

Rouge-1 Evaluation:
0.8803648515108963
----------------------------------------
Rouge-2 Evaluation:
0.8205802745469929
----------------------------------------
Rouge-L Evaluation:
0.8741808394445313


#### Conclusions:
In the evaluation phase, it is clear that the score of the Mistral 7B Instruct Model Rouge has improved significantly after fine-tuning. It is important to note that this comparison is between a 4-bit quantized fine-tuned model and a full Mistral 7B Instruct model.