# Fine-Tuning de Mistral-7B

## Introduction

Dans ce notebook, nous allons effectuer le fine-tuning de Mistral-7B (https://huggingface.co/mistralai/Mistral-7B-v0.3) √† partir de l'API Transformers d'Hugging Face (utilisant LoRa). Les √©tapes suivantes seront suivies :

1. Installation des packages.
2. Importation des biblioth√®ques n√©cessaires.
3. R√©cup√©ration des donn√©es.
4. Pr√©traitement des donn√©es.
5. R√©cup√©ration du mod√®le sur le disque.
6. Entra√Ænement du mod√®le √† partir des donn√©es.
7. Utilisation du mod√®le entra√Æn√© sur un cas.

## Installation des packages

In [1]:
! pip install transformers[sentencepiece] trl accelerate torch bitsandbytes peft datasets -qU

In [2]:
import os; print(os.getenv("LD_LIBRARY_PATH"))

/product/ubuntu22-x86_64/apps/CUDA/12.1.0/nvvm/lib64:/product/ubuntu22-x86_64/apps/CUDA/12.1.0/extras/CUPTI/lib64:/product/ubuntu22-x86_64/apps/CUDA/12.1.0/lib


## Importation des biblioth√®ques n√©cessaires

In [3]:
# Pour t√©l√©charger les donn√©es
from datasets import load_dataset

# Pour avoir l'acc√®s au Hub d'Hugging Face
from huggingface_hub import login

# Pour r√©cup√©rer Mistral-7B, le tokenizer associ√© et ...
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Pour utiliser PyTorch
import torch

## R√©cup√©ration des donn√©es

In [4]:
instruct_tune_dataset = load_dataset("mosaicml/instruct-v3")

In [5]:
print(instruct_tune_dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 56167
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 6807
    })
})


## Pr√©traitement des donn√©es

In [6]:
instruct_tune_dataset = instruct_tune_dataset.filter(lambda x: x["source"] == "dolly_hhrlhf")

In [7]:
print(instruct_tune_dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 34333
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 4771
    })
})


In [8]:
instruct_tune_dataset["train"] = instruct_tune_dataset["train"].select(range(5_000)) # TODO : Rajouter la m√©thode shuffle

In [9]:
instruct_tune_dataset["test"] = instruct_tune_dataset["test"].select(range(200)) # TODO : Rajouter la m√©thode shuffle

In [10]:
print(instruct_tune_dataset)

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['prompt', 'response', 'source'],
        num_rows: 200
    })
})


In [11]:
# Exemple d'une donn√©e (ici une donn√©e d'entra√Ænement)
print(instruct_tune_dataset["train"][0])

{'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction\nWhat are different types of grass?\n\n### Response\n', 'response': 'There are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.', 'source': 'dolly_hhrlhf'}


In [12]:
# Exemple de prompt d'une donn√©e originale
print(instruct_tune_dataset["train"][0]['prompt'])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction
What are different types of grass?

### Response



In [13]:
# Exemple de response d'une donn√©e originale
print(instruct_tune_dataset["train"][0]['response'])

There are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.


In [14]:
# Exemple de source d'une donn√©e originale
print(instruct_tune_dataset["train"][0]['source'])

dolly_hhrlhf


In [15]:
def create_prompt(sample):
    """
    Modifie la donn√©e d'entr√©e pour correspondre au format attendu par Mistral-7B et √† la t√¢che en question.

    Param√®tres:
    sample (dict): La donn√©e d'entr√©e.

    Retours:
    str: La modification sous le bon format attendu par Mistral-7B et √† la t√¢che en question.

    Exemple:
    >>> print(create_prompt({'prompt': "Can I find information about SALOME platform ?", 'response': ""}))
    
    <s>### Instruction:
    Use the provided input to create a response to the prompt question.
    
    ### Input:
    Can I find information about SALOME platform ?
    
    ### Response:
    </s>
    """
    
    bos_token = "<s>"
    original_system_message = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    system_message = "Use the provided input to create a response to the prompt question."
    response = sample["response"].replace(original_system_message, "").replace("\n\n### Instruction\n", "").replace("\n### Response\n", "").strip()
    input = sample["prompt"]
    eos_token = "</s>"
    
    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "### Instruction:"
    full_prompt += "\n" + system_message
    full_prompt += "\n\n### Input:"
    full_prompt += "\n" + input
    full_prompt += "\n\n### Response:"
    full_prompt += "\n" + response
    full_prompt += eos_token
    
    return full_prompt

In [16]:
# Affichage de la doc de la fonction create_prompt
print(create_prompt.__doc__)


    Modifie la donn√©e d'entr√©e pour correspondre au format attendu par Mistral-7B et √† la t√¢che en question.

    Param√®tres:
    sample (dict): La donn√©e d'entr√©e.

    Retours:
    str: La modification sous le bon format attendu par Mistral-7B et √† la t√¢che en question.

    Exemple:
    >>> print(create_prompt({'prompt': "Can I find information about SALOME platform ?", 'response': ""}))
    
    <s>### Instruction:
    Use the provided input to create a response to the prompt question.
    
    ### Input:
    Can I find information about SALOME platform ?
    
    ### Response:
    </s>
    


In [17]:
# Exemple d'utilisation de la fonction create_prompt
print(create_prompt({'prompt': "Can I find information about SALOME platform ?", 'response': ""}))

<s>### Instruction:
Use the provided input to create a response to the prompt question.

### Input:
Can I find information about SALOME platform ?

### Response:
</s>


In [18]:
# Exemple de la donn√©e finale, i.e. apr√®s passage dans la fonction create_prompt
print(create_prompt(instruct_tune_dataset["train"][0]))

<s>### Instruction:
Use the provided input to create a response to the prompt question.

### Input:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction
What are different types of grass?

### Response


### Response:
There are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.</s>


## R√©cup√©ration du mod√®le Mistral-7B

Dans cette section, on t√©l√©charge/r√©cup√®re le mod√®le de base Mistral-7B, se trouvant sur le disque, et le tokenizer associ√©.

In [19]:
# TODO: √Ä √©tudier
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.float
)

In [20]:
# Sp√©cification du r√©pertoire de sauvegarde du mod√®le
save_directory_model = "../models/mistral7b_not_fine-tune"

# Sp√©cification du r√©pertoire de sauvegarde du tokenizer
save_directory_tokenizer = "../models/mistral7b_tokenizer_not_fine-tune"

In [21]:
# Chargement du mod√®le sauvegard√©
model = AutoModelForCausalLM.from_pretrained(
    save_directory_model,
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False,
)

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [22]:
# R√©cup√©ration du tokenizer associ√©
#tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
tokenizer = AutoTokenizer.from_pretrained(save_directory_tokenizer)

# Probl√®me: ValueError: Cannot instantiate this tokenizer from a slow version. If it's based on sentencepiece, make sure you have sentencepiece installed.
# Solution: pip install transformers[sentencepiece]

In [23]:
# TODO: √Ä √©tudier
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Fine-tuning du mod√®le

In [24]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

In [25]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [34]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir = "mistral_instruct_generation",
    #num_train_epochs=5,00:02.0
    max_steps =5, # comment out this line if you want to train in epochs
    per_device_train_batch_size = 2,
    warmup_steps = 0,
    logging_steps=1,
    #save_strategy="epoch",
    eval_strategy="epoch",
    eval_steps=2, # comment out this line if you want to evaluate at the end of each epoch
    learning_rate=2e-4,
    bf16=True,
    per_gpu_train_batch_size=1,
    lr_scheduler_type='constant',
)

In [35]:
from trl import SFTTrainer

max_seq_length = 128

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
max_steps is given, it will override any value given in num_train_epochs
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


In [33]:
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


AttributeError: `AcceleratorState` object has no attribute `distributed_type`. This happens if `AcceleratorState._reset_state()` was called and an `Accelerator` or `PartialState` was not reinitialized.

## Pr√©diction √† partir du mod√®le

In [7]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

Exemple d'utilisation de la fonction `generate_response` :

In [8]:
# Prompt
#prompt = "<s>### Instruction:\nUse the provided input to create an instruction that could have been used to generate the response with an LLM.\n\n### Input:\nI think it depends a little on the individual, but there are a number of steps you‚Äôll need to take.  First, you‚Äôll need to get a college education.  This might include a four-year undergraduate degree and a four-year doctorate program.  You‚Äôll also need to complete a residency program.  Once you have your education, you‚Äôll need to be licensed.  And finally, you‚Äôll need to establish a practice.\n\n### Response:"
#prompt = create_prompt()
prompt = "<s>### Instruction:\nUse the provided input to create a response.\n\n### Input:\nCan I find information about SALOME platform ?\n\n### Response:</s>"
print(prompt)

<s>### Instruction:
Use the provided input to create a response.

### Input:
Can I find information about SALOME platform ?

### Response:</s>


In [9]:
# R√©ponse pr√©dite par le mod√®le
print(generate_response(prompt, model))



<s>-

### Output:

Dear *NAME*,

Thank you, as a result of our conversation, the *YOUR NAME* user has been registered on the SALOME platform in "Account verification" status.

Please go to *YOUR SALOME LINK* to confirm your e-mail address and login to platform.

###### Information
You have activated your account with success. Click on the button to start your journey from Open Science.

###### Regards,

SALOME team, SOLABS Co, LLC</s>


## Sauvegarde du mod√®le sur le disque

In [36]:
# Sp√©cification du r√©pertoire de sauvegarde
save_directory = "../models/mistral7b_fine-tune"

# Sauvegarde du mod√®le fine-tun√©
model.save_pretrained(save_directory)