<a href="https://colab.research.google.com/github/baptiste-bedouret/Mistral7B-Finetuned/blob/master/Fine-tuning%20Mistral%207B%20on%20annotated%20dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup Runtime
For fine-tuning Mistral, a GPU instance is essential. Follow the directions below:

1. Go to `Runtime` (located in the top menu bar).
2. Select `Change Runtime Type`.
3. Choose `T4 GPU` (or a comparable option).


## Packages installation

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
!pip install transformers accelerate trl torch bitsandbytes peft datasets -qU

## Load the dataset

In [3]:
from datasets import load_dataset, DatasetDict
from sklearn.model_selection import train_test_split

dataset = (load_dataset("json", data_files="/content/drive/My Drive/Smart-Data/Renault/Dataset_Annotated.json",
                        split='train').train_test_split(train_size=4000, test_size=1000))
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['completions', 'tasks'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['completions', 'tasks'],
        num_rows: 1000
    })
})


Remove columns from 'tasks':

In [4]:
columns_to_remove = ['id', 'date', 'pays', 'langue', 'score']

def remove_columns_from_tasks(entry):
    for column in columns_to_remove:
        entry['tasks'].pop(column, None)
    return entry

# Apply the function to each entry in the dataset
dataset = dataset.map(remove_columns_from_tasks)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [5]:
print(dataset['train']['tasks'][2]['text'])

2 things I would like to mention here.
1.service:the service is ok
2.Name transfer:I purchased the renault kwid vehicle and I approached the service for name tranfer.Irrespect of BY in my vehicle number they mistakenly entered AY..after I told them they are rectified the issue.


In [6]:
print(dataset['test']['tasks'][2]['text'])

Bon accueil et bon conseil. Je pense que la visite annuelle de ma voiture ainsi que le contrôle technique ont été faits correctement sachant le professionnalisme des employés du garage.


Remove columns from 'completions':

In [7]:
completions_columns_to_remove = ['intensity', 'span']

def remove_columns_from_completions(entry):
    for completion in entry['completions']:
        for column in completions_columns_to_remove:
            completion.pop(column, None)
    return entry

# Apply the function to each entry in the dataset
dataset = dataset.map(remove_columns_from_completions)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
print(dataset['train']['completions'][2])

[{'category': 'Service', 'polarity': 'positive'}, {'category': 'Name transfer', 'polarity': 'positive'}]


In [9]:
print(dataset['test']['completions'][2])

[{'category': 'Welcome-Kindness-Warmth-Friendliness', 'polarity': 'positive'}, {'category': 'Attention-Assistance-Effort', 'polarity': 'positive'}]


Create formated prompt:

```
<s>### Instruction:
Use the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.

### Input:
{input}

### Response:
{response}</s>
```

In [10]:
# Define the formatting function
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['completions'])):
        text = f"""<s>[INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n{example["tasks"][i]['text']}\n\n### Response:\n{example["completions"][i]}</s>"""
        output_texts.append(text)
    return output_texts

In [None]:
print(formatting_prompts_func(dataset['train']))

## Loading and training Mistral 7B model

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os, torch, platform, warnings
from trl import SFTTrainer

In [13]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [14]:
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

(True, True)

Let's example how well the model does at this task currently:

In [15]:
def generate_response(prompt):
    encoded_input = tokenizer(prompt, return_tensors = "pt", add_special_tokens = True)
    model_inputs = encoded_input.to('cuda')

    generated_ids = model.generate(**model_inputs, max_new_tokens = 1000, do_sample = True, pad_token_id = tokenizer.eos_token_id)

    decoded_output = tokenizer.batch_decode(generated_ids)

    return decoded_output[0]

In [16]:
generate_response("[INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n Always friendly and efficient. Car comes back nice and clean.\n\n### Response:")

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


'<s> [INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n Always friendly and efficient. Car comes back nice and clean.\n\n### Response:</s>\n Category: Customer Service\n Polarity: Positive\n\n Category: Cleanliness\n Polarity: Positive</s>'

In [17]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
    )
model = get_peft_model(model, peft_config)

All that's left to do is set up a number of hyperparameters.

In [18]:
# Training Arguments
# Hyperparameters should be adjusted based on the hardware you using
training_arguments = TrainingArguments(
    output_dir= "mistral_instruct_generation",
    num_train_epochs= 1,
    per_device_train_batch_size= 4,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    save_steps= 5000,
    logging_steps= 30,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= False,
    bf16= False,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "constant",
)
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    formatting_func = formatting_prompts_func,
    train_dataset=dataset['train'],
    eval_dataset = dataset['test'],
    peft_config=peft_config,
    max_seq_length= 2048,
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Train the dataset on Mistral model:

In [19]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
30,1.2902
60,0.9644
90,0.7473
120,0.8518
150,0.6617
180,0.8853
210,0.8011
240,0.7128
270,0.8326
300,0.6675


TrainOutput(global_step=375, training_loss=0.8257147394816081, metrics={'train_runtime': 2449.0034, 'train_samples_per_second': 1.225, 'train_steps_per_second': 0.153, 'total_flos': 1.967265528859853e+16, 'train_loss': 0.8257147394816081, 'epoch': 1.0})

## Evaluation process

In [20]:
trainer.save_model("mistral_instruct_generation")

In [21]:
merged_model = model.merge_and_unload()



In [22]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [23]:
generate_response("[INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n Always friendly and efficient. Car comes back nice and clean.\n\n### Response:", merged_model)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


'<s> [INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n Always friendly and efficient. Car comes back nice and clean.\n\n### Response:</s><s> QUIALITY OF SERVICE:\n### Positive</s>'

In [24]:
generate_response("[INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n Service incharge Mujjafar is knowledgeable and understands the problem and provides good service\n\n### Response:", merged_model)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


"<s> [INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n Service incharge Mujjafar is knowledgeable and understands the problem and provides good service\n\n### Response:</s> [{'category': 'Service incharge competency', 'polarity': 'Positive'}]</s>"

In [25]:
generate_response("[INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n THERE WAS AN EML LIGHT ON WHICH HAS NOT BEEN CHECKED.  AND YOU HAVEN'T TOLD ME THE REASON OF IT. IM NOT HAPPY OF YOUR SERVICE\n\n### Response:", merged_model)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


"<s> [INST]### Instruction:\nUse the provided input to generate a response that identifies one or more categories and indicates the polarity (Positive, Negative, or Neutral) for each category in the text.[/INST]\n\n### Input:\n THERE WAS AN EML LIGHT ON WHICH HAS NOT BEEN CHECKED.  AND YOU HAVEN'T TOLD ME THE REASON OF IT. IM NOT HAPPY OF YOUR SERVICE\n\n### Response:</s><s> Questionnaire is a bit out of context. EML light was not checked at all in service workshop.\n\n### Category:\nIssue not addressed\n\n### Polarity:\nNegative</s>"