# Fine-tune Mistral for Social Media Analysis


This code notebook makes it possible to fine-tune Mistral on a set of social media annotation


## Install

We connect to the drive

We install transformers and peft separately from the latest version on Github. Otherwhise, you will miss key metadata for Mistral support.

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

We install the correct version of tensorflow:

In [None]:
!pip install tensorflow==2.14

We install the other extensions.

In [None]:
!pip install -q accelerate bitsandbytes trl guardrail-ml tensorboard

We load the libraries

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
    LlamaTokenizerFast
)
from peft import LoraConfig, PeftModel, get_peft_model
from trl import SFTTrainer



## Parameters

The important hyperparameters first:

In [None]:
per_device_train_batch_size = 4 #Number of texts sent in one batch: higher will mean quicker epochs, lower less vram.
learning_rate = 2e-4 #Rate of memorization and also amnesia of past knowledge. High value are preferable for annotations.
max_seq_length = 1024 #Context window: not necessarily big for analytical LLMs of social media expression.

# The name of Mistral model
model_name = "mistral-7b-v0.1"

# Le name of the new model.
new_model_name = "mistral-7b-sna"

# The number of steps.
# I prefer this to the number of epochs (easier to manage and anticipate the time it takes to finetune)
max_steps = 500

# Saving steps. Useful when there is an issue with fine-tuning: your can easily restart.
save_steps = 100

# The output directory where the model predictions and checkpoints will be written
output_dir = "./mistral-7b-sna"

# Tensorboard logs
tb_log_dir = "./mistral-7b-sna/logs"

The other hyperparameters (no need to change them normally):

In [None]:
# Base parameters
local_rank = -1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4
max_grad_norm = 0.3
weight_decay = 0.001
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64
group_by_length = True

# Activate 4-bit precision base model loading
use_4bit = True

# Activate nested quantization for 4-bit base models
use_nested_quant = False

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4=
bnb_4bit_quant_type = "nf4"

# Number of training epochs
num_train_epochs = 1

# Enable fp16 training
fp16 = True

# Enable bf16 training
bf16 = False

# Use packing dataset creating
packing = False

# Enable gradient checkpointing
gradient_checkpointing = True

# Optimizer to use, original is paged_adamw_32bit
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine, and has advantage for analysis)
lr_scheduler_type = "constant"

# Fraction of steps to do a warmup for
warmup_ratio = 0.03

# Group sequences into batches with same length (saves memory and speeds up training considerably)
group_by_length = True

# Log every X updates steps
logging_steps = 1

# Load the entire model on the GPU 0
device_map = {"": 0}

# Visualize training
report_to = "tensorboard"

We load the model

We use a light version (in 4-bit) to speed up training.

In [None]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
        print("=" * 80)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device_map,
    quantization_config=bnb_config
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Your GPU supports bfloat16, you can accelerate training with the argument --bf16


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We load the tokenizer and the peft configuration. Notice you have to specify the target modules as peft is not yet fully updated for Mistral.

Also using the llama fast tokenizer but not sure if this is the best idea…

In [None]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    inference_mode=False,
    task_type="CAUSAL_LM",
    target_modules = ["q_proj", "v_proj"] #There are options to deepen the finetuning by unfreezing more weights but with a cost in performance
)

tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True)
#tokenizer = LlamaTokenizerFast.from_pretrained(model_name, add_eos_token=True, from_slow=True)

# This is the fix for fp16 training
tokenizer.padding_side = "right"

## Dataset preparation

You should put the dataset in the same directory as the models (so Mistral here)

We load the data in a custom format. Here 'full_text' is the input and 'analysis' the expected output for analysis. May have to be changed for custom fields.

In [None]:
from datasets import load_dataset

def format_custom(sample):
    instruction = f"<s>Text: {sample['full_text']} \n\n### Analysis:\n\n"
    context = None
    response = f"{sample['analysis']}"
    # join all the parts together
    prompt = "".join([i for i in [instruction, context, response] if i is not None])
    return prompt

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_custom(sample)}{tokenizer.eos_token}"
    return sample

# Loadng the dataset.
data_files = {"train": "brahe_instructions.json"}
dataset = load_dataset("json", data_files=data_files, split="train")

#Transformation du dataset pour utiliser le format guanaco
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
dataset

Dataset({
    features: ['text'],
    num_rows: 4272
})

A sample of the dataset:

In [None]:
dataset[40]

{'text': '<s>Text: We found that the wave had actually borne the boat on its crest from the beach into the woods, and there launched it into the heart of this bush; which was extremely fortunate, for had it been tossed against a rock or a tree, it would have been dashed to pieces, whereas it had not received the smallest injury. It was no easy matter, however, to get it out of the bush and down to the sea again. This cost us two days of hard labour to accomplish. We had also much ado to clear away the rubbish from before the bower, and spent nearly a week in constant labour ere we got the neighbourhood to look as clean and orderly as before; for the uprooted bushes and seaweed that lay on the beach formed a more dreadfully confused-looking mass than one who had not seen the place after the inundation could conceive. Before leaving the subject, I may mention, for the sake of those who interest themselves in the curious natural phenomena of our world, that this gigantic wave occurs regul

## Fine-tuning

We launch the training. Should take 1-2 hours with the default settings.

In [None]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
torch.cuda.empty_cache()

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing
)

trainer.train()
#trainer.train(resume_from_checkpoint=True)

We save the model

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained(new_model_name)

We merge the model and the LORA to get inference speed up.

In [None]:
del model
torch.cuda.empty_cache()

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(new_model_name, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
output_merged_dir = os.path.join(new_model_name, "final_merged_checkpoint")
model.save_pretrained(output_merged_dir, safe_serialization=True)

We export the tokenizer files.

In [None]:
!cp "mistral-7b-v0.1/tokenizer.json" "mistral-7b-sna/final_merged_checkpoint/tokenizer.json"
!cp "mistral-7b-v0.1/tokenizer.model" "mistral-7b-sna/final_merged_checkpoint/tokenizer.model"
!cp "mistral-7b-v0.1/tokenizer_config.json" "mistral-7b-sna/final_merged_checkpoint/tokenizer_config.json"
!cp "mistral-7b-v0.1/special_tokens_map.json" "mistral-7b-sna/final_merged_checkpoint/special_tokens_map.json"

## Inference

To do later: not working for now (but you're free to debug). You may need to delete the runtime to free the memory (since we will use a different implementation), especially if you are on the free colab.

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-_qt6yybs
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-_qt6yybs
  Resolved https://github.com/huggingface/transformers.git to commit 391177441b133645c02181b57370ab12f71b88c4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
!pip install git+https://github.com/vllm-project/vllm

Collecting git+https://github.com/vllm-project/vllm
  Cloning https://github.com/vllm-project/vllm to /tmp/pip-req-build-j4b1q0o4
  Running command git clone --filter=blob:none --quiet https://github.com/vllm-project/vllm /tmp/pip-req-build-j4b1q0o4
  Resolved https://github.com/vllm-project/vllm to commit e2fb71ec9f2c3168ba8614408fa807a5f65707c5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ninja (from vllm==0.2.0)
  Using cached ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)
Collecting ray>=2.5.1 (from vllm==0.2.0)
  Downloading ray-2.7.0-cp310-cp310-manylinux2014_x86_64.whl (62.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Collecting xformers>=0.0.22 (from vllm==0.2.0)
  Downloading xformers-0.0.22-cp310-cp310-manylinux2014_x86_64.whl 

In [None]:
from vllm import LLM, SamplingParams
import os

In [None]:
new_model_name = "mistral-7b-sna"

In [None]:
output_merged_dir = os.path.join(new_model_name, "final_merged_checkpoint")
llm = LLM(output_merged_dir)

INFO 09-29 15:21:34 llm_engine.py:72] Initializing an LLM engine with config: model='mistral-7b-sna/final_merged_checkpoint', tokenizer='mistral-7b-sna/final_merged_checkpoint', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 09-29 15:23:33 llm_engine.py:205] # GPU blocks: 9336, # CPU blocks: 2048


In [None]:
sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=500)

In [None]:
prompts = ["""Text: For a long time I used to go to bed early. Sometimes, when I had put out my candle, my eyes would close so quickly that I had not even time to say "I'm going to sleep." And half an hour later the thought that it was time to go to sleep would awaken me; I would try to put away the book which, I imagined, was still in my hands, and to blow out the light; I had been thinking all the time, while I was asleep, of what I had just been reading, but my thoughts had run into a channel of their own, until I myself seemed actually to have become the subject of my book: a church, a quartet, the rivalry between François I and Charles V. This impression would persist for some moments after I was awake; it did not disturb my mind, but it lay like scales upon my eyes and prevented them from registering the fact that the candle was no longer burning. Then it would begin to seem unintelligible, as the thoughts of a former existence must be to a reincarnate spirit; the subject of my book would separate itself from me, leaving me free to choose whether I would form part of it or no; and at the same time my sight would return and I would be astonished to find myself in a state of darkness, pleasant and restful enough for the eyes, and even more, perhaps, for my mind, to which it appeared incomprehensible, without a cause, a matter dark indeed. \n\n### Analysis:\n\n"""]

In [None]:
prompts = ["""Text: grandfather’s, who died years ago; and my body, the side
upon which I was lying, faithful guardians of a past which
my mind should never have forgotten, brought back
before my eyes the glimmering flame of the night-light in
its urn-shaped bowl of Bohemian glass that hung by
chains from the ceiling, and the chimney-piece of Siena
marble in my bedroom at Combray, in my grandparents’
house, in those far distant days which at this moment I
imagined to be in the present without being able to picture
them exactly, and which would become plainer in a little
while when I was properly awake.
      Then the memory of a new position would spring up,
and the wall would slide away in another direction; I was
in my room in Mme de Saint-Loup’s house in the country;
good heavens, it must be ten o’clock, they will have
finished dinner! I must have overslept myself in the little
nap which I always take when I come in from my walk
with Mme de Saint-Loup, before dressing for the evening.
For many years have now elapsed since the Combray days
when, coming in from the longest and latest walks, I
would still be in time to see the reflection of the sunset
glowing in the panes of my bedroom window. It is a very
different kind of life that one leads at Tansonville, at Mme
de Saint-Loup’s, and a different kind of pleasure that I
derive from taking walks only in the evenings, from
visiting by moonlight the roads on which I used to play as
a child in the sunshine; as for the bedroom in which I must
have fallen asleep instead of dressing for dinner, I can see
it from the distance as we return from our walk, with its
lamp shining through the window, a solitary beacon in the
night.\n\n### Analysis:\n\n"""]

In [None]:
outputs = llm.generate(prompts, sampling_params)

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.26s/it]


In [None]:
outputs

[RequestOutput(request_id=0, prompt='Text: grandfather’s, who died years ago; and my body, the side\nupon which I was lying, faithful guardians of a past which\nmy mind should never have forgotten, brought back\nbefore my eyes the glimmering flame of the night-light in\nits urn-shaped bowl of Bohemian glass that hung by\nchains from the ceiling, and the chimney-piece of Siena\nmarble in my bedroom at Combray, in my grandparents’\nhouse, in those far distant days which at this moment I\nimagined to be in the present without being able to picture\nthem exactly, and which would become plainer in a little\nwhile when I was properly awake.\n      Then the memory of a new position would spring up,\nand the wall would slide away in another direction; I was\nin my room in Mme de Saint-Loup’s house in the country;\ngood heavens, it must be ten o’clock, they will have\nfinished dinner! I must have overslept myself in the little\nnap which I always take when I come in from my walk\nwith Mme de Sa

# Application du modèle à un jeu de données

In [None]:
import pandas as pd
proust = pd.read_excel("proust_novel.xlsx")

In [None]:
prompts = []

for texts in proust["text"].tolist():
  prompts.append("Text: " + texts + "\n\n### Analysis:\n\n")

print(prompts[0])

Text:               MARCEL PROUST

     Marcel Proust was born in the Parisian suburb of
Auteuil on July 10, 1871. His father, Adrien Proust, was a
doctor celebrated for his work in epidemiology; his
mother, Jeanne Weil, was a stockbroker’s daughter of
Jewish descent. He lived as a child in the family home on
Boulevard Malesherbes in Paris, but spent vacations with
his aunt and uncle in the town of Illiers near Chartres,
where the Prousts had lived for generations and which
became the model for the Combray of his great novel. (In
recent years it was officially renamed Illiers-Combray.)
Sickly from birth, Marcel was subject from the age of nine
to violent attacks of asthma, and although he did a year of
military service as a young man and studied law and
political science, his invalidism disqualified him from an
active professional life.
     During the 1890s Proust contributed sketches to Le
Figaro and to a short-lived magazine, Le Banquet,
founded by some of his school friends in 1892

In [None]:
outputs = llm.generate(prompts, sampling_params)

Processed prompts: 100%|██████████| 589/589 [02:25<00:00,  4.05it/s]


In [None]:
proust["analysis"] = outputs

In [None]:
proust

Unnamed: 0.1,Unnamed: 0,text,page,title,author,analysis
0,0,MARCEL PROUST\n\n Marcel Pro...,9,Swann's Way,Marcel Proust,"RequestOutput(request_id=1, prompt='Text: ..."
1,1,the fiction of Anatole France (on whom he mode...,10,Swann's Way,Marcel Proust,"RequestOutput(request_id=2, prompt='Text: the ..."
2,2,"Goncourt Prize, bringing Proust great and inst...",11,Swann's Way,Marcel Proust,"RequestOutput(request_id=3, prompt='Text: Gonc..."
3,3,CONTENTS\n\nNote on the ...,15,Swann's Way,Marcel Proust,"RequestOutput(request_id=4, prompt='Text: ..."
4,4,Note on the Translation (1981)\...,17,Swann's Way,Marcel Proust,"RequestOutput(request_id=5, prompt='Text: ..."
...,...,...,...,...,...,...
584,584,576 SWANN’S WAY\n\npiled a...,600,Swann's Way,Marcel Proust,"RequestOutput(request_id=585, prompt='Text: 57..."
585,585,PLACE-NAMES · THE NAME ...,601,Swann's Way,Marcel Proust,"RequestOutput(request_id=586, prompt='Text: ..."
586,586,578 SWANN’S WAY\n\ngrey s...,602,Swann's Way,Marcel Proust,"RequestOutput(request_id=587, prompt='Text: 57..."
587,587,PLACE-NAMES · THE NAME ...,603,Swann's Way,Marcel Proust,"RequestOutput(request_id=588, prompt='Text: ..."


In [None]:
proust.to_excel("proust_novel_mistral_3.xlsx")