# Ben Needs a Friend - Fine-tuning Mistral
This is part of the "Ben Needs a Friend" tutorial.  See all the notebooks and materials [here](https://github.com/bpben/ben_friend).

This notebook is intended to be run in Kaggle Notebooks with GPU acceleration.  Access that version [here](https://www.kaggle.com/code/bpoben/ben-needs-a-friend-fine-tuning-mistral). 

In this notebook, I'll walk through an example fine-tuning the Mistral model to be more like a character from Friends.  I have a couple experiments here, but the steps are the same:

- Process the dataset (attached to this notebook!) into format for training
- Set up a Low Rank Adapter (LoRA) for training with the model (Technically [QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes))
- Plug everything into the SFTTrainer and train!
- Experiment and see how cool it all is

For the SFT setup, I drew on [this example](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb).  Really useful!

# 

In [1]:
# you'll see some warnings here - Kaggle has some interesting versions preloaded
!pip install -q bitsandbytes datasets==2.16 accelerate loralib peft trl

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.0.3 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is incompatible.
cuml 23.8.0 requires dask==2023.7.1, but you have dask 2023.12.0 which is incompatible.
cuml 23.8.0 requires distributed==2023.7.1, but you have distributed 2023.12.0 which is incompatible.
dask-cuda 23.8.0 requires dask==2023.7

In [2]:

# set up for LoRA training
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType, PeftModel, PeftConfig
from datasets import load_dataset, Dataset
from transformers import default_data_collator, get_linear_schedule_with_warmup
import pandas as pd
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
# for SFT
from trl import setup_chat_format, SFTTrainer, DataCollatorForCompletionOnlyLM



In [3]:
# want to fine-tune on the dialogue of main characters
# everyone else is kind of irrelevant, honestly
main_chars = ['Ross', 'Monica', 'Rachel', 'Chandler', 'Phoebe', 'Joey']
# sometimes the scripts have different casing for characters
main_chars = [m.lower() for m in main_chars]

def is_valid_line(line, main_chars=main_chars):
    """
    Check if a line is complete, dialogue and part of the main characters.

    Parameters:
    - line (str): The line to be checked.
    """
    if len(line)>0:
        if line[0].isalpha():
            name = line.split(':')[0].lower()
            if name in main_chars:
                return True
    return False

lines = open('/kaggle/input/friends-tv-show-script/Friends_Transcript.txt', 'r').read().split('\n')

## Formatting the dataset
This turns out to be one of the major elements governing how the LLM behavior changes with training.  Maybe that's not a surprise to anyone, but I ran a number of experiments here and came up with some interesting results.  You can see [this post]() for details about that, but for this notebook we're going to focus on "paired exchanges":

A: "Hello, how are you?"

B: "I'm fine, thanks!"

In [4]:
# collecting valid lines
valid_lines = []
for l in lines:
    if is_valid_line(l):
        # remove the speaker's name
        valid_lines.append(l.split(':')[1].strip())

# make dataset
# i take a small subset of the data here
# I actually see some pretty noticeable changes just with this many observations!
subset = 50
paired = list(zip(valid_lines, valid_lines[1:]))
friends_dataset = Dataset.from_list(
    [{'text': (a, b)} for a, b in paired[:subset]])

In [5]:
friends_dataset[0]

{'text': ["There's nothing to tell! He's just some guy I work with!",
  "C'mon, you're going out with the guy! There's gotta be something wrong with him!"]}

In [6]:
def apply_chat_template(example, tokenizer):
    # applying the template to the training dataset
    a, b = example['text']
    f_prompt = [{"role": "user",
                "content": a},
               {"role": "assistant",
               "content": b}]
    f_prompt = tokenizer.apply_chat_template(f_prompt, tokenize=False)
    example['text'] = f_prompt
    return example

In [7]:
# path for Kaggle - you will need to change this if you're running locally
instruct_model = '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'
tokenizer = AutoTokenizer.from_pretrained(instruct_model)
# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

In [8]:
formatted_dataset = friends_dataset.map(apply_chat_template,
        num_proc=4,
       fn_kwargs={"tokenizer": tokenizer},
    )
formatted_dataset[0]

Map (num_proc=4):   0%|          | 0/50 [00:00<?, ? examples/s]

{'text': "<s>[INST] There's nothing to tell! He's just some guy I work with! [/INST]C'mon, you're going out with the guy! There's gotta be something wrong with him!</s> "}

## Fine-tuning the Mistral model
Now we set up the configurations we will be using to fine-tune the model.  Note we don't load the model here, we just rely on `SFTTrainer` to do that work for us.

A lot of these parameters can be tweaked, but I'm using just a standard set I've seen in other examples.

In [9]:
# Configure quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# not loading model, just setting up kwargs
model_kwargs = dict(
        torch_dtype="auto",
        device_map="auto",
        quantization_config=quantization_config,
        )

In [10]:
# generate a config for the lora training
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False, # we're training, not doing inference!
    # some basic parameters
    r=16, lora_alpha=16, lora_dropout=0.05,
    base_model_name_or_path=instruct_model,
    # these are the layers we're targeting with our low rank decomposition
    # that means we'll be learning adjustments to weights in these layers
    target_modules = [
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
  ]
)

In [11]:
data_version = 'sft_friends'
training_args = TrainingArguments(
    data_version, # name the directory to save checkpoints
    # these parameters worked pretty well in experiments
    num_train_epochs=3, 
    learning_rate=1e-3,  
    weight_decay=0.01,
    report_to = [], # otherwise will try to report to wnb
    per_device_train_batch_size=4,
    # optional, only if you want to push to HF Hub after
    push_to_hub_model_id='mistralai/Mistral-7B-Instruct-v0.1'
)



In [12]:
trainer = SFTTrainer(
    model=instruct_model,
    tokenizer=tokenizer,
    model_init_kwargs=model_kwargs,
    train_dataset=formatted_dataset,
    eval_dataset=None,
    dataset_text_field="text",
    peft_config=peft_config,
    args=training_args,
    # maximum length of an training sequence
    max_seq_length=150,
    # packing - multiple examples packed together, faster training
    packing=True,
)

trainer.train()



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=12, training_loss=1.8128097852071126, metrics={'train_runtime': 61.0998, 'train_samples_per_second': 0.786, 'train_steps_per_second': 0.196, 'total_flos': 307769396428800.0, 'train_loss': 1.8128097852071126, 'epoch': 3.0})

In [13]:
# this saves the adapter, not the whole model!
trainer.model.save_pretrained('friendly_mistral')



## Friendly vs un-friendly
Now that we've trained the adapter, we can quickly observe the difference in output with and without the adapter!

In [14]:
peft_model_path = 'friendly_mistral/'

# looks familiar!
tokenizer = AutoTokenizer.from_pretrained(instruct_model)
model = AutoModelForCausalLM.from_pretrained(instruct_model,
                                             load_in_4bit=True,
                                             device_map="auto")

model.load_adapter(peft_model_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
# formatting single prompt
def format_prompt(text, tokenizer):
    f_prompt = [{"role": "user",
                "content": text}]
    f_prompt = tokenizer.apply_chat_template(f_prompt, tokenize=False)
    return f_prompt

prompt = 'What are you doing tonight?'

# hacky - just to feed the tokens themselves to the model
inputs = tokenizer(format_prompt(prompt, tokenizer), return_tensors="pt")
inputs.to('cuda')
# disable the adapter and check out the response
model.disable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


["[INST] What are you doing tonight? [/INST] I don't have personal experiences or activities. I'm here to assist you. How can I help you tonight?"]

In [16]:
# enable to see the difference
model.enable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


["[INST] What are you doing tonight? [/INST]I'm going to a party. Are you coming?"]

In [17]:
prompt = 'What do you think of Ross?'

inputs = tokenizer(format_prompt(prompt, tokenizer), return_tensors="pt")
inputs.to('cuda')
# disable the adapter and check out the response
model.disable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['[INST] What do you think of Ross? [/INST] I don\'t have personal feelings or opinions. However, I can tell you that Ross is a fictional character from the television show "Friends." He is a paleontologist and one of the main characters in the show. He is known']

In [18]:
# enable to see the difference
model.enable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


["[INST] What do you think of Ross? [/INST]Well, he's a paleontologist."]

Let's use the prompt we used for the OpenAI instruction tuning.  Note here - this is a much smaller model so its output is generally pretty iffy:

In [19]:
# prompt from our OpenAI experiments
prompt = """
Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.
Ben: What should we do tonight?
Friend:  """
inputs = tokenizer(format_prompt(prompt, tokenizer), return_tensors="pt")
_ = inputs.to('cuda')

In [20]:
# without adapter
model.disable_adapters()
generated_ids = model.generate(**inputs, 
                               max_new_tokens=50)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] 
Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.
Ben: What should we do tonight?
Friend:   [/INST] Well, Ben, we could always go to that new sushi place down the street and see if they have any vegetarian options. Or we could just stay in and watch that old movie we've been meaning to see. What do you


In [21]:
# with adapter
model.enable_adapters()
generated_ids = model.generate(**inputs, 
                               max_new_tokens=50)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] 
Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.
Ben: What should we do tonight?
Friend:   [/INST] Oh, I don't know. We could go to that new place downtown, you know the one where the waiters are all... (pauses) well-endowed?


## Optional - push your adapter to HF Hub
This is just if you're interested in sharing your adapter.  I wanted to use it for some other experiments, so I exported it.

In [22]:
# logging into HF hub - necessary if you want to save/load trained info
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [60]:
model.adapters#push_adapter_to_hub
#trainer.model.base_model#push_to_hub('sft_friendsly', base_model='ddd')

AttributeError: 'MistralForCausalLM' object has no attribute 'adapters'

In [41]:
trainer.push_to_hub(repo_id='sft_friendsly')

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'mistralai/mistralai/Mistral-7B-Instruct-v0.1'. Use `repo_type` argument if needed.