# Ben Needs a Friend - Fine-tuning Mistral
This is part of the "Ben Needs a Friend" tutorial.  See all the notebooks and materials [here](https://github.com/bpben/ben_friend).

This notebook is intended to be run in Kaggle Notebooks with GPU acceleration.  Access that version [here](https://www.kaggle.com/code/bpoben/ben-needs-a-friend-fine-tuning-mistral). 

In this notebook, I'll walk through an example fine-tuning the Mistral model to be more like a character from Friends.  I have a couple experiments here, but the steps are the same:

- Process the dataset (attached to this notebook!) into format for training
- Set up a Low Rank Adapter (LoRA) for training with the model (Technically [QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes))
- Plug everything into the SFTTrainer and train!
- Experiment and see how cool it all is

For the SFT setup, I drew on [this example](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb).  Really useful!

# 

In [None]:
# you'll see some warnings here - Kaggle has some interesting versions preloaded
!pip install -q bitsandbytes datasets==2.16 accelerate peft trl

In [None]:

# set up for LoRA training
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType, PeftModel, PeftConfig
from datasets import load_dataset, Dataset
from transformers import default_data_collator, get_linear_schedule_with_warmup
import pandas as pd
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
# for SFT
from trl import setup_chat_format, SFTTrainer, DataCollatorForCompletionOnlyLM

In [None]:
# want to fine-tune on the dialogue of main characters
# everyone else is kind of irrelevant, honestly
main_chars = ['Ross', 'Monica', 'Rachel', 'Chandler', 'Phoebe', 'Joey']
# sometimes the scripts have different casing for characters
main_chars = [m.lower() for m in main_chars]

def is_valid_line(line, main_chars=main_chars):
    """
    Check if a line is complete, dialogue and part of the main characters.

    Parameters:
    - line (str): The line to be checked.
    """
    if len(line)>0:
        if line[0].isalpha():
            name = line.split(':')[0].lower()
            if name in main_chars:
                return True
    return False

lines = open('/kaggle/input/friends-tv-show-script/Friends_Transcript.txt', 'r').read().split('\n')

## Formatting the dataset
This turns out to be one of the major elements governing how the LLM behavior changes with training.  Maybe that's not a surprise to anyone, but I ran a number of experiments here and came up with some interesting results.  You can see [this post]() for details about that, but for this notebook we're going to focus on "paired exchanges":

A: "Hello, how are you?"

B: "I'm fine, thanks!"

In [None]:
# collecting valid lines
valid_lines = []
for l in lines:
    if is_valid_line(l):
        # remove the speaker's name
        valid_lines.append(l.split(':')[1].strip())

# make dataset
# i take a small subset of the data here
# I actually see some pretty noticeable changes just with this many observations!
subset = 50
paired = list(zip(valid_lines, valid_lines[1:]))
friends_dataset = Dataset.from_list(
    [{'text': (a, b)} for a, b in paired[:subset]])

In [None]:
friends_dataset[0]

In [None]:
def apply_chat_template(example, tokenizer):
    # applying the template to the training dataset
    a, b = example['text']
    f_prompt = [{"role": "user",
                "content": a},
               {"role": "assistant",
               "content": b}]
    f_prompt = tokenizer.apply_chat_template(f_prompt, tokenize=False)
    example['text'] = f_prompt
    return example

In [None]:
# path for Kaggle - you will need to change this if you're running locally
instruct_model = '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'
tokenizer = AutoTokenizer.from_pretrained(instruct_model)
# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
formatted_dataset = friends_dataset.map(apply_chat_template,
        num_proc=4,
       fn_kwargs={"tokenizer": tokenizer},
    )
formatted_dataset[0]

## Fine-tuning the Mistral model
Now we set up the configurations we will be using to fine-tune the model.  Note we don't load the model here, we just rely on `SFTTrainer` to do that work for us.

A lot of these parameters can be tweaked, but I'm using just a standard set I've seen in other examples.

In [None]:
# Configure quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# not loading model, just setting up kwargs
model_kwargs = dict(
        torch_dtype="auto",
        device_map="auto",
        quantization_config=quantization_config,
        )

In [None]:
# generate a config for the lora training
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False, # we're training, not doing inference!
    # some basic parameters
    r=16, # rank
    lora_alpha=16, # scaling parameter for the weights - how strong impact on base weights
    lora_dropout=0.05, # similar to dropout in NN generally
    base_model_name_or_path=instruct_model,
    # these are the layers we're targeting with our low rank decomposition
    # that means we'll be learning adjustments to weights in these layers
    target_modules = [
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
  ]
)

In [None]:
data_version = 'sft_friends'
training_args = TrainingArguments(
    data_version, # name the directory to save checkpoints
    # these parameters worked pretty well in experiments
    num_train_epochs=3, 
    learning_rate=1e-3,  
    weight_decay=0.01, # type of regularizatin
    report_to = [], # otherwise will try to report to wnb
    per_device_train_batch_size=4,
)

In [None]:
trainer = SFTTrainer(
    model=instruct_model,
    tokenizer=tokenizer,
    model_init_kwargs=model_kwargs,
    train_dataset=formatted_dataset,
    eval_dataset=None,
    dataset_text_field="text",
    peft_config=peft_config,
    args=training_args,
    # maximum length of an training sequence
    max_seq_length=150,
    # packing - multiple examples packed together, faster training
    packing=True,
)

trainer.train()

In [None]:
# this saves the adapter, not the whole model!
trainer.model.save_pretrained('friendly_mistral')

## Friendly vs un-friendly
Now that we've trained the adapter, we can quickly observe the difference in output with and without the adapter!

In [None]:
peft_model_path = 'friendly_mistral/'

# looks familiar!
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(instruct_model)
model = AutoModelForCausalLM.from_pretrained(instruct_model,
                                             quantization_config=quantization_config,
                                             device_map="auto")

model.load_adapter(peft_model_path)

In [None]:
# formatting single prompt
def format_prompt(text, tokenizer):
    f_prompt = [{"role": "user",
                "content": text}]
    f_prompt = tokenizer.apply_chat_template(f_prompt, tokenize=False)
    return f_prompt

prompt = 'What are you doing tonight?'

# hacky - just to feed the tokens themselves to the model
inputs = tokenizer(format_prompt(prompt, tokenizer), return_tensors="pt")
inputs.to('cuda')
# disable the adapter and check out the response
model.disable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )


In [None]:
# enable to see the difference
model.enable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )

In [None]:
prompt = 'What do you think of Ross?'

inputs = tokenizer(format_prompt(prompt, tokenizer), return_tensors="pt")
inputs.to('cuda')
# disable the adapter and check out the response
model.disable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )


In [None]:
# enable to see the difference
model.enable_adapters()
generated_ids = model.generate(**inputs, max_new_tokens=50)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True, )

Let's use the prompt we used for the OpenAI instruction tuning.  Note here - this is a much smaller model so its output is generally pretty iffy:

In [None]:
# prompt from our OpenAI experiments
prompt = """
Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.
Ben: What should we do tonight?
Friend:  """
inputs = tokenizer(format_prompt(prompt, tokenizer), return_tensors="pt")
_ = inputs.to('cuda')

In [None]:
# without adapter
model.disable_adapters()
generated_ids = model.generate(**inputs, 
                               max_new_tokens=50)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

In [None]:
# with adapter
model.enable_adapters()
generated_ids = model.generate(**inputs, 
                               max_new_tokens=50)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])