#### ❓Question #1:

What makes Mistral-7B-Instruct-v0.2 a good model to use for a summarization task?

Answer: It is a fine tuned text generation model

In [1]:
import torch
torch.cuda.is_available()

True

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

In [4]:
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "left"


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

#### ❓Question #2:

![image](https://i.imgur.com/N8y2crZ.png)

Label the image with the appropriate layer from `mistralai/Mistral-7B-Instruct-v0.2`'s architecture.

- Layer Norm: input_layernorm
- Feed Forward:
- Masked Multi Self-Attention: self_attn
- Text & Position Embed: post_attention_layernorm
- Text Prediction: mlp

In [6]:
from datasets import load_dataset

dataset_name = "samsum"
dataset = load_dataset(dataset_name)

In [7]:
dataset["test"] = dataset["test"].select(range(50))
dataset["train"] = dataset["train"].select(range(1000))
dataset["validation"] = dataset["validation"].select(range(50))

In [13]:
def create_prompt(sample, include_response = True):
  """
  Parameters:
    - sample: dict representing row of dataset
    - include_response: bool

  Functionality:
    This function should build the Python str `full_prompt`.

    If `include_response` is true, it should include the summary -
    else it should not contain the summary (useful for prompting) and testing

  Returns:
    - full_prompt: str
  """

  ### YOUR CODE HERE
  full_prompt =  f"<s>[INST]Provide a summary of the following text:\n\n[INPUT_TEXT_START]\n{sample['dialogue']}\n[INPUT_TEXT_END]\n\n[/INST]\n"


  if include_response:
    full_prompt += sample['summary'] + '\n\n'

  return full_prompt

In [14]:
print(create_prompt(dataset["test"][1]))

<s>[INST]Provide a summary of the following text:

[INPUT_TEXT_START]
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)
[INPUT_TEXT_END]

[/INST]
Eric and Rob are going to watch a stand-up on youtube.




In [15]:
def generate_response(prompt, model, tokenizer):
  """
  Parameters:
    - prompt: str representing formatted prompt
    - model: model object
    - tokenizer: tokenizer object

  Functionality:
    This will allow our model to generate a response to a prompt!

  Returns:
    - str response of the model
  """

  # convert str input into tokenized input
  encoded_input = tokenizer(prompt,  return_tensors="pt")

  # send the tokenized inputs to our GPU
  model_inputs = encoded_input.to('cuda')

  # generate response and set desired generation parameters
  generated_ids = model.generate(
      **model_inputs,
      max_new_tokens=256,
      do_sample=True,
      pad_token_id=tokenizer.eos_token_id
  )

  # decode output from tokenized output to str output
  decoded_output = tokenizer.batch_decode(generated_ids)

  # return only the generated response (not the prompt) as output
  return decoded_output[0].split("[/INST]")[-1]

In [16]:
generate_response(create_prompt(dataset["test"][1], include_response=False),
                  model,
                  tokenizer)

'\nEric and Rob are having a conversation about a stand-up comedy performance by an unnamed comedian whose routine includes interacting with a machine. They both find the routine amusing, with Eric particularly enjoying the train part. They discuss how the comedian speaks to the machine in the performance and speculate that it might be his only stand-up. However, they later discover that there are other stand-up routines of his available on YouTube and decide to watch them. They sign off with the phrase "TTYL" (talk to you later) before ending the conversation.</s>'

In [17]:
# Ground Truth Summary
dataset["test"][1]["summary"]

'Eric and Rob are going to watch a stand-up on youtube.'

Let's try another just to see how the model responds to a different prompt.

In [18]:
generate_response(create_prompt(dataset["test"][3], include_response=False),
                  model,
                  tokenizer)

"\nWill asked Emma what she wanted for dinner, to which she responded that she wasn't hungry and didn't want him to worry about cooking. She mentioned that things weren't going well, but didn't want Will to be concerned. Emma assured him she would be home soon and would let him know when she got there. Will expressed his love for her and she returned the sentiment.</s>"

In [19]:
# Ground Truth Summary
dataset["test"][3]["summary"]

'Emma will be home soon and she will let Will know.'

In [20]:
from peft import prepare_model_for_kbit_training
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)

In [21]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [22]:
from peft import LoraConfig, get_peft_model

# set our rank (higher value is more memory/better performance)
lora_r = 16

# set our dropout (default value)
lora_dropout = 0.1

# rule of thumb: alpha should be (lora_r * 2)
lora_alpha = 32

# construct our LoraConfig with the above hyperparameters
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [23]:
model = get_peft_model(
    model,
    peft_config
)

print_trainable_parameters(model)

trainable params: 6815744 || all params: 3758886912 || trainable%: 0.18132346515244138


In [24]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=4096, out_features=1024,

#### ❓Question #3:

What modules (or groupings of layers) did we apply LoRA too - and how can we tell from the model summary?

Answer: The attention layers. Lora.

In [25]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "mistral7binstruct_summarize",
  #num_train_epochs=5,
  max_steps = 50, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 1,
  warmup_steps = 0.03,
  logging_steps=10,
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=25, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2e-4,
  lr_scheduler_type='constant',
)

#### ❓Question #4:

Describe what the following parameters are doing:

- `warmup_steps` -- number of steps used for a linear warmup from 0 to learning_rate
- `learning_rate` --  the initial learning rate
- `lr_scheduler_type` --  the scheduler type that is used

> NOTE: Feel free to consult the [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) or other resources!

In [26]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt,
  args=args,
  train_dataset=dataset["train"],
  eval_dataset=dataset["validation"]
)



Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



In [27]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mamandajovanderwal[0m. Use [1m`wandb login --relogin`[0m to force relogin




Step,Training Loss,Validation Loss
25,1.73,1.549017
50,1.5082,1.470357


TrainOutput(global_step=50, training_loss=1.6960164833068847, metrics={'train_runtime': 120.2668, 'train_samples_per_second': 0.416, 'train_steps_per_second': 0.416, 'total_flos': 4372977156096000.0, 'train_loss': 1.6960164833068847, 'epoch': 0.43})

In [28]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [29]:
trainer.push_to_hub("ajvanderwal/mistral-7binstruct-summary-100s")

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

events.out.tfevents.1709649484.kneddlald001.2802364.0:   0%|          | 0.00/6.98k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/ajvanderwal/mistral7binstruct_summarize/commit/341cf70f72d2958c484f4f8b7335ac47ba53a83b', commit_message='ajvanderwal/mistral-7binstruct-summary-100s', commit_description='', oid='341cf70f72d2958c484f4f8b7335ac47ba53a83b', pr_url=None, pr_revision=None, pr_num=None)

### Compare Outputs

Let's see how our model fairs at this task, now!

In [30]:
merged_model = model.merge_and_unload()



#### ❓Question #5:

What does the `merge_and_unload()` method do?

Answer: It merges Lora weights back with the base model.

> NOTE: Check out the [documentation](https://huggingface.co/docs/trl/v0.7.11/use_model) or the [source code](https://github.com/huggingface/peft/blob/096fe537370cf8a2cb55cc9bd05c7812ca919405/src/peft/tuners/lora/model.py#L685) to find out!

In [35]:
generate_response(create_prompt(dataset["test"][1], include_response=False),
                  merged_model,
                  tokenizer)

'\nEric and Rob are chatting about a video of "MACHINE!" - an American comedian stand up show where he speaks sarcastically to a Russian machine interpreter. Eric likes the train part. He is wondering whether Machine has other stand-ups. Rob checks and discovers they are available on YouTube.\n\n</s>'

Let's look at the base model response:

> *Eric and Rob are having a conversation about a stand-up comedy performance by an unnamed comedian named "Machine." They find the performance hilarious, particularly a part involving a train. They discuss how Americans perceive Russians based on the comedy routine. Eric wonders if this is Machine\'s only stand-up performance, and Rob agrees to help him find more of Machine\'s comedy content on YouTube. They both express excitement and plan to watch the other performances together. They sign off with the phrase "talk to you later" (TTYL) before ending the conversation.</s>*

Now the fine-tuned response:

> *Eric and Rob watched a MACHINE comedy stand-up video and found it funny. Eric is especially impressed by the Russian accent in the video. He asks Rob if that was MACHINE'S only stand-up and if there are any others. Rob confirms that there are several MACHINE'S stand-up videos on Youtube. Eric and Rob agree to watch them together.</s>*



We can see that, directionally, our model is getting much closer to our desired results with only *100* steps of training.

Let's try another example to make sure it wasn't a fluke!

In [36]:
print(dataset["test"][3]["dialogue"])

Will: hey babe, what do you want for dinner tonight?
Emma:  gah, don't even worry about it tonight
Will: what do you mean? everything ok?
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry
Will: Well what time will you be home?
Emma: soon, hopefully
Will: you sure? Maybe you want me to pick you up?
Emma: no no it's alright. I'll be home soon, i'll tell you when I get home. 
Will: Alright, love you. 
Emma: love you too. 


In [37]:
generate_response(create_prompt(dataset["test"][3], include_response=False),
                  merged_model,
                  tokenizer)

"\nEmma wants to go home soon, everything is not exactly fine with her but she doesn't want Will to worry, and doesn't need him to cook for her.\n\n</s>"

In [39]:
# Ground Truth Summary
dataset["test"][3]["summary"]

'Emma will be home soon and she will let Will know.'

Lets look at the base model response:

>  *Emma won't be home for dinner tonight, she has a problem, but she doesn't want Will to worry. She'll let him know when she's home. She'll be coming soon.</s>*

And the fine-tuned model:

> *Emma is not feeling well. She will be home soon. She doesn't want Will to cook anything for dinner.</s>*

And again, we can see that the model performs the task *better* than the original un-fine-tuned model - though there is still work to do.