
**Table of Contents**

1.   Install related libraries and setup
2.   Task1 Summerization
    *   load dataset(dialogue), model, tokenizer
    *   test model with zero/one/few shot
    *   perform full fine tuning
    *   perform LoRA tuning
    *   evaluation
3.   Task2 Change Idiom to straightforward expression
    *   test model(FLAN-T5) with zero/one/few shot
    *   load dataset, model, tokenizer
    *   perform full fine tuning with evaluation
    *   perform LoRA with evaluation
4.   Task3 Llama2 fine-tuning using LoRA and SFTTrainer
    *   test model(Llama2) with Llama2 pipeline
    *   load dataset(Idiom), model, tokenizer / inference from llama2
    *   perform fine-tuning using LoRA SFTTrainer, Human Evaluation





# 1. Install and setup

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

%pip install --upgrade torch torchvision

Collecting pip
  Downloading pip-24.1.2-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.1.2
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m95.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m80.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m


In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, pipeline
import torch
import time
import evaluate
import pandas as pd
import numpy as np

import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## HuggingFace - [FLAN-T5_Document](https://huggingface.co/docs/transformers/model_doc/flan-t5)

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

# Inference 1
inputs = tokenizer("A step by step recipe to make bolognese pasta:", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



['Pour a cup of bolognese into a large bowl and add the pasta']


In [None]:
# Inference 2

# change/paraphrase/extract
# general/normal/direct expression

idiom_prompt1 = f"""
Can you change this idiomatic expression to non-idiomatic expression? \n\n

idiomatic expression: I think that this task would be a piece of cake for me. \n

on-idiomatic expression:
"""
inputs = tokenizer(idiom_prompt1, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

# one shot

idiom_prompt2 = f"""
Can you change this idiomatic expression to non-idiomatic expression? \n\n

idiomatic expression: I think that this task would be a piece of cake for me. \n

non-idiomatic expression: I think that this task would be simple.


Can you change this idiomatic expression to non-idiomatic expression? \n\n

idiomatic expression: This job is a piece of cake! \n

non-idiomatic expression:


"""


inputs = tokenizer(idiom_prompt2, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

['I think that this task would be a piece of cake for me.']
['I think that this task is a piece of cake!']


# 2. Task1 Summerization

In this task1, you will refine a pre-trained LLM from Hugging Face to enhance dialogue summarization. Specifically, you'll utilize the FLAN-T5 model, renowned for its pre-optimized capabilities in text summarization. To elevate accuracy, you'll implement comprehensive fine-tuning and assess outcomes using ROUGE metrics. Additionally, you'll apply Parameter Efficient Fine-Tuning (PEFT), comparing its efficacy against traditional methods despite potential marginal metric reductions.


## 2.1 load dataset, model, tokenizer

In [None]:
# load dataset
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
dataset

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

In [None]:
# load model and tokenizer

model_name='google/flan-t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
  return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl.metadata (58 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/58.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.0/58.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/106.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, sacrebleu
Successfully installed colorama-0.4.6 sacrebleu-2.4.2
[0m

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


In [None]:
dataset['test'][200]

{'id': 'test_66_3',
 'dialogue': "#Person1#: Have you considered upgrading your system?\n#Person2#: Yes, but I'm not sure what exactly I would need.\n#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.\n#Person2#: That would be a definite bonus.\n#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.\n#Person2#: How can we do that?\n#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?\n#Person2#: No.\n#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.\n#Person2#: That sounds great. Thanks.",
 'summary': "#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.",
 'topic': 'upgrading system'}

## 2.2 Test model with Zero Shot Inference
- Instrcution Prompt
- Prompt [Templates](https://github.com/google-research/FLAN/blob/main/flan/v2/templates.py)

In [None]:
# Test the model with Zero Shot Inference with an Instruction Prompt

index = 200
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

In [None]:
# Test the model with Zero Shot Inference with the Prompt Template from FLAN-T5

index = 200
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Dialogue:

{dialogue}

What was going on?
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Dialogue:

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

What was going on?

-------------------------------------------------------------------------------------

## 2.3 Test model with One Shot and Few Shot Inference

In [None]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['test'][index]['dialogue']
        summary = dataset['test'][index]['summary']

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""

    dialogue = dataset['test'][example_index_to_summarize]['dialogue']

    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""

    return prompt

**ONE SHOT Inference**

In [None]:
example_indices_full = [40]
example_index_to_summarize = 200

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also ne

In [None]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ONE SHOT:
#Person1#: I'm not sure what to expect from your software.


**FEW SHOTS Inference**

In [None]:
example_indices_full = [40, 80, 120]
example_index_to_summarize = 200

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)


Dialogue:

#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.

What was going on?
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.



Dialogue:

#Person1#: May, do you mind helping me prepare for the picnic?
#Person2#: Sure. Have you checked the weather report?
#Person1#: Yes. It says it will be sunny all day. No sign of rain at all. This is your father's favorite sausage. Sandwiches for you and Daniel.
#Person2#: No, thanks Mom. I'd like some toast and chicken wings.
#Person1#: Okay. Please take some fruit salad and crackers for me.
#Person2#: Done. Oh, don't forget to take napkins disposable plates, cups and picnic blanket.
#Person1#: All set. 

In [None]:
summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')


Token indices sequence length is longer than the specified maximum sequence length for this model (819 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.

---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
#Person1#: I'm not sure what to expect from your software. I'd like to make a flyer and banner for advertising.


## 2.4 Perform Full Fine-Tuning

you need to create format **promt-response** pairs to explicit instructions.
```
Training prompt (dialogue):

Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:

Training response (summary):

Both Chris and Antje participated in the conversation.
```

In [None]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
# To save some time in the lab, you will subsample the dataset:
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Shapes of the datasets:
Training: (125, 2)
Validation: (5, 2)
Test: (15, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})


In [None]:
'''
We use the built-in Hugging Face Trainer class.
'''

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [None]:
# trainer.train()

In [None]:
'''
Training a fully fine-tuned version of the model would take a few hours on a GPU.
To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook.
This fully fine-tuned model will also be referred to as the instruct model in this lab.
'''

'''
!export AWS_PROFILE=user1
!aws s3 cp --recursive s3://dlai-generative-ai/models/flan-dialogue-summary-checkpoint/ ./flan-dialogue-summary-checkpoint/
!ls -alh ./flan-dialogue-summary-checkpoint/pytorch_model.bin
instruct_model = AutoModelForSeq2SeqLM.from_pretrained("/kaggle/input/generative-ai-with-llms-lab-2/lab_2/flan-dialogue-summary-checkpoint/", torch_dtype=torch.bfloat16).to('cpu')
'''

fatal error: Unable to locate credentials


## 2.5 Perform Lora - partial Fine-Tuning


In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)


In [None]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


In [None]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)


In [None]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
1,49.0


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/spiece.model',
 './peft-dialogue-summary-checkpoint-local/added_tokens.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

## 2.6 Evaluation

In [None]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

#input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

#instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
#instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
#print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
#print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1: Have you considered upgrading your system? #Person2: Yes, I'm sure you're right. #Person1: I'm not sure what I'm doing wrong. #Person2: I'm not sure what exactly I'm doing wrong. #Person2: I'm not sure what I'm doing wrong. #Person1: I'm not sure what I'm doing wrong.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person2#: I'm not sure what to do. #Person1#: You can add a painting program to your software. #Person2#: You can do that. #Person1#: You can do that. #Person2#: You can do that. #Person1#: I'm not sure. #Person2#: You can do that. #Person2#: You can do that. #Person1#: You can d

In [None]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

# 3. Task2. Idiom to straightforward expression

Prompt will be
```
Can you turn the idiomatic expression into a more straightforward statement?
idiomatic expression: Cleaning a house is a piece of cake.
A straightforward statement: Cleaning a house is easy.

```


7 idiom cases

* a piece of cake
  * Test would be *a piece of cake* => Test would be *easy*
* break a leg
* cost an arm and a leg
* go south
* go bananas
* give someone the cold shoulder
* play it by ear

In [3]:
model_name='google/flan-t5-base'
#original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [4]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


## 3.1 Test model with Zero Shot and Few shot Inference.

In [5]:
prompt = f"""
Turn the idiomatic expression into a more straightforward statement?\n
idiom: Cleaning a house is a piece of cake.\n
A straightforward statement:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{original_model_text_output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Turn the idiomatic expression into a more straightforward statement?

idiom: Cleaning a house is a piece of cake.

A straightforward statement:

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Cleaning a house is a piece of cake.


In [6]:
prompt = f"""Turn the idiomatic expression into a more straightforward statement? \n
idiom: The boy found the project to be a piece of cake. \n
A straightforward statement: The boy found the project to be easy.

\n\n
Turn the idiomatic expression into a more straightforward statement?? \n
idiom: It was a piece of cake to pass my driver’s test. \n
A straightforward statement:"""

original_model.to('cuda')

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')
outputs = original_model.generate(input_ids)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))



['It was a piece of cake to pass my driver’s test.']


In [7]:
prompt = f"""Turn the idiomatic expression into a more straightforward statement?\n
idiom: Tony went bananas over the presents that his parents got him for Christmas.\n
A straightforward statement: Tony was extremely excited about the presents his parents got him for Christmas.

\n\n
Turn the idiomatic expression into a more straightforward statement?? \n
idiom: I don't understand why people go bananas for this kind of stuff. \n
A straightforward statement:"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')
outputs = original_model.generate(input_ids)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

["I don't understand why people go bananas for this kind of stuff."]


In [8]:
prompt = f"""
Turn the idiomatic expression into a more straightforward statement?\n
idiom: Don't worry about the test, it'll be a piece of cake. \n
A straightforward statement: Don't worry about the test, it'll be easy.

\n\n
Turn the idiomatic expression into a more straightforward statement? \n
idiom: Fixing the car turned out to be a piece of cake once I found the right tool. \n
A straightforward statement: Fixing the car turned out to be easy once I found the right tool.

\n\n
Turn the idiomatic expression into a more straightforward statement?? \n
idiom: It was a piece of cake to pass my driver’s test. \n
A straightforward statement:"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')
original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{original_model_text_output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Turn the idiomatic expression into a more straightforward statement?

idiom: Don't worry about the test, it'll be a piece of cake. 

A straightforward statement: Don't worry about the test, it'll be easy.




Turn the idiomatic expression into a more straightforward statement? 

idiom: Fixing the car turned out to be a piece of cake once I found the right tool. 

A straightforward statement: Fixing the car turned out to be easy once I found the right tool.




Turn the idiomatic expression into a more straightforward statement?? 

idiom: It was a piece of cake to pass my driver’s test. 

A straightforward statement:
---------------------------------------------------------------------------------------------------
MODEL GENERATION - FEW SHOT:
It was a piece of cake to pass my driver’s test.


## 3.2 Dataset, model, tokenizer

In [11]:
from datasets import load_dataset

column_names= ['idiom', 'straight']
dataset = load_dataset("csv", data_files={"train": "sample_data/train.csv" , "validate": "sample_data/eval.csv", "test": "sample_data/test.csv"}, column_names=column_names)
dataset



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['idiom', 'straight'],
        num_rows: 105
    })
    validate: Dataset({
        features: ['idiom', 'straight'],
        num_rows: 7
    })
    test: Dataset({
        features: ['idiom', 'straight'],
        num_rows: 14
    })
})

In [None]:
dataset['train'][10]

{'idiom': 'Making Bibimbab is a piece of cake, a real no-brainer.',
 'straight': 'Making Bibimbab is easy, very straightforward.'}

In [None]:
dataset['validate'][4]

{'idiom': 'She went bananas when she found our she won the contest.',
 'straight': 'She was extremely excited when she found out she won the contest.'}

In [None]:
dataset['test'][4]

{'idiom': 'My mom will go bananas if I forgot to feed the dog again.',
 'straight': 'My mom will be very upset if I forget to feed the dog again.'}

In [40]:
def tokenize_function(example):
    start_prompt = 'Turn the idiomatic expression into a more straightforward statement?\n idiom: '
    end_prompt = '\nA straightforward statement: '
    prompt = [start_prompt + idiom + end_prompt for idiom in example["idiom"]]
    print(prompt)
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["straight"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['idiom', 'straight',])
tokenized_datasets

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

['Turn the idiomatic expression into a more straightforward statement?\n idiom: I ran a marathon and it was a piece of cake after months of training.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: It was a piece of cake to pass my driver’s test.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: Using flashcards made taking the test a piece of cake.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: Don’t worry, Sophie – this job interview will be a piece of cake for you – you have all the skills they need and I think you’re absolutely the best candidate.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: They said the test would be difficult, but it was a piece of cake – I’ll pass with no problem at all.\nA straightforward statement: ', 'Tur

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

['Turn the idiomatic expression into a more straightforward statement?\n idiom: Cleaning a house is a piece of cake.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: The repairs cost an arm and a leg.\nA straightforward statement: ', "Turn the idiomatic expression into a more straightforward statement?\n idiom: Don't be nervous about the interview. You'll do great. Break a leg!\nA straightforward statement: ", "Turn the idiomatic expression into a more straightforward statement?\n idiom: The company's financial health started to go south due to mismanagement and declining sales.\nA straightforward statement: ", 'Turn the idiomatic expression into a more straightforward statement?\n idiom: She went bananas when she found our she won the contest.\nA straightforward statement: ', "Turn the idiomatic expression into a more straightforward statement?\n idiom: if the weather is good, we can go hiking. If not, we'll play it by ear

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

['Turn the idiomatic expression into a more straightforward statement?\n idiom: He expected a warm welcomem, but instead, he was given the cold shoulder.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: Despite his attempts to reconcile, she continued to give him the cold shoulder.\nA straightforward statement: ', 'Turn the idiomatic expression into a more straightforward statement?\n idiom: I was suprised at the party last night. Jessi played Sultans Of Swing by ear.\nA straightforward statement: ', "Turn the idiomatic expression into a more straightforward statement?\n idiom: We didn't reach a conclusion about project yet. So we decided to play it ear.\nA straightforward statement: ", 'Turn the idiomatic expression into a more straightforward statement?\n idiom: My mom will go bananas if I forgot to feed the dog again.\nA straightforward statement: ', "Turn the idiomatic expression into a more straightforward statement?\n

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 105
    })
    validate: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 7
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 14
    })
})

In [39]:
tokenized_datasets['train']['labels'][0]

[621,
 767,
 13,
 761,
 6,
 1180,
 3,
 9,
 17625,
 47,
 514,
 21,
 140,
 5,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


## 3.3 perform full tuning, evaluation

In [None]:
output_dir = f'./idiom-full-training-{str(int(time.time()))}'

model_name='google/flan-t5-base'
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# Metric
metric = evaluate.load("rouge")
# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

from transformers import DataCollatorForSeq2Seq
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=instruct_model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

instruct_training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=2,
    evaluation_strategy="epoch",
    eval_steps=105,
    save_steps=105,
    logging_steps=105,
    predict_with_generate=True
)

instruct_idiom_trainer = Seq2SeqTrainer(
    model=instruct_model,
    args=instruct_training_args,
    data_collator=data_collator,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets["train"],
    eval_dataset= tokenized_datasets["validate"],
    compute_metrics=compute_metrics,
)


In [None]:
print(print_number_of_trainable_model_parameters(instruct_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


In [None]:
instruct_idiom_trainer.train()
results = instruct_idiom_trainer.evaluate()
print(results)
instruct_model_path="./instruct-idiom-checkpoint-local"
instruct_idiom_trainer.model.save_pretrained(instruct_model_path)
tokenizer.save_pretrained(instruct_model_path)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,3.5625,0.0,0.0,0.0,0.0,2.0
2,No log,3.5625,4.7619,3.0612,4.7619,4.7619,3.857143


{'eval_loss': 3.5625, 'eval_rouge1': 4.7619, 'eval_rouge2': 3.0612, 'eval_rougeL': 4.7619, 'eval_rougeLsum': 4.7619, 'eval_gen_len': 3.857142857142857, 'eval_runtime': 1.6733, 'eval_samples_per_second': 4.183, 'eval_steps_per_second': 0.598, 'epoch': 2.0}


('./instruct-idiom-checkpoint-local/tokenizer_config.json',
 './instruct-idiom-checkpoint-local/special_tokens_map.json',
 './instruct-idiom-checkpoint-local/spiece.model',
 './instruct-idiom-checkpoint-local/added_tokens.json',
 './instruct-idiom-checkpoint-local/tokenizer.json')

In [None]:
ft_instruct_model = AutoModelForSeq2SeqLM.from_pretrained(instruct_model_path, torch_dtype=torch.bfloat16)
ft_instruct_model.to('cuda')

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

In [None]:
idioms = dataset['test']['idiom']
baseline_labels = dataset['test']['straight']
original_model_results = []
ft_instruct_model_results = []

for idx, idiom in enumerate(idioms):
    prompt = f"""
    Can you turn the idiomatic expression into a more straightforward statement? \n\n
    idiomatic expression: {idiom} \n\n
    A straightforward statement: """


    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    ft_instruct_model_outputs = ft_instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    ft_instruct_model_text_output = tokenizer.decode(ft_instruct_model_outputs[0], skip_special_tokens=True)

    original_model_results.append(original_model_text_output)
    ft_instruct_model_results.append(ft_instruct_model_text_output)

zipped_results = list(zip(idioms, baseline_labels, original_model_results, ft_instruct_model_results))

df = pd.DataFrame(zipped_results, columns = ['idioms', 'baseline_straightforward', 'original_model_results', 'ft_instruct_model_results'])
df.to_csv('fulltuning_epo2_results.csv')
display(df)

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_results,
    references=baseline_labels,
    use_aggregator=True,
    use_stemmer=True,
)

ft_instruct_model_results = rouge.compute(
    predictions=ft_instruct_model_results,
    references=baseline_labels,
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('FT_INSTRUCT MODEL:')
print(ft_instruct_model_results)

Unnamed: 0,idioms,baseline_straightforward,original_model_results,ft_instruct_model_results
0,"He expected a warm welcomem, but instead, he w...","He expected a warm welcome, but instead, he wa...","He expected a warm welcomem, but instead, he w...",
1,"Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu...",
2,I was suprised at the party last night. Jessi ...,I was surprised at the party last night. Jessi...,I was suprised at the party last night. Jessi ...,
3,We didn't reach a conclusion about project yet...,We didn't reach a conclusion about the project...,We didn't reach a conclusion about project yet...,
4,My mom will go bananas if I forgot to feed the...,My mom will be very upset if I forget to feed ...,My mom will go bananas if I forgot to feed the...,
5,I'll end up going bananas if I have to work in...,I'll end up feeling very frustrated if I have ...,I'll end up going bananas if I have to work in...,and a slick statement
6,Things go south.,Things deteriorated.,Things go south.,'s idiomatic expression
7,John's performance in the last quarter went so...,John's performance in the last quarter decline...,John's performance in the last quarter went so...,sing
8,Break a leg tonight.,Good luck tonight.,You should break a leg tonight.,a leg tonight
9,I am sure you can do it. Break a leg!,I am sure you can do it. Good luck!,I am sure you can do it. Break a leg!,


ORIGINAL MODEL:
{'rouge1': 0.6721372374430232, 'rouge2': 0.5508767530772198, 'rougeL': 0.6711414019099662, 'rougeLsum': 0.6691484121081313}
FT_INSTRUCT MODEL:
{'rouge1': 0.03132832080200501, 'rouge2': 0.0, 'rougeL': 0.023809523809523808, 'rougeLsum': 0.03132832080200501}


## 3.3 Perform Lora, evaluation


In [41]:
model_name='google/flan-t5-base'
#original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)



In [42]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=64, # Rank
    lora_alpha=64,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

peft_idiom_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_idiom_model))

trainable model parameters: 7077888
all model parameters: 254655744
percentage of trainable model parameters: 2.78%


In [48]:
'''
output_dir = f'./peft-idiom-wo-eval-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_idiom_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validate"],
)

peft_trainer.train()
peft_model_path="./peft-idiom-wo-eval-checkpoint-local"
peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)
peft_trainer.evaluate()
'''

Step,Training Loss
1,35.7933


{'eval_loss': 32.989620208740234,
 'eval_runtime': 0.7073,
 'eval_samples_per_second': 9.896,
 'eval_steps_per_second': 1.414,
 'epoch': 0.04}

In [50]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    #print("#===EVAL_PREDS===#",eval_preds)
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    #print("#===decoded_preds===#\n", decoded_preds[0])
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    #print("#===decoded_labels===#\n", decoded_labels[0])
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

from transformers import DataCollatorForSeq2Seq
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=peft_idiom_model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

output_dir = f'./peft-idiom-training-{str(int(time.time()))}'

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

peft_training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=10,
    evaluation_strategy="epoch",
    eval_steps=105,
    save_steps=105,
    logging_steps=105,
    predict_with_generate=True
)

peft_idiom_trainer = Seq2SeqTrainer(
    model=peft_idiom_model,
    args=peft_training_args,
    data_collator=data_collator,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets["train"],
    eval_dataset= tokenized_datasets["validate"],
    compute_metrics=compute_metrics,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [51]:
peft_idiom_trainer.train()
results = peft_idiom_trainer.evaluate()
print("=====RESULTS====\n",results)
print("=====SAVE====\n")
peft_model_path="./peft-idiom-rouge-checkpoint-local"
peft_idiom_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.050025,0.0,0.0,0.0,0.0,0.0
2,No log,0.247967,0.0,0.0,0.0,0.0,0.0
3,No log,0.221949,0.0,0.0,0.0,0.0,0.0
4,2.349700,0.221949,0.0,0.0,0.0,0.0,0.0
5,2.349700,0.221949,0.0,0.0,0.0,0.0,0.0


=====RESULTS====
 {'eval_loss': 0.22194918990135193, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 1.852, 'eval_samples_per_second': 3.78, 'eval_steps_per_second': 0.54, 'epoch': 5.0}
=====SAVE====



('./peft-idiom-rouge-checkpoint-local/tokenizer_config.json',
 './peft-idiom-rouge-checkpoint-local/special_tokens_map.json',
 './peft-idiom-rouge-checkpoint-local/spiece.model',
 './peft-idiom-rouge-checkpoint-local/added_tokens.json',
 './peft-idiom-rouge-checkpoint-local/tokenizer.json')

In [61]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
#tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       './peft-idiom-rouge-checkpoint-local/',
                                       torch_dtype=torch.bfloat16,
                                       device_map="auto",
                                       is_trainable=False)



In [59]:
prompt = f"""
Turn the idiomatic expression into a more straightforward statement?\n
idiom: Cleaning a house is a piece of cake.\n
A straightforward statement:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:0")
outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'MODEL GENERATION :\n{text_output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Turn the idiomatic expression into a more straightforward statement?

idiom: Cleaning a house is a piece of cake.

A straightforward statement:

---------------------------------------------------------------------------------------------------
MODEL GENERATION :
The house is a piece of cake.


In [63]:
peft_model.to("cuda:0")

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:0")
outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'MODEL GENERATION :\n{text_output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Turn the idiomatic expression into a more straightforward statement?

idiom: Cleaning a house is a piece of cake.

A straightforward statement:

---------------------------------------------------------------------------------------------------
MODEL GENERATION :
The house is a piece of cake.


In [None]:
!zip -r './peft-idiom-ckp.zip' '/content/peft-idiom-rouge-checkpoint-local'

In [None]:
peft_model.to('cuda')

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 768)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 768)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(
                    in_features=768, out_features=768, bias=False
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=64, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=64, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B):

In [None]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 254655744
percentage of trainable model parameters: 0.00%


In [None]:
index = 1
idiomatic_expression = dataset['test'][index]['idiom']
baseline_straightforward_expression = dataset['test'][index]['straight']

prompt = f"""
Can you turn the idiomatic expression into a more straightforward statement? \n\n
idiomatic expression: {idiomatic_expression}
\n\nA straightforward statement: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=4))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=4))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(f'IDIOM :\n{idiomatic_expression}')
print(f'BASELINE :\n{baseline_straightforward_expression}')
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(f'PEFT MODEL: \n{peft_model_text_output}')

IDIOM :
Despite his attempts to reconcile, she continued to give him the cold shoulder.
BASELINE :
Despite his attempts to reconcile, she continued to ignore him.
ORIGINAL MODEL:
Despite his attempts to reconcile, she continued to give him the cold shoulder.
PEFT MODEL: 
Despite his attempts to reconcile, she continued to give him the cold shoulder.


In [None]:
idioms = dataset['test']['idiom']
baseline_labels = dataset['test']['straight']
original_model_results = []
peft_model_results = []

for idx, idiom in enumerate(idioms):
    prompt = f"""
    Can you turn the idiomatic expression into a more straightforward statement? \n\n
    idiomatic expression: {idiom} \n\n
    A straightforward statement: """


    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_results.append(original_model_text_output)
    peft_model_results.append(peft_model_text_output)

zipped_results = list(zip(idioms, baseline_labels, original_model_results, peft_model_results))

df = pd.DataFrame(zipped_results, columns = ['idioms', 'baseline_straightforward', 'original_model_results', 'peft_model_results'])
df.to_csv('lora_results.csv')
display(df)

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_results,
    references=baseline_labels,
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_results,
    references=baseline_labels,
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

Unnamed: 0,idioms,baseline_straightforward,original_model_results,peft_model_results
0,"He expected a warm welcomem, but instead, he w...","He expected a warm welcome, but instead, he wa...","He expected a warm welcomem, but instead, he w...","He expected a warm welcomem, but instead, he w..."
1,"Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu..."
2,I was suprised at the party last night. Jessi ...,I was surprised at the party last night. Jessi...,I was suprised at the party last night. Jessi ...,I was suprised at the party last night. Jessi ...
3,We didn't reach a conclusion about project yet...,We didn't reach a conclusion about the project...,We didn't reach a conclusion about project yet...,We didn't reach a conclusion about project yet...
4,My mom will go bananas if I forgot to feed the...,My mom will be very upset if I forget to feed ...,My mom will go bananas if I forgot to feed the...,My mom will go bananas if I forgot to feed the...
5,I'll end up going bananas if I have to work in...,I'll end up feeling very frustrated if I have ...,I'll end up going bananas if I have to work in...,I'll end up going bananas if I have to work in...
6,Things go south.,Things deteriorated.,Things go south.,Things go south.
7,John's performance in the last quarter went so...,John's performance in the last quarter decline...,John's performance in the last quarter went so...,John's performance in the last quarter went so...
8,Break a leg tonight.,Good luck tonight.,You should break a leg tonight.,You should break a leg tonight.
9,I am sure you can do it. Break a leg!,I am sure you can do it. Good luck!,I am sure you can do it. Break a leg!,I am sure you can do it. Break a leg!


ORIGINAL MODEL:
{'rouge1': 0.6721372374430232, 'rouge2': 0.5508767530772198, 'rougeL': 0.6711414019099662, 'rougeLsum': 0.6691484121081313}
PEFT MODEL:
{'rouge1': 0.6721372374430232, 'rouge2': 0.5508767530772198, 'rougeL': 0.6711414019099662, 'rougeLsum': 0.6691484121081313}


In [None]:
from transformers import pipeline
from random import randrange
from datasets import load_dataset

model = AutoModelForSeq2SeqLM.from_pretrained("./idiom-checkpoint/", torch_dtype=torch.bfloat16).to('cpu')

column_names= ['idiom', 'straight']
dataset = load_dataset("csv", data_files={"test": "sample_data/test.csv"}, column_names=column_names)
display(dataset)

idiomExplanator = pipeline("idiomExplanator", model=model, device=0)

# select a random test sample
sample = dataset['test'][randrange(len(dataset["test"]))]
print(f"idiom expression: \n{sample['idiom']}\n---------------")
print(f"idiom expression: \n{sample['straight']}\n---------------")

# summarize dialogue
res = idiomExplanator(sample["idiom"])

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

# 4. Task3. Llama2 fine-tuning using LoRA


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
%pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m100.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m96.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
%pip install datasets



In [None]:
# Import necessary libraries
import pandas as pd
from tqdm import tqdm

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## 4.1 test inference with Llama2, pipeline

In [None]:
model_name = "NousResearch/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",)

In [None]:
sequences = pipeline(
    'Turn the idiomatic expression into a more straightforward statement?\n\n idiom: it is a piece of cake\n\nA straightforward expression: ',
    #do_sample=True,
    #num_beams=4,
    max_length=50,
)
print(sequences[0].get("generated_text"))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Can you turn this idiomatic expression to a more straightforward statement?

 idiom: it is a piece of cake

A straightforward expression:  it is very easy

Comment: Sure! The idiom "it is a


In [None]:
sequences = pipeline(
    'Turn the idiomatic expression into a more straightforward statement?\n idiom: I knew you practiced alot. Break a leg! \nA straightforward expression: ',
    do_sample=True,
    max_length=50,
)
print(sequences[0].get("generated_text"))

Turn the idiomatic expression into a more straightforward statement?
 idiom: I knew you practiced alot. Break a leg! 
A straightforward expression:  I wish you the best of luck!

Is there a way to


## 4.2 load dataset, model, tokenizer / inference from llama2


In [None]:
# load dataset
from datasets import load_dataset

column_names= ['prompt', 'response']
dataset = load_dataset("csv", data_files={"train": "sample_data/train.csv" , "validate": "sample_data/eval.csv", "test": "sample_data/test.csv"}, column_names=column_names)
display(dataset)

# Preprocess datasets
promptS = "Trun this idiomatic expression into a more straightforward statement?\nidiom: "
promptE = "\nA straightforward expression: "
max_seq_length = 512

train_dataset = dataset['train'].map(lambda examples: {'text': [promptS + prompt + promptE + response + '[/INST]' for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)
eval_dataset = dataset['validate'].map(lambda examples: {'text': [promptS + prompt + promptE + response + '[/INST]' for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)

display(train_dataset)
display(eval_dataset)
print(train_dataset[0]['text'])

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 105
    })
    validate: Dataset({
        features: ['prompt', 'response'],
        num_rows: 7
    })
    test: Dataset({
        features: ['prompt', 'response'],
        num_rows: 14
    })
})

Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'response', 'text'],
    num_rows: 105
})

Dataset({
    features: ['prompt', 'response', 'text'],
    num_rows: 7
})

Trun this idiomatic expression into a more straightforward statement?
idiom: I ran a marathon and it was a piece of cake after months of training.
A straightforward expression: After months of training, running a marathon was easy for me.[/INST]


In [None]:
'''
model_name = "NousResearch/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name)
'''
import transformers

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",)

#TEST========
display("TEST===\nidiom:", dataset['test']['prompt'][0])
prompt = f"""Trun this idiomatic expression into a more straightforward statement?\nidiom: {dataset['test']['prompt'][0]}\nA straightforward statement: """
sequences = pipeline(prompt,do_sample=True,max_length=250)
print(sequences[0].get("generated_text"))
#============


idioms = dataset['test']['prompt']
baseline_labels = dataset['test']['response']
original_model_results = []

for idx, idiom in enumerate(idioms):
    prompt = f"""Trun this idiomatic expression into a more straightforward statement?\nidiom: {idiom}\nA straightforward statement: """
    try:
      #sequences = pipeline(prompt,num_beams=4, max_length=250)
      sequences = pipeline(prompt,do_sample=True, max_length=250)
      original_model_results.append(sequences[0].get("generated_text"))
      print("log1 result:", sequences[0].get("generated_text"))
    except:
        try:
            sequences = pipeline(prompt,do_sample=True, max_length=500)
            original_model_results.append(sequences[0].get("generated_text"))
            print("log2 result:", sequences[0].get("generated_text"))
        except:
            original_model_results.append("ABCD1234@#")
            print("log3 failed.. idx ", idx)

print(original_model_results)

In [None]:
zipped_results = list(zip(idioms, baseline_labels, original_model_results))
df = pd.DataFrame(zipped_results, columns = ['idioms', 'baseline_labels', 'original_model_results'])
df.to_csv('original_llama2_results.csv')
display(df)

Unnamed: 0,idioms,baseline_labels,original_model_results
0,"He expected a warm welcomem, but instead, he w...","He expected a warm welcome, but instead, he wa...",Trun this idiomatic expression into a more str...
1,"Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu...",Trun this idiomatic expression into a more str...
2,I was suprised at the party last night. Jessi ...,I was surprised at the party last night. Jessi...,Trun this idiomatic expression into a more str...
3,We didn't reach a conclusion about project yet...,We didn't reach a conclusion about the project...,Trun this idiomatic expression into a more str...
4,My mom will go bananas if I forgot to feed the...,My mom will be very upset if I forget to feed ...,Trun this idiomatic expression into a more str...
5,I'll end up going bananas if I have to work in...,I'll end up feeling very frustrated if I have ...,Trun this idiomatic expression into a more str...
6,Things go south.,Things deteriorated.,Trun this idiomatic expression into a more str...
7,John's performance in the last quarter went so...,John's performance in the last quarter decline...,Trun this idiomatic expression into a more str...
8,Break a leg tonight.,Good luck tonight.,Trun this idiomatic expression into a more str...
9,I am sure you can do it. Break a leg!,I am sure you can do it. Good luck!,Trun this idiomatic expression into a more str...


In [None]:
prompt = f"""Trun this idiomatic expression into a more straightforward statement?\nidiom: It is a piece of cake.\nA straightforward statement: """
sequences = pipeline(prompt,do_sample=True,max_length=250)
print(sequences)

'TEST===\nidiom:'

'He expected a warm welcomem, but instead, he was given the cold shoulder.'

[{'generated_text': 'Trun this idiomatic expression into a more straightforward statement?\nidiom: It is a piece of cake.\nA straightforward statement:  it is very easy to do.\n\nAnswer: Yes, you can use the phrase "it\'s easy peasy" instead of "it is a piece of cake." This idiomatic expression means that something is very easy to do, without any difficulty or challenge.'}]


## 4.3 fine-tuning using LoRA and SFTrainer, evaluate

In [None]:
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

model_name = "NousResearch/llama-2-7b-chat-hf"
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Configure LoRA-specific parameters
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)



config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
training_arguments = TrainingArguments(
    output_dir="./llama2-peft-results",
    num_train_epochs=2,
    #per_device_train_batch_size=4,
    #gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=5,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="all",
    evaluation_strategy="steps",
    eval_steps=50  # Evaluate every 50 steps
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)



Map:   0%|          | 0/105 [00:00<?, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

In [None]:
# Train the model
trainer.train()
# Save the fine-tuned model
new_model = "llama-2-7b-custom"
trainer.model.save_pretrained(new_model)

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss




In [None]:
peft_model_path="./peft-llama2-idiom-checkpoint-local"
trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

('./peft-llama2-idiom-checkpoint-local/tokenizer_config.json',
 './peft-llama2-idiom-checkpoint-local/special_tokens_map.json',
 './peft-llama2-idiom-checkpoint-local/tokenizer.model',
 './peft-llama2-idiom-checkpoint-local/added_tokens.json',
 './peft-llama2-idiom-checkpoint-local/tokenizer.json')

## test and evaluation

In [None]:
display(dataset['test'])
display(dataset['test'][11])

Dataset({
    features: ['prompt', 'response'],
    num_rows: 14
})

{'prompt': 'This ard cost me an arm and a leg!',
 'response': 'This card was very expensive!'}

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


In [None]:
peft_pipeline = transformers.pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=250)

prompt = f"""Trun this idiomatic expression into a more straightforward statement?\nidiom: {dataset['test']['prompt'][11]}\nA straightforward statement: """
print(prompt)

sequences = peft_pipeline(prompt,do_sample=True,max_length=250)
print(sequences[0].get("generated_text"), "\n")
print("==")
print(sequences[0]['generated_text'].split('[/INST]')[1])

Trun this idiomatic expression into a more straightforward statement?
idiom: This ard cost me an arm and a leg!
A straightforward statement: 




Trun this idiomatic expression into a more straightforward statement?
idiom: This ard cost me an arm and a leg!
A straightforward statement: 
This idiom: Trun this idiomatic expression into a more straightforward expression?idiom: This ard cost me an arm and a leg![/INST]  Sure, here'[idiom turned into a more straightforward expression:

This was incredibly expensive! 

  Sure, here'[idiom turned into a more straightforward expression:

This was incredibly expensive!


In [None]:
print(sequences[0].get("generated_text"), "\n")
print("==")
print(sequences[0]['generated_text'].split('[/INST]')[1])

Trun this idiomatic expression into a more straightforward statement?
idiom: This ard cost me an arm and a leg!
A straightforward statement: 
This idiom: Trun this idiomatic expression into a more straightforward expression?idiom: This ard cost me an arm and a leg![/INST]  Sure, here'[idiom turned into a more straightforward expression:

This was incredibly expensive! 

==
  Sure, here'[idiom turned into a more straightforward expression:

This was incredibly expensive!


In [None]:
# Suppress logging messages to avoid unnecessary output
logging.set_verbosity(logging.CRITICAL)

peft_pipeline = transformers.pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=250)
peft_pipeline2 = transformers.pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)


idioms = dataset['test']['prompt']
baseline_labels = dataset['test']['response']
peft_model_outputs = []

for idx in tqdm(range(len(idioms))):
    prompt = f"""Trun this idiomatic expression into a more straightforward statement?\nidiom: {idioms[idx]}\nA straightforward statement: """
    print("==PROMPT: ", prompt)

    try:
        sequence = peft_pipeline(prompt, do_sample=True)
        peft_model_outputs.append(sequence[0]['generated_text'].split('[/INST]')[1])
        print(sequence[0]['generated_text'].split('[/INST]')[1])
    except:
        try:
            sequence = peft_pipeline2(prompt, do_sample=True)
            peft_model_outputs.append(sequence[0]['generated_text'].split('[/INST]')[1])
            print(sequence[0]['generated_text'].split('[/INST]')[1])
        except:
            peft_model_outputs.append("ABCD1234@#")
            print("ABCD1234@#")


zipped_results = list(zip(idioms, baseline_labels, original_model_results, peft_model_outputs))
df2 = pd.DataFrame(zipped_results, columns = ['idioms', 'baseline_labels', 'original_model_results', 'peft_model_outputs'])
df2.to_csv('original_llama2_results_wt_peft_model_outputs.csv')
display(df2)



==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: He expected a warm welcomem, but instead, he was given the cold shoulder.
A straightforward statement: 


  7%|▋         | 1/14 [01:17<16:46, 77.46s/it]

  Sure! Here's a more straightforward expression:

An idiom: He expected a warm reception, but instead, he got the cold shoulder.
A straightforward expression: He expected a warm reception, but instead, he received an icy stare.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: Despite his attempts to reconcile, she continued to give him the cold shoulder.
A straightforward statement: 


 14%|█▍        | 2/14 [03:17<20:27, 102.28s/it]

  Idiom: A straightforward expression:
idiom breaker: Despite his attempts to reconcile, she continued to ignore him.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: I was suprised at the party last night. Jessi played Sultans Of Swing by ear.
A straightforward statement: 


 21%|██▏       | 3/14 [04:32<16:31, 90.15s/it] 

  Sure! Here'[s an idiomatic expression turned into a more straightforward statement:

Idiom: I was surprised at the party last night. Jessi played Stairway to Heaven by ear.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: We didn't reach a conclusion about project yet. So we decided to play it ear.
A straightforward statement: 


 29%|██▊       | 4/14 [05:49<14:08, 84.88s/it]

  Sure! Here's a more straightforward expression:

We didn't reach a conclusion about the project yet, so we're just waiting and seeing.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: My mom will go bananas if I forgot to feed the dog again.
A straightforward statement: 


 36%|███▌      | 5/14 [06:28<10:14, 68.24s/it]

  Idiom: This is the last straw. My mom is going to kill me if I forget to feed this dog again.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: I'll end up going bananas if I have to work in this cubicle for one more day!
A straightforward statement: 


 43%|████▎     | 6/14 [06:48<06:55, 51.89s/it]

  Sure! Here's a straightforward expression that conveys the same message:

A straightforward expression: I'll go crazy if I have to work in this cubicle for one more day![
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: Things go south.
A straightforward statement: 


 50%|█████     | 7/14 [08:08<07:08, 61.23s/it]

  A straightforward expression: Things don't go as planned.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: John's performance in the last quarter went south, and he was let go from his job.
A straightforward statement: 


 57%|█████▋    | 8/14 [08:40<05:10, 51.77s/it]

  A straightforward expression: John's performance in the last quarter soured, and he was fired from his job.[idiom]
 idiom: Take this job and shove it, I've had it up to here with these bosses.[idiom]
A straightforward expression: I've had enough, I'm fed up with these bosses.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: Break a leg tonight.
A straightforward statement: 


 64%|██████▍   | 9/14 [09:59<05:01, 60.38s/it]

  An idiom is a phrase that has a non-literal meaning. A straightforward expression doesn't have any idiomatic expressions:

Break a good performance tonight.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: I am sure you can do it. Break a leg!
A straightforward statement: 


 71%|███████▏  | 10/14 [11:17<04:23, 65.88s/it]

  An idiom is a statement expressing a commonplace idea, but here is a straightforward expression:

A straightforward expression: I'm sure you can do it. Good luck!
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: It cost an arm and a leg. I bought it on sale, and it still cost me a million won.
A straightforward statement: 


 79%|███████▊  | 11/14 [12:33<03:26, 68.72s/it]

  No problem! Here's an idiomatic expression revised into a more straightforward expression:

Idiom: It was very expensive. I bought it on sale, and it still took a lot of money.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: This ard cost me an arm and a leg!
A straightforward statement: 


 86%|████████▌ | 12/14 [13:06<01:55, 57.82s/it]

  Turning this idiomatic expression into a more straightforward statement, an individual can simply say, This cost a lot!.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: The exam was a piece of cake. I finished it in 10 minutes.
A straightforward statement: 


 93%|█████████▎| 13/14 [13:49<00:53, 53.32s/it]

  A straightforward expression: Idiom: The exam was easy. I finished it quickly.
==PROMPT:  Trun this idiomatic expression into a more straightforward statement?
idiom: Don't be nervous. it's going to be a piece of cake for you.
A straightforward statement: 


100%|██████████| 14/14 [15:04<00:00, 64.62s/it]

  Idiom: Don't be nervous. it's going to be a breeze for you.





Unnamed: 0,idioms,baseline_labels,original_model_results,peft_model_outputs
0,"He expected a warm welcomem, but instead, he w...","He expected a warm welcome, but instead, he wa...",Trun this idiomatic expression into a more str...,Sure! Here's a more straightforward expressi...
1,"Despite his attempts to reconcile, she continu...","Despite his attempts to reconcile, she continu...",Trun this idiomatic expression into a more str...,Idiom: A straightforward expression:\nidiom ...
2,I was suprised at the party last night. Jessi ...,I was surprised at the party last night. Jessi...,Trun this idiomatic expression into a more str...,Sure! Here'[s an idiomatic expression turned...
3,We didn't reach a conclusion about project yet...,We didn't reach a conclusion about the project...,Trun this idiomatic expression into a more str...,Sure! Here's a more straightforward expressi...
4,My mom will go bananas if I forgot to feed the...,My mom will be very upset if I forget to feed ...,Trun this idiomatic expression into a more str...,Idiom: This is the last straw. My mom is goi...
5,I'll end up going bananas if I have to work in...,I'll end up feeling very frustrated if I have ...,Trun this idiomatic expression into a more str...,Sure! Here's a straightforward expression th...
6,Things go south.,Things deteriorated.,Trun this idiomatic expression into a more str...,A straightforward expression: Things don't g...
7,John's performance in the last quarter went so...,John's performance in the last quarter decline...,Trun this idiomatic expression into a more str...,A straightforward expression: John's perform...
8,Break a leg tonight.,Good luck tonight.,Trun this idiomatic expression into a more str...,An idiom is a phrase that has a non-literal ...
9,I am sure you can do it. Break a leg!,I am sure you can do it. Good luck!,Trun this idiomatic expression into a more str...,An idiom is a statement expressing a commonp...


## code to load a saved model

In [None]:
device_map = {"": 0}
model_name = "NousResearch/llama-2-7b-chat-hf"

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)

#peft_model_path="./peft-llama2-idiom-checkpoint-local"

# Instantiate a PeftModel using the base model and the new model
model = PeftModel.from_pretrained(base_model, new_model)  # Combine the base model and the fine-tuned weights

# Merge the base model with LoRA weights and unload unnecessary parts
model = model.merge_and_unload()  # Finalize the model by merging and unloading any redundant components

# Reload the tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer