# Fine-tune FLAN-T5 for chat & dialogue summarization

In this blog, you will learn how to fine-tune [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl) for chat & dialogue summarization using Hugging Face Transformers. If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. 

In this example we will use the [samsum](https://huggingface.co/datasets/samsum) dataset a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

You will learn how to:

1. [Setup Development Environment](#1-setup-development-environment)
2. [Load and prepare samsum dataset](#2-load-and-prepare-samsum-dataset)
3. [Fine-tune and evaluate FLAN-T5](#3-fine-tune-and-evaluate-flan-t5)
4. [Run Inference and summarize ChatGPT dialogues](#4-run-inference-and-summarize-chatgpt-dialogues)

Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments. 

## Quick intro: FLAN-T5, just a better T5

FLAN-T5 released with the [Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416.pdf) paper is an enhanced version of T5 that has been finetuned in a mixture of tasks. The paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. The paper discovers that overall instruction finetuning is a general method for improving the performance and usability of pretrained language models. 

![flan-t5](../assets/flan-t5.png)

* Paper: https://arxiv.org/abs/2210.11416
* Official repo: https://github.com/google-research/t5x

--- 

Now we know what FLAN-T5 is, let's get started. 🚀

_Note: This tutorial was created and run on a g4dn.xlarge AWS EC2 Instance including a NVIDIA T4._

## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [1]:
import torch

if torch.cuda.is_available():
    print("GPU is available")
    print(f"GPU device name: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is not available")


GPU is available
GPU device name: NVIDIA GeForce RTX 3090


In [29]:
# python
!pip install pytesseract transformers datasets rouge-score nltk tensorboard py7zr --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/75/35/07c9879163b603f0e464b0f6e6e628a2340cfc7cdc5ca8e7d52d776710d4/transformers-4.44.2-py3-none-any.whl.metadata
  Using cached transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Obtaining dependency information for tokenizers<0.20,>=0.19 from https://files.pythonhosted.org/packages/40/4f/eb78de4af3b17b589f43a369cbf0c3a7173f25c3d2cd93068852c07689aa/tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

In [None]:
# install git-fls for pushing model and logs to the hugging face hub
!sudo apt-get install git-lfs --yes

This example will use the [Hugging Face Hub](https://huggingface.co/models) as a remote model versioning service. To be able to push our model to the Hub, you need to register on the [Hugging Face](https://huggingface.co/join). 
If you already have an account, you can skip this step. 
After you have an account, we will use the `notebook_login` util from the `huggingface_hub` package to log into our account and store our token (access key) on the disk. 

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
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
T5_DialogueSum

## 2. Load and prepare dialogueSum dataset from local
- This DialogueSum dataset was originally in English but was translated into Korean by teachers using the Solar API for educational purposes. However, the translation seemed somewhat unnatural for native Korean speakers, so I used the Solar API to retranslate it into English to facilitate a more accurate summarization.

To load the `dialogueSum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [3]:
dataset_id = "dialoguSum_Solar_koen"
# huggingface hub model id
model_id="google/flan-t5-large"

In [4]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset('csv', data_files={'train': "/data/ephemeral/home/data/train_en.csv", 'val': "/data/ephemeral/home/data/dev_en.csv"})

print(f"Train dataset size: {len(dataset['train'])}")
print(f"val dataset size: {len(dataset['val'])}")

# Train dataset size: 12457
# Test dataset size: 499

Train dataset size: 12457
val dataset size: 499


In [5]:
dataset['train']

Dataset({
    features: ['fname', 'dialogue', 'summary', 'topic', 'dialogue_en', 'summary_en', 'topic_en'],
    num_rows: 12457
})

Lets checkout an example of the dataset.

In [15]:
from random import randrange        


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
#Person1#: 이번 여름에 무슨 계획 있어?
#Person2#: 그 회사에서 다시 일할 수도 있지만, 여행을 하면서 세상에 대해 더 배우게 될 것 같아. 내 친구 빌이 올 여름에 유럽 일주를 할 계획이야. 프랑스에 그의 친척이 살아서 그들을 방문하려고 하고 독일, 리투아니아, 라트비아에 갈 계획을 세우고 있어. 나는 비행기표와 식사 비용만 내면 돼.
#Person1#: 빌의 친척 집에 함께 머무르지 않을 때 호텔은 어떻게 할 건데?
#Person2#: 게스트하우스에서 잘 거야. 내가 아르바이트로 돈을 충분히 모아놨어.
#Person1#: 다음 학기에 쓸 돈은 어떻게 할 건데?
#Person2#: 너한테 조금 빌려야 할 거 같아. 이건 일생에 한 번 뿐인 기회야. 나는 정말 많이 배울 수 있을 거라고 생각해. 프랑스어 실력도 향상시킬 수 있고.
---------------
summary: 
#Person2#는 이번 여름에 빌과 함께 유럽 일주를 하고 싶어한다. #Person1#은 #Person2#의 다음 학기 재정 상태를 걱정하지만, #Person2#는 여행 계획을 고집한다.
---------------


To train our model we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means check out [chapter 6](https://huggingface.co/course/chapter6/1?fw=tf) of the Hugging Face Course.

In [None]:
!pip install SentencePiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting SentencePiece
  Obtaining dependency information for SentencePiece from https://files.pythonhosted.org/packages/a6/27/33019685023221ca8ed98e8ceb7ae5e166032686fa3662c68f1f1edf334e/sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: SentencePiece
Successfully installed SentencePiece-0.2.0
[0m

In [16]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

before we can start training we need to preprocess our data. Abstractive Summarization is a text2text-generation task. This means our model will take a text as input and generate a summary as output. For this we want to understand how long our input and output will be to be able to efficiently batch our data. 

In [17]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["val"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")
min_source_length = min([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Min source length: {min_source_length}")


# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["val"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")
min_target_length = min([len(x) for x in tokenized_targets["input_ids"]])
print(f"Min target length: {min_target_length}")

Map:   0%|          | 0/12956 [00:00<?, ? examples/s]

Max source length: 512
Min source length: 56


Map:   0%|          | 0/12956 [00:00<?, ? examples/s]

Max target length: 246
Min target length: 12


In [18]:
special_tokens = ['#CarNumber#', '#SSN#', '#PhoneNumber#', '#PassportNumber#', '#Email#', '#CardNumber#', '#Address#', '#DateOfBirth#', \
'#Person4#', '#Person7#', '#Person3#', '#Person2#', '#Person#', '#Person6#', '#Person5#', '#Person1#']
for token in special_tokens:
    if token in tokenizer.get_vocab():
        print(f"'{token}' is already in the vocabulary.")
    else:
        print(f"'{token}' is not in the vocabulary.")


'#CarNumber#' is not in the vocabulary.
'#SSN#' is not in the vocabulary.
'#PhoneNumber#' is not in the vocabulary.
'#PassportNumber#' is not in the vocabulary.
'#Email#' is not in the vocabulary.
'#CardNumber#' is not in the vocabulary.
'#Address#' is not in the vocabulary.
'#DateOfBirth#' is not in the vocabulary.
'#Person4#' is not in the vocabulary.
'#Person7#' is not in the vocabulary.
'#Person3#' is not in the vocabulary.
'#Person2#' is not in the vocabulary.
'#Person#' is not in the vocabulary.
'#Person6#' is not in the vocabulary.
'#Person5#' is not in the vocabulary.
'#Person1#' is not in the vocabulary.


In [19]:
original_vocab_size = len(tokenizer)

special_tokens = ['#CarNumber#', '#SSN#', '#PhoneNumber#', '#PassportNumber#', '#Email#', '#CardNumber#', '#Address#', '#DateOfBirth#', \
'#Person4#', '#Person7#', '#Person3#', '#Person2#', '#Person#', '#Person6#', '#Person5#', '#Person1#']
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
new_vocab_size = len(tokenizer)

print(f"Original vocab size: {original_vocab_size}")
print(f"New vocab size: {new_vocab_size}")

# Original vocab size: 32100
# New vocab size: 32116

Original vocab size: 32100
New vocab size: 32116


In [21]:
# 원래의 vocab 크기 출력
Original_vocab_size = len(tokenizer)
print(f"Original vocab size: {Original_vocab_size}")

# Step 1: Extract unique tokens from the dataset
unique_tokens = set()

# dataset['train']['dialogue']는 문장들이 포함된 리스트라고 가정
for sentence in dataset['train']['dialogue']:
    # 문장을 토크나이저로 토큰화
    tokens = tokenizer.tokenize(sentence)  # 문장을 토큰화
    unique_tokens.update(tokens)  # 고유한 토큰을 set에 추가

# 고유한 토큰 출력
print(f"Number of unique tokens: {len(unique_tokens)}")
print(unique_tokens[0])  # 고유한 토큰 리스트 출력 (원한다면)


Token indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors


Original vocab size: 32116
Number of unique tokens: 98385
{'횡재했네', '원격으로', '리터를', '유형이', '이끄는', '단호한', '갱신할', '환자들', '글쓰기가요', '문제예요', '효율', '소리치고', '죄송하다고', '골칫거리입니다', '앞코를', '달을', '브라이언트야', '국에서', '이야기하자', '직원이에요', '마틴스의', '겨울이라면', '엄마도', '서둘러야겠어', '끝일', '짧지도', '남녀노소', '산림', '백업을', '국립은행', '벽', '끝냈을', '도슨입니다', '4.5', '뿐이잖아', '한잔하려고', '그녀에', '이모와', '알리려고', '거실이야', '장미처럼', '부딪히셨어요', '찍어본', '자랑하시면서', '흔들려고', '이끌었을', '조용하셨어요', '심부름하고', '정보만', '브로케이드에요', '감사하며', '조금이요', '읽어서', '행복했던', '꼴', '건넌', '경기예요', '주문', '뽑을', '쉬워져요', '수도요금에', '고객사가', '가져오라고', '잡지로', '밀리고', '칠레에서는', '애기야', '저녁엔', '얼굴과', '들어와서', '일치했습니다', '플레이스테이션에서', '돌아간', '시장이란', '그렇게들', '재료의', '문서의', '상행과', '벌나요', '극적으로', '전달하실', '영구적인', '년마다', '해외로', '드려야겠습니다', '적게요', '잡혔나요', '지오바니도', '줄여줄래요', '빵으로', '축하해야', '결과로', '보여주셨나요', '동물원이', '차선을', '이웃집에서', '비행기표는', '보아', '뇌우가', '공휴일에', '지정되어', '걸어보세요', '외국어에', '완료되었는지', '저작권', '늙었다니', '꺾으시면', '400', '공연인데', '결합하고', '살려봐', '닫는다고', '주시겠습니까', '보호되고', '그런거야', '시끄러워요', '달러거스름돈은', '없구나', '내전들',

In [None]:
# token 추가
import pandas as pd

Original_vocab_size = len(tokenizer)
print(f"Original vocab size: {Original_vocab_size}")
# Step 1: Extract unique words from the dataset
unique_words1 = set()
for sentence in dataset['train']['dialogue']:
    words = sentence.split()  # Simple split, you might want to use a tokenizer for better results
    unique_words1.update(words)
unique_words2 = set()
for sentence in dataset['train']['summary']:
    words = sentence.split()  # Simple split, you might want to use a tokenizer for better results
    unique_words2.update(words)

unique_words3 = set()
for sentence in dataset['val']['dialogue']:
    words = sentence.split()  # Simple split, you might want to use a tokenizer for better results
    unique_words3.update(words)
unique_words4 = set()
for sentence in dataset['val']['summary_en']:
    words = sentence.split()  # Simple split, you might want to use a tokenizer for better results
    unique_words4.update(words)

test = pd.read_csv(r'/data/ephemeral/home/data/test_en.csv')
unique_words5 = set()
for sentence in test['dialogue_en']:
    words = sentence.split()  # Simple split, you might want to use a tokenizer for better results
    unique_words5.update(words)    
    
# Step 2: Add these words to the tokenizer vocabulary
# The tokenizer will automatically handle the splitting and add only those not already in the vocab
tokenizer.add_tokens(list(unique_words1))
tokenizer.add_tokens(list(unique_words2))
tokenizer.add_tokens(list(unique_words3))
tokenizer.add_tokens(list(unique_words4))
tokenizer.add_tokens(list(unique_words5))
tokenizer.add_tokens(list(special_tokens))
# Step 3: Check the new vocabulary size
new_vocab_size = len(tokenizer)
print(f"New vocab size: {new_vocab_size}")

Original vocab size: 32100
New vocab size: 92921


In [11]:
# 작동 잘 되는지 확인
# Define a test sentence
sentence = dataset["train"]['dialogue_en'][0]


# Encode the sentence using the tokenizer, returning PyTorch tensors
sentence_encoded = tokenizer(sentence, 
                             max_length=max_source_length, 
                             padding="max_length", 
                             truncation=True, 
                             add_special_tokens=True)

# Decode the encoded sentence, skipping special tokens
sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"], 
        max_length=max_target_length, 
        padding="max_length", 
        truncation=True, 
        add_special_tokens=True,
        skip_special_tokens=False
    )

# Print SENTENCE
print('SENTENCE:')
print(sentence)

# Print the encoded sentence's representation
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"])

# Print the decoded sentence
print('\nDECODED SENTENCE:')
print(sentence_decoded)

SENTENCE:
#Person1#: Hello, Mr. Smith. I'm Dr. Hawkins. Why are you here today?
#Person2#: I thought it would be a good idea to have a checkup.
#Person1#: I see, you haven't had one in five years. You should have one every year.
#Person2#: I know. But if nothing is wrong, why should I go to see a doctor?
#Person1#: The best way to avoid serious illness is to catch these early. So for your own good, come at least once a year.
#Person2#: I see.
#Person1#: Look here. Your eyes and ears seem fine. Take a deep breath. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: As you know, smoking is the leading cause of lung cancer and heart disease. You really should quit.
#Person2#: I've tried hundreds of times, but I find it hard to break the habit.
#Person1#: We have classes and medications that can help. I'll give you more information before you leave.
#Person2#: OK, thank you, doctor.

ENCODED SENTENCE:
[32115, 3, 10, 8774, 6, 1363, 5, 3931, 5, 27, 31, 51, 707, 5, 12833, 77, 7, 5, 1615, 33, 

## Summarizing Using Prompt Engineering

### Applying Zero Shot Inference

In [13]:
# zero shot
from transformers import AutoModelForSeq2SeqLM

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Define a test sentence
sentence = dataset["train"]['dialogue_en'][20]
golden = dataset["train"]['summary_en'][20]

instruction = f"""
Dialogue:

{sentence}

What was going on?
"""
#instruction = ["Please summarize the conversation by clearly stating what each speaker did or said. : " +sentence]
# instruction = ["In this '#Person1#: Hello, Mr. Smith. I'm Dr. Hawkins.' dialogue, the speaker is #Person1#. \
#     Summarize the conversation with a focus on the speakers, ensuring that each speaker's name or identifier, such as #Person1#, is accurately used as the subject in the summary. : " + sentence]
# Encode the sentence using the tokenizer, returning PyTorch tensors
sentence_encoded = tokenizer(instruction, 
                             max_length=max_source_length, 
                             padding="max_length", 
                             truncation=True, 
                             add_special_tokens=True,
                             return_tensors="pt")  # Ensure tensors are returned for model input

# Generate the summary using the model
summary_ids = model.generate(
    sentence_encoded["input_ids"], 
    max_length=max_target_length, 
    min_length=40, 
    num_beams=5,  # Optional: control the generation strategy
    early_stopping=True,  # Optional: stop early when all beams are finished
    no_repeat_ngram_size=2
)

# Decode the encoded sentence, skipping special tokens
sentence_decoded = tokenizer.decode(
    summary_ids[0],  # Select the first (and usually only) sequence generated
    skip_special_tokens=True  # Skip special tokens in the final output
    )

# Print the encoded sentence's representation
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"])

# Print the decoded sentence
print('\nDECODED SENTENCE:')
print(sentence_decoded)

# Print SENTENCE
print('\nGOLDEN:')
print(golden)





ENCODED SENTENCE:
tensor([[ 5267, 10384,    10,     3, 32115,     3,    10,   571,   103,    27,
          5026,   747,    48,  3143,    58,    27,   214,   132,    31,     7,
             3,     9, 19540,  5775,     5,     3, 32111,     3,    10,   363,
            33,    25,   692,    58,     3, 32115,     3,    10,    27,    31,
            51,   652, 13205,     6,   149,   405,    34,   320,    58,     3,
         32111,     3,    10,    27,   317,    25,    31,    60,  1119,    12,
           129, 13205,     5,  3963,    25,  2612,    62,    31,    60,    16,
             3,     9,   443,    30,     8,  1373,    58,     3, 32115,     3,
            10,    27,    31,    51,   207,    44,    48,     5,   465,    80,
            54,   217,   140,     5,     3, 32111,     3,    10,  1521,    25,
         29744,   140,    58,   148,    31,    60,   352,    12,  1137,    46,
          3125,   116,   151,  8876,    16,     3,     9,  1123,    55,     3,
         32115,     3,    10,   9

### Applying One Shot Inference

In [None]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['train']['dialogue_en'][index]
        summary = dataset['train']['summary_en'][index]

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""
Dialogue:

{dialogue}

What was going on?
{summary}


"""

    dialogue = dataset['train']['dialogue_en'][example_index_to_summarize]

    prompt += f"""
Dialogue:

{dialogue}

What was going on?
"""

    return prompt

In [None]:
example_indices_full = [21]
example_index_to_summarize = 101

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)


Dialogue:

#Person1#: We should check in at the Air China counter half an hour before takeoff, Joy.
#Person2#: Yeah, I know. The boarding time on the ticket is 17:05, and it's 16:15 now. I think we have enough time.
#Person1#: Do we need to show our IDs when we check in?
#Person2#: Yeah, that's a must.
#Person1#: What about our luggage?
#Person2#: We can check in our luggage and carry our small bags in our hands. And we need to open each of them for inspection.
#Person1#: Do you think they will search every passenger?
#Person2#: I think so. We definitely don't want to have a hijacking incident on the plane today, do we?

What was going on?
#Person1# asks #Person2# what to do when checking in at the Air China counter.



Dialogue:

#Person1#: Can I help you?
#Person2#: I'm looking for an MP-3 player. Which brand is the best quality?
#Person1#: I recommend Pioneer.
#Person2#: Which model is the best seller?
#Person1#: This model is very popular with women.
#Person2#: Can I see that?
#Pe

In [None]:

# model understanding more context of the conversation with one shot inference

summary = dataset['train']['summary_en'][example_index_to_summarize]

inputs = tokenizer(one_shot_prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

#print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
#print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

BASELINE HUMAN SUMMARY:
#Person2# is looking for an MP-3 player. #Person1# recommends a pioneer and #Person2# chooses yellow.

MODEL GENERATION - ONE SHOT:
rien is looking for an MP-3 player. He wants to buy it in yellow.


### Applying few Shot Inference

In [12]:
example_indices_full = [11, 21, 51]
example_index_to_summarize = 101

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)

NameError: name 'make_prompt' is not defined

In [None]:
summary = dataset['train']['summary_en'][example_index_to_summarize]

inputs = tokenizer(few_shot_prompt, return_tensors='pt', add_special_tokens=True)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

#print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
#print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

BASELINE HUMAN SUMMARY:
#Person2# is looking for an MP-3 player. #Person1# recommends a pioneer and #Person2# chooses yellow.

MODEL GENERATION - FEW SHOT:
rien is looking for an MP-3 player. He wants to buy it in yellow.


#### 결론: TOKENIZER 수정보다 ONE SHOT, FEW SHOT INFERENCE가 낫다.

In [None]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
       

### tokenized_dataset

In [None]:
one_shot_instruct = """Dialogue:

#Person1#: We should check in at the Air China counter half an hour before takeoff, Joy.
#Person2#: Yeah, I know. The boarding time on the ticket is 17:05, and it's 16:15 now. I think we have enough time.
#Person1#: Do we need to show our IDs when we check in?
#Person2#: Yeah, that's a must.
#Person1#: What about our luggage?
#Person2#: We can check in our luggage and carry our small bags in our hands. And we need to open each of them for inspection.
#Person1#: Do you think they will search every passenger?
#Person2#: I think so. We definitely don't want to have a hijacking incident on the plane today, do we?

What was going on?
#Person1# asks #Person2# what to do when checking in at the Air China counter.

Dialogue:
"""

In [None]:
# 단어 수 계산
word_count = len(one_shot_instruct.split())

word_count

133

In [None]:
tokenizer(one_shot_instruct, return_tensors='pt')['input_ids'].shape[-1]

201

In [None]:
def preprocess_function(sample, padding="max_length"):
    # one_shot_instruct를 토큰화하여 토큰 수를 계산
    instruction_tokens = tokenizer(one_shot_instruct, return_tensors='pt')['input_ids'].shape[-1]
    
    # 대화 내용을 프롬프트와 결합하여 입력 생성
    inputs = [one_shot_instruct + item + " What was going on?" for item in sample["dialogue_en"]]

    # max_length에 프롬프트의 토큰 수를 고려한 길이를 설정
    model_inputs = tokenizer(inputs, max_length=max_source_length + instruction_tokens, padding=padding, truncation=True, add_special_tokens=True)

    # 타겟(summary_en)도 토큰화
    labels = tokenizer(text_target=sample["summary_en"], max_length=max_target_length, padding=padding, truncation=True)

    # 패딩 토큰을 -100으로 교체하여 손실 계산에 영향을 미치지 않도록 처리
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# 데이터셋에 전처리 함수를 적용
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['fname', 'dialogue', 'summary', 'topic', 'dialogue_en', 'summary_en', 'topic_en'])

# 처리된 데이터셋의 키 출력
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")


Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the `base` version of the model. 
_I plan to do a follow-up post on how to fine-tune the `xxl` version of the model using Deepspeed._


In [None]:
from transformers import AutoModelForSeq2SeqLM

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)



We want to evaluate our model during training. The `Trainer` supports evaluation during training by providing a `compute_metrics`.  
The most commonly used metrics to evaluate summarization task is [rogue_score](https://en.wikipedia.org/wiki/ROUGE_(metric)) short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries

We are going to use `evaluate` library to evaluate the `rogue` score.

In [None]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/c2/d6/ff9baefc8fc679dcd9eb21b29da3ef10c81aa36be630a7ae78e4611588e1/evaluate-0.4.2-py3-none-any.whl.metadata
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m706.6 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2
[0m

In [None]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # 정수 배열로 변환하고, 범위를 tokenizer의 vocab 크기로 제한
    preds = np.array(preds, dtype=np.int64)
    preds = np.clip(preds, 0, tokenizer.vocab_size - 1)
    print(preds)
  
    # 토큰 ID를 텍스트로 디코딩
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    print(decoded_preds)
    # 라벨에서 -100을 패딩 토큰 ID로 대체
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # 메트릭 계산 후 결과 반환
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to
[nltk_data]     /data/ephemeral/home/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Before we can start training is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the 🤗 Transformers library. 

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


The last step is to define the hyperparameters (`TrainingArguments`) we want to use for our training. We are leveraging the [Hugging Face Hub](https://huggingface.co/models) integration of the `Trainer` to automatically push our checkpoints, logs and metrics during training into a repository.

In [None]:
!pip install wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

In [None]:
wandb login
# 37ef351873d76557e00679959886f35cb3bbc35c

SyntaxError: invalid syntax (3856150448.py, line 1)

In [None]:
import torch
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, EarlyStoppingCallback

# GPU 사용 가능 여부 확인 및 설정
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-{dataset_id}"

# Define training args with additional parameters
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    num_train_epochs=20,  # 총 20 에폭 동안 학습
    learning_rate=1e-5,  # 학습률
    per_device_train_batch_size=1,  # 훈련 중 한 장치당 배치 크기
    per_device_eval_batch_size=1,  # 평가 중 한 장치당 배치 크기
    warmup_ratio=0.1,  # 워밍업 비율
    weight_decay=0.01,  # 가중치 감쇠
    lr_scheduler_type='cosine',  # 코사인 스케줄러 사용
    optim='adamw_torch',  # 옵티마이저: AdamW 사용
    gradient_accumulation_steps=1,  # 기울기 누적 단계
    evaluation_strategy='epoch',  # 에폭 단위로 평가
    save_strategy='epoch',  # 에폭 단위로 저장
    save_total_limit=5,  # 총 5개의 체크포인트를 저장
    fp16=False,  # mixed precision 학습 활성화 # True로 하면 overflow
    load_best_model_at_end=True,  # 가장 좋은 모델을 마지막에 로드
    seed=42,  # 재현성을 위한 시드 값
    logging_dir="./logs",  # 로그 디렉토리
    logging_strategy="epoch",  # 에폭마다 로깅
    predict_with_generate=True,  # 생성 모드를 사용할 때 평가 설정
    generation_max_length=max_target_length,  # 최대 생성 길이
    do_train=True,  # 학습 여부
    do_eval=True,  # 평가 여부
)

# Create Trainer instance with early stopping callback
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["val"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=3,  # 3번의 에폭 동안 개선되지 않으면 중단
            early_stopping_threshold=0.001  # 성능이 0.001만큼 개선되지 않으면 중단
        )
    ]
)

# Training 시작
trainer.train()
# 20 epoch: 13시간 예상
# 1 epoch: 0.69시간 예상

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mthebestday[0m ([33mthebestdayor[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


OverflowError: out of range integral type conversion attempted

We can start our training by using the `train` method of the `Trainer`.

In [None]:
# Start training
trainer.train()

#2.16일 예상

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 


![flan-t5-tensorboard](../assets/flan-t5-tensorboard.png)

Nice, we have trained our model. 🎉 Lets run evaluate the best model again on the test set.


In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 819
  Batch size = 8


{'eval_loss': 1.3715944290161133,
 'eval_rouge1': 47.2358,
 'eval_rouge2': 23.5135,
 'eval_rougeL': 39.6266,
 'eval_rougeLsum': 43.3458,
 'eval_gen_len': 17.39072039072039,
 'eval_runtime': 108.99,
 'eval_samples_per_second': 7.514,
 'eval_steps_per_second': 0.945,
 'epoch': 5.0}

The best score we achieved is an `rouge1` score of `47.23`. 

Lets save our results and tokenizer to the Hugging Face Hub and create a model card. 

In [None]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

## 4. Run Inference

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [None]:
from transformers import pipeline
from random import randrange        

# load model and tokenizer from huggingface hub with pipeline
summarizer = pipeline("summarization", model="philschmid/flan-t5-base-samsum", device=0)

# select a random test sample
sample = dataset['test'][randrange(len(dataset["test"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")

# summarize dialogue
res = summarizer(sample["dialogue"])

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

dialogue: 
Abby: Have you talked to Miro?
Dylan: No, not really, I've never had an opportunity
Brandon: me neither, but he seems a nice guy
Brenda: you met him yesterday at the party?
Abby: yes, he's so interesting
Abby: told me the story of his father coming from Albania to the US in the early 1990s
Dylan: really, I had no idea he is Albanian
Abby: he is, he speaks only Albanian with his parents
Dylan: fascinating, where does he come from in Albania?
Abby: from the seacoast
Abby: Duress I believe, he told me they are not from Tirana
Dylan: what else did he tell you?
Abby: That they left kind of illegally
Abby: it was a big mess and extreme poverty everywhere
Abby: then suddenly the border was open and they just left 
Abby: people were boarding available ships, whatever, just to get out of there
Abby: he showed me some pictures, like <file_photo>
Dylan: insane
Abby: yes, and his father was among the people
Dylan: scary but interesting
Abby: very!
---------------
flan-t5-base summary:
A