## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [1]:
import torch

if torch.cuda.is_available():
    print("GPU is available")
    print(f"GPU device name: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is not available")


GPU is available
GPU device name: NVIDIA GeForce RTX 3090


In [29]:
# python
!pip install pytesseract transformers datasets rouge-score nltk tensorboard py7zr --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/75/35/07c9879163b603f0e464b0f6e6e628a2340cfc7cdc5ca8e7d52d776710d4/transformers-4.44.2-py3-none-any.whl.metadata
  Using cached transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Obtaining dependency information for tokenizers<0.20,>=0.19 from https://files.pythonhosted.org/packages/40/4f/eb78de4af3b17b589f43a369cbf0c3a7173f25c3d2cd93068852c07689aa/tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
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
T5_DialogueSum

## 2. Load and prepare dialogueSum dataset from local
- This DialogueSum dataset was originally in English but was translated into Korean by teachers using the Solar API for educational purposes. However, the translation seemed somewhat unnatural for native Korean speakers, so I used the Solar API to retranslate it into English to facilitate a more accurate summarization.

To load the `dialogueSum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [3]:
dataset_id = "dialoguSum_Solar_koen"
# huggingface hub model id
model_id="paust/pko-flan-t5-large"

In [4]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset('csv', data_files={'train': "/data/ephemeral/home/data/train_en.csv", 'val': "/data/ephemeral/home/data/dev_en.csv"})

print(f"Train dataset size: {len(dataset['train'])}")
print(f"val dataset size: {len(dataset['val'])}")

# Train dataset size: 12457
# Test dataset size: 499

Train dataset size: 12457
val dataset size: 499


In [5]:
dataset['train']

Dataset({
    features: ['fname', 'dialogue', 'summary', 'topic', 'dialogue_en', 'summary_en', 'topic_en'],
    num_rows: 12457
})

Lets checkout an example of the dataset.

In [4]:
from random import randrange        


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
#Person1#: 어서오세요! 리틀 이탈리아에 오신 것을 환영합니다. 저희는 전형적인 이탈리아 가정이랍니다!
#Person2#: 그렇다고 들었습니다. 그래서 너무 즐거워요.
#Person1#: 이탈리아 남자와 결혼하지 않았다면 이렇게 자주 임신하지 않았을 거예요. 아마도 아이 대신에 돼지를 키울 수 있었겠죠!
#Person2#: 네? 음. . . 요즘도 대가족을 꾸리는 사람들이 있다는 것이 좋네요.
#Person1#: 그렇죠. 하지만 터프가이 남편이 꾸물거리지 말고 좀 더 적극적으로 도와주면 더 좋을 것 같아요. 하하. . . 이거 드세요. 이탈리아에서 온 거예요!
---------------
summary: 
#Person2#가 리틀 이탈리아에서 즐거운 시간을 보냅니다. 주인인 #Person1#은 아이들이 너무 많아서 힘들다고 불평합니다.
---------------


In [5]:
from transformers import T5TokenizerFast
# Load tokenizer of FLAN-t5-base
tokenizer = T5TokenizerFast.from_pretrained(model_id)

In [6]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["val"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")
min_source_length = min([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Min source length: {min_source_length}")


# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["val"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")
min_target_length = min([len(x) for x in tokenized_targets["input_ids"]])
print(f"Min target length: {min_target_length}")

Max source length: 1791
Min source length: 70
Max target length: 319
Min target length: 14


In [7]:
special_tokens = ['#CarNumber#', '#SSN#', '#PhoneNumber#', '#PassportNumber#', '#Email#', '#CardNumber#', '#Address#', '#DateOfBirth#', \
'#Person4#', '#Person7#', '#Person3#', '#Person2#', '#Person#', '#Person6#', '#Person5#', '#Person1#']
for token in special_tokens:
    if token in tokenizer.get_vocab():
        print(f"'{token}' is already in the vocabulary.")
    else:
        print(f"'{token}' is not in the vocabulary.")


'#CarNumber#' is not in the vocabulary.
'#SSN#' is not in the vocabulary.
'#PhoneNumber#' is not in the vocabulary.
'#PassportNumber#' is not in the vocabulary.
'#Email#' is not in the vocabulary.
'#CardNumber#' is not in the vocabulary.
'#Address#' is not in the vocabulary.
'#DateOfBirth#' is not in the vocabulary.
'#Person4#' is not in the vocabulary.
'#Person7#' is not in the vocabulary.
'#Person3#' is not in the vocabulary.
'#Person2#' is not in the vocabulary.
'#Person#' is not in the vocabulary.
'#Person6#' is not in the vocabulary.
'#Person5#' is not in the vocabulary.
'#Person1#' is not in the vocabulary.


In [8]:
original_vocab_size = len(tokenizer)

special_tokens = ['#CarNumber#', '#SSN#', '#PhoneNumber#', '#PassportNumber#', '#Email#', '#CardNumber#', '#Address#', '#DateOfBirth#', \
'#Person4#', '#Person7#', '#Person3#', '#Person2#', '#Person#', '#Person6#', '#Person5#', '#Person1#']
#tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
tokenizer.add_tokens(special_tokens)
new_vocab_size = len(tokenizer)

print(f"Original vocab size: {original_vocab_size}")
print(f"New vocab size: {new_vocab_size}")

# Original vocab size: 50358
# New vocab size: 50374

Original vocab size: 50358
New vocab size: 50374


### tokenizer 확인

In [9]:
# 작동 잘 되는지 확인
# Define a test sentence
sentence = dataset["train"]['dialogue'][0]


# Encode the sentence using the tokenizer, returning PyTorch tensors
sentence_encoded = tokenizer(sentence, 
                             truncation=True, 
                             add_special_tokens=True)

# Decode the encoded sentence, skipping special tokens
sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"], 
        skip_special_tokens=True
    )

# Print SENTENCE
print('SENTENCE:')
print(sentence)

# Print the encoded sentence's representation
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"])

# Print the decoded sentence
print('\nDECODED SENTENCE:')
print(sentence_decoded)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


SENTENCE:
#Person1#: 안녕하세요, 스미스씨. 저는 호킨스 의사입니다. 오늘 왜 오셨나요?
#Person2#: 건강검진을 받는 것이 좋을 것 같아서요.
#Person1#: 그렇군요, 당신은 5년 동안 건강검진을 받지 않았습니다. 매년 받아야 합니다.
#Person2#: 알고 있습니다. 하지만 아무 문제가 없다면 왜 의사를 만나러 가야 하나요?
#Person1#: 심각한 질병을 피하는 가장 좋은 방법은 이를 조기에 발견하는 것입니다. 그러니 당신의 건강을 위해 최소한 매년 한 번은 오세요.
#Person2#: 알겠습니다.
#Person1#: 여기 보세요. 당신의 눈과 귀는 괜찮아 보입니다. 깊게 숨을 들이쉬세요. 스미스씨, 담배 피우시나요?
#Person2#: 네.
#Person1#: 당신도 알다시피, 담배는 폐암과 심장병의 주요 원인입니다. 정말로 끊으셔야 합니다. 
#Person2#: 수백 번 시도했지만, 습관을 버리는 것이 어렵습니다.
#Person1#: 우리는 도움이 될 수 있는 수업과 약물들을 제공하고 있습니다. 나가기 전에 더 많은 정보를 드리겠습니다.
#Person2#: 알겠습니다, 감사합니다, 의사선생님.

ENCODED SENTENCE:
[50373, 27, 222, 1381, 963, 13, 222, 14563, 796, 15, 222, 425, 274, 222, 528, 21693, 222, 2183, 535, 15, 222, 805, 222, 997, 222, 29297, 296, 32, 200, 50369, 27, 222, 1323, 6831, 291, 222, 1760, 222, 398, 262, 222, 2150, 222, 398, 222, 2585, 296, 15, 200, 50373, 27, 222, 5367, 296, 13, 222, 1837, 311, 222, 22, 482, 222, 1063, 222, 1323, 6831, 291, 222, 3629, 222, 6136, 15, 222, 4376, 222, 546

## Summarizing Using Prompt Engineering

### Applying Zero Shot Inference

In [26]:
model

T5ForConditionalGeneration(
  (shared): Embedding(50358, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(50358, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
       

In [10]:
import pandas as pd

# Get the original vocabulary size
original_vocab_size = len(tokenizer)
print(f"Original vocab size: {original_vocab_size}")

# Define a function to extract unique words from text
def extract_unique_words(dataset_column):
    unique_words = set()
    for sentence in dataset_column:
        words = sentence.split()  # Simple split, adjust with tokenizer if needed
        unique_words.update(words)
    return unique_words

# Step 1: Extract unique words from the dataset
unique_words_train_dialogue = extract_unique_words(dataset['train']['dialogue'])
unique_words_train_summary = extract_unique_words(dataset['train']['summary'])
unique_words_val_dialogue = extract_unique_words(dataset['val']['dialogue'])
unique_words_val_summary = extract_unique_words(dataset['val']['summary'])

# Step 2: Extract unique words from the test set
test = pd.read_csv('/data/ephemeral/home/data/test.csv')
unique_words_test_dialogue = extract_unique_words(test['dialogue'])

# Combine all unique words
all_unique_words = unique_words_train_dialogue | unique_words_train_summary | unique_words_val_dialogue | unique_words_val_summary | unique_words_test_dialogue

# Step 3: Add these unique words to the tokenizer vocabulary
tokenizer.add_tokens(list(set(all_unique_words)))

# Step 4: Check the new vocabulary size
new_vocab_size = len(tokenizer)
print(f"New vocab size: {new_vocab_size}")


Original vocab size: 50374
New vocab size: 195808


### Applying Zero Shot Inference

In [11]:
# zero shot
from transformers import T5ForConditionalGeneration

# load model from the hub
model = T5ForConditionalGeneration.from_pretrained(model_id)
model.resize_token_embeddings(len(tokenizer))


# Define a test sentence
sentence = dataset["train"]['dialogue'][0]
golden = dataset["train"]['summary'][0]

instruction = f"""다음 대화를 한국어로 요약해줘:\n{sentence}
"""
#instruction = ["Please summarize the conversation by clearly stating what each speaker did or said. : " +sentence]
# instruction = ["In this '#Person1#: Hello, Mr. Smith. I'm Dr. Hawkins.' dialogue, the speaker is #Person1#. \
#     Summarize the conversation with a focus on the speakers, ensuring that each speaker's name or identifier, such as #Person1#, is accurately used as the subject in the summary. : " + sentence]
# Encode the sentence using the tokenizer, returning PyTorch tensors
sentence_encoded = tokenizer(instruction, 
                             max_length=max_source_length, 
                             padding="longest", 
                             truncation=True, 
                             add_special_tokens=True,
                             return_tensors="pt")  # Ensure tensors are returned for model input

# Generate the summary using the model
summary_ids = model.generate(
    sentence_encoded["input_ids"], 
    max_length=max_target_length, 
    min_length=40, 
    num_beams=5,  # Optional: control the generation strategy
    early_stopping=True,  # Optional: stop early when all beams are finished
    no_repeat_ngram_size=2
)

# Decode the encoded sentence, skipping special tokens
sentence_decoded = tokenizer.decode(
    summary_ids[0],  # Select the first (and usually only) sequence generated
    skip_special_tokens=True  # Skip special tokens in the final output
    )

# Print the encoded sentence's representation
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"])

# Print the decoded sentence
print('\nDECODED SENTENCE:')
print(sentence_decoded)

# Print SENTENCE
print('\nGOLDEN:')
print(golden)



ENCODED SENTENCE:
tensor([[ 54883,    222,  80742,    222, 132110,    222,  87673,   1496,     27,
            200,  84037,    222, 154137,    222, 157557,    222, 181610,    222,
          79309,    222,  77117,    222, 185842,    222,  96891,    222, 178248,
            200,  94525,    222, 105686,    222, 159632,    222, 124699,    222,
          89268,    222,  72668,    222,  80201,    200,  84037,    222, 148923,
            222, 159244,    222,  55032,    222,  98756,    222, 105686,    222,
          95426,    222, 181071,    222, 182897,    222,  76739,    222, 113915,
            200,  94525,    222,  67100,    222,  61758,    222,  80559,    222,
         188558,    222, 176692,    222, 150619,    222,  96891,    222, 169610,
            222,  73772,    222, 172848,    222,  81850,    200,  84037,    222,
          90448,    222,  52227,    222, 111605,    222,  90272,    222,  86731,
            222,  59618,    222,  86698,    222, 144898,    222, 132333,    222,
         

### Applying One Shot Inference

In [34]:
def make_prompt(example_indices_full, example_index_to_summarize):
    prompt = ''
    for index in example_indices_full:
        dialogue = dataset['train']['dialogue'][index]
        summary = dataset['train']['summary'][index]

        # The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
        prompt += f"""대화:{dialogue}\n대화내용요약:{summary}"""

    dialogue = dataset['train']['dialogue'][example_index_to_summarize]

    prompt += f"""대화:{dialogue}\n대화내용요약:"""

    return prompt

In [35]:
example_indices_full = [1]
example_index_to_summarize = 101

one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(one_shot_prompt)

대화:#Person1#: 안녕하세요, 파커 부인, 어떻게 지내셨나요?
#Person2#: 안녕하세요, 피터스 박사님. 잘 지냈습니다, 감사합니다. 리키와 함께 백신 접종을 위해 왔습니다.
#Person1#: 좋습니다. 백신 접종 기록을 보니, 리키는 이미 소아마비, 디프테리아, B형 간염 백신을 맞았군요. 그는 14개월이므로, 이제 A형 간염, 수두, 홍역 백신을 맞아야 합니다.
#Person2#: 풍진과 볼거리는 어떻게 되나요?
#Person1#: 지금은 이 백신들만 접종할 수 있고, 몇 주 후에 나머지를 접종할 수 있습니다.
#Person2#: 좋습니다. 박사님, 저도 디프테리아 예방접종이 필요할 것 같아요. 마지막으로 맞은 게 아마도 15년 전이었던 것 같아요!
#Person1#: 저희가 기록을 확인하고 간호사에게 부스터를 접종하도록 하겠습니다. 이제, 리키의 팔을 꽉 잡아주세요, 조금 찌릿할 수 있습니다.
대화내용요약:파커 부인이 리키를 데리고 백신 접종을 하러 갔다. 피터스 박사는 기록을 확인한 후 리키에게 백신을 접종했다.대화:#Person1#: 도와드릴까요?
#Person2#: MP-3 플레이어를 찾고 있어요. 어떤 브랜드가 가장 품질이 좋나요?
#Person1#: 파이오니어를 추천드립니다.
#Person2#: 어떤 모델이 가장 잘 팔리나요?
#Person1#: 이 모델이 여성들에게 매우 인기가 있습니다.
#Person2#: 그것을 볼 수 있을까요?
#Person1#: 물론입니다, 이것은 다기능입니다. 음악을 재생하는 것 외에도 문서를 저장하고 녹음하는 데도 사용할 수 있습니다.
#Person2#: 이 모델을 흰색으로 가지고 계신가요?
#Person1#: 아니요, 하지만 노란색은 있습니다.
#Person2#: 그럼 노란색으로 할게요.
#Person1#: 잠시만 기다려주세요. 가져다 드리겠습니다.
#Person2#: 알겠어요.
대화내용요약:


In [36]:

# model understanding more context of the conversation with one shot inference

summary = dataset['train']['summary'][example_index_to_summarize]

inputs = tokenizer(one_shot_prompt, #truncation=True, 
                             add_special_tokens=True,
                             return_tensors="pt")
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=max_target_length,
        min_length=40, 
        num_beams=5,  # Optional: control the generation strategy
        early_stopping=True,  # Optional: stop early when all beams are finished
        no_repeat_ngram_size=2
        )[0],
    skip_special_tokens=True
)

#print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
#print(dash_line)
print(f'MODEL GENERATION - ONE SHOT:\n{output}')

BASELINE HUMAN SUMMARY:
#Person2#는 MP-3 플레이어를 찾고 있습니다. #Person1#는 파이오니어를 추천하고 #Person2#는 노란색을 선택합니다.

MODEL GENERATION - ONE SHOT:
20대여성: #@시스템#사진#
30대남성:..??!!?
40대기혼:,,?,
50대미혼:-_--,-/-ᅳᅳ-;;,//;::;
60대임부:/_/:__ᅳ
70대임산부도 괜찮으려나? ᄏᄏ
200명중 한명은 괜찮은데 나머지는 아예 안보여서 아쉽더라 ᅲᅲ 흑흑 ᅮᅮ
190명 중에 160명이 과체중이라 그런지 몸무게가 엄청 무거웠어요 ^^ ᄒ
150명의 임산부는 다이어트 열심히 하고 있대요. 허허허.


### Applying few Shot Inference

In [12]:
example_indices_full = [11, 21, 51]
example_index_to_summarize = 101

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)

print(few_shot_prompt)

NameError: name 'make_prompt' is not defined

In [None]:
summary = dataset['train']['summary_en'][example_index_to_summarize]

inputs = tokenizer(few_shot_prompt, return_tensors='pt', add_special_tokens=True)
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"],
        max_new_tokens=50,
    )[0],
    skip_special_tokens=True
)

#print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
#print(dash_line)
print(f'MODEL GENERATION - FEW SHOT:\n{output}')

BASELINE HUMAN SUMMARY:
#Person2# is looking for an MP-3 player. #Person1# recommends a pioneer and #Person2# chooses yellow.

MODEL GENERATION - FEW SHOT:
rien is looking for an MP-3 player. He wants to buy it in yellow.


#### 결론: TOKENIZER 수정보다 ONE SHOT, FEW SHOT INFERENCE가 낫다.

In [23]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(50358, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(50358, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
       

### tokenized_dataset

In [None]:
one_shot_instruct = """Dialogue:

#Person1#: We should check in at the Air China counter half an hour before takeoff, Joy.
#Person2#: Yeah, I know. The boarding time on the ticket is 17:05, and it's 16:15 now. I think we have enough time.
#Person1#: Do we need to show our IDs when we check in?
#Person2#: Yeah, that's a must.
#Person1#: What about our luggage?
#Person2#: We can check in our luggage and carry our small bags in our hands. And we need to open each of them for inspection.
#Person1#: Do you think they will search every passenger?
#Person2#: I think so. We definitely don't want to have a hijacking incident on the plane today, do we?

What was going on?
#Person1# asks #Person2# what to do when checking in at the Air China counter.

Dialogue:
"""

In [None]:
# 단어 수 계산
word_count = len(one_shot_instruct.split())

word_count

133

In [None]:
tokenizer(one_shot_instruct, return_tensors='pt')['input_ids'].shape[-1]

201

In [None]:
def preprocess_function(sample, padding="max_length"):
    # one_shot_instruct를 토큰화하여 토큰 수를 계산
    instruction_tokens = tokenizer(one_shot_instruct, return_tensors='pt')['input_ids'].shape[-1]
    
    # 대화 내용을 프롬프트와 결합하여 입력 생성
    inputs = [one_shot_instruct + item + " What was going on?" for item in sample["dialogue_en"]]

    # max_length에 프롬프트의 토큰 수를 고려한 길이를 설정
    model_inputs = tokenizer(inputs, max_length=max_source_length + instruction_tokens, padding=padding, truncation=True, add_special_tokens=True)

    # 타겟(summary_en)도 토큰화
    labels = tokenizer(text_target=sample["summary_en"], max_length=max_target_length, padding=padding, truncation=True)

    # 패딩 토큰을 -100으로 교체하여 손실 계산에 영향을 미치지 않도록 처리
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# 데이터셋에 전처리 함수를 적용
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['fname', 'dialogue', 'summary', 'topic', 'dialogue_en', 'summary_en', 'topic_en'])

# 처리된 데이터셋의 키 출력
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")


Map:   0%|          | 0/499 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the `base` version of the model. 
_I plan to do a follow-up post on how to fine-tune the `xxl` version of the model using Deepspeed._


In [25]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained('paust/pko-flan-t5-large')
model = T5ForConditionalGeneration.from_pretrained('paust/pko-flan-t5-large')

prompt = """서울특별시(서울特別市, 영어: Seoul Metropolitan Government)는 대한민국 수도이자 최대 도시이다. 선사시대부터 사람이 거주하였으나 본 역사는 백제 첫 수도 위례성을 시초로 한다. 삼국시대에는 전략적 요충지로서 고구려, 백제, 신라가 번갈아 차지하였으며, 고려 시대에는 왕실의 별궁이 세워진 남경(南京)으로 이름하였다.
한국의 수도는 어디입니까?"""
input_ids = tokenizer(prompt, add_special_tokens=True, return_tensors='pt').input_ids
output_ids = model.generate(input_ids=input_ids, max_new_tokens=32, num_beams=12)
text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(text)  # 서울특별시


서울특별시


In [12]:
def preprocess_function(sample, padding="max_length"):
    
    # 대화 내용을 프롬프트와 결합하여 입력 생성
    inputs = ["대화를 한국어로 요약해줘." for item in sample["dialogue"]]

    # max_length에 프롬프트의 토큰 수를 고려한 길이를 설정
    model_inputs = tokenizer(inputs, max_length=max_source_length + 20, padding=padding, truncation=True, add_special_tokens=True)

    # 타겟(summary_en)도 토큰화
    labels = tokenizer(text_target=sample["summary"], 
                        max_length=max_target_length, 
                        padding=padding, 
                        truncation=True)
    
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore padding in the loss.
    if padding == "max_length":
        
        if isinstance(labels["input_ids"][0], list):  # Check if it is a list of lists
            print(f'labels["input_ids"][0]: {labels["input_ids"][0]}')
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]
        else:  # Handle single instance case
            print(f'labels["input_ids"]: {labels["input_ids"]}')
            labels["input_ids"] = [(l if l != tokenizer.pad_token_id else -100) for l in labels["input_ids"]]


    # model_inputs["labels"] = labels["input_ids"]
    # return model_inputs
    
    # with tokenizer.as_target_tokenizer():
    #     labels = tokenizer(sample["summary"], max_length=max_target_length, padding=padding, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


# 데이터셋에 전처리 함수를 적용
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['fname', 'dialogue', 'summary', 'topic'])

# 처리된 데이터셋의 키 출력
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")


Map:   0%|          | 0/12457 [00:00<?, ? examples/s]

labels["input_ids"][0]: [56123, 222, 105686, 222, 122393, 222, 151113, 222, 79309, 222, 156065, 222, 182897, 222, 105686, 222, 159632, 222, 95433, 222, 168566, 222, 79309, 222, 156065, 222, 56123, 222, 58316, 222, 178311, 222, 129594, 222, 111276, 222, 133693, 222, 184477, 222, 67098, 222, 65780, 222, 56908, 222, 65105, 222, 88895, 222, 149030, 222, 77292, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Map:   0%|          | 0/499 [00:00<?, ? examples/s]

labels["input_ids"][0]: [171882, 222, 149708, 222, 108763, 222, 166194, 222, 156065, 222, 178892, 222, 111980, 222, 184324, 222, 115269, 222, 88515, 222, 61633, 222, 165737, 222, 109031, 222, 140744, 222, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [13]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # 정수 배열로 변환하고, 범위를 tokenizer의 vocab 크기로 제한
    preds = np.array(preds, dtype=np.int64)
    preds = np.clip(preds, 0, tokenizer.vocab_size - 1)
    print(preds)
  
    # 토큰 ID를 텍스트로 디코딩
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    print(decoded_preds)
    # 라벨에서 -100을 패딩 토큰 ID로 대체
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # 메트릭 계산 후 결과 반환
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to
[nltk_data]     /data/ephemeral/home/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration,T5TokenizerFast

# load model from the hub
# model = T5ForConditionalGeneration.from_pretrained(model_id)
# model.resize_token_embeddings(len(tokenizer))

# load model from the hub
#model = T5TokenizerFast.from_pretrained(model_id)

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)


In [15]:
wandb login
# 37ef351873d76557e00679959886f35cb3bbc35c

SyntaxError: invalid syntax (3856150448.py, line 1)

In [16]:
import torch
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, EarlyStoppingCallback

# GPU 사용 가능 여부 확인 및 설정
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-{dataset_id}"

# Define training args with additional parameters
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    num_train_epochs=20,  # 총 20 에폭 동안 학습
    learning_rate=1e-5,  # 학습률
    per_device_train_batch_size=1,  # 훈련 중 한 장치당 배치 크기
    per_device_eval_batch_size=1,  # 평가 중 한 장치당 배치 크기
    warmup_ratio=0.1,  # 워밍업 비율
    weight_decay=0.01,  # 가중치 감쇠
    lr_scheduler_type='cosine',  # 코사인 스케줄러 사용
    optim='adamw_torch',  # 옵티마이저: AdamW 사용
    gradient_accumulation_steps=16,  # 기울기 누적 단계
    evaluation_strategy='epoch',  # 에폭 단위로 평가
    save_strategy='epoch',  # 에폭 단위로 저장
    save_total_limit=5,  # 총 5개의 체크포인트를 저장
    fp16=True,  # mixed precision 학습 활성화 # True로 하면 overflow
    load_best_model_at_end=True,  # 가장 좋은 모델을 마지막에 로드
    seed=42,  # 재현성을 위한 시드 값
    logging_dir="./logs",  # 로그 디렉토리
    logging_strategy="epoch",  # 에폭마다 로깅
    predict_with_generate=True,  # 생성 모드를 사용할 때 평가 설정
    generation_max_length=max_target_length,  # 최대 생성 길이
    do_train=True,  # 학습 여부
    do_eval=True,  # 평가 여부
)

# Create Trainer instance with early stopping callback
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["val"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=3,  # 3번의 에폭 동안 개선되지 않으면 중단
            early_stopping_threshold=0.001  # 성능이 0.001만큼 개선되지 않으면 중단
        )
    ]
)

# GPU 메모리 캐시를 지웁니다.
torch.cuda.empty_cache()
# Training 시작
trainer.train()
# 20 epoch: 13시간 예상
# 1 epoch: 0.69시간 예상

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 20.81 MiB is free. Process 3316941 has 23.66 GiB memory in use. Of the allocated memory 23.31 GiB is allocated by PyTorch, and 50.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 819
  Batch size = 8


{'eval_loss': 1.3715944290161133,
 'eval_rouge1': 47.2358,
 'eval_rouge2': 23.5135,
 'eval_rougeL': 39.6266,
 'eval_rougeLsum': 43.3458,
 'eval_gen_len': 17.39072039072039,
 'eval_runtime': 108.99,
 'eval_samples_per_second': 7.514,
 'eval_steps_per_second': 0.945,
 'epoch': 5.0}

The best score we achieved is an `rouge1` score of `47.23`. 

Lets save our results and tokenizer to the Hugging Face Hub and create a model card. 

In [None]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

## 4. Run Inference

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [None]:
from transformers import pipeline
from random import randrange        

# load model and tokenizer from huggingface hub with pipeline
summarizer = pipeline("summarization", model="philschmid/flan-t5-base-samsum", device=0)

# select a random test sample
sample = dataset['test'][randrange(len(dataset["test"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")

# summarize dialogue
res = summarizer(sample["dialogue"])

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

dialogue: 
Abby: Have you talked to Miro?
Dylan: No, not really, I've never had an opportunity
Brandon: me neither, but he seems a nice guy
Brenda: you met him yesterday at the party?
Abby: yes, he's so interesting
Abby: told me the story of his father coming from Albania to the US in the early 1990s
Dylan: really, I had no idea he is Albanian
Abby: he is, he speaks only Albanian with his parents
Dylan: fascinating, where does he come from in Albania?
Abby: from the seacoast
Abby: Duress I believe, he told me they are not from Tirana
Dylan: what else did he tell you?
Abby: That they left kind of illegally
Abby: it was a big mess and extreme poverty everywhere
Abby: then suddenly the border was open and they just left 
Abby: people were boarding available ships, whatever, just to get out of there
Abby: he showed me some pictures, like <file_photo>
Dylan: insane
Abby: yes, and his father was among the people
Dylan: scary but interesting
Abby: very!
---------------
flan-t5-base summary:
A