## 1. Setup Development Environment

Our first step is to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages. 

In [1]:
import torch

if torch.cuda.is_available():
    print("GPU is available")
    print(f"GPU device name: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is not available")


GPU is available
GPU device name: NVIDIA GeForce RTX 3090


In [29]:
# python
!pip install pytesseract transformers datasets rouge-score nltk tensorboard py7zr --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/75/35/07c9879163b603f0e464b0f6e6e628a2340cfc7cdc5ca8e7d52d776710d4/transformers-4.44.2-py3-none-any.whl.metadata
  Using cached transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Obtaining dependency information for tokenizers<0.20,>=0.19 from https://files.pythonhosted.org/packages/40/4f/eb78de4af3b17b589f43a369cbf0c3a7173f25c3d2cd93068852c07689aa/tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
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
T5_DialogueSum

## 2. Load and prepare dialogueSum dataset from local
- This DialogueSum dataset was originally in English but was translated into Korean by teachers using the Solar API for educational purposes. However, the translation seemed somewhat unnatural for native Korean speakers, so I used the Solar API to retranslate it into English to facilitate a more accurate summarization.

To load the `dialogueSum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.


In [3]:
dataset_id = "dialoguSum_Solar_koen"
# huggingface hub model id
model_id="beomi/gemma-ko-2b"

In [6]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset('csv', data_files={'train': "/data/ephemeral/home/data/train.csv", 'val': "/data/ephemeral/home/data/dev.csv"})

print(f"Train dataset size: {len(dataset['train'])}")
print(f"val dataset size: {len(dataset['val'])}")

# Train dataset size: 12457
# Test dataset size: 499

Train dataset size: 12457
val dataset size: 499


In [6]:
dataset['train']

Dataset({
    features: ['fname', 'dialogue', 'summary', 'topic'],
    num_rows: 12457
})

Lets checkout an example of the dataset.

In [3]:
from random import randrange        


sample = dataset['train'][randrange(len(dataset["train"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")
print(f"summary: \n{sample['summary']}\n---------------")

dialogue: 
#Person1#: 내일은 근로자의 날이니까, 노동 직원 모두를 한 시간 일찍 퇴근시키는 것이 좋은 제스처가 될 것 같아요. 어떻게 생각하세요?
#Person2#: 뭐라고요! 완전 말도 안 되는 소리예요! 근로자의 날은 모두를 위한 휴일이지, 노동 직원만을 위한 것이 아니에요. . . 그리고 우리는 어차피 내일 하루 종일 쉴 텐데, 오늘 밤에 한 시간 더 쉬는 게 무슨 의미가 있나요?
#Person1#: 우리 모두가 근로자의 날을 휴일로 쉬지만, 이 휴일의 진정한 목적은 수고로운 노동을 하는 모든 사람들을 기리는 거예요. 다른 사람들이 하기를 꺼려하는 일을 하는 사람들을 기리는 것이죠. 우리는 일반 노동자를 기리기 위해 무언가를 해야 한다고 생각해요.
#Person2#: 그렇다면 우리가 노동 직원에게 퇴근을 일찍 허용한다면, 얼마나 많은 사람들이 퇴근하게 되나요?
#Person1#: 우리 회사에서는 노동 직원이 전체 직원의 60%를 차지합니다. 우리는 단지 절반 조금 이상의 사람들에게 조금 일찍 퇴근하게 해주는 것일 뿐입니다.
---------------
summary: 
#Person1#은 근로자의 날 전날 모든 노동 직원을 한 시간 일찍 퇴근시키는 것을 제안합니다. #Person2#이 그 이유를 이해하지 못해서, #Person1#이 근로자의 날의 주요 목적을 #Person2#에게 설명합니다.
---------------


In [30]:
!pip install -q -U transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

In [7]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [8]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["val"]]).map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")
min_source_length = min([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Min source length: {min_source_length}")


# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["val"]]).map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")
min_target_length = min([len(x) for x in tokenized_targets["input_ids"]])
print(f"Min target length: {min_target_length}")

Max source length: 1630
Min source length: 63
Max target length: 291
Min target length: 13


In [7]:
special_tokens = ['#CarNumber#', '#SSN#', '#PhoneNumber#', '#PassportNumber#', '#Email#', '#CardNumber#', '#Address#', '#DateOfBirth#', \
'#Person4#', '#Person7#', '#Person3#', '#Person2#', '#Person#', '#Person6#', '#Person5#', '#Person1#']
for token in special_tokens:
    if token in tokenizer.get_vocab():
        print(f"'{token}' is already in the vocabulary.")
    else:
        print(f"'{token}' is not in the vocabulary.")


'#CarNumber#' is not in the vocabulary.
'#SSN#' is not in the vocabulary.
'#PhoneNumber#' is not in the vocabulary.
'#PassportNumber#' is not in the vocabulary.
'#Email#' is not in the vocabulary.
'#CardNumber#' is not in the vocabulary.
'#Address#' is not in the vocabulary.
'#DateOfBirth#' is not in the vocabulary.
'#Person4#' is not in the vocabulary.
'#Person7#' is not in the vocabulary.
'#Person3#' is not in the vocabulary.
'#Person2#' is not in the vocabulary.
'#Person#' is not in the vocabulary.
'#Person6#' is not in the vocabulary.
'#Person5#' is not in the vocabulary.
'#Person1#' is not in the vocabulary.


In [9]:
original_vocab_size = len(tokenizer)

special_tokens = ['#CarNumber#', '#SSN#', '#PhoneNumber#', '#PassportNumber#', '#Email#', '#CardNumber#', '#Address#', '#DateOfBirth#', \
'#Person4#', '#Person7#', '#Person3#', '#Person2#', '#Person#', '#Person6#', '#Person5#', '#Person1#']
#tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
tokenizer.add_tokens(special_tokens)
new_vocab_size = len(tokenizer)

print(f"Original vocab size: {original_vocab_size}")
print(f"New vocab size: {new_vocab_size}")

# Original vocab size: 256000
# New vocab size: 256016

Original vocab size: 256000
New vocab size: 256016


## token add

In [10]:
import pandas as pd

# Get the original vocabulary size
original_vocab_size = len(tokenizer)
print(f"Original vocab size: {original_vocab_size}")

# Define a function to extract unique words from text
def extract_unique_words(dataset_column):
    unique_words = set()
    for sentence in dataset_column:
        words = sentence.split()  # Simple split, adjust with tokenizer if needed
        unique_words.update(words)
    return unique_words

# Step 1: Extract unique words from the dataset
unique_words_train_dialogue = extract_unique_words(dataset['train']['dialogue'])
unique_words_train_summary = extract_unique_words(dataset['train']['summary'])
unique_words_val_dialogue = extract_unique_words(dataset['val']['dialogue'])
unique_words_val_summary = extract_unique_words(dataset['val']['summary'])

# Step 2: Extract unique words from the test set
test = pd.read_csv('/data/ephemeral/home/data/test.csv')
unique_words_test_dialogue = extract_unique_words(test['dialogue'])

# Combine all unique words
all_unique_words = unique_words_train_dialogue | unique_words_train_summary | unique_words_val_dialogue | unique_words_val_summary | unique_words_test_dialogue

# Step 3: Add these unique words to the tokenizer vocabulary
tokenizer.add_tokens(list(set(all_unique_words)))

# Step 4: Check the new vocabulary size
new_vocab_size = len(tokenizer)
print(f"New vocab size: {new_vocab_size}")

# Original vocab size: 256016
# New vocab size: 391363

Original vocab size: 256016
New vocab size: 391363


### tokenizer 확인: pad bos 기존 모델과 다름

In [11]:
# 작동 잘 되는지 확인
# Define a test sentence
sentence = dataset["train"]['dialogue'][0]


# Encode the sentence using the tokenizer, returning PyTorch tensors
sentence_encoded = tokenizer(sentence, 
                             max_length=max_source_length, 
                             padding="max_length", 
                             truncation=True, 
                             add_special_tokens=True)

# Decode the encoded sentence, skipping special tokens
sentence_decoded = tokenizer.decode(
        sentence_encoded["input_ids"], 
        max_length=max_target_length, 
        padding="max_length", 
        truncation=True, 
        add_special_tokens=True,
        skip_special_tokens=True
    )

# Print SENTENCE
print('SENTENCE:')
print(sentence)

# Print the encoded sentence's representation
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"])

# Print the decoded sentence
print('\nDECODED SENTENCE:')
print(sentence_decoded)

SENTENCE:
#Person1#: 안녕하세요, 스미스씨. 저는 호킨스 의사입니다. 오늘 왜 오셨나요?
#Person2#: 건강검진을 받는 것이 좋을 것 같아서요.
#Person1#: 그렇군요, 당신은 5년 동안 건강검진을 받지 않았습니다. 매년 받아야 합니다.
#Person2#: 알고 있습니다. 하지만 아무 문제가 없다면 왜 의사를 만나러 가야 하나요?
#Person1#: 심각한 질병을 피하는 가장 좋은 방법은 이를 조기에 발견하는 것입니다. 그러니 당신의 건강을 위해 최소한 매년 한 번은 오세요.
#Person2#: 알겠습니다.
#Person1#: 여기 보세요. 당신의 눈과 귀는 괜찮아 보입니다. 깊게 숨을 들이쉬세요. 스미스씨, 담배 피우시나요?
#Person2#: 네.
#Person1#: 당신도 알다시피, 담배는 폐암과 심장병의 주요 원인입니다. 정말로 끊으셔야 합니다. 
#Person2#: 수백 번 시도했지만, 습관을 버리는 것이 어렵습니다.
#Person1#: 우리는 도움이 될 수 있는 수업과 약물들을 제공하고 있습니다. 나가기 전에 더 많은 정보를 드리겠습니다.
#Person2#: 알겠습니다, 감사합니다, 의사선생님.

ENCODED SENTENCE:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

## Summarizing Using Prompt Engineering

### Applying Zero Shot Inference

In [15]:
# zero shot
from transformers import AutoModelForCausalLM

# load model from the hub
#model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.resize_token_embeddings(len(tokenizer))


# Define a test sentence
sentence = dataset["train"]['dialogue'][0]
golden = dataset["train"]['summary'][0]

instruction = f"""다음 대화를 한국어로 요약해줘:\n{sentence}
"""
#instruction = ["Please summarize the conversation by clearly stating what each speaker did or said. : " +sentence]
# instruction = ["In this '#Person1#: Hello, Mr. Smith. I'm Dr. Hawkins.' dialogue, the speaker is #Person1#. \
#     Summarize the conversation with a focus on the speakers, ensuring that each speaker's name or identifier, such as #Person1#, is accurately used as the subject in the summary. : " + sentence]
# Encode the sentence using the tokenizer, returning PyTorch tensors
sentence_encoded = tokenizer(instruction, 
                             max_length=max_source_length, 
                             padding="longest", 
                             truncation=True, 
                             add_special_tokens=True,
                             return_tensors="pt")  # Ensure tensors are returned for model input

# Generate the summary using the model
summary_ids = model.generate(
    sentence_encoded["input_ids"], 
    max_length=max_target_length, 
    min_length=40, 
    num_beams=5,  # Optional: control the generation strategy
    early_stopping=True,  # Optional: stop early when all beams are finished
    no_repeat_ngram_size=2
)

# Decode the encoded sentence, skipping special tokens
sentence_decoded = tokenizer.decode(
    summary_ids[0],  # Select the first (and usually only) sequence generated
    skip_special_tokens=True  # Skip special tokens in the final output
    )

# Print the encoded sentence's representation
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"])

# Print the decoded sentence
print('\nDECODED SENTENCE:')
print(sentence_decoded)

# Print SENTENCE
print('\nGOLDEN:')
print(golden)

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


ENCODED SENTENCE:
tensor([[     2, 320263, 235248, 331041, 235248, 377723, 235248, 379340, 244669,
         235292,    108, 258768, 235248, 258599, 235248, 303493, 235248, 367847,
         235248, 297359, 235248, 333742, 235248, 295054, 235248, 242940, 235248,
         369132,    108, 344504, 235248, 358333, 235248, 355211, 235248, 376036,
         235248, 307995, 235248, 237506, 235248, 382916,    108, 258768, 235248,
         351656, 235248, 381835, 235248, 368284, 235248, 284944, 235248, 358333,
         235248, 375494, 235248, 300594, 235248, 320935, 235248, 319707, 235248,
         356875,    108, 344504, 235248, 269735, 235248, 297098, 235248, 374301,
         235248, 256545, 235248, 344577, 235248, 305146, 235248, 242940, 235248,
         360508, 235248, 296148, 235248, 346427, 235248, 302345,    108, 258768,
         235248, 343990, 235248, 274558, 235248, 330265, 235248, 299728, 235248,
         328214, 235248, 309287, 235248, 327889, 235248, 312465, 235248, 350099,
         

## 3. Fine-tune and evaluate FLAN-T5

After we have processed our dataset, we can start training our model. Therefore we first need to load our [FLAN-T5](https://huggingface.co/models?search=flan-t5) from the Hugging Face Hub. In the example we are using a instance with a NVIDIA V100 meaning that we will fine-tune the `base` version of the model. 
_I plan to do a follow-up post on how to fine-tune the `xxl` version of the model using Deepspeed._


In [54]:
dataset

DatasetDict({
    train: Dataset({
        features: ['fname', 'dialogue', 'summary', 'topic'],
        num_rows: 12457
    })
    val: Dataset({
        features: ['fname', 'dialogue', 'summary', 'topic'],
        num_rows: 499
    })
})

In [11]:
def preprocess_function(sample, padding="max_length"):
    
    # 대화 내용을 프롬프트와 결합하여 입력 생성
    inputs = ["대화를 요약해줘." for item in sample["dialogue"]]

    # max_length에 프롬프트의 토큰 수를 고려한 길이를 설정
    model_inputs = tokenizer(inputs, max_length=max_source_length + 20, padding=padding, truncation=True, add_special_tokens=True)

    # 타겟(summary_en)도 토큰화
    labels = tokenizer(text_target=sample["summary"], 
                        max_length=max_target_length, 
                        padding=padding, 
                        truncation=True)
    
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore padding in the loss.
    if padding == "max_length":
        
        if isinstance(labels["input_ids"][0], list):  # Check if it is a list of lists
            print(f'labels["input_ids"][0]: {labels["input_ids"][0]}')
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]
        else:  # Handle single instance case
            print(f'labels["input_ids"]: {labels["input_ids"]}')
            labels["input_ids"] = [(l if l != tokenizer.pad_token_id else -100) for l in labels["input_ids"]]


    # model_inputs["labels"] = labels["input_ids"]
    # return model_inputs
    
    # with tokenizer.as_target_tokenizer():
    #     labels = tokenizer(sample["summary"], max_length=max_target_length, padding=padding, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


# 데이터셋에 전처리 함수를 적용
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=['fname', 'dialogue', 'summary', 'topic'])

# 처리된 데이터셋의 키 출력
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")


Map:   0%|          | 0/12457 [00:00<?, ? examples/s]

labels["input_ids"][0]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 316767, 235248, 362221, 235248, 382484, 235248, 376365, 235248, 332268, 235248, 357390, 235248, 373660, 235248, 362221, 235248, 375618, 235248, 293122, 235248, 333682, 235248, 332268, 235248, 357390, 235248, 316767, 235248, 374067, 235248, 373999, 235248, 23826

Map:   0%|          | 0/499 [00:00<?, ? examples/s]

labels["input_ids"][0]: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 260093, 235248, 340666, 235248, 384692, 235248, 323829, 235248, 357390, 235248, 371184, 235248, 261684, 235248, 371854, 235248, 303214, 235248, 363986, 235248, 242602, 235248, 352113, 23

In [12]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    
    # 정수 배열로 변환하고, 범위를 tokenizer의 vocab 크기로 제한
    preds = np.array(preds, dtype=np.int64)
    preds = np.clip(preds, 0, tokenizer.vocab_size - 1)
    print(preds)
  
    # 토큰 ID를 텍스트로 디코딩
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    print(decoded_preds)
    # 라벨에서 -100을 패딩 토큰 ID로 대체
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # 메트릭 계산 후 결과 반환
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    
    # prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    # result["gen_len"] = np.mean(prediction_lens)
    return result

[nltk_data] Downloading package punkt to
[nltk_data]     /data/ephemeral/home/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
from transformers import DataCollatorForSeq2Seq, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_id)
model.resize_token_embeddings(len(tokenizer))

# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=-100,
    pad_to_multiple_of=8
)


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [48]:
wandb login
# 37ef351873d76557e00679959886f35cb3bbc35c

SyntaxError: invalid syntax (3856150448.py, line 1)

In [17]:
import torch
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, EarlyStoppingCallback

# GPU 사용 가능 여부 확인 및 설정
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Hugging Face repository id
repository_id = f"{model_id.split('/')[1]}-{dataset_id}"

# Define training args with additional parameters
training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    num_train_epochs=20,  # 총 20 에폭 동안 학습
    learning_rate=1e-5,  # 학습률
    per_device_train_batch_size=1,  # 훈련 중 한 장치당 배치 크기
    per_device_eval_batch_size=1,  # 평가 중 한 장치당 배치 크기
    warmup_ratio=0.1,  # 워밍업 비율
    warmup_steps=500,  # 워밍업 스텝 수
    weight_decay=0.01,  # 가중치 감쇠
    lr_scheduler_type='cosine',  # 코사인 스케줄러 사용
    optim='adamw_torch',  # 옵티마이저: AdamW 사용
    gradient_accumulation_steps=16,  # 기울기 누적 단계
    evaluation_strategy='epoch',  # 에폭 단위로 평가
    save_strategy='epoch',  # 에폭 단위로 저장
    save_total_limit=5,  # 총 5개의 체크포인트를 저장
    fp16=True,  # mixed precision 학습 활성화 # True로 하면 overflow
    load_best_model_at_end=True,  # 가장 좋은 모델을 마지막에 로드
    seed=42,  # 재현성을 위한 시드 값
    logging_dir="./logs",  # 로그 디렉토리
    logging_strategy="epoch",  # 에폭마다 로깅
    predict_with_generate=True,  # 생성 모드를 사용할 때 평가 설정
    # generation_max_length=max_target_length,  # 최대 생성 길이
)

# Create Trainer instance with early stopping callback
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["val"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=3,  # 3번의 에폭 동안 개선되지 않으면 중단
            early_stopping_threshold=0.001  # 성능이 0.001만큼 개선되지 않으면 중단
        )
    ]
)

# Training 시작
trainer.train()
# 20 epoch: 13시간 예상
# 1 epoch: 0.69시간 예상

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mthebestday[0m ([33mthebestdayor[0m). Use [1m`wandb login --relogin`[0m to force relogin


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.49 GiB. GPU 0 has a total capacty of 23.69 GiB of which 1.02 GiB is free. Process 3252996 has 22.66 GiB memory in use. Of the allocated memory 22.24 GiB is allocated by PyTorch, and 115.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 819
  Batch size = 8


{'eval_loss': 1.3715944290161133,
 'eval_rouge1': 47.2358,
 'eval_rouge2': 23.5135,
 'eval_rougeL': 39.6266,
 'eval_rougeLsum': 43.3458,
 'eval_gen_len': 17.39072039072039,
 'eval_runtime': 108.99,
 'eval_samples_per_second': 7.514,
 'eval_steps_per_second': 0.945,
 'epoch': 5.0}

In [None]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

## 4. Run Inference

Now we have a trained model, we can use it to run inference. We will use the `pipeline` API from transformers and a `test` example from our dataset.

In [None]:
from transformers import pipeline
from random import randrange        

# load model and tokenizer from huggingface hub with pipeline
summarizer = pipeline("summarization", model="philschmid/flan-t5-base-samsum", device=0)

# select a random test sample
sample = dataset['test'][randrange(len(dataset["test"]))]
print(f"dialogue: \n{sample['dialogue']}\n---------------")

# summarize dialogue
res = summarizer(sample["dialogue"])

print(f"flan-t5-base summary:\n{res[0]['summary_text']}")

dialogue: 
Abby: Have you talked to Miro?
Dylan: No, not really, I've never had an opportunity
Brandon: me neither, but he seems a nice guy
Brenda: you met him yesterday at the party?
Abby: yes, he's so interesting
Abby: told me the story of his father coming from Albania to the US in the early 1990s
Dylan: really, I had no idea he is Albanian
Abby: he is, he speaks only Albanian with his parents
Dylan: fascinating, where does he come from in Albania?
Abby: from the seacoast
Abby: Duress I believe, he told me they are not from Tirana
Dylan: what else did he tell you?
Abby: That they left kind of illegally
Abby: it was a big mess and extreme poverty everywhere
Abby: then suddenly the border was open and they just left 
Abby: people were boarding available ships, whatever, just to get out of there
Abby: he showed me some pictures, like <file_photo>
Dylan: insane
Abby: yes, and his father was among the people
Dylan: scary but interesting
Abby: very!
---------------
flan-t5-base summary:
A