# Research in a Flash : Gemma Sprint
Google Machine Learning Bootcamp의 마지막 활동(수료조건)으로 Gemma2 모델을 학습시키고 Fine-Tuning 하는 프로젝트를 수행하게 되었다. 주제는 "Research in a Flash"로, 최근 Sign2GPT등의 논문을 많이 찾아보게 되어, 컴퓨터과학과 관련된 논문들을 더 명료하게 요약해주는 모델을 만들 수 있을까 생각하고 이 프로젝트를 준비하게 되었다. 대시보드에 작성한 프로젝트 설명은 다음과 같다.


## 프로젝트 설명
This project aims to leverage the power if the Gemma 2b model to create a specialized tool for academic paper summarization. By fine-tuning Gemma on a carefully curated dataset of scientific articles and their corresponding abstracts, we'll develop a model capable of distilling complex academic content into concise, informative summaries. This tool will assist researchers, students, and academics by providing quick insights into extensive research papers, saving time and effort in literature reviews and study preparation. The model will be implemented using Hugging Face's Transformers library and will be optimized for deployment on various platforms, making it accessible for desktop and cloud environments. By focusing on key domains such as medicine, computer science, and social sciences, the project will ensure that the summerization model is versatile and applicable across a wide range of academic fields.

## 프로젝트 목적

- Fine-tune the Gemma 2b model for summarizing scientific papers.
- Filter the dataset for computer science papers to optimize training time.
- Deploy the model on Hugging Face for easy accessibility.

In [None]:
!pip3 install -q -U transformers
!pip3 install -q -U datasets
!pip3 install -q -U bitsandbytes
!pip3 install -q -U peft
!pip3 install -q -U trl
!pip3 install -q -U accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m107.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m95.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TrainingArguments
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset

data = load_dataset("abisee/cnn_dailymail", "3.0.0")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
#원문 - 요약문 데이터셋 생성 함수
def generate_prompt(example):
    prompt_list = []

    for i in range(len(example)):
        # Access the original text (article) and the summary text (highlights) from the dataset
        original_text = example[i]['article']  # 원문 텍스트
        summary_text = example[i]['highlights']  # 요약문 텍스트

        # 프롬프트를 아래 형식으로 변환하여 리스트에 추가
        prompt_list.append(r"""<bos><start_of_turn>user
다음 글을 요약해주세요:

{}
<end_of_turn>
<start_of_turn>model
{}<end_of_turn><eos>""".format(original_text, summary_text))

    return prompt_list

In [None]:
import random
train_data = data['train']  # train 데이터셋 접근
train_prompts = generate_prompt(train_data)  # train 데이터셋을 사용해 프롬프트 생성
train_prompts= random.sample(train_prompts, 10000)  # 10,000개의 샘플만 사용
print(train_prompts[0])  # 첫 번째 프롬프트 출력

validation_data = data['validation']
validation_prompts = generate_prompt(validation_data)
validation_prompts = random.sample(validation_prompts, 1000)
print(validation_prompts[0])

'''
test_data = data['test']
test_prompts = generate_prompt(test_data)
print(test_prompts[0])
'''

<bos><start_of_turn>user
다음 글을 요약해주세요:

LONDON, England (CNN) -- A medical ailment that has worried male members of string sections across the music world for over 30 years has been exposed as a hoax. Male cellists of the world can breathe easy again. A senior British lawmaker confessed to making up the condition known as "cello scrotum" -- which relates to chafing from the instrument -- after reading about another musically-related ailment called "guitarist's nipple" in the British Medical Journal in 1974. Elaine Murphy, who is a member of The House of Lords and a trained doctor, came clean about the prank she devised with husband John in a letter to the BMJ published on Wednesday. She said: "Perhaps after 34 years it's time for us to confess that we invented cello scrotum. "Reading (Dr) Curtis's 1974 letter to the BMJ on guitar nipple, we thought it highly likely to be a spoof and decided to go one further by submitting a letter pretending to have noted a similar phenomenon in cellis

"\ntest_data = data['test']\ntest_prompts = generate_prompt(test_data)\nprint(test_prompts[0])\n"

In [None]:
print(len(train_prompts))
print(len(validation_prompts))
'''print(len(test_prompts))'''

10000
1000


'print(len(test_prompts))'

In [None]:
lora_config = LoraConfig(
    r=6,
    lora_alpha = 8,
    lora_dropout = 0.05,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

In [None]:
base_model = "google/gemma-2-2b-it"

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [None]:
# Dataset 클래스를 train_prompts로 반영
from torch.utils.data import Dataset

class PromptDataset(Dataset):
    def __init__(self, prompts_list, tokenizer, max_seq_length):
        self.prompts = prompts_list
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length

    def __len__(self):
        return len(self.prompts)

    def __getitem__(self, idx):
        prompt = self.prompts[idx]
        inputs = self.tokenizer(
            prompt,
            max_length=self.max_seq_length,
            padding='max_length',
            truncation=True,
            return_tensors="pt"
        )
        return {
            'input_ids': inputs['input_ids'].squeeze(0),
            'attention_mask': inputs['attention_mask'].squeeze(0),
        }

# 이제 train 데이터셋을 준비하고, 이를 토크나이저에 맞춰 변환
train_dataset = PromptDataset(train_prompts, tokenizer, max_seq_length=512)
validation_dataset = PromptDataset(validation_prompts, tokenizer, max_seq_length=512)


# train_dataset의 길이 확인
print(len(train_dataset))
print(len(validation_dataset))

10000
1000


In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    max_seq_length=512,
    args=TrainingArguments(
        output_dir="outputs",
        num_train_epochs = 1,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        optim="paged_adamw_8bit",
        warmup_steps=0,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=100,
        push_to_hub=False,
        report_to='none',
        save_strategy="epoch",
        evaluation_strategy="epoch",
    ),
    peft_config=lora_config,
    formatting_func=generate_prompt,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
# 훈련 시작
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.0833,2.095336


TrainOutput(global_step=2500, training_loss=2.1130770263671876, metrics={'train_runtime': 1355.4973, 'train_samples_per_second': 7.377, 'train_steps_per_second': 1.844, 'total_flos': 6.243242213376e+16, 'train_loss': 2.1130770263671876, 'epoch': 1.0})

In [None]:
base_model = "google/gemma-2-2b-it"
adapter_model = "lora_adapter"

# 훈련된 LoRA 어댑터 모델을 저장 (이미 훈련된 어댑터를 저장하는 과정)
trainer.model.save_pretrained(adapter_model)

# 사전 학습된 원래 모델 불러오기
model = AutoModelForCausalLM.from_pretrained(base_model, device_map='auto', torch_dtype=torch.float16)

# LoRA 어댑터 불러와서 병합
model = PeftModel.from_pretrained(model, adapter_model, device_map='auto', torch_dtype=torch.float16)

# 병합된 모델로 변환
model = model.merge_and_unload()

# 최종 모델 저장
model.save_pretrained('gemma-2-2b-it-research-in-a-flash')
tokenizer.save_pretrained('gemma-2-2b-it-research-in-a-flash')

model.push_to_hub("gemma-2-2b-it-research-in-a-flash")
tokenizer.push_to_hub("gemma-2-2b-it-research-in-a-flash")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dwhouse/gemma-2-2b-it-research-in-a-flash/commit/b53d66fc1b40db53f52d532e3226a5d6b7d638fa', commit_message='Upload tokenizer', commit_description='', oid='b53d66fc1b40db53f52d532e3226a5d6b7d638fa', pr_url=None, pr_revision=None, pr_num=None)