# 데이터셋 준비하기
- https://www.kaggle.com/code/mitanshuchakrawarty/fine-tune-llm-for-text-summary

## 1. 환경 셋업

In [95]:
!huggingface-cli login --token ""

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/ec2-user/SageMaker/.cache/token
Login successful


In [96]:
import os 
os.environ['TRANSFORMERS_CACHE'] = "/home/ec2-user/SageMaker/.cache" 
os.environ['HF_DATASETS_CACHE'] = "/home/ec2-user/SageMaker/.cache" 
os.environ['HF_HOME'] = "/home/ec2-user/SageMaker/.cache"

In [97]:
import torch
import time
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

from datasets import Dataset, load_dataset
from datasets import load_dataset, load_metric
from transformers import pipeline, set_seed
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import warnings
warnings.filterwarnings("ignore")

## 2. 데이터 셋 준비
### 데이터 셋 다운로드

In [98]:
huggingface_dataset_name = "cnn_dailymail"

dataset = load_dataset(huggingface_dataset_name, "3.0.0")
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [99]:
sample = dataset["train"][1]
print(f"""Article (excerpt of 500 characters, total length: {len(sample["article"])}):""")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])

Article (excerpt of 500 characters, total length: 4051):
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most s

Summary (length: 281):
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


## 3. 데이터셋 변형

### Chat Message 형태 템플릿 정의

In [100]:
import json

def format_instruction(system_prompt: str, article: str, summary: str):
    message = [
            {
                'content': system_prompt,
                'role': 'system'
            },
            {
                'content': f'Please summarize the goals for journalist in this text:\n\n{article}',
                'role': 'user'
            },
            {
                'content': f'{summary}',
                'role': 'assistant'
            }
        ]
    
    return message # json.dumps(message, indent=2) # json.dumps(message, ensure_ascii=False, indent=2)


# 사용 예시
# system_prompt = "You are an AI assistant specialized in news articles. Your role is to provide accurate summaries and insights. Please analyze the given text and provide concise, informative summaries that highlight the key goals and findings."
# article = "Within three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert's portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood more with rising sea levels. \"We can't ask restoration ecologists to plant nonnative species or to just take their best guess and throw things out there,\" says Rinkert."
# summary = "Scientists are studying nests hoping to learn about transitional habitats that could help restore the shoreline of San Francisco Bay."

# print(format_instruction(system_prompt, article, summary))

### Chat Message 형태로 변환

In [101]:

def generate_instruction_dataset(data_point):
    system_prompt = "You are an AI assistant specialized in news articles.Your role is to provide accurate summaries and insights.Please analyze the given text and provide concise, informative summaries that highlight the key goals and findings."

    return {
        "messages": format_instruction(system_prompt, data_point["article"],data_point["highlights"])
    }

def process_dataset(data: Dataset):
    return (
        data.shuffle(seed=42)
        .map(generate_instruction_dataset).remove_columns(['article','highlights','id'])
    )    

##### 전체 데이터 셋에서 일부 데티터 추출 (짧은 실습을 위해서)

In [102]:
train_num_debug_samples = 10
test_num_debug_samples = 10

In [103]:
## APPLYING PREPROCESSING ON WHOLE DATASET

dataset["train"] = process_dataset(dataset["train"].select(range(train_num_debug_samples)))
dataset["test"] = process_dataset(dataset["validation"])
dataset["validation"] = process_dataset(dataset["validation"])

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [104]:
# Select 1000 rows from the training split
train_dataset = dataset['train'].shuffle(seed=42).select([i for i in range(train_num_debug_samples)])

# Select 100 rows from the test and validation splits
test_dataset = dataset['test'].shuffle(seed=42).select([i for i in range(test_num_debug_samples)])
validation_dataset = dataset['validation'].shuffle(seed=42).select([i for i in range(test_num_debug_samples)])


In [105]:
train_dataset,test_dataset,validation_dataset

(Dataset({
     features: ['messages'],
     num_rows: 10
 }),
 Dataset({
     features: ['messages'],
     num_rows: 10
 }),
 Dataset({
     features: ['messages'],
     num_rows: 10
 }))

In [106]:
train_dataset[0]

{'messages': [{'content': 'You are an AI assistant specialized in news articles.Your role is to provide accurate summaries and insights.Please analyze the given text and provide concise, informative summaries that highlight the key goals and findings.',
   'role': 'system'},
  {'content': 'Please summarize the goals for journalist in this text:\n\nMINNEAPOLIS, Minnesota (CNN) -- Drivers who were on the Minneapolis bridge when it collapsed told harrowing tales of survival. "The whole bridge from one side of the Mississippi to the other just completely gave way, fell all the way down," survivor Gary Babineau told CNN. "I probably had a 30-, 35-foot free fall. And there\'s cars in the water, there\'s cars on fire. The whole bridge is down." He said his back was injured but he determined he could move around. "I realized there was a school bus right next to me, and me and a couple of other guys went over and started lifting the kids off the bridge. They were yelling, screaming, bleeding. I

## 4. 데이터 셋을 JSON 으로 저장

In [107]:
import os

data_folder = os.path.join("../data/", huggingface_dataset_name)
os.makedirs(data_folder, exist_ok=True)

train_data_json = os.path.join(data_folder, "train_dataset.json")
validation_data_json = os.path.join(data_folder, "validation_dataset.json")
test_data_json = os.path.join(data_folder, "test_dataset.json")
print("train_data_json: ", train_data_json)
print("validation_data_json: ", validation_data_json)
print("test_data_json: ", test_data_json)

train_data_json:  ../data/cnn_dailymail/train_dataset.json
validation_data_json:  ../data/cnn_dailymail/validation_dataset.json
test_data_json:  ../data/cnn_dailymail/test_dataset.json


In [108]:
# save datasets to disk 
train_dataset.to_json(train_data_json, orient="records", force_ascii=False)
validation_dataset.to_json(validation_data_json, orient="records", force_ascii=False)
test_dataset.to_json(test_data_json, orient="records", force_ascii=False)

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

40884

In [109]:
import json

# 파일 경로
file_path = "../data/cnn_dailymail/train_dataset.json"

try:
    # 파일 열기 및 내용 읽기
    with open(file_path, 'r', encoding='utf-8') as file:
        # 파일의 전체 내용을 문자열로 읽음
        json_str = file.read()
        
        # JSON 파싱
        data = json.loads(json_str)
        
        # 첫 번째 레코드 출력 (데이터가 리스트인 경우)
        if isinstance(data, list) and len(data) > 0:
            first_record = data[0]
            print(json.dumps(first_record, indent=2, ensure_ascii=False))
        elif isinstance(data, dict):
            # 데이터가 딕셔너리인 경우 전체 출력
            print(json.dumps(data, indent=2, ensure_ascii=False))
        else:
            print("데이터가 비어있거나 예상치 못한 형식입니다.")

except FileNotFoundError:
    print(f"파일을 찾을 수 없습니다: {file_path}")
except json.JSONDecodeError as e:
    print(f"JSON 파싱 오류: {e}")
except Exception as e:
    print(f"오류 발생: {e}")

JSON 파싱 오류: Extra data: line 2 column 1 (char 4595)


### 다음 노트북에서 사용하기 위해 변수 저장

In [113]:
%store data_folder
%store train_data_json 
%store validation_data_json 
%store test_data_json 


Stored 'data_folder' (str)
Stored 'train_data_json' (str)
Stored 'validation_data_json' (str)
Stored 'test_data_json' (str)


## 5. Option: 데이터 셋을 ChatTemplate 형태로 바꾸기

### Chat Template 정의

In [110]:
LLAMA_3_CHAT_TEMPLATE = (
    "{% for message in messages %}"
        "{% if message['role'] == 'system' %}"
            "{{ message['content'] }}"
        "{% elif message['role'] == 'user' %}"
            "{{ '\n\nHuman: ' + message['content'] +  eos_token }}"
        "{% elif message['role'] == 'assistant' %}"
            "{{ '\n\nAssistant: '  + message['content'] +  eos_token  }}"
        "{% endif %}"
    "{% endfor %}"
    "{% if add_generation_prompt %}"
    "{{ '\n\nAssistant: ' }}"
    "{% endif %}"
)
LLAMA_3_CHAT_TEMPLATE

"{% for message in messages %}{% if message['role'] == 'system' %}{{ message['content'] }}{% elif message['role'] == 'user' %}{{ '\n\nHuman: ' + message['content'] +  eos_token }}{% elif message['role'] == 'assistant' %}{{ '\n\nAssistant: '  + message['content'] +  eos_token  }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '\n\nAssistant: ' }}{% endif %}"

### Chat Template 으로 변형하기

In [111]:
# Tokenizer        
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = LLAMA_3_CHAT_TEMPLATE

# template dataset
def template_dataset(examples):
    return{"text":  tokenizer.apply_chat_template(examples["messages"], tokenize=False)}

train_dataset = train_dataset.map(template_dataset, remove_columns=["messages"])
test_dataset = test_dataset.map(template_dataset, remove_columns=["messages"])
validation_dataset = validation_dataset.map(template_dataset, remove_columns=["messages"])    

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

### 변형된 Chat Message 형태 예시 보기

In [112]:
# print random sample
import random

for index in random.sample(range(len(train_dataset)), 1):
    print("index: ", index)
    # index = 5343
    print(train_dataset[index]["text"])

index:  6
You are an AI assistant specialized in news articles.Your role is to provide accurate summaries and insights.Please analyze the given text and provide concise, informative summaries that highlight the key goals and findings.

Human: Please summarize the goals for journalist in this text:

WASHINGTON (CNN) -- Doctors removed five small polyps from President Bush's colon on Saturday, and "none appeared worrisome," a White House spokesman said. The polyps were removed and sent to the National Naval Medical Center in Bethesda, Maryland, for routine microscopic examination, spokesman Scott Stanzel said. Results are expected in two to three days. All were small, less than a centimeter [half an inch] in diameter, he said. Bush is in good humor, Stanzel said, and will resume his activities at Camp David. During the procedure Vice President Dick Cheney assumed presidential power. Bush reclaimed presidential power at 9:21 a.m. after about two hours. Doctors used "monitored anesthesia c