※ このノートブックで扱うモデルの言語は英語となります。

#  ヘルスケアのための文章要約
## Part 1 ノートブックのローカルで Flan-t5 をファインチューニング

このノートブックでは、ノートブックのローカルでメディカル要約タスクのために Flan-t5 をファインチューニングする方法を学習します。
ファインチューニングには、MeQSum データセットを利用します。MeQSum データセットは id, text, summary の 3 つのカラムを含みます。最初に、データセットを学習、バリデーション、テストの 3 つに分割します。
学習のために、text カラムを入力として、summary カラムを出力として利用します。学習の後、モデルはテストデータセットを使って、要約を出力し、人間が作成したものと比較します。
### MeQSum データセット
"On the Summarization of Consumer Health Questions". Asma Ben Abacha and Dina Demner-Fushman. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019.  
#### 引用
@Inproceedings{MeQSum,
author = {Asma {Ben Abacha} and Dina Demner-Fushman},
title = {On the Summarization of Consumer Health Questions},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28th - August 2},
year = {2019},
abstract = {Question understanding is one of the main challenges in question answering. In real world applications, users often submit natural language questions that are longer than needed and include peripheral information that increases the complexity of the question, leading to substantially more false positives in answer retrieval. In this paper, we study neural abstractive models for medical question summarization. We introduce the MeQSum corpus of 1,000 summarized consumer health questions. We explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this new task. In particular, we show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%. We also present a detailed error analysis and discuss directions for improvement that are specific to question summarization. }}



### Kernel と Sagemaker のセットアップ
'Data Science - Python3' カーネルの ml.g4dn.2xlarge インスタンスを使用してください。

In [None]:
!pip install -q openpyxl==3.0.3 xlrd==1.2.0
!pip install -q torch==1.13.1 datasets==2.12.0 transformers==4.28.0 rouge-score==0.1.2 nltk==3.8.1 sentencepiece==0.1.99 evaluate==0.4.0

## 1. データセットの準備

In [None]:
import urllib.request
urllib.request.urlretrieve('https://github.com/abachaa/MeQSum/raw/master/MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx', 'MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx')
# フォルダにexcel fileがダウンロードされるまで、数秒お待ちください。

In [None]:
import pandas as pd

# dataset from https://github.com/abachaa/MeQSum
df = pd.read_excel('MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx')
df = df.drop('File', axis=1)
df = df.rename(columns={'CHQ':'Text'})
df = df.dropna()
df['Text']= df['Text'].apply(lambda x: x.lower())
df['Summary'] = df['Summary'].apply(lambda x: x.lower())
df['Id'] = range(0, len(df.index))
df = df[['Id', 'Text', 'Summary']]
# df = df.sample(frac=1).reset_index(drop=True) # データをシャッフルする場合に使用。
df

In [None]:
import torch
import datasets
from datasets import Dataset
from datasets import load_metric
from datasets import concatenate_datasets

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

import numpy as np
import evaluate

import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

In [None]:
model_checkpoint = 'google/flan-t5-small' # google/mt5-X for Japanese training. 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
tokenizer("Hello, Welcome to AWS!")

In [None]:
train = df[:700]
val = df[700:900]
test = df[900:]
print('train: {}, val: {}, test: {}'.format(train.shape, val.shape, test.shape))

In [None]:
# dataframeからデータセットを作成します。
train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(test)

In [None]:
# 最大入力長と最大出力長をデータセットをベースに設定します。
tokenized_inputs = concatenate_datasets([train_dataset, val_dataset, test_dataset]).map(lambda x: tokenizer(x["Text"], truncation=True), batched=True, remove_columns=["Text", "Summary"])
max_input_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max input length: {max_input_length}")

tokenized_targets = concatenate_datasets([train_dataset, val_dataset, test_dataset]).map(lambda x: tokenizer(x["Summary"], truncation=True), batched=True, remove_columns=["Text", "Summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

In [None]:
# 要約用のテンプレート
def preprocess_function(sample,padding="max_length"):
    inputs = ["summarize: " + item for item in sample["Text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, padding=padding, truncation=True)

    labels = tokenizer(text_target=sample["Summary"], max_length=max_target_length, padding=padding, truncation=True)

    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# データセットをTokenizeします。
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

print(f"Keys of tokenized dataset: {tokenized_train.features}")

In [None]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

# 2. huggingfaceを利用した学習

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
# パラメータ
batch_size = 4
label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=label_pad_token_id, pad_to_multiple_of=8)

In [None]:
# lossのメトリクス
metric = evaluate.load("rouge")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

In [None]:
import gc
gc.collect()

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
if DEVICE == "cuda":
    print("[INFO] training using {}".format(torch.cuda.get_device_name(0)))

torch.cuda.empty_cache()
%env WANDB_DISABLED=True

In [None]:
# sequence to sequence 学習用の引数
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-meqsum2019",
    evaluation_strategy = "epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    logging_strategy="steps",
    logging_steps=100,
    predict_with_generate=True,
    fp16=False
)

In [None]:
# trainer オブジェクト
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

## 3. テストデータセットを利用した推論

In [None]:
# テストデータセットのTokenize
test_dataset = Dataset.from_pandas(test)
tokenized_test = test_dataset.map(
                preprocess_function,
                batched=True)

In [None]:
predict_results = trainer.predict(tokenized_test)
predict_results.metrics

In [None]:
# 推論結果のデコード
if args.predict_with_generate:
    predictions = tokenizer.batch_decode(predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    predictions = [pred.strip() for pred in predictions]

In [None]:
test['Predicted Summary'] = predictions
pd.set_option('display.max_colwidth', 1024)
test

## ノートブックインスタンスの App Kernelの停止
このノートブックは他のラボでも使用する予定の ml.g4dn.2xlarge を使用しているため、終了したら kernel app を停止してください。
停止するには、左側にあるメニューから、丸の中に黒い四角のあるアイコンをクリックし、ml.g4dn.2xlarge の 電源ボタンをクリックしてください。