#  Text Summarization for healthcare
## Part 2 Fintuning Flan-t5 via SageMaker SDK
In the previous notebook we fine-tuned the MeQSum dataset on a local notebook instance. In this notebook we will learn how to use the SageMaker SDK to spin up training instances for fine-tuning the Flan-T5 model on a medical summary task. 
### MeQSum Dataset
"On the Summarization of Cealth Questions". Asma Ben Abacha and Dina Demner-Fushman. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019.  
#### Citation Information
@Inproceedings{MeQSum,
author = {Asma {Ben Abacha} and Dina Demner-Fushman},
title = {On the Summarization of Consumer Health Questions},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28th - August 2},
year = {2019},
abstract = {Question understanding is one of the main challenges in question answering. In real world applications, users often submit natural language questions that are longer than needed and include peripheral information that increases the complexity of the question, leading to substantially more false positives in answer retrieval. In this paper, we study neural abstractive models for medical question summarization. We introduce the MeQSum corpus of 1,000 summarized consumer health questions. We explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this new task. In particular, we show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%. We also present a detailed error analysis and discuss directions for improvement that are specific to question summarization. }}


### Kernel and SageMaker Setup
Please use the ml.t3.medium instance for this notebook. The Kernel is 'Data Science - Python3'.

In [7]:
!pip -q install transformers==4.28.0 datasets==2.12.0 sagemaker==2.156.0 --upgrade

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0 which is incompatible.
awscli 1.27.111 requires botocore==1.29.111, but you have botocore 1.29.147 which is incompatible.
awscli 1.27.111 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.
awscli 1.27.111 requires rsa<4.8,>=3.1.2, but you have rsa 4.9 which is incompatible.
aiobotocore 2.4.2 requires botocore<1.27.60,>=1.27.59, but you have botocore 1.29.147 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m

## 1. Preparing Dataset

In [8]:
import datasets
from datasets import Dataset
from datasets import load_metric
from datasets import concatenate_datasets
from datasets.filesystems import S3FileSystem

import transformers
from transformers import AutoTokenizer

import sagemaker
from sagemaker.huggingface import HuggingFace

sess = sagemaker.Session()

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [9]:
import pandas as pd
# dataset from https://github.com/abachaa/MeQSum

df = pd.read_excel('MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx')
df = df.drop('File', axis=1)
df = df.rename(columns={'CHQ':'Text'})
df = df.dropna()
df['Text']= df['Text'].apply(lambda x: x.lower())
df['Summary'] = df['Summary'].apply(lambda x: x.lower())
df['Id'] = range(0, len(df.index))
df = df[['Id', 'Text', 'Summary']]
# df = df.sample(frac=1).reset_index(drop=True) # to shaffule
df

Unnamed: 0,Id,Text,Summary
0,0,subject: who and where to get cetirizine - d\n...,who manufactures cetirizine?
1,1,who makes bromocriptine\ni am wondering what c...,who manufactures bromocriptine?
2,2,subject: nulytely\nmessage: hello can you tell...,"who makes nulytely, and where can i buy it?"
3,3,williams' syndrome\ni would like to have my da...,where can i get genetic testing for william's ...
4,4,clinicaltrials.gov - question - general inform...,where can i get genetic testing for multiple m...
...,...,...,...
995,995,subject: after surgery of ear drum still same ...,what are the treatments for perforated eardrum?
996,996,subject: clinicaltrials.gov - question - speci...,what are the treatments for glycogen storage d...
997,997,message: i have numbness/tingling in my lower ...,where can i find information and treatment for...
998,998,subject: sleep apnea\nmessage: i was diagnosed...,how long does swelling from sleep apnea take t...


In [10]:
model_checkpoint = 'google/flan-t5-base' # 'google/flan-t5-small' for quick training.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [11]:
train = df[:700]
val = df[700:900]
test = df[900:]
print('train: {}, val: {}, test: {}'.format(train.shape, val.shape, test.shape))

train: (700, 3), val: (200, 3), test: (100, 3)


In [12]:
# Dataset from dataframe
train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(test)

In [13]:
# max input lengith and max target length based on dataset.
tokenized_inputs = concatenate_datasets([train_dataset, val_dataset, test_dataset]).map(lambda x: tokenizer(x["Text"], truncation=True), batched=True, remove_columns=["Text", "Summary"])
max_input_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max input length: {max_input_length}")

tokenized_targets = concatenate_datasets([train_dataset, val_dataset, test_dataset]).map(lambda x: tokenizer(x["Summary"], truncation=True), batched=True, remove_columns=["Text", "Summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Max input length: 512


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Max target length: 60


In [14]:
# summarizing template
def preprocess_function(sample,padding="max_length"):
    inputs = ["summarize: " + item for item in sample["Text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, padding=padding, truncation=True)

    labels = tokenizer(text_target=sample["Summary"], max_length=max_target_length, padding=padding, truncation=True)

    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [15]:
# Tokenized dataset
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

print(f"Keys of tokenized dataset: {tokenized_train.features}")

Map:   0%|          | 0/700 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Keys of tokenized dataset: {'Id': Value(dtype='int64', id=None), 'Text': Value(dtype='string', id=None), 'Summary': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


In [16]:
# Uploading dataset to S3
s3 = S3FileSystem()

bucket = sess.default_bucket()
s3_prefix = "huggingface/meqsum-flan-t5-summarization"

base_job_name="huggingface-meqsum-flan-t5-summarization"
checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

# The local path where the model will save its checkpoints in the training container
checkpoint_local_path="/opt/ml/checkpoints"

dataset_input_path = "s3://{}/{}".format(bucket, s3_prefix)
train_input_path = "{}/train".format(dataset_input_path)
valid_input_path = "{}/validation".format(dataset_input_path)

print(dataset_input_path)
print(train_input_path)
print(valid_input_path)
print(checkpoint_s3_bucket)

tokenized_train.save_to_disk(train_input_path, fs=s3)
tokenized_val.save_to_disk(valid_input_path, fs=s3)

s3://sagemaker-us-east-1-431579215499/huggingface/meqsum-flan-t5-summarization
s3://sagemaker-us-east-1-431579215499/huggingface/meqsum-flan-t5-summarization/train
s3://sagemaker-us-east-1-431579215499/huggingface/meqsum-flan-t5-summarization/validation
s3://sagemaker-us-east-1-431579215499/huggingface-meqsum-flan-t5-summarization/checkpoints




Saving the dataset (0/1 shards):   0%|          | 0/700 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/200 [00:00<?, ? examples/s]

## 2. Training using SageMaker Training

In [17]:
# hyperparameters
hyperparameters = {
    "epochs": 10,
    "learning-rate": 2e-5,
    "train-batch-size": 4,
    "eval-batch-size": 4,
    "model-name": model_checkpoint,
    'output_dir': checkpoint_local_path
}

In [18]:
metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]

In [19]:
huggingface_estimator = HuggingFace(
    role=sagemaker.get_execution_role(),
    entry_point="train.py",
    dependencies=["requirements.txt"],
    hyperparameters=hyperparameters,
    base_job_name=base_job_name,
    checkpoint_s3_uri=checkpoint_s3_bucket,
    checkpoint_local_path=checkpoint_local_path,
    transformers_version="4.26.0",
    pytorch_version="1.13.1",
    py_version="py39",
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    metric_definitions=metric_definitions
    # distribution={"smdistributed": {"dataparallel": {"enabled": True}}}, # For distributed training.
)

In [20]:
huggingface_estimator.fit({"train": train_input_path, "valid": valid_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


Using provided s3_resource


INFO:sagemaker:Creating training-job with name: huggingface-meqsum-flan-t5-summarizatio-2023-06-06-01-49-20-635


2023-06-06 01:49:21 Starting - Starting the training job...
2023-06-06 01:49:46 Starting - Preparing the instances for training.........
2023-06-06 01:51:21 Downloading - Downloading input data
2023-06-06 01:51:21 Training - Downloading the training image........................
2023-06-06 01:55:18 Training - Training image download completed. Training in progress....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-06-06 01:55:36,435 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-06-06 01:55:36,454 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-06-06 01:55:36,466 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-06-06 01:55:36,469 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-06-06 01:55:

In [21]:
huggingface_estimator.model_data

's3://sagemaker-us-east-1-431579215499/huggingface-meqsum-flan-t5-summarizatio-2023-06-06-01-49-20-635/output/model.tar.gz'

## 3. Inference using SageMaker endpoint

In [22]:
huggingface_predictor = huggingface_estimator.deploy(
    initial_instance_count=1, instance_type="ml.p3.2xlarge"
)

INFO:sagemaker:Creating model with name: huggingface-meqsum-flan-t5-summarizatio-2023-06-06-02-24-30-115
INFO:sagemaker:Creating endpoint-config with name huggingface-meqsum-flan-t5-summarizatio-2023-06-06-02-24-30-115
INFO:sagemaker:Creating endpoint with name huggingface-meqsum-flan-t5-summarizatio-2023-06-06-02-24-30-115


------------!

In [23]:
predictions = []
for test_data in test_dataset: 
    prediction = huggingface_predictor.predict({"inputs": f"summarize: {test_data['Text']}"})
    predictions.append(prediction[0]['generated_text'])

In [24]:
test['Predicted Summary'] = predictions
pd.set_option('display.max_colwidth', 1024)
test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Id,Text,Summary,Predicted Summary
900,900,subject: just a question\nmessage: hi..just wanna ask... 1.how the aspirin can affect the ear? 2. what is the cause of suddenly ringging in the ear? isn't dangerous? tq.. :),"what causes ringing in the ear, and can aspirin affect the ear?",what are the causes of sudden ringing in the ear?
901,901,"dear doc,\ni am now turning 40years in november and all my life i have desired\nmiserably for a divine intervention to restore my smell sense so that i can\nfully appreciate and participate in this one life on earth. i truly wish to\nbe a part of your research if need be because the disorder had greatly\naffected my life. if you already have medical drugs to cure and restore my\nsmell sense kindly give information on how i can acquire to benefit from\nthis. i pray that god the creator gracefully grants me favour with this so\nthat i can enjoy the beauty of his creation in this world, with respect to\nsmell, before i depart to continue with him in heaven. cheers for now as i\nwait to hear from you.\n[name], ms.\nsmell disorder (anosmia) patient /sufferer,\nwriting from [location]. cell: [contact], [contact].",what are the treatments for anosmia?,what are the treatments for smell disorder?
902,902,"subject: cosmetic leg shortening surgery\nmessage: hi, i am a tall girl(5'8""), who wants to undergo leg shortening sugery of 2 inches for cosmetic purpose. it would be good if i can get more information about it. i would like to know the cost of this surgery, the recovery time and the risks associated with it. how long should i stay in the hospital? thanks and regards","where can i find information on leg shortening surgery, including risks, cost, and recovery time?","what are the costs of leg shortening surgery, recovery time and risks associated with it?"
903,903,"subject: clinicaltrials.gov - question - specific study\nmessage: i am working with a hep c patient who needs treatment but cannot afford tx. how can i help her get in touch with a recruiting study? there are no numbers or ways to contact a recruiting study. sincerely, [name]",where can i find clinical trials on hepatitis c?,"where can i find information on hepatitis c research, including a"
904,904,"subject: laparoscopic splenectomy\nmessage: dear sir/madam my brother [name] is diagnosed itp. his doctor advises laparoscopic splenectomy for him. can you please mail me detail and cost of this surgery. his platletts count is decreased to 12 and his doctors giving him injection mebthera today to increase plattlets count. we are form [location], [location]. please mail us as soon possible. thanks & best regards [name]","where can i find information on laparoscopic splenectomy, including cost?",what are the costs and benefits of laparoscopic splenectomy for itp?
...,...,...,...,...
995,995,subject: after surgery of ear drum still same problem\nmessage: i got surgery for hole in my ear drum(hole was in my ear from 5 0r 6 ears but i did not know it but when i came to know i got surgery) but after two year surgery still i have same problem. problem in listening and continuous noise like buzzing or ringing in my right ear.so sir what should i do right now? plz sir help me. buzzing in my both has been started from last 3 year.plz help me....,what are the treatments for perforated eardrum?,what are the treatments for buzzing and ringing in the ears after eardrum surgery
996,996,subject: clinicaltrials.gov - question - specific study\nmessage: looking for help for my nephew with glycogen storage disease. he lives in virginia and is suffering badly. he has been hospitalized for severe cramping about 5 times this year so far. any guidance you could give would be greatly appreciated.,what are the treatments for glycogen storage disease?,what are the treatments for glycogen storage disease?
997,997,"message: i have numbness/tingling in my lower right arm from elbow to my fingers. a emg has shown nothing abnormal. i have had this for a long time, i need help.",where can i find information and treatment for numbness and tingling in lower right arm?,what are the treatments for arm numbness and tingling?
998,998,subject: sleep apnea\nmessage: i was diagnosed with sleep apnea (prolly had it for 5 years) and i have swelling issues caused from that (it has been ruled out from everything else so the doctor thinks). i just got my cpap machine. i was wondering how long will it take for the swelling to go away. thank you!,how long does swelling from sleep apnea take to heal?,how long will it take for sleep apnea swelling to go away?


## 4. Clean up
Remember to delete your endpoint after use as you will be charged for the instances used in this sample.

In [25]:
huggingface_predictor.delete_model()
huggingface_predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: huggingface-meqsum-flan-t5-summarizatio-2023-06-06-02-24-30-115
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-meqsum-flan-t5-summarizatio-2023-06-06-02-24-30-115
INFO:sagemaker:Deleting endpoint with name: huggingface-meqsum-flan-t5-summarizatio-2023-06-06-02-24-30-115
