# Fine Tuning T5-base to understand Medical Domain

- https://github.com/artidoro/qlora#tutorials-and-demonstrations
- https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/peft-flan-t5-int8-summarization.ipynb

In [46]:
# setup env
!pip install -q bitsandbytes datasets accelerate loralib rouge_score evaluate
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [1]:
import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import bitsandbytes as bnb
from datasets import load_dataset
from huggingface_hub import notebook_login

from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer,  AutoModelForSeq2SeqLM

from peft import LoraConfig, get_peft_model, TaskType

import evaluate

In [3]:
# set variable & parameters
MODEL_CHECKPOINT = "t5-base" # t5-3b
MODEL_REPO = "t5-base-adapt"
PREFIX = "summarize: "
MAX_INPUT_LENGTH = 512   # > CUDA out of memory
MAX_TARGET_LENGTH = 64
BATCH_SIZE = 8

In [4]:
notebook_login()
# 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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load Dataset

### Dataset Summary
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 1,000,000 scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. This is a processed version of the dataset, where we removed some empty entries and formated it to be compatible with the alpaca training. For more details on the data, please refer to the original publicatio.

In [5]:
split = 'train[0:5000]'
billsum = load_dataset("medalpaca/medical_meadow_cord19", split=split)
billsum = billsum.train_test_split(test_size=0.2)
billsum["train"][0]

{'instruction': 'Please summerize the given abstract to a title',
 'output': 'Living in cohousing communities: Psychological effects and coping strategies in times of covid-19',
 'input': 'The aim of this study was to compare a sample of residents in cohousing communities (n = 180) and inhabitants in traditional neighborhoods (n = 104) During the social isolation that was decreed by the German government due to the COVID-19 pandemic, data collection was carried out through the Internet Psychological symptoms and coping strategies were measured, and their differences were investigated by multivariate analysis of variance (MANOVA) Results showed that residents in cohousing communities have lower levels of depressive, anxiety, compulsive and eating disorders, as well as less use of coping strategies which are based on emotional concealment, problem avoidance, and social withdrawal Moreover, its inhabitants showed higher levels in the use of social support It is concluded that living in a 

In [6]:
billsum

DatasetDict({
    train: Dataset({
        features: ['instruction', 'output', 'input'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['instruction', 'output', 'input'],
        num_rows: 1000
    })
})

## Preprocces dataset

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [8]:
def preprocess_function(examples):
    inputs = [PREFIX + doc for doc in examples["input"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    labels = tokenizer(text_target=examples["output"], max_length=MAX_TARGET_LENGTH, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [9]:
tokenized_dataset = billsum.map(preprocess_function, batched=True, remove_columns=['instruction', 'output', 'input'])

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## Load Model PEFT

In [10]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

In [11]:
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],    # tidak dikasih ini juga tidak apa
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, config)

In [12]:
model.print_trainable_parameters()

trainable params: 1,769,472 || all params: 224,673,024 || trainable%: 0.7875765272113843


In [13]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

## Compute Metrics

In [14]:
metrics = evaluate.load("rouge")

In [15]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = metrics.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Define Hyperparameter

In [21]:
# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    MODEL_REPO,
    evaluation_strategy="steps",
    eval_steps=500,
    learning_rate=1e-3,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    push_to_hub=True
)

In [22]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Training

In [23]:
torch.cuda.empty_cache()

In [24]:
trainer.train()

Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
500,2.0799,1.837721,0.4134,0.2133,0.3461,0.3464,17.684
1000,1.9217,1.815728,0.4164,0.2216,0.3544,0.3547,17.504
1500,1.8161,1.8008,0.4223,0.2237,0.3596,0.3598,17.668




Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
500,2.0799,1.837721,0.4134,0.2133,0.3461,0.3464,17.684
1000,1.9217,1.815728,0.4164,0.2216,0.3544,0.3547,17.504
1500,1.8161,1.8008,0.4223,0.2237,0.3596,0.3598,17.668
2000,1.7366,1.796608,0.4219,0.2238,0.3561,0.3562,17.807
2500,1.6718,1.79981,0.422,0.2274,0.3592,0.3594,17.713




TrainOutput(global_step=2500, training_loss=1.8452285400390624, metrics={'train_runtime': 1792.7197, 'train_samples_per_second': 11.156, 'train_steps_per_second': 1.395, 'total_flos': 1.1589970917482496e+16, 'train_loss': 1.8452285400390624, 'epoch': 5.0})

In [None]:
trainer.push_to_hub()

In [33]:
# Push HuggingFace Hub using the model.push_to_hub method.
# model.push_to_hub(MODEL_REPO)
# tokenizer.push_to_hub(MODEL_REPO)

# Save our LoRA model & tokenizer results
# peft_model_id="results"
# trainer.model.save_pretrained(peft_model_id)
# tokenizer.save_pretrained(peft_model_id)

# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

('results/tokenizer_config.json',
 'results/special_tokens_map.json',
 'results/tokenizer.json')

## Evaluate

In [11]:
text_1 = """
About acne
Acne is a common skin condition that affects most people at some point.
 It causes spots, oily skin and sometimes skin that's hot or painful to touch.

Acne most commonly develops on the:

face – this affects almost everyone with acne
back – this affects more than half of people with acne
chest – this affects about 15% of people with acne
Types of spots
There are 6 main types of spot caused by acne:

blackheads – small black or yellowish bumps that develop on the skin; they're not filled with dirt, but are black because the inner lining of the hair follicle produces pigmentation (colouring)
whiteheads – have a similar appearance to blackheads, but may be firmer and won't empty when squeezed
papules – small red bumps that may feel tender or sore
pustules – similar to papules, but have a white tip in the centre, caused by a build-up of pus
nodules – large hard lumps that build up beneath the surface of the skin and can be painful
cysts – the most severe type of spot caused by acne; they're large pus-filled lumps that look similar to boils and carry the greatest risk of causing permanent scarring
"""


text_2 = """
COURSE WHILE IN HOSPITAL
Relevant Complaint(s) and Concerns:
1. Upon arrival: Patient presented with five days of increased urinary frequency, urgency and dysuria as well as
48 hours of fever and rigors. He was hypotensive and tachycardic upon arrival to the emergency department.
The internal medicine service was consulted. The following issues were addressed during the hospitalization:
Summary Course in Hospital (Issues Addressed):
2. Fever and urinary symptoms: A preliminary diagnosis of pyelonephritis was established. Other causes of fever
were possible but less likely. The patient was hypotensive on initial assessment with a blood pressure of
80/40. Serum lactate was elevated at 6.1. A bolus of IV fluid was administered (1.5L) but the patient remained
hypotensive. Our colleagues from ICU were consulted. An arterial line was inserted for hemodynamic
monitoring. Hemodynamics were supported with levophed and crystalloids. Piptazo was started after blood
and urine cultures were drawn. After 12 hours serum lactate had normalized and hemodynamics had
stabilized. Blood cultures were positive for E.Coli that was sensitive to all antibiotics. The patient was stepped
down to oral ciprofloxacin to complete a total 14 day course of antibiotics.
On further review it was learned that the patient has been experiencing symptoms of prostatism for the last
year. An abdominal ultrasound performed for elevated liver enzymes and acute kidney injury confirmed a
"""


text = """
DIAGNOSIS:
A. SKIN, RIGHT ARM, SHAVE BIOPSY:
COMPATIBLE WITH PERFORATING DISORDER WITH FEATURES OF
ELASTOSIS PERFORANS SERPIGINOSUM.
B. SKIN, LEFT NECK, SHAVE BIOPSY:
1. COMPATIBLE WITH PERFORATING DISORDER WITH FEATURES
OF ELASTOSIS PERFORANS SERPIGINOSUM.
2. ASSOCIATED SPONGIOTIC DERMATITIS WITH OCCASIONAL
EOSINOPHILS (SEE NOTE).
"""

In [None]:
# Case kedua https://stackoverflow.com/questions/76459034/how-to-load-a-fine-tuned-peft-lora-model-based-on-llama-with-huggingface-transfo

# Load peft config for pre-trained checkpoint etc.
peft_model_id = "fahmiaziz/t5-base-adapt"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained("t5-base")

In [37]:
def summarize_text(text: str, model):
    if model == "t5-base":
        tokenizer = AutoTokenizer.from_pretrained(model)
        model = AutoModelForSeq2SeqLM.from_pretrained(model)

    if model == "fahmiaziz/t5-base-adapt":
        config = PeftConfig.from_pretrained(peft_model_id)
        model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
        model = PeftModel.from_pretrained(model, peft_model_id)
        tokenizer = AutoTokenizer.from_pretrained("t5-base")

    inputs = tokenizer.encode(
        PREFIX + text, return_tensors = "pt",
        max_length=1024, truncation=True
    )
    outputs = model.generate(
        input_ids=inputs,
        min_length=25,
        max_length=258,
        num_beams=10,
        repetition_penalty=2.5,
        length_penalty=1.0,
        early_stopping=True,
        no_repeat_ngram_size=3,
        temperature=0.1,
        do_sample=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [38]:
t5_base = "t5-base"
t5_adapt = "fahmiaziz/t5-base-adapt"

summary_1 = summarize_text(text, model=t5_base)
summary_2 = summarize_text(text, model=t5_adapt)
print("Medical diagnosis :", text, "\n")
print("Summary T5-base :", summary_1, "\n")
print("Summary T5-Fine Tuning :", summary_2)

'COMPATIBLE WITH PERFORATING DISORDER. 2. ASSOCIATED SPONGIOTIC DERMATITIS WITH OCCASIONAL EOSINOPHILS (SEE NOTE).'

In [39]:
summarize_text(text, model=t5_adapt)

'COMPATIBLE WITH PERFORATING DISORDER WITH FEATURES OF ELASTOSIS PERFORANS SERPIGINOSUM.'

In [40]:
summary_1 = summarize_text(text_1, model=t5_base)
summary_2 = summarize_text(text_1, model=t5_adapt)

print("Medical diagnosis :", text_1, "\n")
print("Summary T5-base :", summary_1, "\n")
print("Summary T5-Fine Tuning :", summary_2)

Medical diagnosis : 
About acne
Acne is a common skin condition that affects most people at some point.
 It causes spots, oily skin and sometimes skin that's hot or painful to touch.

Acne most commonly develops on the:

face – this affects almost everyone with acne
back – this affects more than half of people with acne
chest – this affects about 15% of people with acne
Types of spots
There are 6 main types of spot caused by acne:

blackheads – small black or yellowish bumps that develop on the skin; they're not filled with dirt, but are black because the inner lining of the hair follicle produces pigmentation (colouring)
whiteheads – have a similar appearance to blackheads, but may be firmer and won't empty when squeezed
papules – small red bumps that may feel tender or sore
pustules – similar to papules, but have a white tip in the centre, caused by a build-up of pus
nodules – large hard lumps that build up beneath the surface of the skin and can be painful
cysts – the most severe ty

In [41]:
summary_1 = summarize_text(text_2, model=t5_base)
summary_2 = summarize_text(text_2, model=t5_adapt)

print("Medical diagnosis :", text_2, "\n")
print("Summary T5-base :", summary_1, "\n")
print("Summary T5-Fine Tuning :", summary_2)

Medical diagnosis : 
COURSE WHILE IN HOSPITAL
Relevant Complaint(s) and Concerns:
1. Upon arrival: Patient presented with five days of increased urinary frequency, urgency and dysuria as well as
48 hours of fever and rigors. He was hypotensive and tachycardic upon arrival to the emergency department.
The internal medicine service was consulted. The following issues were addressed during the hospitalization:
Summary Course in Hospital (Issues Addressed):
2. Fever and urinary symptoms: A preliminary diagnosis of pyelonephritis was established. Other causes of fever
were possible but less likely. The patient was hypotensive on initial assessment with a blood pressure of
80/40. Serum lactate was elevated at 6.1. A bolus of IV fluid was administered (1.5L) but the patient remained
hypotensive. Our colleagues from ICU were consulted. An arterial line was inserted for hemodynamic
monitoring. Hemodynamics were supported with levophed and crystalloids. Piptazo was started after blood
and urine 

# Build App

In [42]:
!pip install -q streamlit transformers
!npm install localtunnel

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[34;40mnotice[0m[35m[0m created a lockfile as package-lock.json. You should commit this file.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m

In [61]:
%%writefile app.py
import streamlit as st
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import nltk
import math
import torch

MODEL_REPO = "fahmiaziz/t5-base-adapt"
max_input_length = 512

st.header("Summarize Medical Diagnosis")
st_model_load = st.text('Loading summarize model...')

@st.cache(allow_output_mutation=True)
def load_model():
    print("Loading model...")
    config = PeftConfig.from_pretrained(MODEL_REPO)
    model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
    model = PeftModel.from_pretrained(model, MODEL_REPO)
    tokenizer = AutoTokenizer.from_pretrained("t5-base")
    nltk.download('punkt')
    print("Model loaded!")
    return tokenizer, model

tokenizer, model = load_model()
st.success('Model loaded!')
st_model_load.text("")

with st.sidebar:
    if 'temperature' not in st.session_state:
        st.session_state.temperature = 0.5
    def on_change_temperatures():
        st.session_state.temperature = temperature
    temperature = st.slider("Temperature", min_value=0.1, max_value=1.5, value=0.6, step=0.05, on_change=on_change_temperatures)
    st.markdown("_High temperature means that results are more random_")

if 'text' not in st.session_state:
    st.session_state.text = ""
st_text_area = st.text_area('Summarize Text', value=st.session_state.text, height=500)

def summarize_text():
    PREFIX = "summarize :"
    st.session_state.text = st_text_area

    inputs = tokenizer.encode(
        PREFIX + st_text_area, return_tensors = "pt",
        max_length=1024, truncation=True
    )
    outputs = model.generate(
        input_ids=inputs,
        min_length=25,
        max_length=258,
        num_beams=10,
        repetition_penalty=2.5,
        length_penalty=1.0,
        early_stopping=True,
        no_repeat_ngram_size=3,
        temperature=temperature,
        do_sample=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# generate button
if st.button("Summarize"):
    summary = summarize_text()
    st.write(summary)

Overwriting app.py


In [62]:
!streamlit run /content/app.py &>/content/logs.txt &

In [63]:
!npx localtunnel --port 8501 & curl https://ipv4.icanhazip.com

34.126.188.33
[K[?25hnpx: installed 22 in 1.661s
your url is: https://better-apples-trade.loca.lt


#### How to use
- 34.126.188.33   <= Copy this
- npx: installed 22 in 1.797s
- your url is: https://soft-swans-behave.loca.lt