<h1><b><p style="background-image: url(https://i.postimg.cc/0Qwf8YX6/2927262.jpg);font-family:camtasia;font-size:110%;color:white;text-align:center;border-radius:15px 50px; padding:7px; border:solid 2px #09375b; box-shadow: 10px 10px 10px #042b4c">Project Title: HealthCare Chatbot Using T5-FineTuning</b></h1>



### Tabel of Contents:



* [Import Libraries](#1)

* [Load Dataset](#2)

* [T5 Model](#4)

* [Evaluation](#3)

* [Predictive for Test](#8)

* [Like this? Upvote and comment! ðŸŒŠ End](#6)

**<a id="1"></a>

<h1><b><p style="background-image: url(https://i.postimg.cc/0Qwf8YX6/2927262.jpg);font-family:camtasia;font-size:110%;color:white;text-align:center;border-radius:15px 50px; padding:7px; border:solid 2px #09375b; box-shadow: 10px 10px 10px #042b4c">Import Libraries</p></b></h1>

<a class="btn" href="#home">Tabel of Contents</a>

In [1]:
import pandas as pd
import re
import torch
from datasets import Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from huggingface_hub import login
import wandb
wandb.init(mode = "disabled")
import warnings
warnings.filterwarnings('ignore')

**<a id="2"></a>

<h1><b><p style="background-image: url(https://i.postimg.cc/0Qwf8YX6/2927262.jpg);font-family:camtasia;font-size:110%;color:white;text-align:center;border-radius:15px 50px; padding:7px; border:solid 2px #09375b; box-shadow: 10px 10px 10px #042b4c">Load Dataset</p></b></h1>

<a class="btn" href="#home">Tabel of Contents</a>

In [2]:
data = pd.read_csv('/kaggle/input/alzhimer-chat-leader/full_Chat_data.csv',usecols=[1,2])
data.sample(5)

Unnamed: 0,Questions,Answers
8034,Can medications for behavioral symptoms have s...,"Yes, medications for behavioral symptoms can h..."
11777,How does social engagement contribute to a sen...,"Social engagement fosters connections, builds ..."
3638,How does sugar consumption affect the gut-brai...,Sugar consumption can affect the gut-brain-axi...
6422,How does nicotine influence the levels of prot...,Investigating how nicotine influences proteins...
18080,How does maintaining good vascular health impa...,Good vascular health is crucial for maintainin...


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25177 entries, 0 to 25176
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Questions  25173 non-null  object
 1   Answers    25154 non-null  object
dtypes: object(2)
memory usage: 393.5+ KB


In [4]:
data.dropna(inplace=True)

In [5]:
data.Questions = data.Questions.astype(str)
data.Answers = data.Answers.astype(str)

In [6]:
# Split the data into train, validation, and test sets
train_size = int(len(data) * 0.9)
val_size = int(len(data) * 0.05)
test_size = int(len(data) * 0.05)

train_df = data[:train_size].reset_index(drop=True)  # Reset index for train set
val_df = data[train_size:train_size + val_size].reset_index(drop=True)  # Reset index for validation set
test_df = data[train_size + val_size:].reset_index(drop=True)  # Reset index for test set

In [7]:
# Clean the text by removing unwanted characters
def clean_text(text):
    text = re.sub(r'\r\n', ' ', text)  # Remove carriage returns and line breaks
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'<.*?>', '', text)  # Remove any XML tags
    text = text.strip().lower()  # Strip and convert to lower case
    return text

# Apply cleaning to dialogue and summary columns
train_df['Questions'] = train_df['Questions'].apply(clean_text)
train_df['Answers'] = train_df['Answers'].apply(clean_text)
test_df['Questions'] = test_df['Questions'].apply(clean_text)
test_df['Answers'] = test_df['Answers'].apply(clean_text)
val_df['Questions'] = val_df['Questions'].apply(clean_text)
val_df['Answers'] = val_df['Answers'].apply(clean_text)

In [8]:
train_df.head()

Unnamed: 0,Questions,Answers
0,what is alzheimerâ€™s disease?,alzheimerâ€™s disease is the most common form of...
1,what causes alzheimer's disease?,the fundamental causes of alzheimerâ€™s disease ...
2,what are the symptoms of alzheimer's disease?,early signs and symptoms of alzheimerâ€™s diseas...
3,is alzheimer's disease the same thing as demen...,dementia is a syndrome and has many causes inc...
4,how common is alzheimer's disease?,approximately 50 million people worldwide are ...


In [9]:
# Convert Pandas DataFrame to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
val_dataset = Dataset.from_pandas(val_df)
train_dataset

Dataset({
    features: ['Questions', 'Answers'],
    num_rows: 22638
})

<a id="4"></a>

<h1><b><p style="background-image: url(https://i.postimg.cc/0Qwf8YX6/2927262.jpg);font-family:camtasia;font-size:110%;color:white;text-align:center;border-radius:15px 50px; padding:7px; border:solid 2px #09375b; box-shadow: 10px 10px 10px #042b4c">T5 Model</p></b></h1>

<a class="btn" href="#home">Tabel of Contents</a>

In [10]:
# Initialize tokenizer and model
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [11]:
max_input = max([len(tokenizer.encode(text)) for text in train_dataset['Questions']])
max_output = max([len(tokenizer.encode(text)) for text in train_dataset['Answers']])
print(f"Calculated max_input: {max_input}")
print(f"Calculated max_output: {max_output}")

Token indices sequence length is longer than the specified maximum sequence length for this model (1635 > 512). Running this sequence through the model will result in indexing errors


Calculated max_input: 125
Calculated max_output: 1635


In [12]:
# Tokenization function
def tokenize_function(examples):
    inputs = tokenizer(examples['Questions'], truncation=True, padding="max_length", max_length=256)
    targets = tokenizer(examples['Answers'], truncation=True, padding="max_length", max_length=1024)
    inputs['labels'] = targets['input_ids']
    return inputs
# Tokenize datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
print(train_dataset[0])

Map:   0%|          | 0/22638 [00:00<?, ? examples/s]

Map:   0%|          | 0/1259 [00:00<?, ? examples/s]

Map:   0%|          | 0/1257 [00:00<?, ? examples/s]

{'Questions': 'what is alzheimerâ€™s disease?', 'Answers': 'alzheimerâ€™s disease is the most common form of dementia. alzheimerâ€™s is a progressive neurodegenerative condition that impacts a personâ€™s memory and other cognitive functions to a degree that inhibits daily tasks and activities.', 'input_ids': [125, 19, 491, 172, 3254, 49, 22, 7, 1994, 58, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [13]:
training_args = TrainingArguments(
    output_dir="./Finetuning_T5_HealthCare_Chatbot",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

In [14]:
trainer.train()
trainer.save_model()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.1198,0.146376
2,0.1098,0.140238
3,0.1032,0.137282
4,0.1016,0.134929
5,0.1019,0.133123
6,0.0972,0.132139
7,0.0964,0.13127
8,0.0947,0.130832
9,0.0959,0.130483
10,0.0933,0.130393


In [15]:
tokenizer.save_pretrained(training_args.output_dir)

('./Finetuning_T5_HealthCare_Chatbot/tokenizer_config.json',
 './Finetuning_T5_HealthCare_Chatbot/special_tokens_map.json',
 './Finetuning_T5_HealthCare_Chatbot/spiece.model',
 './Finetuning_T5_HealthCare_Chatbot/added_tokens.json')

In [None]:
login(token="HUGGINGFACE_TOKEN")

In [17]:
repo_name = "ahmed792002/Finetuning_T5_HealthCare_Chatbot"
trainer.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

events.out.tfevents.1734438516.9b798e091f23.23.0:   0%|          | 0.00/33.2k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/ahmed792002/Finetuning_T5_HealthCare_Chatbot/commit/dd995136b2bba3be95d81c5436c259061f981df8', commit_message='Upload tokenizer', commit_description='', oid='dd995136b2bba3be95d81c5436c259061f981df8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ahmed792002/Finetuning_T5_HealthCare_Chatbot', endpoint='https://huggingface.co', repo_type='model', repo_id='ahmed792002/Finetuning_T5_HealthCare_Chatbot'), pr_revision=None, pr_num=None)

**<a id="1"></a>

<h1><b><p style="background-image: url(https://i.postimg.cc/0Qwf8YX6/2927262.jpg);font-family:camtasia;font-size:110%;color:white;text-align:center;border-radius:15px 50px; padding:7px; border:solid 2px #09375b; box-shadow: 10px 10px 10px #042b4c">Evaluation</p></b></h1>

<a class="btn" href="#home">Tabel of Contents</a>

In [18]:
results = trainer.evaluate(test_dataset)
print("Evaluation results:")
print("Test Loss",results["eval_loss"])

Evaluation results:
Test Loss 0.10196499526500702


<a id="8"></a>

<h1><b><p style="background-image: url(https://i.postimg.cc/0Qwf8YX6/2927262.jpg);font-family:camtasia;font-size:110%;color:white;text-align:center;border-radius:15px 50px; padding:7px; border:solid 2px #09375b; box-shadow: 20px 10px 10px #042b4c">Predictive for Test</p></b></h1>

In [19]:
tokenizer = T5Tokenizer.from_pretrained("ahmed792002/Finetuning_T5_HealthCare_Chatbot")
model = T5ForConditionalGeneration.from_pretrained("ahmed792002/Finetuning_T5_HealthCare_Chatbot")

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [20]:
device = model.device
def chatbot(query):
    query = clean_text(query)
    input_ids = tokenizer(query,return_tensors="pt",max_length=256,truncation=True)
    inputs = {key: value.to(device) for key, value in input_ids.items()}
    outputs = model.generate(
        input_ids["input_ids"],
        max_length=1024,
        num_beams=5,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [21]:
sequence = test_dataset['Questions'][100]
print("Q : ",test_dataset['Questions'][100])
print("\n","*"*20,"\n")
print("A : ",test_dataset['Answers'][100])
print("\n","*"*20,"\n")
print("G : ",chatbot(sequence)) 

Q :  what is the significance of targeting hsps as a strategy for alzheimer's disease treatment?

 ******************** 

A :  targeting hsps offers a new mechanism of action for reducing pathogenic tau levels and restoring normal tau homeostasis, which are key factors in alzheimer's disease progression. hsps play a crucial role in protein folding and preventing protein aggregation, processes that are disrupted in neurodegenerative diseases like alzheimer's.

 ******************** 

G :  targeting hsps as a strategy for alzheimer's disease treatment is a key area of research. it's crucial to consult with a healthcare provider for personalized care.


In [22]:
sequence = test_dataset['Questions'][50]
print("Q : ",test_dataset['Questions'][50])
print("\n","*"*20,"\n")
print("A : ",test_dataset['Answers'][50])
print("\n","*"*20,"\n")
print("G : ",chatbot(sequence)) 

Q :  what are the different types of chaperones and their functions in protein folding?

 ******************** 

A :  there are three main types of chaperones: molecular chaperones, pharmacological chaperones, and chemical chaperones. molecular chaperones assist other proteins in folding or unfolding. pharmacological chaperones are small compounds that induce refolding of proteins. chemical chaperones stabilize protein structure.

 ******************** 

G :  chaperones have different functions in protein folding, such as apnea, apnea, apnea, and apnea, which are common in the brain. these functions can include apnea, apnea, apnea, and apnea.


In [23]:
sequence = test_dataset['Questions'][150]
print("Q : ",test_dataset['Questions'][150])
print("\n","*"*20,"\n")
print("A : ",test_dataset['Answers'][150])
print("\n","*"*20,"\n")
print("G : ",chatbot(sequence)) 

Q :  what are other disease-modifying treatments under investigation for alzheimer's disease treatment?

 ******************** 

A :  other dmts targeting aÎ² and tau pathologies, such as aducanumab, gantenerumab, crenezumab, tideglusib, lithium, and others, are under investigation.

 ******************** 

G :  other disease-modifying treatments for alzheimer's disease are being investigated for their effectiveness, effectiveness, and efficacy in reducing the risk of developing alzheimer's disease.


In [24]:
sequence = "what is alzheimer's disease"
print("Q : ",sequence)
print("\n","*"*20,"\n")
print("G : ",chatbot(sequence)) 

Q :  what is alzheimer's disease

 ******************** 

G :  alzheimer's disease is a form of dementia, characterized by memory loss, confusion, confusion, and confusion, which are common in individuals with alzheimer's.


In [25]:
sequence = "what is symptoms of alzheimer's later stages"
print("Q : ",sequence)
print("\n","*"*20,"\n")
print("G : ",chatbot(sequence)) 

Q :  what is symptoms of alzheimer's later stages

 ******************** 

G :  symptoms of alzheimer's later stages include memory loss, confusion, confusion, difficulty with communication, difficulty with tasks, difficulty with tasks, difficulty with tasks, difficulty with tasks, difficulty with tasks, and difficulty with tasks.


<center><span style="font-family:Palatino; font-size:22px;"><i>Like this? <span style="color:#DC143C;">Upvote and Comment!</span> </i>ðŸŒŠ End</span> </center>