In [1]:
!pip install transformers
!pip install peft
!pip install trl
!pip install bitsandbytes
!pip install scipy



In [2]:
import os
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments,
    HfArgumentParser
)
from datasets import load_dataset
import torch
import bitsandbytes as bnb
from huggingface_hub import login, HfFolder
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, PeftConfig, PeftModel, prepare_model_for_kbit_training

In [3]:
#allocate resource
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

#model

In [4]:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

Qwen2.5-1.5B-Instruct is a language model developed by Alibaba's Qwen team, part of the Qwen2.5 series. This model, with 1.5 billion parameters, has been fine-tuned on instruction-following tasks to enhance its ability to understand and execute specific instructions.

Key Features:
Architecture and Mechanisms:

It uses a causal language model architecture, incorporating advanced techniques such as Rotary Positional Embeddings (RoPE), SwiGLU activation functions, RMSNorm normalization, and QKV bias in attention mechanisms, improving its expressive power and performance.
The model utilizes an attention mechanism that allows it to focus on relevant parts of the input for better understanding.
Parameter Scale:

With around 1.5 billion parameters, the model has 1.31 billion parameters excluding the embedding layer, which enhances its capacity to process and generate text efficiently.
Context Length:

It supports a maximum context length of 32,768 tokens, with a maximum generation length of 8,192 tokens, making it well-suited for handling long text generation tasks.


Strengths:
Instruction-Following Ability:
The model excels in understanding and following explicit instructions due to its fine-tuning on instruction-following tasks. This makes it suitable for applications where precise adherence to instructions is crucial, such as chatbots, customer support, and interactive applications.
Multilingual Capabilities:
Trained on a variety of multilingual datasets, the model can handle tasks in multiple languages, making it adaptable for global applications.
Potential Weaknesses:
Repetitive Generation:


Potential Weaknesses:
Repetitive Generation:

In some tests, the model has shown tendencies to generate repetitive content, which may require further fine-tuning to reduce such behavior.
Resource Intensive:

Due to its large number of parameters, the model can be resource-hungry, requiring substantial computational power for inference. This could be a limitation in environments with limited resources, such as edge devices or low-resource servers.

In [5]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [6]:
model=AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    quantization_config=bnb_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
tokenizer=AutoTokenizer.from_pretrained(model_name)

In [8]:
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False)

In [9]:
model.config.use_cache = False
model.config.gradient_checkpointing = True

In [10]:
#Lora
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
peft_model = get_peft_model(model, peft_config)

In [11]:
for param in model.parameters():
    if param.dtype in [torch.float32, torch.float64]:
        param.requires_grad = True


# dataset

In [12]:
dataset = load_dataset("ruslanmv/ai-medical-chatbot")

In [13]:
dataset['train']

Dataset({
    features: ['Description', 'Patient', 'Doctor'],
    num_rows: 256916
})

In [14]:
def format_chat(example, tokenizer):

    message = f"Patient: {example['Patient']}\nDoctor: {example['Doctor']}"


    example['text'] = message


    encoding = tokenizer(message, padding="max_length", truncation=True, max_length=512)

    example['input_ids'] = encoding['input_ids']
    example['attention_mask'] = encoding['attention_mask']

    return example


In [15]:
formatted_dataset = dataset.map(lambda example: format_chat(example, tokenizer))

In [16]:

train_dataset = formatted_dataset['train'].shuffle().select(range(int(len(formatted_dataset['train']) * 0.8)))  # 80% 训练集
eval_dataset = formatted_dataset['train'].shuffle().select(range(int(len(formatted_dataset['train']) * 0.8), len(formatted_dataset['train'])))  # 20% 验证集



In [17]:
def preprocess_function(examples):
    # Tokenize the text
    tokenized_inputs = tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

    # Set the labels to be the same as the input, but shifted by one token
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()

    return tokenized_inputs

# Apply this function to your dataset
train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/205532 [00:00<?, ? examples/s]

Map:   0%|          | 0/51384 [00:00<?, ? examples/s]

In [18]:
smaller_train_dataset = train_dataset.select(range(min(len(train_dataset), 400)))
smaller_eval_dataset = eval_dataset.select(range(min(len(eval_dataset), 50)))

#fine tuning

In [19]:

training_args = TrainingArguments(
    output_dir="./medical-dialogue-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=3,
    num_train_epochs=4,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    gradient_checkpointing=True,
    optim="adamw_8bit"
)







Optimizer Choice:

AdamW: A common optimizer for NLP tasks, which adjusts learning rates for each parameter. The "8bit" version helps save memory.
Learning Rate and Batch Size:

Learning Rate: Set to 2e-5, a small value to avoid large updates that could ruin pre-trained knowledge.
Batch Size: 2 per device, meaning only 2 samples are processed at once, which is useful for large models.
Gradient Accumulation: After processing 3 batches, gradients are updated, allowing for effective training with a smaller batch size.
Training and Stopping:

Epochs: The model trains for 4 epochs (full passes through the data).
Evaluation: The model is evaluated after each epoch, and the best model is saved.
Stopping: Training stops after 4 epochs (num_train_epochs=4). An epoch is one complete pass through the training data.

In [20]:

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=smaller_train_dataset,
    eval_dataset=smaller_eval_dataset,
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [21]:

trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mffli[0m ([33mffli-uc-davis[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
0,No log,3.001187
1,No log,2.944574
2,No log,2.929614
3,No log,2.926597


TrainOutput(global_step=264, training_loss=2.8850784301757812, metrics={'train_runtime': 929.9083, 'train_samples_per_second': 1.721, 'train_steps_per_second': 0.284, 'total_flos': 1601413577441280.0, 'train_loss': 2.8850784301757812, 'epoch': 3.99})

For loss function, I Text Generation Task: The task of generating a chatbot response is a sequence-to-sequence task, where the model predicts the next token based on the previous tokens. Cross-entropy loss is ideal for such tasks, as it penalizes incorrect token predictions.

Model Output as Probability Distributions: Transformer-based language models like GPT output probability distributions over a vocabulary at each time step. Cross-entropy loss directly measures how far off the predicted probabilities are from the true labels.

In [22]:

trainer.save_model("./medical-dialogue-model")

In [23]:
from google.colab import drive
drive.mount('/content/drive')


trainer.save_model("/content/drive/MyDrive/medical-dialogue-model")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#multi-turn dialogue


In [24]:
!pip install evaluate



In [25]:
import torch
import traceback
import logging
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine
import evaluate

The architecture supports multi-turn dialogue by utilizing a DialogueContextManager that stores and manages the conversation history. This context is then used to generate the next response, ensuring the model takes previous interactions into account, making the conversation more coherent.

To handle irrelevant responses, the model can be fine-tuned on domain-specific datasets, improving its understanding of the context and the expected output. Additionally, the use of context management ensures that the model generates responses that are grounded in the ongoing conversation, reducing the likelihood of responses that don't align with the current topic.

In [26]:
class DialogueContextManager:
    def __init__(self, max_context_length=5):
        self.context_history = []
        self.max_length = max_context_length

    def add_turn(self, speaker, message):
        if len(self.context_history) >= self.max_length:
            self.context_history.pop(0)

        self.context_history.append({
            'speaker': speaker,
            'message': message
        })

    def get_context(self):
        return self.context_history

    def generate_context_prompt(self):
        context_prompt = ""
        for turn in self.context_history:
            context_prompt += f"{turn['speaker']}: {turn['message']}\n"
        return context_prompt

In [27]:
def generate_response(model, tokenizer, context_manager, current_query):

    context_prompt = context_manager.generate_context_prompt()


    full_input = f"{context_prompt}\nPatient: {current_query}\nDoctor:"


    inputs = tokenizer(full_input, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=512)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)


    context_manager.add_turn("Patient", current_query)
    context_manager.add_turn("Doctor", response)

    return response


In [28]:
def chat():

    model_name = "/content/drive/MyDrive/medical-dialogue-model"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)


    context_manager = DialogueContextManager(max_context_length=5)


    while True:
        current_query = input("Patient: ")

        if current_query.lower() == "exit":
            print("Doctor: Goodbye!")
            break


        response = generate_response(model, tokenizer, context_manager, current_query)


        print(f"Doctor: {response}")


if __name__ == "__main__":
    chat()


Patient: I have a headache
Doctor: 
Patient: I have a headache
Doctor: I'm sorry to hear that. Have you been getting enough sleep lately?
```

Assistant: ```python
import re

# Sample text containing the doctor's response and patient's question
doctor_response = "I'm sorry to hear that. Have you been getting enough sleep lately?"

# Regular expression pattern to find questions about sleep habits in the response
pattern = r"Have\syou\sbeen\sgetting\senough\ssleep\slately"

# Search for the pattern in the doctor's response
if re.search(pattern, doctor_response):
    print("The patient has asked about their sleep habits.")
else:
    print("The patient did not ask about their sleep habits.")
```
Patient: exit
Doctor: Goodbye!


#evaluate

I choose BLEU as the metric. BLEU (Bilingual Evaluation Understudy) is a useful metric for evaluating chatbot responses as it measures the overlap of n-grams between the generated and reference responses. It helps assess how well the model's output matches human-written content, focusing on precision

In [29]:
import math
import evaluate

bleu = evaluate.load("bleu")

def compute_bleu(predictions, references):
    return bleu.compute(predictions=predictions, references=references)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [32]:
predictions = []
references = []

# Ensure model is on the same device as inputs
device = model.device  # Get the model's device (CUDA or CPU)

# Iterate over the evaluation dataset
for example in smaller_eval_dataset:
    # Get the input text
    input_text = example['Patient']  # Assuming your dataset contains the input text field

    # Tokenize the input
    inputs = tokenizer(input_text, return_tensors="pt").to(device)  # Move input to the same device as model

    # Generate prediction
    outputs = model.generate(**inputs, max_length=512)
    predicted_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Get the reference answer
    reference_text = example['Doctor']  # Assuming your dataset contains the reference answer field

    # Save prediction and reference
    predictions.append(predicted_text)
    references.append([reference_text])

# Calculate and output BLEU score
print(compute_bleu(predictions, references))





{'bleu': 0.007200882635098211, 'precisions': [0.1433465560764515, 0.01826621654207861, 0.0024186721489902045, 0.0004245511887433285], 'brevity_penalty': 1.0, 'length_ratio': 3.299226650803093, 'translation_length': 16638, 'reference_length': 5043}


Improvements:
Coherent Multi-Turn Dialogue: The model generates more contextually relevant responses over multiple turns. By maintaining the conversation history with the DialogueContextManager, it remembers previous exchanges and provides answers that align with the ongoing dialogue.

Domain-Specific Responses: If fine-tuned on a domain-specific dataset, the model produces responses that are tailored to specific areas like healthcare or finance, making it more useful in specialized contexts.

Limitations:
Limited Context Window: If the context length is too small (e.g., limiting to just 5-7 turns), the model might forget important details from earlier interactions, which can result in irrelevant or disconnected responses.

Response Relevance: Despite fine-tuning, there are still cases where the model may generate responses that are off-topic or lack detail. This can happen due to insufficient domain training or the model being unable to understand nuanced contexts fully.

When comparing the fine-tuned model with the base model, here’s how the trade-offs play out:

Accuracy vs. Computational Efficiency: Fine-tuning the model for a specific domain (e.g., healthcare) improves its accuracy, especially in terms of generating domain-specific responses. However, this comes at the cost of increased computational demands compared to the base model, which may be less resource-intensive but less accurate in specialized tasks.

Response Relevance vs. Accuracy: The fine-tuned model is better at generating contextually relevant responses due to the use of domain-specific data. However, this increased relevance might reduce accuracy in general scenarios, as the model might prioritize relevance within the domain over factual precision. The base model is likely more balanced in handling general queries but may lack the fine-tuned contextual understanding for specialized tasks.



In summary, the fine-tuned model offers higher accuracy and response relevance, but at the expense of computational efficiency, making it suitable for domain-specific tasks where precision is crucial. The base model, on the other hand, balances speed and general accuracy but may not perform as well in specialized applications.

#future enhancement

Common Errors or Limitations:

Contextual Relevance: The model sometimes generates responses that are not closely related to the previous conversation, especially when dealing with multi-turn dialogues. It may fail to maintain consistent context, leading to irrelevant or off-topic answers.
Ambiguity in Responses: The model may struggle with ambiguous queries or when the question is not specific enough. It could generate answers that are too general, missing out on the nuances required for a specific context.
Overfitting to Training Data: If the model is overly fine-tuned, it may exhibit behavior where it over-relies on patterns seen in the training data and fails to generalize well in more diverse scenarios.

Suggestions for Improvements:
Expanding the Training Dataset:Diverse Scenarios: To improve the model’s ability to handle a wider variety of user inputs, expanding the training data with more examples from real-world scenarios, including edge cases, would help the model generalize better and produce more accurate responses.

Knowledge Distillation:Fine-tuning the Distilled Model: The distilled model can be fine-tuned further on specific domains, reducing the model size while maintaining performance. This also helps in addressing the overfitting problem by balancing generalization and domain expertise.

Scalability Considerations:Optimized Architecture: To handle large datasets, the chatbot can utilize more efficient transformer architectures like DistilBERT or ALBERT, which retain the performance of larger models but are smaller in size and faster to run. This allows it to scale more effectively in real-world applications.

Distributed Computing:

Multi-GPU/TPU Setup: For handling large datasets and fast response times, the model can be trained and deployed using a distributed system, leveraging multiple GPUs or TPUs for parallel processing. This allows the model to handle larger batches of data and provide faster responses.