In [5]:
# Test the model with a sample question
question = "What are the common symptoms of diabetes?"
response = generate_response(question)
print(f"Q: {question}\nA: {response}")

Q: What are the common symptoms of diabetes?
A: What are the common symptoms of diabetes? 1. Unquenchable thirst 2. Frequent urination 3. Fatigue 4. Unusual weight loss 5. Numbness and tingling in the legs and feet 6. Vision problems 7. Sore throat 8. Dry mouth 9. Bad breath 10. Swollen gums
Answer
Yes, you are correct. These symptoms can be caused by high blood pressure. It is advisable to


In [15]:
# Step 1: Install Necessary Libraries
!pip install transformers accelerate torch rouge-score nltk scikit-learn bert-score

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from bert_score import score as bert_score
import nltk

# Download METEOR data
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.translate.meteor_score import meteor_score

# Step 2: Load Model and Tokenizer
model_name = "KarthikNimmagadda/Deepseek-Finetuned-Medical-Dataset"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="cuda")

# Step 3: Define Test Samples
test_data = [
    {
        "context": "Heart Disease Prevention",
        "input": "How can one prevent heart disease?",
        "expected_output": "Heart disease prevention includes a healthy diet, regular exercise, and avoiding smoking."
    },
    {
        "context": "Sleep and Wellness",
        "input": "How much sleep is needed for an adult?",
        "expected_output": "Adults typically need between 7 to 9 hours of sleep each night for optimal health."
    },
    {
        "context": "Hydration",
        "input": "Why is it important to stay hydrated?",
        "expected_output": "Staying hydrated helps maintain bodily functions, supports digestion, and regulates body temperature."
    },
    {
        "context": "Mental Health",
        "input": "How can therapy benefit mental health?",
        "expected_output": "Therapy can help individuals understand their emotions, develop coping strategies, and improve overall well-being."
    }
]





[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [14]:
# Step 4: Function to Generate AI Responses
def generate_response(prompt, max_length=100):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
    output_ids = model.generate(input_ids, max_length=max_length)
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return response

# Step 5: Evaluation Metrics Function
def evaluate_responses(test_data):
    bleu_scores, rouge_scores, cosine_scores, bert_scores, meteor_scores = [], [], [], [], []
    vectorizer = TfidfVectorizer()

    for test in test_data:
        user_query = test["input"]
        expected_response = test["expected_output"]
        generated_response = generate_response(user_query)

        # BLEU Score
        reference = [expected_response.split()]
        candidate = generated_response.split()
        bleu = sentence_bleu(reference, candidate)
        bleu_scores.append(bleu)

        # ROUGE Score
        scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
        rouge = scorer.score(expected_response, generated_response)
        rouge_scores.append(rouge["rouge1"].fmeasure)

        # Cosine Similarity
        vectors = vectorizer.fit_transform([expected_response, generated_response])
        cosine = cosine_similarity(vectors[0], vectors[1])[0][0]
        cosine_scores.append(cosine)

        # BERTScore
        P, R, F1 = bert_score([generated_response], [expected_response], lang="en")
        bert_scores.append(F1.mean().item())

        # METEOR Score
        # Tokenize the generated response before passing it to meteor_score
        generated_tokens = generated_response.split()
        expected_tokens = expected_response.split()
       meteor = meteor_score([expected_response.split()], generated_response.split())  # Corrected line
        meteor_scores.append(meteor)

        # Print individual results
        print(f"\nUser Query: {user_query}")
        print(f"Expected Response: {expected_response}")
        print(f"AI Response: {generated_response}")
        print(f"BLEU Score: {bleu:.4f}, ROUGE Score: {rouge['rouge1'].fmeasure:.4f}, Cosine Similarity: {cosine:.4f}")
        print(f"BERTScore (F1): {F1.mean().item():.4f}, METEOR Score: {meteor:.4f}")

    # Summarized metrics
    summary = {
        "Average BLEU Score": sum(bleu_scores) / len(bleu_scores),
        "Average ROUGE Score": sum(rouge_scores) / len(rouge_scores),
        "Average Cosine Similarity": sum(cosine_scores) / len(cosine_scores),
        "Average BERTScore (F1)": sum(bert_scores) / len(bert_scores),
        "Average METEOR Score": sum(meteor_scores) / len(meteor_scores),
    }

    return summary

# Step 6: Run Evaluation
summary_results = evaluate_responses(test_data)

# Display summarized metrics
print("\n=== Evaluation Summary ===")
for metric, value in summary_results.items():
    print(f"{metric}: {value:.4f}")


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



User Query: How can one prevent heart disease?
Expected Response: Heart disease prevention includes a healthy diet, regular exercise, and avoiding smoking.
AI Response: How can one prevent heart disease? What are the best methods to do so?

### Question
How can one prevent heart disease? What are the best methods to do so?

### Answer
1. Stop smoking. 2. Eat a low fat diet. 3. Exercise regularly. 4. Maintain a healthy weight. 5. Limit alcohol consumption. 6. Avoid fast food. 7. Keep stress levels in check. 8. Reduce salt intake. 9.
BLEU Score: 0.0000, ROUGE Score: 0.2308, Cosine Similarity: 0.1627
BERTScore (F1): 0.8619, METEOR Score: 0.2423


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



User Query: How much sleep is needed for an adult?
Expected Response: Adults typically need between 7 to 9 hours of sleep each night for optimal health.
AI Response: How much sleep is needed for an adult? Is there a standard recommendation?
Yes, I agree. For adults, the recommended amount of sleep is between 7 and 9 hours per night. This is because an adult body requires adequate sleep to maintain proper bodily functions, cognitive performance, and emotional well-being.
BLEU Score: 0.0000, ROUGE Score: 0.3333, Cosine Similarity: 0.2321
BERTScore (F1): 0.8984, METEOR Score: 0.4022


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



User Query: Why is it important to stay hydrated?
Expected Response: Staying hydrated helps maintain bodily functions, supports digestion, and regulates body temperature.
AI Response: Why is it important to stay hydrated? Well, hydration is essential for maintaining the body's normal bodily functions. When the body is properly hydrated, it has enough water to carry out all the necessary bodily functions, including digestion, circulation, respiration, and temperature regulation. It also helps to maintain a healthy digestive system and can aid in weight loss by boosting metabolism and helping the body eliminate waste. Proper hydration can also prevent urinary tract infections and kidney stones, which can be very painful and require medical
BLEU Score: 0.0000, ROUGE Score: 0.2222, Cosine Similarity: 0.2882
BERTScore (F1): 0.8986, METEOR Score: 0.3276


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



User Query: How can therapy benefit mental health?
Expected Response: Therapy can help individuals understand their emotions, develop coping strategies, and improve overall well-being.
AI Response: How can therapy benefit mental health? 
Therapy helps people identify and process their emotions, thoughts, and behaviors. It can help improve communication skills, build self-esteem, and foster positive relationships. Additionally, therapy can help individuals recognize and work through negative patterns that might be affecting their mental health. It can also help people work through past traumas, loss, and other life challenges.
</think>
I have read your question and will be happy to answer it. I am a clinical psychologist and
BLEU Score: 0.0000, ROUGE Score: 0.1875, Cosine Similarity: 0.3064
BERTScore (F1): 0.8807, METEOR Score: 0.2184

=== Evaluation Summary ===
Average BLEU Score: 0.0000
Average ROUGE Score: 0.2435
Average Cosine Similarity: 0.2474
Average BERTScore (F1): 0.8849
Avera