Step 1: Set Up Your Development Environment

install necessary packages:

In [None]:
!pip install google-generativeai pandas datasets rouge-score

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install langchain langchain_community


Collecting langchain_community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [None]:
!pip install -q google-generativeai


Import the required libraries in your notebook:

In [None]:
import google.generativeai as genai
import pandas as pd
from datasets import load_dataset
from rouge_score import rouge
from langchain_community.chat_models import ChatOpenAI



Step 2: Access Gemini Pro API

Add your API key to Colab:

In [None]:
GOOGLE_API_KEY = "AIzaSyA1Yk1sDTiyXQBuvjN-KyoIh0bMz_5tbrs"
genai.configure(api_key=GOOGLE_API_KEY)



In [None]:
available_models = genai.list_models()
for model in available_models:
    print(model.name)


models/chat-bison-001
models/text-bison-001
models/embedding-gecko-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-001
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash-001
models/gemini-1.5-flash-001-tuning
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-1.5-flash-8b-exp-0827
models/gemini-1.5-flash-8b-exp-0924
models/gemini-2.5-pro-exp-03-25
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206
models/gemini-2.0-flash-thinking-exp-01-21
models/gemini-2.0-flash-thinking

Test if Gemini API is working:

In [None]:
# Choose the correct available model
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")

# Test the model
response = model.generate_content("How can I manage anxiety?")
print(response.text)


Managing anxiety involves a multifaceted approach that can include lifestyle changes, coping mechanisms, and sometimes professional help. Here's a breakdown of strategies you can try:

**Lifestyle Changes:**

* **Regular Exercise:** Physical activity is a powerful anxiety reducer. Aim for at least 30 minutes of moderate-intensity exercise most days of the week.  Even short bursts of activity can make a difference.
* **Healthy Diet:**  Nourishing your body with whole foods, limiting processed foods, sugar, and caffeine can help stabilize your mood and energy levels.
* **Sufficient Sleep:** Aim for 7-9 hours of quality sleep per night.  Establish a regular sleep schedule and create a relaxing bedtime routine.
* **Mindfulness and Meditation:**  Practicing mindfulness helps you focus on the present moment, reducing rumination and worry. Meditation apps can guide you through various techniques.
* **Limit Alcohol and Nicotine:** These substances can worsen anxiety symptoms.
* **Time Manageme

Step 3: Load & Preprocess the MentalChat16K Dataset

Download the MentalChat16K dataset from Hugging Face:

In [None]:
from datasets import load_dataset

dataset = load_dataset("ShenLab/MentalChat16K")
df = pd.DataFrame(dataset["train"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/3.58k [00:00<?, ?B/s]

Interview_Data_6K.csv:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

Synthetic_Data_10K.csv:   0%|          | 0.00/32.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16084 [00:00<?, ? examples/s]

Remove duplicate & irrelevant data:

In [None]:
df.drop_duplicates(inplace=True)
df = df[['input', 'output']]  # Keep only useful columns



Save the cleaned dataset for training/testing:

In [None]:
df.to_csv("mentalchat16k_cleaned.csv", index=False)
print("Dataset cleaned and saved!")


Dataset cleaned and saved!


Step 4: Implement the Chatbot using Gemini

Define a chatbot function:

In [None]:
def chat_with_gemini(user_input):
    model = genai.GenerativeModel("gemini-1.5-pro-latest")
    response = model.generate_content(user_input)
    return response.text


Test the chatbot:

In [None]:
print(chat_with_gemini("How can I manage anxiety?"))


Managing anxiety can involve a combination of lifestyle changes, coping mechanisms, and sometimes professional help. Here's a breakdown of strategies you can try:

**Lifestyle Changes:**

* **Regular Exercise:** Physical activity is a powerful anxiety reducer. Aim for at least 30 minutes of moderate-intensity exercise most days of the week.  Even short bursts of activity can help.
* **Healthy Diet:**  A balanced diet can improve mood and energy levels. Limit processed foods, caffeine, and alcohol, which can exacerbate anxiety.
* **Sufficient Sleep:** Aim for 7-9 hours of quality sleep per night.  Establish a regular sleep schedule and create a relaxing bedtime routine.
* **Mindfulness and Meditation:**  These practices help you focus on the present moment and reduce overthinking.  Numerous apps and online resources can guide you.
* **Limit Stressors:** Identify and minimize sources of stress in your life where possible.  This might involve setting boundaries, saying no to commitments, 

Step 5: Hyperparameter Tuning

5.1 Apply Hyperparameter Tuning
Since Gemini does not support fine-tuning, we optimize performance through:

Temperature: Controls response creativity (Lower = deterministic, Higher = creative)

Top-K Sampling: Selects the top K most probable tokens.

Top-P (Nucleus) Sampling: Adjusts probability mass selection for diverse responses.

Max Tokens: Limits response length to prevent unnecessary verbosity.

In [None]:
response = model.generate_content(
    "How can I manage anxiety?",
    generation_config={"temperature": 0.7, "top_k": 50, "top_p": 0.9, "max_output_tokens": 200}
)


5.2 Use Prompt Engineering
We can improve chatbot responses by:

Few-shot prompting: Provide examples to guide response style.

Role-based prompting: Guide the AI's behavior (e.g., "You are a mental health assistant...").

Chain-of-thought prompting: Encourage logical step-by-step reasoning.

In [None]:
prompt = """You are a supportive mental health assistant.
A user is feeling anxious and needs guidance. Offer practical, empathetic advice."""

response = model.generate_content(prompt)


5.3 Run Evaluations to Compare Results
We compare different hyperparameter settings using evaluation metrics.

Step 6: Evaluation Metrics
To benchmark chatbot performance, we use five key evaluation metrics:

ROUGE-L: Measures the longest common sequence between generated and reference text.

ROUGE-1: Measures unigram (single-word) overlap.

ROUGE-2: Measures bigram (two-word) overlap.

BERTScore: Uses transformer embeddings to compare similarity.

BLEU: Evaluates how similar the generated text is to human-written responses.



In [None]:
!pip install bert-score nltk


Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.0.

In [None]:
from bert_score import score
import nltk
nltk.download('punkt')  # Required for BLEU score


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Code for Evaluation:

In [None]:
from rouge_score import rouge_scorer
from bert_score import score
from nltk.translate.bleu_score import sentence_bleu

def evaluate_model(predictions, references):
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

    scores = {"rouge1": [], "rouge2": [], "rougeL": [], "bleu": [], "bertscore": []}

    for pred, ref in zip(predictions, references):
        rouge_scores = scorer.score(ref, pred)
        scores["rouge1"].append(rouge_scores["rouge1"].fmeasure)
        scores["rouge2"].append(rouge_scores["rouge2"].fmeasure)
        scores["rougeL"].append(rouge_scores["rougeL"].fmeasure)
        scores["bleu"].append(sentence_bleu([ref.split()], pred.split()))

    # BERTScore
    P, R, F1 = score(predictions, references, lang="en", model_type="bert-base-uncased")
    scores["bertscore"] = F1.tolist()

    return scores


Run Evaluation:

In [None]:
test_questions = ["How can I manage anxiety?", "What should I do if I'm feeling depressed?"]
test_answers = ["Practice deep breathing and mindfulness.", "Reach out to a trusted friend or therapist."]

model_responses = [chat_with_gemini(q) for q in test_questions]

results = evaluate_model(model_responses, test_answers)
print(results)


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

{'rouge1': [0.019569471624266147, 0.03278688524590164], 'rouge2': [0.003929273084479371, 0.004705882352941176], 'rougeL': [0.015655577299412915, 0.028103044496487123], 'bleu': [4.5612762621768366e-232, 1.1036181875670853e-155], 'bertscore': [0.41353362798690796, 0.37577903270721436]}


Final Interpretation
The low ROUGE, BLEU, and BERTScore values indicate that the chatbot is not closely matching the reference answers.

Possible reasons:

Chatbot output is too generic and does not contain specific words from the dataset.

Fine-tuning is required on the MentalChat16K dataset.

Prompt engineering can help by refining the way you ask questions.

Next Steps to Improve Performance
✔ Use prompt engineering to make the chatbot more focused.
✔ Fine-tune Gemini with the dataset instead of just calling the API.
✔ Apply BLEU smoothing to improve small text evaluations.
✔ Check if responses are meaningful even if the scores are low.