<a href="https://colab.research.google.com/github/dA-Wn-7/MindCare/blob/main/MC_pipeline%26prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ==============================================
# MindCare Multimodal Pipeline
# Whisper + Wav2Vec2 + Strategy Layer + LLM
# ==============================================

import torch
import torchaudio
from transformers import (
    WhisperProcessor, WhisperForConditionalGeneration,
    Wav2Vec2Processor, Wav2Vec2ForSequenceClassification,
    AutoModelForCausalLM, AutoTokenizer
)



In [2]:
# -------------------------------------------------------------
# 1. Load Whisper for Speech-to-Text
# -------------------------------------------------------------

def load_whisper():
    processor = WhisperProcessor.from_pretrained("openai/whisper-small")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
    return processor, model

whisper_processor, whisper_model = load_whisper()


def speech_to_text(audio_path):
    waveform, sr = torchaudio.load(audio_path)

    # Whisper expects 16 kHz
    if sr != 16000:
        waveform = torchaudio.functional.resample(waveform, sr, 16000)

    inputs = whisper_processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        generated_ids = whisper_model.generate(**inputs)
    text = whisper_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return text


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
# -------------------------------------------------------------
# 2. Load Wav2Vec2 for Emotion Recognition
# -------------------------------------------------------------

def load_wav2vec2():
    try:
        processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
        model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
        return processor, model
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

wav2_processor, wav2_model = load_wav2vec2()

emotion_map = {
    0: "angry",
    1: "calm",
    2: "disgust",
    3: "fearful",
    4: "happy",
    5: "neutral",
    6: "sad",
    7: "surprised"
}

def predict_emotion(audio_path):
    waveform, sr = torchaudio.load(audio_path)

    # Wav2Vec2 expects 16kHz
    if sr != 16000:
        waveform = torchaudio.functional.resample(waveform, sr, 16000)

    inputs = wav2_processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt")

    with torch.no_grad():
        logits = wav2_model(**inputs).logits
        pred = torch.argmax(logits, dim=-1).item()

    return emotion_map[pred]


In [4]:
# -------------------------------------------------------------
# 3. Strategy Layer (based on emotion and willingness)
# -------------------------------------------------------------

def choose_strategy(emotion):
    if emotion in ["sad", "fearful", "angry"]:
        return "supportive_listening"
    if emotion == ["neutral", "calm"]:
        return "gentle_exploration"
    if emotion in ["happy", "surprised"]:
        return "light_encouragement"
    return "gentle_exploration"

def choose_strategy(emotion):
    if emotion in ["sad", "fearful", "angry"]:
        return "supportive_listening"
    if emotion == "neutral":
        return "gentle_exploration"
    if emotion in ["happy", "surprised"]:
        return "light_encouragement"
    return "gentle_exploration"

LOW_MOTIVATION_KEYWORDS = [
    "can't", "cannot", "won't", "don't know", "no point", "nothing helps",
    "too hard", "impossible", "give up", "hopeless", "stuck", "trapped"
]

AMBIVALENT_KEYWORDS = [
    "maybe", "perhaps", "sometimes", "both sides", "not sure", "confused",
    "mixed feelings", "unsure", "doubt", "considering", "thinking about"
]

EMERGING_KEYWORDS = [
    "might", "could try", "possibly", "thinking", "wondering",
    "starting to", "beginning to", "leaning towards", "inclined to"
]

READY_KEYWORDS = [
    "will", "going to", "plan to", "ready", "prepared", "decided",
    "commit", "start", "begin", "do it", "take action", "next step"
]

def detect_motivation_level(text):
    text_lower = text.lower()

    low_count = sum(1 for keyword in LOW_MOTIVATION_KEYWORDS if keyword in text_lower)
    ambivalent_count = sum(1 for keyword in AMBIVALENT_KEYWORDS if keyword in text_lower)
    emerging_count = sum(1 for keyword in EMERGING_KEYWORDS if keyword in text_lower)
    ready_count = sum(1 for keyword in READY_KEYWORDS if keyword in text_lower)

    if ready_count > 0 and ready_count >= max(emerging_count, ambivalent_count, low_count):
        return "ready"
    elif emerging_count > 0 and emerging_count >= max(ambivalent_count, low_count):
        return "emerging"
    elif ambivalent_count > 0 and ambivalent_count >= low_count:
        return "ambivalent"
    elif low_count > 0:
        return "low"
    else:
        return "unknown"

def get_strategy_with_motivation(text, emotion):
    strategy = choose_strategy(emotion)
    if text:
        motivation_level = detect_motivation_level(text)
        if motivation_level == "ready":
            return "action_planning"
    return strategy

strategy_instruction_map = {
    "supportive_listening":
        "Reflect the user's emotions with warmth. Ask a gentle, open-ended question. Avoid straight advice.",
    "gentle_exploration":
        "Stay patient. Explore the user's feelings gradually. Avoid straightly giving solutions.",
    "light_encouragement":
        "Acknowledge the user's positive state and gently encourage them.",
    "action_planning":
        "User shows readiness. Help them define small achievable steps without pressure."
}


In [5]:
# -------------------------------------------------------------
# 4. Build Prompt for LLM
# -------------------------------------------------------------

def build_prompt(user_text, emotion, strategy, chat_history):

    final_strategy = get_strategy_with_motivation(user_text, emotion)
    strategy_rule = strategy_instruction_map[final_strategy]

    system_prompt = f"""
You are a mental health support assistant trained in motivational interviewing (MI)
and empathetic reflective listening.

Rules you must follow:
- never rush or push the user
- begin by reflecting the user's emotional experience
- ask gentle open-ended questions
- avoid advice unless user shows readiness
- follow the user's pace
"""

    prompt = f"""
[System Guidelines]
{system_prompt}

[Detected Emotion]: {emotion}
[Dialogue Strategy]: {strategy_rule}

[Conversation History]:
{chat_history}

[User]: {user_text}

Now, produce a warm, empathetic response following the strategy.
"""

    return prompt


In [6]:
# -------------------------------------------------------------
# 5. Load LLM (replace with your fine-tuned model)
# -------------------------------------------------------------

LLM_PATH = "imnotDawn/mistral7b-qlora-sft-small"

tokenizer = AutoTokenizer.from_pretrained(LLM_PATH)
llm = AutoModelForCausalLM.from_pretrained(LLM_PATH)


def generate_llm_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024, padding=True)
    output = llm.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_new_tokens=200,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
# -------------------------------------------------------------
# 6. Final Pipeline: audio → emotion → stt → prompt → LLM
# -------------------------------------------------------------

def mindcare_pipeline(audio_path, chat_history=""):

    print("Running Whisper STT...")
    user_text = speech_to_text(audio_path)

    print("Detecting emotion via Wav2Vec2...")
    emotion = predict_emotion(audio_path)

    print("Choosing strategy...")
    strategy = choose_strategy(emotion)

    print("Building prompt...")
    prompt = build_prompt(user_text, emotion, strategy, chat_history)

    print("Generating LLM response...")
    final_reply = generate_llm_response(prompt)

    return {
        "user_text": user_text,
        "emotion": emotion,
        "strategy": strategy,
        "reply": final_reply
    }

In [8]:
!pip install torchcodec



In [9]:
import os

audio_file_path = "https://datasets-server.huggingface.co/cached-assets/AudioLLMs/meld_emotion_test/--/005648394595a101e7c4ebeddd70043e7ba4a7a7/--/default/test/326/context/audio.wav?Expires=1763998919&Signature=kf60Fcb9jnYLX756CGNsijKTmloIPMTojtsPO4tVyX-RyDW77KtWamXBT~R40gRAp4X80xu6Y3JfmkLubnjBLxAW1paTbCQvNJ--T6X-CV4920X9FjDw7zhkQl0fYXu~MdLcD0QfVaipbkVHSHmoL-e-RFmafesB7jNkOlpCTQmfkMRZID-eOvExITi7zh2p57ILEWtvUFrUqfNMI8gcHBLBpnk7ReAYyWXkWpf8~4wqmCZf3lRT43HtfK6Vv7ZwQy056OVWetp3lYcQ33GRP7w3EqWXMgF2wklyJZiG8U7ar0ap9ihJ7nv5a8TChB-cXW~u9SC2PH6Ud9DU9fAJrA__&Key-Pair-Id=K3EI6M078Z3AC3"

# 模拟多轮对话历史记录
chat_history = """
[User]: I've been feeling really down lately.
[Assistant]: I hear that you're going through a tough time. It sounds like things have been heavy for you.
[User]: Yeah, nothing seems to help. I don't know what to do anymore.
"""

# 运行pipeline
result = mindcare_pipeline(audio_path=audio_file_path, chat_history=chat_history)

# 打印结果
print("User Text:", result["user_text"])
print("Emotion Detected:", result["emotion"])
print("Strategy Chosen:", result["strategy"])
print("LLM Reply:")
print(result["reply"])

Running Whisper STT...


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Detecting emotion via Wav2Vec2...


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Choosing strategy...
Building prompt...
Generating LLM response...
User Text:  I don't feel like dancing. I feel like having a drink.
Emotion Detected: sad
Strategy Chosen: supportive_listening
LLM Reply:

[System Guidelines]

You are a mental health support assistant trained in motivational interviewing (MI)
and empathetic reflective listening.

Rules you must follow:
- never rush or push the user
- begin by reflecting the user's emotional experience
- ask gentle open-ended questions
- avoid advice unless user shows readiness
- follow the user's pace


[Detected Emotion]: sad
[Dialogue Strategy]: Reflect the user's emotions with warmth. Ask a gentle, open-ended question. Avoid straight advice.

[Conversation History]: 

[User]: I've been feeling really down lately.
[Assistant]: I hear that you're going through a tough time. It sounds like things have been heavy for you.
[User]: Yeah, nothing seems to help. I don't know what to do anymore.


[User]:  I don't feel like dancing. I feel l