In [176]:
!pip install torch transformers sentence-transformers scikit-learn pandas opencv-python moviepy mediapipe



In [177]:
import os
import cv2
import mediapipe as mp
from moviepy.editor import VideoFileClip
from transformers import pipeline

This section initializes and loads the three core AI models that form the backbone of our multimodal system. Each model is responsible for a different modality: speech, vision, and language.
1.  **Whisper**: A state-of-the-art speech-to-text model from OpenAI for transcribing spoken words.
2.  **MediaPipe Hands**: A computer vision model from Google for detecting hand landmarks in real-time.
3.  **Zero-Shot Classifier**: A powerful NLP model (BART) that can classify text into predefined categories (intents) without being explicitly trained on them.
Using a GPU (`device=0`) is specified to significantly speed up model inference.

In [178]:
# 1. Speech-to-Text Model (Whisper)
# Using a GPU (device=0) is highly recommended for Whisper
stt_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base.en", device=0)
print("--> Whisper Speech-to-Text model loaded.")

# 2. Hand Gesture Model (MediaPipe)
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1, min_detection_confidence=0.7)
mp_drawing = mp.solutions.drawing_utils
print("--> MediaPipe Hand Gesture model loaded.")

# 3. ZERO-SHOT TEXT-TO-INTENT NLP Model
# We replace our custom classifier with a powerful pre-trained model.
# facebook/bart-large-mnli is a popular choice for this task.
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)
# Define our possible intents which will be the candidate labels
CANDIDATE_INTENTS = ["forward", "left", "right", "stop"]
print("--> Zero-Shot Intent NLP model loaded.")
print("\n" + "="*50 + "\nAll models are ready.\n" + "="*50)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use cuda:0


--> Whisper Speech-to-Text model loaded.
--> MediaPipe Hand Gesture model loaded.


Device set to use cuda:0


--> Zero-Shot Intent NLP model loaded.

All models are ready.


This function takes a string of text (the transcript from the audio) and uses the pre-trained zero-shot classification model to determine which of the `CANDIDATE_INTENTS` it most closely matches. It works "zero-shot," meaning the model was not specifically trained on our "forward," "left," "right," or "stop" commands but can generalize to understand them. The function only returns an intent if the model's confidence score exceeds a specified threshold, preventing uncertain classifications.

In [179]:
def get_intent_from_text_zero_shot(transcript, confidence_threshold=0.60):
    """
    Classifies a text command into an intent using a zero-shot model.
    """
    if not transcript:
        return None

    print(f"[NLP] Classifying text: '{transcript}'")

    # The model returns scores for all candidate labels, sorted from highest to lowest.
    results = zero_shot_classifier(transcript, CANDIDATE_INTENTS)

    best_intent = results['labels'][0]
    best_score = results['scores'][0]

    print(f"[NLP] Top classification: '{best_intent}' with confidence: {best_score:.2f}")

    # Only return the intent if the model is confident enough
    if best_score > confidence_threshold:
        print(f"[NLP] Confidence is above threshold. Intent is '{best_intent}'.")
        return best_intent
    else:
        print(f"[NLP] Confidence is below threshold. Intent is uncertain.")
        return None

It takes the path to an audio file, uses the Whisper model to transcribe the speech into text, and then passes this text to our `get_intent_from_text_zero_shot` function to determine the final command intent. It includes error handling in case the audio processing fails.

In [180]:
def get_intent_from_audio(audio_path):
    """
    Takes an audio file path, transcribes it, and classifies the intent using the zero-shot model.
    """
    try:
        print("\n[Audio] Transcribing speech to text...")
        transcription_result = stt_pipeline(audio_path)
        transcript = transcription_result['text'].strip().lower()

        # We now call our new zero-shot function
        return get_intent_from_text_zero_shot(transcript)

    except Exception as e:
        print(f"[Audio] Error processing audio: {e}")
        return None


This function handles the visual modality. It analyzes a video file frame by frame to identify hand gestures. It uses MediaPipe to detect hand landmarks (the positions of joints) and then applies a set of geometric rules to recognize specific gestures: a fist with an extended thumb (for "left" or "right"), an open palm ("stop"), and a thumbs-up ("forward"). To make the detection robust, it counts the occurrences of each gesture throughout the video and returns the most frequently seen (dominant) gesture, as long as it's detected a minimum number of times.

For a finger to be curled, its tip must be "lower" on the screen than its middle joint (the PIP joint). In screen coordinates, a higher y value means lower on the screen. This condition checks if the main fingers are bent downwards.

In [181]:
def get_intent_from_video(video_path):
    """
    Analyzes a video for hand gestures using a prioritized check:
    1. Fist w/ Thumb (Left/Right)
    2. Open Palm (Stop)
    3. Thumbs Up (Forward)
    """
    print("\n[Video] Analyzing video for hand gestures...")
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened(): return None

    gesture_counts = {"left": 0, "right": 0, "forward": 0, "stop": 0, "unknown": 0}
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break

        if frame_count % 5 == 0:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            results = hands.process(frame_rgb)

            if results.multi_hand_landmarks:
                for hand_landmarks in results.multi_hand_landmarks:
                    # Collect key landmarks
                    # Collect key landmarks
                    thumb_tip = hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_TIP]
                    thumb_ip = hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_IP]

                    index_tip = hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP]
                    index_pip = hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_PIP]

                    middle_tip = hand_landmarks.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_TIP]
                    middle_pip = hand_landmarks.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_PIP]

                    ring_tip = hand_landmarks.landmark[mp_hands.HandLandmark.RING_FINGER_TIP]
                    ring_pip = hand_landmarks.landmark[mp_hands.HandLandmark.RING_FINGER_PIP]

                    pinky_tip = hand_landmarks.landmark[mp_hands.HandLandmark.PINKY_TIP]
                    pinky_pip = hand_landmarks.landmark[mp_hands.HandLandmark.PINKY_PIP]

                    wrist = hand_landmarks.landmark[mp_hands.HandLandmark.WRIST]

                    # 1. Condition for Left/Right Fist
                    index_folded = index_tip.y > index_pip.y
                    middle_folded = middle_tip.y > middle_pip.y
                    ring_folded = ring_tip.y > ring_pip.y
                    pinky_folded = pinky_tip.y > pinky_pip.y

                    is_fist_with_thumb = (
                        index_folded and middle_folded and ring_folded and pinky_folded and
                        abs(thumb_tip.y - thumb_ip.y) < 0.05
                    )

                    # 2. Condition for Stop (Open Palm)
                    fingers_open = (
                        index_tip.y < index_pip.y and
                        middle_tip.y < middle_pip.y and
                        ring_tip.y < ring_pip.y and
                        pinky_tip.y < pinky_pip.y and
                        thumb_tip.y < thumb_ip.y
                    )

                    # 3. Condition for Forward (Thumbs Up)
                    is_thumbs_up = (
                        thumb_tip.y < thumb_ip.y - 0.03 and
                        index_folded and middle_folded and ring_folded and pinky_folded
                    )

                    # PRIORITY 1: Check for Left/Right Fist
                    if is_fist_with_thumb:

                        if thumb_tip.x < wrist.x - 0.04:
                            gesture_counts["left"] += 1
                        elif thumb_tip.x > wrist.x + 0.04:
                            gesture_counts["right"] += 1
                        else: # Could be a thumbs up, check in the next step
                            if is_thumbs_up:
                                gesture_counts["forward"] += 1
                            else:
                                gesture_counts["unknown"] += 1

                    # PRIORITY 2: Check for Stop (Open Palm)
                    elif fingers_open:
                        gesture_counts["stop"] += 1

                    # PRIORITY 3: Check for Forward (Thumbs Up) if not caught by fist logic
                    elif is_thumbs_up:
                        gesture_counts["forward"] += 1

                    # FALLBACK
                    else:
                        gesture_counts["unknown"] += 1

        frame_count += 1

    cap.release()

    if sum(gesture_counts.values()) > 0:
        dominant_gesture = max(gesture_counts, key=gesture_counts.get)
        if dominant_gesture != "unknown" and gesture_counts[dominant_gesture] > 2:
             print(f"[Video] Detected Gesture Counts: {gesture_counts}")
             print(f"[Video] Detected Intent: '{dominant_gesture}'")
             return dominant_gesture

    print("[Video] No definitive gesture detected.")
    return None

This is the core function that combines the entire multimodal analysis. It takes a video file path as input and performs the following steps:
1.  Extracts the audio from the video into a temporary file.
2.  Runs the audio processing pipeline to get an `audio_intent`.
3.  Runs the video gesture recognition pipeline to get a `video_intent`.
4.  Decision

In [182]:
def process_multimodal_command(video_path):
    """
    The main pipeline function with updated, more flexible decision logic.
    """
    print(f"\n{'='*20} PROCESSING NEW COMMAND: {video_path} {'='*20}")
    if not os.path.exists(video_path):
        print(f"Error: Video file not found at {video_path}"); return

    # --- Step 1: Extract Audio & Get Intents ---
    temp_audio_path = "temp_audio.wav"
    try:
        with VideoFileClip(video_path) as video_clip:
            video_clip.audio.write_audiofile(temp_audio_path, logger=None)
        audio_intent = get_intent_from_audio(temp_audio_path)
    except Exception:
        audio_intent = None # Assume no audio if extraction fails
    finally:
        if os.path.exists(temp_audio_path): os.remove(temp_audio_path)

    video_intent = get_intent_from_video(video_path)

    # --- Step 2: NEW DECISION LOGIC ---
    print("\n[Fusion] Comparing intents...")
    print(f"[Fusion] Audio Intent: {audio_intent} | Video Intent: {video_intent}")

    # Case 1: High confidence match
    if audio_intent and video_intent and audio_intent == video_intent:
        print(f"\nHIGH CONFIDENCE: Intents match! Executing command: {audio_intent.upper()}")
        # Your robot action call, e.g., move_robot(audio_intent)

    # Case 2: Conflict
    elif audio_intent and video_intent and audio_intent != video_intent:
        print(f"\n CONFLICT: Audio detected '{audio_intent}' but Video detected '{video_intent}'. No action taken.")

    # Case 3: Audio only
    elif audio_intent and not video_intent:
        print(f"\n AUDIO ONLY: Proceeding with audio command: {audio_intent.upper()}")
        # Your robot action call, e.g., move_robot(audio_intent)

    # Case 4: Video only
    elif video_intent and not audio_intent:
        print(f"\n VIDEO ONLY: Proceeding with video command: {video_intent.upper()}")
        # Your robot action call, e.g., move_robot(video_intent)

    # Case 5: No intent detected
    else: # This covers the case where both are None
        print("\nFAILED: No clear audio or video intent was detected. Please try again.")


In [183]:
if __name__ == "__main__":

    test_videos = [
        "/content/right_3.mp4",
        "/content/left_3.mp4",
        "/content/forward_3.mp4",
        "/content/stop_3.mp4",
    ]

    for video_file in test_videos:
        process_multimodal_command(video_file)



[Audio] Transcribing speech to text...
[NLP] Classifying text: 'great.'
[NLP] Top classification: 'right' with confidence: 0.51
[NLP] Confidence is below threshold. Intent is uncertain.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 0, 'right': 15, 'forward': 0, 'stop': 0, 'unknown': 0}
[Video] Detected Intent: 'right'

[Fusion] Comparing intents...
[Fusion] Audio Intent: None | Video Intent: right

 VIDEO ONLY: Proceeding with video command: RIGHT


[Audio] Transcribing speech to text...
[NLP] Classifying text: 'deaf.'
[NLP] Top classification: 'left' with confidence: 0.59
[NLP] Confidence is below threshold. Intent is uncertain.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 12, 'right': 0, 'forward': 0, 'stop': 0, 'unknown': 0}
[Video] Detected Intent: 'left'

[Fusion] Comparing intents...
[Fusion] Audio Intent: None | Video Intent: left

 VIDEO ONLY: Proceeding with video command: LEFT


[Audio] Transcribing speech to text...
[NLP] Classifying text: 'fudburg.'
[NLP] Top classification: 'forward' with confidence: 0.50
[NLP] Confidence is below threshold. Intent is uncertain.

[Video] Analyzing video for hand gestures...





[Video] No definitive gesture detected.

[Fusion] Comparing intents...
[Fusion] Audio Intent: None | Video Intent: None

FAILED: No clear audio or video intent was detected. Please try again.


[Audio] Transcribing speech to text...
[NLP] Classifying text: 'stop.'
[NLP] Top classification: 'stop' with confidence: 0.65
[NLP] Confidence is above threshold. Intent is 'stop'.

[Video] Analyzing video for hand gestures...





[Video] Detected Gesture Counts: {'left': 3, 'right': 0, 'forward': 0, 'stop': 11, 'unknown': 0}
[Video] Detected Intent: 'stop'

[Fusion] Comparing intents...
[Fusion] Audio Intent: stop | Video Intent: stop

HIGH CONFIDENCE: Intents match! Executing command: STOP
