ERIS is a real-time multimodal system that recognises human emotion from speech and facial expressions, fuses both signals with a learned gate, generates empathetic LLM responses, and can drive an emotion-aware quadruped robot simulation.
The project exists in two tightly related layers:
-
Multimodal emotion understanding
A live Gradio application that continuously processes webcam and microphone input, predicts emotion, tracks emotional context, and generates short supportive responses. -
Embodied robotic expression
A ROS 2 / Gazebo / CHAMP-based quadruped control stack that converts the predicted emotion into physically distinct gait, posture, and motion behaviour.
- Overview
- Core Features
- System Architecture
- Emotion Classes
- Project Structure
- Requirements
- Installation
- Running the Project
- Robot Integration
- Model Notes
- Acknowledgements
- License
ERIS started as a multimodal emotion recognition system and was extended into a complete interaction pipeline. The final system can:
- capture live audio and video
- predict one of four emotions: angry, happy, neutral, sad
- combine audio and visual evidence using a learned gated fusion module
- keep short-term emotional context across successive predictions
- generate empathetic responses with an LLM
- export the predicted emotional state to a robot simulation
- modulate a quadruped’s gait and posture to match the detected emotion
The goal is not only to classify emotion, but to make the response feel more natural, contextual, and embodied.
- Continuous webcam and microphone capture in background threads
- Rolling multimodal windows for repeated inference
- Real-time prediction display
- Emotion timeline / history across the session
- Live gate readout showing the balance between audio and visual modalities
- Context-aware short replies generated through Groq-hosted Llama 3.1 8B
- Response behaviour that can adapt to:
- the current emotion
- the detected intensity
- recent emotion shifts
- Optional response suppression when the emotional state is stable or not strong enough to require a reply
- Analyse a pre-recorded video file
- Run the same multimodal inference pipeline
- Display class probabilities and gate behaviour
- Skip the live conversational layer when not needed
- Emotion state written to a shared file for the robot process
- Quadruped gait changes based on the predicted emotion
- Smooth transitions between states using cosine interpolation
- Emotion-specific posture, velocity, and motion profiles
Audio stream (microphone)
│
▼
WavLM-Base+
│
attention pooling
│
├───────────────┐
│ │
▼ ▼
Learned gate ViT-Face-Expression
│ │
└───────┬───────┘
▼
Multimodal classifier
▼
angry / happy / neutral / sad
▼
┌────────────┴────────────┐
▼ ▼
Empathetic LLM reply Robot emotion state
▼ ▼
User-facing response Quadruped gait control
ERIS currently predicts four emotions:
- angry
- happy
- neutral
- sad
These classes are used consistently across the multimodal model, the live response layer, and the robot behaviour mapping.
A typical project layout is:
ERIS/
├── app.py # Gradio UI, live inference, LLM integration
├── model.py # Multimodal model architecture
├── multimodal_final_v4.pt # Trained model checkpoint
├── emotion_state.json # Shared emotion state for the robot
├── blaze_face_short_range.tflite # MediaPipe face detector model
├── .env # API keys and local configuration
├── .gitignore
├── README.md
├── Stride_bot-main/ # Robot simulation / gait control
│ ├── bot_walk.py # Robot runtime loop
│ ├── stridebot.urdf # Robot description
│ ├── src/
│ │ ├── emotion_gait.py # Emotion-specific gait parameters
│ │ ├── gaitPlanner.py # Base trot gait planner
│ │ ├── kinematic_model.py # IK model
│ │ └── ...
│ └── LICENSE
└── ros2_quad_ws/ or similar # ROS 2 integration workspace, if used
├── src/
├── launch/
└── ...
- Python 3.10 or newer
ffmpeg- Webcam
- Microphone
- Ubuntu 22.04
- ROS 2 Humble
- Gazebo Fortress
- CHAMP-based quadruped control stack
- PyTorch, OpenCV, Gradio, MediaPipe, SoundDevice
git clone <your-repo-url>
cd ERISpython3 -m venv .venv
source .venv/bin/activatepip install torch torchvision transformers mediapipe opencv-python \
sounddevice soundfile gradio groq python-dotenv \
matplotlib pybullet numpycurl -L -o blaze_face_short_range.tflite \
"https://storage.googleapis.com/mediapipe-models/face_detector/blaze_face_short_range/float16/1/blaze_face_short_range.tflite"GROQ_API_KEY=your_groq_api_key_here
HF_TOKEN=your_huggingface_token_herepython app.pyThe Gradio app runs locally and provides a live interface for webcam and microphone input.
In a separate terminal, run the robot simulation:
cd Stride_bot-main
python bot_walk.pyThe robot reads the current emotion from emotion_state.json and changes gait accordingly.
If you are using the ROS 2 integration variant, start the simulator and behaviour nodes in separate terminals:
source /opt/ros/humble/setup.bash
source ~/ros2_ws/install/setup.bash
ros2 launch go2_config gazebo.launch.py
ros2 launch go2_behaviour emotion_motion.launch.pyYou can also dispatch manual commands from the CLI:
ros2 run go2_behaviour behaviour_commander --action move --value 3.0 --emotion happy
ros2 run go2_behaviour behaviour_commander --action turn --value 90.0 --emotion angry
ros2 run go2_behaviour behaviour_commander --action move --value 1.5 --emotion sadThe robot side of ERIS is designed to turn emotion into visible motion rather than a simple preset animation.
- Happy: energetic, bouncy, prancing gait
- Sad: slower movement with a hunched posture
- Angry: faster, more forceful motion with sharper turns
- Neutral: stable, calm baseline locomotion
- Smooth interpolation between emotion states
- Safe posture updates to avoid inverse kinematics instability
- Dynamic adjustment of gait parameters such as stance duration and swing height
- Heading correction using ground-truth odometry in the ROS 2 variant
The multimodal app can publish the current emotion state directly to the robot control layer, allowing the perception pipeline and the robot behaviour pipeline to operate as one connected system.
The multimodal network uses:
- WavLM-Base+ for speech emotion features
- ViT-Face-Expression for facial emotion features
- Temporal pooling / transformer-based aggregation for sequence-level video understanding
- Learned gated fusion to balance audio and video on a per-sample basis
- Final classifier to predict the emotion label
This design makes the model more robust than using either modality alone.
- WavLM-Base+ for speech feature extraction
- ViT-Face-Expression for facial emotion classification
- Groq / Llama 3.1 8B for low-latency empathetic response generation
- MediaPipe for face detection
- Stride Bot for the quadruped simulation and gait framework
- ROS 2, Gazebo Fortress, and CHAMP for the embodied robotics pipeline
This project contains components derived from the Stride Bot quadruped simulation, which retains its original GPL-3.0 licence in Stride_bot-main/LICENSE. Any additional project code should be licensed according to your repository policy.
ERIS is a real-time emotion-aware interaction system that combines speech and facial emotion recognition, learned multimodal fusion, contextual LLM responses, and emotion-driven quadruped motion into a single integrated pipeline.