## Imports

In [1]:
from ared import EmotionDetector
from ared import ASR
from ared.utils import (
    id2emotion, emotion2id, load_first_50_images, load_audio_from_file
)
import random

random.seed(20)

## Loading the Model
- Path of the weights of audio, vision and text preprocessor models

**Audio Transcription**

- If you want an ASR model you can use any model of your choice below is the implementation using **QWen** model to transcribe the video
- QWen model is quite big and requires atleast 9GB of space

In [2]:
# paths containing the weights of the model
vis_weights = './weights/vision/MELDSceneNet_best.pt'
audio_weights='./weights/audio/model_best_sentiment.pth'
text_wreights = './weights/text/'

device = 'cuda'

# load the emotion detection model
detector = EmotionDetector(vis_model_weights=vis_weights, 
                           text_model_weights=text_wreights, 
                           audio_model_weights=audio_weights,
                           device=device)

# load the ASR model
asr_model = ASR(device)

Some weights of GPT2DoubleHeadsModel were not initialized from the model checkpoint at ./weights/text/ and are newly initialized: ['multiple_choice_head.summary.bias', 'multiple_choice_head.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


audio_start_id: 155163, audio_end_id: 155164, audio_pad_id: 151851.


The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

# Predicting the Emotion

**From a video file directly**
 - Given the path of the video model computes the emotion

In [3]:
video_path = './dia0_utt0.mp4'

utterance = asr_model.convert_speech_to_text(video_path)
emotion, probab = detector.detect_emotion(video=video_path, 
                                          audio=video_path, 
                                          text=utterance)
emotion

'anger'

**From an array of Images, still need audio to be in the file**
- Model takes sequence of 50 images as input and the audio signal of last 2 seconds and the utterance
- **TODO**
Ideally would like to implement in a way that the audio can be processed from a numpy array directly

In [4]:
images = load_first_50_images(video_path)
audio = load_audio_from_file(video_path)

utterance = asr_model.convert_speech_to_text(video_path)

emotion, probab = detector.detect_emotion(video=images, audio=audio, text=utterance)
emotion

'anger'

## Using the Webcam For REALTIME

**TODO**