# Multimodal Emotion Recognition(1/2)
## Zichen Xu（zichenxu407@gmail.com）

The feature abstraction procedure is primarily adapted and modified from the work presented in: https://arxiv.org/abs/1804.05788. And we have published this work on kaggle as well: https://www.kaggle.com/code/zichenxxx/emo-detection.

The code primarily handles audio and motion capture data from the IEMOCAP dataset. It includes three functions: `get_mocap_hand`, `get_mocap_rot`, and `get_mocap_head`, which are used to read and calculate the average values of hand, rotation, and head motion capture data, respectively. Finally, the function `read_iemocap_mocap` integrates the audio, emotion labels, transcribed text, and the three types of motion capture data mentioned above. It processes and sorts them into a unified format, returning an array that contains all the processed data.

- The functions used are encapsulated in the files `features.py`, `helper.py`, and `mocap_data_collect.py`.
- **The results after data cleaning and preprocessing are stored as .pickle files. These include two datasets: "pikle-data" (a partial dataset using only 1/5 of the data for lightweight model tuning) and "pikle-full-data" (the full dataset), both of which are open-sourced on the Kaggle platform. These can be directly accessed and reused without the need for separate data cleaning and preprocessing each time.**

## 1. Package loading

In [1]:
!pip install torchsummary

Collecting torchsummary
  Downloading torchsummary-1.5.1-py3-none-any.whl.metadata (296 bytes)
Downloading torchsummary-1.5.1-py3-none-any.whl (2.8 kB)
[0mInstalling collected packages: torchsummary
Successfully installed torchsummary-1.5.1


In [2]:
import numpy as np
import os
import sys
import pandas as pd

import wave
import copy
import math

# from keras.models import Sequential, Model
# from keras.layers.core import Dense, Activation
# from keras.layers import LSTM, Input, Flatten, Merge
# from keras.layers.wrappers import TimeDistributed
# from keras.optimizers import SGD, Adam, RMSprop
# from keras.layers.normalization import BatchNormalization
from sklearn.preprocessing import label_binarize

module_path = '/kaggle/input/emo-detection1/'
sys.path.append(module_path)
from features import *
from helper import *

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
from torchsummary import summary

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix

## 2. Parameters

In [3]:
batch_size = 128
nb_feat = 34
nb_class = 4
nb_epoch = 80

optimizer = 'Adadelta'


# code_path = os.path.dirname('/kaggle/input/iemocapfullrelease/IEMOCAP_full_release')
emotions_used = np.array(['ang', 'exc', 'neu', 'sad'])
data_path = os.path.dirname('/kaggle/input/iemocapfullrelease/IEMOCAP_full_release/')
# sessions = ['/Session1']
sessions = ['/Session1', '/Session2', '/Session3', '/Session4', '/Session5']
framerate = 16000

## 3. Data Preprocessing

In [4]:
# import os
# import numpy as np

# def get_mocap_hand(path_to_mocap_hand, filename, start, end):
#     mocap_hand_avg = []  # 初始化一个列表，用于存储手部运动捕捉数据的平均值
#     with open(path_to_mocap_hand + filename, 'r') as file:
#         next(file)  # 跳过文件的标题行
#         for line in file:
#             data2 = line.split()  # 将每行数据按空格分割成列表
#             if len(data2) < 3:  # 检查数据行是否有效
#                 continue
#             try:
#                 time_value = float(data2[1])  # 尝试将时间值转换为浮点数
#                 if time_value > start and time_value < end:  # 检查时间值是否在指定的范围内
#                     mocap_hand_avg.append(np.array(data2[2:]).astype(float))  # 将运动捕捉数据（从第三列开始）转换为浮点数并添加到列表中
#             except ValueError:
#                 continue  # 跳过无法转换为浮点数的行
    
#     if mocap_hand_avg:  # 如果有有效的数据行
#         mocap_hand_avg = np.array_split(np.array(mocap_hand_avg), 200)  # 将数据分成200个子集
#         for spl in mocap_hand_avg:
#             spl = np.mean(spl, axis=0)  # 计算每个子集的平均值
#     else:
#         mocap_hand_avg = [np.zeros(1)]  # 如果没有有效数据行，则返回一个零数组
    
#     return mocap_hand_avg  # 返回计算的手部运动捕捉数据的平均值


# def get_mocap_rot(path_to_mocap_rot, filename, start, end):
#     mocap_rot_avg = []  # 初始化一个列表，用于存储旋转运动捕捉数据的平均值
#     with open(path_to_mocap_rot + filename, 'r') as file:
#         next(file)  # 跳过文件的标题行
#         for line in file:
#             data2 = line.split()  # 将每行数据按空格分割成列表
#             if len(data2) < 3:  # 检查数据行是否有效
#                 continue
#             try:
#                 time_value = float(data2[1])  # 尝试将时间值转换为浮点数
#                 if time_value > start and time_value < end:  # 检查时间值是否在指定的范围内
#                     mocap_rot_avg.append(np.array(data2[2:]).astype(float))  # 将运动捕捉数据（从第三列开始）转换为浮点数并添加到列表中
#             except ValueError:
#                 continue  # 跳过无法转换为浮点数的行
    
#     if mocap_rot_avg:  # 如果有有效的数据行
#         mocap_rot_avg = np.array_split(np.array(mocap_rot_avg), 200)  # 将数据分成200个子集
#         for spl in mocap_rot_avg:
#             spl = np.mean(spl, axis=0)  # 计算每个子集的平均值
#     else:
#         mocap_rot_avg = [np.zeros(1)]  # 如果没有有效数据行，则返回一个零数组
    
#     return mocap_rot_avg  # 返回计算的旋转运动捕捉数据的平均值


# def get_mocap_head(path_to_mocap_head, filename, start, end):
#     mocap_head_avg = []  # 初始化一个列表，用于存储头部运动捕捉数据的平均值
#     with open(path_to_mocap_head + filename, 'r') as file:
#         next(file)  # 跳过文件的标题行
#         for line in file:
#             data2 = line.split()  # 将每行数据按空格分割成列表
#             if len(data2) < 3:  # 检查数据行是否有效
#                 continue
#             try:
#                 time_value = float(data2[1])  # 尝试将时间值转换为浮点数
#                 if time_value > start and time_value < end:  # 检查时间值是否在指定的范围内
#                     mocap_head_avg.append(np.array(data2[2:]).astype(float))  # 将运动捕捉数据（从第三列开始）转换为浮点数并添加到列表中
#             except ValueError:
#                 continue  # 跳过无法转换为浮点数的行
    
#     if mocap_head_avg:  # 如果有有效的数据行
#         mocap_head_avg = np.array_split(np.array(mocap_head_avg), 200)  # 将数据分成200个子集
#         for spl in mocap_head_avg:
#             spl = np.mean(spl, axis=0)  # 计算每个子集的平均值
#     else:
#         mocap_head_avg = [np.zeros(1)]  # 如果没有有效数据行，则返回一个零数组
    
#     return mocap_head_avg  # 返回计算的头部运动捕捉数据的平均值


# def read_iemocap_mocap():
#     data = []  # 初始化一个列表，用于存储最终整合的数据
#     ids = {}  # 初始化一个字典，用于存储已经处理过的ID
#     for session in sessions:  # 遍历每个会话
#         path_to_wav = data_path + session + '/dialog/wav/'  # 构建音频文件路径
#         path_to_emotions = data_path + session + '/dialog/EmoEvaluation/'  # 构建情感标签文件路径
#         path_to_transcriptions = data_path + session + '/dialog/transcriptions/'  # 构建转录文本文件路径
#         path_to_mocap_hand = data_path + session + '/dialog/MOCAP_hand/'  # 构建手部运动捕捉文件路径
#         path_to_mocap_rot = data_path + session + '/dialog/MOCAP_rotated/'  # 构建旋转运动捕捉文件路径
#         path_to_mocap_head = data_path + session + '/dialog/MOCAP_head/'  # 构建头部运动捕捉文件路径

#         files2 = os.listdir(path_to_wav)  # 列出音频文件目录下的所有文件

#         files = []
#         for f in files2:  # 遍历所有文件
#             if f.endswith(".wav"):  # 检查文件是否是.wav文件
#                 if f[0] == '.':
#                     files.append(f[2:-4])  # 处理隐藏文件
#                 else:
#                     files.append(f[:-4])  # 处理普通文件

#         for f in files:  # 遍历所有音频文件
#             print(f)
#             mocap_f = f
#             if f == 'Ses05M_script01_1b':
#                 mocap_f = 'Ses05M_script01_1'  # 特殊处理某个文件名

#             wav = get_audio(path_to_wav, f + '.wav')  # 读取音频文件
#             transcriptions = get_transcriptions(path_to_transcriptions, f + '.txt')  # 读取转录文本
#             emotions = get_emotions(path_to_emotions, f + '.txt')  # 读取情感标签
#             sample = split_wav(wav, emotions)  # 将音频文件分割成若干段

#             for ie, e in enumerate(emotions):  # 遍历每个情感标签
#                 e['signal'] = sample[ie]['left']  # 将音频段的左声道信号添加到情感标签中
#                 e.pop("left", None)
#                 e.pop("right", None)
#                 e['transcription'] = transcriptions[e['id']]  # 将转录文本添加到情感标签中
#                 e['mocap_hand'] = get_mocap_hand(path_to_mocap_hand, mocap_f + '.txt', e['start'], e['end'])  # 获取手部运动捕捉数据
#                 e['mocap_rot'] = get_mocap_rot(path_to_mocap_rot, mocap_f + '.txt', e['start'], e['end'])  # 获取旋转运动捕捉数据
#                 e['mocap_head'] = get_mocap_head(path_to_mocap_head, mocap_f + '.txt', e['start'], e['end'])  # 获取头部运动捕捉数据
#                 if e['emotion'] in emotions_used:  # 检查情感是否在使用的情感集合中
#                     if e['id'] not in ids:  # 检查该ID是否已处理过
#                         data.append(e)  # 将情感标签数据添加到最终数据列表中
#                         ids[e['id']] = 1  # 将ID标记为已处理

#     sort_key = get_field(data, "id")  # 获取排序键
#     return np.array(data)[np.argsort(sort_key)]  # 按ID排序并返回数据数组

# data = read_iemocap_mocap()  # 调用函数读取IEMOCAP数据集

In [5]:
# import pickle  # 导入pickle模块，用于对象的序列化和反序列化

# # 使用'wb'（写入二进制）模式打开一个名为'data_collected.pickle'的文件
# # '/kaggle/working/' 是文件的路径
# with open('/kaggle/working/'+'data_collected.pickle', 'wb') as handle:

#     # 使用pickle的dump函数，将'data'对象序列化并保存到已打开的文件中
#     # protocol=pickle.HIGHEST_PROTOCOL表示使用最高可用的pickle协议进行序列化
#     pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 4. Modeling

In this section, I will perform more detailed feature processing on the previously cleaned and feature-extracted dataset, build the relevant functions, and ultimately use the dataset for model analysis.

In [6]:
import pickle  # 导入pickle模块，用于读取和写入pickle文件

# 打开包含数据的pickle文件
with open('/kaggle/input/pikle-full-data/data_collected_full.pickle', 'rb') as handle:
    data2 = pickle.load(handle)  # 使用pickle.load()方法加载pickle文件中的数据，并将其存储在变量data2中

## 4.1 Feature abstraction: Language

In [7]:
import torch  # 导入PyTorch库
from torchtext.data.utils import get_tokenizer  # 导入torchtext库中的分词器
from torchtext.vocab import build_vocab_from_iterator  # 从迭代器构建词汇表
from torch.nn.utils.rnn import pad_sequence  # 对序列进行填充

# 假设 data2 已经被加载并包含需要的数据信息

text = [ses_mod['transcription'] for ses_mod in data2]  # 从data2中提取出'transcription'字段的文本数据

MAX_SEQUENCE_LENGTH = 500  # 设置最大序列长度

# 使用 torchtext 的 basic_english 分词器
tokenizer = get_tokenizer('basic_english')

# 构建词汇表
def yield_tokens(data):
    for text in data:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(text), specials=["<unk>"])  # 从文本数据中构建词汇表，并指定特殊标记"<unk>"
vocab.set_default_index(vocab["<unk>"])  # 设置默认索引为"<unk>"

# 将文本转换为序列
token_tr_X = [vocab(tokenizer(t)) for t in text]

# 将序列填充到相同的长度
x_train_text = pad_sequence([torch.tensor(seq, dtype=torch.long) for seq in token_tr_X], 
                            batch_first=True, padding_value=vocab["<unk>"])

# 如果序列长度超过最大长度，则进行截断
x_train_text = x_train_text[:, :MAX_SEQUENCE_LENGTH]

# 如果序列长度不足最大长度，则进行填充
if x_train_text.size(1) < MAX_SEQUENCE_LENGTH:
    pad_size = MAX_SEQUENCE_LENGTH - x_train_text.size(1)
    x_train_text = torch.cat((x_train_text, torch.full((x_train_text.size(0), pad_size), vocab["<unk>"], dtype=torch.long)), dim=1)

print(x_train_text.shape)  # 打印结果检查维度

torch.Size([4936, 500])


In [8]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库
import os  # 导入操作系统模块
from torchtext.data.utils import get_tokenizer  # 导入torchtext库中的分词器
from torchtext.vocab import build_vocab_from_iterator, Vocab  # 从迭代器构建词汇表、Vocab类

EMBEDDING_DIM = 300  # 设置词嵌入的维度为300

# 假设 data2 已经被加载并包含需要的数据信息
text = [ses_mod['transcription'] for ses_mod in data2]  # 从data2中提取出'transcription'字段的文本数据

# 使用 torchtext 的 basic_english 分词器
tokenizer = get_tokenizer('basic_english')

# 构建词汇表
def yield_tokens(data):
    for text in data:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(text), specials=["<unk>"])  # 从文本数据中构建词汇表，并指定特殊标记"<unk>"
vocab.set_default_index(vocab["<unk>"])  # 设置默认索引为"<unk>"
word_index = vocab.get_stoi()  # 获取词汇表中词语对应的索引
print(f'Found {len(word_index)} unique tokens')  # 打印词汇表中唯一标记的数量

file_loc = os.path.join('/kaggle/input/glove42b300dtxt/glove.42B.300d.txt')  # 指定GloVe文件的路径
print(file_loc)  # 打印文件路径

# 加载GloVe词嵌入
gembeddings_index = {}
with open(file_loc, encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        gembedding = np.asarray(values[1:], dtype='float32')
        gembeddings_index[word] = gembedding
print(f'G Word embeddings: {len(gembeddings_index)}')  # 打印GloVe词嵌入的数量

# 创建词嵌入矩阵
nb_words = len(word_index) + 1
g_word_embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    gembedding_vector = gembeddings_index.get(word)
    if gembedding_vector is not None:
        g_word_embedding_matrix[i] = gembedding_vector

print(f'G Null word embeddings: {np.sum(np.sum(g_word_embedding_matrix, axis=1) == 0)}')  # 打印空词嵌入的数量

# 转换为 PyTorch 的 embedding layer 所需的 tensor
g_word_embedding_matrix = torch.tensor(g_word_embedding_matrix, dtype=torch.float32)

# 使用预训练的词嵌入初始化 embedding 层
embedding_layer = torch.nn.Embedding.from_pretrained(g_word_embedding_matrix, freeze=False)

# 打印 embedding 层的一些信息以验证
print(f'Embedding layer weights shape: {embedding_layer.weight.shape}')  # 打印embedding层权重的形状

Found 3129 unique tokens
/kaggle/input/glove42b300dtxt/glove.42B.300d.txt
G Word embeddings: 1917494
G Null word embeddings: 475
Embedding layer weights shape: torch.Size([3130, 300])


In [9]:
# 假设 data2 已经被加载并包含需要的数据信息
text_store = [ses_mod['transcription'] for ses_mod in data2]  # 从data2中提取出'transcription'字段的文本数据

print(x_train_text.shape)
i=4900
print(text_store[i])

torch.Size([4936, 500])
Stop.  Would you just go away?  I mean, you're conceited, and overbearing, and utterly impossible.


## 4.2 Feature abstraction: Audio

In [10]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库

def calculate_features(frames, freq, options):
    window_sec = 0.2  # 设置窗口长度为0.2秒
    window_n = int(freq * window_sec)  # 计算窗口中的样本数
    
    # 调用 stFeatureExtraction 函数
    st_f = stFeatureExtraction(frames, freq, window_n, window_n // 2)  # 使用 stFeatureExtraction 函数提取特征
    st_f = torch.tensor(st_f, dtype=torch.float32)  # 将 numpy 数组转换为 PyTorch 张量
    
    if st_f.shape[1] > 2:  # 如果特征的维度大于2
        i0 = 1
        i1 = st_f.shape[1] - 1
        if i1 - i0 < 1:
            i1 = i0 + 1
        
        deriv_st_f = torch.zeros((st_f.shape[0], i1 - i0), dtype=torch.float32)  # 创建导数特征的张量
        for i in range(i0, i1):
            i_left = i - 1
            i_right = i + 1
            deriv_st_f[:, i - i0] = st_f[:, i]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量
    elif st_f.shape[1] == 2:  # 如果特征的维度等于2
        deriv_st_f = torch.zeros((st_f.shape[0], 1), dtype=torch.float32)  # 创建导数特征的张量
        deriv_st_f[:, 0] = st_f[:, 0]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量
    else:  # 如果特征的维度小于2
        deriv_st_f = torch.zeros((st_f.shape[0], 1), dtype=torch.float32)  # 创建导数特征的张量
        deriv_st_f[:, 0] = st_f[:, 0]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量


In [11]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库
import soundfile as sf  # 导入soundfile库，用于保存音频数据

def calculate_features(frames, freq, options):
    window_sec = 0.2  # 设置窗口长度为0.2秒
    window_n = int(freq * window_sec)  # 计算窗口中的样本数
    
    # 调用 stFeatureExtraction 函数
    st_f = stFeatureExtraction(frames, freq, window_n, window_n // 2)  # 使用 stFeatureExtraction 函数提取特征
    st_f = torch.tensor(st_f, dtype=torch.float32)  # 将 numpy 数组转换为 PyTorch 张量
    
    if st_f.shape[1] > 2:  # 如果特征的维度大于2
        i0 = 1
        i1 = st_f.shape[1] - 1
        if i1 - i0 < 1:
            i1 = i0 + 1
        
        deriv_st_f = torch.zeros((st_f.shape[0], i1 - i0), dtype=torch.float32)  # 创建导数特征的张量
        for i in range(i0, i1):
            i_left = i - 1
            i_right = i + 1
            deriv_st_f[:, i - i0] = st_f[:, i]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量
    elif st_f.shape[1] == 2:  # 如果特征的维度等于2
        deriv_st_f = torch.zeros((st_f.shape[0], 1), dtype=torch.float32)  # 创建导数特征的张量
        deriv_st_f[:, 0] = st_f[:, 0]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量
    else:  # 如果特征的维度小于2
        deriv_st_f = torch.zeros((st_f.shape[0], 1), dtype=torch.float32)  # 创建导数特征的张量
        deriv_st_f[:, 0] = st_f[:, 0]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量

# 定义函数，用于将序列填充到数组中
def pad_sequence_into_array(sequences, maxlen):
    num_samples = len(sequences)  # 获取序列的数量
    if len(sequences) == 0:
        return torch.zeros((num_samples, maxlen, 1), dtype=torch.float32), 0  # 如果序列为空，则返回一个全零张量
    num_features = sequences[0].shape[0] if len(sequences[0].shape) > 1 else 1  # 获取特征的数量
    padded_array = torch.zeros((num_samples, num_features, maxlen), dtype=torch.float32)  # 创建一个全零张量，用于存储填充后的序列
    for i, seq in enumerate(sequences):
        if len(seq.shape) == 1:
            seq = seq.reshape(1, -1)  # 如果序列的形状为一维，则将其调整为二维
        length = min(maxlen, seq.shape[1])  # 获取序列的长度，并确保不超过最大长度
        padded_array[i, :, :length] = seq[:, :length]  # 将序列填充到数组中
    return padded_array, length  # 返回填充后的数组和实际长度

# 处理语音数据并转换为 PyTorch 张量
x_train_speech = []  # 存储处理后的语音数据
counter = 0

for ses_mod in data2:
    x_head = ses_mod['signal']  # 获取语音信号数据
    
    # 保存第一个和第二个音频信号作为原始音频数据示例
#     if counter == 0 or counter == 1:
#         sf.write(f'original_audio_{counter}.wav', x_head, framerate)
    
    st_features = calculate_features(x_head, framerate, None)  # 计算语音信号的特征
    st_features, _ = pad_sequence_into_array([st_features], maxlen=100)  # 将特征序列填充到数组中
    x_train_speech.append(st_features)  # 将处理后的语音数据添加到列表中
    counter += 1
    if counter % 100 == 0:
        print(counter)

x_train_speech = torch.cat(x_train_speech, dim=0)  # 合并所有张量
print(x_train_speech.shape)  # 打印处理后的语音数据的形状

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
torch.Size([4936, 34, 100])


In [12]:
counter=0
for ses_mod in data2:
    x_head = ses_mod['signal']  # 获取语音信号数据
    # 保存第一个和第二个音频信号作为原始音频数据示例
    if counter >= 4895 and counter <= 4904:
        sf.write(f'original_audio_{counter}.wav', x_head, framerate)
    counter += 1
    if counter % 100 == 0:
        print(counter)

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900


In [13]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库
import soundfile as sf  # 导入soundfile库，用于保存音频数据
from IPython.display import Audio, display  # 导入IPython库，用于展示和播放音频文件

# 展示和播放存储的WAV文件
display(Audio(f'original_audio_{i}.wav'))

## 4.3 Feature abstraction: Movement capture

In [14]:
def process_mocap_data(data, expected_length=200, expected_features=189):
    if not isinstance(data, list):
        data = [data]
    
    processed_data = np.zeros((expected_length, expected_features))
    
    for i, row in enumerate(data):
        if i >= expected_length:
            break
        if isinstance(row, (list, np.ndarray)):
            if len(np.array(row).shape) > 1:
                row = np.array(row).flatten()
            row = np.nan_to_num(row, nan=0.0)  # 将NaN替换为0
            processed_data[i, :min(len(row), expected_features)] = row[:expected_features]
    
    return processed_data

def normalize_mocap(data):
    mean = np.nanmean(data, axis=(0, 1), keepdims=True)
    std = np.nanstd(data, axis=(0, 1), keepdims=True)
    return (data - mean) / (std + 1e-8)

x_train_mocap = []
for ses_mod in data2:
    x_head = process_mocap_data(ses_mod['mocap_head'], expected_length=200, expected_features=18)
    x_hand = process_mocap_data(ses_mod['mocap_hand'], expected_length=200, expected_features=6)
    x_rot = process_mocap_data(ses_mod['mocap_rot'], expected_length=200, expected_features=165)

    x_head = normalize_mocap(x_head)
    x_hand = normalize_mocap(x_hand)
    x_rot = normalize_mocap(x_rot)

    x_mocap = np.concatenate((x_head, x_hand, x_rot), axis=1)
    x_train_mocap.append(torch.tensor(x_mocap, dtype=torch.float32))

x_train_mocap = torch.stack(x_train_mocap)
x_train_mocap = x_train_mocap.view(-1, 200, 189, 1)

# 检查并替换NaN值
x_train_mocap = torch.nan_to_num(x_train_mocap, nan=0.0)

In [15]:
def check_nan_mocap(data, name):
    nan_mask = torch.isnan(data)
    if nan_mask.any():
        print(f"{name} contains NaN values:")
        print(f"Total NaN values: {nan_mask.sum().item()}")
        print(f"Samples with NaN: {nan_mask.any(dim=(1,2,3)).sum().item()}/{data.shape[0]}")
        return True
    return False

check_nan_mocap(x_train_mocap, "Mocap data")

False

In [16]:
def interpolate_nan_mocap(data):
    data_np = data.numpy()
    for i in range(data_np.shape[0]):
        for j in range(data_np.shape[2]):
            data_np[i, :, j, 0] = pd.Series(data_np[i, :, j, 0]).interpolate().values
    return torch.from_numpy(data_np)

if check_nan_mocap(x_train_mocap, "Mocap data"):
    x_train_mocap = interpolate_nan_mocap(x_train_mocap)

In [17]:
print(x_train_mocap.shape)  # 打印结果检查维度

torch.Size([4936, 200, 189, 1])


In [18]:
import torch
import numpy as np
from sklearn.preprocessing import LabelEncoder

# 初始化Y列表
Y = []
for ses_mod in data2:
    Y.append(ses_mod['emotion'])

# 使用LabelEncoder将标签编码为整数
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)

# 将Y转换为PyTorch张量，确保是长整型
Y = torch.tensor(Y, dtype=torch.long)

# 打印Y的形状
print(Y.shape)  # 应该是 torch.Size([950])or4936

torch.Size([4936])


In [19]:
from collections import Counter

print(Counter(Y.numpy()))

Counter({2: 1708, 0: 1103, 3: 1084, 1: 1041})


## 4.4 Dataset preparation

In [20]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class EmotionDataset(Dataset):
    def __init__(self, text, speech, mocap, labels):
        self.text = [t.clone().detach().long() for t in text]
        self.speech = [s.clone().detach().float() for s in speech]
        self.mocap = [m.clone().detach().float() for m in mocap]
        self.labels = [l.clone().detach().long() for l in labels]
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return {
            'text': self.text[idx],
            'speech': self.speech[idx],
            'mocap': self.mocap[idx],
            'labels': self.labels[idx]
        }

In [21]:
from sklearn.model_selection import train_test_split

# 首先，将数据集分割为训练集和测试集
X_text_train, X_text_test, X_speech_train, X_speech_test, X_mocap_train, X_mocap_test, y_train, y_test = train_test_split(
    x_train_text, x_train_speech, x_train_mocap, Y, test_size=0.2, random_state=66, stratify=Y
)

# 将训练集的所有特征合并为一个二维数组
X_combined_train = np.hstack((X_text_train.numpy(), 
                              X_speech_train.numpy().reshape(X_speech_train.shape[0], -1), 
                              X_mocap_train.numpy().reshape(X_mocap_train.shape[0], -1)))

In [22]:
import numpy as np
from imblearn.over_sampling import SMOTE
from collections import Counter

# 假设 y_train 是训练集的标签

# 首先，计算每个类别的样本数量
class_counts = Counter(y_train.numpy())

# 找出多数类的样本数量
max_samples = max(class_counts.values())

# 创建重采样策略字典，将所有少数类的样本数量增加到多数类的水平
sampling_strategy = {cls: max_samples for cls in class_counts.keys() if class_counts[cls] < max_samples}

# 应用SMOTE
smote = SMOTE(sampling_strategy=sampling_strategy, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_combined_train, y_train.numpy())

# 打印重采样后的类别分布
print("重采样后的类别分布:", Counter(y_resampled))

重采样后的类别分布: Counter({3: 1366, 2: 1366, 0: 1366, 1: 1366})


In [23]:
from sklearn.utils.class_weight import compute_class_weight

num_workers = 4

# 将重采样后的训练数据分割回原始形状
text_shape = X_text_train.shape[1]
speech_shape = X_speech_train.shape[1] * X_speech_train.shape[2]
mocap_shape = X_mocap_train.shape[1] * X_mocap_train.shape[2] * X_mocap_train.shape[3]

X_text_resampled = torch.tensor(X_resampled[:, :text_shape])
X_speech_resampled = torch.tensor(X_resampled[:, text_shape:text_shape+speech_shape]).reshape(-1, X_speech_train.shape[1], X_speech_train.shape[2])
X_mocap_resampled = torch.tensor(X_resampled[:, text_shape+speech_shape:]).reshape(-1, X_mocap_train.shape[1], X_mocap_train.shape[2], X_mocap_train.shape[3])
y_resampled = torch.as_tensor(y_resampled).clone().detach()

# 创建重采样后的训练集和原始测试集的数据集
train_dataset = EmotionDataset(X_text_resampled, X_speech_resampled, X_mocap_resampled, y_resampled)
test_dataset = EmotionDataset(X_text_test, X_speech_test, X_mocap_test, y_test)

# 创建pred_dataset（使用测试集的前10个样本）
pred_dataset = EmotionDataset(X_text_test[:10], X_speech_test[:10], X_mocap_test[:10], y_test[:10])


# 创建数据加载器
train_loader = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False, pin_memory=True)
pred_loader = DataLoader(pred_dataset, batch_size=1, shuffle=False)  # 使用batch_size=1以便单独评估每个样本


# 计算类别权重（使用重采样后的训练集）
class_weights = compute_class_weight('balanced', classes=np.unique(y_resampled), y=y_resampled.numpy())

## 4.5 Function preparation

In [24]:
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
import numpy as np

def calculate_metrics(outputs, labels):
    preds = torch.max(outputs, dim=1)[1].cpu().numpy()
    labels = labels.cpu().numpy()
    
    # F1 score(Weighted)
    f1 = f1_score(labels, preds, average='weighted')
    
    # Unweighted accuracy
    uw_acc = accuracy_score(labels, preds)
    
    return f1, uw_acc

# 模型验证
def validate(model, val_loader, criterion, device):
    val_loss = 0
    val_f1, val_uw_acc = 0, 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in val_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            outputs = model(text, speech, mocap)  # 计算模型输出
            loss = criterion(outputs, labels)  # 计算损失函数
            val_loss += loss.item()  # 累计损失
            
            f1, uw_acc = calculate_metrics(outputs, labels)
            val_f1 += f1
            val_uw_acc += uw_acc

    val_loss /= len(val_loader)  # 计算平均损失
    val_f1 /= len(val_loader)  # 计算平均F1分数
    val_uw_acc /= len(val_loader)  # 计算平均非加权准确率

    return val_loss, val_f1, val_uw_acc

# 打印训练结果
def print_log(epoch, train_time, train_loss, train_f1, train_uw_acc, 
              val_loss, val_f1, val_uw_acc, epochs=10):
    print(f"Epoch [{epoch+1}/{epochs}], time: {train_time:.2f}s")
    print(f"Train - loss: {train_loss:.4f}, F1: {train_f1:.4f}, UW_Acc: {train_uw_acc:.4f}")
    print(f"Val - loss: {val_loss:.4f}, F1: {val_f1:.4f}, UW_Acc: {val_uw_acc:.4f}")

# 模型训练
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs, device):
    train_losses = []
    train_f1s, train_uw_accs = [], []
    val_losses = []
    val_f1s, val_uw_accs = [], []
    
    for epoch in range(num_epochs):
        model.train()
        
        train_loss = 0
        train_f1, train_uw_acc = 0, 0
        
        start_time = time.time()  # 记录本epoch开始时间
        
        for batch in train_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            # 检查输入数据
            if torch.isnan(text).any() or torch.isnan(speech).any() or torch.isnan(mocap).any():
                print("Input data contains NaN!")
                continue
            
            optimizer.zero_grad()  # 将模型所有参数tensor的梯度变为0
            outputs = model(text, speech, mocap)  # 计算模型输出
            
            loss = criterion(outputs, labels)  # 计算损失函数
            train_loss += loss.item()  # 累计损失
            
            f1, uw_acc = calculate_metrics(outputs, labels)
            train_f1 += f1
            train_uw_acc += uw_acc
            
            loss.backward()  # 反向传播计算梯度
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()  # 更新模型参数
        
        end_time = time.time()  # 记录本epoch结束时间
        train_time = end_time - start_time  # 计算本epoch的训练耗时
        
        train_loss /= len(train_loader)  # 计算平均损失
        train_f1 /= len(train_loader)  # 计算平均F1分数
        train_uw_acc /= len(train_loader)  # 计算平均非加权准确率
        
        val_loss, val_f1, val_uw_acc = validate(model, val_loader, criterion, device)
        
        scheduler.step(val_f1)
        
        train_losses.append(train_loss)
        train_f1s.append(train_f1)
        train_uw_accs.append(train_uw_acc)
        val_losses.append(val_loss)
        val_f1s.append(val_f1)
        val_uw_accs.append(val_uw_acc)
        
        print_log(epoch, train_time, train_loss, train_f1, train_uw_acc, 
                  val_loss, val_f1, val_uw_acc, epochs=num_epochs)  # 打印训练结果

    return train_losses, train_f1s, train_uw_accs, val_losses, val_f1s, val_uw_accs

In [25]:
def evaluate_model(model, test_loader, criterion, device):
    val_loss, f1, uw_acc = validate(model, test_loader, criterion, device)
    print(f"Test Results:")
    print(f"Loss: {val_loss:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Unweighted Accuracy: {uw_acc:.4f}")

In [26]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
from torch.utils.data import DataLoader
import soundfile as sf
import tempfile

def predict(model, data_loader, device):
    model.eval()  # 切换模型到评估模式
    predictions = []
    
    with torch.no_grad():
        for batch in data_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            
            outputs = model(text=text, speech=speech, mocap=mocap)  # 获取模型输出
            _, predicted = torch.max(outputs, 1)  # 获取预测类别
            predictions.extend(predicted.cpu().numpy())  # 将预测结果添加到列表中

    return predictions

In [27]:
def print_model_summary(model, input_size):
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"Model Structure:\n{model}\n")
    print(f"Input size: {input_size}")
    
    total_params = 0
    for name, module in model.named_children():
        params = sum(p.numel() for p in module.parameters() if p.requires_grad)
        total_params += params
        print(f"{name}: {params:,} parameters")
        
        if hasattr(module, 'named_children'):
            for sub_name, sub_module in module.named_children():
                sub_params = sum(p.numel() for p in sub_module.parameters() if p.requires_grad)
                print(f"  {sub_name}: {sub_params:,} parameters")
    
    print(f"\nTotal trainable parameters: {total_params:,}")
    
    print("\nDetailed parameter shapes:")
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"{name}: {param.shape}")

In [28]:
# 参数初始化
nb_words = 3130 # 3130 or 1419
embedding_dim = 300
max_sequence_length = 500  # 最大序列长度

## 5. Deep learning

## Model 1: Text + Speech

In [29]:
import torch  # 导入PyTorch库
import torch.nn as nn  # 导入PyTorch神经网络模块
import torch.nn.functional as F  # 导入PyTorch函数模块（包含各种激活函数、损失函数等）


# 定义文本模型类
class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(TextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
        self.bn_dense = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = F.relu(self.bn_dense(self.dense(x)))
        return x

# 定义语音模型类
class SpeechModel(nn.Module):
    def __init__(self):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(100 * 34, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        self.dense2 = nn.Linear(1024, 512)
        self.bn2 = nn.BatchNorm1d(512)
        self.dense3 = nn.Linear(512, 256)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.bn1(self.dense1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.dense2(x)))
        x = self.dropout(x)
        x = F.relu(self.dense3(x))
        return x

class CombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(CombinedModel, self).__init__()
        self.text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)
        self.speech_model = SpeechModel()
        self.fc1 = nn.Linear(512, 256)
        self.fc2 = nn.Linear(256, 4)
    
    def forward(self, text, speech, mocap=None):
        text = text.long()
        speech = speech.float()
        if mocap is not None:
            mocap = mocap.float()
    
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        combined = torch.cat((text_out, speech_out), dim=1)
        combined = F.relu(self.fc1(combined))
        combined = self.fc2(combined)
        return combined

In [30]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 实例化模型、损失函数和优化器
model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)

# 打印模型摘要
print(model_combined)

# print("Text Model Summary:")
# print_model_summary(model_combined.text_model, (max_sequence_length,))

# print("\nSpeech Model Summary:")
# print_model_summary(model_combined.speech_model, (100, 34))

print("\nCombined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

CombinedModel(
  (text_model): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (dense): Linear(in_features=16000, out_features=256, bias=True)
    (bn_dense): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

In [31]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))
text = batch['text'].to(device).long()
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs = model_combined(text, speech, mocap)

# 打印输出形状
print(outputs.shape)  # 打印模型输出的形状

Using 2 GPUs!
torch.Size([128, 4])


In [32]:
import torch.optim as optim
from torch.optim import lr_scheduler

# 使用带有类别权重的损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 4.78s
Train - loss: 1.3212, F1: 0.3707, UW_Acc: 0.3859
Val - loss: 1.2829, F1: 0.3388, UW_Acc: 0.3933
Epoch [2/80], time: 3.19s
Train - loss: 1.0742, F1: 0.5331, UW_Acc: 0.5480
Val - loss: 1.0858, F1: 0.4950, UW_Acc: 0.5170
Epoch [3/80], time: 3.17s
Train - loss: 0.9018, F1: 0.6251, UW_Acc: 0.6296
Val - loss: 1.1226, F1: 0.5176, UW_Acc: 0.5326
Epoch [4/80], time: 3.17s
Train - loss: 0.7610, F1: 0.6919, UW_Acc: 0.6943
Val - loss: 1.0435, F1: 0.5555, UW_Acc: 0.5633
Epoch [5/80], time: 3.17s
Train - loss: 0.6333, F1: 0.7555, UW_Acc: 0.7562
Val - loss: 0.9975, F1: 0.5990, UW_Acc: 0.6025
Epoch [6/80], time: 3.13s
Train - loss: 0.5227, F1: 0.8029, UW_Acc: 0.8019
Val - loss: 0.9679, F1: 0.6111, UW_Acc: 0.6149
Epoch [7/80], time: 3.36s
Train - loss: 0.4536, F1: 0.8338, UW_Acc: 0.8335
Val - loss: 1.0967, F1: 0.5752, UW_Acc: 0.5699
Epoch [8/80], time: 3.18s
Train - loss: 0.3836, F1: 0.8620, UW_Acc: 0.8613
Val - loss: 1.0748, F1: 0.6063, UW_Acc: 0.6157
Epoch [9/80], time: 3.27

([1.3211665874303773,
  1.0742320016373035,
  0.9017948868662812,
  0.7610041901122692,
  0.6332793097163356,
  0.5226917350014975,
  0.4536152003809463,
  0.3835705518722534,
  0.3314973784740581,
  0.2735847647106925,
  0.2507367764794549,
  0.22548675848994143,
  0.19502402010352113,
  0.1947686346464379,
  0.16205689172412074,
  0.14906084312255993,
  0.15259484747468038,
  0.14363009222718173,
  0.11863893816290899,
  0.11968681788028673,
  0.11280434318753176,
  0.09421394254232562,
  0.07576486962132675,
  0.06714728578578594,
  0.06765408393775307,
  0.06459579040664573,
  0.0624237384127323,
  0.051243735109131,
  0.04703464378537827,
  0.051298795311256896,
  0.04015109999928364,
  0.037956087546812935,
  0.042191963296296986,
  0.03648049305301419,
  0.03409007733124633,
  0.03119143800333489,
  0.02833251860859089,
  0.02902270051065919,
  0.03279795377450281,
  0.03067034595580988,
  0.026346370058004245,
  0.02638894403461627,
  0.02516505098901689,
  0.02751426170255209,

In [33]:
evaluate_model(model_combined, test_loader, criterion, device)

Test Results:
Loss: 1.6039
F1 Score: 0.6402
Unweighted Accuracy: 0.6431


### Model-1 Prediction

In [34]:
for i in range(4895, 4905):
    print(text_store[i])

Not- Would you shut up.
Very well.  If you have to be boorish and idiotic.
You are far too temperamental.  Try to control yourself.
Go away!  Go away!  I- I hate you.
You know what, I'm sick and tire of listening to you.  You're a-- You're a total sadistic bully.
Stop.  Would you just go away?  I mean, you're conceited, and overbearing, and utterly impossible.
This is the end! Do you understand me? This is the end.
Yes, I am. Let go with me
You are a cruel fiend and I-I loath you.  I thank God I realized who you are before I decided to marry you again.  Oh my God, never!
Beast


In [35]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库
import soundfile as sf  # 导入soundfile库，用于保存音频数据
from IPython.display import Audio, display  # 导入IPython库，用于展示和播放音频文件

# 展示和播放存储的WAV文件
for i in range(4895, 4905):
    display(Audio(f'original_audio_{i}.wav'))

In [36]:
print(Y[4895:4905])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [37]:
prediction = predict(model_combined, pred_loader, device)
print(prediction)

[1, 0, 3, 1, 2, 2, 3, 2, 0, 3]


## Model-3 Text + speech + Mocap(DepthSeparateConv + CNN + CNN)

In [44]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class DepthwiseSeparableConv1d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, padding):
        super(DepthwiseSeparableConv1d, self).__init__()
        self.depthwise = nn.Conv1d(in_channels, in_channels, kernel_size=kernel_size, padding=padding, groups=in_channels)
        self.pointwise = nn.Conv1d(in_channels, out_channels, kernel_size=1)
        self.bn = nn.BatchNorm1d(out_channels)
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        x = self.bn(x)
        return x

class OptimizedTextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(OptimizedTextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = DepthwiseSeparableConv1d(embedding_dim, 128, kernel_size=3, padding=1)
        self.conv2 = DepthwiseSeparableConv1d(128, 64, kernel_size=3, padding=1)
        self.dropout = nn.Dropout(0.1)
        self.global_avg_pool = nn.AdaptiveAvgPool1d(1)
        self.dense = nn.Linear(64, 128)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.conv1(x))
        x = self.dropout(x)
        x = F.relu(self.conv2(x))
        x = self.dropout(x)
        x = self.global_avg_pool(x).squeeze(-1)
        x = self.dense(x)
        return x

class OptimizedSpeechModel(nn.Module):
    def __init__(self, dropout_rate=0.1):
        super(OptimizedSpeechModel, self).__init__()
        self.conv1 = nn.Conv1d(34, 64, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm1d(64)
        self.conv2 = nn.Conv1d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.dropout = nn.Dropout(dropout_rate)
        self.global_avg_pool = nn.AdaptiveAvgPool1d(1)
        self.dense = nn.Linear(128, 128)
    
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = self.global_avg_pool(x).squeeze(-1)
        x = self.dense(x)
        return x

class OptimizedMocapModel(nn.Module):
    def __init__(self):
        super(OptimizedMocapModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.dropout = nn.Dropout(0.1)
        self.global_avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 128)

    def forward(self, x):
        x = x.permute(0, 3, 1, 2).float()
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = self.global_avg_pool(x).squeeze(-1).squeeze(-1)
        x = self.fc(x)
        return x

class OptimizedCombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(OptimizedCombinedModel, self).__init__()
        self.text_model = OptimizedTextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)
        self.speech_model = OptimizedSpeechModel()
        self.mocap_model = OptimizedMocapModel()
        self.fc = nn.Linear(128 * 3, 4)
    
    def forward(self, text, speech, mocap):
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        mocap_out = self.mocap_model(mocap)
        combined = torch.cat((text_out, speech_out, mocap_out), dim=1)
        combined = self.fc(combined)
        return combined

In [45]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 模型实例化
model_combined = OptimizedCombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)

# 打印模型摘要
print(model_combined)

print("Combined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

OptimizedCombinedModel(
  (text_model): OptimizedTextModel(
    (embedding): Embedding(3130, 300)
    (conv1): DepthwiseSeparableConv1d(
      (depthwise): Conv1d(300, 300, kernel_size=(3,), stride=(1,), padding=(1,), groups=300)
      (pointwise): Conv1d(300, 128, kernel_size=(1,), stride=(1,))
      (bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (conv2): DepthwiseSeparableConv1d(
      (depthwise): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), groups=128)
      (pointwise): Conv1d(128, 64, kernel_size=(1,), stride=(1,))
      (bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (dropout): Dropout(p=0.1, inplace=False)
    (global_avg_pool): AdaptiveAvgPool1d(output_size=1)
    (dense): Linear(in_features=64, out_features=128, bias=True)
  )
  (speech_model): OptimizedSpeechModel(
    (conv1): Conv1d(34, 64, kernel_size=(3,), stride=(2,), padding=(1,))
    (bn1): BatchNorm1d(64

In [46]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))

text = batch['text'].to(device)
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs = model_combined(text, speech, mocap)

# 打印输出形状
print(outputs.shape)  # 打印模型输出的形状

Using 2 GPUs!
torch.Size([128, 4])


In [47]:
import torch.optim as optim
from torch.optim import lr_scheduler

# 使用带有类别权重的损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 4.06s
Train - loss: 1.3651, F1: 0.2813, UW_Acc: 0.3384
Val - loss: 1.3406, F1: 0.3175, UW_Acc: 0.3730
Epoch [2/80], time: 3.86s
Train - loss: 1.2909, F1: 0.4514, UW_Acc: 0.4652
Val - loss: 1.2465, F1: 0.4926, UW_Acc: 0.4965
Epoch [3/80], time: 3.85s
Train - loss: 1.1932, F1: 0.5280, UW_Acc: 0.5311
Val - loss: 1.1667, F1: 0.4755, UW_Acc: 0.4901
Epoch [4/80], time: 3.87s
Train - loss: 1.1123, F1: 0.5737, UW_Acc: 0.5755
Val - loss: 1.1151, F1: 0.4969, UW_Acc: 0.5123
Epoch [5/80], time: 3.85s
Train - loss: 1.0444, F1: 0.5909, UW_Acc: 0.5911
Val - loss: 1.0488, F1: 0.5511, UW_Acc: 0.5555
Epoch [6/80], time: 3.85s
Train - loss: 0.9809, F1: 0.6167, UW_Acc: 0.6149
Val - loss: 1.0289, F1: 0.5450, UW_Acc: 0.5605
Epoch [7/80], time: 3.80s
Train - loss: 0.9392, F1: 0.6397, UW_Acc: 0.6403
Val - loss: 0.9913, F1: 0.5831, UW_Acc: 0.5892
Epoch [8/80], time: 3.83s
Train - loss: 0.9114, F1: 0.6482, UW_Acc: 0.6468
Val - loss: 0.9691, F1: 0.5909, UW_Acc: 0.5988
Epoch [9/80], time: 3.82

([1.365081182746,
  1.2908805414687756,
  1.1931511635004088,
  1.1122762699459874,
  1.0443510246831318,
  0.9808774590492249,
  0.93920141042665,
  0.9114096400349639,
  0.8804659940475641,
  0.8515178963195446,
  0.8301646570826686,
  0.8109531291695529,
  0.7920774068943289,
  0.7654421897821648,
  0.7475215210471042,
  0.7312098810839098,
  0.7093645195628322,
  0.6911532837291097,
  0.6745662065439446,
  0.6539195498754812,
  0.6326395786085794,
  0.6151994646981706,
  0.5969779477563015,
  0.574604238188544,
  0.5517123182152592,
  0.5390168463074884,
  0.5126729080843371,
  0.49764945410018746,
  0.47634216172750604,
  0.4566368273524351,
  0.44164937873219334,
  0.42186966469121534,
  0.40730175168015237,
  0.39107797797336136,
  0.3736729344656301,
  0.3636680030545523,
  0.3431406984495562,
  0.33110382147999695,
  0.317385932040769,
  0.29955653499725254,
  0.2944110229957935,
  0.28274216035077737,
  0.2737375795841217,
  0.2616219693838164,
  0.2557247731574746,
  0.25508

In [48]:
prediction = predict(model_combined, pred_loader, device)
print(prediction)

[1, 0, 3, 2, 2, 2, 3, 2, 0, 3]


## Model-4: CNN Text + Speech + Mocap

In [49]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(TextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.dense(x)
        return x

class SpeechModel(nn.Module):
    def __init__(self):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(100 * 34, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        self.dense2 = nn.Linear(1024, 512)
        self.bn2 = nn.BatchNorm1d(512)
        self.dense3 = nn.Linear(512, 256)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.bn1(self.dense1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.dense2(x)))
        x = self.dropout(x)
        x = self.dense3(x)
        return x

class MocapModel(nn.Module):
    def __init__(self):
        super(MocapModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(128)
        self.conv5 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
        self.bn5 = nn.BatchNorm2d(256)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(256 * 7 * 6, 256)

    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.dense(x)
        return x

class CombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(CombinedModel, self).__init__()
        self.text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)
        self.speech_model = SpeechModel()
        self.mocap_model = MocapModel()
        self.fc1 = nn.Linear(256 * 3, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.fc2 = nn.Linear(256, 4)

    def forward(self, text, speech, mocap):
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        mocap_out = self.mocap_model(mocap)
        combined = torch.cat((text_out, speech_out, mocap_out), dim=1)
        combined = F.relu(self.bn1(self.fc1(combined)))
        combined = self.fc2(combined)
        return combined

In [50]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 实例化模型、损失函数和优化器
model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)

# 打印模型摘要
print(model_combined)

print("Combined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

CombinedModel(
  (text_model): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (dense): Linear(in_features=16000, out_features=256, bias=True)
  )
  (speech_model): SpeechModel(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (dense1):

In [51]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))
text = batch['text'].to(device).long()
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs = model_combined(text, speech, mocap)

# 打印输出形状
print(outputs.shape)  # 打印模型输出的形状

Using 2 GPUs!
torch.Size([128, 4])


In [52]:
import torch.optim as optim
from torch.optim import lr_scheduler

# 使用带有类别权重的损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 4.79s
Train - loss: 1.1816, F1: 0.4513, UW_Acc: 0.4687
Val - loss: 1.2538, F1: 0.3920, UW_Acc: 0.4459
Epoch [2/80], time: 4.84s
Train - loss: 0.8137, F1: 0.6725, UW_Acc: 0.6763
Val - loss: 1.0416, F1: 0.5490, UW_Acc: 0.5741
Epoch [3/80], time: 4.64s
Train - loss: 0.6560, F1: 0.7460, UW_Acc: 0.7477
Val - loss: 1.0723, F1: 0.5391, UW_Acc: 0.5629
Epoch [4/80], time: 4.68s
Train - loss: 0.5607, F1: 0.7888, UW_Acc: 0.7884
Val - loss: 0.9802, F1: 0.6007, UW_Acc: 0.6189
Epoch [5/80], time: 4.69s
Train - loss: 0.4757, F1: 0.8302, UW_Acc: 0.8296
Val - loss: 0.9908, F1: 0.6028, UW_Acc: 0.6220
Epoch [6/80], time: 4.70s
Train - loss: 0.3983, F1: 0.8627, UW_Acc: 0.8626
Val - loss: 0.9552, F1: 0.6297, UW_Acc: 0.6433
Epoch [7/80], time: 4.70s
Train - loss: 0.3451, F1: 0.8787, UW_Acc: 0.8782
Val - loss: 0.9230, F1: 0.6522, UW_Acc: 0.6622
Epoch [8/80], time: 4.67s
Train - loss: 0.2948, F1: 0.8999, UW_Acc: 0.8997
Val - loss: 0.8896, F1: 0.6786, UW_Acc: 0.6823
Epoch [9/80], time: 4.66

([1.181634993054146,
  0.8136673500371534,
  0.6559732362281444,
  0.5606872758200002,
  0.4757061434346576,
  0.39830331469691077,
  0.3451340028712916,
  0.2947857761105826,
  0.2534596462582433,
  0.2295351939838986,
  0.19296216895413953,
  0.1644008799001228,
  0.160465766177621,
  0.14415887396696003,
  0.12179975055677947,
  0.11501856849983681,
  0.09636338410335918,
  0.079844314258459,
  0.06922518357981083,
  0.07147969097592109,
  0.061832337358663246,
  0.057646943646114925,
  0.055197394604599756,
  0.04741740707567958,
  0.03996554014901089,
  0.035338247628059496,
  0.03816727204464896,
  0.03352465645171875,
  0.03094068724055623,
  0.03664201246791108,
  0.030847851273625396,
  0.03135395931556474,
  0.028404018235241257,
  0.025608292152715283,
  0.02395260775851649,
  0.026696113263105236,
  0.02533338853526254,
  0.024505490502125995,
  0.023830109362512134,
  0.022185346111655235,
  0.02312047156873484,
  0.021668507024472535,
  0.019387611696973096,
  0.022114062

In [53]:
prediction = predict(model_combined, pred_loader, device)
print(prediction)

[1, 0, 2, 1, 2, 2, 2, 2, 0, 3]


## Text only

In [54]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix):
        super(TextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
        self.bn5 = nn.BatchNorm1d(256)
        self.fc = nn.Linear(256, 4)
    
    def forward(self, text=None, speech=None, mocap=None):
        if text is not None:
            x = self.embedding(text)
            x = x.permute(0, 2, 1)
            x = F.relu(self.bn1(self.conv1(x)))
            x = self.dropout(x)
            x = F.relu(self.bn2(self.conv2(x)))
            x = self.dropout(x)
            x = F.relu(self.bn3(self.conv3(x)))
            x = self.dropout(x)
            x = F.relu(self.bn4(self.conv4(x)))
            x = self.dropout(x)
            x = self.flatten(x)
            x = F.relu(self.bn5(self.dense(x)))
            x = self.fc(x)
            return x
        else:
            raise ValueError("Expected text input for TextModel")

In [55]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix).to(device)

# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    text_model = nn.DataParallel(text_model)
    
# 将模型移动到设备上
text_model.to(device)

# 打印模型摘要
print(text_model)

Using 2 GPUs!
DataParallel(
  (module): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (dense): Linear(in_features=16000, out_features=256, bias=True)
    (bn5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=T

In [56]:
import torch.optim as optim
from torch.optim import lr_scheduler

# 使用带有类别权重的损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(text_model.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(text_model, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 3.08s
Train - loss: 1.4059, F1: 0.2841, UW_Acc: 0.2979
Val - loss: 1.3478, F1: 0.2224, UW_Acc: 0.3558
Epoch [2/80], time: 2.98s
Train - loss: 1.2901, F1: 0.3926, UW_Acc: 0.3946
Val - loss: 1.3074, F1: 0.3817, UW_Acc: 0.4029
Epoch [3/80], time: 3.03s
Train - loss: 1.1915, F1: 0.4713, UW_Acc: 0.4724
Val - loss: 1.2809, F1: 0.4063, UW_Acc: 0.4335
Epoch [4/80], time: 2.98s
Train - loss: 1.0772, F1: 0.5517, UW_Acc: 0.5518
Val - loss: 1.2164, F1: 0.4584, UW_Acc: 0.4599
Epoch [5/80], time: 3.00s
Train - loss: 0.9849, F1: 0.6034, UW_Acc: 0.6039
Val - loss: 1.1848, F1: 0.4900, UW_Acc: 0.5036
Epoch [6/80], time: 3.04s
Train - loss: 0.8830, F1: 0.6474, UW_Acc: 0.6469
Val - loss: 1.1550, F1: 0.5205, UW_Acc: 0.5299
Epoch [7/80], time: 3.02s
Train - loss: 0.8097, F1: 0.6837, UW_Acc: 0.6828
Val - loss: 1.1324, F1: 0.5256, UW_Acc: 0.5365
Epoch [8/80], time: 3.19s
Train - loss: 0.7472, F1: 0.7135, UW_Acc: 0.7127
Val - loss: 1.1211, F1: 0.5433, UW_Acc: 0.5481
Epoch [9/80], time: 2.99

([1.4058793505956961,
  1.2900846503501715,
  1.191461904104366,
  1.07723007368487,
  0.9848765134811401,
  0.8829683079275974,
  0.8097093507300975,
  0.7471928499465765,
  0.6798630093419274,
  0.6264752052551092,
  0.5789528759412987,
  0.5310434608958489,
  0.4840540234432664,
  0.450188479451246,
  0.41648244857788086,
  0.3923124580882316,
  0.3734765142895455,
  0.3512683469195699,
  0.32523694669091424,
  0.3070671353922334,
  0.29494788203128547,
  0.2685646574164546,
  0.25116006129009777,
  0.23575728196044302,
  0.23130190926928854,
  0.2284947204035382,
  0.21599150258441305,
  0.20571296090303465,
  0.2000138406143632,
  0.19793193943278733,
  0.19859337529470755,
  0.1893630078019098,
  0.18711008096850196,
  0.18443428083907726,
  0.1744989738907925,
  0.17133374616157176,
  0.1797145837268164,
  0.17351985844068749,
  0.16898273815249287,
  0.17057445524043816,
  0.1719930664051411,
  0.16557880716268406,
  0.1694233436570611,
  0.16942767435035042,
  0.16396862909544

## Speech only

In [57]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SpeechModel(nn.Module):
    def __init__(self):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(500, 1024)  # 修改此处，将输入维度改为 500
        self.dense2 = nn.Linear(1024, 512)
        self.dense3 = nn.Linear(512, 256)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self,x, text=None, mocap=None):
        x = x.float()  # 转换数据类型为float
        x = self.flatten(x)
        x = F.relu(self.dense1(x))
        x = self.dropout(x)
        x = F.relu(self.dense2(x))
        x = self.dropout(x)
        x = self.dense3(x)
        return x

In [58]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

speech_model = SpeechModel()

# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    speech_model = nn.DataParallel(speech_model)
    
# 将模型移动到设备上
speech_model.to(device)

# 打印模型摘要
print(speech_model)

Using 2 GPUs!
DataParallel(
  (module): SpeechModel(
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (dense1): Linear(in_features=500, out_features=1024, bias=True)
    (dense2): Linear(in_features=1024, out_features=512, bias=True)
    (dense3): Linear(in_features=512, out_features=256, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)


In [59]:
import torch.optim as optim
from torch.optim import lr_scheduler

# 使用带有类别权重的损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(speech_model.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(speech_model, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 2.12s
Train - loss: 4.9028, F1: 0.2401, UW_Acc: 0.2516
Val - loss: 2.9648, F1: 0.2895, UW_Acc: 0.2941
Epoch [2/80], time: 1.99s
Train - loss: 2.9726, F1: 0.3086, UW_Acc: 0.3163
Val - loss: 2.8022, F1: 0.2711, UW_Acc: 0.2974
Epoch [3/80], time: 1.94s
Train - loss: 2.4442, F1: 0.3418, UW_Acc: 0.3498
Val - loss: 2.7703, F1: 0.2672, UW_Acc: 0.2842
Epoch [4/80], time: 1.94s
Train - loss: 2.0741, F1: 0.3674, UW_Acc: 0.3723
Val - loss: 2.4119, F1: 0.3200, UW_Acc: 0.3347
Epoch [5/80], time: 1.93s
Train - loss: 1.9338, F1: 0.3745, UW_Acc: 0.3831
Val - loss: 2.3873, F1: 0.2631, UW_Acc: 0.2898
Epoch [6/80], time: 1.94s
Train - loss: 1.7002, F1: 0.4109, UW_Acc: 0.4156
Val - loss: 2.1403, F1: 0.3061, UW_Acc: 0.3088
Epoch [7/80], time: 1.94s
Train - loss: 1.5908, F1: 0.4171, UW_Acc: 0.4195
Val - loss: 2.0462, F1: 0.2909, UW_Acc: 0.2951
Epoch [8/80], time: 1.94s
Train - loss: 1.4855, F1: 0.4385, UW_Acc: 0.4398
Val - loss: 2.1327, F1: 0.2796, UW_Acc: 0.2774
Epoch [9/80], time: 1.94

([4.902778442515883,
  2.9726296801899754,
  2.4442196751749794,
  2.0740671656852543,
  1.9337890259055204,
  1.7002301132956217,
  1.5908415982889574,
  1.4854505145272543,
  1.4003784490186115,
  1.3418607018714728,
  1.2351188992345057,
  1.2029346726661505,
  1.1638495007226632,
  1.144536245700925,
  1.1382584585699924,
  1.1234898303830347,
  1.101185962211254,
  1.0874968323596688,
  1.0665803440781527,
  1.0628467326940492,
  1.0682098713032036,
  1.0531370279400847,
  1.0258147189783495,
  1.0430466976276664,
  1.024514445038729,
  1.0109601977259615,
  1.028559797031935,
  1.0162219945774522,
  1.0165246087451314,
  1.0051947041999463,
  0.993175936299701,
  0.980851314788641,
  0.9895011477692183,
  0.990820241528888,
  0.9882237564685733,
  0.9886138993640279,
  0.9776762640753458,
  0.9864614009857178,
  0.9801240771315819,
  0.9818106182785922,
  0.989205521206523,
  0.9790595714436021,
  0.9799060724502386,
  0.9764028635135916,
  0.9779544181601946,
  0.979799231817556

## Mocap only

In [60]:
import time
from sklearn.metrics import f1_score, accuracy_score
import torch
import torch.nn.functional as F

# 模型验证
def validate(model, val_loader, criterion, device):
    val_loss = 0
    val_f1, val_uw_acc = 0, 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in val_loader:
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()

            outputs = model(mocap)  # 计算模型输出
            loss = criterion(outputs, labels)  # 计算损失函数
            val_loss += loss.item()  # 累计损失
            
            f1, uw_acc = calculate_metrics(outputs, labels)
            val_f1 += f1
            val_uw_acc += uw_acc

    val_loss /= len(val_loader)  # 计算平均损失
    val_f1 /= len(val_loader)  # 计算平均F1分数
    val_uw_acc /= len(val_loader)  # 计算平均非加权准确率

    return val_loss, val_f1, val_uw_acc

# 打印训练结果
def print_log(epoch, train_time, train_loss, train_f1, train_uw_acc, val_loss, val_f1, val_uw_acc, epochs=10):
    print(f"Epoch [{epoch+1}/{epochs}], time: {train_time:.2f}s")
    print(f"Train - loss: {train_loss:.4f}, F1: {train_f1:.4f}, UW_Acc: {train_uw_acc:.4f}")
    print(f"Val - loss: {val_loss:.4f}, F1: {val_f1:.4f}, UW_Acc: {val_uw_acc:.4f}")

# 模型训练
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs, device):
    train_losses = []
    train_f1s, train_uw_accs = [], []
    val_losses = []
    val_f1s, val_uw_accs = [], []
    
    for epoch in range(num_epochs):
        model.train()
        
        train_loss = 0
        train_f1, train_uw_acc = 0, 0
        
        start_time = time.time()  # 记录本epoch开始时间
        
        for batch in train_loader:
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            optimizer.zero_grad()  # 将模型所有参数tensor的梯度变为0
            outputs = model(mocap)  # 计算模型输出
            
            loss = criterion(outputs, labels)  # 计算损失函数
            train_loss += loss.item()  # 累计损失
            
            f1, uw_acc = calculate_metrics(outputs, labels)
            train_f1 += f1
            train_uw_acc += uw_acc
            
            loss.backward()  # 反向传播计算梯度
            optimizer.step()  # 更新模型参数
        
        end_time = time.time()  # 记录本epoch结束时间
        train_time = end_time - start_time  # 计算本epoch的训练耗时
        
        train_loss /= len(train_loader)  # 计算平均损失
        train_f1 /= len(train_loader)  # 计算平均F1分数
        train_uw_acc /= len(train_loader)  # 计算平均非加权准确率
        
        val_loss, val_f1, val_uw_acc = validate(model, val_loader, criterion, device)
        
        scheduler.step(val_f1)
        
        train_losses.append(train_loss)
        train_f1s.append(train_f1)
        train_uw_accs.append(train_uw_acc)
        val_losses.append(val_loss)
        val_f1s.append(val_f1)
        val_uw_accs.append(val_uw_acc)
        
        print_log(epoch, train_time, train_loss, train_f1, train_uw_acc, 
                  val_loss, val_f1, val_uw_acc, epochs=num_epochs)  # 打印训练结果

    return train_losses, train_f1s, train_uw_accs, val_losses, val_f1s, val_uw_accs

In [61]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MocapModel(nn.Module):
    def __init__(self):
        super(MocapModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(128)
        self.conv5 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
        self.bn5 = nn.BatchNorm2d(256)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(10752, 256)  # Adjust this based on your output size

    def forward(self, x):
        # x shape is already [batch_size, 200, 189, 1]
        x = x.permute(0, 3, 1, 2)  # Change to [batch_size, 1, 200, 189]
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.dense(x)
        return x

In [62]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

mocap_model = MocapModel().to(device)

# 打印模型摘要
print(mocap_model)

MocapModel(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv3): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (bn3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (bn4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv5): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (bn5): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (dense): Linear(in_features=10752, out_features=256, bias=True)

In [63]:
import torch.optim as optim
from torch.optim import lr_scheduler

# 使用带有类别权重的损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(mocap_model.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(mocap_model, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 4.09s
Train - loss: 1.6364, F1: 0.2840, UW_Acc: 0.3067
Val - loss: 1.3982, F1: 0.2747, UW_Acc: 0.3314
Epoch [2/80], time: 3.95s
Train - loss: 1.2377, F1: 0.4394, UW_Acc: 0.4475
Val - loss: 1.3634, F1: 0.3280, UW_Acc: 0.3880
Epoch [3/80], time: 4.04s
Train - loss: 1.1530, F1: 0.4798, UW_Acc: 0.4932
Val - loss: 1.3012, F1: 0.3789, UW_Acc: 0.4381
Epoch [4/80], time: 4.04s
Train - loss: 1.0353, F1: 0.5549, UW_Acc: 0.5593
Val - loss: 1.3002, F1: 0.3910, UW_Acc: 0.4488
Epoch [5/80], time: 3.98s
Train - loss: 0.9808, F1: 0.5815, UW_Acc: 0.5862
Val - loss: 1.3478, F1: 0.3852, UW_Acc: 0.4553
Epoch [6/80], time: 4.02s
Train - loss: 0.9260, F1: 0.6258, UW_Acc: 0.6279
Val - loss: 1.2299, F1: 0.4351, UW_Acc: 0.4811
Epoch [7/80], time: 4.01s
Train - loss: 0.8915, F1: 0.6312, UW_Acc: 0.6328
Val - loss: 1.2235, F1: 0.4446, UW_Acc: 0.4887
Epoch [8/80], time: 4.01s
Train - loss: 0.8701, F1: 0.6450, UW_Acc: 0.6465
Val - loss: 1.2321, F1: 0.4455, UW_Acc: 0.4986
Epoch [9/80], time: 3.99

([1.6363637364187906,
  1.2376834985821745,
  1.1529635800871738,
  1.0353045934854552,
  0.980841686559278,
  0.9260283015495123,
  0.8915295295937117,
  0.870084356429965,
  0.847113853277162,
  0.831225055594777,
  0.828596830368042,
  0.7990725802820783,
  0.7887383131093757,
  0.7688648021498392,
  0.7497997228489366,
  0.7217255525810774,
  0.7181307576423468,
  0.7062970690949019,
  0.7098891014276549,
  0.6857547565948131,
  0.6640053169671879,
  0.6526546367379122,
  0.6479673538097116,
  0.6403157004090243,
  0.6203379582527072,
  0.622186096601708,
  0.6077933061954587,
  0.5883539935877157,
  0.5944317752538726,
  0.5872921409995057,
  0.5608809063600939,
  0.557252282320067,
  0.5269641585128252,
  0.5189874865287958,
  0.5234565360601559,
  0.5033251684765483,
  0.4961353838443756,
  0.49946940499682757,
  0.48686794139618095,
  0.4715919300567272,
  0.47675867759904195,
  0.47573470930720485,
  0.46892649697702987,
  0.4747811237046885,
  0.465092001959335,
  0.463243651