# Multimodal Emotion Recognition(2/2)
## Zichen Xu（zichenxu407@gmail.com）

The feature abstraction procedure is primarily adapted and modified from the work presented in: https://arxiv.org/abs/1804.05788. And we have published this work on kaggle as well: https://www.kaggle.com/code/girlfriendohmygod/emo-detection-transformer/notebook

The code primarily handles audio and motion capture data from the IEMOCAP dataset. It includes three functions: `get_mocap_hand`, `get_mocap_rot`, and `get_mocap_head`, which are used to read and calculate the average values of hand, rotation, and head motion capture data, respectively. Finally, the function `read_iemocap_mocap` integrates the audio, emotion labels, transcribed text, and the three types of motion capture data mentioned above. It processes and sorts them into a unified format, returning an array that contains all the processed data.

- The functions used are encapsulated in the files `features.py`, `helper.py`, and `mocap_data_collect.py`.
- **The results after data cleaning and preprocessing are stored as .pickle files. These include two datasets: "pikle-data" (a partial dataset using only 1/5 of the data for lightweight model tuning) and "pikle-full-data" (the full dataset), both of which are open-sourced on the Kaggle platform. These can be directly accessed and reused without the need for separate data cleaning and preprocessing each time.**

## 1. Package loading

In [1]:
!pip install torchsummary

Collecting torchsummary
  Downloading torchsummary-1.5.1-py3-none-any.whl.metadata (296 bytes)
Downloading torchsummary-1.5.1-py3-none-any.whl (2.8 kB)
Installing collected packages: torchsummary
Successfully installed torchsummary-1.5.1


In [2]:
import numpy as np
import os
import sys
import pandas as pd

import wave
import copy
import math

# from keras.models import Sequential, Model
# from keras.layers.core import Dense, Activation
# from keras.layers import LSTM, Input, Flatten, Merge
# from keras.layers.wrappers import TimeDistributed
# from keras.optimizers import SGD, Adam, RMSprop
# from keras.layers.normalization import BatchNormalization
from sklearn.preprocessing import label_binarize

module_path = '/kaggle/input/emo-detection1/'
sys.path.append(module_path)
from features import *
from helper import *

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
from torchsummary import summary

In [3]:
batch_size = 128
nb_feat = 34
nb_class = 4
nb_epoch = 80

optimizer = 'Adadelta'


# code_path = os.path.dirname('/kaggle/input/iemocapfullrelease/IEMOCAP_full_release')
emotions_used = np.array(['ang', 'exc', 'neu', 'sad'])
data_path = os.path.dirname('/kaggle/input/iemocapfullrelease/IEMOCAP_full_release/')
sessions = ['/Session1']
# sessions = ['/Session1', '/Session2', '/Session3', '/Session4', '/Session5']
framerate = 16000

## 2. Data Preprocessing

In [4]:
# import os
# import numpy as np

# def get_mocap_hand(path_to_mocap_hand, filename, start, end):
#     mocap_hand_avg = []  # 初始化一个列表，用于存储手部运动捕捉数据的平均值
#     with open(path_to_mocap_hand + filename, 'r') as file:
#         next(file)  # 跳过文件的标题行
#         for line in file:
#             data2 = line.split()  # 将每行数据按空格分割成列表
#             if len(data2) < 3:  # 检查数据行是否有效
#                 continue
#             try:
#                 time_value = float(data2[1])  # 尝试将时间值转换为浮点数
#                 if time_value > start and time_value < end:  # 检查时间值是否在指定的范围内
#                     mocap_hand_avg.append(np.array(data2[2:]).astype(float))  # 将运动捕捉数据（从第三列开始）转换为浮点数并添加到列表中
#             except ValueError:
#                 continue  # 跳过无法转换为浮点数的行
    
#     if mocap_hand_avg:  # 如果有有效的数据行
#         mocap_hand_avg = np.array_split(np.array(mocap_hand_avg), 200)  # 将数据分成200个子集
#         for spl in mocap_hand_avg:
#             spl = np.mean(spl, axis=0)  # 计算每个子集的平均值
#     else:
#         mocap_hand_avg = [np.zeros(1)]  # 如果没有有效数据行，则返回一个零数组
    
#     return mocap_hand_avg  # 返回计算的手部运动捕捉数据的平均值


# def get_mocap_rot(path_to_mocap_rot, filename, start, end):
#     mocap_rot_avg = []  # 初始化一个列表，用于存储旋转运动捕捉数据的平均值
#     with open(path_to_mocap_rot + filename, 'r') as file:
#         next(file)  # 跳过文件的标题行
#         for line in file:
#             data2 = line.split()  # 将每行数据按空格分割成列表
#             if len(data2) < 3:  # 检查数据行是否有效
#                 continue
#             try:
#                 time_value = float(data2[1])  # 尝试将时间值转换为浮点数
#                 if time_value > start and time_value < end:  # 检查时间值是否在指定的范围内
#                     mocap_rot_avg.append(np.array(data2[2:]).astype(float))  # 将运动捕捉数据（从第三列开始）转换为浮点数并添加到列表中
#             except ValueError:
#                 continue  # 跳过无法转换为浮点数的行
    
#     if mocap_rot_avg:  # 如果有有效的数据行
#         mocap_rot_avg = np.array_split(np.array(mocap_rot_avg), 200)  # 将数据分成200个子集
#         for spl in mocap_rot_avg:
#             spl = np.mean(spl, axis=0)  # 计算每个子集的平均值
#     else:
#         mocap_rot_avg = [np.zeros(1)]  # 如果没有有效数据行，则返回一个零数组
    
#     return mocap_rot_avg  # 返回计算的旋转运动捕捉数据的平均值


# def get_mocap_head(path_to_mocap_head, filename, start, end):
#     mocap_head_avg = []  # 初始化一个列表，用于存储头部运动捕捉数据的平均值
#     with open(path_to_mocap_head + filename, 'r') as file:
#         next(file)  # 跳过文件的标题行
#         for line in file:
#             data2 = line.split()  # 将每行数据按空格分割成列表
#             if len(data2) < 3:  # 检查数据行是否有效
#                 continue
#             try:
#                 time_value = float(data2[1])  # 尝试将时间值转换为浮点数
#                 if time_value > start and time_value < end:  # 检查时间值是否在指定的范围内
#                     mocap_head_avg.append(np.array(data2[2:]).astype(float))  # 将运动捕捉数据（从第三列开始）转换为浮点数并添加到列表中
#             except ValueError:
#                 continue  # 跳过无法转换为浮点数的行
    
#     if mocap_head_avg:  # 如果有有效的数据行
#         mocap_head_avg = np.array_split(np.array(mocap_head_avg), 200)  # 将数据分成200个子集
#         for spl in mocap_head_avg:
#             spl = np.mean(spl, axis=0)  # 计算每个子集的平均值
#     else:
#         mocap_head_avg = [np.zeros(1)]  # 如果没有有效数据行，则返回一个零数组
    
#     return mocap_head_avg  # 返回计算的头部运动捕捉数据的平均值


# def read_iemocap_mocap():
#     data = []  # 初始化一个列表，用于存储最终整合的数据
#     ids = {}  # 初始化一个字典，用于存储已经处理过的ID
#     for session in sessions:  # 遍历每个会话
#         path_to_wav = data_path + session + '/dialog/wav/'  # 构建音频文件路径
#         path_to_emotions = data_path + session + '/dialog/EmoEvaluation/'  # 构建情感标签文件路径
#         path_to_transcriptions = data_path + session + '/dialog/transcriptions/'  # 构建转录文本文件路径
#         path_to_mocap_hand = data_path + session + '/dialog/MOCAP_hand/'  # 构建手部运动捕捉文件路径
#         path_to_mocap_rot = data_path + session + '/dialog/MOCAP_rotated/'  # 构建旋转运动捕捉文件路径
#         path_to_mocap_head = data_path + session + '/dialog/MOCAP_head/'  # 构建头部运动捕捉文件路径

#         files2 = os.listdir(path_to_wav)  # 列出音频文件目录下的所有文件

#         files = []
#         for f in files2:  # 遍历所有文件
#             if f.endswith(".wav"):  # 检查文件是否是.wav文件
#                 if f[0] == '.':
#                     files.append(f[2:-4])  # 处理隐藏文件
#                 else:
#                     files.append(f[:-4])  # 处理普通文件

#         for f in files:  # 遍历所有音频文件
#             print(f)
#             mocap_f = f
#             if f == 'Ses05M_script01_1b':
#                 mocap_f = 'Ses05M_script01_1'  # 特殊处理某个文件名

#             wav = get_audio(path_to_wav, f + '.wav')  # 读取音频文件
#             transcriptions = get_transcriptions(path_to_transcriptions, f + '.txt')  # 读取转录文本
#             emotions = get_emotions(path_to_emotions, f + '.txt')  # 读取情感标签
#             sample = split_wav(wav, emotions)  # 将音频文件分割成若干段

#             for ie, e in enumerate(emotions):  # 遍历每个情感标签
#                 e['signal'] = sample[ie]['left']  # 将音频段的左声道信号添加到情感标签中
#                 e.pop("left", None)
#                 e.pop("right", None)
#                 e['transcription'] = transcriptions[e['id']]  # 将转录文本添加到情感标签中
#                 e['mocap_hand'] = get_mocap_hand(path_to_mocap_hand, mocap_f + '.txt', e['start'], e['end'])  # 获取手部运动捕捉数据
#                 e['mocap_rot'] = get_mocap_rot(path_to_mocap_rot, mocap_f + '.txt', e['start'], e['end'])  # 获取旋转运动捕捉数据
#                 e['mocap_head'] = get_mocap_head(path_to_mocap_head, mocap_f + '.txt', e['start'], e['end'])  # 获取头部运动捕捉数据
#                 if e['emotion'] in emotions_used:  # 检查情感是否在使用的情感集合中
#                     if e['id'] not in ids:  # 检查该ID是否已处理过
#                         data.append(e)  # 将情感标签数据添加到最终数据列表中
#                         ids[e['id']] = 1  # 将ID标记为已处理

#     sort_key = get_field(data, "id")  # 获取排序键
#     return np.array(data)[np.argsort(sort_key)]  # 按ID排序并返回数据数组

# data = read_iemocap_mocap()  # 调用函数读取IEMOCAP数据集

In [5]:
# import pickle  # 导入pickle模块，用于对象的序列化和反序列化

# # 使用'wb'（写入二进制）模式打开一个名为'data_collected.pickle'的文件
# # '/kaggle/working/' 是文件的路径
# with open('/kaggle/working/'+'data_collected.pickle', 'wb') as handle:

#     # 使用pickle的dump函数，将'data'对象序列化并保存到已打开的文件中
#     # protocol=pickle.HIGHEST_PROTOCOL表示使用最高可用的pickle协议进行序列化
#     pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 3. Feature abstraction

In [6]:
import pickle  # 导入pickle模块，用于读取和写入pickle文件

# 打开包含数据的pickle文件
with open('/kaggle/input/pikle-full-data/data_collected_full.pickle', 'rb') as handle:
    data2 = pickle.load(handle)  # 使用pickle.load()方法加载pickle文件中的数据，并将其存储在变量data2中

In [7]:
import torch  # 导入PyTorch库
from torchtext.data.utils import get_tokenizer  # 导入torchtext库中的分词器
from torchtext.vocab import build_vocab_from_iterator  # 从迭代器构建词汇表
from torch.nn.utils.rnn import pad_sequence  # 对序列进行填充

# 假设 data2 已经被加载并包含需要的数据信息

text = [ses_mod['transcription'] for ses_mod in data2]  # 从data2中提取出'transcription'字段的文本数据

MAX_SEQUENCE_LENGTH = 500  # 设置最大序列长度

# 使用 torchtext 的 basic_english 分词器
tokenizer = get_tokenizer('basic_english')

# 构建词汇表
def yield_tokens(data):
    for text in data:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(text), specials=["<unk>"])  # 从文本数据中构建词汇表，并指定特殊标记"<unk>"
vocab.set_default_index(vocab["<unk>"])  # 设置默认索引为"<unk>"

# 将文本转换为序列
token_tr_X = [vocab(tokenizer(t)) for t in text]

# 将序列填充到相同的长度
x_train_text = pad_sequence([torch.tensor(seq, dtype=torch.long) for seq in token_tr_X], 
                            batch_first=True, padding_value=vocab["<unk>"])

# 如果序列长度超过最大长度，则进行截断
x_train_text = x_train_text[:, :MAX_SEQUENCE_LENGTH]

# 如果序列长度不足最大长度，则进行填充
if x_train_text.size(1) < MAX_SEQUENCE_LENGTH:
    pad_size = MAX_SEQUENCE_LENGTH - x_train_text.size(1)
    x_train_text = torch.cat((x_train_text, torch.full((x_train_text.size(0), pad_size), vocab["<unk>"], dtype=torch.long)), dim=1)

print(x_train_text.shape)  # 打印结果检查维度

torch.Size([4936, 500])


In [8]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库
import os  # 导入操作系统模块
from torchtext.data.utils import get_tokenizer  # 导入torchtext库中的分词器
from torchtext.vocab import build_vocab_from_iterator, Vocab  # 从迭代器构建词汇表、Vocab类

EMBEDDING_DIM = 300  # 设置词嵌入的维度为300

# 假设 data2 已经被加载并包含需要的数据信息
text = [ses_mod['transcription'] for ses_mod in data2]  # 从data2中提取出'transcription'字段的文本数据

# 使用 torchtext 的 basic_english 分词器
tokenizer = get_tokenizer('basic_english')

# 构建词汇表
def yield_tokens(data):
    for text in data:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(text), specials=["<unk>"])  # 从文本数据中构建词汇表，并指定特殊标记"<unk>"
vocab.set_default_index(vocab["<unk>"])  # 设置默认索引为"<unk>"
word_index = vocab.get_stoi()  # 获取词汇表中词语对应的索引
print(f'Found {len(word_index)} unique tokens')  # 打印词汇表中唯一标记的数量

file_loc = os.path.join('/kaggle/input/glove42b300dtxt/glove.42B.300d.txt')  # 指定GloVe文件的路径
print(file_loc)  # 打印文件路径

# 加载GloVe词嵌入
gembeddings_index = {}
with open(file_loc, encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        gembedding = np.asarray(values[1:], dtype='float32')
        gembeddings_index[word] = gembedding
print(f'G Word embeddings: {len(gembeddings_index)}')  # 打印GloVe词嵌入的数量

# 创建词嵌入矩阵
nb_words = len(word_index) + 1
g_word_embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    gembedding_vector = gembeddings_index.get(word)
    if gembedding_vector is not None:
        g_word_embedding_matrix[i] = gembedding_vector

print(f'G Null word embeddings: {np.sum(np.sum(g_word_embedding_matrix, axis=1) == 0)}')  # 打印空词嵌入的数量

# 转换为 PyTorch 的 embedding layer 所需的 tensor
g_word_embedding_matrix = torch.tensor(g_word_embedding_matrix, dtype=torch.float32)

# 使用预训练的词嵌入初始化 embedding 层
embedding_layer = torch.nn.Embedding.from_pretrained(g_word_embedding_matrix, freeze=False)

# 打印 embedding 层的一些信息以验证
print(f'Embedding layer weights shape: {embedding_layer.weight.shape}')  # 打印embedding层权重的形状

Found 3129 unique tokens
/kaggle/input/glove42b300dtxt/glove.42B.300d.txt
G Word embeddings: 1917494
G Null word embeddings: 475
Embedding layer weights shape: torch.Size([3130, 300])


In [9]:
print(x_train_text.shape)

torch.Size([4936, 500])


In [10]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库

# 假设 stFeatureExtraction 函数已经定义
# 如果 stFeatureExtraction 函数使用的是 numpy，你可以将其转换为 torch 实现

def calculate_features(frames, freq, options):
    window_sec = 0.2  # 设置窗口长度为0.2秒
    window_n = int(freq * window_sec)  # 计算窗口中的样本数
    
    # 调用 stFeatureExtraction 函数
    st_f = stFeatureExtraction(frames, freq, window_n, window_n // 2)  # 使用 stFeatureExtraction 函数提取特征
    st_f = torch.tensor(st_f, dtype=torch.float32)  # 将 numpy 数组转换为 PyTorch 张量
    
    if st_f.shape[1] > 2:  # 如果特征的维度大于2
        i0 = 1
        i1 = st_f.shape[1] - 1
        if i1 - i0 < 1:
            i1 = i0 + 1
        
        deriv_st_f = torch.zeros((st_f.shape[0], i1 - i0), dtype=torch.float32)  # 创建导数特征的张量
        for i in range(i0, i1):
            i_left = i - 1
            i_right = i + 1
            deriv_st_f[:, i - i0] = st_f[:, i]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量
    elif st_f.shape[1] == 2:  # 如果特征的维度等于2
        deriv_st_f = torch.zeros((st_f.shape[0], 1), dtype=torch.float32)  # 创建导数特征的张量
        deriv_st_f[:, 0] = st_f[:, 0]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量
    else:  # 如果特征的维度小于2
        deriv_st_f = torch.zeros((st_f.shape[0], 1), dtype=torch.float32)  # 创建导数特征的张量
        deriv_st_f[:, 0] = st_f[:, 0]  # 将特征的某些部分复制到导数特征张量中
        return deriv_st_f  # 返回导数特征张量


In [11]:
import torch  # 导入PyTorch库
import numpy as np  # 导入NumPy库

# 定义函数，用于将序列填充到数组中
def pad_sequence_into_array(sequences, maxlen):
    num_samples = len(sequences)  # 获取序列的数量
    if len(sequences) == 0:
        return torch.zeros((num_samples, maxlen, 1), dtype=torch.float32), 0  # 如果序列为空，则返回一个全零张量
    num_features = sequences[0].shape[0] if len(sequences[0].shape) > 1 else 1  # 获取特征的数量
    padded_array = torch.zeros((num_samples, num_features, maxlen), dtype=torch.float32)  # 创建一个全零张量，用于存储填充后的序列
    for i, seq in enumerate(sequences):
        if len(seq.shape) == 1:
            seq = seq.reshape(1, -1)  # 如果序列的形状为一维，则将其调整为二维
        length = min(maxlen, seq.shape[1])  # 获取序列的长度，并确保不超过最大长度
        padded_array[i, :, :length] = seq[:, :length]  # 将序列填充到数组中
    return padded_array, length  # 返回填充后的数组和实际长度

# 处理语音数据并转换为 PyTorch 张量
x_train_speech = []  # 存储处理后的语音数据
counter = 0
for ses_mod in data2:
    x_head = ses_mod['signal']  # 获取语音信号数据
    st_features = calculate_features(x_head, framerate, None)  # 计算语音信号的特征
    st_features, _ = pad_sequence_into_array([st_features], maxlen=100)  # 将特征序列填充到数组中
    x_train_speech.append(st_features)  # 将处理后的语音数据添加到列表中
    counter += 1
    if counter % 100 == 0:
        print(counter)

x_train_speech = torch.cat(x_train_speech, dim=0)  # 合并所有张量
print(x_train_speech.shape)  # 打印处理后的语音数据的形状

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
torch.Size([4936, 34, 100])


In [12]:
def process_mocap_data(data, expected_length=200, expected_features=189):
    if not isinstance(data, list):
        data = [data]
    
    processed_data = np.zeros((expected_length, expected_features))
    
    for i, row in enumerate(data):
        if i >= expected_length:
            break
        if isinstance(row, (list, np.ndarray)):
            if len(np.array(row).shape) > 1:
                row = np.array(row).flatten()
            row = np.nan_to_num(row, nan=0.0)  # 将NaN替换为0
            processed_data[i, :min(len(row), expected_features)] = row[:expected_features]
    
    return processed_data

def normalize_mocap(data):
    mean = np.nanmean(data, axis=(0, 1), keepdims=True)
    std = np.nanstd(data, axis=(0, 1), keepdims=True)
    return (data - mean) / (std + 1e-8)

x_train_mocap = []
for ses_mod in data2:
    x_head = process_mocap_data(ses_mod['mocap_head'], expected_length=200, expected_features=18)
    x_hand = process_mocap_data(ses_mod['mocap_hand'], expected_length=200, expected_features=6)
    x_rot = process_mocap_data(ses_mod['mocap_rot'], expected_length=200, expected_features=165)

    x_head = normalize_mocap(x_head)
    x_hand = normalize_mocap(x_hand)
    x_rot = normalize_mocap(x_rot)

    x_mocap = np.concatenate((x_head, x_hand, x_rot), axis=1)
    x_train_mocap.append(torch.tensor(x_mocap, dtype=torch.float32))

x_train_mocap = torch.stack(x_train_mocap)
x_train_mocap = x_train_mocap.view(-1, 200, 189, 1)

# 检查并替换NaN值
x_train_mocap = torch.nan_to_num(x_train_mocap, nan=0.0)

In [13]:
def check_nan_mocap(data, name):
    nan_mask = torch.isnan(data)
    if nan_mask.any():
        print(f"{name} contains NaN values:")
        print(f"Total NaN values: {nan_mask.sum().item()}")
        print(f"Samples with NaN: {nan_mask.any(dim=(1,2,3)).sum().item()}/{data.shape[0]}")
        return True
    return False

check_nan_mocap(x_train_mocap, "Mocap data")

False

In [14]:
def interpolate_nan_mocap(data):
    data_np = data.numpy()
    for i in range(data_np.shape[0]):
        for j in range(data_np.shape[2]):
            data_np[i, :, j, 0] = pd.Series(data_np[i, :, j, 0]).interpolate().values
    return torch.from_numpy(data_np)

if check_nan_mocap(x_train_mocap, "Mocap data"):
    x_train_mocap = interpolate_nan_mocap(x_train_mocap)

In [15]:
print(x_train_mocap.shape)  # 打印结果检查维度

torch.Size([4936, 200, 189, 1])


In [16]:
import torch
import numpy as np
from sklearn.preprocessing import LabelEncoder

# 初始化Y列表
Y = []
for ses_mod in data2:
    Y.append(ses_mod['emotion'])

# 使用LabelEncoder将标签编码为整数
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)

# 将Y转换为PyTorch张量，确保是长整型
Y = torch.tensor(Y, dtype=torch.long)

# 打印Y的形状
print(Y.shape)  # 应该是 torch.Size([950])or4936

torch.Size([4936])


## 4. Dataset preparation

In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class EmotionDataset(Dataset):
    def __init__(self, text, speech, mocap, labels):
        self.text = [t.clone().detach().long() for t in text]
        self.speech = [s.clone().detach().float() for s in speech]
        self.mocap = [m.clone().detach().float() for m in mocap]
        self.labels = [l.clone().detach().long() for l in labels]
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return {
            'text': self.text[idx],
            'speech': self.speech[idx],
            'mocap': self.mocap[idx],
            'labels': self.labels[idx]
        }

In [18]:
from sklearn.model_selection import train_test_split

# 首先，将数据集分割为训练集和测试集
X_text_train, X_text_test, X_speech_train, X_speech_test, X_mocap_train, X_mocap_test, y_train, y_test = train_test_split(
    x_train_text, x_train_speech, x_train_mocap, Y, test_size=0.2, random_state=66, stratify=Y
)

# 将训练集的所有特征合并为一个二维数组
X_combined_train = np.hstack((X_text_train.numpy(), 
                              X_speech_train.numpy().reshape(X_speech_train.shape[0], -1), 
                              X_mocap_train.numpy().reshape(X_mocap_train.shape[0], -1)))

In [19]:
import numpy as np
from imblearn.over_sampling import SMOTE
from collections import Counter

# 假设 y_train 是训练集的标签

# 首先，计算每个类别的样本数量
class_counts = Counter(y_train.numpy())

# 找出多数类的样本数量
max_samples = max(class_counts.values())

# 创建重采样策略字典，将所有少数类的样本数量增加到多数类的水平
sampling_strategy = {cls: max_samples for cls in class_counts.keys() if class_counts[cls] < max_samples}

# 应用SMOTE
smote = SMOTE(sampling_strategy=sampling_strategy, random_state=66)
X_resampled, y_resampled = smote.fit_resample(X_combined_train, y_train.numpy())

# 打印重采样后的类别分布
print("重采样后的类别分布:", Counter(y_resampled))

重采样后的类别分布: Counter({3: 1366, 2: 1366, 0: 1366, 1: 1366})


In [20]:
from sklearn.utils.class_weight import compute_class_weight

num_workers = 4

# 将重采样后的训练数据分割回原始形状
text_shape = X_text_train.shape[1]
speech_shape = X_speech_train.shape[1] * X_speech_train.shape[2]
mocap_shape = X_mocap_train.shape[1] * X_mocap_train.shape[2] * X_mocap_train.shape[3]

X_text_resampled = torch.tensor(X_resampled[:, :text_shape])
X_speech_resampled = torch.tensor(X_resampled[:, text_shape:text_shape+speech_shape]).reshape(-1, X_speech_train.shape[1], X_speech_train.shape[2])
X_mocap_resampled = torch.tensor(X_resampled[:, text_shape+speech_shape:]).reshape(-1, X_mocap_train.shape[1], X_mocap_train.shape[2], X_mocap_train.shape[3])
y_resampled = torch.as_tensor(y_resampled).clone().detach()

# 创建重采样后的训练集和原始测试集的数据集
train_dataset = EmotionDataset(X_text_resampled, X_speech_resampled, X_mocap_resampled, y_resampled)
test_dataset = EmotionDataset(X_text_test, X_speech_test, X_mocap_test, y_test)

# 创建pred_dataset（使用测试集的前10个样本）
pred_dataset = EmotionDataset(X_text_test[:10], X_speech_test[:10], X_mocap_test[:10], y_test[:10])


# 创建数据加载器
train_loader = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False, pin_memory=True)
pred_loader = DataLoader(pred_dataset, batch_size=1, shuffle=False)  # 使用batch_size=1以便单独评估每个样本


# 计算类别权重（使用重采样后的训练集）
class_weights = compute_class_weight('balanced', classes=np.unique(y_resampled), y=y_resampled.numpy())

## 5. Function preparation

In [21]:
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
import numpy as np

def calculate_metrics(outputs, labels):
    preds = torch.max(outputs, dim=1)[1].cpu().numpy()
    labels = labels.cpu().numpy()
    
    # F1 score(Weighted)
    f1 = f1_score(labels, preds, average='weighted')
    
    # Unweighted accuracy
    uw_acc = accuracy_score(labels, preds)
    
    return f1, uw_acc

# 模型验证
def validate(model, val_loader, criterion, device):
    val_loss = 0
    val_f1, val_uw_acc = 0, 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in val_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            outputs = model(text, speech, mocap)  # 计算模型输出
            loss = criterion(outputs, labels)  # 计算损失函数
            val_loss += loss.item()  # 累计损失
            
            f1, uw_acc = calculate_metrics(outputs, labels)
            val_f1 += f1
            val_uw_acc += uw_acc

    val_loss /= len(val_loader)  # 计算平均损失
    val_f1 /= len(val_loader)  # 计算平均F1分数
    val_uw_acc /= len(val_loader)  # 计算平均非加权准确率

    return val_loss, val_f1, val_uw_acc

# 打印训练结果
def print_log(epoch, train_time, train_loss, train_f1, train_uw_acc, 
              val_loss, val_f1, val_uw_acc, epochs=10):
    print(f"Epoch [{epoch+1}/{epochs}], time: {train_time:.2f}s")
    print(f"Train - loss: {train_loss:.4f}, F1: {train_f1:.4f}, UW_Acc: {train_uw_acc:.4f}")
    print(f"Val - loss: {val_loss:.4f}, F1: {val_f1:.4f}, UW_Acc: {val_uw_acc:.4f}")

# 模型训练
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs, device):
    train_losses = []
    train_f1s, train_uw_accs = [], []
    val_losses = []
    val_f1s, val_uw_accs = [], []
    
    for epoch in range(num_epochs):
        model.train()
        
        train_loss = 0
        train_f1, train_uw_acc = 0, 0
        
        start_time = time.time()  # 记录本epoch开始时间
        
        for batch in train_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            # 检查输入数据
            if torch.isnan(text).any() or torch.isnan(speech).any() or torch.isnan(mocap).any():
                print("Input data contains NaN!")
                continue
            
            optimizer.zero_grad()  # 将模型所有参数tensor的梯度变为0
            outputs = model(text, speech, mocap)  # 计算模型输出
            
            loss = criterion(outputs, labels)  # 计算损失函数
            train_loss += loss.item()  # 累计损失
            
            f1, uw_acc = calculate_metrics(outputs, labels)
            train_f1 += f1
            train_uw_acc += uw_acc
            
            loss.backward()  # 反向传播计算梯度
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()  # 更新模型参数
        
        end_time = time.time()  # 记录本epoch结束时间
        train_time = end_time - start_time  # 计算本epoch的训练耗时
        
        train_loss /= len(train_loader)  # 计算平均损失
        train_f1 /= len(train_loader)  # 计算平均F1分数
        train_uw_acc /= len(train_loader)  # 计算平均非加权准确率
        
        val_loss, val_f1, val_uw_acc = validate(model, val_loader, criterion, device)
        
        scheduler.step(val_f1)
        
        train_losses.append(train_loss)
        train_f1s.append(train_f1)
        train_uw_accs.append(train_uw_acc)
        val_losses.append(val_loss)
        val_f1s.append(val_f1)
        val_uw_accs.append(val_uw_acc)
        
        print_log(epoch, train_time, train_loss, train_f1, train_uw_acc, 
                  val_loss, val_f1, val_uw_acc, epochs=num_epochs)  # 打印训练结果

    return train_losses, train_f1s, train_uw_accs, val_losses, val_f1s, val_uw_accs

In [22]:
def evaluate_model(model, test_loader, criterion, device):
    val_loss, f1, uw_acc = validate(model, test_loader, criterion, device)
    print(f"Test Results:")
    print(f"Loss: {val_loss:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Unweighted Accuracy: {uw_acc:.4f}")

In [23]:
def print_model_summary(model, input_size):
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"Model Structure:\n{model}\n")
    print(f"Input size: {input_size}")
    
    total_params = 0
    for name, module in model.named_children():
        params = sum(p.numel() for p in module.parameters() if p.requires_grad)
        total_params += params
        print(f"{name}: {params:,} parameters")
        
        if hasattr(module, 'named_children'):
            for sub_name, sub_module in module.named_children():
                sub_params = sum(p.numel() for p in sub_module.parameters() if p.requires_grad)
                print(f"  {sub_name}: {sub_params:,} parameters")
    
    print(f"\nTotal trainable parameters: {total_params:,}")
    
    print("\nDetailed parameter shapes:")
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"{name}: {param.shape}")

In [24]:
# 参数初始化
nb_words = 3130 # 3130 or 1419
embedding_dim = 300
max_sequence_length = 500  # 最大序列长度
num_heads = 8

## 6. Deep learning models

## Model-5: Multi-attention Text+speech

In [25]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(embed_dim, ff_dim)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(ff_dim, embed_dim)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.dropout = nn.Dropout(dropout)

        assert (
            self.head_dim * num_heads == embed_dim
        ), "Embedding dimension must be divisible by number of heads"

        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.fc_out = nn.Linear(embed_dim, embed_dim)
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.feed_forward = FeedForward(embed_dim, 4 * embed_dim, dropout)

    def forward(self, x):
        N, seq_length, embed_dim = x.shape

        queries = self.query(x).view(N, seq_length, self.num_heads, self.head_dim)
        keys = self.key(x).view(N, seq_length, self.num_heads, self.head_dim)
        values = self.value(x).view(N, seq_length, self.num_heads, self.head_dim)

        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        energy = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** (1 / 2))
        attention = self.dropout(torch.softmax(energy, dim=-1))  # 添加dropout

        out = torch.matmul(attention, values)
        out = out.transpose(1, 2).contiguous().view(N, seq_length, self.embed_dim)
        
        out = self.fc_out(out)
        out = self.dropout(out)  # 添加dropout
        out = self.layer_norm1(out + x)
        ff_out = self.feed_forward(out)
        out = self.layer_norm2(ff_out + out)

        return out

class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads):
        super(TextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.attention = MultiHeadSelfAttention(embed_dim=32, num_heads=num_heads)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
        self.bn_dense = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = x.permute(0, 2, 1)
        x = self.attention(x)
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn_dense(self.dense(x))
        return x

class SpeechModel(nn.Module):
    def __init__(self):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(100 * 34, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        self.dense2 = nn.Linear(1024, 512)
        self.bn2 = nn.BatchNorm1d(512)
        self.dense3 = nn.Linear(512, 256)
        self.bn3 = nn.BatchNorm1d(256)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.bn1(self.dense1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.dense2(x)))
        x = self.dropout(x)
        x = self.bn3(self.dense3(x))
        return x

class CombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads):
        super(CombinedModel, self).__init__()
        self.text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads)
        self.speech_model = SpeechModel()
        self.attention = MultiHeadSelfAttention(embed_dim=512, num_heads=num_heads)
        self.fc1 = nn.Linear(512, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.fc2 = nn.Linear(256, 4)
    
    def forward(self, text, speech, mocap=None):
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        combined = torch.cat((text_out, speech_out), dim=1)
        combined = self.attention(combined.unsqueeze(1)).squeeze(1)
        combined = F.relu(self.bn1(self.fc1(combined)))
        combined = self.fc2(combined)
        return combined

In [26]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 实例化模型、损失函数和优化器
model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)

# 打印模型摘要
print(model_combined)

print("Combined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

CombinedModel(
  (text_model): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (attention): MultiHeadSelfAttention(
      (dropout): Dropout(p=0.1, inplace=False)
      (query): Linear(in_features=32, out_features=32, bias=True)
      (key): Linear(in_features=32, out_features=32, bias=True)
      (value): Linear(in_features=32, o

In [27]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))
text = batch['text'].to(device).long()
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs = model_combined(text, speech, mocap)

# 打印输出形状
print(outputs.shape)  # 打印模型输出的形状

Using 2 GPUs!
torch.Size([128, 4])


In [28]:
import torch.optim as optim
from torch.optim import lr_scheduler
criterion = torch.nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.3, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 6.92s
Train - loss: 1.2308, F1: 0.4019, UW_Acc: 0.4150
Val - loss: 1.2428, F1: 0.3883, UW_Acc: 0.4267
Epoch [2/80], time: 5.74s
Train - loss: 0.9752, F1: 0.5737, UW_Acc: 0.5772
Val - loss: 0.9938, F1: 0.5650, UW_Acc: 0.5697
Epoch [3/80], time: 5.77s
Train - loss: 0.8241, F1: 0.6621, UW_Acc: 0.6621
Val - loss: 0.9848, F1: 0.5862, UW_Acc: 0.5923
Epoch [4/80], time: 5.84s
Train - loss: 0.7002, F1: 0.7193, UW_Acc: 0.7185
Val - loss: 1.0538, F1: 0.5584, UW_Acc: 0.5845
Epoch [5/80], time: 5.82s
Train - loss: 0.6003, F1: 0.7676, UW_Acc: 0.7675
Val - loss: 0.9236, F1: 0.6536, UW_Acc: 0.6540
Epoch [6/80], time: 5.82s
Train - loss: 0.5133, F1: 0.8013, UW_Acc: 0.8010
Val - loss: 1.0607, F1: 0.6042, UW_Acc: 0.6081
Epoch [7/80], time: 5.84s
Train - loss: 0.4510, F1: 0.8277, UW_Acc: 0.8274
Val - loss: 1.0844, F1: 0.6067, UW_Acc: 0.6134
Epoch [8/80], time: 5.84s
Train - loss: 0.3922, F1: 0.8533, UW_Acc: 0.8526
Val - loss: 1.1765, F1: 0.5953, UW_Acc: 0.6068
Epoch [9/80], time: 5.80

([1.2308305335599323,
  0.9752326427504073,
  0.824103923731072,
  0.7001542421274407,
  0.6002598903899969,
  0.5133445962916973,
  0.45102210238922474,
  0.39216787420039956,
  0.3517377179029376,
  0.3171789618425591,
  0.2970693298550539,
  0.23460519279158393,
  0.20575914895811745,
  0.18209245800971985,
  0.16530216260011807,
  0.17571557347857675,
  0.1639437801962675,
  0.16070767942556116,
  0.1506528980856718,
  0.14721127304919931,
  0.1322363128149232,
  0.12884535666468533,
  0.13599042793692545,
  0.11811296485884246,
  0.1116602674646433,
  0.10665130008791768,
  0.11274751691624176,
  0.10998728914662849,
  0.1061461030743843,
  0.10124147544766582,
  0.10359856183099192,
  0.09171678134521773,
  0.09660792965875116,
  0.09841575282950733,
  0.09690624588104181,
  0.09938645406171333,
  0.09457190473412358,
  0.09064373980427898,
  0.09612741191373315,
  0.08914554006485052,
  0.1046496389736963,
  0.0845747735611228,
  0.09147951336101044,
  0.0939126706920391,
  0.08

## Model-6: Multiattention Text+speech

In [29]:
# import torch
# import torch.nn as nn
# import torch.nn.functional as F

# class MultiHeadSelfAttention(nn.Module):
#     def __init__(self, embed_dim, num_heads):
#         super(MultiHeadSelfAttention, self).__init__()
#         self.embed_dim = embed_dim
#         self.num_heads = num_heads
#         self.head_dim = embed_dim // num_heads

#         assert (
#             self.head_dim * num_heads == embed_dim
#         ), "Embedding dimension must be divisible by number of heads"

#         self.query = nn.Linear(embed_dim, embed_dim)
#         self.key = nn.Linear(embed_dim, embed_dim)
#         self.value = nn.Linear(embed_dim, embed_dim)
#         self.fc_out = nn.Linear(embed_dim, embed_dim)
#         self.layer_norm = nn.LayerNorm(embed_dim)

#     def forward(self, x):
#         N, seq_length, embed_dim = x.shape

#         # Split the embedding into self.num_heads different pieces
#         queries = self.query(x).view(N, seq_length, self.num_heads, self.head_dim)
#         keys = self.key(x).view(N, seq_length, self.num_heads, self.head_dim)
#         values = self.value(x).view(N, seq_length, self.num_heads, self.head_dim)

#         # Transpose the queries, keys, and values
#         queries = queries.transpose(1, 2)
#         keys = keys.transpose(1, 2)
#         values = values.transpose(1, 2)

#         # Calculate the energy between the queries and keys
#         energy = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** (1 / 2))
#         attention = torch.softmax(energy, dim=-1)

#         # Multiply the attention values with the values
#         out = torch.matmul(attention, values)
#         out = out.transpose(1, 2).contiguous().view(N, seq_length, self.embed_dim)
        
#         # Apply the final fully connected layer
#         out = self.fc_out(out)
        
#         # Add residual connection and apply layer normalization
#         out = self.layer_norm(out + x)

#         return out

# class TextModel(nn.Module):
#     def __init__(self, embedding_dim, max_sequence_length, g_word_embedding_matrix, dropout_rate=0.1, num_heads=num_heads):
#         super(TextModel, self).__init__()
#         self.embedding_dim = embedding_dim
#         self.embedding = nn.Embedding.from_pretrained(g_word_embedding_matrix, freeze=False)  # Make embeddings trainable
#         self.lstm1 = nn.LSTM(input_size=embedding_dim, hidden_size=512, batch_first=True, bidirectional=True, num_layers=2, dropout=dropout_rate)
#         self.lstm2 = nn.LSTM(input_size=512*2, hidden_size=512, batch_first=True, bidirectional=True, num_layers=2, dropout=dropout_rate)  # Note: 512*2 due to bidirectional
#         self.attention = MultiHeadSelfAttention(embed_dim=512*2, num_heads=num_heads)
#         self.dropout = nn.Dropout(dropout_rate)  # Dropout layer
#         self.dense = nn.Linear(512*2, 256)  # 512*2 due to bidirectional output
    
#     def forward(self, x):
#         x = self.embedding(x)
#         x, _ = self.lstm1(x)
#         x, _ = self.lstm2(x)
#         x = self.attention(x)
#         x = self.dropout(x[:, -1, :])  # Apply dropout to the output of the last LSTM cell
#         x = self.dense(x)  # Only use the output of the last LSTM cell
#         return x

# class SpeechModel(nn.Module):
#     def __init__(self, dropout_rate=0.3, num_heads=num_heads):
#         super(SpeechModel, self).__init__()
#         self.flatten = nn.Flatten()
#         self.dense1 = nn.Linear(100 * 34, 1024)
#         self.bn1 = nn.BatchNorm1d(1024)
#         self.attention = MultiHeadSelfAttention(embed_dim=1024, num_heads=num_heads)
#         self.dense2 = nn.Linear(1024, 256)
#         self.bn2 = nn.BatchNorm1d(256)
#         self.dropout = nn.Dropout(dropout_rate)
    
#     def forward(self, x):
#         x = self.flatten(x)
#         x = F.relu(self.bn1(self.dense1(x)))
#         x = self.attention(x.unsqueeze(1)).squeeze(1)
#         x = self.dropout(x)
#         x = self.bn2(self.dense2(x))
#         return x

# class CombinedModel(nn.Module):
#     def __init__(self, embedding_dim, max_sequence_length, g_word_embedding_matrix, dropout_rate=0.3, num_heads=num_heads):
#         super(CombinedModel, self).__init__()
#         self.text_model = TextModel(embedding_dim, max_sequence_length, g_word_embedding_matrix, dropout_rate, num_heads)
#         self.speech_model = SpeechModel(dropout_rate, num_heads)
#         self.fc1 = nn.Linear(256 * 2, 256)
#         self.bn1 = nn.BatchNorm1d(256)
#         self.fc2 = nn.Linear(256, 4)
    
#     def forward(self, text, speech, mocap=None):
#         text_out = self.text_model(text)
#         speech_out = self.speech_model(speech)
#         combined = torch.cat((text_out, speech_out), dim=1)
#         combined = F.relu(self.bn1(self.fc1(combined)))
#         combined = self.fc2(combined)
#         return combined

In [30]:
# # 设置设备为GPU或CPU
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# # 检查 class_weights 是否已经在正确的设备上
# if not isinstance(class_weights, torch.Tensor):
#     class_weights = torch.FloatTensor(class_weights)

# class_weights = class_weights.to(device)

# # 实例化模型、损失函数和优化器
# model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix)

# # 打印模型摘要
# print(model_combined)

# print("Combined Model Summary:")
# print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

In [31]:
# # 使用DataParallel包装模型
# if torch.cuda.device_count() > 1:
#     print(f"Using {torch.cuda.device_count()} GPUs!")
#     model_combined = nn.DataParallel(model_combined)

# # 将模型移动到设备上
# model_combined.to(device)

# # 假设我们有一个batch
# batch = next(iter(train_loader))
# text = batch['text'].to(device).long()
# speech = batch['speech'].to(device)
# mocap = batch['mocap'].to(device)

# # 前向传播
# outputs = model_combined(text, speech, mocap)

# # 打印输出形状
# print(outputs.shape)  # 打印模型输出的形状

In [32]:
# import torch.optim as optim
# from torch.optim import lr_scheduler
# criterion = torch.nn.CrossEntropyLoss()
# # 优化器
# optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# # 学习率调度器
# scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.3, patience=5, verbose=True)

# # 调用训练函数进行训练
# train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

## Model-7: Multi-Attention Text+speech+mocap

In [33]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import time

class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(embed_dim, ff_dim)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(ff_dim, embed_dim)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.dropout = nn.Dropout(dropout)

        assert (
            self.head_dim * num_heads == embed_dim
        ), "Embedding dimension must be divisible by number of heads"

        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.fc_out = nn.Linear(embed_dim, embed_dim)
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.feed_forward = FeedForward(embed_dim, 4 * embed_dim, dropout)

    def forward(self, x):
        N, seq_length, embed_dim = x.shape

        queries = self.query(x).view(N, seq_length, self.num_heads, self.head_dim)
        keys = self.key(x).view(N, seq_length, self.num_heads, self.head_dim)
        values = self.value(x).view(N, seq_length, self.num_heads, self.head_dim)

        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        energy = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** (1 / 2))
        attention = self.dropout(torch.softmax(energy, dim=-1))  # 添加dropout

        out = torch.matmul(attention, values)
        out = out.transpose(1, 2).contiguous().view(N, seq_length, self.embed_dim)
        
        out = self.fc_out(out)
        out = self.dropout(out)  # 添加dropout
        out = self.layer_norm1(out + x)
        ff_out = self.feed_forward(out)
        out = self.layer_norm2(ff_out + out)

        return out

class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads):
        super(TextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.attention = MultiHeadSelfAttention(embed_dim=32, num_heads=num_heads)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
        self.bn5 = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = x.permute(0, 2, 1)
        x = self.attention(x)
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn5(self.dense(x))
        return x

class SpeechModel(nn.Module):
    def __init__(self, dropout_rate=0.1, num_heads=num_heads):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(100 * 34, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        self.attention = MultiHeadSelfAttention(embed_dim=1024, num_heads=num_heads)
        self.dense2 = nn.Linear(1024, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.dropout = nn.Dropout(dropout_rate)
    
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.bn1(self.dense1(x)))
        x = self.attention(x.unsqueeze(1)).squeeze(1)
        x = self.dropout(x)
        x = self.bn2(self.dense2(x))
        return x

class MocapModel(nn.Module):
    def __init__(self):
        super(MocapModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(128)
        self.conv5 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
        self.bn5 = nn.BatchNorm2d(256)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(256 * 7 * 6, 256)
        self.bn8 = nn.BatchNorm1d(256)

    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn8(self.dense(x))
        return x

class CombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads):
        super(CombinedModel, self).__init__()
        self.text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads)
        self.speech_model = SpeechModel(num_heads=num_heads)
        self.mocap_model = MocapModel()
        self.fc1 = nn.Linear(256 * 3, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.fc2 = nn.Linear(256, 4)
    
    def forward(self, text, speech, mocap):
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        mocap_out = self.mocap_model(mocap)
        combined = torch.cat((text_out, speech_out, mocap_out), dim=1)
        combined = F.relu(self.bn1(self.fc1(combined)))
        combined = self.fc2(combined)
        return combined

In [34]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 模型实例化
model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads)

# 打印模型摘要
print(model_combined)

print("Combined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

CombinedModel(
  (text_model): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (attention): MultiHeadSelfAttention(
      (dropout): Dropout(p=0.1, inplace=False)
      (query): Linear(in_features=32, out_features=32, bias=True)
      (key): Linear(in_features=32, out_features=32, bias=True)
      (value): Linear(in_features=32, o

In [35]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))
text = batch['text'].to(device).long()
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs = model_combined(text, speech, mocap)

# 打印输出形状
print(outputs.shape)  # 打印模型输出的形状

Using 2 GPUs!
torch.Size([128, 4])


In [36]:
import torch.optim as optim
from torch.optim import lr_scheduler
criterion = torch.nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.3, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 8.09s
Train - loss: 1.1401, F1: 0.4806, UW_Acc: 0.4936
Val - loss: 1.2778, F1: 0.4386, UW_Acc: 0.4581
Epoch [2/80], time: 7.87s
Train - loss: 0.8184, F1: 0.6622, UW_Acc: 0.6654
Val - loss: 0.9675, F1: 0.5730, UW_Acc: 0.5837
Epoch [3/80], time: 7.85s
Train - loss: 0.6374, F1: 0.7499, UW_Acc: 0.7497
Val - loss: 0.9889, F1: 0.6106, UW_Acc: 0.6167
Epoch [4/80], time: 7.84s
Train - loss: 0.5260, F1: 0.8007, UW_Acc: 0.8009
Val - loss: 0.9681, F1: 0.6285, UW_Acc: 0.6417
Epoch [5/80], time: 7.82s
Train - loss: 0.4377, F1: 0.8416, UW_Acc: 0.8411
Val - loss: 0.9710, F1: 0.6369, UW_Acc: 0.6526
Epoch [6/80], time: 7.98s
Train - loss: 0.3716, F1: 0.8644, UW_Acc: 0.8642
Val - loss: 0.8444, F1: 0.6798, UW_Acc: 0.6858
Epoch [7/80], time: 7.80s
Train - loss: 0.3140, F1: 0.8885, UW_Acc: 0.8880
Val - loss: 1.1308, F1: 0.6330, UW_Acc: 0.6436
Epoch [8/80], time: 7.81s
Train - loss: 0.2715, F1: 0.9070, UW_Acc: 0.9066
Val - loss: 0.8907, F1: 0.6975, UW_Acc: 0.7022
Epoch [9/80], time: 7.80

([1.1400883280953695,
  0.8183701468068499,
  0.6373799207598664,
  0.5260107877642609,
  0.4377241952474727,
  0.3715539713238561,
  0.3140091622291609,
  0.27146527586981306,
  0.2247300373260365,
  0.19808672299218733,
  0.16724539322908535,
  0.16186780343915141,
  0.14570256495891615,
  0.12352123665948246,
  0.11319833370142204,
  0.1101403577729713,
  0.09423800774438437,
  0.09566593456060388,
  0.08530494042260703,
  0.07384864487793556,
  0.07560433590308178,
  0.06419770086054193,
  0.0638206223415774,
  0.0639672267229058,
  0.05755404842107795,
  0.061320294431129165,
  0.037792531990034635,
  0.02482133384707362,
  0.020490199500738188,
  0.02129561782186461,
  0.018252627869938,
  0.019371400782188703,
  0.016898191923838714,
  0.013487059276464373,
  0.016634339206787044,
  0.01811978286998563,
  0.01435749194324883,
  0.011512130764204749,
  0.01021599393272989,
  0.00807534152273695,
  0.008785394345258558,
  0.00903115774299083,
  0.008789011699602355,
  0.0087687523

## Model-8: Self-cross attention Text+Speech+Mocap

In [37]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import time

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class MultiHeadCrossAttention(nn.Module):
    def __init__(self, query_dim, key_dim, num_heads, dropout=0.1):
        super(MultiHeadCrossAttention, self).__init__()
        self.num_heads = num_heads
        self.head_dim = query_dim // num_heads

        assert (
            self.head_dim * num_heads == query_dim
        ), "Query dimension must be divisible by number of heads"

        self.query = nn.Linear(query_dim, query_dim)
        self.key = nn.Linear(key_dim, query_dim)
        self.value = nn.Linear(key_dim, query_dim)
        self.fc_out = nn.Linear(query_dim, query_dim)
        self.layer_norm1 = nn.LayerNorm(query_dim)
        self.layer_norm2 = nn.LayerNorm(query_dim)
        self.dropout = nn.Dropout(dropout)
        self.feed_forward = FeedForward(query_dim, query_dim * 4, dropout)

    def forward(self, query, key, value):
        N, query_len, _ = query.shape
        _, key_len, _ = key.shape

        residual = query

        queries = self.query(query).view(N, query_len, self.num_heads, self.head_dim)
        keys = self.key(key).view(N, key_len, self.num_heads, self.head_dim)
        values = self.value(value).view(N, key_len, self.num_heads, self.head_dim)

        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        energy = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** (1 / 2))
        attention = torch.softmax(energy, dim=-1)

        out = torch.matmul(attention, values)
        out = out.transpose(1, 2).contiguous().view(N, query_len, -1)
        
        out = self.fc_out(out)
        out = self.dropout(out)
        out = self.layer_norm1(out + residual)

        ff_out = self.feed_forward(out)
        out = self.layer_norm2(ff_out + out)

        return out

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        assert (
            self.head_dim * num_heads == embed_dim
        ), "Embedding dimension must be divisible by number of heads"

        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.fc_out = nn.Linear(embed_dim, embed_dim)
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)
        self.feed_forward = FeedForward(embed_dim, embed_dim * 4, dropout)

    def forward(self, x):
        N, seq_length, embed_dim = x.shape

        residual = x

        queries = self.query(x).view(N, seq_length, self.num_heads, self.head_dim)
        keys = self.key(x).view(N, seq_length, self.num_heads, self.head_dim)
        values = self.value(x).view(N, seq_length, self.num_heads, self.head_dim)

        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        energy = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** (1 / 2))
        attention = torch.softmax(energy, dim=-1)

        out = torch.matmul(attention, values)
        out = out.transpose(1, 2).contiguous().view(N, seq_length, self.embed_dim)
        
        out = self.fc_out(out)
        out = self.dropout(out)
        out = self.layer_norm1(out + residual)

        ff_out = self.feed_forward(out)
        out = self.layer_norm2(ff_out + out)

        return out

class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads):
        super(TextModel, self).__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.attention = MultiHeadSelfAttention(embed_dim=32, num_heads=num_heads)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
        self.bn5 = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = x.permute(0, 2, 1)
        x = self.attention(x)
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn5(self.dense(x))
        return x

class SpeechModel(nn.Module):
    def __init__(self, dropout_rate=0.1, num_heads=num_heads):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(100 * 34, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        self.attention = MultiHeadSelfAttention(embed_dim=1024, num_heads=num_heads)
        self.dense2 = nn.Linear(1024, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.dropout = nn.Dropout(dropout_rate)
    
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.bn1(self.dense1(x)))
        x = self.attention(x.unsqueeze(1)).squeeze(1)
        x = self.dropout(x)
        x = self.bn2(self.dense2(x))
        return x

class MocapModel(nn.Module):
    def __init__(self):
        super(MocapModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(128)
        self.conv5 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
        self.bn5 = nn.BatchNorm2d(256)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(256 * 7 * 6, 256)
        self.bn8 = nn.BatchNorm1d(256)

    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn8(self.dense(x))
        return x

class CombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads):
        super(CombinedModel, self).__init__()
        self.text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=num_heads)
        self.speech_model = SpeechModel(num_heads=num_heads)
        self.mocap_model = MocapModel()
        
        self.text_speech_attention = MultiHeadCrossAttention(256, 256, num_heads)
        self.text_mocap_attention = MultiHeadCrossAttention(256, 256, num_heads)
        
        self.final_attention = MultiHeadSelfAttention(256, num_heads)
        
        self.fc1 = nn.Linear(256, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 4)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, text, speech, mocap):
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        mocap_out = self.mocap_model(mocap)
        
        text_speech_attended = self.text_speech_attention(text_out.unsqueeze(1), speech_out.unsqueeze(1), speech_out.unsqueeze(1)).squeeze(1)
        text_mocap_attended = self.text_mocap_attention(text_out.unsqueeze(1), mocap_out.unsqueeze(1), mocap_out.unsqueeze(1)).squeeze(1)
        
        combined = torch.stack([text_speech_attended, text_mocap_attended, text_out], dim=1)
        
        combined = self.final_attention(combined)
        combined = combined.mean(dim=1)  # Average pooling across the sequence dimension
        
        combined = F.relu(self.bn1(self.fc1(combined)))
        combined = self.dropout(combined)
        combined = self.fc2(combined)
        return combined

In [38]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 模型实例化
model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads)

# 打印模型摘要
print(model_combined)

print("Combined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

CombinedModel(
  (text_model): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (attention): MultiHeadSelfAttention(
      (query): Linear(in_features=32, out_features=32, bias=True)
      (key): Linear(in_features=32, out_features=32, bias=True)
      (value): Linear(in_features=32, out_features=32, bias=True)
      (fc_out): Line

In [39]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))
text = batch['text'].to(device).long()
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs = model_combined(text, speech, mocap)

# 打印输出形状
print(outputs.shape)  # 打印模型输出的形状

Using 2 GPUs!
torch.Size([128, 4])


In [40]:
import torch.optim as optim
from torch.optim import lr_scheduler
criterion = torch.nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.AdamW(model_combined.parameters(), lr=0.0001, weight_decay=0.001)

# 学习率调度器
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.3, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=80, device=device)

Epoch [1/80], time: 7.81s
Train - loss: 1.2983, F1: 0.3625, UW_Acc: 0.3890
Val - loss: 1.1388, F1: 0.5254, UW_Acc: 0.5262
Epoch [2/80], time: 7.82s
Train - loss: 0.9086, F1: 0.6355, UW_Acc: 0.6417
Val - loss: 1.0061, F1: 0.5722, UW_Acc: 0.5918
Epoch [3/80], time: 7.79s
Train - loss: 0.6949, F1: 0.7342, UW_Acc: 0.7333
Val - loss: 0.9920, F1: 0.5926, UW_Acc: 0.5980
Epoch [4/80], time: 7.85s
Train - loss: 0.5886, F1: 0.7775, UW_Acc: 0.7775
Val - loss: 1.1453, F1: 0.5475, UW_Acc: 0.5792
Epoch [5/80], time: 7.77s
Train - loss: 0.4979, F1: 0.8244, UW_Acc: 0.8238
Val - loss: 1.1919, F1: 0.5478, UW_Acc: 0.5818
Epoch [6/80], time: 7.78s
Train - loss: 0.4142, F1: 0.8528, UW_Acc: 0.8525
Val - loss: 0.9200, F1: 0.6418, UW_Acc: 0.6546
Epoch [7/80], time: 7.94s
Train - loss: 0.3698, F1: 0.8674, UW_Acc: 0.8673
Val - loss: 1.0046, F1: 0.6301, UW_Acc: 0.6394
Epoch [8/80], time: 7.80s
Train - loss: 0.3273, F1: 0.8885, UW_Acc: 0.8882
Val - loss: 1.0805, F1: 0.6114, UW_Acc: 0.6282
Epoch [9/80], time: 7.82

([1.2982983810957087,
  0.9085727431053339,
  0.6948946354001068,
  0.5885889959889788,
  0.4978982480459435,
  0.41418074036753455,
  0.36979189584421557,
  0.3272529743438543,
  0.2950351886277975,
  0.2637899926928587,
  0.24173882915530093,
  0.21217315886602844,
  0.21515137165091758,
  0.19557491464670315,
  0.1673129202668057,
  0.17216877258101174,
  0.14846423268318176,
  0.1423806565445523,
  0.1381496660584627,
  0.1040314812819625,
  0.08720562355809433,
  0.0827846571283285,
  0.07540996164776557,
  0.07339802925843139,
  0.06354606480792512,
  0.05589027659491051,
  0.056165609811974125,
  0.04997726118322029,
  0.04474440156373867,
  0.04593914929170941,
  0.04278192781778269,
  0.04471057116292244,
  0.04320821694509928,
  0.03655503837521686,
  0.037088304909682554,
  0.034897514553957204,
  0.033937419699721555,
  0.04049589660365221,
  0.034805150139470435,
  0.03326475206589283,
  0.03683922366174155,
  0.0346474964171648,
  0.03255828251239172,
  0.0379127673371586

## Model-9 Mixed Attention 3 modalities

In [41]:
# def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, device):
#     history = {
#         'train_loss': [], 'train_f1': [], 'train_uw_acc': [],
#         'val_loss': [], 'val_f1': [], 'val_uw_acc': [],
#         'train_weights': []
#     }
    
#     for epoch in range(num_epochs):
#         model.train()
#         epoch_results = {'train_loss': 0, 'train_f1': 0, 'train_uw_acc': 0, 'train_weights': torch.zeros(3).to(device)}
        
#         start_time = time.time()
        
#         for batch in train_loader:
#             text = batch['text'].to(device)
#             speech = batch['speech'].to(device)
#             mocap = batch['mocap'].to(device)
#             labels = batch['labels'].to(device).long()
            
#             optimizer.zero_grad()
#             outputs, weights = model(text, speech, mocap)
            
#             loss = criterion(outputs, labels)
            
#             loss.backward()
#             torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
#             optimizer.step()
            
#             epoch_results['train_loss'] += loss.item()
#             f1, uw_acc = calculate_metrics(outputs, labels)
#             epoch_results['train_f1'] += f1
#             epoch_results['train_uw_acc'] += uw_acc
#             epoch_results['train_weights'] += weights.sum(dim=0)
        
#         for key in ['train_loss', 'train_f1', 'train_uw_acc']:
#             epoch_results[key] /= len(train_loader)
#         epoch_results['train_weights'] /= len(train_loader.dataset)
        
#         val_results = validate(model, val_loader, criterion, device)
        
#         train_time = time.time() - start_time
        
#         for key in epoch_results:
#             history[key].append(epoch_results[key])
#         for key in val_results:
#             history[key].append(val_results[key])
        
#         print_log(epoch, train_time, epoch_results, val_results, epochs=num_epochs)

#     return history

# def validate(model, val_loader, criterion, device):
#     model.eval()
#     results = {'val_loss': 0, 'val_f1': 0, 'val_uw_acc': 0}
    
#     with torch.no_grad():
#         for batch in val_loader:
#             text = batch['text'].to(device)
#             speech = batch['speech'].to(device)
#             mocap = batch['mocap'].to(device)
#             labels = batch['labels'].to(device).long()
            
#             outputs, _ = model(text, speech, mocap)  # 我们在验证时不需要weights
            
#             loss = criterion(outputs, labels)
            
#             results['val_loss'] += loss.item()
#             f1, uw_acc = calculate_metrics(outputs, labels)
#             results['val_f1'] += f1
#             results['val_uw_acc'] += uw_acc
    
#     for key in ['val_loss', 'val_f1', 'val_uw_acc']:
#         results[key] /= len(val_loader)
    
#     return results

In [42]:
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs, device):
    history = {
        'train_loss': [], 'train_f1': [], 'train_uw_acc': [],
        'val_loss': [], 'val_f1': [], 'val_uw_acc': [],
        'train_weights': [], 'val_weights': []
    }
    
    for epoch in range(num_epochs):
        model.train()
        epoch_results = {'train_loss': 0, 'train_f1': 0, 'train_uw_acc': 0, 'train_weights': torch.zeros(3).to(device)}
        
        start_time = time.time()
        
        for batch in train_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            optimizer.zero_grad()
            outputs, weights = model(text, speech, mocap)
            
            loss = criterion(outputs, labels)
            
            # Add diversity loss
            diversity_loss = -torch.std(weights, dim=-1).mean()
            loss += 0.1 * diversity_loss
            
            # L2 regularization on attention weights
            l2_reg = 0.001 * weights.pow(2).sum()
            loss += l2_reg
            
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            
            epoch_results['train_loss'] += loss.item()
            f1, uw_acc = calculate_metrics(outputs, labels)
            epoch_results['train_f1'] += f1
            epoch_results['train_uw_acc'] += uw_acc
            epoch_results['train_weights'] += weights.sum(dim=0)
        
        for key in ['train_loss', 'train_f1', 'train_uw_acc']:
            epoch_results[key] /= len(train_loader)
        epoch_results['train_weights'] /= len(train_loader.dataset)
        
        val_results = validate(model, val_loader, criterion, device)
        
        train_time = time.time() - start_time
        
        for key in epoch_results:
            history[key].append(epoch_results[key])
        for key in val_results:
            history[key].append(val_results[key])
        
        print_log(epoch, train_time, epoch_results, val_results, epochs=num_epochs)
        
        scheduler.step(val_results['val_f1'])

    return history

def validate(model, val_loader, criterion, device):
    model.eval()
    results = {'val_loss': 0, 'val_f1': 0, 'val_uw_acc': 0, 'val_weights': torch.zeros(3).to(device)}
    
    with torch.no_grad():
        for batch in val_loader:
            text = batch['text'].to(device)
            speech = batch['speech'].to(device)
            mocap = batch['mocap'].to(device)
            labels = batch['labels'].to(device).long()
            
            outputs, weights = model(text, speech, mocap)
            
            loss = criterion(outputs, labels)
            
            results['val_loss'] += loss.item()
            f1, uw_acc = calculate_metrics(outputs, labels)
            results['val_f1'] += f1
            results['val_uw_acc'] += uw_acc
            results['val_weights'] += weights.sum(dim=0)
    
    for key in ['val_loss', 'val_f1', 'val_uw_acc']:
        results[key] /= len(val_loader)
    results['val_weights'] /= len(val_loader.dataset)
    
    return results

def print_log(epoch, train_time, train_results, val_results, epochs=100):
    print(f"Epoch [{epoch+1}/{epochs}], time: {train_time:.2f}s")
    print(f"Train - loss: {train_results['train_loss']:.4f}, F1: {train_results['train_f1']:.4f}, UW_Acc: {train_results['train_uw_acc']:.4f}")
    print(f"Train Weights - Text: {train_results['train_weights'][0]:.4f}, Speech: {train_results['train_weights'][1]:.4f}, Mocap: {train_results['train_weights'][2]:.4f}")
    print(f"Val - loss: {val_results['val_loss']:.4f}, F1: {val_results['val_f1']:.4f}, UW_Acc: {val_results['val_uw_acc']:.4f}")
    print(f"Val Weights - Text: {val_results['val_weights'][0]:.4f}, Speech: {val_results['val_weights'][1]:.4f}, Mocap: {val_results['val_weights'][2]:.4f}")

In [43]:
# def print_log(epoch, train_time, train_results, val_results, epochs=100):
#     print(f"Epoch [{epoch+1}/{epochs}], time: {train_time:.2f}s")
#     print(f"Train - loss: {train_results['train_loss']:.4f}, F1: {train_results['train_f1']:.4f}, UW_Acc: {train_results['train_uw_acc']:.4f}")
#     print(f"Train Weights - Text: {train_results['train_weights'][0]:.4f}, Speech: {train_results['train_weights'][1]:.4f}, Mocap: {train_results['train_weights'][2]:.4f}")
#     print(f"Val - loss: {val_results['val_loss']:.4f}, F1: {val_results['val_f1']:.4f}, UW_Acc: {val_results['val_uw_acc']:.4f}")

In [44]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        
        self.qkv = nn.Linear(d_model, 3 * d_model)
        self.att_drop = nn.Dropout(dropout)
        self.projection = nn.Linear(d_model, d_model)
        
    def forward(self, x, mask=None):
        qkv = self.qkv(x).reshape(x.size(0), x.size(1), 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        dots = torch.matmul(q, k.transpose(-1, -2)) / (self.head_dim ** 0.5)
        if mask is not None:
            dots = dots.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(dots, dim=-1)
        attn = self.att_drop(attn)
        
        out = torch.matmul(attn, v).transpose(1, 2).reshape(x.size(0), x.size(1), self.d_model)
        out = self.projection(out)
        return out
    
class ModalityEncoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.norm = nn.LayerNorm(output_dim)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return self.norm(x)

class DynamicQuerySelection(nn.Module):
    def __init__(self, modal_dim, num_modalities):
        super().__init__()
        self.modal_encoders = nn.ModuleList([
            ModalityEncoder(modal_dim, 128, modal_dim) for _ in range(num_modalities)
        ])
        self.selector = nn.Sequential(
            nn.Linear(modal_dim * num_modalities, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, num_modalities)
        )
        self.temperature = nn.Parameter(torch.ones(1) * 0.5)
        
    def forward(self, modalities):
        encoded = [encoder(modality) for encoder, modality in zip(self.modal_encoders, modalities)]
        combined = torch.cat(encoded, dim=-1)
        logits = self.selector(combined)
        weights = F.softmax(logits / self.temperature, dim=-1)
        query = sum(w.unsqueeze(1) * m for w, m in zip(weights.unbind(dim=-1), encoded))
        return query, weights

class MixedAttention(nn.Module):
    def __init__(self, d_model, num_modalities, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_modalities = num_modalities
        self.multihead_attn = nn.MultiheadAttention(d_model * num_modalities, num_heads)
        
    def forward(self, modalities):
        # Concatenate all modalities
        combined = torch.cat(modalities, dim=-1)  # Shape: [batch_size, d_model * num_modalities]
        combined = combined.unsqueeze(0)  # Add sequence dimension: [1, batch_size, d_model * num_modalities]
        
        attn_output, _ = self.multihead_attn(combined, combined, combined)
        return attn_output.squeeze(0)  # Remove sequence dimension: [batch_size, d_model * num_modalities]


class TextModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=8):
        super().__init__()
        self.embedding = nn.Embedding(nb_words, embedding_dim)
        self.embedding.weight.data.copy_(g_word_embedding_matrix.clone().detach())
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=256, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(256)
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(128)
        self.conv3 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(64)
        self.conv4 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(32)
        self.attention = MultiHeadAttention(32, num_heads)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(32 * max_sequence_length, 256)
        self.bn5 = nn.BatchNorm1d(256)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = x.permute(0, 2, 1)
        x = self.attention(x)
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn5(self.dense(x))
        return x

class SpeechModel(nn.Module):
    def __init__(self, dropout_rate=0.1, num_heads=8):
        super(SpeechModel, self).__init__()
        self.flatten = nn.Flatten()
        self.dense1 = nn.Linear(100 * 34, 1024)
        self.bn1 = nn.BatchNorm1d(1024)
        self.attention = MultiHeadAttention(1024, num_heads)
        self.dense2 = nn.Linear(1024, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.dropout = nn.Dropout(dropout_rate)
    
    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.bn1(self.dense1(x)))
        x = self.attention(x.unsqueeze(1)).squeeze(1)
        x = self.dropout(x)
        x = self.bn2(self.dense2(x))
        return x

class MocapModel(nn.Module):
    def __init__(self):
        super(MocapModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(128)
        self.conv5 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)
        self.bn5 = nn.BatchNorm2d(256)
        self.dropout = nn.Dropout(0.1)
        self.flatten = nn.Flatten()
        self.dense = nn.Linear(256 * 7 * 6, 256)
        self.bn8 = nn.BatchNorm1d(256)

    def forward(self, x):
        x = x.permute(0, 3, 1, 2)
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.dropout(x)
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.dropout(x)
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.dropout(x)
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.dropout(x)
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.dropout(x)
        x = self.flatten(x)
        x = self.bn8(self.dense(x))
        return x

class CombinedModel(nn.Module):
    def __init__(self, nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads=8):
        super().__init__()
        self.text_model = TextModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads)
        self.speech_model = SpeechModel(num_heads=num_heads)
        self.mocap_model = MocapModel()
        
        self.dynamic_query = DynamicQuerySelection(256, 3)  # 3 modalities, 256 is the output dim of each model
        self.mixed_attention = MixedAttention(256, 3, num_heads)
        
        self.fc1 = nn.Linear(256 * 4, 128)  # 256 * 4 because we concatenate the dynamic query
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 4)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, text, speech, mocap):
        text_out = self.text_model(text)
        speech_out = self.speech_model(speech)
        mocap_out = self.mocap_model(mocap)
        
        query, weights = self.dynamic_query([text_out, speech_out, mocap_out])
        combined = self.mixed_attention([text_out, speech_out, mocap_out])
        
        x = torch.cat([combined, query], dim=-1)
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x, weights

In [45]:
# 设置设备为GPU或CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 检查 class_weights 是否已经在正确的设备上
if not isinstance(class_weights, torch.Tensor):
    class_weights = torch.FloatTensor(class_weights)

class_weights = class_weights.to(device)

# 模型实例化
model_combined = CombinedModel(nb_words, embedding_dim, max_sequence_length, g_word_embedding_matrix, num_heads)

# 打印模型摘要
print(model_combined)

print("Combined Model Summary:")
print_model_summary(model_combined, [(max_sequence_length,), (100, 34), (200, 189, 1)])

CombinedModel(
  (text_model): TextModel(
    (embedding): Embedding(3130, 300)
    (conv1): Conv1d(300, 256, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv1d(256, 128, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv1d(128, 64, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv4): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (bn4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (attention): MultiHeadAttention(
      (qkv): Linear(in_features=32, out_features=96, bias=True)
      (att_drop): Dropout(p=0.1, inplace=False)
      (projection): Linear(in_features=32, out_features=32, bias=True)
    )
    (dropout): Dropout(p=0.1, i

In [46]:
# 使用DataParallel包装模型
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model_combined = nn.DataParallel(model_combined)

# 将模型移动到设备上
model_combined.to(device)

# 假设我们有一个batch
batch = next(iter(train_loader))
text = batch['text'].to(device).long()
speech = batch['speech'].to(device)
mocap = batch['mocap'].to(device)

# 前向传播
outputs, weights = model_combined(text, speech, mocap)

# 打印输出形状
print("Output shape:", outputs.shape)  # 打印模型主要输出的形状
print("Weights shape:", weights.shape)  # 打印权重的形状

Using 2 GPUs!
Output shape: torch.Size([128, 4])
Weights shape: torch.Size([128, 3])


In [47]:
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.optim import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau

criterion = torch.nn.CrossEntropyLoss()
# 优化器
if isinstance(model_combined, nn.DataParallel):
    optimizer = AdamW([
        {'params': [p for n, p in model_combined.module.named_parameters() if 'dynamic_query' not in n]},
        {'params': model_combined.module.dynamic_query.parameters(), 'lr': 0.001}
    ], lr=0.0001, weight_decay=0.001)
else:
    optimizer = AdamW([
        {'params': [p for n, p in model_combined.named_parameters() if 'dynamic_query' not in n]},
        {'params': model_combined.dynamic_query.parameters(), 'lr': 0.001}
    ], lr=0.0001, weight_decay=0.001)

scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=5, verbose=True)

# 调用训练函数进行训练
train_model(model_combined, train_loader, test_loader, criterion, optimizer, scheduler, num_epochs=100, device=device)

Epoch [1/100], time: 9.70s
Train - loss: 1.1955, F1: 0.4684, UW_Acc: 0.4731
Train Weights - Text: 0.3356, Speech: 0.2264, Mocap: 0.4380
Val - loss: 1.2933, F1: 0.3855, UW_Acc: 0.4520
Val Weights - Text: 0.1360, Speech: 0.2895, Mocap: 0.5746
Epoch [2/100], time: 9.42s
Train - loss: 0.8886, F1: 0.6452, UW_Acc: 0.6484
Train Weights - Text: 0.1572, Speech: 0.3463, Mocap: 0.4965
Val - loss: 1.0031, F1: 0.5603, UW_Acc: 0.5816
Val Weights - Text: 0.1072, Speech: 0.3224, Mocap: 0.5704
Epoch [3/100], time: 9.34s
Train - loss: 0.7579, F1: 0.7127, UW_Acc: 0.7132
Train Weights - Text: 0.1344, Speech: 0.4022, Mocap: 0.4635
Val - loss: 0.9595, F1: 0.5933, UW_Acc: 0.6128
Val Weights - Text: 0.1177, Speech: 0.3731, Mocap: 0.5092
Epoch [4/100], time: 9.41s
Train - loss: 0.6733, F1: 0.7492, UW_Acc: 0.7493
Train Weights - Text: 0.1399, Speech: 0.4079, Mocap: 0.4522
Val - loss: 1.2069, F1: 0.5070, UW_Acc: 0.5523
Val Weights - Text: 0.0987, Speech: 0.4834, Mocap: 0.4179
Epoch [5/100], time: 9.41s
Train - l

{'train_loss': [1.1955225731051244,
  0.8886402512705603,
  0.7579318215680677,
  0.673343769339628,
  0.5863422231618748,
  0.5249941640121992,
  0.47843838015268014,
  0.41670882563258327,
  0.3778486071630966,
  0.3495812104191891,
  0.3122910687396693,
  0.2863878105268922,
  0.2693114758923996,
  0.2245076620994612,
  0.1889611766781918,
  0.17359101113884948,
  0.16333338393028393,
  0.15777981177318928,
  0.15733104060555614,
  0.14182114826385364,
  0.13777344448621884,
  0.1302212157914805,
  0.12430706672197164,
  0.12848506521346958,
  0.10527856613314429,
  0.09076940649470618,
  0.09024466210326483,
  0.0880931698132393,
  0.08740681617758995,
  0.08998939074402632,
  0.07632826615211576,
  0.07485216162925543,
  0.0698351708435735,
  0.06950563382963802,
  0.06614254692266154,
  0.06785599098995675,
  0.06325140023647352,
  0.05924138790646265,
  0.059794145441332526,
  0.06343710223256155,
  0.06321996502404989,
  0.06084943701361501,
  0.05721440244206162,
  0.053353863