Emotion Recognition
=====



- 사람의 음성/시각정보등을 통해서 감정을 인식하는 과정

- 대표적인 모델은 CNN based

In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random
import torch
import torchvision
import torch.nn.functional as F
import torch.nn as nn
import torchvision.transforms as transforms

from sklearn.metrics import classification_report
from sklearn.model_selection import KFold
from typing import Union
from torch.nn.modules.pooling import AdaptiveAvgPool2d
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader, RandomSampler

In [2]:
# Fix the seed
def seed_set(s : int = 42) -> None:
  torch.manual_seed(s)
  np.random.seed(s)
  random.seed(s)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False

random_seed = 41
seed_set(random_seed)

# CUDA device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
print("halo")

## Dataset

### RAVDESS 

[Download](https://zenodo.org/records/1188976)

각 오디오 파일마다 의미에 따른 고유의 번호로 구성되어져 있음 (예ㅣ03-01-02-01-01-01-24)

- 양식 (01 = full-AV, 02 = video-only, 03 = audio-only) - 주어진 데이터셋의 경우 **03**에 해당
    - 보컬 채널 (01 = Speech, 02 = Song) - 주어진 데이터셋의 경우 **01**에 해당


- 감정표현 (01 = Neutral, 02 = Calm, 03 = Happy,  04 = Sad, 05 = Angry, 06 = Fearful, 07 = Disgust, 08 = Surprised)
    - 총 1440개의 데이터 샘플 존재 (192 * 7 + 96 * 1 = 1440)
 


- 감정의 강함정도 (01 = Normal, 02 = Strong) - Neutral을 제외한 각 감정표현마다 존재

- 문장 (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”)

 

- 반복 (01 = “1st”, 02 = “2nd”) 

 

- 배우(01 ~ 24) - 여성배우는 짝수, 남성배우는 홀수로 구분

In [3]:
ravdess_path = "/home/mllab/jhson/dann_ser/RAVDESS"
generated_ravdess_path = "/home/mllab/dchung/dataset/ravdess-hifi"

In [4]:
emotions = {'01' : 'Neutral', '02' : 'Calm', '03' : 'Happy',  '04' : 'Sad',
              '05' : 'Angry', '06' : 'Fearful', '07' : 'Disgust', '08' : 'Surprised'}

level = {'01' : 'Normal', '02' : 'Strong'}

statement = {'01' : 'Kids are talking by the door' , '02' : 'Dogs are sitting by the door'}

invert_emotions = dict(zip(emotions.values(), emotions.keys()))

def gender_classifier(x : int) -> bool:
  x = int(x)
  if (x % 2 == 0):
    return "Female"
  return "Male"

In [5]:
# rav_load() returns pandas dataframe processing the given directory
def rav_load(dir : str) -> pd.DataFrame:
    emotion = []
    intensity = []
    path = []
    sentence = []
    id = []

    for dirname, _, filenames in os.walk(dir):
      for filename in filenames:
        if ".npy" in filename:
            continue
        path.append(os.path.join(dirname, filename))
        Sequence = filename.split('-')
        emotion.append(Sequence[2])
        intensity.append(Sequence[3])
        sentence.append(Sequence[4])
        id.append(Sequence[6].split('.')[0])

    df = pd.DataFrame({
        'ID' : id,
        'Emotion' : emotion,
        'Intensity' : intensity,
        'Statement' : sentence,
        'Path' : path
        })

    df['Emotion'] = df['Emotion'].map(emotions)
    df['Intensity'] = df['Intensity'].map(level)
    df['Statement'] = df['Statement'].map(statement)
    df['Gender'] = df['ID'].apply(lambda x : gender_classifier(x))
    return df

df = rav_load(ravdess_path)
generated_df = rav_load(generated_ravdess_path)

In [6]:
df.head()

Unnamed: 0,ID,Emotion,Intensity,Statement,Path,Gender
0,22,Calm,Normal,Kids are talking by the door,/home/mllab/jhson/dann_ser/RAVDESS/Actor_22/03...,Female
1,22,Happy,Normal,Kids are talking by the door,/home/mllab/jhson/dann_ser/RAVDESS/Actor_22/03...,Female
2,22,Surprised,Strong,Dogs are sitting by the door,/home/mllab/jhson/dann_ser/RAVDESS/Actor_22/03...,Female
3,22,Sad,Strong,Kids are talking by the door,/home/mllab/jhson/dann_ser/RAVDESS/Actor_22/03...,Female
4,22,Surprised,Strong,Kids are talking by the door,/home/mllab/jhson/dann_ser/RAVDESS/Actor_22/03...,Female


In [7]:
generated_df.head()

Unnamed: 0,ID,Emotion,Intensity,Statement,Path,Gender
0,22,Calm,Normal,Kids are talking by the door,/home/mllab/dchung/dataset/ravdess-hifi/Actor_...,Female
1,22,Happy,Normal,Kids are talking by the door,/home/mllab/dchung/dataset/ravdess-hifi/Actor_...,Female
2,22,Surprised,Strong,Dogs are sitting by the door,/home/mllab/dchung/dataset/ravdess-hifi/Actor_...,Female
3,22,Sad,Strong,Kids are talking by the door,/home/mllab/dchung/dataset/ravdess-hifi/Actor_...,Female
4,22,Surprised,Strong,Kids are talking by the door,/home/mllab/dchung/dataset/ravdess-hifi/Actor_...,Female


### TESS

In [8]:
# tor_load(str) returns pandas dataframe processing the given directory
def tor_load(dir : str) -> pd.DataFrame:
    emotion = []
    path = []

    emotions_To = {'neutral' : 1, 'happy' : 2, 'sad': 3, 'angry' : 4, 'fear' : 5, 'disgust' : 6, 'ps' : 7}

    for dirname, _, filenames in os.walk(dir):
      for filename in filenames:
        path.append(os.path.join(dirname, filename))
        Sequence = filename.split('_')
        Sequence = Sequence[2].split('.')
        emotion.append(Sequence[0])



    df = pd.DataFrame({'Emotion' : emotion, 'Path' : path})
    df['Emotion'] = df['Emotion'].map(emotions_To)
    
    return df

tor_df = tor_load("/home/mllab/dchung/dataset/TESS")

In [9]:
tor_df.head()

Unnamed: 0,Emotion,Path
0,1,/home/mllab/dchung/dataset/TESS/YAF_neutral/YA...
1,1,/home/mllab/dchung/dataset/TESS/YAF_neutral/YA...
2,1,/home/mllab/dchung/dataset/TESS/YAF_neutral/YA...
3,1,/home/mllab/dchung/dataset/TESS/YAF_neutral/YA...
4,1,/home/mllab/dchung/dataset/TESS/YAF_neutral/YA...


### AudioUtils

**특징**
- Spectrogram관련 처리 라이브러리는 librosa 사용

**If applicable**

- [0 - 1] 사이로 Clipping

- (128, 420)으로 제로 패딩 처리 

- (S : original log mel, delta S, delta of delta of S)로 3채널 구성 : [링크](https://wiki.aalto.fi/display/ITSP/Deltas+and+Delta-deltas)


In [8]:
# normalize_mel() normalizes a tensor of log mel-spectrogram
def normalize_mel(S : torch.Tensor) -> torch.Tensor:
    min_level_db= -80
    return torch.clip((S-min_level_db)/-min_level_db, 0, 1)

In [9]:
# mel_spec() returns a log mel-spectrogram
def mel_spec(path : str, padding : bool = False) -> Union[np.ndarray, torch.Tensor]:
  y, sr = librosa.load(path, sr = 48000)
  S = librosa.feature.melspectrogram(y = y, n_fft = 2048, hop_length = 480, sr = sr, win_length = 1920)
  logS = librosa.power_to_db(S, ref=np.max)

  if (padding):
     logS = torch.from_numpy(logS)
     logS = normalize_mel(logS)
     desired_shape = (128, 420)
     padding_size = desired_shape[1] - logS.shape[1]
     logS = torch.nn.functional.pad(logS, (0, padding_size))
     return logS.unsqueeze(0)
     
  return  logS

In [10]:
# multi_mel_spec() returns a log-mel spectrogram composed of three channels
def multi_mel_spec(path : str) -> torch.Tensor:
  y, sr = librosa.load(path, sr = 48000)
  S = librosa.feature.melspectrogram(y = y, n_fft = 2048, hop_length = 480, sr = sr, win_length = 1920)

  # 3-Log-Mel
  logS = librosa.power_to_db(S, ref=np.max)
  logS_delta = librosa.feature.delta(logS)
  logS_ddelta = librosa.feature.delta(logS_delta)

  # Clipping
  logS = torch.from_numpy(logS)
  logS_delta = torch.from_numpy(logS_delta)
  logS_ddelta = torch.from_numpy(logS_ddelta)
  logS, logS_delta, logS_ddelta = normalize_mel(logS), normalize_mel(logS_delta), normalize_mel(logS_ddelta)

  # Zero-padding
  desired_shape = (128, 420)
  padding_size = desired_shape[1] - logS.shape[1]
  logS = torch.nn.functional.pad(logS, (0, padding_size))
  logS_delta = torch.nn.functional.pad(logS_delta, (0, padding_size),value=1)
  logS_ddelta = torch.nn.functional.pad(logS_ddelta, (0, padding_size),value=1)

  return  torch.stack([logS, logS_delta, logS_ddelta], dim=0)

In [11]:
# draw_mel() draws a spectrogram
def draw_mel(S : Union[np.ndarray, torch.Tensor]) -> None:
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(np.float32(logS), x_axis='time', y_axis='mel', sr = sr)
    plt.colorbar(format='%+2.0f dB')
    plt.title('Mel-Spectrogram')
    plt.show()

In [12]:
# df_transform() converts data into a log mel-spectrogram
def df_transform(df : pd.DataFrame, multi_channel : bool = False, random_seed : int = random_seed) -> pd.DataFrame:
    if (multi_channel): df['Path'] = df['Path'].apply(lambda x: multi_mel_spec(x))
    else: df['Path'] = df['Path'].apply(lambda x: mel_spec(x))

    df['Emotion'] = df['Emotion'].apply(lambda x: int(invert_emotions[x]) - 1)
    df['ID'] = df['ID'].apply(lambda x: int(x))
    df = df.sample(frac=1, random_state = random_seed)
    return df

mel = df_transform(df)
generated_mel = df_transform(generated_df)
concat_mel = pd.concat([mel, generated_mel])

**Transform**

1. 기존 데이터를 PIL Image로 변환
2. (128, 256)으로 Resize를 진행
3. 다시 Tensor로 변환

제로패딩할시 사용하지않음

In [13]:
# Doesn't need to use if zero-padded
data_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize(size=(128, 256)),
    transforms.ToTensor()
])

In [14]:
class RAVDESS_Dataset(Dataset):
  def __init__(self, x : pd.Series, y : pd.Series, transform):
    self.emotion = y.to_list()
    self.path = x.to_list()
    self.transform = transform
    
  def __len__(self):
    return len(self.emotion)

  def __getitem__(self, idx):
    spec = self.transform(self.path[idx])
    y = self.emotion[idx]
    return spec, y

In [15]:
# prepare_dataloader() returns torch train/test(valid) DataLoaders
def prepare_dataloader(train : pd.DataFrame, valid : pd.DataFrame, batch_size : int, seed : int = random_seed):

  train_dataset = RAVDESS_Dataset(train['Path'], train['Emotion'], data_transform)
  test_dataset = RAVDESS_Dataset(valid['Path'], valid['Emotion'], data_transform)

  train_sampler = RandomSampler(train_dataset, generator=torch.Generator().manual_seed(seed))
  test_sampler = RandomSampler(test_dataset, generator=torch.Generator().manual_seed(seed))

  train_dataloader = DataLoader(train_dataset, batch_size = batch_size, sampler = train_sampler)
  test_dataloader = DataLoader(test_dataset, batch_size = batch_size, sampler = test_sampler)

  return train_dataloader, test_dataloader

## Model

### Attention

##### U-vector attention: [Attention Based Fully Convolutional Network forSpeech Emotion Recognition](https://arxiv.org/abs/1806.01506)

**특징**

![Alt text](image.png)
- Lambda값은 0-1의 값을 가질수있음. 값 *0.3* 사용

- Xavier Uniform Initialization을 사용하여 임의에 U vecotr를 생성
    - Xavier Normal Initialization
    - Zero Initialization
    등도 사용가능

In [18]:
class Attention(nn.Module):
  def __init__(self, num_features : int):
    super().__init__()
    self.dense = nn.Linear(in_features = num_features, out_features = num_features)
    self.lamb_val = 0.3
    self.u = nn.init.xavier_uniform_(nn.Parameter(torch.randn(1, num_features), requires_grad=True))
    self.tanh = nn.Tanh()
    
  def forward(self, x : torch.Tensor) -> torch.Tensor:
    batch_size, channel_size, variable_length = x.size()
    x = torch.transpose(x, -1, -2)
    e = self.tanh(self.dense(x))
    e = self.u @ e.transpose(-1,-2) * self.lamb_val
    a = torch.softmax(e, dim=-1)
    out = a @ x
    out = out.transpose(-1, -2)
    output = torch.sum(out, dim=2)
    return output

### Scratch

##### CNN14 : [PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf)

**특징**
- VGG-ish Network

- Max Pooling 대신 Average Pooling 사용

- 타 오디오셋에서 먼저 트레이닝후 Fine-tune


**논문결과 (RAVDESS)**

|          | Scratch | Fine-tune |
| -------- | ------- | --------- |
| Accuracy | 69.2%   | 72.1%     |
- Speaker간의 Independency를 지켜서 실험


- Global Average Pooling을 대신하여 Attention Layer를 적용가능

In [49]:
class ConvBlock(nn.Module):
    def __init__(self, in_channels : int, out_channels : int):

        super(ConvBlock, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
                              kernel_size=(3, 3), stride=(1, 1),
                              padding=(1, 1), bias=False)

        self.conv2 = nn.Conv2d(in_channels=out_channels, out_channels=out_channels,
                              kernel_size=(3, 3), stride=(1, 1),
                              padding=(1, 1), bias=False)

        self.bn1 = nn.BatchNorm2d(out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU()
        self.avgpool = nn.AvgPool2d((2,2))
        self.dropout = nn.Dropout(0.1)

    def forward(self, x : torch.Tensor) -> torch.Tensor:
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.relu(self.bn2(self.conv2(x)))
        x = self.avgpool(x)
        x = self.dropout(x)
        return x

class CNN14(nn.Module):
  def __init__(self, uattn : bool):
    super().__init__()
    self.conv_layer1 = ConvBlock(in_channels = 1, out_channels = 64)
    self.conv_layer2 = ConvBlock(64, 128)
    self.conv_layer3 = ConvBlock(128, 256)
    self.conv_layer4 = ConvBlock(256, 512)
    self.conv_layer5 = ConvBlock(512, 1024)
    self.conv_layer6 = ConvBlock(1024, 2048)
    self.glob_pool = nn.AdaptiveAvgPool2d((1,1))
    self.fc_layer = nn.Sequential(
                  nn.Linear(2048, 512),
                  nn.ReLU()
    )
    
    self.classifier = nn.Linear(512, 8)
    self.flatten = nn.Flatten(start_dim=1)
    self.attention = None
    if (uattn):
        self.attention = Attention(2048)
        self.flatten = nn.Flatten(start_dim=2)

  def forward(self, x : torch.Tensor) -> torch.Tensor:
    x = self.conv_layer1(x)
    x = self.conv_layer2(x)
    x = self.conv_layer3(x)
    x = self.conv_layer4(x)
    x = self.conv_layer5(x)
    x = self.conv_layer6(x)
    if (self.attention is None):
        x = self.glob_pool(x)
        x = self.flatten(x)
    
    else:
        x = self.flatten(x)
        x = self.attention(x)
        
    x = self.fc_layer(x)
    x = self.classifier(x)
    return x

model = CNN14(False)

##### CNNX : [Shallow over Deep Neural Networks: A Empirical Analysis for Human Emotion Classification Using Audio Data](https://link.springer.com/chapter/10.1007/978-3-030-76736-5_13)

**구조**
|          Layer           | Depth |
| :----------------------: | :---: |
| 2D Conv (ReLU + Avgpool) |   8   |
| 2D Conv (ReLU + Avgpool) |  16   |
| 2D Conv (ReLU + Avgpool) |  32   |
|         13440 FC         |   |
|     2048 FC ||
|            2048 FC            |       |

- Shallow Neural Network로 단순하게 구성
- Max Pooling 대신 Average Pooling을 사용

**논문결과**

|          | Scratch |
| -------- | ------- | 
| Accuracy | 82.99%  |

- 평가 기준이 명확하지 않음

In [16]:
class CNNX(nn.Module):
  def __init__(self):
    super().__init__()
    self.conv1 = nn.Sequential(
        nn.Conv2d(in_channels=1, out_channels=8, kernel_size=(3, 3)),
        nn.ReLU()
    )

    self.conv2 = nn.Sequential(
        nn.Conv2d(8, 16, (3, 3)),
        nn.ReLU()
    )

    self.conv3 = nn.Sequential(
        nn.Conv2d(16, 32, (3, 3)),
        nn.ReLU()
    )

    self.fc_layer1 = nn.Sequential(
        nn.Linear(in_features = 13440, out_features = 2048),
        nn.ReLU()
    )

    self.fc_layer2 = nn.Sequential(
        nn.Linear(2048, 2048),
        nn.ReLU()
    )
    self.fc_layer3 = nn.Linear(2048, 8)
    self.avg_pool = nn.AvgPool2d((2, 2))
    self.flatten = nn.Flatten(start_dim=1)
    
  def forward(self, x : torch.Tensor) -> torch.Tensor:
    x = self.avg_pool(self.conv1(x))
    x = self.avg_pool(self.conv2(x))
    x = self.avg_pool(self.conv3(x))
    x = self.flatten(x)
    x = self.fc_layer1(x)
    x = self.fc_layer2(x)
    x = self.fc_layer3(x)
    return x

##### 1DCNNLSTM : [A Hybrid CNN–LSTM Network for the Classification of Human Activities Based on Micro-Doppler Radar](https://ieeexplore.ieee.org/document/8978926)


**구조**

|          Layer           | Width |
| :----------------------: | :---: |
| 1D Conv (ReLU + Maxpool) |  64   |
| 1D Conv (ReLU + Maxpool) |  128  |
| 1D Conv (ReLU + Maxpool) |  256  |
|           LSTM           |  256  |
|            256 FC            |   |

In [35]:
class CNNLSTM(nn.Module):
  def __init__(self):
    super().__init__()
    self.conv_layer1 = nn.Sequential(
        nn.Conv1d(128, 64, 3),
        nn.ReLU(inplace=True)
    )

    self.conv_layer2 = nn.Sequential(
        nn.Conv1d(64, 128, 3),
        nn.ReLU(inplace=True)
    )

    self.conv_layer3 = nn.Sequential(
        nn.Conv1d(128, 256, 3),
        nn.ReLU(inplace=True)
    )

    self.LSTM = nn.LSTM(256, 256, batch_first=True)
    self.maxpool = nn.MaxPool1d(2)
    self.classifier = nn.Linear(256, 8)

  def forward(self, x : torch.Tensor) -> torch.Tensor:
    x = x.squeeze(1)
    x = self.conv_layer1(x)
    x = self.maxpool(x)
    x = self.conv_layer2(x)
    x = self.maxpool(x)
    x = self.conv_layer3(x)
    x = x.transpose(1, 2)
    x, (h0, c0) = self.LSTM(x)
    x = self.classifier(h0[-1])
    return x

### Fine-tune

- 이미 만들어진 구조와 Pretrained Weight값을 사용

- 1채널 텐서를 입력받을수 있게 레이어를 추가 혹은 변경

- Task에 맞게 마지막 레이어를 추가

##### ResNet 18

In [35]:
class ResNet18(nn.Module):
    def __init__(self, pretrained : bool = True):
        super().__init__()
        self.network = torchvision.models.resnet18(pretrained)

        # Change # of in_channels
        self.network.conv1 = nn.Conv2d(in_channels = 1, out_channels = self.network.conv1.out_channels,
                                        kernel_size=7, stride=2, padding=3)

        self.network.fc = nn.Linear(in_features = self.network.fc.in_features, out_features = 512)
        self.classifier = nn.Linear(512, 8)


    def forward(self, x : torch.Tensor) -> torch.Tensor:
        x = self.network(x)
        x = self.network.fc(x)
        x = self.classifier(x)
        return x


model = ResNet18()



##### VGG19

- Batch Normalization이 적용되어있음

- Global Average Pooling 대신 Attention 사용

In [63]:
class VGG19(nn.Module):
  def __init__(self, pretrained : bool = True):
    super().__init__()
    self.network = torchvision.models.vgg19_bn(pretrained)

    # Add one layer that takes one channel input
    first_conv_layer = [nn.Conv2d(in_channels = 1, out_channels = 3, 
                                  kernel_size = 3, stride = 1, 
                                  padding = 1, dilation = 1, 
                                  groups = 1, bias=True)]
                            
    first_conv_layer.extend(list(self.network.features))  
    self.network.features= nn.Sequential(*first_conv_layer)
    
    self.classifier = nn.Linear(512, 8)
    self.attention = Attention(512)
    self.flatten = nn.Flatten(start_dim = 2)
    
  def forward(self, x : torch.Tensor) -> torch.Tensor:
    x = self.network.features(x)
    x = self.flatten(x)
    x = self.attention(x)
    x = self.classifier(x)
    return x

model = VGG19()

##### AlexNet


- Global Average Pooling 대신 Attention 사용

In [27]:
class AlexNet(nn.Module):
    def __init__(self, pretrained : bool = True):
        super().__init__()
        self.network = torchvision.models.alexnet(pretrained)
        
        first_conv_layer = [nn.Conv2d(in_channels = 1, out_channels = 3, 
                                  kernel_size = 3, stride = 1, 
                                  padding = 1, dilation = 1, 
                                  groups = 1, bias=True)]
        
        first_conv_layer.extend(list(self.network.features))  
        self.network.features = nn.Sequential(*first_conv_layer)
        
        self.flatten = nn.Flatten(start_dim = 2)
        self.avgpool  = self.network.avgpool
        self.classifier = nn.Linear(256, 8)
        self.attention = Attention(256)


    def forward(self, x : torch.Tensor) -> torch.Tensor:
        x = self.network.features(x)
        x = self.flatten(x)
        x = self.attention(x)
        x = self.classifier(x)
        return x


model = AlexNet()



## Train & Evaluate

### TrainUtils

In [17]:
# cal_avg() returns accuracy, recall, f1-score, and precision scores
def cal_avg(l : dict) -> float:
  Accuracy = 0
  Recall = 0
  F1Score = 0
  Precision = 0
  support = 0

  for report in l:
    support += 1
    Accuracy += report['accuracy']
    Recall += report['macro avg']['recall']
    F1Score += report['macro avg']['f1-score']
    Precision += report['macro avg']['precision']

    print(f" Fold {support}: Accuracy : {round(report['accuracy'], 4)} Precision : {round(report['macro avg']['precision'],4)} Recall : {round(report['macro avg']['recall'], 4)} F1-Score : {round(report['macro avg']['f1-score'],4)}")

  Accuracy /= support
  Recall /= support
  F1Score /= support
  Precision /= support
  
  return round(Accuracy, 4), round(Recall, 4), round(F1Score, 4), round(Precision, 4)


# accuracy() returns the accuracy of the predicted label(s) compared to the true label(s)
def accuracy(y_pred : torch.Tensor, y_true : torch.Tensor) -> float:
    correct = torch.eq(y_true, y_pred).sum().item()
    acc = (correct / len(y_pred)) * 100
    return acc


# train_step() trains the model
def train_step(model, dataloader, optim, loss_fn, accuracy_fn) -> float:
    train_loss = 0.0
    train_acc = 0.0

    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X = X.to(device)
        y = y.to(device)
        y_logits = model(X)
        y_preds = torch.log_softmax(y_logits, dim=1).argmax(dim=1)

        acc = accuracy_fn(y_preds, y)
        loss = loss_fn(y_logits, y)

        optim.zero_grad()
        loss.backward()
        optim.step()

        train_loss += loss.item()
        train_acc += acc

    train_loss /= len(dataloader)
    train_acc /= len(dataloader)
    print(f"Train loss: {train_loss:.5f} | Train accuracy: {train_acc:.2f}%")

    return train_loss


# eval_step() evalutes the model
def eval_step(model, dataloader, optim, loss_fn, accuracy_fn) -> float:
    test_loss = 0.0
    test_acc = 0.0

    model.eval()
    with torch.inference_mode():
        for b, (X, y) in enumerate(dataloader):
            X = X.to(device)
            y = y.to(device)
            y_logits = model(X)
            y_preds = torch.log_softmax(y_logits, dim=1).argmax(dim=1)

            acc = accuracy_fn(y_preds, y)
            loss = loss_fn(y_logits, y)

            test_loss += loss.item()
            test_acc += acc
    test_loss /= len(dataloader)
    test_acc /= len(dataloader)
    print(f"Test loss: {test_loss:.5f} | Test accuracy: {test_acc:.2f}%\n")
    return test_loss


# report() returns the classification report in a dictionary format evluating the given model
def report(model, dataloader) -> dict:
  model.eval()
  y_true=[]
  y_pred=[]
  labels = ['Neutral','Calm','Happy','Sad','Angry','Fearful','Disgust','Surprised']

  with torch.inference_mode():
      for b, (X, y) in enumerate(dataloader):
        X = X.to(device)
        y = y.to(device)
        y_logits = model(X)
        y_preds = torch.softmax(y_logits, dim=1).argmax(dim=1)
        y_true.append(y.cpu())
        y_pred.append(y_preds.cpu())

  true = np.concatenate(y_true)
  pred = np.concatenate(y_pred)
  print(classification_report(true, pred, digits=4))
  return classification_report(true, pred, digits=4, output_dict = True)

### Hyperparameters

More info @ /home/mllab/dchung/Notebook/config

In [18]:
# Hyperparameters
criterion = nn.CrossEntropyLoss()
batch_size = 32
learning_rate = 0.0001
epochs = 100

### KFold

|              | Actor 1 | Actor2 | ...     | Actor k |
| ------------ | ------- | ------ | ------- | ------- |
| # of samples | 60      | 60     | k=1..24 | 60      |

- Speaker Independence를 유지하기 위해 Actor의 ID를 기준으로 Train/Test(Valid)셋을 나눔

    - 1 Fold : [4, 9, 10, 14, 15, 17] 

    - 2 Fold : [5, 13, 16, 18, 20, 23] 

    - 3 Fold : [1, 2, 6, 8, 19, 21] 

    - 4 Fold : [3, 7, 11, 12, 22, 24] 
 
*시드(351) 고정 필요*

In [19]:
actor = [x for x in range(1, 25)]
k = 4
kf = KFold(n_splits=k, shuffle=True, random_state = 351) # fix this seed

In [None]:
dict_list = []
k_train_acc = []
k_test_acc = []
k_train_loss = []
k_test_loss =[]

# KFold
for i, (train_index, valid_index) in enumerate(kf.split(actor)):
  
    train_loss = []
    test_loss =[]

    print(f'FOLD {i + 1}')
    print('--------------------------------')
    train_actor, valid_actor = [actor[j] for j in train_index], [actor[j] for j in valid_index]
    train = mel[mel['ID'].isin(train_actor)]
    valid = mel[mel['ID'].isin(valid_actor)]

    '''
    normal, strong = valid[valid['Intensity'] == "Normal"], valid[valid['Intensity'] == "Strong"]

    # Train Strong Test Normal
    train = pd.concat([train, strong])
    valid = normal

    # Train Normal Test Strong
    valid = pd.concat([strong, normal[normal['Emotion'] == 0]])
    train = pd.concat([train, normal[normal['Emotion'] != 0]])
    '''

    
    train_dataloader, test_dataloader = prepare_dataloader(train, valid, batch_size, random_seed)
    model = CNNX()
    optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.00001)
    model = model.to(device)

    # Train & Evaluate
    for epoch in range(epochs):
      print(f"Epoch {epoch + 1} result:")
      train_result = train_step(model, train_dataloader, optimizer, criterion, accuracy)
      test_result = eval_step(model, test_dataloader, optimizer, criterion, accuracy)
      train_loss.append(train_result)
      test_loss.append(test_result)


    k_train_loss.append(train_loss)
    k_test_loss.append(test_loss)
    dict_list.append(report(model, test_dataloader))

FOLD 1
--------------------------------
Epoch 1 result:
Train loss: 2.24100 | Train accuracy: 15.66%
Test loss: 2.08737 | Test accuracy: 12.50%

Epoch 2 result:
Train loss: 2.01926 | Train accuracy: 20.37%
Test loss: 1.99212 | Test accuracy: 20.05%

Epoch 3 result:
Train loss: 1.90029 | Train accuracy: 28.68%
Test loss: 1.84139 | Test accuracy: 25.26%

Epoch 4 result:
Train loss: 1.74116 | Train accuracy: 35.02%
Test loss: 1.73483 | Test accuracy: 34.38%

Epoch 5 result:
Train loss: 1.58140 | Train accuracy: 40.38%
Test loss: 1.70163 | Test accuracy: 31.25%

Epoch 6 result:
Train loss: 1.42359 | Train accuracy: 47.76%
Test loss: 1.73783 | Test accuracy: 36.20%

Epoch 7 result:
Train loss: 1.29815 | Train accuracy: 52.60%
Test loss: 1.60573 | Test accuracy: 40.62%

Epoch 8 result:
Train loss: 1.15692 | Train accuracy: 59.28%
Test loss: 1.63718 | Test accuracy: 41.93%

Epoch 9 result:
Train loss: 1.11185 | Train accuracy: 59.80%
Test loss: 1.60250 | Test accuracy: 42.19%

Epoch 10 result

## Result

- Macro Average를 사용
    - Balanced Dataset이여서 Weighted Average와 Macro Average와 큰 차이없음

In [39]:
accuracy, recall, f1, precision = cal_avg(dict_list)
print("---------------------------")
print(f"Accuracy : {accuracy * 100}% \nPrecision : {precision * 100}% \nRecall : {recall * 100}% \nF1-Score : {f1 * 100}%")

 Fold 1: Accuracy : 0.55 Precision : 0.5443 Recall : 0.5391 F1-Score : 0.5376
 Fold 2: Accuracy : 0.5417 Precision : 0.5435 Recall : 0.5365 F1-Score : 0.5288
 Fold 3: Accuracy : 0.5639 Precision : 0.5505 Recall : 0.5547 F1-Score : 0.5472
 Fold 4: Accuracy : 0.625 Precision : 0.6425 Recall : 0.6224 F1-Score : 0.6221
---------------------------
Accuracy : 57.010000000000005% 
Precision : 57.02% 
Recall : 56.32% 
F1-Score : 55.88999999999999%


Training from Scratch
|               | Accuracy | Precision | Recall | F1-Score |
| ------------- | -------- | --------- | ------ | -------- |
| CNN-14        | 60.00%   | 66.28%    | 60.48% | 59.30%   |
| CNN-14 + Attn | 60.68%   | 66.05%    | 61.14% | 60.20%   |
| CNN-X         | 54.93%   | 57.43%    | 54.43% | 54.24%   |
| 1DCNNLSTM     | 56.81%   | 56.98%    | 55.73% | 54.91%   |



Fine-Tuning
|                | Accuracy | Precision | Recall | F1-Score |
| -------------- | -------- | --------- | ------ | -------- |
| VGG19 + Attn   | 65.28%   | 66.42%    | 65.10% | 64.43%   |
| ResNet         | 57.01%   | 57.02%    | 56.32% | 55.89%   |
| AlexNet + Attn | 55.62%   | 55.24%    | 54.69% | 53.81%   |