<a href="https://colab.research.google.com/github/YuanGongND/vocalsound/blob/main/colab/VocalSound.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [VocalSound: A Dataset for Improving Human Vocal Sounds Recognition](https://ieeexplore.ieee.org/document/9746828)

- This colab script contains the official code of the data preparation and baseline experiment in the [ICASSP paper](https://ieeexplore.ieee.org/document/9746828).

- Please cite our paper(s) if you find the VocalSound dataset and code useful. The first paper proposes introduces the VocalSound dataset and the second paper describes the training pipeline and model we used for the baseline experiment.

  - ```
@INPROCEEDINGS{gong_vocalsound,
  author={Gong, Yuan and Yu, Jin and Glass, James},
  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition},
  year={2022},
  pages={151-155},
  doi={10.1109/ICASSP43922.2022.9746828}}
```
 - ```
@ARTICLE{gong_psla,
    author={Gong, Yuan and Chung, Yu-An and Glass, James},
    title={PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    year={2021},
    doi={10.1109/TASLP.2021.3120633}
}
```

# Step 1. Prepare the VocalSound Data

**Step 1.1** Mount your Google Drive to this Colab as the VocalSound dataset is quite large (2.5 GB). You will be asked a few security check in this step.

In [None]:
import os

from google.colab import drive
drive.mount('/content/drive')

data_dir = '/content/drive/MyDrive/vocalsound_baseline'
if os.path.exists(data_dir) == True:
    print('data path already exists')
else:
    os.mkdir(data_dir)

Mounted at /content/drive
data path already exists


**Step 1.2** Download and unzip the VocalSound (16kHz version) dataset. Unzip process takes up to 20 minutes, please be patient.

In [None]:
if os.path.exists('/content/drive/MyDrive/vocalsound_baseline/vs_release.zip') == False:
  print('Downloading and uncompressing the VocalSound dataset, it takes up to 20 minutes, please be patient.')
  os.system('wget https://www.dropbox.com/s/c5ace70qh1vbyzb/vs_release_16k.zip?dl=1 -O /content/drive/MyDrive/vocalsound_baseline/vs_release.zip')
  os.system('unzip -q /content/drive/MyDrive/vocalsound_baseline/vs_release.zip -d /content/drive/MyDrive/vocalsound_baseline/')

**Step 1.3** Automatically correct the path in the data json file.

In [None]:
import json

def get_immediate_files(a_dir):
    return [name for name in os.listdir(a_dir) if os.path.isfile(os.path.join(a_dir, name))]

def change_path(json_file_path, target_path):
    with open(json_file_path, 'r') as fp:
        data_json = json.load(fp)
    data = data_json['data']

    # change the path in the json file
    for i in range(len(data)):
        ori_path = data[i]["wav"]
        new_path = target_path + '/audio_16k/' + ori_path.split('/')[-1]
        data[i]["wav"] = new_path

    with open(json_file_path, 'w') as f:
        json.dump({'data': data}, f, indent=1)

# for train, validation, test
json_files = get_immediate_files(data_dir + '/datafiles/')
for json_f in json_files:
    if json_f.endswith('.json'):
        print('now processing ' + data_dir + '/datafiles/' + json_f)
        change_path(data_dir + '/datafiles/' + json_f, data_dir)

# for subtest sets
json_files = get_immediate_files(data_dir + '/datafiles/subtest/')
for json_f in json_files:
    if json_f.endswith('.json'):
        print('now processing ' + data_dir + '/datafiles/subtest/' + json_f)
        change_path(data_dir + '/datafiles/subtest/' + json_f, data_dir)


now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/te.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/val.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/all.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/tr.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/subtest/te_male.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/subtest/te_age3.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/subtest/te_age2.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/subtest/te_female.json
now processing /content/drive/MyDrive/vocalsound_baseline/datafiles/subtest/te_age1.json


# Step 2. Define the EfficientNet Model

You can plug in and test your own model here. As long as your model's input and output is same with the baseline model, it should work with the script.

In [None]:
import argparse
import os
import pickle
import sys
import time
import torch
import torch.nn as nn
import shutil
import numpy as np
import torchvision
from torchvision.models.feature_extraction import create_feature_extractor

class EffNetOri(torch.nn.Module):
    def __init__(self, label_dim=6, pretrain=True, level=0):
        super().__init__()
        b = int(level)
        if pretrain == True:
            print('now train a effnet-b{:d} model with ImageNet pretrain'.format(b))
        else:
            print('now train a effnet-b{:d} model without ImageNet pretrain'.format(b))
        if b == 7:
            self.model = torchvision.models.efficientnet_b7(pretrained=pretrain)
        elif b == 6:
            self.model = torchvision.models.efficientnet_b6(pretrained=pretrain)
        elif b == 5:
            self.model = torchvision.models.efficientnet_b5(pretrained=pretrain)
        elif b == 4:
            self.model = torchvision.models.efficientnet_b4(pretrained=pretrain)
        elif b == 3:
            self.model = torchvision.models.efficientnet_b3(pretrained=pretrain)
        elif b == 2:
            self.model = torchvision.models.efficientnet_b2(pretrained=pretrain)
        elif b == 1:
            self.model = torchvision.models.efficientnet_b1(pretrained=pretrain)
        elif b == 0:
            self.model = torchvision.models.efficientnet_b0(pretrained=pretrain)

        new_proj = torch.nn.Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        print('conv1 get from pretrained model.')
        new_proj.weight = torch.nn.Parameter(torch.sum(self.model.features[0][0].weight, dim=1).unsqueeze(1))
        new_proj.bias = self.model.features[0][0].bias
        self.model.features[0][0] = new_proj
        self.model = create_feature_extractor(self.model, {'features.8': 'mout'})
        self.feat_dim, self.freq_dim = self.get_dim()
        self.linear = nn.Linear(self.feat_dim, label_dim)

    def get_dim(self):
        # expect input x = (batch_size, time_frame_num, frequency_bins), e.g., (12, 1024, 128)
        x = torch.zeros(10, 1000, 128)
        x = x.unsqueeze(1)
        x = x.transpose(2, 3)
        x = self.model(x)['mout']
        return int(x.shape[1]), int(x.shape[2])

    def forward(self, x):
        # expect input x = (batch_size, time_frame_num, frequency_bins), e.g., (12, 1024, 128)
        x = x.unsqueeze(1)
        x = x.transpose(2, 3)
        x = self.model(x)['mout']
        x = torch.mean(x, dim=[2, 3])
        x = self.linear(x)
        return x

# # You can define your model here and test it with our pipeline.
# # For simplicity, you can keep the model name same with us, and comment out our model.
# class YourModel(torch.nn.Module):
#     def __init__(self):
#         super().__init__()
#
#     def forward(self, x):
#         # input in shape [batch_size, number of time frames, number of frequency bins], e.g., (12, 1024, 128)
#         # your model here
#         # output should be in shape [batch_size, 6] where 6 is the number of classes.
#         return x

# Step 3. Build the Dataloader

In [None]:
import csv
import json
import torchaudio
import numpy as np
import scipy.signal
import torch
import torch.nn.functional
from torch.utils.data import Dataset
import random

def make_index_dict(label_csv):
    index_lookup = {}
    with open(label_csv, 'r') as f:
        csv_reader = csv.DictReader(f)
        line_count = 0
        for row in csv_reader:
            index_lookup[row['mid']] = row['index']
            line_count += 1
    return index_lookup

def make_name_dict(label_csv):
    name_lookup = {}
    with open(label_csv, 'r') as f:
        csv_reader = csv.DictReader(f)
        line_count = 0
        for row in csv_reader:
            name_lookup[row['index']] = row['display_name']
            line_count += 1
    return name_lookup

def lookup_list(index_list, label_csv):
    label_list = []
    table = make_name_dict(label_csv)
    for item in index_list:
        label_list.append(table[item])
    return label_list

class VSDataset(Dataset):
    def __init__(self, dataset_json_file, label_csv=None, audio_conf=None, raw_wav_mode=False, specaug=False):
        self.datapath = dataset_json_file
        with open(dataset_json_file, 'r') as fp:
            data_json = json.load(fp)
        self.data = data_json['data']
        self.audio_conf = audio_conf
        self.mode = self.audio_conf.get('mode')
        self.melbins = self.audio_conf.get('num_mel_bins')
        self.index_dict = make_index_dict(label_csv)
        self.label_num = len(self.index_dict)
        #print('Number of classes is {:d}'.format(self.label_num))

        self.windows = {'hamming': scipy.signal.hamming, 'hann': scipy.signal.hann, 'blackman': scipy.signal.blackman, 'bartlett': scipy.signal.bartlett}

        # if just load raw wavform
        self.raw_wav_mode = raw_wav_mode
        if specaug == True:
            self.freqm = self.audio_conf.get('freqm')
            self.timem = self.audio_conf.get('timem')
            print('now using following mask: {:d} freq, {:d} time'.format(self.audio_conf.get('freqm'), self.audio_conf.get('timem')))
        self.specaug = specaug
        self.mixup = self.audio_conf.get('mixup')
        #print('now using mix-up with rate {:f}'.format(self.mixup))
        #print('now add rolling and new mixup stategy')

    def _wav2fbank(self, filename, filename2=None):
        # not mix-up, the colab version remove the mixup part
        waveform, sr = torchaudio.load(filename)
        waveform = waveform - waveform.mean()

        fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=sr, use_energy=False,
                                                  window_type='hanning', num_mel_bins=self.melbins, dither=0.0, frame_shift=10)

        target_length = self.audio_conf.get('target_length', 512)
        n_frames = fbank.shape[0]

        p = target_length - n_frames

        # cut and pad
        if p > 0:
            m = torch.nn.ZeroPad2d((0, 0, 0, p))
            fbank = m(fbank)
        elif p < 0:
            fbank = fbank[0:target_length, :]

        return fbank

    def __getitem__(self, index):
        datum = self.data[index]
        label_indices = np.zeros(self.label_num) + 0.00
        fbank = self._wav2fbank(datum['wav'])
        for label_str in datum['labels'].split(','):
            label_indices[int(self.index_dict[label_str])] = 1.0
        label_indices = torch.FloatTensor(label_indices)

        if self.specaug == True:
            freqm = torchaudio.transforms.FrequencyMasking(self.freqm)
            timem = torchaudio.transforms.TimeMasking(self.timem)
            fbank = torch.transpose(fbank, 0, 1)
            fbank = fbank.unsqueeze(0)
            fbank = freqm(fbank)
            fbank = timem(fbank)
            fbank = fbank.squeeze(0)
            fbank = torch.transpose(fbank, 0, 1)

        # mean/std is get from the val set as a prior.
        fbank = (fbank + 3.05) / 5.42

        # shift if in the training set, training set typically use mixup
        if self.mode == 'train':
            fbank = torch.roll(fbank, np.random.randint(0, 1024), 0)

        return fbank, label_indices

    def __len__(self):
        return len(self.data)

# Step 4. Define the Metrics

In [None]:
import numpy as np
from scipy import stats as stats_func
from sklearn import metrics
import torch

def d_prime(auc):
    standard_normal = stats_func.norm()
    d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
    return d_prime

def calculate_stats(output, target):
    """Calculate statistics including mAP, AUC, etc.

    Args:
      output: 2d array, (samples_num, classes_num)
      target: 2d array, (samples_num, classes_num)

    Returns:
      stats: list of statistic of each class.
    """

    classes_num = target.shape[-1]
    stats = []

    # Class-wise statistics
    for k in range(classes_num):

        # Average precision
        avg_precision = metrics.average_precision_score(
            target[:, k], output[:, k], average=None)

        # AUC
        auc = metrics.roc_auc_score(target[:, k], output[:, k], average=None)

        # Accuracy
        # this is only used for single-label classification such as esc-50, not for multiple label one such as AudioSet
        acc = metrics.accuracy_score(np.argmax(target, 1), np.argmax(output, 1))

        # F1
        target_i = np.argmax(target, axis=1)
        output_i = np.argmax(output, axis=1)
        f1 = metrics.f1_score(target_i, output_i, average=None)

        # Precisions, recalls
        (precisions, recalls, thresholds) = metrics.precision_recall_curve(
            target[:, k], output[:, k])

        # FPR, TPR
        (fpr, tpr, thresholds) = metrics.roc_curve(target[:, k], output[:, k])

        save_every_steps = 1     # Sample statistics to reduce size
        dict = {'precisions': precisions[0::save_every_steps],
                'recalls': recalls[0::save_every_steps],
                'AP': avg_precision,
                'fpr': fpr[0::save_every_steps],
                'fnr': 1. - tpr[0::save_every_steps],
                'auc': auc,
                'acc': acc,
                'f1': f1
                }
        stats.append(dict)

    return stats

# Step 5. Create the Training and Evaluation Pipeline

In [None]:
import os
import datetime
import time
import torch
import numpy as np
import pickle

def train(audio_model, train_loader, test_loader, args):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)
    torch.set_grad_enabled(True)
    best_epoch, best_cum_epoch, best_mAP, best_acc, best_cum_mAP = 0, 0, -np.inf, -np.inf, -np.inf
    global_step, epoch = 0, 0
    exp_dir = args.exp_dir

    audio_model = audio_model.to(device)
    # Set up the optimizer
    audio_trainables = [p for p in audio_model.parameters() if p.requires_grad]
    print('Total parameter number is : {:.3f} million'.format(sum(p.numel() for p in audio_model.parameters()) / 1000000))
    print('Total trainable parameter number is : {:.3f} million'.format(sum(p.numel() for p in audio_trainables) / 1000000))
    trainables = audio_trainables

    optimizer = torch.optim.Adam(trainables, args.lr, weight_decay=args.weight_decay, betas=(0.95, 0.999))

    scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, list(range(10, 60)), gamma=1.0)

    epoch += 1

    print("current #steps=%s, #epochs=%s" % (global_step, epoch))
    print("start training...")

    result = np.zeros([args.n_epochs, 9])
    audio_model.train()
    while epoch < args.n_epochs + 1:
        audio_model.train()

        for i, (audio_input, labels) in enumerate(train_loader):
            # measure data loading time
            B = audio_input.size(0)
            audio_input = audio_input.to(device, non_blocking=True)
            labels = labels.to(device, non_blocking=True)

            audio_output = audio_model(audio_input)
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(audio_output, torch.argmax(labels.long(), axis=1))

            # original optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            end_time = time.time()
            global_step += 1

        print('start validation')
        stats, valid_loss = validate(audio_model, test_loader, args, epoch)

        cum_stats = stats

        cum_mAP = np.mean([stat['AP'] for stat in cum_stats])
        cum_mAUC = np.mean([stat['auc'] for stat in cum_stats])
        cum_acc = np.mean([stat['acc'] for stat in cum_stats])

        mAP = np.mean([stat['AP'] for stat in stats])
        mAUC = np.mean([stat['auc'] for stat in stats])
        acc = np.mean([stat['acc'] for stat in stats])

        middle_ps = [stat['precisions'][int(len(stat['precisions'])/2)] for stat in stats]
        middle_rs = [stat['recalls'][int(len(stat['recalls'])/2)] for stat in stats]
        average_precision = np.mean(middle_ps)
        average_recall = np.mean(middle_rs)

        print("---------------------Epoch {:d} Results---------------------".format(epoch))
        print("ACC: {:.6f}".format(acc))
        print("mAP: {:.6f}".format(mAP))
        print("AUC: {:.6f}".format(mAUC))
        print("Avg Precision: {:.6f}".format(average_precision))
        print("Avg Recall: {:.6f}".format(average_recall))
        print("d_prime: {:.6f}".format(d_prime(mAUC)))
        print("valid_loss: {:.6f}".format(valid_loss))

        result[epoch-1, :] = [mAP, acc, average_precision, average_recall, d_prime(mAUC), valid_loss, cum_mAP, cum_acc, optimizer.param_groups[0]['lr']]

        np.savetxt(exp_dir + '/result.csv', result, delimiter=',')

        if acc > best_acc:
            best_acc = acc
            best_acc_epoch = epoch
            torch.save(audio_model.state_dict(), "%s/models/best_audio_model.pth" % (exp_dir))

        scheduler.step()

        #print('number of params groups:' + str(len(optimizer.param_groups)))
        print('Epoch-{0} lr: {1}'.format(epoch, optimizer.param_groups[0]['lr']))

        epoch += 1

def validate(audio_model, val_loader, args, epoch):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    audio_model = audio_model.to(device)
    audio_model.eval()

    A_predictions, A_targets, A_loss = [], [], []
    with torch.no_grad():
        for i, (audio_input, labels) in enumerate(val_loader):
            audio_input = audio_input.to(device)

            # compute output
            audio_output = audio_model(audio_input)
            predictions = audio_output.to('cpu').detach()

            A_predictions.append(predictions)
            A_targets.append(labels)

            # compute the loss
            labels = labels.to(device)
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(audio_output, torch.argmax(labels.long(), axis=1))
            A_loss.append(loss.to('cpu').detach())

        audio_output = torch.cat(A_predictions)
        target = torch.cat(A_targets)
        loss = np.mean(A_loss)
        stats = calculate_stats(audio_output, target)

    return stats, loss

# Step 6. Train the EfficientNet model with VocalSound

In [None]:
import argparse
import sys
import time
import torch
import shutil
import ast
import numpy as np

print("I am process %s, running on %s: starting (%s)" % (
        os.getpid(), os.uname()[1], time.asctime()))

# I/O args
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--data-train", type=str, default='/content/drive/MyDrive/vocalsound_baseline/datafiles/tr.json', help="training data json")
parser.add_argument("--data-val", type=str, default='/content/drive/MyDrive/vocalsound_baseline/datafiles/val.json', help="validation data json")
parser.add_argument("--label-csv", type=str, default='/content/drive/MyDrive/vocalsound_baseline/class_labels_indices_vs.csv', help="csv with class labels")
parser.add_argument("--exp-dir", type=str, default="/content/drive/MyDrive/vocalsound_baseline/baseline_exp/", help="directory to dump experiments")
# training and optimization args
parser.add_argument("--optim", type=str, default="adam", help="training optimizer", choices=["sgd", "adam"])
parser.add_argument('-b', '--batch-size', default=80, type=int, metavar='N', help='mini-batch size (default: 100)')
parser.add_argument('-w', '--num-workers', default=2, type=int, metavar='NW', help='# of workers for dataloading (default: 8)')
parser.add_argument('--lr', '--learning-rate', default=1e-4, type=float, metavar='LR', help='initial learning rate')
parser.add_argument('--lr-decay', default=40, type=int, metavar='LRDECAY', help='Divide the learning rate by 10 every lr_decay epochs')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M', help='momentum')
parser.add_argument('--weight-decay', '--wd', default=5e-7, type=float, metavar='W', help='weight decay (default: 1e-4)')
parser.add_argument("--n-epochs", type=int, default=20, help="number of maximum training epochs")
parser.add_argument("--n-print-steps", type=int, default=1, help="number of steps to print statistics")
# models args
parser.add_argument("--n_class", type=int, default=6, help="number of classes")
parser.add_argument('--save_model', help='save the models or not', type=ast.literal_eval, default='False')
parser.add_argument("--model", type=str, default='eff_mean', help="model")
parser.add_argument("--model_size", type=int, default=0, help="model size")
parser.add_argument('--imagenet_pretrain', help='if use pretrained imagenet efficient net', type=ast.literal_eval, default='False')
parser.add_argument('--freqm', help='frequency mask max length', type=int, default=48)
parser.add_argument('--timem', help='time mask max length', type=int, default=192)
parser.add_argument("--mixup", type=float, default=0, help="how many (0-1) samples need to be mixup during training")

args = parser.parse_args(args=[])

audio_conf = {'num_mel_bins': 128, 'target_length': 512, 'freqm': args.freqm, 'timem': args.timem, 'mixup': args.mixup, 'mode': 'train'}

print('balanced sampler is not used')
train_loader = torch.utils.data.DataLoader(
    VSDataset(args.data_train, label_csv=args.label_csv, audio_conf=audio_conf, raw_wav_mode=False, specaug=True),
    batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers, pin_memory=True)

val_audio_conf = {'num_mel_bins': 128, 'target_length': 512, 'mixup': 0, 'mode': 'test'}

val_loader = torch.utils.data.DataLoader(
    VSDataset(args.data_val, label_csv=args.label_csv, audio_conf=val_audio_conf, raw_wav_mode=False),
    batch_size=200, shuffle=False, num_workers=args.num_workers, pin_memory=True)

if args.model == 'eff_mean':
    audio_model = EffNetOri(label_dim=args.n_class, level=args.model_size, pretrain=args.imagenet_pretrain)
else:
    raise ValueError('Model Unrecognized')

# start training
if os.path.exists(args.exp_dir):
    print("Deleting existing experiment directory %s" % args.exp_dir)
    shutil.rmtree(args.exp_dir)
print("\nCreating experiment directory: %s" % args.exp_dir)
os.makedirs("%s/models" % args.exp_dir)
with open("%s/args.pkl" % args.exp_dir, "wb") as f:
    pickle.dump(args, f)

print('Now starting training for {:d} epochs'.format(args.n_epochs))
train(audio_model, train_loader, val_loader, args)

# test on the test set and sub-test set, model selected on the validation set
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sd = torch.load(args.exp_dir + '/models/best_audio_model.pth', map_location=device)
audio_model.load_state_dict(sd)

all_res = []

# best model on the validation set, repeat to confirm
stats, _ = validate(audio_model, val_loader, args, 'valid_set')
# note it is NOT mean of class-wise accuracy
val_acc = stats[0]['acc']
val_mAUC = np.mean([stat['auc'] for stat in stats])
print('---------------evaluate on the validation set---------------')
print("Accuracy: {:.6f}".format(val_acc))
all_res.append(val_acc)

# test the model on the evaluation set
data_eval_list = ['te.json', 'subtest/te_age1.json', 'subtest/te_age2.json', 'subtest/te_age3.json', 'subtest/te_female.json', 'subtest/te_male.json']
eval_name_list = ['all_test', 'test age 18-25', 'test age 26-48', 'test age 49-80', 'test female', 'test male']

data_dir = '/'.join(args.data_val.split('/')[:-1])
for idx, cur_eval in enumerate(data_eval_list):
    cur_eval = data_dir + '/' + cur_eval
    eval_loader = torch.utils.data.DataLoader(
        VSDataset(cur_eval, label_csv=args.label_csv, audio_conf=val_audio_conf),
        batch_size=args.batch_size*2, shuffle=False, num_workers=args.num_workers, pin_memory=True)
    stats, _ = validate(audio_model, eval_loader, args, eval_name_list[idx])
    eval_acc = stats[0]['acc']
    all_res.append(eval_acc)
    print('---------------evaluate on {:s}---------------'.format(eval_name_list[idx]))
    print("Accuracy: {:.6f}".format(eval_acc))

all_res = np.array(all_res)
all_res = all_res.reshape([1, all_res.shape[0]])
np.savetxt(args.exp_dir + '/all_eval_result.csv', all_res, header=','.join(['validation'] + eval_name_list), delimiter=',')

I am process 71, running on 0dab978c433d: starting (Sat Apr 16 20:30:23 2022)
balanced sampler is not used
now using following mask: 48 freq, 192 time
now train a effnet-b0 model without ImageNet pretrain
conv1 get from pretrained model.
Deleting existing experiment directory /content/drive/MyDrive/vocalsound_baseline/baseline_exp/

Creating experiment directory: /content/drive/MyDrive/vocalsound_baseline/baseline_exp/
Now starting training for 30 epochs
cuda
Total parameter number is : 4.015 million
Total trainable parameter number is : 4.015 million
current #steps=0, #epochs=1
start training...
start validation
---------------------Epoch 1 Results---------------------
ACC: 0.366038
mAP: 0.371119
AUC: 0.761857
Avg Precision: 0.281954
Avg Recall: 0.827373
d_prime: 1.007327
valid_loss: 1.491638
number of params groups:1
Epoch-1 lr: 0.0001
start validation
---------------------Epoch 2 Results---------------------
ACC: 0.657143
mAP: 0.676608
AUC: 0.896861
Avg Precision: 0.327757
Avg Recal