# Task description
- Classify the speakers of given features.
- Main goal: Learn how to use transformer.
- Baselines:
  - Easy: Run sample code and know how to use transformer.
  - Medium: Know how to adjust parameters of transformer.
  - Strong: Construct [conformer](https://arxiv.org/abs/2005.08100) which is a variety of transformer. 
  - Boss: Implement [Self-Attention Pooling](https://arxiv.org/pdf/2008.01077v1.pdf) & [Additive Margin Softmax](https://arxiv.org/pdf/1801.05599.pdf) to further boost the performance.

- Other links
  - Kaggle: [link](https://www.kaggle.com/t/ac77388c90204a4c8daebeddd40ff916)
  - Slide: [link](https://docs.google.com/presentation/d/1HLAj7UUIjZOycDe7DaVLSwJfXVd3bXPOyzSb6Zk3hYU/edit?usp=sharing)
  - Data: [link](https://drive.google.com/drive/folders/1vI1kuLB-q1VilIftiwnPOCAeOOFfBZge?usp=sharing)

# Download dataset
- Data is [here](https://drive.google.com/drive/folders/1vI1kuLB-q1VilIftiwnPOCAeOOFfBZge?usp=sharing)

语音特征提取: 看懂梅尔语谱图(Mel-spectrogram)、梅尔倒频系数(MFCCs)的原理：https://blog.csdn.net/weixin_50547200/article/details/117294164

SIGNAL-NOICE RATE（SNR）信噪比

In [336]:
!wget https://github.com/MachineLearningHW/ML_HW4_Dataset/releases/latest/download/Dataset.tar.gz.partaa
!wget https://github.com/MachineLearningHW/ML_HW4_Dataset/releases/latest/download/Dataset.tar.gz.partab
!wget https://github.com/MachineLearningHW/ML_HW4_Dataset/releases/latest/download/Dataset.tar.gz.partac
!wget https://github.com/MachineLearningHW/ML_HW4_Dataset/releases/latest/download/Dataset.tar.gz.partad

!cat Dataset.tar.gz.part* > Dataset.tar.gz

# unzip the file
!tar zxvf Dataset.tar.gz

'wget' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���
'wget' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���
'wget' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���
'wget' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���
'cat' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���


## Fix Random Seed

In [337]:
import numpy as np
import torch
import random

# 设置固定的 seed，方便复现结果
def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

set_seed(87)

# Data

## Dataset
- Original dataset is [Voxceleb2](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html).
- The [license](https://creativecommons.org/licenses/by/4.0/) and [complete version](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/files/license.txt) of Voxceleb2.
- We randomly select 600 speakers from Voxceleb2.
- Then preprocess the raw waveforms into mel-spectrograms.

- Args:
  - data_dir: The path to the data directory.
  - metadata_path: The path to the metadata.
  - segment_len: The length of audio segment for training. 
- The architecture of data directory \\
  - data directory \\
  |---- metadata.json \\
  |---- testdata.json \\
  |---- mapping.json \\
  |---- uttr-{random string}.pt \\

- The information in metadata
  - "n_mels": The dimention of mel-spectrogram.
  - "speakers": A dictionary. 
    - Key: speaker ids.
    - value: "feature_path" and "mel_len"


For efficiency, we segment the mel-spectrograms into segments in the traing step.

代码解析：https://geek.csdn.net/65bc959fb8e5f01e1e45cdf2.html?dp_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6MzczMTA3MCwiZXhwIjoxNzI0MDU3NTczLCJpYXQiOjE3MjM0NTI3NzMsInVzZXJuYW1lIjoiY2hlbmJvdyJ9.kF9J7xhF7ZsX-vb-rJT_2FRD7K9iaCnBeefQVVrJx-Y

### 制作一个数据集，将 pt 文件按照 speaker 的 index 规划整理，并将所有的 mel 长度统一到一致

In [338]:
import os
import json
import torch
import random
from pathlib import Path
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
 
class myDataset(Dataset):
	# 根据mapping.json和metadata.json返回self.data，存储的是语音地址和对应训练集label
	def __init__(self, data_dir, segment_len=128):
		self.data_dir = data_dir
		self.segment_len = segment_len
	
		# Load the mapping from speaker neme to their corresponding id. 
		mapping_path = Path(data_dir) / "mapping.json"
		# Python的json.loads() 方法与json.dumps()方法：https://blog.csdn.net/m0_72524813/article/details/132420537#:~:text=json.loads()%20%E6%98%AFPython,%E7%A8%8B%E5%BA%8F%E4%B9%8B%E9%97%B4%E4%BC%A0%E9%80%92%E6%95%B0%E6%8D%AE%E3%80%82
		mapping = json.load(mapping_path.open())
		self.speaker2id = mapping["speaker2id"]
	
		# Load metadata of training data.
		# json.load 是 Python 标准库中的一个方法，用于将 JSON 格式的字符串解析为 Python 数据结构。
		metadata_path = Path(data_dir) / "metadata.json"
		metadata = json.load(open(metadata_path))["speakers"]
	
		# Get the total number of speaker.
		# metadata.keys()： dict_keys(['id03074', 'id05623', 'id06406', 'id01014', 'id02426', 'id01503', ...])
		self.speaker_num = len(metadata.keys())
		self.data = []
		for speaker in metadata.keys():
			for utterances in metadata[speaker]:
				# if self.speaker2id[speaker] == 436: # speaker 是 id03074
					# id03074 所有说的话
					# print('utterances["feature_path"]', utterances["feature_path"]) # uttr-18e375195dc146fd8d14b8a322c29b90.pt, uttr-da9917d5853049178487c065c9e8b718.pt...
					# print('self.speaker2id[speaker]', self.speaker2id[speaker]) # 436
				self.data.append([utterances["feature_path"], self.speaker2id[speaker]]) 
 
	# 用于返回数据集的样本数,1人有多段语音
	def __len__(self):
			return len(self.data)
	
	# 根据语音地址加载pt数据, 并且规范长度(太长就随机截断), 返回mel, speaker, 代表语音数据和对应label
	def __getitem__(self, index):
		feat_path, speaker = self.data[index]
		# print('feat_path', feat_path) # feat_path uttr-159e0381f24f40eda0b240c27f26e853.pt
		# print('speaker', speaker) # speaker 46
		# Load preprocessed mel-spectrogram.
		# torch.load()函数是用于加载保存模型或张量数据的重要工具.
		mel = torch.load(os.path.join(self.data_dir, feat_path))

		# 在 netron 里面查看 pt文件 模型结构，看到的是如 tensor.float32[623, 40] 的格式

		# print('mel', mel) # mel 是不同的 语音 pt 文件
		# print('len(mel)', len(mel))
		# print('mel.shape', mel.shape)
		# 每个文件对应的长度，存储在 metadata.json 文件 / testdata.json 文件 里
		'''
			mel tensor([[ -2.7560,  -2.5961,  -4.3199,  ...,  -3.6494,  -4.5743,  -5.1913],
					[ -1.0380,   0.8124,   1.3774,  ...,  -3.4867,  -3.9669,  -4.8328],
					[  1.0720,   3.0578,   3.0533,  ...,  -3.7178,  -3.9180,  -6.3955],
					...,
					[ -7.9589,  -8.1239,  -7.5815,  ...,  -9.3548,  -9.3447,  -9.3637],
					[-20.7233, -20.4128, -19.7443,  ..., -20.3118, -20.7233, -20.7233],
					[-20.7233, -20.7233, -20.7233,  ..., -20.7233, -20.7233, -20.7233]])

			len(mel) 555 (每个 mel 的长度不同，这里只是列举其中一个例子)
			mel.shape torch.Size([555, 40])
		'''

		# segment_len: The length of audio segment for training. 
		# Segmemt mel-spectrogram into "segment_len" frames.
		if len(mel) > self.segment_len:

			# 将所有 mel（语音 tensor） 都变成长度相等的
			# random.randint 返回一个随机整型数，其范围为[low, high)。如果没有写参数high的值，则返回[0, low)的值
			# len 是按长度算的，因此取的话是按照行的长度来取
			start = random.randint(0, len(mel) - self.segment_len)
			mel = torch.FloatTensor(mel[start:start+self.segment_len])
		else:
			# 如 91， 116 ... 
			# 如果短于 segment_len，将会在下一步进行 pad_sequence，对tensor做padding
			mel = torch.FloatTensor(mel)
		# Turn the speaker id into long for computing loss later.
		speaker = torch.FloatTensor([speaker]).long() # tensor([46])   .long 得到的是 64位 整数型

		# print('len(mel)', len(mel)) # 将所有 mel 的长度统一到了 128，即设置的 segment_len 大小
		# print('mel.shape', mel.shape) # torch.Size([128, 40])

		return mel, speaker
 
	def get_speaker_number(self):
		return self.speaker_num

## Dataloader
- Split dataset into training dataset(90%) and validation dataset(10%).
- Create dataloader to iterate the data.

数据加载器，并对数据集进行划分。

In [339]:
import torch
from torch.utils.data import DataLoader, random_split
from torch.nn.utils.rnn import pad_sequence

# 在最后一步堆叠的时候可能会出现问题: 如果一条数据中所含有的每个数据元的长度不同,那么将无法进行堆叠.如: multi-hot类型的数据,序列数据。
# 在使用这些数据时,通常需要先进行长度上的补齐,再进行堆叠,以现在的流程,是没有办法加入该操作的。
# 此外,某些优化方法是要对一个batch的数据进行操作。collate fn函数就是手动将抽取出的样本堆叠起来的函数。

# 目的是为了统一特征大小
def collate_batch(batch):
	# Process features within a batch.
	"""Collate a batch of data."""
	# zip(tuple1,tuple2...)	返回内部为元组的zip对象
	mel, speaker = zip(*batch) # 得到的 mel, speaker 是一个 tuple， 没有属性 size
	# len(mel) 32
	# mel (tensor([[-0.7146,  0.3835, -0.2590,  ..., -1.3224, -2.0467, -5.4494],
	#         [-2.2641,  0.4378,  0.6055,  ..., -1.4691, -1.9975, -5.6835],
	#         [-1.0164,  0.4325, -0.0409,  ..., -1.8047, -2.4112, -5.6914],
	#         ...,
	#         [-1.6105,  0.2047,  2.2918,  ..., -0.5493, -0.1460, -1.6393],
	#         [-2.1356,  0.6594,  1.9534,  ..., -0.4245, -0.5643, -2.2908],
	#         [-1.9108, -0.4720,  1.8428,  ..., -0.4028, -0.6166, -2.7526]]), tensor([[-2.8393, -3.3396, -1.2337,  ..., -4.0980, -3.7093, -4.0470],
	#         [-2.7348, -1.9792, -0.8202,  ..., -3.9078, -4.0070, -4.1508],
	#         [-0.7501, -1.1022,  0.0266,  ..., -4.1193, -3.8542, -4.0073],
	#         ...,
	#         [-1.7079, -1.5688, -2.2516,  ..., -4.3531, -4.0317, -5.1432],
	#         [-2.1183, -1.8093, -1.7475,  ..., -4.3342, -4.1762, -4.3311],
	#         [-2.4244, -2.8033, -2.1104,  ..., -1.0629, -1.8171, -3.1625]]), tensor([[-3.8964, -2.8853,  3.6164,  ..., -0.3559, -1.7570, -4.5594],
	#         [-3.7899, -3.5051,  3.6453,  ...,  0.0273, -1.3136, -4.6941],
	#         [-3.8008, -3.2892,  3.6823,  ..., -1.0819, -2.3080, -4.8499],
	#         ...,
	#         [-4.3418,  0.3452,  1.3784,  ..., -0.9108, -1.8571, -3.8895],
	#         [-4.1228,  0.8943,  2.1872,  ..., -2.6148, -3.4352, -4.7241],
	#         [-3.4088,  1.2481,  2.3568,  ..., -3.2097, -3.7072, -5.2161]]), tensor([[ 1.6985,  3.0389,  2.3919,  ..., -6.8099, -7.4328, -7.9691],
	#         [ 1.3528,  2.7300,  2.0499,  ..., -6.3623, -6.8602, -7.9829],
	#         [ 1.3052,  2.3795,  1.3503,  ..., -7.2043, -7.1261, -8.2887],
	#         ...,
	#         [ 0.3529,  1.5088,  0.9150,  ..., -6.8059, -7.5539, -7.5360],
	#         [-0.8137,  0.4170,  0.1198,  ..., -6.7557, -7.6424, -7.7298],
	#         [ 0.5288, -0.5472, -1.8947,  ..., -7.1280, -6.5842, -7.4351]]), tensor([[-6.1522, -5.0040, -2.9372,  ..., -5.1671, -4.7596, -5.7620],
	# ...
	#         [ 3.7019,  3.5262,  3.3843,  ..., -4.9649, -5.2592, -4.5914],
	#         [ 0.7989,  0.8617,  0.7932,  ..., -6.3390, -6.3799, -5.8654],
	#         [-8.6659, -8.4837, -7.7574,  ..., -9.0995, -8.8258, -7.9695]]))
	# speaker (tensor([517]), tensor([372]), tensor([172]), tensor([472])...)
	
	# Because we train the model batch by batch, we need to pad the features in the same batch to make their lengths the same.

	# pad_sequence() 是 PyTorch 中用于对序列进行填充的函数。在深度学习任务中，经常会遇到输入序列长度不一致的情况，为了方便进行批量处理，我们需要将序列进行填充，使得所有序列的长度相同。
	# 通过例子10分钟快速看懂pad_sequence、pack_padded_sequence以及pad_packed_sequence: https://blog.csdn.net/qq_43391414/article/details/123289492
	# mel 是语音 tensor
	mel = pad_sequence(mel, batch_first=True, padding_value=-20)    # pad log 10^(-20) which is very small value.
	# mel.size(): torch.Size([32, 128, 40])
	# mel: (batch size, length, 40)
	return mel, torch.FloatTensor(speaker).long() # .long 得到的是 64位 整数型

def get_dataloader(data_dir, batch_size, n_workers):
	"""Generate dataloader"""
	dataset = myDataset(data_dir)
	speaker_num = dataset.get_speaker_number()
	# Split dataset into training dataset and validation dataset
	trainlen = int(0.9 * len(dataset))
	lengths = [trainlen, len(dataset) - trainlen] # 训练数据集长度、测试数据集长度
	trainset, validset = random_split(dataset, lengths)

	train_loader = DataLoader(
		trainset,
		batch_size=batch_size,
		shuffle=True,
		drop_last=True,
		num_workers=n_workers,
		pin_memory=True,
		# collate_fn是一个强大的工具，它允许我们定制数据加载和预处理过程，以满足复杂的数据结构和预处理需求。
		# collate_fn如果你不指定，会调用pytorch内部的，也就是说这个函数是一定会调用的，而且调用这个函数时pytorch会往这个函数里面传入一个参数batch。
		collate_fn=collate_batch,
	)
	valid_loader = DataLoader(
		validset,
		batch_size=batch_size,
		num_workers=n_workers,
		drop_last=True,
		pin_memory=True,
		collate_fn=collate_batch,
	)

	return train_loader, valid_loader, speaker_num

<b>list组成单元类型要一样，但是组成单元的基本单元并不管，因为dataloader只负责给你返回batchs</b>

数据加载器通常会将数据集分成多个批次，每个批次包含若干个数据样本。在构建批次时，数据加载器并不要求每个数据样本的内部结构完全相同，只要它们属于同一类型即可。

这意味着，即使数据样本的内部结构不同（比如一个是文本数据，一个是图像数据），只要它们在整体上组成了一个批次，并且属于同一类型，数据加载器就可以正确处理并提供这些批次给模型进行训练。

数据加载器要求数据属于同一个类型意味着在构建数据加载器时，通常需要确保每个数据样本或数据批次具有相同的数据类型。这有助于数据加载器正确处理数据并提供一致的接口给模型进行训练。

<!-- ![image.png](./1.jpeg) -->

<img src="./1.jpeg"  width="600" />


# Model
- TransformerEncoderLayer:
  - Base transformer encoder layer in [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
  - Parameters:
    - d_model: the number of expected features of the input (required).

    - nhead: the number of heads of the multiheadattention models (required).

    - dim_feedforward: the dimension of the feedforward network model (default=2048).

    - dropout: the dropout value (default=0.1).

    - activation: the activation function of intermediate layer, relu or gelu (default=relu).

- TransformerEncoder:
  - TransformerEncoder is a stack of N transformer encoder layers
  - Parameters:
    - encoder_layer: an instance of the TransformerEncoderLayer() class (required).

    - num_layers: the number of sub-encoder-layers in the encoder (required).

    - norm: the layer normalization component (optional).

Transformer实现以及Pytorch源码解读（一）-数据输入篇: https://blog.csdn.net/weixin_41806489/article/details/128380667?spm=1001.2014.3001.5502

构建一个 transformer 模型

In [340]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Classifier(nn.Module):
	def __init__(self, d_model=80, n_spks=600, dropout=0.1):
		super().__init__()
		# Project the dimension of features from that of input into d_model.
		self.prenet = nn.Linear(40, d_model)
		# TODO:
		#   Change Transformer to Conformer.
		#   https://arxiv.org/abs/2005.08100
		# Conformer: Convolution-augmented Transformer for Speech Recognition
		# nn.TransformerEncoderLayer详细解释，使用方法: https://blog.csdn.net/qlkaicx/article/details/138316578?spm=1001.2101.3001.6661.1&utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-138316578-blog-109279616.235%5Ev43%5Epc_blog_bottom_relevance_base1&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-138316578-blog-109279616.235%5Ev43%5Epc_blog_bottom_relevance_base1&utm_relevant_index=1
		
		'''
		Args:
			d_model: the number of expected features in the input (required).
			nhead: the number of heads in the multiheadattention models (required).
			dim_feedforward: the dimension of the feedforward network model (default=2048).
			dropout: the dropout value (default=0.1).
			activation: the activation function of the intermediate layer, can be a string
				("relu" or "gelu") or a unary callable. Default: relu
			layer_norm_eps: the eps value in layer normalization components (default=1e-5).
			batch_first: If ``True``, then the input and output tensors are provided
				as (batch, seq, feature). Default: ``False``.
			norm_first: if ``True``, layer norm is done prior to attention and feedforward
				operations, respectivaly. Otherwise it's done after. Default: ``False`` (after).
		'''
		self.encoder_layer = nn.TransformerEncoderLayer(
			d_model=d_model, dim_feedforward=256, nhead=2
		)
		# self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=2)

		# Project the the dimension of features from d_model into speaker nums.
		# 用于将 Transformer 层的输出映射到最终的类别数。
		self.pred_layer = nn.Sequential( 
			nn.Linear(d_model, d_model),
			nn.ReLU(),
			nn.Linear(d_model, n_spks),
		)

	def forward(self, mels):
		# 设置了 batch size 为 32
		# print('mels.shape', mels.shape) # torch.Size([32, 128, 40])
		"""
		args:
			mels: (batch size, length, 40)
		return:
			out: (batch size, n_spks)
		"""
		# out: (batch size, length, d_model)
		out = self.prenet(mels) # out 是 tensor 
		# 自行设定了 "batch_size": 32， segment_len=128（将所有 mel 的长度统一到了 128）
		# mel.shape：torch.Size([128, 40])
		# out.shape：torch.Size([32, 128, 80])
		
		# out: (length, batch size, d_model)
		out = out.permute(1, 0, 2)
		# The encoder layer expect features in the shape of (length, batch size, d_model).
		out = self.encoder_layer(out)
		# out: (batch size, length, d_model)
		out = out.transpose(0, 1)
		# mean pooling
		stats = out.mean(dim=1)

		# out: (batch, n_spks)
		out = self.pred_layer(stats)
		return out

# Learning rate schedule
- For transformer architecture, the design of learning rate schedule is different from that of CNN.
- Previous works show that the warmup of learning rate is useful for training models with transformer architectures.
- The warmup schedule
  - Set learning rate to 0 in the beginning.
  - The learning rate increases linearly from 0 to initial learning rate during warmup period.

Learning rate warmup（学习率预热）是深度学习模型训练中的一种优化策略，主要目的是在训练初期通过逐渐增加学习率来帮助模型更好地适应训练过程，从而提高训练的稳定性和收敛速度。

在预热期间，学习率从一个较小的值（如0或接近0的值）开始，并线性（或非线性）地增加到优化器中的初始预设学习率。在训练后期逐渐减少学习率，直至降为0。但后期这部分并非学习率预热策略的核心内容，而是属于学习率调度的更广泛范畴。

In [341]:
# 不属于该作业的代码
# transformer
import torch
embedding = torch.nn.Embedding(4, 3)
embedding

Embedding(4, 3)

设置学习率的变化

In [342]:
import math

import torch
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LambdaLR


def get_cosine_schedule_with_warmup(
	optimizer: Optimizer,
	num_warmup_steps: int,
	num_training_steps: int,
	num_cycles: float = 0.5,
	last_epoch: int = -1,
):
	"""
	Create a schedule with a learning rate that decreases following the values of the cosine function between the
	initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the
	initial lr set in the optimizer.

	Args:
		optimizer (:class:`~torch.optim.Optimizer`):
		The optimizer for which to schedule the learning rate.
		num_warmup_steps (:obj:`int`):
		The number of steps for the warmup phase.
		num_training_steps (:obj:`int`):
		The total number of training steps.
		num_cycles (:obj:`float`, `optional`, defaults to 0.5):
		The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0
		following a half-cosine).
		last_epoch (:obj:`int`, `optional`, defaults to -1):
		The index of the last epoch when resuming training.

	Return:
		:obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
	"""
	def lr_lambda(current_step):
		# Warmup
		if current_step < num_warmup_steps: # 如果当前步骤数 current_step 小于 num_warmup_steps，则执行预热逻辑。
			return float(current_step) / float(max(1, num_warmup_steps)) # 预热阶段的学习率乘数是通过将当前步骤数除以预热步骤数（确保分母不为0）来计算的。这样，学习率会从0开始逐渐增加，直到达到预热阶段的末尾。
		# decadence
		progress = float(current_step - num_warmup_steps) / float(         	
			max(1, num_training_steps - num_warmup_steps)
		)
		return max(
			0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)) # 余弦函数的输出在-1到1之间，但乘以0.5并加上0.5后，输出范围变为0到1。这个值随后被用作学习率乘数。
		)

	return LambdaLR(optimizer, lr_lambda, last_epoch)

# Model Function
- Model forward function.

预测类别

In [343]:
import torch


def model_fn(batch, model, criterion, device):
	"""Forward a batch through the model."""

	mels, labels = batch
	mels = mels.to(device)
	labels = labels.to(device)

	outs = model(mels)

	loss = criterion(outs, labels)

	# Get the speaker id with highest probability.
	preds = outs.argmax(1) # argmax(1)就是在每个样本的预测结果中（即每个样本对应的那一行），找到最大值的索引，这个索引就代表了模型认为最可能的类别。
	# Compute accuracy.
	accuracy = torch.mean((preds == labels).float())

	return loss, accuracy

# Validate
- Calculate accuracy of the validation set.

In [344]:
from tqdm import tqdm
import torch


def valid(dataloader, model, criterion, device): 
	"""Validate on validation set."""

	model.eval()
	running_loss = 0.0
	running_accuracy = 0.0
	pbar = tqdm(total=len(dataloader.dataset), ncols=0, desc="Valid", unit=" uttr")

	for i, batch in enumerate(dataloader):
		with torch.no_grad():
			loss, accuracy = model_fn(batch, model, criterion, device)
			running_loss += loss.item()
			running_accuracy += accuracy.item()

		pbar.update(dataloader.batch_size)
		pbar.set_postfix(
			loss=f"{running_loss / (i+1):.2f}",
			accuracy=f"{running_accuracy / (i+1):.2f}",
		)

	pbar.close()
	model.train()

	return running_accuracy / len(dataloader)

# Main function

通过代码实现以下图片上的显示

```python
		pbar.set_postfix(
			loss=f"{batch_loss:.2f}",
			accuracy=f"accuracy--{batch_accuracy:.2f}",
			step=step + 1,
		)
```

![image.png](1.png)

In [345]:
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, random_split


def parse_args():
	"""arguments"""
	config = {
		"data_dir": "./Dataset",
		"save_path": "model.ckpt",
		"batch_size": 32,
		# 构造Dataloader时，必须这样设置num_workers=0，因为我们没有GPU进行多线程，如果 设置 > 0 的数字则会报错broken pipe
		# "n_workers": 8,
		"n_workers": 0,
		"valid_steps": 2000,
		"warmup_steps": 1000,
		"save_steps": 10000,
		"total_steps": 70000,
	}

	return config


def main(
	data_dir,
	save_path,
	batch_size,
	n_workers,
	valid_steps,
	warmup_steps,
	total_steps,
	save_steps,
):
	"""Main function."""
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	print(f"[Info]: Use {device} now!")

	train_loader, valid_loader, speaker_num = get_dataloader(data_dir, batch_size, n_workers) # data_dir 在 config 里定义了，为 "./Dataset"
	# 在一开始就说了 We randomly select 600 speakers from Voxceleb2. 也在定义 Classifier 时设置了 n_spks=600
	# print('speaker_num', speaker_num) # 为 600
	train_iterator = iter(train_loader)
	print(f"[Info]: Finish loading data!",flush = True)

	model = Classifier(n_spks=speaker_num).to(device)
	criterion = nn.CrossEntropyLoss()
	optimizer = AdamW(model.parameters(), lr=1e-3)
	scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
	print(f"[Info]: Finish creating model!",flush = True)

	best_accuracy = -1.0
	best_state_dict = None

	pbar = tqdm(total=valid_steps, ncols=0, desc="Train", unit=" step")

	for step in range(total_steps):
		# Get data
		try:
			batch = next(train_iterator)
		except StopIteration:
			train_iterator = iter(train_loader)
			batch = next(train_iterator)

		loss, accuracy = model_fn(batch, model, criterion, device)
		batch_loss = loss.item()
		batch_accuracy = accuracy.item()

		# Updata model
		loss.backward()
		optimizer.step()
		scheduler.step()
		optimizer.zero_grad()

		# Log
		# 就是在显示
		pbar.update()
		pbar.set_postfix(
			loss=f"{batch_loss:.2f}",
			accuracy=f"{batch_accuracy:.2f}",
			step=step + 1,
		)

		# Do validation
		if (step + 1) % valid_steps == 0:
			pbar.close()

			valid_accuracy = valid(valid_loader, model, criterion, device)

			# keep the best model
			if valid_accuracy > best_accuracy:
				best_accuracy = valid_accuracy
				best_state_dict = model.state_dict()

			pbar = tqdm(total=valid_steps, ncols=0, desc="Train", unit=" step")

		# Save the best model so far.
		if (step + 1) % save_steps == 0 and best_state_dict is not None:
			torch.save(best_state_dict, save_path)
			pbar.write(f"Step {step + 1}, best model saved. (accuracy={best_accuracy:.4f})")

	pbar.close()


if __name__ == "__main__":
	# 属于 python argparse库的内容
	main(**parse_args())

[Info]: Use cpu now!
[Info]: Finish loading data!
[Info]: Finish creating model!



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

KeyboardInterrupt: 

# Inference

## Dataset of inference

In [None]:
import os
import json
import torch
from pathlib import Path
from torch.utils.data import Dataset


class InferenceDataset(Dataset):
	def __init__(self, data_dir):
		testdata_path = Path(data_dir) / "testdata.json"
		metadata = json.load(testdata_path.open())
		self.data_dir = data_dir
		self.data = metadata["utterances"]

	def __len__(self):
		return len(self.data)

	def __getitem__(self, index):
		utterance = self.data[index]
		feat_path = utterance["feature_path"]
		mel = torch.load(os.path.join(self.data_dir, feat_path))
		print('mel', mel)

		return feat_path, mel


def inference_collate_batch(batch):
	"""Collate a batch of data."""
	feat_paths, mels = zip(*batch)

	return feat_paths, torch.stack(mels)

## Main funcrion of Inference

In [None]:
import json
import csv
from pathlib import Path
from tqdm.notebook import tqdm

import torch
from torch.utils.data import DataLoader

def parse_args():
	"""arguments"""
	config = {
		"data_dir": "./Dataset",
		"model_path": "./model.ckpt",
		"output_path": "./output.csv",
	}

	return config

def main(
	data_dir,
	model_path,
	output_path,
):
	"""Main function."""
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	print(f"[Info]: Use {device} now!")

	mapping_path = Path(data_dir) / "mapping.json"
	mapping = json.load(mapping_path.open())

	dataset = InferenceDataset(data_dir) # class InferenceDataset(Dataset): 制作自己的 dataset 数据集
	dataloader = DataLoader(
		dataset,
		batch_size=1,
		shuffle=False,
		drop_last=False,
		# num_workers=8, 
		num_workers=0, 
		collate_fn=inference_collate_batch,
	)
	print(f"[Info]: Finish loading data!",flush = True)

	speaker_num = len(mapping["id2speaker"])
	model = Classifier(n_spks=speaker_num).to(device)
	model.load_state_dict(torch.load(model_path))
	model.eval()
	print(f"[Info]: Finish creating model!",flush = True)

	results = [["Id", "Category"]]
	for feat_paths, mels in tqdm(dataloader):
		with torch.no_grad():
			mels = mels.to(device)
			outs = model(mels)
			preds = outs.argmax(1).cpu().numpy() # argmax(1)就是在每个样本的预测结果中（即每个样本对应的那一行），找到最大值的索引，这个索引就代表了模型认为最可能的类别。
			for feat_path, pred in zip(feat_paths, preds):
				results.append([feat_path, mapping["id2speaker"][str(pred)]])

	with open(output_path, 'w', newline='') as csvfile:
		writer = csv.writer(csvfile)
		writer.writerows(results)


if __name__ == "__main__":
	print('parse_args()', parse_args())
	main(**parse_args()) # 打印出来的结果是：{'data_dir': './Dataset', 'model_path': './model.ckpt', 'output_path': './output.csv'}

	'''
		1. `parse_args()`：这是一个函数调用，通常是用来解析命令行参数的。在Python中，可以使用`argparse`等库来解析命令行参数，`parse_args()`则是解析命令行参数后返回的一个包含解析结果的对象或字典。

		2. `**`：这是解包操作符，它可以将字典解包成关键字参数。当在函数调用中使用`**`时，它会将字典中的键值对分别作为关键字参数传递给函数。

		3. `main(**parse_args())`：这是一个函数调用，其中`main`是一个函数的名称，而`**parse_args()`则是将`parse_args()`函数返回的解析结果解包成关键字参数传递给`main`函数。

		综合起来，`main(**parse_args())`的含义是将通过解析命令行参数得到的结果作为关键字参数传递给`main`函数。这种方式可以方便地将解析后的参数传递给函数，使得代码更加简洁和易于维护。
	'''

parse_args() {'data_dir': './Dataset', 'model_path': './model.ckpt', 'output_path': './output.csv'}
[Info]: Use cpu now!
[Info]: Finish loading data!
[Info]: Finish creating model!


  0%|          | 0/8000 [00:00<?, ?it/s]