# Task 1, solution by D. Uzhva

## Introduction

### Task statement

Implement an approach described in the paper ["Speaker Recognition from Raw Waveform with SincNet"](https://arxiv.org/abs/1808.00158).
Set up experiments and compare the results with those proposed and illustrated in the article.

### Paper review

The paper proposes SincNet CNN architecture
SincNet is a one-dimensional convolutional neural network (CNN) used for speaker recognition.
Compared to a classical 1-D CNN, SincNet differs in its first layer: convolutional kernels are limited to only sample values from the reference function 
$$
    g[n, f_1, f_2] = 2 f_2 \text{sinc}(2 \pi f_2 n) - 2 f_1 \text{sinc}(2 \pi f_1 n),
$$ 
where $f_1$ and $f_2$ are so-called cutoff frequencies and [$\text{sinc}$](https://en.wikipedia.org/wiki/Sinc_function) function is defined as
$$
    \text{sinc}(x) = \frac{\sin(x)}{x},
$$ 
thus such layer is called SincConv.
In the frequency domain, $\text{sinc}$ functions represent band-pass filters, so that the proposed SincConv layer is believed to extract more meaningful features than filters with arbitrary values.
Indeed, by applying band-pass filtering using $\text{sinc}$ functions one may precisely extract pitch and formants from a sample of speech for further layers to learn such simplified (but at the same time more informative) features.
In other words, band-pass filtering augments the data before the main CNN pipeline.

More than that, by utilizing the described restriction, a SincConv layer would require only $2 F$ parameters to learn (two cutoff frequencies per filter), where $F$ is the number of SincConv filters in a layer.
In comparison, a conventional convolution layer with $F$ filters of length $L$ would require $L \cdot F >> 2 F$ parameters.
With that being said, a SincConv layer may allow to increase overall network performance both during training and evaluation.

### Research questions

In order to verify the approach proposed in the paper, the following research questions are stated.

1. Does a ScinConv layer increases the performance during training and validation?
2. Does a SincConv layer improves accuracy of the CNN in speaker identification?

The improvements are compared to a CNN with a basic first convolution layer.

## Experimental setup

Aiming to reproduce the method proposed in the article, a CNN training pipeline was prepared.
The following packages were used:

* Python 3
* PyTorch
* NumPy
* Pandas
* [SoundFile](https://pypi.org/project/SoundFile/)

The SincConv layer with the network architecture are implemented in [model.py](model.py).
Note that the implementation allows the SincConv layer to accept an input with multiple channels, in contrast to the single-channel version proposed in the [original paper](https://arxiv.org/abs/1808.00158).
A model can be fed to the trainer ```train_model()``` function in [trainer.py](trainer.py), which also handles dataset split into train and validation parts.
A dataset is loaded by a corresponding loader in [dataloader.py](dataloader.py).
In [main.py](main.py), one can schedule training tasks for various configs defined in [cfg/](cfg/).
Overall pipeline is outlined in [nn_pipeline.pdf](nn_pipeline.pdf):

![](nn_pipeline.png)

### Data preparation

For the series of experiments, the [TIMIT dataset](https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3) is exploited.
This dataset is composed of 6300 spoken sentences stored in `.WAV` files, which correspond to 630 various speakers (10 sentences per speaker).
The speakers can also be separated by their gender into two classes.
With that being said, [timit_make_labels.py](timit_make_labels.py) allows to prepare a dataframe with label assignments for each sentence.
Two labeling schemes are available: one obtains 630 classes, while the other extracts two gender classes.
In this way, the same TIMIT dataset was used both for binary and multi-class speaker identification tasks.

## Results

### Performance analysis