The main goal of this project is present a performance comparison when using different features. We will use the same model based on a convolutional neural network and we will train it using different features as input. The impact on the performance depending on the features used will be surveyed.
We will train different networks using the following features, and we will present a comparison.
- Mel spectogram: Mel-scaled power spectogram
- MFCC: Mel-Frequency Cepstral Coefficients
- Chroma STFT: Chromagram from a waveform or power spectrogram
- Chroma CQT: Constant-Q chromagram
- Spectral Contrast: Spectral contrast
All the proposed are spectral features based, to extract them we will use the python library librosa.
[Todo] Detail each transformation.
Our CNN model is based in this reference, and it is as follows
For each stereo audio signal, we will pre-process independently left and right channels extracting the features. Both features will be the input for two independent CNN, as described in the previous figure, and after that we will concatenate to estimate the category through a softmax regression.