Sound recognition with neural networks
This was an experiment to see how well the Mel Frequency Cepstral Coefficients (MFCC’s) and Chroma analysis are doing in extracting features from audio signals. To do this, I wanted to detect whether some song is by Chet Baker or Beyonce - clearly two very different genres. This turned out horribly difficult to accomplish, so I moved onto simpler data and used raw audio books to detect if some voice is of a women or man. Next, I transformed the preprocessed audio snippets, and fed them all into a neural network to classify different pitches.
- see the soundTransformation directory for my implementation of the Mel Frequency Cepstral Coefficients (MFCC’s). For most of the transformation I used the Python Librosa library
- see all preprocessing in the music directory
chromogram.pyto get the cleaned up sound input from sound_input.wav -> this contains a data array of 1 second CQT transformed clips.
- run network by
- cleaning up and processing of two audiobooks with male & female voices (concatenated two files, trimmed to equal lengths, removed silence below 20 decibels)
- implementation of the sound transformation both with MFCCs, and CQT
- placed 1 second clips of the audio data into a numpy array
- network all set up -> 99% accuracy when testing on people from the training sample
- 84% accuracy when testing on people different from the training sample
Ideas for the future
For robustness: train on a new data array with a bigger variety of females/male voices -> problematic as there is no data In terms of trying different models, for more complex tasks, a recurrcent neural network would be more appropriate, so an idea could be to try that with on a larger dataset -> again, tough to get a larger sample.