## Dataset and its structure

1. We can use Urban Sound Classification ( https://urbansounddataset.weebly.com/ ) dataset which is quite popular.
2. Whichever dataset you are using, it is important to understand its structure and how to extract required features out of them.
3. For UrbanSound8K dataset, it can be downloaded using the following link ( https://goo.gl/8hY5ER  ). It downloads a compressed tar file of size around 6GB.
4. On extracting it, it contains two folders named 'audio' and 'metadata'.
5. Audio folder contains 10 folders with name fold1, fold2 and so on, each having approximately 800 audio files of 4s each.
6. Metadata folder contains a .csv file having various columns such as file_id, label, class_id corresponding to label, salience etc.
7. Complete description can be found here https://urbansounddataset.weebly.com/urbansound8k.html

## Research Paper and Resources to follow

1. https://github.com/meyda/meyda/wiki/audio-features
2. https://github.com/tyiannak/pyAudioAnalysis/wiki/3.-Feature-Extraction
3. https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a
4. https://towardsdatascience.com/urban-sound-classification-part-1-99137c6335f9
5. https://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing-deep-learning/

## Library To Use

We can use librosa library which can be installed using 
> pip install librosa

It uses ffmpeg as backend to convert and read some of the audio files. So to install ffmpeg, you can use 
> apt-get install ffmpeg

Librosa library can read audio files and convert them to there amplitude values for each sample of audio. Let us say there is an audio file of 4s and sampling rate of audio file is 22050 Hz. This means that audio file is made using amplitude samples such that 22050 samples of amplitudes are recorded in each second. Hence a 4s audio file with sampling rate 22050 can be expressed as an array of 4\*22050=88200 size 


## How to Load Audio Files and Extract Features

Using load method of librosa library, we can read audio files. It takes file path as input and returns an array having amplitude samples along with sampling rate of file.

Librosa library has many methods already build to extract features mentioned in resources which then returns another array of features.
We can use various combinations of those features. This is something you can play around and try how and which features like mfcc, spectral features, energy etc affect the classification of audio. 

For eg, in first stage you can extract only mfcc features and then build up a model and check the accuracy. Then try the same with other features. In order to further improve accuracy, you can also try to use more than one type of features and check the results.

## Using CNN to classify sound

This is a very classical way of sound classification as it is observed that similar type of sounds have similar spectrogram (read resource 3 to understand more about spectrogram). A spectrogram is a visual representation of the spectrum of frequencies of sound or other signal as they vary with time. And thus we can train a CNN network which takes these spectrogram images as input and using it tries to generalize patterns and hence classify them.

In [1]:
pip install librosa

Collecting librosa
  Downloading librosa-0.9.2-py3-none-any.whl (214 kB)
Collecting soundfile>=0.10.2
  Downloading soundfile-0.11.0-py2.py3-none-win_amd64.whl (1.0 MB)
Collecting resampy>=0.2.2
  Downloading resampy-0.4.2-py3-none-any.whl (3.1 MB)
Collecting audioread>=2.1.9
  Downloading audioread-3.0.0.tar.gz (377 kB)
Collecting pooch>=1.0
  Downloading pooch-1.6.0-py3-none-any.whl (56 kB)
Building wheels for collected packages: audioread
  Building wheel for audioread (setup.py): started
  Building wheel for audioread (setup.py): finished with status 'done'
  Created wheel for audioread: filename=audioread-3.0.0-py3-none-any.whl size=23706 sha256=03eebb651f0a80e5c428dc689314ecdbfe861941ae68932523e4b9bf4f0e8827
  Stored in directory: c:\users\drish\appdata\local\pip\cache\wheels\e4\76\a4\cfb55573167a1f5bde7d7a348e95e509c64b2c3e8f921932c3
Successfully built audioread
Installing collected packages: soundfile, resampy, pooch, audioread, librosa
Successfully installed audioread-3.0.0 li

In [4]:
pip install ffmpeg-python

Collecting ffmpeg-pythonNote: you may need to restart the kernel to use updated packages.

  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [8]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is 7061-83D3

 Directory of C:\Users\drish\Downloads\Coding Ninjas ML\Urban Sound Classification

21-10-2022  19:26    <DIR>          .
19-10-2022  15:16    <DIR>          ..
19-10-2022  15:19    <DIR>          .ipynb_checkpoints
21-10-2022  17:46    <DIR>          UrbanSound8K
04-06-2014  03:46     7,097,425,920 UrbanSound8K.tar
21-10-2022  17:37     6,023,741,708 UrbanSound8K.tar.gz
21-10-2022  19:26            10,843 urban_sound_classification.ipynb
               3 File(s) 13,121,178,471 bytes
               4 Dir(s)  326,412,599,296 bytes free


In [10]:
import librosa
import pandas as pd
import os
import numpy as np

audio_dataset_path="UrbanSound8K/UrbanSound8K/audio/"
metadata=pd.read_csv("UrbanSound8K/UrbanSound8K/metadata/UrbanSound8K.csv")
metadata.head()

Unnamed: 0,slice_file_name,fsID,start,end,salience,fold,classID,class
0,100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark
1,100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing
2,100263-2-0-121.wav,100263,60.5,64.5,1,5,2,children_playing
3,100263-2-0-126.wav,100263,63.0,67.0,1,5,2,children_playing
4,100263-2-0-137.wav,100263,68.5,72.5,1,5,2,children_playing


## MFCC
It generates features from a given audio file from its time and frequency characteristics
We are using librosa which will load all audio files in sampling rate of 22khz approx for simplicity. Librosa wil provide audio data between -1 and +1

In [11]:
def features_extractor(file):
    audio,sample_rate=librosa.load(file,res_type="kaiser_fast")
    mfcc_features=librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
    mfcc_scaled_features=np.mean(mfcc_features.T,axis=0)
    return mfcc_scaled_features

In [12]:
from tqdm import tqdm
##Iterate through all files and extract features using above function
extracted_features=[]
for index_num,row in tqdm(metadata.iterrows()):
    file_name=os.path.join(os.path.abspath(audio_dataset_path),"fold"+str(row["fold"])+"/"+str(row["slice_file_name"]))
    final_class_labels=row["class"]
    data=features_extractor(file_name)
    extracted_features.append([data,final_class_labels])

  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
8732it [07:31, 19.34it/s]


In [13]:
extracted_features_df=pd.DataFrame(extracted_features,columns=["feature","class"])
extracted_features_df.head()

Unnamed: 0,feature,class
0,"[-217.35526, 70.22339, -130.38527, -53.282898,...",dog_bark
1,"[-424.09818, 109.34077, -52.919525, 60.86475, ...",children_playing
2,"[-458.79114, 121.38419, -46.520657, 52.00812, ...",children_playing
3,"[-413.89984, 101.66373, -35.42945, 53.036358, ...",children_playing
4,"[-446.60352, 113.68541, -52.402206, 60.302044,...",children_playing


In [18]:
X=np.array(extracted_features_df["feature"].to_list())
y=np.array(extracted_features_df["class"].to_list())

In [19]:
X.shape

(8732, 40)

In [20]:
#Label Encoding for CNN
y=np.array(pd.get_dummies(y))

In [22]:
y.shape

(8732, 10)

In [23]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.2,random_state=0)

In [24]:
X_train.shape,Y_train.shape

((6985, 40), (6985, 10))

In [25]:
X_test.shape,Y_test.shape

((1747, 40), (1747, 10))

## Building Model

In [26]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Activation,Flatten
from tensorflow.keras.optimizers import Adam
from sklearn import metrics

In [30]:
num_labels=Y_test.shape[1]
num_labels

10

In [38]:
model=Sequential()
#first layer
model.add(Dense(256,input_shape=(40,))) #because 40 features
model.add(Activation("relu"))
model.add(Dropout(0.5))

#second layer
model.add(Dense(512)) 
model.add(Activation("relu"))
model.add(Dropout(0.5))

#third layer
model.add(Dense(1024)) 
model.add(Activation("relu"))
model.add(Dropout(0.5))

#final layer
model.add(Dense(num_labels))
model.add(Activation("softmax"))

In [39]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_7 (Dense)             (None, 256)               10496     
                                                                 
 activation_7 (Activation)   (None, 256)               0         
                                                                 
 dropout_6 (Dropout)         (None, 256)               0         
                                                                 
 dense_8 (Dense)             (None, 512)               131584    
                                                                 
 activation_8 (Activation)   (None, 512)               0         
                                                                 
 dropout_7 (Dropout)         (None, 512)               0         
                                                                 
 dense_9 (Dense)             (None, 1024)             

In [40]:
model.compile(loss="categorical_crossentropy",metrics=["accuracy"],optimizer="Adam")

In [41]:
model.fit(X_train,Y_train,batch_size=32,epochs=100,validation_data=(X_test,Y_test))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100


Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x190ffa3d430>

In [42]:
test_accuracy=model.evaluate(X_test,Y_test,verbose=0)
print(test_accuracy[1])

0.8494561910629272
