### Step 1: We will prepare our dataset for analysis and extract sound signal features from  audio files using Mel-Frequency Cepstral Coefficients(MFCC).

Every signal has its own characteristics. In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC.

You can get detailed info about MFC on : https://www.youtube.com/watch?v=4_SH2nfbQZ8&t=0s

So by using librosa library we will get characteristics of every audio signal in our dataset and hold them in a list.

In [1]:
import tensorflow as tf
print(tf.__version__)

2.8.0


In [2]:
import pandas as pd
import os
import librosa
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from keras.models import Sequential
from keras.layers import Dense,Dropout,Activation
from tqdm import tqdm





### Feature Extraction

Here we will be using Mel-Frequency Cepstral Coefficients(MFCC) from the audio samples. The MFCC summarises the frequency distribution across the window size, so it is possible to analyse both the frequency and time characteristics of the sound. These audio representations will allow us to identify features for classification.


In [None]:
dataset_path = ".\dataset"
train = pd.read_csv(fr"{dataset_path}\train.csv", index_col=[0])
label_n = train['class'].nunique()
labels = train['class'].unique()
train

In [None]:
def features_extractor(filename):
    audio, sample_rate = librosa.load(filename, res_type='kaiser_fast') 
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=128)
    mfccs_scaled_features = np.mean(mfccs_features.T,axis=0)
    return mfccs_scaled_features

In [None]:
extracted_features=[]
for index, row in tqdm(train.iterrows()):
   
    file_name = rf"{dataset_path}\train2\{row['filename']}"
    data = features_extractor(file_name)
    extracted_features.append([data,row['class']])




In [None]:
# We will convert extracted_features to Pandas dataframe and OneHotEncoder
extracted_features_df = pd.DataFrame(extracted_features,columns=['feature','class'])
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(extracted_features_df[['class']]).toarray())

encoder_df_list = encoder_df.to_numpy()


In [None]:
# We then split the dataset into independent and dependent dataset
x = np.array(extracted_features_df['feature'].tolist())
y = encoder_df_list

In [None]:
print(extracted_features_df)
print(encoder_df)

In [None]:
# We split dataset as Train and Test
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=69)

### Step 2: Building a Convolutional Neural Networks (CNN) Model and Train Our Model with UrbanSound8K Dataset.


In [None]:
# How many classes we have? We should  use it in ourm model
num_labels = label_n

In [None]:
# Now we start building our CNN model..

model = Sequential()

# 1. hidden layer
model.add(Dense(125,input_shape=(128,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
# 2. hidden layer
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.5))
# 3. hidden layer
model.add(Dense(125))
model.add(Activation('relu'))
model.add(Dropout(0.5))

# output layer
model.add(Dense(num_labels))
model.add(Activation('softmax'))

# Model özetini görüntüleme
model.summary()

In [None]:
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

In [None]:
# Trianing the model
epochscount = 200
num_batch_size = 32
model.fit(x_train, y_train, batch_size=num_batch_size, epochs=epochscount, validation_data=(x_test, y_test), verbose=1)

In [31]:
validation_test_set_accuracy = model.evaluate(x_test,y_test,verbose=0)
print("Accuracy : ",validation_test_set_accuracy[1])

Accuracy :  0.8508771657943726


In [30]:
#save model
model.save(f'./models/model_{validation_test_set_accuracy[1]}')

INFO:tensorflow:Assets written to: ./models/model_0.8508771657943726\assets


### Step 3: Finally We Predict an Audio File's Class Using Our CNN Model.

We first preprocess the new audio data and then predict the class.


In [None]:
model = tf.keras.models.load_model('path') # load model

filename=rf"path"
target_range = (-1, 1)
y, sr = librosa.load(filename)
y_normalized = (y - np.min(y)) / (np.max(y) - np.min(y)) * (target_range[1] - target_range[0]) + target_range[0]
mfccs_features = librosa.feature.mfcc(y=y_normalized, sr=sr, n_mfcc=128)
mfccs_scaled_features = np.mean(mfccs_features.T,axis=0)
mfccs_scaled_features = mfccs_scaled_features.reshape(1,-1)
result_array = model.predict(mfccs_scaled_features)
result = np.argmax(result_array[0])
print(result)