<a href="https://colab.research.google.com/github/aaramoni/neural_nets/blob/main/classify_spoken_digits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification of spoken digits with CNN

Import tools for audio preprocessing from github repo and several 
libraries.

In [1]:
!git clone https://github.com/aaramoni/tools.git

Cloning into 'tools'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 9 (delta 2), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.


In [12]:
import numpy as np
import random
from IPython.display import Audio
from tools.audio_preprocessing import AudioPreprocessor
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from tensorflow import keras

Import and process input data.

Dataset consists of 3000 recorded samples of different people saying digits from 0 to 9.

https://www.kaggle.com/datasets/joserzapata/free-spoken-digit-dataset-fsdd

In [3]:
ap = AudioPreprocessor()
ap.load('/content/drive/MyDrive/data/spoken_digits')
ap.pad(1)
ap.log_spectrogram()
ap.normalize()

In [4]:
X = ap.spectrograms
X = X.reshape(X.shape + (1,))

y = ap.filenames
y = [int(filename[0]) for filename in y]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

Create and train convolutional neural network model for classification.

In [9]:
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Softmax

model = Sequential([
    Input(shape=(256, 87, 1)),
    Conv2D(256, kernel_size=7, strides=2, padding='same', activation='relu'),
    MaxPooling2D(2),
    Conv2D(128, kernel_size=5, strides=1, padding='same', activation='relu'),
    MaxPooling2D(2),
    Conv2D(64, kernel_size=3, strides=1, padding='same', activation='relu'),
    MaxPooling2D(2),
    Flatten(),
    Dense(10),
    Softmax()
    ])

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_6 (Conv2D)           (None, 128, 44, 256)      12800     
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 64, 22, 256)      0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 64, 22, 128)       819328    
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 32, 11, 128)      0         
 2D)                                                             
                                                                 
 conv2d_8 (Conv2D)           (None, 32, 11, 64)        73792     
                                                                 
 max_pooling2d_8 (MaxPooling  (None, 16, 5, 64)       

In [10]:
BATCH_SIZE = 32
EPOCHS = 20

history = model.fit(X_train, y_train,
                    validation_data=(X_test, y_test),
                    batch_size=BATCH_SIZE, epochs=EPOCHS)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Try predictions and compare with audio.

In [24]:
rand = random.randint(0, 3000)

value = y[rand] 
prediction = np.argmax(model.predict(X[rand:rand+1]))

print(f'Actual value: {value}') 
print(f'Predicted value: {prediction}')

Audio(ap.signals[rand], rate=22050)

Actual value: 9
Predicted value: 9
