# Digit and Simple Voice Recognizer
A simple speech pattern recognition by utilizing MFCCs (Mel Frequency Cepstral Coefficients) and Dynamic Time Warping (DTW) to match given template speech set with the test set.
<br><br>
Required additional library to run this notebook:
<br>
    `Librosa`
<br>
<br>
*@edwardpassagi* on **<a href="https://github.com/edwardpassagi/voiceRecognizer">GitHub</a>**


### Library Imports

In [1]:
# Library Import and basic function definition
from scipy.io import wavfile

import random

import scipy
import scipy.spatial.distance as dis
import scipy.signal as signal
import numpy as np
import IPython.display as ipd
import librosa

# Progress Visualization
from tqdm import tqdm
from tqdm.auto import tqdm, trange

# Ignore MFCC warning due to wavfile tag
import warnings
warnings.filterwarnings('ignore')

# Print Sound
def sound( x, rate=8000, label=''):
    from IPython.display import display, Audio, HTML
    if label is '':
        display( Audio( x, rate=rate))
    else:
        display( HTML( 
        '<style> table, th, td {border: 0px; }</style> <table><tr><td>' + label + 
        '</td><td>' + Audio( x, rate=rate)._repr_html_()[3:] + '</td></tr></table>'
        ))

## 1. Data Preparation

### Audio Import and MFCC transformation

Since I'll be comparing MFCC data for each audio (template and test), make sure to get both the WAV file and the MFCC representation for each audio file.

In [2]:
# sr = 44100
sr = wavfile.read("./digits_samples/template.wav")[0]

# take L channel
template = np.array(wavfile.read("./digits_samples/template.wav")[1][:,0], dtype=float)
test = np.array(wavfile.read("./digits_samples/test.wav")[1][:,0], dtype=float)

# find MFCC for both sets
templateMFCC = librosa.feature.mfcc(template, sr, n_mfcc = 50)
testMFCC = librosa.feature.mfcc(test, sr, n_mfcc = 50)


### Parse the dataset to each digits

In [3]:
# parse template to 10 MFCC and 10 digits
tempMFList = []
tempDigs = np.array(np.array_split(template,10))

# parse testing to 110 MFCCs and 110 digits
testMFList = []
testDigs = np.array(np.array_split(test,110))

for i in range(10):
    tempMFList.append(librosa.feature.mfcc(tempDigs[i], sr, n_mfcc = 50))
    
for i in range(110):
    testMFList.append(librosa.feature.mfcc(testDigs[i], sr, n_mfcc = 50))

#### Test digits is represented in testIndex mod(10)

In [4]:
# sound of some template digits
print("Template digits")
for i in range(0,10,2):
    pr = "number: "+str(i)
    sound(tempDigs[i], sr, pr)

# sound of some test digits

print("Test digits")
for i in range(90,100,2):
    pr = "number: "+str(i)
    sound(testDigs[i], sr, pr)

Template digits


0,1
number: 0,Your browser does not support the audio element.


0,1
number: 2,Your browser does not support the audio element.


0,1
number: 4,Your browser does not support the audio element.


0,1
number: 6,Your browser does not support the audio element.


0,1
number: 8,Your browser does not support the audio element.


Test digits


0,1
number: 90,Your browser does not support the audio element.


0,1
number: 92,Your browser does not support the audio element.


0,1
number: 94,Your browser does not support the audio element.


0,1
number: 96,Your browser does not support the audio element.


0,1
number: 98,Your browser does not support the audio element.


## 2. The Algorithm

Since each digits (or speech) can be spoken in vastly different ways, we want to make sure to ignore small, irrelevant differences to make up the meaning of the voice.<br><br>
Thus, we can approach this problem by comparing each MFCC slices for the test and the templates, while finding the most optimal route (low cost) to determine our predicted result. <br>
In this algorithm, we'll be using *Bellmann's Optimality Principle* for our pathfinding method. (Further reading <a href="https://en.wikipedia.org/wiki/Bellman_equation"> here</a>.
<br><br>
**First**, we need to represent the distances of our representative matrix (in my case, the MFCC form). This can be done by this formula:<br>
$$D(\mathbf{a},\mathbf{b}) = \frac{\sum a_i b_i}{\sqrt{a_i^2}\sqrt{\sum b_i^2}}$$<br>
where *(i,j)* represents the distance between the *i-th* frame of the template with the *j-th* frame of the input.

In [5]:
def D_mat(a,b):
    D = np.zeros((len(a.T), len(b.T)))
    for i, matA in enumerate(a.T):
        for j, matB in enumerate(b.T):
            # get cosine distance between the two frames
            D[i,j] = dis.cosine(matA,matB)
    return D

**Second**, we can compute the *cost* matrix, where we find the most optimal path to reach *(i,j)*, in this case, let's set a constraint to just consider nodes coming from **(i-1,j)**, **(i-1,j-1)**, and **(i-1,j-1)**.

In [6]:
def C_mat(D):
    C = D.copy()
    
    for i in range(1, C.shape[0]):
        for j in range(1, C.shape[1]):
            curr = C[i,j]
            W = C[i, j-1] + curr
            NW = C[i-1,j-1] + curr
            N = C[i-1,j] + curr
            # assign lowest value to C matrix
            C[i,j] = np.nanmin([W,NW,N])
    return C

**Finally**, we can form our classifier by using the *lowest cost* path to determine our prediction.

In [7]:
def classify(inputMFCC, templateMFCC, window = 40):
    retval = np.zeros(len(templateMFCC))
    
    for i, templateFrame in enumerate(templateMFCC):
        D = D_mat(inputMFCC, templateFrame)
        C = C_mat(D)
        
        # get minimum cost from both edges
        # only consider the last half
        opt = min(min(C[window:,-1]),min(C[-1,window:]))
        retval[i]=opt
    
    # return the minimum index
    return np.argmin(retval)

We are now done with our algorithm, and can now test it with our test sets.

## 3. Digit Recognizer

Remember that the actual digit (0-9) is represented as `testIndex mod(10)`.

In [8]:
# Test on digit 3
testIndex = 93

predicted = classify(testMFList[testIndex], tempMFList)
# testMFList
# classify_c(tempMFList[2], tempMFList)
print("Actual: {}, Predicted: {}".format(testIndex%10,predicted))
sound(testDigs[testIndex], sr, "Actual digit")
sound(tempDigs[predicted], sr, "Template digit")

Actual: 3, Predicted: 3


0,1
Actual digit,Your browser does not support the audio element.


0,1
Template digit,Your browser does not support the audio element.


It seems that our algorithm works just fine. To determine its accuracy, we can run it on many different test sets.<br><br>
I have 110 different sounds that I can test it against.

In [9]:
correct = np.zeros(10)
for i in trange(110, desc = 'Test set'):
    guessedval = classify(testMFList[i], tempMFList)
    actualIndex = i%10
    
    # Determine whether or not its correctly guessed
    correct[actualIndex] = correct[actualIndex]+1 if guessedval==actualIndex else correct[actualIndex]




### Summary
We can now see our accuracy for each digits:

In [10]:
## Data Summary
print("Data Summary:\n")
totalCorrectDigit = int(np.sum(correct))
print("Total Accuracy: {}%, Correct Guesses: {}, False Guesses: {}\n".format(totalCorrectDigit/110*100, totalCorrectDigit, 110-totalCorrectDigit))

for idx, c in enumerate(correct):
    print("Digit {} Accuracy: {}%".format(idx, c/11*100) )

Data Summary:

Total Accuracy: 99.0909090909091%, Correct Guesses: 109, False Guesses: 1

Digit 0 Accuracy: 100.0%
Digit 1 Accuracy: 100.0%
Digit 2 Accuracy: 100.0%
Digit 3 Accuracy: 100.0%
Digit 4 Accuracy: 100.0%
Digit 5 Accuracy: 100.0%
Digit 6 Accuracy: 100.0%
Digit 7 Accuracy: 100.0%
Digit 8 Accuracy: 90.9090909090909%
Digit 9 Accuracy: 100.0%


## 4. Voice-driven dialler

We can now implement what we have made for something that can be used in our daily lives.<br><br>
In this case, we can make a voice-driven dialler, where we can set up numbers for each our friends (by saying each friend's name followed by their phone number), and call them just by saying their names.<br><br>
Let's start by importing our audio files, and parsing them (similar to step 2).

### Audio Import and Parsing

In [11]:
# sr = 44100
sr = wavfile.read("./voice_dialler/input.wav")[0]

# take L channel
tempVD = np.array(wavfile.read("./voice_dialler/input.wav")[1][:,0], dtype=float)
testVD = np.array(wavfile.read("./voice_dialler/names.wav")[1][:,0], dtype=float)

In [12]:
print("Input data:")
sound(tempVD, sr, "Setup audio")
sound(testVD, sr, "Test names")

Input data:


0,1
Setup audio,Your browser does not support the audio element.


0,1
Test names,Your browser does not support the audio element.


### Contact List:
| Names | Phone Number |
| --- | --- | --- |
| Furkan | 1379 |
| Simon | 5240 |
| Mohamed | 6683 |
| Edward | 7134 |
| Amir | 9523 |

In [13]:
recipientNum = 5
phoneDigitsAmt = 4
names = ["Furkan", "Simon","Mohamed","Edward","Amir"]

# parse template to 10 MFCC and 10 digits
tempMFListVD = []
tempWAV = np.array(np.array_split(tempVD,10))

# parse testing to 110 MFCCs and 110 digits
testMFListVD = []
testWAV = np.array(np.array_split(testVD,10))
    
for i in range(10):
    testMFListVD.append(librosa.feature.mfcc(testWAV[i], sr, n_mfcc = 50))

In [14]:
# sound of some template digits
print("template chunks")
for i in range(10):
    pr = "chunk: "+str(i)
    sound(tempWAV[i], sr, pr)

# sound of some test digits
# test digits is represented in testIndex mod(10)
print("test names")
for i in range(10):
    pr = "chunk: "+str(i)
    sound(testWAV[i], sr, pr)

template chunks


0,1
chunk: 0,Your browser does not support the audio element.


0,1
chunk: 1,Your browser does not support the audio element.


0,1
chunk: 2,Your browser does not support the audio element.


0,1
chunk: 3,Your browser does not support the audio element.


0,1
chunk: 4,Your browser does not support the audio element.


0,1
chunk: 5,Your browser does not support the audio element.


0,1
chunk: 6,Your browser does not support the audio element.


0,1
chunk: 7,Your browser does not support the audio element.


0,1
chunk: 8,Your browser does not support the audio element.


0,1
chunk: 9,Your browser does not support the audio element.


test names


0,1
chunk: 0,Your browser does not support the audio element.


0,1
chunk: 1,Your browser does not support the audio element.


0,1
chunk: 2,Your browser does not support the audio element.


0,1
chunk: 3,Your browser does not support the audio element.


0,1
chunk: 4,Your browser does not support the audio element.


0,1
chunk: 5,Your browser does not support the audio element.


0,1
chunk: 6,Your browser does not support the audio element.


0,1
chunk: 7,Your browser does not support the audio element.


0,1
chunk: 8,Your browser does not support the audio element.


0,1
chunk: 9,Your browser does not support the audio element.


### Phone Numbers and Names
We can now convert the phone numbers from *WAV audio file* into *string*.

In [15]:
tempNames = []
phoneNumber = []

for i in range(10):
    if i % 2 == 0: tempNames.append(np.array_split(tempWAV[i],4)[0])
    else: phoneNumber.append(tempWAV[i])

        
# Find each audio files' MFCC representation using librosa
tempNamesMF = []
for i in range(recipientNum):
    tempNamesMF.append(librosa.feature.mfcc(tempNames[i], sr, n_mfcc = 50))
    
phoneDigs = []
for i in range(recipientNum):
    phoneDigs.append(np.array_split(phoneNumber[i],phoneDigitsAmt))

In [16]:
# Convert phone number to string

phoneNumArr = np.zeros((recipientNum,phoneDigitsAmt))

for i in trange(recipientNum, desc='recipients'):
    for j in tqdm(range(phoneDigitsAmt), desc='digits'):
        # get phone number digits
        curDigMFCC = librosa.feature.mfcc(phoneDigs[i][j], sr, n_mfcc = 50)
        curDigit = classify(curDigMFCC, tempMFList)
        phoneNumArr[i][j]=curDigit

        
phoneNumStr = []
for i in range(recipientNum):
    curStr = ""
    for j in range(phoneDigitsAmt):
        curStr += str(int(phoneNumArr[i][j]))
    phoneNumStr.append(curStr)

















In [17]:
print(phoneNumStr)

['1379', '5240', '2783', '7134', '9523']


**Accuracy** for our digit recognizer here is ~90%. We can see that on index 2, Mohamed's phone number is detected as `2783`, even though  it's supposed to be`6683`.

### Testing the feature
We can now test the feature to see if our algorithm correctly calls the spoken name.

In [18]:
# Testing to call "Mohamed", phone number 6683
testIdx = 2

title = "input: "+ names[testIdx%5]
sound(testWAV[testIdx%5], sr, title)
guessedNameIdx = classify(testMFListVD[testIdx], tempNamesMF)

print("matches with:")
title = "template: "+ names[guessedNameIdx]
sound(tempNames[guessedNameIdx], sr, title)
print("Dialling {}, with phone number: {}".format(names[guessedNameIdx], phoneNumStr[guessedNameIdx]))

0,1
input: Mohamed,Your browser does not support the audio element.


matches with:


0,1
template: Mohamed,Your browser does not support the audio element.


Dialling Mohamed, with phone number: 2783


In [19]:
# Testing to call "Furkan", phone number 1379
testIdx = 0

title = "input: "+ names[testIdx%5]
sound(testWAV[testIdx%5], sr, title)
guessedNameIdx = classify(testMFListVD[testIdx], tempNamesMF)

print("matches with:")
title = "template: "+ names[guessedNameIdx]
sound(tempNames[guessedNameIdx], sr, title)
print("Dialling {}, with phone number: {}".format(names[guessedNameIdx], phoneNumStr[guessedNameIdx]))

0,1
input: Furkan,Your browser does not support the audio element.


matches with:


0,1
template: Furkan,Your browser does not support the audio element.


Dialling Furkan, with phone number: 1379


### Determining the overall accuracy
Now that we have test the feature, we can determine the overall accuracy of the algorithm with some test sets.<br>
In my case, I have 10 different instances for calling the 5 given names (2 for each).<br><br>
Actual test index is represented as testIndex mod(5).

In [20]:
correctVD = np.zeros(5)

for i in trange(10, desc='Test Set'):
    guessedNameIdx = classify(testMFListVD[i], tempNamesMF)
    actualIndex = i%5
    correctVD[actualIndex] = correctVD[actualIndex]+1 if guessedNameIdx==actualIndex else correctVD[actualIndex]




The summary of the accuracy for each case is listed below:

In [21]:
## Data Summary
totalVDCorrect = int(np.sum(correctVD))
print("Data Summary:\n")
print("Accuracy: {}%, Correct Guesses: {}, False Guesses: {}\n".format(totalVDCorrect/10*100, totalVDCorrect, 10-totalVDCorrect))

for idx, c in enumerate(correctVD):
    print("Name {} Accuracy: {}%".format(names[idx%5], c/2*100) )

Data Summary:

Accuracy: 100.0%, Correct Guesses: 10, False Guesses: 0

Name Furkan Accuracy: 100.0%
Name Simon Accuracy: 100.0%
Name Mohamed Accuracy: 100.0%
Name Edward Accuracy: 100.0%
Name Amir Accuracy: 100.0%


Based on our small test sets, we can now confirm that our voice driven-dialler is working as expected.

## Conclusion

In conclusion, the algorithm used is certainly not optimized for large templates (as it iterates for each template cases, and iterates to find distances for each MFCC frames), which can increase runtime *significantly* on bigger datasets.<br><br>
A high accuracy value might be caused by test sets that is fairly similar (I recorded both template and test at similar conditions).