# Speech recognition

Simple speech recognition system can be implemented using DTW.

This notebook is inspired by [Rouanet DTW library example](http://nbviewer.jupyter.org/github/pierre-rouanet/dtw/blob/master/speech-recognition.ipynb)

We will use a simple [database](https://www.dropbox.com/s/c12fmsctfwwov5d/sounds.zip) composed of 12 french words pronounced about 25 times by different speakers.

In [1]:
import librosa
from dtw import dtw
import matplotlib as plt
import numpy as np
import glob
import operator

%matplotlib inline

### Loading Data

In [2]:
%%time

y = []
with open('sounds/wavToTag.txt') as f:
    y = list([l.replace('\n', '') for l in f.readlines()])

X = []
for i in range(len(y)):
    x, sample_rate = librosa.load("sounds/{}.wav".format(i))
    X.append(x)

CPU times: user 9.78 s, sys: 68 ms, total: 9.85 s
Wall time: 10.9 s


### Processing

In [3]:
n_window_samples = int(sample_rate * 2 * 10**(-3))

def reshape_sound(x):
    # reshape into windows of width of 20 ms
    
    new_len = np.floor_divide(x.shape[0], n_window_samples) * n_window_samples
    x = x[0:new_len]
    x = x.reshape((n_window_samples, -1), order='F')
    return x

In [4]:
X = [reshape_sound(x) for x in X]

### Define groundtruth data

In [5]:
gt = dict()

unique_labels = set(y)
for l in unique_labels:
    idx = y.index(l)
    y.pop(idx)
    x = X.pop(idx)
    gt[l] = x

### Classificating!

In [None]:
%%time

classifications = dict(zip(unique_labels, [[]]*len(unique_labels)))

for idx in range(len(y)):
    x = X[idx]
    
    # for each reference sound calculate normilized dtw distance to x
    # save it in a dictionary where the key is the reference label
    x_distance = {}
    for label, ground in gt.items():
        cost, path = librosa.dtw(x, ground)
        path = np.array(path)
        columns = path[:,0]
        rows = path[:,1]
        min_cost = np.sum(cost[columns, rows])
        min_cost = min_cost / cost.size
        x_distance[label] = min_cost

    # ascending order by distance
    ordered_distance = sorted(x_distance.items(), key=operator.itemgetter(1))
    predictions_rank = [l[0] for l in ordered_distance]
    real_label_position = predictions_rank.index(y[idx])
    classifications[y[idx]].append(real_label_position)
    
    print("{0: <15s}{1:2d}º{2: >15s}".format(y[idx], (real_label_position+1), predictions_rank[0]))

chaussure       2º         sofoot
manette         2º          gants
chaussure       2º         sofoot
sofoot          1º         sofoot
manette         2º          biere
stade           2º      girondins
gants           1º          gants
jeuvideo        5º      girondins
stade           4º        manette
gants           1º          gants
gants           3º      girondins
biere           5º      girondins
jeuvideo        5º      chaussure
zidane          6º        beckham
sofoot          1º         sofoot
biere           4º        manette
zidane         10º      chaussure
jeuvideo        4º      chaussure
cocacola       12º      chaussure
manette        10º         zidane
sofoot          2º      chaussure
cocacola        6º        beckham
biere           6º         zidane
sofoot          5º        beckham
beckham         1º        beckham
stade           5º        beckham
zidane         10º        manette
chaussure       4º      girondins
manette         3º      girondins
biere         