# Keyword spotting

[Interesting Article about audio](https://www.seeedstudio.com/blog/2018/11/23/6-important-speech-recognition-technology-you-need-to-know/)

Keyword spotting consits of detecting a limited set of keywords, this is typically what is used to wake up IoT devices (Alexa etc.). One of the datasets available is called [Google Speech Commands](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html) here his the [related paper](https://arxiv.org/abs/1804.03209). This is the one we are going to focus on. It contains a list of 35 words for a total  of 105'829 utterances. They were recored by different users all using their phone or laptop mic (the data was collected using a web application). The dataset also contains backgrouind noise audio (see "_background_noise_" folder), because it is important to be bale to distinguish audio that contains speech from audio that contains none.

Here is the list of words and the number of occurences:

![List of keywords](images/Capture.PNG)

For the project we could use several of those keywords for the robot to understand. We use the V2 version of the dataset.


As a sidenode a framework called [fairseq](https://github.com/facebookresearch/fairseq) could be used for more complex speech recognition task. It is very popular (+20k stars on github)

We will use the model implemented in this [paper](https://arxiv.org/ftp/arxiv/papers/2101/2101.04792.pdf) as it has the best SOTA results on [papers with code](https://paperswithcode.com/sota/keyword-spotting-on-google-speech-commands ) on our dataset. 

In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [33]:
class Res15(nn.Module):
    def __init__(self, n_maps):
        super(Res15, self).__init__()
        n_maps = n_maps
        self.conv0 = nn.Conv2d(1, n_maps, (3, 3), padding=(1, 1), bias=False)
        self.n_layers = n_layers = 13
        dilation = True
        if dilation:
            self.convs = [nn.Conv2d(n_maps, n_maps, (3, 3), padding=int(2 ** (i // 3)), dilation=int(2 ** (i // 3)),
                                    bias=False) for i in range(n_layers)]
        else:
            self.convs = [nn.Conv2d(n_maps, n_maps, (3, 3), padding=1, dilation=1,
                                    bias=False) for _ in range(n_layers)]
        for i, conv in enumerate(self.convs):
            self.add_module("bn{}".format(i + 1), nn.BatchNorm2d(n_maps, affine=False))
            self.add_module("conv{}".format(i + 1), conv)

    def forward(self, audio_signal, length=None):
        x = audio_signal.unsqueeze(1)
        for i in range(self.n_layers + 1):
            y = F.relu(getattr(self, "conv{}".format(i))(x))
            if i == 0:
                if hasattr(self, "pool"):
                    y = self.pool(y)
                old_x = y
            if i > 0 and i % 2 == 0:
                x = y + old_x
                old_x = x
            else:
                x = y
            if i > 0:
                x = getattr(self, "bn{}".format(i))(x)
        x = x.view(x.size(0), x.size(1), -1)  # shape: (batch, feats, o3)
        x = torch.mean(x, 2)
        return x.unsqueeze(-2), length

## Using HuggingFace

https://huggingface.co/superb/wav2vec2-base-superb-ks
seems to be a popular option


This is trained on speech commands v1 it contains the 10 following words:
1. Yes
2. No
3. Up
4. Down
5. Left
6. Right
7. On
8. Off
9. Stop
10. Go

In [1]:
from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "ks", split="test")

classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks")
labels = classifier(dataset[0]["file"], top_k=5)


  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset superb_demo (/home/guillaume/.cache/huggingface/datasets/anton-l___superb_demo/ks/1.9.0/77d23894ff429329a7fe80f9007cabb0deec321316f8dda1a1e9d10ffa089d08)


In [2]:
print(labels)

[{'score': 0.9999943971633911, 'label': '_silence_'}, {'score': 2.4985183699755e-06, 'label': 'left'}, {'score': 1.984544496735907e-06, 'label': 'down'}, {'score': 4.3534416249713104e-07, 'label': '_unknown_'}, {'score': 3.17640683533682e-07, 'label': 'stop'}]


In [3]:
dataset[0]["file"]

'/home/guillaume/.cache/huggingface/datasets/downloads/extracted/8b845bff8b050c19206f97a59ed3450967d8cd6f93823158ae48dae62e8c2041/_silence_/5.wav'

In [4]:
# with pyaudio
import pyaudio
import wave
import tempfile
import os

CHUNK = 320  # number of audio samples per frame
FORMAT = pyaudio.paInt16  # audio format
CHANNELS = 1  # mono audio
RATE = 48000  # sampling rate in Hz
RECORD_SECONDS = 0.5  # duration of each recording in seconds
FILE_NAME = f"temp.wav"

def record_audio():
    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=RATE,
                    input=True,
                    frames_per_buffer=CHUNK,
                    input_device_index=4)

    try:
        while True:
            frames = []  # to store audio frames

            for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
                data = stream.read(CHUNK)
                frames.append(data)

            # write frames to temporary WAV file
            
            wav_filename =  FILE_NAME
            wf = wave.open(wav_filename, 'wb')
            wf.setnchannels(CHANNELS)
            wf.setsampwidth(p.get_sample_size(FORMAT))
            wf.setframerate(RATE)
            wf.writeframes(b''.join(frames))
            wf.close()

            # read contents of WAV file a

            yield wav_filename

    except KeyboardInterrupt:
        pass

    stream.stop_stream()
    stream.close()
    p.terminate()


In [5]:
import sounddevice as sd
import numpy as np
import wave
import logging
import scipy.signal as signal
import pyaudio

In [6]:
sd.query_devices()

   0 HD-Audio Generic: HDMI 0 (hw:0,3), ALSA (0 in, 8 out)
   1 Sennheiser XS LAV USB-C: Audio (hw:1,0), ALSA (1 in, 0 out)
   2 HD-Audio Generic: ALC257 Analog (hw:2,0), ALSA (0 in, 2 out)
   3 hdmi, ALSA (0 in, 8 out)
   4 jack, ALSA (2 in, 2 out)
   5 pipewire, ALSA (64 in, 64 out)
   6 pulse, ALSA (32 in, 32 out)
*  7 default, ALSA (32 in, 32 out)
   8 Family 17h (Models 10h-1fh) HD Audio Controller Analog Stereo, JACK Audio Connection Kit (0 in, 0 out)
   9 Sennheiser XS LAV USB-C Mono, JACK Audio Connection Kit (1 in, 0 out)
  10 GNOME Settings, JACK Audio Connection Kit (2 in, 2 out)

In [12]:
import sounddevice as sd
from scipy.io.wavfile import write

fs = 48000  # Sample rate
seconds = 3  # Duration of recording

myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1, device = 1)
sd.wait()  # Wait until recording is finished
write('output.wav', fs, myrecording)  # Save as WAV file 


KeyboardInterrupt: 

In [9]:
for wav_data in record_audio():
    # pass the WAV data to your keyword spotter here
    label = classifier(wav_data, top_k=1)
    if label[0]["score"] > 0.99:
        print(label)
    else:
        print("No keyword detected")
    pass


No keyword detected
No keyword detected
No keyword detected
No keyword detected
No keyword detected
No keyword detected
No keyword detected
No keyword detected


KeyboardInterrupt: 