

#  ASR Automate Speech Recognition
* STT (Speech to Text)



* First, speech recognition that allows the machine to catch the words, phrases and sentences we speak

* Second, natural language processing to allow the machine to understand what we speak, and

* Third, speech synthesis to allow the machine to speak.

# Difficulties in developing a speech recognition system

* Size of the vocabulary − Size of the vocabulary impacts the ease of developing an ASR. Consider the following sizes of vocabulary for a better understanding.

 * A small size vocabulary consists of 2-100 words, for example, as in a voice-menu system

 * A medium size vocabulary consists of several 100s to 1,000s of words, for example, as in a database-retrieval task

 * A large size vocabulary consists of several 10,000s of words, as in a general dictation task.





* **Channel characteristics** − Channel quality is also an important dimension. For example, human speech contains high bandwidth with full frequency range, while a telephone speech consists of low bandwidth with limited frequency range. Note that it is harder in the latter.

* **Speaking mode**− Ease of developing an ASR also depends on the speaking mode, that is whether the speech is in isolated word mode, or connected word mode, or in a continuous speech mode. Note that a continuous speech is harder to recognize.

* **Speaking style** − A read speech may be in a formal style, or spontaneous and conversational with casual style. The latter is harder to recognize.
* **Speaker dependency**− Speech can be speaker dependent, speaker adaptive, or speaker independent. A speaker independent is the hardest to build.

* **Type of noise** bold text − Noise is another factor to consider while developing an ASR. Signal to noise ratio may be in various ranges, depending on the acoustic environment that observes less versus more background noise
 * If the signal to noise ratio is greater than 30dB, it is considered as high range

 * If the signal to noise ratio lies between 30dB to 10db, it is considered as medium SNR

 * If the signal to noise ratio is lesser than 10dB, it is considered as low range

* **Microphone characteristics** − The quality of microphone may be good, average, or below average. Also, the distance between mouth and micro-phone can vary. These factors also should be considered for recognition systems.

# Simple audio recognition: Recognizing keywords

**DataSet:**

**speech_commands**


An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.
[link](https://www.tensorflow.org/datasets/catalog/speech_commands)

In [None]:
!pip install -U -q tensorflow tensorflow_datasets
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages will be REMOVED:
  libcudnn8-dev
The following held packages will be changed:
  libcudnn8
The following packages will be DOWNGRADED:
  libcudnn8
0 upgraded, 0 newly installed, 1 downgraded, 1 to remove and 22 not upgraded.
Need to get 430 MB of archives.
After this operation, 1,153 MB disk space will be freed.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  libcudnn8 8.1.0.77-1+cuda11.2 [430 MB]
Fetched 430 MB in 6s (75.1 MB/s)
(Reading database ... 122518 files and directories currently installed.)
Removing libcudnn8-dev (8.7.0.84-1+cuda11.8) ...
update-alternatives: removing manually selected alterna

In [None]:
import os
import pathlib

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display

# Set the seed value for experiment reproducibility.
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

In [None]:
DATASET_PATH = 'data/mini_speech_commands'

data_dir = pathlib.Path(DATASET_PATH)
if not data_dir.exists():
  tf.keras.utils.get_file(
      'mini_speech_commands.zip',
      origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
      extract=True,
      cache_dir='.', cache_subdir='data')

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip


In [None]:
from IPython.display import Audio, display

display(Audio('data/mini_speech_commands/down/004ae714_nohash_0.wav', autoplay=True))

In [None]:
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
print('Commands:', commands)

Commands: ['left' 'no' 'right' 'stop' 'yes' 'down' 'README.md' 'go' 'up']


In [None]:
type(commands)

numpy.ndarray

In [None]:
commands = commands[(commands != 'README.md') & (commands != '.DS_Store')]
print('Commands:', commands)

Commands: ['left' 'no' 'right' 'stop' 'yes' 'down' 'go' 'up']


This dataset only contains single channel audio, so use the tf.squeeze function to drop the extra axis:

The utils.audio_dataset_from_directory function only returns up to two splits. It's a good idea to keep a test set separate from your validation set. Ideally you'd keep it in a separate directory, but in this case you can use Dataset.shard to split the validation set into two halves. Note that iterating over any shard will load all the data, and only keep its fraction.

# Convert waveforms to spectrograms

The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the short-time Fourier transform (STFT) to convert the waveforms to as spectrograms, which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model

Now, create spectrogramn datasets from the audio datasets:

# Build and train the model

## Display a confusion matrix

## Run inference on an audio file

# Export the model with preprocessing