# MLMI2: Speech Recognition

Lecturer: Prof. Phil Woodland

----

# Table of Contents

>## 1. Introduction
* 1.1. Human Speech
* 1.2. ASR
* 1.3. Recognition Model Architecture

>## 2. Speech Signal Processing
* 2.1. Quasi-Stationary Assumption
* 2.2. Digital Signals
* 2.3. Windowing
* 2.4. Fourier Transform
* 2.5. DFT - Discrete Fourier Transform
* 2.6. Complex Formulation of DFT
* 2.7. Spectral Properties of Speech
* 2.8. Spectral Features of Sounds
* 2.9. Spectrograms

>## 3. Linear Prediction and Cepstral Analysis
* 3.1. Source-Filter Model
* 3.2. General Linear Constant Coefficient Difference Equation
* 3.3. Linear Prediction
* 3.4. Autocorrelation Method
* 3.5. Inverse Filtering
* 3.6. LP Spectrum
* 3.7. Cepstral Analysis

>## 4. HMM Speech Recognition - Introduction
* 4.1. Hidden Markov Models
* 4.2. Composite HMMs
* 4.3. Viterbi Algorithm
* 4.4. Viterbi Training


----

# 1. Introduction

## 1.1. Human Speech

* **Speech Waveform**

>1. **Non-stationary** 
>2. Mixture of **pseudo-periodic** $+$ **random** components

* **Human Speech Production Mechanism**

>1. **Acoustic tube**: variably-shaped / responsible for detailed sounds
>2. **Excitation source**: responsible for broad distinctions in speech sound
>  1. **Voiced sounds** $\leftarrow$ Vocal cords which vibrate when air from the lungs is forced through them
>  2. **Fricative sounds** $\leftarrow$ Turbulence caused by forcing air through constrictions formed by raising the tongue to narrow the acoustic tube
>  3. **Plosive sounds** $\leftarrow$ Turbulence caused by the release of air following a complete closure of the acoustic tube

* **Elements of Human Speech**

>* **Speech**: Sequence of **phones**
>  * **Phones**: Directly associated with **phonemes**
>    * **Phonemes**: Abstraction of phones / basic units of speech
>      * **ARPAbet**: American English phone set
>* **Consonant Classification**
>  * 5 classes based on the type of vocal tract constriction
>  * $\Rightarrow$ plosive(stops), fricatives, affricates, liquids(semi-vowels), nasals
>* **Vowel Classification**
>  * Based on **tongue position** (front to back) & **jaw position** (low to high)
>  * $\Rightarrow$ Represented by **vowel quadrilateral** (use arrows for **diphthongs**)

## 1.2. ASR

>$$\text{Changing time signal} \rightarrow \text{Corresponding symbol sequence}\;(\textbf{string of words})$$
>* We need a **sequence-to-sequence** model
>* This is **NOT** about understanding the input

* **Several Styles of ASR**

>||Input Mode|Vocab Size|Basic Unit|Grammar|
|-|-|-|-|-|
|**Isolated**|discrete|small|word|none|
|**Continuous**|continuous|medium|phone|FS/N-gram|
|**Discrete large vocab**|discrete|large|phone|N-gram|
|**Continuous large vocab**|continuous|large|phone|N-gram|

* **Performance-metric for ASR**

>* **WER(Word Error Rate)** for isolated word recognition
>$$\text{%WER}=100\times\frac{N(\text{correct})}{N(\text{reference})}$$
>$$\;$$
>* **WER** for continuous speech recognition
>$$\text{%WER}=100\times\frac{N(\text{ins.})+N(\text{del.})+N(\text{sub.})}{N(\text{reference})}$$

>* **User Satisfaction**

* **The performance of ASR is affected by**

>* **Inter-speaker variability** (important to develop **recognizer robustness**)
>* **Intra-speaker variability** (e.g. situation, aging, ...)
>* **Microphone** (close vs far / fixed vs variable / bandwidth)
>* **Background noise** 

## 1.3. Recognition Model Architecture

* **Feature Extraction**

>* Normally) use a **fixed duration** of speech signal (**frame**)
>* Each frame corresponds to a **feature vector**
>* Spectral information is often encoded as a **cepstral representation** (e.g. MFCCs)

* **Generative model approach**


>$$\hat{W}=\underset{W}{\arg\max}\;{P(\textbf{O}|W)P(W)}$$
>* $\textbf{O}$: Stream of feature vectors describing the utterance
>* $P(\textbf{O}|W)$: given by acoustic model
>* $P(W)$: given by language model

* **Hidden Markov Model (HMM)**

>* **GMM-HMM**: uses Gaussian Mixture Models
>* **DNN-HMM**: uses Deep Neural Networks