# 2. Spectral Analysis

## 2.1. Quasi-Stationarity Assumption
* Speech is stationary over a **short interval** $(10\;\text{~}\;20\;\text{ms})$ $\rightarrow$ find the signal characteristics in each interval

## 2.2. Digital Signals

>$$\text{Analog} \rightarrow \text{Digital}$$
>
>* **Step 1: Low-pass filter** (remove $f>0.5\;f_{sampling}$: **Nyquist's theorem**)
>* **Step 2: Sampling** (discretisation in time)
>  * High Quality: $16\text{kHz}$ vs. Normal Quality: $8\text{kHz}$
>* **Step 3: ADC** (discretisation in amplitude)
>  * High Quality: $16\text{bits/sample}$ vs. Normal Quality: $8\text{bits/sample}$
>  * Output ranges from $-2^{Q-1}, ... , 2^{Q-1}-1$

## 2.3. Windowing
* DFT assumes periodicity $\rightarrow$ **Discontinuity** $\rightarrow$ Undesired high frequency
* **Hamming Window**
  * Attenuates the discontinuity **but** also smears the spectral peaks

>$$w(nT)=0.54-0.46\cos{ \left[ \frac{2\pi n}{N-1} \right] }$$

* **Block Processing**
  * Allow overlap between windows
  * Typically) Frame: $10\text{ms}$ & Windows: $25\text{ms}$

## 2.4. Fourier Transform
* **Fourier Analysis**

>* **Periodic Signal of $f_0$**: constructed by sinusoids of $f_0, 2f_0, 3f_0, ...$

>  * $f_0$: fundamental frequency
>  * $nf_0 (n>1)$: harmonics

>* **Aperiodic & Stochastic Signal**: spectrum which is a continuous function of frequency

* **Fourier Transform**

><img src="images/image04.png" width=500>

* **FFT - Fast Fourier Transform**

>$$\mathcal{O}(N^2) \rightarrow \mathcal{O}(N\log N)$$
>
>* Makes use of the symmetry
>* Requires the window to be a power of 2 samples in size
>* This can be achieved by appropriate choice of analysis size and/or zero-padding (after windowing) the frame to a power of 2

## 2.5. DFT - Discrete Fourier Transform
* **Cosine Correlation**

>$$c(\Omega)=\sum^{N-1}_{n=0}{s(nT)\cos{\left(\frac{2\pi np}{N}\right)}}\;\;\;,\;\;\;p=0,1,...,N-1$$

* **Sine Correlation**

>$$s(\Omega)=\sum^{N-1}_{n=0}{s(nT)\sin{\left(\frac{2\pi np}{N}\right)}}\;\;\;,\;\;\;p=0,1,...,N-1$$

* **Amplitude**

>$$a_p=\sqrt{c^2(\Omega)+s^2(\Omega)}$$

* **Phase**

>$$\phi_p=\tan^{-1}\left[\frac{s(\Omega)}{c(\Omega)}\right]$$

## 2.6. Complex Formulation of DFT


* **DFT**

>\begin{align}
S_p&=\sum^{N-1}_{n=0}{s(nT)\left[\cos{\left(\frac{2\pi np}{N}\right)}-j\sin{\left(\frac{2\pi np}{N}\right)}\right]}\\
&=\sum^{N-1}_{n=0}{s(nT) e^{-j\left(\frac{2\pi np}{N}\right)}}\;\;\;,\;\;\;p=0,1,...,N-1
\end{align}

* **Inverse DFT**

>$$s(nT)=\frac{1}{N}\sum^{N-1}_{p=0}{S_p e^{j\left(\frac{2\pi np}{N}\right)}}\;\;\;,\;\;\;n=0,1,...,N-1$$

## 2.7. Spectral Properties of Speech

* **Vowel** (iy)

><img src="images/image05.png" width=400>
>
>* **Time domain**: approximately **periodic** with $f_0=130\text{Hz}$
>* **Frequency domain**: corresponding periodic excitation $(\text{~}7.5\;\text{cycles/1000Hz})$

* **Fricative** (s)

><img src="images/image06.png" width=400>
>
>* **Time domain**: no periodicity
>* **Frequency domain**: random variations at much higher frequency

## 2.8. Spectral Features of Sounds

* **Vowel**

>* Characterised by the first 3 **formants**
>* Following is a simple relationship

>||Tongue Front|Tongue Back|
|-|-|-|
|**High Jaw**|$F_1$ Low, $F_2$ High|$F_1$ Low, $F_2$ Low|
|**Low Jaw**|$F_1$ High, $F_2$ High|$F_1$ High, $F_2$ Low|

* **Liquids**

>* Characterised by formant position & dynamics
>* Overall energy is lower than for vowels

* **Nasals**

>* Strong low $F_1$ around $250\text{Hz}$ and weak higher formants
>* Often energy around $2.5\text{kHz}$

* **Fricatives**

>* Most energy in higher frequencies
>* Voiced fricatives show weak formant structure

* **Stops**

>* Characterised by silence,
>* Optionally followed by a burst of high energy

## 2.9. Spectrograms
* Dimensions: **time & frequency**
* Intensity of image: **spectral energy**

><img src="images/image07.png" width=400>
>
>||Short window|Wide window|
|-|-|-|
|Time resolution|Good|Poor|
|Frequency resolution|Poor|Good|
||(vertical lines)|(horizontal lines)|
|Band|Wide|Narrow|
||(pitch periods visible)|(harmonics of $f_0$ visible)|

# 3. Linear Prediction and Cepstral Analysis

## 3.1. Source-Filter Model

* **Basic Idea**

>$$\underset{\text{(vocal cords)}}{\textbf{Sound Source}} \;\; + \;\; \underset{\text{(vocal tract)}}{\textbf{Linear Acoustic Filter}}$$
>
>* **Linear Prediction Analysis:** method to explicitly determine the parameters of the source-filter model
>* When used for **speech synthesis:** filter parameters are updated every $\text{~10ms}$
>* When used for **speech analysis:** speech is divided into segments $(\text{10~25ms})$

* **Assumptions**

>1. **Sound source (excitation):** periodic pulse train or noise
>2. **Filter:** single lossless linear time variant filter with single input
>3. The filter and excitation characteristics are stationary over $\text{~}10\text{ms}$

## 3.2. General Linear Constant Coefficient Difference Equation

* **General Eq. may include any no. of past inputs & outputs**

>$$y[nT]=\sum^p_{i=1}a_iy\left[(n-i)T\right]+\sum^q_{i=0}b_ix\left[(n-i)T\right]$$

* **Taking Z-transforms**

>$$Y(z)=Y(z)\sum^p_{i=1}a_iz^{-i} + X(z)\sum^q_{i=0}b_iz^{-i}$$

>$$H(z)=\frac{Y(z)}{X(z)}=\frac{\sum^q_{i=0}b_iz^{-i}}{1-\sum^p_{i=1}a_iz^{-i}}$$

>$$p=\text{no. of poles} \;\;\;,\;\;\; q=\text{no. of zeros}$$

>* $p=0$: **non-recursive** or **FIR**(finite impulse response)
>* $p>0$: **recursive** or **IIR**(infinite impulse response)
>* $q=0,p>0$: **all-pole filter**

* **Two-Pole Resonators**
  * Used to **model a single formant** of a particular bandwidth and frequency
  * Can be used in series or parallel to create more complex filters

>$$H(z)=\frac{1}{1-bz^{-1}-cz^{-2}}$$
>
>* $c$: always less than 1 for a **stable** filter and the values of $b$, $c$ determine the frequency and bandwidth of the resonator

## 3.3. Linear Prediction

* **Linear Prediction**
  * **Idea:** each sample can be predicted as a weighted sum of the $p$ preceding samples

>$$\hat{s}[n]=\sum^p_{i=1}a_is[n-i]$$
>* $s[n]\equiv s[nT]$

* **Prediction Error**

>$$e[n]=s[n]-\hat{s}[n]=s[n]-\sum^p_{i=1}a_is[n-i]$$

* **Taking Z-transforms**

>$$E[z]=\left[ 1-\sum^p_{i=1}a_iz^{-i} \right] S(z)$$

* **Filter Transfer Function**

>$$H[z]=\frac{S(z)}{E(z)}=\frac{1}{1-\sum^p_{i=1}a_iz^{-i}}$$
>* **All-pole** digital filter, with prediction error signal as input

* **Source Filter Model**
  * Assume that the input is spectrally flat $\rightarrow$ $E(z)=G$
  
>$$H[z]=\frac{G}{1-\sum^p_{i=1}a_iz^{-i}}=\frac{G}{A(z)}$$
>* $G$: estimated to match the overall energy
>* $1/A(z)$: models the spectral shape of the vocal tract (also glottal source shaping)
>* Error sequence: models the excitation

* **Calculating the LP Coefficients $\Rightarrow$ Normal Equations**

>$$E_p=\sum_n e[n]^2=\sum_n \left( s[n] - \sum^p_{k=1}a_ks[n-k] \right)^2$$

>$$\frac{\partial E_p}{\partial a_k}=0 \;\;\;\Rightarrow\;\;\; \sum_n s[n]s[n-k]=\sum^p_{i=1}a_i\sum_ns[n-i]s[n-k]\;\;\;,\;\;\;k=1,...,p$$

## 3.4. Autocorrelation Method

* **Autocorrelation Sequence** $r_{i-k}$

>$$\sum^\infty_{n=-\infty} s[n-i]s[n-k] = \sum^\infty_{n=-\infty} s[n]s[n+i-k]$$

* **Normal Equations**

>$$r_k=\sum^p_{i=1}a_ir_{i-k}\;\;\;,\;\;\;k=1,...,p$$
>* For $N$ sample window,
>$$r_k=\sum^{N-k-1}_{n=0}s[n]s[n+k]$$

* **Durbin's Algorithm**

>\begin{align}
k_i &= \left( r_i - \sum^{i-1}_{j-1}a_j^{(i-1)}r_{i-j} \right) / E_p^{(i-1)} \\
a^{(i)}_i &= k_i \\
a^{(i)}_j &= a^{(i-1)}_j-k_ia^{(i)}_{i-j} \;\;\;,\;\;\; 1 \leq j < i \\
E^{(i)}_p &= (1-k^2_i)E_p^{(i-1)} \\
A_i &= \left[ \frac{1-k_i}{1+k_i} \right] A_{i-1}
\end{align}

>* $a_k^{(i)}$: LP parameters at iteration $i$
>* $E_p^{(i)}$: Sum-squared predictor error $(e^{(0)}_p=r_0) \rightarrow$ guaranteed to decrease
>* $k_i$: reflection coefficients $(|k_i|<1)$

## 3.5. Inverse Filtering
* **Inverse Filtering:** finding the error signal or prediction residual
* Example: the residual signal for a vowel waveform:

><img src="images/image08.png" width=400>

>* Most short term correlations seem to be lost in the error signal
>* The residual contains long term correlations due to pitch pulses
>* There is a spike at the pitch periods when prediction is poor

## 3.6. LP Spectrum

* **LP Spectrum**

>* Frequency response of the **Linear Prediction Filter** $H(z)$: smoothed approximation of the speech spectrum
>* Evaluate using:
>  * **Direct substitution**: set $z=e^{i\omega T}$ and computer magnitude
>  * **DFT/FFT**: Construct the N-point impulse response for the inverse filter $A(z)$, zero-pad and find frequency response, invert and scale for $H(z)$

* **Applications of Linear Prediction**

>* **1. Pitch Extraction**
>* **2. Formant Analysis**
>* **3. Speech Synthesis/Coding**
>* **4. Speech Recognition Features**

## 3.7. Cepstral Analysis

* **Aim**

>$$\text{Signal Spectrum} \rightarrow \text{Excitation} \times \text{Vocal Tract} \times \text{Other Filtering Effects}$$

* **Homomorphic Filtering**

><img src="images/image09.png" width=400>

* **Cepstrum**

><img src="images/image10.png" width=400>

>* For real cepstrum, the log is applied to the magnitude spectrum
>* In the cepstrum, vocal tract impulse response decays rapidly and can be separated (by windowing) from the excitation
>* The cepstrum is computed in the **quefrency** domain and filtering in this domain is called **liftering**
>* Note that taking the IDFT of the cepstrum doesn't return to the time domain because of the non-linear log operation

><img src="images/image11.png" width=400>

>* Making $h$ smaller increases the smoothing over the whole spectrum

* **Applications of Cepstral Analysis**

>* **1. Pitch Estimation**
>* **2. Smoothed Spectrum**
>* **3. Speech Coding**
>* **4. Recognition**

* **Mel-Scale Filterbanks**
  * The energy in each frequency band is computed from the DFT

><img src="images/image12.png" width=300>

>$$\textbf{Mel-scale:}\;\;\;\text{Mel}(f)=2595\log_10{\left(1+\frac{f}{700}\right)}$$

><img src="images/image13.png" width=300>

* **Discrete Cosine Transform**
  * Simplified version of DFT
  * Uses the fact that the log magnitude spectrum is real-valued, symmetric w.r.t. 0 and periodic in frequency
  
>$$c_n=\sqrt{\frac{2}{p}}\sum^P_{i=1}m_i\cos{\left[ \frac{n(i-\frac{1}{2})\pi}{P} \right]}$$
>
>* $P$: no. of filterbank channels
>* $c_n$: **MFCCs(Mel-Frequency Cepstral Coefficients)**
>* $c_0$: measure of the signal energy
>* DCT **decorrelates** the spectral coefficients and allows them to be modelled with diagonal Gaussian distributions

* **PLP(Perceptual Linear Prediction)**
  * Alternative for MFCC
  
>* 1. First the power spectrum is computed on a non-linear frequency scale (Bark scale)
>* 2. The power is compressed (e.g. with a power-law compression) and other compensation for the frequency sensitivity of human hearing is applied
>* 3. Autocorrelation coefficients are obtained from inverse DFT of the power spectrum
>* 4. Autocorrelation analysis LPC can be computed using Durbin's algorithm
>* 5. Cepstral coefficients are obtained by applying linear prediction to cepstrum recursion