# Speech enhancement by Wiener filtering - Voice activity detection and all pole modelling of the vocal tract
*Laurent Colbois, Lionel Desarzens  
COM-415 Audio and Acoustic Signal Processing course  
EPFL 2018-2019*

***Abstract***

# 1. Introduction

## 2. Review of existing techniques (How thorough ?)

## 3. Algorithm description

We denote $y$ the noisy speech signal to enhance. $y$ is decomposed between a clean speech signal $s$ and a noise signal $n$, i.e. $y = s + n$. The denoising algorithm is supposed to build a good estimate $\hat{s}$ of $s$. In practice, we build artificially $y$ by adding noise of our choice (either white noise, or noise from another sound file) to a clean speech sample. This allows in particular to control freely the SNR of $y$. The algorithm we use for speech enhancement is called *Iterative Wiener Filtering* and is presented hereafter.

### 3.1 Wiener filtering
The general approach for the denoising process is by the use of a Wiener filter that we apply in the STFT domain.
The main principle of a Wiener filter is the following : given an input signal $x$ and a target signal $d$, the Wiener filter is a linear filter that, when applied on $x[n]$, gives as an output $\hat{d}$ such that the error $e[n] = \hat{d} - d$ is minimized in the least square sense. In the context of speech enhanchement, the input signal is $y = s + n$, and the target signal would be $s$. Note that in general $s$ is not actually known; however if one gains some insight on the spectral properties of $s$, a reasonable Wiener filter can be constructed.

According to **Cite main article** or **Cite Loizou book**, in the context of speech enhancement a good Wiener filter, given in the frequency domain, is 
$$H(\omega) = \frac{P_s(\omega)}{P_s(\omega)+P_n(\omega)}$$  
where $P_s$ and $P_n$ are the power spectral density (PSD) of respectively the speech and the noise. The main focus of this work is to build a good estimation of those PSDs.  

The general approach we choose to build those estimations consists in the following steps. 
> 0. Split the signal in analysis frames (possibly overlapping)
> 1. Run the signal through a voice activity detection module (VAD), identify each frame as either speech or noise
> 2. If processing a noise frame : compute a noise PSD estimation
> 3. If processing a speech frame :  
    4.1 Read the noise PSD estimation from previous noise frames  
    4.2 Compute the speech PSD estimation  
    4.3 Apply the denoising procedure (build and apply the Wiener filter)-> obtain estimate $\hat{s}$ of $s$  
    4.4 Optional : compute a noise PSD estimation for the current frame by considering the residual  $y - \hat{s}$, that can possibly be considered when filtering future frames 
    
The detail of how the implementation of the VAD algorithm, the noise PSD estimation and the speech PSD estimation are given in the following subsections.

### 3.2 Voice Activity Detection (VAD)
The VAD is an algorithm used in speech processing. It main purpose is (as said above) to classify a frame as containing speech, or not. Many different kind of VAD exist (see [], [] and []), but as the VAD is not the main objective of this project, we used the one defined in [] (see github... for the implementation).

#### 3.2.1 Estimating noise floor level [...]
This VAD works using the average noise level. For each frame, the signal energy level is computed :

$$L^{(n)} = \sqrt{\frac{1}{K} \sum_{k=0}^{K-1} \left(W_k \cdot \left \lvert Y_{k}^{(n)} \right \rvert \right )^2}$$

where $W_k$ is a weigthing function, and $Y_k^{(n)}$ is the DFT of the $n$-th frame on K points. The aim of using a weighting function is to give more importance to the frequency where noise and signal are very different. <br>
    As most of the energy of the noise is contained in the low part of the spectrum, using a high-pass filter might be a good idea. Moreover most of the speech energy is contained between $125 Hz$ and $300 Hz$.

#### 3.2.2. Dual-Constant time integrator
For this method, the estimate of the noise floor level at $n$-th frame can be computed as follow :

$$L_{min}^{(n)} = \begin{cases} \left (1-\frac{T}{\tau_{up}} \right )L_{min}^{(n-1)}+\frac{T}{\tau_{up}}L^{n}, & L^{(n)}>L_{min}^{(n-1)} \\ 
\left (1-\frac{T}{\tau_{down}} \right )L_{min}^{(n-1)}+\frac{T}{\tau_{down}}L^{n}, & L^{(n)}\leq L_{min}^{(n-1)} \end{cases}$$

where $T$ is the frame duration, $\tau_{up}$ and $\tau_{down}$ are the time constant to track the noise.

#### 3.2.3. Decision
Finally, using the estimated noise floor level, and the signal energy level, the decision of the VAD can be expressed as follow :
$$V^{(n)} = \begin{cases} 0 & \mbox{if} \frac{L^{(n)}}{L_{min}^{(n)}}<T_{down} \\ 
1 & \mbox{if} \frac{L^{(n)}}{L_{min}^{(n)}}>T_{up} \\
V^{(n-1)} & \mbox{otherwise}\end{cases}$$

#### 3.2.4 Further implementation
It has been shown in [???] that the previous explained method works well on high SNR values. In some case, it might be interesting to implement a second type of VAD which is more robust to low SNR. Such algorithms are called Statistical VAD.

As the aime of this project was to reduce noise, we did not implement this king of VAD.

### 3.3 Noise PSD estimation
All computations are done under the assumption that the noise is white centered gaussian with variance $\sigma_n^2$. In this case we simply have $P_n(\omega) = \sigma_n^2$.

**Update rule for $\sigma$'s**

### 3.4 Speech PSD estimation : all-pole model
The estimation of the speech PSD relies on modelling the speech production as an auto-regressive model. **Cite Lim, Oppenheim**. More specifically, the speech signal $s$ is written as 
$$ s[l] = \sum\limits_{k=1}^{p} a_k s[l-k] + g\cdot w[l] $$  
where $g$ is a gain factor, and $w[l]$ is a simple excitation. $p$ is a chosen parameter. For voiced speech, w[l] is modeled as a simple periodic excitation, while it is modeled as white noise for unvoiced speech. This is also equivalent to say the vocal tract acts on the initial exictation as the following transfer function :

$$ V(\omega) = \frac{g}{1- \sum\limits_{k=1}^{p} a_k e^{-jk\omega}} $$. In this case the PSD 
$$ P_s(\omega) = \frac{g^2}{\left| 1- \sum\limits_{k=1}^{p} a_k e^{-jk\omega} \right|^2}$$
$$ g^2 \text{ such that } \int_{-\pi}^{\pi} P_s(\omega) \text{d}\omega = \sum\limits_{k=0}^{N-1}y^2[k] - N\sigma_n^2$$

## Results and discussion

### Evaluation metrics  
An obvious qualitative way of evaluating the performance of a speech enhancement system is simply to listen to the signal before and after the denoising operation. However we would like to also have a quantitative measure of the efficiency of the algorithm. We choose the two following metrics :

> - A posteriori SNR : from the original signal $y$ and speech signal estimation $\hat{s}$, one can build an estimate $\hat{n} = y - \hat{s}$ of the noise signal. This allows to compute the SNR of $\hat{s}$ with respect to $\hat{n}$, and to compare it with the true SNR of $s$ with respect to $n$. This SNR a posteriori should be higher than the true SNR if the denoising operation is successful.    


> - Intelligibility : even when it increases the SNR, the denoising operation might add some distortion in the speech signal and decrease its intelligibility. In order to get a quantitative measure of intelligibilty, we use a pretrained neural network for classification of speech signal (Google Speech Command Dataset). This network is asked to classify speech signals at various SNR ratios, and we compare its classification certainty for noisy speech and denoised speech. The code for this operation was greatly inspired from [1].

## Conclusion

## Bibliography

<html>

<head>
<title>evaluation</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="generator" content="bibtex2html">
</head>

<body>

<!-- This document was automatically generated with bibtex2html 1.98
     (see http://www.lri.fr/~filliatr/bibtex2html/),
     with the following command:
     /usr/bin/bibtex2html evaluation.bib  -->


<table>

<tr valign="top">
<td align="right" class="bibtexnumber">
[<a name="Mermet:255834">1</a>]
</td>
<td class="bibtexitem">
Alexis Mermet.
 Augmenting "pyroomacoustics" with machine learning utilities.
 2018.
 SEMESTER_PROJECT.
[<a href="http://infoscience.epfl.ch/record/255834">http</a>&nbsp;]

</td>
</tr>
</table><hr><p><em>This file was generated by
<a href="http://www.lri.fr/~filliatr/bibtex2html/">bibtex2html</a> 1.98.</em></p>
</body>
</html>
