Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correcting (19), thanks Bruce Mackinnon #50

Merged
merged 3 commits into from
May 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Binary file modified doc/codec2.pdf
Binary file not shown.
25 changes: 20 additions & 5 deletions doc/codec2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ \subsection{Model Based Speech Coding}

A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.

As such low bit rates we use a speech production ``model". The input speech is anlaysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.
As such low bit rates we use a speech production ``model". The input speech is analysed, and we extract model parameters, which are then sent over the channel. An example of a model based parameter is the pitch of the person speaking. We estimate the pitch of the speaker, quantise it to a 7 bit number, and send that over the channel every 20ms.

The model based approach used by Codec 2 allows high compression, with some trade offs such as noticeable artefacts in the decoded speech. Higher bit rate codecs (above 5000 bit/s), such as those use for mobile telephony or voice on the Internet, tend to pay more attention to preserving the speech waveform, or use a hybrid approach of waveform and model based techniques. They sound better but require a higher bit rate.

Expand Down Expand Up @@ -488,24 +488,39 @@ \subsection{Voicing Estimation}

For each band we first estimate the complex harmonic amplitude (magnitude and phase) using \cite{griffin1988multiband}:
\begin{equation}
\label{eq:est_amp_mbe1}
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) W^* (k - \lfloor mr \rceil)}{|\sum_{k=a_m}^{b_m} W (k - \lfloor mr \rceil)|^2}
\end{equation}
where $r= \omega_0 N_{dft}/2 \pi$ is a constant that maps the $m$-th harmonic to a DFT bin, and $ \lfloor x \rceil$ is the rounding operator. As $w(n)$ is a real and even, $W(k)$ is real and even so we can write:
where $r= \omega_0 N_{dft}/2 \pi$ is a constant that maps the $m$-th harmonic to a DFT bin, and $ \lfloor x \rceil$ is the rounding operator. To avoid non-zero array indexes we define the shifted window function:
\begin{equation}
U(k) = W(k-N_{dft}/2)
\end{equation}
such that $U(N_{dft}/2)=W(0)$. As $w(n)$ is a real and even, $W(k)$ is real and even so we can write:
\begin{equation}
\begin{split}
W^* (k - \lfloor mr \rceil) &= W(k - \lfloor mr \rceil) \\
&= U(k - \lfloor mr \rceil + Ndft/2) \\
&= U(k + l) \\
l &= Ndft/2 - \lfloor mr \rceil \\
& = \lfloor Ndft/2 - mr \rceil
\end{split}
\end{equation}
for even $Ndft$. We can therefore write \ref{eq:est_amp_mbe1} as:
\begin{equation}
\label{eq:est_amp_mbe}
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) W (k + \lfloor mr \rceil)}{\sum_{k=a_m}^{b_m} |W (k + \lfloor mr \rceil)|^2}
B_m = \frac{\sum_{k=a_m}^{b_m} S_w(k) U(k + l)}{\sum_{k=a_m}^{b_m} |U (k + l)|^2}
\end{equation}
Note this procedure is different to the $A_m$ magnitude estimation procedure in (\ref{eq:mag_est}), and is only used locally for the MBE voicing estimation procedure. Unlike (\ref{eq:mag_est}), the MBE amplitude estimation (\ref{eq:est_amp_mbe}) assumes the energy in the band of $S_w(k)$ is from the DFT of a sine wave, and $B_m$ is complex valued.

The synthesised frequency domain speech for this band is defined as:
\begin{equation}
\hat{S}_w(k) = B_m W(k + \lfloor mr \rceil), \quad k=a_m,...,b_m-1
\hat{S}_w(k) = B_m U(k + l), \quad k=a_m,...,b_m-1
\end{equation}
The error between the input and synthesised speech in this band is then:
\begin{equation}
\begin{split}
E_m &= \sum_{k=a_m}^{b_m-1} |S_w(k) - \hat{S}_w(k)|^2 \\
&=\sum_{k=a_m}^{b_m-1} |S_w(k) - B_m W(k + \lfloor mr \rceil)|^2
&=\sum_{k=a_m}^{b_m-1} |S_w(k) - B_m U(k + l)|^2
\end{split}
\end{equation}
A Signal to Noise Ratio (SNR) ratio is defined as:
Expand Down