<div style="margin: 0 auto 10px; height: 70px; border: 2px solid gray; border-radius: 6px;">
  <div style="float: left; margin: 5px 10px 5px 10px; "><img src="img/bfh.jpg" /></div>
  <div style="float: right; margin: 20px 30px 0; font-size: 15pt; font-weight: bold; color: #98b7d2;"><a href="https://moodle.bfh.ch/course/view.php?id=39255" style="color: #98b7d2;">BTE5476 - Project-Oriented Digital Signal Processing </a></div>
</div>
<div style="clear: both; font-size: 30pt; font-weight: bold; color: #64788b; margin-left: 30px;">
    Autotune
</div>

In [None]:
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as sp
from IPython.display import Audio
from scipy.io import wavfile

In [None]:
plt.rcParams['figure.figsize'] = 14, 4 

In [None]:
def multiplay(clips, sf, title=None):
    outs = [widgets.Output() for c in clips]
    for ix, item in enumerate(clips):
        with outs[ix]:
            print(title[ix] if title is not None else "")
            display(Audio(item, rate=sf, normalize=True))
    return widgets.HBox(outs)

In [None]:
# pre-load audio samples in a dictionary
audio, audio_sf = {}, None
for file in ['speech', 'music', 'yesterday', 'voiced', 
            'sing-whisper', 'sing-robot', 'sing-daft', 
            'talk-whisper', 'talk-robot', 'talk-daft'] :
    sf, x = wavfile.read(f'data/{file}.wav')
    if audio_sf is None:
        audio_sf = sf
    else:
        assert sf == audio_sf, 'notebook requires sampling rates for all audio samples to be the same'
    audio[file] = (x - np.mean(x)) / 32767.0

In [None]:
def plot_spec(x, Fs, max_freq=None, do_fft=True):
    C = int(len(x) / 2)  # positive frequencies only
    if max_freq:
        C = int(C * max_freq / float(Fs) * 2) 
    X = np.abs(np.fft.fft(x)[0:C]) if do_fft else x[0:C]
    N = Fs * np.arange(0, C) / len(x);
    plt.plot(N, X)
    return N, X

In [None]:
def ms2n(ms, sf):
    return int(float(sf) * ms / 1000.0)

In [None]:
# frequency ratio between successive semitones in 12-tone equal temperament
semitone = 2 ** (1.0 / 12)

# Auto-Tune

<img width="300" style="float: right; margin: 0 30px 0 30px;" src="img/autotune.jpg"> 

[Auto-Tune](https://en.wikipedia.org/wiki/Auto-Tune) is a commercial pitch correction algorithm designed specifically for vocal audio tracks; by combining pitch estimation with pitch shifting, an autotune plug-in can remove any slightly out-of-tune notes in a singer's performance. In this notebook we will study how to adapt a standard pitch-shifting algorithm to the specificities of the human voice.

# Pitch shifting: recap

The pitch shifting methods we have implemented in the previous notebooks used time scaling followed by resampling; more precisely, to shift the pitch of an audio signal by a factor $\beta$:
 * we time-scale the signal by a factor $1/\beta$ using a phase vocoder (i.e. increase its length if raising the pitch and vice-versa)
 * we resample the time-scaled signal by a factor $\beta$

## The building blocks

In [None]:
def hann_cola(N: int, K=2) -> (np.ndarray, int, int):
    if K <= 1:
        return np.ones(N), N, N
    S = N // K
    N = K * S + 1
    win = (2 / K) * np.sin(np.pi * np.arange(0, N) / (N-1)) ** 2
    return win, N, S

In [None]:
def resample(x, alpha):
    N = len(x)
    y, m = np.zeros(int(N / alpha + 0.5)), 0
    while True:
        t = m * alpha 
        n = int(np.floor(t))
        if n < N - 1:
            y[m] = (n + 1 - t) * x[n] + (t - n) * x[n + 1] 
            m += 1
        else:
            break
    return y[:m]

In [None]:
def timescale_pv(x, alpha, grain_size, K=4):
    win, N, S = hann_cola(grain_size, K)
    a_hop, s_hop = int(S * alpha), S 
    
    wk = 2 * np.pi * np.arange(0, N) / N
    phase = prev_phase = np.angle(np.fft.fft(x[:N]))

    def wrap_angle(w):
        return w - 2 * np.pi * np.round(w / 2 / np.pi)
    
    y = np.zeros(int(len(x) / alpha + 1))    
    n, m = 0, 0
    while n < len(x) - N and m < len(y) - N:
        grain_fft = np.fft.fft(win * x[n:n + N])
        grain_phase = np.angle(grain_fft)
        phase_delta = wrap_angle(grain_phase - prev_phase - a_hop * wk)
        prev_phase = grain_phase
        phase = wrap_angle(phase + s_hop * (wk + phase_delta / a_hop))
        grain = np.real(np.fft.ifft(np.abs(grain_fft) * np.exp(1j * phase)))
        y[m:m+N] += grain * win
        n += a_hop
        m += s_hop
    return y

## Pitch shifting (two passes)

In [None]:
def pitchshift_pv(x, beta, grain_size, K=3):
    return(resample(timescale_pv(x, 1 / beta, grain_size, K), beta))

In [None]:
shift = 3

for file, gs_ms in [('speech', 80), ('music', 200)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [ audio[file], *( pitchshift_pv(audio[file], semitone ** s, gs) for s in [shift, -shift] ) ], 
        audio_sf, 
        [f'{file} sample', 'higher pitch', 'lower pitch' ]
    ))    

## Warm-up exercise (optional): single-pass implementation

Implement the pitch shifting algorithm in causal form (i.e. using a single pass). Here is a possible approach (and note how the analysis and sythesis hops are both equal to the stride):

 * choose the synthesis grain length $N$; the analysis grain size will be $M=\lfloor \beta N \rfloor$
 * choose the grain overlap factor $K$ and obtain the COLA synthesis stride $S$ 
 * iterate over $k = 0, 1, \ldots$:
     * extract a grain of length $M$ from the input at time $n = kS$
     * resample the grain with factor $\beta$ to obtain a grain of length $N$
     * compute the phase offset wrt the previous output grain and update the global phase
     * set the new grain's phase to the global phase and obtain  $g'_k[n]$
     * add $w[n-kS]g'_k[n-kS]$ to the output signal

In [None]:
def pitchshift_pv_rt(x, beta, grain_size, K=3):
    win, N, S = hann_cola(grain_size, K)
    y = np.zeros_like(x)   
    
    # your code here...
    
    return y

In [None]:
for file, gs_ms in [('speech', 120), ('music', 200)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [ audio[file], *( pitchshift_pv_rt(audio[file], semitone ** s, gs) for s in [shift, -shift] ) ], 
        audio_sf, 
        [f'{file} sample', 'higher pitch', 'lower pitch' ]
    ))    

## Pitch-shifting vocals

When we pitch-shift a singing voice via time scaling and resampling, the result doesn't sound very natural:

In [None]:
file, gs_ms = 'yesterday', 160

shift = 2

display(multiplay(
    [ audio[file], *( pitchshift_pv(audio[file], semitone ** s, ms2n(gs_ms, audio_sf)) for s in [shift, -shift] ) ], 
    audio_sf, 
    [f'{file} sample', 'higher pitch', 'lower pitch' ]
))    

The reason for this is due to the particular structure of a voice signal, which needs to be taken into account by the pitch shifting algorithm.

# Linear Predictive Coding of speech signals

LPC is a speech analysis technique widely used to compress voice signal to very low bitrates. It is based on a source-filter model for speech production and on the estimation of the time-varying model parameters.

## The source-filter model for voice

<img width="400" style="float: right; margin-right: 30px;" src="img/lpc.jpg"> 

Vocal production can be modeled as an *excitation source signal* shaped by the frequency response of the *vocal tract*; the peaks of the frequency response are called the *formants* and their position changes according to the position of moving parts such as lips, tongue, jaw, etc. 

The source is time-varying and produces either:
 * a periodic signal generated by the vibration of the vocal cords (voiced sounds)
 * a noise-like signal generated by an air flow (unvoiced sounds)
 
The filter's response depends on:
 * the shape of the vocal tract (larynx, mouth, nose: time varying)
 * the resonances in head and chest (fixed)

The spectrum of a short voiced speech segment shows the harmonic nature of the sound and the overall envelope determined by the head and mouth

In [None]:
# plot the spectrum of a short voiced speech segment and the estimated frequency response of the vocal tract
#  (the actual response will be computed explicitly later)
vs_env = np.fft.fft([1.0, -2.1793, 2.4140, -1.6790, 0.3626, 0.5618, -0.7047, 
                0.1956, 0.1872, -0.2878, 0.2354, -0.0577, -0.0815, 0.0946, 
                0.1242, -0.1360, 0.0677, -0.0622, -0.0306, 0.0430, -0.0169], len(audio['voiced']));
plot_spec(audio['voiced'], audio_sf);
plot_spec(np.abs(np.divide(1.0, vs_env)), audio_sf, do_fft=False);

Key observations:

 * the frequency response of the vocal tract is independent of the source excitation
 * to pitch-shift speech we should only modify the excitation
 * if we shift the whole signal, the envelope of the vocal tract is shifted as well and the voice sounds unnatural

## LPC analysis

The source-filter model for voice production is:

$$
    X(z) = A(z)E(z)
$$

where
 * $A(z)$: frequency response of the vocal tract for the segment
 * $E(z)$: excitation signal (either noise or periodic oscillation)

Both are unkonwn and we need to find _both_ from the samples $x[n]$. Under some reasonable assumptions, $A(z)$ can be determined by computing the so-called Linear Prediction Coefficients (LPC) of a speech segment; the length of the analysis windows must be short enough so that the nature of the source and the frequency response of the filter do not change within each segment.

### The all-pole filter model

We model the vocal tract as an all-pole filter of order $p$; this is reasonable since the vocal tract is a passive filter with clear resonance peaks:

$$
  A(z) = \frac{1}{1 - \sum_{k=1}^{p}a_kz^{-k}}
$$ 

In the time domain we have
$$
  x[n] = \sum_{k=1}^{p}a_k x[n-k] + e[n]
$$

### The AR estimation problem

In the absence of an external excitation, the all-pole model states that $x[n]$ is fully determined by $x[n-1], \dots, x[n-p]$; we call this an *autoregressive* (AR) signal. 

With an external excitation, if we knew the $a_k$ we could exactly predict $x[n]$ and the "error" would be the excitation itself, which is supposed to be independent of $x[n]$

$$
  e[n] = x[n] - \sum_{k=1}^{p}a_k x[n-k]
$$

The optimal set of $a_k$ minimizes the energy of the unpredictable error component, $E[e^2[n]]$:
 * if $E[e^2[n]]$ is minimized, then $e[n]$ is orthogonal to $x[n]$ (see the [orhtogonality principle](https://en.wikipedia.org/wiki/Orthogonality_principle))
 * orthogonal signals have no shared information: we have separated excitation and source!

### The linear prediction coefficients

The coefficients of the filter $A(z)$ can be found as

$$
    \begin{bmatrix}
        r_0 & r_1 & r_2 & \ldots & r_{p-1} \\
        r_1 & r_0 & r_1 & \ldots & r_{p-2} \\        
        & & & \vdots \\
        r_{p-1} & r_{p-2} & r_{p-3} & \ldots & r_{0} \\
    \end{bmatrix}
    \begin{bmatrix}
        a_1 \\
        a_2 \\        
        \vdots \\
        a_{p}
    \end{bmatrix} = 
    \begin{bmatrix}
        r_1 \\
        r_2 \\        
        \vdots \\
        r_{p}
    \end{bmatrix}
$$

where $r$ is the biased autocorrelation of the $N$-point input data:

$$
  r_m = (1/N)\sum_{k = 0}^{N-m-1}x[k]x[k+m]
$$

Because of the Toeplitz structure of the autocorrelation matrix, the system of equations can be solved very efficiently using the Levinson-Durbin algorithm. Here is a direct implementation of the method:

In [None]:
def autocorrelation(x: np.ndarray, p: int) -> np.ndarray:
    # compute the biased autocorrelation for x up to lag p
    N = len(x)
    assert N > p, f'vector x must contain at least {p} elements'
    r = np.zeros(p+1)
    for m in range(0, p+1):
        for n in range(0, N-m):
            r[m] += x[n] * x[n+m]
        r[m] /= float(N)
    return r

In [None]:
def levinson_durbin(r: np.ndarray, p: int) -> np.ndarray:
    assert len(r) > p, f'vector r must have at least {p} elements'
    g = r[1] / r[0]
    a = np.array([g])
    v = (1. - g * g) * r[0];
    for i in range(1, p):
        g = (r[i+1] - np.dot(a, r[1:i+1])) / v
        a = np.r_[ g,  a - g * a[i-1::-1] ]
        v *= 1. - g * g
    # return the coefficients of the A(z) filter
    return np.r_[1, -a[::-1]]     

In [None]:
def lpc(x: np.ndarray, p: int) -> np.ndarray:
    # compute p LPC coefficients for a speech segment
    return levinson_durbin(autocorrelation(x, p), p)

let's plot the previous figure again by direct computation

In [None]:
x = audio['voiced']
a = lpc(x, 20)

A = np.abs(np.divide(1.0, np.fft.fft(a, len(audio['voiced']))))  

plot_spec(x, audio_sf);
plot_spec(A, audio_sf, do_fft=False);

In [None]:
exc = sp.lfilter(a, [1], x)
plt.plot(x)
plt.plot(exc);

### Aside: LPC speech compression

Basic encoding:
    * segment speech in 20ms chunks
    * find $A(z)$ using LPC analysis
    * inverse filter to obtain the excitation signal 
    * determine if $e[n]$ is periodic (voiced) and if so find the period
    * data per chunk: $P$ LPC coefficients + source period

Typical compression rate: 
 * PCM speech signal sampled at 8kHz, 8 bits/sample: 48 kbs
 * LPC-coded, 40ms frames, 20 LPC coefficients: 4 kbs

Mobile phones (GSM) use LPC-based encoding for voice transmission.

## Vocal pitch shifting using LPC

Once we have separated the source from the filter, vocal pitch shifting can be performed by pitch shifting only the excitation; here is a single-pass implementation using the phase vocoder.

For a pitcshift factor $\beta$: 

 * choose a *synthesis* grain size $N$
 * choose the grain overlap factor $K$ and obtain the COLA synthesis stride $S$ 
 * the _analysis_ grain size will be $M = \lfloor S \beta \rfloor$ 
 * iterate over $k = 0, 1, \ldots$:
     * extract a grain of length $M$ from the input at time $n=kS$
     * compute the LPC coefficients for the grain
     * inverse-filter the grain to recover the excitation signal
     * resample the excitation with a factor $\beta$
     * forward-filter the resampled excitation to apply the original vocal tract envelope
     * adjust the phase of the grain with the phase vocoder
     * add the resynthesized grain to the output signal at $n=kS$

In [None]:
def pitchshift_lpc(x, beta, grain_size, K=3, order=20):
    win, N, S = hann_cola(grain_size, K)
    M = int((N + order) * beta)
    
    wk = 2 * np.pi * np.arange(0, N) / N
    phase = prev_phase = np.angle(win * np.fft.fft(x[:N]))

    filter_state = np.zeros(order)

    def wrap_angle(w):
        return w - 2 * np.pi * np.round(w / 2 / np.pi)
    
    y = np.zeros_like(x)    
    for n in range(0, len(x) - max(N, M), S):
        chunk = x[n:n+M]
        a = lpc(chunk, order)
        exc = sp.lfilter(a, [1], chunk)
        exc = resample(exc, beta)
        grain, filter_state = sp.lfilter([1], a, exc, zi=filter_state)

        grain_fft = np.fft.fft(win * grain[:N])
        grain_phase = np.angle(grain_fft)
        phase_delta = wrap_angle(grain_phase - prev_phase - S * wk)
        prev_phase = grain_phase
        phase = wrap_angle(phase + S * wk + phase_delta)
        grain = np.real(np.fft.ifft(np.abs(grain_fft) * np.exp(1j * phase)))
        y[n:n+N] += grain * win
    return y

In [None]:
file, gs_ms = 'yesterday', 160

shift = 2

for algo in [pitchshift_pv, pitchshift_lpc]:
    display(multiplay(
        [ audio[file], *( algo(audio[file], semitone ** s, ms2n(gs_ms, audio_sf), K=3) for s in [shift, -shift] ) ], 
        audio_sf, 
        ['vocal sample', f'higher pitch ({algo.__name__})', f'lower pitch ({algo.__name__})' ]
    ))    

# Fun with the vocoder

The digital phase vocoder that we implemented earlier is a modern variation of the original analog [vocoder](https://en.wikipedia.org/wiki/Vocoder), the first speech compression device. The vocoder was developed in the 1930s in order to increase the efficiency of existing telecommunication channels and it was later used to build the first voice encryption device used by the allies during WWII.

The vocoder worked by performing a source-filter decomposition of the speech signal, much in the same way as what we achieved via LPC analysis. When the vocoder's output was resynthesized into a speech signal, however, it was quickly noticed that one could create very interesting and unexpected sounds by replacing the original source signal with a synthetic one. The vocoder quickly attracted the attention of artists and musicians and, to this day, its modern digital implementation is used to create a variety of sound effects.

As an example, here are three classic voice effects that can be achieved applying a modified vocoder to our usual audio samples:

In [None]:
modes = ['whisper', 'robot', 'daft']
for voice in ['talk', 'sing']:
    display(widgets.VBox([widgets.Label(voice), multiplay([ audio[f'{voice}-{mode}'] for mode in modes ], audio_sf, modes) ]  ))

## Exercise

Modify the LPC-based autotune algorithm to reproduce as well as you can the three vocoder "styles" demonstrated above. Feel free to get creative and experiment!

In [None]:
def vocoder(x, mode, sf, **argv):
    # mode selects among 'whisper', 'robot', 'daft'
    
    y = np.zeros_like(x)    

    # your code here...
    
    return y

In [None]:
modes = ['whisper', 'robot', 'daft']
for file in ['speech', 'yesterday']:
    display(multiplay(
        [ audio[file], *( vocoder(audio[file], mode, audio_sf) for mode in modes ) ], 
        audio_sf, 
        [f'{file} (original)',  *modes]
    ))    