<div style="margin: 0 auto 10px; height: 70px; border: 2px solid gray; border-radius: 6px;">
  <div style="float: left; margin: 5px 10px 5px 10px; "><img src="img/bfh.jpg" /></div>
  <div style="float: right; margin: 20px 30px 0; font-size: 15pt; font-weight: bold; color: #98b7d2;"><a href="https://moodle.bfh.ch/course/view.php?id=39255" style="color: #98b7d2;">BTE5476 - Project-Oriented Digital Signal Processing </a></div>
</div>
<div style="clear: both; font-size: 30pt; font-weight: bold; color: #64788b; margin-left: 30px;">
    Phase vocoder
</div>

In [None]:
%matplotlib inline
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as sp
from IPython.display import Audio
from scipy.io import wavfile

plt.rcParams['figure.figsize'] = 14, 4 
plt.rcParams['image.cmap'] = 'tab10'

In [None]:
def multiplay(clips, sf, title=None):
    outs = [widgets.Output() for c in clips]
    for ix, item in enumerate(clips):
        with outs[ix]:
            print(title[ix] if title is not None else "")
            display(Audio(item, rate=sf, normalize=True))
    return widgets.HBox(outs)

In [None]:
# pre-load audio samples in a dictionary
audio, audio_sf = {}, None
for file in ['speech', 'music', 'cymbal', 'clarinet'] :
    sf, x = wavfile.read(f'data/{file}.wav')
    if audio_sf is None:
        audio_sf = sf
    else:
        assert sf == audio_sf, 'notebook requires sampling rates for all audio samples to be the same'
    audio[file] = (x - np.mean(x)) / 32767.0

In [None]:
def ms2n(ms, sf):
    return int(float(sf) * ms / 1000.0)

# Time scaling with granular synthesis

<img width="800" style="margin: 10px;" src="img/ola.png">
<div style="clear: both;"></div>

For a scaling factor $\alpha$: 

 * choose a grain size of $N$ samples (typically 40 ms for speech, 100 ms for music) 
 * choose the grain overlap factor $K$ and obtain the COLA stride $S$ 
 * $S$ is known as the _sythesis hop size_, since output grains will be spaced $S$ samples apart
 * the _analysis hop size_ is $S_a = \lfloor S \alpha \rfloor$ 
 * iterate over $k = 0, 1, \ldots$:
     * extract a grain from the input as $g_k[n] = x_i[n-kS_a]$  
     * add $w[n-kS]g_k[n-kS]$ to the output signal

In [None]:
def hann_cola(N: int, K=2) -> (np.ndarray, int, int):
    if K <= 1:
        return np.ones(N), N, N
    S = N // K
    N = K * S + 1
    win = (2 / K) * np.sin(np.pi * np.arange(0, N) / (N-1)) ** 2
    return win, N, S

In [None]:
def timescale_gs(x, alpha, grain_size, K=2):
    win, N, S = hann_cola(grain_size, K)
    y = np.zeros(int(len(x) / alpha + 1))
    a_hop, s_hop = int(S * alpha), S 
    n, m = 0, 0
    while n < len(x) - N and m < len(y) - N:
        y[m:m+N] += x[n:n+N] * win
        n += a_hop
        m += s_hop
    return y

In [None]:
factors = [1.4, 0.8]

for file, gs_ms in [('speech', 120), ('music', 200)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [audio[file], *(timescale_gs(audio[file], a, gs, K=3) for a in factors)], 
        audio_sf, 
        [f'{file} sample', *('slower' if a < 1 else 'faster' for a in factors)] 
    ))    

The technique works well to speed up speech but slowed down speech still has clicking artefacts.

## The problem with "simple" granular synthesis

Granular synthesis seems to work well on non pitched sounds but not so well on sustained periodic sounds

In [None]:
alpha = 0.75

for file, gs_ms in [('cymbal', 80), ('clarinet', 100)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [audio[file], timescale_gs(audio[file], alpha, gs, K=3)], 
        audio_sf, 
        [f'{file}', 'time-stretched'] 
    ))    

It will be easier to investigate the problem if we use a simple sinusoid

In [None]:
x = np.sin(2 * np.pi * (110 / audio_sf) * np.arange(0, audio_sf))
s = slice(1000, 3000)
plt.plot(x[s]);
Audio(x, rate=audio_sf)

In [None]:
gs = ms2n(40, audio_sf)
x_ts = timescale_gs(x, alpha, gs)
plt.plot(x_ts[s])
Audio(x_ts, rate=audio_sf)

The reason for the strange waveform becomes evident if we set the grain overlap to zero: phase jumps!

In [None]:
plt.plot(timescale_gs(x, alpha, gs, K=0)[s])
for n in range(gs - s.start % gs, s.stop - s.start, gs):
    plt.axvline(n, color='red', alpha=0.4)

## Phase jumps

The $k$-th grain extracted from the input is
$$
    g_{k}[n] = x_i[kS_a + n]
$$
which, for $x_i[n] = \sin(\omega_0 n)$, becomes
$$
    g_{k}[n] = \sin(\omega_0 n + k \omega_0  S_a  ).
$$
The phase difference at the beginning of two consecutive grains is 
$$
    \Delta_k = \angle g_{k}[0] - \angle  g_{k-1}[0] = \omega_0 S_a
$$
When the grains are merged in the output, however, they will be spaced $S = S_a/\alpha$ samples apart and so, to avoid phase discontinuities, their phase difference should be
$$
    \Delta'_k = \angle g_{k-1}[S] - \angle  g_{k-1}[0] = \omega_0 S = \Delta_k/ \alpha
$$

Of course, if the input was a sinusoid of known frequency and fixed amplitude, we would not need granular synthesis at all. For arbitrary signals, nevertheless, we can use the same approach if we consider the DFT of each grain.

# The Phase vocoder

If $G_k[m]$ are the $N$ DFT coefficients of a grain, using polar coordinates we can write $G_k[m] = A_k[m]e^{j\varphi_k[m]}$. With this we can express the grain as the sum of $N$ sinusoidal signals using the inverse DFT formula:
$$
    g_k[n] = \sum_{m=0}^{N-1} A_k[m]\, \mathrm{exp}[j(\omega_m n + \varphi_k[m])], \quad \omega_m = \frac{2\pi}{N}m.
$$
Within a grain, the phase of each component increases by $\omega_m$ per sample, so that the expected phase after $S_a$ samples is 
$$
    \gamma_{k,m}[S_a] = \omega_m S_a + \varphi_{k}[m].
$$
If the signal contained only one or more $N$-periodic sinusoids, the initial phases $\varphi_k[m]$ of each grain $k$ would be exactly equal to $\gamma_{k-1,m}[S_a]$; in all other cases however, (which is to say, always), this will not be the case. The phase offset between successive grains as the difference between the initial phase of the next grain and the expected phase of the previous grain at $n=S_a$:
$$
    \Delta_{k, m} = \varphi_{k}[m] - \gamma_{k-1,m}[S_a] =  \varphi_{k}[m] - \varphi_{k-1}[m] - \omega_m S_a,
$$
Importantly, the phase offset must be "wrapped" over the $[-\pi, \pi]$ interval. 

To create the time-stretched signal, before being merged into the output each grain is resynthesized from a modified set of DFT coefficients $G'_k[m] = A_k[m]e^{j\varphi'_k[m]}$; the amplitude remains the same but the phase is 
$$
    \varphi'_k[m] = \varphi'_{k-1}[m] + \omega_m S + \Delta_{k,m}/\alpha.
$$
In other words, starting from an initial output phase $\varphi'_0[m]$, the phase for each component is incremented by the expected amount $\omega_m S$ plus the phase offset normalized by the time stretch factor $\alpha$.

### Example

Consider two segments of an audio signal and their superposition:

In [None]:
def test_pv(x, N, Sa, S):
    x1, x2 = x[:N], x[Sa:Sa+N]
    X1, X2 = np.fft.fft(x1), np.fft.fft(x2)

    w = 2 * np.pi / N * np.arange(N)
    
    Delta = np.angle(X2) - np.angle(X1) - w * Sa
    Delta = Delta - 2 * np.pi * np.round(Delta / 2 / np.pi)
    
    g2 = np.real(np.fft.ifft(np.abs(X2) * np.exp(1j * (np.angle(X1) + w * S + Delta * S / Sa))))
    
    plt.plot(x1)
    plt.plot(np.arange(S,S+N), x2)    
    plt.figure()
    plt.plot(x1)
    plt.plot(np.arange(S,S+N), g2)

In [None]:
test_pv(audio['music'][65000:], 1000, 120, 250)

## Implementation

Changing the phase of a sinusoidal componet is equivalent to a circular shift; since the first and the last samples in a grain have most likely different values, it's important to use a tapering window before computing each DFT as well. With this, here is a full implementation

In [None]:
def timescale_gs_pv(x, alpha, grain_size, K=4):
    win, N, S = hann_cola(grain_size, K)
    a_hop, s_hop = int(S * alpha), S 
    
    wk = 2 * np.pi * np.arange(0, N) / N
    phase = prev_phase = np.angle(np.fft.fft(x[:N]))

    def wrap_angle(w):
        return w - 2 * np.pi * np.round(w / 2 / np.pi)
    
    y = np.zeros(int(len(x) / alpha + 1))    
    n, m = 0, 0
    while n < len(x) - N and m < len(y) - N:
        grain_fft = np.fft.fft(win * x[n:n + N])
        grain_phase = np.angle(grain_fft)
        phase_delta = wrap_angle(grain_phase - prev_phase - a_hop * wk)
        prev_phase = grain_phase
        phase = wrap_angle(phase + s_hop * (wk + phase_delta / a_hop))
        grain = np.real(np.fft.ifft(np.abs(grain_fft) * np.exp(1j * phase)))
        y[m:m+N] += grain * win
        n += a_hop
        m += s_hop
    return y

We can now test with the sinusoidal signal again

In [None]:
x_ts_pv = timescale_gs_pv(x, alpha, gs)
plt.plot(x_ts_pv[s])
multiplay([x, x_ts, x_ts_pv], audio_sf, title=['test sinusoid', 'stretched with GS', 'stretched with GS-PV'])   

In [None]:
for file, gs_ms in [('speech', 80), ('music', 200)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [ audio[file], *( ts(audio[file], alpha, gs, K=3) for ts in [timescale_gs, timescale_gs_pv] ) ], 
        audio_sf, 
        [f'{file} sample', *( f'slower ({algo})' for algo in ['GS', 'GS_PV'] ) ]
    ))    

## Pitch shifting

In [None]:
def resample(x, alpha):
    N = len(x)
    y, m = np.zeros(int(N / alpha + 0.5)), 0
    while True:
        t = m * alpha 
        n = int(np.floor(t))
        if n < N - 1:
            y[m] = (n + 1 - t) * x[n] + (t - n) * x[n + 1] 
            m += 1
        else:
            break
    return y[:m]

In [None]:
semitone = 2 ** (1.0 / 12)
shift = 3
alpha = semitone ** shift

for file, gs_ms in [('speech', 80), ('music', 200)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [ audio[file], *( resample(ts(audio[file], alpha, gs, K=3), 1/alpha) for ts in [timescale_gs, timescale_gs_pv] ) ], 
        audio_sf, 
        [f'{file} sample', *( f'slower ({algo})' for algo in ['GS', 'GS_PV'] ) ]
    ))    