<div style="margin: 0 auto 10px; height: 70px; border: 2px solid gray; border-radius: 6px;">
  <div style="float: left; margin: 5px 10px 5px 10px; "><img src="img/bfh.jpg" /></div>
  <div style="float: right; margin: 20px 30px 0; font-size: 15pt; font-weight: bold; color: #98b7d2;"><a href="https://moodle.bfh.ch/course/view.php?id=39255" style="color: #98b7d2;">BTE5476 - Project-Oriented Digital Signal Processing </a></div>
</div>
<div style="clear: both; font-size: 30pt; font-weight: bold; color: #64788b; margin-left: 30px;">
    Time and pitch scaling for audio
</div>

In [None]:
%matplotlib inline
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as sp
from IPython.display import Audio
from scipy.io import wavfile

plt.rcParams['figure.figsize'] = 14, 4 
plt.rcParams['image.cmap'] = 'tab10'

In [None]:
def multiplay(clips, sf, title=None):
    outs = [widgets.Output() for c in clips]
    for ix, item in enumerate(clips):
        with outs[ix]:
            print(title[ix] if title is not None else "")
            display(Audio(item, rate=sf, normalize=True))
    return widgets.HBox(outs)

In [None]:
def plot_spec(x, Fs, max_freq=None, do_fft=True):
    C = int(len(x) / 2)  # positive frequencies only
    if max_freq:
        C = int(C * max_freq / float(Fs) * 2) 
    X = np.abs(np.fft.fft(x)[0:C]) if do_fft else x[0:C]
    N = Fs * np.arange(0, C) / len(x);
    plt.plot(N, X)
    return N, X

In [None]:
# pre-load audio samples in a dictionary
audio, audio_sf = {}, None
for file in ['speech', 'music', 'cymbal', 'clarinet', 'yesterday'] :
    sf, x = wavfile.read(f'data/{file}.wav')
    if audio_sf is None:
        audio_sf = sf
    else:
        assert sf == audio_sf, 'notebook requires sampling rates for all audio samples to be the same'
    audio[file] = (x - np.mean(x)) / 32767.0

# Problem setup

Consider a finite-support analog audio signal $x_i(t)$ starting at $t=0$ and ending at $t=T$.

## Time, pitch, timbre

 * the duration of an audio signal is equal to the length $T$ of its finite support
 * a local periodicity, $x(t) \approx x(t + nP)$, produces a sensation of _pitch_ with fundamental frequency $f_1 = 1/P$
 * the magnitude spectrum $|X(f)|$ determines the perceived _timbre_

Note that the spectrum of a pitched signal has a _harmonic_ structure, that is, the energy is concentrated around multiples of the fundamental frequency $f_1$.

In [None]:
plt.subplot(2, 1, 1)
plot_spec(audio['clarinet'], audio_sf);
plt.subplot(2, 1, 2)
plot_spec(audio['cymbal'], audio_sf);

## The goal

 * time scaling: change the duration of a signal
 * pitch scaling: change the pitch of a signal

Can we scale time and pitch independently of each other, and without affecting timbre?

# Time scaling via warping

The simplest form of time scaling is time warping, that is, we change the underlying timebase.

## Continuous time
<img width="300" style="float: right;" src="img/turntable.jpg">

For analog signals, warping the time axis is simply:
$$
    x_o(t) = x_i(\alpha t), \qquad \alpha > 0
$$

 * $\alpha > 1$ : signal plays faster
 * $\alpha < 1$ : signal plays slower

This is what happens if you spin a vynil record faster or slower than its nominal RPM value.

## Discrete time

For $\alpha = N/M$ with $N, M$ coprime we can use multirate processing:

<img width="600" style="float:right;" src="img/multirate.png">

 * upsample by $N$
 * apply a lowpass filter
 * downsample by $M$
 
but this can get very expensive computationally if $N$ is large.

### Fractional resampling

For efficiency, we can approximate the exact multirate approach via local interpolation. For instance, using _linear_ interpolation, each output sample $x_o[m]$ is computed like so:

 * find the two closest input samples $x_i[n]$ and $x_i[n+1]$ where $n = \lfloor \alpha m \rfloor$
 * $\tau = \alpha m - n$ is the ``time'' offset between output and input ($0 \le \tau \le 1$)
 * compute the linear interpolation between $x[n]$ and $x[n+1]$ at $t=\tau$: $$
   x_o[m] = (1-\tau)x_i[n] + \tau\, x_i[n+1]
   $$

In [None]:
def resample(x, alpha):
    N = len(x)
    y, m = np.zeros(int(N / alpha + 0.5)), 0
    while True:
        t = m * alpha 
        n = int(np.floor(t))
        if n < N - 1:
            y[m] = (n + 1 - t) * x[n] + (t - n) * x[n + 1] 
            m += 1
        else:
            break
    return y[:m]

In [None]:
factors = [1.4, 0.8]
for file in ['speech', 'music']:
    display(multiplay(
        [audio[file], *(resample(audio[file], a) for a in factors)], 
        audio_sf, 
        [f'{file} sample', *('slower' if a < 1 else 'faster' for a in factors)] 
    ))    

## Issues with time warping:

If the time axis is scaled by a factor $\alpha$:

 * period is scaled by $1/\alpha$ since $x_i(t) = x_i(t + nP) \Rightarrow x_o(t) \approx x_o(t + nP/\alpha)$; \
   pitch is scaled as $f_o = \alpha f_i$ and we hear a higher pitch if $\alpha > 1$, lower if $\alpha < 1$
 * timbre is affected because $X_o(f) = X(f / \alpha)$; \
   if $\alpha > 1$ the spectrum is stretched and everything sounds shrill (chipmunk voice), if $\alpha < 1$ the spectrum shrinks around zero and everything sounds dark (Darth Vader voice)
 * the method cannot be implemented in real time (it cannot be driven by the input signal)

# Frequency shifting

To change the pitch of an audio signal we could try to simply shift its spectrum up or down in frequency.

## Shifting via modulation

Using the modulation theorem: 
$$
\mathrm{DTFT}\{2x[n]\cos(\omega_0 n)\} = X(\omega - \omega_0) + X(\omega + \omega_0)
$$

In [None]:
def modshift(x, f, sf):
    return 2 * x * np.cos(2 * np.pi / audio_sf * f * np.arange(0, len(x)))

In [None]:
x = audio['speech']
shift = 400

x_mod = modshift(x, shift, audio_sf)
multiplay([x, x_mod], audio_sf, ['original', 'upshift via modulation'])

<img width="250" style="float: right; margin: 10px;" src="img/voice_changer.jpg">

Simple technique to make your voice hard to recognize, low-cost and real-time; used in old-time "voice scramblers" for anonymous prank calls. However:

 * audio is baseband, so we can only go "up"
 * signal's bandwidth much larger than shift frequency: the two spectral copies overlap (distortion)
 * inaudible low frequencies are shifted to hearing range (warbling)

Example:

 * negative frequencies are aliased to positive
 * lots of spurious content in low frequency region

In [None]:
plt.subplot(2, 1, 1)
plot_spec(x, audio_sf, 2000)
plt.subplot(2, 1, 2)
plot_spec(x_mod, audio_sf, 2000)
plt.axvline(shift, color='red');

## Improved modulation-based shifting

Use the same approach as in SSB (single-sideband modulation):

<img width="600" style="" src="img/ssb.png">

Algorithm:
 * apply a bandpass filter to eliminate inaudible low frequencies and to limit bandwidth
 * if shifting down, also eliminate low frequencies below modulation frequency
 * compute the one-sided analytic signal using a Hilbert filter
 * shift analytic signal via a complex exponential
 * take the real part

In [None]:
def ssbshift(x, f, sf, L=40):
    # bandpass the voice signal
    f0 = 100.0 if f > 0 else 100.0 - f
    x = sp.lfilter(*sp.butter(6, [f0, min(10000, 0.9 * sf/2)], fs=sf, btype='band'), x)
    # compute analytic signal
    h = sp.remez(2 * L + 1, [0.02, 0.48], [-1], type='hilbert')
    xa = 1j * sp.lfilter(h, 1, x)[L:] + x[:-L]
    # shift and take real part
    return np.real(xa * np.exp(1j * 2 * np.pi * f / sf * np.arange(0, len(xa))))

In [None]:
x_ssb = ssbshift(x, shift, audio_sf)

multiplay([x, x_ssb, ssbshift(x, -shift, audio_sf)], audio_sf, ['original', 'ssb upshift', 'ssb downshift'])

Let's compare simple modulation and SSB modulation:

In [None]:
multiplay([x, x_mod, x_ssb], audio_sf, ['original', 'simple modulation', 'ssb modulation'])

In [None]:
plt.subplot(3, 1, 1)
plot_spec(x, audio_sf, 2000)
plt.subplot(3, 1, 2)
plot_spec(x_mod, audio_sf, 2000)
plt.axvline(shift, color='red')
plt.subplot(3, 1, 3)
plot_spec(x_ssb, audio_sf, 2000)
plt.axvline(shift, color='red');

## _The_ issue with frequency shifting

Good things so far: 
 * can be implemented in real time
 * preserves the speed of the original audio
 
But it sounds **awful** with music:

In [None]:
x = audio['music']
shift=200
multiplay([x, ssbshift(x, shift, audio_sf), ssbshift(x, -shift, audio_sf)], 
          audio_sf, 
          ['music sample', 'ssb upshift', 'ssb downshift'])

In pitch perception, the brain checks for a _harmonic_ spectral structure, with spectral lines at multiples of a fundamental frequency:

$$
    f_n = nf_1, \quad n = 1, 2, 3, \ldots \Rightarrow \frac{f_n}{f_1} = n
$$

Spectral shifting breaks the harmonic structure:

$$
    f'_n = f_c + f_n = f_c + nf_1 \Rightarrow \frac{f'_n}{f'_1} = \frac{f_c + nf_1}{f_c + f_1}
$$

# Granular Synthesis
<img width="500" style="float: right;" src="img/gsplot.jpg">

In [Granular Synthesis](https://en.wikipedia.org/wiki/Granular_synthesis) complex waveforms are be built by stitching together very short sound snippets called "grains". 
 
 * used as a compositional tool to generate complex timbres at arbitrary pitches
 * each grain must be "pitched"
 * works well for non-pitched sounds too
 * lots of sophisticated variations exist to maximize output quality





By decomposing a signal into grains and then putting the grains back together we can try to change the overall duration.

## Time stretching via granular synthesis 

Simple idea to increase the duration of a signal: 
 * split signal into small grains
 * repeat each grain two or more times

In [None]:
def gs_poc(x, grain_size, M=2):
    y = np.zeros(M * (len(x) + 1))
    for n in range(0, len(x) - grain_size, grain_size):
        y[M * n : M * (n + grain_size)] = np.tile(x[n : n + grain_size], M)
    return y

Helper function to convert milliseconds to samples

In [None]:
def ms2n(ms, sf):
    return int(float(sf) * ms / 1000.0)

In [None]:
grain_size = ms2n(40, audio_sf)

for file in ['speech', 'music']:
    display(multiplay(
        [audio[file], gs_poc(audio[file], grain_size)], audio_sf, 
        [f'{file} sample', 'stretched'] 
    ))    

The good:
 * pitch and timbre are well-preserved
 * very cheap to implement

The bad:
 * grain repetition creates discontinuities in the signal that we hear as a clicking noise 
 * can we use granular synthesis also to _reduce_ the length of a signal?

## Crossfading grains

In granular synthesis, grains should overlap and be _merged_ together using a multiplicative *tapering window*. If $g_k[n]$ is a set of grains of length $N$, the output signal is defined as 

$$
    x_o[n] = \sum_{k=-\infty}^\infty g_k[n-kS]w[n - kS]
$$

where

 * $w[n]$ is a finite-support window of length $N$, smoothly tapering to zero at both ends
 * $S$ is the synthesis _stride_ (that is $S = N(1-a)$ where $0 \le a \le 1$ is the amount of overlap between successive grains)
 * the window should fulfill the COLA condition (constant overlap-add):$$
   \sum_{k=-\infty}^\infty w[n - kS] = 1
   $$

### COLA windows

There is a vast literature on the design of windows with the COLA property; here, for simplicity, we will use the Hann (or Hanning) window, defined as 

$$
    w[n] = \frac{1}{2}\left[1-\cos\left(\frac{2\pi n}{N-1}\right)\right], \quad n = 0, 1, \ldots, N-1.
$$

The Hann window fulfills the constant overlap-add condition when the stride is of the form 
$$
    S = \frac{N}{K}
$$
where $K$ is an integer greater than one; since the stride must also be an integer, depending on the value of $K$ we may need to adjust the window length. Note that the overlap factor is $a = (K-1)/K$.

In [None]:
def hann_cola(N, K=2):
    assert K >= 2, 'cannot satisfy COLA with zero overlap. Set K > 1'
    S = N // K
    N = K * S + 1
    win = (2 / K) * np.sin(np.pi * np.arange(0, N) / (N-1)) ** 2
    return win, N, S

In [None]:
def test_overlap(win, N, S, M=8):
    assert N == len(win)
    y = np.zeros((M - 1) * S + N)
    for m in range(0, M * S, S):
        plt.plot(m + np.arange(0, N), win, 'C0')
        y[m:m+N] += win
    plt.plot(y, 'C2', linewidth=8, alpha=0.3)
    plt.gca().set_ylim([0, 1.2]);

In [None]:
test_overlap(*hann_cola(50, 2))

In [None]:
test_overlap(*hann_cola(101, 4))

## Time scaling with granular synthesis

<img width="800" style="float: right;" src="img/ola.png">

For a scaling factor $\alpha$: 

 * choose grain size (around 40 ms for speech, 100 ms for music) 
 * choose overlap factor and obtain grain stride $S$ (in samples)
 * iterate over $k = 0, 1, \ldots$:
     * extract a grain from the input at sample $k \lfloor S \alpha \rfloor$  
     * blend the grain into the output signal at sample $kS$

In [None]:
def timescale_gs(x, alpha, grain_size, K=2):
    win, N, S = hann_cola(grain_size, K)
    y = np.zeros(int(len(x) / alpha))
    n, m = 0, 0
    while n < len(x) - N and m < len(y) - N:
        y[m:m+N] += x[n:n+N] * win
        m += S
        n += int(S * alpha)
    return y

In [None]:
factors = [1.4, 0.8]

for file, gs_ms in [('speech', 40), ('music', 100)]:
    gs = ms2n(gs_ms, audio_sf)
    display(multiplay(
        [audio[file], *(timescale_gs(audio[file], a, gs, K=2) for a in factors)], 
        audio_sf, 
        [f'{file} sample', *('slower' if a < 1 else 'faster' for a in factors)] 
    ))    

Remarks so far:
  * time stretching works best to speed up speech
  * slowed down speech still has clicking artefacts
  * music is OK but the lower pitches have significant detuning
  
There seems to be a problem with low frequencies and still significant artefacts. We'll talk more about those later.

## Pitch shifting via granular synthesis

Since GS allows us to change the time scale leaving the pitch unchanged, what if:
 * we use GS to scale time by a factor $1/\alpha$; pitch remains the same
 * we apply fractional resampling (i.e. time warping) by a factor $\alpha$; this restores the original length but changes the pitch as a side effect.
 * pitch is raised if $\alpha > 1$, lowered otherwise

The following implementation uses two passes over the input, the first to time-scale it with granular synthesis, the second to resample it; this approach works but it cannot be used in real time.

In [None]:
def pitchshift_gs(x, alpha, grain_size, K=2):
    return(resample(timescale_gs(x, 1 / alpha, grain_size, K), alpha))

In [None]:
semitone = 2 ** (1.0 / 12)

x = audio['music']
grain_size = ms2n(100, audio_sf)

shift = 3
alpha = semitone ** shift
multiplay([x, pitchshift_gs(x, alpha, grain_size), pitchshift_gs(x, 1 / alpha, grain_size)], 
          audio_sf, 
          title=['music sample', f'up {shift} semitones (GS)', f'down {shift} semitones (GS)'])

# Exercise: real-time implementation

We want to implement a pitch shifting algorithm that works in real time, namely:

 * the algorithm produces one output sample every new input sample, synchronously
 * we can accept a reasonable amount of fixed processing delay
 * the algorithm can only access past input samples

The fundamental idea is to use granular synthesis to merge *resmapled* grains directly; let's break the problem into manageable pieces

## Real-time dummy granular synthesis

Consider a granular synthesis function that simply reconstruct the input signal; from the general formula

$$
    x_o[n] = \sum_{k=-\infty}^\infty g_k[n-kS]w[n - kS]
$$

if we define each grain as $g_k[n] = x_i[kS+n]$, we can see that the output is equal to the input as long as the COLA condition is satisfied:

$$
    x_o[n] = \sum_{k=-\infty}^\infty x_i[n]w[n - kS] = x_i[n] \sum_{k=-\infty}^\infty w[n - kS] = x_i[n]
$$ 

Assume the window has length $N$, with overlap factor equal to $K$ and a stride of $S$ samples; since the window is nonzero only over the $[0, N-1]$ interval, every output sample is the sum of weighed samples from $K$ grains since
$$
    w[n-kS] \neq 0 \quad \Leftrightarrow \quad 0 \le n - kS < N \quad \Leftrightarrow \quad n/S - K < k \le n/S
$$

If we store incoming input samples $x_i[n]$ in a memory buffer
 * when $n = kS$ (at multiples of the stride) the buffer will have accumulated the data for grain $g_{k-1}(m)$
 * at time $n$ the buffer will contain data for grains $g_{k-1}(m), g_{k-2}(m), \ldots$, where $k = \lfloor n / S \rfloor$
 * at time $n$ we will be able to output $x_o[n - N + 1]$
 * the total delay is therefore $N$ samples

The following picture illustrates the buffering structure, where solid lines indicate full grains already in the buffer and dashed lines grains not yet acquired

In [None]:
win, N, S = hann_cola(100, K=3)
D = int(S * 0.8)
for g in [(-2 + i, 'C'+c, f'$g_{{k{-3+i:+}}}$') for i, c in enumerate(('0', '9', '2', '6:', '3:', '1:'))]:
    plt.plot(np.arange(N) + g[0] * S, win, g[1], label=g[2])
plt.axvline(N+D, c='C2', lw=3)
plt.axvline(D+1, c='C2', ls='--')
plt.gca().set_xticks([D+1, N-S-1, N-1, N+D], ['$n-N+1$', '$(k-1)S$', '$kS$', '$n$'])
plt.legend();

In [None]:
def dummy_gs(x, grain_size, K=2):
    win, N, S = hann_cola(grain_size, K)

    # keep in memory the last K grains 
    grains = np.zeros((K, N))

    # input buffer, you can make this as large as you want (at least as big as a grain)
    B = 2 * N
    buffer = np.zeros(B)

    y = np.zeros(len(x))
    for n in range(len(x)):
        # your code here
        pass
    return y

In [None]:
# testing the function on a ramp; output should be a delayed version of the input
x = np.arange(1, 100)
plt.plot(x)
plt.plot(dummy_gs(x, 10, 2));

## Real-time pitch shifting granular synthesis

With the "dummy" granular synthesis in place, the trick now is to blend together time-warped grains. For this we will need to do the following

 * for a given grain size $N$, we will need to resample a chunk of $M \approx \alpha N$ samples, where $\alpha$ is the resampling factor
 * we need to allocate an input buffer of size at least $M$

In [None]:
def pitchshift_gs_rt(x, alpha, grain_size, K=2):
    y = np.zeros(len(x))
    # your code here
    
    for n in range(len(x)):
        # your code here
        y[n] = ...
        pass
        
    return y

In [None]:
x = audio['music']
grain_size = ms2n(100, audio_sf)

shift = 3
alpha = semitone ** shift
multiplay([x, pitchshift_gs_rt(x, alpha, grain_size), pitchshift_gs_rt(x, 1 / alpha, grain_size)], 
          audio_sf, 
          title=['music sample', f'up {shift} semitones (GS)', f'down {shift} semitones (GS)'])

# Before DSP...

<img width="600" style="" src="img/pitchshift.jpg"> 

Although we have just described a purely digital version of grain-based pitch shifting, it is interesting to remark that, before digital audio was a reality, the only true pitch-shifting devices available to the music industry were extremely complex (and costly) mechanical devices that implemented, in analog, the same principle behind granular synthesis. 

Above is the block diagram of such a contraption: the original sound is recorded on the main tape spool, which is run at a speed that can vary with respect to the nominal recording speed to raise or lower the pitch. To compensate for these changes in speed the tape head is actually a rotating disk containing four individual coils; at any given time, at least two neighboring coils are picking up the signal from the tape, with an automatic fade-in and fade-out as they approach and leave the tape. The head disk rotates at a speed that compensates for the change in speed of the main tape, therefore keeping the timebase constant. The coils on the head disk picking up the signal are in fact producing overlapping "grains" that are mixed together in the output signal.  