[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/derrick-nuby/intro-to-ai-ai6/blob/hw1-answers/assignments/hw01_style_transfer/hw01_answer.ipynb)

# Style transfer using a Linear Transformation

Here we have three pieces of audio. The first two are `Synth.wav` (audio $\mathbf{A}$) and `Piano.wav` (audio $\mathbf{B}$), which are recordings of a chromatic scale in a single octave played by a synthesizer and a piano respectively. The third piece of audio is the intro melody of “Blinding Lights” (audio $\mathbf{C}$) by The Weeknd, played with the same synth tone used to generate `Synth.wav`.

All audio files are in the `hw01_style_transfer/data` folder.

From these files, you can obtain the spectrogram $\mathbf{M}_A$, $\mathbf{M}_B$ and $\mathbf{M}_C$ . Your objective is to find the spectrogram of the piano version of the song “Blinding Lights” ($\mathbf{M}_D$).

In this problem, we assume that style can be transferred using a linear transformation. Formally, we need
to find the matrix $\mathbf{T}$ such that:

$$
\mathbf{TM}_A ≈ \mathbf{M}_B
$$

1. Write your code to determine matrix $\mathbf{T}$.

2. Our model assumes that $\mathbf{T}$ can transfer style from synthesizer music to piano music. Applying $\mathbf{T}$ on $\mathbf{M}_C$ should give us a estimation of “Blinding Lights” played by Piano, getting an estimation of $\mathbf{M}_D$. Using this matrix and phase matrix of $\mathbf{C}$, synthesize an audio signal.

    Submit your sythensized audio named as $\mathbf{problem3.wav}$.

In [None]:
# mounts your google drive to help you access files stored on your drive directly like you would on you local machine
# from google.colab import drive

# drive.mount("/content/gdrive")
# %cd gdrive/MyDrive/

## Solution

You need to compute the spectrogram of the music file. First, read and load the audio file at its correct sample rate (44100 in our case).

If you are using Python, you can use [Librosa](https://librosa.org/doc/latest/index.html) to load the wav file as follows (we also recommend using the numpy package for matrix operations below if you use python):

```python
import librosa
audio, sr = librosa.load(filename, sr = None) # sr = None means the audio will be loaded with its original sr - sample rate
assert sr = 44100 # we want to make sure the computed sr from the music file is what we expect - 44100 Hz
```

Next, we can compute the complex Short-Time Fourier Transform (STFT) of the signal and its magnitude spectrogram. Use `2048` sample windows, which correspond to 64 ms analysis windows; overlap/hop length of `256` samples to 64 frames by second of signal. Different toolboxes should provide similar spectrograms. If
you are using the Python Librosa library, you can use the following command:

```python
spectrogram = librosa.stft(audio, n_fft=2048, hop_length=256, center=False, win_length=2048)
M = abs(spectrogram)
phase = spectrogram/(M + 2.2204e-16)
```

In this case, $\mathbf{M}$ represents the music file and should be a matrix, where the rows correspond to the frequencies and the columns to time. For visualizing the spectrogram, see the documentation online for `librosa.display.specshow`.

In [None]:
import numpy as np

### Computing $\mathbf{M}_A$, $\mathbf{M}_B$, $\mathbf{M}_C$

In [None]:
import librosa

# Load audio files
audio_A, sr = librosa.load('data/Synth.wav', sr=None)
audio_B, sr = librosa.load('data/Piano.wav', sr=None)
audio_C, sr = librosa.load('data/Blinding_Lights_synth.wav', sr=None)
assert sr == 44100  # Verify sample rate

# Compute STFT for each audio
spectrogram_A = librosa.stft(audio_A, n_fft=2048, hop_length=256, center=False, win_length=2048)
spectrogram_B = librosa.stft(audio_B, n_fft=2048, hop_length=256, center=False, win_length=2048)
spectrogram_C = librosa.stft(audio_C, n_fft=2048, hop_length=256, center=False, win_length=2048)

# Compute magnitude and phase
MA = np.abs(spectrogram_A)
phase_A = spectrogram_A / (MA + 2.2204e-16)  # Avoid division by zero

MB = np.abs(spectrogram_B)
phase_B = spectrogram_B / (MB + 2.2204e-16)

MC = np.abs(spectrogram_C)
phase_C = spectrogram_C / (MC + 2.2204e-16)

### Learning $\mathbf{T}$

We were given that:

$$
TM_{A} ≈ M_{B}
$$

From basic math, we know that:

$$
T = \frac{M_{B}}{M_{A}} = M_{B} * \frac{1}{M_{A}}
$$

$\frac{1}{M_{A}}$ is known as the inverse of $\mathbf{A}$, which can also be written as $M_{A}^{-1}.$ So our final formula will be:
$$
T = M_{B} * pinv(M_{A})
$$

where $pinv$ meanse the [pseudo-inverse](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html) of a matrix. We use the pseudo-inverse ($pinv$) because **$M_{A}$ is not a square matrix**

**Step 1**: Compute $\mathbf{T}$

Hint: use [matrix multiplication](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) not element-wise multiplcation for your multipliction operation to avoid broadcast error. **Try and understand why**.

In [None]:
# Compute T using matrix multiplication and pseudo-inverse
T = np.matmul(MB, np.linalg.pinv(MA))

### Computing error

We also would like to know how well the $\mathbf{T}$ we got represents a good style transfer from synthesizer music to piano music. We can do ths by computing the error. In this case, we would use the [frobenius norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).

**Step 2:** Compute error $∥TM_{A} − M_{B} ∥^{2}_{F}$

Hint: use [matrix multiplication](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) not element-wise multiplcation for your multipliction operation to avoid broadcast error. Try and understand why.

In [None]:
# Compute error using Frobenius norm
error = np.linalg.norm(np.matmul(T, MA) - MB, ord='fro') ** 2

### Finding $\mathbf{M}_D$

Our model assumes that $\mathbf{T}$ can transfer style from synthesizer music to piano music. Applying $\mathbf{T}$ on $\mathbf{M}_C$ should give us an estimation of “Blinding Lights” played by Piano, getting an estimation of $\mathbf{M}_D$.

$$
\mathbf{M}_D = \mathbf{T} * \mathbf{M}_C
$$

**Step 3:** Compute $\mathbf{M}_D$

Hint: use [matrix multiplication](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) not element-wise multiplcation for your multipliction operation to avoid broadcast error. **Try and understand why**.

In [None]:
# Compute MD using matrix multiplication
MD = np.matmul(T, MC)

### Recover $\mathbf{M}_D$ signal

To recover the signal from the constructed spectrogram $\mathbf{M}_D$ we need to use the `phase` matrix we computed earlier from the original signal. Combine both and compute the Inverse-STFT to obtain a vector and then write them into a wav file. To compute the STFT and then write the wav file you can use the following python command:

```python
# Latest Librosa doesn't have an audio write function. Use PySoundFile instead.
import soundfile as sf
signal = librosa.istft(spectrogram * phase, hop_length=256, center=False, win_length=2048)
sf.write("problem3.wav", spectrogram, 44100) # here we use the original sr which is 44100 Hz
```

**Step 4:** Using the matrix, $\mathbf{M}_D$ and **phase matrix of $\mathbf{C}$**, synthesize an audio signal.

Hint: Here use **element-wise multiplication**. **Try and understand why**.

In [None]:
import soundfile as sf

# Recover the signal from MD using phase_C
MD_spectrogram = MD * phase_C
MD_signal = librosa.istft(MD_spectrogram, hop_length=256, center=False, win_length=2048)

# Save the signal to problem3.wav
sf.write("problem3.wav", MD_signal, 44100)

### Bonus: Check your reconstructed signal music:



In [None]:
import IPython.display as ipd
ipd.Audio('problem3.wav') # load a local WAV file