[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/derrick-nuby/intro-to-ai-ai6/blob/local-answers-python/assignments/hw01_style_transfer/hw01_answer_local_version.ipynb)

# Style transfer using a Linear Transformation

Here we have three pieces of audio. The first two are `Synth.wav` (audio $\mathbf{A}$) and `Piano.wav` (audio $\mathbf{B}$), which are recordings of a chromatic scale in a single octave played by a synthesizer and a piano respectively. The third piece of audio is the intro melody of “Blinding Lights” (audio $\mathbf{C}$) by The Weeknd, played with the same synth tone used to generate `Synth.wav`.

All audio files are in the `hw01_style_transfer/data` folder.

From these files, you can obtain the spectrogram $\mathbf{M}_A$, $\mathbf{M}_B$ and $\mathbf{M}_C$ . Your objective is to find the spectrogram of the piano version of the song “Blinding Lights” ($\mathbf{M}_D$).

In this problem, we assume that style can be transferred using a linear transformation. Formally, we need
to find the matrix $\mathbf{T}$ such that:

$$
\mathbf{TM}_A ≈ \mathbf{M}_B
$$

1. Write your code to determine matrix $\mathbf{T}$.

2. Our model assumes that $\mathbf{T}$ can transfer style from synthesizer music to piano music. Applying $\mathbf{T}$ on $\mathbf{M}_C$ should give us a estimation of “Blinding Lights” played by Piano, getting an estimation of $\mathbf{M}_D$. Using this matrix and phase matrix of $\mathbf{C}$, synthesize an audio signal.

    Submit your sythensized audio named as $\mathbf{problem3.wav}$.

In [None]:
!pip install librosa soundfile

# 2. Import libraries
import os
import urllib.request
import numpy as np

# 3. Create data folder
os.makedirs('data', exist_ok=True)

# 4. Download audio files from GitHub
print("Downloading audio files...")
base_url = "https://raw.githubusercontent.com/derrick-nuby/intro-to-ai-ai6/local-answers-python/assignments/hw01_style_transfer/data/"
files = ["Synth.wav", "Piano.wav", "BlindingLights.wav"]

success_count = 0
for file in files:
    try:
        urllib.request.urlretrieve(base_url + file, f"data/{file}")
        print(f"✅ Downloaded {file}")
        success_count += 1
    except Exception as e:
        print(f"❌ Could not download {file}: {e}")

print(f"\n🎉 Setup complete! {success_count}/3 files downloaded.")
if success_count == 3:
    print("✅ All files ready! You can now run the rest of the notebook.")
else:
    print("⚠️  Some files failed to download. You may need to upload them manually.")
    print("   Click the folder icon on the left and upload the missing files to the 'data' folder.")

## Solution

You need to compute the spectrogram of the music file. First, read and load the audio file at its correct sample rate (44100 in our case).

If you are using Python, you can use [Librosa](https://librosa.org/doc/latest/index.html) to load the wav file as follows (we also recommend using the numpy package for matrix operations below if you use python):

```python
import librosa
audio, sr = librosa.load(filename, sr = None) # sr = None means the audio will be loaded with its original sr - sample rate
assert sr = 44100 # we want to make sure the computed sr from the music file is what we expect - 44100 Hz
```

Next, we can compute the complex Short-Time Fourier Transform (STFT) of the signal and its magnitude spectrogram. Use `2048` sample windows, which correspond to 64 ms analysis windows; overlap/hop length of `256` samples to 64 frames by second of signal. Different toolboxes should provide similar spectrograms. If
you are using the Python Librosa library, you can use the following command:

```python
spectrogram = librosa.stft(audio, n fft=2048, hop length=256, center=False, win length=2048)
M = abs(spectrogram)
phase = spectrogram/(M + 2.2204e-16)
```

In this case, $\mathbf{M}$ represents the music file and should be a matrix, where the rows correspond to the frequencies and the columns to time. For visualizing the spectrogram, see the documentation online for `librosa.display.specshow`.

In [None]:
import numpy as np

### Computing $\mathbf{M}_A$, $\mathbf{M}_B$, $\mathbf{M}_C$

In [None]:
import librosa

# TODO: Compute MA and phase_A
audio_A, sr_A = librosa.load("data/Synth.wav", sr=None)
assert sr_A == 44100  # Make sure sample rate is correct
spectrogram_A = librosa.stft(audio_A, n_fft=2048, hop_length=256, center=False, win_length=2048)
MA = abs(spectrogram_A)
phase_A = spectrogram_A / (MA + 2.2204e-16)

# TODO: Compute MB and phase_B
audio_B, sr_B = librosa.load("data/Piano.wav", sr=None)
assert sr_B == 44100  # Make sure sample rate is correct
spectrogram_B = librosa.stft(audio_B, n_fft=2048, hop_length=256, center=False, win_length=2048)
MB = abs(spectrogram_B)
phase_B = spectrogram_B / (MB + 2.2204e-16)

# TODO: Compute MC and phase_C
audio_C, sr_C = librosa.load("data/BlindingLights.wav", sr=None)
assert sr_C == 44100  # Make sure sample rate is correct
spectrogram_C = librosa.stft(audio_C, n_fft=2048, hop_length=256, center=False, win_length=2048)
MC = abs(spectrogram_C)
phase_C = spectrogram_C / (MC + 2.2204e-16)

print(f"MA shape: {MA.shape}")
print(f"MB shape: {MB.shape}")
print(f"MC shape: {MC.shape}")
print("Spectrograms computed successfully!")

MA shape: (1025, 1695)
MB shape: (1025, 1695)
MC shape: (1025, 1695)
Spectrograms computed successfully!


### Learning $\mathbf{T}$

We were given that:

$$
TM_{A} ≈ M_{B}
$$

From basic math, we know that:

$$
T = \frac{M_{B}}{M_{A}} = M_{B} * \frac{1}{M_{A}}
$$

$\frac{1}{M_{A}}$ is known as the inverse of $\mathbf{A}$, which can also be written as $M_{A}^{-1}.$ So our final formula will be:
$$
T = M_{B} * pinv(M_{A})
$$

where $pinv$ meanse the [pseudo-inverse](https://numpy.org/doc/stable/reference/generated/numpy.linalg.pinv.html) of a matrix. We use the pseudo-inverse ($pinv$) because **$M_{A}$ is not a square matrix**

**Step 1**: Compute $\mathbf{T}$

Hint: use [matrix multiplication](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) not element-wise multiplcation for your multipliction operation to avoid broadcast error. **Try and understand why**.

In [None]:
import numpy as np

# Assuming MA and MB are defined earlier in the code
# MA = ...
# MB = ...

# TODO: compute T
# T = MB * pinv(MA)
# Using matrix multiplication (not element-wise) as specified in the hint
T = np.matmul(MB, np.linalg.pinv(MA))

print(f"T shape: {T.shape}")
print("Transformation matrix T computed successfully!")

T shape: (1025, 1025)
Transformation matrix T computed successfully!


### Computing error

We also would like to know how well the $\mathbf{T}$ we got represents a good style transfer from synthesizer music to piano music. We can do ths by computing the error. In this case, we would use the [frobenius norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).

**Step 2:** Compute error $∥TM_{A} − M_{B} ∥^{2}_{F}$

Hint: use [matrix multiplication](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) not element-wise multiplcation for your multipliction operation to avoid broadcast error. Try and understand why.

In [None]:
# TODO: compute the error (frobenius norm)
# Error = ||T*MA - MB||_F^2
# Using matrix multiplication as specified in the hint
T_MA = np.matmul(T, MA)
difference = T_MA - MB
error = np.linalg.norm(difference, 'fro')**2

print(f"Reconstruction error (Frobenius norm squared): {error}")
print("Error computed successfully!")

Reconstruction error (Frobenius norm squared): 9440.37890625
Error computed successfully!


### Finding $\mathbf{M}_D$

Our model assumes that $\mathbf{T}$ can transfer style from synthesizer music to piano music. Applying $\mathbf{T}$ on $\mathbf{M}_C$ should give us an estimation of “Blinding Lights” played by Piano, getting an estimation of $\mathbf{M}_D$.

$$
\mathbf{M}_D = \mathbf{T} * \mathbf{M}_C
$$

**Step 3:** Compute $\mathbf{M}_D$

Hint: use [matrix multiplication](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) not element-wise multiplcation for your multipliction operation to avoid broadcast error. **Try and understand why**.

In [None]:
# TODO: compute MD
MD = np.matmul(T, MC)

print(f"MD shape: {MD.shape}")
print("Piano version spectrogram MD computed successfully!")

MD shape: (1025, 1695)
Piano version spectrogram MD computed successfully!


### Recover $\mathbf{M}_D$ signal

To recover the signal from the constructed spectrogram $\mathbf{M}_D$ we need to use the `phase` matrix we computed earlier from the original signal. Combine both and compute the Inverse-STFT to obtain a vector and then write them into a wav file. To compute the STFT and then write the wav file you can use the following python command:

```python
# Latest Librosa doesn't have an audio write function. Use PySoundFile instead.
import soundfile as sf
signal = librosa.istft(spectrogram * phase, hop length=256, center=False, win length=2048)
sf.write("problem3.wav", spectrogram, 44100) # here we use the original sr which is 44100 Hz
```

**Step 4:** Using the matrix, $\mathbf{M}_D$ and **phase matrix of $\mathbf{C}$**, synthesize an audio signal.

Hint: Here use **element-wise multiplication**. **Try and understand why**.

In [None]:
import soundfile as sf
import librosa
import numpy as np

# TODO: recover the signal from MD
# Using element-wise multiplication with phase_C as specified in the hint
# We use the phase from the original Blinding Lights (C) to reconstruct the audio
complex_spectrogram = MD * phase_C
MD_signal = librosa.istft(complex_spectrogram, hop_length=256, center=False, win_length=2048)

# TODO: save your matrix MD in a "problem3.wav" file
sf.write("problem3.wav", MD_signal, 44100)

print("Audio signal recovered and saved as 'problem3.wav'!")
print(f"Audio length: {len(MD_signal)/44100:.2f} seconds")

Audio signal recovered and saved as 'problem3.wav'!
Audio length: 9.88 seconds


### Bonus: Check your reconstructed signal music:



In [None]:
import IPython.display as ipd
ipd.Audio('problem3.wav') # load a local WAV file