## Spectrogram Test

This notebook serves to test the potential audio output of 512x512 spectrogram images. 

This is the image size that stream diffusion will generate. So before training the model it's important to check the quality of output. 

In [1]:
! pip install librosa matplotlib soundfile



In [5]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

audio_file = 'audio/cw_amen11_145.wav'

y, sr = librosa.load(audio_file, sr=None)

# sr = None means that the audio file will be loaded with its original sampling rate

Generate spectrogram using STFT (short-time fourier transform)

**Frequency Bins:** The frequency axis of the spectrogram is divided into bins, each representing a specific frequency range. The number of frequency bins is determined by the FFT (Fast Fourier Transform) size. For example, an FFT size of 2048 will give you 1024 frequency bins (since FFT produces symmetrical output, only the first half is typically used).

**Buffer Size (Window Size):** The buffer size is the length of the audio segment used to compute each FFT. A larger buffer size gives you higher frequency resolution (more detailed frequency information), but lower time resolution (less detail in timing). Conversely, a smaller buffer size gives you higher time resolution but lower frequency resolution.

**Hop Length (Stride):** This is the number of samples between successive frames (overlap between windows). A smaller hop length results in more overlap, providing smoother transitions but with higher computational cost.

**Frame Rate:** The frame rate (or time resolution) in the spectrogram is determined by the hop length and the sample rate. It tells you how often the FFT is computed along the time axis.

In [6]:
# Parameters for 512x512 spectrogram
n_fft = 1024  # Determines the number of frequency bins (height of the spectrogram)
hop_length = int(len(y) / 512)  # Ensure 512 time steps (width of the spectrogram)

# Generate the spectrogram
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

# Resize the spectrogram to 512x512 if needed (padding or cropping)
S_db_resized = librosa.util.fix_length(S_db, size=512, axis=1)

# Ensure the spectrogram is exactly 512x512
assert S_db_resized.shape == (513, 512), f"Spectrogram shape is {S_db_resized.shape}, expected (513, 512)"

# Check the range of S_db to adjust visualization if needed
print(f"Decibel range: min {S_db_resized.min()}, max {S_db_resized.max()}")



Decibel range: min -80.0, max 0.0


In [7]:
import soundfile as sf

# Inverse STFT to reconstruct the audio signal
y_reconstructed = librosa.istft(D, hop_length=hop_length)

sf.write('audio/cw_amen11_145_recon.wav', y_reconstructed, sr)


In [8]:
# Create a figure without any axes or titles
fig, ax = plt.subplots(figsize=(5.12, 5.12))
ax.axis('off')  # Turn off axes
librosa.display.specshow(S_db_resized, sr=sr, hop_length=hop_length, x_axis=None, y_axis=None, cmap='inferno')
plt.subplots_adjust(left=0, right=1, top=1, bottom=0)  # Remove padding

# Save the spectrogram image without any additional chart elements
output_image_path = 'spectrogram_512x512_no_axes.png'
plt.savefig(output_image_path, bbox_inches='tight', pad_inches=0, dpi=100)
plt.close()

In [1]:
# testomg the SpectrogramGenerator class

from SpectrogramGenerator import SpectrogramGenerator as sg

generator = sg(input_audio_file='audio/cash_register_reconstructed.wav', output_image_path='spectrograms/', output_audio_file_path='')

generator.generate_spectrogram()


Spectrogram shape is (513, 512)
